Teaching a Small Model to Hack Like a Real Attacker
Modern web apps layer defenses so thick that XSS should be impossible. Attackers still find ways through. We wondered: could we teach AI to probe, adapt, and reason its way past security controls the same way humans do?
How we built a reinforcement-trained XSS agent that discovers and bypasses hidden defense layers
Modern web apps ship with layers of protection—HTML sanitizers, attribute filters, CSP rules, WAF heuristics, encoding frameworks, and custom enterprise logic stacked on top of each other. On paper, XSS should be a solved problem.
But if you’ve ever worked in security, you know the punchline: attackers still find ways through.
At Novee, we wondered something ambitious:
If a human attacker can probe, adapt, and reason their way through layered defenses… can an AI model learn to do the same?
This is the story of how we tried to answer that question.
It starts with a deceptively simple challenge… and ends with a small 4B model out-performing scanners and even frontier LLMs in a realistic, browser-verified exploitation environment.
Act I — The “Simple” Problem That Wasn’t Simple at All
We began with what looked like a clean, constrained task:
Given an HTML injection point and a list of forbidden keywords, can a model generate a working XSS payload?
For example:
Context:
<button onclick='[inject-here]'>Click</button>
Forbidden keywords:
["alert", "script", "eval", "img"]
The model’s job is to craft something that still executes JavaScript when the button is clicked—while respecting every blacklist constraint.
This isn’t a prompt-completion exercise.
It’s puzzle solving.
It’s reasoning.
And it’s fundamentally verifiable: either the payload triggers the browser, or it doesn’t.
To keep the model disciplined, we designed a structured output format:
<think> short, precise reasoning </think>
<output> payload only </output>
The <think> block captures the model’s reasoning process.
The <output> block is the actual payload we inject.
This format later became the backbone of multi-turn reasoning.
Collecting the Right Data (It Was Harder Than Writing the Model)
We started with 300 high-quality samples—real HTML contexts, real blacklists, real payloads. Enough to test ideas, but nowhere near enough to train a model capable of deep reasoning.
So we did what any pragmatic ML team does: we got scrappy.
1. We collected and scraped realistic contexts
Not toy problems. Realistic DOM structures.
2. We augmented aggressively
Using LLMs and systematic transformations, we expanded to ~3,000 training samples by:
- Varying contexts
- Mutating blacklists
- Synthesizing plausible payloads
3. Every single sample was tested in a real browser
If Playwright didn’t confirm the payload executed, the sample was thrown out.
That gave us a dataset grounded in actual browser behavior, not speculation.
With that, we moved into supervised fine-tuning (SFT).
The model learned to imitate strong one-shot XSS payloads.
But imitation only gets you so far.
Attackers don’t imitate—they search.
That’s where reinforcement learning comes in.
Teaching the Model Through the Browser Itself
One of the best things about XSS is that success is brutally objective:
- Either the payload triggers execution → success
- Or it doesn’t → failure
This makes XSS perfect for reinforcement learning (RL), because the model receives a crisp, unambiguous signal straight from the environment.
So after SFT, we switched to an RL loop:
- The model proposes multiple candidate payloads
- Each payload is injected into a live browser
- Browser behavior determines the reward
- The model updates accordingly
No human labeling.
No heuristics.
The browser becomes the teacher.
We used a reinforcement method where the model compares its own candidates against each other. In practice, this helped the model explore creativity safely, without needing an expensive value model.
The question was: would RL actually make a difference?
The answer was loud and clear.
Phase 1 Results — RL vs. Everyone Else
We tested on two levels:
- Base difficulty: ~4 forbidden keywords
- Hard difficulty: ~25 forbidden keywords (very restrictive)
We compared:
- XSStrike (standard)
- XSStrike with 500-shot hints
- Gemini 2.5 Pro
- Claude 4 Sonnet
- Our own model (SFT vs. SFT+RL)
Here are the full results:
| Model | Base (300 samples) | Hard (300 samples) |
|---|---|---|
| XSStrike-500 shot | 180 / 300 (60.0%) | 162 / 300 (54.4%) |
| XSStrike | 237 / 300 (79.0%) | 187 / 300 (62.3%) |
| Gemini 2.5 Pro | 197 / 300 (65.7%) | 148 / 300 (49.4%) |
| Claude 4 Sonnet | 235 / 300 (78.3%) | 194 / 300 (64.7%) |
| Novee – SFT | 261 / 300 (87.0%) | 167 / 300 (55.7%) |
| Novee – RL | 275 / 300 (91.7%) | 270 / 300 (90.0%) |
The important patterns:
1. SFT alone is strong
It beats scanners and frontier LLMs on the Base set.
2. SFT collapses under extreme constraints
On Hard difficulty, SFT drops to 55.7%.
3. RL is shockingly robust
Even on the Hard set with ~25 forbidden tokens:
- Our RL model hits 90% accuracy
- No baseline model comes close
This was the moment we realized:
Reinforcement learning isn’t optional for this problem.
It’s the key.
With one-shot payload generation solved, we moved on to a much harder goal.
Act II — Can an AI Infer Hidden Defenses Over Multiple Turns?
Attackers don’t fire one payload and leave.
They probe. They adapt. They observe how filters mutate their inputs.
They gradually form a mental model of the defense stack.
We wanted the model to do exactly that.
So we escalated the challenge:
- The model only sees the HTML context
- It does not see the blacklist
- It does not see any sanitization or defense rules
- It can send multiple payloads over several turns
- After each turn, it gets browser feedback
- The defense stack is a hidden combination of:
- HTML sanitization
- Output encoding
- Attribute filtering
- JS function blocking
- CSP
- WAF transformations
The task:
Infer what’s blocking you, adapt your strategy, and eventually bypass the defenses.
This is extremely hard—even for humans.
And that meant the dataset had to be carefully constructed.
Building Multi-Turn Data Without Going Broke
Our first idea was naïve: randomly generate defense profiles and ask a strong closed-source model to solve them.
This failed spectacularly. Why?
- Many random combinations had no possible bypass
- The strong model wasted compute trying to solve hopeless puzzles
- Costs skyrocketed
- We collected tons of failed trajectories and almost no good ones
So we flipped the process.
Start from a known working payload.
Then find the hardest defense layers it still bypasses.
This produced:
- Realistic
- Non-trivial
- Guaranteed solvable defense profiles.
For broader coverage, we also took ~20,000 collected payloads and tested them against systematically generated defense stacks to identify solvable combinations.
This was slower, but massively more cost-efficient and far higher quality.
From this we created a dataset of:
(context, defense profile, successful payload)
But a single successful payload doesn’t teach multi-turn reasoning.
We needed trajectories.
Teaching the Model “How Attackers Think”
To get multi-turn trajectories, we took puzzles we knew were solvable and used a closed-source model to generate full attack sequences—but with a twist:
Whenever the model approached the maximum allowed number of turns, we revealed the “ground truth” payload and asked it to produce a realistic sequence of intermediate attempts that would logically lead to that payload.
This produced:
100 complete multi-turn trajectories
Each containing:
- Reasoning
- Multiple attempted payloads
- Browser outcomes
- A final successful bypass
These traces became the cornerstone of our second-phase supervised training.
Training the Multi-Turn Agent
We trained the model on three data types:
- Single-turn puzzles with blacklists (from Phase 1)
- Single-turn puzzles with visible defense layers
- Multi-turn trajectories with hidden defense layers
This mixture taught the model:
- How payloads break in different contexts
- How defenses reshape inputs
- How to escalate complexity across turns
- How to connect observations across attempts
One important decision:
Should the model learn from only the final successful attempt?
Or from all intermediate turns, including failures?
We chose to train on all turns, because failure carries information.
Later, RL would help the model unlearn the bad ones.
This felt risky—but it paid off.
The Gibberish Problem: When Reasoning Models Lose Their Mind
If you’ve worked with chain-of-thought models, you’ve probably seen it: sometimes they produce weird, high-entropy nonsense (“gibberish”) in their reasoning.
In supervised learning, this is annoying but harmless.
In RL, it’s disastrous—because if a gibberish trajectory gets rewarded once, it can be reinforced.
So we used Meta’s CWM work, gibberish filter:
- If a reasoning trace uses extremely rare tokens
- or tokens that the model assigns extremely low probability
- we discard the trajectory entirely
This simple rule prevented collapse and kept reasoning clean.
Reinforced Fine-Tuning (RFT) – Progressive RL to Extract Better Traces
After SFT, we moved to Reinforced Fine-Tuning (RFT) for the multi-turn setting.
The first RL attempts were unstable. Directly optimizing the policy in this environment—hidden defenses, multi-turn interaction, small number of trajectories—made training fragile.
So you adopted a different strategy:
Use progressive RL primarily as a way to extract better traces for offline training.
Concretely:
We ran RL and collected successful trajectories, with some more filtering criteria :
- Discarded traces where the thinking trace contained gibberish
- Ensured you did not oversample specific or easy samples
These successful, high-quality trajectories were then used to enrich the training data for subsequent fine-tuning.
Reinforcement Learning in Phase 2
After training on:
- The original data
- Plus the rejection-sampled outputs from progressive RL
you ran RL again on the multi-turn task, following the GRPO objective, with filtering zero-advantage trajectories, and penalties for Gibberish thinking
and Repeated payloads across turns.
An interesting observation you made:
When the penalty for repeated payloads was too strong, the model collapsed into predicting single payloads each turn, hurting performance.
Reducing that negative reward helped restore healthy exploration: the model learned not to spam the exact same payload, but it still reused promising ideas when appropriate.
Phase 2 Results — Multi-Turn Hidden-Defense Bypass
Final evaluation: 113 scenarios, each with hidden defenses and several allowed turns.
Here’s how models performed:
| Model | Accuracy | Avg. Turns |
|---|---|---|
| Qwen3-4B-Thinking | 0% | – |
| Claude4-Sonnet | 18.3% | 5.11 |
| Novee – SFT | 11.3% | 2.87 |
| Novee – SFT + RFT | 23.2% | 2.92 |
| Novee – SFT + RFT + RL | 28.4% | 3.61 |
Important notes:
- Even frontier models struggled: this task is hard.
- The best version of our model nearly tripled the performance of baseline Qwen3.
- The model used more turns on average—but used them wisely.
- This was achieved with a 4B parameter model, not a frontier LLM.
This is one of the clearest signs that environment-coupled RL unlocks capabilities that pure language training cannot.
Why This Matters (Far Beyond XSS)
Although our work focuses on XSS, the implications are broad.
This research shows:
1. Small models can outperform huge LLMs when trained with the right signals
General-purpose LLMs shrink under constrained or adversarial tasks.
Small models trained with environment feedback thrive.
2. AI can truly simulate attacker behavior
Not just guessing—but:
- probing
- observing
- analyzing
- adapting
Just like a security engineer or red-teamer would.
3. Offense-driven ML is the future of application security
Static scanners check signatures.
LLMs guess text.
But an agent that interacts with your actual application?
That’s a new category.
4. This method generalizes to many vulnerability classes
We’re already extending this to:
- SSTI
- CSP bypasses
- Prototype pollution
- DOM-based sinks
- and more
Anywhere the attacker must “probe → observe → adapt”, this approach works.
Closing Thoughts
What started as a question—
“Can an AI generate a constrained XSS payload?”
—turned into a demonstration that a small, well-trained model can:
- Analyze HTML contexts
- Infer hidden sanitization and encoding layers
- Adapt strategies over multiple turns
- Learn directly from browser behavior
- And outperform both specialized scanners and frontier LLMs
At Novee, we see this as the beginning of a new era of offensive security automation:
environment-aware, reinforcement-trained, adaptive attacker agents.
If you want to:
- experiment with these agents on your own applications,
- explore joint research on multi-turn exploit reasoning, or
- integrate this into your CI/CD pipeline as a continuous offensive simulator—