Teaching a Small Model to Hack Like a Real Attacker

Modern web apps layer defenses so thick that XSS should be impossible. Attackers still find ways through. We wondered: could we teach AI to probe, adapt, and reason its way past security controls the same way humans do?

Omer Ninburg, Co-Founder & CTO

Barak Battash, PhD, Founding AI Researcher

December 29, 2025

10 mins

Explore Article +

Request a Demo

How we built a reinforcement-trained XSS agent that discovers and bypasses hidden defense layers

Modern web apps ship with layers of protection—HTML sanitizers, attribute filters, CSP rules, WAF heuristics, encoding frameworks, and custom enterprise logic stacked on top of each other. On paper, XSS should be a solved problem.

But if you’ve ever worked in security, you know the punchline: attackers still find ways through.

At Novee, we wondered something ambitious:

If a human attacker can probe, adapt, and reason their way through layered defenses… can an AI model learn to do the same?

This is the story of how we tried to answer that question.

It starts with a deceptively simple challenge… and ends with a small 4B model out-performing scanners and even frontier LLMs in a realistic, browser-verified exploitation environment.

Act I — The “Simple” Problem That Wasn’t Simple at All

We began with what looked like a clean, constrained task:

Given an HTML injection point and a list of forbidden keywords, can a model generate a working XSS payload?

For example:

Context:

<button onclick='[inject-here]'>Click</button>

Forbidden keywords:

["alert", "script", "eval", "img"]

The model’s job is to craft something that still executes JavaScript when the button is clicked—while respecting every blacklist constraint.

This isn’t a prompt-completion exercise.

It’s puzzle solving.

It’s reasoning.

And it’s fundamentally verifiable: either the payload triggers the browser, or it doesn’t.

To keep the model disciplined, we designed a structured output format:

<think> short, precise reasoning </think>
<output> payload only </output>

The <think> block captures the model’s reasoning process.

The <output> block is the actual payload we inject.

This format later became the backbone of multi-turn reasoning.

Collecting the Right Data (It Was Harder Than Writing the Model)

We started with 300 high-quality samples—real HTML contexts, real blacklists, real payloads. Enough to test ideas, but nowhere near enough to train a model capable of deep reasoning.

So we did what any pragmatic ML team does: we got scrappy.

1. We collected and scraped realistic contexts

Not toy problems. Realistic DOM structures.

2. We augmented aggressively

Using LLMs and systematic transformations, we expanded to ~3,000 training samples by:

Varying contexts
Mutating blacklists
Synthesizing plausible payloads

3. Every single sample was tested in a real browser

If Playwright didn’t confirm the payload executed, the sample was thrown out.

That gave us a dataset grounded in actual browser behavior, not speculation.

With that, we moved into supervised fine-tuning (SFT).

The model learned to imitate strong one-shot XSS payloads.

But imitation only gets you so far.

Attackers don’t imitate—they search.

That’s where reinforcement learning comes in.

Teaching the Model Through the Browser Itself

One of the best things about XSS is that success is brutally objective:

Either the payload triggers execution → success
Or it doesn’t → failure

This makes XSS perfect for reinforcement learning (RL), because the model receives a crisp, unambiguous signal straight from the environment.

So after SFT, we switched to an RL loop:

The model proposes multiple candidate payloads
Each payload is injected into a live browser
Browser behavior determines the reward
The model updates accordingly

No human labeling.

No heuristics.

The browser becomes the teacher.

We used a reinforcement method where the model compares its own candidates against each other. In practice, this helped the model explore creativity safely, without needing an expensive value model.

The question was: would RL actually make a difference?

The answer was loud and clear.

Phase 1 Results — RL vs. Everyone Else

We tested on two levels:

Base difficulty: ~4 forbidden keywords
Hard difficulty: ~25 forbidden keywords (very restrictive)

We compared:

XSStrike (standard)
XSStrike with 500-shot hints
Gemini 2.5 Pro
Claude 4 Sonnet
Our own model (SFT vs. SFT+RL)

Here are the full results:

Model	Base (300 samples)	Hard (300 samples)
XSStrike-500 shot	180 / 300 (60.0%)	162 / 300 (54.4%)
XSStrike	237 / 300 (79.0%)	187 / 300 (62.3%)
Gemini 2.5 Pro	197 / 300 (65.7%)	148 / 300 (49.4%)
Claude 4 Sonnet	235 / 300 (78.3%)	194 / 300 (64.7%)
Novee – SFT	261 / 300 (87.0%)	167 / 300 (55.7%)
Novee – RL	275 / 300 (91.7%)	270 / 300 (90.0%)

The important patterns:

1. SFT alone is strong

It beats scanners and frontier LLMs on the Base set.

2. SFT collapses under extreme constraints

On Hard difficulty, SFT drops to 55.7%.

3. RL is shockingly robust

Even on the Hard set with ~25 forbidden tokens:

Our RL model hits 90% accuracy
No baseline model comes close

This was the moment we realized:

Reinforcement learning isn’t optional for this problem.

It’s the key.

With one-shot payload generation solved, we moved on to a much harder goal.

Act II — Can an AI Infer Hidden Defenses Over Multiple Turns?

Attackers don’t fire one payload and leave.

They probe. They adapt. They observe how filters mutate their inputs.

They gradually form a mental model of the defense stack.

We wanted the model to do exactly that.

So we escalated the challenge:

The model only sees the HTML context
It does not see the blacklist
It does not see any sanitization or defense rules
It can send multiple payloads over several turns
After each turn, it gets browser feedback
The defense stack is a hidden combination of:
- HTML sanitization
- Output encoding
- Attribute filtering
- JS function blocking
- CSP
- WAF transformations

The task:

Infer what’s blocking you, adapt your strategy, and eventually bypass the defenses.

This is extremely hard—even for humans.

And that meant the dataset had to be carefully constructed.

Building Multi-Turn Data Without Going Broke

Our first idea was naïve: randomly generate defense profiles and ask a strong closed-source model to solve them.

This failed spectacularly. Why?

Many random combinations had no possible bypass
The strong model wasted compute trying to solve hopeless puzzles
Costs skyrocketed
We collected tons of failed trajectories and almost no good ones

So we flipped the process.

Start from a known working payload.

Then find the hardest defense layers it still bypasses.

This produced:

Realistic
Non-trivial
Guaranteed solvable defense profiles.

For broader coverage, we also took ~20,000 collected payloads and tested them against systematically generated defense stacks to identify solvable combinations.

This was slower, but massively more cost-efficient and far higher quality.

From this we created a dataset of:

(context, defense profile, successful payload)

But a single successful payload doesn’t teach multi-turn reasoning.

We needed trajectories.

Teaching the Model “How Attackers Think”

To get multi-turn trajectories, we took puzzles we knew were solvable and used a closed-source model to generate full attack sequences—but with a twist:

Whenever the model approached the maximum allowed number of turns, we revealed the “ground truth” payload and asked it to produce a realistic sequence of intermediate attempts that would logically lead to that payload.

This produced:

100 complete multi-turn trajectories

Each containing:

Reasoning
Multiple attempted payloads
Browser outcomes
A final successful bypass

These traces became the cornerstone of our second-phase supervised training.

Training the Multi-Turn Agent

We trained the model on three data types:

Single-turn puzzles with blacklists (from Phase 1)
Single-turn puzzles with visible defense layers
Multi-turn trajectories with hidden defense layers

This mixture taught the model:

How payloads break in different contexts
How defenses reshape inputs
How to escalate complexity across turns
How to connect observations across attempts

One important decision:

Should the model learn from only the final successful attempt?

Or from all intermediate turns, including failures?

We chose to train on all turns, because failure carries information.

Later, RL would help the model unlearn the bad ones.

This felt risky—but it paid off.

The Gibberish Problem: When Reasoning Models Lose Their Mind

If you’ve worked with chain-of-thought models, you’ve probably seen it: sometimes they produce weird, high-entropy nonsense (“gibberish”) in their reasoning.

In supervised learning, this is annoying but harmless.

In RL, it’s disastrous—because if a gibberish trajectory gets rewarded once, it can be reinforced.

So we used Meta’s CWM work, gibberish filter:

If a reasoning trace uses extremely rare tokens
or tokens that the model assigns extremely low probability
we discard the trajectory entirely

This simple rule prevented collapse and kept reasoning clean.

Reinforced Fine-Tuning (RFT) – Progressive RL to Extract Better Traces

After SFT, we moved to Reinforced Fine-Tuning (RFT) for the multi-turn setting.

The first RL attempts were unstable. Directly optimizing the policy in this environment—hidden defenses, multi-turn interaction, small number of trajectories—made training fragile.

So you adopted a different strategy:

Use progressive RL primarily as a way to extract better traces for offline training.

Concretely:

We ran RL and collected successful trajectories, with some more filtering criteria :

Discarded traces where the thinking trace contained gibberish
Ensured you did not oversample specific or easy samples

These successful, high-quality trajectories were then used to enrich the training data for subsequent fine-tuning.

Reinforcement Learning in Phase 2

After training on:

The original data
Plus the rejection-sampled outputs from progressive RL

you ran RL again on the multi-turn task, following the GRPO objective, with filtering zero-advantage trajectories, and penalties for Gibberish thinking

and Repeated payloads across turns.

An interesting observation you made:

When the penalty for repeated payloads was too strong, the model collapsed into predicting single payloads each turn, hurting performance.

Reducing that negative reward helped restore healthy exploration: the model learned not to spam the exact same payload, but it still reused promising ideas when appropriate.

Phase 2 Results — Multi-Turn Hidden-Defense Bypass

Final evaluation: 113 scenarios, each with hidden defenses and several allowed turns.

Here’s how models performed:

Model	Accuracy	Avg. Turns
Qwen3-4B-Thinking	0%	–
Claude4-Sonnet	18.3%	5.11
Novee – SFT	11.3%	2.87
Novee – SFT + RFT	23.2%	2.92
Novee – SFT + RFT + RL	28.4%	3.61

Important notes:

Even frontier models struggled: this task is hard.
The best version of our model nearly tripled the performance of baseline Qwen3.
The model used more turns on average—but used them wisely.
This was achieved with a 4B parameter model, not a frontier LLM.

This is one of the clearest signs that environment-coupled RL unlocks capabilities that pure language training cannot.

Why This Matters (Far Beyond XSS)

Although our work focuses on XSS, the implications are broad.

This research shows:

1. Small models can outperform huge LLMs when trained with the right signals

General-purpose LLMs shrink under constrained or adversarial tasks.

Small models trained with environment feedback thrive.

2. AI can truly simulate attacker behavior

Not just guessing—but:

probing
observing
analyzing
adapting

Just like a security engineer or red-teamer would.

3. Offense-driven ML is the future of application security

Static scanners check signatures.

LLMs guess text.

But an agent that interacts with your actual application?

That’s a new category.

4. This method generalizes to many vulnerability classes

We’re already extending this to:

SSTI
CSP bypasses
Prototype pollution
DOM-based sinks
and more

Anywhere the attacker must “probe → observe → adapt”, this approach works.

Closing Thoughts

What started as a question—

“Can an AI generate a constrained XSS payload?”

—turned into a demonstration that a small, well-trained model can:

Analyze HTML contexts
Infer hidden sanitization and encoding layers
Adapt strategies over multiple turns
Learn directly from browser behavior
And outperform both specialized scanners and frontier LLMs

At Novee, we see this as the beginning of a new era of offensive security automation:

environment-aware, reinforcement-trained, adaptive attacker agents.

If you want to:

experiment with these agents on your own applications,
explore joint research on multi-turn exploit reasoning, or
integrate this into your CI/CD pipeline as a continuous offensive simulator—

we’d love to talk.

Why External Network Penetration Testing Is Critical for Compliance

Novee Team

February 26, 2026

A clean vulnerability scan doesn’t mean your perimeter is secure. Compliance frameworks know this. That’s why standards like PCI DSS 4.0, SOC 2, and HIPAA no longer accept documentation as…

11 mins

AI Can Find Bugs. But Can It Prove You’re Actually Secure?

Ido Geffen, Co-Founder & CEO

February 24, 2026

Claude Code Security represents a tremendous move forward for AI code scanning, but finding vulnerabilities in static codebases – even at machine-speed – is not how real attackers operate.

4 mins

Agentic AI Pentesting: Top Emerging Threats and How to Defend Against Them