Why Small, Purpose-Trained AI Models Beat Frontier LLMs at Offensive Security

In live-browser exploit benchmarks, Novee’s 4B-parameter model achieved up to 90% accuracy, outperforming Claude 4 Sonnet and other frontier LLMs by over ~55%.

Omer Ninburg, Co-Founder & CTO

Barak Battash, PhD, Founding AI Researcher

December 29, 2025

7 mins

Small, Purpose-Trained AI Models Beat Frontier LLMs at Offensive Security

Explore Article +

Request a Demo

When we tested Novee’s proprietary, patent-pending 4-billion-parameter model against frontier large language models on real offensive security tasks, we did not expect the results to be this decisive.

On constrained web exploitation challenges validated in a live browser, Novee’s model reached ~90% accuracy. Claude Sonnet 4 peaked at 64%. This is a 50% performance improvement from a model that’s a fraction of the size.

90% vs 65%

A small, purpose-trained offensive security AI model beating frontier LLMs on complex exploitation

This result reflects a core belief we started with: Despite their impressive coding capabilities, frontier LLMs haven’t been trained on the specific challenge of adversarial exploitation – and that specialized experience makes all the difference.

In this post, we walk through the test setup, the benchmark results, and the training choices that explain why Novee’s AI model outperformed frontier LLMs – as well as the implications for security leaders investing in offensive security and penetration testing.

Offensive security is an adversarial reasoning problem that depends on environment feedback and adaptation – capabilities that general-purpose training doesn’t prioritize

Large language models are trained to predict text. That makes them excellent at explanation, summarization, content generation, and general reasoning across domains.

Offensive security works differently. It is not a language problem. It is an adversarial reasoning problem grounded in real systems.

Take, for example, web applications.

Modern web applications are protected by layers of defenses: HTML sanitizers, attribute filters, CSP rules, WAFs, encoding frameworks, and custom validation logic. These systems work as designed – they block known patterns and strip forbidden strings.

Real attackers do not reason in terms of patterns and rules.

They probe. They send a payload and watch how it gets transformed. They observe what breaks, what escapes, what is silently dropped. From that feedback, they infer what defenses are in place and adapt their strategy. Over multiple attempts, they build a mental model of the system and find a way through.

That process is interactive, stateful, constraint-driven, and grounded in real system behavior. It’s not a single-shot reasoning task—it’s iterative puzzle-solving grounded in real system feedback.

A concrete test: constrained XSS exploitation

To test whether an AI model could actually reason like an attacker, we started with a focused but realistic problem: cross-site scripting (XSS).

The task:

You are given a real HTML injection point.
You are given a list of forbidden keywords enforced by sanitizers and filters.
Your goal is to generate a payload that still executes JavaScript in a real browser.

For example, an injection point inside an attribute, with keywords like script, alert, eval, and img blocked.

This task has an important property: it is objectively verifiable. Either the payload triggers execution in the browser, or it does not.

We evaluated two difficulty levels:

Base difficulty: ~4 forbidden keywords
Hard difficulty: ~25 forbidden keywords (extremely restrictive)

For each level, we tested 300 real samples.

Phase 1 results: Novee’s Model Crushes the Competition

We compared Novee’s models against specialized scanners (XSStrike) and frontier LLMs (Gemini 2.5 Pro, Claude 4 Sonnet) on 300 test samples for each difficulty level.

Model	Base Difficulty	Hard Difficulty
XSStrike (scanner)	79.0%	62.3%
Gemini 2.5 Pro	65.7%	49.4%
Claude 4 Sonnet	78.3%	64.7%
Novee 1: SFT only	87.0%	55.7%
Novee 1: SFT + RL	91.7%	90.0%

What This Tells Us

Three critical patterns emerged:

Supervised learning alone is strong. Novee 1 SFT-only model beat specialized scanners and matched frontier LLMs on base difficulty (87.0%).
SFT collapses under extreme constraints. When we added 25 forbidden keywords, the SFT-only model dropped to 55.7% – worse than XSStrike and far worse than Claude.
Reinforcement learning unlocks robustness. Novee 1 RL-trained model maintained 90% accuracy even under extreme constraints. No other model came close. This is where the 50% improvement over Claude Sonnet 4 comes from.

Phase 2: multi-turn reasoning against hidden defenses

Real attackers don’t fire one payload and leave. They probe multiple times, observing responses. They gradually form a mental model of the defense stack – what’s being sanitized, what’s encoded, what rules are in place. Then they adapt.

So we escalated the challenge. Now the model:

Only sees the HTML context (no blacklist visible)
Does not see sanitization or defense rules
Can send multiple payloads over several turns
Gets browser feedback after each turn

The defense stack is hidden: HTML sanitization, output encoding, attribute filtering, JS function blocking, CSP, WAF transformations.

Essentially the task became: Infer what’s blocking you, adapt your strategy, and eventually bypass the defenses.

Building Multi-Turn Intelligence

We trained the model on full attack trajectories – sequences showing reasoning, multiple attempted payloads, browser feedback, and eventual bypasses. We used a combination of supervised learning on these trajectories, then applied reinforcement learning to help the model explore and adapt.

One critical insight: we trained on all turns, including failures. A blocked payload tells you something about the defenses. Later, RL helped the model learn which strategies actually work.

Phase 2 Results: Novee’s Model Reaches 28% Success on Near-Impossible Tasks, Outperforming Claude 4 Sonnet by ~55%

Final evaluation: 113 scenarios with hidden defenses and multiple allowed turns. This is extremely difficult, even for elite hackers. Here’s how the models performed:

Model	Success Rate	Avg. Turns Used
Qwen3-4B-Thinking	0%	—
Claude 4 Sonnet	18.3%	5.11
Novee 1 – SFT only	11.3%	2.87
Novee 1 – SFT + RFT	23.2%	2.92
Novee 1 – SFT + RFT + RL	28.4%	3.61

What Phase 2 Proves

Even frontier models struggled on this task. Claude 4 Sonnet hit 18.3%. Novee 2.5, our most advanced model, reached 28.4% – a 55% improvement. The baseline Qwen3 model scored zero.

The model also used its turns wisely. While Claude averaged 5.11 turns per attempt, our model averaged 3.61 – probing more efficiently and adapting faster.

This was achieved with a 4-billion-parameter model, not a frontier LLM. Environment-coupled reinforcement learning unlocks capabilities that pure language training cannot.

Even in near-impossible scenarios, the small reinforcement-trained model consistently outperformed frontier LLMs, while using its turns more efficiently.

Tying it all together: why small, purpose-trained AI models beat frontier LLMs at offensive security

These results highlight structural issues common to all general-purpose, frontier LLMs.

Frontier LLMs struggle in adversarial security because:

they are not trained on environment feedback
they do not learn from failed interactions
they optimize for fluency, not exploitability
they break down under tight, adversarial constraints

Scale alone doesn’t compensate for the lack of specialized, feedback-driven training.

In offensive security, how you train matters more than how big you train.

Purpose-trained offensive models are better at:

uncovering business logic vulnerabilities
chaining small weaknesses into real attack paths
validating exploitability instead of producing noise
delivering results continuously, not quarterly

Why Novee built its own model – and the implications for security leaders

These results explain why we chose to build a proprietary, purpose-trained offensive security model rather than rely on frontier LLMs. Offensive security demands interaction with real systems, learning from failure, and reasoning under constraints. That requires different training objectives, different evaluation methods, and deep offensive expertise. Our model and platform were designed by a team that has spent years conducting real-world offensive operations, and the same environment-coupled reasoning demonstrated in these benchmarks is already running in production across customer environments today.

For security leaders, the implication is clear. As software and attackers both move at machine speed, episodic testing and generic automation fall behind. Effective penetration testing must operate continuously, validate real exploit paths, and focus on what can actually be exploited right now.
If this resonates, we invite you to see how Novee’s AI offensive security platform applies these principles in practice – get a live demo today.

Why External Network Penetration Testing Is Critical for Compliance

Novee Team

February 26, 2026

A clean vulnerability scan doesn’t mean your perimeter is secure. Compliance frameworks know this. That’s why standards like PCI DSS 4.0, SOC 2, and HIPAA no longer accept documentation as…

11 mins

AI Can Find Bugs. But Can It Prove You’re Actually Secure?

Ido Geffen, Co-Founder & CEO

February 24, 2026

Claude Code Security represents a tremendous move forward for AI code scanning, but finding vulnerabilities in static codebases – even at machine-speed – is not how real attackers operate.

4 mins

Agentic AI Pentesting: Top Emerging Threats and How to Defend Against Them