The next breach won’t wait for your next pentest.
Meet us at RSAC.
  The next breach won’t wait for your next pentest.
Meet us at RSAC.

The Hardest Part of Training Security AI Isn’t the Model. It’s the Environment

How we turned hundreds of broken open-source apps into deterministic training environments

Or Dancig, Founding AI Researcher

9 mins

Explore Article +

The Problem with Training AI on Open Source Applications? The Applications Don’t Work.

At Novee, we train a small-scale AI model to think like an expert penetration tester. Rather than exfil and train on our customers’ real, privileged data, we prep our model ahead of deployment. But that training requires thousands of real web applications as environments, spun up and torn down repeatedly during reinforcement learning. (RL). These applications are easy enough to get, but turns out they’re not so easy to work with.

Once we started running them, a trend began to emerge. By every automated metric, the OS applications were healthy. Many of them, however, were completely empty: no database schema, no users, no content. Just a default installation wizard waiting for a human to click “Next” six times.

In other words, more than 70% of the open-source applications we collected for training didn’t actually work out of the box.

You can’t train an attacker model on empty shells, and when nearly ⅔ of the training environments don’t reflect reality, the training process is compromised. So before we could train the model, we had to solve the environmental problem.

How Novee Trains AI to Think Like a Hacker

In our previous post, we showed our own custom, small model outperforming Claude Sonnet at XSS bypass in a live browser environment (91.7% vs. 64.7%). But reinforcement learning (RL) needs training environments, thousands of them, each starting in an identical verified state. 

You can’t achieve reinforcement learning without a functioning world for the agent to act in.

If we wanted an AI that could truly reason like an OffSec operator, we needed to take a step back, and build an environment it could actually learn in.

The Scale of Breakage in OS Web Applications

In theory, open source web applications should make perfect training data; they include historical commits with known vulnerabilities, they’re plentiful, and they’re easy to containerize.

In practice, we ran into a structural gap between how software is described (“just run ‘docker-compose up’”) and how it actually behaves.

Here’s what we found:

Status% of Collected Apps
Failed to build20%
Built but crashed on startup25%
Running but functionally broken (silent failures)10%
Blocked behind setup wizard / manual config20%
Fully functional out of the box25%

Many projects assumed a human operator would finish setup. Database migrations were described in README files but never encoded in Docker. First-run configuration happened through browser-based wizards. Some builds even fetched live dependencies at runtime, meaning yesterday’s environment wasn’t identical to today’s. Infrastructure tools happily reported services as “healthy” even when they were functionally useless.

We learned quite quickly that a running container isn’t proof of a working system.

This matters for ML specifically: when you train on broken environments, you get a model that learns to navigate brokenness, not to find vulnerabilities

  • If the system is partially initialized, the model learns incorrect assumptions.
  • If builds silently fetch updated dependencies, results become non-reproducible.
  • If authentication fails intermittently, training signals become noisy.

The environment *is* the data. Broken environments mean broken data.

Introducing the Fixer: An Agent to Build the AI Training Grounds

To properly lay the training grounds for our hive of agentic penetration testers, we built a new agent, whom we call “the Fixer.”

Its job is to take an arbitrary open-source application and make it fully functional with a single `docker compose up`. Not “containers running”, but fully realized, with databases initialized, admin users created, and no wizards, no manual steps. We needed applications that could serve real content on the first request.

One hard constraint: we never modify application logic. Dockerfiles, compose configs, entrypoint scripts, environment variables – all fair game. The application’s behavior stays untouched. We’re fixing how it gets built and deployed, not how it works. That boundary protects the integrity of our training data.

As we analyzed why environments failed (or only appeared to work), patterns emerged. Each one required its own fix.

The Missing Manual Problem 

Many applications require post-startup commands documented in READMEs but never encoded in Docker. Database migrations, seed data, cache warming, all assuming a human operator will run them. Laravel apps need `php artisan migrate`, Django needs `python manage.py migrate`, Rails expects `rake db:migrate`. 

How We Solved It: In a container, there’s no human to read the README. The Fixer scans setup instructions and integrates them into the container’s entrypoint.

The Phantom Success Problem

Containers showed “Healthy” status while the application was completely broken. A MySQL container can be “healthy” while the database has zero tables. An nginx container can return HTTP 200 while serving `phpinfo()` instead of the actual app – a surprisingly common document root misconfiguration. Docker’s default health checks just verify the process is running.

How We Solved It: The Fixer adds deep, service-specific checks across dozens of service types, including PostgreSQL, MySQL, MongoDB, Redis, RabbitMQ, and Elasticsearch – each with its own verification pattern. Not “is the port open?” but “can we execute a real query against actual data?”

The Freshness Problem

 We pinned specific commits with known vulnerabilities. But Dockerfiles that use `git clone` or `wget` during build pulled the latest patched version, overwriting our selections. 

How We Solved It: The Fixer detects these patterns, including subtle cases like multi-stage builds where `COPY–from=builder` copies from cloned sources, or Dockerfiles that download pre-built releases from GitHub rather than building from source, and transforms them into proper `COPY` commands using our local, version-controlled code.

The Installation Wizard Trap

Some apps redirect every request to a web-based setup wizard on the first run. From the outside, what seems like a working web server obscures an endless series of “Click here to configure your database.” 

How We Solved It: The Fixer finds CLI equivalents for wizard steps and bakes them into the startup script.

The Human Gatekeeper Problem

Production applications include CAPTCHA on login, registration, and password reset, designed to stop exactly the kind of automated access our models need. 

How We Solved It: The Fixer detects CAPTCHA implementations across reCAPTCHA, hCAPTCHA, and built-in image CAPTCHAs, and surgically bypasses server-side validation – typically adding `return true;` as the first line of the validation function in PHP, JavaScript, or Python. The UI stays intact but the blocking behavior disappears.

The Catalog of Chaos

The named patterns above handle problems we learned to recognize. But real-world software also fails in endlessly inventive ways, and no predefined catalog could cover them all. 

This is where the Fixer shows its real edge: when it encounters a failure it hasn’t seen before, it doesn’t give up. It reads error traces the way a DevOps engineer would – identifying root causes from stack traces, build logs, and container output – and adapts.

A few examples from the long tail: 

  • CentOS 8 hit EOL in 2021, but countless Dockerfiles still reference it, meaning package managers hang trying to reach mirrors that no longer exist. 
  • A Dockerfile runs `npm ci` but the repo has no lockfile; in this case, a single-word change (“install” instead of “ci”) separates a working build from a failing one. 
  • Base images reference Docker Hub tags deleted years ago, and the fix requires finding compatible replacements without breaking the dependency chain.

Trust, But Verify

Here’s a problem unique to using AI to fix infrastructure: an LLM can confidently claim “I fixed it!” while the environment is still broken. We can’t trust self-assessment.

After the Fixer claims success, a completely separate, fully deterministic verification runs, without any AI involved. 

It’s a clean rebuild from scratch:

  • Full stack startup with deep health checks. 
  • A baked-in wait period for stabilization (some containers crash 10-20 seconds after startup). 
  • HTTP probes against real endpoints. 
  • Content verification + clean teardown. 

The Fixer operates in a sandbox and tracks every file it modifies so only tracked changes transfer to the final output. If verification passes, we know the environment works, because we just proved it using the exact process training will use.

What Still Doesn’t Work

Not every application can be turned into a training environment, and it’s important to share our boundaries: 

  • External service dependencies: Apps requiring payment gateways, email providers, or third-party OAuth can’t be fully isolated. We can mock some external services, but the fidelity degrades.
  • Proprietary license checks that phone home at startup. No way around these without modifying application logic, which violates principle #1.
  • Hostile build systems: One application’s build script downloaded and executed an unversioned shell script from a URL that now returns 404. Not fixable without rewriting the build, which crosses the line into modifying the project itself.

Current coverage: more than 80% of collected apps reach verified state. 

Environments Custom-Built to Train AI Hackers

Every environment cycle in our training pipeline starts from a Fixer-verified state. During a single training run, environments are created and destroyed thousands of times. Each instance begins in the exact same verified state: same database content, same configuration, same dependencies. The model then learns from real, functioning application behavior.

This is what makes the RL loop for our custom-built model possible. The model attempts an attack, receives a reward signal, and the environment resets to a clean state. If that reset isn’t perfectly deterministic – if, for example, the database has leftover state from the previous episode, or a dependency has drifted – then the reward signal becomes noise. 

Verified environments across multiple frameworks and programming languages means real applications with real complexity. 

Along the way, the Fixer has also produced something we didn’t plan for: a catalog of real-world deployment patterns, common misconfigurations, and the gap between how developers describe setup and what actually works.

Well-Trained, But Not on Customer Data

No Novee customer will ever interact with the Fixer. But without it, our small-scale reasoning model wouldn’t have the training signals that separate pattern-matching from genuine security reasoning. The Fixer is the quiet coach behind the team of our star AI agents; preparing the training ground for our AI hackers to succeed.

And, most importantly, because we build our own training grounds for our AI hackers, they don’t need our customers’ privileged data.


See the results of that training at work. Explore our platform.

Stay updated

Get the latest insights on AI, cybersecurity, and continuous pentesting delivered to your inbox