How to Build an AI Agent That Audits Your Own Code — Adversarial Review, Fail-Closed Design, Idempotency, and Sandboxed Agents
I don’t trust my own code. So I built an AI team that doesn’t trust it either.
I’m a self-taught operator. No CS degree. I run a real California field-services business on software I wrote myself, and the thing nobody warns you about is the bug that passes. It compiles. Your tests are green. And it still quietly does the wrong thing in production three weeks later, because somewhere a function logged “done” without checking that the thing it claimed to do actually happened.
That bug is the whole enemy. Not the crash. The crash you see. The silent one is the killer, the one that reports success and lies. So before I name a single concept, let me show you one. I want you to feel it first.
// EXHIBIT A — the bug that passes every test
Read it slow. The function takes money. It wraps the charge in a try. If the charge throws — network blip, the card declines, the payment service is down — the except eats it. Then, no matter what happened, it logs “charged” and returns True.
Here’s why it’s so dangerous. Your test mocks the payment client, calls the function, and asserts it returns True. It does. Green. Your test for the failure case mocks the client to raise, calls the function, asserts it “doesn’t crash.” It doesn’t crash. Green. The function passes every test you’d naturally write, because it was built to always succeed on paper. It cannot fail in a way your assertions can see.
Now ship it. A month later the payment provider has a bad afternoon. Charges silently fail. Your logs are a wall of green “charged” lines. Your dashboard says revenue is fine. Nothing pages you, because nothing threw. You find out when a customer who got the goods never got billed, or worse, when you go to reconcile and the money isn’t there. The log lied for a month and you believed it, because you wrote it to lie.
The rest of this guide is the answer to one question: how do I never write that again, and how do I build a machine that catches it when I do anyway? Five ideas. For each one I’ll name the principle, explain the real theory underneath it in plain English, hand you working Python, and call out the mistake almost everybody makes. You don’t need my system. You need the concepts. Take them.
1. Adversarial verification beats approval review
// THE PRINCIPLE
Don’t ask a reviewer “does this look good?” Point a reviewer at your code and tell it to break the thing. Default to “this is wrong” and force the code to survive the attack. One credible objection blocks the ship. Not a vote. A veto.
// WHY IT WORKS
There’s a piece of real philosophy of science under this, and it changes how you think about every test you write. Karl Popper’s idea of falsifiability says you can never prove a general claim true by piling up examples that agree with it. The precise version of the claim is this: “my code is correct” is a universal statement — it asserts something about every possible input. A test is a single observation. No finite stack of agreeing observations can confirm a statement about an infinite set. But one disagreeing observation refutes it instantly. A thousand passing tests don’t prove your code is correct. They prove it didn’t break on those thousand inputs. The only move that ever actually carries information is a serious attempt to refute the claim that fails anyway. Survive a real attack and you’ve learned something. Collect another green checkmark and you’ve learned almost nothing.
Think of it like a structural engineer. He doesn’t certify a bridge by watching a thousand cars roll across and saying “looks solid.” He loads it past spec, on purpose, hunting for the failure point. The information lives in the stress test, not the smooth traffic. Your tests should drive the truck at the bridge, not coast across it.
Now layer in a very human problem: confirmation bias. Ask anyone, a person or a model, “this looks fine, right?” and they’re wired to find the reason it’s fine. Language models are especially eager to agree; their whole training pulls toward the helpful, agreeable answer. A reviewer whose default is approval is a reviewer who will rationalize your bug into a feature. You have to flip the incentive. A reviewer told to assume the code is broken, and to earn its way to “okay” by failing to break it, surfaces things a friendly read will never see. The attitude is the entire trick. Same model, different instruction, completely different result.
And the veto rule matters as much as the attitude. If five reviewers “vote” and the majority says ship, you’ve built a machine that averages away your one sharp skeptic. But damage doesn’t average. A single reviewer who can demonstrate real harm is correct no matter how many others shrugged. Truth here isn’t a popularity contest; it’s whether the attack lands. So one credible objection, one P0, holds the whole thing.
// APPLY IT
Two changes, today, no fleet of agents required. First, fix your review prompt. Most people write the weak one on the left. Write the one on the right.
// WEAK — invites agreement
// STRONG — demands an attack
Second, write your tests as attacks, not confirmations. A confirmation test proves the happy path works, which you already believed. An attack test goes hunting for the input that breaks it. Same function, two completely different tests:
That last test fails against Exhibit A, which is the point. Then give different reviewers different jobs so they don’t all chase the same obvious thing — one hunts swallowed errors, one hunts the security and money paths, one just checks the happy path still works. Then honor the veto. If one of them shows real damage, you don’t ship until you’ve answered it. Not out-voted it. Answered it.
Here’s why I’ll never run without it. On a recent build, the adversarial pass caught three real failures that had already passed compile and the author’s own green tests:
Every one of those was a “success” the system was about to report to a human. Green tests didn’t catch them. A reviewer told to distrust did.
2. Verify, then claim — and fail closed
// THE PRINCIPLE
Never log “done” until you’ve checked that the thing you claim you did actually committed. And on anything that touches money or matters: when you’re not sure it worked, refuse. Don’t proceed. A log line is a claim. Evidence is re-reading the state and confirming it changed.
// WHY IT WORKS
This one has a beautiful, almost cruel piece of theory behind it: the Two Generals Problem. Two generals on opposite hills have to attack at the same time, and the only way to coordinate is to send messengers through a valley where messengers get captured. General A sends “attack at dawn.” Did it arrive? He needs a confirmation back. General B sends one. Did that arrive? Now B needs a confirmation of the confirmation. It never bottoms out. The proven result — and it really is proven, not a vibe — is that over an unreliable channel, no finite number of messages can make both sides certain they agree.
Your code lives in that valley constantly. You call an API to cancel something. The call returns. But think about what the return value actually witnesses. It witnesses that a response came back. It does not witness that the action committed, because the failure mode you fear most is exactly the one where the action succeeds on the far side and the acknowledgment gets lost on the way home. You logged “cancelled.” That log proves you sent the order and got some response. It does not prove the world changed. The log is the messenger you can’t confirm. So you treat it as a claim and you go look: re-read the record, confirm the new state with your own eyes, before you tell a human it’s done.
The second half is design philosophy straight from safety engineering: fail-safe vs. fail-deadly. A dead-man’s brake on a train fails closed — lose power, lose the operator, the brake engages and the train stops. The safe state is the default state. A traffic light that loses its controller and goes dark instead of flashing red has failed open, into the dangerous direction, and now there’s a four-way intersection with no rules. Look at how most software fails: it hits a snag, swallows the exception, logs a cheerful success, and rolls on. That’s the dark intersection. The error vanishes and the machine keeps moving as if nothing happened. Every silent killer I’ve ever chased lived in a fail-open path. Flip them. When unsure, stop.
// APPLY IT
Take Exhibit A from the top of this guide and rewrite it the right way. No swallowed error. A read-back that confirms the charge actually landed. And a money branch that fails closed — uncertain means refuse and alert a human, never “probably fine, keep going.”
The shape to internalize is three lines wide: do_thing(), then assert fetch_state() == expected, then and only then log("done"). The middle line is the one nobody writes and the one that catches the lie. Then hunt your codebase for three patterns and treat all three as bugs on sight: except: pass, a bare except Exception that logs success anyway, and any “done” written without a read-back.
3. Idempotency: it will fire twice, design for it
// THE PRINCIPLE
Anything triggered by a webhook, a timer, a retry, or a queue will eventually fire twice. Not might. Will. So you design every such action to be safe to repeat: a stable dedup key plus a time window, decided up front, not bolted on after the double-charge.
// WHY IT WORKS
This pairs with fail-closed, and it comes from the same valley. Remember the Two Generals: you can’t be sure your message arrived. So what does a sane system do when it’s unsure? It retries. That’s the correct behavior. The retry is the cure for the lost message. But the retry creates a new problem: maybe the first attempt actually landed and only the acknowledgment got lost. Now you’ve done the thing twice.
This is why “exactly-once” is a myth at the level of raw message delivery. You don’t get to choose between “once” and “twice.” Your real menu is at-most-once (never retry, sometimes silently drop the action) or at-least-once (always retry, sometimes do it twice). At-most-once loses things, which is unacceptable for a charge or a serve or an email. So every serious system picks at-least-once. And the way you turn at-least-once back into the effect of exactly-once is idempotency: you make the operation safe to apply repeatedly, so the second, third, and tenth delivery all collapse to the same result as the first.
Idempotence is a precise math word, not a buzzword. An operation is idempotent if applying it twice equals applying it once. Multiplying by one is idempotent: do it a hundred times, same number. Setting a field to a value is idempotent. Adding to a field is not — run it twice and you’ve doubled it. Your job is to take a naturally non-idempotent action (“charge $5”, “send the email”) and wrap it so the system as a whole behaves idempotently. Stripe built its entire API around this; you pass an idempotency key with a charge, and if the same key shows up again they return the original result instead of charging twice. You can build the same guard yourself in about ten lines.
// APPLY IT
Pick a key that’s stable across retries of the same logical event — the webhook’s event id, or a hash of the meaningful fields — never something fresh like the current timestamp or a new UUID per call. Record it before you act. If you’ve seen it inside the window, skip. A small TTL store (Redis, a SQLite table, a Postgres row with an expiry) is plenty.
One sharp edge worth naming: the atomic insert matters. If you “check, then set” in two steps, two copies of the webhook arriving in the same millisecond both check (neither seen), both set, both proceed — you’ve reintroduced the exact bug you were trying to kill. Use a single atomic operation (SET NX, or a unique-constraint insert that throws on the dup) so exactly one caller wins the race.
datetime.now(), an auto-increment id minted inside the handler. Then every retry gets a brand-new key, every key looks unseen, and the guard never fires. A dedup key that isn’t stable across retries is decorative. The key has to be the same fingerprint the second time the same event shows up.
4. Single-lead orchestration: one brain, many hands
// THE PRINCIPLE
One lead model holds the plan and the context. It never does the grunt work itself. It fans the labor out to specialized workers running in parallel, and it scopes each worker to its own file so the fleet can’t collide.
// WHY IT WORKS
This is the supervisor pattern, and it rests on one of the oldest ideas in software: separation of concerns. Holding the whole plan and doing the detailed work are two different jobs, and they fight each other for the same scarce resource — attention. A model deep in the weeds of one file loses the thread of the overall goal. A model holding the goal can’t also be elbow-deep in four files at once without its focus smearing across all of them. There’s a real limit to how much context any one worker can hold sharply, human or model, and you spend it badly when one entity tries to plan and execute simultaneously. So you split the roles. One context-holder decides what and in what order. A set of workers each do one how and report back. Think of a contractor and his trades: the contractor doesn’t pick up the wrench, and the plumber doesn’t redraw the blueprint. Each holds exactly the context their job needs.
The “one file per worker” rule is the quiet genius, and it comes straight from how you avoid a race condition. If two workers can edit the same lines, you’ve got a merge conflict at best and silent clobbering at worst — one overwrites the other’s work and nobody notices. Give each worker exclusive ownership of its own file and there is nothing to collide on. No shared mutable state, no merge, no race. You didn’t manage the conflict, you designed it out of existence. And the workers stay stateless: they read what they need, do the work, write it back, and forget. The durable state lives in the files and in version control, never in a worker’s head — because a worker’s head evaporates the moment it finishes.
// APPLY IT
When a task is big enough to parallelize, split it by resource, not by vague topic. Assign each unit of work an owner that touches files nobody else touches. Keep the lead out of the implementation. Here’s the shape, stripped to a sketch:
That assert line is doing real work — it’s the rule “no two workers share a file” turned into code that refuses to run if you broke it. Anything that has to survive goes to disk or to git, because a worker’s memory is gone the second it returns.
5. Sandbox the verifier — it’s an attack surface
// THE PRINCIPLE
Your trust boundary sits wherever you execute someone else’s code. When you let an agent run unsupervised, the gate that checks its work runs the agent’s own tests. That test is now code you’re running. So the test is attack surface, and you lock it down at every layer.
// WHY IT WORKS
Here’s the subtle bug, the kind you only see once you let agents run on their own overnight. The naive way to gate an agent’s fix is: run its tests. But sit with what that means. The agent writes a test, and your gate executes it. Picture the dangerous gate in one line:
If the agent writes a test that imports a live module and charges a card, your “safety check” just charged the card. Restricting which tools the agent can call doesn’t save you, because the gate runs the test outside the agent’s sandbox, with your credentials, on your machine.
This is a textbook confused deputy problem. Your gate is the “deputy” — a trusted program with real authority. The agent has no authority of its own, but it can hand the deputy an instruction (a test file) that the deputy then carries out using the deputy’s power. The agent tricks your trusted process into doing the damage on its behalf. It’s the same shape as a classic privilege bug where a low-privilege caller gets a high-privilege service to misuse its access. Once you see your test runner as a deputy that will faithfully execute whatever the agent wrote, you stop trusting the test by default.
The defense is defense in depth: not one wall, but layers, so any single one failing doesn’t end the world. No individual control here is perfect — a static scan can be fooled, a stub env might miss a path. Stacked, they make the worst case a commit I throw away.
// APPLY IT
Replace the one dangerous line with the layered version. Scan the test before you run it, run it with no real credentials, and only on a throwaway branch:
In plain terms, the layers are: each agent runs in its own throwaway git worktree with a credential-free stub environment, so there are no live secrets for it to reach. A restricted tool allowlist — edit files and run the test runner, that’s it, no arbitrary shell. A static scan that rejects any agent-written test that imports a live module, touches the network, or reads secrets, before the gate runs it. And work that lands on a dated branch only, never main, never live data, so the worst case is a bad commit thrown away with one command.
That’s the difference between “I gave an AI write access” and a system you can actually sleep through. The agent is powerful inside a box that can’t hurt anything outside it. The general rule outlasts my setup: find the exact line where your process runs input you didn’t write, and treat everything that crosses it as hostile until proven otherwise.
Putting it all together: the full pipeline
Five ideas in isolation are five tips. Wired into one flow they become a machine that turns “I think this works” into “this survived everything I could throw at it.” Here’s the whole pipeline as one pass, with each idea doing its job:
Walk it once in english. A task comes in. The lead plans and hands each worker a file nobody else owns, so the parallel run has no races to manage (idea 4). Each worker writes its fix and an attack test, not a confirmation test — it’s trying to break its own work (idea 1). Those tests hit a sandboxed gate: scanned before they run, starved of credentials, executed on a throwaway branch, so a test that reaches for something live dies harmless (idea 5). What survives the gate goes to adversarial reviewers whose default is “this is broken” (idea 1 again), and a single credible P0 vetoes the ship. A veto sends it back to be root-caused now, not noted for later. Only when nothing can break it does it verify — read the state back, confirm it committed, fail closed if it didn’t (idea 2). And the trigger that kicked the whole thing off carries a dedup key, so if the webhook fires twice the pipeline doesn’t run twice (idea 3).
That’s the composition. Adversarial review finds the bug. Fail-closed refuses to lie about it. Idempotency keeps a retry from doubling it. Orchestration lets the work go fast without colliding. Sandboxing means the whole thing can run while I sleep without burning the house down. Pull any one out and the machine springs a leak exactly where that idea was holding.
The philosophy: hope vs. proof
Strip away the agents and the orchestration and there’s one sentence underneath all of it:
Most software is built on hope. It hopes the API call landed. It hopes the input is shaped the way the happy path assumes. It hopes the retry didn’t double the charge. It hopes the test that passed actually tested the thing. Hope is a fine feeling and a terrible engineering control. Every place you hope, a silent bug can move in, because hope produces no evidence. It just keeps going and assumes the best — which is exactly what Exhibit A did, all the way to the bank.
The flip side is a posture I’ll defend to anyone: “I don’t trust my own code” is not a weakness. It’s a discipline. Philosophers have a name close to it: epistemic humility, knowing the limits of what you actually know. You don’t know your code is right. You can’t. So you stop claiming it and you build the thing that checks. The reviewer that tries to refute you, the read-back that confirms the world changed, the dedup key that assumes the message will arrive twice, the sandbox that assumes the worst: none of that is paranoia. It’s you taking your own uncertainty seriously enough to engineer around it.
Finding problems was never the scary part. Shipping blind is. The standard isn’t “no bugs ever.” That’s a fantasy, and anyone selling it is lying. The standard is “nothing reaches a human unverified.” Building the machine that enforces that automatically, while I sleep, that’s the actual job.
I didn’t learn this in a classroom. I learned it because a quiet bug in software I run my operation on can become a very loud problem at the worst possible moment. When that’s the stakes, you stop hoping your code is right and you build the thing that proves it. You can build it too. Go find one except: pass in your codebase right now and close it. That’s the first brick.
Leave a comment