BuildDoing the work

Why Your Coding Agent Can't Be Your Testing Agent

Same-session green tests prove consistency — not that you shipped what the spec asked for.

Serhii MalyshevSoftware architect and tech lead6 min read · Jun 15, 2026

#ArtificialIntelligence #SoftwareEngineering #Cursor #SoftwareTesting #DevOps

Green tests prove the code matches the code. Something else has to prove the code matches the spec.

We ask the coding agent to implement the feature. We ask the same agent to write the tests. We merge when CI goes green. We call it TDD because the tests exist somewhere in the diff.

We are not testing. We are greenwashing our own mistakes.

The failure mode is specific: the agent removes a field from the API response, updates the assertion in the same commit, and the suite celebrates. The mobile client you do not have open crashes on undefined. The test did not miss the bug — it ratified the breaking change. That is not a flaky pipeline. That is Confirmation-Driven Development — tests written to validate the agent's version of reality, not the specification.

If your agent workflow treats a green run as "done," you have outsourced verification to the thing that most needed verifying.

That's the gap.

Confirmation-Driven Development — green CI is not a spec check

Kent Beck's TDD locked intent early: write the test from the requirement, watch it fail, then write code until the test passes. The test was the spec frozen in code.

Agentic coding inverts the incentive. The agent wants task completion — please the operator, close the loop, signal success. Tests become proof-of-work, not proof-of-correctness. Adrian Hall names the result Confirmation-Driven Development: exercises written so the code you already wrote gets a workout, not a cross-examination.

You see the symptoms everywhere. The agent "fixes" a failing test by loosening the assertion. Coverage climbs. Production still breaks on edge cases nobody asserted independently. PR review comments cluster on style and null checks, not "this violates requirement X." Requirement X lives in a ticket the reviewer never opened.

The agent is not lazy. It is coherent.

Same context window, same assumptions, same blind spots in both artifacts.

The Independence Wall — shared context is the bug

Call it the Independence Wall: verification is only trustworthy when it cannot see what it verifies.

Research on LLM self-verification is blunt. Valmeekam et al. found performance collapse when GPT-4 critiqued its own answers on formal reasoning tasks — while sound external verifiers produced large gains. The classical argument that verification is easier than generation does not transfer when the model is doing approximate retrieval, not proof. Ling et al. show the same fragility on logical fallacy detection — self-check is not a general-purpose QA department.

Code is the same shape of problem with better costumes.

Huang et al.'s AgentCoder setup quantified it: when one agent writes implementation and tests together, test accuracy sits around 61%. Split test design into an agent that sees the spec only — never the code — and accuracy jumps to roughly 88%. The test writer stopped inheriting the programmer's mental model.

That is the test oracle problem in one sprint. The oracle — the expected value your assertion checks against — got copied from the implementation. Wrong code, wrong oracle, green bar.

Line coverage measures execution, not protection. A test can touch every branch and survive every injected bug if the assertions mirror the code's mistakes.

Same model, same weights, opposed jobs can still work — because the wall is informational, not parametric.

Give the test agent the spec and a penalty for missed bugs. Give the coding agent error output only, not test source. When they disagree, a human picks the winner. That is not theory. It is how you rebuild what pre-agent QA departments used to do.

Three ways to split builder and tester

You do not need a research lab. You need a boundary the agent cannot casually cross.

Pattern A — The Fresh-Session Split (single-agent tools)

In Cursor or Claude Code, stop asking for "code and tests" in one thread. Sequence it:

Session one — paste the spec only. Output failing tests. No implementation files in context.
Session two — paste the test file only. Implement until green. No spec, no permission to edit tests.
Orchestrator (you, or a script) — run the suite. Feed stderr back to session two. Repeat.

Markaicode's Agent A / Agent B split is the same contract with explicit roles. The tests become the contract because the coder literally cannot see the requirement document — only the assertions.

Leaky? Yes — and say that out loud before you bet a release on it. A model with repo access can still open tests/ unless you block it. Pattern A is a discipline hack for prototypes; production paths deserve Pattern B or C.

Pattern B — The Worktree Wall (Red/Green teams)

For production paths, physical isolation beats politeness. The spec-impl-tdd pattern runs a Red team in one worktree (tests from spec, no impl visibility) and a Green team in another (impl from failing output, no assertion visibility). A QA agent or human adjudicates when tests and code disagree against the spec — not against whichever side wants to go home.

Cost: tokens and time roughly double, plus the human minutes when Red and Green disagree. Pay it when the feature has a spec worth enforcing — payment flows, auth boundaries, anything a tautological unit test has already lied about once. Skip it for throwaway scripts and one-off data transforms.

Worked example: merge_sorted_lists(a, b) -> list[int], sorted merge in O(n+m). A coupled agent tests happy path. A spec-only test agent adds merge_sorted_lists([], []), single-element lists, and duplicate-heavy inputs — cases the programmer-coupled writer skips because it already "knows" the loop structure. Independence buys edge cases for free.

Pattern C — The Filesystem Judge (roles + human arbiter)

The claude-code-tdd harness hard-codes the wall: tester writes only under test/, coder writes everywhere except test/ and role prompts, existing tests are read-only. A reviewer agent stages work; you approve functional prompts once.

TruLayer's production agent team generalizes the idea: eleven agents, each with authorized write directories — backend engineer touches backend/, QA writes tests, nobody merges their own PR. Orchestration is the bottleneck, not raw codegen skill.

Your minimum viable version: two markdown agent definitions in .cursor/agents/ or Claude's agent folder — Builder and Tester — with explicit "must not" lists. Human steps in when they fight. That disagreement is signal, not noise.

Spot the theater in sixty seconds

No Stryker install required.

Comment-out test. Pick a function with business logic. Delete one line. Run the suite. If nothing fails, your tests assert presence, not behavior.

Boundary flip. Change >= to > on a tier threshold. Green suite means the boundary was never tested independently.

Same-commit smell. If the test file and implementation changed together in one agent pass, treat every new assertion as guilty until it cites a requirement outside the diff.

Can you state the expected output without opening the source? If the answer requires reading the implementation, the oracle came from the code.

One minute. Brutal.

Worth it before you trust the dashboard.

What to keep in the stack

This is a yes-and article.

Keep AI test generation — it is fast and covers execution paths humans skip. Keep AI review for null dereferences, injection shapes, and obvious structural rot. Add what those layers structurally cannot provide: an external reference — spec, user flow, characterization of old behavior, property test, or E2E against a running preview with expectations formed from product language, not function output.

AgentCoder's lesson is restraint, not sprawl: three tight roles beat five generalist agents on benchmarks at lower token cost. You are not building a company of bots. You are building a wall.

The coding agent can write your tests. It just cannot be the agent that wrote the code — not in the same context, not in the same commit, not with the same oracle. Separate the builder from the tester, oppose their incentives, and judge the fights yourself.

Next time you run merge_sorted_lists, watch which list the spec-only agent tests first: empty inputs, or the path the implementation already took? That gap is where your confidence was hiding.