Engineering with AI agents · field guide
Agentic Loop
quality loops for shipping with AI agents
The goal, plain and simple: ship quality products, by running quality loops, and delegating as much as possible to agentic loops.
What is an agentic quality loop?
A quality loop is a cycle where one model builds, a model from a different family reviews it adversarially, a runtime gate verifies it against a real environment, and the human only steps in at the irreversible gates. The core idea: stop prompting every step by hand and design the loop that prompts your agents. Three things lift quality above an ordinary review: a reviewer of a different lineage (it doesn't share your blind spots), real runtime verification (diff-correct ≠ works), and structured findings (data, not prose).
1 · The why — the bottleneck is the human orchestrating
A manual adversarial loop already produces solid code, but orchestration costs human time in the seams: build, write the review prompt, confront, decide GO, and re-invoke between steps. You don't lose quality; you lose your time.
The painful evidence (anonymized): a change survived ~20 rounds of adversarial review and still failed in production, because no round ever executed the flow against the real API/environment (a provider permission wasn't approved). The review looks at the diff, not the runtime. Not a model failure — a process gap.
2 · The method
build → verify locally → adversarial review (different model) → confront/fix → re-review → runtime smoke vs real env → GO/NO-GO → integrate → deploy → next phase
- 1. Build → verify locally (build / typecheck / tests / lint).
- 2. Review adversarially with a model from a different family — eyes that don't share your blind spots.
- 3. Runtime smoke against a real environment (staging/preview). Separate gate from the diff: a failure here is NO-GO even if the diff is perfect.
- 4. GO/NO-GO with judgment: reachable defects fixed; residuals degrade safe (reviewable/idempotent, never to wrong data), documented, backstopped.
- 5. Integrate → deploy → next phase. Human only on the irreversible.
3 · State of the art
How to work
Design the loop, not the prompt
- Delegate the "after". Run the server, verify, commit, push, PR, pull comments and fix, re-review, merge, next — all delegable. That's where the value is.
- Don't look at the code too early. Let another agent review it before you do; come in last.
- Loops that create loops; shape = the work. Don't hard-code a persona zoo; let the problem dictate the structure.
- Isolate (worktrees) so concurrent loops don't collide.
- Linear goal vs dynamic workflow — pick per task.
- Confront, don't obey — verify every finding against real code.
What to aim at
Think bigger
- Experimentation is cheap now. As the cloud made scaling cheap, AI makes writing code cheap → bigger bets.
- You can go horizontal: cover the whole range "functional but simple" + extensibility so users go deep.
- Stop building glue. Fixing seams one after another is living in the margins; reinvent the whole piece when it makes sense.
- Don't just automate the old work — enable new work. Push until you hit the wall; it's farther than you think.
Synthesis: quality loops are the production system; ambition is where you aim them. One without the other falls short.
4 · Capabilities people under-use
OpenAI Codex (CLI)
The right engine
codex exec is the automation primitive (not headless codex review, not first-class yet).
--output-schema: findings as structured JSON (severity/file:line/status/fix). The key upgrade.
-s read-only · -a never · codex exec - (prompt via stdin).
- "Skills define the method, automations define the schedule."
- Subagents:
agents.max_depth=1 by default (raising it = costly fan-out).
Claude Code / Agent SDK
Commonly under-used
- Headless
claude -p + --json-schema, --bare (reproducible CI), --resume (stateful loops), --permission-mode dontAsk.
- Native orchestration: multi-agent workflows (pipeline/parallel, schema, loop-until-dry, adversarial verify), subagents, background monitors, scheduled wake-ups.
- Hooks (pre/post-tool) as deterministic gates; cloud multi-agent review.
Insight: much of what people coordinate "by hand" with threads already exists natively in these tools. Don't invent infrastructure — use what's there.
5 · The autonomy ladder
Guiding principle: autonomy scales with reversibility, not with model quality. Self-drive where a mistake degrades safe; human-gate where it's irreversible.
L0
The agent builds + reviews-with-another-model + decides GO per phase. The human: mission + irreversible gate + supervision.
L1
The skill runs phase→phase without human re-invocation (notifies / interruptible) + runtime gate. The agent decides GO.
L2
Dynamic multi-lens review + auto-merge to staging for the safe class (additive / behind a flag / reversible).
L3
Self-served missions (pull from a queue/backlog on a schedule). The human = exception handler.
NEVER
Production / money / destructive data / migrations: human-gated by policy, not by incapacity.
6 · The skill
Installable as a Claude Code skill — /agentic-loop. It discovers the project's verify command, diff base, and runtime target, then runs the loop. See QUALITY_LOOPS.md and SKILL.md.
| Piece | What it does |
| Engine | Reviewer of a different family via codex exec (or claude -p + --json-schema), adversarial prompt via stdin, read-only, background. |
| Findings | Structured JSON via --output-schema → re-review with memory. |
| Synthesis | Triaged: real / false-positives and why / deferred with backstop — not the raw dump. |
| Runtime gate | Smoke against a real environment. NO-GO even if the diff is perfect. |
| Shape | Dynamic per problem — no fixed persona panel. |
| Decision | The agent decides GO; the human holds the irreversible gate. Once predictable → schedule it. |
7 · 12 principles for quality loops
- Delegate the "after". The value is automating what you do after prompting, not the prompt.
- Don't look at the code too early. Let another agent review it before you do.
- Two model families > one. The reviewer must be a different lineage than the builder.
- Diff-correct ≠ works. Always a real runtime gate.
- Findings as data, not prose.
--output-schema / --json-schema.
- Dynamic shape, not a persona zoo. Let the problem dictate structure.
- Isolate so loops don't collide.
- Confront, don't obey. Verify every finding against real code.
- Autonomy = reversibility. Human gate only on the irreversible.
- Treat the limit as a challenge. Subscription → loop; API → measure first.
- Skill = method; automation = schedule. In that order.
- Aim at something that seems impossible. The wall is farther than you think.
8 · Your next loop — quick guide
When you finish prompting, instead of reading the code, ask:
- Did I run build/typecheck/tests/lint? → delegate it.
- Did another model review it adversarially? → don't read it yet.
- Did I exercise it against a real environment? → the gate that matters.
- Do the residuals degrade safe and are they documented? → GO.
- Is it production/money/destructive-data/migration? → here's where you step in.
- Did I repeat this flow 2-3 times? → make it a skill. Already predictable? → put it on a schedule.
The habit that moves the needle most: when an agent finishes, don't open the editor — ask it "can you do the next step yourself?" and watch how far it gets. It'll surprise you.