Engineering with AI agents · field guide

Agentic Loop
quality loops for shipping with AI agents

The goal, plain and simple: ship quality products, by running quality loops, and delegating as much as possible to agentic loops.

The why The method State of the art Capabilities Autonomy The skill 12 principles Your next loop

What is an agentic quality loop?

A quality loop is a cycle where one model builds, a model from a different family reviews it adversarially, a runtime gate verifies it against a real environment, and the human only steps in at the irreversible gates. The core idea: stop prompting every step by hand and design the loop that prompts your agents. Three things lift quality above an ordinary review: a reviewer of a different lineage (it doesn't share your blind spots), real runtime verification (diff-correct ≠ works), and structured findings (data, not prose).

1 · The why — the bottleneck is the human orchestrating

A manual adversarial loop already produces solid code, but orchestration costs human time in the seams: build, write the review prompt, confront, decide GO, and re-invoke between steps. You don't lose quality; you lose your time.

The painful evidence (anonymized): a change survived ~20 rounds of adversarial review and still failed in production, because no round ever executed the flow against the real API/environment (a provider permission wasn't approved). The review looks at the diff, not the runtime. Not a model failure — a process gap.

2 · The method

build → verify locally → adversarial review (different model) → confront/fix → re-review → runtime smoke vs real env → GO/NO-GO → integrate → deploy → next phase

1. Build → verify locally (build / typecheck / tests / lint).
2. Review adversarially with a model from a different family — eyes that don't share your blind spots.
3. Runtime smoke against a real environment (staging/preview). Separate gate from the diff: a failure here is NO-GO even if the diff is perfect.
4. GO/NO-GO with judgment: reachable defects fixed; residuals degrade safe (reviewable/idempotent, never to wrong data), documented, backstopped.
5. Integrate → deploy → next phase. Human only on the irreversible.

3 · State of the art

How to work

Design the loop, not the prompt

Delegate the "after". Run the server, verify, commit, push, PR, pull comments and fix, re-review, merge, next — all delegable. That's where the value is.
Don't look at the code too early. Let another agent review it before you do; come in last.
Loops that create loops; shape = the work. Don't hard-code a persona zoo; let the problem dictate the structure.
Isolate (worktrees) so concurrent loops don't collide.
Linear goal vs dynamic workflow — pick per task.
Confront, don't obey — verify every finding against real code.

What to aim at

Think bigger

Experimentation is cheap now. As the cloud made scaling cheap, AI makes writing code cheap → bigger bets.
You can go horizontal: cover the whole range "functional but simple" + extensibility so users go deep.
Stop building glue. Fixing seams one after another is living in the margins; reinvent the whole piece when it makes sense.
Don't just automate the old work — enable new work. Push until you hit the wall; it's farther than you think.

Synthesis: quality loops are the production system; ambition is where you aim them. One without the other falls short.

4 · Capabilities people under-use

OpenAI Codex (CLI)

The right engine

codex exec is the automation primitive (not headless codex review, not first-class yet).
--output-schema: findings as structured JSON (severity/file:line/status/fix). The key upgrade.
-s read-only · -a never · codex exec - (prompt via stdin).
"Skills define the method, automations define the schedule."
Subagents: agents.max_depth=1 by default (raising it = costly fan-out).

Claude Code / Agent SDK

Commonly under-used

Headless claude -p + --json-schema, --bare (reproducible CI), --resume (stateful loops), --permission-mode dontAsk.
Native orchestration: multi-agent workflows (pipeline/parallel, schema, loop-until-dry, adversarial verify), subagents, background monitors, scheduled wake-ups.
Hooks (pre/post-tool) as deterministic gates; cloud multi-agent review.

Insight: much of what people coordinate "by hand" with threads already exists natively in these tools. Don't invent infrastructure — use what's there.

5 · The autonomy ladder

Guiding principle: autonomy scales with reversibility, not with model quality. Self-drive where a mistake degrades safe; human-gate where it's irreversible.

The agent builds + reviews-with-another-model + decides GO per phase. The human: mission + irreversible gate + supervision.

The skill runs phase→phase without human re-invocation (notifies / interruptible) + runtime gate. The agent decides GO.

Dynamic multi-lens review + auto-merge to staging for the safe class (additive / behind a flag / reversible).

Self-served missions (pull from a queue/backlog on a schedule). The human = exception handler.

NEVER

Production / money / destructive data / migrations: human-gated by policy, not by incapacity.

6 · The skill

Installable as a Claude Code skill — /agentic-loop. It discovers the project's verify command, diff base, and runtime target, then runs the loop. See QUALITY_LOOPS.md and SKILL.md.

Piece	What it does
Engine	Reviewer of a different family via `codex exec` (or `claude -p` + `--json-schema`), adversarial prompt via stdin, read-only, background.
Findings	Structured JSON via `--output-schema` → re-review with memory.
Synthesis	Triaged: real / false-positives and why / deferred with backstop — not the raw dump.
Runtime gate	Smoke against a real environment. NO-GO even if the diff is perfect.
Shape	Dynamic per problem — no fixed persona panel.
Decision	The agent decides GO; the human holds the irreversible gate. Once predictable → schedule it.

7 · 12 principles for quality loops

Delegate the "after". The value is automating what you do after prompting, not the prompt.
Don't look at the code too early. Let another agent review it before you do.
Two model families > one. The reviewer must be a different lineage than the builder.
Diff-correct ≠ works. Always a real runtime gate.
Findings as data, not prose. --output-schema / --json-schema.
Dynamic shape, not a persona zoo. Let the problem dictate structure.
Isolate so loops don't collide.
Confront, don't obey. Verify every finding against real code.
Autonomy = reversibility. Human gate only on the irreversible.
Treat the limit as a challenge. Subscription → loop; API → measure first.
Skill = method; automation = schedule. In that order.
Aim at something that seems impossible. The wall is farther than you think.

8 · Your next loop — quick guide

When you finish prompting, instead of reading the code, ask:

Did I run build/typecheck/tests/lint? → delegate it.
Did another model review it adversarially? → don't read it yet.
Did I exercise it against a real environment? → the gate that matters.
Do the residuals degrade safe and are they documented? → GO.
Is it production/money/destructive-data/migration? → here's where you step in.
Did I repeat this flow 2-3 times? → make it a skill. Already predictable? → put it on a schedule.

The habit that moves the needle most: when an agent finishes, don't open the editor — ask it "can you do the next step yourself?" and watch how far it gets. It'll surprise you.