# Quality Loops

An essay and justification for the agentic loop method behind shipping with AI coding agents.
The goal, plain and simple: **ship quality products, by running quality loops, and delegating as much as possible to agentic loops.**

> Generic and reusable in any project. Assumes nothing about a specific repo.

---

## Abstract

A **quality loop** is a cycle where one model *builds*, a model from a **different family** *reviews it adversarially*, a **runtime gate** verifies it against a real environment, and the human only steps in at the **irreversible gates**. The core idea: stop prompting every step by hand and **design the loop that prompts your agents**. Three things lift quality above an ordinary review: a **reviewer of a different lineage** (it doesn't share your blind spots), **real runtime verification** (diff-correct ≠ works), and **structured findings** (data, not prose). The aim isn't "make the agent stop asking permission" — it's to **take the human out of every seam of the work except the irreversible ones** (production, money, destructive data, migrations).

---

## 1. The problem: the bottleneck is the human orchestrating

A manual adversarial loop already produces solid code. But orchestration costs **human time in the seams**: build, write the review prompt, confront findings, decide GO, and re-invoke between steps. You don't lose quality — you lose your time.

And there's a deeper hole: **the review looks at the diff, not the runtime.**

> **Real case (anonymized):** a change survived ~20 rounds of adversarial review and *still* failed in production, because no round ever *executed* the flow against the real external API/environment (a provider permission wasn't approved). Diff-correct ≠ works. Not a model failure — a **process gap**.

## 2. What a quality loop is (the method)

1. **Build** → verify locally (build / typecheck / tests / lint).
2. **Review adversarially** with a model from a **different family** than the builder. Its value is being a pair of eyes that *doesn't share your blind spots*.
3. **Runtime smoke against a real environment** (staging/preview): exercise the actual flow. A **separate** gate from the diff — a failure here is NO-GO even if the diff is perfect.
4. **GO / NO-GO with judgment:** advance when every reachable defect is fixed and the residuals **degrade safe** (to a reviewable / idempotent state, **never to wrong data**), documented and backstopped.
5. **Integrate → deploy → next phase.** The human only at the **irreversible gates**.

## 3. State-of-the-art principles (and why they matter)

**On the loop (how to work):**
- **Delegate the "after".** What you do *after* prompting — run the server, verify, commit, push, open the PR, pull the review comments and fix them, re-review, merge, next — is almost all delegable. That's where the value is, not in the prompt.
- **Don't look at the code too early.** If you read it before another agent reviews it, you're wasting your time. Come in last, once the obvious junk is gone.
- **Loops that create loops; the shape of the loop = the shape of the work.** Don't hard-code a zoo of "personas" (reviewer X, reviewer Y). Let the agent dynamically build the structure the problem needs (stacked vs parallel, how many phases).
- **Isolate** (e.g. git worktrees) so concurrent loops don't collide and an agent can watch/fix a PR for hours without blocking everything else.
- **Linear goal vs dynamic workflow.** A linear "goal" repeats "done? no → keep going." A dynamic workflow creates bespoke work from a goal. Pick per task.
- **Plans/reports in HTML** when a human must read them: easier to scan, even on a phone.
- **Confront, don't obey.** Verify every finding against the real code; push back when the reviewer is wrong, accept when it's right even if you said otherwise.

**On ambition (what to aim at):**
- **Experimentation is cheap now.** As the cloud made scaling cheap, AI makes writing code cheap → **take bigger bets**.
- **You can go horizontal.** You used to win one vertical and go deep; now you can cover the whole range "functional but simple" + give **extensibility** so users go deep where they need.
- **Stop building glue.** Fixing the seam between pieces, one after another, is living in the margins; reinvent the whole piece when it makes sense.
- **Think bigger.** Don't just automate the old work — enable new work. Push until you hit the wall; it's farther than you think.

> **Synthesis:** quality loops are the *production system*; ambition is *where you aim them*. One without the other falls short.

## 4. Capabilities we tend to under-use

**Codex (CLI):**
- **`codex exec` is the automation primitive** (not `codex review`, whose structured headless form isn't first-class yet). Read-only and non-interactive: `-s read-only`, `-a never`, prompt via stdin with `codex exec -`.
- **`--output-schema <json-schema>`**: findings as **structured JSON** (severity / file:line / status / fix) → trivial re-review-with-memory and synthesis. *The key upgrade.*
- **"Skills define the method, automations define the schedule."** Skill first (the method), schedule later (cron) once predictable.
- **Subagents:** `agents.max_depth` defaults to `1` (raising it = costly fan-out); one thread per coherent unit of work.

**Claude Code / Agent SDK:**
- **Headless `claude -p`** with `--output-format json` + `--json-schema` (equivalent to `--output-schema`), `--bare` (reproducible in CI), `--append-system-prompt`, `--continue`/`--resume` (**stateful loops**), `--permission-mode dontAsk` / `--allowedTools` (bounded autonomy), per-run cost in the JSON.
- **In-session:** deterministic multi-agent orchestration (pipeline/parallel, schema'd output, loop-until-dry, adversarial verify), subagents, background monitors (poll until a condition), scheduled wake-ups, **hooks** (deterministic pre/post-tool gates), and cloud multi-agent review.

> **Insight:** much of what people coordinate "by hand" with threads already exists **natively** in these tools. Don't invent infrastructure; use what's there.

## 5. The autonomy ladder (reversibility, not confidence)

Guiding principle: **autonomy scales with reversibility, not with model quality.** Self-drive where a mistake degrades safe; human-gate where it's irreversible.

- **L0 — typical today:** the agent builds + reviews-with-another-model + **decides GO per phase**; the human sets the mission, supervises, and approves the irreversible.
- **L1 — the skill:** runs phase→phase **without human re-invocation** (notifies / interruptible) + **runtime gate**. The agent decides GO; the human only the irreversible.
- **L2:** dynamic multi-lens review (correctness + runtime + security) + **auto-merge to staging** for the safe class (additive / behind a flag / reversible).
- **L3:** self-served missions (pull from a queue/backlog on a schedule); the human = exception handler.
- **Forever invariant:** production / money / destructive data / migrations stay human-gated **by policy**, not by incapacity.

## 6. The skill design

Builder model + reviewer of a **different family** via `codex exec` (or `claude -p` with `--json-schema`) · adversarial prompt via stdin · **structured findings (`--output-schema`)** · background run (re-notifies) · **triaged synthesis** (real / false-positives and why / deferred with backstop), not the raw dump · **re-review with memory** of findings until GO · **runtime smoke against a real environment** (separate gate) · **dynamic** loop shape · "degrade-safe" GO/NO-GO checklist. **The agent decides GO; the human holds the irreversible gate.** Once predictable → wrap on a schedule.

## 7. Insights for quality loops (the principles)

1. **Delegate the "after".** The value is automating what you do after prompting, not the prompt.
2. **Don't look at the code too early.** Let another agent review it before you do.
3. **Two model families > one.** The reviewer must be a different lineage than the builder.
4. **Diff-correct ≠ works.** Always a real runtime gate.
5. **Findings as data, not prose.** `--output-schema` / `--json-schema`.
6. **Dynamic shape, not a persona zoo.** Let the problem dictate structure.
7. **Isolate** so loops don't collide.
8. **Confront, don't obey.** Verify every finding against real code.
9. **Autonomy = reversibility.** Human gate only on the irreversible.
10. **Treat the limit as a challenge.** Subscription → loop; API pricing → measure first.
11. **Skill = method; automation = schedule.** In that order.
12. **Aim at something that seems impossible.** The wall is farther than you think.

---

### References
- Reference talks on agentic loops: https://www.youtube.com/watch?v=iJVJwmCKW9o · https://www.youtube.com/watch?v=WBT-z_-OPhw
- Codex CLI — Non-interactive: https://developers.openai.com/codex/noninteractive
- Codex CLI — Features / Reference: https://developers.openai.com/codex/cli/features · https://developers.openai.com/codex/cli/reference
- Codex — Best practices / Subagents: https://developers.openai.com/codex/learn/best-practices · https://developers.openai.com/codex/subagents
- Codex — Headless review (open issue): https://github.com/openai/codex/issues/6432
- Claude Code — Headless / Agent SDK: https://code.claude.com/docs/en/headless