How to Design Agentic Loops: Two Worked Examples
From a single passing test to a checklist that builds itself, then to a swarm of agents that grade their own work.
The first article built a
Ralph loop: a while loop that drove a coding agent until
npm test went green. That is the simple rung. This
article climbs two more, reusing the same little
shortener project so you can feel it grow.
- Intermediate — a checklist-driven loop that ships a backlog of features one at a time, sandboxed in a container.
- Complex — a planner–generator–evaluator loop that fans work out to parallel agents in git worktrees and merges only what passes.
2 Intermediate: the checklist-driven loop
One passing test is a nice demo, but real work is a backlog. The intermediate pattern keeps the agent’s memory in two files on disk — a task list and a progress log — and works the list one item per iteration. Because the model is amnesiac, these files are the project’s memory between runs.
1. Write the backlog as a checklist
Each task is a checkbox with its own acceptance test named up front.
Save this as TASKS.md:
# Shortener backlog
Work the FIRST unchecked task only. Mark it `[x]` when its
acceptance test passes. Do not touch already-checked tasks.
- [ ] **store** — persist code→url mappings in a SQLite file `data.db`.
Acceptance: `test/store.test.js` passes.
- [ ] **api** — expose `POST /shorten` and `GET /:code` with Express.
Acceptance: `test/api.test.js` passes.
- [ ] **expiry** — links accept an optional `ttlSeconds` and 404 once expired.
Acceptance: `test/expiry.test.js` passes.
- [ ] **ratelimit** — cap `POST /shorten` at 10 req/min per IP.
Acceptance: `test/ratelimit.test.js` passes.
Create an empty PROGRESS.md for the agent to journal into.
This is where it records what it tried, so the next amnesiac iteration
does not repeat a dead end:
echo "# Progress log" > PROGRESS.md
2. Write the per-iteration prompt
The prompt tells the agent to read its memory, do exactly one task,
and write its memory back. Save it as PROMPT.md:
You are building a URL shortener, working through a backlog.
1. Read TASKS.md and PROGRESS.md.
2. Pick the FIRST unchecked `[ ]` task in TASKS.md.
3. Write its acceptance test if it does not exist yet, then
implement the feature until `npm test` passes for it.
4. When green, change that task's `[ ]` to `[x]` in TASKS.md
and append a short note to PROGRESS.md (what you did, any
gotchas for the next run).
5. Run the FULL suite with `npm test` to confirm nothing else broke.
Do exactly ONE task, then stop. Never weaken or delete a test.
If every task is `[x]`, write "BACKLOG COMPLETE" and stop.
3. The loop: stop when every box is checked
The objective stop condition is no longer “tests pass” but
“no unchecked boxes remain & the full suite is green.”
Save as loop.sh:
#!/usr/bin/env bash
set -euo pipefail
MAX_ITERATIONS=20
remaining() { grep -c '^\- \[ \]' TASKS.md || true; }
for i in $(seq 1 "$MAX_ITERATIONS"); do
left=$(remaining)
echo "──────── iteration $i — $left task(s) left ────────"
# Done when the backlog is empty AND the whole suite is green.
if [ "$left" -eq 0 ] && npm test >/dev/null 2>&1; then
echo "✅ backlog complete and suite green"
exit 0
fi
claude -p "$(cat PROMPT.md)" \
--dangerously-skip-permissions \
--output-format stream-json --verbose
# Commit each finished step so progress is durable and reviewable.
git add -A && git commit -q -m "loop: iteration $i" || true
done
echo "❌ hit the cap with $(remaining) task(s) unfinished"
exit 1
Notice the git commit at the end of each pass. The git
history becomes a perfect audit trail: one commit per completed task,
easy to review in the morning and easy to git revert if
the agent took a wrong turn on task three.
4. Run it in a real container sandbox
This is the point where you stop trusting a folder and start trusting a box. Mount only the project, cut the network, and let it run:
docker run --rm -it \
--network none \
-v "$PWD":/work -w /work \
node:22 \
bash -c "npm install && ./loop.sh"
--network none means
the agent cannot reach the internet or the model API. In
practice you run the loop in a sandbox that allows only the
model’s endpoint (an allow-list firewall) while blocking
everything else — which is exactly what Anthropic’s
reference devcontainer
does. Use --network none only when the agent binary runs
outside the container and drives it over a mount.
Come back later to four commits, four checked boxes, and a
PROGRESS.md that reads like a junior engineer’s
stand-up notes. You designed the loop once; it shipped four features.
What changed from the simple rung
State moved into durable files (TASKS.md,
PROGRESS.md, git). The stop condition became a
compound check across a backlog. And the sandbox graduated from
“a folder I trust” to “a network-isolated
container.” Same three-beat loop — just more memory
and a better fence.
3 Complex: planner, generator, evaluator
The checklist loop is sequential: one task, then the next. The complex pattern, described by Addy Osmani as the planner–generator–evaluator split, does three things the sequential loop cannot: it separates planning from doing, runs independent work in parallel, and grades the output with a different agent than the one that wrote it. You stop designing one loop and start designing a small organisation.
The three roles
- Planner — expands a one-line goal into a set of independent, well-specified tasks, each with its own acceptance test. Runs once.
- Generators — one agent per task, each in its own isolated copy of the repo, implementing in parallel.
- Evaluator — a separate agent (or just CI) that runs the tests and grades each result, so the thing being judged is not the judge.
The isolation primitive that makes parallelism safe is the
git worktree: multiple working directories backed by
one repository, each on its own branch. Two agents editing
src/store.js at the same time never collide, because each
has its own checkout.
1. Planner: turn a goal into parallel tasks
Run the planner once to produce a machine-readable plan. Save the
prompt as plan.md:
Read the codebase. The goal is:
"Add metrics, structured logging, and an OpenAPI spec to the
shortener — these three are independent of each other."
Output ONLY a file `plan.json`: an array of tasks, each with
`id` (slug), `title`, and `acceptance` (the exact `npm test`
sub-command that proves it done). Make the tasks independent so
they can be built in parallel without touching the same files.
claude -p "$(cat plan.md)" --dangerously-skip-permissions
cat plan.json
You get something like:
[
{ "id": "metrics", "title": "Prometheus /metrics endpoint",
"acceptance": "npm test -- metrics" },
{ "id": "logging", "title": "structured JSON request logging",
"acceptance": "npm test -- logging" },
{ "id": "openapi", "title": "serve an OpenAPI 3 spec at /openapi.json",
"acceptance": "npm test -- openapi" }
]
2. Generators: one worktree per task, in parallel
For each task, spin up an isolated worktree on its own branch and
launch a generator agent in the background. This is the loop fanning
out. Save as fanout.sh:
#!/usr/bin/env bash
set -euo pipefail
# One worktree + one background generator per task.
for id in $(jq -r '.[].id' plan.json); do
task=$(jq -c ".[] | select(.id==\"$id\")" plan.json)
branch="agent/$id"
git worktree add -b "$branch" "../wt-$id" HEAD
(
cd "../wt-$id"
npm install --silent
claude -p "Implement this task until its acceptance test passes.
Task: $task
Do not edit other tasks' files. Never weaken a test." \
--dangerously-skip-permissions
) > "log-$id.txt" 2>&1 &
done
wait # block until every generator has finished
echo "all generators done"
Three agents now work at once, each in ../wt-metrics,
../wt-logging, ../wt-openapi, never stepping
on each other. On a real task list this is where the wall-clock
savings show up.
3. Evaluator: grade, then merge only the winners
Crucially, the generator does not get to declare its
own work done. A separate evaluator pass runs each branch’s
acceptance test on a clean checkout and merges only what genuinely
passes. Save as evaluate.sh:
#!/usr/bin/env bash
set -euo pipefail
for id in $(jq -r '.[].id' plan.json); do
acc=$(jq -r ".[] | select(.id==\"$id\") | .acceptance" plan.json)
echo "──── evaluating $id: $acc ────"
if ( cd "../wt-$id" && npm install --silent && eval "$acc" ); then
echo "✅ $id passed — merging"
git merge --no-ff "agent/$id" -m "merge $id (evaluator-approved)"
else
echo "❌ $id failed — left on branch agent/$id for review"
fi
git worktree remove "../wt-$id" --force
done
# Final gate: the whole suite must be green after all merges.
npm test
That last npm test is the loop’s real stop
condition — the objective oracle, applied after integration. A
branch that fails its acceptance test is never merged; it is parked
for a human (or a follow-up Ralph loop) to look at. The judge is not
the judged.
The self-improving twist
The most advanced version closes one more loop: feed the evaluator's
failures back to the planner so it rewrites the failed
tasks’ specs, and let the system add lessons to its own
CLAUDE.md / AGENTS.md instructions so the
next run starts smarter. That is what people mean by
self-improving agents:
the loop edits not just the code, but the instructions that drive the
loop. Powerful — and exactly why the network-isolated sandbox is
non-negotiable at this tier.
The three rungs, side by side
Simple (Ralph): one prompt, re-run until tests pass. Intermediate (checklist): a backlog in files, one task per iteration, committed and sandboxed. Complex (planner–generator–evaluator): parallel agents in worktrees, graded by a separate evaluator, merged only when green. Every rung is the same plan–act–observe loop — what grows is the memory, the parallelism, and the rigour of the stop condition. Your job, at every level, is to design that loop and then get out of it.
Further reading
- How to Stop Prompting Agents and Start Designing Loops? — the concept and the simple example
- Addy Osmani — Agent Harness Engineering
- Simon Willison — Designing agentic loops
- git worktree documentation
- Codex — non-interactive mode (swap
claude -pforcodex exec --sandbox workspace-writein any script above)