How to Design Agentic Loops: Two Worked Examples

From a single passing test to a checklist that builds itself, then to a swarm of agents that grade their own work.

Read time About 18 minutes

Prerequisite The concept & simple example

You need A sandbox, Claude Code or Codex, git

The first article built a Ralph loop: a while loop that drove a coding agent until npm test went green. That is the simple rung. This article climbs two more, reusing the same little shortener project so you can feel it grow.

Intermediate — a checklist-driven loop that ships a backlog of features one at a time, sandboxed in a container.
Complex — a planner–generator–evaluator loop that fans work out to parallel agents in git worktrees and merges only what passes.

Reminder: every loop here runs an agent that approves its own actions. Run them in a disposable container with no network and no real credentials, never on your laptop. See the safety section of the first article, or our isolated toolbox guide.

2 Intermediate: the checklist-driven loop

One passing test is a nice demo, but real work is a backlog. The intermediate pattern keeps the agent’s memory in two files on disk — a task list and a progress log — and works the list one item per iteration. Because the model is amnesiac, these files are the project’s memory between runs.

1. Write the backlog as a checklist

Each task is a checkbox with its own acceptance test named up front. Save this as TASKS.md:

# Shortener backlog

Work the FIRST unchecked task only. Mark it `[x]` when its
acceptance test passes. Do not touch already-checked tasks.

- [ ] **store** — persist code→url mappings in a SQLite file `data.db`.
      Acceptance: `test/store.test.js` passes.
- [ ] **api** — expose `POST /shorten` and `GET /:code` with Express.
      Acceptance: `test/api.test.js` passes.
- [ ] **expiry** — links accept an optional `ttlSeconds` and 404 once expired.
      Acceptance: `test/expiry.test.js` passes.
- [ ] **ratelimit** — cap `POST /shorten` at 10 req/min per IP.
      Acceptance: `test/ratelimit.test.js` passes.

Create an empty PROGRESS.md for the agent to journal into. This is where it records what it tried, so the next amnesiac iteration does not repeat a dead end:

echo "# Progress log" > PROGRESS.md

2. Write the per-iteration prompt

The prompt tells the agent to read its memory, do exactly one task, and write its memory back. Save it as PROMPT.md:

You are building a URL shortener, working through a backlog.

1. Read TASKS.md and PROGRESS.md.
2. Pick the FIRST unchecked `[ ]` task in TASKS.md.
3. Write its acceptance test if it does not exist yet, then
   implement the feature until `npm test` passes for it.
4. When green, change that task's `[ ]` to `[x]` in TASKS.md
   and append a short note to PROGRESS.md (what you did, any
   gotchas for the next run).
5. Run the FULL suite with `npm test` to confirm nothing else broke.

Do exactly ONE task, then stop. Never weaken or delete a test.
If every task is `[x]`, write "BACKLOG COMPLETE" and stop.

Why one task per iteration? A fresh context window per task is the cheapest defence against context rot. The agent never carries four features’ worth of half-remembered detail; it reads only what it needs from disk, does one thing well, and hands a clean slate to the next pass.

3. The loop: stop when every box is checked

The objective stop condition is no longer “tests pass” but “no unchecked boxes remain & the full suite is green.” Save as loop.sh:

#!/usr/bin/env bash
set -euo pipefail

MAX_ITERATIONS=20

remaining() { grep -c '^\- \[ \]' TASKS.md || true; }

for i in $(seq 1 "$MAX_ITERATIONS"); do
  left=$(remaining)
  echo "──────── iteration $i — $left task(s) left ────────"

  # Done when the backlog is empty AND the whole suite is green.
  if [ "$left" -eq 0 ] && npm test >/dev/null 2>&1; then
    echo "✅ backlog complete and suite green"
    exit 0
  fi

  claude -p "$(cat PROMPT.md)" \
    --dangerously-skip-permissions \
    --output-format stream-json --verbose

  # Commit each finished step so progress is durable and reviewable.
  git add -A && git commit -q -m "loop: iteration $i" || true
done

echo "❌ hit the cap with $(remaining) task(s) unfinished"
exit 1

Notice the git commit at the end of each pass. The git history becomes a perfect audit trail: one commit per completed task, easy to review in the morning and easy to git revert if the agent took a wrong turn on task three.

4. Run it in a real container sandbox

This is the point where you stop trusting a folder and start trusting a box. Mount only the project, cut the network, and let it run:

docker run --rm -it \
  --network none \
  -v "$PWD":/work -w /work \
  node:22 \
  bash -c "npm install && ./loop.sh"

Chicken-and-egg: --network none means the agent cannot reach the internet or the model API. In practice you run the loop in a sandbox that allows only the model’s endpoint (an allow-list firewall) while blocking everything else — which is exactly what Anthropic’s reference devcontainer does. Use --network none only when the agent binary runs outside the container and drives it over a mount.

Come back later to four commits, four checked boxes, and a PROGRESS.md that reads like a junior engineer’s stand-up notes. You designed the loop once; it shipped four features.

What changed from the simple rung

State moved into durable files (TASKS.md, PROGRESS.md, git). The stop condition became a compound check across a backlog. And the sandbox graduated from “a folder I trust” to “a network-isolated container.” Same three-beat loop — just more memory and a better fence.

3 Complex: planner, generator, evaluator

The checklist loop is sequential: one task, then the next. The complex pattern, described by Addy Osmani as the planner–generator–evaluator split, does three things the sequential loop cannot: it separates planning from doing, runs independent work in parallel, and grades the output with a different agent than the one that wrote it. You stop designing one loop and start designing a small organisation.

The three roles

Planner — expands a one-line goal into a set of independent, well-specified tasks, each with its own acceptance test. Runs once.
Generators — one agent per task, each in its own isolated copy of the repo, implementing in parallel.
Evaluator — a separate agent (or just CI) that runs the tests and grades each result, so the thing being judged is not the judge.

The isolation primitive that makes parallelism safe is the git worktree: multiple working directories backed by one repository, each on its own branch. Two agents editing src/store.js at the same time never collide, because each has its own checkout.

1. Planner: turn a goal into parallel tasks

Run the planner once to produce a machine-readable plan. Save the prompt as plan.md:

Read the codebase. The goal is:

  "Add metrics, structured logging, and an OpenAPI spec to the
   shortener — these three are independent of each other."

Output ONLY a file `plan.json`: an array of tasks, each with
`id` (slug), `title`, and `acceptance` (the exact `npm test`
sub-command that proves it done). Make the tasks independent so
they can be built in parallel without touching the same files.

claude -p "$(cat plan.md)" --dangerously-skip-permissions
cat plan.json

You get something like:

[
  { "id": "metrics", "title": "Prometheus /metrics endpoint",
    "acceptance": "npm test -- metrics" },
  { "id": "logging", "title": "structured JSON request logging",
    "acceptance": "npm test -- logging" },
  { "id": "openapi", "title": "serve an OpenAPI 3 spec at /openapi.json",
    "acceptance": "npm test -- openapi" }
]

2. Generators: one worktree per task, in parallel

For each task, spin up an isolated worktree on its own branch and launch a generator agent in the background. This is the loop fanning out. Save as fanout.sh:

#!/usr/bin/env bash
set -euo pipefail

# One worktree + one background generator per task.
for id in $(jq -r '.[].id' plan.json); do
  task=$(jq -c ".[] | select(.id==\"$id\")" plan.json)
  branch="agent/$id"

  git worktree add -b "$branch" "../wt-$id" HEAD

  (
    cd "../wt-$id"
    npm install --silent
    claude -p "Implement this task until its acceptance test passes.
               Task: $task
               Do not edit other tasks' files. Never weaken a test." \
      --dangerously-skip-permissions
  ) > "log-$id.txt" 2>&1 &
done

wait   # block until every generator has finished
echo "all generators done"

Three agents now work at once, each in ../wt-metrics, ../wt-logging, ../wt-openapi, never stepping on each other. On a real task list this is where the wall-clock savings show up.

3. Evaluator: grade, then merge only the winners

Crucially, the generator does not get to declare its own work done. A separate evaluator pass runs each branch’s acceptance test on a clean checkout and merges only what genuinely passes. Save as evaluate.sh:

#!/usr/bin/env bash
set -euo pipefail

for id in $(jq -r '.[].id' plan.json); do
  acc=$(jq -r ".[] | select(.id==\"$id\") | .acceptance" plan.json)
  echo "──── evaluating $id: $acc ────"

  if ( cd "../wt-$id" && npm install --silent && eval "$acc" ); then
    echo "✅ $id passed — merging"
    git merge --no-ff "agent/$id" -m "merge $id (evaluator-approved)"
  else
    echo "❌ $id failed — left on branch agent/$id for review"
  fi

  git worktree remove "../wt-$id" --force
done

# Final gate: the whole suite must be green after all merges.
npm test

That last npm test is the loop’s real stop condition — the objective oracle, applied after integration. A branch that fails its acceptance test is never merged; it is parked for a human (or a follow-up Ralph loop) to look at. The judge is not the judged.

This is just CI, in disguise. If your evaluator is a GitHub Actions workflow that runs the acceptance test on each agent branch, you have turned the pattern into a nightly “open PRs, let CI grade them, auto-merge the green ones” pipeline — the production form of the same three roles.

The self-improving twist

The most advanced version closes one more loop: feed the evaluator's failures back to the planner so it rewrites the failed tasks’ specs, and let the system add lessons to its own CLAUDE.md / AGENTS.md instructions so the next run starts smarter. That is what people mean by self-improving agents: the loop edits not just the code, but the instructions that drive the loop. Powerful — and exactly why the network-isolated sandbox is non-negotiable at this tier.

The three rungs, side by side

Simple (Ralph): one prompt, re-run until tests pass. Intermediate (checklist): a backlog in files, one task per iteration, committed and sandboxed. Complex (planner–generator–evaluator): parallel agents in worktrees, graded by a separate evaluator, merged only when green. Every rung is the same plan–act–observe loop — what grows is the memory, the parallelism, and the rigour of the stop condition. Your job, at every level, is to design that loop and then get out of it.

How to Design Agentic Loops: Two Worked Examples

2 Intermediate: the checklist-driven loop

1. Write the backlog as a checklist

2. Write the per-iteration prompt

3. The loop: stop when every box is checked

4. Run it in a real container sandbox

What changed from the simple rung

3 Complex: planner, generator, evaluator

The three roles

1. Planner: turn a goal into parallel tasks

2. Generators: one worktree per task, in parallel

3. Evaluator: grade, then merge only the winners

The self-improving twist

The three rungs, side by side

Further reading