How to Run Codex Unattended on a Flutter App?
Turn Flutter’s own toolchain — formatter, analyzer, tests, and compiler — into the guardrails that let Codex build for hours and course-correct on its own.
There is a piece of advice going around that, once you internalise it, changes how you run a coding agent on a real project:
“You can go a long way with just Codex and the standard tools integrated with the languages and frameworks you already use. For every task you need to be pretty deliberate about the exact set of work that needs to be done, have the definition of the work stored in a persistent location outside the context window, and have lots of external feedback and validation. The LLM itself cannot be responsible for determining pass / fail criteria on its own. Static checks are required to make sure it can course correct and run for extended periods.”
That is a complete operating manual in one paragraph. This guide turns it into a concrete, runnable setup for a Flutter app driven by Codex — and Flutter turns out to be close to the ideal case, because the standard Flutter toolchain hands you a wall of machine-checkable feedback for free.
It is the third in a short series. If you have not met the underlying idea, the concept article and its worked examples cover designing agentic loops in general. This one is the language-specific, Codex-specific recipe.
The three things the quote is really asking for
Unpack the paragraph and there are three jobs to do, in order:
-
A persistent definition of the work, outside the context
window. The model is amnesiac and its context rots over a
long run. So the spec — the exact, deliberate set of tasks
— lives in files on disk that survive every reset:
AGENTS.md,SPEC.md,PROGRESS.md. - Lots of external feedback and validation. Not one test at the end — a dense, fast layer of checks the agent can run after every change: formatter, analyzer, unit and widget tests, golden tests, a compile.
-
Static checks as the only pass/fail oracle. The
agent never gets to declare its own work correct. A single script,
verify.sh, exits0or it does not, and that exit code is the truth. This is the thing that lets it course correct: a precise error message is a far better instruction than any prompt you could write.
Why Flutter is a near-perfect fit
You do not have to assemble the validation layer from scratch — Flutter ships it. Every one of these is a standard command that returns a non-zero exit code on failure, which is exactly what an unattended loop needs:
dart format— deterministic formatting; fails if anything is off.flutter analyze— the static analyzer, configurable to treat even info-level lints as fatal.dart fix— auto-applies a large class of analyzer fixes.flutter test— unit and widget tests, with coverage.- Golden tests — pixel-level UI regression checks, so even “does it look right” becomes objective.
flutter test integration_test— full end-to-end flows on a device or emulator.flutter build— the compiler itself as a final gate.
Together these mean the LLM is almost never the judge. The Dart toolchain is.
! Before anything: sandbox the run
An unattended loop runs Codex with approvals turned off so it can edit
files and run commands without stopping to ask. That is the whole point
— and the reason it must not run loose on your machine. Run it in
a disposable container with no real credentials, and
restrict the network to only what the build needs (the model endpoint
plus pub.dev). Codex’s
workspace-write sandbox
limits edits to the project directory; pair it with container-level
network isolation for defence in depth. The
safety section of the first
article and our isolated
toolbox guide both cover this.
flutter pub get needs
network the first time. Either warm the pub cache before you
cut the network, or allow-list pub.dev and
storage.googleapis.com in the sandbox. A loop that dies on
iteration one because it cannot resolve dependencies is a sad loop.
1 Build the validation harness first
Counter-intuitively, you write the checks before you let the agent write the app. The harness is what makes the rest safe, so it comes first.
1a. Make the analyzer strict
A lenient analyzer lets the agent paper over bugs. Crank it up. Add
flutter_lints (the official baseline) or the stricter
very_good_analysis,
then turn on Dart’s strict language modes. Edit
analysis_options.yaml at the project root:
include: package:very_good_analysis/analysis_options.yaml
analyzer:
language:
strict-casts: true # no silent dynamic-to-T casts
strict-inference: true # no untyped inference holes
strict-raw-types: true # no raw generic types
errors:
# treat unfinished work as a hard stop, not a friendly note
todo: warning
fixme: warning
linter:
rules:
prefer_final_locals: true
require_trailing_commas: true
Add the dev dependency:
flutter pub add --dev very_good_analysis
1b. Write the one script that decides pass/fail
This is the heart of the whole setup: a single script that runs the
entire feedback wall and exits non-zero the moment anything is wrong.
Order it “fail fast” — the cheap checks first —
so the agent gets the most relevant error soonest. Save it as
verify.sh:
#!/usr/bin/env bash
# The single source of truth for "is the work correct?".
# Exits 0 only if every check passes. Codex may not override this.
set -euo pipefail
echo "▶ 1/6 dependencies"
flutter pub get
echo "▶ 2/6 code generation" # skip if you don't use build_runner
dart run build_runner build --delete-conflicting-outputs 2>/dev/null || true
echo "▶ 3/6 formatting"
dart format --output=none --set-exit-if-changed .
echo "▶ 4/6 static analysis"
flutter analyze --fatal-infos --fatal-warnings
echo "▶ 5/6 unit + widget tests (with coverage)"
flutter test --coverage
echo "▶ 6/6 compile gate"
flutter build apk --debug
echo "✅ verify.sh: all checks passed"
--fatal-infos matters: by default the
analyzer reports many issues as harmless “info” and exits
0 anyway. An agent will happily ignore those. The
--fatal-infos --fatal-warnings flags turn every lint into
a build-breaker, so the analyzer becomes a wall the agent must
get past rather than a suggestion it can skip.
Integration tests need a device or emulator, so they do not belong in
the fast inner loop. Keep them in a separate script,
verify-e2e.sh, that you run on a tier with an emulator
attached:
#!/usr/bin/env bash
set -euo pipefail
flutter test integration_test
chmod +x verify.sh verify-e2e.sh
2 Store the definition of work on disk
Now the persistent memory the quote insists on. Three files, each with a distinct job, all outside the context window so they survive every reset.
2a. AGENTS.md — the standing rules
Codex automatically reads AGENTS.md from the project root
at the start of every run — it is Codex’s equivalent of
CLAUDE.md. This is where you encode the non-negotiables, so
you never have to repeat them in a prompt. Save as AGENTS.md:
# AGENTS.md — how to work in this repository
This is a Flutter app. `./verify.sh` is the ONLY authority on whether your
work is correct. You may not decide a task is done on your own.
## The loop you must follow for every task
1. Read SPEC.md and PROGRESS.md.
2. Pick the FIRST unchecked `[ ]` task in SPEC.md.
3. Implement it. Write tests if the task names a test file that doesn't exist.
4. Run `./verify.sh`. If it exits non-zero, read the output, fix the
actual cause, and run it again. Repeat until it exits 0.
5. Only then: tick the task `[x]` in SPEC.md and append a short note to
PROGRESS.md (what you did, anything the next run should know).
6. Do EXACTLY ONE task per run, then stop.
## Hard rules
- Never edit, skip, or delete a test to make it pass. Tests are the spec.
- Never weaken analysis_options.yaml or add `// ignore:` to silence the analyzer.
- Keep `dart format` clean. Run `dart fix --apply` before hand-fixing lints.
- Prefer composition and small widgets. No business logic inside `build()`.
2b. SPEC.md — the deliberate, exact set of work
This is where “be pretty deliberate about the exact set of work” lives. Break the app into small, independently verifiable tasks, and pin each one to a concrete acceptance check. Vague tasks produce vague work; a task without an acceptance test cannot be looped. Here is a spec for a small todo app:
# Build spec — Todo app
Work the FIRST unchecked task only. A task is done when `./verify.sh`
passes AND the named test file exists and is exercised.
- [ ] **model** — immutable `Todo(id, title, done, createdAt)` with
`copyWith` and value equality.
Acceptance: `test/todo_model_test.dart` covers copyWith + equality.
- [ ] **repository** — `TodoRepository` persisting via `shared_preferences`,
with `load()`, `save(List<Todo>)`.
Acceptance: `test/todo_repository_test.dart` with a fake store.
- [ ] **controller** — `TodoController` (ChangeNotifier): add, toggle,
remove, clearCompleted; persists through the repository.
Acceptance: `test/todo_controller_test.dart`.
- [ ] **list-ui** — `TodoListPage` renders todos; tapping a checkbox toggles;
swipe-to-dismiss removes.
Acceptance: `test/todo_list_page_test.dart` (widget test).
- [ ] **empty-state** — friendly empty view when there are no todos.
Acceptance: widget test asserts the empty message.
- [ ] **golden** — golden test of the empty and populated list.
Acceptance: `test/golden/todo_list_golden_test.dart` with committed PNGs.
2c. PROGRESS.md — the journal
The agent writes here at the end of each task so the next, freshly amnesiac run does not repeat a dead end or undo a decision. Seed it empty:
echo "# Progress log" > PROGRESS.md
3 Configure Codex for unattended runs
For a loop, Codex needs to run without pausing for approval and with
permission to edit the workspace. Set this up once as a named profile
in ~/.codex/config.toml so every run is consistent and you
do not sprinkle flags everywhere:
# ~/.codex/config.toml
[profiles.flutter-loop]
# model = "..." # optional — omit to use your configured default
approval_policy = "never" # never stop to ask; the sandbox is the guardrail
sandbox_mode = "workspace-write" # read/edit the project + run local commands
Now a single non-interactive run looks like this —
codex exec is the scripted, finishes-and-exits mode:
codex exec --profile flutter-loop \
"Read AGENTS.md, SPEC.md and PROGRESS.md, then do the first unchecked task."
--sandbox workspace-write and
--approval-policy never directly, but baking them into a
profile keeps the loop script readable and means you change the policy
in one place. (The old --full-auto flag still works but is
deprecated — it just forces workspace-write and
prints a warning.)
4 Write the loop
Everything is now in place: a persistent spec, a wall of validation,
and an objective oracle. The loop just feeds Codex one task at a time
and lets verify.sh decide when the whole thing is done.
Save as loop.sh:
#!/usr/bin/env bash
set -euo pipefail
MAX_ITERATIONS=40
remaining() { grep -c '^\- \[ \]' SPEC.md || true; }
for i in $(seq 1 "$MAX_ITERATIONS"); do
left=$(remaining)
echo "════════ iteration $i — $left task(s) left ════════"
# Objective stop condition: backlog empty AND the full harness green.
if [ "$left" -eq 0 ] && ./verify.sh >/dev/null 2>&1; then
echo "✅ spec complete and verify.sh green — stopping"
exit 0
fi
# One task per run. Codex reads its memory from disk; AGENTS.md holds the rules.
codex exec --profile flutter-loop \
--output-last-message ".codex-last.txt" \
"Read AGENTS.md, SPEC.md and PROGRESS.md. Implement the FIRST unchecked
task in SPEC.md. Run ./verify.sh and keep fixing until it exits 0.
Then tick the task in SPEC.md and append a note to PROGRESS.md."
# Commit each finished step: a durable, reviewable audit trail.
git add -A
git commit -q -m "codex: iteration $i — $(head -c 72 .codex-last.txt)" || true
done
echo "❌ hit the $MAX_ITERATIONS-iteration cap with $(remaining) task(s) left"
exit 1
Run it inside your sandbox container:
chmod +x loop.sh
./loop.sh 2>&1 | tee loop.log
Two things worth noting in the script. --output-last-message
captures Codex’s final summary to a file, which makes a tidy commit
message and a machine-readable trail of what each iteration claimed to
do. And the git commit per iteration means a wrong turn on
task four is a one-line git revert, not a forensic dig.
What course-correction actually looks like
This is the payoff the quote promises — “static checks are required to make sure it can course correct and run for extended periods.” A real iteration on the repository task tends to go like this, with zero input from you:
- Codex writes
TodoRepositoryand a test, then runsverify.sh. flutter analyze --fatal-infosfails:The argument type 'String?' can't be assigned to the parameter type 'String'. A nullability bug it would never have caught by “thinking harder.”- It reads the exact file and line from the analyzer output, adds a null check, re-runs.
- Now
flutter testfails: the fake store returns stale data aftersave(). It fixes the persistence call, re-runs. verify.shexits0. Only now does it tick the box and write toPROGRESS.md. The loop moves on.
At no point did the model decide it was done. The analyzer and the test runner decided, and their precise error messages were better course corrections than any follow-up prompt you could have typed. That is why a loop like this can grind through a whole spec while you are asleep.
matchesGoldenFile) freezes the approved pixels into a
committed PNG, so any visual change becomes a hard, objective failure
the agent can see and fix. Generate the baselines once with
flutter test --update-goldens, review them by eye, commit
them, and from then on they are just another wall in
verify.sh.
Bonus: run the same gate in CI
The harness you built for the loop is exactly what you want guarding
every push. Run verify.sh in GitHub Actions and the
standard is identical whether code came from you or from Codex:
# .github/workflows/verify.yml
name: verify
on: [push, pull_request]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: subosito/flutter-action@v2
with: { channel: stable }
- run: ./verify.sh
And you can close the loop entirely with the
official Codex GitHub Action:
have a scheduled job pick the next unchecked task from
SPEC.md, run Codex against it, and open a pull request that
CI then grades with the very same verify.sh. The judge is
never the judged, all the way out to production.
What you end up with
A Flutter repository where Codex can run unattended for hours: a
deliberate spec in SPEC.md, standing rules in
AGENTS.md, a journal in PROGRESS.md, and a
single verify.sh — format, analyze, test, compile
— that is the sole authority on done. The loop ships one
committed, fully-validated task at a time, and you only step in when
it hits the iteration cap. You stopped prompting the agent; you
built the rails it runs on.
Further reading
- How to Stop Prompting Agents and Start Designing Loops? — the concept behind this series
- How to Design Agentic Loops: Two Worked Examples
- Codex — non-interactive mode (
codex exec) - Codex — custom instructions with AGENTS.md
- Codex — configuration reference (profiles, sandbox)
- Dart — customizing static analysis
- Flutter — integration testing