How to Run Codex Unattended on a Flutter App?

Turn Flutter’s own toolchain — formatter, analyzer, tests, and compiler — into the guardrails that let Codex build for hours and course-correct on its own.

Read time About 20 minutes
What you need Flutter SDK, the Codex CLI, git, and a sandbox
The principle Static checks decide pass/fail — never the model

There is a piece of advice going around that, once you internalise it, changes how you run a coding agent on a real project:

“You can go a long way with just Codex and the standard tools integrated with the languages and frameworks you already use. For every task you need to be pretty deliberate about the exact set of work that needs to be done, have the definition of the work stored in a persistent location outside the context window, and have lots of external feedback and validation. The LLM itself cannot be responsible for determining pass / fail criteria on its own. Static checks are required to make sure it can course correct and run for extended periods.”

That is a complete operating manual in one paragraph. This guide turns it into a concrete, runnable setup for a Flutter app driven by Codex — and Flutter turns out to be close to the ideal case, because the standard Flutter toolchain hands you a wall of machine-checkable feedback for free.

It is the third in a short series. If you have not met the underlying idea, the concept article and its worked examples cover designing agentic loops in general. This one is the language-specific, Codex-specific recipe.

The three things the quote is really asking for

Unpack the paragraph and there are three jobs to do, in order:

  1. A persistent definition of the work, outside the context window. The model is amnesiac and its context rots over a long run. So the spec — the exact, deliberate set of tasks — lives in files on disk that survive every reset: AGENTS.md, SPEC.md, PROGRESS.md.
  2. Lots of external feedback and validation. Not one test at the end — a dense, fast layer of checks the agent can run after every change: formatter, analyzer, unit and widget tests, golden tests, a compile.
  3. Static checks as the only pass/fail oracle. The agent never gets to declare its own work correct. A single script, verify.sh, exits 0 or it does not, and that exit code is the truth. This is the thing that lets it course correct: a precise error message is a far better instruction than any prompt you could write.

Why Flutter is a near-perfect fit

You do not have to assemble the validation layer from scratch — Flutter ships it. Every one of these is a standard command that returns a non-zero exit code on failure, which is exactly what an unattended loop needs:

Together these mean the LLM is almost never the judge. The Dart toolchain is.

! Before anything: sandbox the run

An unattended loop runs Codex with approvals turned off so it can edit files and run commands without stopping to ask. That is the whole point — and the reason it must not run loose on your machine. Run it in a disposable container with no real credentials, and restrict the network to only what the build needs (the model endpoint plus pub.dev). Codex’s workspace-write sandbox limits edits to the project directory; pair it with container-level network isolation for defence in depth. The safety section of the first article and our isolated toolbox guide both cover this.

Flutter gotcha: flutter pub get needs network the first time. Either warm the pub cache before you cut the network, or allow-list pub.dev and storage.googleapis.com in the sandbox. A loop that dies on iteration one because it cannot resolve dependencies is a sad loop.

1 Build the validation harness first

Counter-intuitively, you write the checks before you let the agent write the app. The harness is what makes the rest safe, so it comes first.

1a. Make the analyzer strict

A lenient analyzer lets the agent paper over bugs. Crank it up. Add flutter_lints (the official baseline) or the stricter very_good_analysis, then turn on Dart’s strict language modes. Edit analysis_options.yaml at the project root:

include: package:very_good_analysis/analysis_options.yaml

analyzer:
  language:
    strict-casts: true        # no silent dynamic-to-T casts
    strict-inference: true    # no untyped inference holes
    strict-raw-types: true    # no raw generic types
  errors:
    # treat unfinished work as a hard stop, not a friendly note
    todo: warning
    fixme: warning

linter:
  rules:
    prefer_final_locals: true
    require_trailing_commas: true

Add the dev dependency:

flutter pub add --dev very_good_analysis

1b. Write the one script that decides pass/fail

This is the heart of the whole setup: a single script that runs the entire feedback wall and exits non-zero the moment anything is wrong. Order it “fail fast” — the cheap checks first — so the agent gets the most relevant error soonest. Save it as verify.sh:

#!/usr/bin/env bash
# The single source of truth for "is the work correct?".
# Exits 0 only if every check passes. Codex may not override this.
set -euo pipefail

echo "▶ 1/6 dependencies"
flutter pub get

echo "▶ 2/6 code generation"   # skip if you don't use build_runner
dart run build_runner build --delete-conflicting-outputs 2>/dev/null || true

echo "▶ 3/6 formatting"
dart format --output=none --set-exit-if-changed .

echo "▶ 4/6 static analysis"
flutter analyze --fatal-infos --fatal-warnings

echo "▶ 5/6 unit + widget tests (with coverage)"
flutter test --coverage

echo "▶ 6/6 compile gate"
flutter build apk --debug

echo "✅ verify.sh: all checks passed"
Why --fatal-infos matters: by default the analyzer reports many issues as harmless “info” and exits 0 anyway. An agent will happily ignore those. The --fatal-infos --fatal-warnings flags turn every lint into a build-breaker, so the analyzer becomes a wall the agent must get past rather than a suggestion it can skip.

Integration tests need a device or emulator, so they do not belong in the fast inner loop. Keep them in a separate script, verify-e2e.sh, that you run on a tier with an emulator attached:

#!/usr/bin/env bash
set -euo pipefail
flutter test integration_test
chmod +x verify.sh verify-e2e.sh

2 Store the definition of work on disk

Now the persistent memory the quote insists on. Three files, each with a distinct job, all outside the context window so they survive every reset.

2a. AGENTS.md — the standing rules

Codex automatically reads AGENTS.md from the project root at the start of every run — it is Codex’s equivalent of CLAUDE.md. This is where you encode the non-negotiables, so you never have to repeat them in a prompt. Save as AGENTS.md:

# AGENTS.md — how to work in this repository

This is a Flutter app. `./verify.sh` is the ONLY authority on whether your
work is correct. You may not decide a task is done on your own.

## The loop you must follow for every task
1. Read SPEC.md and PROGRESS.md.
2. Pick the FIRST unchecked `[ ]` task in SPEC.md.
3. Implement it. Write tests if the task names a test file that doesn't exist.
4. Run `./verify.sh`. If it exits non-zero, read the output, fix the
   actual cause, and run it again. Repeat until it exits 0.
5. Only then: tick the task `[x]` in SPEC.md and append a short note to
   PROGRESS.md (what you did, anything the next run should know).
6. Do EXACTLY ONE task per run, then stop.

## Hard rules
- Never edit, skip, or delete a test to make it pass. Tests are the spec.
- Never weaken analysis_options.yaml or add `// ignore:` to silence the analyzer.
- Keep `dart format` clean. Run `dart fix --apply` before hand-fixing lints.
- Prefer composition and small widgets. No business logic inside `build()`.

2b. SPEC.md — the deliberate, exact set of work

This is where “be pretty deliberate about the exact set of work” lives. Break the app into small, independently verifiable tasks, and pin each one to a concrete acceptance check. Vague tasks produce vague work; a task without an acceptance test cannot be looped. Here is a spec for a small todo app:

# Build spec — Todo app

Work the FIRST unchecked task only. A task is done when `./verify.sh`
passes AND the named test file exists and is exercised.

- [ ] **model** — immutable `Todo(id, title, done, createdAt)` with
      `copyWith` and value equality.
      Acceptance: `test/todo_model_test.dart` covers copyWith + equality.
- [ ] **repository** — `TodoRepository` persisting via `shared_preferences`,
      with `load()`, `save(List<Todo>)`.
      Acceptance: `test/todo_repository_test.dart` with a fake store.
- [ ] **controller** — `TodoController` (ChangeNotifier): add, toggle,
      remove, clearCompleted; persists through the repository.
      Acceptance: `test/todo_controller_test.dart`.
- [ ] **list-ui** — `TodoListPage` renders todos; tapping a checkbox toggles;
      swipe-to-dismiss removes.
      Acceptance: `test/todo_list_page_test.dart` (widget test).
- [ ] **empty-state** — friendly empty view when there are no todos.
      Acceptance: widget test asserts the empty message.
- [ ] **golden** — golden test of the empty and populated list.
      Acceptance: `test/golden/todo_list_golden_test.dart` with committed PNGs.

2c. PROGRESS.md — the journal

The agent writes here at the end of each task so the next, freshly amnesiac run does not repeat a dead end or undo a decision. Seed it empty:

echo "# Progress log" > PROGRESS.md

3 Configure Codex for unattended runs

For a loop, Codex needs to run without pausing for approval and with permission to edit the workspace. Set this up once as a named profile in ~/.codex/config.toml so every run is consistent and you do not sprinkle flags everywhere:

# ~/.codex/config.toml

[profiles.flutter-loop]
# model = "..."              # optional — omit to use your configured default
approval_policy = "never"    # never stop to ask; the sandbox is the guardrail
sandbox_mode = "workspace-write"   # read/edit the project + run local commands

Now a single non-interactive run looks like this — codex exec is the scripted, finishes-and-exits mode:

codex exec --profile flutter-loop \
  "Read AGENTS.md, SPEC.md and PROGRESS.md, then do the first unchecked task."
Prefer profiles to flags. You can pass --sandbox workspace-write and --approval-policy never directly, but baking them into a profile keeps the loop script readable and means you change the policy in one place. (The old --full-auto flag still works but is deprecated — it just forces workspace-write and prints a warning.)

4 Write the loop

Everything is now in place: a persistent spec, a wall of validation, and an objective oracle. The loop just feeds Codex one task at a time and lets verify.sh decide when the whole thing is done. Save as loop.sh:

#!/usr/bin/env bash
set -euo pipefail

MAX_ITERATIONS=40

remaining() { grep -c '^\- \[ \]' SPEC.md || true; }

for i in $(seq 1 "$MAX_ITERATIONS"); do
  left=$(remaining)
  echo "════════ iteration $i — $left task(s) left ════════"

  # Objective stop condition: backlog empty AND the full harness green.
  if [ "$left" -eq 0 ] && ./verify.sh >/dev/null 2>&1; then
    echo "✅ spec complete and verify.sh green — stopping"
    exit 0
  fi

  # One task per run. Codex reads its memory from disk; AGENTS.md holds the rules.
  codex exec --profile flutter-loop \
    --output-last-message ".codex-last.txt" \
    "Read AGENTS.md, SPEC.md and PROGRESS.md. Implement the FIRST unchecked
     task in SPEC.md. Run ./verify.sh and keep fixing until it exits 0.
     Then tick the task in SPEC.md and append a note to PROGRESS.md."

  # Commit each finished step: a durable, reviewable audit trail.
  git add -A
  git commit -q -m "codex: iteration $i — $(head -c 72 .codex-last.txt)" || true
done

echo "❌ hit the $MAX_ITERATIONS-iteration cap with $(remaining) task(s) left"
exit 1

Run it inside your sandbox container:

chmod +x loop.sh
./loop.sh 2>&1 | tee loop.log

Two things worth noting in the script. --output-last-message captures Codex’s final summary to a file, which makes a tidy commit message and a machine-readable trail of what each iteration claimed to do. And the git commit per iteration means a wrong turn on task four is a one-line git revert, not a forensic dig.

What course-correction actually looks like

This is the payoff the quote promises — “static checks are required to make sure it can course correct and run for extended periods.” A real iteration on the repository task tends to go like this, with zero input from you:

  1. Codex writes TodoRepository and a test, then runs verify.sh.
  2. flutter analyze --fatal-infos fails: The argument type 'String?' can't be assigned to the parameter type 'String'. A nullability bug it would never have caught by “thinking harder.”
  3. It reads the exact file and line from the analyzer output, adds a null check, re-runs.
  4. Now flutter test fails: the fake store returns stale data after save(). It fixes the persistence call, re-runs.
  5. verify.sh exits 0. Only now does it tick the box and write to PROGRESS.md. The loop moves on.

At no point did the model decide it was done. The analyzer and the test runner decided, and their precise error messages were better course corrections than any follow-up prompt you could have typed. That is why a loop like this can grind through a whole spec while you are asleep.

Golden tests pull UI into the same regime. “Does it look right” usually needs a human — but a golden test (matchesGoldenFile) freezes the approved pixels into a committed PNG, so any visual change becomes a hard, objective failure the agent can see and fix. Generate the baselines once with flutter test --update-goldens, review them by eye, commit them, and from then on they are just another wall in verify.sh.

Bonus: run the same gate in CI

The harness you built for the loop is exactly what you want guarding every push. Run verify.sh in GitHub Actions and the standard is identical whether code came from you or from Codex:

# .github/workflows/verify.yml
name: verify
on: [push, pull_request]
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: subosito/flutter-action@v2
        with: { channel: stable }
      - run: ./verify.sh

And you can close the loop entirely with the official Codex GitHub Action: have a scheduled job pick the next unchecked task from SPEC.md, run Codex against it, and open a pull request that CI then grades with the very same verify.sh. The judge is never the judged, all the way out to production.

What you end up with

A Flutter repository where Codex can run unattended for hours: a deliberate spec in SPEC.md, standing rules in AGENTS.md, a journal in PROGRESS.md, and a single verify.sh — format, analyze, test, compile — that is the sole authority on done. The loop ships one committed, fully-validated task at a time, and you only step in when it hits the iteration cap. You stopped prompting the agent; you built the rails it runs on.

Further reading