Effective harnesses for long-running agents \ Anthropic

Anthropic solved the long-running AI agent problem by making agents work like engineers in shifts: leave git commits, progress logs, and structured feature lists so the next session knows exactly what happened.

Jan 11, 2026 · ai ml

Read Original

• Long-running agents fail in predictable ways: they one-shot entire apps, leave half-implemented features, or declare victory too early—even with context compaction
• Solution is a two-agent system: "initializer agent" creates feature_list.json (200+ features), git repo, and init.sh; "coding agent" works on one feature at a time and commits progress
• Key artifacts: JSON feature list (harder for model to corrupt than Markdown), claude-progress.txt, git commits with descriptive messages, and init.sh for quick startup
• Testing is critical: explicitly prompt agents to use browser automation (Puppeteer MCP) and test end-to-end like a human user, not just unit tests
• Every session starts the same way: pwd, read git logs and progress files, pick one feature, run basic smoke test, then implement—mimics what good engineers do

The core problem with long-running AI agents is that they work in discrete sessions with no memory between context windows—like engineers working in shifts where each new person has no idea what the previous shift did. Even Opus 4.5 with the Claude Agent SDK fails to build production-quality apps when given high-level prompts like "build a claude.ai clone." The failures are predictable: agents try to implement everything at once, run out of context mid-feature, or look around after some progress and declare the job done.

Anthropic's solution mimics human engineering practices through a two-agent architecture. The "initializer agent" sets up the environment on first run: a feature_list.json file with 200+ granular features (all marked "failing"), a git repo with initial commit, a claude-progress.txt log, and an init.sh script. The "coding agent" then works incrementally—one feature at a time—and leaves clean artifacts: git commits with descriptive messages and progress updates. Using JSON for the feature list proved critical because models are less likely to inappropriately modify JSON than Markdown. Every session starts with a routine: run pwd, read git logs and progress files, choose the highest-priority incomplete feature, run init.sh to start the dev server, and execute a basic smoke test before implementing anything new.

The final piece is rigorous testing. Without explicit prompting, Claude would make code changes and even run unit tests but fail to verify end-to-end functionality. Providing browser automation tools (Puppeteer MCP) and prompting the agent to test like a human user dramatically improved performance. The approach still has limitations—Claude can't see browser-native alert modals through Puppeteer, making those features buggier—but the framework demonstrates that better harness design, not just better models, is key to long-running agent success. Open questions remain about whether specialized agents (testing agent, QA agent, cleanup agent) would outperform a single general-purpose agent, and whether these patterns generalize beyond web development to fields like scientific research.

Effective harnesses for long-running agents \ Anthropic

TLDR

In Detail

TLDR

In Detail

Related