← Bookmarks 📄 Article

Continually improving our agent harness · Cursor

Cursor reveals how they make the same AI models work noticeably better through obsessive harness engineering—measuring everything from "Keep Rate" of agent code to fixing "context anxiety" where models refuse work as their context fills up.

· ai ml
Read Original
Listen to Article
0:000:00
Summary used for search

• The "harness" (infrastructure around the model) matters as much as the model itself—Cursor customizes prompts, tools, and context management per model to exploit each one's strengths and quirks
• They measure agent quality through "Keep Rate" (how much agent-generated code survives over time) and LLM-based analysis of user satisfaction, not just speed or token efficiency
• Tool call errors are classified (InvalidArguments, ProviderError, etc.) with anomaly detection alerts, and they use AI agents to automatically surface and fix infrastructure issues
• Different models need radically different treatment: OpenAI models are literal and precise, Claude is intuitive; they even discovered one model developing "context anxiety" and refusing work as context filled up
• Mid-conversation model switching is hard because of cache misses and out-of-distribution conversation histories—they mitigate with conversation summarization but recommend staying with one model per task

Cursor's approach to building their AI coding agent centers on what they call the "harness"—the infrastructure, tooling, and orchestration layer around the language model. Their core thesis is that the harness is as important as the model itself, and that obsessive iteration on this layer is what makes their agent noticeably faster and smarter than competitors using identical underlying models.

Their engineering methodology has evolved significantly as models improved. Early versions relied on heavy guardrails—surfacing lint errors after every edit, rewriting file reads, limiting tool calls—and lots of static context like folder layouts and code snippets. Now they've stripped most of that away in favor of dynamic context that agents fetch while working. They measure improvements through both offline evals (their CursorBench suite) and online A/B tests on real usage. The key metrics go beyond obvious ones like latency: "Keep Rate" tracks what fraction of agent-generated code survives in the codebase over time, while an LLM analyzes user responses to determine satisfaction. They classify tool call errors into categories (InvalidArguments for model mistakes, ProviderError for vendor outages, UnexpectedEnvironment for context contradictions) and use anomaly detection to catch regressions, with AI agents automatically surfacing issues and creating tickets.

The deepest work happens when customizing the harness for each model. OpenAI models get patch-based file editing tools because that's what they were trained on; Anthropic models get string replacement. Prompting differs too—OpenAI is literal and precise, Claude is more intuitive. When they get early access to new models, they spend weeks tuning the harness to that model's quirks. One model developed "context anxiety," refusing work as its context window filled up, which they mitigated through prompt adjustments. Mid-conversation model switching is particularly tricky because different models have different tool shapes and conversation histories become out-of-distribution, requiring conversation summarization to reduce cache penalties. Looking forward, they see the future as multi-agent systems where specialized agents handle different subtasks, with the harness orchestrating which agent to dispatch and how to stitch results together—making harness engineering even more critical than it is today.