Training Composer for longer horizons

Cursor trained their AI coding agent to learn what information matters by making self-summarization part of the RL training loop, enabling it to solve problems requiring hundreds of actions that exceed its context window.

May 1, 2026 · ai ml

Read Original

• Traditional context compaction (prompted summarization or sliding windows) causes models to forget critical information; Cursor instead trains Composer to learn what to preserve through reinforcement learning
• Self-summarization cuts compaction errors by 50% while using 1/5 the tokens (1,000 vs 5,000+) because the model learns contextually what information is high-value
• The training process rewards the summaries themselves, not just final outputs—poor summaries that lose critical info get downweighted
• Composer solved the notoriously hard "make-doom-for-mips" problem in 170 turns by compressing 100k+ tokens to 1k of what it deemed most useful
• Agent trajectories are growing faster than model context windows—this is a fundamental scaling bottleneck that compaction-in-the-loop addresses

Cursor identified a core problem with AI agents: trajectories are expanding faster than context windows, and traditional compaction methods (prompted summarization or sliding windows) cause models to forget critical information. Their solution is to make self-summarization a trained behavior rather than a bolt-on feature. During training, Composer generates until hitting a token trigger, then pauses to summarize its own context before continuing. Critically, the summaries themselves get rewarded during RL training—good summaries that preserve important information are upweighted, while poor ones that lose critical details are downweighted.

The results are striking. Compared to a heavily engineered baseline with thousands of tokens of summarization prompts producing 5,000+ token summaries, Composer's learned approach uses a simple "please summarize" prompt and outputs ~1,000 token summaries while cutting compaction errors by 50%. It learns contextually what information is high-value rather than following rigid rules. As a case study, Composer solved the notoriously difficult "make-doom-for-mips" problem from Terminal-Bench 2.0 in 170 turns, compressing over 100,000 tokens down to the 1,000 it believed would most help solve the problem.

The broader implication is that this approach enables models to one-shot hard problems requiring long reasoning chains. By training compaction into the loop rather than treating it as a separate engineering problem, Cursor created a model that can handle tasks requiring hundreds of actions—a fundamental requirement for real-world software engineering where the most valuable problems can't be solved in a single context window.

Training Composer for longer horizons

TLDR

In Detail

TLDR

In Detail

Related