Everything We Got Wrong About Research-Plan-Implement - Dexter Horthy

• RPI's single 85-instruction mega-prompt exceeded models' "instruction budget" (~150-200 instructions max), causing inconsistent results and requiring "magic words" to work properly
• Reading 1000-line plans before 1000-line code reviews isn't leverage—split planning into 200-line design docs and structure outlines to catch bad decisions before code exists
• Models can only follow ~150-200 instructions reliably; split workflows into focused <40-instruction prompts and use control flow instead of prompt-based routing
• "Vertical plans" (build + test in slices) beat "horizontal plans" (all DB, then all API, then all frontend) because you catch errors early instead of debugging 1200 lines at once
• The new framework is QRSPI: Questions → Research → Structure → Plan → Implement, with human alignment at each stage and zero tolerance for not reading production code

Dexter Horthy, who helped popularize the Research-Plan-Implement methodology for AI coding agents, admits the approach had fundamental flaws that only became apparent when teams tried to scale it. The core problem: RPI relied on a single mega-prompt with 85+ instructions, which exceeded the "instruction budget" of frontier LLMs (models can only reliably follow 150-200 instructions). When you add system prompts, tools, and MCP servers, adherence drops dramatically. The symptom was that skilled engineers got great results while teams struggled—and the "fix" was telling people to use "magic words" like "work back and forth with me starting with your open questions." If your tool requires hours of training and secret phrases to work, the tool is broken.

The deeper issue was that RPI encouraged outsourcing thinking to agents. Engineers would feed tickets directly to research agents, which produced opinionated research instead of objective facts. The planning phase would skip critical alignment steps and jump straight to writing 1000-line plans. Teams were told to review these plans instead of code, but plans and implementation would diverge, forcing double reviews. The leverage was illusory—you still spent the same time on alignment and review, just in the wrong places. Worse, models naturally write "horizontal plans" (all database changes, then all services, then all API, then all frontend), which means you're 1200 lines deep before you can test anything.

The solution is QRSPI: Questions → Research → Structure → Plan → Implement. Split the mega-prompt into focused prompts with <40 instructions each. Hide the ticket from research context to keep it objective. Create a 200-line "design discussion" artifact that forces the agent to brain-dump its understanding, patterns it found, and open questions before any code exists. Add a "structure outline" (like C header files) showing the order of changes and test points—forcing vertical planning where you build and verify in slices. Use actual control flow instead of prompt-based routing. Most importantly: read the production code. The industry tried not reading code for six months and it produced slop that had to be ripped out. Aim for 2-3x productivity with quality, not 10x with garbage. The goal isn't faster shipping—it's sustainable AI-augmented engineering where humans own the thinking and agents provide leverage.

Everything We Got Wrong About Research-Plan-Implement - Dexter Horthy

TLDR

In Detail

TLDR

In Detail

Related