GPT 4.1 Prompting Guide | OpenAI Cookbook

GPT-4.1 follows instructions more literally than GPT-4o, breaking many existing prompts—but three simple system prompt reminders can boost agentic coding performance by 20%, and this guide shows you exactly how to migrate.

Apr 15, 2025 · ai ml

Read Original

• GPT-4.1's literal instruction-following means it no longer infers intent—you must explicitly specify every behavior, but this makes it highly steerable with the right prompts
• Three critical agentic reminders (persistence, tool-calling, planning) increased SWE-bench Verified scores by ~20% in OpenAI's testing
• Use the API's tools field instead of manual injection (2% gain), and place instructions at both beginning and end of long context
• Conflicting instructions break the model—later instructions win, so audit your prompts for consistency and use the recommended structure (Role → Instructions → Steps → Examples)
• OpenAI provides a specific diff format and reference implementation that eliminates guesswork for coding agents

GPT-4.1 represents a fundamental shift in how OpenAI models interpret prompts. Unlike GPT-4o, which liberally inferred user intent, GPT-4.1 follows instructions literally and precisely. This means many existing prompts will underperform or break entirely—but it also means the model is highly steerable if you know the right patterns. OpenAI's internal testing revealed that three simple system prompt additions can transform the model from "chatbot-like" to "eager agent," boosting SWE-bench Verified performance by nearly 20%.

The three critical agentic reminders are: (1) Persistence—tell the model to keep going until the problem is solved, not yield control prematurely; (2) Tool-calling—instruct it to use tools rather than guess or hallucinate answers; (3) Planning (optional)—prompt it to think out loud between tool calls rather than silently chain them. These reminders, combined with well-structured tool definitions using the API's tools field (not manual injection), enable state-of-the-art agentic performance. OpenAI also provides a specific diff format and Python reference implementation that the model is trained on, eliminating the trial-and-error of getting code edits right.

For long context and instruction following, the key insight is that GPT-4.1 is literal to a fault. Instructions should appear at both the beginning and end of long context for best performance. Conflicting instructions cause failures, with later instructions taking precedence. The recommended prompt structure is: Role/Objective → Instructions (with subsections for detail) → Reasoning Steps → Output Format → Examples → Context. Use markdown for structure, XML for nested documents (not JSON, which performs poorly in long context), and be exhaustively explicit about desired behaviors—the model won't fill in gaps anymore.

GPT 4.1 Prompting Guide | OpenAI Cookbook

TLDR

In Detail

TLDR

In Detail

Related