← Bookmarks 📄 Article

Demystifying evals for AI agents \ Anthropic

Anthropic's battle-tested playbook for evaluating AI agents: why the same capabilities that make agents useful (autonomy, tool use, multi-turn reasoning) make them uniquely hard to measure, and the specific grader types, metrics, and 8-step roadmap that work in production.

· ai ml
Read Original
Listen to Article
0:0013:43

My Notes (5)

Understanding AI agent performance

Method Pros Cons
Automated evals — Running tests programmatically without real users Faster iteration; Fully reproducible; No user impact; Can run on every commit; Tests scenarios at scale without requiring a prod deployment Requires more upfront investment to build; Requires ongoing maintenance as product and model evolves to avoid drift; Can create false confidence if it doesn't match real usage patterns
Production monitoring — Tracking metrics and errors in live systems Reveals real user behavior at scale; Catches issues that synthetic evals miss; Provides ground truth on how agents actually perform Reactive, problems reach users before you know about them; Signals can be noisy; Requires investment in instrumentation; Lacks ground truth for grading
A/B testing — Comparing variants with real user traffic Measures actual user outcomes (retention, task completion); Controls for confounds; Scalable and systematic Slow, days or weeks to reach significance and requires sufficient traffic; Only tests changes you deploy; Less signal on the underlying "why" for changes in metrics without being able to thoroughly review the transcripts
User feedback — Explicit signals like thumbs-down or bug reports Surfaces problems you didn't anticipate; Comes with real examples from actual human users; The feedback often correlates with product goals Sparse and self-selected; Skews toward severe issues; Users rarely explain why something failed; Not automated; Relying primarily on users to catch issues can have negative user impact
Manual transcript review — Humans reading through agent conversations Builds intuition for failure modes; Catches subtle quality issues automated checks miss; Helps calibrate what "good" looks like and grasp details Time-intensive; Doesn't scale; Coverage is inconsistent; Reviewer fatigue or different reviewers can affect the signal quality; Typically only gives qualitative signal rather than clear quantitative grading
Systematic human studies — Structured grading of agent outputs by trained raters Gold-standard quality judgements from multiple human raters; Handles subjective or ambiguous tasks; Provides signal for improving model-based graders Relatively expensive and slow turnaround; Hard to run frequently; Inter-rater disagreement requires reconciliation; Complex domains (legal, finance, healthcare) require human experts to conduct studies

"in some internal evals we observed Claude gaining an unfair advantage on some tasks by examining the git history from previous trials"

lol

Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization. For instance, if you only test whether the agent searches when it should, you might end up with an agent that searches for almost everything. Try to avoid class-imbalanced evals.We learned this firsthand when building evals for web search in Claude.ai. The challenge was preventing the model from searching when it shouldn’t, while preserving its ability to do extensive research when appropriate. The team built evals covering both directions: queries where the model should search (like finding the weather) and queries where it should answer from existing knowledge (like “who founded Apple?”). Striking the right balance between undertriggering (not searching when it should) or overtriggering (searching when it shouldn’t) was difficult, and took many rounds of refinements to both the prompts and the eval. As more example problems come up, we continue to add to evals to improve our coverage.

"auditing Terminal-Bench revealed that if a task asks the agent to write a script but doesn’t specify a filepath, and the tests assume a particular filepath for the script, the agent might fail through no fault of its own. Everything the grader checks should be clear from the task description; agents shouldn’t fail due to ambiguous specs. With frontier models, a 0% pass rate across many trials (i.e. 0% pass@100) is most often a signal of a broken task, not an incapable agent, and a sign to double-check your task specification and graders"

Good metrics to account for eval non determinism

pass@k measures the likelihood that an agent gets at least one correct solution in k attempts. As k increases, pass@k score rises - more ‘shots on goal’ means higher odds of at least 1 success. A score of 50% pass@1 means that a model succeeds at half the tasks in the eval on its first try. In coding, we’re often most interested in the agent finding the solution on the first try—pass@1. In other cases, proposing many solutions is valid as long as one works.

pass^k measures the probability that all k trials succeed. As k increases, pass^k falls since demanding consistency across more trials is a harder bar to clear. If your agent has a 75% per-trial success rate and you run 3 trials, the probability of passing all three is (0.75)³ ≈ 42%. This metric especially matters for customer-facing agents where users expect reliable behavior every time.

Summary used for search

• Agents break traditional evals because mistakes compound across turns and frontier models find creative solutions that static tests miss—you need new evaluation infrastructure, not just better prompts
• Three grader types with clear tradeoffs: code-based (fast/cheap/brittle), model-based (flexible/non-deterministic), human (gold standard/expensive)—effective evals combine all three strategically
• Agent-specific strategies that work: coding agents need unit tests + static analysis, conversational agents need multi-dimensional rubrics + simulated users, research agents need groundedness + coverage checks
• pass@k measures "does it eventually succeed?" while pass^k measures "does it succeed consistently?"—pick the metric that matches your product requirements
• Start with 20-50 tasks from real failures, write unambiguous specs, read transcripts religiously, and treat evals as living infrastructure that compounds in value over the agent's lifecycle

Anthropic argues that evaluations are essential infrastructure for shipping AI agents, not optional overhead. The core challenge: the same capabilities that make agents useful—autonomy, multi-turn reasoning, tool use—make them uniquely difficult to evaluate. Mistakes propagate across turns, frontier models find creative solutions that break static tests, and success often depends on both the outcome and the interaction quality. Teams without evals get stuck in reactive loops, unable to distinguish real regressions from noise or measure improvements systematically.

The piece provides a complete taxonomy of evaluation components and three types of graders with specific tradeoffs. Code-based graders (string matching, unit tests, static analysis) are fast and objective but brittle to valid variations. Model-based graders (LLM rubrics, pairwise comparison) are flexible and handle nuance but require calibration against human judgment. Human graders provide gold-standard quality but don't scale. Effective evaluations combine all three strategically based on the agent type and task requirements.

Different agent types need different evaluation strategies. Coding agents work well with deterministic unit tests plus LLM rubrics for code quality. Conversational agents require multi-dimensional grading (task completion + interaction quality) and often need a second LLM to simulate users. Research agents need groundedness checks, coverage verification, and source quality assessment. Computer use agents require state verification in sandboxed environments. The guide also introduces pass@k (likelihood of eventual success) versus pass^k (consistency across trials) as complementary metrics for measuring agent reliability.

The practical roadmap starts with 20-50 tasks drawn from real failures, not hundreds of synthetic examples. Write unambiguous task specifications where two experts would reach the same verdict. Build balanced problem sets testing both positive and negative cases. Create stable evaluation harnesses with isolated trials. Design graders thoughtfully—grade outcomes, not paths taken. Read transcripts religiously to verify graders are fair. Monitor for eval saturation when agents approach 100% pass rates. Treat evals as living infrastructure requiring ongoing maintenance and open contribution from product teams, not just engineers. The value compounds over time: evals become regression tests, baselines for new models, and the communication channel between research and product teams.