Joshua Saxe - The Hard Part Isn't Building the Agent: Measuring Effectiveness | [un]prompted 2026
Cybersecurity's evaluation crisis: We're judging AI agents that generate 100,000 tokens of reasoning using the same binary metrics designed for cat-vs-dog classification—and it's blocking deployment of autonomous defenders we desperately need.
Read Original Summary used for search
TLDR
• Classical ML metrics (precision/recall) assume perfect ground truth, but security experts disagree 10%+ on SOC alerts and access decisions—even 1% label noise makes measuring improvement impossible
• We're reducing complex agent reasoning (evidence gathering, policy analysis, decision justification) down to a single bit of error information, like hiring engineers based only on multiple-choice tests
• Solution: Multi-dimensional rubrics that evaluate reasoning quality, evidence gathering, and decision-making process—not just outcomes—similar to how you'd interview a security engineer
• This approach unblocks deployment: teams "hill climb" multiple dimensions until hitting a deployment bar, then use genetic algorithms to auto-optimize agent architectures
• Takes 50% of team time but enables 10x faster shipping with confidence—critical as AI-powered attacks scale and organizations need autonomous defenders making sensitive production decisions
In Detail
Saxe identifies a fundamental mismatch in how cybersecurity evaluates AI agents: the field borrowed classical ML metrics (precision, recall, F-score) designed for transparent problems with perfect labels, but security decisions are inherently uncertain—even expert analysts disagree at double-digit rates on whether SOC alerts are true positives or who should have database access. His simulation shows that even 1% label noise causes evaluation accuracy to plummet, creating a "noise ceiling" above which you can't measure system improvement. The current approach reduces 100,000 tokens of agent reasoning—evidence gathering, policy analysis, tool calling—down to a single bit of error information compared against noisy ground truth.
The solution is treating AI agents like security engineer candidates: evaluate them on reasoning quality under uncertainty, not just binary outcomes. This means defining interview-style rubrics with dimensions like evidence gathering, first-principles reasoning, policy understanding, and auditability. You calibrate an LLM judge on ~100 samples to grade agent trajectories at scale, then teams "hill climb" these dimensions until hitting a deployment bar aligned with leadership. This multi-dimensional view cuts through the "is the model wrong or are the labels wrong?" debates that plague binary metrics.
The practical payoff is massive: evaluation takes 50% of team time but enables 10x faster shipping because you can deploy with confidence once above your bar. Once you have robust evals, you can use AI coding tools and genetic algorithms to automatically optimize agent architectures. This matters urgently because AI-powered attacks are scaling while most organizations remain understaffed—hospitals with two IT security workers will need autonomous agents making sensitive decisions like quarantining the CTO's account or patching production code, and we can't deploy those systems without evaluation frameworks that actually work.