← Bookmarks 📄 Article

Measuring AI Ability to Complete Long Tasks - METR

AI's ability to complete tasks autonomously has been doubling every 7 months for 6 years—a trend that, if continued, means AI agents will handle month-long projects by the early 2030s.

· ai ml
Read Original
Listen to Article
0:005:13
Summary used for search

• METR found that measuring AI by task length (how long tasks take humans) resolves the paradox of superhuman benchmarks vs limited real-world utility—current best models reliably complete ~1 hour tasks but fail on 4+ hour tasks
• The length of tasks AI can complete with 50% reliability has doubled every 7 months since 2019, showing a remarkably consistent exponential trend across multiple task sets
• Extrapolating forward: if the trend holds, AI will autonomously complete week-long tasks in 2-4 years and month-long projects by early 2030s
• The trend is robust to methodology changes and holds even on real-world tasks (SWE-Bench shows even faster 3-month doubling time)
• This metric directly predicts economic impact better than test scores—it explains why GPT-4 aces exams but can't reliably do remote assistant work

METR proposes a new lens for understanding AI progress: measuring the length of tasks (by human completion time) that AI agents can complete autonomously. This approach resolves a key puzzle—why do models that ace expert-level exams still struggle with basic remote work? The answer: current frontier models like Claude 3.7 Sonnet succeed nearly 100% of the time on tasks taking humans under 4 minutes, but their success rate drops below 10% for tasks over 4 hours. They can handle some hour-long tasks, but can only reliably complete tasks of a few minutes.

The real insight comes from historical data. When METR plots the task length that state-of-the-art models can complete with 50% probability over the past 6 years, they find a remarkably consistent exponential trend with a 7-month doubling time. This holds from GPT-2 (2019) through Claude 3.7 (2025), spanning orders of magnitude in capability. The trend replicates on independent datasets like SWE-Bench Verified (which shows an even faster 3-month doubling) and remains robust to various methodological perturbations and subset analyses.

The implications are stark: if this trend continues for just 2-4 more years, AI agents will autonomously complete week-long tasks. By the early 2030s, they'll handle month-long projects. The steepness of the exponential means forecasts are relatively robust—even a 10x measurement error only shifts timelines by ~2 years. This provides a concrete framework for forecasting when AI will have transformative economic impact, grounded in a metric that directly relates to real-world utility rather than abstract test scores.