Intelligence Benchmarking | Artificial Analysis

Artificial Analysis discloses the complete methodology behind their Intelligence Index v4.0—including exact prompts, regex patterns, and custom benchmarks—revealing how benchmark design choices (from answer extraction to equality checking) fundamentally shape AI model comparisons.

Jan 7, 2026 · ai ml

Read Original

• Intelligence Index v4.0 combines 10 evaluations across 4 categories (Agents 25%, Coding 25%, General 25%, Scientific Reasoning 25%), using pass@1 scoring with multiple repeats
• Custom benchmarks include AA-LCR (long context reasoning with ~100k token documents), AA-Omniscience (knowledge + hallucination penalty), and GDPval-AA (real-world work tasks scored via ELO)
• Full transparency on implementation: temperature 0 for non-reasoning models, 0.6 for reasoning; equality checker LLMs instead of exact string matching; specific regex patterns for answer extraction
• GDPval-AA uses pairwise comparisons graded by Gemini 3 Pro with frozen ELO scores normalized as clamp((ELO-500)/2000) for Intelligence Index stability
• Acknowledges biases: HLE dataset adversarially selected against GPT-4o/Claude/Gemini, making direct comparisons with other models potentially unfair

Artificial Analysis has published the complete methodology for their Intelligence Index v4.0, a composite benchmark they claim is "more useful than any other metric in existence today." The index combines 10 evaluations weighted across four categories: Agents (25%), Coding (25%), General (25%), and Scientific Reasoning (25%). Each category includes 2-3 specific benchmarks—for example, Agents includes GDPval-AA (real-world knowledge work) and τ²-Bench Telecom (conversational AI), while Coding includes Terminal-Bench Hard (terminal-based tasks) and SciCode (scientific computing).

The methodology reveals critical implementation details that affect benchmark results. They use temperature 0 for non-reasoning models and 0.6 for reasoning models, with maximum output tokens of 16,384 (or model-specific limits for reasoning models). For answer extraction, they employ multi-stage regex patterns rather than exact string matching—for multiple choice questions, they try 8+ different patterns to catch various answer formats (LaTeX boxed notation, natural language, standalone letters). For open-ended answers, they use "equality checker LLMs" (different models for different benchmarks) to assess semantic equivalence rather than requiring exact matches.

Three custom benchmarks stand out: AA-LCR tests long context reasoning with ~100k token documents requiring 128K context windows; AA-Omniscience measures knowledge accuracy while penalizing hallucinations (score = 50% accuracy + 50% non-hallucination rate); and GDPval-AA evaluates real-world work tasks by having models complete 220 tasks with file outputs, then using pairwise comparisons graded by Gemini 3 Pro to compute ELO ratings. For Intelligence Index inclusion, GDPval ELO scores are frozen at model addition and normalized as clamp((ELO-500)/2000) to maintain stability over time.

The document also acknowledges limitations and potential biases. For example, they note that HLE (Humanity's Last Exam) was adversarially curated using GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, making direct comparisons between these models and others "potentially biased." They estimate ±1% confidence intervals for the overall Intelligence Index but note individual evaluations may have wider intervals. The transparency extends to disclosing exact prompts, regex patterns, grading rubrics, and even the specific Python packages pre-installed in their evaluation environments.

Intelligence Benchmarking | Artificial Analysis

TLDR

In Detail

TLDR

In Detail

Related