Comprehensive LLM Benchmark Database – LLMIndex
A searchable database of 49 LLM benchmarks with detailed metadata—from MMLU to HumanEval to emotional intelligence tests—that catalogs how AI models are actually evaluated across reasoning, coding, math, and safety.
Read Original Summary used for search
TLDR
• Indexes 49 benchmarks across 8 categories (general purpose, math, code, multilingual, safety, etc.) with papers, repos, and datasets linked
• Features UGI Leaderboard tracking "uncensored general intelligence" across 765+ models, measuring knowledge breadth and willingness to engage
• Highlights specialized tests: AGIEval uses real SAT/GRE/LSAT exams, C-Eval has 13,948 Chinese questions across 52 disciplines, EQ-Bench measures emotional intelligence across 45 role-play scenarios
• Each benchmark entry includes specific metrics (accuracy, pass@k, ELO), difficulty levels, and what capabilities it actually tests
• Provides structured metadata to compare what different benchmarks measure—commonsense reasoning vs. graduate-level science vs. code generation
In Detail
LLMIndex presents a comprehensive directory of 49 evaluation benchmarks used to test large language models, organized across 8 categories including general purpose, mathematics, coding, multilingual capabilities, and safety. Each benchmark entry includes research papers, GitHub repositories, datasets, and specific metrics used (accuracy, pass@k scores, ELO ratings), making it a centralized reference for understanding how AI capabilities are measured.
The site features two prominent leaderboards: the UGI Leaderboard tracking "Uncensored General Intelligence" across 765+ models (measuring knowledge breadth, willingness to engage, and writing quality), and HuggingFace's Open LLM Leaderboard comparing open-source models on standard academic benchmarks. The benchmark profiles reveal the diversity of evaluation approaches—AGIEval uses real standardized tests (SAT, GRE, GMAT, LSAT, Chinese Gaokao), C-Eval provides 13,948 questions across 52 disciplines in Chinese, EQ-Bench measures emotional intelligence through 45 multi-turn role-play scenarios, and HumanEval tests code generation with 164 hand-crafted Python problems.
The database structure allows filtering by category, metric type, and tags (e.g., "chain-of-thought," "graduate-level," "anti-memorization"), making it practical for researchers selecting appropriate benchmarks or developers understanding what specific scores actually measure. With 30 benchmarks including research papers and all 49 marked as active, it serves as a living reference for the evolving landscape of AI evaluation standards.