← Bookmarks 📄 Article

AI Leaderboards 2025 - Compare LLM, TTS, STT, Video, Image & Embedding Models

A unified leaderboard tracking 220+ AI models across all modalities—LLM, image, video, audio, embeddings—with verified benchmark scores, pricing, and arena rankings in one searchable hub.

· ai ml
Read Original
Listen to Article
0:004:16
Summary used for search

• Gemini 3 Pro leads coding (1,548 arena score) while Claude Opus 4.5 dominates chat (1,319); GPT-5.2 ranks third overall with 92.4% GPQA
• Compares models across 6 modalities with specific metrics: context windows (up to 1M tokens), pricing ($0.10-$75/M tokens), benchmark scores (GPQA, SWE-bench, MMLU)
• Arena-based rankings show real-world performance: Coding Arena, Chat Arena, Image Arena with head-to-head comparisons
• Tracks open-source vs proprietary licensing, with Chinese models (GLM-4.6, MiniMax M2.1) showing competitive performance at lower costs
• Updated daily with new model releases, providing filterable tables and scatter plots for multi-dimensional comparison

LLM-stats.com functions as a centralized intelligence hub for AI model selection, aggregating performance data across six distinct modalities: language models, image generation, video generation, text-to-speech, speech-to-text, and embeddings. The platform's core value is providing verified, comparable metrics that go beyond marketing claims—showing actual arena scores (where models compete head-to-head), benchmark performance (GPQA, SWE-bench, MMLU), and practical constraints like context windows and per-token pricing.

The current leaderboard reveals interesting competitive dynamics: Google's Gemini 3 Pro dominates coding tasks (1,548 arena score) with a massive 1M token context window at $2/$12 per million tokens, while Anthropic's Claude Opus 4.5 leads in chat applications (1,319 score) at premium pricing ($5/$25). Chinese models like Zhipu's GLM-4.6 and MiniMax's M2.1 demonstrate competitive performance at significantly lower costs ($0.30-$0.60 input), suggesting a price-performance arbitrage opportunity. The platform tracks both proprietary and open-source models, with licensing clearly indicated.

The site's architecture enables multi-dimensional filtering: users can sort by specific benchmarks (GPQA for reasoning, SWE-bench for coding), view arena rankings based on real user preferences, or optimize for cost-efficiency. With 220+ models tracked and daily updates on new releases, it serves as both a decision-making tool for developers selecting models and a market intelligence platform showing competitive positioning across the rapidly evolving AI landscape.