Grok 3: Another Win For The Bitter Lesson

xAI's Grok 3 proves that massive compute (100K H100s) still beats clever optimization—DeepSeek's success was impressive engineering under constraints, not evidence that scaling is dead.

Feb 20, 2025 · ai ml

Read Original

• Grok 3 reached state-of-the-art by brute-forcing with 100K H100s, not algorithmic tricks—validating the "Bitter Lesson" that scaling compute beats hand-crafted optimizations
• DeepSeek optimized brilliantly because export restrictions left them no choice, but their CEO admits GPUs are their main bottleneck—they'd scale up if they could
• The paradigm shift from pre-training to post-training (test-time compute, RL) helped latecomers like xAI catch up faster than the multi-year pre-training era allowed
• xAI is better positioned than OpenAI/Anthropic for the next phase—they have 100K+ H100s locked in while others are still securing compute
• When post-training scales to pre-training investment levels, only companies with massive compute clusters will remain competitive

The author argues that Grok 3's achievement validates the "Bitter Lesson"—the empirical observation that scaling compute beats algorithmic cleverness. While DeepSeek impressed by competing with 50K Hoppers versus competitors' 100K+ H100s through aggressive optimization, this was necessity, not strategy. DeepSeek's CEO explicitly stated export controls are their main bottleneck, contradicting the "GPUs don't matter" narrative. The Bitter Lesson doesn't deny the value of optimization; it says that when you have the primary resource (compute), you don't waste time squeezing secondary resources. DeepSeek had no choice but to optimize; xAI chose to scale.

The paradigm shift from pre-training (scaling model size) to post-training (scaling test-time compute via RL and supervised fine-tuning) helped latecomers. Pre-training required multi-year head starts and massive upfront investment. Post-training is still early—rapid improvements are achievable cheaply, allowing OpenAI to jump from o1 to o3 in three months and xAI to reach state-of-the-art in two years. This shift leveled the playing field temporarily, but once companies figure out how to scale post-training to pre-training investment levels, only those with massive compute will compete.

xAI positioned itself remarkably well with its 100K H100 cluster (expanding to 200K), arguably better than OpenAI and Anthropic. Nvidia's favoritism gives xAI priority access to next-gen hardware. For DeepSeek, engineering ingenuity won't compensate for a 150K-GPU gap once post-training scales up. The author concludes that despite media narratives celebrating efficiency, the fundamental truth remains: when it comes to building intelligence, scaling wins over cleverness every time. This has implications for AI leadership, export control value, and the "sell Nvidia" crowd's misunderstanding.

Grok 3: Another Win For The Bitter Lesson

TLDR

In Detail

TLDR

In Detail

Related