← Bookmarks 📄 Article

Quantization from the ground up | ngrok blog

An interactive deep-dive into LLM quantization showing how you can compress models 4x with only 5-10% accuracy loss—enough to run capable models on your laptop—by understanding the bit-level mechanics of how computers store numbers.

· ai ml
Read Original
Listen to Article
0:000:00
Summary used for search

• Quantization lets you make LLMs 4x smaller and 2x faster by compressing 16-bit floats down to 4-bit integers with surprisingly little quality loss
• The key insight: quality degradation is non-linear—8-bit quantization has almost no penalty, 4-bit is ~90% as good (not 25% as you'd expect from linear thinking)
• Symmetric quantization scales values around zero; asymmetric quantization is more efficient by offsetting around the data's actual midpoint
• Block-wise quantization (32-256 parameters at a time) is necessary because outlier parameters would destroy whole-model quantization
• Multiple measurement approaches (perplexity, KL divergence, benchmarks) all confirm that 8-bit and 4-bit quantizations preserve model capability while 2-bit breaks completely

This post builds quantization understanding from first principles using interactive visualizations. It starts by showing that LLMs are fundamentally billions of parameters (weights) that get multiplied by inputs, then explains how computers represent numbers in bits with precision/range tradeoffs. A 32-bit float has 7 significant figures and a huge range, but LLM parameters cluster tightly around zero—most fall between -0.5 and 0.5—so this precision is wasted.

The core technique is mapping high-precision floats into smaller integer ranges. Symmetric quantization scales values around zero by finding the max absolute value and dividing by the target range (e.g., -0.89 to 0.16 maps into -7 to 7). Asymmetric quantization is more efficient—it offsets around the data's actual midpoint rather than forcing symmetry around zero, reducing average error by ~10%. In practice, models are quantized in blocks of 32-256 parameters because outlier values would destroy whole-model quantization by forcing everything into a tiny number of buckets.

The author tests Qwen 3.5 9B at different quantization levels using perplexity (how confident the model is in correct tokens), KL divergence (how much the full probability distribution shifts), the GPQA benchmark, and direct conversation. Results show 8-bit is nearly indistinguishable from the original, 4-bit scores ~90% on benchmarks with slightly higher perplexity, and 2-bit completely collapses. The non-linear degradation means you get massive compression benefits (4x smaller, 2x faster) before hitting a quality cliff. The practical takeaway: don't fear quantized models—they work well enough to run capable LLMs locally on consumer hardware.