How Vercel Cut Build Wait Times From 90 Seconds To 5
Vercel achieved an 18x speedup by deliberately choosing a harder foundation—microVMs instead of containers—because their threat model (running untrusted customer code on shared hardware) demanded it, then stacking three optimizations on top.
Read Original Summary used for search
TLDR
• Containers were inadequate for "hostile multi-tenancy" because they share a kernel—one exploit could reach every customer's build on the same machine
• Firecracker microVMs provide VM-level isolation (separate kernels, CPU-enforced boundaries) at near-container speed: 125ms boot, few MB memory vs 30-60s for traditional VMs
• The 90s→5s drop came from three layers: faster boots via image caching and block device snapshotting, warm pools of pre-booted cells, and Firecracker's baseline speed
• Every cell is destroyed after each build (not reused) as a security choice—prevents customer state leakage even though reuse would be faster
• Building from primitives instead of using Kubernetes cost massive engineering effort but gave leverage to ship features like Secure Compute that would've been impossible on someone else's platform
In Detail
Vercel's Hive platform runs thousands of customer builds on shared infrastructure, which creates a "hostile multi-tenancy" problem: every build script could be a deliberate exploit trying to escape its sandbox and access other customers' data. Containers were the obvious choice, but they share a Linux kernel across all containers on the same machine—a single kernel exploit would breach every customer's build. Traditional VMs provide true isolation (separate kernels) but take 30-60 seconds to boot, making them impractical for 2-minute builds.
Vercel adopted Firecracker, AWS's open-source microVM technology originally built for Lambda. Firecracker boots in 125ms and uses only a few MB of memory while providing VM-level isolation with CPU-enforced boundaries. Each customer build runs in a "cell" (one Firecracker microVM containing one container), and cells are destroyed after every build rather than reused—a deliberate security choice that prevents state leakage between customers. The architecture splits orchestration between a box daemon (on the physical machine) and a cell daemon (inside each microVM), with dedicated CPU/memory per cell but rate-limited disk/network.
The 18x provisioning speedup came from three compounding optimizations. First, faster cold starts: caching the build container image locally (saving 45s) and using block device snapshotting to start from a saved disk image instead of building from scratch. Second, warm pools: keeping pre-booted cells idle and waiting, so most builds skip the 5-second cold path entirely. Third, Firecracker's baseline speed makes warm pools practical—traditional VMs would require enormous pools to keep up with demand. The result was a 30% overall build performance improvement and 40% improvement for cold-path builds. The trade-off is real cost: warm pools burn money on idle compute, and building from primitives required massive engineering investment versus using Kubernetes. But owning the substrate enabled product features like Secure Compute that would've been nearly impossible on top of someone else's platform. The lesson isn't "use microVMs"—it's that starting with the correct primitive for your threat model, even if harder, creates a foundation where optimizations compound.