← Bookmarks 📄 Article

So You Want to Build Your Own Data Center

Railway spent 9 months building their own data centers from scratch to escape Google Cloud's existential risks—here's why managing fiber optic cables and 3-phase power turned out to be easier than dealing with GCP support.

· infrastructure
Read Original
Summary used for search

• Railway migrated off GCP because hyperscalers posed existential business risk through pricing constraints, service limitations, and zero support despite millions in annual spend
• Building a data center cage involves power density calculations ($/kW varies 2x by geography), redundant ISP selection based on regional peering, and cold/hot aisle airflow design—closer to construction than DevOps
• Each install phase requires documenting 300+ cables across 60+ devices in Excel before contractors can rack and stack—wrong PDU orientations, fiber polarity issues, and faulty power sockets are common gotchas
• Network design uses BGP to consolidate routing tables from multiple ISPs, selecting optimal paths per IP prefix (e.g., Telstra for Australian traffic, PCCW for Japan)
• The payoff: control over pricing (no egress fees), service quality, and engineering capabilities that hyperscalers fundamentally can't provide

Railway's journey from Google Cloud Platform to bare metal infrastructure reveals the hidden costs and constraints of hyperscalers. Despite multi-million dollar annual spend, they received minimal support, faced unexplained outages, and hit engineering constraints that prevented them from delivering competitive pricing and features to customers. The decision to build their own data centers wasn't just about cost—it was about survival and control.

The physical buildout process is remarkably different from software infrastructure. Power is the critical resource, paid as a fixed monthly commit regardless of consumption, with costs varying 2x between geographies like US West Coast versus Singapore. They chose cage colocation (private space with mesh walls) over greenfield or rack colocation, then had to solve power density calculations, select PDUs with remote metering capabilities, and design cold/hot aisle airflow. Network connectivity required contracting with multiple Tier 1 ISPs selected for regional footprint maturity, then using BGP to consolidate routing tables and optimize packet handoff per destination—sending Australian traffic to Telstra, Japanese traffic via PCCW's NTT peering.

The operational complexity is intense: each install phase involves 60+ devices and 300+ discrete cables, all documented in Excel-based cabling matrices and rack elevations before contractors can begin. Common issues include upside-down PDUs reversing socket numbering, facilities wiring power phase-to-neutral versus phase-to-phase, contractors mounting reverse-airflow switches backwards, wrong fiber polarity requiring "rolling cables" (swapping LC connector plugs), and faulty PDU sockets requiring rubber mallets. The team built internal tools (Railyard and MetalCP) to automate generating build specifications and managing the infrastructure. Their network uses FRR and SONiC on whitebox switches for a software-driven L3-only design that integrates with their control plane.