Backing up Spotify - Anna's Blog

Anna's Archive scraped all of Spotify into 300TB of torrents—86 million tracks representing 99.6% of listens—creating the first fully open music preservation archive that anyone can mirror, solving the long tail problem by prioritizing coverage over audiophile quality.

Read Original

• First preservation archive for music that's fully open and mirrorable: 256M track metadata (99.9% of Spotify), 86M music files (99.6% of listens), ~300TB total
• Smart quality tradeoff: original OGG Vorbis 160kbit/s for popular tracks, reencoded to OGG Opus 75kbit/s for unpopular ones—sounds identical to most people but saves massive space
• Largest public music metadata database: 186M unique ISRCs vs MusicBrainz's 5M, distributed as queryable SQLite databases with almost lossless API reconstruction
• Solves the long tail problem: existing music archives over-focus on popular artists and lossless quality, leaving 70% of songs (with <1000 streams) poorly preserved
• Prioritized by Spotify's popularity metric: top 3 songs have more total streams than bottom 20-100M songs combined, so they focused on coverage that matters

Anna's Archive has created the world's first fully open music preservation archive by scraping Spotify at scale. While music preservation exists through audiophile communities, it has critical flaws: over-focus on popular artists, obsession with lossless quality that inflates file sizes, and no authoritative comprehensive list. This release addresses all three by prioritizing preservation over perfection—86 million music files representing 99.6% of all Spotify listens, distributed across ~300TB of torrents that anyone can mirror.

The technical approach is elegant: they scraped metadata for 256 million tracks (99.9% of Spotify) and prioritized file downloads using Spotify's popularity metric. For tracks with popularity>0, they archived the original OGG Vorbis at 160kbit/s. For popularity=0 tracks (70% of all songs, mostly with <1000 streams), they reencoded to OGG Opus at 75kbit/s—indistinguishable to most listeners but dramatically smaller. This solved the long tail problem: the top 3 songs have more total streams than the bottom 20-100 million songs combined, so smart prioritization matters. The metadata is distributed as queryable SQLite databases with almost lossless API reconstruction, including comprehensive audio features, playlists, and the largest public music metadata database ever (186 million unique ISRCs vs MusicBrainz's 5 million).

The implications are profound: this is the Library of Alexandria for music, but distributed and uncensorable. Anyone with enough disk space can mirror humanity's entire musical output. The release includes fascinating statistical analysis—70% of songs have <1000 streams, most songs cluster around 120 BPM, and you can generate a "true shuffle" across all 256 million Spotify tracks using their SQLite database. The metadata alone enables unprecedented music research, while the preservation-first approach ensures the long tail of human musical creativity won't be lost to license expirations, platform shutdowns, or budget cuts.

Backing up Spotify - Anna's Blog

TLDR

In Detail

TLDR

In Detail

Related