The Weight of Remembering

Every ChatGPT conversation physically exists as charge states in GPU memory for 5-10 minutes before vanishing forever—and the evolution from GPT-2's 300 KiB/token to DeepSeek's 68.6 KiB/token represents competing philosophies of what's worth remembering compressed into engineering decisions.

Apr 7, 2026 · ai ml

Read Original

• KV cache evolution (GPT-2 → Llama 3 GQA → DeepSeek V3 MLA → Gemma 3 sliding windows) shows four different answers to "what should a mind remember": total recall → shared perspectives → compressed abstraction → selective attention
• Current AI has working memory (KV cache, seconds-minutes) and permanent knowledge (weights), but nothing in between—no architectural equivalent to the hippocampus consolidating medium-term memories
• Compaction is lossy and blind: prompted summarization loses unpredictable details, while Cursor's learned compaction (RL-trained) works for code but unclear how it handles domains without clean reward signals
• External memory (files, databases, vector stores) is ugly but transparent—arguably the primary technology of knowledge for both humans and AI, not a workaround
• The trajectory from human-designed memory architectures toward AI learning to manage its own context (Cursor) echoes Greg Egan's Diaspora: minds reshaping themselves to perceive what their original form couldn't process

The KV cache—the key-value pairs stored in GPU memory during inference—is the physical substrate of every AI conversation, and its evolution reveals competing philosophies of memory. GPT-2 (2019) cached everything at 300 KiB per token. Llama 3 (2024) adopted grouped-query attention, sharing key-value pairs across multiple query heads, dropping to 128 KiB/token with almost no quality loss. DeepSeek V3 (2024) compressed keys and values into a lower-dimensional latent space before caching, achieving 68.6 KiB/token while matching or exceeding standard attention on benchmarks. Gemma 3 (2025) uses sliding windows—sharp focus on recent tokens, approximate memory of older context—with negligible perplexity loss. Meanwhile, state space models like Mamba asked whether the cache was necessary at all, maintaining fixed-size hidden states that compress in real-time rather than accumulating history.

Current architectures have a fundamental gap: working memory (KV cache, lasting seconds to minutes before eviction) and permanent knowledge (trained weights), with nothing in between. No architectural equivalent to the hippocampus consolidating medium-term memories. What fills this void are heuristics—RAG, vector databases, file systems, system prompts—bridges built over an architectural gap. When KV cache grows too large, models use compaction: summarizing their own context into shorter representations. Prompted compaction is lossy and blind, losing unpredictable details with no mechanism for knowing what vanished. Cursor's learned compaction trains models via RL to compress effectively, showing promise on coding benchmarks where reward signals are clean (tests pass or fail), but it's unclear how this handles domains where failure is silent.

The trajectory from human-designed memory systems toward AI learning to manage its own context echoes Greg Egan's Diaspora, where digital citizens reshape their own cognitive architecture to perceive higher-dimensional mathematics. Each architectural revision—from full fidelity to shared representations to compressed abstractions to selective windows—is a choice about what to remember and what to release. Cursor's learned compaction represents a narrow crack: the model learning how to manage memory for itself, still confined and shaped by human reward signals, but pointing toward a future where the minds that depend on memory might eventually get a say in how it works.

The Weight of Remembering

TLDR

In Detail

TLDR

In Detail

Related