Event Sourcing: Real-World Pitfalls and Lessons Learned
A battle-tested guide to event sourcing that skips the tutorials and focuses on what actually breaks in production—from framework pollution of your domain to split-brain scenarios to why "exactly-once delivery" is a fairy tale.
Read Original Summary used for search
TLDR
• Event sourcing has three maturity levels: transactional (easy, slow), async projections (eventual consistency), and distributed with event bus (complex, scalable)—choose based on your actual load, not aspirations
• Keep your domain pure: frameworks like Axon/Lagom pollute business code with annotations; use libraries (Akka Persistence) and isolate infrastructure in separate packages
• Cassandra beats relational DBs for event stores at scale: leaderless architecture, optimized for writes, handles 100K+ transactions/sec with linear horizontal scaling
• Binary serialization (Protocol Buffers/Avro) gives 60-70% disk savings, faster performance, and better schema evolution than JSON
• Distributed systems realities: exactly-once delivery doesn't exist, handle duplicates with idempotent updates, use compensating actions (sagas) instead of distributed transactions, and prepare for split-brain with resolvers and chaos testing
In Detail
The speaker presents lessons from building five commercial event-sourced systems, focusing on production realities rather than theory. Event sourcing exists on a maturity spectrum: transactional event sourcing (saving events and updating read models in one transaction—simple but slow), async projections (eventual consistency but better performance), and distributed architecture with event bus (complex but highly scalable). Most teams should start simple and add complexity only when load demands it.
A critical mistake is letting frameworks pollute your domain. Popular frameworks like Axon and Lagom force you to annotate business entities with infrastructure concerns (serialization, persistence). The solution: keep domain logic in pure functions (just process command → return events, apply event → return state), isolate framework code in application packages, and use libraries like Akka Persistence over full frameworks. This makes testing trivial and switching frameworks possible.
For databases, relational stores are safe but don't scale vertically. Cassandra offers partitioning by design, leaderless architecture (no single point of failure), and write optimization that handles 100K+ transactions/sec with linear scaling. Binary serialization (Protocol Buffers or Avro) beats JSON with 60-70% disk savings, faster performance, and superior schema evolution support. Enrich events with context (current balance, not just delta) to make projectors simple—they shouldn't need external queries.
The hardest lessons come from distributed systems realities: exactly-once delivery is a myth (use idempotent updates and sequence number tracking for effectively-once), multi-aggregate transactions require compensating actions via sagas (not distributed transactions), GDPR needs data shredding with encryption keys, and split-brain scenarios will happen in clusters (use split-brain resolvers, extensive failover testing, and keep clusters small). The reward for handling this complexity is a system that scales like Twitter, Facebook, and Netflix—all built on event stream processing.