Theory is essential, but seeing how distributed systems work (and fail) in production at massive scale is where deep understanding comes from. This module covers battle-tested architectures from the world’s leading technology companies.Every case study in this chapter follows the same pattern: a company hit a scaling wall, made a set of engineering trade-offs under real constraints (time, money, team expertise, existing code), and lived with the consequences — both intended and unintended. These are not “the right way” to build systems; they are “one way that worked for a specific company at a specific point in time.” The lesson is never “copy Google’s architecture.” It is “understand the reasoning that led to Google’s decisions and apply that reasoning to your own, very different, situation.”
The database that made the impossible possible — globally consistent transactions. Before Spanner, the conventional wisdom (hardened by the CAP theorem) was that you had to choose between strong consistency and global distribution. Spanner essentially said “what if we throw GPS satellites and atomic clocks at the problem?” and built a system where the laws of physics help enforce transaction ordering.
The Insight: You can’t synchronize clocks perfectly, but you can bound the uncertainty.
TRADITIONAL APPROACH:"The time is 10:00:00.000"(But actually, who knows? Could be off by milliseconds)TRUETIME APPROACH:"The time is between 10:00:00.000 and 10:00:00.007"(Guaranteed! We measured the uncertainty)HARDWARE:• GPS receivers on datacenter roofs• Atomic clocks (rubidium) as backup• Multiple time masters per datacenter• Typical uncertainty: 1-7ms
Usage in Transactions:
1. Transaction T1 gets commit timestamp = TT.now().latest2. T1 waits until TT.after(timestamp) is true3. Now GUARANTEED: no other transaction can get timestamp ≤ T1's4. External consistency achieved without distributed locks!
Paxos Groups for Replication
Per-Tablet Paxos:
Each tablet (partition) has its own Paxos group
3-5 replicas across zones
Writes go through Paxos leader
Reads can go to any replica (with proper timestamp)
Split and Merge:
Tablets automatically split when too large
Tablets merge when too small
Paxos ensures consistent split/merge
External Consistency
The Guarantee: If transaction T1 commits before T2 starts, T1’s timestamp < T2’s timestamp.Why It Matters:
SCENARIO:User in US writes a recordUser in Europe reads immediately afterWITHOUT EXTERNAL CONSISTENCY:European read might see old data (before US write)WITH EXTERNAL CONSISTENCY:European read guaranteed to see US write(because read timestamp > write timestamp)
┌─────────────────────────────────────────────────────────────────────────────┐│ THE 2012 LEAP SECOND INCIDENT │├─────────────────────────────────────────────────────────────────────────────┤│ ││ DATE: June 30, 2012 ││ ││ WHAT HAPPENED: ││ • Leap second added at midnight UTC ││ • Linux kernel bug caused livelock in clock_gettime() ││ • Systems consuming 100% CPU doing nothing ││ • Widespread outages: Reddit, LinkedIn, Mozilla, Gawker ││ ││ HOW GOOGLE HANDLED IT: ││ • Google had already implemented "leap smear" ││ • Instead of adding 1 second at midnight: ││ - Slow down time by 11.6 μs per second ││ - Spread over 24 hours before midnight ││ • All Google services remained stable ││ ││ LESSON: ││ • Time is a critical distributed systems primitive ││ • You must control how time changes propagate ││ • Google's investment in TrueTime paid dividends ││ ││ RESULT: ││ • Leap smear became industry standard ││ • AWS, Azure, and others now use similar approaches ││ • International discussions to eliminate leap seconds entirely ││ │└─────────────────────────────────────────────────────────────────────────────┘
SCENARIO:Customer adding items to cart during peak shoppingDatabase partition occursOPTION A: Reject writes, show error→ Customer leaves, buys from competitor→ Lost revenue: $$$OPTION B: Accept writes, reconcile later→ Worst case: Duplicate items in cart→ Customer removes duplicates at checkout→ Revenue preservedAMAZON CHOSE: Option B (Availability over Consistency)
Implementation:
Leaderless replication
Write to W of N replicas (W < N means some can be down)
Read from R replicas, resolve conflicts
Shopping cart uses “union” merge (keep all items)
Consistent Hashing with Virtual Nodes
Problem: Nodes joining/leaving causes massive data reshufflingSolution: Virtual nodes
WITHOUT VIRTUAL NODES:Hash ring: [──A──|──B──|──C──|──D──]Node B fails: [──A──|────C────|──D──]Node C now has 2x data (overloaded!)WITH VIRTUAL NODES:Each physical node = 150+ virtual nodesHash ring: [A1|B2|C1|A2|D1|B1|C2|D2|...]Node B fails: B's virtual nodes spread across A, C, DLoad stays balanced
Adaptive Capacity
Original Problem:
Provisioned throughput (e.g., 1000 WCU)
Uniform distribution assumed
Hot partition = throttling
Solution: Adaptive capacity
BEFORE (2017):Table: 1000 WCUPartition 1: 250 WCU limit (1000/4)Hot key in Partition 1: Throttled!AFTER (Adaptive Capacity):Table: 1000 WCUPartition 1: Can use up to 1000 WCU if others idleHot key: No throttling as long as total < 1000PLUS: On-demand capacity (2018)No provisioning, pay per request
EVCache = Enhanced MemcachedKEY FEATURES:1. Replication across zones (survive AZ failure)2. Local zone preference (lower latency)3. Fast fallback to other zones4. Shadow clusters for testingTOPOLOGY:┌─────────────────────────────────────────────┐│ EVCache Cluster ││ ││ Zone A Zone B Zone C ││ ┌─────┐ ┌─────┐ ┌─────┐ ││ │MC-1 │ ◄─────► │MC-1 │ ◄─────► │MC-1 │ ││ │MC-2 │ │MC-2 │ │MC-2 │ ││ │MC-3 │ │MC-3 │ │MC-3 │ ││ └─────┘ └─────┘ └─────┘ ││ ││ Writes: All zones (synchronous) ││ Reads: Local zone first, fallback to others│└─────────────────────────────────────────────┘
Zuul: Edge Gateway
Responsibilities:
Authentication
Dynamic routing
Load shedding
Request throttling
Attack detection
Scale: 1+ million RPS at the edgeInnovation: Zuul 2 (async/non-blocking)
Moved from thread-per-request to event loop
90% reduction in connection memory
Better tail latency under load
Chaos Engineering: The Simian Army
Chaos Monkey (2011): Randomly kills instances in production
THE SIMIAN ARMY:Chaos Monkey → Kill instancesLatency Monkey → Inject artificial delaysChaos Kong → Fail entire AWS regionDoctor Monkey → Health checks and remediationJanitor Monkey → Clean up unused resourcesConformity Monkey → Enforce best practices
Philosophy:
"The best way to avoid failure is to fail constantly"If you can't handle instance failures at 3pm on Tuesday,you definitely can't handle them at 3am on Black Friday.RESULT:- Engineers build resilient systems by default- Failures become routine, not emergencies- Recovery is automated, not heroic
Problem: Matching riders to drivers needs consistent, fast routingSolution: Ringpop (swim + consistent hashing)
HOW IT WORKS:1. SWIM Protocol for cluster membership - Nodes gossip about each other's health - Detect failures in seconds - No single point of failure2. Consistent Hashing for request routing - Hash(rider_location) → specific node - That node has all nearby drivers in memory - Sub-millisecond matching decisionsTOPOLOGY:City: San Francisco┌────────────────────────────────────────────────────┐│ HASH RING ││ ││ Node A ─────────────────── Node B ││ (SOMA, ← gossip → (Financial, ││ Mission) Marina) ││ │ │ ││ └──────── Node C ──────────┘ ││ (Castro, ││ Sunset) ││ ││ Request for rider at (lat, lng): ││ → Hash to Node A ││ → Node A has all drivers in that area ││ → Match in <10ms │└────────────────────────────────────────────────────┘
Schemaless: MySQL at Scale
Challenge:
Need horizontal scaling (MySQL doesn’t shard easily)
# Stripe's approach (simplified)async def create_charge(idempotency_key: str, amount: int, ...): # 1. Check if we've seen this key before existing = await idempotency_store.get(idempotency_key) if existing: if existing.status == "completed": # Return cached response return existing.response elif existing.status == "in_progress": # Someone else is processing raise ConflictError("Request in progress") # 2. Mark as in-progress await idempotency_store.set( idempotency_key, {"status": "in_progress", "started_at": now()} ) try: # 3. Do the actual work result = await actually_create_charge(amount, ...) # 4. Store the result await idempotency_store.set( idempotency_key, {"status": "completed", "response": result} ) return result except Exception as e: # 5. On failure, allow retry await idempotency_store.delete(idempotency_key) raise
TTL: Idempotency keys typically expire after 24 hours
Transactional Outbox Pattern
Problem: Need to update database AND send event atomically
WRONG APPROACH:1. Update database (charge.status = "succeeded")2. Send Kafka eventWhat if step 2 fails? Database updated, event lost!TRANSACTIONAL OUTBOX:1. In a single transaction: - Update charge.status = "succeeded" - Insert into outbox table: {event: "charge.succeeded", ...}2. Separate process reads outbox, sends to Kafka3. After Kafka confirms, delete from outbox┌─────────────────────────────────────────────────────┐│ DATABASE ││ ┌───────────────┐ ┌───────────────────────────┐ ││ │ charges │ │ outbox │ ││ ├───────────────┤ ├───────────────────────────┤ ││ │ id: ch_123 │ │ id: 1 │ ││ │ status: succ │ │ event: charge.succeeded │ ││ │ ... │ │ payload: {...} │ ││ └───────────────┘ └───────────────────────────┘ ││ │ ││ │ Background process ││ ▼ ││ ┌─────────────┐ ││ │ Kafka │ ││ └─────────────┘ │└─────────────────────────────────────────────────────┘
Request Hedging
Problem: Tail latency (p99) is often much worse than medianSolution: Send to multiple replicas, use first response
HEDGING STRATEGY:Time 0ms: Send to replica ATime 5ms: If no response, also send to replica BTime 10ms: If no response, also send to replica CUse FIRST response, cancel othersBENEFITS:• p99 latency dramatically reduced• One slow replica doesn't hurt overall latencyCAUTION:• Only for idempotent reads• Increases load on backend (but usually worth it)• Need cancellation to avoid wasted work
All companies: Every mutating operation accepts an idempotency key.
Stripe: Idempotency-Key header
AWS: ClientRequestTokenGoogle: requestId
Chaos Testing
You don’t know if you’re resilient until you test.
Netflix: Chaos Monkey, Chaos Kong
Amazon: GameDay exercises
Google: DiRT (Disaster Recovery Testing)
Circuit Breakers
Fail fast instead of cascading.
Netflix: Hystrix (now Resilience4j)
Uber: Custom circuit breakers in every service
All use some variant of the pattern.
Observability
You can’t fix what you can’t see.
Distributed tracing: Zipkin/Jaeger-style
Metrics: RED method (Rate, Errors, Duration)
Logs: Structured, correlated by trace ID