Skip to main content

Real-World Case Studies

Theory is essential, but seeing how distributed systems work (and fail) in production at massive scale is where deep understanding comes from. This module covers battle-tested architectures from the world’s leading technology companies.
Track Duration: 12-16 hours
Companies Covered: Google, Amazon, Netflix, Uber, Meta, Stripe
Focus: Architecture decisions, failure stories, lessons learned

Google Spanner

The database that made the impossible possible—globally consistent transactions.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GOOGLE SPANNER ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                          ┌─────────────────────┐                            │
│                          │     Spanner API     │                            │
│                          └──────────┬──────────┘                            │
│                                     │                                        │
│        ┌────────────────────────────┼────────────────────────────┐          │
│        │                            │                            │          │
│        ▼                            ▼                            ▼          │
│   ┌─────────────┐            ┌─────────────┐            ┌─────────────┐    │
│   │   Zone A    │            │   Zone B    │            │   Zone C    │    │
│   │  (Oregon)   │            │  (Iowa)     │            │  (Virginia) │    │
│   └──────┬──────┘            └──────┬──────┘            └──────┬──────┘    │
│          │                          │                          │            │
│   ┌──────┴──────┐            ┌──────┴──────┐            ┌──────┴──────┐    │
│   │ Spanserver  │            │ Spanserver  │            │ Spanserver  │    │
│   │  (Paxos     │◄──────────►│  (Paxos     │◄──────────►│  (Paxos     │    │
│   │   Leader)   │            │   Replica)  │            │   Replica)  │    │
│   └──────┬──────┘            └─────────────┘            └─────────────┘    │
│          │                                                                   │
│          ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           COLOSSUS                                  │   │
│   │                    (Distributed File System)                        │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           TRUETIME                                  │   │
│   │              GPS + Atomic Clocks in every datacenter                │   │
│   │           API: TT.now() → [earliest, latest]                       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Innovations

The Insight: You can’t synchronize clocks perfectly, but you can bound the uncertainty.
TRADITIONAL APPROACH:
"The time is 10:00:00.000"
(But actually, who knows? Could be off by milliseconds)

TRUETIME APPROACH:
"The time is between 10:00:00.000 and 10:00:00.007"
(Guaranteed! We measured the uncertainty)

HARDWARE:
• GPS receivers on datacenter roofs
• Atomic clocks (rubidium) as backup
• Multiple time masters per datacenter
• Typical uncertainty: 1-7ms
Usage in Transactions:
1. Transaction T1 gets commit timestamp = TT.now().latest
2. T1 waits until TT.after(timestamp) is true
3. Now GUARANTEED: no other transaction can get timestamp ≤ T1's
4. External consistency achieved without distributed locks!
Per-Tablet Paxos:
  • Each tablet (partition) has its own Paxos group
  • 3-5 replicas across zones
  • Writes go through Paxos leader
  • Reads can go to any replica (with proper timestamp)
Split and Merge:
  • Tablets automatically split when too large
  • Tablets merge when too small
  • Paxos ensures consistent split/merge
The Guarantee: If transaction T1 commits before T2 starts, T1’s timestamp < T2’s timestamp.Why It Matters:
SCENARIO:
User in US writes a record
User in Europe reads immediately after

WITHOUT EXTERNAL CONSISTENCY:
European read might see old data (before US write)

WITH EXTERNAL CONSISTENCY:
European read guaranteed to see US write
(because read timestamp > write timestamp)
Cost: Commit wait adds latency (~7ms average)

Spanner Failure Story: The Leap Second

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 LEAP SECOND INCIDENT                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: June 30, 2012                                                        │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • Leap second added at midnight UTC                                        │
│  • Linux kernel bug caused livelock in clock_gettime()                     │
│  • Systems consuming 100% CPU doing nothing                                 │
│  • Widespread outages: Reddit, LinkedIn, Mozilla, Gawker                   │
│                                                                              │
│  HOW GOOGLE HANDLED IT:                                                     │
│  • Google had already implemented "leap smear"                             │
│  • Instead of adding 1 second at midnight:                                 │
│    - Slow down time by 11.6 μs per second                                 │
│    - Spread over 24 hours before midnight                                  │
│  • All Google services remained stable                                      │
│                                                                              │
│  LESSON:                                                                    │
│  • Time is a critical distributed systems primitive                         │
│  • You must control how time changes propagate                              │
│  • Google's investment in TrueTime paid dividends                          │
│                                                                              │
│  RESULT:                                                                    │
│  • Leap smear became industry standard                                      │
│  • AWS, Azure, and others now use similar approaches                        │
│  • International discussions to eliminate leap seconds entirely            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Amazon DynamoDB

The database that powers amazon.com checkout—when availability is everything.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DYNAMODB ARCHITECTURE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         REQUEST ROUTERS                              │   │
│  │              (Stateless, route requests to partitions)               │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │                                         │
│           ┌───────────────────────┼───────────────────────┐                 │
│           ▼                       ▼                       ▼                 │
│   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐        │
│   │  Partition 1  │       │  Partition 2  │       │  Partition N  │        │
│   │  ┌─────────┐  │       │  ┌─────────┐  │       │  ┌─────────┐  │        │
│   │  │ Leader  │  │       │  │ Leader  │  │       │  │ Leader  │  │        │
│   │  └────┬────┘  │       │  └────┬────┘  │       │  └────┬────┘  │        │
│   │       │       │       │       │       │       │       │       │        │
│   │  ┌────┴────┐  │       │  ┌────┴────┐  │       │  ┌────┴────┐  │        │
│   │  │Replicas │  │       │  │Replicas │  │       │  │Replicas │  │        │
│   │  │(2 more) │  │       │  │(2 more) │  │       │  │(2 more) │  │        │
│   │  └─────────┘  │       │  └─────────┘  │       │  └─────────┘  │        │
│   └───────────────┘       └───────────────┘       └───────────────┘        │
│                                                                              │
│  GLOBAL TABLES (Multi-Region):                                              │
│                                                                              │
│   Region: US-EAST-1           Region: EU-WEST-1                             │
│   ┌─────────────────┐         ┌─────────────────┐                           │
│   │     Table A     │ ◄─────► │     Table A     │                           │
│   │  (Replica)      │   Sync  │  (Replica)      │                           │
│   └─────────────────┘         └─────────────────┘                           │
│                                                                              │
│   Last-Writer-Wins conflict resolution                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions

The Origin Story (from the 2007 Dynamo paper):
SCENARIO:
Customer adding items to cart during peak shopping
Database partition occurs

OPTION A: Reject writes, show error
→ Customer leaves, buys from competitor
→ Lost revenue: $$$

OPTION B: Accept writes, reconcile later
→ Worst case: Duplicate items in cart
→ Customer removes duplicates at checkout
→ Revenue preserved

AMAZON CHOSE: Option B (Availability over Consistency)
Implementation:
  • Leaderless replication
  • Write to W of N replicas (W < N means some can be down)
  • Read from R replicas, resolve conflicts
  • Shopping cart uses “union” merge (keep all items)
Problem: Nodes joining/leaving causes massive data reshufflingSolution: Virtual nodes
WITHOUT VIRTUAL NODES:
Hash ring: [──A──|──B──|──C──|──D──]
Node B fails: [──A──|────C────|──D──]
Node C now has 2x data (overloaded!)

WITH VIRTUAL NODES:
Each physical node = 150+ virtual nodes
Hash ring: [A1|B2|C1|A2|D1|B1|C2|D2|...]
Node B fails: B's virtual nodes spread across A, C, D
Load stays balanced
Original Problem:
  • Provisioned throughput (e.g., 1000 WCU)
  • Uniform distribution assumed
  • Hot partition = throttling
Solution: Adaptive capacity
BEFORE (2017):
Table: 1000 WCU
Partition 1: 250 WCU limit (1000/4)
Hot key in Partition 1: Throttled!

AFTER (Adaptive Capacity):
Table: 1000 WCU
Partition 1: Can use up to 1000 WCU if others idle
Hot key: No throttling as long as total < 1000

PLUS: On-demand capacity (2018)
No provisioning, pay per request

DynamoDB Failure Story: The 2015 US-EAST-1 Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2015 DYNAMODB OUTAGE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 20, 2015                                                   │
│  DURATION: ~5 hours                                                         │
│  IMPACT: Many high-profile sites down (Netflix, IMDB, Airbnb...)           │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Metadata service became overloaded                                       │
│  • Metadata = "where is partition X?"                                       │
│  • Clients retrying → exponential load increase                            │
│  • Vicious cycle of retries                                                │
│                                                                              │
│  THE CASCADE:                                                               │
│  ┌────────────┐     ┌────────────┐     ┌────────────┐                      │
│  │  Client    │────►│  Request   │────►│  Metadata  │                      │
│  │  Retry     │     │  Router    │     │  Service   │                      │
│  │  (10x)     │     │  (slow)    │     │  (dying)   │                      │
│  └────────────┘     └────────────┘     └────────────┘                      │
│       ▲                    │                 │                              │
│       │                    │                 │                              │
│       └────────── Timeout ─┴─── Overloaded ──┘                              │
│                                                                              │
│  LESSONS LEARNED:                                                           │
│  1. Metadata services are critical path                                     │
│  2. Retry storms can kill systems                                           │
│  3. Need circuit breakers and backoff                                       │
│  4. Client-side caching of metadata helps                                   │
│                                                                              │
│  CHANGES MADE:                                                              │
│  • More aggressive metadata caching                                         │
│  • Better retry backoff in SDKs                                             │
│  • Metadata service scaling improvements                                    │
│  • Cross-region tables for DR                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Netflix: The Chaos Engineering Pioneers

Netflix serves 230+ million subscribers with 99.99% availability. How?

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NETFLIX CLOUD ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                           CDN (Open Connect)                          │   │
│  │          15,000+ servers in ISPs worldwide                           │   │
│  │          95% of traffic served from CDN                              │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │ (5% control plane traffic)              │
│                                   ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         AWS (Control Plane)                           │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                         ZUUL                                    │ │   │
│  │  │               (API Gateway / Edge Service)                      │ │   │
│  │  └───────────────────────────┬─────────────────────────────────────┘ │   │
│  │                              │                                       │   │
│  │  ┌───────────────────────────┼───────────────────────────────────┐   │   │
│  │  │           MICROSERVICES (1000+)                               │   │   │
│  │  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐      │   │   │
│  │  │  │Ratings │ │Viewing │ │Profile │ │Catalog │ │Billing │ ...  │   │   │
│  │  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘      │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  │                                                                      │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    DATA LAYER                                   │ │   │
│  │  │   EVCache (Memcached)    Cassandra    Elasticsearch            │ │   │
│  │  │   30M+ req/sec           Petabytes    Search & Analytics       │ │   │
│  │  └─────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Patterns

Scale:
  • 30+ million requests/second
  • 1+ trillion operations/day
  • Petabytes of cached data
Architecture:
EVCache = Enhanced Memcached

KEY FEATURES:
1. Replication across zones (survive AZ failure)
2. Local zone preference (lower latency)
3. Fast fallback to other zones
4. Shadow clusters for testing

TOPOLOGY:
┌─────────────────────────────────────────────┐
│              EVCache Cluster                │
│                                             │
│  Zone A          Zone B          Zone C    │
│  ┌─────┐         ┌─────┐         ┌─────┐   │
│  │MC-1 │ ◄─────► │MC-1 │ ◄─────► │MC-1 │   │
│  │MC-2 │         │MC-2 │         │MC-2 │   │
│  │MC-3 │         │MC-3 │         │MC-3 │   │
│  └─────┘         └─────┘         └─────┘   │
│                                             │
│  Writes: All zones (synchronous)            │
│  Reads: Local zone first, fallback to others│
└─────────────────────────────────────────────┘
Responsibilities:
  • Authentication
  • Dynamic routing
  • Load shedding
  • Request throttling
  • Attack detection
Scale: 1+ million RPS at the edgeInnovation: Zuul 2 (async/non-blocking)
  • Moved from thread-per-request to event loop
  • 90% reduction in connection memory
  • Better tail latency under load
Chaos Monkey (2011): Randomly kills instances in production
THE SIMIAN ARMY:

Chaos Monkey      → Kill instances
Latency Monkey    → Inject artificial delays
Chaos Kong        → Fail entire AWS region
Doctor Monkey     → Health checks and remediation
Janitor Monkey    → Clean up unused resources
Conformity Monkey → Enforce best practices
Philosophy:
"The best way to avoid failure is to fail constantly"

If you can't handle instance failures at 3pm on Tuesday,
you definitely can't handle them at 3am on Black Friday.

RESULT:
- Engineers build resilient systems by default
- Failures become routine, not emergencies
- Recovery is automated, not heroic

Netflix Failure Story: The 2012 Christmas Eve Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 CHRISTMAS EVE OUTAGE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: December 24, 2012                                                    │
│  DURATION: ~8 hours                                                         │
│  CAUSE: AWS ELB (Elastic Load Balancer) failure                            │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • AWS made changes to ELB configuration                                    │
│  • Bug triggered widespread ELB failures in US-EAST-1                      │
│  • Netflix (and many others) went down                                      │
│                                                                              │
│  THE IRONY:                                                                 │
│  • Netflix had Chaos Monkey                                                 │
│  • Netflix could survive instance failures                                  │
│  • Netflix could survive AZ failures                                        │
│  • But ELB was a single point of failure they didn't test!                 │
│                                                                              │
│  AFTERMATH:                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  1. Built their own load balancer: Zuul                             │    │
│  │     (Removed dependency on AWS ELB)                                 │    │
│  │                                                                     │    │
│  │  2. Created "Chaos Kong"                                            │    │
│  │     (Simulate entire region failures)                               │    │
│  │                                                                     │    │
│  │  3. Multi-region active-active                                      │    │
│  │     (Traffic can shift between regions in seconds)                  │    │
│  │                                                                     │    │
│  │  4. Invested in "Failure Injection Testing" (FIT)                   │    │
│  │     (Structured chaos experiments)                                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  LESSON:                                                                    │
│  "If you haven't tested a failure mode, you're not resilient to it"        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Uber: Real-Time at Scale

Uber processes millions of trips daily with sub-second dispatch decisions.

Architecture Evolution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    UBER ARCHITECTURE EVOLUTION                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  2010-2014: MONOLITH                                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        Python Monolith                               │   │
│  │    (Everything in one codebase, one database)                        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  2014-2017: MICROSERVICES EXPLOSION                                         │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ... (1000+ services)        │
│  │Trip  │ │Map   │ │Price │ │Match │ │Pay   │                              │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘                              │
│                                                                              │
│  2017+: DOMAIN-ORIENTED SERVICES (DOMA)                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐       │    │
│  │  │   RIDES DOMAIN │  │  EATS DOMAIN   │  │ FREIGHT DOMAIN │       │    │
│  │  │  ┌───────────┐ │  │  ┌───────────┐ │  │  ┌───────────┐ │       │    │
│  │  │  │Trip       │ │  │  │Order      │ │  │  │Shipment   │ │       │    │
│  │  │  │Matching   │ │  │  │Restaurant │ │  │  │Carrier    │ │       │    │
│  │  │  │Pricing    │ │  │  │Delivery   │ │  │  │Route      │ │       │    │
│  │  │  └───────────┘ │  │  └───────────┘ │  │  └───────────┘ │       │    │
│  │  └────────────────┘  └────────────────┘  └────────────────┘       │    │
│  │                                                                     │    │
│  │  SHARED PLATFORM                                                    │    │
│  │  ┌─────────────────────────────────────────────────────────────┐   │    │
│  │  │  Maps  │  Payments  │  Identity  │  Messaging  │  Dispatch  │   │    │
│  │  └─────────────────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Systems

Problem: Matching riders to drivers needs consistent, fast routingSolution: Ringpop (swim + consistent hashing)
HOW IT WORKS:

1. SWIM Protocol for cluster membership
   - Nodes gossip about each other's health
   - Detect failures in seconds
   - No single point of failure

2. Consistent Hashing for request routing
   - Hash(rider_location) → specific node
   - That node has all nearby drivers in memory
   - Sub-millisecond matching decisions

TOPOLOGY:

City: San Francisco

┌────────────────────────────────────────────────────┐
│                    HASH RING                        │
│                                                    │
│       Node A ─────────────────── Node B           │
│      (SOMA,      ← gossip →    (Financial,       │
│     Mission)                     Marina)          │
│         │                          │              │
│         └──────── Node C ──────────┘              │
│                 (Castro,                          │
│                  Sunset)                          │
│                                                    │
│    Request for rider at (lat, lng):               │
│    → Hash to Node A                               │
│    → Node A has all drivers in that area          │
│    → Match in <10ms                               │
└────────────────────────────────────────────────────┘
Challenge:
  • Need horizontal scaling (MySQL doesn’t shard easily)
  • Need flexible schema (trip data evolves fast)
  • Need low latency (real-time dispatch)
Solution: Schemaless (MySQL + application-level sharding)
ARCHITECTURE:

┌─────────────────────────────────────────────────┐
│              Schemaless Layer                   │
│  (Application-level sharding + abstraction)     │
└────────────────────┬────────────────────────────┘

     ┌───────────────┼───────────────┐
     ▼               ▼               ▼
┌─────────┐    ┌─────────┐    ┌─────────┐
│ MySQL   │    │ MySQL   │    │ MySQL   │
│ Shard 1 │    │ Shard 2 │    │ Shard N │
└─────────┘    └─────────┘    └─────────┘

DATA MODEL:
┌─────────────────────────────────────────────────┐
│  Row Key (UUID)  │  Column  │  Body (JSON)     │
├──────────────────┼──────────┼──────────────────┤
│  trip-123-abc    │  base    │ {"rider": ...}   │
│  trip-123-abc    │  driver  │ {"driver": ...}  │
│  trip-123-abc    │  route   │ {"waypoints":..} │
└─────────────────────────────────────────────────┘

BENEFITS:
• Easy horizontal scaling (shard by row key)
• Schema evolution (just add columns)
• Point-in-time recovery (versioned cells)
Uber’s solution for long-running, fault-tolerant workflows.
EXAMPLE: Trip Lifecycle

@workflow
def trip_workflow(rider_id, pickup, dropoff):
    # Each step is fault-tolerant and can be retried
    
    # Step 1: Find driver (may take minutes)
    driver = yield find_driver(pickup)
    
    # Step 2: Wait for pickup
    yield wait_for_event("driver_arrived", timeout="30m")
    
    # Step 3: Wait for trip completion
    yield wait_for_event("trip_completed", timeout="8h")
    
    # Step 4: Process payment
    yield process_payment(rider_id, driver.id, trip.amount)
    
    # Step 5: Request ratings
    yield send_rating_request(rider_id, driver.id)

GUARANTEES:
• Each step runs exactly once
• Workflow survives server failures
• State persisted at each step
• Full visibility into running workflows
Open Source: Temporal (Cadence fork) now widely adopted

Uber Failure Story: The 2019 Mapping Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2019 MAPPING OUTAGE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 2019                                                       │
│  IMPACT: App crashes, dispatch failures, surge pricing errors              │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Map tile service returned malformed data                                 │
│  • Client apps crashed when parsing                                         │
│  • Retry storms overloaded map service                                      │
│  • Cascade to other services using maps                                     │
│                                                                              │
│  THE CASCADE:                                                               │
│                                                                              │
│  Malformed     →    Client    →    Retry    →    Service    →   Dispatch   │
│  Response           Crash          Storm        Overload        Failure    │
│                                                                              │
│  LESSONS:                                                                   │
│  ─────────                                                                  │
│  1. Client-side validation is critical                                      │
│     - Don't trust backend responses blindly                                │
│     - Graceful degradation when data is bad                                │
│                                                                              │
│  2. Circuit breakers needed everywhere                                      │
│     - Client-side circuit breakers                                          │
│     - Service-side admission control                                        │
│                                                                              │
│  3. Retry behavior must be controlled                                       │
│     - Exponential backoff                                                   │
│     - Jitter to spread load                                                 │
│     - Client-side retry budgets                                             │
│                                                                              │
│  4. Map data is critical path                                               │
│     - Cache aggressively                                                    │
│     - Multiple fallback sources                                             │
│     - Stale data better than no data                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Stripe: Financial Transactions at Scale

When money is involved, correctness is everything.

Architecture Principles

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRIPE ENGINEERING PRINCIPLES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. EXACTLY-ONCE DELIVERY                                                   │
│  ────────────────────────                                                   │
│  Payment must happen exactly once, never more, never less                   │
│                                                                              │
│  Implementation:                                                            │
│  • Idempotency keys on all mutating operations                             │
│  • Request deduplication in API layer                                       │
│  • Transactional outbox pattern for events                                  │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  2. STRONG CONSISTENCY FOR MONEY                                            │
│  ───────────────────────────────                                            │
│  Unlike social media, eventual consistency is not acceptable                │
│                                                                              │
│  Approach:                                                                  │
│  • Single-leader PostgreSQL for financial data                              │
│  • Synchronous replication to standby                                       │
│  • Serializable isolation for critical paths                                │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  3. IDEMPOTENCY BY DEFAULT                                                  │
│  ──────────────────────────                                                 │
│  Every API accepts Idempotency-Key header                                   │
│                                                                              │
│  curl https://api.stripe.com/v1/charges \                                  │
│    -H "Idempotency-Key: order-12345" \                                     │
│    -d amount=2000 \                                                        │
│    -d currency=usd                                                         │
│                                                                              │
│  Same key = same result (even if called 100 times)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Patterns

How it works:
# Stripe's approach (simplified)

async def create_charge(idempotency_key: str, amount: int, ...):
    # 1. Check if we've seen this key before
    existing = await idempotency_store.get(idempotency_key)
    
    if existing:
        if existing.status == "completed":
            # Return cached response
            return existing.response
        elif existing.status == "in_progress":
            # Someone else is processing
            raise ConflictError("Request in progress")
    
    # 2. Mark as in-progress
    await idempotency_store.set(
        idempotency_key, 
        {"status": "in_progress", "started_at": now()}
    )
    
    try:
        # 3. Do the actual work
        result = await actually_create_charge(amount, ...)
        
        # 4. Store the result
        await idempotency_store.set(
            idempotency_key,
            {"status": "completed", "response": result}
        )
        
        return result
        
    except Exception as e:
        # 5. On failure, allow retry
        await idempotency_store.delete(idempotency_key)
        raise
TTL: Idempotency keys typically expire after 24 hours
Problem: Need to update database AND send event atomically
WRONG APPROACH:

1. Update database (charge.status = "succeeded")
2. Send Kafka event

What if step 2 fails? Database updated, event lost!

TRANSACTIONAL OUTBOX:

1. In a single transaction:
   - Update charge.status = "succeeded"
   - Insert into outbox table: {event: "charge.succeeded", ...}

2. Separate process reads outbox, sends to Kafka

3. After Kafka confirms, delete from outbox

┌─────────────────────────────────────────────────────┐
│                    DATABASE                         │
│  ┌───────────────┐    ┌───────────────────────────┐ │
│  │   charges     │    │        outbox             │ │
│  ├───────────────┤    ├───────────────────────────┤ │
│  │ id: ch_123    │    │ id: 1                     │ │
│  │ status: succ  │    │ event: charge.succeeded   │ │
│  │ ...           │    │ payload: {...}            │ │
│  └───────────────┘    └───────────────────────────┘ │
│                               │                     │
│                               │ Background process  │
│                               ▼                     │
│                        ┌─────────────┐             │
│                        │    Kafka    │             │
│                        └─────────────┘             │
└─────────────────────────────────────────────────────┘
Problem: Tail latency (p99) is often much worse than medianSolution: Send to multiple replicas, use first response
HEDGING STRATEGY:

Time 0ms:    Send to replica A
Time 5ms:    If no response, also send to replica B
Time 10ms:   If no response, also send to replica C

Use FIRST response, cancel others

BENEFITS:
• p99 latency dramatically reduced
• One slow replica doesn't hurt overall latency

CAUTION:
• Only for idempotent reads
• Increases load on backend (but usually worth it)
• Need cancellation to avoid wasted work

Common Patterns Across Companies

Idempotency Everywhere

All companies: Every mutating operation accepts an idempotency key. Stripe: Idempotency-Key header AWS: ClientRequestToken Google: requestId

Chaos Testing

You don’t know if you’re resilient until you test. Netflix: Chaos Monkey, Chaos Kong Amazon: GameDay exercises Google: DiRT (Disaster Recovery Testing)

Circuit Breakers

Fail fast instead of cascading. Netflix: Hystrix (now Resilience4j) Uber: Custom circuit breakers in every service All use some variant of the pattern.

Observability

You can’t fix what you can’t see. Distributed tracing: Zipkin/Jaeger-style Metrics: RED method (Rate, Errors, Duration) Logs: Structured, correlated by trace ID

Key Takeaways

  1. Availability often trumps consistency — Amazon’s shopping cart chose availability. Know when this trade-off is acceptable.
  2. Test failure modes in production — Netflix’s Chaos Engineering isn’t optional, it’s essential.
  3. Build for horizontal scale from day one — Re-architecting a monolith is painful. Design for distribution early.
  4. Invest in your primitives — Google built TrueTime. Uber built Ringpop. Your foundations matter.
  5. Every outage is a learning opportunity — The companies with the best uptime have the best postmortems.