Real-World Case Studies

Theory is essential, but seeing how distributed systems work (and fail) in production at massive scale is where deep understanding comes from. This module covers battle-tested architectures from the world’s leading technology companies.

Track Duration: 12-16 hours
Companies Covered: Google, Amazon, Netflix, Uber, Meta, Stripe
Focus: Architecture decisions, failure stories, lessons learned

Google Spanner

The database that made the impossible possible—globally consistent transactions.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GOOGLE SPANNER ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                          ┌─────────────────────┐                            │
│                          │     Spanner API     │                            │
│                          └──────────┬──────────┘                            │
│                                     │                                        │
│        ┌────────────────────────────┼────────────────────────────┐          │
│        │                            │                            │          │
│        ▼                            ▼                            ▼          │
│   ┌─────────────┐            ┌─────────────┐            ┌─────────────┐    │
│   │   Zone A    │            │   Zone B    │            │   Zone C    │    │
│   │  (Oregon)   │            │  (Iowa)     │            │  (Virginia) │    │
│   └──────┬──────┘            └──────┬──────┘            └──────┬──────┘    │
│          │                          │                          │            │
│   ┌──────┴──────┐            ┌──────┴──────┐            ┌──────┴──────┐    │
│   │ Spanserver  │            │ Spanserver  │            │ Spanserver  │    │
│   │  (Paxos     │◄──────────►│  (Paxos     │◄──────────►│  (Paxos     │    │
│   │   Leader)   │            │   Replica)  │            │   Replica)  │    │
│   └──────┬──────┘            └─────────────┘            └─────────────┘    │
│          │                                                                   │
│          ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           COLOSSUS                                  │   │
│   │                    (Distributed File System)                        │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           TRUETIME                                  │   │
│   │              GPS + Atomic Clocks in every datacenter                │   │
│   │           API: TT.now() → [earliest, latest]                       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Innovations

TrueTime: Making Time Trustworthy

The Insight: You can’t synchronize clocks perfectly, but you can bound the uncertainty.

TRADITIONAL APPROACH:
"The time is 10:00:00.000"
(But actually, who knows? Could be off by milliseconds)

TRUETIME APPROACH:
"The time is between 10:00:00.000 and 10:00:00.007"
(Guaranteed! We measured the uncertainty)

HARDWARE:
• GPS receivers on datacenter roofs
• Atomic clocks (rubidium) as backup
• Multiple time masters per datacenter
• Typical uncertainty: 1-7ms

Usage in Transactions:

Transaction T1 gets commit timestamp = TT.now().latest
T1 waits until TT.after(timestamp) is true
Now GUARANTEED: no other transaction can get timestamp ≤ T1's
External consistency achieved without distributed locks!

Paxos Groups for Replication

Per-Tablet Paxos:

Each tablet (partition) has its own Paxos group
3-5 replicas across zones
Writes go through Paxos leader
Reads can go to any replica (with proper timestamp)

Split and Merge:

Tablets automatically split when too large
Tablets merge when too small
Paxos ensures consistent split/merge

External Consistency

The Guarantee: If transaction T1 commits before T2 starts, T1’s timestamp < T2’s timestamp.Why It Matters:

SCENARIO:
User in US writes a record
User in Europe reads immediately after

WITHOUT EXTERNAL CONSISTENCY:
European read might see old data (before US write)

WITH EXTERNAL CONSISTENCY:
European read guaranteed to see US write
(because read timestamp > write timestamp)

Cost: Commit wait adds latency (~7ms average)

Spanner Failure Story: The Leap Second

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 LEAP SECOND INCIDENT                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: June 30, 2012                                                        │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • Leap second added at midnight UTC                                        │
│  • Linux kernel bug caused livelock in clock_gettime()                     │
│  • Systems consuming 100% CPU doing nothing                                 │
│  • Widespread outages: Reddit, LinkedIn, Mozilla, Gawker                   │
│                                                                              │
│  HOW GOOGLE HANDLED IT:                                                     │
│  • Google had already implemented "leap smear"                             │
│  • Instead of adding 1 second at midnight:                                 │
│    - Slow down time by 11.6 μs per second                                 │
│    - Spread over 24 hours before midnight                                  │
│  • All Google services remained stable                                      │
│                                                                              │
│  LESSON:                                                                    │
│  • Time is a critical distributed systems primitive                         │
│  • You must control how time changes propagate                              │
│  • Google's investment in TrueTime paid dividends                          │
│                                                                              │
│  RESULT:                                                                    │
│  • Leap smear became industry standard                                      │
│  • AWS, Azure, and others now use similar approaches                        │
│  • International discussions to eliminate leap seconds entirely            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Amazon DynamoDB

The database that powers amazon.com checkout—when availability is everything.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DYNAMODB ARCHITECTURE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         REQUEST ROUTERS                              │   │
│  │              (Stateless, route requests to partitions)               │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │                                         │
│           ┌───────────────────────┼───────────────────────┐                 │
│           ▼                       ▼                       ▼                 │
│   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐        │
│   │  Partition 1  │       │  Partition 2  │       │  Partition N  │        │
│   │  ┌─────────┐  │       │  ┌─────────┐  │       │  ┌─────────┐  │        │
│   │  │ Leader  │  │       │  │ Leader  │  │       │  │ Leader  │  │        │
│   │  └────┬────┘  │       │  └────┬────┘  │       │  └────┬────┘  │        │
│   │       │       │       │       │       │       │       │       │        │
│   │  ┌────┴────┐  │       │  ┌────┴────┐  │       │  ┌────┴────┐  │        │
│   │  │Replicas │  │       │  │Replicas │  │       │  │Replicas │  │        │
│   │  │(2 more) │  │       │  │(2 more) │  │       │  │(2 more) │  │        │
│   │  └─────────┘  │       │  └─────────┘  │       │  └─────────┘  │        │
│   └───────────────┘       └───────────────┘       └───────────────┘        │
│                                                                              │
│  GLOBAL TABLES (Multi-Region):                                              │
│                                                                              │
│   Region: US-EAST-1           Region: EU-WEST-1                             │
│   ┌─────────────────┐         ┌─────────────────┐                           │
│   │     Table A     │ ◄─────► │     Table A     │                           │
│   │  (Replica)      │   Sync  │  (Replica)      │                           │
│   └─────────────────┘         └─────────────────┘                           │
│                                                                              │
│   Last-Writer-Wins conflict resolution                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions

Always Available: The Shopping Cart Story

The Origin Story (from the 2007 Dynamo paper):

SCENARIO:
Customer adding items to cart during peak shopping
Database partition occurs

OPTION A: Reject writes, show error
→ Customer leaves, buys from competitor
→ Lost revenue: $$$

OPTION B: Accept writes, reconcile later
→ Worst case: Duplicate items in cart
→ Customer removes duplicates at checkout
→ Revenue preserved

AMAZON CHOSE: Option B (Availability over Consistency)

Implementation:

Leaderless replication
Write to W of N replicas (W < N means some can be down)
Read from R replicas, resolve conflicts
Shopping cart uses “union” merge (keep all items)

Consistent Hashing with Virtual Nodes

Problem: Nodes joining/leaving causes massive data reshufflingSolution: Virtual nodes

WITHOUT VIRTUAL NODES:
Hash ring: [──A──|──B──|──C──|──D──]
Node B fails: [──A──|────C────|──D──]
Node C now has 2x data (overloaded!)

WITH VIRTUAL NODES:
Each physical node = 150+ virtual nodes
Hash ring: [A1|B2|C1|A2|D1|B1|C2|D2|...]
Node B fails: B's virtual nodes spread across A, C, D
Load stays balanced

Adaptive Capacity

Original Problem:

Provisioned throughput (e.g., 1000 WCU)
Uniform distribution assumed
Hot partition = throttling

Solution: Adaptive capacity

BEFORE (2017):
Table: 1000 WCU
Partition 1: 250 WCU limit (1000/4)
Hot key in Partition 1: Throttled!

AFTER (Adaptive Capacity):
Table: 1000 WCU
Partition 1: Can use up to 1000 WCU if others idle
Hot key: No throttling as long as total < 1000

PLUS: On-demand capacity (2018)
No provisioning, pay per request

DynamoDB Failure Story: The 2015 US-EAST-1 Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2015 DYNAMODB OUTAGE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 20, 2015                                                   │
│  DURATION: ~5 hours                                                         │
│  IMPACT: Many high-profile sites down (Netflix, IMDB, Airbnb...)           │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Metadata service became overloaded                                       │
│  • Metadata = "where is partition X?"                                       │
│  • Clients retrying → exponential load increase                            │
│  • Vicious cycle of retries                                                │
│                                                                              │
│  THE CASCADE:                                                               │
│  ┌────────────┐     ┌────────────┐     ┌────────────┐                      │
│  │  Client    │────►│  Request   │────►│  Metadata  │                      │
│  │  Retry     │     │  Router    │     │  Service   │                      │
│  │  (10x)     │     │  (slow)    │     │  (dying)   │                      │
│  └────────────┘     └────────────┘     └────────────┘                      │
│       ▲                    │                 │                              │
│       │                    │                 │                              │
│       └────────── Timeout ─┴─── Overloaded ──┘                              │
│                                                                              │
│  LESSONS LEARNED:                                                           │
│  1. Metadata services are critical path                                     │
│  2. Retry storms can kill systems                                           │
│  3. Need circuit breakers and backoff                                       │
│  4. Client-side caching of metadata helps                                   │
│                                                                              │
│  CHANGES MADE:                                                              │
│  • More aggressive metadata caching                                         │
│  • Better retry backoff in SDKs                                             │
│  • Metadata service scaling improvements                                    │
│  • Cross-region tables for DR                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Netflix: The Chaos Engineering Pioneers

Netflix serves 230+ million subscribers with 99.99% availability. How?

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NETFLIX CLOUD ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                           CDN (Open Connect)                          │   │
│  │          15,000+ servers in ISPs worldwide                           │   │
│  │          95% of traffic served from CDN                              │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │ (5% control plane traffic)              │
│                                   ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         AWS (Control Plane)                           │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                         ZUUL                                    │ │   │
│  │  │               (API Gateway / Edge Service)                      │ │   │
│  │  └───────────────────────────┬─────────────────────────────────────┘ │   │
│  │                              │                                       │   │
│  │  ┌───────────────────────────┼───────────────────────────────────┐   │   │
│  │  │           MICROSERVICES (1000+)                               │   │   │
│  │  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐      │   │   │
│  │  │  │Ratings │ │Viewing │ │Profile │ │Catalog │ │Billing │ ...  │   │   │
│  │  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘      │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  │                                                                      │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    DATA LAYER                                   │ │   │
│  │  │   EVCache (Memcached)    Cassandra    Elasticsearch            │ │   │
│  │  │   30M+ req/sec           Petabytes    Search & Analytics       │ │   │
│  │  └─────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Patterns

EVCache: Caching at Scale

Scale:

30+ million requests/second
1+ trillion operations/day
Petabytes of cached data

Architecture:

EVCache = Enhanced Memcached

KEY FEATURES:
1. Replication across zones (survive AZ failure)
2. Local zone preference (lower latency)
3. Fast fallback to other zones
4. Shadow clusters for testing

TOPOLOGY:
┌─────────────────────────────────────────────┐
│              EVCache Cluster                │
│                                             │
│  Zone A          Zone B          Zone C    │
│  ┌─────┐         ┌─────┐         ┌─────┐   │
│  │MC-1 │ ◄─────► │MC-1 │ ◄─────► │MC-1 │   │
│  │MC-2 │         │MC-2 │         │MC-2 │   │
│  │MC-3 │         │MC-3 │         │MC-3 │   │
│  └─────┘         └─────┘         └─────┘   │
│                                             │
│  Writes: All zones (synchronous)            │
│  Reads: Local zone first, fallback to others│
└─────────────────────────────────────────────┘

Zuul: Edge Gateway

Responsibilities:

Authentication
Dynamic routing
Load shedding
Request throttling
Attack detection

Scale: 1+ million RPS at the edgeInnovation: Zuul 2 (async/non-blocking)

Moved from thread-per-request to event loop
90% reduction in connection memory
Better tail latency under load

Chaos Engineering: The Simian Army

Chaos Monkey (2011): Randomly kills instances in production

THE SIMIAN ARMY:

Chaos Monkey      → Kill instances
Latency Monkey    → Inject artificial delays
Chaos Kong        → Fail entire AWS region
Doctor Monkey     → Health checks and remediation
Janitor Monkey    → Clean up unused resources
Conformity Monkey → Enforce best practices

Philosophy:

"The best way to avoid failure is to fail constantly"

If you can't handle instance failures at 3pm on Tuesday,
you definitely can't handle them at 3am on Black Friday.

RESULT:
- Engineers build resilient systems by default
- Failures become routine, not emergencies
- Recovery is automated, not heroic

Netflix Failure Story: The 2012 Christmas Eve Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 CHRISTMAS EVE OUTAGE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: December 24, 2012                                                    │
│  DURATION: ~8 hours                                                         │
│  CAUSE: AWS ELB (Elastic Load Balancer) failure                            │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • AWS made changes to ELB configuration                                    │
│  • Bug triggered widespread ELB failures in US-EAST-1                      │
│  • Netflix (and many others) went down                                      │
│                                                                              │
│  THE IRONY:                                                                 │
│  • Netflix had Chaos Monkey                                                 │
│  • Netflix could survive instance failures                                  │
│  • Netflix could survive AZ failures                                        │
│  • But ELB was a single point of failure they didn't test!                 │
│                                                                              │
│  AFTERMATH:                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  1. Built their own load balancer: Zuul                             │    │
│  │     (Removed dependency on AWS ELB)                                 │    │
│  │                                                                     │    │
│  │  2. Created "Chaos Kong"                                            │    │
│  │     (Simulate entire region failures)                               │    │
│  │                                                                     │    │
│  │  3. Multi-region active-active                                      │    │
│  │     (Traffic can shift between regions in seconds)                  │    │
│  │                                                                     │    │
│  │  4. Invested in "Failure Injection Testing" (FIT)                   │    │
│  │     (Structured chaos experiments)                                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  LESSON:                                                                    │
│  "If you haven't tested a failure mode, you're not resilient to it"        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Uber: Real-Time at Scale

Uber processes millions of trips daily with sub-second dispatch decisions.

Architecture Evolution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    UBER ARCHITECTURE EVOLUTION                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  2010-2014: MONOLITH                                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        Python Monolith                               │   │
│  │    (Everything in one codebase, one database)                        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  2014-2017: MICROSERVICES EXPLOSION                                         │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ... (1000+ services)        │
│  │Trip  │ │Map   │ │Price │ │Match │ │Pay   │                              │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘                              │
│                                                                              │
│  2017+: DOMAIN-ORIENTED SERVICES (DOMA)                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐       │    │
│  │  │   RIDES DOMAIN │  │  EATS DOMAIN   │  │ FREIGHT DOMAIN │       │    │
│  │  │  ┌───────────┐ │  │  ┌───────────┐ │  │  ┌───────────┐ │       │    │
│  │  │  │Trip       │ │  │  │Order      │ │  │  │Shipment   │ │       │    │
│  │  │  │Matching   │ │  │  │Restaurant │ │  │  │Carrier    │ │       │    │
│  │  │  │Pricing    │ │  │  │Delivery   │ │  │  │Route      │ │       │    │
│  │  │  └───────────┘ │  │  └───────────┘ │  │  └───────────┘ │       │    │
│  │  └────────────────┘  └────────────────┘  └────────────────┘       │    │
│  │                                                                     │    │
│  │  SHARED PLATFORM                                                    │    │
│  │  ┌─────────────────────────────────────────────────────────────┐   │    │
│  │  │  Maps  │  Payments  │  Identity  │  Messaging  │  Dispatch  │   │    │
│  │  └─────────────────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Systems

Ringpop: Consistent Hashing for Dispatch

Problem: Matching riders to drivers needs consistent, fast routingSolution: Ringpop (swim + consistent hashing)

HOW IT WORKS:

1. SWIM Protocol for cluster membership
   - Nodes gossip about each other's health
   - Detect failures in seconds
   - No single point of failure

2. Consistent Hashing for request routing
   - Hash(rider_location) → specific node
   - That node has all nearby drivers in memory
   - Sub-millisecond matching decisions

TOPOLOGY:

City: San Francisco

┌────────────────────────────────────────────────────┐
│                    HASH RING                        │
│                                                    │
│       Node A ─────────────────── Node B           │
│      (SOMA,      ← gossip →    (Financial,       │
│     Mission)                     Marina)          │
│         │                          │              │
│         └──────── Node C ──────────┘              │
│                 (Castro,                          │
│                  Sunset)                          │
│                                                    │
│    Request for rider at (lat, lng):               │
│    → Hash to Node A                               │
│    → Node A has all drivers in that area          │
│    → Match in <10ms                               │
└────────────────────────────────────────────────────┘

Schemaless: MySQL at Scale

Challenge:

Need horizontal scaling (MySQL doesn’t shard easily)
Need flexible schema (trip data evolves fast)
Need low latency (real-time dispatch)

Solution: Schemaless (MySQL + application-level sharding)

ARCHITECTURE:

┌─────────────────────────────────────────────────┐
│              Schemaless Layer                   │
│  (Application-level sharding + abstraction)     │
└────────────────────┬────────────────────────────┘
                     │
     ┌───────────────┼───────────────┐
     ▼               ▼               ▼
┌─────────┐    ┌─────────┐    ┌─────────┐
│ MySQL   │    │ MySQL   │    │ MySQL   │
│ Shard 1 │    │ Shard 2 │    │ Shard N │
└─────────┘    └─────────┘    └─────────┘

DATA MODEL:
┌─────────────────────────────────────────────────┐
│  Row Key (UUID)  │  Column  │  Body (JSON)     │
├──────────────────┼──────────┼──────────────────┤
│  trip-123-abc    │  base    │ {"rider": ...}   │
│  trip-123-abc    │  driver  │ {"driver": ...}  │
│  trip-123-abc    │  route   │ {"waypoints":..} │
└─────────────────────────────────────────────────┘

BENEFITS:
• Easy horizontal scaling (shard by row key)
• Schema evolution (just add columns)
• Point-in-time recovery (versioned cells)

Cadence: Workflow Orchestration

Uber’s solution for long-running, fault-tolerant workflows.

EXAMPLE: Trip Lifecycle

@workflow
def trip_workflow(rider_id, pickup, dropoff):
    # Each step is fault-tolerant and can be retried
    
    # Step 1: Find driver (may take minutes)
    driver = yield find_driver(pickup)
    
    # Step 2: Wait for pickup
    yield wait_for_event("driver_arrived", timeout="30m")
    
    # Step 3: Wait for trip completion
    yield wait_for_event("trip_completed", timeout="8h")
    
    # Step 4: Process payment
    yield process_payment(rider_id, driver.id, trip.amount)
    
    # Step 5: Request ratings
    yield send_rating_request(rider_id, driver.id)

GUARANTEES:
• Each step runs exactly once
• Workflow survives server failures
• State persisted at each step
• Full visibility into running workflows

Open Source: Temporal (Cadence fork) now widely adopted

Uber Failure Story: The 2019 Mapping Outage

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2019 MAPPING OUTAGE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 2019                                                       │
│  IMPACT: App crashes, dispatch failures, surge pricing errors              │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Map tile service returned malformed data                                 │
│  • Client apps crashed when parsing                                         │
│  • Retry storms overloaded map service                                      │
│  • Cascade to other services using maps                                     │
│                                                                              │
│  THE CASCADE:                                                               │
│                                                                              │
│  Malformed     →    Client    →    Retry    →    Service    →   Dispatch   │
│  Response           Crash          Storm        Overload        Failure    │
│                                                                              │
│  LESSONS:                                                                   │
│  ─────────                                                                  │
│  1. Client-side validation is critical                                      │
│     - Don't trust backend responses blindly                                │
│     - Graceful degradation when data is bad                                │
│                                                                              │
│  2. Circuit breakers needed everywhere                                      │
│     - Client-side circuit breakers                                          │
│     - Service-side admission control                                        │
│                                                                              │
│  3. Retry behavior must be controlled                                       │
│     - Exponential backoff                                                   │
│     - Jitter to spread load                                                 │
│     - Client-side retry budgets                                             │
│                                                                              │
│  4. Map data is critical path                                               │
│     - Cache aggressively                                                    │
│     - Multiple fallback sources                                             │
│     - Stale data better than no data                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Stripe: Financial Transactions at Scale

When money is involved, correctness is everything.

Architecture Principles

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRIPE ENGINEERING PRINCIPLES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. EXACTLY-ONCE DELIVERY                                                   │
│  ────────────────────────                                                   │
│  Payment must happen exactly once, never more, never less                   │
│                                                                              │
│  Implementation:                                                            │
│  • Idempotency keys on all mutating operations                             │
│  • Request deduplication in API layer                                       │
│  • Transactional outbox pattern for events                                  │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  2. STRONG CONSISTENCY FOR MONEY                                            │
│  ───────────────────────────────                                            │
│  Unlike social media, eventual consistency is not acceptable                │
│                                                                              │
│  Approach:                                                                  │
│  • Single-leader PostgreSQL for financial data                              │
│  • Synchronous replication to standby                                       │
│  • Serializable isolation for critical paths                                │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  3. IDEMPOTENCY BY DEFAULT                                                  │
│  ──────────────────────────                                                 │
│  Every API accepts Idempotency-Key header                                   │
│                                                                              │
│  curl https://api.stripe.com/v1/charges \                                  │
│    -H "Idempotency-Key: order-12345" \                                     │
│    -d amount=2000 \                                                        │
│    -d currency=usd                                                         │
│                                                                              │
│  Same key = same result (even if called 100 times)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Patterns

Idempotency Keys

How it works:

# Stripe's approach (simplified)

async def create_charge(idempotency_key: str, amount: int, ...):
    # 1. Check if we've seen this key before
    existing = await idempotency_store.get(idempotency_key)
    
    if existing:
        if existing.status == "completed":
            # Return cached response
            return existing.response
        elif existing.status == "in_progress":
            # Someone else is processing
            raise ConflictError("Request in progress")
    
    # 2. Mark as in-progress
    await idempotency_store.set(
        idempotency_key, 
        {"status": "in_progress", "started_at": now()}
    )
    
    try:
        # 3. Do the actual work
        result = await actually_create_charge(amount, ...)
        
        # 4. Store the result
        await idempotency_store.set(
            idempotency_key,
            {"status": "completed", "response": result}
        )
        
        return result
        
    except Exception as e:
        # 5. On failure, allow retry
        await idempotency_store.delete(idempotency_key)
        raise

TTL: Idempotency keys typically expire after 24 hours

Transactional Outbox Pattern

Problem: Need to update database AND send event atomically

WRONG APPROACH:

1. Update database (charge.status = "succeeded")
2. Send Kafka event

What if step 2 fails? Database updated, event lost!

TRANSACTIONAL OUTBOX:

1. In a single transaction:
   - Update charge.status = "succeeded"
   - Insert into outbox table: {event: "charge.succeeded", ...}

2. Separate process reads outbox, sends to Kafka

3. After Kafka confirms, delete from outbox

┌─────────────────────────────────────────────────────┐
│                    DATABASE                         │
│  ┌───────────────┐    ┌───────────────────────────┐ │
│  │   charges     │    │        outbox             │ │
│  ├───────────────┤    ├───────────────────────────┤ │
│  │ id: ch_123    │    │ id: 1                     │ │
│  │ status: succ  │    │ event: charge.succeeded   │ │
│  │ ...           │    │ payload: {...}            │ │
│  └───────────────┘    └───────────────────────────┘ │
│                               │                     │
│                               │ Background process  │
│                               ▼                     │
│                        ┌─────────────┐             │
│                        │    Kafka    │             │
│                        └─────────────┘             │
└─────────────────────────────────────────────────────┘

Request Hedging

Problem: Tail latency (p99) is often much worse than medianSolution: Send to multiple replicas, use first response

HEDGING STRATEGY:

Time 0ms:    Send to replica A
Time 5ms:    If no response, also send to replica B
Time 10ms:   If no response, also send to replica C

Use FIRST response, cancel others

BENEFITS:
• p99 latency dramatically reduced
• One slow replica doesn't hurt overall latency

CAUTION:
• Only for idempotent reads
• Increases load on backend (but usually worth it)
• Need cancellation to avoid wasted work

Common Patterns Across Companies

Idempotency Everywhere

All companies: Every mutating operation accepts an idempotency key. Stripe: Idempotency-Key header AWS: ClientRequestToken Google: requestId

Chaos Testing

You don’t know if you’re resilient until you test. Netflix: Chaos Monkey, Chaos Kong Amazon: GameDay exercises Google: DiRT (Disaster Recovery Testing)

Circuit Breakers

Fail fast instead of cascading. Netflix: Hystrix (now Resilience4j) Uber: Custom circuit breakers in every service All use some variant of the pattern.

Observability

You can’t fix what you can’t see. Distributed tracing: Zipkin/Jaeger-style Metrics: RED method (Rate, Errors, Duration) Logs: Structured, correlated by trace ID

Key Takeaways

Availability often trumps consistency — Amazon’s shopping cart chose availability. Know when this trade-off is acceptable.
Test failure modes in production — Netflix’s Chaos Engineering isn’t optional, it’s essential.
Build for horizontal scale from day one — Re-architecting a monolith is painful. Design for distribution early.
Invest in your primitives — Google built TrueTime. Uber built Ringpop. Your foundations matter.
Every outage is a learning opportunity — The companies with the best uptime have the best postmortems.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Real-World Case Studies

​Google Spanner

​Architecture Overview

​Key Innovations

​Spanner Failure Story: The Leap Second

​Amazon DynamoDB

​Architecture Overview

​Key Design Decisions

​DynamoDB Failure Story: The 2015 US-EAST-1 Outage

​Netflix: The Chaos Engineering Pioneers

​Architecture Overview

​Key Patterns

​Netflix Failure Story: The 2012 Christmas Eve Outage

​Uber: Real-Time at Scale

​Architecture Evolution

​Key Systems

​Uber Failure Story: The 2019 Mapping Outage

​Stripe: Financial Transactions at Scale

​Architecture Principles

​Key Patterns

​Common Patterns Across Companies

Idempotency Everywhere

Chaos Testing

Circuit Breakers

Observability

​Key Takeaways

Real-World Case Studies

Google Spanner

Architecture Overview

Key Innovations

Spanner Failure Story: The Leap Second

Amazon DynamoDB

Architecture Overview

Key Design Decisions

DynamoDB Failure Story: The 2015 US-EAST-1 Outage

Netflix: The Chaos Engineering Pioneers

Architecture Overview

Key Patterns

Netflix Failure Story: The 2012 Christmas Eve Outage

Uber: Real-Time at Scale

Architecture Evolution

Key Systems

Uber Failure Story: The 2019 Mapping Outage

Stripe: Financial Transactions at Scale

Architecture Principles

Key Patterns

Common Patterns Across Companies

Key Takeaways