> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Real-World Case Studies

> Deep dives into Google, Amazon, Netflix, and Uber's distributed systems with actual failure stories

# Real-World Case Studies

Theory is essential, but seeing how distributed systems work (and fail) in production at massive scale is where deep understanding comes from. This module covers battle-tested architectures from the world's leading technology companies.

Every case study in this chapter follows the same pattern: a company hit a scaling wall, made a set of engineering trade-offs under real constraints (time, money, team expertise, existing code), and lived with the consequences -- both intended and unintended. These are not "the right way" to build systems; they are "one way that worked for a specific company at a specific point in time." The lesson is never "copy Google's architecture." It is "understand the reasoning that led to Google's decisions and apply that reasoning to your own, very different, situation."

<Info>
  **Track Duration**: 12-16 hours\
  **Companies Covered**: Google, Amazon, Netflix, Uber, Meta, Stripe\
  **Focus**: Architecture decisions, failure stories, lessons learned
</Info>

***

## Google Spanner

The database that made the impossible possible -- globally consistent transactions. Before Spanner, the conventional wisdom (hardened by the CAP theorem) was that you had to choose between strong consistency and global distribution. Spanner essentially said "what if we throw GPS satellites and atomic clocks at the problem?" and built a system where the laws of physics help enforce transaction ordering.

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    GOOGLE SPANNER ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                          ┌─────────────────────┐                            │
│                          │     Spanner API     │                            │
│                          └──────────┬──────────┘                            │
│                                     │                                        │
│        ┌────────────────────────────┼────────────────────────────┐          │
│        │                            │                            │          │
│        ▼                            ▼                            ▼          │
│   ┌─────────────┐            ┌─────────────┐            ┌─────────────┐    │
│   │   Zone A    │            │   Zone B    │            │   Zone C    │    │
│   │  (Oregon)   │            │  (Iowa)     │            │  (Virginia) │    │
│   └──────┬──────┘            └──────┬──────┘            └──────┬──────┘    │
│          │                          │                          │            │
│   ┌──────┴──────┐            ┌──────┴──────┐            ┌──────┴──────┐    │
│   │ Spanserver  │            │ Spanserver  │            │ Spanserver  │    │
│   │  (Paxos     │◄──────────►│  (Paxos     │◄──────────►│  (Paxos     │    │
│   │   Leader)   │            │   Replica)  │            │   Replica)  │    │
│   └──────┬──────┘            └─────────────┘            └─────────────┘    │
│          │                                                                   │
│          ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           COLOSSUS                                  │   │
│   │                    (Distributed File System)                        │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           TRUETIME                                  │   │
│   │              GPS + Atomic Clocks in every datacenter                │   │
│   │           API: TT.now() → [earliest, latest]                       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Innovations

<AccordionGroup>
  <Accordion title="TrueTime: Making Time Trustworthy" icon="clock">
    **The Insight**: You can't synchronize clocks perfectly, but you can bound the uncertainty.

    ```
    TRADITIONAL APPROACH:
    "The time is 10:00:00.000"
    (But actually, who knows? Could be off by milliseconds)

    TRUETIME APPROACH:
    "The time is between 10:00:00.000 and 10:00:00.007"
    (Guaranteed! We measured the uncertainty)

    HARDWARE:
    • GPS receivers on datacenter roofs
    • Atomic clocks (rubidium) as backup
    • Multiple time masters per datacenter
    • Typical uncertainty: 1-7ms
    ```

    **Usage in Transactions**:

    ```
    1. Transaction T1 gets commit timestamp = TT.now().latest
    2. T1 waits until TT.after(timestamp) is true
    3. Now GUARANTEED: no other transaction can get timestamp ≤ T1's
    4. External consistency achieved without distributed locks!
    ```
  </Accordion>

  <Accordion title="Paxos Groups for Replication" icon="clone">
    **Per-Tablet Paxos**:

    * Each tablet (partition) has its own Paxos group
    * 3-5 replicas across zones
    * Writes go through Paxos leader
    * Reads can go to any replica (with proper timestamp)

    **Split and Merge**:

    * Tablets automatically split when too large
    * Tablets merge when too small
    * Paxos ensures consistent split/merge
  </Accordion>

  <Accordion title="External Consistency" icon="shield">
    **The Guarantee**: If transaction T1 commits before T2 starts, T1's timestamp \< T2's timestamp.

    **Why It Matters**:

    ```
    SCENARIO:
    User in US writes a record
    User in Europe reads immediately after

    WITHOUT EXTERNAL CONSISTENCY:
    European read might see old data (before US write)

    WITH EXTERNAL CONSISTENCY:
    European read guaranteed to see US write
    (because read timestamp > write timestamp)
    ```

    **Cost**: Commit wait adds latency (\~7ms average)
  </Accordion>
</AccordionGroup>

### Spanner Failure Story: The Leap Second

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 LEAP SECOND INCIDENT                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: June 30, 2012                                                        │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • Leap second added at midnight UTC                                        │
│  • Linux kernel bug caused livelock in clock_gettime()                     │
│  • Systems consuming 100% CPU doing nothing                                 │
│  • Widespread outages: Reddit, LinkedIn, Mozilla, Gawker                   │
│                                                                              │
│  HOW GOOGLE HANDLED IT:                                                     │
│  • Google had already implemented "leap smear"                             │
│  • Instead of adding 1 second at midnight:                                 │
│    - Slow down time by 11.6 μs per second                                 │
│    - Spread over 24 hours before midnight                                  │
│  • All Google services remained stable                                      │
│                                                                              │
│  LESSON:                                                                    │
│  • Time is a critical distributed systems primitive                         │
│  • You must control how time changes propagate                              │
│  • Google's investment in TrueTime paid dividends                          │
│                                                                              │
│  RESULT:                                                                    │
│  • Leap smear became industry standard                                      │
│  • AWS, Azure, and others now use similar approaches                        │
│  • International discussions to eliminate leap seconds entirely            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Amazon DynamoDB

The database that powers amazon.com checkout—when availability is everything.

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DYNAMODB ARCHITECTURE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         REQUEST ROUTERS                              │   │
│  │              (Stateless, route requests to partitions)               │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │                                         │
│           ┌───────────────────────┼───────────────────────┐                 │
│           ▼                       ▼                       ▼                 │
│   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐        │
│   │  Partition 1  │       │  Partition 2  │       │  Partition N  │        │
│   │  ┌─────────┐  │       │  ┌─────────┐  │       │  ┌─────────┐  │        │
│   │  │ Leader  │  │       │  │ Leader  │  │       │  │ Leader  │  │        │
│   │  └────┬────┘  │       │  └────┬────┘  │       │  └────┬────┘  │        │
│   │       │       │       │       │       │       │       │       │        │
│   │  ┌────┴────┐  │       │  ┌────┴────┐  │       │  ┌────┴────┐  │        │
│   │  │Replicas │  │       │  │Replicas │  │       │  │Replicas │  │        │
│   │  │(2 more) │  │       │  │(2 more) │  │       │  │(2 more) │  │        │
│   │  └─────────┘  │       │  └─────────┘  │       │  └─────────┘  │        │
│   └───────────────┘       └───────────────┘       └───────────────┘        │
│                                                                              │
│  GLOBAL TABLES (Multi-Region):                                              │
│                                                                              │
│   Region: US-EAST-1           Region: EU-WEST-1                             │
│   ┌─────────────────┐         ┌─────────────────┐                           │
│   │     Table A     │ ◄─────► │     Table A     │                           │
│   │  (Replica)      │   Sync  │  (Replica)      │                           │
│   └─────────────────┘         └─────────────────┘                           │
│                                                                              │
│   Last-Writer-Wins conflict resolution                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Design Decisions

<AccordionGroup>
  <Accordion title="Always Available: The Shopping Cart Story" icon="cart-shopping">
    **The Origin Story** (from the 2007 Dynamo paper):

    ```
    SCENARIO:
    Customer adding items to cart during peak shopping
    Database partition occurs

    OPTION A: Reject writes, show error
    → Customer leaves, buys from competitor
    → Lost revenue: $$$

    OPTION B: Accept writes, reconcile later
    → Worst case: Duplicate items in cart
    → Customer removes duplicates at checkout
    → Revenue preserved

    AMAZON CHOSE: Option B (Availability over Consistency)
    ```

    **Implementation**:

    * Leaderless replication
    * Write to W of N replicas (W \< N means some can be down)
    * Read from R replicas, resolve conflicts
    * Shopping cart uses "union" merge (keep all items)
  </Accordion>

  <Accordion title="Consistent Hashing with Virtual Nodes" icon="circle">
    **Problem**: Nodes joining/leaving causes massive data reshuffling

    **Solution**: Virtual nodes

    ```
    WITHOUT VIRTUAL NODES:
    Hash ring: [──A──|──B──|──C──|──D──]
    Node B fails: [──A──|────C────|──D──]
    Node C now has 2x data (overloaded!)

    WITH VIRTUAL NODES:
    Each physical node = 150+ virtual nodes
    Hash ring: [A1|B2|C1|A2|D1|B1|C2|D2|...]
    Node B fails: B's virtual nodes spread across A, C, D
    Load stays balanced
    ```
  </Accordion>

  <Accordion title="Adaptive Capacity" icon="gauge-high">
    **Original Problem**:

    * Provisioned throughput (e.g., 1000 WCU)
    * Uniform distribution assumed
    * Hot partition = throttling

    **Solution**: Adaptive capacity

    ```
    BEFORE (2017):
    Table: 1000 WCU
    Partition 1: 250 WCU limit (1000/4)
    Hot key in Partition 1: Throttled!

    AFTER (Adaptive Capacity):
    Table: 1000 WCU
    Partition 1: Can use up to 1000 WCU if others idle
    Hot key: No throttling as long as total < 1000

    PLUS: On-demand capacity (2018)
    No provisioning, pay per request
    ```
  </Accordion>
</AccordionGroup>

### DynamoDB Failure Story: The 2015 US-EAST-1 Outage

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2015 DYNAMODB OUTAGE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 20, 2015                                                   │
│  DURATION: ~5 hours                                                         │
│  IMPACT: Many high-profile sites down (Netflix, IMDB, Airbnb...)           │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Metadata service became overloaded                                       │
│  • Metadata = "where is partition X?"                                       │
│  • Clients retrying → exponential load increase                            │
│  • Vicious cycle of retries                                                │
│                                                                              │
│  THE CASCADE:                                                               │
│  ┌────────────┐     ┌────────────┐     ┌────────────┐                      │
│  │  Client    │────►│  Request   │────►│  Metadata  │                      │
│  │  Retry     │     │  Router    │     │  Service   │                      │
│  │  (10x)     │     │  (slow)    │     │  (dying)   │                      │
│  └────────────┘     └────────────┘     └────────────┘                      │
│       ▲                    │                 │                              │
│       │                    │                 │                              │
│       └────────── Timeout ─┴─── Overloaded ──┘                              │
│                                                                              │
│  LESSONS LEARNED:                                                           │
│  1. Metadata services are critical path                                     │
│  2. Retry storms can kill systems                                           │
│  3. Need circuit breakers and backoff                                       │
│  4. Client-side caching of metadata helps                                   │
│                                                                              │
│  CHANGES MADE:                                                              │
│  • More aggressive metadata caching                                         │
│  • Better retry backoff in SDKs                                             │
│  • Metadata service scaling improvements                                    │
│  • Cross-region tables for DR                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Netflix: The Chaos Engineering Pioneers

Netflix serves 230+ million subscribers with 99.99% availability. How?

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    NETFLIX CLOUD ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                           CDN (Open Connect)                          │   │
│  │          15,000+ servers in ISPs worldwide                           │   │
│  │          95% of traffic served from CDN                              │   │
│  └────────────────────────────────┬─────────────────────────────────────┘   │
│                                   │ (5% control plane traffic)              │
│                                   ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         AWS (Control Plane)                           │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                         ZUUL                                    │ │   │
│  │  │               (API Gateway / Edge Service)                      │ │   │
│  │  └───────────────────────────┬─────────────────────────────────────┘ │   │
│  │                              │                                       │   │
│  │  ┌───────────────────────────┼───────────────────────────────────┐   │   │
│  │  │           MICROSERVICES (1000+)                               │   │   │
│  │  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐      │   │   │
│  │  │  │Ratings │ │Viewing │ │Profile │ │Catalog │ │Billing │ ...  │   │   │
│  │  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘      │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  │                                                                      │   │
│  │  ┌─────────────────────────────────────────────────────────────────┐ │   │
│  │  │                    DATA LAYER                                   │ │   │
│  │  │   EVCache (Memcached)    Cassandra    Elasticsearch            │ │   │
│  │  │   30M+ req/sec           Petabytes    Search & Analytics       │ │   │
│  │  └─────────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Patterns

<AccordionGroup>
  <Accordion title="EVCache: Caching at Scale" icon="bolt">
    **Scale**:

    * 30+ million requests/second
    * 1+ trillion operations/day
    * Petabytes of cached data

    **Architecture**:

    ```
    EVCache = Enhanced Memcached

    KEY FEATURES:
    1. Replication across zones (survive AZ failure)
    2. Local zone preference (lower latency)
    3. Fast fallback to other zones
    4. Shadow clusters for testing

    TOPOLOGY:
    ┌─────────────────────────────────────────────┐
    │              EVCache Cluster                │
    │                                             │
    │  Zone A          Zone B          Zone C    │
    │  ┌─────┐         ┌─────┐         ┌─────┐   │
    │  │MC-1 │ ◄─────► │MC-1 │ ◄─────► │MC-1 │   │
    │  │MC-2 │         │MC-2 │         │MC-2 │   │
    │  │MC-3 │         │MC-3 │         │MC-3 │   │
    │  └─────┘         └─────┘         └─────┘   │
    │                                             │
    │  Writes: All zones (synchronous)            │
    │  Reads: Local zone first, fallback to others│
    └─────────────────────────────────────────────┘
    ```
  </Accordion>

  <Accordion title="Zuul: Edge Gateway" icon="shield">
    **Responsibilities**:

    * Authentication
    * Dynamic routing
    * Load shedding
    * Request throttling
    * Attack detection

    **Scale**: 1+ million RPS at the edge

    **Innovation**: Zuul 2 (async/non-blocking)

    * Moved from thread-per-request to event loop
    * 90% reduction in connection memory
    * Better tail latency under load
  </Accordion>

  <Accordion title="Chaos Engineering: The Simian Army" icon="monkey">
    **Chaos Monkey** (2011): Randomly kills instances in production

    ```
    THE SIMIAN ARMY:

    Chaos Monkey      → Kill instances
    Latency Monkey    → Inject artificial delays
    Chaos Kong        → Fail entire AWS region
    Doctor Monkey     → Health checks and remediation
    Janitor Monkey    → Clean up unused resources
    Conformity Monkey → Enforce best practices
    ```

    **Philosophy**:

    ```
    "The best way to avoid failure is to fail constantly"

    If you can't handle instance failures at 3pm on Tuesday,
    you definitely can't handle them at 3am on Black Friday.

    RESULT:
    - Engineers build resilient systems by default
    - Failures become routine, not emergencies
    - Recovery is automated, not heroic
    ```
  </Accordion>
</AccordionGroup>

### Netflix Failure Story: The 2012 Christmas Eve Outage

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2012 CHRISTMAS EVE OUTAGE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: December 24, 2012                                                    │
│  DURATION: ~8 hours                                                         │
│  CAUSE: AWS ELB (Elastic Load Balancer) failure                            │
│                                                                              │
│  WHAT HAPPENED:                                                             │
│  • AWS made changes to ELB configuration                                    │
│  • Bug triggered widespread ELB failures in US-EAST-1                      │
│  • Netflix (and many others) went down                                      │
│                                                                              │
│  THE IRONY:                                                                 │
│  • Netflix had Chaos Monkey                                                 │
│  • Netflix could survive instance failures                                  │
│  • Netflix could survive AZ failures                                        │
│  • But ELB was a single point of failure they didn't test!                 │
│                                                                              │
│  AFTERMATH:                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  1. Built their own load balancer: Zuul                             │    │
│  │     (Removed dependency on AWS ELB)                                 │    │
│  │                                                                     │    │
│  │  2. Created "Chaos Kong"                                            │    │
│  │     (Simulate entire region failures)                               │    │
│  │                                                                     │    │
│  │  3. Multi-region active-active                                      │    │
│  │     (Traffic can shift between regions in seconds)                  │    │
│  │                                                                     │    │
│  │  4. Invested in "Failure Injection Testing" (FIT)                   │    │
│  │     (Structured chaos experiments)                                  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  LESSON:                                                                    │
│  "If you haven't tested a failure mode, you're not resilient to it"        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Uber: Real-Time at Scale

Uber processes millions of trips daily with sub-second dispatch decisions.

### Architecture Evolution

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    UBER ARCHITECTURE EVOLUTION                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  2010-2014: MONOLITH                                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        Python Monolith                               │   │
│  │    (Everything in one codebase, one database)                        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  2014-2017: MICROSERVICES EXPLOSION                                         │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ... (1000+ services)        │
│  │Trip  │ │Map   │ │Price │ │Match │ │Pay   │                              │
│  └──────┘ └──────┘ └──────┘ └──────┘ └──────┘                              │
│                                                                              │
│  2017+: DOMAIN-ORIENTED SERVICES (DOMA)                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐       │    │
│  │  │   RIDES DOMAIN │  │  EATS DOMAIN   │  │ FREIGHT DOMAIN │       │    │
│  │  │  ┌───────────┐ │  │  ┌───────────┐ │  │  ┌───────────┐ │       │    │
│  │  │  │Trip       │ │  │  │Order      │ │  │  │Shipment   │ │       │    │
│  │  │  │Matching   │ │  │  │Restaurant │ │  │  │Carrier    │ │       │    │
│  │  │  │Pricing    │ │  │  │Delivery   │ │  │  │Route      │ │       │    │
│  │  │  └───────────┘ │  │  └───────────┘ │  │  └───────────┘ │       │    │
│  │  └────────────────┘  └────────────────┘  └────────────────┘       │    │
│  │                                                                     │    │
│  │  SHARED PLATFORM                                                    │    │
│  │  ┌─────────────────────────────────────────────────────────────┐   │    │
│  │  │  Maps  │  Payments  │  Identity  │  Messaging  │  Dispatch  │   │    │
│  │  └─────────────────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Systems

<AccordionGroup>
  <Accordion title="Ringpop: Consistent Hashing for Dispatch" icon="ring">
    **Problem**: Matching riders to drivers needs consistent, fast routing

    **Solution**: Ringpop (swim + consistent hashing)

    ```
    HOW IT WORKS:

    1. SWIM Protocol for cluster membership
       - Nodes gossip about each other's health
       - Detect failures in seconds
       - No single point of failure

    2. Consistent Hashing for request routing
       - Hash(rider_location) → specific node
       - That node has all nearby drivers in memory
       - Sub-millisecond matching decisions

    TOPOLOGY:

    City: San Francisco

    ┌────────────────────────────────────────────────────┐
    │                    HASH RING                        │
    │                                                    │
    │       Node A ─────────────────── Node B           │
    │      (SOMA,      ← gossip →    (Financial,       │
    │     Mission)                     Marina)          │
    │         │                          │              │
    │         └──────── Node C ──────────┘              │
    │                 (Castro,                          │
    │                  Sunset)                          │
    │                                                    │
    │    Request for rider at (lat, lng):               │
    │    → Hash to Node A                               │
    │    → Node A has all drivers in that area          │
    │    → Match in <10ms                               │
    └────────────────────────────────────────────────────┘
    ```
  </Accordion>

  <Accordion title="Schemaless: MySQL at Scale" icon="database">
    **Challenge**:

    * Need horizontal scaling (MySQL doesn't shard easily)
    * Need flexible schema (trip data evolves fast)
    * Need low latency (real-time dispatch)

    **Solution**: Schemaless (MySQL + application-level sharding)

    ```
    ARCHITECTURE:

    ┌─────────────────────────────────────────────────┐
    │              Schemaless Layer                   │
    │  (Application-level sharding + abstraction)     │
    └────────────────────┬────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         ▼               ▼               ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ MySQL   │    │ MySQL   │    │ MySQL   │
    │ Shard 1 │    │ Shard 2 │    │ Shard N │
    └─────────┘    └─────────┘    └─────────┘

    DATA MODEL:
    ┌─────────────────────────────────────────────────┐
    │  Row Key (UUID)  │  Column  │  Body (JSON)     │
    ├──────────────────┼──────────┼──────────────────┤
    │  trip-123-abc    │  base    │ {"rider": ...}   │
    │  trip-123-abc    │  driver  │ {"driver": ...}  │
    │  trip-123-abc    │  route   │ {"waypoints":..} │
    └─────────────────────────────────────────────────┘

    BENEFITS:
    • Easy horizontal scaling (shard by row key)
    • Schema evolution (just add columns)
    • Point-in-time recovery (versioned cells)
    ```
  </Accordion>

  <Accordion title="Cadence: Workflow Orchestration" icon="diagram-project">
    Uber's solution for long-running, fault-tolerant workflows.

    ```
    EXAMPLE: Trip Lifecycle

    @workflow
    def trip_workflow(rider_id, pickup, dropoff):
        # Each step is fault-tolerant and can be retried
        
        # Step 1: Find driver (may take minutes)
        driver = yield find_driver(pickup)
        
        # Step 2: Wait for pickup
        yield wait_for_event("driver_arrived", timeout="30m")
        
        # Step 3: Wait for trip completion
        yield wait_for_event("trip_completed", timeout="8h")
        
        # Step 4: Process payment
        yield process_payment(rider_id, driver.id, trip.amount)
        
        # Step 5: Request ratings
        yield send_rating_request(rider_id, driver.id)

    GUARANTEES:
    • Each step runs exactly once
    • Workflow survives server failures
    • State persisted at each step
    • Full visibility into running workflows
    ```

    **Open Source**: Temporal (Cadence fork) now widely adopted
  </Accordion>
</AccordionGroup>

### Uber Failure Story: The 2019 Mapping Outage

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE 2019 MAPPING OUTAGE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DATE: September 2019                                                       │
│  IMPACT: App crashes, dispatch failures, surge pricing errors              │
│                                                                              │
│  ROOT CAUSE:                                                                │
│  • Map tile service returned malformed data                                 │
│  • Client apps crashed when parsing                                         │
│  • Retry storms overloaded map service                                      │
│  • Cascade to other services using maps                                     │
│                                                                              │
│  THE CASCADE:                                                               │
│                                                                              │
│  Malformed     →    Client    →    Retry    →    Service    →   Dispatch   │
│  Response           Crash          Storm        Overload        Failure    │
│                                                                              │
│  LESSONS:                                                                   │
│  ─────────                                                                  │
│  1. Client-side validation is critical                                      │
│     - Don't trust backend responses blindly                                │
│     - Graceful degradation when data is bad                                │
│                                                                              │
│  2. Circuit breakers needed everywhere                                      │
│     - Client-side circuit breakers                                          │
│     - Service-side admission control                                        │
│                                                                              │
│  3. Retry behavior must be controlled                                       │
│     - Exponential backoff                                                   │
│     - Jitter to spread load                                                 │
│     - Client-side retry budgets                                             │
│                                                                              │
│  4. Map data is critical path                                               │
│     - Cache aggressively                                                    │
│     - Multiple fallback sources                                             │
│     - Stale data better than no data                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Stripe: Financial Transactions at Scale

When money is involved, correctness is everything.

### Architecture Principles

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRIPE ENGINEERING PRINCIPLES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. EXACTLY-ONCE DELIVERY                                                   │
│  ────────────────────────                                                   │
│  Payment must happen exactly once, never more, never less                   │
│                                                                              │
│  Implementation:                                                            │
│  • Idempotency keys on all mutating operations                             │
│  • Request deduplication in API layer                                       │
│  • Transactional outbox pattern for events                                  │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  2. STRONG CONSISTENCY FOR MONEY                                            │
│  ───────────────────────────────                                            │
│  Unlike social media, eventual consistency is not acceptable                │
│                                                                              │
│  Approach:                                                                  │
│  • Single-leader PostgreSQL for financial data                              │
│  • Synchronous replication to standby                                       │
│  • Serializable isolation for critical paths                                │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  3. IDEMPOTENCY BY DEFAULT                                                  │
│  ──────────────────────────                                                 │
│  Every API accepts Idempotency-Key header                                   │
│                                                                              │
│  curl https://api.stripe.com/v1/charges \                                  │
│    -H "Idempotency-Key: order-12345" \                                     │
│    -d amount=2000 \                                                        │
│    -d currency=usd                                                         │
│                                                                              │
│  Same key = same result (even if called 100 times)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Patterns

<AccordionGroup>
  <Accordion title="Idempotency Keys" icon="key">
    **How it works**:

    ```python theme={null}
    # Stripe's approach (simplified)

    async def create_charge(idempotency_key: str, amount: int, ...):
        # 1. Check if we've seen this key before
        existing = await idempotency_store.get(idempotency_key)
        
        if existing:
            if existing.status == "completed":
                # Return cached response
                return existing.response
            elif existing.status == "in_progress":
                # Someone else is processing
                raise ConflictError("Request in progress")
        
        # 2. Mark as in-progress
        await idempotency_store.set(
            idempotency_key, 
            {"status": "in_progress", "started_at": now()}
        )
        
        try:
            # 3. Do the actual work
            result = await actually_create_charge(amount, ...)
            
            # 4. Store the result
            await idempotency_store.set(
                idempotency_key,
                {"status": "completed", "response": result}
            )
            
            return result
            
        except Exception as e:
            # 5. On failure, allow retry
            await idempotency_store.delete(idempotency_key)
            raise
    ```

    **TTL**: Idempotency keys typically expire after 24 hours
  </Accordion>

  <Accordion title="Transactional Outbox Pattern" icon="inbox">
    **Problem**: Need to update database AND send event atomically

    ```
    WRONG APPROACH:

    1. Update database (charge.status = "succeeded")
    2. Send Kafka event

    What if step 2 fails? Database updated, event lost!

    TRANSACTIONAL OUTBOX:

    1. In a single transaction:
       - Update charge.status = "succeeded"
       - Insert into outbox table: {event: "charge.succeeded", ...}

    2. Separate process reads outbox, sends to Kafka

    3. After Kafka confirms, delete from outbox

    ┌─────────────────────────────────────────────────────┐
    │                    DATABASE                         │
    │  ┌───────────────┐    ┌───────────────────────────┐ │
    │  │   charges     │    │        outbox             │ │
    │  ├───────────────┤    ├───────────────────────────┤ │
    │  │ id: ch_123    │    │ id: 1                     │ │
    │  │ status: succ  │    │ event: charge.succeeded   │ │
    │  │ ...           │    │ payload: {...}            │ │
    │  └───────────────┘    └───────────────────────────┘ │
    │                               │                     │
    │                               │ Background process  │
    │                               ▼                     │
    │                        ┌─────────────┐             │
    │                        │    Kafka    │             │
    │                        └─────────────┘             │
    └─────────────────────────────────────────────────────┘
    ```
  </Accordion>

  <Accordion title="Request Hedging" icon="clone">
    **Problem**: Tail latency (p99) is often much worse than median

    **Solution**: Send to multiple replicas, use first response

    ```
    HEDGING STRATEGY:

    Time 0ms:    Send to replica A
    Time 5ms:    If no response, also send to replica B
    Time 10ms:   If no response, also send to replica C

    Use FIRST response, cancel others

    BENEFITS:
    • p99 latency dramatically reduced
    • One slow replica doesn't hurt overall latency

    CAUTION:
    • Only for idempotent reads
    • Increases load on backend (but usually worth it)
    • Need cancellation to avoid wasted work
    ```
  </Accordion>
</AccordionGroup>

***

## Common Patterns Across Companies

<CardGroup cols={2}>
  <Card title="Idempotency Everywhere" icon="arrows-repeat">
    All companies: Every mutating operation accepts an idempotency key.
    **Stripe**: `Idempotency-Key` header
    **AWS**: `ClientRequestToken`
    **Google**: `requestId`
  </Card>

  <Card title="Chaos Testing" icon="explosion">
    You don't know if you're resilient until you test.
    **Netflix**: Chaos Monkey, Chaos Kong
    **Amazon**: GameDay exercises
    **Google**: DiRT (Disaster Recovery Testing)
  </Card>

  <Card title="Circuit Breakers" icon="plug">
    Fail fast instead of cascading.
    **Netflix**: Hystrix (now Resilience4j)
    **Uber**: Custom circuit breakers in every service
    All use some variant of the pattern.
  </Card>

  <Card title="Observability" icon="eye">
    You can't fix what you can't see.
    **Distributed tracing**: Zipkin/Jaeger-style
    **Metrics**: RED method (Rate, Errors, Duration)
    **Logs**: Structured, correlated by trace ID
  </Card>
</CardGroup>

***

## Key Takeaways

1. **Availability often trumps consistency** — Amazon's shopping cart chose availability. Know when this trade-off is acceptable.

2. **Test failure modes in production** — Netflix's Chaos Engineering isn't optional, it's essential.

3. **Build for horizontal scale from day one** — Re-architecting a monolith is painful. Design for distribution early.

4. **Invest in your primitives** — Google built TrueTime. Uber built Ringpop. Your foundations matter.

5. **Every outage is a learning opportunity** — The companies with the best uptime have the best postmortems.