Real-World Case Studies
Theory is essential, but seeing how distributed systems work (and fail) in production at massive scale is where deep understanding comes from. This module covers battle-tested architectures from the world’s leading technology companies.Track Duration: 12-16 hours
Companies Covered: Google, Amazon, Netflix, Uber, Meta, Stripe
Focus: Architecture decisions, failure stories, lessons learned
Companies Covered: Google, Amazon, Netflix, Uber, Meta, Stripe
Focus: Architecture decisions, failure stories, lessons learned
Google Spanner
The database that made the impossible possible—globally consistent transactions.Architecture Overview
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ GOOGLE SPANNER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ Spanner API │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Zone A │ │ Zone B │ │ Zone C │ │
│ │ (Oregon) │ │ (Iowa) │ │ (Virginia) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Spanserver │ │ Spanserver │ │ Spanserver │ │
│ │ (Paxos │◄──────────►│ (Paxos │◄──────────►│ (Paxos │ │
│ │ Leader) │ │ Replica) │ │ Replica) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ COLOSSUS │ │
│ │ (Distributed File System) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TRUETIME │ │
│ │ GPS + Atomic Clocks in every datacenter │ │
│ │ API: TT.now() → [earliest, latest] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Innovations
TrueTime: Making Time Trustworthy
TrueTime: Making Time Trustworthy
The Insight: You can’t synchronize clocks perfectly, but you can bound the uncertainty.Usage in Transactions:
Copy
TRADITIONAL APPROACH:
"The time is 10:00:00.000"
(But actually, who knows? Could be off by milliseconds)
TRUETIME APPROACH:
"The time is between 10:00:00.000 and 10:00:00.007"
(Guaranteed! We measured the uncertainty)
HARDWARE:
• GPS receivers on datacenter roofs
• Atomic clocks (rubidium) as backup
• Multiple time masters per datacenter
• Typical uncertainty: 1-7ms
Copy
1. Transaction T1 gets commit timestamp = TT.now().latest
2. T1 waits until TT.after(timestamp) is true
3. Now GUARANTEED: no other transaction can get timestamp ≤ T1's
4. External consistency achieved without distributed locks!
Paxos Groups for Replication
Paxos Groups for Replication
Per-Tablet Paxos:
- Each tablet (partition) has its own Paxos group
- 3-5 replicas across zones
- Writes go through Paxos leader
- Reads can go to any replica (with proper timestamp)
- Tablets automatically split when too large
- Tablets merge when too small
- Paxos ensures consistent split/merge
External Consistency
External Consistency
The Guarantee: If transaction T1 commits before T2 starts, T1’s timestamp < T2’s timestamp.Why It Matters:Cost: Commit wait adds latency (~7ms average)
Copy
SCENARIO:
User in US writes a record
User in Europe reads immediately after
WITHOUT EXTERNAL CONSISTENCY:
European read might see old data (before US write)
WITH EXTERNAL CONSISTENCY:
European read guaranteed to see US write
(because read timestamp > write timestamp)
Spanner Failure Story: The Leap Second
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE 2012 LEAP SECOND INCIDENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DATE: June 30, 2012 │
│ │
│ WHAT HAPPENED: │
│ • Leap second added at midnight UTC │
│ • Linux kernel bug caused livelock in clock_gettime() │
│ • Systems consuming 100% CPU doing nothing │
│ • Widespread outages: Reddit, LinkedIn, Mozilla, Gawker │
│ │
│ HOW GOOGLE HANDLED IT: │
│ • Google had already implemented "leap smear" │
│ • Instead of adding 1 second at midnight: │
│ - Slow down time by 11.6 μs per second │
│ - Spread over 24 hours before midnight │
│ • All Google services remained stable │
│ │
│ LESSON: │
│ • Time is a critical distributed systems primitive │
│ • You must control how time changes propagate │
│ • Google's investment in TrueTime paid dividends │
│ │
│ RESULT: │
│ • Leap smear became industry standard │
│ • AWS, Azure, and others now use similar approaches │
│ • International discussions to eliminate leap seconds entirely │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Amazon DynamoDB
The database that powers amazon.com checkout—when availability is everything.Architecture Overview
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ DYNAMODB ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ REQUEST ROUTERS │ │
│ │ (Stateless, route requests to partitions) │ │
│ └────────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Partition 1 │ │ Partition 2 │ │ Partition N │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ Leader │ │ │ │ Leader │ │ │ │ Leader │ │ │
│ │ └────┬────┘ │ │ └────┬────┘ │ │ └────┬────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────┴────┐ │ │ ┌────┴────┐ │ │ ┌────┴────┐ │ │
│ │ │Replicas │ │ │ │Replicas │ │ │ │Replicas │ │ │
│ │ │(2 more) │ │ │ │(2 more) │ │ │ │(2 more) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ GLOBAL TABLES (Multi-Region): │
│ │
│ Region: US-EAST-1 Region: EU-WEST-1 │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Table A │ ◄─────► │ Table A │ │
│ │ (Replica) │ Sync │ (Replica) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ Last-Writer-Wins conflict resolution │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Design Decisions
Always Available: The Shopping Cart Story
Always Available: The Shopping Cart Story
The Origin Story (from the 2007 Dynamo paper):Implementation:
Copy
SCENARIO:
Customer adding items to cart during peak shopping
Database partition occurs
OPTION A: Reject writes, show error
→ Customer leaves, buys from competitor
→ Lost revenue: $$$
OPTION B: Accept writes, reconcile later
→ Worst case: Duplicate items in cart
→ Customer removes duplicates at checkout
→ Revenue preserved
AMAZON CHOSE: Option B (Availability over Consistency)
- Leaderless replication
- Write to W of N replicas (W < N means some can be down)
- Read from R replicas, resolve conflicts
- Shopping cart uses “union” merge (keep all items)
Consistent Hashing with Virtual Nodes
Consistent Hashing with Virtual Nodes
Problem: Nodes joining/leaving causes massive data reshufflingSolution: Virtual nodes
Copy
WITHOUT VIRTUAL NODES:
Hash ring: [──A──|──B──|──C──|──D──]
Node B fails: [──A──|────C────|──D──]
Node C now has 2x data (overloaded!)
WITH VIRTUAL NODES:
Each physical node = 150+ virtual nodes
Hash ring: [A1|B2|C1|A2|D1|B1|C2|D2|...]
Node B fails: B's virtual nodes spread across A, C, D
Load stays balanced
Adaptive Capacity
Adaptive Capacity
Original Problem:
- Provisioned throughput (e.g., 1000 WCU)
- Uniform distribution assumed
- Hot partition = throttling
Copy
BEFORE (2017):
Table: 1000 WCU
Partition 1: 250 WCU limit (1000/4)
Hot key in Partition 1: Throttled!
AFTER (Adaptive Capacity):
Table: 1000 WCU
Partition 1: Can use up to 1000 WCU if others idle
Hot key: No throttling as long as total < 1000
PLUS: On-demand capacity (2018)
No provisioning, pay per request
DynamoDB Failure Story: The 2015 US-EAST-1 Outage
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE 2015 DYNAMODB OUTAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DATE: September 20, 2015 │
│ DURATION: ~5 hours │
│ IMPACT: Many high-profile sites down (Netflix, IMDB, Airbnb...) │
│ │
│ ROOT CAUSE: │
│ • Metadata service became overloaded │
│ • Metadata = "where is partition X?" │
│ • Clients retrying → exponential load increase │
│ • Vicious cycle of retries │
│ │
│ THE CASCADE: │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Client │────►│ Request │────►│ Metadata │ │
│ │ Retry │ │ Router │ │ Service │ │
│ │ (10x) │ │ (slow) │ │ (dying) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ ▲ │ │ │
│ │ │ │ │
│ └────────── Timeout ─┴─── Overloaded ──┘ │
│ │
│ LESSONS LEARNED: │
│ 1. Metadata services are critical path │
│ 2. Retry storms can kill systems │
│ 3. Need circuit breakers and backoff │
│ 4. Client-side caching of metadata helps │
│ │
│ CHANGES MADE: │
│ • More aggressive metadata caching │
│ • Better retry backoff in SDKs │
│ • Metadata service scaling improvements │
│ • Cross-region tables for DR │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Netflix: The Chaos Engineering Pioneers
Netflix serves 230+ million subscribers with 99.99% availability. How?Architecture Overview
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ NETFLIX CLOUD ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ CDN (Open Connect) │ │
│ │ 15,000+ servers in ISPs worldwide │ │
│ │ 95% of traffic served from CDN │ │
│ └────────────────────────────────┬─────────────────────────────────────┘ │
│ │ (5% control plane traffic) │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ AWS (Control Plane) │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ ZUUL │ │ │
│ │ │ (API Gateway / Edge Service) │ │ │
│ │ └───────────────────────────┬─────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌───────────────────────────┼───────────────────────────────────┐ │ │
│ │ │ MICROSERVICES (1000+) │ │ │
│ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │
│ │ │ │Ratings │ │Viewing │ │Profile │ │Catalog │ │Billing │ ... │ │ │
│ │ │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ DATA LAYER │ │ │
│ │ │ EVCache (Memcached) Cassandra Elasticsearch │ │ │
│ │ │ 30M+ req/sec Petabytes Search & Analytics │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Patterns
EVCache: Caching at Scale
EVCache: Caching at Scale
Scale:
- 30+ million requests/second
- 1+ trillion operations/day
- Petabytes of cached data
Copy
EVCache = Enhanced Memcached
KEY FEATURES:
1. Replication across zones (survive AZ failure)
2. Local zone preference (lower latency)
3. Fast fallback to other zones
4. Shadow clusters for testing
TOPOLOGY:
┌─────────────────────────────────────────────┐
│ EVCache Cluster │
│ │
│ Zone A Zone B Zone C │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │MC-1 │ ◄─────► │MC-1 │ ◄─────► │MC-1 │ │
│ │MC-2 │ │MC-2 │ │MC-2 │ │
│ │MC-3 │ │MC-3 │ │MC-3 │ │
│ └─────┘ └─────┘ └─────┘ │
│ │
│ Writes: All zones (synchronous) │
│ Reads: Local zone first, fallback to others│
└─────────────────────────────────────────────┘
Zuul: Edge Gateway
Zuul: Edge Gateway
Responsibilities:
- Authentication
- Dynamic routing
- Load shedding
- Request throttling
- Attack detection
- Moved from thread-per-request to event loop
- 90% reduction in connection memory
- Better tail latency under load
Chaos Engineering: The Simian Army
Chaos Engineering: The Simian Army
Chaos Monkey (2011): Randomly kills instances in productionPhilosophy:
Copy
THE SIMIAN ARMY:
Chaos Monkey → Kill instances
Latency Monkey → Inject artificial delays
Chaos Kong → Fail entire AWS region
Doctor Monkey → Health checks and remediation
Janitor Monkey → Clean up unused resources
Conformity Monkey → Enforce best practices
Copy
"The best way to avoid failure is to fail constantly"
If you can't handle instance failures at 3pm on Tuesday,
you definitely can't handle them at 3am on Black Friday.
RESULT:
- Engineers build resilient systems by default
- Failures become routine, not emergencies
- Recovery is automated, not heroic
Netflix Failure Story: The 2012 Christmas Eve Outage
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE 2012 CHRISTMAS EVE OUTAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DATE: December 24, 2012 │
│ DURATION: ~8 hours │
│ CAUSE: AWS ELB (Elastic Load Balancer) failure │
│ │
│ WHAT HAPPENED: │
│ • AWS made changes to ELB configuration │
│ • Bug triggered widespread ELB failures in US-EAST-1 │
│ • Netflix (and many others) went down │
│ │
│ THE IRONY: │
│ • Netflix had Chaos Monkey │
│ • Netflix could survive instance failures │
│ • Netflix could survive AZ failures │
│ • But ELB was a single point of failure they didn't test! │
│ │
│ AFTERMATH: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Built their own load balancer: Zuul │ │
│ │ (Removed dependency on AWS ELB) │ │
│ │ │ │
│ │ 2. Created "Chaos Kong" │ │
│ │ (Simulate entire region failures) │ │
│ │ │ │
│ │ 3. Multi-region active-active │ │
│ │ (Traffic can shift between regions in seconds) │ │
│ │ │ │
│ │ 4. Invested in "Failure Injection Testing" (FIT) │ │
│ │ (Structured chaos experiments) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ LESSON: │
│ "If you haven't tested a failure mode, you're not resilient to it" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Uber: Real-Time at Scale
Uber processes millions of trips daily with sub-second dispatch decisions.Architecture Evolution
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ UBER ARCHITECTURE EVOLUTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 2010-2014: MONOLITH │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Python Monolith │ │
│ │ (Everything in one codebase, one database) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ 2014-2017: MICROSERVICES EXPLOSION │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ... (1000+ services) │
│ │Trip │ │Map │ │Price │ │Match │ │Pay │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ 2017+: DOMAIN-ORIENTED SERVICES (DOMA) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ RIDES DOMAIN │ │ EATS DOMAIN │ │ FREIGHT DOMAIN │ │ │
│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │
│ │ │ │Trip │ │ │ │Order │ │ │ │Shipment │ │ │ │
│ │ │ │Matching │ │ │ │Restaurant │ │ │ │Carrier │ │ │ │
│ │ │ │Pricing │ │ │ │Delivery │ │ │ │Route │ │ │ │
│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ │ │ │
│ │ SHARED PLATFORM │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Maps │ Payments │ Identity │ Messaging │ Dispatch │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Systems
Ringpop: Consistent Hashing for Dispatch
Ringpop: Consistent Hashing for Dispatch
Problem: Matching riders to drivers needs consistent, fast routingSolution: Ringpop (swim + consistent hashing)
Copy
HOW IT WORKS:
1. SWIM Protocol for cluster membership
- Nodes gossip about each other's health
- Detect failures in seconds
- No single point of failure
2. Consistent Hashing for request routing
- Hash(rider_location) → specific node
- That node has all nearby drivers in memory
- Sub-millisecond matching decisions
TOPOLOGY:
City: San Francisco
┌────────────────────────────────────────────────────┐
│ HASH RING │
│ │
│ Node A ─────────────────── Node B │
│ (SOMA, ← gossip → (Financial, │
│ Mission) Marina) │
│ │ │ │
│ └──────── Node C ──────────┘ │
│ (Castro, │
│ Sunset) │
│ │
│ Request for rider at (lat, lng): │
│ → Hash to Node A │
│ → Node A has all drivers in that area │
│ → Match in <10ms │
└────────────────────────────────────────────────────┘
Schemaless: MySQL at Scale
Schemaless: MySQL at Scale
Challenge:
- Need horizontal scaling (MySQL doesn’t shard easily)
- Need flexible schema (trip data evolves fast)
- Need low latency (real-time dispatch)
Copy
ARCHITECTURE:
┌─────────────────────────────────────────────────┐
│ Schemaless Layer │
│ (Application-level sharding + abstraction) │
└────────────────────┬────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ MySQL │ │ MySQL │ │ MySQL │
│ Shard 1 │ │ Shard 2 │ │ Shard N │
└─────────┘ └─────────┘ └─────────┘
DATA MODEL:
┌─────────────────────────────────────────────────┐
│ Row Key (UUID) │ Column │ Body (JSON) │
├──────────────────┼──────────┼──────────────────┤
│ trip-123-abc │ base │ {"rider": ...} │
│ trip-123-abc │ driver │ {"driver": ...} │
│ trip-123-abc │ route │ {"waypoints":..} │
└─────────────────────────────────────────────────┘
BENEFITS:
• Easy horizontal scaling (shard by row key)
• Schema evolution (just add columns)
• Point-in-time recovery (versioned cells)
Cadence: Workflow Orchestration
Cadence: Workflow Orchestration
Uber’s solution for long-running, fault-tolerant workflows.Open Source: Temporal (Cadence fork) now widely adopted
Copy
EXAMPLE: Trip Lifecycle
@workflow
def trip_workflow(rider_id, pickup, dropoff):
# Each step is fault-tolerant and can be retried
# Step 1: Find driver (may take minutes)
driver = yield find_driver(pickup)
# Step 2: Wait for pickup
yield wait_for_event("driver_arrived", timeout="30m")
# Step 3: Wait for trip completion
yield wait_for_event("trip_completed", timeout="8h")
# Step 4: Process payment
yield process_payment(rider_id, driver.id, trip.amount)
# Step 5: Request ratings
yield send_rating_request(rider_id, driver.id)
GUARANTEES:
• Each step runs exactly once
• Workflow survives server failures
• State persisted at each step
• Full visibility into running workflows
Uber Failure Story: The 2019 Mapping Outage
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE 2019 MAPPING OUTAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DATE: September 2019 │
│ IMPACT: App crashes, dispatch failures, surge pricing errors │
│ │
│ ROOT CAUSE: │
│ • Map tile service returned malformed data │
│ • Client apps crashed when parsing │
│ • Retry storms overloaded map service │
│ • Cascade to other services using maps │
│ │
│ THE CASCADE: │
│ │
│ Malformed → Client → Retry → Service → Dispatch │
│ Response Crash Storm Overload Failure │
│ │
│ LESSONS: │
│ ───────── │
│ 1. Client-side validation is critical │
│ - Don't trust backend responses blindly │
│ - Graceful degradation when data is bad │
│ │
│ 2. Circuit breakers needed everywhere │
│ - Client-side circuit breakers │
│ - Service-side admission control │
│ │
│ 3. Retry behavior must be controlled │
│ - Exponential backoff │
│ - Jitter to spread load │
│ - Client-side retry budgets │
│ │
│ 4. Map data is critical path │
│ - Cache aggressively │
│ - Multiple fallback sources │
│ - Stale data better than no data │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Stripe: Financial Transactions at Scale
When money is involved, correctness is everything.Architecture Principles
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ STRIPE ENGINEERING PRINCIPLES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. EXACTLY-ONCE DELIVERY │
│ ──────────────────────── │
│ Payment must happen exactly once, never more, never less │
│ │
│ Implementation: │
│ • Idempotency keys on all mutating operations │
│ • Request deduplication in API layer │
│ • Transactional outbox pattern for events │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ 2. STRONG CONSISTENCY FOR MONEY │
│ ─────────────────────────────── │
│ Unlike social media, eventual consistency is not acceptable │
│ │
│ Approach: │
│ • Single-leader PostgreSQL for financial data │
│ • Synchronous replication to standby │
│ • Serializable isolation for critical paths │
│ │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ 3. IDEMPOTENCY BY DEFAULT │
│ ────────────────────────── │
│ Every API accepts Idempotency-Key header │
│ │
│ curl https://api.stripe.com/v1/charges \ │
│ -H "Idempotency-Key: order-12345" \ │
│ -d amount=2000 \ │
│ -d currency=usd │
│ │
│ Same key = same result (even if called 100 times) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Patterns
Idempotency Keys
Idempotency Keys
How it works:TTL: Idempotency keys typically expire after 24 hours
Copy
# Stripe's approach (simplified)
async def create_charge(idempotency_key: str, amount: int, ...):
# 1. Check if we've seen this key before
existing = await idempotency_store.get(idempotency_key)
if existing:
if existing.status == "completed":
# Return cached response
return existing.response
elif existing.status == "in_progress":
# Someone else is processing
raise ConflictError("Request in progress")
# 2. Mark as in-progress
await idempotency_store.set(
idempotency_key,
{"status": "in_progress", "started_at": now()}
)
try:
# 3. Do the actual work
result = await actually_create_charge(amount, ...)
# 4. Store the result
await idempotency_store.set(
idempotency_key,
{"status": "completed", "response": result}
)
return result
except Exception as e:
# 5. On failure, allow retry
await idempotency_store.delete(idempotency_key)
raise
Transactional Outbox Pattern
Transactional Outbox Pattern
Problem: Need to update database AND send event atomically
Copy
WRONG APPROACH:
1. Update database (charge.status = "succeeded")
2. Send Kafka event
What if step 2 fails? Database updated, event lost!
TRANSACTIONAL OUTBOX:
1. In a single transaction:
- Update charge.status = "succeeded"
- Insert into outbox table: {event: "charge.succeeded", ...}
2. Separate process reads outbox, sends to Kafka
3. After Kafka confirms, delete from outbox
┌─────────────────────────────────────────────────────┐
│ DATABASE │
│ ┌───────────────┐ ┌───────────────────────────┐ │
│ │ charges │ │ outbox │ │
│ ├───────────────┤ ├───────────────────────────┤ │
│ │ id: ch_123 │ │ id: 1 │ │
│ │ status: succ │ │ event: charge.succeeded │ │
│ │ ... │ │ payload: {...} │ │
│ └───────────────┘ └───────────────────────────┘ │
│ │ │
│ │ Background process │
│ ▼ │
│ ┌─────────────┐ │
│ │ Kafka │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────┘
Request Hedging
Request Hedging
Problem: Tail latency (p99) is often much worse than medianSolution: Send to multiple replicas, use first response
Copy
HEDGING STRATEGY:
Time 0ms: Send to replica A
Time 5ms: If no response, also send to replica B
Time 10ms: If no response, also send to replica C
Use FIRST response, cancel others
BENEFITS:
• p99 latency dramatically reduced
• One slow replica doesn't hurt overall latency
CAUTION:
• Only for idempotent reads
• Increases load on backend (but usually worth it)
• Need cancellation to avoid wasted work
Common Patterns Across Companies
Idempotency Everywhere
All companies: Every mutating operation accepts an idempotency key.
Stripe:
Idempotency-Key header
AWS: ClientRequestToken
Google: requestIdChaos Testing
You don’t know if you’re resilient until you test.
Netflix: Chaos Monkey, Chaos Kong
Amazon: GameDay exercises
Google: DiRT (Disaster Recovery Testing)
Circuit Breakers
Fail fast instead of cascading.
Netflix: Hystrix (now Resilience4j)
Uber: Custom circuit breakers in every service
All use some variant of the pattern.
Observability
You can’t fix what you can’t see.
Distributed tracing: Zipkin/Jaeger-style
Metrics: RED method (Rate, Errors, Duration)
Logs: Structured, correlated by trace ID
Key Takeaways
- Availability often trumps consistency — Amazon’s shopping cart chose availability. Know when this trade-off is acceptable.
- Test failure modes in production — Netflix’s Chaos Engineering isn’t optional, it’s essential.
- Build for horizontal scale from day one — Re-architecting a monolith is painful. Design for distribution early.
- Invest in your primitives — Google built TrueTime. Uber built Ringpop. Your foundations matter.
- Every outage is a learning opportunity — The companies with the best uptime have the best postmortems.