Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Distributed Systems Mastery

A comprehensive, interview-focused curriculum designed for engineers targeting Staff/Principal roles at top tech companies (Google, Meta, Amazon, Netflix, Stripe, etc.). This course covers everything from fundamentals to cutting-edge distributed systems concepts. Distributed systems sit at the intersection of computer science theory (impossibility results, consensus proofs) and messy real-world practice (network jitter, clock skew, disk failures at 3 AM). This curriculum bridges both worlds — giving you the theoretical foundations to reason about correctness and the production battle scars to design systems that actually work under fire.
Course Duration: 18-24 weeks (self-paced)
Target Outcome: Staff+ Engineer at FAANG / Top-tier distributed systems expertise
Prerequisites: Strong programming, basic networking, database fundamentals
Language: Concepts with implementations in Go/Java/Python
New Content: 5 additional tracks with 45+ modules, real-world case studies, and Staff+ interview problems

Why This Course?

FAANG Interview Ready

Covers exact topics asked at Google, Amazon, Meta, and other top companies

Deep Theoretical Foundation

Understand Raft, Paxos, ZAB, and other consensus protocols inside-out

Production Battle-Tested

Patterns from systems handling millions of QPS at scale

Hands-On Projects

Build your own distributed KV store, consensus implementation, and more
Interview Reality: At Staff+ level, you’re expected to design systems that handle billions of requests, survive data center failures, and maintain consistency guarantees. This course prepares you for exactly that.

What Companies Ask

CompanyCommon Topics
GoogleConsensus protocols, Spanner, Bigtable internals, Paxos, distributed transactions
AmazonDynamoDB internals, eventual consistency, vector clocks, Dynamo paper
MetaTAO, ZippyDB, consensus at scale, social graph distribution
NetflixEVCache, Cassandra, chaos engineering, resilience patterns
StripeDistributed transactions, exactly-once delivery, idempotency
UberRingpop, consistent hashing, real-time systems, Cadence workflows

Course Structure

The curriculum is organized into 9 tracks progressing from fundamentals to Staff+ expertise:
┌─────────────────────────────────────────────────────────────────────────────┐
│                   DISTRIBUTED SYSTEMS MASTERY                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TRACK 1: CORE FOUNDATIONS        TRACK 2: CONSENSUS & COORDINATION         │
│  ─────────────────────────        ──────────────────────────────            │
│  □ Why Distributed?               □ The Consensus Problem                   │
│  □ Network Fundamentals           □ Paxos (Single & Multi)                  │
│  □ Time & Ordering                □ Raft (In-Depth)                         │
│  □ Consistency Models (NEW)       □ Byzantine Fault Tolerance (NEW)         │
│  □ Distributed Snapshots (NEW)    □ Gossip Protocols (NEW)                  │
│  □ CAP/PACELC Theorems            □ Formal Verification (TLA+) (NEW)        │
│                                   □ Leader Election & Locks                 │
│                                                                              │
│  TRACK 3: REPLICATION             TRACK 4: TRANSACTIONS                     │
│  ─────────────────────            ──────────────────────                    │
│  □ Single-Leader                  □ ACID in Distributed World               │
│  □ Multi-Leader                   □ 2PC and 3PC                             │
│  □ Leaderless                     □ Saga Pattern                            │
│  □ Conflict Resolution            □ TCC Pattern                             │
│  □ CRDTs                          □ Distributed Locking                     │
│                                                                              │
│  TRACK 5: DATA SYSTEMS            TRACK 6: MESSAGING & EVENTS               │
│  ─────────────────────            ──────────────────────────                │
│  □ Partitioning Strategies (NEW)  □ Kafka Deep Dive (NEW)                   │
│  □ Consistent Hashing             □ Event Sourcing & CQRS                   │
│  □ Distributed Databases          □ Message Queue Patterns                  │
│  □ Distributed Storage            □ Exactly-Once Semantics                  │
│  □ Stream Processing              □ Dead Letter Queues                      │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│  ADVANCED PRODUCTION TRACKS                                                  │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  TRACK 7: CLOCK SYNCHRONIZATION   TRACK 8: FAULT TOLERANCE                  │
│  ───────────────────────────────  ────────────────────────                  │
│  □ TrueTime Deep Dive             □ Circuit Breaker Patterns                │
│  □ GPS & Atomic Clocks            □ Bulkhead Isolation                      │
│  □ Hybrid Logical Clocks          □ Retry Strategies                        │
│  □ Clock Synchronization          □ Timeout Patterns                        │
│  □ Spanner's Architecture         □ Graceful Degradation                    │
│                                                                              │
│  TRACK 9: DISTRIBUTED CACHING     TRACK 10: PRODUCTION & PRACTICE           │
│  ─────────────────────────────    ─────────────────────────────             │
│  □ Cache Strategies               □ Observability at Scale                  │
│  □ Cache Invalidation             □ Chaos Engineering                       │
│  □ Redis/Memcached Patterns       □ Real-World Case Studies                 │
│  □ CDN Caching                    □ Interview Practice Problems             │
│  □ Cache Stampede Prevention      □ Staff+ Level Problem Sets               │
│                                                                              │
│  CAPSTONE PROJECTS (Module 51)                                              │
│  ────────────────────────────                                              │
│  □ Build Distributed KV Store                                               │
│  □ Implement Raft Consensus                                                 │
│  □ Design Interview Practice                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Track 1: Foundations

Build the mental models that all distributed systems are built upon.
Duration: 4-6 hoursUnderstanding the fundamental reasons and challenges.
  • Single machine limitations (CPU, memory, disk, network)
  • Horizontal vs Vertical scaling trade-offs
  • The Eight Fallacies of Distributed Computing (deep dive)
  • Types: Compute clusters, Storage systems, Coordination systems
  • Real examples: Google’s evolution from single machines to global infrastructure
Interview Focus: Why not just use a bigger machine? Cost-benefit analysis
Duration: 6-8 hoursNetworking knowledge every distributed systems engineer needs.
  • TCP guarantees and failure modes
  • Network partitions: What they are and how to detect them
  • Message passing: At-most-once, at-least-once, exactly-once
  • RPC frameworks: gRPC, Thrift, Protocol Buffers
  • Failure detection: Heartbeats, Phi accrual detector
Interview Focus: How do you detect if a node is dead vs slow?
Duration: 8-10 hoursCritical Topic: Time is the foundation of distributed systems reasoning.
  • Why physical clocks fail (NTP drift, leap seconds)
  • Logical clocks (Lamport timestamps)
  • Vector clocks (causality tracking)
  • Hybrid logical clocks (HLC)
  • TrueTime (Google’s GPS + atomic clock approach)
  • Happens-before relationship
Interview Focus: How does Google Spanner achieve external consistency?
Duration: 4-6 hoursUnderstanding what can go wrong is crucial for designing resilient systems.
  • Fail-stop vs Fail-recover
  • Byzantine failures
  • Network failures: Partition, delay, reordering
  • Partial failures: The hardest problem
  • Gray failures (subtle, hard-to-detect issues)
Interview Focus: Design for partial failures in a payment system
Duration: 6-8 hoursThe fundamental trade-offs in distributed systems.
  • CAP Theorem: Proof and implications
  • CP vs AP: Real-world examples
  • PACELC: The more practical framework
  • Beyond CAP: Harvest and Yield
  • Consistency spectrum: Linearizable → Eventual
Interview Focus: Is your system CP or AP? What are the trade-offs?
Duration: 6-8 hoursCapturing a consistent global state in a distributed system.
  • The Global State Problem: Why we can’t just “freeze” time
  • Chandy-Lamport Algorithm: Markers and state recording
  • Consistent vs Inconsistent cuts
  • Practical uses: Distributed debugging, Checkpointing, Termination detection
Interview Focus: How do you take a backup of a distributed database without stopping writes?
Duration: 6-8 hoursDecentralized communication and membership.
  • Epidemic algorithms: Rumor mongering and anti-entropy
  • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership
  • Phi Accrual Failure Detection: Suspicion-based detectors
  • Use cases: Cassandra membership, Redis Cluster, HashiCorp Serf/Consul
Interview Focus: How does a 1000-node cluster detect a single node failure without a central master?

Track 2: Consensus Protocols

The heart of distributed systems. Master these for Staff+ interviews.
Duration: 4-6 hoursWhy consensus is hard and why it matters.
  • FLP Impossibility result (and its implications)
  • Safety vs Liveness guarantees
  • Consensus use cases: Leader election, configuration, transactions
  • Relationship to State Machine Replication
Interview Focus: Can you achieve consensus in an asynchronous system?
Duration: 10-12 hoursThe Original: Understanding Paxos is fundamental.
  • Basic Paxos: Prepare/Promise, Accept/Accepted
  • Why Paxos works (safety proofs)
  • Multi-Paxos optimizations
  • Paxos Made Simple (Lamport’s paper walkthrough)
  • Fast Paxos and Flexible Paxos
  • EPaxos (Leaderless variant)
Interview Focus: Walk through a Paxos round with failures
Duration: 12-14 hoursMost Asked: Raft is the go-to consensus protocol in interviews.
  • Leader election mechanism
  • Log replication and commit rules
  • Safety properties and proofs
  • Membership changes (joint consensus)
  • Log compaction and snapshots
  • Raft vs Paxos comparison
  • etcd/Consul implementation details
Hands-On: Implement Raft from scratchInterview Focus: What happens when the leader fails during commit?
Duration: 4-6 hoursAn alternative view of consensus.
  • VR protocol overview
  • View changes and recovery
  • Comparison with Raft
  • When to use VR vs Raft
Duration: 6-8 hoursHow Zookeeper maintains distributed coordination.
  • ZAB protocol phases
  • Leader activation and synchronization
  • Recovery and failover
  • Zookeeper guarantees (FIFO, linearizable writes)
  • Zookeeper use cases: Locking, configuration, leader election
Interview Focus: Design a distributed lock using Zookeeper
Duration: 8-10 hoursHandling malicious nodes and arbitrary failures.
  • The Byzantine Generals Problem: Understanding the consensus bound (3f+1)
  • PBFT: Practical Byzantine Fault Tolerance internals
  • Modern BFT: Tendermint, HotStuff (used in Libra/Diem)
  • Proof of Work vs BFT: Performance and safety trade-offs
Interview Focus: How do you achieve consensus when some nodes might be actively lying or malicious?
Duration: 8-10 hoursProving correctness before writing code.
  • Why testing isn’t enough: State space explosion
  • TLA+ and PlusCal basics
  • Modeling safety (invariants) vs liveness (progress)
  • The TLC Model Checker and error traces
  • Case studies: How AWS and MongoDB use TLA+
Interview Focus: How do you prove that your custom consensus protocol is correct?

Track 3: Replication Strategies

How data is copied and kept consistent across nodes.
Duration: 6-8 hoursThe simplest and most common replication strategy.
  • Synchronous vs Asynchronous replication
  • Semi-synchronous replication
  • Replication lag and its problems
  • Read-your-writes, Monotonic reads, Consistent prefix
  • Failover handling and split-brain prevention
  • MySQL/PostgreSQL replication internals
Interview Focus: How do you handle replication lag in a user-facing feature?
Duration: 8-10 hoursWhen single-leader isn’t enough.
  • Use cases: Multi-datacenter, offline clients
  • Conflict detection and resolution
  • Last-write-wins (LWW) and its problems
  • Custom conflict resolution logic
  • CockroachDB and TiDB approach
Interview Focus: Design multi-region writes for a collaborative editor
Duration: 8-10 hoursThe Dynamo-style approach used by Cassandra, Riak, etc.
  • Read/write quorums (R + W > N)
  • Sloppy quorums and hinted handoff
  • Anti-entropy: Read repair, Merkle trees
  • Dynamo paper deep dive
  • Cassandra consistency levels
Interview Focus: When would you choose leaderless over leader-based?
Duration: 6-8 hoursWhen conflicts happen, how do you resolve them?
  • Application-level resolution
  • Version vectors
  • LWW strategies and pitfalls
  • Merge functions
  • Operational transformation (Google Docs)
Interview Focus: Design conflict resolution for a shopping cart
Duration: 8-10 hoursAdvanced Topic: Conflict-free Replicated Data Types.
  • Operation-based vs State-based CRDTs
  • G-Counter, PN-Counter
  • G-Set, 2P-Set, OR-Set
  • LWW-Register, MV-Register
  • CRDT-based databases (Riak, Redis CRDT)
  • Performance and memory implications
Interview Focus: Design a collaborative text editor using CRDTs

Track 4: Distributed Transactions

Maintaining data integrity across multiple nodes.
Duration: 6-8 hoursHow ACID properties translate to distributed environments.
  • Local vs Distributed transactions
  • Isolation levels: Read uncommitted → Serializable
  • Snapshot isolation and write skew
  • Serializable Snapshot Isolation (SSI)
Interview Focus: What isolation level would you choose for a banking system?
Duration: 8-10 hoursThe classic distributed transaction protocol.
  • Prepare and Commit phases
  • Coordinator failures and blocking
  • Participant failures and recovery
  • 2PC in practice: XA transactions
  • Why 2PC is often avoided (performance, availability)
Interview Focus: What happens if the coordinator crashes after prepare?
Duration: 4-6 hoursAttempting to solve 2PC’s blocking problem.
  • Pre-commit phase addition
  • Non-blocking under certain failures
  • Why 3PC isn’t commonly used
  • Network partition problems
Duration: 8-10 hoursProduction Favorite: Long-running transactions without locks.
  • Choreography vs Orchestration
  • Compensating transactions
  • Semantic locks and countermeasures
  • Saga execution coordinator
  • Saga pattern in microservices
  • Temporal.io and Cadence workflows
Hands-On: Implement an order saga with compensationInterview Focus: Design a travel booking saga with compensation logic
Duration: 4-6 hoursTry-Confirm-Cancel for distributed transactions.
  • Two-phase approach at application level
  • Resource reservation
  • Timeout handling
  • When to use TCC vs Saga
Interview Focus: TCC vs 2PC vs Saga - when to use each?
Duration: 6-8 hoursCoordinating access to shared resources.
  • Single-node locks in distributed systems (Redis SETNX)
  • Redlock algorithm and its critique
  • Fencing tokens for safety
  • Zookeeper-based locks
  • Lease-based locking
Interview Focus: Design a distributed rate limiter with locks

Track 5: Data Systems at Scale

Partitioning, storage, and processing at massive scale.
Duration: 8-10 hoursHow to split data across nodes effectively.
  • Key-range partitioning
  • Hash partitioning
  • Hybrid approaches
  • Secondary indexes: Local vs Global
  • Rebalancing strategies
  • Hot spots and skew handling
Interview Focus: How would you partition a social network’s posts?
Duration: 6-8 hoursThe foundational algorithm for distributed systems.
  • Basic consistent hashing
  • Virtual nodes for load balancing
  • Bounded-load consistent hashing
  • Jump consistent hashing
  • Rendezvous hashing (HRW)
Hands-On: Implement consistent hashing with virtual nodesInterview Focus: Design a distributed cache with consistent hashing
Duration: 12-14 hoursHow production databases work internally.
  • Spanner: TrueTime, external consistency, Paxos groups
  • CockroachDB: Raft, serializable isolation, SQL distribution
  • TiDB: Raft + Percolator, hybrid OLTP/OLAP
  • Cassandra: Gossip, consistent hashing, tunable consistency
  • DynamoDB: Leaderless, GSI, adaptive capacity
  • MongoDB: Raft-based replication, sharding
Interview Focus: How does Spanner achieve global consistency?
Duration: 8-10 hoursBlock and object storage at scale.
  • GFS/HDFS architecture
  • Object storage (S3 architecture)
  • Erasure coding for durability
  • Ceph architecture
  • Tiered storage strategies
Interview Focus: Design a petabyte-scale storage system
Duration: 10-12 hoursReal-time data processing at scale.
  • Event sourcing and event-driven architecture
  • Kafka internals: Partitions, consumer groups, exactly-once
  • Stream processing: Flink, Kafka Streams
  • Windowing: Tumbling, Sliding, Session
  • Watermarks and late data handling
  • Exactly-once semantics in streaming
Interview Focus: Design a real-time analytics pipeline

Track 6: Production Excellence

Operating distributed systems at scale.
Duration: 6-8 hoursYou can’t fix what you can’t see.
  • Distributed tracing (Jaeger, Zipkin, OpenTelemetry)
  • Metrics aggregation at scale
  • Log aggregation and analysis
  • Correlation across services
  • SLIs, SLOs, and error budgets
Interview Focus: How do you debug a latency spike across 100 services?
Duration: 6-8 hoursNetflix-style reliability through controlled chaos.
  • Chaos Monkey and the Simian Army
  • Designing chaos experiments
  • Blast radius control
  • Failure injection frameworks (Litmus, Chaos Mesh)
  • Game days and runbooks
Interview Focus: How would you test your system’s resilience?
Duration: 6-8 hoursKeeping systems running at scale.
  • Toil reduction and automation
  • On-call best practices
  • Postmortem culture (blameless)
  • Error budgets and release velocity
  • Progressive rollouts
Interview Focus: Describe your approach to a 50% latency increase
Duration: 6-8 hoursDesigning systems that are inherently resistant to failure.
  • Static Stability and over-provisioning
  • Cell-based Architectures and blast radius control
  • Dependency Isolation (The Bulkhead Pattern)
  • Avoiding control-plane dependencies in recovery paths
Interview Focus: How do you design a system that survives an AZ failure without autoscaling?
Duration: 4-6 hoursWhen things go wrong at scale.
  • Incident response playbooks
  • Communication during outages
  • Escalation procedures
  • Root cause analysis
  • Learning from failures
Duration: 6-8 hoursEnsuring your system can handle growth.
  • Load testing strategies
  • Capacity modeling
  • Performance regression detection
  • Autoscaling strategies
  • Cost optimization at scale
Interview Focus: How do you prepare for a 10x traffic spike?

Track 7: Clock Synchronization (Advanced)

Time is the foundation of distributed systems. Master clock synchronization for Staff+ expertise.
Duration: 6-8 hoursHow clocks stay synchronized across networks.
  • NTP architecture and stratum levels
  • PTP/IEEE 1588 for microsecond precision
  • Clock drift detection and correction
  • Network asymmetry compensation
  • Monitoring clock health in production
Interview Focus: How do you detect and handle clock skew in your system?
Duration: 8-10 hoursCapturing causality without physical time.
  • Lamport timestamps and happened-before relationship
  • Vector clocks for precise conflict detection
  • Comparison rules and concurrency proofs
  • Implementation in DynamoDB and Riak
Interview Focus: When would you use vector clocks vs logical clocks?
Duration: 8-10 hoursCombining physical and logical time for practical systems.
  • HLC design and implementation
  • Timestamp encoding strategies
  • CockroachDB’s MVCC with HLC
  • Causality tracking with bounded skew
  • HLC vs Vector Clocks trade-offs
Hands-On: Implement HLC for a distributed databaseInterview Focus: Why choose HLC over pure logical clocks?
Duration: 10-12 hoursCritical Topic: Google Spanner’s revolutionary approach to time.
  • GPS time transfer and accuracy bounds
  • Atomic clock drift characteristics (Rubidium vs Cesium)
  • TrueTime API: TT.now(), TT.after(), TT.before()
  • Uncertainty intervals and commit-wait protocol
  • How Spanner uses TrueTime for external consistency
Interview Focus: Walk through how Spanner commits a transaction using TrueTime

Track 8: Fault Tolerance Patterns

Building resilient systems that survive failures.
Duration: 8-10 hoursProduction Essential: Prevent cascade failures.
  • State machine: Closed → Open → Half-Open
  • Failure threshold configuration
  • Timeout and retry integration
  • Hystrix and Resilience4j implementations
  • Monitoring circuit breaker health
Hands-On: Implement a circuit breaker with state transitionsInterview Focus: Design circuit breakers for a payment gateway
Duration: 6-8 hoursContain failures to prevent system-wide outages.
  • Thread pool isolation patterns
  • Semaphore-based bulkheads
  • Connection pool partitioning
  • Process-level isolation
  • Kubernetes resource limits as bulkheads
Interview Focus: How do you prevent one slow service from affecting others?
Duration: 6-8 hoursWhen and how to retry failed operations.
  • Exponential backoff with jitter
  • Retry budgets and thundering herd prevention
  • Idempotency keys for safe retries
  • Deadline propagation across services
  • Distinguishing transient vs permanent failures
Interview Focus: Design a retry strategy for a distributed task queue
Duration: 8-10 hoursMaintaining partial functionality during failures.
  • Feature flags for degradation
  • Fallback strategies and stale data serving
  • Load shedding and admission control
  • Quality-of-service tiering
  • Netflix’s fallback hierarchies
Interview Focus: How would you degrade an e-commerce site during database issues?
Duration: 4-6 hoursThe most important but often misunderstood pattern.
  • Connection vs read vs write timeouts
  • Timeout cascades and deadline propagation
  • Context cancellation across service boundaries
  • Calculating appropriate timeout values
  • Timeout vs circuit breaker interaction
Interview Focus: How do you set timeouts for a microservices call chain?

Track 9: Distributed Caching

Caching patterns for high-performance distributed systems.
Duration: 8-10 hoursChoosing the right caching pattern for your use case.
  • Cache-aside (lazy loading) pattern
  • Read-through and write-through caching
  • Write-behind (write-back) caching
  • Refresh-ahead pattern
  • Cache eviction policies (LRU, LFU, TTL)
Interview Focus: When would you choose write-behind over write-through?
Duration: 8-10 hoursThe Hard Problem: Keeping caches consistent.
  • TTL-based invalidation strategies
  • Event-driven invalidation with Kafka/pub-sub
  • Tag-based cache invalidation
  • Cascading invalidation patterns
  • Cache versioning strategies
Interview Focus: Design cache invalidation for a product catalog
Duration: 10-12 hoursProduction-grade distributed cache implementations.
  • Redis Cluster architecture and slot migration
  • Redis Sentinel for high availability
  • Memcached consistent hashing
  • Memory management and eviction
  • Replication lag and read consistency
  • Redis vs Memcached decision framework
Interview Focus: Design a distributed session store using Redis
Duration: 6-8 hoursCaching at the edge for global performance.
  • CDN architecture and PoP design
  • Cache-Control header strategies
  • Origin shield and tiered caching
  • Cache purging at scale
  • Dynamic content caching patterns
Interview Focus: Design CDN caching for a video streaming platform
Duration: 6-8 hoursPreventing thundering herd on cache misses.
  • Locking and mutex patterns
  • Probabilistic early expiration
  • Request coalescing
  • Background refresh strategies
  • Circuit breaker integration
Interview Focus: How do you handle cache stampede during Black Friday?

Special Track: Real-World Case Studies

Learn from production systems at scale.
Duration: 8-10 hoursThe first globally distributed, strongly consistent database.
  • TrueTime and external consistency
  • Paxos groups and data placement
  • Lock-free read-only transactions
  • Schema changes without downtime
  • Real failure stories and lessons learned
Interview Focus: How does Spanner achieve 5 nines availability?
Duration: 8-10 hoursThe paper that launched NoSQL.
  • Consistent hashing with virtual nodes
  • Vector clocks and conflict resolution
  • Sloppy quorums and hinted handoff
  • Evolution from Dynamo to DynamoDB
  • Global Tables and cross-region replication
Interview Focus: Design a shopping cart using Dynamo-style storage
Duration: 6-8 hoursChaos engineering pioneers.
  • Chaos Monkey and Simian Army
  • EVCache and caching at scale
  • Zuul gateway and load shedding
  • Failure injection testing
  • Multi-region active-active deployment
Interview Focus: Design a chaos engineering strategy for your system
Duration: 6-8 hoursBuilding reliable systems for millions of rides.
  • Ringpop for membership and routing
  • Cadence/Temporal workflow orchestration
  • Geospatial indexing at scale
  • Real-time dispatch and matching
  • Multi-region failover strategies
Interview Focus: Design a ride-matching system like Uber

Staff+ Interview Practice Problems

Curated problems for senior-level interviews.

Global Rate Limiter

Design a rate limiter that works across multiple data centers with sub-millisecond overhead

Distributed Transaction Coordinator

Build a transaction coordinator supporting 2PC, Saga, and TCC patterns

Real-Time Leaderboard

Design a leaderboard supporting millions of concurrent players with real-time updates

Multi-Region Database

Design a database with strong consistency guarantees across continents
Interview Tip: Each problem includes detailed solutions, trade-off analysis, and follow-up questions commonly asked at Google, Meta, Amazon, and other top companies.

Capstone Projects

Apply everything you’ve learned.

Project 1: Distributed KV Store

Build a key-value store with:
  • Raft-based replication
  • Consistent hashing for partitioning
  • Read/write quorums
  • Snapshot and recovery

Project 2: Implement Raft

A complete Raft implementation:
  • Leader election
  • Log replication
  • Membership changes
  • Persistence and recovery

Project 3: Distributed Lock Service

Build a coordination service:
  • Ephemeral nodes
  • Watch mechanism
  • Sequential ordering
  • Lock implementation

Project 4: Mock Interviews

Practice system design:
  • Design Uber’s dispatch system
  • Design Stripe’s payment processing
  • Design Netflix’s CDN
  • Design Twitter’s timeline

Key Papers to Read

Essential reading for deep understanding:
PaperWhy It Matters
Dynamo (Amazon)Leaderless replication, vector clocks, eventual consistency
Spanner (Google)TrueTime, globally consistent transactions
Raft (Stanford)Understandable consensus
Paxos Made Simple (Lamport)The original consensus paper
MapReduce (Google)Distributed computation paradigm
Kafka (LinkedIn)Distributed log architecture
Time, Clocks (Lamport)Logical time foundations
CALM TheoremConsistency without coordination
FLP ImpossibilityLimits of distributed consensus
Harvest/YieldPractical CAP trade-offs

Interview Preparation Strategy

1

Master the Theory (Weeks 1-6)

Complete Tracks 1-3. Focus on:
  • CAP/PACELC intuition
  • Raft protocol (draw from memory)
  • Replication trade-offs
2

Go Deep on Transactions (Weeks 7-9)

Complete Track 4. Be ready to:
  • Design a saga for any use case
  • Explain 2PC failure modes
  • Discuss distributed locking trade-offs
3

Study Real Systems (Weeks 10-12)

Complete Track 5. Know:
  • How Spanner achieves global consistency
  • How Kafka provides exactly-once
  • When to use Cassandra vs PostgreSQL
4

Master Advanced Patterns (Weeks 13-15)

Complete Tracks 7-9 (New!). Focus on:
  • TrueTime and clock synchronization
  • Circuit breakers and fault tolerance
  • Distributed caching patterns
5

Study Case Studies (Week 16)

Review real-world architectures:
  • Google Spanner, Amazon Dynamo
  • Netflix resilience, Uber real-time
  • Learn from actual failure post-mortems
6

Mock Interviews (Weeks 17-20)

  • Practice 3-4 system design problems per week
  • Use the Staff+ Interview Problems module
  • Focus on distributed aspects
  • Record yourself and review

Who This Course Is For

  • Senior Engineers (4+ years) aiming for Staff/Principal
  • Backend Engineers wanting deep distributed systems knowledge
  • Infrastructure Engineers building platforms
  • Anyone targeting FAANG/top-tier companies

Ready to Begin?

Start with Foundations

Begin with Track 1 to build your mental model of distributed systems

Jump to Case Studies

Learn from Google, Amazon, Netflix, and Uber’s production systems

Practice Interview Problems

Staff+ level problems with detailed solutions and trade-off analysis

Master Fault Tolerance

Circuit breakers, retries, and graceful degradation patterns

Interview Deep-Dive

Strong Answer:
  • The first thing I clarify is what the system’s core invariants are. For a payment ledger, an incorrect balance is catastrophic, so I lean CP. For a social media feed, a user seeing a slightly stale timeline for a few seconds is acceptable, so AP makes sense.
  • I never frame it as a binary choice for the entire system. Most real systems are a mix: the metadata service might be CP (using Raft-backed etcd), while the content-delivery layer is AP (eventual consistency with CDN caching).
  • I also move beyond CAP to PACELC, because CAP only describes behavior during partitions. During normal operation, the trade-off is between latency and consistency. A system like Spanner is PC/EC — it chooses consistency even when there is no partition, accepting higher latency for correctness. DynamoDB is PA/EL — it prioritizes availability during partitions and low latency during normal operation.
  • In practice, I would enumerate the critical user-facing flows, classify each by its tolerance for staleness or unavailability, and then assign the appropriate consistency level per flow rather than per system.
Follow-up: How would you handle the situation where a single user request touches both a CP subsystem and an AP subsystem?At the boundary between CP and AP subsystems, you need to be very deliberate about what guarantees the end user actually sees. The pattern I use is to make the CP subsystem the “source of truth” for correctness-critical state (like account balance or inventory count) and let the AP subsystem handle best-effort, eventually-consistent derived data (like recommendation scores or timeline caches). The user-facing request first commits to the CP path — for example, debiting inventory via a Raft-backed service — and then asynchronously propagates to the AP path via an event bus like Kafka. If the AP path is temporarily inconsistent, the UI can show a “processing” state or fall back to the last-known-good value. The key principle is: never let the AP subsystem’s staleness corrupt the CP subsystem’s invariants. The CP path is the gatekeeper; the AP path is the fast lane.
Strong Answer:
  • I would say: “CAP means that when our network has a serious problem — a partition — we have to pick between two options. Option one: we stop serving some requests to make sure every answer is correct (consistency). Option two: we keep serving every request, but some answers might be outdated (availability). We cannot have both during the outage.”
  • The warning I would give is that CAP is about the extreme case — a network partition. Most of the time, partitions are rare and brief. The real day-to-day trade-off is between latency and consistency (PACELC). A VP should care more about “how fast are we during normal operation?” than “what happens during an outage that occurs once a year?”
  • I would also warn against the common misconception that “AP means our data is unreliable.” AP systems like DynamoDB or Cassandra are used by Amazon and Netflix for mission-critical workloads. They are not unreliable — they just define consistency differently and resolve conflicts after the fact.
Follow-up: A senior engineer on your team argues that since partitions are rare, you should always choose CP. How do you respond?I would push back respectfully. Partitions are rare, but their effects are amplified. During a partition, a CP system rejects writes, which means user-facing errors or degraded functionality for the duration. If your SLA requires 99.99% availability, even a 5-minute partition per month can blow your error budget. The right answer depends on the business impact: is a wrong answer worse than no answer? For a bank ledger, yes — CP is correct. For a shopping cart, no — Amazon’s original Dynamo paper showed that losing a cart addition costs more revenue than showing a slightly stale cart. I would bring data: what is the expected frequency of partitions, what is the business cost of unavailability vs. inconsistency, and use that to drive the decision rather than an abstract preference.
Strong Answer:
  • In almost every greenfield project today, I would default to Raft. The reason is understandability. Raft was explicitly designed to be easier to implement correctly, and correctness in consensus is everything. Bugs in consensus lead to data loss, split-brain, or silent corruption — the worst kind of production incidents.
  • I would choose Paxos (specifically Multi-Paxos or a variant like EPaxos) in specific scenarios: when I need flexible quorum configurations, when I need leaderless consensus for lower latency in geo-distributed deployments, or when I am extending an existing system that already uses Paxos (like Google’s internal infrastructure).
  • The practical trade-off: Raft has a single leader, which simplifies reasoning but creates a throughput bottleneck and a latency penalty for writes that must cross regions. EPaxos removes the leader but is significantly harder to implement and reason about.
  • In production, I look at the ecosystem too. etcd (Raft) and Consul (Raft) have mature, battle-tested implementations. If I am building a coordination service, I would use one of these rather than implementing consensus from scratch.
Follow-up: What happens in a Raft cluster if the leader is in a data center that becomes partitioned from the majority?The partitioned leader will stop receiving acknowledgments from the majority and its heartbeats will not reach most followers. The followers in the majority partition will time out, increment their term, and elect a new leader. The old leader, now in the minority partition, will eventually receive a message with a higher term number and step down to follower. During the transition, there is a brief period where no leader exists and writes are rejected — this is the availability cost of CP consensus. The critical safety property is that the old leader cannot commit any new entries because it cannot reach a majority. Any entries it accepted but did not commit will be overwritten by the new leader’s log. This is by design: Raft sacrifices availability during partitions to guarantee that committed entries are never lost.
Strong Answer:
  • A well-known example is the early days of Amazon’s shopping cart. They chose strong consistency initially, which meant that during network hiccups between data centers, cart operations would fail or time out. Users would click “Add to Cart” and see an error. Amazon calculated that every failed cart addition had a direct revenue cost, so they moved to an eventually consistent model (Dynamo) where the cart would always accept writes, even if it meant occasionally showing duplicate items or slightly stale state. The trade-off was worth it: a cart with an extra item is a minor UI annoyance, but a cart that refuses to work loses a sale.
  • Another example is the MongoDB Jepsen findings. Early versions of MongoDB claimed “strong consistency” but under network partitions, stale reads were possible because reads could be served by secondaries that had not replicated the latest write. This was not a bug in the consistency model itself but a mismatch between what was advertised and what was actually implemented — a common and dangerous failure mode.
  • The lesson for interviews: always verify that your system’s actual behavior matches its claimed consistency model. Use tools like Jepsen, write linearizability checkers, and test under real failure conditions (network partitions, clock skew, process crashes).
Follow-up: How would you set up ongoing validation that your system actually provides the consistency guarantees it claims?I would implement three layers of validation. First, a continuous integration Jepsen-style test suite that runs against a staging cluster, injecting network partitions, clock skew, and process kills while verifying linearizability or whatever consistency model you claim. Second, production-level consistency auditing: for critical paths like financial transactions, I would dual-write to an audit log and run an offline checker that reconstructs the serial order and flags any anomalies. Third, observability: track metrics like replication lag, leader election frequency, and quorum failures. If replication lag spikes above your SLA threshold, that is an early warning that your consistency guarantees are at risk before users notice. The goal is to detect consistency violations before they become customer-visible incidents.