Distributed Systems Mastery
A comprehensive, interview-focused curriculum designed for engineers targeting Staff/Principal roles at top tech companies (Google, Meta, Amazon, Netflix, Stripe, etc.). This course covers everything from fundamentals to cutting-edge distributed systems concepts.Target Outcome: Staff+ Engineer at FAANG / Top-tier distributed systems expertise
Prerequisites: Strong programming, basic networking, database fundamentals
Language: Concepts with implementations in Go/Java/Python
New Content: 5 additional tracks with 45+ modules, real-world case studies, and Staff+ interview problems
Why This Course?
FAANG Interview Ready
Deep Theoretical Foundation
Production Battle-Tested
Hands-On Projects
What Companies Ask
| Company | Common Topics |
|---|---|
| Consensus protocols, Spanner, Bigtable internals, Paxos, distributed transactions | |
| Amazon | DynamoDB internals, eventual consistency, vector clocks, Dynamo paper |
| Meta | TAO, ZippyDB, consensus at scale, social graph distribution |
| Netflix | EVCache, Cassandra, chaos engineering, resilience patterns |
| Stripe | Distributed transactions, exactly-once delivery, idempotency |
| Uber | Ringpop, consistent hashing, real-time systems, Cadence workflows |
Course Structure
The curriculum is organized into 9 tracks progressing from fundamentals to Staff+ expertise:Track 1: Foundations
Build the mental models that all distributed systems are built upon.Module 1: Why Distributed Systems?
Module 1: Why Distributed Systems?
- Single machine limitations (CPU, memory, disk, network)
- Horizontal vs Vertical scaling trade-offs
- The Eight Fallacies of Distributed Computing (deep dive)
- Types: Compute clusters, Storage systems, Coordination systems
- Real examples: Google’s evolution from single machines to global infrastructure
Module 2: Network Fundamentals
Module 2: Network Fundamentals
- TCP guarantees and failure modes
- Network partitions: What they are and how to detect them
- Message passing: At-most-once, at-least-once, exactly-once
- RPC frameworks: gRPC, Thrift, Protocol Buffers
- Failure detection: Heartbeats, Phi accrual detector
Module 3: Time and Ordering
Module 3: Time and Ordering
- Why physical clocks fail (NTP drift, leap seconds)
- Logical clocks (Lamport timestamps)
- Vector clocks (causality tracking)
- Hybrid logical clocks (HLC)
- TrueTime (Google’s GPS + atomic clock approach)
- Happens-before relationship
Module 4: Failure Models
Module 4: Failure Models
- Fail-stop vs Fail-recover
- Byzantine failures
- Network failures: Partition, delay, reordering
- Partial failures: The hardest problem
- Gray failures (subtle, hard-to-detect issues)
Module 5: CAP and PACELC Theorems
Module 5: CAP and PACELC Theorems
- CAP Theorem: Proof and implications
- CP vs AP: Real-world examples
- PACELC: The more practical framework
- Beyond CAP: Harvest and Yield
- Consistency spectrum: Linearizable → Eventual
Track 2: Consensus Protocols
The heart of distributed systems. Master these for Staff+ interviews.Module 6: The Consensus Problem
Module 6: The Consensus Problem
- FLP Impossibility result (and its implications)
- Safety vs Liveness guarantees
- Consensus use cases: Leader election, configuration, transactions
- Relationship to State Machine Replication
Module 7: Paxos Protocol
Module 7: Paxos Protocol
- Basic Paxos: Prepare/Promise, Accept/Accepted
- Why Paxos works (safety proofs)
- Multi-Paxos optimizations
- Paxos Made Simple (Lamport’s paper walkthrough)
- Fast Paxos and Flexible Paxos
- EPaxos (Leaderless variant)
Module 8: Raft Consensus (Deep Dive)
Module 8: Raft Consensus (Deep Dive)
- Leader election mechanism
- Log replication and commit rules
- Safety properties and proofs
- Membership changes (joint consensus)
- Log compaction and snapshots
- Raft vs Paxos comparison
- etcd/Consul implementation details
Module 9: Viewstamped Replication
Module 9: Viewstamped Replication
- VR protocol overview
- View changes and recovery
- Comparison with Raft
- When to use VR vs Raft
Module 10: ZAB (Zookeeper Atomic Broadcast)
Module 10: ZAB (Zookeeper Atomic Broadcast)
- ZAB protocol phases
- Leader activation and synchronization
- Recovery and failover
- Zookeeper guarantees (FIFO, linearizable writes)
- Zookeeper use cases: Locking, configuration, leader election
Track 3: Replication Strategies
How data is copied and kept consistent across nodes.Module 11: Single-Leader Replication
Module 11: Single-Leader Replication
- Synchronous vs Asynchronous replication
- Semi-synchronous replication
- Replication lag and its problems
- Read-your-writes, Monotonic reads, Consistent prefix
- Failover handling and split-brain prevention
- MySQL/PostgreSQL replication internals
Module 12: Multi-Leader Replication
Module 12: Multi-Leader Replication
- Use cases: Multi-datacenter, offline clients
- Conflict detection and resolution
- Last-write-wins (LWW) and its problems
- Custom conflict resolution logic
- CockroachDB and TiDB approach
Module 13: Leaderless Replication
Module 13: Leaderless Replication
- Read/write quorums (R + W > N)
- Sloppy quorums and hinted handoff
- Anti-entropy: Read repair, Merkle trees
- Dynamo paper deep dive
- Cassandra consistency levels
Module 14: Conflict Resolution
Module 14: Conflict Resolution
- Application-level resolution
- Version vectors
- LWW strategies and pitfalls
- Merge functions
- Operational transformation (Google Docs)
Module 15: CRDTs
Module 15: CRDTs
- Operation-based vs State-based CRDTs
- G-Counter, PN-Counter
- G-Set, 2P-Set, OR-Set
- LWW-Register, MV-Register
- CRDT-based databases (Riak, Redis CRDT)
- Performance and memory implications
Track 4: Distributed Transactions
Maintaining data integrity across multiple nodes.Module 16: ACID in Distributed Systems
Module 16: ACID in Distributed Systems
- Local vs Distributed transactions
- Isolation levels: Read uncommitted → Serializable
- Snapshot isolation and write skew
- Serializable Snapshot Isolation (SSI)
Module 17: Two-Phase Commit (2PC)
Module 17: Two-Phase Commit (2PC)
- Prepare and Commit phases
- Coordinator failures and blocking
- Participant failures and recovery
- 2PC in practice: XA transactions
- Why 2PC is often avoided (performance, availability)
Module 18: Three-Phase Commit (3PC)
Module 18: Three-Phase Commit (3PC)
- Pre-commit phase addition
- Non-blocking under certain failures
- Why 3PC isn’t commonly used
- Network partition problems
Module 19: Saga Pattern
Module 19: Saga Pattern
- Choreography vs Orchestration
- Compensating transactions
- Semantic locks and countermeasures
- Saga execution coordinator
- Saga pattern in microservices
- Temporal.io and Cadence workflows
Module 20: TCC Pattern
Module 20: TCC Pattern
- Two-phase approach at application level
- Resource reservation
- Timeout handling
- When to use TCC vs Saga
Module 21: Distributed Locking
Module 21: Distributed Locking
- Single-node locks in distributed systems (Redis SETNX)
- Redlock algorithm and its critique
- Fencing tokens for safety
- Zookeeper-based locks
- Lease-based locking
Track 5: Data Systems at Scale
Partitioning, storage, and processing at massive scale.Module 22: Partitioning Strategies
Module 22: Partitioning Strategies
- Key-range partitioning
- Hash partitioning
- Hybrid approaches
- Secondary indexes: Local vs Global
- Rebalancing strategies
- Hot spots and skew handling
Module 23: Consistent Hashing
Module 23: Consistent Hashing
- Basic consistent hashing
- Virtual nodes for load balancing
- Bounded-load consistent hashing
- Jump consistent hashing
- Rendezvous hashing (HRW)
Module 24: Distributed Databases Deep Dive
Module 24: Distributed Databases Deep Dive
- Spanner: TrueTime, external consistency, Paxos groups
- CockroachDB: Raft, serializable isolation, SQL distribution
- TiDB: Raft + Percolator, hybrid OLTP/OLAP
- Cassandra: Gossip, consistent hashing, tunable consistency
- DynamoDB: Leaderless, GSI, adaptive capacity
- MongoDB: Raft-based replication, sharding
Module 25: Distributed Storage Systems
Module 25: Distributed Storage Systems
- GFS/HDFS architecture
- Object storage (S3 architecture)
- Erasure coding for durability
- Ceph architecture
- Tiered storage strategies
Module 26: Stream Processing
Module 26: Stream Processing
- Event sourcing and event-driven architecture
- Kafka internals: Partitions, consumer groups, exactly-once
- Stream processing: Flink, Kafka Streams
- Windowing: Tumbling, Sliding, Session
- Watermarks and late data handling
- Exactly-once semantics in streaming
Track 6: Production Excellence
Operating distributed systems at scale.Module 27: Observability at Scale
Module 27: Observability at Scale
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry)
- Metrics aggregation at scale
- Log aggregation and analysis
- Correlation across services
- SLIs, SLOs, and error budgets
Module 28: Chaos Engineering
Module 28: Chaos Engineering
- Chaos Monkey and the Simian Army
- Designing chaos experiments
- Blast radius control
- Failure injection frameworks (Litmus, Chaos Mesh)
- Game days and runbooks
Module 29: SRE Practices
Module 29: SRE Practices
- Toil reduction and automation
- On-call best practices
- Postmortem culture (blameless)
- Error budgets and release velocity
- Progressive rollouts
Module 30: Incident Management
Module 30: Incident Management
- Incident response playbooks
- Communication during outages
- Escalation procedures
- Root cause analysis
- Learning from failures
Module 31: Capacity Planning
Module 31: Capacity Planning
- Load testing strategies
- Capacity modeling
- Performance regression detection
- Autoscaling strategies
- Cost optimization at scale
Track 7: Clock Synchronization (Advanced)
Time is the foundation of distributed systems. Master clock synchronization for Staff+ expertise.Module 32: GPS and Atomic Clocks
Module 32: GPS and Atomic Clocks
- GPS time transfer and accuracy bounds
- Atomic clock drift characteristics
- Rubidium vs Cesium clock trade-offs
- Antenna placement and signal multipath
- Cost-benefit analysis for different accuracy needs
Module 33: TrueTime Deep Dive
Module 33: TrueTime Deep Dive
- TrueTime API:
TT.now(),TT.after(),TT.before() - Uncertainty intervals and commit-wait protocol
- GPS + atomic clock redundancy architecture
- How Spanner uses TrueTime for external consistency
- Implementing TrueTime-like systems without GPS
Module 34: Hybrid Logical Clocks
Module 34: Hybrid Logical Clocks
- HLC design and implementation
- Timestamp encoding strategies
- CockroachDB’s MVCC with HLC
- Causality tracking with bounded skew
- HLC vs Vector Clocks trade-offs
Module 35: Clock Synchronization Protocols
Module 35: Clock Synchronization Protocols
- NTP architecture and stratum levels
- PTP/IEEE 1588 for microsecond precision
- Clock drift detection and correction
- Network asymmetry compensation
- Monitoring clock health in production
Track 8: Fault Tolerance Patterns
Building resilient systems that survive failures.Module 36: Circuit Breaker Pattern
Module 36: Circuit Breaker Pattern
- State machine: Closed → Open → Half-Open
- Failure threshold configuration
- Timeout and retry integration
- Hystrix and Resilience4j implementations
- Monitoring circuit breaker health
Module 37: Bulkhead Isolation
Module 37: Bulkhead Isolation
- Thread pool isolation patterns
- Semaphore-based bulkheads
- Connection pool partitioning
- Process-level isolation
- Kubernetes resource limits as bulkheads
Module 38: Retry Strategies
Module 38: Retry Strategies
- Exponential backoff with jitter
- Retry budgets and thundering herd prevention
- Idempotency keys for safe retries
- Deadline propagation across services
- Distinguishing transient vs permanent failures
Module 39: Graceful Degradation
Module 39: Graceful Degradation
- Feature flags for degradation
- Fallback strategies and stale data serving
- Load shedding and admission control
- Quality-of-service tiering
- Netflix’s fallback hierarchies
Module 40: Timeout Patterns
Module 40: Timeout Patterns
- Connection vs read vs write timeouts
- Timeout cascades and deadline propagation
- Context cancellation across service boundaries
- Calculating appropriate timeout values
- Timeout vs circuit breaker interaction
Track 9: Distributed Caching
Caching patterns for high-performance distributed systems.Module 41: Cache Strategies
Module 41: Cache Strategies
- Cache-aside (lazy loading) pattern
- Read-through and write-through caching
- Write-behind (write-back) caching
- Refresh-ahead pattern
- Cache eviction policies (LRU, LFU, TTL)
Module 42: Cache Invalidation
Module 42: Cache Invalidation
- TTL-based invalidation strategies
- Event-driven invalidation with Kafka/pub-sub
- Tag-based cache invalidation
- Cascading invalidation patterns
- Cache versioning strategies
Module 43: Redis & Memcached Architecture
Module 43: Redis & Memcached Architecture
- Redis Cluster architecture and slot migration
- Redis Sentinel for high availability
- Memcached consistent hashing
- Memory management and eviction
- Replication lag and read consistency
- Redis vs Memcached decision framework
Module 44: CDN Caching
Module 44: CDN Caching
- CDN architecture and PoP design
- Cache-Control header strategies
- Origin shield and tiered caching
- Cache purging at scale
- Dynamic content caching patterns
Module 45: Cache Stampede Prevention
Module 45: Cache Stampede Prevention
- Locking and mutex patterns
- Probabilistic early expiration
- Request coalescing
- Background refresh strategies
- Circuit breaker integration
Special Track: Real-World Case Studies
Learn from production systems at scale.Google Spanner Architecture
Google Spanner Architecture
- TrueTime and external consistency
- Paxos groups and data placement
- Lock-free read-only transactions
- Schema changes without downtime
- Real failure stories and lessons learned
Amazon Dynamo & DynamoDB
Amazon Dynamo & DynamoDB
- Consistent hashing with virtual nodes
- Vector clocks and conflict resolution
- Sloppy quorums and hinted handoff
- Evolution from Dynamo to DynamoDB
- Global Tables and cross-region replication
Netflix Resilience Architecture
Netflix Resilience Architecture
- Chaos Monkey and Simian Army
- EVCache and caching at scale
- Zuul gateway and load shedding
- Failure injection testing
- Multi-region active-active deployment
Uber's Real-Time Systems
Uber's Real-Time Systems
- Ringpop for membership and routing
- Cadence/Temporal workflow orchestration
- Geospatial indexing at scale
- Real-time dispatch and matching
- Multi-region failover strategies
Staff+ Interview Practice Problems
Curated problems for senior-level interviews.Global Rate Limiter
Distributed Transaction Coordinator
Real-Time Leaderboard
Multi-Region Database
Capstone Projects
Apply everything you’ve learned.Project 1: Distributed KV Store
- Raft-based replication
- Consistent hashing for partitioning
- Read/write quorums
- Snapshot and recovery
Project 2: Implement Raft
- Leader election
- Log replication
- Membership changes
- Persistence and recovery
Project 3: Distributed Lock Service
- Ephemeral nodes
- Watch mechanism
- Sequential ordering
- Lock implementation
Project 4: Mock Interviews
- Design Uber’s dispatch system
- Design Stripe’s payment processing
- Design Netflix’s CDN
- Design Twitter’s timeline
Key Papers to Read
Essential reading for deep understanding:| Paper | Why It Matters |
|---|---|
| Dynamo (Amazon) | Leaderless replication, vector clocks, eventual consistency |
| Spanner (Google) | TrueTime, globally consistent transactions |
| Raft (Stanford) | Understandable consensus |
| Paxos Made Simple (Lamport) | The original consensus paper |
| MapReduce (Google) | Distributed computation paradigm |
| Kafka (LinkedIn) | Distributed log architecture |
| Time, Clocks (Lamport) | Logical time foundations |
| CALM Theorem | Consistency without coordination |
| FLP Impossibility | Limits of distributed consensus |
| Harvest/Yield | Practical CAP trade-offs |
Interview Preparation Strategy
Master the Theory (Weeks 1-6)
- CAP/PACELC intuition
- Raft protocol (draw from memory)
- Replication trade-offs
Go Deep on Transactions (Weeks 7-9)
- Design a saga for any use case
- Explain 2PC failure modes
- Discuss distributed locking trade-offs
Study Real Systems (Weeks 10-12)
- How Spanner achieves global consistency
- How Kafka provides exactly-once
- When to use Cassandra vs PostgreSQL
Master Advanced Patterns (Weeks 13-15)
- TrueTime and clock synchronization
- Circuit breakers and fault tolerance
- Distributed caching patterns
Study Case Studies (Week 16)
- Google Spanner, Amazon Dynamo
- Netflix resilience, Uber real-time
- Learn from actual failure post-mortems
Mock Interviews (Weeks 17-20)
- Practice 3-4 system design problems per week
- Use the Staff+ Interview Problems module
- Focus on distributed aspects
- Record yourself and review
Who This Course Is For
- Target Audience
- Prerequisites
- Time Commitment
- Senior Engineers (4+ years) aiming for Staff/Principal
- Backend Engineers wanting deep distributed systems knowledge
- Infrastructure Engineers building platforms
- Anyone targeting FAANG/top-tier companies