Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Distributed Systems Mastery
A comprehensive, interview-focused curriculum designed for engineers targeting Staff/Principal roles at top tech companies (Google, Meta, Amazon, Netflix, Stripe, etc.). This course covers everything from fundamentals to cutting-edge distributed systems concepts. Distributed systems sit at the intersection of computer science theory (impossibility results, consensus proofs) and messy real-world practice (network jitter, clock skew, disk failures at 3 AM). This curriculum bridges both worlds — giving you the theoretical foundations to reason about correctness and the production battle scars to design systems that actually work under fire.Target Outcome: Staff+ Engineer at FAANG / Top-tier distributed systems expertise
Prerequisites: Strong programming, basic networking, database fundamentals
Language: Concepts with implementations in Go/Java/Python
New Content: 5 additional tracks with 45+ modules, real-world case studies, and Staff+ interview problems
Why This Course?
FAANG Interview Ready
Deep Theoretical Foundation
Production Battle-Tested
Hands-On Projects
What Companies Ask
| Company | Common Topics |
|---|---|
| Consensus protocols, Spanner, Bigtable internals, Paxos, distributed transactions | |
| Amazon | DynamoDB internals, eventual consistency, vector clocks, Dynamo paper |
| Meta | TAO, ZippyDB, consensus at scale, social graph distribution |
| Netflix | EVCache, Cassandra, chaos engineering, resilience patterns |
| Stripe | Distributed transactions, exactly-once delivery, idempotency |
| Uber | Ringpop, consistent hashing, real-time systems, Cadence workflows |
Course Structure
The curriculum is organized into 9 tracks progressing from fundamentals to Staff+ expertise:Track 1: Foundations
Build the mental models that all distributed systems are built upon.Module 1: Why Distributed Systems?
Module 1: Why Distributed Systems?
- Single machine limitations (CPU, memory, disk, network)
- Horizontal vs Vertical scaling trade-offs
- The Eight Fallacies of Distributed Computing (deep dive)
- Types: Compute clusters, Storage systems, Coordination systems
- Real examples: Google’s evolution from single machines to global infrastructure
Module 2: Network Fundamentals
Module 2: Network Fundamentals
- TCP guarantees and failure modes
- Network partitions: What they are and how to detect them
- Message passing: At-most-once, at-least-once, exactly-once
- RPC frameworks: gRPC, Thrift, Protocol Buffers
- Failure detection: Heartbeats, Phi accrual detector
Module 3: Time and Ordering
Module 3: Time and Ordering
- Why physical clocks fail (NTP drift, leap seconds)
- Logical clocks (Lamport timestamps)
- Vector clocks (causality tracking)
- Hybrid logical clocks (HLC)
- TrueTime (Google’s GPS + atomic clock approach)
- Happens-before relationship
Module 4: Failure Models
Module 4: Failure Models
- Fail-stop vs Fail-recover
- Byzantine failures
- Network failures: Partition, delay, reordering
- Partial failures: The hardest problem
- Gray failures (subtle, hard-to-detect issues)
Module 5: CAP and PACELC Theorems
Module 5: CAP and PACELC Theorems
- CAP Theorem: Proof and implications
- CP vs AP: Real-world examples
- PACELC: The more practical framework
- Beyond CAP: Harvest and Yield
- Consistency spectrum: Linearizable → Eventual
Module 47: Distributed Snapshots
Module 47: Distributed Snapshots
- The Global State Problem: Why we can’t just “freeze” time
- Chandy-Lamport Algorithm: Markers and state recording
- Consistent vs Inconsistent cuts
- Practical uses: Distributed debugging, Checkpointing, Termination detection
Module 48: Gossip Protocols
Module 48: Gossip Protocols
- Epidemic algorithms: Rumor mongering and anti-entropy
- SWIM: Scalable Weakly-consistent Infection-style Process Group Membership
- Phi Accrual Failure Detection: Suspicion-based detectors
- Use cases: Cassandra membership, Redis Cluster, HashiCorp Serf/Consul
Track 2: Consensus Protocols
The heart of distributed systems. Master these for Staff+ interviews.Module 6: The Consensus Problem
Module 6: The Consensus Problem
- FLP Impossibility result (and its implications)
- Safety vs Liveness guarantees
- Consensus use cases: Leader election, configuration, transactions
- Relationship to State Machine Replication
Module 7: Paxos Protocol
Module 7: Paxos Protocol
- Basic Paxos: Prepare/Promise, Accept/Accepted
- Why Paxos works (safety proofs)
- Multi-Paxos optimizations
- Paxos Made Simple (Lamport’s paper walkthrough)
- Fast Paxos and Flexible Paxos
- EPaxos (Leaderless variant)
Module 8: Raft Consensus (Deep Dive)
Module 8: Raft Consensus (Deep Dive)
- Leader election mechanism
- Log replication and commit rules
- Safety properties and proofs
- Membership changes (joint consensus)
- Log compaction and snapshots
- Raft vs Paxos comparison
- etcd/Consul implementation details
Module 9: Viewstamped Replication
Module 9: Viewstamped Replication
- VR protocol overview
- View changes and recovery
- Comparison with Raft
- When to use VR vs Raft
Module 10: ZAB (Zookeeper Atomic Broadcast)
Module 10: ZAB (Zookeeper Atomic Broadcast)
- ZAB protocol phases
- Leader activation and synchronization
- Recovery and failover
- Zookeeper guarantees (FIFO, linearizable writes)
- Zookeeper use cases: Locking, configuration, leader election
Module 49: Byzantine Fault Tolerance (BFT)
Module 49: Byzantine Fault Tolerance (BFT)
- The Byzantine Generals Problem: Understanding the consensus bound (3f+1)
- PBFT: Practical Byzantine Fault Tolerance internals
- Modern BFT: Tendermint, HotStuff (used in Libra/Diem)
- Proof of Work vs BFT: Performance and safety trade-offs
Module 50: Formal Verification (TLA+)
Module 50: Formal Verification (TLA+)
- Why testing isn’t enough: State space explosion
- TLA+ and PlusCal basics
- Modeling safety (invariants) vs liveness (progress)
- The TLC Model Checker and error traces
- Case studies: How AWS and MongoDB use TLA+
Track 3: Replication Strategies
How data is copied and kept consistent across nodes.Module 11: Single-Leader Replication
Module 11: Single-Leader Replication
- Synchronous vs Asynchronous replication
- Semi-synchronous replication
- Replication lag and its problems
- Read-your-writes, Monotonic reads, Consistent prefix
- Failover handling and split-brain prevention
- MySQL/PostgreSQL replication internals
Module 12: Multi-Leader Replication
Module 12: Multi-Leader Replication
- Use cases: Multi-datacenter, offline clients
- Conflict detection and resolution
- Last-write-wins (LWW) and its problems
- Custom conflict resolution logic
- CockroachDB and TiDB approach
Module 13: Leaderless Replication
Module 13: Leaderless Replication
- Read/write quorums (R + W > N)
- Sloppy quorums and hinted handoff
- Anti-entropy: Read repair, Merkle trees
- Dynamo paper deep dive
- Cassandra consistency levels
Module 14: Conflict Resolution
Module 14: Conflict Resolution
- Application-level resolution
- Version vectors
- LWW strategies and pitfalls
- Merge functions
- Operational transformation (Google Docs)
Module 15: CRDTs
Module 15: CRDTs
- Operation-based vs State-based CRDTs
- G-Counter, PN-Counter
- G-Set, 2P-Set, OR-Set
- LWW-Register, MV-Register
- CRDT-based databases (Riak, Redis CRDT)
- Performance and memory implications
Track 4: Distributed Transactions
Maintaining data integrity across multiple nodes.Module 16: ACID in Distributed Systems
Module 16: ACID in Distributed Systems
- Local vs Distributed transactions
- Isolation levels: Read uncommitted → Serializable
- Snapshot isolation and write skew
- Serializable Snapshot Isolation (SSI)
Module 17: Two-Phase Commit (2PC)
Module 17: Two-Phase Commit (2PC)
- Prepare and Commit phases
- Coordinator failures and blocking
- Participant failures and recovery
- 2PC in practice: XA transactions
- Why 2PC is often avoided (performance, availability)
Module 18: Three-Phase Commit (3PC)
Module 18: Three-Phase Commit (3PC)
- Pre-commit phase addition
- Non-blocking under certain failures
- Why 3PC isn’t commonly used
- Network partition problems
Module 19: Saga Pattern
Module 19: Saga Pattern
- Choreography vs Orchestration
- Compensating transactions
- Semantic locks and countermeasures
- Saga execution coordinator
- Saga pattern in microservices
- Temporal.io and Cadence workflows
Module 20: TCC Pattern
Module 20: TCC Pattern
- Two-phase approach at application level
- Resource reservation
- Timeout handling
- When to use TCC vs Saga
Module 21: Distributed Locking
Module 21: Distributed Locking
- Single-node locks in distributed systems (Redis SETNX)
- Redlock algorithm and its critique
- Fencing tokens for safety
- Zookeeper-based locks
- Lease-based locking
Track 5: Data Systems at Scale
Partitioning, storage, and processing at massive scale.Module 22: Partitioning Strategies
Module 22: Partitioning Strategies
- Key-range partitioning
- Hash partitioning
- Hybrid approaches
- Secondary indexes: Local vs Global
- Rebalancing strategies
- Hot spots and skew handling
Module 23: Consistent Hashing
Module 23: Consistent Hashing
- Basic consistent hashing
- Virtual nodes for load balancing
- Bounded-load consistent hashing
- Jump consistent hashing
- Rendezvous hashing (HRW)
Module 24: Distributed Databases Deep Dive
Module 24: Distributed Databases Deep Dive
- Spanner: TrueTime, external consistency, Paxos groups
- CockroachDB: Raft, serializable isolation, SQL distribution
- TiDB: Raft + Percolator, hybrid OLTP/OLAP
- Cassandra: Gossip, consistent hashing, tunable consistency
- DynamoDB: Leaderless, GSI, adaptive capacity
- MongoDB: Raft-based replication, sharding
Module 25: Distributed Storage Systems
Module 25: Distributed Storage Systems
- GFS/HDFS architecture
- Object storage (S3 architecture)
- Erasure coding for durability
- Ceph architecture
- Tiered storage strategies
Module 26: Stream Processing
Module 26: Stream Processing
- Event sourcing and event-driven architecture
- Kafka internals: Partitions, consumer groups, exactly-once
- Stream processing: Flink, Kafka Streams
- Windowing: Tumbling, Sliding, Session
- Watermarks and late data handling
- Exactly-once semantics in streaming
Track 6: Production Excellence
Operating distributed systems at scale.Module 27: Observability at Scale
Module 27: Observability at Scale
- Distributed tracing (Jaeger, Zipkin, OpenTelemetry)
- Metrics aggregation at scale
- Log aggregation and analysis
- Correlation across services
- SLIs, SLOs, and error budgets
Module 28: Chaos Engineering
Module 28: Chaos Engineering
- Chaos Monkey and the Simian Army
- Designing chaos experiments
- Blast radius control
- Failure injection frameworks (Litmus, Chaos Mesh)
- Game days and runbooks
Module 29: SRE Practices
Module 29: SRE Practices
- Toil reduction and automation
- On-call best practices
- Postmortem culture (blameless)
- Error budgets and release velocity
- Progressive rollouts
Module 30: Advanced Resiliency Patterns
Module 30: Advanced Resiliency Patterns
- Static Stability and over-provisioning
- Cell-based Architectures and blast radius control
- Dependency Isolation (The Bulkhead Pattern)
- Avoiding control-plane dependencies in recovery paths
Module 31: Incident Management
Module 31: Incident Management
- Incident response playbooks
- Communication during outages
- Escalation procedures
- Root cause analysis
- Learning from failures
Module 32: Capacity Planning
Module 32: Capacity Planning
- Load testing strategies
- Capacity modeling
- Performance regression detection
- Autoscaling strategies
- Cost optimization at scale
Track 7: Clock Synchronization (Advanced)
Time is the foundation of distributed systems. Master clock synchronization for Staff+ expertise.Module 33: Clock Synchronization Protocols
Module 33: Clock Synchronization Protocols
- NTP architecture and stratum levels
- PTP/IEEE 1588 for microsecond precision
- Clock drift detection and correction
- Network asymmetry compensation
- Monitoring clock health in production
Module 34: Logical and Vector Clocks
Module 34: Logical and Vector Clocks
- Lamport timestamps and happened-before relationship
- Vector clocks for precise conflict detection
- Comparison rules and concurrency proofs
- Implementation in DynamoDB and Riak
Module 35: Hybrid Logical Clocks
Module 35: Hybrid Logical Clocks
- HLC design and implementation
- Timestamp encoding strategies
- CockroachDB’s MVCC with HLC
- Causality tracking with bounded skew
- HLC vs Vector Clocks trade-offs
Module 36: TrueTime and Atomic Clocks
Module 36: TrueTime and Atomic Clocks
- GPS time transfer and accuracy bounds
- Atomic clock drift characteristics (Rubidium vs Cesium)
- TrueTime API:
TT.now(),TT.after(),TT.before() - Uncertainty intervals and commit-wait protocol
- How Spanner uses TrueTime for external consistency
Track 8: Fault Tolerance Patterns
Building resilient systems that survive failures.Module 37: Circuit Breaker Pattern
Module 37: Circuit Breaker Pattern
- State machine: Closed → Open → Half-Open
- Failure threshold configuration
- Timeout and retry integration
- Hystrix and Resilience4j implementations
- Monitoring circuit breaker health
Module 38: Bulkhead Isolation
Module 38: Bulkhead Isolation
- Thread pool isolation patterns
- Semaphore-based bulkheads
- Connection pool partitioning
- Process-level isolation
- Kubernetes resource limits as bulkheads
Module 39: Retry Strategies
Module 39: Retry Strategies
- Exponential backoff with jitter
- Retry budgets and thundering herd prevention
- Idempotency keys for safe retries
- Deadline propagation across services
- Distinguishing transient vs permanent failures
Module 40: Graceful Degradation
Module 40: Graceful Degradation
- Feature flags for degradation
- Fallback strategies and stale data serving
- Load shedding and admission control
- Quality-of-service tiering
- Netflix’s fallback hierarchies
Module 41: Timeout Patterns
Module 41: Timeout Patterns
- Connection vs read vs write timeouts
- Timeout cascades and deadline propagation
- Context cancellation across service boundaries
- Calculating appropriate timeout values
- Timeout vs circuit breaker interaction
Track 9: Distributed Caching
Caching patterns for high-performance distributed systems.Module 42: Cache Strategies
Module 42: Cache Strategies
- Cache-aside (lazy loading) pattern
- Read-through and write-through caching
- Write-behind (write-back) caching
- Refresh-ahead pattern
- Cache eviction policies (LRU, LFU, TTL)
Module 43: Cache Invalidation
Module 43: Cache Invalidation
- TTL-based invalidation strategies
- Event-driven invalidation with Kafka/pub-sub
- Tag-based cache invalidation
- Cascading invalidation patterns
- Cache versioning strategies
Module 44: Redis & Memcached Architecture
Module 44: Redis & Memcached Architecture
- Redis Cluster architecture and slot migration
- Redis Sentinel for high availability
- Memcached consistent hashing
- Memory management and eviction
- Replication lag and read consistency
- Redis vs Memcached decision framework
Module 45: CDN Caching
Module 45: CDN Caching
- CDN architecture and PoP design
- Cache-Control header strategies
- Origin shield and tiered caching
- Cache purging at scale
- Dynamic content caching patterns
Module 46: Cache Stampede Prevention
Module 46: Cache Stampede Prevention
- Locking and mutex patterns
- Probabilistic early expiration
- Request coalescing
- Background refresh strategies
- Circuit breaker integration
Special Track: Real-World Case Studies
Learn from production systems at scale.Google Spanner Architecture
Google Spanner Architecture
- TrueTime and external consistency
- Paxos groups and data placement
- Lock-free read-only transactions
- Schema changes without downtime
- Real failure stories and lessons learned
Amazon Dynamo & DynamoDB
Amazon Dynamo & DynamoDB
- Consistent hashing with virtual nodes
- Vector clocks and conflict resolution
- Sloppy quorums and hinted handoff
- Evolution from Dynamo to DynamoDB
- Global Tables and cross-region replication
Netflix Resilience Architecture
Netflix Resilience Architecture
- Chaos Monkey and Simian Army
- EVCache and caching at scale
- Zuul gateway and load shedding
- Failure injection testing
- Multi-region active-active deployment
Uber's Real-Time Systems
Uber's Real-Time Systems
- Ringpop for membership and routing
- Cadence/Temporal workflow orchestration
- Geospatial indexing at scale
- Real-time dispatch and matching
- Multi-region failover strategies
Staff+ Interview Practice Problems
Curated problems for senior-level interviews.Global Rate Limiter
Distributed Transaction Coordinator
Real-Time Leaderboard
Multi-Region Database
Capstone Projects
Apply everything you’ve learned.Project 1: Distributed KV Store
- Raft-based replication
- Consistent hashing for partitioning
- Read/write quorums
- Snapshot and recovery
Project 2: Implement Raft
- Leader election
- Log replication
- Membership changes
- Persistence and recovery
Project 3: Distributed Lock Service
- Ephemeral nodes
- Watch mechanism
- Sequential ordering
- Lock implementation
Project 4: Mock Interviews
- Design Uber’s dispatch system
- Design Stripe’s payment processing
- Design Netflix’s CDN
- Design Twitter’s timeline
Key Papers to Read
Essential reading for deep understanding:| Paper | Why It Matters |
|---|---|
| Dynamo (Amazon) | Leaderless replication, vector clocks, eventual consistency |
| Spanner (Google) | TrueTime, globally consistent transactions |
| Raft (Stanford) | Understandable consensus |
| Paxos Made Simple (Lamport) | The original consensus paper |
| MapReduce (Google) | Distributed computation paradigm |
| Kafka (LinkedIn) | Distributed log architecture |
| Time, Clocks (Lamport) | Logical time foundations |
| CALM Theorem | Consistency without coordination |
| FLP Impossibility | Limits of distributed consensus |
| Harvest/Yield | Practical CAP trade-offs |
Interview Preparation Strategy
Master the Theory (Weeks 1-6)
- CAP/PACELC intuition
- Raft protocol (draw from memory)
- Replication trade-offs
Go Deep on Transactions (Weeks 7-9)
- Design a saga for any use case
- Explain 2PC failure modes
- Discuss distributed locking trade-offs
Study Real Systems (Weeks 10-12)
- How Spanner achieves global consistency
- How Kafka provides exactly-once
- When to use Cassandra vs PostgreSQL
Master Advanced Patterns (Weeks 13-15)
- TrueTime and clock synchronization
- Circuit breakers and fault tolerance
- Distributed caching patterns
Study Case Studies (Week 16)
- Google Spanner, Amazon Dynamo
- Netflix resilience, Uber real-time
- Learn from actual failure post-mortems
Who This Course Is For
- Target Audience
- Prerequisites
- Time Commitment
- Senior Engineers (4+ years) aiming for Staff/Principal
- Backend Engineers wanting deep distributed systems knowledge
- Infrastructure Engineers building platforms
- Anyone targeting FAANG/top-tier companies
Ready to Begin?
Start with Foundations
Jump to Case Studies
Practice Interview Problems
Master Fault Tolerance
Interview Deep-Dive
You are designing a new distributed system from scratch. Walk me through how you decide between a CP and an AP architecture.
You are designing a new distributed system from scratch. Walk me through how you decide between a CP and an AP architecture.
- The first thing I clarify is what the system’s core invariants are. For a payment ledger, an incorrect balance is catastrophic, so I lean CP. For a social media feed, a user seeing a slightly stale timeline for a few seconds is acceptable, so AP makes sense.
- I never frame it as a binary choice for the entire system. Most real systems are a mix: the metadata service might be CP (using Raft-backed etcd), while the content-delivery layer is AP (eventual consistency with CDN caching).
- I also move beyond CAP to PACELC, because CAP only describes behavior during partitions. During normal operation, the trade-off is between latency and consistency. A system like Spanner is PC/EC — it chooses consistency even when there is no partition, accepting higher latency for correctness. DynamoDB is PA/EL — it prioritizes availability during partitions and low latency during normal operation.
- In practice, I would enumerate the critical user-facing flows, classify each by its tolerance for staleness or unavailability, and then assign the appropriate consistency level per flow rather than per system.
If you had to explain the CAP theorem to a VP of Engineering who is not deeply technical, how would you frame it -- and what would you warn them about?
If you had to explain the CAP theorem to a VP of Engineering who is not deeply technical, how would you frame it -- and what would you warn them about?
- I would say: “CAP means that when our network has a serious problem — a partition — we have to pick between two options. Option one: we stop serving some requests to make sure every answer is correct (consistency). Option two: we keep serving every request, but some answers might be outdated (availability). We cannot have both during the outage.”
- The warning I would give is that CAP is about the extreme case — a network partition. Most of the time, partitions are rare and brief. The real day-to-day trade-off is between latency and consistency (PACELC). A VP should care more about “how fast are we during normal operation?” than “what happens during an outage that occurs once a year?”
- I would also warn against the common misconception that “AP means our data is unreliable.” AP systems like DynamoDB or Cassandra are used by Amazon and Netflix for mission-critical workloads. They are not unreliable — they just define consistency differently and resolve conflicts after the fact.
You mentioned Raft and Paxos in the curriculum. When would you choose one over the other for a production system?
You mentioned Raft and Paxos in the curriculum. When would you choose one over the other for a production system?
- In almost every greenfield project today, I would default to Raft. The reason is understandability. Raft was explicitly designed to be easier to implement correctly, and correctness in consensus is everything. Bugs in consensus lead to data loss, split-brain, or silent corruption — the worst kind of production incidents.
- I would choose Paxos (specifically Multi-Paxos or a variant like EPaxos) in specific scenarios: when I need flexible quorum configurations, when I need leaderless consensus for lower latency in geo-distributed deployments, or when I am extending an existing system that already uses Paxos (like Google’s internal infrastructure).
- The practical trade-off: Raft has a single leader, which simplifies reasoning but creates a throughput bottleneck and a latency penalty for writes that must cross regions. EPaxos removes the leader but is significantly harder to implement and reason about.
- In production, I look at the ecosystem too. etcd (Raft) and Consul (Raft) have mature, battle-tested implementations. If I am building a coordination service, I would use one of these rather than implementing consensus from scratch.
Walk me through a real-world scenario where choosing the wrong consistency model caused a production incident.
Walk me through a real-world scenario where choosing the wrong consistency model caused a production incident.
- A well-known example is the early days of Amazon’s shopping cart. They chose strong consistency initially, which meant that during network hiccups between data centers, cart operations would fail or time out. Users would click “Add to Cart” and see an error. Amazon calculated that every failed cart addition had a direct revenue cost, so they moved to an eventually consistent model (Dynamo) where the cart would always accept writes, even if it meant occasionally showing duplicate items or slightly stale state. The trade-off was worth it: a cart with an extra item is a minor UI annoyance, but a cart that refuses to work loses a sale.
- Another example is the MongoDB Jepsen findings. Early versions of MongoDB claimed “strong consistency” but under network partitions, stale reads were possible because reads could be served by secondaries that had not replicated the latest write. This was not a bug in the consistency model itself but a mismatch between what was advertised and what was actually implemented — a common and dangerous failure mode.
- The lesson for interviews: always verify that your system’s actual behavior matches its claimed consistency model. Use tools like Jepsen, write linearizability checkers, and test under real failure conditions (network partitions, clock skew, process crashes).