Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Consistency Deep Dive
Linearizability vs Serializability
Most candidates confuse these — and interviewers love testing this distinction because it reveals whether you genuinely understand distributed consistency or just memorized surface-level definitions. The key: linearizability is about single operations on single objects appearing to happen atomically in real time. Serializability is about multi-operation transactions on multiple objects appearing to happen in some serial order (which may not correspond to real time). Strict serializability gives you both — and is what Google Spanner provides using GPS-synchronized atomic clocks (TrueTime), at the cost of cross-region write latency.Isolation Levels (Know All Four!)
| Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | ✓ | ✓ | ✓ |
| Read Committed | ✗ | ✓ | ✓ |
| Repeatable Read | ✗ | ✗ | ✓ |
| Serializable | ✗ | ✗ | ✗ |
Distributed Consensus Deep Dive
Leader Election: Why It’s Hard
Leader election sounds simple: “just pick one node to be in charge.” But the reason it is one of the hardest problems in distributed systems is that the very mechanism you would use to coordinate the election (the network) is the thing that fails. Imagine trying to elect a class president when students can only pass notes, some notes get lost, and some students might have already left the building without telling anyone. The note-passing is the problem you are trying to solve.Raft: The Algorithm You Should Know
Clock Synchronization
Clock synchronization is one of the sneakiest problems in distributed systems because wall clocks appear reliable — they give you a number that seems authoritative — but they silently drift. Think of it like three friends timing a race with unsynchronized stopwatches: each records a slightly different finish time, and there is no objective way to determine whose stopwatch is “right.” In distributed systems, this disagreement can corrupt conflict resolution, cause duplicate processing, or silently lose data when last-write-wins picks the wrong “last.”Why Wall Clocks Fail
Vector Clocks (Conflict Detection)
Data Partitioning Strategies
Data partitioning (sharding) is how you scale a database beyond what a single machine can handle — but the partition key choice is the most consequential and least reversible decision you will make. A bad partition key creates “hot partitions” (one shard drowning while others idle), forces expensive cross-partition queries on your hot path, or makes rebalancing a multi-day operational nightmare. The analogy: choosing a partition key is like organizing a library into separate buildings. If you split by the first letter of the author’s last name, the “S” building is overflowing (Smith, Singh, Suzuki…) while the “X” building is nearly empty. But if you split by a hash of the book’s ISBN, every building is equally full — at the cost of no longer being able to browse all books by the same author in one place.Partition Key Selection
Handling Hot Partitions
Exactly-Once Semantics
The Three Delivery Guarantees
Implementing Exactly-Once
Distributed Caching Patterns
Caching at scale introduces problems that simply do not exist with a single-server cache. The most dangerous: cache stampede (also called “thundering herd”), where a popular cache key expires and hundreds of servers simultaneously hit the database to re-populate it. At moderate scale this is an annoyance; at high scale it can take down your database and cascade into a full outage. The patterns below are battle-tested solutions from companies like Facebook, Twitter, and Netflix.Cache Stampede Prevention
Rate Limiting at Scale
Rate limiting is the seatbelt of distributed systems — you hope you never need it, but when a client misbehaves or a traffic spike hits, it is the difference between a graceful degradation and a cascading outage. The subtlety most engineers miss: rate limiting must be global, not per-server. If you have 10 API servers each allowing 100 requests/second from the same client, that client effectively gets 1,000 requests/second. Centralized rate limiting (typically via Redis) solves this, but introduces a new dependency on the rate limiter itself — which is why the most resilient implementations use local rate limiting as a fast first pass and centralized rate limiting for accuracy.Distributed Rate Limiting
CQRS (Command Query Responsibility Segregation)
CQRS separates read and write operations into different models, optimizing each for its specific use case. The core insight: in most systems, read and write workloads have fundamentally different characteristics. Writes need validation, consistency, and domain logic. Reads need speed, denormalization, and flexible queries. Trying to serve both through a single model forces painful compromises. Think of it like a library: the cataloging system (write side) carefully classifies books, enforces Dewey Decimal rules, and ensures no duplicates. The search terminals (read side) are optimized for patrons to find books fast — they might have denormalized data, multiple indexes, and even slightly stale information. You would not force librarians to catalog books through the search terminal, and you would not make patrons navigate the raw catalog system. The trade-off is operational complexity: you now maintain two models and a synchronization mechanism between them (usually events). CQRS is overkill for simple CRUD applications, but it shines for systems with high read-to-write ratios (100:1 or more), complex read queries that differ significantly from write structures, or requirements for different scaling characteristics on reads vs writes.- Python
- JavaScript
Event Sourcing
Event sourcing stores all changes as a sequence of events rather than overwriting current state, providing a complete audit trail and enabling time-travel debugging. Where a traditional database says “the account balance IS 525.” This seemingly small difference has profound implications: you can rebuild any past state by replaying events to a point in time, you get a full audit trail for free, and you can create new read models by replaying the event stream through new logic — without migrating any data. Event sourcing pairs naturally with CQRS (above): the write side appends events to an immutable log, and one or more read-side projections consume those events to build materialized views optimized for queries. The trade-off is complexity: your system now has eventual consistency between the event store and read models, you need snapshot strategies for aggregates with long event histories (an aggregate with 10 million events takes too long to replay from scratch), and schema evolution of events requires careful versioning since events are immutable once stored.- Python
- JavaScript
Interview Questions: Senior Level
How would you design for multi-region active-active?
How would you design for multi-region active-active?
- Data replication: Async replication between regions (eventual consistency)
- Conflict resolution: Last-write-wins (with vector clocks) or custom merge
- Routing: GeoDNS to route users to nearest region
- Failover: Health checks + automatic DNS failover
- Consistency: Accept that cross-region writes may conflict
- Latency vs consistency
- Cost of running in multiple regions
- Complexity of conflict resolution
How do you handle a database that can't keep up with writes?
How do you handle a database that can't keep up with writes?
- Batch writes: Accumulate and write in batches
- Write-behind cache: Write to Redis, async persist to DB
- Message queue: Queue writes, process at sustainable rate
- Sharding: Distribute writes across multiple DB nodes
- Different DB: Switch to write-optimized DB (Cassandra, ScyllaDB)
Explain how you'd implement distributed transactions
Explain how you'd implement distributed transactions
- First ask: “Do we really need distributed transactions?” Often can redesign.
- 2PC: Strong consistency, but blocking and slow
- Saga: Eventual consistency, compensating transactions
- Outbox pattern: Reliable event publishing with local transaction
How do you debug a latency spike in a distributed system?
How do you debug a latency spike in a distributed system?
- Observe: Check metrics dashboards (p99 latency by service)
- Trace: Use distributed tracing (Jaeger/Zipkin) to find slow span
- Correlate: Check if spike correlates with deployments, traffic, or GC
- Drill down: Once you find the service, check:
- CPU/memory usage
- DB query times (slow query log)
- Network latency between services
- Thread pool saturation
- Lock contention
How would you design a system that handles 1M requests per second?
How would you design a system that handles 1M requests per second?
- Back of envelope: 1M RPS = ~60K servers at 16 RPS each (conservative)
- Stateless compute: Horizontal scaling with load balancer
- Caching: Cache everything possible (aim for 99%+ cache hit)
- CDN: Serve static content from edge
- Database: Shard aggressively, read replicas
- Async: Queue non-critical work
- Network: 1M × 10KB = 10GB/s = 80Gbps (need multiple LBs)
- Compute: 1M / 10K (RPS per server) = 100 servers minimum
- Database: Can’t hit DB for every request, need 99%+ cache hit
Interview Deep-Dive
You are designing a global payment system. The PM says they want sub-100ms writes worldwide. Walk me through why that is physically impossible with strong consistency, and what you would actually propose.
You are designing a global payment system. The PM says they want sub-100ms writes worldwide. Walk me through why that is physically impossible with strong consistency, and what you would actually propose.
- Linearizable writes across regions: With a 3-node Raft cluster spanning US, EU, and APAC, a write from Singapore must reach a majority. If the leader is in US-East, that is ~160ms to Singapore and ~90ms to EU. The write latency floor is the second-fastest quorum member, so roughly 90ms best case — and that is before any application logic.
- What Spanner does: Google Spanner achieves external consistency using GPS-synchronized TrueTime clocks, but even Spanner reports single-digit millisecond writes only when the transaction’s data is colocated within a single region. Cross-region transactions in Spanner still take 100-200ms.
- What I would actually propose: Partition the data by geography. Payments originating in APAC are mastered in APAC, EU payments in EU. Each region runs its own Raft group with sub-10ms write latency. Cross-region reads can tolerate brief staleness (a merchant dashboard does not need real-time accuracy of a payment that happened 2 seconds ago in another continent). For the rare cross-region transaction (a US user paying an EU merchant), accept the latency hit on that specific path and use a Saga pattern for the settlement workflow.
Your team's Kafka consumers are processing events with exactly-once semantics using idempotency keys stored in Postgres. During a load test at 50K events/sec, you notice duplicate processing spiking. What is happening and how do you fix it?
Your team's Kafka consumers are processing events with exactly-once semantics using idempotency keys stored in Postgres. During a load test at 50K events/sec, you notice duplicate processing spiking. What is happening and how do you fix it?
- Consumer A reads event X, begins processing, writes to Postgres with idempotency key, but has not yet committed the Kafka offset.
- A Kafka rebalance triggers (perhaps because another consumer joined, or A’s heartbeat was slow under load). Consumer A loses its partition assignment.
- Consumer B picks up the partition, reads event X again (offset was not committed), checks Postgres for the idempotency key — and the timing matters here. If A’s Postgres transaction committed but the Kafka offset did not, you are fine (idempotency catches it). But if A’s Postgres transaction is still in-flight or was rolled back due to the rebalance interruption, B will not find the key and will process the event again.
max.poll.interval.ms setting (default 300 seconds, but effective throughput matters). If the consumer takes too long between polls — because it is doing synchronous Postgres writes for each event — Kafka assumes it is dead and triggers a rebalance.Fix, in order of impact:- Batch the idempotency writes: Instead of one Postgres round-trip per event, buffer 100-500 events and do a single batch INSERT ON CONFLICT. This reduces the per-event processing time and keeps the consumer polling frequently.
- Use the transactional outbox pattern: Write the idempotency key and the business result in the same Postgres transaction, then have a separate process commit Kafka offsets only after confirming the Postgres transaction committed.
- Tune Kafka consumer settings: Increase
max.poll.recordsand decreasemax.poll.interval.msappropriately. Usesession.timeout.ms= 10s andheartbeat.interval.ms= 3s to detect failures fast without false positives. - Consider Kafka transactions: Use Kafka’s built-in transactional API (
enable.idempotence=true+transactional.id) to achieve exactly-once between Kafka produce and consume, and handle the Postgres write separately with an outbox.
Walk me through how you would design the isolation level strategy for a large e-commerce platform. Not every table needs Serializable, and not every table can tolerate Read Uncommitted. How do you decide?
Walk me through how you would design the isolation level strategy for a large e-commerce platform. Not every table needs Serializable, and not every table can tolerate Read Uncommitted. How do you decide?
-
Serializable (or at minimum Repeatable Read with explicit locking): Inventory decrement on checkout. This is the classic “two users buy the last item” problem. If you use Read Committed, both transactions can read quantity=1, both decrement to 0, and you have oversold. You need either Serializable isolation or an explicit SELECT FOR UPDATE to prevent this. At high scale, I would actually avoid row-level locking entirely and use an atomic decrement:
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity >= 1, checking the affected row count. - Read Committed (the Postgres default): Order history queries, product catalog browsing, user profile reads. These are read-heavy, and a non-repeatable read (seeing a price change mid-transaction) is harmless — the user sees the updated price, which is correct behavior.
- Repeatable Read: Financial reporting and analytics queries that run for minutes. If you are generating a daily revenue report, you need a consistent snapshot — seeing some orders but not others because they committed during your query would produce incorrect totals. Postgres implements this efficiently with MVCC snapshots.
- Read Uncommitted: I almost never use this in practice. The one exception might be approximate analytics dashboards where you want maximum throughput and can tolerate seeing in-flight data. But even then, Read Committed is only marginally slower and avoids the confusion of dirty reads.
TransactionContext that each service method declares. The payment service always runs at Serializable for the actual charge, but the receipt generation that follows runs at Read Committed. The inventory service uses the atomic decrement pattern (no explicit isolation level needed because the atomicity is in the SQL statement itself). The reporting service uses Repeatable Read with long-running read-only transactions.Follow-up: Your Serializable transactions on the inventory table are causing lock contention and timeouts during flash sales. How do you fix this without dropping the isolation level?Three approaches, from simplest to most complex. First, use the atomic decrement pattern I mentioned — it avoids explicit locking entirely because the WHERE clause acts as a guard. Second, if you need Serializable for more complex invariants, shard the inventory by product_id so contention is per-product, not global. A flash sale for one product does not block purchases of other products. Third, for the actual flash sale scenario (1000 people buying 50 items), move the hot inventory count to Redis with DECR, let Redis handle the contention (single-threaded, no lock overhead), and reconcile back to Postgres asynchronously. You accept a brief window where Postgres is behind Redis, but Redis is the authoritative source for “is there stock left” during the sale.Estimate the cost and feasibility of adding cross-region replication to a service in AWS us-east-1 that handles 20K writes/sec and 100K reads/sec with a 50TB Postgres database.
Estimate the cost and feasibility of adding cross-region replication to a service in AWS us-east-1 that handles 20K writes/sec and 100K reads/sec with a 50TB Postgres database.
- Each write generates a WAL (Write-Ahead Log) record. At 20K writes/sec with an average WAL record of ~200 bytes, that is 4 MB/sec = ~345 GB/day of replication traffic.
- AWS cross-region data transfer: 7/day = ~$210/month just for ongoing replication.
- The initial seed: 50TB transferred cross-region at 1,000 one-time cost. But the bigger issue is time — at 5 Gbps sustained transfer, 50TB takes roughly 22 hours. During that time, the replica is not serving reads.
- Async replication across regions (us-east-1 to eu-west-1, ~80ms RTT): expect 80-200ms replication lag under normal load. This is fine for read replicas serving non-critical reads.
- Synchronous replication: every write now takes at least 80ms additional latency. At 20K writes/sec, this means each write occupies a connection for 80ms longer. Your connection pool needs to be roughly 20K * 0.08 = 1,600 additional connections just to maintain throughput.
- 100K reads/sec. If you route 50% to the EU replica, that is 50K reads/sec on a single Postgres instance, which is feasible with connection pooling and proper indexing but tight. You likely need 2-3 read replicas in EU.
- Async replication for the read replicas (accept 100-200ms lag).
- Do NOT make writes synchronous cross-region — the latency cost is too high for 20K writes/sec.
- If you need cross-region write availability (disaster recovery), use a warm standby that can be promoted in minutes, not a synchronous replica.
- Total monthly cost estimate: 3,000-5,000 (2-3 db.r6g.4xlarge RDS instances in EU) + ~4,000-5,500/month.
- CQRS: becomes valuable at roughly 10:1 or higher read-to-write ratio, or when read patterns diverge significantly from write patterns. Below that ratio, the synchronization overhead between read and write models is not worth the benefit.
- Event Sourcing: the event store grows linearly with write volume. At 10K writes/second with 1KB average event size, you generate roughly 850GB/day of event data. Snapshots every 100 events reduce replay time from O(n) to O(n/100) for aggregate reconstruction.
- Distributed Consensus (Raft): practical for up to 5-7 nodes. Beyond that, the leader must replicate every write to a majority, and AppendEntries RPCs become the bottleneck. For larger clusters, use multi-Raft (CockroachDB/TiKV style) where different data ranges have independent Raft groups.
- Vector Clocks: the vector grows with the number of nodes that have ever written. For systems with thousands of writers, consider hybrid logical clocks (HLC) or bounded vector clocks (Dynamo-style dotted version vectors) to cap metadata size.