Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
System Design Interview Questions (60+ Detailed Q&A)
How to Win the System Design Interview
Before diving into the questions, internalize the meta-skills that separate candidates who get hired from those who don’t.Senior vs Staff: What the Bar Actually Looks Like
Back-of-Envelope Estimation Techniques
Every system design interview involves numbers. Get comfortable with these: The Power-of-Two Cheat Sheet:- 1 million seconds = ~11.5 days
- 1 billion seconds = ~31.7 years
- 1 KB = 1,000 bytes (use powers of 10 for estimation, not 1,024)
- 1 MB = 1 million bytes, 1 GB = 1 billion bytes, 1 TB = 1 trillion bytes
- A single web server: 1K-10K requests/sec (depends on complexity)
- A single PostgreSQL instance: 5K-20K TPS (simple queries), 500-2K TPS (complex joins)
- Redis: 100K-200K ops/sec per instance
- Kafka: 100K-1M messages/sec per broker
- A single SSD: 10K-100K IOPS random, 500 MB/s sequential
- Network within a data center: 1-10 Gbps per machine
- Estimate daily active users (DAU) and actions/user/day
- Multiply to get daily writes
- Estimate average payload size per write
- Daily storage = daily writes x payload size
- Multiply by retention period for total storage
- Example: 50M DAU x 2 posts/day x 500 bytes/post = 50 GB/day = 18 TB/year (before replication)
Trade-Off Vocabulary You Must Know
CAP Theorem: During a network partition, choose Consistency or Availability. Partition tolerance is not optional — networks always fail eventually. The real question is: “What does my system do when a partition happens?” BASE (Basically Available, Soft state, Eventual consistency): The alternative to ACID for distributed systems. Not weaker — different. Appropriate when stale reads have low business cost. PACELC (the CAP extension most candidates don’t know): If there is a Partition, choose Availability or Consistency. Else (no partition), choose Latency or Consistency. This is more useful than CAP because it covers the normal (non-partition) case. DynamoDB is PA/EL. Spanner is PC/EC. Knowing PACELC signals senior-level distributed systems understanding. Reversibility: “Is this decision easy to change later?” Choosing a database is hard to reverse. Choosing a caching layer is easy. Spend more time on irreversible decisions.Reading the Interviewer’s Signals
- “What else could go wrong?” — They want failure modes. Talk about what happens when components crash, networks partition, or traffic spikes 10x.
- “How would you handle X at 100x scale?” — Your current design has a bottleneck they want you to find. Look for single points of failure and stateful components.
- “Let’s move on to…” — You’re spending too long on this area. Wrap up your current point and follow their lead.
- “That’s interesting, but what about…” — They’re steering you toward a specific concern. Take the hint.
- “How would you make this production-ready?” — They want observability, alerting, deployment strategy, runbooks.
- Silence after you finish a point — They want you to go deeper. Keep talking.
When to Pivot Your Design
- When your estimation shows the numbers don’t work (e.g., you designed for 1K QPS but the math shows 100K QPS)
- When the interviewer says “assume X is not available” — they’re testing your adaptability
- When you realize your design has a single point of failure you can’t mitigate
- When the interviewer asks a question that exposes a fundamental flaw — acknowledge it, don’t defend it: “Good point — that approach breaks under partitions. Let me restructure.”
Work-Sample: Practice Prompt
The Clarifying Questions Playbook
Strong candidates ask 3-5 clarifying questions before sketching anything. Here is the canonical list, organized by category. Pick 3-5 that shape the design most. Functional scope (what does the system actually do?):- “What are the core user actions we need to support? Read path vs write path?”
- “Is this feature for end users, internal tools, or both? (Different SLOs.)”
- “What is the MVP vs the full scope? What can we defer?”
- “How many users? DAU vs MAU? Peak concurrent?”
- “What is the read:write ratio? 100:1, 10:1, 1:1?”
- “What is the latency budget? p50 vs p99? (p99 drives architecture, not p50.)”
- “How much data do we retain, for how long?”
- “What is the consistency requirement? Strong, read-your-writes, eventual?”
- “What is the availability target? 99.9%, 99.99%, 99.999%?”
- “What is the current scale and expected growth over 2 years?”
- “Is traffic steady, bursty, or has strong daily/seasonal patterns?”
- “Geographic distribution — one region, multi-region, global?”
- “Are there specific technologies we must use or avoid? (Legacy DB, cloud provider restrictions.)”
- “What is the team size? A 3-person team cannot operate a 20-service microservice architecture.”
- “What is the cost budget? Back-of-envelope cost comparisons matter.”
- “Any compliance requirements? GDPR, HIPAA, PCI, SOC2?”
Interviewer Signal Decoder Ring
| Interviewer says… | What they actually mean |
|---|---|
| ”Interesting. What else?” | Your answer was shallow. Go deeper or consider another angle. |
| ”Why would you pick X over Y?” | You picked X without justifying. They want the trade-off articulated. |
| ”What happens if this component fails?” | Talk about failure modes; they think you are ignoring reliability. |
| ”Walk me through what happens when a user…” | They want you to trace a specific request end-to-end. |
| ”At 10x scale…” | Your design has a bottleneck they can see. Find it and solve it. |
| ”Is that the simplest solution?” | You’re over-engineering. Simpler is better. |
| ”What would you monitor?” | They want SLOs, alerts, dashboards, not just metrics. |
| ”Have you seen this in production?” | They want war stories and operational wisdom, not theory. |
| ”How would you roll this out?” | Talk about deployment strategy, feature flags, canaries, rollback. |
| Long silence after your answer | Not enough. Add another layer — failure mode, cost, trade-off. |
| ”Let’s say we can’t use X” | They are testing adaptability. Redesign without that component. |
| ”Draw the request flow” | They want boxes and arrows, not prose. |
Weak vs Strong: Common Design Moments
| Moment | Weak candidate | Strong candidate |
|---|---|---|
| Opening | Jumps into drawing immediately | Asks clarifying questions, quantifies scale, then drafts |
| Picking a DB | ”Let’s use Postgres” (no reason) | “Postgres because OLTP with complex queries at our scale fits in one node; I’d revisit at 50K TPS” |
| Caching | Adds Redis everywhere | ”Add Redis only on the hot path; measure the hit rate after adding — if <70%, pull it out” |
| Scale | ”Just add more servers" | "More servers past this point hits the DB; the real fix is partitioning by user_id” |
| Consistency | ”Everything must be consistent" | "Money is consistent, feed is eventual; here’s why per use case” |
| Failure | Doesn’t mention it | ”If the queue backs up, we shed load; if the DB fails over, we serve stale reads from cache with a TTL” |
| Cost | Doesn’t mention it | ”This design costs ~Y” |
| Tradeoffs | Presents one solution | ”Three options: A cheaper but slower, B faster but costlier, C matches your requirements at middle cost; I’d pick B” |
1. Core Concepts & Scalability
1. Vertical vs Horizontal Scaling
1. Vertical vs Horizontal Scaling
- Vertical Scaling (Scale Up): Add more CPU, RAM, or faster disks to a single machine. Think going from an
m5.xlargeto anm5.16xlargeon AWS. It is simple — your application code does not change, there is no distributed coordination, and a single-node database stays ACID-compliant trivially. The ceiling is real though: the largest EC2 instance (u-24tb1.metal) tops out at 24 TB RAM and 448 vCPUs. Beyond that, you are stuck. And you have a single point of failure — if that machine dies, everything dies. - Horizontal Scaling (Scale Out): Add more machines behind a load balancer. Theoretically infinite scale. Netflix, Google, and every hyperscaler runs this way. The trade-off is complexity: you now need load balancing, data partitioning, distributed consensus, and you have to deal with partial failures (one node dies, the others keep going).
- The real-world nuance: Most startups should scale vertically first. A single PostgreSQL instance on a beefy machine can handle 10K+ TPS and tens of millions of rows comfortably. Premature horizontal scaling adds operational overhead that kills small teams. The sweet spot is vertical until you hit a wall (usually around 20K/month on a single instance), then go horizontal for the bottleneck layer only.
- Example: Basecamp famously runs on a small number of very powerful servers. They scaled vertically for years before needing to shard. Meanwhile, Twitter had to go horizontal early because their write volume was inherently distributed.
- You have a monolithic app on one server hitting CPU limits at peak. Walk me through your decision process — scale up or out? Start with profiling. Is the CPU bottleneck in the application code (inefficient queries, N+1 problems, missing caching) or genuine load? If genuine, scale up first — double the CPU cores, it is a 10-minute operation with zero code changes. If you are already on a large instance and still peaking, then you break the monolith’s hot path into a separately scalable service. The decision hinges on whether the bottleneck is stateless (easy to scale out) or stateful (database — much harder).
- How does horizontal scaling interact with database design? This is where it gets hard. Stateless app servers scale horizontally trivially — spin up more containers behind an ALB. The database layer is the real challenge. You need read replicas for read-heavy workloads, sharding for write-heavy ones, and connection pooling (PgBouncer, ProxySQL) to avoid exhausting database connections as app instances multiply. A common failure mode is scaling app servers to 50 instances, each opening 20 DB connections, and your database drowns under 1000 connections.
- What is the cost model difference in cloud environments? Vertical scaling follows a roughly linear price curve up to a point, then becomes exponential — the top-tier instances cost disproportionately more per unit of compute. Horizontal scaling has a more linear cost curve but adds hidden costs: load balancers (50M ARR, the engineering time cost of horizontal scaling dwarfs the infrastructure savings.
r6g.4xlarge (128 GB RAM). Do you scale up the DB or add read replicas? If the bottleneck is read-heavy, add read replicas first — zero code changes for most ORMs with a read/write split. If write-heavy, you’ve hit the harder problem. Connection pooling (PgBouncer) buys time. Sharding is the last resort. Always profile first — most “database bottlenecks” are actually 3-4 unoptimized queries consuming 80% of resources.
5. What observability would you put in place before you need to scale? Track request latency percentiles (p50, p95, p99), CPU/memory utilization per instance, database connection pool usage, and queue depth. Set alerting thresholds at 70% utilization so you have time to react. Teams that scale reactively (after an outage) spend 3-5x more than those who scale proactively from metrics.
6. At what point does the cost of horizontal scaling justify rewriting the application? When operational complexity dominates engineering bandwidth. If your team spends more than 30% of time managing infrastructure instead of shipping features, it’s time to evaluate managed services (Aurora Serverless, DynamoDB) or a platform team. Shopify hit this inflection point around 2018 and invested heavily in their internal platform.Real-World Example: Shopify ran a monolithic Rails app on a beefy MySQL primary for years. They scaled vertically until ~2016 when write throughput on Black Friday pushed the box to its ceiling. Rather than sharding the whole monolith, they introduced “pod” architecture — sharding only the checkout path by shop ID while leaving admin and reporting on the original primary. They bought two more years of runway this way before investing in full horizontal scale-out.- Q: Can you scale a stateful service horizontally without sharding? A: Briefly, with read replicas — but writes still funnel to one primary. True horizontal scale of writes requires partitioning data across nodes, which is sharding by another name.
- Q: What single metric tells you it’s time to stop scaling vertically? A: CPU steady-state above 70% at normal traffic, with no room to absorb a 2x peak. Past that point the machine has no headroom for failover or spikes.
- highscalability.com — case studies of vertical-first architectures (Stack Overflow, Basecamp)
- martin.kleppmann.com — Designing Data-Intensive Applications, Chapter 6 on partitioning
2. CAP Theorem
2. CAP Theorem
- The theorem states: In a distributed data store, when a network partition occurs, you must choose between Consistency (every read returns the most recent write or an error) and Availability (every request returns a non-error response, though it may be stale). Partition Tolerance means the system keeps operating despite messages being lost or delayed between nodes.
- Why P is mandatory: Network partitions will happen in any distributed system — switches fail, cables get cut, cloud AZs lose connectivity. You cannot “choose” to not have partitions. So the real choice is always CP vs AP during a partition event. When there is no partition, you get both C and A.
- CP in practice (Banking, ZooKeeper, etcd): During a partition, the system refuses to serve requests rather than risk returning stale data. A bank transfer system would rather return an error (“service unavailable”) than let you withdraw money twice. ZooKeeper refuses reads if it cannot confirm it has the latest state — this is why it is used for leader election and distributed locks.
- AP in practice (Cassandra, DynamoDB, DNS): During a partition, the system keeps serving requests but may return stale data. A social media feed showing a post from 5 seconds ago instead of 2 seconds ago is acceptable. Cassandra lets you tune this with consistency levels —
ONEfor AP,QUORUMfor closer to CP. - The PACELC extension: Daniel Abadi argued CAP only covers the partition case. PACELC adds: Else (when no partition), do you optimize for Latency or Consistency? DynamoDB is PA/EL — during partition favor availability, else favor low latency. A traditional RDBMS with sync replication is PC/EC — always favor consistency.
-
Can you give me an example where a system switches between CP and AP behavior dynamically?
CockroachDB is a good example. Under normal operation, it provides serializable consistency (CP). But it has a “follower reads” feature where you can opt into slightly stale reads for lower latency (AP behavior) for specific queries. Similarly, Cassandra lets you choose consistency per query —
QUORUMreads for account balance checks,ONEreads for displaying a user’s avatar. - How does CAP apply to microservices that call each other synchronously? When Service A calls Service B synchronously and Service B is down, you have effectively chosen CP for that interaction — you get consistency (you do not serve stale data) but lose availability. To get AP behavior, Service A needs a local cache or fallback. This is why Netflix built Hystrix — circuit breakers with fallback responses let services degrade gracefully (AP) instead of cascading failures (neither C nor A).
- A PM asks you to build a feature that needs both strong consistency AND five-nines availability globally. What do you tell them? You explain the fundamental trade-off and negotiate. “Which matters more — correctness or uptime?” For a payment system, you might use strong consistency for the transaction itself but eventual consistency for the balance display. For a shopping cart, you can use CRDTs (conflict-free replicated data types) that give you both — Amazon’s Dynamo paper showed how shopping carts can merge conflicts automatically, giving you availability without losing data.
QUORUM takes 15ms, but a read at ONE takes 3ms. The PM wants the fast version. What’s your advice? This is the EL vs EC decision in PACELC. At ONE, you sacrifice consistency — you might read stale data. The question is: what’s the business cost of a stale read? For a product catalog, 3ms with ONE is fine — the price might be 5 seconds stale. For an account balance, use QUORUM and accept the 15ms. Frame this as a business decision, not a technical one.- Q: Is a single-node Postgres affected by CAP? A: No — CAP only applies to distributed systems with multiple nodes. A single instance is just C and A with no P to worry about.
- Q: Can a system change its CAP classification at runtime?
A: Yes. Cassandra lets you set consistency level per query (
ONE= AP,QUORUM= effectively CP). The system is flexible; individual operations make the choice. - Q: Why do people say “CA databases” when CAP says you can’t have that? A: Because non-distributed databases are often labeled CA for marketing. Technically there’s no P to sacrifice when there’s only one node — the label is misleading.
- martin.kleppmann.com — “A Critique of the CAP Theorem” paper
- highscalability.com — Amazon’s Dynamo paper walkthrough
3. ACID vs BASE
3. ACID vs BASE
- ACID (Relational Databases — PostgreSQL, MySQL InnoDB):
- Atomicity: A transaction is all-or-nothing. If you transfer $100 from Account A to B, either both the debit and credit happen, or neither does. Implemented via write-ahead logs (WAL) — changes are logged before applied, so the DB can roll back on crash.
- Consistency: The database moves from one valid state to another. Constraints (foreign keys, unique indexes, check constraints) are enforced. You cannot insert an order referencing a customer that does not exist.
- Isolation: Concurrent transactions do not interfere. Isolation levels range from
READ UNCOMMITTED(dirty reads possible, fast) toSERIALIZABLE(no anomalies, slow). Most production systems useREAD COMMITTED(PostgreSQL default) orREPEATABLE READ(MySQL InnoDB default) as a practical middle ground. - Durability: Once committed, data survives crashes. The WAL is fsynced to disk before the commit is acknowledged.
- BASE (NoSQL — Cassandra, DynamoDB, MongoDB):
- Basically Available: The system responds to every request, possibly with stale data.
- Soft State: The state may change over time, even without new input, as replicas converge.
- Eventual Consistency: Given enough time without new writes, all replicas will converge to the same value.
- The real trade-off: ACID gives you correctness guarantees at the cost of throughput and latency. A PostgreSQL transaction that touches rows on multiple shards requires two-phase commit (2PC), which is slow. BASE gives you throughput and availability at the cost of reasoning complexity — your application code must handle stale reads and conflicts. Most real systems mix both: ACID for the money path, BASE for the activity feed.
- Your e-commerce platform uses PostgreSQL for orders and Cassandra for product catalog. A customer places an order for a product that was just deleted from the catalog. How do you handle this? This is the classic cross-system consistency problem. You cannot do a distributed ACID transaction across Postgres and Cassandra. Options: (a) Event-driven saga — the order service publishes an OrderCreated event, the catalog service validates the product exists and publishes ProductValidated or ProductNotFound, the order service compensates (cancels) if invalid. (b) Cache the product data at order time — store a snapshot of the product in the order record so the order is self-contained. (c) Soft-delete products with a grace period — never hard-delete, mark as unavailable, clean up after 30 days. Most production systems use a combination of (b) and (c).
-
Explain isolation levels and when you would intentionally choose a weaker one.
SERIALIZABLEprevents all anomalies but uses predicate locks that tank throughput. In a reporting query that reads millions of rows, usingSERIALIZABLEwould lock out writers for minutes. Instead, you useREAD COMMITTEDor evenSNAPSHOT ISOLATION(PostgreSQLREPEATABLE READ). The key insight: for analytics/reporting, slightly stale data is fine. For financial transactions, you wantSERIALIZABLEor at minimumREPEATABLE READwith explicit row-level locks (SELECT ... FOR UPDATE).
PENDING state, publishes an OrderCreated event to Kafka. The shipping service consumes it, reserves inventory, and publishes ShipmentReserved. The order service transitions to CONFIRMED. If the shipping service fails, it publishes ShipmentFailed, and the order service runs a compensating transaction (cancel the order, refund the charge). This is BASE in action — the system is temporarily inconsistent but converges. Uber, Airbnb, and most microservice architectures use sagas.
4. What’s the most dangerous consistency bug you can imagine in production? A read replica serves a stale “account balance = 0. A second debit of 400 out of thin air. This is why financial reads must hit the primary or use synchronous replication. Banks that moved to eventually consistent systems without understanding this have suffered real losses.What weak candidates say: “ACID is for relational, BASE is for NoSQL.” What strong candidates say: “The boundary is blurring — Spanner gives you globally distributed ACID, and MongoDB 4.0+ supports multi-document transactions. The real decision axis is your tolerance for temporary inconsistency and whether your application logic can handle conflict resolution.”Real-World Example: Stripe uses Postgres (ACID) for every dollar that moves — the ledger is strictly serializable. But their webhook delivery system is BASE-style: an event might arrive twice or out of order, and consumers are expected to be idempotent. Same company, same request lifecycle, two completely different consistency models matched to the cost of being wrong.- Q: Can MongoDB fully replace Postgres now that it has multi-document transactions? A: For many workloads, yes — but at a throughput cost. MongoDB transactions span replica sets using 2PC-like coordination; they’re slower than Postgres’s local ACID. Use them, don’t abuse them.
- Q: What’s the simplest way to make a BASE consumer safe? A: Idempotency keys. Tag every message with a unique ID; the consumer records “processed” keys and skips duplicates. This turns at-least-once delivery into effectively-exactly-once at the application layer.
- martin.kleppmann.com — DDIA Chapter 7 on transactions
- postgresql.org/docs —
SET TRANSACTION ISOLATION LEVELdocumentation
4. Load Balancer Algorithms
4. Load Balancer Algorithms
- Round Robin: Requests go to servers sequentially: S1, S2, S3, S1, S2, S3… Simple and works well when servers are identical and requests are roughly equal cost. Fails when some requests are expensive (a video transcoding job vs a health check) — one server gets unlucky with heavy requests while others idle.
- Weighted Round Robin: Assign weights based on server capacity. A 16-core machine gets weight 4, an 8-core gets weight 2. The LB sends 4 requests to the big machine for every 2 to the small one. Useful during rolling deployments when old and new instances have different specs.
- Least Connections: Route to the server with the fewest active connections. Better than round robin for variable-cost requests because a server processing a slow request naturally accumulates connections and gets fewer new ones. This is the default for many production setups. Nginx uses this with
least_conn. - IP Hash: Hash the client IP to deterministically route to a server. Creates “sticky sessions” — the same user always hits the same server. Useful when server-side session state exists and you cannot externalize it to Redis. Downside: if a server dies, all its users get redistributed and lose sessions. Also creates uneven distribution if a large corporate NAT sends thousands of users from one IP.
- Least Response Time: Route to the server with the lowest average response time. Best for heterogeneous backends but requires the LB to track latency metrics. AWS ALB supports this as “least outstanding requests.”
- Random with Two Choices (Power of Two): Pick two random servers, send to the one with fewer connections. Surprisingly effective — proven mathematically to be exponentially better than pure random selection. Used in some service meshes.
- Your backend has 10 servers behind a round-robin LB. One server starts responding slowly (not failing, just slow). What happens and how do you fix it? Round robin keeps sending 10% of traffic to the slow server. Users hitting it see degraded performance, but the LB does not know — the server is not “down.” Fix: switch to least-connections or least-response-time. Add health checks with latency thresholds — if P99 latency exceeds 2s, mark unhealthy. Or use a circuit breaker at the client level. Envoy proxy handles this elegantly with outlier detection — it ejects hosts that deviate from the cluster’s baseline latency.
-
How does L4 vs L7 load balancing change your algorithm choice?
L4 (TCP level — AWS NLB) only sees IP and port, so it can do round robin, least connections, and IP hash. It cannot inspect HTTP headers or URLs. L7 (HTTP level — AWS ALB, Nginx) can do content-based routing: send
/api/v1/*to service A and/api/v2/*to service B. L7 can also do cookie-based sticky sessions (more reliable than IP hash for NAT scenarios). For gRPC or WebSockets, L7 is required because connections are long-lived and L4 round robin would just send all streams from one connection to one server. - You need to do a zero-downtime deployment. How does the LB algorithm affect your strategy? With round robin, you do rolling deployments: drain one server (stop sending new requests, let in-flight complete), deploy, add back. The LB needs connection draining support — ALB gives you a configurable deregistration delay (default 300s). With IP hash, draining is trickier because removing a server forces rehashing. Use consistent hashing to minimize disruption. Blue-green is simpler: spin up a parallel set of new servers, switch the LB target group atomically.
/health which returns 200 even though the database is down. Users get 500 errors. What’s wrong with your health check design? The health check is a liveness check, not a readiness check. It confirms the process is running but not that it can serve traffic. Fix: the /health endpoint should verify critical dependencies — run a lightweight DB query, check Redis connectivity, verify disk space. But don’t make it too deep or you create a fragile health check that flaps. The sweet spot: check the connection pool is healthy (fast) rather than running a full query (slow and fragile).
5. You’re running gRPC services. Why does L4 load balancing break, and what do you use instead? gRPC multiplexes many RPCs over a single HTTP/2 connection. An L4 load balancer sees one long-lived TCP connection and routes all RPCs to the same backend. You need L7 load balancing that can inspect individual gRPC frames. Options: Envoy (native gRPC support), Linkerd, or client-side load balancing with gRPC’s built-in round_robin or pick_first policies. Kubernetes headless services + client-side LB is the common pattern.Real-World Example: Dropbox runs their Nginx edge tier on least_conn because photo and file uploads have wildly variable durations — a round-robin scheme would pile multi-minute uploads onto unlucky servers. Their internal RPC mesh (gRPC) uses client-side round_robin over headless Kubernetes services because the request shape is uniform and latency matters more than sophistication.- Q: Why is IP hash a bad choice for a corporate network with a NAT gateway? A: Thousands of users exit from one public IP, so they all hash to the same server — defeating the load balancing entirely.
- Q: Can round-robin cause a cold start problem? A: Yes. A freshly added server gets traffic immediately, but its caches (JIT, connection pools, local caches) are empty so its p99 is much worse for the first 30-60 seconds. Use slow-start features (Envoy) to ramp up traffic gradually.
- highscalability.com — Dropbox’s edge architecture posts
- Cloudflare engineering blog — “Load Balancing at Cloudflare”
5. Consistent Hashing
5. Consistent Hashing
- The problem it solves: With naive hashing (
hash(key) % N), adding or removing a server changes N, which remaps almost every key. If you have a 100-node cache cluster and one node dies, roughly 99% of keys get remapped, causing a cache stampede that can take down your database. - How it works: Imagine a ring (hash space 0 to 2^32). Both servers and keys are hashed onto this ring. A key is assigned to the next server clockwise on the ring. When a server is added, it only takes keys from its immediate clockwise neighbor. When a server is removed, its keys move to the next server clockwise. Only ~1/N keys need to move (N = total servers).
- Virtual nodes (vnodes): Physical servers get multiple positions on the ring (e.g., 150 virtual nodes per physical server). This solves two problems: (a) uneven distribution — with few physical nodes, the hash ring can be lopsided; vnodes smooth it out. (b) heterogeneous hardware — a powerful server gets more vnodes, taking proportionally more keys. Cassandra uses 256 vnodes per node by default.
- Real-world usage: Cassandra (data partitioning), Amazon DynamoDB (partition assignment), Akamai CDN (content routing), Memcached (client-side consistent hashing via the
ketamaalgorithm), Discord (routing users to gateway servers).
-
You are using consistent hashing for a cache cluster. One node is getting 3x the traffic of others. Why, and how do you fix it?
Hot keys — a celebrity’s profile or a trending item maps to one node. Vnodes help with even distribution but do not help if a single key is hot (the key always maps to the same node). Fixes: (a) key replication — store hot keys on multiple nodes and load-balance reads across them; (b) local caching — each app server caches the top 100 hot keys in-process with a short TTL; (c) key salting — append a random suffix (
celebrity:123:shard_0throughcelebrity:123:shard_9) and read from a random shard. Instagram uses approach (c) for celebrity profiles. -
How does consistent hashing interact with replication?
In systems like DynamoDB and Cassandra, a key is not just placed on one node — it is replicated to the next N-1 nodes clockwise on the ring (where N is the replication factor, typically 3). Reads and writes use quorum (
R + W > N) for consistency. This means the “ring walk” to find responsible nodes is simple but the replication topology makes node addition more complex — you need to transfer data from multiple predecessor nodes, not just one. -
Can consistent hashing cause data loss? When?
Yes, during node removal if replication factor is 1 (no replication). Also, during rapid node additions/removals (“flapping”), keys can be in transit between nodes and neither node serves them — this is a brief unavailability window. Cassandra handles this with hinted handoff — if the target node is down, a neighbor temporarily stores the write as a “hint” and replays it when the node returns. But hints have a TTL (3 hours default) — if the node is down longer, you need full repair (
nodetool repair).
- Q: How do you handle a hot key under consistent hashing? A: The ring doesn’t help — a single key always maps to one node. Use key splitting (salt the key into N sub-keys) or replicate the hot key across multiple nodes and load-balance reads.
- Q: What breaks if your hash function has poor distribution?
A: Even with vnodes, clumped hash values create hotspot nodes. Use cryptographic hashes (MD5, SHA-1) or good non-crypto hashes (xxHash, MurmurHash) — not
Object.hashCode().
- highscalability.com — “How Discord Stores Trillions of Messages”
- martin.kleppmann.com — DDIA Chapter 6 on partitioning by hash
6. Database Sharding
6. Database Sharding
- Horizontal Sharding (most common): Split rows across databases by a shard key. Users with IDs 1-1M on Shard 1, 1M-2M on Shard 2 (range-based). Or
hash(user_id) % 4to distribute evenly (hash-based). Each shard is a complete database instance with the full schema. - Vertical Sharding: Split by table/feature. User profiles on DB1, orders on DB2, analytics on DB3. This is really just service decomposition by another name — microservices naturally do this.
- Choosing the shard key is the most critical decision: The wrong shard key causes hot shards (all writes to one shard) or makes queries impossible (need to scatter-gather across all shards). Good shard keys have high cardinality, even distribution, and align with your query patterns. For a multi-tenant SaaS app,
tenant_idis perfect — queries are almost always scoped to one tenant. For a social network,user_idworks for user-centric queries but makes “find all posts with hashtag X” a cross-shard nightmare. - Challenges in production:
- Cross-shard joins: Impossible at the database level. Your application must query both shards and join in memory, or you denormalize.
- Cross-shard transactions: Require distributed transactions (2PC), which are slow and fragile. Most teams avoid them entirely and use eventual consistency patterns (sagas).
- Rebalancing: When a shard gets too large, splitting it requires migrating data while serving live traffic. Vitess (YouTube’s MySQL sharding layer) automates this, but it took years to build.
- Global queries:
SELECT COUNT(*) FROM usersnow requires querying every shard and summing. Pagination across shards is particularly painful.
- Your sharded database has 8 shards and one shard has 10x the data of others. What happened and how do you fix it? Either the shard key has skewed distribution (range-based with a hot range) or organic data growth was uneven (one tenant grew massively). Immediate fix: split the hot shard into two. Long-term: switch to hash-based sharding for even distribution. If it is a hot tenant, consider giving them a dedicated shard (Slack does this for large workspaces). Vitess and Citus (Postgres extension) can do online shard splits.
-
How did Instagram handle sharding with PostgreSQL?
Instagram sharded PostgreSQL by user ID into thousands of logical shards mapped to fewer physical databases. Each logical shard ID was embedded in their Snowflake-style ID generation — the ID itself tells you which shard to query. They used PL/pgSQL functions to route queries. The key insight: they chose a shard key (
user_id) that aligned with 99% of their query patterns. - When should you NOT shard and what should you do instead? Before sharding, exhaust: (a) query optimization and indexing; (b) read replicas for read-heavy workloads; (c) caching (Redis/Memcached) for hot data; (d) vertical scaling; (e) archiving old data to cold storage. Sharding should be a last resort because it is irreversible in practice — once you shard, your entire application, ORM, migration tooling, and operational procedures change permanently.
user_id but now the product team wants a “global search” feature across all users. Your architecture doesn’t support cross-shard queries. What do you do? You don’t query across shards for search — you introduce a separate search index. Stream writes from all shards to Elasticsearch via CDC (Debezium). The search index is eventually consistent but handles the global query use case. This is the polyglot persistence pattern — sharded Postgres for transactional writes, Elasticsearch for search reads. Almost every sharded system at scale has a secondary index layer.
5. How do you handle database migrations (ALTER TABLE) across 64 shards? You cannot run ALTER TABLE simultaneously on 64 shards — any failure leaves you in a partially migrated state. Use a tool like gh-ost (GitHub) or pt-online-schema-change (Percona) that creates a shadow table, copies data, and atomically swaps. Run migrations sequentially across shards with automated rollback. Stripe’s approach: deploy code that’s compatible with both old and new schemas first, then run the migration, then deploy code that uses the new schema.- Q: What’s the alternative to sharding if your writes exceed one node? A: A write-optimized engine (Cassandra, ScyllaDB, DynamoDB) or a distributed SQL database (CockroachDB, Spanner, Yugabyte) that shards transparently.
- Q: Why is changing the shard key so painful? A: Every row has to move. Hash-based schemes require dual-write + backfill + cutover spanning weeks; the ORM, migrations, backup tooling all need updating. Treat the shard key as a one-way door.
- highscalability.com — “How Figma’s databases team lived to tell the scale”
- martin.kleppmann.com — DDIA Chapter 6 on rebalancing partitions
7. Caching Strategies
7. Caching Strategies
- Cache-Aside (Lazy Loading): The application checks the cache first. On a miss, it reads from the database, writes the result to the cache, then returns. The application controls all logic. This is the most common pattern (used with Redis/Memcached). Trade-off: First request after eviction is always slow (cold start). Stale data possible if the DB is updated without invalidating the cache.
- Read-Through: The cache itself is responsible for loading data from the DB on a miss. The application only talks to the cache. Trade-off: Cleaner application code, but the cache needs a data-loading plugin/callback. Same staleness issues as cache-aside.
- Write-Through: Every write goes to both cache and DB synchronously. Ensures cache and DB are always consistent. Trade-off: Write latency doubles (two writes per operation). Writes to data that is never read waste cache space. AWS DynamoDB Accelerator (DAX) uses this pattern.
- Write-Behind (Write-Back): Write to cache only, then asynchronously flush to DB in batches. Extremely fast writes. Trade-off: Risk of data loss if the cache crashes before flushing. Complexity in ordering and deduplication. Used in CPU caches (L1/L2) and sometimes in application-level write-heavy workloads.
- Write-Around: Write directly to DB, bypassing cache. Cache only gets populated on reads (via cache-aside). Trade-off: Avoids cache pollution with write-heavy data that is rarely read, but guarantees a cache miss on the first read after a write.
- The invalidation problem: “There are only two hard things in computer science: cache invalidation and naming things.” The core issue is: when the DB changes, how does the cache know? Options: TTL-based expiry (simple but stale for TTL duration), event-driven invalidation (DB change triggers cache delete via CDC/events — Debezium can stream Postgres WAL to Kafka to invalidate caches), or versioned keys (append a version number to cache keys, increment on write).
- You have a cache-aside setup and your DB gets updated by a background job that does not go through your app. Users see stale data for hours. How do you fix this? The background job bypasses the cache invalidation path. Options: (a) Have the background job also invalidate/update the cache. (b) Use Change Data Capture (CDC) — stream the DB’s write-ahead log (Debezium for Postgres/MySQL) to a Kafka topic, then a consumer invalidates cache keys. This decouples the invalidation from the writer. (c) Use short TTLs as a safety net. Production systems typically use (b) + (c) together.
-
How would you handle caching for data that is personalized per user vs data shared across all users?
Shared data (product catalog, feature flags) is cache-friendly — one entry serves millions of users. Use a centralized cache (Redis) with a moderate TTL. Personalized data (user dashboards, feeds) has low cache hit rates if stored centrally because each user’s data is unique. Strategies: (a) user-specific cache keys (
feed:user_123) with short TTLs; (b) precompute and cache on write (fanout) for very active data; (c) local in-process caching (Caffeine, Guava) for hot personalized data to avoid Redis round-trips. -
Your Redis cache uses 80 GB of memory and costs are mounting. How do you reduce it without hurting hit rates?
Analyze key-level memory usage with
redis-cli --bigkeysandMEMORY USAGE. Common wins: (a) compress values — store gzipped JSON, save 60-80% memory at the cost of CPU; (b) use Redis hashes for small objects instead of individual keys — Redis optimizes small hashes with ziplist encoding; (c) shorten key names; (d) audit TTLs — some keys may have no expiry and accumulate forever; (e) tiered caching — put the hottest 10% in Redis and the rest in a cheaper cache like Memcached or Redis on SSD.
user:*, product:*, session:*). Identify which key families have the lowest hit rates. Common culprits: (a) Keys with TTLs shorter than the access interval — a key with 30s TTL accessed every 60s has a 50% miss rate. (b) Large key spaces with low reuse — caching 10M user profiles when only 100K are active wastes memory and evicts hot keys. (c) Cold-start misses after deployments or cache restarts. Fix each pattern differently: extend TTLs for frequently-accessed keys, add a pre-warming job for critical keys, use tiered caching (L1 in-process for the hottest 1K keys, L2 Redis for the rest).
5. How do you test that your cache invalidation logic is correct? This is notoriously hard. Approach: (a) Integration tests that write to DB, trigger the invalidation path, and verify the cache reflects the change. (b) Shadow mode — run the application with and without cache, compare responses. (c) Production monitoring: track “stale read” events where a cache hit returns a value older than the DB value (sample 0.1% of reads). Airbnb runs a continuous “cache consistency checker” that randomly samples keys and compares cache vs DB.Work-sample prompt: “Your e-commerce site’s product page loads in 800ms. 500ms is a database query for product details that hasn’t changed in 3 months. What caching strategy do you implement, what TTL do you choose, and how do you handle the case where a product manager updates the price?”Real-World Example: Facebook’s TAO (the social graph cache) uses cache-aside with event-driven invalidation via their internal pub-sub system. When a user posts, the TAO write path invalidates affected cache entries across thousands of edge servers within ~10ms. They published a SIGMETRICS paper showing that their cache hit rate is ~96%, and the remaining 4% of misses are what drives nearly all their MySQL traffic.- Q: Why is write-behind caching dangerous for user-facing data? A: If the cache crashes before flushing, you lose committed writes. Fine for counters and metrics; a disaster for orders or payments.
- Q: How do you cache something that’s personalized per user? A: Short TTL (seconds to minutes), per-user key, and often an in-process L1 cache. Redis won’t help much if each user’s data is unique — you’re paying for network round-trips with minimal reuse.
- highscalability.com — Facebook TAO architecture deep-dive
- martin.kleppmann.com — DDIA Chapter 5 on replication & caching
8. Eviction Policies (LRU vs LFU)
8. Eviction Policies (LRU vs LFU)
- LRU (Least Recently Used): Evict the item that was accessed longest ago. Most common policy (Redis default). Works well for recency-biased workloads — “data accessed recently is likely to be accessed again.” Implementation: a doubly-linked list (ordered by access time) + a hashmap (O(1) lookup). On access, move the node to the head. On eviction, remove the tail.
- LFU (Least Frequently Used): Evict the item accessed the fewest times. Better for frequency-biased workloads where some items are always popular regardless of recency. Redis added LFU in 4.0 (
maxmemory-policy allkeys-lfu). Downside: a key that was popular yesterday but is no longer relevant stays cached because its frequency count is high — this is the “cache pollution” problem. Redis mitigates this by decaying frequency counts over time. - Other policies:
- FIFO: First in, first out. Simple but ignores access patterns entirely.
- Random: Surprisingly effective in practice — often within 10-15% of optimal. Used when the access pattern is unpredictable.
- TTL-based: Items expire after a set time regardless of usage. Prevents indefinitely stale data.
- ARC (Adaptive Replacement Cache): Balances between LRU and LFU dynamically. Used in ZFS and IBM’s storage systems. Patented, so less common in open-source.
- Redis specifically: Uses an approximated LRU (samples 5 keys and evicts the least recently used among them) rather than true LRU, because tracking exact LRU for millions of keys is memory-expensive. You can tune the sample size with
maxmemory-samples.
- Your cache uses LRU, but a nightly batch job scans all products and pollutes the cache, evicting hot user-session data. How do you fix this? The batch job touches every key once, making them “recently used” and pushing out genuinely hot data. Solutions: (a) switch to LFU — the batch job touches each key once (low frequency) while user sessions are accessed repeatedly (high frequency); (b) use separate cache instances for sessions and product data; (c) bypass the cache for the batch job entirely — have it read directly from the DB.
-
How would you implement LRU with O(1) operations for get and put?
Hashmap for O(1) key lookup, doubly-linked list for O(1) insertion/removal/reordering. On
get(key): look up in hashmap, move the node to the head of the list, return value. Onput(key, value): if key exists, update and move to head. If not, insert at head, and if over capacity, remove the tail node and delete it from the hashmap.
maxmemory-samples to 10 gets within 1% of true LRU. You’d notice the approximation failing when you have a large number of keys with very similar access times — the random sampling becomes less discriminating. For most workloads, the approximation is imperceptible.Real-World Example: Netflix’s Open Connect edge caches use a custom LFU variant they call LFRU (Least Frequently/Recently Used). Popular shows stay in cache for weeks; one-off content gets evicted quickly. They reported this change alone improved their cache hit rate from ~85% to ~95%, saving petabytes of origin-fetch traffic.- Q: What’s “cache pollution” in LFU? A: An item that was hot yesterday has a high frequency count but isn’t accessed today. LFU keeps it stuck in cache. Fix: LFU with decay (Redis does this).
- Q: When is FIFO acceptable as an eviction policy? A: Almost never for general caches — but fine for rate-limit counters or session data where you just want age-based expiry.
- highscalability.com — Netflix Open Connect architecture
- redis.io/docs/manual/eviction — canonical Redis eviction docs
9. CDN (Content Delivery Network)
9. CDN (Content Delivery Network)
- What it is: A geographically distributed network of edge servers that cache content close to end users. When a user in Tokyo requests a static asset, they hit the Tokyo edge server instead of the origin server in Virginia. Latency drops from 200ms to 20ms. Major CDNs: CloudFront (AWS), Cloudflare, Akamai, Fastly.
- What it caches: Static assets (images, CSS, JS bundles, fonts, videos), but modern CDNs also cache API responses, HTML pages, and even run compute at the edge (Cloudflare Workers, CloudFront Functions, Lambda@Edge).
- Push vs Pull CDN:
- Push: You proactively upload content to the CDN. Good for content you know in advance (video platforms uploading transcoded files). More control, but you manage the upload pipeline.
- Pull: The CDN fetches from your origin on first request, then caches. Good for web assets — zero operational overhead, just set
Cache-Controlheaders. First request per edge location is slow (cache miss goes to origin).
- Cache invalidation at CDN scale: This is where CDNs get tricky. You deploy new CSS but the CDN has the old version cached across 200+ edge locations. Solutions: (a) cache-busting with hashed filenames (
app.a3f2b1.css) — every deploy generates a new filename, so the CDN treats it as a new resource. This is the gold standard, used by Webpack/Vite. (b) Purge API — CloudFront’sCreateInvalidationpropagates across all edges in 5-15 minutes. (c) Short TTLs for frequently changing content withstale-while-revalidate. - Performance impact: A well-configured CDN typically reduces page load time by 40-60%, reduces origin server load by 70-90% for static-heavy sites, and improves availability. Netflix’s Open Connect CDN serves 95%+ of traffic from edge boxes installed directly in ISP data centers.
Cache-Control headers.Follow-up:-
Your CDN is serving stale API responses after a database update. How do you debug and fix this?
Check
Cache-Controlheaders on the API response — if the origin sendsCache-Control: public, max-age=3600, the CDN will serve stale data for up to an hour. Fix: for dynamic API responses, useCache-Control: private, no-cacheorno-store. Also check if there is aVaryheader mismatch — the CDN might be serving a cached response for a differentAccept-LanguageorAuthorizationcontext. -
How would you use a CDN for a single-page application (SPA)?
Serve the entire SPA from the CDN —
index.htmland all static assets. Hash all JS/CSS filenames for cache busting. Setindex.htmltoCache-Control: no-cache(always revalidate) but assets toCache-Control: public, max-age=31536000(one year, because the hash changes on every deploy). Use the CDN’s custom error page feature to returnindex.htmlfor all 404s (client-side routing). CloudFront + S3 with OAC is the standard AWS setup.
Vary headers that might prevent caching. (c) Large assets without compression — enable Brotli/gzip at the CDN edge. (d) Video content being served through the CDN instead of a dedicated video CDN like Mux or Cloudflare Stream. CDN costs are bandwidth-driven, so compressing and caching aggressively are the two biggest levers.
4. How do you handle user-specific content (e.g., a logged-in dashboard) with a CDN? You generally don’t cache authenticated responses at the CDN. Use Cache-Control: private for user-specific content. However, you can cache the page shell/layout at the CDN and fill in personalized content client-side via API calls. This is the “static shell + dynamic API” pattern. Alternatively, Cloudflare Workers or Lambda@Edge can personalize responses at the edge by reading a cookie/JWT and fetching user-specific data from a nearby origin.Real-World Example: Dropbox moved from a simple origin-CDN setup to running their own edge PoPs in 20+ cities around 2020. Their main driver: regulatory data residency in Europe and sub-50ms p99 for file previews. They saved ~30% on bandwidth and kept the same hit ratio (~94%) while gaining control over their edge logic.- Q: How do you invalidate a single URL across 200+ edge PoPs instantly? A: You don’t — it takes 30s-5min depending on CDN. For “instant” invalidation, hash the URL and deploy a new version. Purge APIs are a fallback, not a primary strategy.
- Q: When is CDN caching actively harmful?
A: For authenticated APIs where caching leaks one user’s data to another. Always set
Cache-Control: privateorno-storeon personalized responses, and auditVaryheaders.
- highscalability.com — Dropbox’s edge network deep-dive
- Cloudflare blog — “How Cloudflare’s architecture handles scale”
10. Stateless vs Stateful Architecture
10. Stateless vs Stateful Architecture
- Stateless: The server holds no session data between requests. Every request contains everything needed to process it (e.g., a JWT token with user identity, or an API key). Any server can handle any request. Scaling is trivial — add more servers behind a load balancer. If a server dies, no data is lost. This is the foundation of cloud-native architecture.
- Stateful: The server stores session data in memory (e.g., a login session, a shopping cart in server RAM). Subsequent requests from the same user must go to the same server (sticky sessions). Scaling is hard — you need sticky session configuration on the LB, and if that server dies, the user loses their session.
- Making stateful apps stateless: Externalize the state to a shared store: (a) Redis/Memcached for session data — all app servers read/write sessions to the same Redis cluster. (b) JWTs — encode session data in the token itself. No server-side storage needed. But JWTs cannot be revoked without a blocklist (which is server-side state again), and they can get large. (c) Client-side storage — cookies, localStorage. Limited by size and security constraints.
- WebSockets complicate things: WebSocket connections are inherently stateful — the connection is persistent between a specific client and a specific server. To scale WebSocket servers, you need a pub/sub backbone (Redis Pub/Sub, Kafka) so that messages reach clients regardless of which server they are connected to. Discord handles this with a custom gateway architecture.
-
You inherit a legacy app that stores sessions in server memory. You need to scale it horizontally by next week. What is your plan?
Fastest path: add Redis as a session store. Most frameworks have a drop-in session adapter — Express.js has
connect-redis, Spring hasspring-session-data-redis, Django hasdjango-redis. Configure the session middleware to use Redis, deploy Redis (ElastiCache on AWS), and remove sticky sessions from the load balancer. Test thoroughly — some apps implicitly depend on in-memory state beyond sessions (cached user objects, in-memory job queues). - JWTs vs server-side sessions — when would you choose each? JWTs when: you have a microservices architecture (each service can validate the token independently), you need cross-domain/cross-service authentication. Server-side sessions when: you need instant revocation (logout, password change), you have large session data, or you are concerned about JWT size inflating request headers. The hybrid approach is common: a short-lived JWT (15 minutes) for authentication + a server-side refresh token (in Redis, 7 days) for re-issuing JWTs.
- Q: Are WebSocket servers inherently stateful? A: The TCP connection is, yes — but you can keep application state externalized. The server holds the socket; Redis holds the messages. A server crash drops connections but doesn’t lose data.
- Q: What’s the fastest way to turn a stateful app stateless in a week? A: Drop-in session store (connect-redis, spring-session-data-redis) and remove sticky sessions from the LB. Tests everything else. Anything beyond sessions needs more care.
- highscalability.com — “How Slack works”
- martin.kleppmann.com — DDIA Chapter 1 on reliable and scalable systems
2. Distributed Systems Internals
11. Strong vs Eventual Consistency
11. Strong vs Eventual Consistency
- Strong Consistency: After a write completes, every subsequent read — from any node — returns the updated value. Implemented via synchronous replication: the write is not acknowledged until all (or a quorum of) replicas confirm. Latency is higher because you wait for the slowest replica. Examples: Google Spanner (uses TrueTime GPS clocks), ZooKeeper, etcd, CockroachDB.
- Eventual Consistency: After a write, replicas converge to the same value eventually — usually within milliseconds to seconds, but the guarantee is unbounded under adverse conditions. Reads may return stale data. Implemented via asynchronous replication. Examples: Cassandra (at
ONEconsistency level), DynamoDB (default reads), DNS (TTL-based propagation). - When eventual consistency is dangerous: Financial transactions (double spending), inventory management (overselling), access control (revoking permissions but the revocation has not propagated). A real example: DynamoDB’s eventual consistency caused issues with shopping cart merging during partitions, which led to the development of CRDTs and vector clocks (Dynamo paper).
- When eventual consistency is fine: Social media feeds (seeing a post 2 seconds late is invisible), analytics dashboards, DNS, content recommendations. The key question: “What is the business cost of a stale read?”
- The middle ground — Causal Consistency: Stronger than eventual but weaker than strong. Guarantees that causally related operations are seen in order. MongoDB supports this with causal sessions. Often the sweet spot for user-facing applications.
- You are building a collaborative document editor. What consistency model do you use? Strong consistency is too slow for every keystroke. Eventual consistency alone causes conflicts. The answer is Operational Transformation (OT, used by Google Docs) or CRDTs (Conflict-free Replicated Data Types, used by Figma). Both allow concurrent edits with deterministic conflict resolution. CRDTs are mathematically guaranteed to converge regardless of operation order.
-
A read replica has 5-second lag. Users update their profile and immediately see old data. How do you fix this without removing read replicas?
Read-your-own-writes consistency. After a write, route that user’s subsequent reads to the primary for a short window. Implementation: set a cookie/header with the write timestamp and compare against replica lag. MongoDB has
readPreference: primaryPreferred. Alternatively, write-through caching — the cache always has the user’s latest data.
UPDATE inventory SET qty = qty - 1, version = version + 1 WHERE id = X AND version = V). If the version changed, retry. (c) Reservation pattern — “soft-lock” the item for 10 minutes on checkout, decrement on payment confirmation. Most e-commerce platforms use (c) because it handles the payment-timeout edge case.
4. How do you observe and measure consistency lag in production? Publish a “canary write” every 30 seconds — write a known value to the primary, then read from each replica and measure the delta. Export as a Prometheus metric (replication_lag_seconds). Alert when lag exceeds your SLO (e.g., 5 seconds). PostgreSQL exposes pg_stat_replication with replay_lag. For application-level consistency, inject a tracing header on writes and verify it appears on subsequent reads through the same user session.What weak candidates say: “Just use strong consistency everywhere.” What strong candidates say: “I’d map each data path to a consistency requirement. User-facing profile updates need read-your-own-writes. Analytics aggregations are fine with 30-second staleness. Payment ledger needs linearizability. Then I’d pick the cheapest consistency model that meets each requirement.”Real-World Example: Meta’s Facebook feed runs on a mostly-eventual consistency model — posts take up to a few seconds to propagate. But when a user blocks someone, that block takes effect with strong consistency globally within milliseconds. They built different storage tiers precisely because one-size-fits-all consistency would either be slow everywhere or incorrect in the important places.- Q: What’s the difference between eventual and strong eventual consistency? A: Strong eventual adds a guarantee: given the same set of updates, all replicas converge to the same value deterministically. CRDTs provide this; raw LWW does not.
- Q: Can you detect stale reads automatically in production? A: Yes — attach a causal token (write LSN or vector clock) to user sessions, and sample-check reads against it. A stale read returns data older than the token. Airbnb runs a continuous consistency sampler.
- martin.kleppmann.com — DDIA Chapter 9 on consistency and consensus
- “Consistency Models” by Jepsen — visual map of models and their relationships
12. Quorum (N, R, W)
12. Quorum (N, R, W)
- The parameters: N = total number of replicas. W = number of replicas that must acknowledge a write. R = number of replicas that must respond to a read.
- The key formula: If
R + W > N, you have strong consistency — at least one node in the read set overlaps with the write set, guaranteeing you see the latest write. - Common configurations:
- N=3, W=2, R=2: Strong consistency. Tolerates 1 node failure for both reads and writes. This is the default for many distributed databases (Cassandra
QUORUM). - N=3, W=1, R=1: Eventual consistency. Fast but no overlap guarantee. Cassandra
ONE. - N=3, W=3, R=1: Strong consistency with fast reads, but writes cannot tolerate any node failure. Good for read-heavy workloads.
- N=3, W=1, R=3: Strong consistency with fast writes but slow reads. Good for write-heavy workloads.
- N=3, W=2, R=2: Strong consistency. Tolerates 1 node failure for both reads and writes. This is the default for many distributed databases (Cassandra
- Failure tolerance: You can tolerate
N - Wnode failures on writes andN - Ron reads. - Sloppy quorum (DynamoDB): During a partition, if the designated nodes are unreachable, writes go to any available node. This improves write availability at the cost of consistency. The “sloppy” write is later transferred to the correct node via hinted handoff.
-
Your Cassandra cluster is N=3 with
QUORUMreads and writes. One data center has a network issue and 2 of 3 nodes are unreachable. What happens? With W=2, writes fail because you cannot reach a quorum. With R=2, reads also fail. The system becomes unavailable — this is the CP side of CAP. If you switch toLOCAL_QUORUMwith a multi-DC setup, reads and writes succeed as long as a quorum exists in the local data center. - How would you configure quorum for a system that is 95% reads and 5% writes, optimizing for read latency? Set W=N (write to all replicas) and R=1 (read from any one). Since W=N, every replica has the latest write, so R=1 still gives strong consistency. Writes are slower but reads are maximally fast. The downside: if any replica is down, writes fail entirely.
LOCAL_QUORUM (Cassandra) to require quorum only within the local DC. This gives you availability within each DC at the cost of cross-DC consistency. The trade-off: during a partition, each DC serves its own latest data, which may diverge. After partition heals, anti-entropy repair reconciles.Work-sample prompt: “Your system has N=3, R=2, W=2. A user writes a value, then immediately reads it from a different client. Is the read guaranteed to see the write? Walk through the math, then describe a scenario where it might not.”Real-World Example: Cassandra’s default of QUORUM with RF=3 means N=3, R=2, W=2 — it tolerates one node failure and gives strong consistency. When Instagram’s backend hit multi-DC scale, they switched to LOCAL_QUORUM to keep reads fast within a region, accepting eventual cross-DC consistency for profile data that doesn’t need to be globally linearized.- Q: Why is N almost always odd? A: Even N gives no majority tie-breaker — N=4 tolerates only 1 failure (needs 3 for majority), the same as N=3 but with more machines. Odd numbers give better failure tolerance per dollar.
- Q: Can you pick R=W=1 and still get consistency? A: Only if N=1. With more replicas, R=W=1 gives eventual consistency — a read might hit a replica that hasn’t received the write yet.
- martin.kleppmann.com — DDIA Chapter 5 on quorums
- “Amazon’s Dynamo Paper” — the original quorum-based design
13. Leader Election (Raft / Paxos)
13. Leader Election (Raft / Paxos)
- The problem: In a distributed system, you often need one node to be the “leader” — to serialize writes, coordinate tasks, or hold a lock. But nodes crash, networks partition, and clocks drift.
- Raft (the understandable one): Designed by Diego Ongaro specifically to be easier to understand than Paxos. Core mechanics:
- Leader Election: Each node is in one of three states: Leader, Follower, Candidate. Followers expect heartbeats from the leader. If no heartbeat received within a randomized timeout (150-300ms), the follower becomes a candidate, increments its
term, and requests votes. A candidate needs a majority vote to become leader. The randomized timeout prevents split votes. - Log Replication: The leader receives all writes, appends them to its log, and replicates to followers. A write is committed once a majority of followers acknowledge it.
- Safety: Raft guarantees that once a log entry is committed, it appears in the log of every future leader.
- Leader Election: Each node is in one of three states: Leader, Follower, Candidate. Followers expect heartbeats from the leader. If no heartbeat received within a randomized timeout (150-300ms), the follower becomes a candidate, increments its
- Paxos: Invented by Leslie Lamport in 1989. Provably correct but notoriously difficult to implement. Multi-Paxos is the practical variant. Google uses it in Chubby (their distributed lock service).
- Split Brain: A network partition creates two groups of nodes, each electing its own leader. Raft prevents this: a leader needs a majority to commit writes. In a 5-node cluster split into 2 and 3, only the group of 3 has a majority. This is why cluster sizes are always odd (3, 5, 7).
- Real-world usage: etcd (Raft — Kubernetes cluster state), ZooKeeper (ZAB, a Paxos variant), CockroachDB (Raft), Consul (Raft).
- You have a 3-node etcd cluster and one node goes down permanently. What is the operational impact? With 2 of 3 nodes, you still have a majority — the cluster continues. But you can no longer tolerate any additional failure. Action: immediately provision a new etcd member. In Kubernetes, a lost etcd node means you are one failure away from a complete control plane outage — this is Priority 1.
- Why is Paxos considered harder than Raft, and does it matter in practice? Paxos describes a single-decree consensus algorithm. To build a practical system, you need Multi-Paxos, which Lamport never fully specified. Every implementation interprets it differently. Raft specifies the complete package. In practice, both achieve the same guarantees. CockroachDB chose Raft. Google Spanner uses a Paxos variant. Both work correctly.
- What is the FLP impossibility result and why should a system designer care? Fischer, Lynch, and Paterson proved in 1985 that in an asynchronous system, it is impossible to guarantee consensus if even one node can crash. This means all consensus protocols rely on timing assumptions. In production, this manifests as “leader election storms” where no candidate can win because timeouts keep expiring.
etcd_disk_wal_fsync_duration_seconds.
5. How does “split brain” actually manifest in production, and what’s the worst-case outcome? Split brain occurs when two nodes both believe they are the leader. In a database, both accept writes, creating conflicting data. Worst case: a financial system where both leaders approve conflicting transactions, leading to inconsistent ledger entries that require manual reconciliation. Prevention: fencing tokens — every leader gets a monotonically increasing token. Before performing any action, the leader presents its token. If the storage system sees a token lower than the last one it accepted, it rejects the request. This is how ZooKeeper-based leader election protects against zombie leaders.14. Bloom Filters
14. Bloom Filters
- What it is: A space-efficient probabilistic data structure that answers: “Is this element in the set?” It can say “Definitely not” (100% accurate) or “Probably yes” (with a configurable false-positive rate). It never produces false negatives. Uses a bit array of
mbits andkhash functions. - How it works: To insert an element, hash it with
kdifferent hash functions, each producing an index into the bit array. Set thosekbits to 1. To query, hash the element with the samekfunctions and check if allkbits are 1. If any bit is 0, the element is definitely not in the set. If all are 1, it is probably in the set. - Tuning the false-positive rate: A 1% false-positive rate requires about 9.6 bits per element. A 0.1% rate requires about 14.4 bits per element. A set of 1 billion URLs would take ~100 GB as strings but only ~1.2 GB as a Bloom filter with 1% FP rate.
- Limitations: You cannot delete elements from a standard Bloom filter. Counting Bloom Filters use counters instead of bits to support deletion, at 3-4x the memory cost.
- Real-world usage:
- Databases (LevelDB, RocksDB, Cassandra): Before doing a disk read for a key, check the Bloom filter. Cassandra reports that Bloom filters prevent 80-90% of unnecessary disk reads.
- Chrome Safe Browsing: The browser stores a Bloom filter of known malicious URLs locally.
- CDNs (Akamai): Determine whether content has been requested before (worth caching). Avoids caching one-hit wonders.
- You are building a username registration system. Can you use a Bloom filter to check if a username is taken? You can use it as a fast first check — if the Bloom filter says “not taken,” it is definitely available. If it says “maybe taken,” you check the database. This saves DB lookups for the majority of unique usernames. However, you cannot delete from a standard Bloom filter, so if a user deletes their account, their username appears “taken” forever unless you rebuild the filter. A Cuckoo Filter supports deletion and might be better here.
- How does a Bloom filter compare to a hash set in terms of space? A hash set stores the actual elements and gives exact answers but uses 50-100x more memory. For 1 billion strings averaging 50 bytes each: hash set = ~50 GB, Bloom filter = ~1.5 GB. Use a hash set when you need exact answers and can afford the memory. Use a Bloom filter when you need a fast, space-efficient pre-filter.
(1 - e^(-kn/m))^k. At 5x the designed capacity, the FP rate could jump from 1% to 30%+. Many of those false positives bypass the filter and hit the database, negating its purpose. Fix: rebuild the filter periodically (or use a Scalable Bloom Filter that chains multiple filters as data grows).
4. How would you use Bloom filters in a distributed system design interview? Give me three design problems where you’d reach for them. (a) Web crawler URL deduplication — avoid recrawling URLs already visited. (b) Weak password detection — check if a password appears in a known breach list of 500M passwords without storing the actual passwords. (c) Database read optimization — check if a row might exist before doing an expensive disk I/O (this is exactly how LevelDB/RocksDB use them). The common pattern: use a Bloom filter as a cheap, fast pre-check before an expensive operation.15. Rate Limiting Algorithms
15. Rate Limiting Algorithms
- Token Bucket: A bucket holds up to
Btokens. Tokens are added at a fixed rateR. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursts up toBrequests, then rate-limits toR. Used by: AWS API Gateway, Stripe. Best for: APIs that want to allow short bursts. - Leaky Bucket: Requests enter a FIFO queue. The queue is processed at a fixed rate. If the queue is full, new requests are dropped. Produces perfectly smooth output. Best for: Traffic shaping where the downstream cannot handle bursts.
- Fixed Window Counter: Divide time into fixed windows (e.g., 1-minute). Count requests per window. Problem: Boundary burst — a user sends 100 requests at 12:00:59 and 100 more at 12:01:01, getting 200 requests in 2 seconds while the limit is “100 per minute.”
- Sliding Window Log: Store the timestamp of each request. On a new request, remove timestamps older than the window and count the remaining. Perfectly accurate but memory-intensive.
- Sliding Window Counter: A hybrid of fixed window and sliding window. Uses two adjacent fixed windows and weights their counts based on the current position. This is what most production systems actually use.
-
You need to implement rate limiting across 20 API servers. How do you make it consistent?
Local in-process rate limiters give each server
global_limit / num_servers, but this is inflexible and inaccurate. Better: use a centralized Redis counter with an atomic Lua script. For ultra-high throughput, use local rate limiters as a first pass and Redis as the second pass. Envoy’s rate limit service follows this two-tier approach. - A client is rate-limited at 100 req/min but complains they are being limited at 80 req/min. What is happening? Possibilities: (a) clock skew between client and server; (b) distributed rate limiter race conditions — two servers read “99” simultaneously, both allow, then both increment to 100; (c) auto-retries from the client’s HTTP library counting against the limit; (d) fixed window boundary issue.
- How would you rate-limit by different dimensions (per user, per IP, per API key) simultaneously? Maintain separate counters for each dimension. Check all applicable limits before allowing a request. The most restrictive one wins.
INCRBY in Redis by the request’s cost instead of INCR by 1. GitHub’s API uses this approach — their rate limit headers show X-RateLimit-Resource to distinguish between different cost pools.16. Distributed ID Generation
16. Distributed ID Generation
- UUID v4 (Random): 128-bit random identifier. No coordination needed. Collision probability is astronomically low. Downsides: 36 characters as string, not time-sortable (B-tree index fragmentation — random inserts cause page splits, degrading write performance by 2-5x vs sequential IDs), not human-readable.
- UUID v7 (Time-ordered, RFC 9562): Embeds a Unix timestamp in the high bits. Time-sortable like Snowflake IDs but with UUID compatibility. Rapidly becoming the recommended default for new systems.
- Snowflake ID (Twitter): 64-bit ID: timestamp (41 bits, ~69 years) + machine ID (10 bits, 1024 machines) + sequence number (12 bits, 4096 IDs/ms/machine). Time-sorted, compact, ~4M unique IDs/sec/machine without coordination. Downsides: Requires machine ID assignment (ZooKeeper or startup parameter). Clock skew can cause duplicates.
- Database Auto-Increment: Simple, sequential, compact. But single point of failure and write bottleneck. You can shard it (odd IDs on machine 1, even on machine 2) but it is fragile.
- Snowflake variants: Instagram’s ID (timestamp + shard ID + sequence), Discord’s Snowflake, ULID (Crockford Base32, string-sortable).
- Your system generates 50,000 IDs per second across 100 servers. Which approach do you use? Snowflake IDs. 500 IDs/sec per server is well within capacity. Machine IDs can be assigned via Kubernetes pod ordinals. UUID v7 also works but at 128 bits, your indexes are 2x larger.
- What happens if a server’s clock drifts backward by 5 seconds? The timestamp component regresses, risking collisions. Snowflake’s approach: detect backward jumps and refuse to generate IDs until the clock catches up. Google’s TrueTime API solves this with GPS-synchronized clocks and confidence intervals.
uuid column alongside id. Generate UUIDs for all new rows and backfill existing rows. Update APIs to accept both formats (route by format: numeric goes to id, UUID format goes to uuid). Update clients to use UUIDs. After full migration, deprecate the integer endpoint. Keep the integer column for internal joins (smaller, faster). This dual-ID approach takes 3-6 months at scale. Stripe went through this migration and documented the challenges.17. Heartbeat & Health Checks
17. Heartbeat & Health Checks
- Basic heartbeat: Each server sends a periodic “I am alive” message every
Tseconds. IfKconsecutive heartbeats are missed, the node is declared dead. Tuning: SmallTandK= fast detection but more false positives. LargeTandK= fewer false positives but slow detection. Typical:T=5s,K=3= 15 seconds to detect failure. - Gossip Protocol: Each node periodically picks a random peer and exchanges health information. Information spreads exponentially — after O(log N) rounds, every node knows every other node’s status. Advantages: No central monitor (no SPOF), scales to thousands of nodes. Used by: Cassandra (uses Phi Accrual Failure Detector on top of gossip — calculates a “suspicion level” instead of binary dead/alive), Consul (SWIM protocol).
- Types of health checks:
- Liveness: “Is the process running?” Basic TCP or HTTP check.
- Readiness: “Can the process serve traffic?” Checks dependencies. Kubernetes uses both: liveness probes restart the pod, readiness probes remove it from the service endpoint.
- Deep health checks: Verify actual functionality — run a test query, check disk space. More thorough but more prone to false failures.
-
A node is healthy but its disk is full, so all writes fail. Heartbeats pass. How do you detect this?
The
/healthendpoint should attempt a small test write and report failure. Monitor disk usage via metrics (Prometheusnode_filesystem_avail_bytes). Kubernetes readiness probes can fail the pod so the LB stops sending traffic. -
You have a 100-node cluster. How does gossip scale compared to centralized heartbeats?
Centralized: the monitor receives 100 heartbeats every
Tseconds. Manageable, but the monitor is a SPOF. Gossip: each node contacts 1-3 random peers per round. Information propagates in O(log N) rounds — about 7 rounds for 100 nodes. No SPOF, but convergence is slower.
/health endpoint might still respond while the node drops 30% of real requests. Detection: (a) Outlier detection at the load balancer — Envoy tracks per-host success rates and ejects hosts that deviate from the cluster median. (b) Client-side latency tracking — if one server’s p99 is 5x higher than others, flag it. (c) Application-level health: track request success rate per instance in Prometheus and alert when it deviates. Netflix’s approach: “the health check passed but the error rate is 10x the cluster average” triggers automatic traffic drain.18. Circuit Breaker Pattern
18. Circuit Breaker Pattern
- The problem: Service A calls Service B. Service B is slow or down. Without protection, Service A’s thread pool fills up with requests waiting on Service B, and Service A itself becomes unresponsive. Now Service C, which depends on A, also fails. This is a cascading failure.
- The state machine:
- Closed (normal): Requests flow through. The breaker monitors failure rate. If failures exceed a threshold (e.g., 50% of last 20 requests), the circuit opens.
- Open (failing fast): All requests immediately fail with a fallback response without calling the downstream service. A timer starts.
- Half-Open (probing): After the timeout (e.g., 30 seconds), the breaker lets one test request through. If it succeeds, the circuit closes. If it fails, the circuit re-opens.
- Configuration: Failure thresholds should be calibrated to the service’s normal error rate. A service with a 2% baseline should not trip at 5%. Set the threshold at 3-5x the baseline.
- Implementation: Resilience4j (Java), Polly (.NET), Hystrix (maintenance mode), or service mesh level (Istio/Envoy’s outlier detection). Envoy’s approach is elegant: it ejects unhealthy hosts from the load balancing pool rather than failing all requests.
- Your circuit breaker trips, but the downstream service has actually recovered. Users see errors for 60 more seconds. How do you minimize this? Use active health checks alongside the circuit breaker — a background thread pings the downstream service’s health endpoint and closes the circuit proactively when health checks pass. Or increase the number of test requests in half-open state.
- How do you implement fallback behavior? Depends on the use case. For a recommendation service: return cached “popular items.” For a payment service: queue for later processing and show “payment pending.” For a non-critical feature: return a default and degrade gracefully. Netflix’s Hystrix fallbacks were critical — during a cascade failure, users saw slightly stale data but the site stayed up.
/api/v2/payments, a global circuit breaker won’t catch it, but an endpoint-scoped one will.
4. How do circuit breakers interact with retries? Can they make things worse? Yes — retry storms. If 100 clients each retry 3 times when they get a failure, you’ve turned 100 requests into 400. The downstream service that was struggling now gets 4x the load. Fix: exponential backoff with jitter, retry budgets (max 20% of requests can be retries), and the circuit breaker stopping retries entirely when open. The order matters: circuit breaker wraps the retry logic, not the other way around.19. Bulkhead Pattern
19. Bulkhead Pattern
- The analogy: A ship has bulkheads (watertight compartments). If one compartment floods, the others stay dry. In software, if one component fails or consumes excessive resources, it should not take down unrelated components.
- Implementation approaches:
- Thread Pool Isolation: Each downstream service gets its own thread pool. If the payment service is slow and its 20 threads are all blocked, the recommendation service still has its own 20 threads. Hystrix used this approach. Downside: thread pools add memory and context-switching overhead.
- Semaphore Isolation: Each downstream service gets a semaphore (max concurrent requests). Lighter than thread pools but no timeout protection. Resilience4j defaults to this.
- Process/Container Isolation: Separate containers with resource limits (CPU/memory cgroup limits in Kubernetes). The most common form in modern microservices.
- Connection Pool Isolation: Separate database connection pools for different features. A runaway report query should not prevent the checkout flow from accessing the database.
- Real-world example: Kubernetes pod resource limits are bulkheading at the infrastructure level — a pod that exceeds its memory limit is OOM-killed without affecting other pods on the same node.
- You have a monolith that handles both API requests and background reports. Reports consume all database connections. How do you apply the bulkhead pattern? Separate connection pools: allocate 80% to the API pool and 20% to the report pool. At the infrastructure level, consider moving reports to a read replica. This is the most impactful bulkheading for monoliths — separating OLTP and OLAP workloads.
- How does the bulkhead pattern interact with circuit breakers? They are complementary. The bulkhead limits the blast radius (how many resources a failing dependency consumes). The circuit breaker limits the duration (how long you keep trying). Without a bulkhead, a circuit breaker still helps but failing calls consume resources until the breaker trips. With both: the bulkhead contains resource consumption immediately, and the circuit breaker stops requests entirely after the threshold.
20. Idempotency
20. Idempotency
- Definition: An operation is idempotent if performing it multiple times has the same effect as performing it once.
f(f(x)) = f(x). Crucially, the side effects are the same, not just the return value. - Why it matters: Networks are unreliable. A client sends a payment request, the server processes it, but the response is lost. The client retries. Without idempotency, the customer is charged twice. Stripe reports that ~1% of API calls are retries.
- HTTP method idempotency:
- GET, PUT, DELETE: Idempotent by spec.
PUT /user/123replaces the user — doing it twice gives the same result.DELETE /user/123deletes the user — second call returns 404 but the state is the same. - POST: NOT idempotent by default.
POST /orderscreates a new order each time. - PATCH: Not guaranteed idempotent —
PATCHwith a relative change (increment age) is not idempotent.
- GET, PUT, DELETE: Idempotent by spec.
- Implementation pattern:
- Client generates a unique idempotency key (UUID) and sends it as a header.
- Server checks a deduplication table: “Have I seen this key before?”
- If yes: return the stored response (do not re-execute).
- If no: execute, store the result keyed by the idempotency key, return the result.
- Keys have a TTL (e.g., 24 hours) to prevent unbounded growth.
- Stripe’s implementation: They store idempotency keys in Postgres with the request hash, response body, and status code. If a retry comes with the same key but different request body, they return a 422 error.
- Your payment API is idempotent, but the downstream payment processor (Stripe) already has its own idempotency. Do you still need your own layer? Yes. The network between your server and Stripe can fail independently from client-to-server. If your server calls Stripe (succeeds), then crashes before responding, the client retries, and your server must recognize the retry without re-processing business logic.
-
How do you handle idempotency in an event-driven system?
Every event gets a unique event ID. Consumers maintain a processed-events table. Before processing, check if the ID has been seen. Kafka supports this natively with
enable.idempotence=true. - What about operations that are inherently non-idempotent, like sending an email? Add an idempotency layer before the side effect. Store a deduplication record before sending. Use a two-phase approach: (a) mark as “pending,” (b) send email, (c) mark as “sent.” A background job retries “pending” entries older than a threshold. This is the transactional outbox pattern applied to side effects.
CREATE TABLE idempotency_keys (...) PARTITION BY RANGE (created_at). Drop partitions older than 48 hours nightly. This keeps the table small and queries fast. Stripe TTLs their idempotency keys at 24 hours.
5. Two different API endpoints both modify the same underlying resource. Client sends idempotency key A to endpoint 1, and idempotency key B to endpoint 2. Both modify the same order. Is there a conflict? Yes — idempotency keys protect against retries of the same request, not against concurrent modifications from different requests. You still need optimistic concurrency control (version numbers) or pessimistic locking on the underlying resource. Idempotency and concurrency control are orthogonal concerns — you need both.Work-sample prompt: “A client POSTs a payment with idempotency key abc-123. Your server processes it and responds 200. The client never receives the response (network timeout). The client retries with the same key. Walk through the entire request lifecycle for both the first and second attempts, including every database operation.”3. Storage & Data
21. SQL vs NoSQL (When to choose?)
21. SQL vs NoSQL (When to choose?)
-
Choose SQL (PostgreSQL, MySQL) when:
- You need ACID transactions (financial systems, inventory).
- Your data has relationships that require joins.
- You need ad-hoc querying — SQL lets you answer questions you did not anticipate at design time.
- Your schema is relatively stable and you want database-enforced integrity.
- Scale: A single PostgreSQL instance handles most workloads up to millions of rows and thousands of TPS.
-
Choose NoSQL when:
- Document DB (MongoDB, DynamoDB): Semi-structured data with varying schemas, or access pattern is always “get document by ID.”
- Wide Column (Cassandra, HBase): High write throughput, time-series, append-heavy workloads.
- Key-Value (Redis, Memcached): Caching, sessions, rate limiting. Sub-millisecond latency.
- Graph DB (Neo4j, Neptune): Highly connected data where relationship traversal is the primary query pattern.
- The real nuance: Most production systems use multiple databases. A typical e-commerce platform might use PostgreSQL for orders (ACID), Redis for caching, Elasticsearch for search, and Cassandra for event logging. This is polyglot persistence.
jsonb), and the real differentiator is access patterns and consistency requirements.Follow-up:- Your team wants to use MongoDB for an order management system because “it is more flexible.” What is your response? Push back. Order management requires ACID transactions. MongoDB added multi-document transactions in 4.0, but with significant performance overhead. PostgreSQL gives you battle-tested ACID with decades of optimization. Use JSONB columns for flexible parts.
- When would you choose DynamoDB over PostgreSQL? When the access pattern is simple key lookups, you need guaranteed single-digit-ms latency at any scale, and you want zero operational overhead. The trade-off: the single-table design pattern requires you to know all access patterns upfront. PostgreSQL is more flexible but requires more operational investment.
mongodump to JSON, transform, COPY into Postgres). If larger, introduce Postgres for new features while keeping MongoDB for existing ones (strangler fig). Long-term, consolidate. The lesson: schema flexibility is not the same as structural flexibility — PostgreSQL’s JSONB gives you both.22. Database Indexing (B-Tree vs LSM Tree)
22. Database Indexing (B-Tree vs LSM Tree)
-
B-Tree (PostgreSQL, MySQL InnoDB):
- A balanced tree where each node is a disk page (4-16 KB). Lookups traverse from root to leaf in O(log N) — about 4-5 disk reads for 1 billion rows. Updates happen in-place: find the leaf node, modify it, write it back.
- Read-optimized: Point lookups and range scans are fast because data is sorted.
- Write penalty: Random writes cause write amplification (page read, modify, rewrite) and fragmentation.
-
LSM Tree (RocksDB, LevelDB, Cassandra, ScyllaDB):
- Writes go to an in-memory buffer (MemTable). When full, it is flushed to disk as a sorted, immutable file (SSTable). Periodically, SSTables are merged (compaction).
- Write-optimized: Writes are always sequential appends — 100x faster than random writes on HDD, 10x on SSD.
- Read penalty: A read might check multiple SSTables. Mitigated by Bloom filters, block caches, and compaction.
- Write Amplification: B-trees write a full page for any change. LSM trees rewrite data during compaction (10-30x amplification). Both have it, but in different places.
- When to choose: Read-heavy OLTP → B-tree. Write-heavy time-series or logging → LSM tree.
-
You are building a logging system that ingests 500,000 events/second. What storage engine?
LSM tree (Cassandra or ScyllaDB). The write pattern is append-only. Partition key:
(service_name, date_bucket). Avoid B-tree databases — 500K random inserts/sec would saturate the I/O. - What is compaction and why is it both necessary and dangerous? Compaction merges SSTables, removing tombstones and duplicate versions. Without it, reads degrade catastrophically. But compaction is I/O-intensive — Cassandra’s “compaction storm” is a known operational issue. Tuning compaction strategy (leveled vs size-tiered vs FIFO) is one of the most important operational tasks for LSM-tree databases.
nodetool compactionstats for pending compaction tasks. If the backlog is growing, either: (a) switch from size-tiered to leveled compaction (fewer files to check per read, but more I/O for compaction), (b) increase compaction throughput by provisioning more disk I/O, or (c) reduce the data’s TTL so expired data is cleaned up faster. Another possibility: tombstone accumulation from DELETE-heavy workloads — tombstones are not removed until compaction runs and gc_grace_seconds expires.
4. When would you use both B-tree and LSM-tree storage in the same system? CockroachDB uses RocksDB (LSM) as its storage engine but provides SQL (B-tree-like query interface). The write-heavy LSM engine handles the distributed writes, while the SQL layer provides the read-friendly query semantics. In your own architecture: use PostgreSQL (B-tree) for transactional data and Cassandra (LSM) for time-series/event data. The B-tree handles point reads and range scans on relational data; the LSM tree handles high-throughput append-only writes.23. Replication Types
23. Replication Types
-
Single Leader (Master-Slave / Primary-Replica):
- All writes go to the primary. Replicas replicate asynchronously (usually) or synchronously.
- Async: Fast writes but replicas may lag. If the primary dies before replicas catch up, those writes are lost.
- Sync: No data loss on primary failure, but write latency increases.
- Failover: Patroni (PostgreSQL), MHA (MySQL), Sentinel (Redis). The promoted replica may have fewer transactions than the dead primary.
- Use case: Most OLTP workloads.
-
Multi-Leader (Master-Master):
- Writes can go to any leader. Each replicates to the others. Enables writes in multiple regions.
- The conflict problem: Two leaders accept conflicting writes. Resolution: last-writer-wins (simple but lossy), merge logic, CRDTs.
- Use case: Multi-region deployments with low write latency requirements.
-
Leaderless (Dynamo-style — Cassandra, DynamoDB):
- No designated leader. Any node accepts writes. Reads and writes use quorums.
- Advantages: No failover needed, all nodes are equal.
- Disadvantages: Complex conflict resolution, no ordering guarantee across keys.
- Your PostgreSQL primary is in US-East and you have read replicas in EU and Asia. A user in Asia writes and immediately reads stale data. Why and how do you fix it? Replication lag (100-200ms cross-ocean). Fixes: read-your-own-writes (route reads to primary for a short window after writes), causal consistency with LSN tracking, or use a globally distributed database like CockroachDB.
- When would you use multi-leader replication and what is the biggest headache? Global apps needing low-latency writes in every region. The biggest headache is conflict resolution. If two users edit the same data in different regions simultaneously, you need LWW, custom merge logic, or CRDTs.
24. Partitioning Strategies
24. Partitioning Strategies
- Range Partitioning: Keys divided into contiguous ranges. Advantage: Range scans are efficient. Risk: Hotspots — if recent data is accessed more, the latest range gets all traffic. Mitigation: compound partition keys like
(user_id, date_bucket). - Hash Partitioning:
partition = hash(key) % N. Uniform distribution. Sacrifice: Range queries become scatter-gather operations —BETWEENqueries must hit every partition. - Composite/Hybrid (Cassandra): Partition key (hashed) + clustering key (sorted within partition).
PRIMARY KEY ((user_id), timestamp)— hash-distributed users, time-sorted events within each user. Best of both worlds for the right access patterns. - Directory-Based: A lookup service maps each key to its partition. Maximum flexibility but the directory is a SPOF and bottleneck.
-
You partition an events table by
hash(event_type). There are only 5 event types. What goes wrong? With only 5 distinct values, you have at most 5 partitions. Ifpage_viewis 80% of traffic, one partition handles 80% of load. Fix: compound key like(event_type, random_bucket)with 100 buckets = 500 possible partitions. This is the “salting” technique. - How does DynamoDB handle hot partitions internally? Adaptive capacity shifts throughput from underutilized partitions to hot ones. Burst capacity allows short-term spikes. For sustained hotspots, it splits partitions automatically. But if a single item is hot, no partitioning helps — you need application-level sharding.
PRIMARY KEY ((tenant_id), created_at). A single tenant generates 90% of all data. What breaks? The tenant_id partition key means all of that tenant’s data lives on the same set of nodes. Those nodes become overwhelmed while the rest of the cluster idles. This is the “noisy neighbor” problem in multi-tenant systems. Fixes: (a) Compound partition key with time bucketing: PRIMARY KEY ((tenant_id, month_bucket), created_at). This spreads one tenant’s data across 12 partitions per year. (b) Give the hot tenant a dedicated keyspace on separate nodes. (c) Rate-limit writes per tenant at the application level.
4. How do you decide between range and hash partitioning at design time? What questions do you ask? Three questions: (a) “Do I need range scans?” If yes, range partitioning or composite keys. (b) “Is my key space evenly distributed?” If not (e.g., timestamps, sequential IDs), hash partitioning avoids hotspots. (c) “Can I tolerate scatter-gather reads?” If your most common query needs to hit all partitions, you’ve chosen wrong. The golden rule: partition by the most common query’s WHERE clause.25. File Storage (Block vs Object vs File)
25. File Storage (Block vs Object vs File)
- Block Storage (AWS EBS, Azure Managed Disk): Raw volumes attached to a single compute instance. The OS manages the filesystem. Low-latency, high IOPS. Used for boot volumes, databases. Typically attached to one instance at a time.
- File Storage (AWS EFS, NFS): Shared filesystem accessible by multiple instances simultaneously. POSIX semantics. Used for shared application data, ML training data. Higher latency than block, ~3x more expensive than S3.
- Object Storage (AWS S3, Azure Blob): Flat namespace of objects accessed via HTTP API. Virtually unlimited scale, 11 nines of durability, extremely cheap. Not mountable as a filesystem. No in-place updates — you replace the entire object. Used for backups, static assets, data lake, video/image storage.
- You are designing a video processing pipeline. Where does each storage type fit? Upload to S3 (HTTP API, unlimited scale). Transcoding worker reads from S3, processes on EBS-backed local storage (fast I/O). Output back to S3. CDN pulls from S3 for delivery. Metadata in a database on EBS. Shared model files on EFS if multiple workers need them.
- Your application stores user-uploaded images on the local filesystem. When do you migrate to S3? When you need multiple servers (local FS is not shared), durability beyond a single disk, or CDN integration. Use pre-signed URLs for direct client-to-S3 upload, bypassing your server entirely.
26. Data Lake vs Data Warehouse
26. Data Lake vs Data Warehouse
- Data Lake (S3, ADLS): Raw, unprocessed data in native format. Schema-on-read. Cheap storage ($0.023/GB/month). Risk: Data swamp — without governance, it becomes a dumping ground of undocumented data.
- Data Warehouse (Snowflake, BigQuery, Redshift): Cleaned, structured data optimized for analytics. Schema-on-write. Columnar storage for fast aggregation. More expensive but massively parallel query performance.
- The Modern Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi): Combines cheap, flexible lake storage with schema enforcement and ACID transactions of a warehouse. Data lives in S3 as Parquet files, with a metadata layer adding schema evolution, time travel, and ACID. Eliminates the need to copy data from lake to warehouse.
- Your data lake on S3 has 50 TB of JSON logs. Analysts complain queries take 30+ minutes. What do you do? Convert to Parquet (columnar, compressed). Partition by date. Use AWS Glue crawler for a data catalog. Query with Athena. Expected improvement: 30 minutes to 30 seconds. Athena charges per TB scanned, so Parquet + column pruning reduces cost by 90%.
- When would you skip the warehouse entirely and query the lake directly? When data is too diverse for a fixed schema, queries are exploratory (ML feature engineering), or cost sensitivity is high. Use a lakehouse format (Iceberg/Delta) for ACID semantics and schema evolution on S3.
27. Message Queues (Kafka vs RabbitMQ)
27. Message Queues (Kafka vs RabbitMQ)
-
RabbitMQ (AMQP):
- Push-based. Smart broker, dumb consumer. Once acknowledged, messages are deleted.
- Routing: Extremely flexible — direct, topic, fanout, and header exchanges.
- Throughput: ~30K-50K messages/sec per node. Sufficient for most application workloads.
- Use for: Task queues, complex routing, request-reply patterns, transactional messaging with dead-letter queues.
-
Apache Kafka:
- Pull-based (distributed log). Dumb broker, smart consumer. Messages persisted for a configurable retention period. Multiple consumer groups read independently. Replay is trivial — reset the offset.
- Ordering: Strict within a partition. Not across partitions. Use entity ID as partition key for per-entity ordering.
- Throughput: 100K-1M+ messages/sec per broker. LinkedIn processes 7 trillion messages/day.
- Use for: Event sourcing, CDC, log aggregation, real-time analytics, microservice event backbone.
-
When to choose:
- Complex routing, task queues, or RPC → RabbitMQ.
- High throughput, event replay, or multiple consumers on the same stream → Kafka.
- You have a Kafka topic with 10 partitions and 20 consumers in the same group. What happens? Only 10 consumers are active — each partition maps to exactly one consumer in a group. The other 10 idle. Maximum parallelism equals partition count.
-
A Kafka consumer crashes mid-processing. What happens to the message?
Depends on commit strategy. With
auto.commit=true, the message may be re-delivered (at-least-once). For exactly-once, use Kafka’s transactional API or consumer-side idempotency. - How would you migrate from a monolith to microservices using Kafka? Strangler Fig pattern. The monolith publishes domain events to Kafka topics. New microservices consume these events. Kafka’s replay capability is critical — if a new service has a bug, reset the offset and reprocess after fixing.
user_123 going to the same partition), the ordering guarantee is temporarily violated during the rebalance. Mitigation: (a) Use a consistent hashing scheme for the partitioner (but Kafka’s default modulo partitioner doesn’t do this). (b) Pause producers briefly during the rebalance. (c) Design consumers to handle out-of-order messages idempotently. In practice, most teams accept a brief ordering disruption and ensure consumers are idempotent.
5. When would you choose NATS or Pulsar over Kafka or RabbitMQ? NATS: ultra-low latency (<1ms), lightweight, no persistence by default — perfect for ephemeral pub/sub in IoT or edge computing. NATS JetStream adds persistence if needed. Pulsar: when you need Kafka-like durability + multi-tenancy + built-in geo-replication without custom tooling. Pulsar’s tiered storage (offload old segments to S3) is cheaper than Kafka’s at retention periods over 7 days. Splunk and Yahoo use Pulsar for exactly this cost advantage.28. Long Polling vs WebSockets vs SSE
28. Long Polling vs WebSockets vs SSE
- Short Polling: Client sends requests at fixed intervals. Wasteful — most responses are empty. At 100K clients polling every 5s, that is 1.2M requests/minute of mostly empty responses.
- Long Polling: Server holds the connection open until data is available (or timeout). Efficient for infrequent events. Each open connection consumes a server thread/connection slot.
- WebSockets: Persistent, bidirectional, full-duplex over a single TCP connection. Sub-millisecond latency. Used by: Chat, collaboration tools, multiplayer games. Infrastructure: Connections are stateful — scaling needs a pub/sub backbone (Redis Pub/Sub, Kafka). LBs must support WebSocket upgrade.
- SSE (Server-Sent Events): Unidirectional server-to-client over HTTP. Built-in auto-reconnection with
Last-Event-Id. Works through all proxies and CDNs without special config. Used by: Stock tickers, live scores, ChatGPT streaming, CI/CD build logs. - Decision framework: Bidirectional → WebSockets. Server-to-client only → SSE (simpler). Universal compatibility needed → Long Polling as fallback. Socket.IO abstracts this automatically.
- You are building a live dashboard that shows metrics updating every second. WebSocket or SSE? SSE. Unidirectional (server to client), built-in reconnection, works through all HTTP infrastructure. WebSocket adds unnecessary complexity for this use case.
- You have 500K concurrent WebSocket connections. How do you scale? Each connection is ~2 KB memory. Scale horizontally with sticky sessions (connections cannot be re-routed mid-session). The hard part is message fanout: use Redis Pub/Sub for <100K messages/sec, Kafka or NATS for higher throughput. Discord uses ~1M connections per gateway node with a pub/sub layer for cross-node messaging.
29. Geohashing / Quadtree
29. Geohashing / Quadtree
- The problem: Given a user’s lat/lng, find all restaurants within 5 km. Naive distance calculation on every row is O(N) and unusable at scale.
- Geohashing: Encodes a 2D coordinate into a 1D string by recursively subdividing the world. Nearby locations share common prefixes. Proximity search becomes a prefix search — fast with a B-tree index. Also query the 8 neighboring cells to handle edge effects. Used by: Elasticsearch, Redis (
GEOADD/GEOSEARCH), MongoDB. Limitation: Uniform grid regardless of data density. - Quadtree: A tree where each node represents a rectangular region. Dense areas subdivide more. Advantage: Adapts to data density. Limitation: Must be stored in memory; harder to distribute. Used by: Uber (Google’s S2 geometry library).
- Choosing: Geohash for database-indexed solutions. Quadtree for in-memory spatial indexing with adaptive resolution.
-
You are building “find nearest 10 drivers” for a ride-hailing app. Drivers move constantly. How do you index them?
Redis with
GEOADDandGEOSEARCH— handles the spatial index in-memory with O(log N) complexity and 100K+ updates/sec. The index is ephemeral — if it crashes, drivers re-report their locations. - Two points are physically close but fall in different geohash cells. How do you handle this? Always query the target cell AND its 8 neighbors. Libraries compute adjacent geohashes. This adds 9 prefix queries instead of 1, but it is still fast with a B-tree index.
30. Row-based vs Columnar DB
30. Row-based vs Columnar DB
- Row-based (PostgreSQL, MySQL — OLTP): Each row stored contiguously. Reading a full row is one sequential read. Ideal for OLTP:
SELECT * FROM users WHERE id = 123. Weakness for analytics:SELECT AVG(salary) FROM employeesreads every column of every row, even though it only needssalary. - Columnar (BigQuery, Redshift, ClickHouse — OLAP): Each column stored contiguously. Column pruning: Analytics queries read only needed columns (2% of data for a 50-column table). Compression: Same-type column values compress 5-20x with dictionary encoding. Vectorized processing: Modern engines process column vectors using SIMD instructions for 10-100x throughput. Weakness for OLTP: Assembling a full row requires N random I/Os.
- Hybrid approaches: PostgreSQL has columnar table access methods (Citus). DuckDB is an in-process columnar database — “SQLite for analytics.”
-
Analytics queries on your 500M-row PostgreSQL events table take 10+ minutes. You cannot migrate to a warehouse yet. Quick wins?
(a) BRIN index on timestamp (1000x smaller than B-tree for naturally ordered data). (b) Partition by month with
pg_partman. (c) Materialized views for common aggregations. (d)EXPLAIN ANALYZEto find missing indexes. - When would you use ClickHouse over BigQuery? ClickHouse for sub-second real-time OLAP on streaming data and self-hosting for cost control. BigQuery for zero-ops serverless, bursty query patterns, and petabyte-scale scans. Cost crossover: BigQuery cheaper for infrequent large queries; ClickHouse cheaper for high-frequency smaller queries.
4. Design Cases (The “Design X” Questions)
31. Design a URL Shortener (TinyURL)
31. Design a URL Shortener (TinyURL)
- Functional requirements: Given a long URL, generate a short URL. Given a short URL, redirect to the original.
- ID generation / Encoding:
- MD5/SHA256: Too long even truncated, collisions on truncation.
- Auto-increment ID → Base62 encoding: ID 1000 →
qi. 7 characters = 62^7 = 3.5 trillion unique URLs. Simple, sequential. Downside: predictable (enumerable). - Random Base62 string: Check for collision with a Bloom filter.
- Custom hash: Hash the URL, take first 7 chars of Base62. On collision, append counter and rehash.
- Storage: Key-value store (DynamoDB, Redis). Reads are 100:1 vs writes — heavy caching layer. At 100M URLs, entire dataset fits in memory (~10 GB).
- Redirection: HTTP 301 (permanent, browser caches, fewer server hits) vs 302 (temporary, every click hits server — needed for analytics). Most shorteners use 302.
- Abuse prevention: Rate limit creation per IP/user. Block known malicious domains (Google Safe Browsing API). Scan destination URLs asynchronously. Expire URLs after a default TTL.
- A short URL for a phishing page goes viral. 10M people click it. How does your system handle this? Hot key caching (CDN, Redis, in-process cache). Abuse mitigation: scan destination URL against Google Safe Browsing and PhishTank APIs at creation time and periodically. If flagged, show a warning interstitial page.
-
How would you implement custom vanity URLs like
tny.url/my-brand? Check custom code availability (case-insensitive uniqueness constraint). Reserve common words and offensive terms via blocklist. Two lookup paths: custom codes and generated codes, both indexed in the same table.
32. Design Rate Limiter
32. Design Rate Limiter
- Where: API Gateway (centralized) or per-service middleware (more granular).
- Algorithm: Token Bucket for user-facing APIs (allows bursts). Sliding Window Counter for precision-sensitive scenarios.
- Distributed implementation with Redis:
INCR+EXPIREmust be atomic (Lua script). Return HTTP 429 withX-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Resetheaders. - Tiered limits: Free: 100 req/min. Pro: 1000. Enterprise: 10,000. Look up tier from cached config.
- Distributed challenges: With multiple gateway instances, each incrementing local counters, users get
N * limitthrough. Solution: centralized Redis counter. - Handling rate-limited requests: 429 with
Retry-After. Queue for later if important. Or degrade gracefully — serve cached/stale data.
- Redis goes down. What happens to your API? Fail open (allow all — safer for revenue-critical APIs) or fail closed (reject all — safer for abuse prevention). Standard: fail open with alerting and a local fallback rate limiter.
- How do you rate-limit by multiple dimensions simultaneously? Separate counters per dimension. Use a Lua script to check all atomically. Most restrictive limit wins.
33. Design Instagram Feed
33. Design Instagram Feed
- Data model: Users, Posts (media in S3/CDN), Follows (graph), Feed (timeline per user).
- Push (Fanout on Write): When User A posts, write post ID to every follower’s feed list in Redis. Fast reads, slow writes. Celebrity problem: 10M followers = 10M writes per post.
- Pull (Fanout on Read): No precomputation. On feed open, query all followed users’ posts and merge-sort. Fast writes, slow reads.
- Hybrid (what Instagram and Twitter actually use): Push for normal users (followers < 10K). Pull for celebrities. Feed assembly merges precomputed feed with real-time celebrity post query.
- Ranking: ML model scores posts by relationship affinity, content type, recency, engagement signals. Runs at feed assembly time on the candidate set.
- Storage: Feed lists in Redis sorted sets. Posts/metadata in DynamoDB or PostgreSQL. Media in S3 + CDN.
- A user follows 5,000 accounts and opens the app. How many milliseconds to assemble their feed? With hybrid model: fetch precomputed feed IDs from Redis (~2ms). Fetch celebrity posts (~5ms batched). Merge and rank top 50 (~1ms). Hydrate post objects (~3ms). Total: ~10-15ms. Instagram targets <100ms.
- How do you handle a user unfollowing someone? Lazily filter at read time — check the follow graph (cached, O(1)) during feed assembly. Stale post IDs naturally age out as newer posts push them down.
34. Design Chat (WhatsApp)
34. Design Chat (WhatsApp)
- Connection layer: WebSockets for persistent bidirectional communication. Each device connects to a gateway server. Connection registry in Redis maps
user_id → gateway_server_id. - Message delivery guarantees:
- Sent: Server ack to sender. Delivered: Recipient device ack (second checkmark). Read: Recipient opens chat (blue checkmark).
- Offline delivery: Messages stored in Cassandra/DynamoDB. Delivered when user reconnects. TTL: 30 days.
- Message ordering: Per-conversation sequence number assigned by server (not client clock). For groups, use vector clocks or centralized sequencer.
- Storage: Cassandra (time-series pattern). Partition key:
chat_id. Clustering key:timestamp. - E2E Encryption: Signal Protocol (Double Ratchet Algorithm). Server never sees plaintext. Key rotation on group membership changes.
- Online/Presence: Heartbeat to Redis with 30-second TTL. “Last seen” is the timestamp of the last heartbeat.
- User B’s gateway server crashes. What happens to messages sent to B? Chat service routes to dead gateway, TCP fails. Message goes to offline queue. When B reconnects to a new gateway, pending messages are delivered and the registry is updated.
- How does WhatsApp handle multiple devices? Multi-device model: each device has its own identity key pair. Sender encrypts once per recipient device. Server delivers independently. Capped at 4 linked devices.
- How would you implement “typing” indicators? Transient, fire-and-forget events. Not persisted. Debounced client-side timer. Rate-limited to 1 event/second. Best-effort delivery.
35. Design Youtube/Netflix
35. Design Youtube/Netflix
- Upload pipeline: Chunked/resumable upload (5 MB chunks). Original file to S3. Message to processing queue (Kafka/SQS).
- Transcoding: Convert source into multiple resolutions (240p-4K) and codecs (H.264, H.265, VP9, AV1). Split video into 2-10 second segments for parallel processing across worker fleet. Netflix uses per-title encoding — custom bitrate ladder optimized for content.
- Adaptive Bitrate Streaming (ABR): HLS and DASH standards. Video split into small segments with a manifest listing quality levels. Client dynamically switches quality per segment based on bandwidth and buffer level.
- CDN / Delivery: Netflix’s Open Connect places servers inside ISP data centers. During peak, 95% of traffic from local ISP caches. Popular content proactively pushed; long-tail content pulled on demand.
- Storage: S3 for all video files (hundreds of petabytes at Netflix scale). Metadata in Cassandra. Search in Elasticsearch.
- A user uploads a 30-minute 4K video. How long until playback, and how do you minimize it? Optimized: start with one fast encode (720p H.264) and publish immediately (2-5 minutes). Transcode higher resolutions in background. Segment-level parallelism across workers. Progressive publishing — add resolutions to manifest as they complete. YouTube shows “Processing” and makes video available at lower quality first.
-
How would you implement “resume playback” across devices?
Store
(user_id, video_id, position_seconds)in a fast KV store. Update periodically via fire-and-forget heartbeats. Use conditional updates for race conditions (two devices playing the same video).
36. Design Google Drive (Dropbox)
36. Design Google Drive (Dropbox)
- File storage: Files split into 4 MB blocks, each hashed (SHA-256). File represented as a list of block hashes. Block-level deduplication: Identical blocks stored once. Dropbox reports 40-60% storage savings.
- Sync engine: Client computes block hashes, sends list to server. Server responds with missing blocks. Client uploads only new/changed blocks (delta sync — editing 100 MB file uploads only 4 MB). Server notifies other devices via long polling/WebSockets. Other devices download only changed blocks.
- Conflict resolution: Two users edit offline and sync simultaneously. Keep both: save one as
report.docxand the other asreport (conflict copy - John).docx. Let users resolve manually. - Metadata database: File hierarchy, block references, permissions, version history. Deletes are soft-deletes with 30-day retention.
- Versioning: Old versions are stored as block lists (immutable blocks mean no extra storage for unchanged blocks). Users can restore any previous version.
- Two users edit a Google Sheets spreadsheet simultaneously. How is this different from Dropbox conflict? Google Sheets uses Operational Transformation (OT) — every keystroke is an operation transformed for concurrent edits. No “conflict copy.” Architecturally different from Dropbox’s file-level sync.
-
How does Dropbox handle a user uploading 100,000 small files (node_modules)?
Known pain point. Optimizations: batch metadata operations, skip dedup for very small files,
.dropboxignorefor generated directories. Most sync clients recommend excludingnode_modules.
37. Design Typeahead (Search Autocomplete)
37. Design Typeahead (Search Autocomplete)
- Requirements: Top 5-10 suggestions within 100ms as the user types each character.
- Trie (Prefix Tree): Each node represents a character. Path from root to node = prefix. Optimization: Store the top-K results at each trie node. Lookup is O(prefix_length). Use compressed tries (Patricia/radix tree) to reduce memory.
- Ranking: Popularity (search frequency), recency (trending queries), personalization (user’s search history), offensive content filtering (blocklist or ML classifier).
- Update pipeline: Offline aggregation (MapReduce/Spark) rebuilds trie periodically. Incremental “delta Trie” for near-real-time updates. For trending queries, a small Redis sorted set updated in near-real-time, merged with trie results at serving time.
- Serving: Trie replicated across servers in memory. Read-only between rebuilds. Latency target: <50ms P99.
LIKE 'prefix%'.” This takes seconds on billions of rows, not milliseconds.Follow-up:- Your autocomplete shows offensive suggestions that were trending. How do you handle this? Multi-layer: blocklist of known terms, ML classifier, human review, real-time serving filter that checks the blocklist before returning results.
- How would you add personalization without per-user tries? One global trie. At serving time, re-rank using a per-user cache of recent/frequent queries (Redis). Merge global top-5 with user’s personal top-5, deduplicate, re-rank. For deeper personalization, use an ML model scoring candidates with user features.
38. Design Web Crawler
38. Design Web Crawler
- URL Frontier: Priority queue of URLs to crawl (Kafka or custom). Prioritized by importance. Two sub-systems: priority queue (which URLs are important?) and politeness queue (per-domain rate limiting).
- Fetcher Workers: Horizontally scaled. Async HTTP clients for high throughput.
- URL Deduplication: Bloom filter of visited URLs. 10 billion URLs with 1% FP = ~12 GB.
- Content Deduplication: SimHash for near-duplicate detection. Skip re-indexing unchanged content.
- Politeness: Respect
robots.txt. Per-domain rate limit (1 req/sec). Rotating IP pool. - DNS: Custom resolver with large cache to avoid DNS bottleneck at crawl scale.
- Trap detection: Limit URL depth per domain. Deduplicate after normalizing query params. Monitor per-domain URL counts.
- A page returns a 301 redirect chain. How do you handle it? Follow redirects up to max depth (5-10 hops). Record the entire chain for deduplication. Detect loops. For permanent redirects, update the frontier to prefer the destination URL.
- How do you decide which pages to re-crawl and how often? Change frequency (estimated from previous crawls), importance (PageRank), and freshness requirements. News sites recrawled hourly, Wikipedia weekly.
39. Design Notification System
39. Design Notification System
- Multi-channel: Push (APNs/FCM), Email (SES/SendGrid), SMS (Twilio), In-app (WebSocket or pull). Handle token invalidation for push, bounce management for email.
- Architecture: Notification service receives requests via API/events (Kafka). Validates, checks user preferences, routes to channel handlers.
- User Preferences: Per-user config: enabled channels, quiet hours, notification categories. Stored in DB or cache.
- Priority Queues: P0 (security — bypasses quiet hours), P1 (transactional — immediate), P2 (social — can batch), P3 (marketing — scheduled). Separate queues or Kafka topics with dedicated consumers.
- Rate Limiting: Per-user limits (max 5 push/hour). Per-channel limits (APNs rate-limits at ~50K/sec).
- Deduplication: Idempotency key per notification. Redis set with TTL.
- Delivery tracking: queued → sent → delivered → opened → clicked.
- A flash sale sends 100,000 notifications. APNs rate-limits you. How do you fix it? Multiple persistent HTTP/2 connections to APNs. Pre-warm connections. Stagger dispatch over a time window. Queue time-sensitive notifications before the sale starts.
-
A user got the same notification 3 times. What went wrong?
Worker processed, sent, but crashed before acknowledging. Queue re-delivered. Fix:
SETNX notification:sent:{id}before sending. Guarantees at-most-once per notification ID.
40. Design Leaderboard
40. Design Leaderboard
- Naive:
SELECT * FROM scores ORDER BY score DESC. O(N log N) per query. Unusable at 100M players. - Redis Sorted Set:
ZADD leaderboard score user_id(O(log N)).ZREVRANK leaderboard user_id(O(log N)).ZREVRANGE leaderboard 0 9 WITHSCORES(top 10). At 100M entries, ~10-15 GB RAM. One Redis instance handles this. - Tie-breaking: Redis sorts ties lexicographically. To rank by score + time (first achiever ranks higher):
score = points * 10^10 + (MAX_TIMESTAMP - timestamp). - Types: Global, friends (filter by social graph), time-scoped (daily/weekly — separate sorted sets:
leaderboard:daily:2024-03-15, expire old keys).
-
The game has 50M players. Leaderboard resets daily. How without downtime?
Time-bucketed sorted sets:
leaderboard:2024-03-15. At midnight, start writing to new key. Old key still readable. No “reset” operation needed. -
A player’s score needs to be the sum of their last 100 games. How efficiently?
Per-player list of recent scores (
LPUSH+LTRIMto 100). On each game: push new score, pop oldest, computenew_sum = old_sum - popped + new, thenZADD. O(1) per game. Wrap in a Lua script for atomicity.
5. Reliability & Operations
41. Handling Hot Partitions
41. Handling Hot Partitions
- What causes hot partitions: Certain keys receive disproportionate traffic. A celebrity’s profile, a trending hashtag, a global counter, or a flash sale product.
- Mitigation strategies:
- Virtual nodes / Consistent hashing: Smooths key distribution but does not help if a single key is hot.
- Key salting / Write sharding: Append random suffix:
celebrity:123:shard_0throughcelebrity:123:shard_99. Writes distribute across 100 sub-keys. Reads aggregate. This is how Instagram handles celebrity profiles. - Local caching: Each app server caches hot keys in-process (Caffeine, Guava) with 1-5s TTL. Reduces backend load from 100K req/sec to ~1K req/sec.
- Dedicated shard: For known hot tenants, assign a dedicated shard with more resources (Slack does this for large workspaces).
- DynamoDB’s adaptive capacity: Automatically shifts throughput from underutilized partitions. Splits hot partitions automatically. But cannot help if a single item is hot.
-
A DynamoDB item (a view counter) gets 50,000 writes/sec. The partition limit is 1,000 WCU. How?
Write sharding: 100 items (
counter:0throughcounter:99). Each write goes to a random shard. Reads aggregate withBatchGetItem. Cache the aggregate in ElastiCache with 1s TTL. - A Kafka partition gets 10x throughput because one customer generates most events. How do you rebalance? Move the hot customer to a dedicated Kafka topic with its own partitioning scheme. Or split their events into sub-streams by event type, each with a different key.
42. Thundering Herd Problem
42. Thundering Herd Problem
- The problem: A popular cache key expires. 10,000 concurrent requests all get a cache miss and simultaneously query the database. The database is overwhelmed. This can cascade into a full outage.
- Variant — System startup: 1,000 processes start simultaneously and all warm caches at once.
- Variant — CDN miss: Breaking news drives millions of users to the same page; CDN cache misses all hit origin.
- Mitigation techniques:
- Cache stampede lock (Mutex): Only one request acquires a lock (
SETNXwith TTL) and fetches from DB. Others wait or return stale value. Reduces 10,000 DB queries to 1. - Jittered expiration:
TTL = base_ttl + random(0, jitter_window). Spreads expirations over a window instead of a single instant. - Stale-while-revalidate: Serve stale value immediately while refreshing in background. HTTP
Cache-Control: stale-while-revalidate=60at CDN level. - Request coalescing: If multiple identical requests arrive while one is in-flight, collapse into a single upstream request. Nginx’s
proxy_cache_lock, Go’ssingleflight. - Pre-warming: Background job refreshes popular keys before TTL expires.
- Cache stampede lock (Mutex): Only one request acquires a lock (
- Your Redis cluster goes down entirely during peak traffic. How do you recover without taking down the database? Circuit breaker on DB access. Semaphore limiting concurrent DB queries. Graceful degradation for most requests. Long-term: Redis Sentinel/Cluster for HA. Multi-tier cache: in-process L1 cache (Caffeine) survives Redis failures.
- You deploy a new feature that changes cache key format. All existing caches miss. How do you prevent a thundering herd? Gradual rollout: 1% of traffic primes new cache, then 10%, 50%, 100%. Or support both key formats during transition — read new key, fallback to old key, write to new key.
43. Blue-Green vs Canary Deployments
43. Blue-Green vs Canary Deployments
- Blue-Green: Two identical environments. Deploy to Green, switch traffic from Blue to Green. Instant rollback (switch back). Downsides: 2x infrastructure cost. Database migrations must be backward-compatible. All-or-nothing (no gradual validation).
- Canary: Deploy to 1-5% of servers. Monitor. Gradually increase. Downsides: Slower rollout. Requires sophisticated traffic routing and monitoring. Some bugs only manifest at high traffic percentages.
- Rolling (middle ground): Update servers one at a time. Kubernetes default
RollingUpdate. No extra infrastructure cost, but slower rollback. - Decision: Blue-green for stateless apps with critical instant-rollback needs. Canary for high-traffic consumer apps. Rolling for cost-sensitive environments.
- Blue-green deploy with a database migration that adds a column. How do you handle it? The migration must be backward-compatible. Step 1: Add column with default value (nullable). Both old and new code work. Step 2: Deploy new code. Step 3: After decommissioning old, clean up (make non-nullable, drop unused columns). This is the “expand and contract” pattern.
- Your canary at 1% shows 0.5% error rate vs 0.3% baseline. Significant enough to abort? With small sample sizes, statistical noise is significant. Use proper statistical tests (Mann-Whitney U). Wait for more data or increase canary percentage. Netflix’s Kayenta automates canary analysis with statistical methods.
44. Service Discovery Patterns
44. Service Discovery Patterns
- Client-Side Discovery: Client queries a service registry (Eureka, Consul) for available instances, then load-balances directly. Advantage: No extra hop. Disadvantage: Every client needs a discovery library. Example: Netflix Eureka + Ribbon.
- Server-Side Discovery: Client sends to a load balancer, which queries the registry and routes to a healthy instance. Client is simple. Example: Kubernetes kube-proxy + DNS-based discovery.
- Service Mesh (Istio, Linkerd): Sidecar proxy (Envoy) handles discovery, load balancing, retries, circuit breaking, mTLS. Zero application code changes. Disadvantage: Operational complexity, 1-3ms latency overhead per hop.
- Eureka registry is down. What happens? Existing connections continue. Clients use cached instance lists (Eureka clients cache aggressively). New services cannot register and stale entries persist. The system degrades gracefully — Eureka prioritizes availability (AP).
- How does Kubernetes service discovery work under the hood? Service gets a ClusterIP. Kube-proxy programs iptables/IPVS rules to route traffic from ClusterIP to healthy pod IPs. CoreDNS registers a DNS record. Readiness probes remove failing pods from Endpoints. No external registry needed.
45. Distributed Consensus
45. Distributed Consensus
- The problem: Getting multiple nodes to agree on a single value despite crashes, delays, and partitions.
- Why it is hard (FLP Impossibility, 1985): In a fully asynchronous system, consensus is impossible if even one node can crash. All practical protocols use timeouts (partial synchrony).
- Paxos (Lamport, 1989): Foundational. Provably correct but difficult to implement as a complete system. Multi-Paxos is the practical variant.
- Raft (Ongaro, 2014): Designed for understandability. Leader election (randomized timeouts), log replication (majority acknowledgment), safety (candidate must have most up-to-date log). Used in etcd, Consul, CockroachDB.
- ZAB (ZooKeeper): Optimized for primary-backup replication. Totally ordered broadcast.
- Practical systems: etcd (Raft — Kubernetes state), ZooKeeper (ZAB — Hadoop, Kafka metadata), CockroachDB (Raft — per-range consensus), Spanner (Paxos).
- When would you use a consensus protocol directly vs using etcd? Almost always use etcd or ZooKeeper. Implementing Raft correctly takes months and years of hardening. Only implement directly when building a distributed database with unique requirements.
- The leader writes to a database but dies before completing. How does the new leader handle this? Idempotent leader work with fencing tokens. The new leader retries with a monotonically increasing revision number. If the old leader’s stale write arrives later, the database rejects it.
46. Backpressure
46. Backpressure
- The problem: In a pipeline (API → Queue → Worker → Database), if the worker is slower than the API, work accumulates. Without backpressure, the queue grows unboundedly until the system crashes.
- Mechanisms:
- Bounded queues: Maximum size. When full, new messages rejected. Producer slows down or returns 429.
- TCP flow control: Receiver advertises zero window size when buffer is full.
- Reactive Streams: Subscriber tells publisher how many items it can handle (
request(N)). Demand-driven flow. - HTTP 429: The ultimate backpressure signal to external clients.
- Load shedding: Drop lower-priority requests to preserve capacity for high-priority ones. Amazon sheds browsing metadata to protect checkout during spikes.
- Kafka consumer backpressure: Consumers pull at their own pace. Monitoring consumer lag alerts when falling behind.
- Your Kafka consumer is falling behind — lag growing by 10K messages/minute. What do you do? Add more consumer instances (up to partition count). Diagnose: CPU-bound (optimize code), I/O-bound (batch writes, async I/O), or waiting on external services (circuit breakers). If unrecoverable, consider resetting offset to latest.
- How does backpressure work in a microservices call chain (A → B → C → D)? If D is slow, C fills up, B fills up, A times out. Without backpressure, retries amplify load. With circuit breakers at each level, the pressure propagates upstream. Service meshes automate this with retry budgets (max 20% retries).
47. Chaos Engineering
47. Chaos Engineering
- The philosophy: “The best way to avoid failure is to fail constantly.” Proactively inject failures to discover weaknesses before real outages.
- The methodology:
- Define steady state: P99 latency < 200ms, error rate < 0.1%.
- Hypothesize: “If we kill 1 of 3 DB replicas, the system should failover within 30s with no user-visible errors.”
- Inject failure. 4. Observe. 5. Learn.
- Types of experiments: Infrastructure (kill instances, terminate AZs), Network (latency, packet drops, partitions), Application (inject exceptions, exhaust threads), Dependency (external API errors/timeouts), State (corrupt cache, clock skew, fill disk).
- Tools: Chaos Monkey (Netflix), Litmus (Kubernetes-native), Gremlin (commercial), AWS Fault Injection Simulator, Toxiproxy (Shopify).
- Safety controls: Start in staging. Limit blast radius. Set automatic abort conditions. Run during business hours. Gradually increase scope.
- VP asks: “Why are you breaking things on purpose?” How do you justify it? “We are discovering things that are already broken but have not manifested yet. Our last experiment found a misconfigured circuit breaker that would have caused a cascading failure during peak traffic — estimated 0 and a few hours of engineering time.”
- How do you run chaos experiments without affecting real users? Synthetic traffic, feature flags routing small percentages through the chaos path, blast radius control (single pod/instance), low-traffic windows with on-call engineers ready.
48. Caching Hazards
48. Caching Hazards
- Cache Avalanche: Cache layer goes down entirely or mass key expiration. All traffic hits DB. Mitigation: Redis Cluster/Sentinel for HA. Jittered TTLs. Multi-tier caching (in-process L1 cache survives Redis failure). Circuit breaker on DB.
- Cache Penetration: Requests for keys that do not exist in cache or DB. Every request bypasses cache and hits DB. Mitigation: Bloom filter of valid keys. Cache null values with short TTL. Input validation.
- Cache Stampede (Hot Key Expiry): Single hot key expires. Thousands of concurrent requests hit DB simultaneously. Mitigation: Mutex/lock (one request fetches, others wait). Early refresh before TTL expires. Stale-while-revalidate. Never-expire hot keys with event-driven invalidation.
- You deploy a Bloom filter for penetration protection. A new product is added. Users cannot find it because the Bloom filter says it does not exist. How do you fix it? Rebuild periodically. Or maintain a secondary “recently added” set checked alongside the main filter. New keys added to both DB and secondary set. Merged on next rebuild.
-
Your cache has 10M keys, all with 1-hour TTL. At the top of each hour, DB gets hammered. Avalanche or stampede?
Avalanche — mass expiration. Fix: add TTL jitter (
3600 + random(0, 600)) so keys expire across 10 minutes instead of at the boundary.
49. Proxy vs Reverse Proxy
49. Proxy vs Reverse Proxy
- Forward Proxy (protects the client): Sits between client and internet. Server sees proxy’s IP, not client’s. Use cases: Anonymity (VPN, Tor), content filtering (corporate network blocks social media via Squid Proxy), caching (university proxy caches popular downloads), bypassing geo-restrictions.
- Reverse Proxy (protects the server): Sits between internet and server(s). Client never communicates directly with backend. Use cases: Load balancing, SSL termination, caching (Varnish, Nginx
proxy_cache), compression, security (hide backend topology, absorb DDoS, WAF), request routing. Tools: Nginx, HAProxy, Envoy, Cloudflare, Traefik, Caddy. - Key distinction: Forward proxy is configured by the client. Reverse proxy is configured by the server — clients do not know it exists.
-
Nginx as reverse proxy buffers entire 50 MB upload body before forwarding. This adds latency. How do you fix it?
proxy_request_buffering off;for streaming. Or have clients upload directly to S3 via pre-signed URL, bypassing the proxy for large payloads. - How does a reverse proxy enable zero-downtime deployments? Maintains client connection while swapping backends. Start new instances, add to upstream pool, drain old instances (stop new requests, complete in-flight), remove old. Client’s TLS connection to the proxy is uninterrupted.
50. API Gateway vs Load Balancer
50. API Gateway vs Load Balancer
- Load Balancer (L4/L7): Distribute traffic across instances. Health checks, connection draining, SSL termination, basic path-based routing. Does NOT do: Authentication, rate limiting, request transformation, API versioning.
- API Gateway (Application-level): Single entry point with cross-cutting concerns. Beyond load balancing: Auth/authz (JWT, API keys), rate limiting, request/response transformation, API composition/aggregation (fan out to multiple services), analytics, circuit breaking. Examples: Kong, AWS API Gateway, Apigee.
- In practice, you use both: Gateway handles application logic (auth, rate limiting). Routes to services through a load balancer (distributes across instances).
- When NOT to use an API Gateway: For internal service-to-service (east-west) traffic, use a service mesh (Istio/Linkerd). The gateway is for external (north-south) traffic.
- 30 microservices, 500 routing rules. How do you manage gateway complexity? Treat config as code (GitOps). CI/CD validation. Per-service route ownership (Ambassador supports per-service Kubernetes annotations). Shared behavior via plugins, not duplication per route.
- Should your API Gateway be synchronous or asynchronous? Synchronous for request-response APIs. Asynchronous for high-volume event ingestion — acknowledge immediately (202 Accepted) and enqueue for processing (SQS/Kafka). AWS API Gateway supports both.
6. Advanced Scenario-Based Questions
Scenario 1: Feed Ranking at Scale — Your feed service is serving 500M users. Monday morning, latency spikes from p99 50ms to 2s. Feed ranking scores are stale. What's happening and how do you fix it?
Scenario 1: Feed Ranking at Scale — Your feed service is serving 500M users. Monday morning, latency spikes from p99 50ms to 2s. Feed ranking scores are stale. What's happening and how do you fix it?
- “Add more servers” or “Scale up the database.” No diagnosis, just throwing resources at it.
- Ignore the staleness clue and focus only on latency.
- Suggest rebuilding the ranking system from scratch.
- Immediate diagnosis: Monday morning = traffic surge from weekend catch-up reads. The ranking model’s feature store (likely Redis or a feature serving layer like Feast) is cache-cold or overwhelmed. Stale scores suggest the offline pipeline (Spark/Flink job that computes ranking features) either failed over the weekend or its output hasn’t propagated.
- Short-term fix: Check the ML pipeline DAG (Airflow/Dagster) for failed weekend jobs. If the feature store is hot, add read replicas or fall back to a simpler scoring model (e.g., chronological + engagement weight) with a feature flag. Instagram does exactly this — they degrade to a “time-decay + like-count” fallback when their ML ranker is slow.
- Root cause pattern: Feed ranking has a hidden coupling — the serving path depends on a batch pipeline completing on time. Introduce an SLO on pipeline freshness (e.g., features must be <6 hours old), alert on staleness, not just latency. LinkedIn’s feed team monitors “feature age” as a first-class metric.
- Long-term fix: Move hot features to a streaming pipeline (Flink computing engagement signals in real-time) so you’re not dependent on a batch job completing by Monday 6 AM. Pre-warm caches Sunday night with a synthetic load test. Implement a two-tier ranking: lightweight first pass (cheap features, sub-10ms) then re-rank top 200 candidates with heavy ML features.
- Metrics to watch: Feature staleness age, cache hit ratio on feature store, model inference p99, fallback-mode activation rate.
- How would you design the fallback ranking so it doesn’t just show garbage? What signals are cheap enough to compute in real-time?
- Your ranking model uses 500 features. At 500M users, how do you serve feature lookups without the feature store becoming a bottleneck? Walk me through the data layout.
- The PM says “users are complaining that their feed is boring on Mondays.” Is this a ranking bug or a content supply problem? How do you tell the difference with data?
Scenario 2: Payment Processing Idempotency — A merchant reports that 3 out of 10,000 customers were charged twice for the same order. Your payment service uses an idempotency key. What went wrong?
Scenario 2: Payment Processing Idempotency — A merchant reports that 3 out of 10,000 customers were charged twice for the same order. Your payment service uses an idempotency key. What went wrong?
- “Just add a unique constraint on the order ID.” They don’t understand that the idempotency layer clearly exists already and failed.
- “Use a transaction.” Without explaining what transaction, where, and across which services.
- Treat it as a simple duplicate insert problem rather than a distributed systems failure.
- The usual suspect: The idempotency key was checked and the charge was recorded in two separate steps without atomicity. Between the check (“have I seen this key?”) and the write (“mark this key as processed”), a retry slipped through. This is the classic check-then-act race condition. At Stripe, they solve this by using a Postgres advisory lock on the idempotency key — the check and the insert happen inside a single serializable transaction.
- Second suspect: Client-side retry with a new idempotency key. The mobile app crashed after sending the request but before receiving the 200. On restart, it generated a new idempotency key for the “same” order. Fix: the idempotency key must be deterministic — derived from
order_id + customer_id, not randomly generated per request. - Third suspect: The idempotency store (Redis) lost the key between the first and second attempt. If you’re using Redis with no persistence or with a short TTL (say 60s), and the retry comes at 61s, the key is gone. Stripe keeps idempotency keys for 24 hours in Postgres, not Redis, for exactly this reason.
- Architecture fix: Implement idempotency at the payment gateway level, not just the API level. Use a state machine for each payment:
CREATED -> PROCESSING -> CHARGED -> CONFIRMED. The transition from PROCESSING to CHARGED must be atomic (single DB row update with a WHERE clause on the current state). Any duplicate attempt sees state=CHARGED and returns the cached response. - Detection and recovery: For the 3 customers already charged twice, you need an async reconciliation job that compares your ledger against the payment processor’s records (Stripe/Adyen API). Auto-refund duplicates with an explanation email. Run this reconciliation hourly, not daily — Shopify learned this the hard way during Black Friday 2022.
- The idempotency key is stored in Postgres. You’re processing 50,000 payments/second on Black Friday. How do you prevent the idempotency table from becoming a write bottleneck?
- Walk me through exactly what happens when a payment request times out at the HTTP level but the charge actually went through at the payment processor. How does your system recover?
- Your idempotency keys have a 24-hour TTL. A customer retries a failed payment 25 hours later. What happens and how should the system behave?
Scenario 3: Real-Time Leaderboard — You run a mobile game with 20M daily active players. The product team wants a global leaderboard that updates in real-time with each player's rank. Current implementation uses SQL ORDER BY and it's buckling. Design a replacement.
Scenario 3: Real-Time Leaderboard — You run a mobile game with 20M daily active players. The product team wants a global leaderboard that updates in real-time with each player's rank. Current implementation uses SQL ORDER BY and it's buckling. Design a replacement.
- “Use Redis sorted sets” — correct tool but with no consideration for 20M members and the write throughput required.
- Can’t explain how
ZRANKworks under the hood or what happens when 50,000 score updates hit per second. - No mention of how to handle ties or display “your rank is 1,234,567 out of 20,000,000.”
- Core data structure: Redis Sorted Set (
ZADD leaderboard score user_id) gives O(log N) inserts and O(log N) rank lookups viaZREVRANK. At 20M members, log2(20M) is about 24 operations — well within Redis single-thread capacity for reads. For writes, a single Redis instance can handle roughly 100KZADDops/sec, so 50K score updates/sec is feasible on one node. - The real problem is reads, not writes: If 20M players each check the leaderboard every 30 seconds, that’s 670K
ZREVRANKcalls/sec. A single Redis instance can’t handle that. Solution: read replicas. One primary handles writes, 5-10 replicas handle rank lookups. Stale by 1-2 seconds, which is fine for a game leaderboard. - Sharding for extreme scale: If you need multiple leaderboards (per-region, per-game-mode), shard by leaderboard ID. For a single massive global board beyond Redis capacity, partition scores into buckets (0-1000, 1001-2000, etc.), keep a count per bucket, and compute approximate rank as
sum of counts in higher buckets + rank within bucket. This is how Riot Games handles League of Legends ranked. - Tie-breaking: Redis sorted sets break ties lexicographically on the member name. To break ties by timestamp (earlier score wins), encode the score as
actual_score * 10^10 + (MAX_TIMESTAMP - timestamp). This way, higher scores and earlier timestamps both sort higher. - Displaying “nearby” ranks:
ZREVRANGE leaderboard (rank-5) (rank+5) WITHSCORESgives you the 11 players around the current user. This is what players actually care about — not ranks 1-10, but “am I beating my friends?” - Cost consideration: A Redis instance with 20M sorted set members uses roughly 2-3 GB of RAM. At AWS pricing, that’s about 2000/month RDS instance. This is a 40x cost reduction.
- The game goes viral in Asia and you now have 200M players. Your single sorted set no longer fits the write throughput on one primary. How do you shard a single global leaderboard without losing accurate global ranks?
- Product wants “weekly leaderboard that resets every Monday.” How do you handle the reset of 200M entries without a Redis outage?
- A cheater submits a score of 999,999,999. How does your system detect and handle this without manual intervention?
Scenario 4: Distributed Rate Limiter — Your API gateway rate-limits users to 100 requests/minute. It works fine on a single node. Now you have 12 gateway nodes behind a load balancer. Users are getting 1200 requests/minute through. Fix it.
Scenario 4: Distributed Rate Limiter — Your API gateway rate-limits users to 100 requests/minute. It works fine on a single node. Now you have 12 gateway nodes behind a load balancer. Users are getting 1200 requests/minute through. Fix it.
- “Use sticky sessions so each user always hits the same node.” This defeats the purpose of load balancing and creates hotspots.
- “Sync the counters between all 12 nodes on every request.” This creates O(N) network calls per request and adds 50ms+ latency.
- Don’t see the fundamental problem: local counters on N nodes give N times the actual limit.
- The core fix: Centralize the counter in Redis. Each gateway node does
INCR user:$id:minute:$timestampwith a 120-second TTL. Redis handles 100K+ INCR ops/sec, so 12 nodes checking limits adds negligible load. This is what Cloudflare, Kong, and most production API gateways actually do. - The latency concern: Adding a Redis round-trip (0.5-1ms) per request is acceptable for most APIs. If sub-millisecond matters, use a hybrid approach: each node keeps a local counter and a local “budget.” The central store allocates budgets (e.g.,
100/12 = 8 requests per node per window). Nodes rate-limit locally against their budget and periodically sync with Redis. Envoy Proxy calls this “local rate limiting with a global backstop.” - Sliding window implementation: Fixed windows have the burst-at-boundary problem (100 requests at 0:59, 100 more at 1:00 = 200 in 2 seconds). Use a sliding window counter: store counts for the current and previous minute, weight the previous minute’s count by how far into the current minute you are. Redis key:
user:$id:count:$minute. TwoGETcalls, oneINCR, all pipelined. GitHub’s API uses this exact approach. - Redis failure mode: If Redis goes down, what happens? Two choices: fail-open (allow all requests — prioritize availability) or fail-closed (reject all requests — prioritize safety). For a payments API, fail-closed. For a social media API, fail-open with local rate limiting as a degraded fallback. Document this decision explicitly in your runbook.
- Lua scripting for atomicity: The check-and-increment must be atomic. Use a Redis Lua script:
local count = redis.call('INCR', key); if count == 1 then redis.call('EXPIRE', key, ttl) end; return count. This prevents the race where two nodes both read count=99, both increment to 100, and both allow the request (actual count: 101).
- Your Redis rate-limit cluster handles 500K checks/sec. A DDoS floods you with 5M unique IPs/sec. Now Redis itself is the bottleneck. How do you rate-limit the rate limiter?
- The business wants different rate limits for different tiers: free users get 100/min, paid get 10,000/min, enterprise gets unlimited. How does this change your Redis key design and the Lua script?
- You discover that 10% of rate-limited users are being incorrectly blocked because clock skew between gateway nodes causes window misalignment. How do you fix it?
Scenario 5: Notification System Fan-Out — Your app has 50M users. A popular creator with 30M followers posts. Your notification service needs to notify all 30M followers. Currently, it takes 45 minutes and the queue backs up for other notifications. Redesign it.
Scenario 5: Notification System Fan-Out — Your app has 50M users. A popular creator with 30M followers posts. Your notification service needs to notify all 30M followers. Currently, it takes 45 minutes and the queue backs up for other notifications. Redesign it.
- “Just add more workers” — doesn’t address the architectural problem.
- Treat all notifications equally. No concept of priority or tiering.
- Don’t consider that 30M push notifications means 30M calls to APNs/FCM, which has its own rate limits and failure modes.
- Priority queues: The 45-minute backup means a single celebrity post is starving time-sensitive notifications (password resets, 2FA codes, order confirmations). Separate queues by priority: P0 (security/transactional — immediate), P1 (direct interactions — likes, comments on your post), P2 (broadcast/fan-out — celebrity posts). P2 can take 30 minutes; P0 must complete in under 10 seconds. RabbitMQ priority queues or separate Kafka topics with dedicated consumer groups.
- Batched fan-out: Don’t create 30M individual notification tasks. Batch followers into segments of 10,000. That’s 3,000 tasks instead of 30M. Each worker processes a batch, calling APNs/FCM batch APIs (APNs supports 1,000/request in HTTP/2 multiplexed, FCM supports topic messaging for up to 1M devices). Twitter’s notification system processes batches of 5,000 followers per worker.
- Tiered delivery: Not all 30M followers need a push notification. Users who haven’t opened the app in 30 days get no push (saves APNs calls). Users who have notifications muted for this creator get an in-app badge only. Active users in the last 24 hours get a push. This typically cuts the actual push fan-out from 30M to 3-5M. Instagram does this — they call it “notification relevance filtering.”
- Pre-computed follower segments: Don’t query the followers table at fan-out time. Pre-compute follower lists for large accounts (>100K followers) and store them partitioned by shard in a KV store. When the post happens, the fan-out service reads pre-sharded follower lists in parallel. Rebuild these segments hourly or on follow/unfollow events via a streaming pipeline.
- APNs/FCM rate limiting and failures: Apple rate-limits at roughly 30K-50K notifications/sec per provider certificate. FCM has similar limits per project. You need connection pooling (persistent HTTP/2 connections to APNs), exponential backoff on 429s, and a dead-letter queue for failed tokens. Invalid tokens (user uninstalled the app) should trigger a cleanup job — Twitter found that 15-20% of stored push tokens are stale at any given time.
- A bug in the notification template sends 30M users a garbled message. You need to “recall” or suppress the notification. How would you design a kill-switch for in-flight notifications?
- How do you handle the “notification storm” when 10 celebrities all post within the same 60-second window? That’s potentially 100M+ notifications queued simultaneously.
- Users complain they get the push notification 20 minutes after the post. Your SLO is 5 minutes for P1 notifications. Walk me through how you’d instrument the pipeline to find the bottleneck.
Scenario 6: URL Shortener Hot Keys — Your URL shortener handles 10B redirects/month. A single short URL (a viral tweet link) gets 500K hits/second, melting the cache node responsible for that key. The site is returning 503s. What do you do?
Scenario 6: URL Shortener Hot Keys — Your URL shortener handles 10B redirects/month. A single short URL (a viral tweet link) gets 500K hits/second, melting the cache node responsible for that key. The site is returning 503s. What do you do?
Scenario 7: Chat System Presence — Your chat app (think Slack/Discord) shows green dots for online users. With 5M concurrent users, the presence service is consuming 40% of your total infrastructure cost and generating 2B Redis writes/day from heartbeats. Optimize it.
Scenario 7: Chat System Presence — Your chat app (think Slack/Discord) shows green dots for online users. With 5M concurrent users, the presence service is consuming 40% of your total infrastructure cost and generating 2B Redis writes/day from heartbeats. Optimize it.
- “Just increase the heartbeat interval” — from 5s to 30s? Now users appear online for 30 seconds after closing the app.
- “Remove the presence feature” — not a technical solution, just feature deletion.
- Don’t calculate the actual write volume or understand why it’s expensive.
- Math first: 5M users sending a heartbeat every 5 seconds = 1M writes/sec to Redis. At 86,400 seconds/day = 86.4B operations/day (the 2B number means it’s probably already at a 30s interval). Each heartbeat is a
SET user:$id:presence online EX 10— tiny payload but enormous QPS. At 23K/month. The 40% cost figure checks out. - Optimization 1 — Channel-scoped presence: Users don’t need to know the online status of all 5M users, only people in their visible channels/DMs. Instead of a global presence store, scope presence to channels. When a user opens a channel, subscribe to presence updates for that channel’s members only. Discord does this — presence is per-guild, not global. This typically reduces the “presence audience” by 100x.
- Optimization 2 — Pub/Sub for changes, not polling: Instead of each client polling “is user X online?” every 5 seconds, push presence changes via WebSocket. When user A goes offline, publish a presence change event to the channels they’re in. Subscribers get notified. No polling at all. This converts O(N) continuous polling into O(1) event-based updates. Slack switched to this model and reduced presence-related traffic by 95%.
- Optimization 3 — Tiered heartbeat intervals: Active users (typing, sending messages) heartbeat every 10 seconds. Idle users (app open but no interaction for 5 minutes) heartbeat every 60 seconds. Backgrounded mobile users heartbeat every 5 minutes (or rely on push-reconnect). This alone can reduce write volume by 70%.
- Optimization 4 — Bloom filter for “is anyone online?”: Before fetching individual presence statuses for a channel with 500 members, check a per-channel Bloom filter that answers “does this channel have any online members?” If no, skip the presence fetch entirely. For channels with no active users (most channels most of the time), this eliminates the query entirely.
- Architecture: Replace the per-user Redis key with a per-server connection registry. Each WebSocket server knows which users are connected to it. Presence queries fan out to “which servers have user X connected?” using a lightweight gossip protocol or a central registry (ZooKeeper/etcd). When a server dies, all its users are marked offline in bulk — no need to expire 100K individual keys.
- A WebSocket server crashes with 50,000 connected users. How does the system detect this and mark all 50,000 users offline without a 60-second delay? What’s the blast radius?
- User A has the app open on their phone and laptop simultaneously. How do you handle multi-device presence? When should the green dot disappear?
- Your presence system uses pub/sub for change events, but you notice that “user came online” events are sometimes delivered out of order with messages. Users see a message from someone who appears “offline.” How do you fix this consistency issue?
Scenario 8: Search Autocomplete — Your e-commerce site serves autocomplete suggestions as users type. At 200M queries/day, the Trie-based service works fine. The product team now wants personalized suggestions (based on user's past searches and purchases) and trending suggestions (based on what's popular right now). Your Trie rebuild takes 6 hours. Redesign for freshness and personalization.
Scenario 8: Search Autocomplete — Your e-commerce site serves autocomplete suggestions as users type. At 200M queries/day, the Trie-based service works fine. The product team now wants personalized suggestions (based on user's past searches and purchases) and trending suggestions (based on what's popular right now). Your Trie rebuild takes 6 hours. Redesign for freshness and personalization.
- “Rebuild the Trie more often” — doesn’t solve personalization and 6 hours is a pipeline constraint, not a tuning knob.
- “Store per-user Tries” — 200M users with individual Tries would require petabytes of RAM.
- Conflate search ranking with autocomplete — they’re fundamentally different problems (prefix matching vs. relevance scoring).
- Separate the concerns into three layers: (1) Base suggestions from the global Trie — high-quality, pre-computed, updated every few hours. (2) Trending overlay — computed from a real-time streaming pipeline, refreshed every 60 seconds. (3) Personal overlay — derived from the user’s recent searches and purchase history at query time.
- Base layer (global Trie): Keep the existing Trie but optimize the rebuild. Instead of full rebuild, use incremental updates: maintain a “delta Trie” of new/changed suggestions since last full build. At query time, merge results from the main Trie and the delta Trie. Full rebuild becomes a compaction step that runs daily. Google’s autocomplete does this — the main index is rebuilt periodically, but trending queries appear within minutes via a streaming sidecar.
- Trending layer: Use a sliding-window counter (Redis Sorted Set or Apache Flink) tracking query frequency over the last 1 hour. Top 10K trending queries by prefix are stored in a small Redis cluster. At autocomplete time, fetch trending suggestions for the prefix and merge them into the Trie results with a boost score. Implementation: Kafka topic of search queries -> Flink job computing top-K per prefix per hour -> Redis sorted sets keyed by prefix. Latency: 1-2ms for the Redis lookup. Amazon does this for “trending searches” during Prime Day.
- Personalization layer: Don’t build per-user Tries. Instead, at query time: (a) Fetch user’s last 50 searches and last 20 purchased product categories from a user-profile service (stored in DynamoDB/Redis, ~1ms). (b) From the Trie’s candidate suggestions (say top 50 for the prefix), re-rank by computing a lightweight relevance score:
base_score * 0.6 + trending_boost * 0.2 + personal_relevance * 0.2. Personal relevance = cosine similarity between the suggestion’s category embedding and the user’s interest embedding. This adds ~5ms to the autocomplete latency but makes it feel “magical.” - Latency budget: Users expect autocomplete in under 100ms. Breakdown: network RTT (20-30ms) + prefix lookup in Trie (1-2ms) + trending fetch from Redis (1-2ms) + user profile fetch (2-5ms) + re-ranking (1-2ms) + serialization (1ms). Total: ~35-45ms server-side. Well within budget. If the personal or trending layers are slow, fall back to base Trie results with a timeout of 10ms on each enrichment call.
- Freshness for product catalog changes: When a product is discontinued or renamed, the autocomplete shouldn’t suggest it. Maintain a blacklist in Redis (checked at serving time) and a whitelist push from the catalog service. This is cheaper than rebuilding the Trie. Etsy found that stale autocomplete suggestions (recommending out-of-stock items) reduced click-through rate by 12%.
- A user types “i” — this matches millions of suggestions. How do you keep the Trie lookup fast for very short prefixes? What pruning strategy do you use?
- Your trending layer picks up a query that’s trending because of a tragedy (e.g., a mass shooting). How do you build content safety into the autocomplete pipeline? What’s the architecture for a real-time blocklist?
- You want to A/B test three different ranking formulas for personalized autocomplete. How do you design the system so ranking logic is swappable without redeploying the service?
Scenario 9: Content Moderation Pipeline — Your social platform handles 50M user-generated posts per day (text, images, and video). Currently, human moderators review flagged content, but the backlog is 72 hours and harmful content stays visible the entire time. Legal is threatening fines under the EU Digital Services Act (DSA) which requires removal of illegal content within 24 hours of notice. Design an automated content moderation pipeline that reduces the backlog to under 1 hour while keeping false-positive rates below 2%.
Scenario 9: Content Moderation Pipeline — Your social platform handles 50M user-generated posts per day (text, images, and video). Currently, human moderators review flagged content, but the backlog is 72 hours and harmful content stays visible the entire time. Legal is threatening fines under the EU Digital Services Act (DSA) which requires removal of illegal content within 24 hours of notice. Design an automated content moderation pipeline that reduces the backlog to under 1 hour while keeping false-positive rates below 2%.