Skip to main content

System Design Interview Questions (50+ Detailed Q&A)

1. Core Concepts & Scalability

Answer:
  • Vertical (Scale Up): Bigger machine (More RAM/CPU). Limit: HW Cost/Max capacity. Single Point of Failure (SPOF).
  • Horizontal (Scale Out): More machines. Infinite scale. Complexity: Load Balancing, Data Consistency.
Answer: In a distributed system, you can only pick 2 of 3:
  • Consistency: Every read receives the most recent write or an error.
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
  • Partition Tolerance: The system continues to operate despite network messages drop/delay (Partitions).
  • Real world: P is mandatory. Choice is CP (Bank) vs AP (Social Feed).
Answer:
  • ACID (SQL): Atomicity, Consistency, Isolation, Durability. Strict.
  • BASE (NoSQL): Basically Available, Soft state, Eventual consistency. Flexible.
Answer:
  • Round Robin: Sequential.
  • Least Connections: Send to server with fewest open connections.
  • Weighted Round Robin: For servers with different specs.
  • IP Hash: Sticky Session (User always goes to same server).
Answer: Solves rebalancing in distributed caching/sharding. Maps data and servers to a ring (0-360 deg). Data maps to the next server clockwise. Adding/removing a server only affects neighbors, minimal data movement.
Answer: Splitting data across multiple machines.
  • Horizontal: Rows 1-1000 on DB1, 1001-2000 on DB2.
  • Vertical: User table on DB1, Product table on DB2. Challenges: Joins across shards (Impossible/Slow), Rebalancing.
Answer:
  • Read-Through: App asks Cache. If miss, Cache fetches from DB.
  • Write-Through: Write to Cache and DB simultaneously. Safe but slow write.
  • Write-Back: Write to RAM (Cache), async write to DB. Fast but risk of data loss on crash.
  • Cache Aside: App manages it. Check Cache -> Else DB -> Update Cache.
Answer:
  • LRU (Least Recently Used): Evict old items. Most common.
  • LFU (Least Frequently Used): Evict items used rarely.
Answer: Distributed network of proxy servers. Caches static content (Images, CSS, Video) at edge locations close to user. Reduces Latency and Server Load. Push vs Pull CDN.
Answer:
  • Stateless: Server keeps no session data. req contains all info (JWT). Easy scaling.
  • Stateful: Server keeps session in memory. Harder scaling (Sticky sessions/Redis store needed).

2. Distributed Systems Internals

Answer:
  • Strong: Reading immediately after write returns new data. High Latency (Sync replication).
  • Eventual: Reading might return stale data for a few ms. Low Latency (Async replication).
Answer: Configurable consistency. N = Nodes. R = Read Nodes. W = Write Nodes. If R + W > N, then you have strong consistency (Overlap ensures read sees write). Example: N=3, W=2, R=2.
Answer: Nodes elect a Leader to handle writes. Followers replicate. If Leader dies, election happens. Split Brain: Network partition creates two leaders. Solved by Quorum (Majority vote).
Answer: Probabilistic Data Structure. Space efficient. Questions: “Is element in set?” Answers: “No” (100% sure) or “Maybe” (High probability). Used in: DB (to avoid disk lookups for missing keys), CDNs.
Answer:
  • Token Bucket: Allow burst.
  • Leaky Bucket: Constant outflow rate. Smooths traffic.
  • Fixed Window: Reset at minute boundary.
  • Sliding Window: Accurate.
Answer:
  • UUID: 128-bit. Unordered. Collision free. Long.
  • Snowflake (Twitter): 64-bit. Time sorted. Epoch + MachineID + Sequence.
  • DB AutoIncrement: Hard to scale (Write bottleneck).
Answer: Servers send pulse to central monitor every X seconds. If missed Y pulses -> Dead. Gossip Protocol: Nodes talk to neighbors randomly to propagate health status (Cassandra).
Answer: Prevent cascading failure. If service fails 5 times, Open Circuit (Fail fast immediately). After timeout, Half-Open (Try one request). If success, Close Circuit (Resume).
Answer: Isolate failure domains like ship compartments. Service A usage shouldn’t starve Service B threads. Separate Thread Pools / Resources.
Answer: f(f(x)) = f(x). Retrying a request multiple times has same effect as once. Crucial for Payment APIs. Impl: Unique Request ID + Deduplication table.

3. Storage & Data

Answer:
  • SQL: Structured Data, Relations (Joins), ACID transactions needed. (E-comm, Bank).
  • NoSQL: Unstructured, High Write throughput, Flexible schema. (Logs, Social Feed, Metadata).
Answer:
  • B-Tree (SQL): Read optimized. Update in place. Random IO.
  • LSM Tree (NoSQL/Cassandra): Write optimized. Append only (MemTable -> SSTable). Sequential IO.
Answer:
  • Master-Slave: Write to Master. Read from Slaves. Async lag.
  • Master-Master: Write to any. Conflict resolution needed.
  • Single Layout: All nodes equal (Dynamo).
Answer:
  • Range: Key A-M (Node 1). Hotspot risk (If all users imply ‘A’).
  • Hash: Hash(Key) % N. Even distribution.
Answer:
  • Block (EBS): HDD/SSD attached to OS. Fast. Boot drive.
  • File (EFS/NFS): Shared folder hierarchy.
  • Object (S3): Flat ID/Metadata. HTTP API. Cheap, scalar. Not for OS boot.
Answer:
  • Lake (S3): Raw data (Logs, JSON, CSV). Schema-on-Read. Cheap.
  • Warehouse (Snowflake): Cleaned, Structured data. Schema-on-Write. Queries.
Answer:
  • RabbitMQ: Push-based. Smart broker. Good for complex routing/tasks. Message deleted after ack.
  • Kafka: Pull-based (Log). Dumb broker. High throughput. Message persisted for X days. Replayable.
Answer:
  • Polling: Client asks every 5s.
  • Long Polling: Server holds connection open until data available.
  • WebSockets: Bi-directional. Chat.
  • SSE (Server Sent Events): Uni-directional (Server -> Client). Stock ticker.
Answer: Location indices (Yelp/Uber). Maps 2D map to 1D string/tree for fast circular search.
Answer:
  • Row (Postgres): Good for transactions (CRUD one user).
  • Columnar (Cassandra/BigQuery): Good for Analytics (Avg Salary of 1M rows). Compresses well.

4. Design Cases (The “Design X” Questions)

Answer:
  • Core: Hash function? MD5 (too long). Base62 encoding of AutoIncrement ID.
  • Scale: Billions of URLs.
  • DB: Key-Value (DynamoDB). fast read.
  • Cleanup: TTL or Lazy delete on access.
Answer:
  • Where: API Gateway/Middleware.
  • Store: Redis (Counters).
  • Algo: Token Bucket or Sliding Window Log (for precision).
  • Distributed: Consistent Hashing for counters.
Answer:
  • Model: User, Follow, Post.
  • Push Model (Fanout on Write): When User A posts, push ID to all Followers’ timeline lists in Redis. Fast Read. Slow Write. (Bad for Justin Bieber).
  • Pull Model: Read time aggregation.
  • Hybrid: Push for normal, Pull for celebs.
Answer:
  • Proto: WebSocket.
  • Storage: HBase/Cassandra (Time series).
  • Status: Heartbeat to Redis.
  • Encryption: E2E (Signal protocol).
Answer:
  • Upload: Chunking.
  • Transcoding: Convert to varying bitrates/formats (Worker Queue).
  • Storage: S3.
  • Delivery: CDN (Open Connect).
  • Adaptive Streaming: DASH/HLS (Client picks quality).
Answer:
  • Chunking: File split into 4MB blocks. Deduplication (Hash blocks).
  • Sync: Sync only changed blocks.
  • Meta DB: File hierarchy in SQL/NoSQL.
  • Block Store: S3.
Answer:
  • DS: Trie (Prefix Tree).
  • Optimization: Store Top 5 results in each Trie Node.
  • Update: Offline aggregation (MapReduce) to rebuild Trie.
Answer:
  • Queue: URL Frontier (Kafka).
  • Dedup: Bloom Filter (Visited URLs).
  • Politeness: Per-domain rate limit.
  • DNS: Custom DNS cache.
Answer:
  • Pluggable Senders: Email, SMS, Push.
  • Queue: RabbitMQ (Priority Queues).
  • Rate Limit: Don’t spam users.
Answer:
  • Naive: SQL ORDER BY. Slow.
  • Fast: Redis Sorted Set (ZADD user score). O(log N).

5. Reliability & Operations

Answer: Virtual nodes (Consistent hashing). Salting keys (Append random number to ‘bieber’ key to spread load).
Answer: 1000 processes wake up to handle 1 event (or cache expiry causing DB spike). Fix: Jitter (Random sleep), Leasing/Locking.
Answer:
  • Blue-Green: Instant switch. 2x cost.
  • Canary: Gradual rollout (1%, 10%). Safer.
Answer:
  • Client-side: Client queries Registry (Eureka).
  • Server-side: Client hits LB. LB queries Registry.
Answer: Getting nodes to agree on value. Paxos (Hard). Raft (Standard).
Answer: System signals upstream to slow down. TCP Window. Reactive Streams. Queue fill event -> 429 Retry Later.
Answer: Killing random servers in Prod (Chaos Monkey) to test resilience.
Answer:
  • Avalanche: Cache empty, DB hit by millions.
  • Penetration: Requesting non-existent key hits DB always. (Fix: Bloom filter).
  • Stampede: Many users expire key same time.
Answer:
  • Forward: Protects Client. (Hide IP, Filter content).
  • Reverse: Protects Server. (LB, SSL term, Cache).
Answer:
  • LB: Transport level distribution (L4/L7).
  • Gateway: App Logic. (Auth, Rate Limit, Transformation, Routing).