System Design Interview Questions (50+ Detailed Q&A)

1. Core Concepts & Scalability

1. Vertical vs Horizontal Scaling

Answer:

Vertical (Scale Up): Bigger machine (More RAM/CPU). Limit: HW Cost/Max capacity. Single Point of Failure (SPOF).
Horizontal (Scale Out): More machines. Infinite scale. Complexity: Load Balancing, Data Consistency.

2. CAP Theorem

Answer: In a distributed system, you can only pick 2 of 3:

Consistency: Every read receives the most recent write or an error.
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
Partition Tolerance: The system continues to operate despite network messages drop/delay (Partitions).
Real world: P is mandatory. Choice is CP (Bank) vs AP (Social Feed).

3. ACID vs BASE

Answer:

ACID (SQL): Atomicity, Consistency, Isolation, Durability. Strict.
BASE (NoSQL): Basically Available, Soft state, Eventual consistency. Flexible.

4. Load Balancer Algorithms

Answer:

Round Robin: Sequential.
Least Connections: Send to server with fewest open connections.
Weighted Round Robin: For servers with different specs.
IP Hash: Sticky Session (User always goes to same server).

5. Consistent Hashing

Answer: Solves rebalancing in distributed caching/sharding. Maps data and servers to a ring (0-360 deg). Data maps to the next server clockwise. Adding/removing a server only affects neighbors, minimal data movement.

6. Database Sharding

Answer: Splitting data across multiple machines.

Horizontal: Rows 1-1000 on DB1, 1001-2000 on DB2.
Vertical: User table on DB1, Product table on DB2. Challenges: Joins across shards (Impossible/Slow), Rebalancing.

7. Caching Strategies

Answer:

Read-Through: App asks Cache. If miss, Cache fetches from DB.
Write-Through: Write to Cache and DB simultaneously. Safe but slow write.
Write-Back: Write to RAM (Cache), async write to DB. Fast but risk of data loss on crash.
Cache Aside: App manages it. Check Cache -> Else DB -> Update Cache.

8. Eviction Policies (LRU vs LFU)

Answer:

LRU (Least Recently Used): Evict old items. Most common.
LFU (Least Frequently Used): Evict items used rarely.

9. CDN (Content Delivery Network)

Answer: Distributed network of proxy servers. Caches static content (Images, CSS, Video) at edge locations close to user. Reduces Latency and Server Load. Push vs Pull CDN.

10. Stateless vs Stateful Architecture

Answer:

Stateless: Server keeps no session data. req contains all info (JWT). Easy scaling.
Stateful: Server keeps session in memory. Harder scaling (Sticky sessions/Redis store needed).

2. Distributed Systems Internals

11. Strong vs Eventual Consistency

Answer:

Strong: Reading immediately after write returns new data. High Latency (Sync replication).
Eventual: Reading might return stale data for a few ms. Low Latency (Async replication).

12. Quorum (N, R, W)

Answer: Configurable consistency. N = Nodes. R = Read Nodes. W = Write Nodes. If R + W > N, then you have strong consistency (Overlap ensures read sees write). Example: N=3, W=2, R=2.

13. Leader Election (Raft / Paxos)

Answer: Nodes elect a Leader to handle writes. Followers replicate. If Leader dies, election happens. Split Brain: Network partition creates two leaders. Solved by Quorum (Majority vote).

14. Bloom Filters

Answer: Probabilistic Data Structure. Space efficient. Questions: “Is element in set?” Answers: “No” (100% sure) or “Maybe” (High probability). Used in: DB (to avoid disk lookups for missing keys), CDNs.

15. Rate Limiting Algorithms

Answer:

Token Bucket: Allow burst.
Leaky Bucket: Constant outflow rate. Smooths traffic.
Fixed Window: Reset at minute boundary.
Sliding Window: Accurate.

16. Distributed ID Generation

Answer:

UUID: 128-bit. Unordered. Collision free. Long.
Snowflake (Twitter): 64-bit. Time sorted. Epoch + MachineID + Sequence.
DB AutoIncrement: Hard to scale (Write bottleneck).

17. Heartbeat & Health Checks

Answer: Servers send pulse to central monitor every X seconds. If missed Y pulses -> Dead. Gossip Protocol: Nodes talk to neighbors randomly to propagate health status (Cassandra).

18. Circuit Breaker Pattern

Answer: Prevent cascading failure. If service fails 5 times, Open Circuit (Fail fast immediately). After timeout, Half-Open (Try one request). If success, Close Circuit (Resume).

19. Bulkhead Pattern

Answer: Isolate failure domains like ship compartments. Service A usage shouldn’t starve Service B threads. Separate Thread Pools / Resources.

20. Idempotency

Answer: f(f(x)) = f(x). Retrying a request multiple times has same effect as once. Crucial for Payment APIs. Impl: Unique Request ID + Deduplication table.

3. Storage & Data

21. SQL vs NoSQL (When to choose?)

Answer:

SQL: Structured Data, Relations (Joins), ACID transactions needed. (E-comm, Bank).
NoSQL: Unstructured, High Write throughput, Flexible schema. (Logs, Social Feed, Metadata).

22. Database Indexing (B-Tree vs LSM)

Answer:

B-Tree (SQL): Read optimized. Update in place. Random IO.
LSM Tree (NoSQL/Cassandra): Write optimized. Append only (MemTable -> SSTable). Sequential IO.

23. Replication Types

Answer:

Master-Slave: Write to Master. Read from Slaves. Async lag.
Master-Master: Write to any. Conflict resolution needed.
Single Layout: All nodes equal (Dynamo).

24. Partitioning Strategies

Answer:

Range: Key A-M (Node 1). Hotspot risk (If all users imply ‘A’).
Hash: Hash(Key) % N. Even distribution.

25. File Storage (Block vs Object vs File)

Answer:

Block (EBS): HDD/SSD attached to OS. Fast. Boot drive.
File (EFS/NFS): Shared folder hierarchy.
Object (S3): Flat ID/Metadata. HTTP API. Cheap, scalar. Not for OS boot.

26. Data Lake vs Data Warehouse

Answer:

Lake (S3): Raw data (Logs, JSON, CSV). Schema-on-Read. Cheap.
Warehouse (Snowflake): Cleaned, Structured data. Schema-on-Write. Queries.

27. Message Queues (Kafka vs RabbitMQ)

Answer:

RabbitMQ: Push-based. Smart broker. Good for complex routing/tasks. Message deleted after ack.
Kafka: Pull-based (Log). Dumb broker. High throughput. Message persisted for X days. Replayable.

28. Long Polling vs WebSockets vs SSE

Answer:

Polling: Client asks every 5s.
Long Polling: Server holds connection open until data available.
WebSockets: Bi-directional. Chat.
SSE (Server Sent Events): Uni-directional (Server -> Client). Stock ticker.

29. Geohashing / Quadtree

Answer: Location indices (Yelp/Uber). Maps 2D map to 1D string/tree for fast circular search.

30. Row-based vs Columnar DB

Answer:

Row (Postgres): Good for transactions (CRUD one user).
Columnar (Cassandra/BigQuery): Good for Analytics (Avg Salary of 1M rows). Compresses well.

4. Design Cases (The “Design X” Questions)

31. Design a URL Shortener (TinyURL)

Answer:

Core: Hash function? MD5 (too long). Base62 encoding of AutoIncrement ID.
Scale: Billions of URLs.
DB: Key-Value (DynamoDB). fast read.
Cleanup: TTL or Lazy delete on access.

32. Design Rate Limiter

Answer:

Where: API Gateway/Middleware.
Store: Redis (Counters).
Algo: Token Bucket or Sliding Window Log (for precision).
Distributed: Consistent Hashing for counters.

33. Design Instagram Feed

Answer:

Model: User, Follow, Post.
Push Model (Fanout on Write): When User A posts, push ID to all Followers’ timeline lists in Redis. Fast Read. Slow Write. (Bad for Justin Bieber).
Pull Model: Read time aggregation.
Hybrid: Push for normal, Pull for celebs.

34. Design Chat (WhatsApp)

Answer:

Proto: WebSocket.
Storage: HBase/Cassandra (Time series).
Status: Heartbeat to Redis.
Encryption: E2E (Signal protocol).

35. Design Youtube/Netflix

Answer:

Upload: Chunking.
Transcoding: Convert to varying bitrates/formats (Worker Queue).
Storage: S3.
Delivery: CDN (Open Connect).
Adaptive Streaming: DASH/HLS (Client picks quality).

36. Design Google Drive (Dropbox)

Answer:

Chunking: File split into 4MB blocks. Deduplication (Hash blocks).
Sync: Sync only changed blocks.
Meta DB: File hierarchy in SQL/NoSQL.
Block Store: S3.

37. Design Typeahead (Search Autocomplete)

Answer:

DS: Trie (Prefix Tree).
Optimization: Store Top 5 results in each Trie Node.
Update: Offline aggregation (MapReduce) to rebuild Trie.

38. Design Web Crawler

Answer:

Queue: URL Frontier (Kafka).
Dedup: Bloom Filter (Visited URLs).
Politeness: Per-domain rate limit.
DNS: Custom DNS cache.

39. Design Notification System

Answer:

Pluggable Senders: Email, SMS, Push.
Queue: RabbitMQ (Priority Queues).
Rate Limit: Don’t spam users.

40. Design Leaderboard

Answer:

Naive: SQL ORDER BY. Slow.
Fast: Redis Sorted Set (ZADD user score). O(log N).

5. Reliability & Operations

41. Handling Hot Partitions

Answer: Virtual nodes (Consistent hashing). Salting keys (Append random number to ‘bieber’ key to spread load).

42. Thundering Herd Problem

Answer: 1000 processes wake up to handle 1 event (or cache expiry causing DB spike). Fix: Jitter (Random sleep), Leasing/Locking.

43. Blue-Green vs Canary

Answer:

Blue-Green: Instant switch. 2x cost.
Canary: Gradual rollout (1%, 10%). Safer.

44. Service Discovery Patterns

Answer:

Client-side: Client queries Registry (Eureka).
Server-side: Client hits LB. LB queries Registry.

45. Distributed Consensus

Answer: Getting nodes to agree on value. Paxos (Hard). Raft (Standard).

46. Backpressure

Answer: System signals upstream to slow down. TCP Window. Reactive Streams. Queue fill event -> 429 Retry Later.

47. Chaos Engineering

Answer: Killing random servers in Prod (Chaos Monkey) to test resilience.

48. Caching Hazards

Answer:

Avalanche: Cache empty, DB hit by millions.
Penetration: Requesting non-existent key hits DB always. (Fix: Bloom filter).
Stampede: Many users expire key same time.

49. Proxy vs Reverse Proxy

Answer:

Forward: Protects Client. (Hide IP, Filter content).
Reverse: Protects Server. (LB, SSL term, Cache).

50. API Gateway vs Load Balancer

Answer:

LB: Transport level distribution (L4/L7).
Gateway: App Logic. (Auth, Rate Limit, Transformation, Routing).

Interview Experiences

Interview Questions

​System Design Interview Questions (50+ Detailed Q&A)

​1. Core Concepts & Scalability

​2. Distributed Systems Internals

​3. Storage & Data

​4. Design Cases (The “Design X” Questions)

​5. Reliability & Operations

System Design Interview Questions (50+ Detailed Q&A)

1. Core Concepts & Scalability

2. Distributed Systems Internals

3. Storage & Data

4. Design Cases (The “Design X” Questions)

5. Reliability & Operations