System Design Interview Questions (50+ Detailed Q&A)
1. Core Concepts & Scalability
1. Vertical vs Horizontal Scaling
1. Vertical vs Horizontal Scaling
- Vertical (Scale Up): Bigger machine (More RAM/CPU). Limit: HW Cost/Max capacity. Single Point of Failure (SPOF).
- Horizontal (Scale Out): More machines. Infinite scale. Complexity: Load Balancing, Data Consistency.
2. CAP Theorem
2. CAP Theorem
- Consistency: Every read receives the most recent write or an error.
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
- Partition Tolerance: The system continues to operate despite network messages drop/delay (Partitions).
- Real world: P is mandatory. Choice is CP (Bank) vs AP (Social Feed).
3. ACID vs BASE
3. ACID vs BASE
- ACID (SQL): Atomicity, Consistency, Isolation, Durability. Strict.
- BASE (NoSQL): Basically Available, Soft state, Eventual consistency. Flexible.
4. Load Balancer Algorithms
4. Load Balancer Algorithms
- Round Robin: Sequential.
- Least Connections: Send to server with fewest open connections.
- Weighted Round Robin: For servers with different specs.
- IP Hash: Sticky Session (User always goes to same server).
5. Consistent Hashing
5. Consistent Hashing
6. Database Sharding
6. Database Sharding
- Horizontal: Rows 1-1000 on DB1, 1001-2000 on DB2.
- Vertical: User table on DB1, Product table on DB2. Challenges: Joins across shards (Impossible/Slow), Rebalancing.
7. Caching Strategies
7. Caching Strategies
- Read-Through: App asks Cache. If miss, Cache fetches from DB.
- Write-Through: Write to Cache and DB simultaneously. Safe but slow write.
- Write-Back: Write to RAM (Cache), async write to DB. Fast but risk of data loss on crash.
- Cache Aside: App manages it. Check Cache -> Else DB -> Update Cache.
8. Eviction Policies (LRU vs LFU)
8. Eviction Policies (LRU vs LFU)
- LRU (Least Recently Used): Evict old items. Most common.
- LFU (Least Frequently Used): Evict items used rarely.
9. CDN (Content Delivery Network)
9. CDN (Content Delivery Network)
10. Stateless vs Stateful Architecture
10. Stateless vs Stateful Architecture
- Stateless: Server keeps no session data. req contains all info (JWT). Easy scaling.
- Stateful: Server keeps session in memory. Harder scaling (Sticky sessions/Redis store needed).
2. Distributed Systems Internals
11. Strong vs Eventual Consistency
11. Strong vs Eventual Consistency
- Strong: Reading immediately after write returns new data. High Latency (Sync replication).
- Eventual: Reading might return stale data for a few ms. Low Latency (Async replication).
12. Quorum (N, R, W)
12. Quorum (N, R, W)
13. Leader Election (Raft / Paxos)
13. Leader Election (Raft / Paxos)
14. Bloom Filters
14. Bloom Filters
15. Rate Limiting Algorithms
15. Rate Limiting Algorithms
- Token Bucket: Allow burst.
- Leaky Bucket: Constant outflow rate. Smooths traffic.
- Fixed Window: Reset at minute boundary.
- Sliding Window: Accurate.
16. Distributed ID Generation
16. Distributed ID Generation
- UUID: 128-bit. Unordered. Collision free. Long.
- Snowflake (Twitter): 64-bit. Time sorted. Epoch + MachineID + Sequence.
- DB AutoIncrement: Hard to scale (Write bottleneck).
17. Heartbeat & Health Checks
17. Heartbeat & Health Checks
18. Circuit Breaker Pattern
18. Circuit Breaker Pattern
19. Bulkhead Pattern
19. Bulkhead Pattern
20. Idempotency
20. Idempotency
f(f(x)) = f(x).
Retrying a request multiple times has same effect as once.
Crucial for Payment APIs.
Impl: Unique Request ID + Deduplication table.3. Storage & Data
21. SQL vs NoSQL (When to choose?)
21. SQL vs NoSQL (When to choose?)
- SQL: Structured Data, Relations (Joins), ACID transactions needed. (E-comm, Bank).
- NoSQL: Unstructured, High Write throughput, Flexible schema. (Logs, Social Feed, Metadata).
22. Database Indexing (B-Tree vs LSM)
22. Database Indexing (B-Tree vs LSM)
- B-Tree (SQL): Read optimized. Update in place. Random IO.
- LSM Tree (NoSQL/Cassandra): Write optimized. Append only (MemTable -> SSTable). Sequential IO.
23. Replication Types
23. Replication Types
- Master-Slave: Write to Master. Read from Slaves. Async lag.
- Master-Master: Write to any. Conflict resolution needed.
- Single Layout: All nodes equal (Dynamo).
24. Partitioning Strategies
24. Partitioning Strategies
- Range: Key A-M (Node 1). Hotspot risk (If all users imply ‘A’).
- Hash: Hash(Key) % N. Even distribution.
25. File Storage (Block vs Object vs File)
25. File Storage (Block vs Object vs File)
- Block (EBS): HDD/SSD attached to OS. Fast. Boot drive.
- File (EFS/NFS): Shared folder hierarchy.
- Object (S3): Flat ID/Metadata. HTTP API. Cheap, scalar. Not for OS boot.
26. Data Lake vs Data Warehouse
26. Data Lake vs Data Warehouse
- Lake (S3): Raw data (Logs, JSON, CSV). Schema-on-Read. Cheap.
- Warehouse (Snowflake): Cleaned, Structured data. Schema-on-Write. Queries.
27. Message Queues (Kafka vs RabbitMQ)
27. Message Queues (Kafka vs RabbitMQ)
- RabbitMQ: Push-based. Smart broker. Good for complex routing/tasks. Message deleted after ack.
- Kafka: Pull-based (Log). Dumb broker. High throughput. Message persisted for X days. Replayable.
28. Long Polling vs WebSockets vs SSE
28. Long Polling vs WebSockets vs SSE
- Polling: Client asks every 5s.
- Long Polling: Server holds connection open until data available.
- WebSockets: Bi-directional. Chat.
- SSE (Server Sent Events): Uni-directional (Server -> Client). Stock ticker.
29. Geohashing / Quadtree
29. Geohashing / Quadtree
30. Row-based vs Columnar DB
30. Row-based vs Columnar DB
- Row (Postgres): Good for transactions (CRUD one user).
- Columnar (Cassandra/BigQuery): Good for Analytics (Avg Salary of 1M rows). Compresses well.
4. Design Cases (The “Design X” Questions)
31. Design a URL Shortener (TinyURL)
31. Design a URL Shortener (TinyURL)
- Core: Hash function? MD5 (too long). Base62 encoding of AutoIncrement ID.
- Scale: Billions of URLs.
- DB: Key-Value (DynamoDB). fast read.
- Cleanup: TTL or Lazy delete on access.
32. Design Rate Limiter
32. Design Rate Limiter
- Where: API Gateway/Middleware.
- Store: Redis (Counters).
- Algo: Token Bucket or Sliding Window Log (for precision).
- Distributed: Consistent Hashing for counters.
33. Design Instagram Feed
33. Design Instagram Feed
- Model: User, Follow, Post.
- Push Model (Fanout on Write): When User A posts, push ID to all Followers’ timeline lists in Redis. Fast Read. Slow Write. (Bad for Justin Bieber).
- Pull Model: Read time aggregation.
- Hybrid: Push for normal, Pull for celebs.
34. Design Chat (WhatsApp)
34. Design Chat (WhatsApp)
- Proto: WebSocket.
- Storage: HBase/Cassandra (Time series).
- Status: Heartbeat to Redis.
- Encryption: E2E (Signal protocol).
35. Design Youtube/Netflix
35. Design Youtube/Netflix
- Upload: Chunking.
- Transcoding: Convert to varying bitrates/formats (Worker Queue).
- Storage: S3.
- Delivery: CDN (Open Connect).
- Adaptive Streaming: DASH/HLS (Client picks quality).
36. Design Google Drive (Dropbox)
36. Design Google Drive (Dropbox)
- Chunking: File split into 4MB blocks. Deduplication (Hash blocks).
- Sync: Sync only changed blocks.
- Meta DB: File hierarchy in SQL/NoSQL.
- Block Store: S3.
37. Design Typeahead (Search Autocomplete)
37. Design Typeahead (Search Autocomplete)
- DS: Trie (Prefix Tree).
- Optimization: Store Top 5 results in each Trie Node.
- Update: Offline aggregation (MapReduce) to rebuild Trie.
38. Design Web Crawler
38. Design Web Crawler
- Queue: URL Frontier (Kafka).
- Dedup: Bloom Filter (Visited URLs).
- Politeness: Per-domain rate limit.
- DNS: Custom DNS cache.
39. Design Notification System
39. Design Notification System
- Pluggable Senders: Email, SMS, Push.
- Queue: RabbitMQ (Priority Queues).
- Rate Limit: Don’t spam users.
40. Design Leaderboard
40. Design Leaderboard
- Naive: SQL
ORDER BY. Slow. - Fast: Redis Sorted Set (
ZADD user score). O(log N).
5. Reliability & Operations
41. Handling Hot Partitions
41. Handling Hot Partitions
42. Thundering Herd Problem
42. Thundering Herd Problem
43. Blue-Green vs Canary
43. Blue-Green vs Canary
- Blue-Green: Instant switch. 2x cost.
- Canary: Gradual rollout (1%, 10%). Safer.
44. Service Discovery Patterns
44. Service Discovery Patterns
- Client-side: Client queries Registry (Eureka).
- Server-side: Client hits LB. LB queries Registry.
45. Distributed Consensus
45. Distributed Consensus
46. Backpressure
46. Backpressure
47. Chaos Engineering
47. Chaos Engineering
48. Caching Hazards
48. Caching Hazards
- Avalanche: Cache empty, DB hit by millions.
- Penetration: Requesting non-existent key hits DB always. (Fix: Bloom filter).
- Stampede: Many users expire key same time.
49. Proxy vs Reverse Proxy
49. Proxy vs Reverse Proxy
- Forward: Protects Client. (Hide IP, Filter content).
- Reverse: Protects Server. (LB, SSL term, Cache).
50. API Gateway vs Load Balancer
50. API Gateway vs Load Balancer
- LB: Transport level distribution (L4/L7).
- Gateway: App Logic. (Auth, Rate Limit, Transformation, Routing).