Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The System Design Interview
System design interviews test your ability to design large-scale distributed systems. Unlike coding interviews, there’s no single “correct” answer — interviewers evaluate your thought process, communication, and trade-off analysis. The uncomfortable truth: most system design interviews are decided in the first 10 minutes. If you spend those 10 minutes silently thinking or jumping straight into database schemas, you’ve already lost. The interviewers who wrote “hire” on your scorecard did so because you asked sharp clarifying questions, framed the problem space clearly, and showed you could navigate ambiguity. The actual design is secondary to the process.What interviewers are looking for:
- Can you drive the conversation and ask good questions? This is the #1 signal. An engineer who asks “what’s our consistency requirement?” before drawing boxes is more impressive than one who draws a perfect architecture without asking.
- Do you understand scalability and distributed systems? Not just buzzwords — can you explain why you chose Cassandra over PostgreSQL for this specific workload?
- Can you identify and solve bottlenecks? The difference between a mid-level and senior answer is proactively finding the bottleneck vs. waiting for the interviewer to point it out.
- Do you make reasonable trade-offs and justify them? Every decision has a cost. Saying “I’ll use Redis for caching” is weak. Saying “I’ll use Redis for caching because our read:write ratio is 100:1 and I can tolerate stale data for up to 60 seconds” is strong.
- Can you communicate complex ideas clearly? If your interviewer can’t follow your reasoning, they can’t give you credit for it. Draw as you explain. Label your diagrams. Say “let me pause and make sure we’re aligned” before going deeper.
Interview Framework (45 minutes)
Step 1: Requirements Clarification (5 min)
Never skip this step! Jumping to solutions is the #1 mistake candidates make.Questions to Ask
- Functional
- Non-Functional
- Example Dialogue
Step 2: Capacity Estimation (5 min)
Quick Formulas
Numbers You Must Know
| Metric | Value | Rounded |
|---|---|---|
| Seconds/day | 86,400 | ~100,000 |
| Seconds/month | 2.6M | ~2.5M |
| Seconds/year | 31.5M | ~30M |
| Latency | Time |
|---|---|
| Memory access | 100 ns |
| SSD read | 100 μs |
| Network (same DC) | 0.5 ms |
| Network (cross-continent) | 150 ms |
| Capacity | Per Server |
|---|---|
| Concurrent connections | 10K-100K |
| Simple API QPS | 1K-10K |
| Complex API QPS | 100-500 |
Step 3: High-Level Design (10 min)
The Standard Architecture
Start with this template and modify based on requirements:When to Add Components
| Component | Add When… |
|---|---|
| CDN | Serving static files, global users |
| Load Balancer | Multiple servers (always) |
| Cache (Redis) | Read-heavy, repeated queries |
| Message Queue | Async processing, decoupling |
| Search (Elasticsearch) | Full-text search needed |
| Blob Storage (S3) | Images, videos, files |
| Rate Limiter | Public API, preventing abuse |
API Design Template
Step 4: Deep Dive (20 min)
What to Deep Dive On
The interviewer will guide you, but be prepared to discuss:Data Model
- Table schemas
- Relationships
- Indexing strategy
- Sharding key selection
Scaling
- Horizontal vs vertical
- Caching strategy
- Database partitioning
- Read replicas
Core Algorithm
- News feed generation
- Matching algorithm
- Ranking/scoring
- Rate limiting
Reliability
- Failure handling
- Data replication
- Consistency guarantees
- Monitoring/alerting
Database Schema Template
Step 5: Bottlenecks & Trade-offs (5 min)
Common Bottlenecks
Trade-off Discussions
Always mention trade-offs. This shows senior thinking:| Decision | Trade-off | When to pick each side |
|---|---|---|
| SQL vs NoSQL | Consistency vs Scale | SQL when you need JOINs, transactions, or complex queries (user accounts, orders, payments). NoSQL when your access pattern is simple key-value or wide-column lookups with high write throughput (messages, session data, time-series). The threshold: PostgreSQL handles ~10K QPS on a single machine before you need sharding. Cassandra handles ~50K writes/sec per node out of the box. |
| Sync vs Async | Latency vs Reliability | Sync when the caller needs the result immediately (payment authorization, search). Async via message queue when the caller only needs acknowledgment (send notification, process order, index document). The rule of thumb: if the operation takes >100ms or can fail independently, make it async. |
| Cache | Speed vs Staleness | Always cache when read:write ratio exceeds 10:1. Set TTL based on tolerance for stale data: 5 seconds for stock prices, 60 seconds for social feeds, 24 hours for product catalogs. Cache invalidation is the hard part — use cache-aside with TTL for simplicity, write-through for consistency. |
| Denormalization | Read speed vs Write complexity | Denormalize when the read path is 100x+ hotter than the write path and the join would cross service boundaries or database shards. Example: store author_name in the tweet record rather than joining with the users table on every timeline read. Accept that updating a username requires an async backfill job. |
| Microservices | Flexibility vs Complexity | Start with a modular monolith. Extract to microservices only when you need independent deployment (different teams, different scaling needs, different tech stacks). The operational overhead of microservices (service discovery, distributed tracing, network failures between services) is real — do not pay it until you must. |
| Strong consistency | Correctness vs Availability | Strong consistency for money (payments, inventory). Eventual consistency for content (social feeds, search indexes, recommendations). The cost of strong consistency is ~50ms of additional latency per operation and reduced availability during network partitions. The cost of eventual inconsistency in a payment system is real money lost. |
Quick Reference: Component Cheatsheet
Databases
| Type | Examples | Use When |
|---|---|---|
| SQL | PostgreSQL, MySQL | ACID, complex queries, joins |
| Key-Value | Redis, DynamoDB | Simple lookups, caching, sessions |
| Document | MongoDB | Flexible schema, hierarchical data |
| Wide-Column | Cassandra, HBase | High write throughput, time-series |
| Graph | Neo4j | Relationships, recommendations |
| Search | Elasticsearch | Full-text search, logs |
Caching Patterns
| Pattern | How It Works | Use When |
|---|---|---|
| Cache-Aside | App checks cache, then DB | Read-heavy, cache misses OK |
| Write-Through | Write to cache + DB together | Need consistency |
| Write-Behind | Write to cache, async to DB | High write throughput |
| Read-Through | Cache loads from DB on miss | Simplify app logic |
Message Queue Patterns
| Pattern | Use Case |
|---|---|
| Point-to-Point | Task distribution, job queues |
| Pub/Sub | Event broadcasting, notifications |
| Event Sourcing | Audit log, state reconstruction |
| CQRS | Separate read/write models |
Red Flags to Avoid
Green Flags That Impress
Practice Problems by Level
Entry Level (30 min)
- URL Shortener
- Paste bin
- Rate Limiter
- Key-Value Store
Mid Level (45 min)
- Twitter Timeline
- Notification System
- Web Crawler
Senior Level (60 min)
- YouTube
- Google Search
- Uber/Lyft
- Distributed Cache
- Payment System
- Ticket Booking (BookMyShow)
Power Phrases for Interviews
These phrases signal experience and depth. Use them naturally, not as scripts:Practice Questions with Model Answers
These questions mirror what you will actually face in a 45-minute system design interview. Each includes the kind of answer that gets a “strong hire” and the red flags that get you rejected.Question 1: “How would you design a rate limiter?”
Difficulty: Entry-Level / Intermediate What the interviewer is really testing: Can you think about distributed systems primitives? Do you understand the difference between local and global state? Can you pick the right algorithm for the constraint?Strong Answer
Strong Answer
The way I think about rate limiting is: it is fundamentally a counting problem with a time dimension. The key question is where the counter lives and how you define the time window.Algorithm choices and when to use each:
- Token Bucket: Best general-purpose choice. A bucket fills at a steady rate (say 10 tokens/sec), each request consumes one token, requests are rejected when empty. It naturally allows short bursts while enforcing long-term averages. This is what AWS API Gateway and Stripe use.
- Sliding Window Log: Store the timestamp of every request, count how many fall within the last N seconds. Precise but memory-hungry — for 1M users at 100 req/min, you are storing 100M timestamps.
- Sliding Window Counter: Hybrid approach. Keep counters for the current and previous fixed windows, then weight them by overlap. Uses far less memory than the log approach while being more accurate than a fixed window.
- Fixed Window Counter: Simplest to implement, but has the boundary burst problem — a user can send 2x the limit by timing requests at the window boundary.
EVAL (Lua script) in Redis: check the bucket, decrement if tokens available, return allow/deny. Redis gives us sub-millisecond latency and atomic operations.The gotcha most people miss: In a multi-region setup, you cannot have a single global Redis for rate limiting — the 150ms cross-continent latency would kill your P99. You either accept per-region limits (a user gets N requests per region) or use a “local fast path + async sync” pattern where each region has its own counter and they periodically reconcile, accepting temporary over-limit.Example: At a company doing 50K API requests/second across 3 regions, we used per-region token buckets with a 10% buffer on the limit. So a 1000 req/min limit was actually enforced as 367 req/min per region. If the async sync detected a user was globally over-limit, we tightened the per-region limit on the next sync cycle (every 5 seconds). This meant a user could briefly exceed their limit by up to 10%, which was acceptable.- “What happens to in-flight requests when a rate limit kicks in? Do you return 429 immediately or queue them?”
- “How would you rate-limit by different dimensions simultaneously — per user, per IP, and per endpoint?”
- “Your Redis rate limiter is adding 2ms to every request. The team says that is too much. How do you reduce it?”
- “A single user is sending requests from 10,000 different IPs. Your per-IP limiter does not catch them. Now what?”
Question 2: “You need to design a notification system. Walk me through it.”
Difficulty: Intermediate What the interviewer is really testing: Can you handle multiple delivery channels, prioritization, and failure handling? Do you understand the difference between at-least-once and at-most-once delivery?Strong Answer
Strong Answer
A notification system has three core problems: routing (which channel — push, SMS, email), reliability (guaranteeing delivery), and user experience (not spamming people).High-level architecture:
- Notification Service — accepts notification requests via API or from internal event bus
- Priority Queue (Kafka or SQS) — separates urgent (2FA codes, payment alerts) from non-urgent (marketing, social)
- Channel Workers — separate workers for each delivery channel (APNS for iOS push, FCM for Android, SendGrid/SES for email, Twilio for SMS)
- User Preference Store — what channels has the user enabled? What are their quiet hours?
- Deduplication Layer — idempotency keys prevent sending the same notification twice
- At-least-once delivery for critical notifications (2FA, payment). Use a transactional outbox pattern: write the notification to a database table in the same transaction as the triggering event, then a poller picks it up and sends it. If the send fails, retry.
- At-most-once for marketing — nobody wants two copies of a promotional email. Use idempotency keys and skip retries after the first attempt.
- Rate limiting per user per channel — even if the system generates 50 notifications for you in an hour, we batch or throttle. Instagram does this with their “X and 47 others liked your photo” aggregation.
user_id % N to ensure ordering per user while spreading load across consumers.- “A user has notifications enabled on 3 devices. How do you avoid sending the same alert to all 3 when they have already read it on one?”
- “Your SMS provider charges $0.01 per message. A bug causes 10M erroneous notifications to be queued. How do you build a safety net?”
- “How would you implement notification aggregation — batching 50 likes into one notification?”
- “Walk me through exactly what happens when a notification send fails at each stage of your pipeline.”
Question 3: “Design the data model and access patterns for a social media feed.”
Difficulty: Senior What the interviewer is really testing: Do you understand the fan-out problem? Can you reason about read-heavy vs write-heavy trade-offs? Do you know when to denormalize?Strong Answer
Strong Answer
The core tension in a social feed is write amplification vs read latency. You are choosing where to pay the cost — at tweet time or at feed-render time.Data model (I will use Twitter as the concrete example):Tables:
users—user_id(PK),username,follower_count, metadatatweets—tweet_id(PK, Snowflake ID for time-ordering),user_id,content,media_urls,created_atfollows—follower_id,followee_id(composite PK, indexed both ways)timeline_cache—user_id(PK),tweet_ids(sorted set in Redis, not SQL)
- Fan-out on write (push model): When User A tweets, push the tweet ID into every follower’s timeline cache. For a user with 500 followers, that is 500 Redis ZADD operations. Pre-computed timelines make reads trivially fast — just
ZRANGEfrom Redis. - Fan-out on read (pull model): When User B opens their feed, query all followees’ recent tweets in real time. For a user following 500 people, that means 500 lookups, merge-sort, and return top 20.
- Users with fewer than ~10K followers: fan-out on write. The write cost is bounded.
- Celebrities (10K+ followers): skip the fan-out. Store their tweets separately.
- At read time: fetch the pre-computed timeline from cache, THEN fetch recent tweets from any celebrities the user follows, merge, rank, return.
tweet_id using Snowflake ID ranges). Timeline cache is Redis sorted sets with tweet_id as both member and score (since Snowflake IDs are time-ordered). Follows table is in a graph store or sharded by follower_id.- “A user unfollows someone. How do you remove that person’s tweets from the pre-computed timeline? What if the cache has already been served?”
- “Your Redis timeline cache is consuming 500GB of RAM. How do you reduce it without degrading read latency?”
- “How do you handle a tweet that goes viral — 10M likes in an hour? What parts of your system are under stress?”
- “The product team wants to switch from chronological to ranked feed. What changes in your data model and pipeline?”
Question 4: “Design a distributed cache like Memcached or Redis.”
Difficulty: Senior / Staff-Level What the interviewer is really testing: Do you understand consistent hashing, cache invalidation strategies, and the operational challenges of distributed state? Can you reason about memory management at scale?Strong Answer
Strong Answer
A distributed cache needs to solve three problems: partitioning (which node holds which key), replication (what happens when a node dies), and eviction (what to throw away when memory is full).Partitioning with consistent hashing:
- Hash each key to a point on a ring (0 to 2^32).
- Each cache node owns a range of the ring.
- To find which node holds a key: hash the key, walk clockwise to the next node.
- With virtual nodes (150-200 per physical node), you get even distribution. Without virtual nodes, adding/removing a node causes massive key redistribution.
hash(key) % N): When you add a node, modular hashing remaps ~100% of keys (N/(N+1) keys move). Consistent hashing remaps only ~1/N of keys. At 10B cached keys, the difference between redistributing 10B keys vs 1B keys is the difference between a 2-hour cache warm-up outage and a 12-minute blip.Replication strategy:- For a cache (not a database), I would use async replication to 1 replica. Cache data is reconstructable from the source of truth, so losing a replica is annoying (cache miss spike) but not catastrophic.
- The replica handles reads during failover. Promotion is automatic via a gossip protocol or a coordinator like ZooKeeper.
- LRU (Least Recently Used): Default choice. Implemented with a hash map + doubly-linked list for O(1) access and eviction.
- LFU (Least Frequently Used): Better when you have items with very different access frequencies. Redis 4.0+ supports this with an approximated LFU using a logarithmic counter.
- TTL-based: Set per-key expiration. Good for data that has a natural staleness window.
key % num_servers.” This shows no awareness of consistent hashing, rebalancing costs, or what happens when a node fails.
Follow-ups:
- “Your cache cluster has a hot key — one key getting 100K reads/second while everything else gets 100 reads/second. How do you handle it?”
- “How do you handle cache invalidation across multiple services that cache the same underlying data?”
- “You are migrating from 10 cache nodes to 15. Walk me through the migration without downtime.”
- “What is the difference between caching at the application layer vs using a read-through cache? When would you choose each?”
Question 5: “Design a web crawler that indexes 1 billion pages.”
Difficulty: Senior What the interviewer is really testing: Can you design a system with massive parallelism, politeness constraints, and deduplication? Do you understand the trade-off between crawl freshness and resource consumption?Strong Answer
Strong Answer
A web crawler at billion-page scale has five core challenges: URL frontier management, politeness (not DDoSing websites), deduplication (same content at different URLs), distributed coordination, and prioritization (crawl important pages first and more frequently).Architecture:
- URL Frontier — a priority queue of URLs to crawl, partitioned by domain. Each domain has its own sub-queue with a politeness delay (typically 1-2 seconds between requests to the same domain). I would use a combination of Kafka (for durability) and an in-memory priority heap (for scheduling).
- Fetcher Workers — stateless workers that pull URLs from the frontier, download the page, and push the content to processing. At 1B pages over 1 week, we need ~1,650 pages/sec sustained. With average fetch time of 2 seconds, we need ~3,300 concurrent fetchers.
- Content Parser — extracts text, links, and metadata. Discovered URLs go back to the frontier (after dedup check).
- Dedup Service — two levels: (a) URL dedup using a Bloom filter (1B URLs at 1% false positive needs ~1.2 GB), (b) Content dedup using SimHash or MinHash to detect near-duplicate pages (same article on different URLs).
- Storage — raw HTML in blob storage (S3), parsed content and metadata in a document store (Elasticsearch for search indexing).
robots.txt before crawling any domain (cache it for 24 hours). Respect Crawl-delay directives. Use a per-domain rate limiter. If you ignore this, your IP ranges get blocked and your crawl coverage plummets.Prioritization strategy:- PageRank or domain authority for initial priority
- Freshness signal: news sites recrawl every 15 minutes, static corporate pages recrawl monthly
- Crawl budget per domain: even if a domain has 10M pages, only crawl the top 100K by link count
unbound) and batch resolution. Google’s crawler pre-resolves DNS for the entire frontier.- “How do you detect and handle spider traps — pages that generate infinite URLs dynamically?”
- “Your Bloom filter for URL dedup has a 1% false positive rate. That means 10M pages never get crawled. Is that acceptable? How would you improve it?”
- “How do you handle JavaScript-rendered pages where the content is not in the initial HTML response?”
- “The product team wants crawl results to be fresh within 1 hour for the top 10,000 domains. How does this change your architecture?”
Question 6: “Design a payment system for an e-commerce platform.”
Difficulty: Staff-Level What the interviewer is really testing: Do you understand exactly-once semantics, idempotency, and the operational realities of handling money? Can you design for auditability and regulatory compliance?Strong Answer
Strong Answer
Payment systems are the one domain where “eventually consistent” can mean “eventually sued.” The core principle is: every state transition must be auditable, idempotent, and recoverable.Architecture:
- Payment API — accepts payment requests with a client-generated idempotency key. Before doing anything, check if this key has been seen before. If yes, return the cached result. This prevents double-charges when a client retries after a timeout.
- Payment State Machine — every payment moves through explicit states:
CREATED -> AUTHORIZED -> CAPTURED -> SETTLED(orFAILED,REFUNDED). Each transition is a separate database write with the previous state as a precondition (optimistic locking). This prevents impossible transitions likeSETTLED -> AUTHORIZED. - Payment Service Provider (PSP) Integration — abstract behind an adapter pattern so you can support Stripe, Adyen, PayPal. Each PSP call is wrapped with: timeout (5 seconds), retry with exponential backoff (max 3 attempts), and circuit breaker (if a PSP fails 50% of calls in 30 seconds, stop sending traffic).
- Ledger — double-entry bookkeeping. Every payment creates at least two ledger entries (debit customer, credit merchant). The ledger is append-only, immutable, and the sum of all entries for any account must equal the account balance at all times. This is how you pass financial audits.
- Reconciliation Job — runs daily, compares your ledger against PSP settlement reports. Flags discrepancies for human review. In practice, you will have a 0.01-0.1% discrepancy rate from timeouts, partial failures, and edge cases.
$10.00 is stored as {amount: 1000, currency: "USD"}. For currencies like Japanese Yen that have no fractional unit, amount: 1000 means 1000 JPY.Regulatory considerations: PCI-DSS compliance means you never store raw card numbers. Use tokenization from your PSP. SOX compliance means your ledger is immutable and audit-logged. GDPR means you need to handle data deletion requests while maintaining financial records (anonymize, do not delete).- “A customer is charged twice for the same order. Walk me through how you detect this, prevent it, and recover from it.”
- “Your payment service processes $10M/day. How do you handle a 30-minute outage of your primary PSP without losing revenue?”
- “Walk me through the exact sequence of database writes and API calls for a single payment. Where are the failure points?”
- “How would you design a refund system that handles partial refunds, multi-currency refunds, and refunds after the merchant has already been paid out?”
Question 7: “How would you design a real-time leaderboard for a game with 10 million players?”
Difficulty: Intermediate / Senior What the interviewer is really testing: Do you understand sorted data structures at scale? Can you reason about the trade-off between real-time accuracy and computational cost?Strong Answer
Strong Answer
A real-time leaderboard is fundamentally a sorted set problem with rank queries. The key operations are: update a player’s score, get a player’s rank, and get the top-N players.Approach 1 — Redis Sorted Sets (best for most cases):
- Redis
ZADDto update scores: O(log N) - Redis
ZREVRANKto get a player’s rank: O(log N) - Redis
ZREVRANGEto get top-N: O(log N + N) - At 10M members, a sorted set uses ~1.5 GB of memory (16 bytes per member ID + 8 bytes score + overhead). Totally fits in a single Redis instance.
- Segment the leaderboard into buckets (score ranges). Each bucket tracks its count.
- To find a player’s rank: sum the counts of all buckets with higher scores, then find position within their bucket.
- This is essentially a B-tree or segment tree approach. More complex but horizontally scalable.
- If the game has score updates every second for 10M players, that is 10M writes/sec — too much even for Redis.
- Solution: batch updates. Collect score changes in a buffer, flush to the leaderboard every 5 seconds. Players see their rank with a 5-second delay, which is acceptable for most games.
- For the top-100 leaderboard shown on the homepage, cache it and refresh every 10 seconds.
ORDER BY score DESC LIMIT 100.” This works at small scale but becomes a full table scan at 10M rows and cannot return a specific player’s rank without counting all rows above them.
Follow-ups:
- “The product team wants leaderboards segmented by country, friend group, and globally. How does your design change?”
- “Two players have the same score. How do you break ties? What if the tiebreaker is ‘who reached the score first’?”
- “How would you implement a ‘players near me’ feature that shows the 5 players above and below your rank?”
- “Your game has 500M players. Redis sorted sets no longer fit in memory. What is your migration strategy?”
Question 8: “Design a system that detects and prevents fraud in real-time for a fintech app.”
Difficulty: Staff-Level What the interviewer is really testing: Can you design a system with sub-100ms latency requirements that combines rule-based and ML-based detection? Do you understand the false-positive vs false-negative trade-off when real money is at stake?Strong Answer
Strong Answer
Fraud detection has a fundamental tension: every millisecond of latency costs user experience, but every missed fraud event costs real money. The system must make a decision in under 100ms at the point of transaction while incorporating signals that span weeks of history.Two-layer architecture:Layer 1 — Real-time Rules Engine (synchronous, in the transaction path):
- Hard rules that block immediately: transaction from a country the user has never been in + amount > $500, velocity check (5+ transactions in 60 seconds), known-bad device fingerprint.
- Implemented as a stream processor (Flink or custom) with in-memory state. Rules are configured in a DSL, not code, so the fraud team can update them without deployments.
- Latency budget: 20ms. These rules catch the obvious fraud (stolen card used from a new location).
- Feature vector includes: transaction amount, merchant category, time of day, device fingerprint, user’s historical spending pattern, geographic velocity (how fast did they “travel” between transactions), network graph features (is this merchant connected to known fraud rings?).
- Model inference via a pre-trained model served from a feature store. Features are pre-computed and cached — you do not compute a user’s 90-day spending pattern at transaction time, you maintain it incrementally.
- Score threshold: transactions scoring > 0.8 are blocked, 0.5-0.8 are flagged for review, < 0.5 are approved.
- Latency budget: 50ms (including feature store lookup + model inference).
- When a flagged transaction is reviewed by a human analyst, that label (fraud/not-fraud) must feed back into the training pipeline.
- Without this, your model drifts as fraud patterns evolve. Fraudsters adapt within weeks.
- Cold start problem: new users have no history. Use device fingerprinting, email age, phone verification signals as substitutes.
- Kafka ingests transaction events
- Flink applies real-time rules and enriches with features from Redis (feature store cache)
- ML model served via TensorFlow Serving or ONNX Runtime on GPU instances
- Decision is written back to the transaction service within the 100ms SLA
- All decisions are logged to an immutable audit trail for regulatory compliance
- “A new type of fraud emerges that your ML model has never seen. How does your system detect it? How quickly can you respond?”
- “Your rules engine is blocking 2% of legitimate transactions. The business says that is unacceptable. How do you tune it without letting more fraud through?”
- “Walk me through the exact data pipeline from a transaction hitting your system to the fraud decision being made. Where are the latency bottlenecks?”
- “How do you handle fraud detection for a user’s very first transaction when you have zero historical data?”