Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
System Design Practice Problems
These six problems are structured using the Answer Framework: Clarify, Estimate, Design, Deep Dive, and Trade-Offs. Each solution demonstrates the thinking process that separates senior engineers from mid-level candidates.How Interviewers Score System Design
Before diving into the problems, you should understand exactly what interviewers are evaluating. Most companies — from startups to FAANG — use a rubric with four to five dimensions. Knowing the rubric does not mean gaming the interview; it means spending your time on the things that actually earn points.The Five Scoring Dimensions
The Five Scoring Dimensions
1. Requirements Gathering and Problem Scoping (15-20% of evaluation)
What interviewers look for: Do you ask clarifying questions before designing? Do you distinguish functional from non-functional requirements? Do you scope the problem to something achievable in 45 minutes instead of trying to design everything?Strong signal: The candidate asks 5-8 targeted questions, writes down explicit requirements, and gets confirmation before drawing anything. They proactively state assumptions: “I am going to assume we need 99.9% availability and eventual consistency is acceptable for analytics. Does that align with what you had in mind?”Weak signal: The candidate either asks zero questions (jumps to a solution) or asks 15+ unfocused questions (stalling). They design for vague requirements and get surprised by follow-ups.Scoring:- No hire: Starts designing without understanding the problem. Makes incorrect assumptions that fundamentally change the design.
- Lean no hire: Asks a few surface-level questions but misses critical requirements (scale, consistency, availability).
- Lean hire: Asks good questions, identifies the key requirements, and scopes the problem appropriately.
- Strong hire: Asks questions that reveal non-obvious constraints. Proactively identifies the hardest part of the problem before being prompted.
2. High-Level Design (25-30% of evaluation)
What interviewers look for: Can you sketch a coherent architecture with the right major components? Does data flow logically between them? Are the API contracts reasonable? Is the design appropriate for the scale you estimated?Strong signal: The candidate draws a clear diagram with labeled components and data flow arrows within 5-10 minutes. They explain the responsibility of each component and why it exists. The API design is clean and follows REST/gRPC conventions.Weak signal: The candidate draws a tangled diagram with unlabeled boxes, cannot explain how data flows from client to storage, or proposes an architecture that does not match the requirements they just gathered.Scoring:- No hire: Cannot produce a coherent end-to-end design. Major components are missing or misconnected.
- Lean no hire: Design works but is overly simplistic or missing critical components (no cache for a read-heavy system, no queue for async processing).
- Lean hire: Solid design with appropriate components. Data flow is clear. Handles the stated requirements.
- Strong hire: Design is elegant and scalable. Components are well-chosen for the access patterns. The candidate identifies the 2-3 critical paths and explains why they drive the architecture.
3. Deep Dive and Technical Depth (25-30% of evaluation)
What interviewers look for: Can you go deep on the most critical component? Do you understand algorithms, data structures, and protocols at a level that would let you actually build this? Can you walk through a request lifecycle step by step?Strong signal: The candidate picks the most important component and explains it in detail — data model, algorithm choice with trade-offs, failure handling, and performance characteristics. They use precise technical vocabulary (not buzzwords) and can answer follow-up questions with confidence.Weak signal: The candidate stays at the surface level on everything. They name technologies (Redis, Kafka, Cassandra) without explaining why or how they would configure them. They cannot walk through a concrete request lifecycle.Scoring:- No hire: Cannot go deep on any component. Lacks understanding of basic data structures, algorithms, or protocols.
- Lean no hire: Can go somewhat deep but makes factual errors or proposes approaches that would not work in practice.
- Lean hire: Demonstrates solid depth on 1-2 components. Makes correct technical choices and can justify them.
- Strong hire: Demonstrates expert-level depth. Brings up non-obvious considerations (race conditions, hot partitions, clock skew). Could convincingly lead implementation of this system.
4. Trade-Off Discussion and Decision Quality (15-20% of evaluation)
What interviewers look for: Does the candidate present alternatives before committing to a choice? Do they articulate what they are trading away with each decision? Are their trade-offs grounded in the specific requirements, not generic platitudes?Strong signal: “I am choosing a sliding window counter over a token bucket here because our requirements emphasize precision over burst tolerance. If the product team later says they want to allow controlled bursts, we would switch to token bucket — the Redis data model change is minimal.”Weak signal: “I chose Redis because it is fast.” (No alternative considered, no trade-off articulated, no connection to requirements.)Scoring:- No hire: Makes choices without considering alternatives. Cannot articulate downsides of their own design.
- Lean no hire: Mentions alternatives but the comparison is shallow or incorrect.
- Lean hire: Presents clear trade-offs for major decisions. Connects choices to requirements.
- Strong hire: Trade-offs are nuanced and specific. The candidate identifies second-order consequences and explains when they would revisit decisions. Demonstrates the judgment you would want from a tech lead making real architecture choices.
5. Communication and Collaboration (10-15% of evaluation)
What interviewers look for: Is the candidate easy to follow? Do they structure their explanation logically? Do they check in with the interviewer? Do they respond well to hints and pushback?Strong signal: The candidate signals transitions (“Now let me move to the deep dive on message ordering”), checks in periodically (“Should I go deeper here or move to the caching layer?”), and responds to interviewer pushback by genuinely reconsidering rather than defending their position rigidly.Weak signal: The candidate monologues for 10 minutes without checking in, gets defensive when challenged, or is disorganized in their presentation (jumping between components randomly).Scoring:- No hire: Difficult to follow. Defensive or dismissive of feedback. Cannot adapt when the interviewer redirects.
- Lean no hire: Somewhat organized but misses cues from the interviewer. Does not adjust depth based on interviewer interest.
- Lean hire: Clear, structured communication. Responds well to hints and redirections.
- Strong hire: Exceptional communicator. Treats the interview as a collaborative design session. Makes the interviewer feel like they are working with a great colleague.
How to Use the Rubric During Practice
How to Use the Rubric During Practice
| Dimension | Score (1-4) | Notes |
|---|---|---|
| Requirements gathering | Did I ask the right questions? | |
| High-level design | Is my architecture coherent and complete? | |
| Technical depth | Could I actually build the component I deep-dived on? | |
| Trade-off discussion | Did I present alternatives and justify my choices? | |
| Communication | Was I structured, clear, and collaborative? |
Cross-Chapter Connection: Interview Meta-Skills
Cross-Chapter Connection: Interview Meta-Skills
Senior vs Staff: What Separates the Levels in System Design
Senior vs Staff: What Separates the Levels in System Design
AI-Assisted Engineering Lens: How LLMs Change System Design Preparation
AI-Assisted Engineering Lens: How LLMs Change System Design Preparation
- Using LLMs for estimation sanity checks: Tools like ChatGPT and Claude can validate your back-of-envelope calculations in real-time. Paste your estimation and ask “Does this math check out for a system at Twitter’s scale?” The LLM will flag order-of-magnitude errors. This does not replace the skill of estimation — interviewers still want to see you do it — but it accelerates your practice iterations.
- Architecture diagram generation: Copilot and similar tools can generate PlantUML, Mermaid, or ASCII architecture diagrams from natural language descriptions. In practice, this means you can iterate on high-level designs faster during preparation. In the interview itself, you still draw by hand.
- Exploring trade-offs systematically: Ask an LLM “Give me 5 reasons to choose Cassandra over DynamoDB for a chat message store, and 5 reasons to choose the other way.” This forces you to consider angles you might have missed. Then critically evaluate the LLM’s output — it sometimes hallucinates capabilities or misses pricing details.
- The trap to avoid: Using LLMs to memorize answers rather than understand frameworks. An interviewer who asks “Why Cassandra?” and gets a memorized list of bullet points will immediately follow with “What is Cassandra’s compaction strategy and how does it affect write amplification?” If you cannot go deeper than the LLM-generated summary, the interviewer knows.
- In production system design: AI-assisted code generation (Copilot, Cursor) accelerates implementation but does not change architectural decisions. The systems in this chapter still need the same components — AI just helps you write the Lua scripts, the Kafka consumers, and the API endpoints faster. The design thinking remains human.
Work-Sample Warm-Up: Calibrate Your Thinking Before Diving In
Work-Sample Warm-Up: Calibrate Your Thinking Before Diving In
- The Estimation Drill: A photo-sharing app has 200M daily active users. Each user uploads 2 photos per day (average 3MB each). How much new storage per day? Per year? What is the upload bandwidth requirement at peak (5x average)? Can a single server handle the upload throughput?
- The Trade-Off Drill: You are choosing between PostgreSQL and DynamoDB for a user profile service. The access pattern is 95% reads by user ID, 5% writes. You expect 10M users and 50K reads/sec at peak. Argue for PostgreSQL in 3 sentences. Now argue for DynamoDB in 3 sentences. Which would you actually choose, and what is the deciding factor?
- The Failure Mode Drill: You have a simple web app: Load Balancer, 3 API Servers, and 1 PostgreSQL database. Name every distinct failure scenario (there are at least 6) and the user impact of each. Which failure is the most dangerous and why?
Problem 1: Design a URL Shortener
Difficulty: Easy | Time Target: 35 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Core Features | Do we need custom aliases (e.g., short.ly/my-brand)? | Changes the uniqueness and validation logic |
| Expiration | Should URLs expire? User-configurable TTL? | Affects storage strategy and cleanup jobs |
| Analytics | Do we need click tracking (geo, referrer, device)? | Adds an entire analytics pipeline |
| Scale | How many URLs are created per day? Read vs write ratio? | Drives every downstream architecture decision |
| Availability | Is it acceptable for a redirect to fail occasionally? | Determines replication and consistency model |
Requirements You Should Lock Down
Functional:- Given a long URL, generate a unique short URL
- Given a short URL, redirect (HTTP 301/302) to the original
- Optional: custom aliases, expiration, click analytics
- High availability (redirects must not fail)
- Low latency (redirect < 100ms)
- Short URLs should not be guessable (no sequential IDs)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Hash Generation
Phase 4: Deep Dive -- Hash Generation
The Core Problem: How to Generate a Unique 7-Character Key?
This is the most important design decision. Three approaches:Approach A: MD5/SHA256 Truncation
Approach B: Base62 Counter (Pre-Generated)
Approach C: Key Generation Service (KGS) -- Recommended
Database Choice
| Criteria | SQL (PostgreSQL) | NoSQL (DynamoDB/Cassandra) |
|---|---|---|
| Schema | Fixed, well-defined | Flexible but unnecessary here |
| Read pattern | Key lookup (fast with index) | Key lookup (native strength) |
| Write pattern | Single row insert | Single row insert |
| Scale | Sharding is manual | Auto-sharding built-in |
| Consistency | Strong by default | Eventual (configurable) |
| Verdict | Good for < 1B records | Better for 100B+ records |
Caching Strategy
- Cache-aside with Redis: Check cache first, on miss read from DB and populate cache
- TTL: Set cache TTL to match URL expiration (or 24 hours for non-expiring URLs)
- Eviction: LRU — frequently accessed URLs stay hot, long-tail URLs get evicted
- Cache size: ~12 GB for 20% of daily URLs (fits a single Redis instance)
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Hash method | KGS (pre-generated keys) | MD5 truncation | Zero runtime collisions, no coordination overhead |
| Database | NoSQL (DynamoDB) | PostgreSQL | Simple key-value at extreme scale |
| Redirect code | 302 (temporary) | 301 (permanent) | Preserves analytics; 301 causes browser caching |
| Cache | Redis cache-aside | Write-through | Read-heavy system benefits from cache-aside simplicity |
| Short URL length | 7 characters | 6 characters | 3.5T keyspace vs 56B; future-proofing is cheap |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You asked about 301 vs 302
You calculated before designing
You compared three hash approaches with trade-offs
You addressed the analytics pipeline separately
How the Real System Works
How the Real System Works
Bitly’s Architecture
Bitly processes billions of shortens and redirects. Their architecture reflects a decade of scaling lessons:- Key generation: Bitly uses a counter-based approach with Base62 encoding, allocating counter ranges to application servers to avoid coordination overhead. Each server gets a block of IDs and increments locally, only requesting a new block when the current one is exhausted.
- Storage: They migrated from MySQL to a distributed data store. Their link mapping data is sharded by the hash of the short URL key, ensuring even distribution across nodes.
- Caching: Bitly uses a multi-layer caching strategy. Popular links (the top 1-2% by click volume) are cached in memory on the application servers themselves. A shared Redis layer handles the next tier. The database is the fallback of last resort. Given the extreme read-to-write ratio, cache hit rates exceed 99%.
- Analytics pipeline: Click data flows through Kafka into a real-time processing pipeline. They decouple the redirect path (which must be sub-100ms) from analytics aggregation entirely. Click events are enriched asynchronously with geo, device, and referrer data.
- Global presence: Bitly uses Anycast DNS and edge servers in multiple regions to ensure redirects are fast worldwide. A redirect request in Tokyo does not need to round-trip to a US data center.
TinyURL’s Original Design
TinyURL, one of the earliest URL shorteners (launched around 2002), took a simpler approach that is instructive as a baseline:- Single MySQL database: All short-to-long URL mappings lived in a single MySQL instance with a B-tree index on the short key. For years, this was sufficient because the total dataset was small enough to fit in RAM.
- Sequential IDs: TinyURL used auto-incrementing MySQL IDs converted to a short alphanumeric string. This is the simplest possible approach and works until you need non-guessable URLs or multi-datacenter writes.
- No analytics: The original TinyURL had no click tracking at all. This dramatically simplified the architecture — a redirect was a single database lookup and an HTTP 301 response.
- Lesson: TinyURL proves that the simplest possible design can serve millions of users if the access pattern is simple enough (key-value lookup). The complexity in Bitly’s architecture exists because of analytics, custom domains, enterprise features, and global scale — not because URL shortening itself is hard.
What Happens When Things Fail?
What Happens When Things Fail?
- The system does not go down with it — it degrades gracefully. All reads fall through to the database read replicas. Latency increases from ~5ms (cache hit) to ~20-50ms (DB read). Users notice nothing.
- Risk: if cache is cold and all traffic suddenly hits the DB, you get a thundering herd. Mitigation: implement request coalescing (if 1,000 requests for the same key arrive simultaneously, only one hits the DB and the rest wait for its result) and bring the cache back incrementally using a warm-up script that pre-loads the top 20% of URLs by access frequency.
- Reads continue from replicas (redirects keep working). Writes (new URL creation) fail. This is acceptable for minutes — redirects are 100x more important than creates.
- Promote a read replica to primary (automated failover via PostgreSQL Patroni or DynamoDB’s built-in multi-AZ). Typical recovery: 15-30 seconds with automated failover.
- Application servers have locally cached batches of pre-generated keys (1,000+ keys each). They continue creating URLs from their local batch. You have minutes to hours before any server exhausts its cache, depending on batch size.
- If KGS stays down and local caches deplete, fall back to MD5 hash generation as a degraded-mode strategy.
- The bloom filter catches it before the DB write. Retry with a different key. If the bloom filter itself has a false positive, the DB’s unique constraint catches it. Two layers of defense, not one.
- Anycast DNS routes traffic to the nearest healthy data center. Cross-region replication ensures the URL database is available in multiple regions. Users in the affected region experience a brief DNS TTL delay (seconds) before traffic re-routes.
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| Load Balancer | Application Load Balancer (ALB) | Layer 7 routing, path-based routing for /api/* vs /{shortKey}, built-in health checks |
| API Servers | ECS Fargate or EKS | Fargate for simplicity (no server management), EKS if you need Kubernetes ecosystem. Auto-scaling based on request count |
| Database | DynamoDB | Purpose-built for key-value lookups at massive scale. Single-digit millisecond latency. On-demand capacity mode handles traffic spikes without pre-provisioning. Partition key = short_url, sort key not needed |
| Cache | ElastiCache for Redis | Managed Redis with Multi-AZ failover. Use cluster mode for >12 GB cache or >100K ops/sec. Set up read replicas for read-heavy traffic |
| Key Generation Service | DynamoDB + Lambda | Store pre-generated keys in a DynamoDB table. Lambda function pre-generates batches during off-peak hours. App servers fetch batches via DynamoDB BatchGetItem |
| Analytics Pipeline | Kinesis Data Streams to Kinesis Data Firehose to S3 to Athena | Kinesis replaces Kafka on AWS. Firehose auto-batches events to S3 in Parquet format. Athena provides serverless SQL queries over click data |
| CDN / Global Routing | CloudFront + Route 53 | CloudFront edge locations for redirect caching (if using 301). Route 53 latency-based routing for multi-region deployments |
| Monitoring | CloudWatch + X-Ray | CloudWatch for metrics and alarms (cache hit rate, redirect latency p99). X-Ray for distributed tracing across ALB, Lambda, DynamoDB |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- DynamoDB and ElastiCache are gone. You are on a single EC2 instance (or a small VPS). The KGS becomes an in-process counter with local file persistence. Redis becomes an in-memory LRU cache within the application process. The database becomes SQLite or a single PostgreSQL instance. This forces you to think about what the essential architecture is versus what is scaling convenience. The answer: a single Go or Rust binary with an embedded cache and SQLite can handle 10K redirects/sec on a $20/month VPS.
- Now you need a
user_idforeign key on every URL record and every click event. The analytics pipeline must support targeted deletion (not just append-only logs). Kafka retention policies are no longer sufficient — you need to be able to purge specific events from S3/Athena. This fundamentally changes the analytics storage from an immutable log to a mutable store, which means you might prefer a database-backed analytics pipeline over Kafka-to-S3.
- The entire CDN, Anycast DNS, and global replication discussion becomes irrelevant. A single PostgreSQL table behind an Nginx reverse proxy is the correct architecture. If you propose DynamoDB and CloudFront for 500 internal users, you have failed the judgment test. The interviewer is testing whether you can right-size, not whether you know the big architecture.
- Now you need authentication on the create endpoint, user attribution on click events, and an immutable audit log (append-only table or a service like AWS QLDB). The click event is no longer just analytics — it is a compliance record. This changes the priority: analytics can be eventually consistent, but the audit log must be durable and strongly consistent.
Cross-Chapter Connections
Cross-Chapter Connections
- Databases and storage — The NoSQL vs SQL decision, sharding strategies, and partition key design are covered in depth in APIs and Databases. Understanding when key-value access patterns favor NoSQL is critical here. For a deeper dive into DynamoDB’s partition key mechanics, single-table design, and when to choose DynamoDB over Cassandra, see Database Deep Dives.
- Caching — The Redis cache-aside pattern, eviction policies (LRU), and cache warming strategies are explored in Caching and Observability.
- Performance and scalability — Read replicas, horizontal scaling, and CDN usage for global latency reduction are covered in Performance and Scalability.
- Networking — HTTP redirect semantics (301 vs 302), DNS resolution, and Anycast routing are foundational networking concepts covered in Networking and Deployment.
- Reliability — Fail-safe patterns, graceful degradation, and the analytics buffering pattern connect to Reliability Principles.
- Cloud service patterns — The mapping of URL shortener components to AWS services (DynamoDB, ElastiCache, CloudFront, Kinesis) and cost optimization strategies are explored in Cloud Service Patterns. Understanding when to use DynamoDB on-demand vs provisioned capacity is a practical skill covered there.
Problem 2: Design a Rate Limiter
Difficulty: Medium | Time Target: 40 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Scope | Client-side or server-side rate limiting? | Client-side is unreliable; focus on server-side |
| Granularity | Per-user, per-API endpoint, per-IP, or global? | Determines key structure in the rate limit store |
| Response | Should we return rate limit headers (X-RateLimit-*)? | Industry standard; shows HTTP expertise |
| Throttling | Hard limit (reject) or soft limit (queue/slow down)? | Changes architecture from reject to backpressure |
| Distributed | Single server or distributed across multiple servers? | Single-server is trivial; distributed is the real problem |
| Rules | Who configures the rate limit rules? Dynamic or static? | Affects whether you need a rules engine or config file |
Requirements You Should Lock Down
Functional:- Rate limit requests based on configurable rules (user ID, IP, API key)
- Support multiple rate limit tiers (free: 10 req/min, pro: 1000 req/min)
- Return proper HTTP 429 with
Retry-Afterheader when limit is exceeded - Rate limit rules can be updated without redeployment
- Very low latency (added overhead < 1ms per request)
- Must work across distributed servers (shared state)
- Highly available — if the rate limiter fails, requests should pass through (fail-open)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 3: High-Level Design
Phase 3: High-Level Design
Phase 4: Deep Dive -- Algorithms
Phase 4: Deep Dive -- Algorithms
Comparing Four Rate Limiting Algorithms
Algorithm 1: Fixed Window Counter
Algorithm 2: Sliding Window Log
Algorithm 3: Sliding Window Counter (Recommended)
Algorithm 4: Token Bucket
Distributed Rate Limiting: The Hard Part
The real challenge is making this work across multiple servers.Problem: If you have 10 API servers, each checking Redis independently, race conditions cause over-counting or under-counting.Solution: Lua Script in Redis (Atomic Operations)Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Algorithm | Sliding window counter | Token bucket | Better precision with minimal memory; token bucket is also excellent for APIs needing burst tolerance |
| Placement | API Gateway | Application middleware | Centralized enforcement, language-agnostic, single place to configure rules |
| State store | Redis Cluster | Local in-memory | Must work across all servers; local state gives inconsistent enforcement |
| Failure mode | Fail-open | Fail-closed | Availability > abuse protection; combine with secondary detection for abuse |
| Atomicity | Redis Lua scripts | Distributed locks | Lua scripts are atomic in Redis, simpler than external locks, sub-millisecond |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You compared four algorithms with concrete trade-offs
You addressed the distributed race condition explicitly
You discussed fail-open vs fail-closed
You mentioned clock skew
How the Real System Works
How the Real System Works
Cloudflare’s Rate Limiting at Scale
Cloudflare handles tens of millions of HTTP requests per second across 300+ data centers. Their rate limiting operates at a scale that most engineers never encounter:- Edge-first enforcement: Rate limiting happens at Cloudflare’s edge servers, not at a centralized service. Each data center runs its own rate limiting logic. This is critical because sending every request to a central rate limiter would add unacceptable latency and create a single point of failure.
- Sliding window counters: Cloudflare uses a sliding window approach similar to Algorithm 3 above. They chose this because it provides good accuracy with minimal memory per client, which matters when you are tracking millions of unique IPs and API keys simultaneously.
- Approximate counting across data centers: Since rate limiting is distributed, a user hitting multiple Cloudflare data centers (due to Anycast routing) could theoretically exceed their global limit. Cloudflare handles this with periodic synchronization between edge nodes — they trade perfect accuracy for speed. The result is rate limits that are “approximately correct” within a small margin, which is acceptable for virtually all use cases.
- Configurable actions: When a rate limit is exceeded, Cloudflare supports multiple responses: block (HTTP 429), challenge (show a CAPTCHA), JS challenge (verify browser), or simulate (log only, for testing rules before enforcement). This flexibility is a great interview detail to mention when discussing rate limiter behavior.
- IP reputation scoring: Beyond simple rate counting, Cloudflare layers in threat intelligence. An IP with a known bad reputation gets a lower effective rate limit. This hybrid approach (rate limiting + reputation) is more effective than either technique alone.
Stripe’s Rate Limiting API
Stripe’s approach to rate limiting is worth studying because they operate from the API provider’s perspective — they must protect their own infrastructure while giving developers a good experience:- Token bucket algorithm: Stripe uses a token bucket implementation for their API rate limits. Each API key gets a bucket with a defined capacity (e.g., 100 requests per second for live mode). Tokens refill at a steady rate. This allows short bursts (useful when a merchant processes a batch of charges) while enforcing a sustained rate limit.
- Tiered limits by endpoint: Not all API endpoints have the same rate limit. Creating a payment intent might be limited to 100/sec, while listing events could allow 1000/sec. Stripe separates read-heavy endpoints from write-heavy ones because the cost to Stripe’s backend is different.
- Graceful headers: Stripe returns
RateLimit-Limit,RateLimit-Remaining, andRateLimit-Resetheaders on every response (not just 429s). This allows well-behaved clients to self-throttle before hitting the limit. They also return aRetry-Afterheader on 429 responses with a specific number of seconds to wait. - Idempotency keys for retries: Stripe’s rate limiting is designed to work hand-in-hand with their idempotency key system. When a client gets rate-limited and retries, the idempotency key ensures the retry does not create a duplicate charge. This is a subtle but important integration point between rate limiting and application semantics.
- Load shedding at scale: When Stripe’s systems are under extreme load, they implement progressive load shedding — first throttling lower-priority traffic (webhooks, list operations) before restricting payment-critical paths. This priority-based approach is more sophisticated than a flat rate limit and is worth mentioning in interviews when discussing production-grade systems.
What Happens When Things Fail?
What Happens When Things Fail?
- This is why you chose fail-open. All requests pass through without rate limiting. Your backend services are temporarily unprotected. Mitigation: have a secondary, local in-memory rate limiter on each API server as a fallback. It will not be globally consistent, but it provides per-server protection. A server-local token bucket with 10x the normal per-user limit acts as a safety net.
- Alert immediately. Redis recovery is your top priority.
- If using Redis Cluster, keys on the failed shard become inaccessible. Some users are rate-limited correctly; others are not. This is worse than a total failure because the inconsistency is silent.
- Mitigation: monitor rate limiter health with a canary key that you check every second. If the canary read fails, switch the entire rate limiter to local-only mode until Redis is healthy.
- The rate limiter cannot check state. With fail-open, requests pass through. With fail-closed, all requests are rejected (an outage).
- This is the core argument for fail-open: a network partition should not become a denial-of-service against your own users.
- Fixed window counters may reset at different times on different servers, allowing a user to briefly exceed their limit by hitting different servers at the window boundary.
- Mitigation: use Redis server time (
TIMEcommand) for all timestamp comparisons, not the application server’s clock. The sliding window counter algorithm is also more tolerant of small clock differences because it interpolates between windows.
- Redis processes commands sequentially (single-threaded for commands). At extreme rates from a single user, the Lua script executes fast (~0.1ms) but the network round-trip dominates. This is not a real concern in practice — Redis handles 100K+ ops/sec, which means a single user would need to send 100K+ req/sec to cause contention.
- If this is a DDoS scenario, you need upstream protection (CDN-level blocking, IP reputation) before the request ever reaches your rate limiter.
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| API Gateway / Entry Point | Amazon API Gateway | Built-in rate limiting with usage plans and API keys. Supports throttling at per-method, per-client granularity. For custom algorithms beyond what API Gateway provides, place your limiter behind an ALB |
| Rate Limit State Store | ElastiCache for Redis (cluster mode) | Sub-millisecond latency for counter operations. Cluster mode shards data across nodes for >100K ops/sec. Multi-AZ for automatic failover |
| Rules Configuration | AWS AppConfig (part of Systems Manager) | Dynamic configuration updates without redeployment. Supports gradual rollout of new rate limit rules. Built-in validation to prevent deploying invalid configs |
| Monitoring and Alerting | CloudWatch Metrics + CloudWatch Alarms | Custom metrics for rate limit hits, 429 response rates, and Redis latency. Alarms trigger when 429 rate exceeds baseline (possible attack or misconfigured limits) |
| DDoS Protection (upstream) | AWS Shield + AWS WAF | Shield provides automatic DDoS protection at the network layer. WAF provides rate-based rules at the application layer — a first line of defense before your custom rate limiter |
| Logging and Analytics | Kinesis Data Firehose to S3 + Athena | Stream rate limit events (who was throttled, when, which endpoint) to S3 for analysis. Athena queries help identify abuse patterns and tune limits |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- Static rules files are no longer sufficient. You need a rules service that is queried per request or cached with aggressive invalidation. The rate limit check becomes: look up
tenant_idtotier, look uptiertolimits, apply the algorithm. A tier change mid-window means you either apply the new limit immediately (possibly rate-limiting a user who just upgraded) or finish the current window at the old limit. The product decision here has architectural implications — discuss both options.
- This eliminates synchronous Redis calls on every request. You need a local in-memory rate limiter on each server, periodically synced with a global state. Each server maintains its own sliding window counter and reports to a central aggregator every 1-5 seconds. You accept ~5% accuracy loss for zero added latency on allowed requests. Only rejected requests see the overhead (the 429 response). This is the Cloudflare edge model.
- The rate limit key now includes geography:
user:123:US:/api/search. You need to resolve the user’s geography at the edge (GeoIP lookup from CloudFront or Cloudflare) and pass it to the limiter. The rules engine becomes a matrix:(user_tier, region, endpoint) -> limit. This is a real-world pattern — many API providers have geography-specific rate limits for regulatory reasons.
- You cannot use Redis for every rate limit check at this price. Options: move to a local in-memory limiter with periodic Redis sync (reduces Redis ops by 90%), use DynamoDB on-demand for rate limit state (cheaper for bursty workloads), or use AWS API Gateway’s built-in rate limiting and eliminate the custom limiter entirely for most endpoints. The cost constraint forces you to evaluate the buy-vs-build decision with real numbers.
Cross-Chapter Connections
Cross-Chapter Connections
- APIs and HTTP — Rate limit headers (X-RateLimit-*, Retry-After), HTTP 429 semantics, and API versioning are covered in APIs and Databases.
- Networking — Understanding TCP connection management, load balancing algorithms, and how API gateways work is foundational. See Networking and Deployment.
- Caching — Redis as both a cache and a data structure server, TTL management, and eviction policies are covered in Caching and Observability.
- Concurrency and distributed state — Race conditions, atomic operations (Lua scripts), distributed locking, and clock synchronization are covered in Messaging, Concurrency, and State. The clock skew discussion in this problem connects directly to the logical clocks and vector clocks covered in Distributed Systems Theory — understanding why wall-clock time is unreliable in distributed systems is essential for designing correct rate limiting across multiple servers.
- Reliability — Fail-open vs fail-closed, circuit breakers, and graceful degradation patterns connect to Reliability Principles.
- Security — Rate limiting as a defense against DDoS and brute-force attacks connects to Auth and Security.
- Cloud service patterns — AWS API Gateway’s built-in throttling, WAF rate-based rules, and the decision framework for managed vs custom rate limiting are explored in Cloud Service Patterns.
Problem 3: Design a Chat System
Difficulty: Medium | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Type | 1-to-1 only, or group chat too? Max group size? | Group chat introduces fan-out complexity |
| Features | Online/offline status? Read receipts? Typing indicators? | Each adds real-time event complexity |
| Media | Text only, or images/files/voice? | Media requires object storage + CDN pipeline |
| History | How far back should chat history go? | Determines storage volume and partitioning |
| Delivery | What delivery guarantees? At-least-once? Exactly-once? | Drives dedup and acknowledgment protocol |
| Scale | How many concurrent users? Messages per day? | Determines WebSocket server capacity |
Requirements You Should Lock Down
Functional:- 1-to-1 messaging with delivery confirmation
- Group chat (up to 500 members)
- Online/offline presence indicators
- Message history (persistent, searchable)
- Push notifications for offline users
- Real-time delivery (< 200ms for online users)
- Message ordering guarantees (per-conversation)
- 99.99% message delivery (no lost messages)
- Support 50M concurrent connections
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Three Critical Components
Phase 4: Deep Dive -- Three Critical Components
Deep Dive 1: WebSocket vs Long-Polling
WebSocket (Recommended)
Long-Polling (Fallback)
Deep Dive 2: Message Storage and Partitioning
Database choice: Cassandra or ScyllaDBWhy: Write-heavy workload (2B messages/day), time-series-like access pattern (fetch recent messages for a conversation), horizontal scaling built-in.Deep Dive 3: Online Presence (Heartbeat Mechanism)
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Transport | WebSocket | Long-polling | Real-time bidirectional; fall back to long-polling when WS unavailable |
| Message DB | Cassandra | PostgreSQL | Write-heavy, time-series access, horizontal scale |
| Inter-server messaging | Redis Pub/Sub | Direct RPC | Decouples chat servers; Pub/Sub handles routing naturally |
| Presence | Lazy evaluation + heartbeat | Active fan-out | Fan-out to all friends is prohibitively expensive at scale |
| Message ordering | Per-conversation ordering | Global ordering | Global ordering is unnecessary and impossible to scale; conversations are independent |
| Group message delivery | Write message once, fan-out reads | Write to each member’s inbox | Saves storage; read fan-out is cheaper for groups < 500 |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You identified the fan-out problem for groups
You designed lazy presence instead of active push
You addressed message ordering as per-conversation
You planned for the offline path
How the Real System Works
How the Real System Works
WhatsApp: 100B+ Messages/Day with ~50 Engineers
WhatsApp is the gold standard for doing more with less. At the time of the Facebook acquisition (2014), WhatsApp handled 50+ billion messages per day with roughly 50 engineers. Today that number exceeds 100 billion. Their architecture is a masterclass in simplicity:- Erlang/OTP: WhatsApp’s backend is built on Erlang, a language designed for telecom systems that need massive concurrency and fault tolerance. A single Erlang server can handle 2-3 million concurrent connections because Erlang processes are lightweight (a few hundred bytes each, compared to threads that cost megabytes). This is why 50 engineers could operate what would require hundreds of engineers in a Java or Python stack.
- Message storage philosophy: WhatsApp treats the server as a delivery mechanism, not a storage system. Messages are stored on the server only until they are delivered to the recipient’s device. Once the recipient acknowledges receipt, the message is deleted from the server. This “transient store” approach dramatically reduces storage requirements and simplifies the architecture.
- XMPP-based protocol (modified): WhatsApp started with the XMPP messaging protocol and heavily modified it for mobile efficiency. Their custom protocol minimizes battery drain and data usage — critical for WhatsApp’s user base in markets with limited data plans.
- No read receipts stored server-side: The blue check marks (read receipts) are end-to-end signals between devices. The server facilitates delivery but does not store read state. This is another example of pushing complexity to the edges and keeping the server simple.
- Mnesia for routing: WhatsApp uses Erlang’s built-in Mnesia database for connection routing — mapping which user is connected to which server. Mnesia is an in-memory distributed database that replicates across Erlang nodes, giving sub-millisecond lookups without an external dependency like Redis.
- FreeBSD over Linux: WhatsApp chose FreeBSD for their servers, partly because of superior network stack performance for their specific workload (millions of concurrent TCP connections). This is an unconventional choice that paid off — it shows that sometimes the right tool is not the popular tool.
Discord’s Message Storage
Discord’s approach is interesting because they had to solve a different problem than WhatsApp — Discord needs persistent, searchable message history (channels can have millions of messages spanning years):- Migration from MongoDB to Cassandra: Discord initially stored messages in MongoDB. As they scaled, MongoDB’s single-primary-per-shard model created write bottlenecks for hot channels. They migrated to Cassandra, which distributes writes evenly across nodes. The partition key is
(channel_id, bucket)where bucket is a time window, preventing any single partition from growing unbounded. - Bucket strategy: Each bucket covers a fixed time period (approximately 10 days of messages). When a user scrolls back through history, Discord fetches the appropriate bucket. New messages always write to the current bucket. This means the “hot” data (recent messages) is always on a small number of partitions, keeping reads fast.
- Data services layer: Discord added a data services layer between their API and Cassandra. This layer handles request coalescing — if 1,000 users in the same channel all request the same messages at the same time (e.g., after a server outage), the data services layer deduplicates these into a single Cassandra read and fans the result back to all requesters. Without this, a thundering herd could overwhelm Cassandra.
- ScyllaDB migration: Discord later migrated from Cassandra to ScyllaDB (a C++ reimplementation of Cassandra’s protocol) for better tail latency performance. Cassandra’s JVM garbage collection pauses caused p99 latency spikes that affected user experience. ScyllaDB, being C++ with a shard-per-core architecture, eliminated GC pauses entirely.
What Happens When Things Fail?
What Happens When Things Fail?
- All 10K users get disconnected simultaneously. Each client’s reconnection logic kicks in (exponential backoff with jitter). The load balancer routes reconnections to healthy servers.
- Critical: the Message Store (Cassandra) has all messages persisted. When users reconnect, they request messages since their last received message ID. No messages are lost — they were persisted before the server acknowledged them to the sender. This is why you write to the database before forwarding to the recipient.
- Risk: thundering herd on reconnection. 10K clients reconnecting simultaneously can overwhelm the remaining servers. Mitigation: client-side exponential backoff with random jitter (reconnect between 1-10 seconds, not all at once).
- Inter-server message routing stops. If User A is on Chat Server 1 and User B is on Chat Server 3, messages cannot be forwarded.
- Short-term: messages queue up on the sending server. Long-term: fall back to a polling mechanism where Chat Servers periodically check the Message Store for undelivered messages for their connected users.
- Alternative: run Redis Pub/Sub as a cluster with Sentinel for automatic failover. Recovery is typically under 30 seconds.
- Cassandra is designed for this — it uses a quorum-based consistency model. With a replication factor of 3 and quorum reads/writes, the system tolerates one node failure per partition.
- If too many nodes fail and quorum is lost: messages cannot be persisted. The Chat Server must buffer messages in memory and retry. Return a “message pending” status to the sender instead of “delivered.” This is an honest UX that users understand — they see a clock icon instead of a check mark.
- The heartbeat TTL is 60 seconds, so in the worst case, a user appears online for up to 60 seconds after going offline. This is acceptable for most chat applications — WhatsApp and Telegram both have similar tolerances.
- In the other direction (user appears offline but is online), this happens if a heartbeat is lost. The next heartbeat in 30 seconds corrects it. This is why presence should always be treated as “best effort” and never used for critical decisions.
- Messages sent within each region continue to work. Cross-region messages queue up and deliver once the partition heals.
- The critical design decision: do you allow both regions to accept writes (risk message ordering conflicts) or do you designate one region as primary (risk write unavailability in the secondary)? For chat, allowing both regions to accept writes with per-conversation ordering (guaranteed by partition key) is the better choice. Cross-conversation ordering does not matter.
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
edited_at and deleted_at columns. When a client fetches messages, it applies edits and filters deletes client-side. For real-time updates, broadcast an “edit” or “delete” event to all connected members of the conversation via the same Pub/Sub channel. If a client missed the event (was offline), it reconciles on reconnect when it fetches the latest messages. This event-sourcing approach is simpler than trying to mutate delivered messages across all clients.The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| WebSocket Management | API Gateway WebSocket APIs + Lambda or ECS | API Gateway handles WebSocket connection lifecycle (connect, disconnect, message routing) natively. For 5M concurrent connections, use ECS/EKS with custom WebSocket servers behind a Network Load Balancer (NLB) — NLB supports millions of concurrent connections and operates at Layer 4 |
| Inter-Server Messaging | Amazon ElastiCache for Redis (Pub/Sub) or Amazon SNS | Redis Pub/Sub for low-latency inter-server message routing. SNS for fan-out to multiple subscribers when message delivery to offline users triggers push notifications |
| Message Storage | Amazon Keyspaces (Managed Cassandra) or DynamoDB | Keyspaces provides Cassandra-compatible API with serverless scaling — ideal for the write-heavy, time-series message storage pattern. DynamoDB works if you prefer single-table design with conversation_id as partition key and message_id as sort key |
| Presence Service | ElastiCache for Redis | Store heartbeat state with TTL-based expiration. Redis hash maps for presence:user_id to {status, server, timestamp} |
| Push Notifications | Amazon SNS Mobile Push | SNS integrates with APNs (iOS) and FCM (Android) directly. Handles device token management, multi-platform delivery, and retry logic |
| Media Storage | S3 + CloudFront | S3 for image/file storage with lifecycle policies (move to Glacier after 90 days). CloudFront CDN for low-latency media delivery worldwide |
| Connection Routing | ElastiCache for Redis or DynamoDB | Map user_id to server_id for message routing. Redis for sub-millisecond lookups; DynamoDB if you want managed persistence without Redis operational overhead |
| Monitoring | CloudWatch + X-Ray | Track WebSocket connection counts, message delivery latency (p50, p95, p99), Pub/Sub lag, and presence accuracy |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- Server-side search is now impossible. The server stores encrypted blobs. Key exchange happens via the Signal Protocol (double-ratchet). Group chat encryption becomes significantly harder — you need the Sender Keys protocol where each group member maintains a session with the sender. Message editing and deletion must be handled as encrypted “edit” or “tombstone” events. The presence system is unaffected (metadata, not content). This is how WhatsApp and Signal work.
- The fan-out model from the original design breaks at 10K members. You cannot push messages to 10K Redis Pub/Sub channels. Instead, use a channel-centric model: members subscribe to the channel, and the Chat Server maintains a channel subscription map. A message published to a channel is broadcast to all servers that have at least one subscriber for that channel (typically far fewer than 10K servers). This is Discord’s approach. The presence system also changes — you cannot show individual online status for 10K members, so you show “X members online” as an aggregate.
- User-facing “delete” becomes a soft delete (hidden from UI, still in storage). The Cassandra TTL strategy from the original design must be disabled. Storage grows linearly forever — at 2B messages/day, that is ~5 PB over 7 years. You need a tiered storage strategy: hot (last 30 days in Cassandra), warm (30 days to 1 year in S3), cold (1-7 years in S3 Glacier). The audit trail must be immutable. Encryption key management becomes critical — you must be able to decrypt 7-year-old messages, which means key rotation must preserve old keys.
- Offline-first architecture becomes mandatory. Messages are stored locally on the device first, then synced when connectivity returns. You need a conflict resolution protocol for messages created offline by multiple users. Vector clocks or Lamport timestamps per conversation ensure causal ordering. Message payloads must be compact (Protocol Buffers instead of JSON). The app must work in a “compose and queue” mode where the user types and the message is sent later. This is WhatsApp’s design philosophy.
Cross-Chapter Connections
Cross-Chapter Connections
- Networking — WebSocket protocol, TCP connection management, HTTP long-polling, and SSE are covered in Networking and Deployment.
- Messaging and concurrency — Pub/Sub patterns, message queues, fan-out strategies, and exactly-once vs at-least-once delivery guarantees are explored in Messaging, Concurrency, and State.
- Databases — Cassandra data modeling, partition key design, hot partition mitigation, and the migration story (MongoDB to Cassandra to ScyllaDB) connect to APIs and Databases. For a deeper exploration of Cassandra vs ScyllaDB partition strategies, wide-row design, and when to choose time-series-optimized databases, see Database Deep Dives.
- Caching — Redis for Pub/Sub, presence state, and connection routing connects to Caching and Observability.
- Performance — WebSocket connection scaling, heartbeat optimization, and lazy vs active presence evaluation connect to Performance and Scalability.
- Security — End-to-end encryption, authentication of WebSocket connections, and the Signal Protocol connect to Auth and Security.
- Real-time systems — WebSocket lifecycle management, heartbeat protocols, presence tracking, and the trade-offs between push (WebSocket), pull (long-polling), and hybrid delivery models are core real-time patterns explored in Real-Time Systems. The chat system is the canonical real-time system design problem.
- Cloud service patterns — The decision between API Gateway WebSocket APIs and custom WebSocket servers on ECS, NLB vs ALB for persistent connections, and SNS Mobile Push integration are practical AWS patterns covered in Cloud Service Patterns.
Problem 4: Design a News Feed
Difficulty: Hard | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Content | Text posts only, or images/videos too? | Media adds CDN and transcoding complexity |
| Feed type | Chronological or ranked (algorithmic)? | Ranked feeds need an ML scoring service |
| Social graph | Follow model (Twitter) or friend model (Facebook)? | Follow is asymmetric and simpler; friends is symmetric |
| Celebrity problem | Do some users have millions of followers? | This single question changes the entire fan-out strategy |
| Real-time | Should the feed update in real-time or on refresh? | Real-time adds WebSocket/SSE complexity |
| Scale | DAU? Average follows per user? Posts per day? | Drives the fan-out vs fan-in decision |
Requirements You Should Lock Down
Functional:- Users create posts (text + images/video)
- Users follow other users (asymmetric follow model)
- Home feed shows posts from followed users, ranked by relevance
- Support likes, comments, shares (interaction signals)
- Feed generation latency < 500ms
- Support 500M DAU
- Handle celebrity accounts (10M+ followers)
- Feed should feel “fresh” (updates within minutes of posting)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Feed Ranking and Caching
Phase 4: Deep Dive -- Feed Ranking and Caching
Deep Dive 1: The Hybrid Fan-Out Strategy
Post arrives at Post Service
Check follower count
Fan-out workers process asynchronously
Deep Dive 2: Feed Ranking
A pure chronological feed is simple but produces poor engagement. A ranked feed considers multiple signals:Deep Dive 3: Caching and CDN for Media
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Fan-out strategy | Hybrid (push for normal, pull for celebrities) | Pure push or pure pull | Pure push breaks for celebrities; pure pull is too slow for reads |
| Celebrity threshold | 10K followers | Static vs dynamic | Start with static threshold, evolve to dynamic based on system load |
| Feed storage | Redis sorted sets (post IDs only) | Store full post content | IDs are tiny; fetch content separately with post cache hit rates > 99% |
| Ranking | Lightweight scoring at read time | Pre-computed ranked feeds | Ranking signals change in real-time (new likes); pre-ranking gets stale |
| Media delivery | CDN + object storage (S3) | Serve from app servers | Media is the bandwidth bottleneck; CDN reduces latency and server load by 90%+ |
| Inactive users | Skip fan-out, generate on demand | Fan-out to everyone | 60% of users do not check their feed daily; skipping saves enormous write volume |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You immediately identified the celebrity fan-out problem
You proposed a hybrid approach instead of picking one extreme
You separated the ranking layer from the data layer
You optimized for inactive users
How the Real System Works
How the Real System Works
Facebook’s News Feed Ranking Evolution
Facebook’s News Feed is perhaps the most studied feed system in the world. Its evolution from simple to sophisticated mirrors the journey every feed system takes:- 2006 — Chronological feed: The original News Feed was purely chronological. If your friend posted, it appeared in your feed sorted by time. Simple, but as users added hundreds of friends, the feed became noisy and engagement dropped.
- 2009 — EdgeRank: Facebook introduced EdgeRank, a relatively simple scoring formula:
Score = Affinity x Weight x Decay. Affinity measured how close you were to the poster (based on interactions). Weight reflected the content type (photos ranked higher than text). Decay was a time-based exponential falloff. EdgeRank was essentially the formula we described in the ranking section above. - 2013+ — Machine learning replaces EdgeRank: Facebook replaced the hand-tuned formula with a machine learning model that considers thousands of signals. The model predicts the probability that you will engage with (like, comment, share, click) each candidate post. Posts are ranked by predicted engagement score.
- Fan-out architecture: Facebook uses a hybrid fan-out model very similar to what we designed above. Normal users get fan-out-on-write (their posts are pushed to followers’ pre-computed feeds). Celebrities and pages with millions of followers use fan-out-on-read (their posts are merged at feed-fetch time). The threshold is dynamic and adjusts based on system load.
- Aggregator service: Facebook’s feed generation pipeline has a dedicated aggregation service (called “Multifeed”) that merges posts from multiple sources: the pre-computed feed cache, celebrity posts fetched on-read, ads injected by the ad auction, and “story bumping” (resurfacing older posts that are getting new comments). This merge step is where the ranking model runs.
- Feed cache (TAO + Memcache): Facebook built a custom distributed cache called TAO (The Associations and Objects cache) specifically for social graph data. Feed data lives in a combination of TAO, Memcache, and MySQL. The caching hierarchy is: L1 (in-process cache on the web server) to L2 (Memcache cluster in the same region) to L3 (TAO for social graph queries) to L4 (MySQL as source of truth).
Twitter’s Timeline Architecture
Twitter’s timeline system is a fascinating case study because they famously changed their approach:- Original approach (fan-out on write): For years, Twitter used aggressive fan-out-on-write. When you tweeted, a fleet of workers would push your tweet ID into the timelines of all your followers stored in Redis. This worked well because reads were instant — opening Twitter just meant reading a pre-computed list from Redis.
- The celebrity problem: This approach broke for users like Lady Gaga or Barack Obama with 30M+ followers. A single tweet triggered 30 million Redis writes. During events like the Super Bowl or elections, the fan-out queue would back up by minutes or even hours, meaning some users would see tweets with significant delay.
- Migration to hybrid: Twitter moved to a hybrid model around 2012-2013. Most users (those with fewer followers than a dynamic threshold) still get fan-out-on-write. High-follower accounts are excluded from fan-out. When you open your timeline, Twitter’s timeline service merges your pre-computed timeline with recent tweets from the high-follower accounts you follow.
- Timeline cache: Twitter stores timelines in a massive Redis cluster. Each user’s timeline is a Redis list of tweet IDs (not full tweet content). At read time, the IDs are hydrated by fetching the tweet content from a separate Tweet Store service. This separation means the timeline cache stays small (just IDs and scores) while tweet content is cached independently.
- Ranking and relevance: Twitter introduced algorithmic ranking (“top tweets”) alongside the chronological timeline. The ranking model considers engagement signals, recency, your past interactions with the author, and content type. Users can toggle between ranked and chronological views — a product decision that has architectural implications (you need to support both code paths).
What Happens When Things Fail?
What Happens When Things Fail?
- Users’ pre-computed feeds are gone. Every feed fetch becomes a fan-out-on-read: query the Post DB for recent posts from all followed users, merge, rank, and return. This is 100-1000x slower than a cache hit.
- Mitigation: multi-tier caching. L1 cache on the web server (in-process, last 5 minutes of feeds), L2 is the Redis cluster. If L2 dies, L1 buys you time for a percentage of users. For users not in L1, serve a degraded feed (e.g., top trending posts for everyone) while Redis recovers.
- Recovery priority: warm the cache for the most active users first (those currently online), not all 500M users. Use the active user list from the presence/session service to prioritize.
- New posts are being created but not distributed to followers’ feeds. Users see stale feeds. The Kafka queue grows.
- This is expected during traffic spikes (Super Bowl, elections, breaking news). Mitigation: auto-scale fan-out workers based on Kafka consumer lag. Set alerting thresholds: lag > 5 minutes triggers scaling, lag > 30 minutes triggers an incident.
- Graceful degradation: if fan-out lag exceeds a threshold, temporarily switch all users to fan-out-on-read. The feed will be slightly slower but always fresh. Switch back to fan-out-on-write once workers catch up.
- Fall back to chronological ordering. The feed still loads — it is just not ranked. Users see recent posts from people they follow, sorted by time. This is a perfectly acceptable degraded experience. Many users actually prefer chronological feeds.
- The important principle: the ranking service should be stateless and horizontally scalable, sitting behind a load balancer. A single instance failure should not affect any users. Only a total cluster failure causes degradation.
- If you are using the hybrid model correctly, this is a non-event. Celebrity posts are not fanned out — they live in the celebrity post cache and are merged at read time. The celebrity’s 10M followers experience zero additional latency because no fan-out occurs for this post.
- This is precisely why the hybrid model exists. Under pure fan-out-on-write, this celebrity post would add 10M items to an already-backed-up queue.
- Post content fetches slow down. Feeds still load (post IDs come from the Feed Cache) but individual posts render as “loading” on the client.
- Mitigation: aggressive caching of post content. Popular posts (anything with > 100 likes in the past hour) should be in the Post Cache (Redis) with very high TTL. Cache hit rate for viral content should be 99.9%+.
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| Post Service / API | ECS Fargate behind ALB | Auto-scaling based on request count. Fargate eliminates server management. ALB handles path-based routing for different API endpoints |
| Post Storage | DynamoDB or Amazon Keyspaces | DynamoDB for post metadata with user_id as partition key and post_id (ULID/Snowflake) as sort key. Keyspaces (managed Cassandra) if you want wider rows and more flexible time-range queries |
| Feed Cache | ElastiCache for Redis (cluster mode) | Redis sorted sets for pre-computed feeds (post IDs scored by timestamp). Cluster mode for 2+ TB of feed data across shards. Multi-AZ for failover |
| Fan-Out Workers | SQS + Lambda or Kinesis + ECS | SQS for simple fan-out: post creation triggers a message, Lambda workers fan out to follower feeds in Redis. Kinesis for higher throughput with ordered processing per partition |
| Social Graph | Amazon Neptune or DynamoDB | Neptune (managed graph database) for complex social graph queries (mutual friends, friend-of-friend). DynamoDB with adjacency list pattern if the graph queries are simple (just “who does user X follow?”) |
| Media Storage / CDN | S3 + CloudFront + MediaConvert | S3 for image and video storage. MediaConvert for video transcoding (multiple resolutions, adaptive bitrate). CloudFront for global delivery with Origin Shield to reduce origin load |
| Ranking Service | SageMaker Inference Endpoints | Host the feed ranking ML model on SageMaker. Real-time inference endpoints with auto-scaling based on request count. A/B testing via SageMaker endpoint variants |
| Celebrity Post Cache | ElastiCache for Redis or DynamoDB DAX | Separate cache for celebrity posts merged at read time. DAX (DynamoDB Accelerator) if your celebrity posts live in DynamoDB — provides microsecond read latency |
| Monitoring | CloudWatch + CloudWatch Contributor Insights | CloudWatch for fan-out lag, feed generation latency, cache hit rates. Contributor Insights to identify which users or posts are creating the most fan-out load (celebrity detection) |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- Pre-computed feeds now have a write-back problem. When a post is flagged, you must scan and remove it from potentially millions of pre-computed feed caches. This is expensive. Alternative: do not embed post content in feeds — store only post IDs. The Feed Service hydrates IDs to content at read time, checking a “blocked posts” set before returning. A blocked post simply returns nothing, creating a seamless removal. This is how Facebook and Instagram handle content moderation at scale — the moderation decision propagates instantly because the content fetch is decoupled from the feed cache.
- Two separate feed generation paths. The “Following” feed is the hybrid fan-out model from the original design. The “For You” feed is entirely fan-out-on-read: a recommendation service generates candidate posts from a broader pool (not just followed users), scores them, and returns the top N. The “For You” path requires a real-time ML inference service (SageMaker, TensorFlow Serving). The two feeds share the post cache but have completely different generation pipelines. The switching must be instant — both feeds should be pre-warmed or quickly generatable.
- The celebrity fan-out problem vanishes entirely. Pure fan-out-on-write works perfectly at 50K users. The ranking service is unnecessary — chronological is fine for enterprise. But the compliance requirement adds a new dimension: posts cannot be truly deleted (only soft-deleted from user-facing views), the audit trail must be immutable, and access control must respect organizational hierarchies (a post in the “Engineering” group should not be visible to “Sales” unless explicitly shared). The entire feed architecture simplifies dramatically, but the access control layer becomes the hardest part.
- The biggest costs in a feed system: (1) Redis cluster for pre-computed feeds (memory is expensive at scale), (2) fan-out workers (compute for write amplification), (3) CDN for media delivery. Cost reduction strategies: stop pre-computing feeds for inactive users (users who have not logged in for 14 days) — this can reduce Redis usage by 40-60%. Reduce the feed cache depth from 500 to 200 post IDs per user. Move fan-out workers to Spot instances (they are stateless and retry-safe). Compress media more aggressively (AVIF instead of JPEG saves 30-50% bandwidth). Each of these trades a small UX degradation for a measurable cost reduction.
Cross-Chapter Connections
Cross-Chapter Connections
- Caching — Multi-tier caching (CDN, Redis feed cache, post cache, social graph cache), cache invalidation strategies, and cache warming are covered in depth in Caching and Observability.
- Messaging and async processing — Kafka-based fan-out workers, consumer lag monitoring, and async event processing connect to Messaging, Concurrency, and State.
- Databases — Choosing between Redis sorted sets for feed storage, Cassandra for post storage, and the social graph data model connect to APIs and Databases. For a deeper exploration of graph databases (Neptune) vs adjacency list patterns in DynamoDB for social graph modeling, and the trade-offs between wide-column stores and document databases for post storage, see Database Deep Dives.
- Performance — CDN for media delivery, feed latency optimization, and the trade-off between pre-computation and on-demand computation connect to Performance and Scalability.
- Design patterns — The push/pull/hybrid pattern is a fundamental distributed systems design pattern explored in Design Patterns.
- Scalability estimation — The back-of-envelope math for fan-out write amplification connects to Capacity, Git, and Pipelines.
- Distributed systems theory — The fan-out problem is fundamentally about write amplification and eventual consistency. Understanding CAP theorem trade-offs (the feed cache is eventually consistent with the post store) and how read-your-writes consistency affects the posting user’s experience connects to Distributed Systems Theory.
- Cloud service patterns — DynamoDB Streams for event-driven fan-out, SageMaker for real-time ranking inference, Neptune for social graph queries, and the serverless fan-out pattern are AWS-specific implementations explored in Cloud Service Patterns.
Problem 5: Design a Distributed Task Scheduler
Difficulty: Hard | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Task types | One-time, delayed, recurring (cron), or all three? | Recurring tasks need a fundamentally different scheduling model |
| Execution | At-most-once, at-least-once, or exactly-once? | Determines dedup strategy and acknowledgment protocol |
| Priority | Do tasks have priority levels? | Priority requires a priority queue, not a FIFO queue |
| Payload | What is a task? Just a function name + args, or arbitrary code? | Arbitrary code execution is a security and sandboxing problem |
| Duration | How long can a task run? Timeout? | Long-running tasks need heartbeats and lease renewal |
| Scale | How many tasks per day? Concurrent workers? | Drives queue partitioning and worker scaling decisions |
| Failure | What happens when a task fails? Retry policy? Dead letter? | Failure handling is where most task schedulers get complex |
Requirements You Should Lock Down
Functional:- Schedule one-time tasks for immediate or delayed execution
- Schedule recurring tasks using cron expressions
- Retry failed tasks with configurable backoff (max 3 retries)
- Dead-letter queue for tasks that exceed retry limit
- Task status tracking: pending, scheduled, running, completed, failed, dead
- Cancel or pause scheduled/recurring tasks
- At-least-once execution guarantee (with idempotency support)
- Handle 10M task executions per day
- Scheduling accuracy: within 1 second of target time
- Horizontal scaling of both scheduler and workers
- No single point of failure
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Three Critical Components
Phase 4: Deep Dive -- Three Critical Components
Deep Dive 1: Task State Machine
Every task follows a strict state machine. Invalid transitions must be rejected.Database Schema
Deep Dive 2: Distributed Locking for Task Pickup
The critical problem: multiple workers must not pick up the same task.Approach A: Database-Level Locking (Simple, Recommended to Start)
Approach B: Redis-Based Distributed Lock (For Higher Scale)
Lease-Based Execution
- Worker acquires a lease (lock with TTL) on the task
- Worker must renew the lease periodically (heartbeat) if the task runs long
- If the worker crashes, the lease expires and another worker can pick up the task
- When the task completes, the worker releases the lease and updates state
Deep Dive 3: Failure Handling and Idempotency
Retry with Exponential Backoff
Idempotency via Idempotency Keys
idempotency_key field ensures the same logical task is not created twice:Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Task store | PostgreSQL | Cassandra | Need strong consistency + transactions for state machine; Cassandra’s eventual consistency causes double execution |
| Task queue | Redis sorted sets | SQS / RabbitMQ | Redis gives precise scheduled execution time ordering; managed queues add latency |
| Locking | DB-level SKIP LOCKED (start), Redis leases (scale) | ZooKeeper | Start simple with DB; graduate to Redis. ZooKeeper is operationally heavy for this use case |
| Execution guarantee | At-least-once + idempotency | Exactly-once | True exactly-once is impossible; at-least-once with idempotent tasks is the practical standard |
| Scheduler HA | Leader election (single active scheduler) | Multiple schedulers with distributed locking | Single scheduler is simpler; leader election via PostgreSQL advisory lock or Redis Redlock |
| Recurring tasks | Scheduler computes next_run_at after each execution | Worker computes next run | Scheduler owns the schedule; workers just execute. Separation of concerns. |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You defined a rigorous state machine
You addressed the zombie task problem
You explained why exactly-once is impossible and offered the practical alternative
You designed for progressive scaling
How the Real System Works
How the Real System Works
Uber’s Cadence / Temporal
Cadence was built at Uber to solve the problem of orchestrating long-running, reliable workflows (ride matching, payment processing, driver onboarding). It was later open-sourced and evolved into Temporal (founded by the same engineers):- Workflow as code: Unlike traditional task schedulers that treat tasks as independent units, Cadence/Temporal models entire workflows as code. A workflow function can call activities (individual tasks), wait for signals, set timers, and branch on conditions — all while being fully durable. If the worker crashes mid-workflow, the framework replays the workflow function from the event history to resume exactly where it left off.
- Event sourcing under the hood: Every state change in a workflow is persisted as an event in a history. The workflow’s current state is reconstructed by replaying this event history. This is fundamentally different from the state-machine-in-a-database approach we designed above. It is more flexible (arbitrary workflow logic instead of predefined states) but more complex to reason about.
- Task queues with sticky execution: Temporal uses task queues to dispatch activities to workers. It supports “sticky execution” — once a worker starts processing a workflow, subsequent activities in that workflow are preferentially routed to the same worker. This improves cache locality (workflow context is already in memory) and reduces replay overhead.
- Visibility and search: Temporal provides a built-in visibility layer that indexes workflow metadata (status, type, start time, custom search attributes) into Elasticsearch. Operators can search for workflows across millions of executions. This is the observability layer that is often an afterthought in custom task schedulers.
- Multi-cluster replication: For disaster recovery, Temporal supports active-passive replication across clusters. Workflows in a failed cluster can be resumed in the standby cluster. This is critical for Uber’s ride-booking workflows — a datacenter failure cannot cause rides to be lost.
- At Uber’s scale: At peak, Uber’s Cadence deployment handled hundreds of thousands of workflow executions per second across multiple Cassandra clusters totaling petabytes of event history. The system managed everything from ride lifecycle to financial settlement to promotional offer expiration.
Apache Airflow’s Architecture
Airflow, originally created at Airbnb (around 2014) and now an Apache project, is the most widely used open-source task orchestrator for data pipelines:- DAG-based scheduling: Airflow represents workflows as Directed Acyclic Graphs (DAGs). Each node is a task (e.g., “extract data from S3”, “run Spark job”, “load into warehouse”) and edges represent dependencies. The scheduler traverses the DAG, executing tasks only when their upstream dependencies are complete.
- Scheduler architecture: Airflow’s scheduler is a loop that runs every few seconds: (1) parse all DAG files to discover task dependencies, (2) identify tasks whose dependencies are met and whose scheduled time has arrived, (3) place those tasks into a queue (Celery, Kubernetes, or a local executor). The scheduler was historically single-threaded and a bottleneck, but Airflow 2.0+ supports multiple schedulers with database-level locking (similar to the SKIP LOCKED pattern we described above).
- Executor model: Airflow separates scheduling from execution. The executor is pluggable: CeleryExecutor distributes tasks to a fleet of Celery workers via Redis/RabbitMQ, KubernetesExecutor spins up a new pod for each task (strong isolation but higher overhead), and LocalExecutor runs tasks as subprocesses on the scheduler machine (fine for small deployments).
- Metadata database (PostgreSQL/MySQL): All task state, DAG definitions, execution history, and scheduling metadata live in a relational database. This is the source of truth. The choice of PostgreSQL or MySQL is deliberate — Airflow needs ACID transactions for state machine transitions, exactly as we discussed in Problem 5.
- XCom for inter-task communication: Tasks can pass small amounts of data to downstream tasks via “XComs” (cross-communications) stored in the metadata database. For large datasets, tasks write to external storage (S3, GCS) and pass only the reference via XCom. This pattern of passing pointers instead of payloads is a common production optimization.
- Limitations: Airflow is designed for batch workflows (run this ETL every hour), not for real-time event-driven tasks. It has no native concept of responding to events or running sub-second-latency tasks. For those use cases, Temporal or a custom solution (like the one we designed) is more appropriate.
What Happens When Things Fail?
What Happens When Things Fail?
- This is the most dangerous failure. No new tasks get enqueued from PENDING to SCHEDULED. Tasks already in the queue continue to be processed by workers, but no new tasks are picked up from the database.
- Mitigation: leader election with automatic failover. Run 2-3 scheduler instances. Only one is active (the leader), the others are on standby. Use PostgreSQL advisory locks or Redis Redlock for leader election. When the leader’s lock expires (because it crashed), a standby instance acquires the lock and becomes the new leader within seconds.
- Detection: the standby schedulers monitor the leader’s heartbeat. Additionally, a canary task (a no-op task scheduled every minute) serves as an end-to-end health check. If the canary does not execute within 2 minutes, alert on-call.
- The task is stuck in RUNNING state. The lease (lock TTL) expires after the configured timeout. The zombie detector (a periodic scheduler job) identifies tasks where
locked_at + timeout_sec < NOW()andstatus = RUNNING. These zombie tasks are moved back to SCHEDULED if retries remain, or to DEAD if they are exhausted. - The task may have partially completed before the crash. This is why idempotency is non-negotiable — the retry must be safe to execute even if the first attempt partially succeeded. Example: if the task is “charge customer $50,” the payment system must use an idempotency key so the retry does not double-charge.
- Workers cannot pull new tasks. Tasks accumulate in PostgreSQL with status SCHEDULED but no queue to feed them to workers.
- Short-term mitigation: workers fall back to polling PostgreSQL directly using the
SELECT ... FOR UPDATE SKIP LOCKEDquery. This is slower (10-100x less throughput than Redis) but functional. - Long-term: Redis Sentinel or Redis Cluster with automatic failover. Recovery in under 30 seconds for most configurations.
- This is the worst-case scenario. Task state is lost (or temporarily inaccessible). New tasks cannot be created. Task status updates from workers fail.
- Workers that are currently executing tasks continue to run (they already have the task payload). But they cannot report completion, so tasks will appear as zombies after lease expiry and get retried — making idempotency even more critical.
- Mitigation: PostgreSQL with streaming replication and automatic failover (Patroni, RDS Multi-AZ). Recovery time: 15-30 seconds. Data loss: zero or near-zero (depending on synchronous vs async replication).
- After exhausting retries, the task lands in the DLQ. But the recurring schedule generates a new execution for the next interval, which also fails and also lands in the DLQ. You now have a DLQ filling up with the same failing task.
- Mitigation: circuit breaker pattern for recurring tasks. After N consecutive failures (across multiple scheduled executions), automatically PAUSE the recurring task and alert the task owner. Require manual re-enablement after the root cause is fixed. This prevents a single broken dependency from flooding the DLQ and masking other legitimate failures.
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
depends_on field to the task schema (a list of task IDs). The scheduler only moves a task to SCHEDULED if all its dependencies are in COMPLETED state. This turns the task scheduler into a DAG executor (like Airflow). For simple linear dependencies, this is straightforward. For complex DAGs, you need a topological sort when evaluating which tasks are ready. Mention that at this point you are essentially building a workflow engine, and it may be worth using Temporal or Airflow instead of reinventing it.“How do you ensure scheduling accuracy — that a task scheduled for exactly 10:00:00 AM runs within 1 second of that time?”The scheduler polls the database on a tight loop (every 1-5 seconds). At each poll, it selects all tasks where next_run_at <= NOW() and enqueues them. If the poll interval is 1 second, the maximum scheduling delay is 1 second plus the queue wait time. For sub-second accuracy, switch from polling to a timer-based approach: load the next N tasks into memory, set OS-level timers for each, and enqueue at the precise moment. This is how cron works internally. The trade-off: in-memory timers are lost if the scheduler crashes, so you still need the DB as the source of truth.“How would you handle a deployment where you need to update the task handler code without dropping in-flight tasks?”Use a blue-green deployment for workers. Deploy the new worker code to a new set of workers (green). Drain the old workers (blue) — stop feeding them new tasks and wait for in-flight tasks to complete (with a timeout). Once all blue workers are idle, shut them down. If a blue worker’s in-flight task exceeds the drain timeout, it is forcibly killed and the lease expires, causing a retry on a green worker. The key: the task payload must be backward-compatible, or versioned (include a version field in the payload so the worker knows which handler to use).The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| Task API | ECS Fargate behind ALB or API Gateway + Lambda | Fargate for steady-state API traffic. Lambda for bursty task submission patterns (pay-per-invocation, auto-scales to thousands of concurrent invocations) |
| Task Store (source of truth) | Amazon RDS for PostgreSQL (Multi-AZ) | Managed PostgreSQL with automated failover (15-30 seconds). SELECT FOR UPDATE SKIP LOCKED works identically to self-managed PostgreSQL. Read replicas for status queries. Automated backups with point-in-time recovery |
| Task Queue | Amazon SQS (Standard or FIFO) | SQS Standard for at-least-once delivery with massive throughput (nearly unlimited). SQS FIFO for ordered processing with deduplication (300 msg/sec base, 3000 with high throughput mode). Built-in DLQ support — no need to build your own |
| Scheduler (cron/delayed tasks) | Amazon EventBridge Scheduler | Fully managed scheduler that triggers targets (Lambda, SQS, Step Functions) on a schedule or at a specific time. Supports cron expressions and one-time schedules. Scales to millions of schedules. Eliminates the need for a custom scheduler leader |
| Workers | ECS Fargate or Lambda | Fargate for long-running tasks (up to hours). Lambda for short tasks (<15 minutes) with automatic scaling to thousands of concurrent executions. Mix both: Lambda for fast tasks, Fargate for heavy tasks |
| Workflow Orchestration | AWS Step Functions | For complex task dependencies (DAGs), Step Functions provides visual workflow orchestration with built-in retry policies, error handling, timeouts, and parallel execution. Standard Workflows for long-running tasks, Express Workflows for high-volume short tasks |
| Dead Letter Queue | SQS DLQ + CloudWatch Alarm | SQS natively supports DLQ configuration. CloudWatch alarm when DLQ depth > 0 triggers SNS notification to on-call |
| Monitoring | CloudWatch + CloudWatch Logs Insights | Custom metrics for task completion rate, execution duration, queue depth, DLQ depth. Logs Insights for querying task execution logs across all workers |
| Completed Task Archive | S3 + Glacier | After 7 days, archive completed task records from RDS to S3 (Parquet via AWS Glue ETL). Lifecycle policy moves to Glacier after 90 days for long-term retention |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- The simple queue model breaks. You now need a DAG execution engine. Each task has a
depends_onfield listing prerequisite task IDs. The scheduler evaluates task readiness by checking whether all dependencies are in COMPLETED state. This requires a topological sort on the dependency graph. Cycles must be detected and rejected at submission time. At this point, you are building Airflow or Temporal. In an interview, acknowledge this and explain when to buy vs build: “If we need DAG execution, I would evaluate Temporal or Step Functions before building custom, because correct DAG scheduling with failure handling is a solved problem that is hard to get right.”
- Global deduplication becomes the hard problem. The idempotency key must be checked globally, not per-region. Options: (1) use a global DynamoDB table with a unique constraint on the idempotency key (cross-region replication with conflict resolution), (2) designate a single “coordinator region” that owns task deduplication and fans out execution to regional workers, or (3) partition tasks by a deterministic key (e.g., hash of idempotency key modulo 3 regions) so each task has exactly one “owning” region. Option 3 is the most scalable but requires re-routing submissions to the owning region.
- A single worker pool cannot handle both. Sub-second tasks need lightweight workers (Lambda functions) that spin up and down instantly. 24-hour tasks need persistent workers (ECS tasks) with lease renewal, heartbeats, and checkpointing. The scheduler routes tasks to different worker pools based on estimated duration. The lease-based locking pattern from the original design must support configurable TTLs — a 60-second TTL for sub-second tasks, a 1-hour TTL with renewal for long-running tasks. Checkpointing allows a long-running task to resume from its last checkpoint after a worker failure, rather than restarting from scratch.
- Per-tenant queues or per-tenant priority within a shared queue. The simplest approach: each tenant gets a fair-share allocation of worker capacity. If you have 100 workers and 10 active tenants, each tenant gets 10 workers. Burst capacity is allocated from an overflow pool. Critical tasks bypass the per-tenant queue and go to a dedicated high-priority queue (similar to the notification system’s priority routing). This is the multi-tenancy pattern used by managed services like AWS Step Functions and Temporal Cloud.
Cross-Chapter Connections
Cross-Chapter Connections
- Databases — PostgreSQL advisory locks,
SELECT FOR UPDATE SKIP LOCKED, ACID transactions for state machine integrity, and schema design connect to APIs and Databases. For a deeper dive into PostgreSQL’s locking mechanisms (advisory locks, row-level locks, SKIP LOCKED internals) and when to choose PostgreSQL over DynamoDB for transactional workloads, see Database Deep Dives. - Messaging and queues — Redis as a task queue, Kafka for event-driven architectures, dead-letter queues, and at-least-once delivery guarantees are covered in Messaging, Concurrency, and State.
- Reliability — Leader election, automatic failover, circuit breaker patterns, and graceful degradation connect to Reliability Principles.
- Distributed systems theory — The leader election for the scheduler, the fencing token pattern for zombie task detection, and the impossibility of exactly-once delivery in distributed systems are core consensus and consistency concepts explored in Distributed Systems Theory. The Raft or Paxos consensus protocols that underpin PostgreSQL’s Patroni failover and Redis’s Sentinel are covered there in depth.
- Observability — Task status dashboards, DLQ depth alerting, canary tasks, and execution latency metrics connect to Caching and Observability.
- Design patterns — The state machine pattern, lease-based locking, idempotency keys, and exponential backoff with jitter are fundamental patterns explored in Design Patterns.
- Deployment — Blue-green deployments for worker code updates, draining in-flight tasks, and zero-downtime deploys connect to Networking and Deployment.
- Cloud service patterns — Step Functions for workflow orchestration, EventBridge Scheduler for managed cron, SQS for task queuing with native DLQ support, and the “buy vs build” decision framework for task scheduling are AWS-specific patterns explored in Cloud Service Patterns.
Problem 6: Design a Notification System
Difficulty: Medium-Hard | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Channels | Which channels: push notifications, email, SMS, in-app? All four? | Each channel has different delivery infrastructure, cost, and latency characteristics |
| Priority | Do notifications have priority levels (urgent vs marketing)? | Priority determines queue ordering and whether to bypass rate limits |
| User preferences | Can users opt out of specific channels or notification types? | Requires a user preferences service and filtering layer before dispatch |
| Deduplication | Should we prevent sending the same notification twice? | Dedup is critical — sending duplicate SMS costs money and annoys users |
| Rate limiting | Should we limit notifications per user per time window? | Users who receive 50 push notifications in an hour will disable notifications entirely |
| Templating | Are notifications templated or free-form? Localization? | Templates with variable substitution simplify the API and enable A/B testing |
| Scale | How many notifications per day? Peak events? | Black Friday or breaking news can cause 100x traffic spikes |
| Delivery tracking | Do we need delivery confirmation and analytics? | Affects whether you track sent/delivered/opened/clicked states |
Requirements You Should Lock Down
Functional:- Send notifications across four channels: push (iOS + Android), email, SMS, and in-app
- Support priority levels: critical (immediate), high (within 1 minute), normal (within 5 minutes), low (batched hourly)
- Respect user channel preferences (opt-in/opt-out per channel per notification type)
- Deduplicate notifications within a configurable time window (default: 1 hour)
- Rate limit per-user per-channel (e.g., max 5 push notifications per hour for marketing)
- Notification templates with variable substitution and localization support
- Delivery tracking: sent, delivered, opened, clicked
- Deliver critical notifications within 10 seconds
- Handle 1 billion notifications per day (normal load), 10 billion during peak events
- 99.9% delivery rate (at-least-once per channel)
- Graceful degradation if one channel provider is down (do not block other channels)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Four Critical Components
Phase 4: Deep Dive -- Four Critical Components
Deep Dive 1: Priority-Based Routing and Queue Architecture
Not all notifications are created equal. A two-factor authentication code must arrive in seconds; a weekly digest can wait hours. The priority system ensures critical notifications are never delayed by marketing volume.Deep Dive 2: Deduplication and Idempotency
Duplicate notifications are expensive (SMS costs money) and annoying (users uninstall your app). Dedup happens at two levels:API-Level Dedup (Idempotency Key)
idempotency_key. Before processing, check Redis:Content-Level Dedup (Similarity Hashing)
Deep Dive 3: User Preferences and Channel Selection
The preferences engine determines which channels to use for each notification:Deep Dive 4: Per-User Rate Limiting
Rate limiting per user per channel prevents notification fatigue:Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Queue architecture | Separate queues per priority | Single priority queue | Separate queues ensure critical notifications are isolated from marketing volume floods. A single queue risks head-of-line blocking |
| Dedup approach | Redis-based with TTL | Database unique constraint | Redis is faster (sub-ms) for the dedup check that happens on every notification. DB constraint is a safety net but too slow for hot-path |
| User preferences cache | Redis with 5-min TTL + invalidation on update | Read-through cache | Explicit invalidation ensures preferences update within seconds. Read-through with long TTL causes stale reads |
| Channel selection | Preference engine at dispatch time | Pre-compute per-user channel lists | Dispatch-time evaluation respects real-time preference changes, quiet hours, and device availability. Pre-computation gets stale |
| Notification aggregation | Timer-based batching with aggregate keys | No aggregation | Aggregation dramatically reduces notification fatigue. “20 people liked your post” is better UX than 20 separate notifications |
| SMS provider | Multi-provider with failover (Twilio primary, Vonage backup) | Single provider | SMS providers have outages. Multi-provider with automatic failover ensures critical SMS (2FA) always gets delivered |
| Delivery tracking | Event-sourced status updates | Status polling | Each channel worker publishes status events (sent, delivered, opened, clicked) to a stream. Async processing decouples tracking from dispatch |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You designed priority-aware routing from the start
You implemented multi-level deduplication
You addressed the preferences-caching consistency problem
You designed for notification fatigue
How the Real System Works
How the Real System Works
LinkedIn’s Notification Infrastructure
LinkedIn’s notification system is one of the most sophisticated in the industry, delivering billions of notifications per week across push, email, in-app, and badge count updates:- Unified notification platform: LinkedIn built a centralized notification service that all product teams use. Teams do not build their own notification dispatch — they call the platform API with a notification type, user ID, and template variables. The platform handles channel selection, preference filtering, rate limiting, dedup, and delivery. This centralization is critical: without it, different teams implement conflicting rate limiting logic and users get bombarded.
- Relevance scoring: Before dispatching, LinkedIn runs each notification through a relevance model that predicts whether the user will engage with it. Low-relevance notifications are silently dropped or downgraded to in-app only. This is a notification-specific ranking model similar to the feed ranking model — it considers the user’s historical engagement with this notification type, time of day, and current notification load.
- Volume capping: LinkedIn enforces strict per-user per-day limits across all notification types. Even if 50 different product features want to notify a user, the total push notification count per day is capped. This forces product teams to compete for notification “budget” and naturally selects for the most relevant notifications.
- Email digests: Rather than sending individual emails for every event, LinkedIn batches low-priority notifications into periodic digest emails (daily or weekly, based on user preference). The digest engine aggregates events, deduplicates, and selects the most relevant items for the digest. This reduces email volume by 10x while maintaining engagement.
- Delivery feedback loop: LinkedIn tracks open rates, click-through rates, and unsubscribe rates per notification type per channel. If a notification type’s engagement drops below a threshold, it is automatically flagged for review. If a user stops engaging with push notifications for 30 days, LinkedIn reduces push volume to that user. This closed-loop system prevents notification fatigue at scale.
WhatsApp Business Notifications
WhatsApp’s notification system for businesses (appointment reminders, shipping updates, etc.) offers lessons in constraint-driven design:- Template-only messaging: Businesses can only send notifications using pre-approved message templates. This eliminates spam by design and ensures every notification the user receives is structured, relevant, and expected. Templates must be approved by WhatsApp before they can be used.
- Rate limiting by business quality: Each business account has a quality score based on user feedback (blocks, reports). High-quality businesses can send more notifications; low-quality businesses get throttled. This feedback-driven rate limiting is more sophisticated than simple request counting.
- Delivery windows: WhatsApp enforces a 24-hour conversation window. A business can only send template messages outside of a conversation window. Within a conversation (user has responded in the last 24 hours), the business can send free-form messages. This window-based approach balances business needs with user experience.
What Happens When Things Fail?
What Happens When Things Fail?
- Critical notifications (2FA) must still reach the user. Fall back to SMS for critical notifications when push delivery fails (check delivery receipt status from APNs/FCM within 30 seconds; if undelivered, escalate to SMS). For non-critical notifications, queue and retry when the provider recovers.
- Monitor APNs/FCM delivery rates. A sudden drop (from 98% to 50%) indicates a partial outage. Switch to the alternative channel (email) for high-priority notifications during the outage.
- Email providers throttle when you send too fast or your bounce rate spikes. Implement adaptive sending rate: start slow, increase until you hit the provider’s feedback signal (HTTP 429 or bounce rate > 2%). Back off and retry with exponential delay.
- Have a secondary email provider configured and ready. If the primary is throttled for more than 5 minutes, route email traffic to the secondary. This dual-provider setup costs negligibly more but provides critical redundancy.
- Without dedup, duplicate notifications will be sent. This is the worst outcome for SMS (costs money) and bad for user experience on all channels.
- Short-term: fall back to database-level dedup (check a dedup table in PostgreSQL with a unique constraint on
idempotency_key). This is 10-100x slower but prevents duplicates. - For SMS specifically: always check the database, not just Redis, before sending. The cost of a duplicate SMS (and potential regulatory issues) justifies the extra latency.
- The normal processing pipeline cannot handle 100x volume instantly. Implement backpressure: the notification API returns HTTP 202 (accepted, processing asynchronously) and queues all notifications. Workers process at their maximum sustainable rate.
- Prioritize by priority level: critical notifications process first, marketing notifications are automatically deferred during high-volume events.
- Auto-scale workers based on queue depth. Set CloudWatch alarms for queue depth and use ECS Service Auto Scaling or Lambda concurrency to add capacity within minutes.
- Notifications already in the queue were evaluated against old preferences. When the channel worker processes the notification, it does a final preference check just before dispatch. This “double-check” pattern ensures the user’s latest opt-out is respected, even if the notification was queued before the preference change.
- This adds one Redis/DB read per dispatch but prevents the most common user complaint: “I opted out but still received a notification.”
Interviewer Follow-Ups You Should Expect
Interviewer Follow-Ups You Should Expect
template_id + locale. Each template contains the text with variable placeholders (e.g., "Your order {{order_id}} has shipped"). At render time, look up the user’s preferred locale from the user profile, fetch the matching template, and substitute variables. If the user’s locale is not available, fall back to English. For push notifications, render the final text server-side. For emails, you can use a templating engine (Handlebars, Jinja2) with richer formatting. Store templates separately from code so they can be updated by content teams without a deployment.“How do you handle the case where a user has multiple devices (iPhone, iPad, Android phone) and should receive a push notification on all of them?”Maintain a device registry that maps user_id to a list of device_tokens with metadata (platform, last_active timestamp, app version). When sending a push notification, fan out to all active devices for that user. “Active” means the device token has not been invalidated by APNs/FCM and the user has opened the app on that device in the last 90 days. Prune stale device tokens by checking APNs/FCM feedback — they report invalid tokens on failed deliveries. The fan-out is typically small (1-3 devices per user) so this does not create a scaling challenge.“What if the notification system itself needs to send notifications about its own health? How do you avoid a circular dependency?”This is the “who watches the watchmen” problem. Notification system health alerts must use an independent alerting path that does not depend on the notification system itself. Options: (1) PagerDuty or Opsgenie integration directly from CloudWatch alarms (bypasses your system entirely), (2) a dedicated Slack webhook from your monitoring service, or (3) a minimal, independent SMS service that only sends operational alerts. The key principle: your alerting infrastructure must have zero dependency on the system it monitors.The Senior-Level Signal
The Senior-Level Signal
AWS Service Recommendations
AWS Service Recommendations
| Component | AWS Service | Why This Service |
|---|---|---|
| Notification API | API Gateway + Lambda or ECS Fargate | Lambda for bursty traffic with per-invocation billing. Fargate for steady-state traffic with predictable costs |
| Priority Queues | SQS (separate queue per priority) | SQS Standard for normal/low priority (nearly unlimited throughput). SQS FIFO for critical priority (strict ordering, exactly-once processing) |
| Push Notifications | Amazon SNS Mobile Push | Direct integration with APNs (iOS) and FCM (Android). Handles device token management, multi-platform delivery, and automatic retries. Scales to billions of pushes per day |
| Amazon SES | Extremely cost-effective ($0.10 per 1,000 emails). Dedicated IPs for reputation management. Built-in bounce and complaint handling. Suppression list management | |
| SMS | Amazon SNS SMS or Amazon Pinpoint | SNS SMS for transactional messages (2FA codes, alerts). Pinpoint for marketing messages with campaign management, segmentation, and analytics |
| In-App Notifications | DynamoDB + API Gateway WebSocket | DynamoDB table partitioned by user_id with notification_id as sort key. WebSocket API for real-time delivery to connected users |
| User Preferences | DynamoDB + ElastiCache (DAX or Redis) | DynamoDB for durable preference storage. DAX for microsecond reads on the hot path. Redis if you need complex data structures beyond key-value |
| Deduplication | ElastiCache for Redis | Redis SET with NX (not exists) and TTL for idempotency key dedup. Redis sorted sets for content-hash dedup with expiration |
| Notification Aggregation | Lambda + DynamoDB Streams + Step Functions | DynamoDB Streams capture notification events. Step Functions Wait state implements the aggregation timer (5-minute window). Lambda function builds the aggregated notification |
| Template Storage | DynamoDB or S3 | DynamoDB for templates that are read on every dispatch (sub-ms latency). S3 for rich email templates (HTML) fetched and cached by workers |
| Delivery Tracking | Kinesis Data Firehose to S3 + Athena | Stream delivery events (sent, delivered, opened, clicked) to S3 in Parquet format. Athena for analytics queries (open rates by notification type, delivery latency percentiles) |
| Monitoring | CloudWatch + SNS Alarms to PagerDuty | Custom metrics for delivery rate, queue depth per priority, dedup hit rate, preference cache hit rate. Alarms with different severity thresholds per metric |
Constraint Injection -- How Interviewers Mutate This Problem
Constraint Injection -- How Interviewers Mutate This Problem
- The user preferences model expands from simple opt-in/opt-out to a regulatory rules engine. Each country has a profile: Canada requires explicit opt-in for marketing SMS (CASL), Germany has strict quiet hours (no marketing before 8 AM or after 9 PM local time, GDPR), India requires a DND (Do Not Disturb) registry check before sending SMS. The channel selection algorithm from the original design becomes: apply country-specific rules AFTER user preferences, not before. This means you need the user’s country of residence (not just their current location) and a regulatory rules database that is updated by your legal team. The notification API should reject requests that violate country-specific rules, returning a clear error to the caller.
- Webhooks add a new “channel” with unique failure characteristics: the destination URL may be slow, unreliable, or completely down. You need per-webhook retry policies with exponential backoff, a circuit breaker per webhook URL (if it fails 5 times in a row, stop trying and alert the customer), and a webhook delivery log that customers can inspect. Webhook payloads must be signed (HMAC-SHA256) so the receiver can verify authenticity. The webhook delivery infrastructure is essentially a separate system from the notification system — it looks more like a task scheduler with per-destination queues.
- SMS cost optimization: (1) Replace SMS with push notifications wherever possible — push is free, SMS costs 0.05 per message. For any notification where the user has push enabled, never send SMS. (2) Batch low-priority SMS into daily digests. Instead of 5 separate SMS for 5 order updates, send one SMS at end of day: “You had 5 order updates today. Check the app for details.” (3) Use short codes and toll-free numbers (cheaper per-message than long codes). (4) Negotiate volume discounts with your SMS provider — at $1.2M/year, you have leverage. (5) Implement SMS verification before sending: if a phone number has bounced 3 times, remove it from the SMS channel. Each of these is a real optimization used by companies like Uber and DoorDash.
- This adds an inference step to the notification pipeline: before template rendering, the system calls an LLM (or a fine-tuned small model) to generate personalized content. Latency implications: an LLM call adds 200ms-2s per notification. For critical notifications (2FA), skip personalization. For marketing notifications, batch the LLM calls — generate personalized content for a cohort of users in bulk, then dispatch. The LLM call must be in the non-blocking path (after queuing, before dispatch), not in the API request path. Content moderation of LLM-generated text is non-negotiable — every generated message must pass through a toxicity filter before sending. This is how LinkedIn and Spotify personalize notification copy at scale.
Cross-Chapter Connections
Cross-Chapter Connections
- APIs and HTTP — RESTful API design for the notification endpoint, webhook delivery patterns, and HTTP status codes (202 Accepted for async processing) are covered in APIs and Databases.
- Messaging and queues — Priority queues, dead-letter queues, at-least-once delivery guarantees, and the fan-out dispatch pattern connect to Messaging, Concurrency, and State.
- Databases — DynamoDB for user preferences and notification history, partition key design for time-series notification data, and the trade-off between Redis and DynamoDB for dedup storage connect to Database Deep Dives.
- Caching — Redis for dedup state, user preference caching with TTL-based invalidation, and template caching connect to Caching and Observability.
- Reliability — Multi-provider failover for SMS/email, graceful degradation when a channel is down, and the “double-check” pattern for preferences connect to Reliability Principles.
- Security — Notification opt-out compliance (CAN-SPAM, GDPR, TCPA), secure delivery of sensitive content (2FA codes), and rate limiting as abuse prevention connect to Auth and Security.
- Real-time systems — In-app notification delivery via WebSocket, presence-aware channel selection (push to online users, email to offline users), and real-time delivery tracking connect to Real-Time Systems.
- Distributed systems theory — Deduplication in distributed systems (idempotency guarantees across retries), exactly-once delivery semantics, and the CAP theorem implications for the preference cache connect to Distributed Systems Theory.
- Cloud service patterns — SNS for push notifications, SES for email, Pinpoint for SMS, SQS for priority queuing, and the cost optimization strategies for high-volume notification delivery are AWS-specific patterns explored in Cloud Service Patterns.
Summary: The Pattern Across All Six Problems
- Ask before you draw — Requirements clarification is not a formality; it changes the design.
- Numbers before architecture — Back-of-envelope estimation reveals whether you need 3 servers or 3,000.
- Start with the right abstraction — High-level design captures components and data flow before implementation details.
- Go deep on what matters — You cannot deep-dive everything in 45 minutes. Pick the 2-3 components that define the system.
- Name the trade-offs explicitly — Every design decision has a cost. Senior engineers articulate both sides.
Think of It This Way
System design interviews are like architectural blueprints — the interviewer wants to see your thinking process, not a finished building. An architect does not walk into a client meeting and immediately start drawing floor plans. They ask: how many people will use this space? What is the budget? Are there zoning constraints? What is the climate? Only after understanding the requirements do they pick up a pencil. Your system design interview should follow the same rhythm: clarify, estimate, sketch, refine. Here is another way to think about it: a great system design answer is like a map with multiple zoom levels. Start at the continent level (high-level architecture), then zoom into the country level (key components), then the city level (critical algorithms and data models). The interviewer should be able to stop you at any zoom level and feel that you have a coherent picture. Candidates who jump straight to city-level details (arguing about Redis data structures before explaining the overall architecture) lose the forest for the trees.Deep Dive Resources
These resources will deepen your understanding beyond what any single guide can cover. They are organized by format so you can pick what fits your learning style.Books
Books
| Resource | Why It Matters |
|---|---|
| System Design Interview by Alex Xu (Vol 1 & 2) | The most interview-focused system design resource available. Each chapter walks through a specific system (rate limiter, chat, news feed, etc.) with the same Clarify-Estimate-Design-Deep Dive framework. Volume 1 covers fundamentals; Volume 2 covers more advanced topics like hotel reservation systems and stock exchanges. |
| Designing Data-Intensive Applications by Martin Kleppmann | The definitive deep reference for understanding the building blocks (replication, partitioning, consistency models, stream processing) that underpin every system design. Not interview-focused, but the mental models it builds are irreplaceable. Read this to understand why systems work the way they do. |
| Web Scalability for Startup Engineers by Artur Ejsmont | A practical, end-to-end guide to scaling web applications. Covers caching, async processing, search, and more. Especially useful if you want to understand the full stack rather than individual components. |
Online Courses and Guides
Online Courses and Guides
| Resource | Why It Matters |
|---|---|
| System Design Primer (github.com/donnemartin/system-design-primer) | A free, comprehensive GitHub repository that covers system design concepts, trade-offs, and practice problems. Think of it as a structured self-study curriculum. The diagrams and summaries are excellent for quick review before interviews. |
| Grokking the System Design Interview (educative.io) | An interactive course that walks through 20+ system design problems step by step. The format (text-based with diagrams, not video) makes it easy to study at your own pace. Widely regarded as the most popular paid resource for system design interview prep. |
| ByteByteGo by Alex Xu (blog.bytebytego.com / YouTube) | Alex Xu’s newsletter and YouTube channel provide visual, concise explanations of system design concepts. The animated diagrams are particularly useful for building intuition about data flow, fan-out patterns, and distributed system behavior. The YouTube channel is free; the newsletter has free and paid tiers. |
Engineering Blogs and Articles
Engineering Blogs and Articles
| Resource | Why It Matters |
|---|---|
| High Scalability (highscalability.com) | One of the oldest and most comprehensive collections of real-world architecture case studies. Their “This is how X works” posts dissect the architectures of companies like Netflix, Twitter, WhatsApp, and Uber. Reading 5-10 of these gives you a library of real patterns to reference in interviews. |
| InfoQ Architecture Articles (infoq.com/architecture-design) | In-depth articles and conference talks on distributed systems, microservices, and architecture patterns. The content skews more toward practicing architects than interview prep, which makes it excellent for building genuine expertise. |
| Company Engineering Blogs | The best primary sources for how real systems work. Prioritize: Netflix Tech Blog (resilience, streaming), Uber Engineering (real-time systems, Cadence/Temporal), Discord Blog (scaling real-time chat), Meta Engineering (feed systems, TAO, Memcache), Cloudflare Blog (edge computing, rate limiting, DNS). These are the sources that Alex Xu and other authors reference. |
Practice Platforms
Practice Platforms
| Resource | Why It Matters |
|---|---|
| Exponent (tryexponent.com) | Mock interview platform with system design practice, peer feedback, and expert-led sessions. Useful for practicing the verbal communication aspect that reading alone cannot build. |
| Pramp (pramp.com) | Free peer-to-peer mock interviews. You practice with another engineer and take turns as interviewer and candidate. The best way to practice thinking out loud under time pressure. |
| LeetCode System Design (leetcode.com/discuss/interview-question/system-design) | Community-contributed system design discussions. Quality varies, but the top posts have thousands of upvotes for a reason. Good for seeing how other candidates approach the same problems. |
Common Mistakes in System Design Interviews
Even well-prepared candidates fall into these traps. Knowing them in advance gives you a significant edge.Mistake 1: Jumping to the Solution Without Clarifying Requirements
Mistake 1: Jumping to the Solution Without Clarifying Requirements
Mistake 2: Over-Engineering for Scale You Don't Have
Mistake 2: Over-Engineering for Scale You Don't Have
Mistake 3: Ignoring Failure Modes and Edge Cases
Mistake 3: Ignoring Failure Modes and Edge Cases
Mistake 4: Not Calculating Back-of-Envelope Estimates
Mistake 4: Not Calculating Back-of-Envelope Estimates
Mistake 5: Designing in Silence Instead of Thinking Out Loud
Mistake 5: Designing in Silence Instead of Thinking Out Loud
Mistake 6: Focusing on One Component Too Deeply While Ignoring the Big Picture
Mistake 6: Focusing on One Component Too Deeply While Ignoring the Big Picture
How to Practice System Design
Reading about system design is necessary but not sufficient. You would not prepare for a piano recital by only reading sheet music. System design interviews are a performance — you must practice performing.The Practice Framework
Week 1-2: Learn the Framework
Week 3-4: Solo Practice
Week 5-6: Peer Practice
Ongoing: Requirement Mutations
Tips That Make a Real Difference
Talk in “phases,” not “steps.” Saying “Let me start with the clarification phase” signals that you have a structured approach. Interviewers hear this and relax — they know you will not waste their time. Draw before you talk. A diagram with 5 boxes and 4 arrows, drawn in 30 seconds, communicates more than 3 minutes of verbal description. Label every box. Label every arrow with data flow direction. Use the diagram as your anchor for the rest of the conversation. Narrate your trade-offs in real-time. Do not save trade-offs for the end. When you pick NoSQL over SQL, say why at that moment: “I am choosing DynamoDB here because the access pattern is pure key-value lookup at 100K+ reads per second, and I do not need joins or complex queries. If the access pattern changes, I would revisit this.” This is the behavior interviewers reward most. Ask the interviewer where to go deep. After your high-level design (15-20 minutes in), say: “I can deep dive on the caching strategy, the data partitioning, or the failure handling. Which would be most useful to discuss?” This shows time management, respect for the interviewer’s priorities, and confidence that you can go deep on any component. Practice the numbers until they are reflexive. Memorize these approximations:- 1 day = ~86,400 seconds (round to 100K for quick math)
- 1 million requests/day = ~12 requests/second
- 1 billion requests/day = ~12,000 requests/second
- 1 KB x 1 billion = 1 TB
- Redis handles ~100K operations/second per instance
- A single PostgreSQL instance handles ~10K transactions/second
- A typical server handles ~10K concurrent WebSocket connections
Interview Deep-Dive Questions
These questions simulate the kind of probing, multi-layered interrogation you will face from experienced system design interviewers. Each question starts at a specific point and then drills deeper through follow-ups — exactly how a real 45-minute interview unfolds. Use them for self-study or peer practice.Q1: Why would you choose fan-out on write over fan-out on read, and when does that decision break down?
Q1: Why would you choose fan-out on write over fan-out on read, and when does that decision break down?
Question
You are designing a social feed system. Walk me through the trade-off between fan-out on write (push model) and fan-out on read (pull model). When does one clearly beat the other, and what forces you into a hybrid?Difficulty: Senior | Concepts tested: Write amplification, read latency, hybrid architectures, real-world scaling judgmentStrong Answer
The way I think about this is through the lens of where you want to pay the cost — at write time or at read time.- Fan-out on write means when a user publishes a post, you immediately push that post ID into every follower’s pre-computed feed cache (typically a Redis sorted set). The read path becomes trivially fast — just read from the cache. The cost is write amplification: one post from a user with 1,000 followers generates 1,000 writes. For a system where the vast majority of users have a few hundred followers, this is the correct default because reads outnumber writes by 10-100x, and you want reads to be sub-50ms.
- Fan-out on read means you store posts per-author and at read time you query all followed authors, merge their posts, rank, and return the top N. There is zero write amplification — one post is one write. But the read path is expensive: if I follow 300 users, the system must scatter-gather 300 queries, merge-sort the results, and return. At scale, this is too slow for a good user experience (500ms+ for a feed load).
- The decision breaks down at the celebrity boundary. A user with 10 million followers posting once means 10 million cache writes. During peak events — an election, a Super Bowl halftime tweet — that can mean hundreds of millions of writes in a burst that backs up your fan-out queue by minutes or hours. This is exactly the problem Twitter faced publicly. The solution is a hybrid: fan-out on write for normal users (say, under 10K followers), and fan-out on read for celebrities. At read time, you merge the pre-computed feed with a small number of celebrity post lookups. The threshold should be dynamic based on current system load — during peak traffic, lower the threshold to reduce write pressure.
- Real-world example: Facebook, Twitter/X, and Instagram all converged on this hybrid approach independently. Twitter documented their migration from pure fan-out-on-write to hybrid around 2012-2013 specifically because of the celebrity problem. The fact that three of the largest feed systems arrived at the same architecture is strong evidence that the hybrid is the correct design for this problem class.
Follow-up: How do you decide the exact follower threshold for switching from push to pull?
A static threshold (like 10K) is a reasonable starting point, but in production you want it to be adaptive. The key insight is that the cost of a fan-out depends on two things: the follower count AND the current state of the system.- During normal load, you might fan out for users with up to 50K followers because the queue has spare capacity.
- During a traffic spike (breaking news, sporting event), you lower the threshold to 5K or even 1K to protect the fan-out workers from falling behind.
- The simplest implementation is a single configuration value that the fan-out service reads each batch:
MAX_FANOUT_FOLLOWERS. Monitor Kafka consumer lag — when lag exceeds 5 minutes, automatically lower the threshold. When lag returns to normal, raise it. - A more sophisticated approach uses a cost function:
fanout_cost = follower_count * current_queue_depth_factor. If the cost exceeds a budget, defer to read-time merge. This makes the system self-tuning.
Follow-up: What happens to the user experience during the transition period when you change the threshold?
This is a subtle UX question. When you lower the threshold from 50K to 5K:- Users who follow someone with 20K followers were previously getting that person’s posts via their pre-computed feed cache. Now that person is suddenly treated as a “celebrity” and their posts must be merged at read time.
- If the feed service does not know to look up this author’s posts, they simply disappear from the feed until the next full read-time merge.
- The fix: maintain a per-user “celebrity list” that is updated when the threshold changes. When a user’s feed is fetched, the Feed Service checks both the pre-computed cache AND the celebrity list. The celebrity list is rebuilt periodically or on threshold change.
- There is also a cache warming concern: when you raise the threshold back, posts from those users need to be fanned out again. You do not retroactively fan out old posts — you start from the next post. Users may see a slight discontinuity in their feed timeline. In practice, this is barely noticeable because feed ranking surfaces the most relevant content regardless of delivery mechanism.
Going Deeper: How would you handle the edge case where a normal user suddenly goes viral (10 followers yesterday, 5 million followers today)?
This is the “emergent celebrity” problem. Your system was happily fanning out for this user when they had 10 followers. Now they have 5 million, but your system has not reclassified them yet.- Detection: Run a periodic job (every 5-10 minutes) that checks follower counts against the current threshold. Flag users who have crossed the threshold as “newly promoted” celebrities.
- Transition: Stop fan-out for the newly promoted user immediately. Their future posts use the pull model. Existing posts already in followers’ caches remain there (no need to remove them).
- The dangerous window: If the user posts during the time between crossing the threshold and detection (up to 10 minutes), the fan-out service attempts to push to millions of followers. This is exactly the kind of burst that can back up the queue. Mitigation: add a pre-flight check in the fan-out worker itself — before processing a fan-out event, re-check the author’s current follower count. If it now exceeds the threshold, skip the fan-out and let the read path handle it. This adds one extra lookup per fan-out event but protects against the race condition.
- WhatsApp and Telegram sidestep this entirely by not having a public feed — every message is point-to-point or to a defined group. The viral celebrity problem is unique to open social graphs.
MAX_FANOUT_FOLLOWERS in a shared config store (etcd, Consul, or even Redis). When lag exceeds 5 minutes, lower the threshold by 20%. When lag drops below 1 minute, raise it back toward the default. The math does not have to be sophisticated — a PID-like controller with a dead band works fine, and you cap the range so the threshold cannot collapse to zero or balloon to infinity.- High Scalability — “The Architecture Twitter Uses to Deal with 150M Active Users” (highscalability.com) — the canonical public writeup of Twitter’s hybrid timeline.
- Facebook Engineering — “Scaling the Facebook Feed” (engineering.fb.com) — equivalent hybrid architecture at Facebook’s scale.
- Martin Kleppmann — “Designing Data-Intensive Applications” Chapter 1 — the “Twitter example” section that made this trade-off a standard interview question.
Q2: You said you would use Redis for caching. Walk me through exactly what happens when Redis goes down in your URL shortener.
Q2: You said you would use Redis for caching. Walk me through exactly what happens when Redis goes down in your URL shortener.
Question
In your URL shortener design, you have Redis as a cache-aside layer in front of the database. The Redis instance just died completely. Walk me through what happens to the system — second by second, from the user’s perspective and from the infrastructure’s perspective.Difficulty: Senior | Concepts tested: Graceful degradation, thundering herd, cache recovery strategies, production incident responseStrong Answer
Let me walk through this as a timeline, because the failure unfolds in stages.- Second 0 — Redis dies. The application servers’ Redis clients start getting connection refused errors or timeouts. How quickly they detect this depends on the client configuration — a health check interval of 1-2 seconds with a connection timeout of 500ms means detection within 2-3 seconds.
- Seconds 1-3 — Cache misses begin. Every redirect request that would have been a cache hit now falls through to the database. Latency jumps from ~5ms (cache hit) to ~20-50ms (database read). Users probably do not notice yet because 50ms is still fast for a redirect.
- Seconds 3-10 — Thundering herd risk. Here is where it gets dangerous. If your system processes 350K reads/sec at peak and your cache hit rate was 99%, that means only ~3,500 reads/sec were hitting the database. Now ALL 350K reads/sec hit the database simultaneously. A typical PostgreSQL instance handles ~10K TPS, a DynamoDB table in on-demand mode scales up but not instantly. You are looking at a potential 100x spike to the database.
-
The mitigation that matters: request coalescing (also called collapsed forwarding). If 1,000 requests for the same popular short URL arrive within a 100ms window, only ONE request hits the database. The other 999 wait for that single query to return and share the result. This is not a theoretical optimization — it is the difference between your database surviving and falling over. Libraries like
singleflightin Go or similar patterns in other languages implement this. - Seconds 10-30 — Stabilization. With request coalescing, the database load is manageable but elevated. Latency is higher than normal (50-100ms per redirect instead of 5ms). You are degraded but functional. The monitoring system fires alerts for the Redis outage and elevated database latency.
- Recovery plan: Do NOT just restart Redis and let it fill up organically. A cold cache with 350K reads/sec means the database stays under heavy load for potentially hours while the cache warms. Instead, run a cache warming script that pre-loads the top 20% of URLs by access frequency (this 20% generates 80% of traffic). You can derive this from access logs or a pre-computed popularity ranking. Load these into Redis before routing traffic back to it. Then switch traffic back to the cache-first path. The warm-up should take minutes, not hours.
- From the user’s perspective: Redirects were maybe 30-50ms slower for a few minutes. They almost certainly did not notice. This is the power of graceful degradation — the system got worse, not broken.
Follow-up: What if it is not a total Redis failure but a partial one — say, 1 out of 3 Redis cluster nodes goes down?
This is actually more insidious than a total failure.- With Redis Cluster, each node owns a subset of the hash slots (the keyspace). When one node dies, all keys on that node become inaccessible. The other two nodes continue serving their keys fine.
- This means some short URLs resolve instantly from cache (their keys are on healthy nodes) and others are slow (their keys were on the dead node). The inconsistency is invisible to monitoring that only tracks averages — your average latency might look fine while one-third of users experience degraded performance.
- Detection: track cache hit rate per node or per hash slot range, not just globally. A sudden drop from 99% to 66% is the signature of a single-node failure in a 3-node cluster.
- Redis Cluster has automatic failover (if you have replicas configured): the replica of the failed master gets promoted to master within 10-30 seconds. During the failover window, requests for keys on that shard fail. Your application code should handle this with a fallback to database read — the same cache-miss path, but triggered by a Redis error rather than a cache miss.
- The lesson: always configure at least one replica per master in a Redis Cluster. The cost is 2x the memory, but the availability improvement is enormous.
Follow-up: How would you design the cache warming script? What are the pitfalls?
The cache warming script has several non-obvious challenges.- Data source: You need to know which URLs are “hot.” Options: (1) maintain a real-time access counter in the database or a separate analytics system, (2) parse recent access logs, or (3) use the previous cache contents if you have periodic snapshots (Redis RDB dumps). Option 1 is most reliable. Option 3 is fastest but only works if you have regular backups.
- Pace the warming: If you dump 20 million key-value pairs into Redis as fast as possible, you saturate Redis’s network bandwidth and CPU during the load. Rate-limit the warming to ~50K writes/sec (about half Redis’s capacity) so the node can simultaneously serve live traffic.
- Stale data risk: The warming script reads from the database, but URLs may have been created, modified, or expired since the data was read. For a URL shortener this is low risk (URLs are mostly immutable), but for a system with mutable data, you need to validate freshness. Set a short TTL on warmed entries (say, 5 minutes) so stale entries expire quickly and get refreshed from the database.
- The biggest pitfall: warming the cache while traffic is still hitting the database. You need to coordinate the cutover — warm the cache, then redirect traffic to use it. If traffic starts using the cache while it is only 10% warmed, you get a 90% miss rate, which is barely better than no cache.
singleflight package is the canonical reference implementation.
Use it naturally: “We wrap every cache-miss path in singleflight, so if 10K concurrent requests miss the cache on the same short URL, only 1 goes to the database and 9,999 wait for the result.”
Warning: Request coalescing only helps with concurrent duplicate requests. It does not help with many distinct hot keys arriving simultaneously — for that you need per-key rate limiting or a bigger database.- Facebook Engineering — “Scaling Memcache at Facebook” (research.facebook.com) — the canonical paper on leases, gutter caches, and thundering herd mitigations at scale.
- Redis documentation — “Redis Cluster Specification” (redis.io/docs) — authoritative on replication, failover, and partial-failure semantics.
- Brad Fitzpatrick / LiveJournal — “Memcached tutorial” historical writeups on the cache-aside pattern that became industry standard.
Q3: Explain the difference between at-most-once, at-least-once, and exactly-once delivery. Why is exactly-once so hard?
Q3: Explain the difference between at-most-once, at-least-once, and exactly-once delivery. Why is exactly-once so hard?
Question
In your task scheduler design, you chose at-least-once delivery with idempotent tasks. Explain the three delivery semantics, why exactly-once is effectively impossible in a distributed system, and how you achieve the practical equivalent.Difficulty: Senior to Staff | Concepts tested: Distributed systems fundamentals, Two Generals Problem, idempotency, production trade-offsStrong Answer
The way I think about this is through the lens of what happens when things go wrong — specifically, when the acknowledgment message is lost.- At-most-once: Fire and forget. Send the message, do not retry. If the message is lost, it is gone. Simple and fast, but lossy. Use case: logging, metrics collection, non-critical analytics events. You accept that some data is lost.
- At-least-once: Send the message, wait for an acknowledgment. If you do not receive the ack within a timeout, retransmit. The problem: the receiver may have processed the message and sent the ack, but the ack was lost. You retransmit, and the receiver processes it again. You get duplicates. Use case: anything where losing a message is unacceptable (payment processing, task execution, order fulfillment). You accept potential duplicates and design for them.
- Exactly-once: Each message is processed exactly one time. No duplicates, no losses. Sounds ideal, but here is why it is effectively impossible in an unreliable network.
- The practical equivalent: At-least-once delivery plus idempotent processing. You guarantee the message is delivered at least once (with retries), and you guarantee that processing it multiple times has the same effect as processing it once. The receiver maintains a deduplication table — when it processes a message, it records the message ID. On subsequent deliveries, it checks the ID and skips reprocessing. From the outside, this looks like exactly-once, which is why Kafka calls its consumer semantics “exactly-once” — it is really at-least-once with built-in dedup and transactional writes.
- Real-world example: In the task scheduler, a worker picks up a “charge customer $50” task. It calls the payment API with an idempotency key. The payment succeeds. The worker tries to mark the task as COMPLETED, but the network times out. The zombie detector eventually re-enqueues the task. Another worker picks it up and calls the payment API with the same idempotency key. The payment system recognizes the key and returns the original result without charging again. From the customer’s perspective, they were charged exactly once.
Follow-up: How does Kafka achieve what it calls “exactly-once semantics” if true exactly-once is impossible?
Great question — Kafka’s “exactly-once” is carefully scoped and often misunderstood.- Kafka achieves exactly-once within the Kafka ecosystem by combining three mechanisms: idempotent producers (each message has a sequence number, the broker deduplicates), transactional writes (a producer can write to multiple partitions atomically), and transactional consumers (a consumer reads and writes offsets in the same transaction).
- What this means concretely: a Kafka Streams application that reads from topic A, processes the message, and writes to topic B will do so exactly once — even if the consumer crashes and restarts. The offset commit and the output write happen in the same transaction.
- What this does NOT mean: if your Kafka consumer calls an external system (sends an email, writes to a database, calls an API), Kafka has no way to make that external call idempotent. If the consumer crashes after calling the external system but before committing the offset, the message will be reprocessed and the external system will be called again. Exactly-once only applies to the Kafka-to-Kafka path.
- This is why the task scheduler design combines at-least-once delivery with application-level idempotency keys — because the “task” typically involves calling external systems where Kafka’s transactional guarantees do not extend.
Follow-up: What are the performance implications of idempotency? It sounds like you need a lookup on every operation.
Yes, and this is a real cost that you should not hand-wave.- Every idempotent operation requires a read-before-write: check the dedup table, then execute if the key is new. This doubles the number of I/O operations on the critical path.
- The dedup table grows over time: if you process 56 million tasks per day and store idempotency keys for 7 days, you have ~400 million entries in the dedup table. At ~50 bytes per entry, that is ~20 GB — easily fits in Redis or a dedicated DynamoDB table.
- TTL is essential: idempotency keys should expire after a reasonable window (1 hour to 7 days depending on the use case). Without TTL, the dedup table grows unbounded. The TTL must be longer than the maximum retry window — if your retry policy allows retries for up to 1 hour, the idempotency key must live for at least 1 hour.
- In practice, the overhead is minimal: the dedup check is a single key lookup in Redis or DynamoDB (~1ms). Compared to the task execution itself (seconds to minutes), 1ms of overhead is negligible. The correctness guarantee is worth far more than the performance cost.
- Where it gets expensive: high-throughput, low-latency systems where every microsecond matters (trading systems, game servers). In those cases, you might accept at-most-once for non-critical operations and reserve idempotent at-least-once for operations that must not be duplicated (trades, inventory decrements).
Q4: How do you handle hot partitions in a distributed database, and how would you detect one before it causes an outage?
Q4: How do you handle hot partitions in a distributed database, and how would you detect one before it causes an outage?
Question
In several of the system designs above — chat messages partitioned by conversation_id, feeds partitioned by user_id, tasks partitioned by task type — there is a risk of hot partitions. Explain what a hot partition is, how you would detect one, and what strategies you would use to mitigate it.Difficulty: Senior | Concepts tested: Database internals, partition key design, monitoring, data modelingStrong Answer
A hot partition is when one partition in a distributed database receives a disproportionate share of the traffic — reads, writes, or both — compared to other partitions. In a system like DynamoDB or Cassandra, each partition key maps to a physical partition on a specific set of nodes. If one partition key is orders of magnitude more popular than others, the nodes hosting it become a bottleneck while other nodes sit idle.- Concrete example: In the chat system, conversation_id is the partition key. Most conversations have a few hundred messages per day. But a group chat for a 500-person company channel might receive 50,000 messages per day — all hitting the same partition. That single partition is handling 100x the load of an average partition.
- Another example: In DynamoDB, each partition supports ~3,000 Read Capacity Units and ~1,000 Write Capacity Units per second. If a celebrity’s user feed receives 10,000 reads/sec, a single partition cannot handle it, regardless of how much total capacity you have provisioned.
- DynamoDB: Use CloudWatch Contributor Insights, which identifies the most-accessed partition keys. It shows the top N partition keys by read/write volume over a time window. If one key accounts for more than 10% of total traffic, that is a hot partition.
- Cassandra: Monitor per-partition read/write latencies using
nodetool cfstatsor through your monitoring system (Prometheus with Cassandra exporter). Look for P99 latency spikes that are localized to specific nodes. Thenodetool toppartitionscommand samples and reports the most accessed partitions. - General approach: Instrument your application to log partition key access patterns. Aggregate in your analytics pipeline and set alerts for skew beyond a threshold (e.g., any single key exceeding 5x the average).
- Write sharding (adding a suffix): Append a random suffix to the partition key. Instead of
conversation_id = "chat_ABC", useconversation_id = "chat_ABC#3"(where 3 is a random number from 0-9). This spreads writes across 10 partitions. The cost: reads must scatter-gather across all 10 sub-partitions and merge results. This is the standard DynamoDB pattern for handling hot keys. - Time-based sub-partitioning: For the chat system specifically, use
(conversation_id, year_month)as a composite partition key. Current messages go to the current month’s partition; historical queries read from older partitions. This naturally distributes the growing dataset and keeps the “hot” partition (current month) manageable. - Caching the hot partition: Put a cache (Redis, DAX) in front of the hot partition. If a celebrity’s profile is accessed 10K times per second, cache it in Redis and let 99% of reads hit the cache. Only 1% (100 reads/sec) reach the database partition.
- Application-level routing: If you know certain entities will always be hot (celebrity accounts, viral posts), route them to a dedicated, scaled-up data store separate from the main database. This is effectively application-level sharding by hotness.
Follow-up: In DynamoDB specifically, does adaptive capacity solve the hot partition problem, or do you still need to worry about it?
DynamoDB’s adaptive capacity helps but does not eliminate the problem.- Adaptive capacity allows DynamoDB to reallocate unused throughput from cold partitions to hot ones. If you have 10,000 RCUs provisioned across 10 partitions and one partition needs 5,000 RCUs while the others need 500 each, adaptive capacity can handle that redistribution.
- However, there are hard limits. A single partition cannot exceed its per-partition throughput limits (~3,000 RCU, ~1,000 WCU), regardless of adaptive capacity. If your hot partition genuinely needs 10,000 RCUs, adaptive capacity cannot help — you need write sharding or a cache.
- DynamoDB’s on-demand mode is more generous with burst handling (up to 2x your previous peak), but the per-partition limits still apply.
- The practical implication: for most workloads, adaptive capacity means you do not need to worry about moderate hotspots. But for extreme hotspots (celebrity accounts, viral content, flash sales), you still need explicit mitigation strategies.
Follow-up: How would you design the partition key for a notification system where some users receive 100x more notifications than others?
The challenge here is asymmetric write volume — a power user or a highly active account receives far more notifications than a typical user.- Option 1: User ID as partition key with write sharding: Use
user_id#shardas the partition key, where shard is computed ashash(notification_id) % N. For high-volume users, set N=10; for normal users, N=1 (no sharding). The application knows each user’s shard count from a user metadata table. - Option 2: Time-based partitioning: Use
user_id#dateas the partition key. Each day gets a new partition. This naturally bounds partition size and keeps recent notifications (the ones users actually read) on a small number of partitions. - Option 3: Separate hot and cold storage: Keep the last 24 hours of notifications in a Redis sorted set per user (fast reads, no partition limits). Older notifications live in DynamoDB with time-based partitioning. Most notification reads are for recent items, so the Redis layer handles 95%+ of reads.
- In my experience, Option 2 is the best balance of simplicity and effectiveness. It works without requiring the application to track per-user shard counts, and it naturally gives you a retention/archival boundary (delete partitions older than 90 days).
feed#celebrity_id#N where N is 0-9, spreading one hot partition into ten. Reads become a 10-way scatter-gather, but writes hit any of 10 partitions uniformly.”
Warning: Write sharding trades write hotness for read complexity. If your workload is read-heavy (social feed), this can make read latency worse. Only apply write sharding when the write hotspot is the actual bottleneck.user_id. A specific celebrity is hot. Do you reshape the whole table, or just that user?
A: Just that user. Reshaping the whole table is a multi-week migration that penalizes every other user for one celebrity’s traffic. The pattern that works is application-level routing: maintain a small “hot users” list (easy to keep in Redis), and for users on that list the application writes to user_id#shard_N while for normal users it writes to user_id. Reads check the list first. This is a day of work, not a month, and it localizes the complexity to just the celebrities. You pay the scatter-gather cost only for those users, not the entire user base.- AWS — “Best practices for designing and using partition keys effectively” (docs.aws.amazon.com/amazondynamodb) — the official DynamoDB guide to hot-partition patterns.
- Cassandra documentation — “Data modeling” chapter (cassandra.apache.org/doc) — authoritative on partition key design and
nodetoolobservability. - Instagram Engineering — “Sharding & IDs at Instagram” (instagram-engineering.com) — public writeup of how they solved the celebrity problem at their scale.
Q5: Your rate limiter uses Redis Lua scripts for atomicity. What happens under extreme contention, and what are the alternatives?
Q5: Your rate limiter uses Redis Lua scripts for atomicity. What happens under extreme contention, and what are the alternatives?
Question
You proposed using Redis Lua scripts to atomically check and increment rate limit counters. Walk me through how this works under extreme contention — thousands of requests per second for the same rate limit key. What are the failure modes, and when would you consider alternatives to Redis?Difficulty: Senior | Concepts tested: Redis internals, concurrency models, distributed rate limiting trade-offs, alternatives evaluationStrong Answer
First, let me explain why Lua scripts work for this. Redis is single-threaded for command execution. A Lua script executes atomically — no other Redis command can interleave during the script. This means our rate limit check (read counter, increment, check against limit, set TTL if needed) happens as one indivisible operation. There are no race conditions at the Redis level.- Under extreme contention (say, 100K requests/sec all hitting the same rate limit key for a popular API endpoint), the Lua script still works correctly because of Redis’s single-threaded model. Each script execution takes ~0.05-0.1ms. At 100K ops/sec, the total time spent executing Lua scripts for this key is ~5-10 seconds per second of wall-clock time — which is impossible. In practice, Redis pipelines and batches these, and the throughput limit is Redis’s overall ops/sec capacity (~100K-200K ops/sec for simple commands on a single instance), not per-key contention.
- The real bottleneck is network round-trips, not Lua execution. Each rate limit check requires one network round-trip from the API server to Redis (~0.5ms in the same AZ). At 100K requests/sec, that is 100K round-trips. If you have 10 API servers, each sends 10K requests/sec to Redis, which is well within Redis’s capacity.
-
When Redis becomes the bottleneck: If you need more than ~200K rate limit checks per second, a single Redis instance is insufficient. Solutions: Redis Cluster (shard by rate limit key) or multiple Redis instances with client-side routing. Since rate limit keys are typically
user_id:endpoint, the key distribution is naturally uniform across shards.
- Long-running Lua script blocks everything: Redis blocks all other commands while a Lua script runs. If your script has a bug (infinite loop) or does expensive operations (large ZRANGEBYSCORE), it blocks the entire Redis instance. Mitigation: keep Lua scripts minimal (ours is 6 lines), and set the
lua-time-limitconfig (default 5 seconds) to kill runaway scripts. - Memory growth: If you use sliding window log (storing every timestamp in a sorted set), a hot key can accumulate millions of entries. The Lua script that cleans up old entries (
ZREMRANGEBYSCORE) runs in O(log N + M) where M is the number of removed entries. For very hot keys, this cleanup can be expensive. This is another reason to prefer the sliding window counter algorithm — it uses only two counters per window, not a growing sorted set.
- Local in-memory rate limiting: Each API server maintains its own token bucket per user. No network round-trip, sub-microsecond latency. The trade-off: limits are enforced per-server, not globally. A user hitting 10 different servers gets 10x the effective limit. Use this as a secondary defense layer alongside Redis, or as a fallback when Redis is down.
- Envoy/Istio rate limiting in the service mesh: The sidecar proxy handles rate limiting with its own shared state (typically backed by Redis). The advantage is that rate limiting is decoupled from application code.
- Cell-based rate limiting (Cloudflare’s approach): Each edge node limits independently and periodically synchronizes counts to a coordination layer. Gives approximate global enforcement with zero cross-node latency for the check itself.
Follow-up: How would you implement a sliding window counter in Redis using Lua, and why does it beat the sorted set approach at scale?
The sliding window counter is a brilliant optimization. Here is how it works in Redis.- Maintain two keys per rate limit:
counter:{key}:{current_window}andcounter:{key}:{previous_window}. Each is a simple integer counter with a TTL equal to 2x the window size. - When a request arrives, the Lua script calculates the weighted count:
weighted = previous_count * (1 - elapsed_fraction_of_current_window) + current_count. If weighted is under the limit, increment the current window counter and allow. - Total Redis memory per rate limit: 2 keys with integer values — roughly 100 bytes. Compare that to the sorted set approach where you store a timestamp per request: 100 requests/minute per user means 100 entries of ~32 bytes each = 3,200 bytes. At 10 million users, that is 32 GB for sorted sets vs 1 GB for sliding window counters. A 32x memory reduction.
- The accuracy trade-off: the sliding window counter is an approximation. It assumes requests in the previous window were uniformly distributed, which may not be true if traffic is bursty within a window. In practice, the error is less than 1%, which is acceptable for all but the most precise rate limiting needs (financial transaction limits).
Follow-up: If Redis is unavailable and you fail-open, how do you prevent abuse during the outage window?
Failing open means accepting that for a brief period, rate limits are not enforced. Here is how you minimize the blast radius.- Local rate limiters as secondary defense: Each API server runs a local in-memory token bucket. Set the local limit to 10x the normal per-user limit (generous, to avoid false positives from traffic distribution unevenness). This catches obvious abuse (thousands of requests/sec from one user) even without Redis.
- Monitor and respond: Alert immediately on Redis failure. The average Redis recovery (with Sentinel or Cluster auto-failover) is 15-30 seconds. During this window, the local rate limiter is your only defense.
- Post-incident analysis: After Redis recovers, run an anomaly detection query on access logs during the outage window. Identify users who significantly exceeded their normal rate. Take retroactive action (warn, temporarily restrict) if abuse is detected.
- Circuit breaker with delayed close: After Redis comes back, do not immediately trust it. Use a circuit breaker that transitions from open (fail-open) to half-open (test Redis with a canary key every second) to closed (normal operation) over 30-60 seconds. This prevents thrashing if Redis is flapping.
Q6: Design the WebSocket connection management layer for a chat system handling 5 million concurrent connections.
Q6: Design the WebSocket connection management layer for a chat system handling 5 million concurrent connections.
Question
You need to maintain 5 million persistent WebSocket connections for a chat system. Walk me through how you would architect the connection management layer — from how connections are assigned to servers, to how you route a message from User A on Server 1 to User B on Server 3.Difficulty: Staff-Level | Concepts tested: WebSocket scaling, connection routing, pub/sub systems, stateful service challengesStrong Answer
Managing 5 million persistent connections is fundamentally different from handling 5 million stateless HTTP requests. WebSocket connections are stateful — the server holding a user’s connection must be the one that pushes messages to that user. This introduces a routing problem that does not exist in stateless architectures.Connection assignment and server capacity:- A typical server handles ~10K-50K concurrent WebSocket connections depending on the language runtime and how much processing happens per message. At 10K per server, you need 500 servers. At 50K per server (Erlang, Go, or optimized Node.js), you need 100 servers.
- Use a Network Load Balancer (Layer 4), not an Application Load Balancer (Layer 7), for WebSocket connections. NLB passes through TCP connections directly, adding minimal overhead. ALB terminates and re-establishes connections, adding latency and memory overhead.
- Connection assignment is typically random (NLB round-robin or least-connections). There is no need for sticky sessions at the LB level because the routing table (below) handles message delivery.
- When a user connects, the Chat Server registers the mapping
user_id -> server_idin a shared store (Redis hash map). When the user disconnects, the mapping is removed. - This routing table is the single most critical piece of infrastructure. Every message delivery starts with a lookup: “Which server holds User B’s connection?” The lookup must be sub-millisecond.
- Redis is ideal here:
HSET connections user:B "chat-server-42"gives O(1) lookups. With 5 million entries at ~50 bytes each, the table is ~250 MB — trivially fits in memory.
- User A sends a message via WebSocket to Chat Server 1.
- Chat Server 1 persists the message to the Message Store (Cassandra).
- Chat Server 1 looks up User B’s server:
HGET connections user:Breturnschat-server-3. - Chat Server 1 publishes the message to a Redis Pub/Sub channel that Chat Server 3 subscribes to. Alternatively, Chat Server 1 makes a direct gRPC call to Chat Server 3.
- Chat Server 3 receives the message and pushes it to User B via the WebSocket connection.
- Redis Pub/Sub: Each Chat Server subscribes to a channel named after itself (e.g.,
channel:chat-server-3). Senders publish to the target server’s channel. Pros: decoupled, simple. Cons: Redis Pub/Sub is fire-and-forget (no persistence), so if a server misses a message (brief disconnection from Redis), the message is lost. Also, every Chat Server subscribes to its own channel, so Redis maintains 500 subscriptions — lightweight. - Direct gRPC: Chat Servers call each other directly via gRPC using the routing table. Pros: point-to-point, no broker, lower latency. Cons: creates O(N^2) potential connections between servers, and requires service discovery.
- My recommendation: Redis Pub/Sub for simplicity. The fire-and-forget limitation is mitigated because messages are already persisted to Cassandra before routing. If a message is lost in Pub/Sub, the recipient client’s reconnection logic catches up from the database.
Follow-up: What happens when a Chat Server crashes and 10K users need to reconnect?
This is the thundering herd problem applied to WebSocket reconnection.- 10K clients detect the disconnection (TCP reset or heartbeat timeout) and trigger their reconnection logic. If they all reconnect immediately, the load balancer distributes 10K new connection requests across remaining servers in a burst.
- Client-side mitigation: exponential backoff with jitter. Each client waits
random(0, min(2^attempt * 100ms, 30s))before reconnecting. This spreads the 10K reconnections over 10-30 seconds instead of hitting all at once. - Server-side: When a client reconnects (to a different server now), it sends its last received message ID. The new server fetches all messages since that ID from Cassandra and pushes them. This “catch-up” ensures zero message loss.
- Routing table cleanup: The dead server’s entries in the routing table become stale. The stale entries will cause message delivery to fail (publishing to a Pub/Sub channel nobody subscribes to). Two cleanup strategies: (1) set a TTL on routing entries and refresh with heartbeats, or (2) have each Chat Server periodically validate its own entries on startup and remove stale ones from the previous instance.
- Capacity headroom: This is why you should not run Chat Servers at 100% connection capacity. If each server handles 10K connections and you have 500 servers, you are at 100% capacity — a single server failure means 10K users cannot reconnect (all other servers are full). Run at 80% capacity (servers handle 8K each), leaving 20% headroom for redistributing connections during failures.
Follow-up: How would you handle the routing table during a Redis outage?
Without the routing table, you cannot route messages to the correct server. This is a critical dependency.- Short-term (seconds): Chat Servers cache a local copy of recent routing lookups. If Server 1 recently looked up that User B is on Server 3, it uses the cached result. This works for ongoing conversations but fails for new conversations.
- Fallback broadcast: If the routing table is unavailable, fall back to broadcasting the message to ALL Chat Servers. Each server checks if it holds the recipient’s connection and delivers if yes, ignores if no. This is O(N) where N is the number of servers, which is expensive but functional. At 500 servers, you are sending 500 messages instead of 1 for each chat message — a 500x overhead. Acceptable for a brief outage, catastrophic if prolonged.
- Redis HA: This is why the routing table must be in a highly available Redis setup (Sentinel with automatic failover, or Redis Cluster with replicas). Target recovery time: under 15 seconds. The broadcast fallback covers the gap.
- Alternative: Use a consistent hashing ring to deterministically assign users to servers. No routing table needed — you compute
server = hash(user_id) % N. The problem: rebalancing when servers are added or removed causes mass disconnection and reconnection. Consistent hashing minimizes this (only ~1/N users are affected), but it is still disruptive. Most production systems prefer the routing table approach for its flexibility.
Q7: How do you design a system to handle scheduled tasks that must execute within 1 second of their target time?
Q7: How do you design a system to handle scheduled tasks that must execute within 1 second of their target time?
Question
Your task scheduler promises to execute delayed tasks within 1 second of their scheduled time. How do you achieve this precision at scale — 10 million tasks per day — without resorting to busy-polling the database every 100 milliseconds?Difficulty: Senior | Concepts tested: Scheduling algorithms, timer wheels, database polling trade-offs, distributed coordinationStrong Answer
The core tension is between precision and efficiency. Polling the database every 100ms gives you sub-second precision but hammers the database with queries. Polling every 10 seconds is efficient but you miss your 1-second SLA. The solution is a two-tier architecture: a database-backed durable store for long-horizon scheduling, and an in-memory timer for short-horizon precision.Tier 1 — Database scan (long horizon):- A scheduler process polls the database every 5-10 seconds:
SELECT * FROM tasks WHERE next_run_at <= NOW() + interval '30 seconds' AND status = 'PENDING'. - This query uses a partial index (
idx_tasks_pending ON tasks (next_run_at) WHERE status IN ('PENDING', 'RETRYING')) so it is fast even with millions of rows. - The scheduler loads these “soon-to-execute” tasks into memory and moves them to Tier 2.
- Tasks within the next 30 seconds are held in a timer wheel (or a priority queue sorted by execution time) in the scheduler’s memory.
- A timer thread checks the wheel every 100ms. When a task’s execution time arrives, it enqueues the task to the work queue (Redis sorted set or SQS).
- The precision is now bounded by the timer resolution (100ms), well within the 1-second SLA.
sleep():- A timer wheel (also called a timing wheel, Hashed and Hierarchical Timing Wheels from the Varghese and Lauck paper) is a ring buffer of time slots. Inserting a timer is O(1), firing expired timers is O(1) per timer. This beats a sorted list (O(log N) insert) when you have thousands of timers expiring per second.
- In practice, for 10M tasks/day, you have ~650 tasks coming due per second at peak. A simple min-heap priority queue is sufficient at this scale. Timer wheels become important at 100K+ timers per second.
- If the scheduler crashes, the in-memory timers are lost. On restart, the scheduler re-scans the database for tasks where
next_run_at <= NOW() + 30 seconds. Tasks that were in-memory but not yet enqueued are re-discovered and processed, possibly a few seconds late (the restart time). - This is acceptable: a scheduler restart that takes 5-10 seconds means a few tasks execute 5-10 seconds late during the restart window. The database is the source of truth; the in-memory tier is an optimization for precision.
Follow-up: How do you handle the case where two scheduler instances both load the same task from the database?
This is the double-scheduling problem, and it is surprisingly subtle.- If you run multiple scheduler instances for HA (active-passive with leader election), only the leader scans the database and loads tasks. The passive instance does nothing until it becomes leader.
- If you run active-active schedulers (for throughput, not just HA), you need to partition the work. Options: (1) each scheduler owns a range of task IDs (hash-based partitioning), (2) each scheduler uses
SELECT FOR UPDATE SKIP LOCKEDto atomically claim tasks without overlapping. - The
SKIP LOCKEDapproach is elegant: when Scheduler A selects tasks, it locks them. Scheduler B’s concurrent query skips the locked rows and picks up different tasks. No coordination protocol needed — PostgreSQL handles it. - Even with
SKIP LOCKED, you need the idempotency guarantee at the worker level. There is a window where the scheduler loads a task, marks it as SCHEDULED, but crashes before enqueuing it. The task stays in SCHEDULED state. A watchdog process (or the standby scheduler) picks up SCHEDULED tasks that have been in that state for longer than expected (say, 30 seconds) and re-enqueues them.
Follow-up: This design keeps 30 seconds of tasks in memory. What if the scheduler needs to handle a burst of 100K tasks all scheduled for the same second?
This is a thundering herd at the scheduling level.- If 100K tasks are all scheduled for exactly 12:00:00, the scheduler’s Tier 1 scan at 11:59:30 loads all 100K into memory. At 12:00:00, the timer fires for all of them.
- The scheduler cannot enqueue 100K tasks to the work queue in zero time. If each enqueue takes 0.05ms (Redis ZADD), that is 5 seconds — some tasks execute 5 seconds late.
- Mitigation 1: Pre-enqueue. When the scheduler loads tasks into memory, immediately enqueue them to the work queue with a “visibility delay” or “delayed execution” setting. The queue holds them until the execution time. This shifts the burst from the timer-fire moment to the database-scan moment, which is less time-sensitive.
- Mitigation 2: Shard the scheduler. If you have 4 scheduler instances, each handling ~25K of the 100K tasks, the per-instance burst is manageable.
- Mitigation 3: Accept imprecision during bursts. If 100K tasks are scheduled for the same second, delivering them over a 5-second window (12:00:00 to 12:00:05) is a reasonable degradation. The 1-second SLA is for normal conditions; burst conditions have a documented relaxed SLA.
- In my experience, the pre-enqueue approach (Mitigation 1) is the most practical because it leverages the queue’s existing scalability rather than requiring you to scale the scheduler itself.
Q8: Compare how you would store notification delivery state versus chat message state. Why are the requirements fundamentally different?
Q8: Compare how you would store notification delivery state versus chat message state. Why are the requirements fundamentally different?
Question
Both the notification system and the chat system need to track whether a message was delivered. But you would design the storage very differently. Explain why.Difficulty: Intermediate to Senior | Concepts tested: Data modeling, access patterns, consistency requirements, storage trade-offsStrong Answer
The way I think about this is through the lens of access patterns and consistency requirements. These two seemingly similar “was it delivered?” questions lead to radically different storage designs.Chat message storage:- Access pattern: Sequential reads within a conversation. “Give me the last 50 messages in this conversation” is the dominant query. Users scroll backward through history.
- Write pattern: Append-only. New messages are added at the end of a conversation’s timeline. Edits and deletes are rare (and modeled as new events, not mutations).
- Consistency requirement: Per-conversation causal ordering. If I reply to your message, my reply must appear after yours. But ordering across conversations does not matter.
- Partition strategy: Partition by
conversation_idwith a clustering key on message timestamp or sequence number. All messages in a conversation are co-located for efficient range reads. - Database choice: Cassandra or ScyllaDB — optimized for append-heavy, time-series-like workloads with range queries within a partition.
- Delivery state: Simple. Each message has a status:
sent(persisted on server),delivered(recipient’s device acknowledged receipt),read(recipient opened the conversation). This state is updated by the recipient’s client and is non-critical — if a delivery receipt is lost, the user sees a single check mark instead of a double check mark. Not ideal but not a correctness issue.
- Access pattern: Random reads by notification ID or user ID. “What is the status of notification X?” or “Show me all notifications for user Y sorted by recency.” Also analytical queries: “What is the delivery rate for notification type Z across all users this week?”
- Write pattern: Status transitions through a state machine:
queued -> sent -> delivered -> opened -> clicked. Each transition is an update, not an append. Multiple channels may each have their own delivery state (push: delivered, email: opened, SMS: sent). - Consistency requirement: Strong consistency for deduplication (you must not send the same notification twice) and for delivery tracking (analytics dashboards must reflect actual delivery counts). Eventual consistency for the user-facing in-app notification list is acceptable.
- Partition strategy: Partition by
user_idfor the user-facing queries, but you also need indexes bynotification_typeandtimestampfor analytics. This is a multi-access-pattern problem. - Database choice: DynamoDB with a GSI (Global Secondary Index) on
notification_type + timestampfor analytics queries. Or a dual-storage approach: DynamoDB for real-time delivery state and user-facing queries, S3 + Athena for analytics. - Delivery state: Complex. You track state per channel per notification. A single notification might be: push=delivered, email=sent, SMS=failed. Each state transition triggers different logic (retry SMS, increment delivery counter, fire a webhook to the calling service).
- Chat messages are immutable after creation; notification state is mutable (transitions through a state machine).
- Chat queries are conversation-scoped (narrow range scan); notification queries span multiple dimensions (by user, by type, by status, by time).
- Chat delivery is a nice-to-have UX signal; notification delivery is a business metric and sometimes a legal requirement (proof of SMS delivery for regulatory compliance).
- Chat scales by conversation count (millions of small partitions); notification scales by per-user volume (some users get 100x more notifications than others, creating hot partitions).
Follow-up: How would you handle the scenario where you need to show both chat messages and notifications in a unified inbox (like LinkedIn)?
This is a data federation problem — merging two different data models into one view.- Do NOT try to store both in the same table. Their access patterns and data models are too different. Merging them compromises the performance of both.
- Build a unified feed service that queries both the chat message store and the notification store in parallel, merges the results by timestamp, and returns a combined view.
- Caching the unified view: Pre-compute the unified inbox for active users in Redis (sorted set with timestamp as score, containing references to either chat messages or notifications). Update the sorted set on new chat messages AND new notifications.
- The tricky part is ordering: Chat messages have per-conversation sequence numbers. Notifications have creation timestamps. To merge them, normalize both to UTC timestamps. For chat messages, the timestamp is when the message was sent. For notifications, it is when the notification was created.
- LinkedIn and Facebook both take this approach — separate backend systems for messages and notifications, unified in a single “inbox” view for the user. The unification happens at the presentation layer, not the storage layer.
Follow-up: For the notification system, how do you track “opened” and “clicked” states across email, push, and in-app without over-engineering it?
Each channel has a fundamentally different mechanism for tracking engagement.- Email: Embed a 1x1 tracking pixel in the email body. When the user’s email client loads the image, your server logs the “opened” event. For “clicked,” use redirect URLs that pass through your tracking service before forwarding to the destination. Caveat: many email clients now block tracking pixels (Apple Mail Privacy Protection), so email open rates are increasingly unreliable. Track clicks as the more reliable engagement signal.
- Push: APNs and FCM provide delivery receipts (the notification reached the device). For “opened” (user tapped the notification), the mobile app sends a callback to your API when the notification is opened. This requires client-side instrumentation.
- In-app: Track “seen” (the notification appeared in the user’s notification list) and “clicked” (the user tapped it). Both are client-side events sent to your API. These are the most reliable engagement signals because you control the client.
- Storage: Write all engagement events to a stream (Kinesis, Kafka). A consumer updates the notification record’s status and publishes the event to an analytics pipeline. The real-time delivery state (in DynamoDB) reflects the latest status per channel. The analytics pipeline (S3 + Athena) provides aggregate views.
- Do not block dispatch on tracking setup. The notification should send regardless of whether tracking infrastructure is healthy. If the tracking pixel URL is down, the email still sends — you just lose the open tracking for that email.
Q9: What is the difference between a load balancer, an API gateway, and a reverse proxy? When would you use each in a system design?
Q9: What is the difference between a load balancer, an API gateway, and a reverse proxy? When would you use each in a system design?
Question
Candidates often draw a “load balancer” box at the top of their system design without explaining what it actually does. Walk me through the differences between a load balancer, an API gateway, and a reverse proxy. When would you use one versus the other, and when do they overlap?Difficulty: Intermediate | Concepts tested: Networking fundamentals, infrastructure architecture, separation of concernsStrong Answer
These three concepts have significant overlap, which is why they cause confusion. In practice, a single product (like NGINX or AWS ALB) can serve as all three simultaneously. But the responsibilities are distinct.- Reverse Proxy: A server that sits in front of your backend servers and forwards client requests to them. The client talks to the proxy, not directly to the backend. The core job is indirection — hiding the backend’s identity, IP addresses, and topology from the client. NGINX, HAProxy, and Envoy are all reverse proxies. A reverse proxy can do SSL termination, caching, compression, and request buffering. It is the broadest concept of the three.
- Load Balancer: A reverse proxy with a specific additional responsibility — distributing traffic across multiple backend servers. It implements a distribution algorithm (round-robin, least-connections, weighted, IP-hash, consistent hashing). The core job is spreading load evenly so no single server is overwhelmed. Load balancers operate at Layer 4 (TCP level, like AWS NLB) or Layer 7 (HTTP level, like AWS ALB). Layer 4 is faster (just forwards TCP packets) but cannot make decisions based on HTTP content. Layer 7 can route based on URL path, headers, cookies.
- API Gateway: A load balancer with application-layer intelligence. Beyond distributing traffic, it handles concerns like authentication, rate limiting, request transformation, API versioning, circuit breaking, and API key management. The core job is managing the API lifecycle. Kong, AWS API Gateway, and Apigee are API gateways. Think of it as a load balancer plus a middleware pipeline.
- Just a reverse proxy: You have one backend server and want SSL termination, caching, and protection from direct internet exposure. Small deployments, simple architectures.
- Load balancer: You have multiple backend servers and need to distribute traffic. Most system designs need this at minimum. Use L4 (NLB) for raw TCP throughput (WebSocket, gRPC) and L7 (ALB) when you need path-based routing.
- API Gateway: You have multiple microservices, external API consumers, and need centralized auth, rate limiting, and API management. The gateway is the single entry point. Most medium-to-large systems need this. In a system design interview, putting “API Gateway” at the top of your diagram signals that you understand these cross-cutting concerns are best handled in one place, not scattered across services.
Follow-up: In the chat system design, you used an NLB instead of an ALB for WebSocket connections. Why?
The choice comes down to how each handles persistent connections.- ALB (Layer 7) terminates the TCP connection from the client and creates a new TCP connection to the backend. For WebSocket, this means the ALB maintains TWO connections per user (client-to-ALB and ALB-to-backend). At 5 million concurrent users, the ALB maintains 10 million TCP connections. ALB also inspects HTTP headers on each frame, adding latency.
- NLB (Layer 4) passes TCP packets through without terminating the connection. The client has a direct TCP connection to the backend (through the NLB). At 5 million concurrent users, the NLB is forwarding packets for 5 million connections — it does not maintain state for each one (or minimal state). This is dramatically more efficient.
- NLB also supports millions of concurrent connections by design, while ALB has lower connection limits and higher per-connection memory overhead.
- The trade-off: NLB cannot do path-based routing, header inspection, or SSL termination (well, it can do TLS passthrough, and newer NLBs support TLS termination). If you need to route
/api/*to one service and/ws/*to another, you need ALB. A common pattern is: ALB for the REST API endpoints, NLB for the WebSocket endpoints.
Follow-up: Where does the rate limiter sit in this stack? Before the load balancer, inside the API gateway, or after?
The placement depends on what you are rate limiting.- At the CDN/edge level: Rate limiting by IP address to prevent DDoS. This is the first line of defense. AWS WAF rate-based rules or Cloudflare’s edge rate limiting do this before the request even reaches your infrastructure. Cost: very low latency, but you can only limit by IP or simple headers.
- At the API Gateway: Rate limiting by API key, user ID, or custom dimensions. This is where most per-user rate limiting happens. The gateway authenticates the request (extracts user identity), then checks the rate limit. AWS API Gateway has built-in throttling. Custom gateways (Kong) support plugin-based rate limiting.
- In the application: Rate limiting by business logic (e.g., max 3 password attempts per minute, max 10 friend requests per hour). This is too specific for the gateway and must live in the application code.
- Best practice is layered rate limiting: IP-level at the edge (coarse, prevents DDoS), user-level at the gateway (medium, enforces API quotas), business-logic-level in the application (fine, enforces product rules). Each layer catches different types of abuse, and the combination is more robust than any single layer.
Q10: Walk me through how you would debug a system where messages in a chat application are arriving out of order.
Q10: Walk me through how you would debug a system where messages in a chat application are arriving out of order.
Question
Users are reporting that messages in their conversations sometimes appear out of order — they see a reply before the original message. You are the on-call engineer. Walk me through how you would diagnose this and what the likely root causes are.Difficulty: Senior | Concepts tested: Debugging distributed systems, message ordering, clock skew, causal consistency, production investigationStrong Answer
Out-of-order messages in a chat system are one of the trickiest distributed systems bugs because the root cause can be at any layer — client, network, server, or database. Here is how I would systematically investigate.Step 1 — Reproduce and characterize the problem:- Is it happening for all users or specific users? Specific conversations? Specific server regions? Check the support tickets for patterns.
- Is it 1-to-1 chats, group chats, or both? Group chats have a fan-out step that 1-to-1 chats do not — more opportunities for reordering.
- Is the reordering consistent (Message B always appears before Message A) or transient (refresh fixes it)? Consistent reordering points to a storage or ordering problem. Transient reordering points to a delivery race condition.
-
Clock skew between Chat Servers: If you are using timestamps for ordering and two Chat Servers have clocks that differ by even 100ms, messages processed by different servers can be ordered incorrectly. Server 1’s clock is 200ms ahead of Server 2’s clock. User A sends at 12:00:00.000 (processed by Server 1, timestamp 12:00:00.200). User B replies at 12:00:00.100 (processed by Server 2, timestamp 12:00:00.100). The reply appears to have been sent before the original message.
- Diagnosis: Compare server clocks. Check NTP synchronization status. If clock offsets exceed 10ms, this is likely the cause.
- Fix: Use server-generated sequence numbers per conversation (monotonically increasing), not wall-clock timestamps. A single Chat Server (the “owner” for that conversation) assigns sequence numbers, guaranteeing causal order.
-
Race condition in the delivery path: User A and User B send messages nearly simultaneously. Message A is persisted to Cassandra but the Pub/Sub delivery to User B is slow. Message B is persisted and delivered to User A quickly. User A sees Message B before their own Message A is confirmed delivered. This is a perceived reordering on the sender side.
- Diagnosis: Check message persistence timestamps vs delivery timestamps. If persistence order is correct but delivery order differs, this is a delivery race condition.
- Fix: The client should render messages in sequence-number order, not delivery order. When Message B arrives with sequence number 5 and Message A (sequence 4) has not arrived yet, the client either waits briefly (50-100ms) for Message A or renders Message B and reorders when Message A arrives.
-
Cassandra read-repair or consistency level issue: If you are reading with consistency level ONE from a replica that has not yet received the most recent write (which was written at consistency level QUORUM), you might read stale data. The client fetches messages, gets an incomplete list, then on the next fetch gets the full list. Messages appear to “insert” into the middle of the conversation.
- Diagnosis: Check your Cassandra read/write consistency levels. If reads are at ONE and writes at QUORUM, there is a window where replicas disagree.
- Fix: Read at QUORUM as well (guarantees you read your own writes), or use LOCAL_QUORUM for both reads and writes in multi-datacenter setups.
- Add a distributed trace (X-Ray, Jaeger) to the message pipeline. Each message gets a trace ID that follows it from client send to server persist to Pub/Sub to recipient delivery. The trace shows exactly where time was spent and in what order events occurred.
- Add a client-side event: when the client renders a message, log the
{message_id, sequence_number, display_timestamp, delivery_timestamp}. This lets you compare what the server sent vs what the client displayed.
Follow-up: You mentioned using per-conversation sequence numbers. How do you generate those in a distributed system without a single point of failure?
This is the core of the problem. You need a monotonically increasing counter per conversation, and it must be fast and reliable.- Option 1 — Single writer per conversation: Assign each conversation to a single Chat Server (the “home” server for that conversation, determined by hashing the conversation_id). That server maintains an in-memory counter for the conversation and increments it for each message. Fast (no network hop for sequence generation) and correct (single writer = no conflicts). The downside: if the home server crashes, a new server takes over and must read the latest sequence number from the database before it can assign new ones.
- Option 2 — Database-generated sequences: Use the database to atomically assign sequence numbers. In Cassandra, this is not natively supported (no auto-increment). In PostgreSQL, use a
SEQUENCEor aSELECT ... FOR UPDATEon a counter row. This is slower (database round-trip per message) but eliminates the single-writer bottleneck. - Option 3 — Snowflake-like IDs: Generate globally unique, time-ordered IDs (timestamp + machine ID + sequence). These are not strictly sequential (they have gaps) but they are monotonically increasing, which is sufficient for ordering. Twitter’s Snowflake, Discord’s Snowflakes, and ULIDs all follow this pattern. The trade-off: ordering is timestamp-based, so clock skew still affects ordering across servers. But within a single server (likely for a single conversation), the ordering is guaranteed by the local sequence counter.
- My recommendation for a chat system: Option 1 (single writer per conversation) for message ordering. It is the simplest correct solution. Use consistent hashing to assign conversations to servers, with a fallback to database-read on server failover.
Follow-up: How would you test for message ordering issues before they hit production?
This is a testing methodology question that separates experienced engineers from theoretical ones.- Chaos testing: Inject artificial clock skew (50-200ms) between Chat Servers in a staging environment. Send rapid-fire messages in test conversations and verify ordering. If your ordering is timestamp-based, you will immediately see reordering. If it is sequence-based, it should be immune.
- Concurrent message injection: Write a load test that has N clients in the same conversation all sending messages as fast as possible. Verify that all clients see the exact same message order. This catches race conditions in the delivery path.
- Network delay injection: Use a tool like
tc(traffic control) or Toxiproxy to add random delays (10-500ms) to the Pub/Sub layer. Verify that the client-side rendering logic handles out-of-order delivery correctly (waits for gaps, reorders on display). - Partition testing: Simulate a Redis Pub/Sub outage. Messages should be persisted to Cassandra even though delivery is delayed. When Pub/Sub recovers, verify that clients catch up in the correct order by reading from the database.
- The key principle: ordering bugs only manifest under concurrency and latency variance. Single-client, low-latency testing will never find them. Your test environment must be adversarial.
Q11: You are designing for 99.99% availability. What does that actually mean in practice, and how does it change your architecture?
Q11: You are designing for 99.99% availability. What does that actually mean in practice, and how does it change your architecture?
Question
Many system design candidates say they want “four nines” availability without understanding the implications. Walk me through what 99.99% availability means in real terms, and how the architecture of a URL shortener changes when you go from 99.9% to 99.99%.Difficulty: Senior to Staff | Concepts tested: Availability math, redundancy, failure domains, operational maturity, SLA vs SLOStrong Answer
Let me start with what the numbers actually mean in terms of allowed downtime.- 99.9% (three nines): 8.76 hours of downtime per year, or ~43 minutes per month. This allows for a bad deployment that takes 30 minutes to roll back, plus one brief infrastructure incident per month.
- 99.99% (four nines): 52.6 minutes of downtime per year, or ~4.4 minutes per month. You essentially cannot have a single incident that takes more than 5 minutes to detect and mitigate. Manual intervention is too slow for most incidents.
- 99.999% (five nines): 5.26 minutes per year. This requires near-zero human involvement in failure recovery. Only the most critical systems (DNS root servers, core financial infrastructure) target this.
- Single-region deployment. Database primary with one replica. Manual or semi-automated failover. Redis with Sentinel. Single points of failure are acceptable if recovery is fast.
- An incident playbook where on-call can SSH in, diagnose, and fix within 30 minutes.
- Multi-region active-active: A single-region outage (AWS availability zone failure, even a full region failure) cannot exceed 4 minutes. You need at least two active regions, each capable of handling full traffic. DNS-based failover (Route 53 health checks) detects a region failure and routes traffic to the healthy region within seconds.
- Zero-downtime deployments: A bad deploy cannot cause more than 4 minutes of downtime. This means canary deployments (deploy to 1% of traffic, monitor for 5 minutes, then gradually increase), instant rollback (keep the previous version running and switch traffic back in seconds), and blue-green deployments.
- Automated failover everywhere: Database failover must be automated (Patroni, RDS Multi-AZ) with detection-to-recovery under 30 seconds. Redis failover must be automated (Sentinel, Cluster). No component can require a human to restart it.
- Health checking at every layer: Load balancers health-check the API servers every 5 seconds. API servers health-check Redis and the database. If any dependency is unhealthy, the server removes itself from the load balancer. Unhealthy servers are replaced automatically (ECS, Kubernetes).
- Blast radius isolation: Changes should only affect a fraction of users. Deploy to one AZ first. Use feature flags to gate new code paths. If something breaks, it breaks for 10% of users, not 100%.
- Observability budget: You need to detect issues within 60 seconds (leaves 3 minutes for mitigation). This means real-time dashboards, sub-minute alerting, and synthetic monitoring (a canary that continuously shortens and resolves URLs, alerting if latency exceeds thresholds or resolution fails).
Follow-up: How do you measure availability in practice? Is it just “server uptime”?
Great question — this is where many teams get it wrong.- Naive approach: Measure uptime as “the server process is running.” By this measure, a server that returns HTTP 500 on every request is “available.” This is meaningless.
- Correct approach: Measure availability from the user’s perspective. A request is “successful” if it returns a correct response within the latency SLA. Availability = (successful requests) / (total requests) over a time window.
- For the URL shortener: A redirect request is successful if it returns a 302 with the correct Location header within 100ms. Anything else (500 error, timeout, wrong redirect, latency > 100ms) counts as a failure.
- SLI vs SLO vs SLA: The SLI (Service Level Indicator) is the metric you measure (redirect success rate). The SLO (Service Level Objective) is your internal target (99.99%). The SLA (Service Level Agreement) is the contractual promise to customers (usually lower than SLO — promise 99.9%, target 99.99%, so you have an error budget).
- Error budgets: If your SLO is 99.99%, you have a 0.01% error budget per month — about 4 minutes of failures. When the error budget is consumed, the team should freeze feature deployments and focus on reliability. This is the Google SRE approach, and it creates a concrete, data-driven tension between shipping features and maintaining reliability.
Going Deeper: How do you handle the case where one region is “available” but returning degraded results — say, serving stale cached data because the database is lagging?
This is the “partially available” problem, and it is the hardest part of availability measurement.- Define what “correct” means: For the URL shortener, a redirect to the correct URL is correct even if it is served from a stale cache. A redirect to the WRONG URL (because of replication lag in a multi-region setup where the URL was recently updated) is incorrect and should count as a failure, even though the server returned a 302.
- Implement correctness checks: Periodically resolve a set of known test URLs and verify the destination. If a region is returning stale or incorrect results, its health check should fail, and traffic should be routed away.
- Replication lag monitoring: In a multi-region setup with asynchronous replication, the secondary region might be seconds behind the primary. A URL created in the primary region might not be resolvable in the secondary region for a few seconds. For a URL shortener, this is rarely a problem (users do not use a short URL within seconds of creating it). For systems where this matters, read-your-writes consistency is achieved by routing the creator’s subsequent requests to the primary region.
- The philosophical point: Availability is not binary. A system can be “up” but “degraded.” Mature organizations track multiple SLIs — latency, error rate, correctness, freshness — and define availability as the intersection of all of them meeting their thresholds. This is harder to measure but gives a much more honest picture of service health.
1 - SLO.
Use it naturally: “Our 99.9% SLO gives us 43 minutes/month of error budget; after a region failover consumed 30 minutes of it, we froze the next feature launch and spent the sprint on the failover runbook.”
Warning: The error budget is only useful if leadership honors the freeze. Teams that burn their budget and keep shipping features anyway are running without an SLO, regardless of what the dashboard says.- Google SRE Book — Chapters on Service Level Objectives and Embracing Risk (sre.google/sre-book) — the authoritative text on SLO/SLI/SLA distinctions and error budget mechanics.
- Alex Hidalgo — “Implementing Service Level Objectives” — practitioner-grade guidance on setting and operationalizing SLOs without the ceremony.
- Amazon Builders’ Library — “Reliability, constant work, and a good cup of coffee” (aws.amazon.com/builders-library) — essays on the engineering patterns behind high-availability AWS services.
Q12: If you had to pick one system from these six to build from scratch at a startup with three engineers, which one and how would you simplify it?
Q12: If you had to pick one system from these six to build from scratch at a startup with three engineers, which one and how would you simplify it?
Question
You are a tech lead at an early-stage startup with three backend engineers. The product needs a notification system. You do not have the time or team to build the full architecture described in Problem 6. Walk me through how you would simplify it to ship in two weeks while still building something that can scale later.Difficulty: Staff-Level | Concepts tested: Engineering judgment, build vs buy, incremental architecture, pragmatism, scope managementStrong Answer
This is fundamentally a judgment question, not a technical one. The right answer is NOT to build a simplified version of the full architecture. The right answer is to identify what you absolutely must build versus what you can buy.Week 1 — Buy everything you can:- Push notifications: Use Firebase Cloud Messaging (FCM) directly. It handles device token management, multi-platform delivery, and retry logic. Three API calls: register device token, send to token, send to topic. No custom push worker needed.
- Email: Use SendGrid or AWS SES with their template system. Store templates in SendGrid, not in your own database. SendGrid handles bounce management, unsubscribe links (CAN-SPAM compliance), and delivery tracking. One API call to send.
- SMS: Use Twilio. One API call to send. Twilio handles delivery receipts, phone number validation, and regulatory compliance per country.
- In-app notifications: This is the only part you must build, because it is tied to your application’s data model.
- A single
notificationstable in your existing PostgreSQL database:{id, user_id, type, title, body, read, created_at}. - A
notification_preferencestable:{user_id, channel, enabled}. Default everything to enabled. - A single API endpoint:
POST /internal/notifythat your other services call. This endpoint: (1) checks user preferences, (2) writes to the notifications table (for in-app), (3) calls SendGrid/FCM/Twilio directly (synchronous for now — yes, synchronous). - A REST endpoint for the frontend:
GET /api/notificationsfor the in-app notification list, andPATCH /api/notifications/{id}/read.
- No message queue. Send notifications synchronously. If SendGrid is slow (200ms), your API response is 200ms slower. At startup scale (hundreds of notifications per day), this is fine.
- No deduplication service. Use a unique constraint on
(user_id, type, idempotency_key)in PostgreSQL. If a duplicate arrives, the INSERT fails. Done. - No rate limiting per user. At startup scale with real users, nobody is getting spammed. Add this when you have data showing it is a problem.
- No priority queues. Everything is sent immediately. A 2FA code and a marketing email go through the same path.
- No aggregation. If a user gets 20 likes, they get 20 notifications. Fix this when users complain.
- No multi-provider failover. If Twilio is down, SMS fails. You have three engineers — operating a failover system is more expensive than the occasional missed SMS.
POST /internal/notify the single entry point. Every service in your codebase calls this one endpoint. When you need to scale:- Month 3: Replace the synchronous external API calls with a background job queue (Sidekiq, BullMQ, Celery). The API endpoint writes to the queue, returns immediately. Workers process the queue asynchronously. You have just added the async dispatch layer.
- Month 6: Add Redis for dedup (replacing the PostgreSQL unique constraint, which is getting slow under load). Add per-user rate limiting.
- Month 12: Split into separate queues by priority. Add notification aggregation. Consider replacing FCM/SendGrid direct calls with SNS or a dedicated notification service.
POST /internal/notify contract stays stable. This is what “building for the future without over-engineering the present” actually looks like.Follow-up: How do you convince the product team that skipping aggregation and rate limiting is okay for now?
With data and risk framing.- Aggregation: “At our current scale of 500 daily active users, the median user receives 3 notifications per day. The busiest user receives 15. Notification fatigue is not a problem yet. When we have data showing users are disabling notifications, we build aggregation. Building it now costs 2 engineer-weeks and delays the launch.”
- Rate limiting: “We do not have users who are being spammed because we control all the notification sources — it is our own code sending notifications. Rate limiting protects against runaway code or abuse, but we can achieve the same protection with a
MAX_NOTIFICATIONS_PER_HOUR = 20constant in the notify endpoint. One line of code, not a Redis-backed distributed rate limiter.” - The general principle: Defer complexity until you have evidence it is needed. Every premature abstraction has a maintenance cost. Three engineers maintaining a rate limiting service is three engineers not building features.
Follow-up: At what point does the “buy everything” approach break down and you need to build custom?
The trigger is usually one of three things:- Cost: At 10 million emails per month, SES costs 5,000+. At 100 million, the difference is significant. You start building custom email infrastructure (dedicated IP management, templating, bounce handling) when the cost of the managed service exceeds the cost of an engineer maintaining custom infrastructure.
- Control: You need notification behaviors that the managed service does not support. For example, “do not send push notifications during the user’s sleeping hours based on their timezone” requires preference logic that FCM does not provide. Or “aggregate 20 like notifications into one” requires application-level batching before calling FCM.
- Reliability: When your notification SLA exceeds what the managed service provides. If you need 99.99% delivery for 2FA codes, depending on a single provider (Twilio) is a risk. You need multi-provider failover, which means building a dispatch layer.
Advanced Interview Scenarios
These questions are deliberately harder than the section above. They target the blind spots that survive standard preparation — scenarios where the obvious answer is wrong, where production experience separates real builders from textbook readers, and where the interviewer is testing judgment more than knowledge. If you can handle these, you can handle anything a staff-level loop throws at you.Q13: Your team just got paged because auto-scaling is making an outage worse, not better. Walk me through how that happens and what you do.
Q13: Your team just got paged because auto-scaling is making an outage worse, not better. Walk me through how that happens and what you do.
Question
You are on-call for the notification system. A traffic spike hits — 10x normal volume due to a breaking news event. Auto-scaling kicks in and starts adding workers. But instead of recovering, latency gets worse and error rates climb. What is happening and how do you stabilize the system?Difficulty: Staff-Level | Concepts tested: Auto-scaling failure modes, cold start, connection pool exhaustion, thundering herd, incident command under pressureThe Trap
Most candidates say “auto-scaling handles traffic spikes” and move on. In reality, auto-scaling is one of the most common amplifiers of outages at scale. This question tests whether you have been paged at 3 AM and learned the hard way.What weak candidates say:“Auto-scaling should handle this. Maybe the scaling policy needs a higher max instance count. I would increase the max and wait for new instances to come online.”What strong candidates say:The way I think about this is that auto-scaling solves the capacity problem but can create three new problems simultaneously — and at 3 AM, you need to know which one you are fighting.-
Problem 1 — Connection pool exhaustion (the most common killer). Each new worker instance opens connections to PostgreSQL, Redis, and downstream services. Your RDS instance has a
max_connectionsof 500. You started with 20 workers at 10 connections each (200 total). Auto-scaling adds 50 more workers — that is 500 new connection attempts. PostgreSQL starts rejecting connections. Now all workers — old and new — start failing on database calls. The new capacity you added is not just useless, it is actively killing the healthy workers. I saw this exact scenario at a company processing payment notifications. Black Friday traffic triggered auto-scaling, which exhausted the RDS connection pool in 90 seconds. The fix was not more instances — it was adding PgBouncer as a connection pooler between the application and the database. PgBouncer multiplexes 500 application connections over 50 actual database connections. We deployed PgBouncer that weekend and never hit the issue again. -
Problem 2 — Cold start cache stampede. New instances have empty local caches. Every request on a new instance is a cache miss that hits the database. If you just scaled from 20 to 70 instances, you have 50 instances all hammering the database with requests that would normally be served from local cache. DynamoDB’s adaptive capacity takes 5-10 minutes to scale up the hot partitions; during that window, you get throttling (
ProvisionedThroughputExceededException). The mitigation is cache warming on startup. Before the instance starts accepting traffic, it pulls the top 1,000 most-requested items into its local cache. This adds 30 seconds to startup time but prevents the stampede. - Problem 3 — Downstream service collapse. Your notification workers call APNs, FCM, SendGrid, and Twilio. These external APIs have their own rate limits. 20 workers sending 100 req/sec each = 2,000 req/sec to SendGrid. Scale to 70 workers = 7,000 req/sec. SendGrid starts returning 429s. Your workers retry. Retries add to the load. You are now in a feedback loop where more workers means more retries means more load means more failures.
- Freeze auto-scaling immediately. Set the desired count to the current number. Stop the bleeding.
- Enable circuit breakers on downstream calls. If SendGrid is returning 429s, stop sending to it for 60 seconds. Buffer notifications in the queue. They are already in SQS — they will not be lost.
- Check the database connection count.
SELECT count(*) FROM pg_stat_activity;If it is nearmax_connections, that is the root cause. Kill idle connections and add PgBouncer if not already present. - Reduce to a known-good instance count. Scale down to the number of instances that were handling traffic before the spike. Let the queue absorb the burst. Process at a sustainable rate.
max_connections, every consumer started throwing errors, and consumer lag increased because no consumer could process anything. The fix: we added a connection pooler, capped auto-scaling at 2 new instances per minute (instead of 10), and added a startup health check that verified database connectivity before the instance joined the consumer group.Follow-up: How do you design an auto-scaling policy that does not amplify outages?
- Rate-limit the scaling itself. Add a cooldown period (5 minutes between scale-up events) and cap the number of new instances per scaling event (add 2 at a time, not 20). AWS calls this “step scaling with cooldown.”
- Scale on the right metric. Scaling on CPU is almost always wrong for I/O-bound services. Scale on queue depth, request latency p99, or consumer lag. For notification workers, scale on SQS
ApproximateNumberOfMessagesVisible— it directly measures backlog. - Pre-provision for known events. If you know Black Friday is coming, scale up before the traffic hits. Scheduled scaling at 5 AM on Black Friday morning is far safer than reactive scaling at 9 AM when the traffic arrives.
- Circuit breaker before scaling. If downstream services are failing, adding more workers makes things worse. Your scaling policy should check dependency health: if the database connection pool is above 80%, do not scale.
Follow-up: How do you prevent the connection pool problem architecturally, not just with PgBouncer?
- Serverless connection patterns. Use DynamoDB or Aurora Serverless v2 instead of standard RDS. Aurora Serverless manages connection pooling internally through its Data API. DynamoDB has no connection limit — it is HTTP-based.
- HTTP-based data access. Replace direct database connections with an internal data service API. Workers call
GET /api/internal/users/123instead of connecting to PostgreSQL directly. The data service manages a fixed connection pool. You can scale workers independently of database connections. - RDS Proxy. AWS’s managed connection pooler that sits between Lambda/ECS and RDS. It maintains a pool of warm connections and multiplexes application connections over them. Adds ~1ms latency but eliminates the connection exhaustion problem.
Q14: You need to migrate the chat system's message store from MongoDB to Cassandra with zero downtime. Walk me through the full migration plan.
Q14: You need to migrate the chat system's message store from MongoDB to Cassandra with zero downtime. Walk me through the full migration plan.
Question
Your chat system has 500 million messages in MongoDB. Performance is degrading because of the single-primary write bottleneck. You have decided to migrate to Cassandra. You cannot take the system offline. Users cannot lose messages or see duplicates during the migration. Walk me through the migration, step by step.Difficulty: Staff-Level | Concepts tested: Zero-downtime migration, dual-write, data consistency verification, rollback strategy, operational planningThe Trap
Most candidates describe “export from old database, import to new database, switch over.” That is an offline migration. The question explicitly says zero downtime. This tests whether you have actually performed a live migration.What weak candidates say:“I would set up Cassandra, copy all the data over, then switch the application to point at Cassandra. Maybe do it during off-peak hours.”This answer misses: what happens to messages written during the copy? How do you verify data consistency? How do you roll back if Cassandra has a problem? What about the different data models?What strong candidates say:I have done a migration like this before, and it is a 4-phase process that typically takes 4-6 weeks from start to full cutover. The phases are: dual-write, backfill, shadow-read verification, and cutover. Let me walk through each.-
Phase 1 — Dual-Write (Week 1-2). Deploy a code change so that every new message write goes to BOTH MongoDB and Cassandra. MongoDB remains the source of truth. The write path looks like: write to MongoDB (synchronous, must succeed), then write to Cassandra (asynchronous, best-effort with retry). If the Cassandra write fails, log the failure to a dead letter queue and retry later. No user-facing behavior changes.
The critical detail: the Cassandra data model is different from MongoDB. MongoDB stores messages as documents with embedded arrays for reactions and read receipts. Cassandra uses a wide-row model with
(conversation_id, message_id)as the primary key. The dual-write layer must translate between models. This translation logic is the highest-risk code — test it exhaustively with production-like data before enabling dual-write. -
Phase 2 — Backfill (Week 2-3). While dual-write handles new messages, you need to copy the 500 million historical messages. Run a backfill job that reads from MongoDB in batches (10,000 messages at a time, paginated by
_id), transforms each message to the Cassandra schema, and writes to Cassandra. The backfill must be idempotent — if it crashes and restarts, it should not create duplicates. Use the message ID as the Cassandra primary key so re-inserting the same message is a no-op (Cassandra upserts by default). Rate-limit the backfill to avoid overwhelming either database. At 5,000 writes/sec to Cassandra, 500 million messages takes ~28 hours. Run it over a weekend with monitoring. If the backfill causes latency spikes in MongoDB (from the read load), throttle it down. There is a race condition: a message written by the dual-write path might be overwritten by the backfill reading an older version from MongoDB. Mitigation: use awrite_timestampin Cassandra and implement a “latest wins” strategy using Cassandra’s built-in last-write-wins conflict resolution. Since the dual-write message is newer, it survives. - Phase 3 — Shadow-Read Verification (Week 3-4). This is the phase most people skip, and it is the one that saves you from a catastrophic cutover failure. Deploy code that reads from BOTH databases on every message fetch. The response is served from MongoDB (still the source of truth), but in the background, you also query Cassandra and COMPARE the results. Log every discrepancy: missing messages in Cassandra, different message content, ordering differences. Calculate a consistency score: “Cassandra agrees with MongoDB on 99.97% of reads.” Investigate the 0.03% discrepancies — they usually reveal bugs in the dual-write translation or backfill edge cases (messages with special characters, very long messages that were truncated, timezone conversion issues). Do not proceed to cutover until the consistency score is 99.99% or higher over 72 hours of shadow reads.
- Phase 4 — Cutover (Week 5). Flip the read path to Cassandra. MongoDB becomes the fallback. The cutover is a feature flag, not a deployment. Step 1: Route 1% of read traffic to Cassandra. Monitor latency and error rates for 1 hour. Step 2: Route 10%. Monitor for 4 hours. Step 3: Route 50%. Monitor for 24 hours. Step 4: Route 100%. MongoDB is now only receiving dual-writes. Step 5 (Week 6+): After 2 weeks of stable Cassandra-primary operation, remove the dual-write to MongoDB. MongoDB is decommissioned. At every step, you can roll back by flipping the feature flag. MongoDB has all the data because dual-write was still active.
Follow-up: What if you discover during shadow reads that Cassandra’s performance is actually worse than MongoDB for certain query patterns?
This is a realistic scenario — the data model translation might not map perfectly to Cassandra’s strengths.- Identify the specific query pattern. Common culprits: (1) queries that work on secondary indexes in MongoDB but require a full table scan in Cassandra (Cassandra secondary indexes are notoriously slow), (2) queries that fetch a small number of messages from many conversations (scatter-gather across partitions), (3) aggregation queries (count of unread messages across all conversations) that MongoDB handles natively but Cassandra cannot.
- For slow secondary index queries: Denormalize. Create a separate Cassandra table optimized for that query. Cassandra’s philosophy is “model your tables around your queries, not your entities.” If you need to look up messages by sender, create a
messages_by_sendertable. - For scatter-gather queries: Add a caching layer. If the query is “fetch the latest message from each of my 50 conversations” (for the conversation list view), cache the latest message per conversation in Redis. Update on every new message. The query becomes a single Redis MGET, not 50 Cassandra reads.
- If performance is fundamentally worse: Reconsider the migration. Maybe the right answer is not Cassandra but ScyllaDB (same data model, better tail latency due to no JVM GC pauses). Or maybe a different database entirely — for example, TiDB if you need distributed SQL.
Follow-up: How do you handle the dual-write failure mode where MongoDB succeeds but Cassandra fails?
- During the dual-write phase, MongoDB is the source of truth. A Cassandra write failure is tolerable — the message exists in MongoDB and will be served correctly.
- Log the failed Cassandra write to a dead letter queue (SQS or a separate PostgreSQL table). A reconciliation worker retries the failed writes every 30 seconds with exponential backoff.
- If a message is in MongoDB but not in Cassandra, and a shadow read occurs before the reconciliation worker catches up, the shadow read comparison will flag it as a discrepancy. This is expected and should not be counted against the consistency score — filter out discrepancies for messages written in the last 5 minutes.
- The dangerous case: Cassandra write appears to succeed (no error returned) but the data is silently corrupted or written to the wrong partition due to a data model bug. This is why shadow-read verification is non-negotiable. It catches bugs that error handling cannot.
Q15: The interviewer says 'Design a system that handles 1 million events per second.' You say 'Use Kafka.' The interviewer says 'Kafka is down. Now what?' How do you answer?
Q15: The interviewer says 'Design a system that handles 1 million events per second.' You say 'Use Kafka.' The interviewer says 'Kafka is down. Now what?' How do you answer?
Question
This question tests your ability to think beyond the happy path where your core infrastructure works. Every system has a “what if the thing you are depending on dies” scenario. Walk me through how you design a high-throughput event pipeline that survives the failure of its primary message broker.Difficulty: Senior to Staff | Concepts tested: Resilience patterns, multi-layer buffering, graceful degradation, “what if the obvious answer is wrong” thinkingThe Trap
The trap is designing a system with a single critical dependency (Kafka) and no fallback. The interviewer wants to see that you think in layers of defense, not single points of failure.What weak candidates say:“Kafka is highly available with replication. If one broker goes down, the others take over. You just configure replication factor 3 and min.insync.replicas 2.”This answer is technically correct for a single-broker failure, but it misses the point. The interviewer is asking about a total Kafka outage — maybe a cluster-wide misconfiguration, a ZooKeeper failure, a network partition that isolates the entire Kafka cluster, or a bad version upgrade.What strong candidates say:Let me think about this in layers, because resilience is never about one backup plan — it is about degradation tiers.- Tier 1 — Kafka is healthy (normal operation). Producers write events to Kafka topics. Consumers process them. Replication factor 3, min.insync.replicas 2. This handles individual broker failures. Kafka recovers automatically. This is the 99.9% case.
- Tier 2 — Kafka is degraded (some partitions unavailable). Producers get errors for specific partitions. The partition leader election is taking too long, or a minority of brokers are down. The producer should implement a local buffer: when a Kafka write fails, write the event to a local disk-backed queue (an embedded RocksDB instance, a memory-mapped file, or even just appending to a local file). Continue retrying Kafka in the background. When Kafka recovers, drain the local buffer into Kafka. Events are delayed but not lost. At Uber, the Kafka producer library (called uReplicator) does exactly this — it buffers to local disk when Kafka is unreachable and drains on recovery. They reported that during a 20-minute Kafka cluster restart, zero events were lost because the local buffer held approximately 400GB of events across their fleet.
-
Tier 3 — Kafka is fully down (total cluster outage). Local buffers on producers fill up. If the outage lasts hours, the local disk fills up. Now you need a fallback message broker.
Option A: Write-ahead to S3. Every event is written to both Kafka and an S3 bucket (partitioned by
yyyy/mm/dd/hh/producer_id). If Kafka is down, events still land in S3. A separate consumer reads from S3 when Kafka is unavailable. Latency goes from sub-second (Kafka) to minutes (S3 listing and reading), but no events are lost. Option B: Secondary message broker. Run a smaller Redis Streams or Amazon Kinesis cluster as a hot standby. When the producer detects Kafka is down (circuit breaker opens after 3 failures in 10 seconds), it switches to the secondary broker. Consumers subscribe to both Kafka and the secondary broker. This gives sub-second latency even during a Kafka outage, at the cost of maintaining a second streaming infrastructure. Option C (the pragmatic choice for most systems): Accept the degradation. If you are processing analytics events, a few hours of delay is acceptable. Use the local buffer (Tier 2) and do NOT invest in a secondary broker. Alert on-call, fix Kafka, drain the buffers. For critical events (payment notifications, 2FA codes), route through a separate, simpler path (SQS or a direct database write) that does not depend on Kafka at all. - The key architectural principle: Separate your event streams by criticality. Payment events and 2FA events should NEVER share infrastructure with analytics and clickstream events. Critical events get a simple, highly available path (SQS, which has 99.999% availability). Bulk events get the high-throughput path (Kafka) with best-effort resilience.
Follow-up: You mentioned writing to both Kafka and S3 simultaneously. How do you handle deduplication when the consumer reads from both sources?
- Every event has a globally unique event ID (UUID or ULID) assigned at the producer before writing to any destination.
- Consumers maintain a dedup window: a Bloom filter or a Redis set of recently processed event IDs, with a TTL of 2x the maximum expected delay between Kafka delivery and S3 delivery.
- When processing from S3 (the backup path), the consumer checks each event ID against the dedup set. If it was already processed from Kafka, skip it.
- The Bloom filter approach is memory-efficient: 1 billion event IDs in a Bloom filter with a 0.01% false positive rate requires approximately 1.2 GB of memory. At a 0.01% false positive rate, you silently drop 1 in 10,000 duplicate-seeming events that are actually unique. For analytics, this is acceptable. For payments, use an exact dedup set (Redis or DynamoDB).
Follow-up: How would you test this resilience in production without actually breaking Kafka?
- Chaos engineering with Gremlin or LitmusChaos. Inject Kafka broker failures in a controlled way during off-peak hours. Verify that the local buffer activates, events are not lost, and the buffer drains correctly on recovery.
- Dark traffic testing. Run a shadow pipeline that mirrors 1% of production events. Intentionally break the shadow pipeline’s Kafka and observe the fallback behavior. The production pipeline is unaffected.
- GameDay exercises. Schedule a quarterly drill where the team simulates a Kafka total outage. Practice the incident response: who gets paged, what is the communication plan, how do you verify events are buffered, how do you verify the buffer drains correctly. At AWS, GameDay exercises are how they discovered that their S3 fallback path had a bug where events with payloads larger than 256KB were silently dropped.
Q16: Your system uses microservices. A new engineer asks 'Why not just use a monolith?' Convince me the microservices are justified -- or admit they are not.
Q16: Your system uses microservices. A new engineer asks 'Why not just use a monolith?' Convince me the microservices are justified -- or admit they are not.
Question
This is not a “microservices vs monolith” definition question. I want you to give me the specific criteria for when a microservices architecture becomes the wrong choice, and how you would evaluate whether your current system should be simplified.Difficulty: Staff-Level | Concepts tested: Architecture judgment, organizational awareness, the “obvious answer is wrong” trap, intellectual honestyThe Trap
Most candidates default to defending microservices because it is the “modern” answer. The interviewer wants to see you challenge assumptions, reason from first principles, and potentially argue that a monolith is the better choice for many real teams.What weak candidates say:“Microservices are better because they allow independent deployments, scaling, and team autonomy. Each service can use the best technology for its domain.”This is the textbook answer, and it is only true for large organizations. For the majority of engineering teams, microservices add more cost than value.What strong candidates say:The honest answer is that microservices are the wrong architecture for most teams, most of the time. Let me explain my reasoning.- The organizational prerequisite that most people ignore. Conway’s Law tells us that system architecture mirrors organizational structure. Microservices work when you have multiple independent teams (6-10 engineers each) that own independent business domains and need to deploy independently. If you have 15 engineers in one team, splitting into microservices gives you all the distributed systems complexity (network calls, serialization, distributed tracing, service discovery, API versioning) without the organizational benefit. You are paying the microservices tax without the microservices dividend. Shopify has over 3,000 engineers and runs a monolith (a large Ruby on Rails application). They have invested heavily in making the monolith modular (component-based architecture with enforced boundaries), which gives them most of the isolation benefits of microservices without the operational overhead. Shopify explicitly chose NOT to migrate to microservices because the coordination cost across hundreds of services would have slowed their product velocity.
-
The specific criteria where microservices are justified:
- Different scaling profiles. Your notification dispatch service needs to scale to 10,000 instances during Black Friday, but your user preference service handles steady traffic. Putting them in the same monolith means you scale the entire application 10,000x just for the notification path.
- Different reliability requirements. The URL redirect service must have 99.99% availability. The analytics service can tolerate 99.9%. In a monolith, the analytics service’s failures can crash the redirect service. Separate processes isolate failure domains.
- Different deployment cadences. Team A deploys 10 times per day. Team B deploys once a week. In a monolith, Team A’s frequent deploys create risk for Team B’s code. Separate services let each team deploy independently.
- Genuine technology diversity needs. The ML ranking service is Python. The real-time event processor is Go. The CRUD API is Node.js. Different languages require different processes (i.e., separate services) by definition.
-
How I evaluate whether to simplify an existing microservices architecture:
- Count the inter-service calls per user request. If a single user action triggers a chain of 8 synchronous service-to-service calls, you have a distributed monolith. You got the worst of both worlds: the deployment complexity of microservices and the tight coupling of a monolith.
- Count the engineers per service. If you have 15 services and 10 engineers, that is 0.67 engineers per service. Nobody can maintain their services properly. Each service has stale dependencies, missing alerts, and outdated documentation.
- Measure deployment coordination. If deploying Feature X requires coordinating releases across 4 services in a specific order, your service boundaries are in the wrong places. Well-designed microservices should be deployable independently for 90%+ of changes.
Follow-up: If you do need microservices, what is the single most important thing to get right?
Service boundaries. Get them wrong and everything downstream is painful.- Boundaries should follow business domains, not technical layers. “User service, database service, logging service” is wrong. “Payments, notifications, content” is right. Each service should own a business capability end-to-end, including its own data store.
- The litmus test: Can this service be developed, deployed, and operated by a single team without coordinating with other teams for 80%+ of their work? If no, the boundary is wrong.
- Start with a monolith and extract. You cannot draw good service boundaries until you understand the domain. Build the monolith first, observe which parts change together and which change independently, then extract services along those natural seams. This is the Strangler Fig pattern, and it is how most successful microservices architectures were actually built (including Netflix and Amazon).
Follow-up: How do you handle a distributed transaction across microservices — say, “create order” requires updating inventory AND processing payment?
- Short answer: you do not. Distributed transactions (two-phase commit) across microservices are fragile, slow, and defeat the purpose of independent services.
- Use the Saga pattern instead. The order service creates the order in “pending” state, sends a “reserve inventory” event. The inventory service reserves the items and sends “inventory reserved.” The payment service processes the payment and sends “payment completed.” The order service moves the order to “confirmed.” If any step fails, compensating actions fire: release the reserved inventory, reverse the payment.
- The hard part is compensating transactions. “Reverse the payment” sounds simple but has edge cases: what if the payment API is also down? What if the payment was partially processed? This is why each service must be idempotent and support a “compensate” operation that is robust to partial failures.
- Temporal/Cadence is purpose-built for this. Instead of hand-coding sagas, define the workflow in Temporal. It handles the orchestration, retries, compensation, and state persistence. If a step fails, Temporal retries with backoff. If the entire workflow fails, Temporal runs the compensating actions. This is how Uber handles ride booking: a Temporal workflow coordinates rider matching, driver notification, payment authorization, and trip creation.
Q17: A developer on your team adds a new field to the notification API response. Three downstream services break in production. What went wrong and how do you prevent it?
Q17: A developer on your team adds a new field to the notification API response. Three downstream services break in production. What went wrong and how do you prevent it?
Question
This question is about API contract management, backward compatibility, and schema evolution. It seems simple on the surface, but the answer reveals how much production experience a candidate has with service-to-service communication at scale.Difficulty: Senior | Concepts tested: API versioning, schema evolution, backward/forward compatibility, contract testing, the Postel principleThe Trap
Most candidates jump to “use API versioning (v1, v2).” That is a partial answer. The real question is: why did adding a NEW field (which should be backward compatible) break downstream services?What weak candidates say:“We should version our APIs. Put/v1/ in the URL. When you make a breaking change, create /v2/.”This misses the core issue: adding a field should not be a breaking change. If it broke things, your serialization or your consumers are misconfigured.What strong candidates say:First, let me figure out WHY adding a new field broke anything, because in a well-designed system, adding a field is a backward-compatible change. The fact that it broke things tells me something is wrong at a deeper level.-
Root cause 1 — Strict deserialization. The downstream service is using a deserializer that fails on unknown fields. In Java, Jackson’s
DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIESistrueby default. A new field in the response causes Jackson to throw an exception. This is the most common cause. The fix is configuring all consumers to ignore unknown fields. In Jackson:@JsonIgnoreProperties(ignoreUnknown = true). In protobuf, this is the default behavior — unknown fields are silently ignored. I have seen this break production at multiple companies. At one, a developer added anavatar_urlfield to the user profile response. The payments service — which only neededuser_idandemail— deserialized the full response and crashed on the unknown field. An unrelated service broke because of a change it should never have noticed. -
Root cause 2 — Exact schema validation. If consumers use JSON Schema validation or OpenAPI spec validation with
additionalProperties: false, any extra field is rejected. The fix: always setadditionalProperties: truein your schemas, or better yet, do not validate the full schema on the consumer side — only validate the fields you use. - Root cause 3 — The field changes the response size beyond a buffer limit. In rare cases, the new field (especially if it contains large data like a base64-encoded image) pushes the response beyond a size limit in the consumer’s HTTP client, a proxy buffer, or a message queue’s maximum message size. This is the sneakiest failure because the error message is often “connection reset” or “timeout” with no obvious link to the schema change.
- Consumer-Driven Contract Testing. Each downstream service defines a “contract” — the specific fields it expects from the API response. The API’s CI pipeline runs all consumer contracts against the current API. If a change breaks a contract, the build fails BEFORE deployment. Pact is the most popular tool for this. At Atlassian, Pact tests prevented hundreds of breaking changes from reaching production over the years.
- The Robustness Principle (Postel’s Law). “Be conservative in what you send, be liberal in what you accept.” Consumers should accept responses with extra fields, missing optional fields, and fields in any order. Producers should only add fields (never remove or rename without a deprecation period).
- Schema registries for event-driven systems. If your services communicate via Kafka events, use a schema registry (Confluent Schema Registry or AWS Glue Schema Registry). The registry enforces compatibility rules: new schema versions must be backward compatible (consumers using the old schema can still read new messages) or forward compatible (consumers using the new schema can still read old messages). A schema change that breaks compatibility is rejected at publish time, not in production.
-
The compatibility matrix:
- Adding a field: Backward compatible (old consumers ignore it). Should NEVER break.
- Removing a field: Backward INCOMPATIBLE (old consumers expect it). Requires a deprecation period.
- Renaming a field: Equivalent to removing + adding. Breaking change.
- Changing a field type: Breaking change. Always.
Deprecation: true). Breaking changes in production dropped to zero within two quarters.Follow-up: How do you handle evolving a protobuf schema without breaking existing consumers?
- Protobuf has built-in field numbering that makes evolution safe. Each field has a number:
string name = 1; int32 age = 2;. As long as you never reuse a field number, you are safe. - Adding a field: Assign the next available number. Old consumers ignore it. New consumers use a default value if the field is missing from old messages.
- Removing a field: Mark the field number as
reservedso it can never be reused:reserved 3;. Old consumers see the field as empty (default value). Never delete the field number from the .proto file — reserve it. - The discipline: Never change a field’s type or number. Treat field numbers as permanent. This is why protobuf is the preferred serialization for long-lived service APIs — the wire format is inherently backward and forward compatible if you follow the rules.
Follow-up: What about database schema evolution? How do you add a column to a table that 50 services read from?
- Expand-and-contract pattern. Phase 1 (expand): add the new column with a default value and make it nullable. No service breaks because they do not read the new column. Phase 2: update the write path to populate the new column. Backfill existing rows. Phase 3: update consumers one by one to read the new column. Phase 4 (contract): once all consumers are updated, make the column non-nullable if needed, remove the old column if it was being replaced.
- Never add a non-nullable column without a default value. This is the database equivalent of a breaking API change — existing INSERT statements that do not include the new column will fail.
- Online schema migration tools. For large tables (billions of rows),
ALTER TABLE ADD COLUMNcan lock the table for seconds or minutes on MySQL. Use online schema migration tools: gh-ost (GitHub’s tool) or pt-online-schema-change (Percona). These create a shadow table, copy data, sync changes via triggers, and do an atomic rename. PostgreSQL handlesADD COLUMNwith a default value without locking since version 11 — another reason to prefer PostgreSQL.
Q18: Design the observability stack for the distributed task scheduler. What are the five alerts you set up first, and why those five?
Q18: Design the observability stack for the distributed task scheduler. What are the five alerts you set up first, and why those five?
Question
You have deployed the task scheduler from Problem 5 to production. Before you go home, you need to set up monitoring and alerting. You can only pick five alerts. Which five do you choose, and what thresholds do you set?Difficulty: Senior | Concepts tested: Operational maturity, signal vs noise in alerting, SLI/SLO thinking, production readiness, the difference between monitoring and observabilityThe Trap
Most candidates list generic infrastructure alerts: CPU > 80%, memory > 90%, disk > 85%. These are important but tell you nothing about whether the system is actually working. The interviewer wants to see you think from the USER’s perspective inward, not from the MACHINE’s perspective outward.What weak candidates say:“I would alert on CPU usage, memory usage, error rates, disk space, and response time.” These are machine-level metrics. CPU can be at 90% while the system is perfectly healthy (processing a backlog). Error rate can be 0% while the system is completely broken (scheduler is frozen, no tasks are being processed at all, so no errors are generated).What strong candidates say:My philosophy on alerting is: alert on symptoms first, causes second. A symptom alert tells you the user is affected. A cause alert helps you diagnose. If you only have five alerts, make them all symptoms.Alert 1 — Task execution delay exceeds SLA.- Metric:
p99(actual_execution_time - scheduled_execution_time)over a 5-minute window. - Threshold: > 60 seconds for P0 tasks, > 5 minutes for P1 tasks.
- Why this is the most important alert: This directly measures whether the system is fulfilling its core promise — executing tasks on time. If this alert fires, users are affected. Every other failure (scheduler down, queue backed up, workers crashing) eventually manifests as execution delay.
- What it catches that other alerts miss: A frozen scheduler that is not enqueuing tasks. The queue is empty (so no queue depth alert fires), error rate is zero (so no error alert fires), but no tasks are being processed because nothing is being scheduled.
- Metric:
count(tasks in DLQ added in last 15 minutes) - Threshold: > 10 tasks in 15 minutes. (In a healthy system, the DLQ is empty or nearly empty.)
- Why: DLQ growth means tasks are exhausting retries and failing permanently. This is a leading indicator of a systemic problem — a dependency is down, a task type has a bug, or a deployment broke the task handler. The DLQ is the canary in the coal mine.
- War Story: At a company running Sidekiq for background jobs, we had no DLQ alerting. A third-party API changed its response format. The integration tasks started failing, exhausted retries, and silently accumulated in the DLQ. We discovered 45,000 failed tasks three days later during a routine check. Those were customer onboarding emails that never sent. Three days of new users never received their welcome sequence.
- Metric: A canary task scheduled to run every 60 seconds. Alert if the canary has not executed in 3 minutes.
- Threshold: Canary task not completed within 180 seconds of its scheduled time.
- Why: This is an end-to-end health check. The canary exercises the entire pipeline: scheduler picks it up, enqueues it, worker executes it, completion is recorded. If the canary does not fire, something in the pipeline is broken. This catches failures that component-level metrics miss (e.g., the scheduler is running but its database query returns zero rows due to a schema migration that accidentally dropped the index).
- Metric:
(active_workers / total_workers) * 100averaged over 10 minutes. - Threshold: > 90% utilization for 10 minutes.
- Why: At 90% utilization, you have no headroom for traffic spikes. The next burst of tasks will cause queuing delays. This is a leading indicator — it fires BEFORE users are affected, giving you time to scale up. Unlike a queue depth alert (which fires when things are already bad), utilization tells you things are about to get bad.
- Metric:
(completed_tasks / (completed_tasks + failed_tasks)) * 100over a 15-minute window. - Threshold: < 95% success rate.
- Why: A sudden drop in success rate indicates either a deployment problem (new code is crashing) or a dependency problem (a downstream API is failing). The 95% threshold accounts for normal background failure rates (network blips, transient errors) while catching systemic issues.
- Why not 99%? In my experience, task schedulers have a natural failure rate of 1-3% due to transient errors (timeouts, rate limiting by third parties). Setting the threshold at 99% would create alert fatigue from false positives. 95% means something genuinely changed.
- CPU/memory/disk. These are useful for dashboards, not for paging someone at 3 AM. High CPU while processing a backlog is normal. These should be warning-level signals that feed into the symptom alerts, not page-worthy on their own.
- Individual task failures. One task failing is not an incident. Set this up as a log or a low-severity notification to the task owner, not an on-call page.
- Queue depth absolute number. A queue depth of 50,000 might be normal during a batch job run and dangerous during idle hours. Absolute thresholds on queue depth create false positives. Instead, alert on the rate of change or on the downstream symptom (execution delay).
Follow-up: How do you avoid alert fatigue when operating this system?
- Ruthlessly eliminate noisy alerts. If an alert fires more than once a week and the action is “acknowledge and ignore,” delete the alert. It is training your team to ignore pages.
- Every alert must have a documented action. If you cannot write a one-paragraph runbook for what to do when an alert fires, the alert should not page anyone. Put it on a dashboard instead.
- Separate alerting tiers: PagerDuty page (wake someone up) for symptom alerts only. Slack notification for leading indicators. Dashboard for infrastructure metrics. Most teams have the ratio inverted — they page on infrastructure and have no symptom alerts.
- Google’s SRE guidance: If your on-call engineer is getting paged more than twice per 12-hour shift, your system or your alerting is broken. Fix one of them.
Follow-up: How would you set up distributed tracing for the task scheduler specifically?
- Trace ID propagation. When a task is created via the API, generate a trace ID and attach it to the task payload. The trace ID follows the task through the scheduler (when it is enqueued), the queue (as a message attribute), the worker (when it is picked up), and any downstream calls the worker makes.
- Key spans to instrument: (1) API receives task creation request, (2) task is persisted to PostgreSQL, (3) scheduler selects and enqueues the task, (4) worker picks up the task, (5) worker executes the task handler, (6) worker reports completion to PostgreSQL. These 6 spans form the task lifecycle trace.
- What tracing reveals that metrics do not: Where is the time being spent? If execution delay is 30 seconds, tracing tells you: 2 seconds in the scheduler queue, 25 seconds waiting for a worker to pick it up (workers are saturated), 3 seconds for actual execution. Now you know to scale workers, not optimize the scheduler.
- Tools: AWS X-Ray integrates natively with ECS, Lambda, and SQS. Jaeger or Grafana Tempo for non-AWS or multi-cloud environments. The OpenTelemetry SDK provides vendor-neutral instrumentation.
Q19: Your URL shortener needs to comply with GDPR. A user requests deletion of all their data. Walk me through what actually happens in your distributed system.
Q19: Your URL shortener needs to comply with GDPR. A user requests deletion of all their data. Walk me through what actually happens in your distributed system.
Question
A user in Europe invokes their “right to erasure” under GDPR. They want all their data deleted — every short URL they created, every click event recorded on those URLs, every analytics record, every cache entry. Walk me through how your URL shortener system processes this request across all data stores.Difficulty: Senior to Staff | Concepts tested: Data lifecycle management, distributed data deletion, GDPR/privacy engineering, cache invalidation, the tension between analytics and privacyThe Trap
Most system design candidates never consider data deletion. They design for writes and reads, but not for erasure. In a distributed system with caches, analytics pipelines, backups, and CDN edge servers, “delete the user’s data” is a genuinely hard distributed systems problem.What weak candidates say:“Delete the rows from the database and clear the cache entries. Send a 200 OK to the user.”This misses: analytics data in the data warehouse, events in Kafka that have already been consumed, data in CDN edge caches, data in database backups and snapshots, data replicated to other regions, and the URL mapping entries that other people’s short URLs might reference.What strong candidates say:GDPR deletion in a distributed system is not a single operation — it is a workflow that touches every data store, and it must be auditable. Let me walk through each layer.-
Step 1 — Record the deletion request. Before deleting anything, create an immutable record:
{user_id, request_time, status: "processing", audit_trail: []}. This record itself is your proof of compliance. GDPR requires you to complete deletion within 30 days and provide evidence that you did. - Step 2 — Primary database (DynamoDB). Query all short URLs owned by this user. For each URL, you have two options: Option A — Hard delete: Remove the row entirely. The short URL stops working. Anyone who has this short URL bookmarked gets a 404. This is the cleanest for GDPR but breaks existing links. Option B — Soft delete with anonymization: Replace the user ID with a tombstone value, remove PII from metadata, but keep the URL mapping alive. The short URL continues to redirect, but it is no longer associated with the user. This preserves link functionality while complying with GDPR (the “right to erasure” applies to personal data, not necessarily to a random string mapping to a URL). The correct choice depends on your legal counsel’s interpretation. In my experience, Option B is what most companies implement because breaking millions of existing links creates customer support nightmares.
-
Step 3 — Redis cache. For every short URL key belonging to this user, issue
DELcommands. If you are using Redis Cluster, the keys are distributed across shards, so this is a scatter-gather operation. At 10,000 URLs per user, this takes seconds. The subtle problem: CDN edge caches (CloudFront). CloudFront caches redirect responses at edge locations worldwide. You must issue an invalidation for each URL:POST /2019-01-01/distribution/{id}/invalidationwith the URL paths. CloudFront invalidation takes 10-60 seconds to propagate globally. During that window, cached redirects still work. This is generally accepted as compliant (you initiated the deletion promptly; propagation delay is a technical limitation). - Step 4 — Analytics and click data. This is the hardest part. Click events are in your analytics pipeline — Kafka topics, S3 data lake, Athena tables, Redshift warehouse, possibly third-party analytics (Google Analytics, Amplitude). For Kafka: events that are still in Kafka (within the retention period) cannot be selectively deleted. Kafka does not support single-message deletion. Your options: (1) wait for the events to age out of Kafka’s retention window (typically 7 days), (2) produce “tombstone” events that downstream consumers interpret as “delete any records for this user,” or (3) implement Kafka’s log compaction where a null-value message for a key deletes that key. For the data warehouse (S3 + Athena): run a deletion job that scans Parquet files in S3, removes or anonymizes records matching the user ID, and rewrites the files. This is expensive for large datasets. Tools like Apache Spark or AWS Glue handle this, but it can take hours for terabytes of data. Schedule these as batch jobs that run nightly, processing all pending deletion requests. For third-party analytics: call their deletion APIs (Google Analytics User Deletion API, Amplitude’s user privacy API). Each has different SLAs — Google processes deletions within 72 hours.
- Step 5 — Backups. This is the most uncomfortable part. Your RDS snapshots and S3 backups contain the user’s data. Retroactively modifying backups is impractical (and some argue impossible without compromising backup integrity). The GDPR-accepted approach: document that backups contain data for deleted users, set backup retention policies (e.g., 30 days), and ensure that if a backup is restored, the deletion request is re-processed. Maintain a “deletion blocklist” — a list of user IDs whose data must be re-deleted if a backup is restored.
-
Step 6 — Confirmation. Update the deletion request record:
{status: "completed", completed_at, audit_trail: [{store: "DynamoDB", action: "anonymized", count: 1523}, {store: "Redis", action: "deleted"}, ...]}. Email the user confirming deletion. Retain the audit record (without PII) for compliance evidence.
Follow-up: How do you handle the case where one user’s short URL appears in another user’s analytics — for example, in a referrer field?
- This is the “interconnected data” problem and it has no clean technical solution.
- If User A created a short URL that User B clicked, User A’s deletion request should remove User A’s ownership and creation data, but User B’s click record (“I clicked link XYZ”) is User B’s data, not User A’s.
- The practical approach: anonymize the short URL in User B’s click record. Replace the URL creator’s user ID with a hash or “deleted user.” The click event still records that User B clicked something, but it is no longer linkable to User A.
- Legal nuance: GDPR distinguishes between “data controller” (you, the platform) and “data subject” (the user). The deletion obligation applies to the data subject’s personal data. A short URL string (e.g., “abc123”) is arguably not personal data unless it can be linked back to the user. Once you remove the link (anonymize the creator), the short URL itself is anonymous.
Follow-up: How would you build this deletion capability into the system from day one instead of retrofitting it?
- Data tagging at write time. Every record written to any store includes a
data_owner_idfield. This is the user ID whose deletion would affect this record. An analytics event about User B clicking User A’s link hasdata_owner_id = [user_A, user_B]— it is relevant to both users’ deletion requests. - Centralized data catalog. Maintain a registry (a simple database table or a service like AWS DataBrew or Collibra) that lists every data store, what PII it contains, and the deletion procedure. This is the first thing auditors ask for.
- Deletion as a first-class API.
POST /api/v1/users/{id}/deletion-requestis a production endpoint, not an admin script. It is tested, monitored, and covered by the same SLO as your creation endpoints. If creating a user is automated, deleting a user must be automated. - Retention policies by default. Every data store has a TTL or a lifecycle policy. Analytics events are retained for 90 days. Kafka topics have 7-day retention. Database records older than the retention period are archived and anonymized automatically. This means most deletion requests only need to touch data within the retention window — older data is already gone.
Q20: You just inherited a system where P99 latency is 10x worse than P50. The team says 'that is normal.' Is it?
Q20: You just inherited a system where P99 latency is 10x worse than P50. The team says 'that is normal.' Is it?
Question
Your notification system has a P50 latency of 50ms and a P99 latency of 500ms — a 10x ratio. The team says this is expected. Should you accept this, and what are the possible causes of such a large gap between P50 and P99?Difficulty: Senior | Concepts tested: Latency distribution analysis, tail latency causes, performance debugging methodology, when to optimize vs when to acceptThe Trap
Two opposite traps. Trap 1: “10x is fine, long tails are normal in distributed systems.” Trap 2: “10x is terrible, we need to fix this immediately.” The interviewer wants nuanced reasoning, not a knee-jerk reaction.What weak candidates say:“P99 is always higher than P50. That is just how distributed systems work. As long as the average is okay, it is fine.” This dismisses a potentially serious problem without investigation.Or: “We should optimize until P99 is closer to P50.” This ignores that tail latency has diminishing returns and the effort to reduce P99 by 50% might not be worth it.What strong candidates say:Whether a 10x P50-to-P99 ratio is acceptable depends entirely on what is causing it and whether the P99 latency violates the SLO. Let me walk through the most common causes of fat tail latency, because the cause determines the action.- Cause 1 — Garbage collection pauses (JVM, Go GC, .NET GC). If your workers run on a garbage-collected runtime, GC pauses cause periodic latency spikes. A typical JVM full GC takes 200-500ms — exactly matching the P99. This affects ~1-2% of requests (those that happen to land during GC). Diagnosis: correlate P99 spikes with GC logs. If every P99 spike aligns with a GC event, you found the cause. Fix: tune the GC (use G1GC or ZGC for Java, reduce heap pressure), reduce object allocation rate, or switch to a non-GC runtime for latency-sensitive paths (Rust, C++). Discord migrated from Go to Rust for their message read path specifically because Go’s GC pauses were causing P99 spikes.
- Cause 2 — Cache misses on the long tail. P50 represents the majority of requests that hit the cache (5ms). P99 represents the 1% that miss the cache and hit the database (500ms). A 99% cache hit rate means 1% of requests are 100x slower. Diagnosis: split latency metrics by cache hit vs cache miss. If cache-miss latency accounts for the entire P99 tail, that is the cause. Fix: improve cache hit rate (larger cache, better eviction policy, cache warming), or reduce cache-miss latency (faster database, pre-computed results, read replicas closer to the application).
- Cause 3 — One slow downstream dependency. The notification system calls APNs (push), SendGrid (email), and Twilio (SMS). APNs and SendGrid respond in 50ms. Twilio occasionally takes 500ms for international SMS delivery confirmation. If 1-2% of notifications go through SMS, Twilio’s latency becomes the P99. Diagnosis: break down latency by channel. If SMS latency accounts for the tail, that is the cause. Fix: make SMS delivery asynchronous — return a response to the caller before waiting for Twilio’s delivery confirmation. Track SMS delivery status via a webhook callback instead.
- Cause 4 — Resource contention at peak. During traffic bursts, workers compete for database connections, Redis connections, or CPU. Most requests proceed normally, but a few are delayed while waiting for a resource. This manifests as a latency spike that correlates with traffic volume. Diagnosis: overlay P99 latency with request rate. If P99 spikes correlate with traffic peaks, you have a contention problem. Fix: increase connection pool sizes, add read replicas, or implement admission control (shed load before contention causes cascading slowness).
- Cause 5 — Retry amplification. A small percentage of requests fail on the first attempt and are retried. The retried request’s total latency includes the original attempt timeout + the retry. If the timeout is 200ms and the retry succeeds in 50ms, the total latency is 250ms+ for retried requests. If 1-2% of requests are retried, this becomes the P99. Diagnosis: check retry rates and correlate with P99. Fix: reduce the initial timeout (if it is too generous), fix the root cause of failures (so retries are not needed), or use hedged requests (send the request to two backends simultaneously, use whichever responds first).
- Does the P99 violate the SLO? If the SLO is “95% of requests under 200ms” and P99 is 500ms, you are violating your SLO for 1% of users. Fix it.
- Is the P99 getting worse over time? A stable 10x ratio is one thing. A ratio that was 5x last month and 10x this month indicates a growing problem. Investigate.
- Does the P99 affect user experience? A 500ms notification delivery is imperceptible. A 500ms API response that blocks a UI render is painful. Context matters.
- What is the cost to improve? Reducing P99 from 500ms to 100ms might require a complete architecture change (add caching, change the database, rewrite in a different language). If the SLO is met and users are not complaining, the engineering effort is better spent elsewhere.
Follow-up: How do you measure and visualize latency distributions properly?
- Never use averages for latency. An average of 100ms could mean all requests are 100ms, or it could mean 99% are 10ms and 1% are 10,000ms. Averages hide the tail.
- Use histograms, not summaries. Prometheus histograms let you calculate arbitrary percentiles after the fact. Summaries calculate percentiles on the client side and cannot be aggregated across instances. If you have 50 instances, you cannot average their P99s to get the system P99 — you need the merged histogram.
- Track P50, P90, P95, P99, P99.9. The jumps between these percentiles tell you the shape of the tail. P50=50ms, P90=60ms, P99=500ms says the tail is caused by a small fraction of very slow requests (bimodal distribution). P50=50ms, P90=200ms, P99=500ms says latency degrades gradually (likely contention-related).
- Grafana heatmaps are the best visualization for latency distributions. They show request count at each latency bucket over time. You can visually spot bimodal distributions, GC pauses (regular spikes), and gradual degradation.
Follow-up: What is a “hedged request” and when would you use one to reduce tail latency?
- A hedged request sends the same request to two (or more) backend servers simultaneously and uses whichever response arrives first. The slower response is discarded.
- When to use: When your tail latency is caused by unpredictable backend slowness (e.g., one replica has a cold cache, one database node is doing a compaction). The probability that BOTH backends are slow simultaneously is much lower than either one being slow independently. If each backend has a 1% chance of being slow, the hedged request has a 0.01% chance.
- The cost: You double the load on the backend. This is acceptable when the tail latency problem is severe and the extra load is within capacity. Google uses hedged requests extensively in their internal systems (documented in their “The Tail at Scale” paper by Jeff Dean and Luiz Barroso).
- Optimization: Do not hedge immediately. Send the first request, wait for P50 latency (say, 50ms). If no response, send the hedge. This avoids doubling load for the 50% of requests that complete quickly.
Q21: Your notification system serves 50 tenants. One tenant sends 100x more traffic than the others. How do you prevent them from degrading everyone else's experience?
Q21: Your notification system serves 50 tenants. One tenant sends 100x more traffic than the others. How do you prevent them from degrading everyone else's experience?
Question
You have built the notification system as a multi-tenant SaaS platform. Tenant A (a large e-commerce company) is sending 500 million notifications per day. Tenant B (a small startup) sends 50,000 per day. Tenant A’s traffic is causing queue delays that affect Tenant B’s critical notifications. How do you architect tenant isolation?Difficulty: Staff-Level | Concepts tested: Multi-tenancy, noisy neighbor problem, resource isolation, fairness algorithms, SaaS architectureThe Trap
The obvious answer is “rate limit Tenant A.” But rate limiting does not solve the problem — it just rejects Tenant A’s traffic, which they are paying for. The real question is how to provide proportional service to all tenants without penalizing anyone.What weak candidates say:“Just rate limit per tenant. If Tenant A exceeds their quota, reject their notifications.” This treats the symptom, not the cause. Tenant A is paying for 500M notifications/day. They should get what they pay for without hurting others.What strong candidates say:The noisy neighbor problem in multi-tenant systems is one of the hardest problems in SaaS architecture because you need to simultaneously satisfy three competing goals: maximize utilization (do not waste capacity), guarantee fairness (no tenant degrades another), and honor commitments (each tenant gets what they paid for).-
Tier 1 — Queue isolation (the most impactful change). The root cause is that all tenants share the same processing queue. Tenant A’s 500M notifications are ahead of Tenant B’s critical 2FA code in the same FIFO queue.
Solution: separate queues per tenant tier. Create three queue pools:
- Enterprise queue pool (Tenants like A with high volume): separate SQS queue per enterprise tenant. Dedicated worker fleet.
- Standard queue pool (Mid-size tenants): shared SQS queue with weighted fair queuing. Workers process messages in round-robin across tenants, not FIFO.
- Critical queue (All tenants, P0 notifications only): 2FA codes and security alerts from ALL tenants go to a dedicated high-priority queue with its own worker fleet. This queue is never starved by bulk traffic.
-
Tier 2 — Compute isolation. Beyond queue isolation, you need to ensure Tenant A’s workers do not starve other tenants of compute resources.
- Per-tenant worker limits. In the shared worker pool, no single tenant can occupy more than 40% of workers at any time. Implement this as a semaphore: before a worker picks up a task, check
tenant_worker_count[tenant_A]. If it exceeds the limit, skip Tenant A’s messages and pick from another tenant. - Kubernetes namespace isolation. For enterprise tenants, run their workers in a separate Kubernetes namespace with resource quotas (CPU limits, memory limits). A memory leak in Tenant A’s notification handler cannot OOM-kill Tenant B’s workers.
- Per-tenant worker limits. In the shared worker pool, no single tenant can occupy more than 40% of workers at any time. Implement this as a semaphore: before a worker picks up a task, check
-
Tier 3 — Downstream dependency isolation. Tenant A’s 500M emails per day can exhaust your SendGrid API rate limit, blocking Tenant B’s emails.
- Separate API keys per tenant tier. Enterprise tenants get their own SendGrid subaccounts with their own sending limits and IP reputation. Standard tenants share a pooled subaccount.
- Per-tenant connection pools to the database. Use PostgreSQL’s
max_connectionsper role/user. Tenant A’s queries cannot exhaust the connection pool available to other tenants.
- Tier 4 — Weighted fair scheduling. For the shared infrastructure that remains, use a weighted fair queue algorithm. Each tenant has a weight proportional to their plan tier. Workers use the Weighted Fair Queuing (WFQ) algorithm: in each scheduling round, process messages from each tenant proportionally to their weight. Tenant A (weight 100) gets 100x more processing than Tenant B (weight 1), but Tenant B ALWAYS gets their share, even when Tenant A is flooding the system. This is conceptually similar to how network routers handle traffic from multiple flows. The CFS (Completely Fair Scheduler) in the Linux kernel does the same thing for CPU time across processes.
Follow-up: How do you price and enforce usage tiers without making the system overly complex?
- Token bucket per tenant. Each tenant has a token bucket with capacity and refill rate set by their plan tier. Free: 10K notifications/day. Pro: 1M/day. Enterprise: custom. The token bucket naturally handles bursts (a Pro tenant can send 10K notifications in a minute if they have accumulated tokens) while enforcing daily limits.
- Soft limits vs hard limits. For paying tenants, use soft limits: when they exceed 80% of their quota, send a warning notification. When they exceed 100%, start queuing (not rejecting) their notifications with lower priority. Only hard-reject at 200% (likely a bug or misconfiguration). This balances fairness with customer experience.
- Usage metering in a separate pipeline. Do not check quotas synchronously on every notification send — that adds latency. Instead, increment a counter asynchronously (in Redis or a metering service) and check it periodically (every 100 notifications or every 10 seconds). The 10-second granularity means a tenant can briefly exceed their limit before enforcement kicks in, but this is acceptable for all but the most cost-sensitive scenarios.
Follow-up: At what point do you move from multi-tenant to single-tenant (dedicated infrastructure per customer)?
- Revenue threshold. When a customer’s annual contract value exceeds the cost of dedicated infrastructure by 3-5x, it makes economic sense to give them their own stack. A customer paying 30K/year to run dedicated is a clear case.
- Compliance requirements. Customers in healthcare (HIPAA), finance (SOC 2 Type II with strict data isolation), or government (FedRAMP) often require single-tenant deployments. No amount of logical isolation satisfies an auditor who requires physical isolation.
- Performance isolation needs. If the customer’s workload is so large or unique that it would require constant tuning of the shared infrastructure, dedicate the infrastructure. The engineering time spent tuning shared infrastructure for one customer exceeds the cost of just giving them their own.
- The hybrid approach: Most SaaS companies end up with a “cell-based architecture” — a set of cells (independent deployment units), each serving a subset of customers. Small customers share a cell. Large customers get their own cell. Each cell is identical in code and infrastructure, just different in the customer set it serves. Slack, Notion, and Figma all use this pattern. It gives you the operational simplicity of a standard deployment with the isolation guarantees of single-tenant.
Q22: I hand you a system design that uses eventual consistency everywhere. Convince me there is a bug hiding in the design, even though all the tests pass.
Q22: I hand you a system design that uses eventual consistency everywhere. Convince me there is a bug hiding in the design, even though all the tests pass.
Question
Consider this scenario: a user creates a short URL, immediately copies it and shares it in a group chat. A friend clicks the link 2 seconds later. In your eventually consistent system, what happens, and why do all your integration tests pass while this fails in production?Difficulty: Senior to Staff | Concepts tested: Consistency model blind spots, read-your-writes, testing vs production divergence, real-world failure patternsThe Trap
The trap is that eventual consistency “sounds right” for a URL shortener (which is read-heavy with immutable data). But there is a specific window where it breaks, and the typical testing environment cannot reproduce the failure.What weak candidates say:“Eventual consistency is fine for a URL shortener because URLs are immutable once created. There is no conflict to resolve.”This is correct in steady state. But it ignores the creation-to-first-read window.What strong candidates say:The bug is in the creation-to-first-read window, and it is a classic read-your-writes consistency violation. Let me walk through exactly what happens.-
The sequence of events:
- User creates short URL
s.ly/abc123by calling the write endpoint. The API server writes to the DynamoDB primary in us-east-1. Returns success. - User copies the link and pastes it in a group chat (2 seconds later).
- Friend in Europe clicks the link. Their request is routed to eu-west-1 (closest region).
- The eu-west-1 DynamoDB replica has not yet received the replication of
abc123. DynamoDB cross-region replication (Global Tables) has a replication lag of typically under 1 second, but can spike to 5-10 seconds under load. - The friend gets a 404 Not Found. The URL does not exist yet in their region.
- User creates short URL
- Why tests do not catch this: Integration tests run in a single region with a single database instance. Writes are immediately visible to reads. The test creates a URL and immediately resolves it — it works every time because there is no replication lag. The production failure only occurs when the reader is in a different region than the writer, and the read happens within the replication window.
- Why this is more common than people think: At scale, this is not a rare edge case. If your URL shortener handles 100M creates/day and the average replication lag is 500ms, and 5% of links are clicked within 2 seconds of creation, that is 5M potentially affected clicks per day. Even if only 10% of those hit a lagging replica, that is 500K 404s per day for URLs that actually exist. This is a serious production bug.
- Layer 1 — Read-your-writes for the creator. After creating a URL, route the creator’s subsequent requests to the same region that handled the write (session affinity or a cookie that specifies the write region). This ensures the creator can always resolve their own URLs. But this does not help the friend in Europe.
-
Layer 2 — Retry on 404 with cross-region fallback. When a URL resolution returns 404, before returning the error to the user, the edge service checks the primary region directly: “Does this URL exist in us-east-1?” If yes, serve the redirect (with slightly higher latency due to the cross-region call). If no, return a genuine 404. This adds one network hop for the rare case of a replication-lagging read, but eliminates false 404s entirely.
This is similar to what DynamoDB Global Tables does with the
ReplicaReadMode: STRONGsetting when available, or what you can implement at the application level. - Layer 3 — Write-through to all regions. For the URL creation path (which is low-volume compared to reads), synchronously write to all regions before returning success to the user. This guarantees the URL is available everywhere before the creator shares it. The trade-off: create latency increases from 50ms (single-region write) to 200ms (multi-region synchronous write). For a URL shortener, this is acceptable because URL creation is infrequent and not latency-sensitive.
Follow-up: How would you test for this in your CI/CD pipeline?
- Simulated replication lag in integration tests. Add a test mode where writes to the “primary” test database are delayed before being visible to the “replica” test database. Use a proxy like Toxiproxy or a test double that introduces artificial read delays. Your test creates a URL, waits 0ms, and tries to resolve it against the “replica.” It should get a 404, which triggers the cross-region fallback.
- Contract test: The URL creation endpoint’s response should include a header like
X-Write-Region: us-east-1. The resolution endpoint, on receiving a 404, should check for aX-Fallback-Regionquery parameter. The contract test verifies this fallback behavior. - Canary in production. Deploy a canary that creates a URL in region A and immediately tries to resolve it in region B. Measure the time until the resolution succeeds. Alert if this exceeds 5 seconds. This is a real-time consistency monitor.
Follow-up: Are there other places in the six system designs where eventual consistency creates hidden bugs?
- News feed — ghost posts. A user unfollows someone, but the follow graph has not propagated to the fan-out workers yet. The unfollowed user’s next post still gets fanned out to the unfollower’s feed. The user sees a post from someone they just unfollowed — a creepy experience.
- Notification preferences — unwanted notifications. A user disables email notifications. The preference update has not propagated to the notification worker’s cache. The user receives one more email. This is not just a UX issue — for marketing emails, it can be a CAN-SPAM violation.
- Task scheduler — duplicate execution. A task’s status is updated to COMPLETED in the primary database, but the scheduler reads from a replica that still shows RUNNING. The scheduler re-enqueues the task. Without idempotency, the task executes twice.
Interview: You are architecting a payments platform expected to run for at least a decade. The CTO wants you to justify every technology choice from a 'survivability over a 10-year horizon' lens, not a 'shipped in a quarter' lens. How do you structure that review?
Interview: You are architecting a payments platform expected to run for at least a decade. The CTO wants you to justify every technology choice from a 'survivability over a 10-year horizon' lens, not a 'shipped in a quarter' lens. How do you structure that review?
- “The CTO asks you to pick between a managed cloud database and self-hosted PostgreSQL for the ledger, with a 10-year horizon. What do you choose?” - Strong answer: Self-hosted on the cloud provider’s raw disks or dedicated instances, because managed-DB feature deprecation and forced major-version upgrades on the vendor’s schedule are the single biggest source of “forced rewrite” surprises over a decade; you trade operational effort for schedule sovereignty.
- “How do you handle the case where a vendor you bet on ‘pivots’ away from your use case three years in?” - Strong answer: Every third-party dependency enters a quarterly review; if the vendor’s roadmap drifts from yours, you pre-position a replacement behind the same port and do a rotating canary over 12-18 months rather than waiting for the forcing function.
- “How do you convince a product org that is pushing for quarterly features to fund the 10-year hygiene work?” - Strong answer: Translate hygiene debt into a concrete forecasted incident rate and a forecasted migration cost; the conversation changes from “tech debt” to “a projected 3-month company-wide outage window in year 6 that we can avoid by spending 10% of eng capacity now.”
- “I would use the newest and most scalable option for everything to be future-proof.” - confuses novelty with longevity; newest tech has the shortest track record and the highest chance of being abandoned.
- “Design for 10x scale from day one.” - optimizes the wrong axis; 10-year survivability is about change tolerance, not peak throughput, and premature scale architecture often ossifies the very seams you will need to move later.
- Building Evolutionary Architectures by Neal Ford, Rebecca Parsons, Patrick Kua
- Stripe Engineering blog: “Online migrations at scale” (2017)
- Related chapter in this series: Design Patterns, section on Strangler Fig and Hexagonal Architecture
Interview: A large legacy service is slow, flaky, and feared. A staff engineer proposes a full rewrite. Another proposes a 2-year incremental refactor. You own the decision. Walk me through how you actually choose - not in principle, but with the specific signals you look for.
Interview: A large legacy service is slow, flaky, and feared. A staff engineer proposes a full rewrite. Another proposes a 2-year incremental refactor. You own the decision. Walk me through how you actually choose - not in principle, but with the specific signals you look for.
- “What if the rewrite is driven by a single senior engineer who wants to use a new language they are excited about?” - Strong answer: That is a red flag, not a reason; language choice should fall out of constraints (team skills, ecosystem, operational fit), not drive them. I would make them write the decision memo with explicit alternatives considered, which usually reveals whether it is an engineering decision or a resume decision.
- “You chose refactor, but two years in the team is burned out and morale is collapsing. Do you switch to a rewrite?” - Strong answer: Probably yes, but the real question is whether the refactor plan was wrong or whether leadership failed to protect capacity; switching tactics without understanding which of those it was just teleports the same failure mode into the new plan.
- “How do you frame this decision for a non-technical CEO who just wants to know ‘is it worth it?’” - Strong answer: In terms of customer-visible outcomes and risk bands - “refactor is 18 months of gradual P99 improvement with low outage risk; rewrite is 30 months of flat performance then a step function with meaningful outage risk on cutover.”
- “Always refactor, never rewrite.” - ignores the case where the data model or concurrency model is the problem and no amount of refactoring will fix it.
- “Rewrite in a new language because the old one is unfashionable.” - confuses aesthetics with engineering; the cost is paid by the future team, not the current one.
- Working Effectively with Legacy Code by Michael Feathers
- Joel Spolsky: “Things You Should Never Do, Part I” (the original Netscape rewrite essay, 2000)
- Related chapter in this series: Design Patterns, section on Strangler Fig Pattern
Interview: Product demands a new feature that will push P99 latency past the 99.95% SLA you committed to last quarter. Legal says the SLA is contractual. Engineering says the feature is feasible only if latency can degrade. How do you actually resolve this without everyone losing?
Interview: Product demands a new feature that will push P99 latency past the 99.95% SLA you committed to last quarter. Legal says the SLA is contractual. Engineering says the feature is feasible only if latency can degrade. How do you actually resolve this without everyone losing?
- “The product team says shipping the feature asynchronously would hurt the user experience too much. How do you push back?” - Strong answer: Ask for the UX research, not the opinion; often ‘hurts UX’ means ‘we have not designed the optimistic UI for it yet,’ and a short UX spike on loading states or optimistic rendering resolves the objection without changing the latency profile.
- “An exec overrides you and says ‘ship it anyway, we will deal with the SLA later.’ What do you do?” - Strong answer: Document the decision, the forecasted breach probability, and the accepted exposure in writing with legal and the exec’s signoff; senior engineers do not ship silent risk, they ship acknowledged risk with a paper trail.
- “How do you prevent this exact conflict from happening again on the next feature?” - Strong answer: Add a ‘latency budget review’ as a non-negotiable checkpoint in the feature kickoff template, so new features enter planning already knowing how many milliseconds of the SLA budget they are allowed to consume.
- “Just ship it, we will optimize later.” - optimistic rollouts that violate contractual SLAs are not engineering decisions; they are unilateral business decisions disguised as engineering ones.
- “Block the feature until latency is solved.” - ignores that SLAs can be renegotiated and that a tiered rollout may satisfy both constraints.
- Site Reliability Engineering by Google (chapter on Service Level Objectives)
- Cloudflare postmortem: “Details of the Cloudflare outage on July 2, 2019”
- Related chapter in this series: System Design Practice, section on 99.99% availability (Q11)