Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

System Design Practice Problems

These six problems are structured using the Answer Framework: Clarify, Estimate, Design, Deep Dive, and Trade-Offs. Each solution demonstrates the thinking process that separates senior engineers from mid-level candidates.
How to use this guide: Do not read the solutions immediately. For each problem, spend 30-45 minutes sketching your own design first. Then compare your approach against the guided solution. The goal is to internalize the framework, not memorize answers.
Common mistake: Candidates jump straight to drawing boxes. Interviewers want to see you think before you draw. The first 10 minutes of clarifying requirements and estimating scale are what set senior candidates apart.

How Interviewers Score System Design

Before diving into the problems, you should understand exactly what interviewers are evaluating. Most companies — from startups to FAANG — use a rubric with four to five dimensions. Knowing the rubric does not mean gaming the interview; it means spending your time on the things that actually earn points.

1. Requirements Gathering and Problem Scoping (15-20% of evaluation)

What interviewers look for: Do you ask clarifying questions before designing? Do you distinguish functional from non-functional requirements? Do you scope the problem to something achievable in 45 minutes instead of trying to design everything?Strong signal: The candidate asks 5-8 targeted questions, writes down explicit requirements, and gets confirmation before drawing anything. They proactively state assumptions: “I am going to assume we need 99.9% availability and eventual consistency is acceptable for analytics. Does that align with what you had in mind?”Weak signal: The candidate either asks zero questions (jumps to a solution) or asks 15+ unfocused questions (stalling). They design for vague requirements and get surprised by follow-ups.Scoring:
  • No hire: Starts designing without understanding the problem. Makes incorrect assumptions that fundamentally change the design.
  • Lean no hire: Asks a few surface-level questions but misses critical requirements (scale, consistency, availability).
  • Lean hire: Asks good questions, identifies the key requirements, and scopes the problem appropriately.
  • Strong hire: Asks questions that reveal non-obvious constraints. Proactively identifies the hardest part of the problem before being prompted.

2. High-Level Design (25-30% of evaluation)

What interviewers look for: Can you sketch a coherent architecture with the right major components? Does data flow logically between them? Are the API contracts reasonable? Is the design appropriate for the scale you estimated?Strong signal: The candidate draws a clear diagram with labeled components and data flow arrows within 5-10 minutes. They explain the responsibility of each component and why it exists. The API design is clean and follows REST/gRPC conventions.Weak signal: The candidate draws a tangled diagram with unlabeled boxes, cannot explain how data flows from client to storage, or proposes an architecture that does not match the requirements they just gathered.Scoring:
  • No hire: Cannot produce a coherent end-to-end design. Major components are missing or misconnected.
  • Lean no hire: Design works but is overly simplistic or missing critical components (no cache for a read-heavy system, no queue for async processing).
  • Lean hire: Solid design with appropriate components. Data flow is clear. Handles the stated requirements.
  • Strong hire: Design is elegant and scalable. Components are well-chosen for the access patterns. The candidate identifies the 2-3 critical paths and explains why they drive the architecture.

3. Deep Dive and Technical Depth (25-30% of evaluation)

What interviewers look for: Can you go deep on the most critical component? Do you understand algorithms, data structures, and protocols at a level that would let you actually build this? Can you walk through a request lifecycle step by step?Strong signal: The candidate picks the most important component and explains it in detail — data model, algorithm choice with trade-offs, failure handling, and performance characteristics. They use precise technical vocabulary (not buzzwords) and can answer follow-up questions with confidence.Weak signal: The candidate stays at the surface level on everything. They name technologies (Redis, Kafka, Cassandra) without explaining why or how they would configure them. They cannot walk through a concrete request lifecycle.Scoring:
  • No hire: Cannot go deep on any component. Lacks understanding of basic data structures, algorithms, or protocols.
  • Lean no hire: Can go somewhat deep but makes factual errors or proposes approaches that would not work in practice.
  • Lean hire: Demonstrates solid depth on 1-2 components. Makes correct technical choices and can justify them.
  • Strong hire: Demonstrates expert-level depth. Brings up non-obvious considerations (race conditions, hot partitions, clock skew). Could convincingly lead implementation of this system.

4. Trade-Off Discussion and Decision Quality (15-20% of evaluation)

What interviewers look for: Does the candidate present alternatives before committing to a choice? Do they articulate what they are trading away with each decision? Are their trade-offs grounded in the specific requirements, not generic platitudes?Strong signal: “I am choosing a sliding window counter over a token bucket here because our requirements emphasize precision over burst tolerance. If the product team later says they want to allow controlled bursts, we would switch to token bucket — the Redis data model change is minimal.”Weak signal: “I chose Redis because it is fast.” (No alternative considered, no trade-off articulated, no connection to requirements.)Scoring:
  • No hire: Makes choices without considering alternatives. Cannot articulate downsides of their own design.
  • Lean no hire: Mentions alternatives but the comparison is shallow or incorrect.
  • Lean hire: Presents clear trade-offs for major decisions. Connects choices to requirements.
  • Strong hire: Trade-offs are nuanced and specific. The candidate identifies second-order consequences and explains when they would revisit decisions. Demonstrates the judgment you would want from a tech lead making real architecture choices.

5. Communication and Collaboration (10-15% of evaluation)

What interviewers look for: Is the candidate easy to follow? Do they structure their explanation logically? Do they check in with the interviewer? Do they respond well to hints and pushback?Strong signal: The candidate signals transitions (“Now let me move to the deep dive on message ordering”), checks in periodically (“Should I go deeper here or move to the caching layer?”), and responds to interviewer pushback by genuinely reconsidering rather than defending their position rigidly.Weak signal: The candidate monologues for 10 minutes without checking in, gets defensive when challenged, or is disorganized in their presentation (jumping between components randomly).Scoring:
  • No hire: Difficult to follow. Defensive or dismissive of feedback. Cannot adapt when the interviewer redirects.
  • Lean no hire: Somewhat organized but misses cues from the interviewer. Does not adjust depth based on interviewer interest.
  • Lean hire: Clear, structured communication. Responds well to hints and redirections.
  • Strong hire: Exceptional communicator. Treats the interview as a collaborative design session. Makes the interviewer feel like they are working with a great colleague.
When you practice with the problems in this guide, score yourself (or have your practice partner score you) on each dimension after every session:
DimensionScore (1-4)Notes
Requirements gatheringDid I ask the right questions?
High-level designIs my architecture coherent and complete?
Technical depthCould I actually build the component I deep-dived on?
Trade-off discussionDid I present alternatives and justify my choices?
CommunicationWas I structured, clear, and collaborative?
Scoring guide: 1 = No hire, 2 = Lean no hire, 3 = Lean hire, 4 = Strong hireThe most common failure pattern: Candidates score 3-4 on technical depth but 1-2 on requirements gathering and trade-offs. They know the technology cold but do not demonstrate the engineering judgment that senior roles require. If this is you, force yourself to spend the first 8 minutes on requirements and the last 5 minutes on explicit trade-off discussion.
Inside knowledge: At most top companies, each interviewer fills out a scorecard with these dimensions (or similar) immediately after the interview. They write 2-3 sentences per dimension with specific evidence. Interviewers who cannot cite specific evidence for a “hire” rating are trained to downgrade. This means you need to give them quotable moments — specific statements that demonstrate each competency. “The candidate proactively identified the celebrity fan-out problem and proposed a dynamic threshold” is the kind of evidence that gets written down.
The scoring dimensions above connect directly to the behavioral frameworks covered in Interview Meta-Skills. The communication dimension in particular benefits from the structured thinking and verbal framing techniques discussed there. For a deeper understanding of how to manage your time within a 45-minute system design interview, see the time allocation strategies in that chapter.
Understanding the scoring rubric is necessary but not sufficient. The difference between a senior hire and a staff hire is not about knowing more components — it is about the altitude of your thinking.
What a Senior engineer demonstrates: They ask good clarifying questions, produce a coherent architecture with the right components, go deep on 1-2 critical areas, and articulate trade-offs grounded in the requirements. They show they can build the system.What a Staff/Principal engineer adds: They reframe the problem before solving it (“Are we sure a URL shortener is the right product? Could a redirect service at the CDN edge solve this without a custom backend?”). They connect the design to organizational constraints (“This architecture requires two on-call rotations — do we have the team size for that?”). They identify second-order consequences (“If we use 301 redirects for SEO, the analytics team loses visibility, which will create a cross-team conflict in Q3”). They demonstrate they can own the system across its entire lifecycle — build, operate, evolve, and eventually deprecate it.
Large language models and AI coding assistants are reshaping how engineers approach system design — both in preparation and in practice.
  • Using LLMs for estimation sanity checks: Tools like ChatGPT and Claude can validate your back-of-envelope calculations in real-time. Paste your estimation and ask “Does this math check out for a system at Twitter’s scale?” The LLM will flag order-of-magnitude errors. This does not replace the skill of estimation — interviewers still want to see you do it — but it accelerates your practice iterations.
  • Architecture diagram generation: Copilot and similar tools can generate PlantUML, Mermaid, or ASCII architecture diagrams from natural language descriptions. In practice, this means you can iterate on high-level designs faster during preparation. In the interview itself, you still draw by hand.
  • Exploring trade-offs systematically: Ask an LLM “Give me 5 reasons to choose Cassandra over DynamoDB for a chat message store, and 5 reasons to choose the other way.” This forces you to consider angles you might have missed. Then critically evaluate the LLM’s output — it sometimes hallucinates capabilities or misses pricing details.
  • The trap to avoid: Using LLMs to memorize answers rather than understand frameworks. An interviewer who asks “Why Cassandra?” and gets a memorized list of bullet points will immediately follow with “What is Cassandra’s compaction strategy and how does it affect write amplification?” If you cannot go deeper than the LLM-generated summary, the interviewer knows.
  • In production system design: AI-assisted code generation (Copilot, Cursor) accelerates implementation but does not change architectural decisions. The systems in this chapter still need the same components — AI just helps you write the Lua scripts, the Kafka consumers, and the API endpoints faster. The design thinking remains human.
Before tackling the six problems, try these micro-scenarios to warm up your system design muscles. Spend 5 minutes on each — no diagrams, just structured thinking out loud.
  1. The Estimation Drill: A photo-sharing app has 200M daily active users. Each user uploads 2 photos per day (average 3MB each). How much new storage per day? Per year? What is the upload bandwidth requirement at peak (5x average)? Can a single server handle the upload throughput?
  2. The Trade-Off Drill: You are choosing between PostgreSQL and DynamoDB for a user profile service. The access pattern is 95% reads by user ID, 5% writes. You expect 10M users and 50K reads/sec at peak. Argue for PostgreSQL in 3 sentences. Now argue for DynamoDB in 3 sentences. Which would you actually choose, and what is the deciding factor?
  3. The Failure Mode Drill: You have a simple web app: Load Balancer, 3 API Servers, and 1 PostgreSQL database. Name every distinct failure scenario (there are at least 6) and the user impact of each. Which failure is the most dangerous and why?

Problem 1: Design a URL Shortener

Difficulty: Easy | Time Target: 35 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Core FeaturesDo we need custom aliases (e.g., short.ly/my-brand)?Changes the uniqueness and validation logic
ExpirationShould URLs expire? User-configurable TTL?Affects storage strategy and cleanup jobs
AnalyticsDo we need click tracking (geo, referrer, device)?Adds an entire analytics pipeline
ScaleHow many URLs are created per day? Read vs write ratio?Drives every downstream architecture decision
AvailabilityIs it acceptable for a redirect to fail occasionally?Determines replication and consistency model

Requirements You Should Lock Down

Functional:
  • Given a long URL, generate a unique short URL
  • Given a short URL, redirect (HTTP 301/302) to the original
  • Optional: custom aliases, expiration, click analytics
Non-Functional:
  • High availability (redirects must not fail)
  • Low latency (redirect < 100ms)
  • Short URLs should not be guessable (no sequential IDs)
Senior move: Ask whether the redirect should be 301 (permanent, browser caches it) or 302 (temporary, every request hits your server). This one question shows you understand HTTP semantics AND its impact on analytics — if you use 301, the browser caches the redirect and you lose click data.

Back-of-Envelope Calculation

Assumptions:
  New URLs created per day:     100M
  Read-to-write ratio:          100:1
  Average URL length:           500 bytes
  Short URL length:             7 characters
  Data retention:               5 years

Write (Create):
  100M / 86,400 seconds         ~ 1,160 writes/sec
  Peak (3x):                    ~ 3,500 writes/sec

Read (Redirect):
  100:1 ratio                   ~ 116,000 reads/sec
  Peak (3x):                    ~ 350,000 reads/sec

Storage (5 years):
  100M/day x 365 x 5            = 182.5 billion URLs
  Each record ~ 600 bytes       = ~110 TB total
  (short_url + long_url + metadata)

Short URL keyspace:
  7 chars, Base62 (a-z A-Z 0-9) = 62^7 = 3.5 trillion
  182.5 billion << 3.5 trillion  -> 7 chars is sufficient

Cache (80/20 rule):
  20% of URLs generate 80% of traffic
  Daily read requests: 100M x 100 = 10B
  Cache 20% of daily unique URLs:
  100M x 0.2 x 600 bytes        = ~12 GB (fits in memory)
Why this matters: The numbers reveal this is a read-heavy system. That single insight drives every design choice — caching strategy, database read replicas, CDN usage. If you skip estimation, you design blind.
Sanity check your numbers: Our estimate of ~110 TB total storage over 5 years is reasonable — Bitly reportedly stores hundreds of terabytes of link data accumulated over more than a decade. If your estimate is wildly different from known real-world systems, revisit your assumptions. Estimation is about getting the right order of magnitude, not the exact number.

Component Architecture

                           ┌──────────────┐
                           │   Clients    │
                           └──────┬───────┘

                           ┌──────▼───────┐
                           │ Load Balancer │
                           └──────┬───────┘

                    ┌─────────────┼─────────────┐
                    │             │             │
             ┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
             │  API Server │ │  ...  │ │  API Server │
             └──────┬──────┘ └───┬───┘ └──────┬──────┘
                    │            │             │
                    └────────────┼─────────────┘

                  ┌──────────────┼──────────────┐
                  │              │              │
           ┌──────▼──────┐ ┌────▼────┐ ┌──────▼──────┐
           │  Redis Cache │ │   DB   │ │  Analytics  │
           │  (read-thru) │ │(write) │ │   (Kafka)   │
           └─────────────┘ └────┬────┘ └─────────────┘

                         ┌──────▼──────┐
                         │ DB Read     │
                         │ Replicas    │
                         └─────────────┘

API Design

POST /api/v1/urls
  Body: { "long_url": "https://...", "custom_alias": "my-brand", "ttl": 3600 }
  Response: { "short_url": "https://short.ly/aB3x7Kp" }

GET /{short_url_key}
  Response: HTTP 302 Redirect → Location: https://original-long-url.com

GET /api/v1/urls/{short_url_key}/stats
  Response: { "clicks": 10432, "created_at": "...", "last_accessed": "..." }

The Core Problem: How to Generate a Unique 7-Character Key?

This is the most important design decision. Three approaches:
1

Approach A: MD5/SHA256 Truncation

Hash the long URL, take the first 7 characters (Base62 encoded).Pros: Deterministic (same URL = same hash), no coordination needed.Cons: Collision risk with truncation. Must check DB and re-hash on collision (append counter, re-hash).
hash("https://example.com/very-long-path")
  → MD5: "a3f2b8c9d1e4..."
  → Base62 first 7: "aB3x7Kp"
  → Collision? Append counter: hash("https://...&1") → retry
2

Approach B: Base62 Counter (Pre-Generated)

Use a global counter (or counter service) and convert to Base62.Pros: Zero collisions, predictable, simple.Cons: Sequential = guessable. Counter is a single point of failure. Solution: allocate ranges to each server (Server 1 gets 1-1M, Server 2 gets 1M-2M).
counter = 1000000
Base62(1000000) = "4c92"  → pad to 7 chars: "004c92x"
3

Approach C: Key Generation Service (KGS) -- Recommended

Pre-generate random 7-character keys in a separate database. Application servers fetch keys in batches.Pros: No collision at runtime, no coordination between app servers, keys are random (not guessable).Cons: Extra service to maintain. Must mark keys as “used” atomically.
KGS DB:
┌────────────┬────────┐
│    key     │  used  │
├────────────┼────────┤
│  aB3x7Kp  │ false  │
│  Zk9mW2q  │ true   │
│  ...       │  ...   │
└────────────┴────────┘

App Server starts → fetches 1000 keys from KGS → uses from local cache
When local cache runs low → fetch another batch
Collision handling matters: If you choose MD5 truncation, you MUST explain your collision resolution strategy. “Just re-hash” is not enough — what if the retry also collides? Use a bloom filter for fast existence checks before hitting the DB.

Database Choice

CriteriaSQL (PostgreSQL)NoSQL (DynamoDB/Cassandra)
SchemaFixed, well-definedFlexible but unnecessary here
Read patternKey lookup (fast with index)Key lookup (native strength)
Write patternSingle row insertSingle row insert
ScaleSharding is manualAuto-sharding built-in
ConsistencyStrong by defaultEventual (configurable)
VerdictGood for < 1B recordsBetter for 100B+ records
Recommendation: NoSQL (DynamoDB or Cassandra). The access pattern is simple key-value lookup at massive scale. No joins, no complex queries. This is exactly what NoSQL was built for.

Caching Strategy

  • Cache-aside with Redis: Check cache first, on miss read from DB and populate cache
  • TTL: Set cache TTL to match URL expiration (or 24 hours for non-expiring URLs)
  • Eviction: LRU — frequently accessed URLs stay hot, long-tail URLs get evicted
  • Cache size: ~12 GB for 20% of daily URLs (fits a single Redis instance)
DecisionChoice MadeAlternativeWhy
Hash methodKGS (pre-generated keys)MD5 truncationZero runtime collisions, no coordination overhead
DatabaseNoSQL (DynamoDB)PostgreSQLSimple key-value at extreme scale
Redirect code302 (temporary)301 (permanent)Preserves analytics; 301 causes browser caching
CacheRedis cache-asideWrite-throughRead-heavy system benefits from cache-aside simplicity
Short URL length7 characters6 characters3.5T keyspace vs 56B; future-proofing is cheap
1

You asked about 301 vs 302

This shows you understand HTTP redirect semantics and their downstream impact on analytics.
2

You calculated before designing

The 100:1 read-write ratio drove the cache-first architecture. Without estimation, you might have over-engineered writes.
3

You compared three hash approaches with trade-offs

Juniors pick one approach. Seniors present options, compare them, and justify their choice.
4

You addressed the analytics pipeline separately

Writing click events to Kafka for async processing shows you know not to block the redirect path with analytics writes.
5

You mentioned bloom filters for collision detection

This demonstrates awareness of probabilistic data structures and performance optimization at scale.

Bitly’s Architecture

Bitly processes billions of shortens and redirects. Their architecture reflects a decade of scaling lessons:
  • Key generation: Bitly uses a counter-based approach with Base62 encoding, allocating counter ranges to application servers to avoid coordination overhead. Each server gets a block of IDs and increments locally, only requesting a new block when the current one is exhausted.
  • Storage: They migrated from MySQL to a distributed data store. Their link mapping data is sharded by the hash of the short URL key, ensuring even distribution across nodes.
  • Caching: Bitly uses a multi-layer caching strategy. Popular links (the top 1-2% by click volume) are cached in memory on the application servers themselves. A shared Redis layer handles the next tier. The database is the fallback of last resort. Given the extreme read-to-write ratio, cache hit rates exceed 99%.
  • Analytics pipeline: Click data flows through Kafka into a real-time processing pipeline. They decouple the redirect path (which must be sub-100ms) from analytics aggregation entirely. Click events are enriched asynchronously with geo, device, and referrer data.
  • Global presence: Bitly uses Anycast DNS and edge servers in multiple regions to ensure redirects are fast worldwide. A redirect request in Tokyo does not need to round-trip to a US data center.

TinyURL’s Original Design

TinyURL, one of the earliest URL shorteners (launched around 2002), took a simpler approach that is instructive as a baseline:
  • Single MySQL database: All short-to-long URL mappings lived in a single MySQL instance with a B-tree index on the short key. For years, this was sufficient because the total dataset was small enough to fit in RAM.
  • Sequential IDs: TinyURL used auto-incrementing MySQL IDs converted to a short alphanumeric string. This is the simplest possible approach and works until you need non-guessable URLs or multi-datacenter writes.
  • No analytics: The original TinyURL had no click tracking at all. This dramatically simplified the architecture — a redirect was a single database lookup and an HTTP 301 response.
  • Lesson: TinyURL proves that the simplest possible design can serve millions of users if the access pattern is simple enough (key-value lookup). The complexity in Bitly’s architecture exists because of analytics, custom domains, enterprise features, and global scale — not because URL shortening itself is hard.
The takeaway for interviews: Start with TinyURL’s simplicity, then explain how you would evolve it toward Bitly’s architecture as requirements grow. This “simple first, scale when needed” narrative is exactly what interviewers want to see.
Real systems fail. Senior engineers design for failure, not around it. Here is how each component in this design can break, and what you do about it.Redis cache goes down:
  • The system does not go down with it — it degrades gracefully. All reads fall through to the database read replicas. Latency increases from ~5ms (cache hit) to ~20-50ms (DB read). Users notice nothing.
  • Risk: if cache is cold and all traffic suddenly hits the DB, you get a thundering herd. Mitigation: implement request coalescing (if 1,000 requests for the same key arrive simultaneously, only one hits the DB and the rest wait for its result) and bring the cache back incrementally using a warm-up script that pre-loads the top 20% of URLs by access frequency.
Database primary goes down:
  • Reads continue from replicas (redirects keep working). Writes (new URL creation) fail. This is acceptable for minutes — redirects are 100x more important than creates.
  • Promote a read replica to primary (automated failover via PostgreSQL Patroni or DynamoDB’s built-in multi-AZ). Typical recovery: 15-30 seconds with automated failover.
Key Generation Service (KGS) goes down:
  • Application servers have locally cached batches of pre-generated keys (1,000+ keys each). They continue creating URLs from their local batch. You have minutes to hours before any server exhausts its cache, depending on batch size.
  • If KGS stays down and local caches deplete, fall back to MD5 hash generation as a degraded-mode strategy.
A hash collision occurs at massive scale:
  • The bloom filter catches it before the DB write. Retry with a different key. If the bloom filter itself has a false positive, the DB’s unique constraint catches it. Two layers of defense, not one.
An entire data center goes offline:
  • Anycast DNS routes traffic to the nearest healthy data center. Cross-region replication ensures the URL database is available in multiple regions. Users in the affected region experience a brief DNS TTL delay (seconds) before traffic re-routes.
The detail most candidates miss: What happens to analytics data when a component fails? Click events should be buffered in Kafka. Kafka retains events for hours or days. When the analytics pipeline recovers, it processes the backlog. No click data is lost — it is just delayed. This separation between the critical path (redirect) and the non-critical path (analytics) is a fundamental reliability pattern.
After you present your URL shortener design, a good interviewer will probe with follow-ups. Here are the most common ones and how to handle them:“What if a malicious user tries to create billions of URLs to exhaust your keyspace?”This is a rate limiting + abuse detection question. Apply per-user and per-IP rate limits on the creation endpoint (see the Rate Limiter problem below). Monitor for anomalous creation patterns. With a 7-character Base62 keyspace of 3.5 trillion keys, even creating 1 billion URLs only consumes 0.03% of the keyspace — but the storage and cost implications are real. Require authentication for URL creation and set per-account quotas.“How would you handle a URL shortener for a URL that itself is a shortened URL (redirect chains)?”Detect redirect chains on creation by following the submitted URL (with a depth limit of, say, 5 hops) to find the final destination. Store the final destination, not the intermediate shortened URL. This prevents redirect chains from degrading user experience and avoids being used as an open redirect attack vector.“If your system needs to work across multiple geographic regions, how do you handle consistency?”Use eventual consistency for the URL database with a conflict-free data model — short URL to long URL mappings are write-once and immutable, making them naturally conflict-free across regions. The KGS allocates disjoint key ranges per region (e.g., Region A gets keys starting with a-m, Region B gets n-z) to prevent cross-region key collisions entirely.
The detail that makes an interviewer’s eyes light up: When discussing the redirect flow, explain that you would log the click event to Kafka before returning the 302 redirect, using a non-blocking async write. But then immediately note the subtlety: if the Kafka producer fails, you must still return the redirect — you never sacrifice the user experience for analytics. This means your analytics will slightly undercount during Kafka outages, and you are okay with that trade-off. Saying this unprompted demonstrates that you understand the hierarchy of system priorities: correctness of the core function (redirect) > completeness of secondary functions (analytics) > consistency of tertiary functions (reporting).
If you were building this URL shortener on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
Load BalancerApplication Load Balancer (ALB)Layer 7 routing, path-based routing for /api/* vs /{shortKey}, built-in health checks
API ServersECS Fargate or EKSFargate for simplicity (no server management), EKS if you need Kubernetes ecosystem. Auto-scaling based on request count
DatabaseDynamoDBPurpose-built for key-value lookups at massive scale. Single-digit millisecond latency. On-demand capacity mode handles traffic spikes without pre-provisioning. Partition key = short_url, sort key not needed
CacheElastiCache for RedisManaged Redis with Multi-AZ failover. Use cluster mode for >12 GB cache or >100K ops/sec. Set up read replicas for read-heavy traffic
Key Generation ServiceDynamoDB + LambdaStore pre-generated keys in a DynamoDB table. Lambda function pre-generates batches during off-peak hours. App servers fetch batches via DynamoDB BatchGetItem
Analytics PipelineKinesis Data Streams to Kinesis Data Firehose to S3 to AthenaKinesis replaces Kafka on AWS. Firehose auto-batches events to S3 in Parquet format. Athena provides serverless SQL queries over click data
CDN / Global RoutingCloudFront + Route 53CloudFront edge locations for redirect caching (if using 301). Route 53 latency-based routing for multi-region deployments
MonitoringCloudWatch + X-RayCloudWatch for metrics and alarms (cache hit rate, redirect latency p99). X-Ray for distributed tracing across ALB, Lambda, DynamoDB
Cost optimization tip: DynamoDB on-demand pricing works well for unpredictable traffic. But if your traffic is steady, provisioned capacity with auto-scaling is 5-7x cheaper. Start with on-demand, analyze traffic patterns for 2 weeks, then switch to provisioned. This is a practical detail that impresses interviewers because it shows you think about cost, not just architecture.
Real interviewers rarely give you the textbook version. They inject constraints mid-design to see how you adapt. Here are the mutations you should practice:Constraint 1: “Your budget is $200/month. No managed services.”
  • DynamoDB and ElastiCache are gone. You are on a single EC2 instance (or a small VPS). The KGS becomes an in-process counter with local file persistence. Redis becomes an in-memory LRU cache within the application process. The database becomes SQLite or a single PostgreSQL instance. This forces you to think about what the essential architecture is versus what is scaling convenience. The answer: a single Go or Rust binary with an embedded cache and SQLite can handle 10K redirects/sec on a $20/month VPS.
Constraint 2: “The shortener must comply with GDPR. Users can request deletion of all their URLs and associated click data.”
  • Now you need a user_id foreign key on every URL record and every click event. The analytics pipeline must support targeted deletion (not just append-only logs). Kafka retention policies are no longer sufficient — you need to be able to purge specific events from S3/Athena. This fundamentally changes the analytics storage from an immutable log to a mutable store, which means you might prefer a database-backed analytics pipeline over Kafka-to-S3.
Constraint 3: “The URL shortener is internal-only. 500 users. No public internet exposure.”
  • The entire CDN, Anycast DNS, and global replication discussion becomes irrelevant. A single PostgreSQL table behind an Nginx reverse proxy is the correct architecture. If you propose DynamoDB and CloudFront for 500 internal users, you have failed the judgment test. The interviewer is testing whether you can right-size, not whether you know the big architecture.
Constraint 4: “URLs must be auditable. Every create and every click must be attributable to a specific user, with tamper-evident logging.”
  • Now you need authentication on the create endpoint, user attribution on click events, and an immutable audit log (append-only table or a service like AWS QLDB). The click event is no longer just analytics — it is a compliance record. This changes the priority: analytics can be eventually consistent, but the audit log must be durable and strongly consistent.
The meta-skill these constraints test: Can you shed complexity when constraints shrink, and add the right complexity when constraints expand? Candidates who give the same architecture for 500 internal users and 500M public users are demonstrating memorization, not engineering judgment.
This problem connects to several foundational topics covered in other chapters:
  • Databases and storage — The NoSQL vs SQL decision, sharding strategies, and partition key design are covered in depth in APIs and Databases. Understanding when key-value access patterns favor NoSQL is critical here. For a deeper dive into DynamoDB’s partition key mechanics, single-table design, and when to choose DynamoDB over Cassandra, see Database Deep Dives.
  • Caching — The Redis cache-aside pattern, eviction policies (LRU), and cache warming strategies are explored in Caching and Observability.
  • Performance and scalability — Read replicas, horizontal scaling, and CDN usage for global latency reduction are covered in Performance and Scalability.
  • Networking — HTTP redirect semantics (301 vs 302), DNS resolution, and Anycast routing are foundational networking concepts covered in Networking and Deployment.
  • Reliability — Fail-safe patterns, graceful degradation, and the analytics buffering pattern connect to Reliability Principles.
  • Cloud service patterns — The mapping of URL shortener components to AWS services (DynamoDB, ElastiCache, CloudFront, Kinesis) and cost optimization strategies are explored in Cloud Service Patterns. Understanding when to use DynamoDB on-demand vs provisioned capacity is a practical skill covered there.

Problem 2: Design a Rate Limiter

Difficulty: Medium | Time Target: 40 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
ScopeClient-side or server-side rate limiting?Client-side is unreliable; focus on server-side
GranularityPer-user, per-API endpoint, per-IP, or global?Determines key structure in the rate limit store
ResponseShould we return rate limit headers (X-RateLimit-*)?Industry standard; shows HTTP expertise
ThrottlingHard limit (reject) or soft limit (queue/slow down)?Changes architecture from reject to backpressure
DistributedSingle server or distributed across multiple servers?Single-server is trivial; distributed is the real problem
RulesWho configures the rate limit rules? Dynamic or static?Affects whether you need a rules engine or config file

Requirements You Should Lock Down

Functional:
  • Rate limit requests based on configurable rules (user ID, IP, API key)
  • Support multiple rate limit tiers (free: 10 req/min, pro: 1000 req/min)
  • Return proper HTTP 429 with Retry-After header when limit is exceeded
  • Rate limit rules can be updated without redeployment
Non-Functional:
  • Very low latency (added overhead < 1ms per request)
  • Must work across distributed servers (shared state)
  • Highly available — if the rate limiter fails, requests should pass through (fail-open)
Senior move: Ask about fail-open vs fail-closed behavior. Fail-open (allow requests if limiter is down) prevents the rate limiter from becoming a single point of failure. Fail-closed (block all requests if limiter is down) is safer against abuse but risky for availability.

Back-of-Envelope Calculation

Assumptions:
  Active users:                 10M
  Average requests per user:    500/day
  Total requests/day:           5 billion
  Peak QPS:                     ~175,000 req/sec (3x average)

Rate limiter overhead:
  Each check = 1 Redis round-trip  ~ 0.5ms
  At 175K QPS:                     ~ 175K Redis ops/sec
  Redis single instance handles:   ~ 100K ops/sec
  → Need 2-3 Redis instances (or Redis Cluster)

Storage per user (sliding window):
  Key: "user:12345:api:/search"     ~ 50 bytes
  Value: sorted set of timestamps   ~ 200 bytes (for 100 entries)
  Total per user (10 endpoints):    ~ 2.5 KB
  10M users:                        ~ 25 GB
  → Fits in Redis cluster
Sanity check your numbers: Our estimate of 175K peak QPS requiring 2-3 Redis instances is reasonable — Cloudflare reports handling tens of millions of requests per second across their edge, with each edge node running its own rate limiter. Stripe’s API likely handles hundreds of thousands of rate limit checks per second across their infrastructure. A 2-3 node Redis cluster for 175K ops/sec is well within Redis’s documented performance characteristics (~100K ops/sec per instance for simple commands).

Where to Place the Rate Limiter

Option A: API Gateway (Recommended for most cases)
  Client → API Gateway [Rate Limiter] → App Servers

Option B: Middleware in Application
  Client → App Server → [Rate Limiter Middleware] → Handler

Option C: Sidecar (Service Mesh)
  Client → Envoy Proxy [Rate Limit] → App Server

Component Architecture

  Client Request


  ┌──────────────┐     ┌─────────────────┐
  │ API Gateway  │────▶│  Rules Engine    │
  │ (rate check) │     │  (config store)  │
  └──────┬───────┘     └─────────────────┘

    ┌────▼────┐
    │  Redis  │  ← Centralized rate limit state
    │ Cluster │
    └────┬────┘

    Pass │ Reject
    ┌────▼────┐  ┌─────────────────┐
    │  App    │  │  HTTP 429       │
    │ Servers │  │  Retry-After: X │
    └─────────┘  └─────────────────┘

Rate Limit Headers (RFC 6585)

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1672531200
Retry-After: 30

Comparing Four Rate Limiting Algorithms

1

Algorithm 1: Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Increment a counter for each request. Reject when counter exceeds limit.
Window: 12:00:00 - 12:01:00 → counter = 87/100
Window: 12:01:00 - 12:02:00 → counter = 0/100 (reset)
Pros: Simple, low memory (one counter per window).Cons: Burst at window boundary. A user can send 100 requests at 12:00:59 and 100 more at 12:01:01, effectively 200 requests in 2 seconds.
2

Algorithm 2: Sliding Window Log

Store the timestamp of every request in a sorted set. For each new request, remove timestamps outside the window, count remaining.
Redis ZSET: user:123 → [12:00:01, 12:00:05, 12:00:33, ...]
New request at 12:01:10:
  Remove all timestamps < 12:00:10
  Count remaining: 45
  If 45 < 100 → ALLOW, add 12:01:10
Pros: Precise, no boundary burst problem.Cons: High memory (stores every timestamp). At 100 req/min per user with 10M users, that is billions of timestamps.
3

Algorithm 3: Sliding Window Counter (Recommended)

Hybrid of fixed window + sliding window. Maintain counters for the current and previous windows. Weight the previous window by overlap percentage.
Previous window (12:00-12:01): 84 requests
Current window  (12:01-12:02): 36 requests
Current time:   12:01:15 → 25% into current window

Weighted count = 84 * 0.75 + 36 = 99
Limit = 100 → ALLOW (barely)
Pros: Low memory (just two counters), smooth, no boundary burst.Cons: Approximation (not exact), but in practice the error is < 1%.
4

Algorithm 4: Token Bucket

Each user has a bucket that fills with tokens at a fixed rate. Each request consumes one token. If bucket is empty, reject.
Bucket capacity: 100 tokens
Refill rate:     100 tokens/minute
Current tokens:  23

Request arrives → 23 > 0 → ALLOW, tokens = 22
Pros: Allows controlled bursts (up to bucket capacity). Simple. Used by AWS API Gateway and Stripe.Cons: Two parameters to tune (capacity and refill rate). Requires per-user state.

Distributed Rate Limiting: The Hard Part

The real challenge is making this work across multiple servers.Problem: If you have 10 API servers, each checking Redis independently, race conditions cause over-counting or under-counting.Solution: Lua Script in Redis (Atomic Operations)
-- Atomic rate limit check in Redis
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- REJECTED
else
    return 1  -- ALLOWED
end
Clock skew in distributed systems: If your API servers have slightly different clocks, fixed window boundaries differ across servers. Mitigations: (1) Use Redis server time, not application server time. (2) Use NTP to keep clocks synchronized within 10ms. (3) The sliding window counter algorithm is inherently more tolerant of small clock differences.
DecisionChoice MadeAlternativeWhy
AlgorithmSliding window counterToken bucketBetter precision with minimal memory; token bucket is also excellent for APIs needing burst tolerance
PlacementAPI GatewayApplication middlewareCentralized enforcement, language-agnostic, single place to configure rules
State storeRedis ClusterLocal in-memoryMust work across all servers; local state gives inconsistent enforcement
Failure modeFail-openFail-closedAvailability > abuse protection; combine with secondary detection for abuse
AtomicityRedis Lua scriptsDistributed locksLua scripts are atomic in Redis, simpler than external locks, sub-millisecond
1

You compared four algorithms with concrete trade-offs

Not just naming them, but explaining boundary burst problems, memory implications, and why the sliding window counter wins for most cases.
2

You addressed the distributed race condition explicitly

The Lua script for atomic Redis operations shows you have dealt with real concurrency problems.
3

You discussed fail-open vs fail-closed

This is an operational concern that junior candidates never raise. It shows you think about failure modes, not just the happy path.
4

You mentioned clock skew

Distributed systems fundamentals. This single detail signals real-world experience with systems that span multiple servers.
5

You designed rate limit headers into the API response

Following RFC 6585 and including Retry-After shows you care about API ergonomics and client experience.

Cloudflare’s Rate Limiting at Scale

Cloudflare handles tens of millions of HTTP requests per second across 300+ data centers. Their rate limiting operates at a scale that most engineers never encounter:
  • Edge-first enforcement: Rate limiting happens at Cloudflare’s edge servers, not at a centralized service. Each data center runs its own rate limiting logic. This is critical because sending every request to a central rate limiter would add unacceptable latency and create a single point of failure.
  • Sliding window counters: Cloudflare uses a sliding window approach similar to Algorithm 3 above. They chose this because it provides good accuracy with minimal memory per client, which matters when you are tracking millions of unique IPs and API keys simultaneously.
  • Approximate counting across data centers: Since rate limiting is distributed, a user hitting multiple Cloudflare data centers (due to Anycast routing) could theoretically exceed their global limit. Cloudflare handles this with periodic synchronization between edge nodes — they trade perfect accuracy for speed. The result is rate limits that are “approximately correct” within a small margin, which is acceptable for virtually all use cases.
  • Configurable actions: When a rate limit is exceeded, Cloudflare supports multiple responses: block (HTTP 429), challenge (show a CAPTCHA), JS challenge (verify browser), or simulate (log only, for testing rules before enforcement). This flexibility is a great interview detail to mention when discussing rate limiter behavior.
  • IP reputation scoring: Beyond simple rate counting, Cloudflare layers in threat intelligence. An IP with a known bad reputation gets a lower effective rate limit. This hybrid approach (rate limiting + reputation) is more effective than either technique alone.

Stripe’s Rate Limiting API

Stripe’s approach to rate limiting is worth studying because they operate from the API provider’s perspective — they must protect their own infrastructure while giving developers a good experience:
  • Token bucket algorithm: Stripe uses a token bucket implementation for their API rate limits. Each API key gets a bucket with a defined capacity (e.g., 100 requests per second for live mode). Tokens refill at a steady rate. This allows short bursts (useful when a merchant processes a batch of charges) while enforcing a sustained rate limit.
  • Tiered limits by endpoint: Not all API endpoints have the same rate limit. Creating a payment intent might be limited to 100/sec, while listing events could allow 1000/sec. Stripe separates read-heavy endpoints from write-heavy ones because the cost to Stripe’s backend is different.
  • Graceful headers: Stripe returns RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers on every response (not just 429s). This allows well-behaved clients to self-throttle before hitting the limit. They also return a Retry-After header on 429 responses with a specific number of seconds to wait.
  • Idempotency keys for retries: Stripe’s rate limiting is designed to work hand-in-hand with their idempotency key system. When a client gets rate-limited and retries, the idempotency key ensures the retry does not create a duplicate charge. This is a subtle but important integration point between rate limiting and application semantics.
  • Load shedding at scale: When Stripe’s systems are under extreme load, they implement progressive load shedding — first throttling lower-priority traffic (webhooks, list operations) before restricting payment-critical paths. This priority-based approach is more sophisticated than a flat rate limit and is worth mentioning in interviews when discussing production-grade systems.
Key pattern: Both Cloudflare and Stripe separate their rate limiting into “protect the infrastructure” (backend stability) and “enforce fair usage” (per-customer limits). These are different concerns with different algorithms, thresholds, and failure modes. In an interview, distinguishing between these two motivations demonstrates operational maturity.
Rate limiters sit in the critical request path. When they fail, the consequences cascade immediately.Redis cluster goes down (total outage):
  • This is why you chose fail-open. All requests pass through without rate limiting. Your backend services are temporarily unprotected. Mitigation: have a secondary, local in-memory rate limiter on each API server as a fallback. It will not be globally consistent, but it provides per-server protection. A server-local token bucket with 10x the normal per-user limit acts as a safety net.
  • Alert immediately. Redis recovery is your top priority.
Redis cluster partially fails (one node in the cluster):
  • If using Redis Cluster, keys on the failed shard become inaccessible. Some users are rate-limited correctly; others are not. This is worse than a total failure because the inconsistency is silent.
  • Mitigation: monitor rate limiter health with a canary key that you check every second. If the canary read fails, switch the entire rate limiter to local-only mode until Redis is healthy.
Network partition between API Gateway and Redis:
  • The rate limiter cannot check state. With fail-open, requests pass through. With fail-closed, all requests are rejected (an outage).
  • This is the core argument for fail-open: a network partition should not become a denial-of-service against your own users.
Clock skew between API servers exceeds 10ms:
  • Fixed window counters may reset at different times on different servers, allowing a user to briefly exceed their limit by hitting different servers at the window boundary.
  • Mitigation: use Redis server time (TIME command) for all timestamp comparisons, not the application server’s clock. The sliding window counter algorithm is also more tolerant of small clock differences because it interpolates between windows.
A single user floods with requests faster than Redis can process:
  • Redis processes commands sequentially (single-threaded for commands). At extreme rates from a single user, the Lua script executes fast (~0.1ms) but the network round-trip dominates. This is not a real concern in practice — Redis handles 100K+ ops/sec, which means a single user would need to send 100K+ req/sec to cause contention.
  • If this is a DDoS scenario, you need upstream protection (CDN-level blocking, IP reputation) before the request ever reaches your rate limiter.
“How do you handle rate limiting in a multi-region deployment where the same user can hit different regions?”Two approaches: (1) Global rate limiting with a single Redis cluster in one region — adds cross-region latency (~50-100ms) for every rate limit check. (2) Per-region rate limiting with periodic synchronization — each region maintains its own counters and syncs to a global view every few seconds. The user’s effective limit becomes (global limit / number of regions) per region, with a small tolerance for cross-region burst. Approach 2 is what Cloudflare uses. The trade-off: you accept approximate enforcement (~10-20% overcount possible) for much lower latency.“How would you rate limit by cost instead of request count — for example, a GraphQL API where some queries are 10x more expensive than others?”Replace the simple counter with a weighted counter. Each request consumes tokens proportional to its cost (estimated from query complexity analysis). A simple list query costs 1 token; a deeply nested query costs 10. The token bucket algorithm handles this naturally — just deduct more tokens per expensive request. This is how GitHub’s GraphQL API works: they assign a “point cost” to each query and rate-limit on total points, not request count.“What happens when you need to update rate limit rules without downtime?”Store rules in a configuration service (etcd, Consul, or a simple database table) and poll for changes every 10-30 seconds from each API server. Alternatively, use a push model where the rules service publishes changes via a pub/sub channel. The key insight: rules should have a version number, and the rate limiter should apply the latest version it has seen. During a transition, some servers may briefly enforce old rules and others new rules — this is acceptable for a few seconds.
The detail that makes an interviewer’s eyes light up: When discussing distributed rate limiting, note that there is a fundamental tension between accuracy and latency. Checking a centralized Redis for every request gives perfect accuracy but adds a network hop. An alternative is the “local + global” pattern: each API server maintains a local token bucket and periodically syncs its count with Redis. Between syncs (every 1-5 seconds), the server enforces locally. This means the effective rate limit has a tolerance of (local limit * sync interval) — a user might briefly exceed the global limit by a small amount. But you trade ~0.5ms of latency per request for ~5% accuracy loss, and in most systems that is an excellent trade-off. Mentioning this pattern unprompted signals that you have operated rate limiters at scale and understand that “correct enough, fast enough” beats “perfectly correct, too slow.”
If you were building this rate limiter on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
API Gateway / Entry PointAmazon API GatewayBuilt-in rate limiting with usage plans and API keys. Supports throttling at per-method, per-client granularity. For custom algorithms beyond what API Gateway provides, place your limiter behind an ALB
Rate Limit State StoreElastiCache for Redis (cluster mode)Sub-millisecond latency for counter operations. Cluster mode shards data across nodes for >100K ops/sec. Multi-AZ for automatic failover
Rules ConfigurationAWS AppConfig (part of Systems Manager)Dynamic configuration updates without redeployment. Supports gradual rollout of new rate limit rules. Built-in validation to prevent deploying invalid configs
Monitoring and AlertingCloudWatch Metrics + CloudWatch AlarmsCustom metrics for rate limit hits, 429 response rates, and Redis latency. Alarms trigger when 429 rate exceeds baseline (possible attack or misconfigured limits)
DDoS Protection (upstream)AWS Shield + AWS WAFShield provides automatic DDoS protection at the network layer. WAF provides rate-based rules at the application layer — a first line of defense before your custom rate limiter
Logging and AnalyticsKinesis Data Firehose to S3 + AthenaStream rate limit events (who was throttled, when, which endpoint) to S3 for analysis. Athena queries help identify abuse patterns and tune limits
AWS API Gateway’s built-in rate limiting vs custom: API Gateway supports token bucket rate limiting out of the box with usage plans. For many applications, this is sufficient and saves you from building anything custom. Build a custom rate limiter only when you need: (1) algorithms beyond token bucket (sliding window), (2) rate limiting by custom dimensions (e.g., by tenant + endpoint + time-of-day), or (3) rate limiting applied at a different layer in your stack (service mesh, internal microservices). Knowing when the managed service is “good enough” versus when you need custom is itself a senior-level signal. For more on this managed-vs-custom decision framework, see Cloud Service Patterns.
Constraint 1: “The rate limiter must support per-tenant billing tiers that change in real time (upgrades/downgrades).”
  • Static rules files are no longer sufficient. You need a rules service that is queried per request or cached with aggressive invalidation. The rate limit check becomes: look up tenant_id to tier, look up tier to limits, apply the algorithm. A tier change mid-window means you either apply the new limit immediately (possibly rate-limiting a user who just upgraded) or finish the current window at the old limit. The product decision here has architectural implications — discuss both options.
Constraint 2: “You cannot add any latency to the request path. The rate limiter must be zero-overhead for allowed requests.”
  • This eliminates synchronous Redis calls on every request. You need a local in-memory rate limiter on each server, periodically synced with a global state. Each server maintains its own sliding window counter and reports to a central aggregator every 1-5 seconds. You accept ~5% accuracy loss for zero added latency on allowed requests. Only rejected requests see the overhead (the 429 response). This is the Cloudflare edge model.
Constraint 3: “Rate limits must be geography-aware: 100 req/min in the US, 50 req/min in the EU (GDPR region), unlimited in your private beta region.”
  • The rate limit key now includes geography: user:123:US:/api/search. You need to resolve the user’s geography at the edge (GeoIP lookup from CloudFront or Cloudflare) and pass it to the limiter. The rules engine becomes a matrix: (user_tier, region, endpoint) -> limit. This is a real-world pattern — many API providers have geography-specific rate limits for regulatory reasons.
Constraint 4: “The company’s cloud bill is $50K/month and your ElastiCache Redis cluster is 30% of that. Reduce cost by 70%.”
  • You cannot use Redis for every rate limit check at this price. Options: move to a local in-memory limiter with periodic Redis sync (reduces Redis ops by 90%), use DynamoDB on-demand for rate limit state (cheaper for bursty workloads), or use AWS API Gateway’s built-in rate limiting and eliminate the custom limiter entirely for most endpoints. The cost constraint forces you to evaluate the buy-vs-build decision with real numbers.
This problem connects to several foundational topics covered in other chapters:
  • APIs and HTTP — Rate limit headers (X-RateLimit-*, Retry-After), HTTP 429 semantics, and API versioning are covered in APIs and Databases.
  • Networking — Understanding TCP connection management, load balancing algorithms, and how API gateways work is foundational. See Networking and Deployment.
  • Caching — Redis as both a cache and a data structure server, TTL management, and eviction policies are covered in Caching and Observability.
  • Concurrency and distributed state — Race conditions, atomic operations (Lua scripts), distributed locking, and clock synchronization are covered in Messaging, Concurrency, and State. The clock skew discussion in this problem connects directly to the logical clocks and vector clocks covered in Distributed Systems Theory — understanding why wall-clock time is unreliable in distributed systems is essential for designing correct rate limiting across multiple servers.
  • Reliability — Fail-open vs fail-closed, circuit breakers, and graceful degradation patterns connect to Reliability Principles.
  • Security — Rate limiting as a defense against DDoS and brute-force attacks connects to Auth and Security.
  • Cloud service patterns — AWS API Gateway’s built-in throttling, WAF rate-based rules, and the decision framework for managed vs custom rate limiting are explored in Cloud Service Patterns.

Problem 3: Design a Chat System

Difficulty: Medium | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Type1-to-1 only, or group chat too? Max group size?Group chat introduces fan-out complexity
FeaturesOnline/offline status? Read receipts? Typing indicators?Each adds real-time event complexity
MediaText only, or images/files/voice?Media requires object storage + CDN pipeline
HistoryHow far back should chat history go?Determines storage volume and partitioning
DeliveryWhat delivery guarantees? At-least-once? Exactly-once?Drives dedup and acknowledgment protocol
ScaleHow many concurrent users? Messages per day?Determines WebSocket server capacity

Requirements You Should Lock Down

Functional:
  • 1-to-1 messaging with delivery confirmation
  • Group chat (up to 500 members)
  • Online/offline presence indicators
  • Message history (persistent, searchable)
  • Push notifications for offline users
Non-Functional:
  • Real-time delivery (< 200ms for online users)
  • Message ordering guarantees (per-conversation)
  • 99.99% message delivery (no lost messages)
  • Support 50M concurrent connections

Back-of-Envelope Calculation

Assumptions:
  Daily Active Users (DAU):     50M
  Concurrent connections:       10% of DAU = 5M
  Messages per user per day:    40
  Total messages/day:           2 billion
  Average message size:         200 bytes
  Group chats per user:         5 (average 20 members)

Throughput:
  2B messages / 86,400 sec      ~ 23,000 messages/sec
  Peak (5x):                    ~ 115,000 messages/sec

WebSocket connections:
  5M concurrent connections
  Each connection ~ 10 KB memory
  Per server (10K connections): ~ 100 MB
  Servers needed:               5M / 10K = 500 WebSocket servers

Storage (1 year):
  2B messages/day x 365 x 200 bytes = ~146 TB/year
  With indexes and metadata:          ~200 TB/year

Presence (heartbeat):
  5M users sending heartbeat every 30 sec
  = 166K heartbeat events/sec
Key insight: This system is dominated by two challenges: (1) maintaining millions of persistent WebSocket connections, and (2) the group chat fan-out problem where one message must be delivered to hundreds of recipients simultaneously.
Sanity check your numbers: Our estimate of ~200 TB/year for message storage is in the right ballpark. WhatsApp reportedly processes 100B+ messages per day — at 200 bytes per message, that is ~20 TB/day or ~7 PB/year, but WhatsApp deletes messages from servers after delivery. Discord, which retains all messages permanently, has discussed storing trillions of messages across their Cassandra/ScyllaDB clusters. Our 2B messages/day for 50M DAU implies ~40 messages per user per day, which aligns with reported averages for messaging apps.

Component Architecture

 ┌──────────┐  ┌──────────┐  ┌──────────┐
 │ Client A │  │ Client B │  │ Client C │
 └────┬─────┘  └────┬─────┘  └────┬─────┘
      │  WebSocket   │             │
      └──────────────┼─────────────┘

              ┌──────▼──────┐
              │   Gateway   │  ← Connection routing + auth
              │   (HAProxy) │
              └──────┬──────┘

      ┌──────────────┼──────────────┐
      │              │              │
 ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
 │  Chat   │   │  Chat   │   │  Chat   │
 │Server 1 │   │Server 2 │   │Server 3 │   ← Hold WebSocket connections
 └────┬────┘   └────┬────┘   └────┬────┘
      │              │              │
      └──────────────┼──────────────┘

      ┌──────────────┼──────────────┐
      │              │              │
 ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
 │  Redis  │   │ Message │   │ Presence │
 │ Pub/Sub │   │  Store  │   │ Service  │
 └─────────┘   └─────────┘   └──────────┘

Message Flow (1-to-1)

1. User A sends message via WebSocket to Chat Server 1
2. Chat Server 1:
   a. Persists message to Message Store (Cassandra)
   b. Looks up which Chat Server holds User B's connection
   c. Publishes message to Redis Pub/Sub channel for User B
3. Chat Server 2 (holding User B's connection):
   a. Receives message from Redis Pub/Sub
   b. Pushes message to User B via WebSocket
4. If User B is offline:
   a. Message is stored in an "undelivered" queue
   b. Push notification sent via APNS/FCM

Deep Dive 1: WebSocket vs Long-Polling

1

WebSocket (Recommended)

Full-duplex, persistent connection. Server can push messages to client at any time.Why it wins: A chat system needs real-time bidirectional communication. WebSocket is purpose-built for this. One connection handles both sending and receiving.Challenge: Maintaining 5M persistent connections requires careful resource management. Connection drops need automatic reconnection with message catch-up.
2

Long-Polling (Fallback)

Client sends a request, server holds it open until a message arrives (or timeout). Client immediately sends another request.When to use: Fallback for environments where WebSocket is blocked (corporate firewalls, older proxies). Keep as a compatibility layer, not the primary transport.
3

Server-Sent Events (SSE)

Unidirectional (server to client only). Client still uses HTTP POST to send messages.Verdict: Acceptable for notifications but awkward for chat because sending and receiving use different channels.

Deep Dive 2: Message Storage and Partitioning

Database choice: Cassandra or ScyllaDBWhy: Write-heavy workload (2B messages/day), time-series-like access pattern (fetch recent messages for a conversation), horizontal scaling built-in.
Table: messages
┌──────────────────┬───────────────┬────────────┬──────────┬─────────┐
│ conversation_id  │  message_id   │  sender_id │   body   │  ts     │
│  (partition key) │ (clustering)  │            │          │         │
├──────────────────┼───────────────┼────────────┼──────────┼─────────┤
│  conv_abc123     │  msg_001      │  user_A    │  "Hey!"  │  17...  │
│  conv_abc123     │  msg_002      │  user_B    │  "Hi!"   │  17...  │
└──────────────────┴───────────────┴────────────┴──────────┴─────────┘

Partition key: conversation_id
  → All messages for a conversation are on the same node
  → Efficient range query: "Get last 50 messages for conversation X"

Clustering key: message_id (time-based UUID / Snowflake ID)
  → Messages within a partition are sorted by time
  → Supports efficient pagination
Hot partition risk: A group chat with 500 members that is very active could create a hot partition. Mitigation: sub-partition by time bucket (e.g., conversation_id + year_month) so older messages are on different partitions.

Deep Dive 3: Online Presence (Heartbeat Mechanism)

Presence Flow:
1. User connects → status set to "online" in Redis
   HSET presence user:123 '{"status":"online","server":"chat-02","ts":...}'

2. Client sends heartbeat every 30 seconds
   → Redis TTL refreshed: EXPIRE presence:user:123 60

3. If no heartbeat for 60 seconds → TTL expires → user is "offline"

4. When User A opens chat with User B:
   → Check Redis: HGET presence user:456
   → Subscribe to presence changes for contacts
Senior move: Do not fan-out presence updates to all friends in real-time (if a user has 5,000 friends, that is 5,000 events on every status change). Instead, use a lazy approach: only check presence when a user opens a conversation or views their contact list. This reduces presence event volume by 100x.
DecisionChoice MadeAlternativeWhy
TransportWebSocketLong-pollingReal-time bidirectional; fall back to long-polling when WS unavailable
Message DBCassandraPostgreSQLWrite-heavy, time-series access, horizontal scale
Inter-server messagingRedis Pub/SubDirect RPCDecouples chat servers; Pub/Sub handles routing naturally
PresenceLazy evaluation + heartbeatActive fan-outFan-out to all friends is prohibitively expensive at scale
Message orderingPer-conversation orderingGlobal orderingGlobal ordering is unnecessary and impossible to scale; conversations are independent
Group message deliveryWrite message once, fan-out readsWrite to each member’s inboxSaves storage; read fan-out is cheaper for groups < 500
1

You identified the fan-out problem for groups

The difference between writing one copy vs N copies per group message is the core scalability challenge. Explaining the trade-off shows distributed systems maturity.
2

You designed lazy presence instead of active push

This optimization alone saves orders of magnitude of network traffic. It shows you think about what NOT to compute, not just what to compute.
3

You addressed message ordering as per-conversation

Global ordering across all conversations is impossible at scale and unnecessary. Scoping ordering to conversations shows pragmatic thinking.
4

You planned for the offline path

Push notifications, undelivered message queues, and message catch-up on reconnect. The offline path is where most chat systems get complex.
5

You chose the partition key carefully

conversation_id as partition key with time-based clustering is the correct Cassandra modeling for this access pattern. This is a detail that shows real database design experience.

WhatsApp: 100B+ Messages/Day with ~50 Engineers

WhatsApp is the gold standard for doing more with less. At the time of the Facebook acquisition (2014), WhatsApp handled 50+ billion messages per day with roughly 50 engineers. Today that number exceeds 100 billion. Their architecture is a masterclass in simplicity:
  • Erlang/OTP: WhatsApp’s backend is built on Erlang, a language designed for telecom systems that need massive concurrency and fault tolerance. A single Erlang server can handle 2-3 million concurrent connections because Erlang processes are lightweight (a few hundred bytes each, compared to threads that cost megabytes). This is why 50 engineers could operate what would require hundreds of engineers in a Java or Python stack.
  • Message storage philosophy: WhatsApp treats the server as a delivery mechanism, not a storage system. Messages are stored on the server only until they are delivered to the recipient’s device. Once the recipient acknowledges receipt, the message is deleted from the server. This “transient store” approach dramatically reduces storage requirements and simplifies the architecture.
  • XMPP-based protocol (modified): WhatsApp started with the XMPP messaging protocol and heavily modified it for mobile efficiency. Their custom protocol minimizes battery drain and data usage — critical for WhatsApp’s user base in markets with limited data plans.
  • No read receipts stored server-side: The blue check marks (read receipts) are end-to-end signals between devices. The server facilitates delivery but does not store read state. This is another example of pushing complexity to the edges and keeping the server simple.
  • Mnesia for routing: WhatsApp uses Erlang’s built-in Mnesia database for connection routing — mapping which user is connected to which server. Mnesia is an in-memory distributed database that replicates across Erlang nodes, giving sub-millisecond lookups without an external dependency like Redis.
  • FreeBSD over Linux: WhatsApp chose FreeBSD for their servers, partly because of superior network stack performance for their specific workload (millions of concurrent TCP connections). This is an unconventional choice that paid off — it shows that sometimes the right tool is not the popular tool.

Discord’s Message Storage

Discord’s approach is interesting because they had to solve a different problem than WhatsApp — Discord needs persistent, searchable message history (channels can have millions of messages spanning years):
  • Migration from MongoDB to Cassandra: Discord initially stored messages in MongoDB. As they scaled, MongoDB’s single-primary-per-shard model created write bottlenecks for hot channels. They migrated to Cassandra, which distributes writes evenly across nodes. The partition key is (channel_id, bucket) where bucket is a time window, preventing any single partition from growing unbounded.
  • Bucket strategy: Each bucket covers a fixed time period (approximately 10 days of messages). When a user scrolls back through history, Discord fetches the appropriate bucket. New messages always write to the current bucket. This means the “hot” data (recent messages) is always on a small number of partitions, keeping reads fast.
  • Data services layer: Discord added a data services layer between their API and Cassandra. This layer handles request coalescing — if 1,000 users in the same channel all request the same messages at the same time (e.g., after a server outage), the data services layer deduplicates these into a single Cassandra read and fans the result back to all requesters. Without this, a thundering herd could overwhelm Cassandra.
  • ScyllaDB migration: Discord later migrated from Cassandra to ScyllaDB (a C++ reimplementation of Cassandra’s protocol) for better tail latency performance. Cassandra’s JVM garbage collection pauses caused p99 latency spikes that affected user experience. ScyllaDB, being C++ with a shard-per-core architecture, eliminated GC pauses entirely.
Interview insight: WhatsApp’s story demonstrates that choosing the right runtime (Erlang) can be a 10x architectural advantage. Discord’s story demonstrates that database migrations are real and normal — you do not need to get the database choice right on day one, but you need to design so that migration is possible.
Chat systems have an unusually low tolerance for failure — users notice missing or delayed messages within seconds.A Chat Server crashes while holding 10K WebSocket connections:
  • All 10K users get disconnected simultaneously. Each client’s reconnection logic kicks in (exponential backoff with jitter). The load balancer routes reconnections to healthy servers.
  • Critical: the Message Store (Cassandra) has all messages persisted. When users reconnect, they request messages since their last received message ID. No messages are lost — they were persisted before the server acknowledged them to the sender. This is why you write to the database before forwarding to the recipient.
  • Risk: thundering herd on reconnection. 10K clients reconnecting simultaneously can overwhelm the remaining servers. Mitigation: client-side exponential backoff with random jitter (reconnect between 1-10 seconds, not all at once).
Redis Pub/Sub goes down:
  • Inter-server message routing stops. If User A is on Chat Server 1 and User B is on Chat Server 3, messages cannot be forwarded.
  • Short-term: messages queue up on the sending server. Long-term: fall back to a polling mechanism where Chat Servers periodically check the Message Store for undelivered messages for their connected users.
  • Alternative: run Redis Pub/Sub as a cluster with Sentinel for automatic failover. Recovery is typically under 30 seconds.
The Message Store (Cassandra) experiences a partition:
  • Cassandra is designed for this — it uses a quorum-based consistency model. With a replication factor of 3 and quorum reads/writes, the system tolerates one node failure per partition.
  • If too many nodes fail and quorum is lost: messages cannot be persisted. The Chat Server must buffer messages in memory and retry. Return a “message pending” status to the sender instead of “delivered.” This is an honest UX that users understand — they see a clock icon instead of a check mark.
Presence service gives stale data (user appears online but is actually offline):
  • The heartbeat TTL is 60 seconds, so in the worst case, a user appears online for up to 60 seconds after going offline. This is acceptable for most chat applications — WhatsApp and Telegram both have similar tolerances.
  • In the other direction (user appears offline but is online), this happens if a heartbeat is lost. The next heartbeat in 30 seconds corrects it. This is why presence should always be treated as “best effort” and never used for critical decisions.
Network split between data centers in a multi-region deployment:
  • Messages sent within each region continue to work. Cross-region messages queue up and deliver once the partition heals.
  • The critical design decision: do you allow both regions to accept writes (risk message ordering conflicts) or do you designate one region as primary (risk write unavailability in the secondary)? For chat, allowing both regions to accept writes with per-conversation ordering (guaranteed by partition key) is the better choice. Cross-conversation ordering does not matter.
“How would you add end-to-end encryption without changing the server architecture significantly?”The server becomes a “dumb pipe” — it stores encrypted blobs it cannot read. Clients exchange public keys via the server (or a separate key distribution service). Messages are encrypted on the sender’s device and decrypted on the recipient’s device. The server’s message storage, routing, and delivery logic remain identical — it just cannot inspect message content. This means server-side search over message history becomes impossible (or requires client-side search). Mention the Signal Protocol as the industry standard for this pattern.“What happens if a user is in a group chat with 500 members and sends a message? Walk me through the fan-out.”One message write to Cassandra (partitioned by conversation_id). Then fan-out to 500 recipients: for each member, look up their Chat Server (via the presence/routing table in Redis), and publish the message to the appropriate Redis Pub/Sub channel. If 500 members are spread across 100 Chat Servers, that is 100 Pub/Sub messages (one per server, each containing the recipient list for that server). For offline members, a single batch push notification request to APNS/FCM. The total fan-out cost: 1 DB write + ~100 Pub/Sub messages + N push notifications for offline users. This completes in under 200ms for online users.“How would you implement message editing and deletion with eventual consistency?”Store edits and deletes as new events (not mutations of the original message). The message table gets edited_at and deleted_at columns. When a client fetches messages, it applies edits and filters deletes client-side. For real-time updates, broadcast an “edit” or “delete” event to all connected members of the conversation via the same Pub/Sub channel. If a client missed the event (was offline), it reconciles on reconnect when it fetches the latest messages. This event-sourcing approach is simpler than trying to mutate delivered messages across all clients.
The detail that makes an interviewer’s eyes light up: When discussing message ordering, explain that you use a per-conversation sequence number (not a global timestamp) generated by the Chat Server. Here is why this matters: two users in the same conversation might send messages at the same millisecond from different time zones. If you use wall-clock timestamps, the order depends on clock synchronization across client devices (which is unreliable). Instead, the Chat Server assigns a monotonically increasing sequence number within each conversation. This guarantees causal ordering — if User A sees User B’s message and replies, User A’s reply always has a higher sequence number than User B’s message. Mentioning causal ordering (not just “timestamps”) signals that you understand the difference between wall-clock time and logical time in distributed systems, which is a Lamport clock concept that separates senior engineers from everyone else.
If you were building this chat system on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
WebSocket ManagementAPI Gateway WebSocket APIs + Lambda or ECSAPI Gateway handles WebSocket connection lifecycle (connect, disconnect, message routing) natively. For 5M concurrent connections, use ECS/EKS with custom WebSocket servers behind a Network Load Balancer (NLB) — NLB supports millions of concurrent connections and operates at Layer 4
Inter-Server MessagingAmazon ElastiCache for Redis (Pub/Sub) or Amazon SNSRedis Pub/Sub for low-latency inter-server message routing. SNS for fan-out to multiple subscribers when message delivery to offline users triggers push notifications
Message StorageAmazon Keyspaces (Managed Cassandra) or DynamoDBKeyspaces provides Cassandra-compatible API with serverless scaling — ideal for the write-heavy, time-series message storage pattern. DynamoDB works if you prefer single-table design with conversation_id as partition key and message_id as sort key
Presence ServiceElastiCache for RedisStore heartbeat state with TTL-based expiration. Redis hash maps for presence:user_id to {status, server, timestamp}
Push NotificationsAmazon SNS Mobile PushSNS integrates with APNs (iOS) and FCM (Android) directly. Handles device token management, multi-platform delivery, and retry logic
Media StorageS3 + CloudFrontS3 for image/file storage with lifecycle policies (move to Glacier after 90 days). CloudFront CDN for low-latency media delivery worldwide
Connection RoutingElastiCache for Redis or DynamoDBMap user_id to server_id for message routing. Redis for sub-millisecond lookups; DynamoDB if you want managed persistence without Redis operational overhead
MonitoringCloudWatch + X-RayTrack WebSocket connection counts, message delivery latency (p50, p95, p99), Pub/Sub lag, and presence accuracy
Scaling WebSockets on AWS: API Gateway WebSocket APIs are convenient but have a connection limit per account (default 500K, raisable to a few million). For 5M+ concurrent connections, you need custom WebSocket servers on ECS/EKS behind a Network Load Balancer. NLB is critical here — ALB adds latency and has lower connection limits. Each ECS task handles 10-50K connections depending on instance size. Use ECS Service Auto Scaling based on a custom CloudWatch metric (active connection count per task). For a deeper treatment of real-time connection management patterns, see Real-Time Systems.
Constraint 1: “Messages must be end-to-end encrypted. The server must never see plaintext. How does this change your architecture?”
  • Server-side search is now impossible. The server stores encrypted blobs. Key exchange happens via the Signal Protocol (double-ratchet). Group chat encryption becomes significantly harder — you need the Sender Keys protocol where each group member maintains a session with the sender. Message editing and deletion must be handled as encrypted “edit” or “tombstone” events. The presence system is unaffected (metadata, not content). This is how WhatsApp and Signal work.
Constraint 2: “The system must support 10,000-member channels (like Slack or Discord), not just 500-member groups.”
  • The fan-out model from the original design breaks at 10K members. You cannot push messages to 10K Redis Pub/Sub channels. Instead, use a channel-centric model: members subscribe to the channel, and the Chat Server maintains a channel subscription map. A message published to a channel is broadcast to all servers that have at least one subscriber for that channel (typically far fewer than 10K servers). This is Discord’s approach. The presence system also changes — you cannot show individual online status for 10K members, so you show “X members online” as an aggregate.
Constraint 3: “The chat system is for a healthcare company. All messages must be retained for 7 years per HIPAA. Messages cannot be permanently deleted by users.”
  • User-facing “delete” becomes a soft delete (hidden from UI, still in storage). The Cassandra TTL strategy from the original design must be disabled. Storage grows linearly forever — at 2B messages/day, that is ~5 PB over 7 years. You need a tiered storage strategy: hot (last 30 days in Cassandra), warm (30 days to 1 year in S3), cold (1-7 years in S3 Glacier). The audit trail must be immutable. Encryption key management becomes critical — you must be able to decrypt 7-year-old messages, which means key rotation must preserve old keys.
Constraint 4: “Users have unreliable internet (emerging markets, 2G connections, frequent disconnections).”
  • Offline-first architecture becomes mandatory. Messages are stored locally on the device first, then synced when connectivity returns. You need a conflict resolution protocol for messages created offline by multiple users. Vector clocks or Lamport timestamps per conversation ensure causal ordering. Message payloads must be compact (Protocol Buffers instead of JSON). The app must work in a “compose and queue” mode where the user types and the message is sent later. This is WhatsApp’s design philosophy.
This problem connects to several foundational topics covered in other chapters:
  • Networking — WebSocket protocol, TCP connection management, HTTP long-polling, and SSE are covered in Networking and Deployment.
  • Messaging and concurrency — Pub/Sub patterns, message queues, fan-out strategies, and exactly-once vs at-least-once delivery guarantees are explored in Messaging, Concurrency, and State.
  • Databases — Cassandra data modeling, partition key design, hot partition mitigation, and the migration story (MongoDB to Cassandra to ScyllaDB) connect to APIs and Databases. For a deeper exploration of Cassandra vs ScyllaDB partition strategies, wide-row design, and when to choose time-series-optimized databases, see Database Deep Dives.
  • Caching — Redis for Pub/Sub, presence state, and connection routing connects to Caching and Observability.
  • Performance — WebSocket connection scaling, heartbeat optimization, and lazy vs active presence evaluation connect to Performance and Scalability.
  • Security — End-to-end encryption, authentication of WebSocket connections, and the Signal Protocol connect to Auth and Security.
  • Real-time systems — WebSocket lifecycle management, heartbeat protocols, presence tracking, and the trade-offs between push (WebSocket), pull (long-polling), and hybrid delivery models are core real-time patterns explored in Real-Time Systems. The chat system is the canonical real-time system design problem.
  • Cloud service patterns — The decision between API Gateway WebSocket APIs and custom WebSocket servers on ECS, NLB vs ALB for persistent connections, and SNS Mobile Push integration are practical AWS patterns covered in Cloud Service Patterns.

Problem 4: Design a News Feed

Difficulty: Hard | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
ContentText posts only, or images/videos too?Media adds CDN and transcoding complexity
Feed typeChronological or ranked (algorithmic)?Ranked feeds need an ML scoring service
Social graphFollow model (Twitter) or friend model (Facebook)?Follow is asymmetric and simpler; friends is symmetric
Celebrity problemDo some users have millions of followers?This single question changes the entire fan-out strategy
Real-timeShould the feed update in real-time or on refresh?Real-time adds WebSocket/SSE complexity
ScaleDAU? Average follows per user? Posts per day?Drives the fan-out vs fan-in decision

Requirements You Should Lock Down

Functional:
  • Users create posts (text + images/video)
  • Users follow other users (asymmetric follow model)
  • Home feed shows posts from followed users, ranked by relevance
  • Support likes, comments, shares (interaction signals)
Non-Functional:
  • Feed generation latency < 500ms
  • Support 500M DAU
  • Handle celebrity accounts (10M+ followers)
  • Feed should feel “fresh” (updates within minutes of posting)

Back-of-Envelope Calculation

Assumptions:
  DAU:                          500M
  Average follows per user:     200
  Average posts per user/day:   2
  Total posts/day:              1 billion
  Average post size:            1 KB (text + metadata)
  Average feed fetch/day:       10 per user

Write (post creation):
  1B / 86,400                   ~ 11,600 posts/sec
  Peak (5x):                    ~ 58,000 posts/sec

Read (feed fetch):
  500M x 10 / 86,400           ~ 58,000 feed fetches/sec
  Peak (5x):                    ~ 290,000 feed fetches/sec

Fan-out calculation (the critical number):
  Average user: 200 followers
    → 1 post = 200 fan-out writes
  Celebrity (1M followers):
    → 1 post = 1,000,000 fan-out writes
    → 10 celebrities posting = 10M writes instantly

Storage for pre-computed feeds:
  500M users x 500 post IDs x 8 bytes = 2 TB
  (Only store post IDs in feed, not full content)
Sanity check your numbers: Our estimate of 1 billion posts/day for 500M DAU (2 posts per user per day) is slightly high but in the right range — Meta reported roughly 2 billion DAU on Facebook with a lower average post rate, but if you include Stories, Reels, and comments-as-content, the total content creation volume is in the billions per day. Twitter (now X) reported around 500M tweets per day at ~400M monthly active users. The 2 TB estimate for pre-computed feed storage (just post IDs) is feasible for a large Redis cluster — Twitter’s timeline Redis cluster reportedly used multiple terabytes of memory for exactly this purpose.
The celebrity problem is the entire interview: If a user with 10M followers posts, and you use fan-out-on-write, you must write to 10M feeds instantly. This is the single most important trade-off in this problem. If you do not raise it yourself, the interviewer will push you toward it.

The Two Fundamental Approaches

Fan-Out on Write (Push Model):
  User posts → immediately write to all followers' feed caches
  ┌──────────┐    ┌──────────────────┐    ┌───────────────┐
  │  Post    │───▶│  Fan-out Service │───▶│ Feed Cache    │
  │  Service │    │  (async workers) │    │ (per-user)    │
  └──────────┘    └──────────────────┘    └───────────────┘
  Pro: Feed reads are instant (pre-computed)
  Con: Write amplification; celebrity post = millions of writes

Fan-Out on Read (Pull Model):
  User opens feed → query all followed users' posts in real-time
  ┌──────────┐    ┌──────────────────┐    ┌───────────────┐
  │  Feed    │───▶│  Aggregation     │───▶│ Post Storage  │
  │  Request │    │  Service         │    │ (per-user)    │
  └──────────┘    └──────────────────┘    └───────────────┘
  Pro: No write amplification; simple writes
  Con: Slow read (must merge 200+ sorted lists at read time)
┌──────────────────────────────────────────────────────────────────┐
│                       Hybrid News Feed                           │
│                                                                  │
│  Normal user posts (< 10K followers):                           │
│    → Fan-out on WRITE to all followers' caches                   │
│                                                                  │
│  Celebrity posts (> 10K followers):                              │
│    → Do NOT fan out; store in celebrity post cache               │
│    → At read time, merge celebrity posts with pre-computed feed  │
└──────────────────────────────────────────────────────────────────┘

  ┌──────────┐
  │  User    │──── Creates post
  └────┬─────┘

  ┌────▼─────────────────────┐
  │  Post Service            │
  │  (writes to Post DB)     │
  └────┬─────────────────────┘

       ├── followers < 10K ──▶ Fan-out workers ──▶ Feed Cache
       │                                           (Redis sorted set)

       └── followers > 10K ──▶ Celebrity Post Cache
                                (read at feed-fetch time)

  Feed Fetch:
  ┌──────────┐    ┌─────────────────┐
  │  Client  │───▶│ Feed Service    │
  └──────────┘    │ 1. Read cache   │
                  │ 2. Merge celebs │
                  │ 3. Rank/sort    │
                  │ 4. Return top N │
                  └─────────────────┘

Deep Dive 1: The Hybrid Fan-Out Strategy

1

Post arrives at Post Service

Validate content, store in Post DB (PostgreSQL/Cassandra), generate post ID.
2

Check follower count

If the poster has fewer than 10K followers, proceed with fan-out on write. If more, only add to the celebrity post index.The 10K threshold is tunable. Some systems use dynamic thresholds based on current system load.
3

Fan-out workers process asynchronously

A Kafka consumer reads new-post events. For each followed user, the worker prepends the post ID to that user’s feed in Redis.
Redis sorted set per user:
ZADD feed:user:789 <timestamp_score> <post_id>

Keep only the most recent 500 entries:
ZREMRANGEBYRANK feed:user:789 0 -501
4

At read time, merge celebrity posts

When user fetches their feed, the Feed Service reads their pre-computed feed from Redis AND fetches recent posts from any celebrities they follow.Merge the two sorted lists, apply ranking, return top 50.

Deep Dive 2: Feed Ranking

A pure chronological feed is simple but produces poor engagement. A ranked feed considers multiple signals:
Score(post) = w1 * recency
            + w2 * affinity(user, author)
            + w3 * engagement(likes, comments, shares)
            + w4 * content_type_preference
            + w5 * diversity_penalty

Where:
  recency    = exponential decay from post creation time
  affinity   = how often user interacts with author
  engagement = normalized like/comment/share count
  diversity  = penalize seeing 5 posts from same author in a row
In an interview, you do not need to define the exact ML model. What matters is that you architect for ranking: the Feed Service fetches candidate posts, passes them through a lightweight ranker, and returns the top N. The ranking model can be iterated independently of the infrastructure.

Deep Dive 3: Caching and CDN for Media

Caching layers:
┌──────────────────────────────────────────────────┐
│  Layer 1: CDN (images, videos, static assets)    │
│  Layer 2: Feed Cache (Redis - post IDs per user) │
│  Layer 3: Post Cache (Redis - post content)      │
│  Layer 4: Social Graph Cache (who follows whom)  │
│  Layer 5: Database (source of truth)             │
└──────────────────────────────────────────────────┘

Cache invalidation strategy:
- Feed cache: append-only (new posts added, old posts expire via ZREMRANGEBYRANK)
- Post cache: invalidate on edit/delete (rare operations)
- Social graph cache: invalidate on follow/unfollow
Senior move: Pre-compute feeds for active users only. If a user has not logged in for 30 days, do not fan out posts to their feed cache. When they return, generate their feed on-demand (fan-out on read) and then resume pre-computation.
DecisionChoice MadeAlternativeWhy
Fan-out strategyHybrid (push for normal, pull for celebrities)Pure push or pure pullPure push breaks for celebrities; pure pull is too slow for reads
Celebrity threshold10K followersStatic vs dynamicStart with static threshold, evolve to dynamic based on system load
Feed storageRedis sorted sets (post IDs only)Store full post contentIDs are tiny; fetch content separately with post cache hit rates > 99%
RankingLightweight scoring at read timePre-computed ranked feedsRanking signals change in real-time (new likes); pre-ranking gets stale
Media deliveryCDN + object storage (S3)Serve from app serversMedia is the bandwidth bottleneck; CDN reduces latency and server load by 90%+
Inactive usersSkip fan-out, generate on demandFan-out to everyone60% of users do not check their feed daily; skipping saves enormous write volume
1

You immediately identified the celebrity fan-out problem

This is the central tension of the problem. Raising it proactively shows you have thought deeply about news feed systems.
2

You proposed a hybrid approach instead of picking one extreme

Real systems (Twitter, Facebook, Instagram) all use hybrid approaches. Knowing this signals industry awareness.
3

You separated the ranking layer from the data layer

Architecturally decoupling ranking from storage allows the ML team to iterate independently. This is how real organizations work.
4

You optimized for inactive users

Not fan-outing to inactive users is a massive optimization that most candidates miss. It shows you think about the 80% of wasted work, not just the 20% that matters.
5

You designed multiple caching layers with clear invalidation strategies

Feed cache, post cache, social graph cache, CDN — each with its own invalidation policy. This layered approach demonstrates production-grade thinking.

Facebook’s News Feed Ranking Evolution

Facebook’s News Feed is perhaps the most studied feed system in the world. Its evolution from simple to sophisticated mirrors the journey every feed system takes:
  • 2006 — Chronological feed: The original News Feed was purely chronological. If your friend posted, it appeared in your feed sorted by time. Simple, but as users added hundreds of friends, the feed became noisy and engagement dropped.
  • 2009 — EdgeRank: Facebook introduced EdgeRank, a relatively simple scoring formula: Score = Affinity x Weight x Decay. Affinity measured how close you were to the poster (based on interactions). Weight reflected the content type (photos ranked higher than text). Decay was a time-based exponential falloff. EdgeRank was essentially the formula we described in the ranking section above.
  • 2013+ — Machine learning replaces EdgeRank: Facebook replaced the hand-tuned formula with a machine learning model that considers thousands of signals. The model predicts the probability that you will engage with (like, comment, share, click) each candidate post. Posts are ranked by predicted engagement score.
  • Fan-out architecture: Facebook uses a hybrid fan-out model very similar to what we designed above. Normal users get fan-out-on-write (their posts are pushed to followers’ pre-computed feeds). Celebrities and pages with millions of followers use fan-out-on-read (their posts are merged at feed-fetch time). The threshold is dynamic and adjusts based on system load.
  • Aggregator service: Facebook’s feed generation pipeline has a dedicated aggregation service (called “Multifeed”) that merges posts from multiple sources: the pre-computed feed cache, celebrity posts fetched on-read, ads injected by the ad auction, and “story bumping” (resurfacing older posts that are getting new comments). This merge step is where the ranking model runs.
  • Feed cache (TAO + Memcache): Facebook built a custom distributed cache called TAO (The Associations and Objects cache) specifically for social graph data. Feed data lives in a combination of TAO, Memcache, and MySQL. The caching hierarchy is: L1 (in-process cache on the web server) to L2 (Memcache cluster in the same region) to L3 (TAO for social graph queries) to L4 (MySQL as source of truth).

Twitter’s Timeline Architecture

Twitter’s timeline system is a fascinating case study because they famously changed their approach:
  • Original approach (fan-out on write): For years, Twitter used aggressive fan-out-on-write. When you tweeted, a fleet of workers would push your tweet ID into the timelines of all your followers stored in Redis. This worked well because reads were instant — opening Twitter just meant reading a pre-computed list from Redis.
  • The celebrity problem: This approach broke for users like Lady Gaga or Barack Obama with 30M+ followers. A single tweet triggered 30 million Redis writes. During events like the Super Bowl or elections, the fan-out queue would back up by minutes or even hours, meaning some users would see tweets with significant delay.
  • Migration to hybrid: Twitter moved to a hybrid model around 2012-2013. Most users (those with fewer followers than a dynamic threshold) still get fan-out-on-write. High-follower accounts are excluded from fan-out. When you open your timeline, Twitter’s timeline service merges your pre-computed timeline with recent tweets from the high-follower accounts you follow.
  • Timeline cache: Twitter stores timelines in a massive Redis cluster. Each user’s timeline is a Redis list of tweet IDs (not full tweet content). At read time, the IDs are hydrated by fetching the tweet content from a separate Tweet Store service. This separation means the timeline cache stays small (just IDs and scores) while tweet content is cached independently.
  • Ranking and relevance: Twitter introduced algorithmic ranking (“top tweets”) alongside the chronological timeline. The ranking model considers engagement signals, recency, your past interactions with the author, and content type. Users can toggle between ranked and chronological views — a product decision that has architectural implications (you need to support both code paths).
The lesson from both Facebook and Twitter: Every major feed system ended up at a hybrid fan-out architecture. Pure push breaks at celebrity scale. Pure pull is too slow for reads. The hybrid model is not a compromise — it is the correct architecture for this problem class. If you propose a hybrid approach in your interview, you are aligned with what the biggest companies actually built.
News feed failures are highly visible — millions of users see an empty or stale feed simultaneously.The Feed Cache (Redis) goes down:
  • Users’ pre-computed feeds are gone. Every feed fetch becomes a fan-out-on-read: query the Post DB for recent posts from all followed users, merge, rank, and return. This is 100-1000x slower than a cache hit.
  • Mitigation: multi-tier caching. L1 cache on the web server (in-process, last 5 minutes of feeds), L2 is the Redis cluster. If L2 dies, L1 buys you time for a percentage of users. For users not in L1, serve a degraded feed (e.g., top trending posts for everyone) while Redis recovers.
  • Recovery priority: warm the cache for the most active users first (those currently online), not all 500M users. Use the active user list from the presence/session service to prioritize.
The Fan-Out Service falls behind (Kafka consumer lag):
  • New posts are being created but not distributed to followers’ feeds. Users see stale feeds. The Kafka queue grows.
  • This is expected during traffic spikes (Super Bowl, elections, breaking news). Mitigation: auto-scale fan-out workers based on Kafka consumer lag. Set alerting thresholds: lag > 5 minutes triggers scaling, lag > 30 minutes triggers an incident.
  • Graceful degradation: if fan-out lag exceeds a threshold, temporarily switch all users to fan-out-on-read. The feed will be slightly slower but always fresh. Switch back to fan-out-on-write once workers catch up.
The Ranking Service goes down:
  • Fall back to chronological ordering. The feed still loads — it is just not ranked. Users see recent posts from people they follow, sorted by time. This is a perfectly acceptable degraded experience. Many users actually prefer chronological feeds.
  • The important principle: the ranking service should be stateless and horizontally scalable, sitting behind a load balancer. A single instance failure should not affect any users. Only a total cluster failure causes degradation.
A celebrity with 10M followers posts during a system-wide fan-out lag:
  • If you are using the hybrid model correctly, this is a non-event. Celebrity posts are not fanned out — they live in the celebrity post cache and are merged at read time. The celebrity’s 10M followers experience zero additional latency because no fan-out occurs for this post.
  • This is precisely why the hybrid model exists. Under pure fan-out-on-write, this celebrity post would add 10M items to an already-backed-up queue.
The Post database experiences high latency:
  • Post content fetches slow down. Feeds still load (post IDs come from the Feed Cache) but individual posts render as “loading” on the client.
  • Mitigation: aggressive caching of post content. Popular posts (anything with > 100 likes in the past hour) should be in the Post Cache (Redis) with very high TTL. Cache hit rate for viral content should be 99.9%+.
“How do you prevent the feed from showing a post from someone the user just unfollowed?”The unfollow event should trigger two actions: (1) remove the unfollowed user’s posts from the follower’s feed cache (scan and delete from the Redis sorted set), and (2) update the social graph cache. For fan-out-on-read (celebrity posts), the merge step naturally excludes unfollowed users because it queries the current follow list. The tricky case: what if an unfollow event is processed after a new fan-out write? Use a “negative filter” — maintain a small set of recently-unfollowed user IDs per user, and filter out their posts at read time. This handles the race condition at the cost of a tiny read-time check.“How would you add sponsored posts (ads) to the feed without degrading user experience?”Ads are injected at the ranking/merge step, not at the fan-out step. The Feed Service fetches a ranked list of organic posts and a separate ranked list of ad candidates from the Ad Auction Service. It then interleaves them according to a policy (e.g., at most 1 ad per 5 organic posts, never two ads in a row, never an ad as the first item). The ad selection can be personalized using the same user signals that drive organic ranking. Key insight: ads are fetched in parallel with organic posts, not sequentially, so they add zero latency to feed generation.“What if a post goes viral — getting millions of likes in minutes — and all those like counts need to update in real-time on everyone’s feed?”Do not update like counts in real-time for every user. Instead, use approximate counts that refresh periodically. When a user sees a post in their feed, the displayed like count comes from a “counter cache” that is updated every 10-30 seconds (not on every like). For the post author’s view, you can show a more real-time count. This is why social media platforms show “1.2M likes” rather than an exact number — the approximation is intentional and serves an engineering purpose, not just a UX one.
The detail that makes an interviewer’s eyes light up: When discussing the hybrid fan-out, explain that the 10K follower threshold is not static — it should be dynamic based on system load. During normal traffic, fan out for users with up to 50K followers. During peak events (Super Bowl, New Year’s Eve), automatically lower the threshold to 5K to reduce write amplification. You can implement this with a single configuration value that the fan-out service reads on each batch. Going further: instead of a hard threshold, use a continuous cost function: fan_out_cost = follower_count * current_queue_depth_factor. If the result exceeds a budget, defer to read-time merge. This adaptive approach means your system self-tunes to match capacity, which is how mature systems actually work at Facebook and Twitter scale.
If you were building this news feed on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
Post Service / APIECS Fargate behind ALBAuto-scaling based on request count. Fargate eliminates server management. ALB handles path-based routing for different API endpoints
Post StorageDynamoDB or Amazon KeyspacesDynamoDB for post metadata with user_id as partition key and post_id (ULID/Snowflake) as sort key. Keyspaces (managed Cassandra) if you want wider rows and more flexible time-range queries
Feed CacheElastiCache for Redis (cluster mode)Redis sorted sets for pre-computed feeds (post IDs scored by timestamp). Cluster mode for 2+ TB of feed data across shards. Multi-AZ for failover
Fan-Out WorkersSQS + Lambda or Kinesis + ECSSQS for simple fan-out: post creation triggers a message, Lambda workers fan out to follower feeds in Redis. Kinesis for higher throughput with ordered processing per partition
Social GraphAmazon Neptune or DynamoDBNeptune (managed graph database) for complex social graph queries (mutual friends, friend-of-friend). DynamoDB with adjacency list pattern if the graph queries are simple (just “who does user X follow?”)
Media Storage / CDNS3 + CloudFront + MediaConvertS3 for image and video storage. MediaConvert for video transcoding (multiple resolutions, adaptive bitrate). CloudFront for global delivery with Origin Shield to reduce origin load
Ranking ServiceSageMaker Inference EndpointsHost the feed ranking ML model on SageMaker. Real-time inference endpoints with auto-scaling based on request count. A/B testing via SageMaker endpoint variants
Celebrity Post CacheElastiCache for Redis or DynamoDB DAXSeparate cache for celebrity posts merged at read time. DAX (DynamoDB Accelerator) if your celebrity posts live in DynamoDB — provides microsecond read latency
MonitoringCloudWatch + CloudWatch Contributor InsightsCloudWatch for fan-out lag, feed generation latency, cache hit rates. Contributor Insights to identify which users or posts are creating the most fan-out load (celebrity detection)
Serverless fan-out pattern on AWS: For the fan-out workers, a powerful AWS pattern is: Post creation writes to DynamoDB Streams to trigger a Lambda function. The Lambda reads the poster’s follower list and writes post IDs to each follower’s feed in ElastiCache. For celebrities (>10K followers), the Lambda publishes to an SQS FIFO queue, and a fleet of ECS workers handles the large fan-out asynchronously. This two-tier approach (Lambda for small fan-outs, ECS for large ones) gives you cost efficiency for the 99% of posts with small fan-out and throughput for the 1% with massive fan-out. For more on this serverless event-driven pattern, see Cloud Service Patterns.
Constraint 1: “The feed must support real-time content moderation. Posts containing harmful content must be removed from all feeds within 30 seconds of detection.”
  • Pre-computed feeds now have a write-back problem. When a post is flagged, you must scan and remove it from potentially millions of pre-computed feed caches. This is expensive. Alternative: do not embed post content in feeds — store only post IDs. The Feed Service hydrates IDs to content at read time, checking a “blocked posts” set before returning. A blocked post simply returns nothing, creating a seamless removal. This is how Facebook and Instagram handle content moderation at scale — the moderation decision propagates instantly because the content fetch is decoupled from the feed cache.
Constraint 2: “You must support a ‘For You’ algorithmic feed alongside a ‘Following’ chronological feed, and users can switch between them.”
  • Two separate feed generation paths. The “Following” feed is the hybrid fan-out model from the original design. The “For You” feed is entirely fan-out-on-read: a recommendation service generates candidate posts from a broader pool (not just followed users), scores them, and returns the top N. The “For You” path requires a real-time ML inference service (SageMaker, TensorFlow Serving). The two feeds share the post cache but have completely different generation pipelines. The switching must be instant — both feeds should be pre-warmed or quickly generatable.
Constraint 3: “The feed is for an internal enterprise social network. 50,000 users. No celebrities. Compliance requires that all posts are retained and auditable.”
  • The celebrity fan-out problem vanishes entirely. Pure fan-out-on-write works perfectly at 50K users. The ranking service is unnecessary — chronological is fine for enterprise. But the compliance requirement adds a new dimension: posts cannot be truly deleted (only soft-deleted from user-facing views), the audit trail must be immutable, and access control must respect organizational hierarchies (a post in the “Engineering” group should not be visible to “Sales” unless explicitly shared). The entire feed architecture simplifies dramatically, but the access control layer becomes the hardest part.
Constraint 4: “Your cloud bill for the feed system is 800K/month.TheCFOwantsitat800K/month. The CFO wants it at 400K. What do you cut?”
  • The biggest costs in a feed system: (1) Redis cluster for pre-computed feeds (memory is expensive at scale), (2) fan-out workers (compute for write amplification), (3) CDN for media delivery. Cost reduction strategies: stop pre-computing feeds for inactive users (users who have not logged in for 14 days) — this can reduce Redis usage by 40-60%. Reduce the feed cache depth from 500 to 200 post IDs per user. Move fan-out workers to Spot instances (they are stateless and retry-safe). Compress media more aggressively (AVIF instead of JPEG saves 30-50% bandwidth). Each of these trades a small UX degradation for a measurable cost reduction.
This problem connects to several foundational topics covered in other chapters:
  • Caching — Multi-tier caching (CDN, Redis feed cache, post cache, social graph cache), cache invalidation strategies, and cache warming are covered in depth in Caching and Observability.
  • Messaging and async processing — Kafka-based fan-out workers, consumer lag monitoring, and async event processing connect to Messaging, Concurrency, and State.
  • Databases — Choosing between Redis sorted sets for feed storage, Cassandra for post storage, and the social graph data model connect to APIs and Databases. For a deeper exploration of graph databases (Neptune) vs adjacency list patterns in DynamoDB for social graph modeling, and the trade-offs between wide-column stores and document databases for post storage, see Database Deep Dives.
  • Performance — CDN for media delivery, feed latency optimization, and the trade-off between pre-computation and on-demand computation connect to Performance and Scalability.
  • Design patterns — The push/pull/hybrid pattern is a fundamental distributed systems design pattern explored in Design Patterns.
  • Scalability estimation — The back-of-envelope math for fan-out write amplification connects to Capacity, Git, and Pipelines.
  • Distributed systems theory — The fan-out problem is fundamentally about write amplification and eventual consistency. Understanding CAP theorem trade-offs (the feed cache is eventually consistent with the post store) and how read-your-writes consistency affects the posting user’s experience connects to Distributed Systems Theory.
  • Cloud service patterns — DynamoDB Streams for event-driven fan-out, SageMaker for real-time ranking inference, Neptune for social graph queries, and the serverless fan-out pattern are AWS-specific implementations explored in Cloud Service Patterns.

Problem 5: Design a Distributed Task Scheduler

Difficulty: Hard | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Task typesOne-time, delayed, recurring (cron), or all three?Recurring tasks need a fundamentally different scheduling model
ExecutionAt-most-once, at-least-once, or exactly-once?Determines dedup strategy and acknowledgment protocol
PriorityDo tasks have priority levels?Priority requires a priority queue, not a FIFO queue
PayloadWhat is a task? Just a function name + args, or arbitrary code?Arbitrary code execution is a security and sandboxing problem
DurationHow long can a task run? Timeout?Long-running tasks need heartbeats and lease renewal
ScaleHow many tasks per day? Concurrent workers?Drives queue partitioning and worker scaling decisions
FailureWhat happens when a task fails? Retry policy? Dead letter?Failure handling is where most task schedulers get complex

Requirements You Should Lock Down

Functional:
  • Schedule one-time tasks for immediate or delayed execution
  • Schedule recurring tasks using cron expressions
  • Retry failed tasks with configurable backoff (max 3 retries)
  • Dead-letter queue for tasks that exceed retry limit
  • Task status tracking: pending, scheduled, running, completed, failed, dead
  • Cancel or pause scheduled/recurring tasks
Non-Functional:
  • At-least-once execution guarantee (with idempotency support)
  • Handle 10M task executions per day
  • Scheduling accuracy: within 1 second of target time
  • Horizontal scaling of both scheduler and workers
  • No single point of failure
Senior move: Ask about exactly-once semantics. Then explain that true exactly-once is impossible in a distributed system. The practical solution is at-least-once delivery combined with idempotent task execution. This single clarification demonstrates deep distributed systems understanding.

Back-of-Envelope Calculation

Assumptions:
  Tasks scheduled per day:      10M
  One-time tasks:               60% (6M)
  Recurring tasks:              40% (4M definitions, generating 50M executions/day)
  Total executions per day:     56M
  Average task duration:        5 seconds
  Peak task burst:              10x average

Throughput:
  56M / 86,400                  ~ 650 tasks/sec average
  Peak (10x):                   ~ 6,500 tasks/sec

Worker capacity:
  Each worker handles 1 task at a time for 5 seconds
  Worker throughput: 12 tasks/min = 720 tasks/hour
  Peak needs: 6,500 concurrent tasks
  Workers needed: 6,500 (at peak, with 1 task per worker)
  With 10 tasks/worker (async): ~650 workers at peak

Task storage:
  56M tasks/day x 1 KB avg = 56 GB/day
  30-day retention: ~1.7 TB
  Completed tasks archived to cold storage after 7 days

Queue depth:
  At peak, 6,500 tasks queued per second
  If workers process in 5 seconds: ~32,500 tasks in queue at any time
Sanity check your numbers: Our estimate of 56M task executions per day and ~650 workers at peak is reasonable for a mid-to-large scale deployment. For context, Uber’s Cadence/Temporal deployment reportedly handled hundreds of thousands of workflow executions per second at peak across their entire infrastructure, but that represents one of the largest deployments in the world. Apache Airflow deployments at companies like Airbnb and Spotify typically handle millions of task executions per day with clusters of hundreds of workers. Our 1.7 TB of task storage over 30 days is modest — most of this can live in PostgreSQL with older completed tasks archived to S3 or cold storage.

Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  Distributed Task Scheduler                     │
│                                                                 │
│  ┌──────────┐       ┌──────────────┐       ┌───────────────┐   │
│  │  Client  │──────▶│  Task API    │──────▶│  Task Store   │   │
│  │  (SDK)   │       │  (REST/gRPC) │       │  (PostgreSQL) │   │
│  └──────────┘       └──────────────┘       └───────┬───────┘   │
│                                                     │           │
│                     ┌──────────────┐                │           │
│                     │  Scheduler   │◀───────────────┘           │
│                     │  (Leader)    │                             │
│                     └──────┬───────┘                             │
│                            │  Enqueues due tasks                │
│                     ┌──────▼───────┐                             │
│                     │  Task Queue  │                             │
│                     │  (Redis /    │                             │
│                     │   SQS)       │                             │
│                     └──────┬───────┘                             │
│                            │                                     │
│              ┌─────────────┼─────────────┐                      │
│              │             │             │                       │
│         ┌────▼────┐  ┌────▼────┐  ┌────▼────┐                  │
│         │Worker 1 │  │Worker 2 │  │Worker N │                  │
│         └────┬────┘  └────┬────┘  └────┬────┘                  │
│              │             │             │                       │
│              └─────────────┼─────────────┘                      │
│                            │                                     │
│                     ┌──────▼───────┐                             │
│                     │  Dead Letter │                             │
│                     │  Queue (DLQ) │                             │
│                     └──────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

API Design

POST   /api/v1/tasks              — Schedule a new task
GET    /api/v1/tasks/{id}         — Get task status
DELETE /api/v1/tasks/{id}         — Cancel a task
PATCH  /api/v1/tasks/{id}/pause   — Pause a recurring task
GET    /api/v1/tasks?status=failed — List tasks by status

Task payload:
{
  "task_type": "send_email",
  "payload": { "to": "user@example.com", "template": "welcome" },
  "schedule": {
    "type": "delayed",          // "immediate" | "delayed" | "cron"
    "run_at": "2026-04-10T10:00:00Z",
    "cron": null                // "0 */6 * * *" for recurring
  },
  "retry_policy": {
    "max_retries": 3,
    "backoff": "exponential",   // "fixed" | "exponential"
    "initial_delay_sec": 10
  },
  "timeout_sec": 300,
  "idempotency_key": "email-welcome-user-123"
}

Deep Dive 1: Task State Machine

Every task follows a strict state machine. Invalid transitions must be rejected.
                  ┌───────────┐
                  │  PENDING   │  ← Task created, not yet due
                  └─────┬─────┘
                        │ (run_at time reached)
                  ┌─────▼─────┐
                  │ SCHEDULED  │  ← Placed in task queue
                  └─────┬─────┘
                        │ (worker picks up)
                  ┌─────▼─────┐
          ┌───────│  RUNNING   │───────┐
          │       └─────┬─────┘       │
          │ (success)   │    (failure + retries left)
    ┌─────▼─────┐       │       ┌─────▼─────┐
    │ COMPLETED  │       │       │ RETRYING   │
    └───────────┘       │       └─────┬─────┘
                        │             │ (back to SCHEDULED)
                        │             └──────▶ SCHEDULED

                        │ (failure + no retries left)
                  ┌─────▼─────┐
                  │   DEAD     │  ← Moved to DLQ
                  └───────────┘

Additional states:
  CANCELLED — user cancelled the task
  PAUSED    — recurring task paused by user

Database Schema

CREATE TABLE tasks (
    id              UUID PRIMARY KEY,
    idempotency_key VARCHAR(255) UNIQUE,
    task_type       VARCHAR(100) NOT NULL,
    payload         JSONB NOT NULL,
    status          VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    priority        INT DEFAULT 0,

    -- Scheduling
    schedule_type   VARCHAR(20) NOT NULL,  -- immediate/delayed/cron
    run_at          TIMESTAMPTZ,
    cron_expression VARCHAR(100),
    next_run_at     TIMESTAMPTZ,           -- For recurring tasks

    -- Execution tracking
    attempt         INT DEFAULT 0,
    max_retries     INT DEFAULT 3,
    timeout_sec     INT DEFAULT 300,
    locked_by       VARCHAR(100),          -- Worker ID holding the lock
    locked_at       TIMESTAMPTZ,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    error_message   TEXT,

    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

-- Index for scheduler to find due tasks efficiently
CREATE INDEX idx_tasks_pending ON tasks (next_run_at)
    WHERE status IN ('PENDING', 'RETRYING');

-- Index for status queries
CREATE INDEX idx_tasks_status ON tasks (status, created_at);
Why PostgreSQL and not a NoSQL database here? Task scheduling requires strong consistency (you cannot execute a task twice because of eventual consistency), transactional state transitions (PENDING to RUNNING must be atomic), and complex queries (find all tasks where next_run_at < NOW() AND status = PENDING). These are SQL strengths.

Deep Dive 2: Distributed Locking for Task Pickup

The critical problem: multiple workers must not pick up the same task.
1

Approach A: Database-Level Locking (Simple, Recommended to Start)

Use SELECT … FOR UPDATE SKIP LOCKED to atomically claim tasks.
BEGIN;
-- Atomically claim the next available task
UPDATE tasks
SET status = 'RUNNING',
    locked_by = 'worker-042',
    locked_at = NOW(),
    attempt = attempt + 1
WHERE id = (
    SELECT id FROM tasks
    WHERE status IN ('SCHEDULED')
      AND next_run_at <= NOW()
    ORDER BY priority DESC, next_run_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED
)
RETURNING *;
COMMIT;
Pros: No external lock service needed. SKIP LOCKED prevents workers from blocking each other. PostgreSQL handles all the concurrency.Cons: Database becomes the bottleneck at very high throughput (> 10K tasks/sec).
2

Approach B: Redis-Based Distributed Lock (For Higher Scale)

Use Redis sorted sets as the task queue. Workers use ZPOPMIN atomically.
-- Scheduler enqueues tasks (score = execution timestamp)
ZADD task_queue <run_at_timestamp> <task_id>

-- Worker atomically pops the next due task
ZPOPMIN task_queue

-- Worker sets a lock with TTL (lease)
SET lock:task:<task_id> worker-042 EX 300 NX
Pros: Much higher throughput than DB locking. Redis handles 100K+ ops/sec.Cons: Requires careful handling of Redis failures. Task state still lives in PostgreSQL (dual-write concern).
3

Lease-Based Execution

Whether using DB or Redis locks, the pattern is the same:
  1. Worker acquires a lease (lock with TTL) on the task
  2. Worker must renew the lease periodically (heartbeat) if the task runs long
  3. If the worker crashes, the lease expires and another worker can pick up the task
  4. When the task completes, the worker releases the lease and updates state
Lease renewal loop (in worker):
while task_is_running:
    sleep(lease_ttl / 3)
    EXPIRE lock:task:<task_id> lease_ttl  # Renew
The zombie task problem: A worker picks up a task, starts processing, then the process is killed (OOM, hardware failure). The task is stuck in RUNNING state forever. Solution: the scheduler periodically scans for tasks in RUNNING state where locked_at + timeout_sec < NOW(). These tasks are “zombies” and get moved back to SCHEDULED (if retries remain) or DEAD.

Deep Dive 3: Failure Handling and Idempotency

1

Retry with Exponential Backoff

When a task fails, calculate the next retry time:
next_retry_at = NOW() + initial_delay * (2 ^ attempt_number)

Attempt 1: retry after 10 seconds
Attempt 2: retry after 20 seconds
Attempt 3: retry after 40 seconds
Attempt 4: max retries exceeded → move to DLQ
Add jitter (random offset) to prevent thundering herd when many tasks fail simultaneously.
2

Idempotency via Idempotency Keys

Since at-least-once delivery means a task CAN run more than once (worker crash after execution but before acknowledgment), tasks must be idempotent.The idempotency_key field ensures the same logical task is not created twice:
POST /tasks with idempotency_key = "charge-order-456"
→ If key exists and task is COMPLETED → return existing result
→ If key exists and task is RUNNING   → return "in progress"
→ If key does not exist               → create new task
Task implementations must also be idempotent internally. Example: “send welcome email to user X” should check if the email was already sent before sending again.
3

Dead Letter Queue (DLQ)

Tasks that exhaust all retries are moved to a DLQ. The DLQ is:
  • A separate table or queue for manual inspection
  • Alerting triggers when DLQ depth exceeds threshold
  • Operations team can inspect, fix, and re-enqueue dead tasks
  • DLQ tasks are retained for 30 days, then archived
-- Move exhausted task to DLQ
UPDATE tasks
SET status = 'DEAD',
    completed_at = NOW(),
    error_message = 'Max retries exceeded: <last error>'
WHERE id = :task_id;

INSERT INTO dead_letter_queue (task_id, original_payload, failure_reason, dead_at)
VALUES (:task_id, :payload, :error, NOW());
DecisionChoice MadeAlternativeWhy
Task storePostgreSQLCassandraNeed strong consistency + transactions for state machine; Cassandra’s eventual consistency causes double execution
Task queueRedis sorted setsSQS / RabbitMQRedis gives precise scheduled execution time ordering; managed queues add latency
LockingDB-level SKIP LOCKED (start), Redis leases (scale)ZooKeeperStart simple with DB; graduate to Redis. ZooKeeper is operationally heavy for this use case
Execution guaranteeAt-least-once + idempotencyExactly-onceTrue exactly-once is impossible; at-least-once with idempotent tasks is the practical standard
Scheduler HALeader election (single active scheduler)Multiple schedulers with distributed lockingSingle scheduler is simpler; leader election via PostgreSQL advisory lock or Redis Redlock
Recurring tasksScheduler computes next_run_at after each executionWorker computes next runScheduler owns the schedule; workers just execute. Separation of concerns.
1

You defined a rigorous state machine

The task lifecycle (PENDING to RUNNING to COMPLETED/DEAD) with valid transitions shows you think about system correctness, not just the happy path.
2

You addressed the zombie task problem

Workers crash. Networks partition. Tasks get stuck. Proactively designing a zombie detector shows real operational experience.
3

You explained why exactly-once is impossible and offered the practical alternative

At-least-once + idempotency is how every production system (Kafka, SQS, Celery) actually works. This shows you know the theory AND the practice.
4

You designed for progressive scaling

Starting with PostgreSQL SKIP LOCKED and evolving to Redis queues as scale demands. This pragmatic “start simple, scale when needed” approach is exactly what senior engineers advocate.
5

You included observability from the start

Task status tracking, DLQ alerting, and execution metrics are not afterthoughts. In production, you cannot operate what you cannot observe.

Uber’s Cadence / Temporal

Cadence was built at Uber to solve the problem of orchestrating long-running, reliable workflows (ride matching, payment processing, driver onboarding). It was later open-sourced and evolved into Temporal (founded by the same engineers):
  • Workflow as code: Unlike traditional task schedulers that treat tasks as independent units, Cadence/Temporal models entire workflows as code. A workflow function can call activities (individual tasks), wait for signals, set timers, and branch on conditions — all while being fully durable. If the worker crashes mid-workflow, the framework replays the workflow function from the event history to resume exactly where it left off.
  • Event sourcing under the hood: Every state change in a workflow is persisted as an event in a history. The workflow’s current state is reconstructed by replaying this event history. This is fundamentally different from the state-machine-in-a-database approach we designed above. It is more flexible (arbitrary workflow logic instead of predefined states) but more complex to reason about.
  • Task queues with sticky execution: Temporal uses task queues to dispatch activities to workers. It supports “sticky execution” — once a worker starts processing a workflow, subsequent activities in that workflow are preferentially routed to the same worker. This improves cache locality (workflow context is already in memory) and reduces replay overhead.
  • Visibility and search: Temporal provides a built-in visibility layer that indexes workflow metadata (status, type, start time, custom search attributes) into Elasticsearch. Operators can search for workflows across millions of executions. This is the observability layer that is often an afterthought in custom task schedulers.
  • Multi-cluster replication: For disaster recovery, Temporal supports active-passive replication across clusters. Workflows in a failed cluster can be resumed in the standby cluster. This is critical for Uber’s ride-booking workflows — a datacenter failure cannot cause rides to be lost.
  • At Uber’s scale: At peak, Uber’s Cadence deployment handled hundreds of thousands of workflow executions per second across multiple Cassandra clusters totaling petabytes of event history. The system managed everything from ride lifecycle to financial settlement to promotional offer expiration.

Apache Airflow’s Architecture

Airflow, originally created at Airbnb (around 2014) and now an Apache project, is the most widely used open-source task orchestrator for data pipelines:
  • DAG-based scheduling: Airflow represents workflows as Directed Acyclic Graphs (DAGs). Each node is a task (e.g., “extract data from S3”, “run Spark job”, “load into warehouse”) and edges represent dependencies. The scheduler traverses the DAG, executing tasks only when their upstream dependencies are complete.
  • Scheduler architecture: Airflow’s scheduler is a loop that runs every few seconds: (1) parse all DAG files to discover task dependencies, (2) identify tasks whose dependencies are met and whose scheduled time has arrived, (3) place those tasks into a queue (Celery, Kubernetes, or a local executor). The scheduler was historically single-threaded and a bottleneck, but Airflow 2.0+ supports multiple schedulers with database-level locking (similar to the SKIP LOCKED pattern we described above).
  • Executor model: Airflow separates scheduling from execution. The executor is pluggable: CeleryExecutor distributes tasks to a fleet of Celery workers via Redis/RabbitMQ, KubernetesExecutor spins up a new pod for each task (strong isolation but higher overhead), and LocalExecutor runs tasks as subprocesses on the scheduler machine (fine for small deployments).
  • Metadata database (PostgreSQL/MySQL): All task state, DAG definitions, execution history, and scheduling metadata live in a relational database. This is the source of truth. The choice of PostgreSQL or MySQL is deliberate — Airflow needs ACID transactions for state machine transitions, exactly as we discussed in Problem 5.
  • XCom for inter-task communication: Tasks can pass small amounts of data to downstream tasks via “XComs” (cross-communications) stored in the metadata database. For large datasets, tasks write to external storage (S3, GCS) and pass only the reference via XCom. This pattern of passing pointers instead of payloads is a common production optimization.
  • Limitations: Airflow is designed for batch workflows (run this ETL every hour), not for real-time event-driven tasks. It has no native concept of responding to events or running sub-second-latency tasks. For those use cases, Temporal or a custom solution (like the one we designed) is more appropriate.
Choosing between them: Temporal is the better choice when you need durable execution of arbitrary application workflows (user signup flows, order processing, sagas). Airflow is the better choice when you need to orchestrate data pipelines with complex dependencies on a schedule. Our Problem 5 design sits between the two — it handles scheduled tasks (like Airflow) but with a simpler execution model (like Temporal’s activity workers). In an interview, knowing when to use which framework demonstrates architectural judgment.
Task schedulers are critical infrastructure — when they fail, downstream systems silently stop working.The Scheduler (leader) crashes:
  • This is the most dangerous failure. No new tasks get enqueued from PENDING to SCHEDULED. Tasks already in the queue continue to be processed by workers, but no new tasks are picked up from the database.
  • Mitigation: leader election with automatic failover. Run 2-3 scheduler instances. Only one is active (the leader), the others are on standby. Use PostgreSQL advisory locks or Redis Redlock for leader election. When the leader’s lock expires (because it crashed), a standby instance acquires the lock and becomes the new leader within seconds.
  • Detection: the standby schedulers monitor the leader’s heartbeat. Additionally, a canary task (a no-op task scheduled every minute) serves as an end-to-end health check. If the canary does not execute within 2 minutes, alert on-call.
A Worker crashes mid-task:
  • The task is stuck in RUNNING state. The lease (lock TTL) expires after the configured timeout. The zombie detector (a periodic scheduler job) identifies tasks where locked_at + timeout_sec < NOW() and status = RUNNING. These zombie tasks are moved back to SCHEDULED if retries remain, or to DEAD if they are exhausted.
  • The task may have partially completed before the crash. This is why idempotency is non-negotiable — the retry must be safe to execute even if the first attempt partially succeeded. Example: if the task is “charge customer $50,” the payment system must use an idempotency key so the retry does not double-charge.
The Task Queue (Redis) goes down:
  • Workers cannot pull new tasks. Tasks accumulate in PostgreSQL with status SCHEDULED but no queue to feed them to workers.
  • Short-term mitigation: workers fall back to polling PostgreSQL directly using the SELECT ... FOR UPDATE SKIP LOCKED query. This is slower (10-100x less throughput than Redis) but functional.
  • Long-term: Redis Sentinel or Redis Cluster with automatic failover. Recovery in under 30 seconds for most configurations.
PostgreSQL (Task Store) goes down:
  • This is the worst-case scenario. Task state is lost (or temporarily inaccessible). New tasks cannot be created. Task status updates from workers fail.
  • Workers that are currently executing tasks continue to run (they already have the task payload). But they cannot report completion, so tasks will appear as zombies after lease expiry and get retried — making idempotency even more critical.
  • Mitigation: PostgreSQL with streaming replication and automatic failover (Patroni, RDS Multi-AZ). Recovery time: 15-30 seconds. Data loss: zero or near-zero (depending on synchronous vs async replication).
A recurring task consistently fails (e.g., an API it depends on is down for days):
  • After exhausting retries, the task lands in the DLQ. But the recurring schedule generates a new execution for the next interval, which also fails and also lands in the DLQ. You now have a DLQ filling up with the same failing task.
  • Mitigation: circuit breaker pattern for recurring tasks. After N consecutive failures (across multiple scheduled executions), automatically PAUSE the recurring task and alert the task owner. Require manual re-enablement after the root cause is fixed. This prevents a single broken dependency from flooding the DLQ and masking other legitimate failures.
The detail most candidates miss: What happens when the scheduler and the database disagree about the current time? If the scheduler runs next_run_at <= NOW() but the database’s NOW() is 5 seconds behind the scheduler’s clock, tasks execute 5 seconds late. Conversely, if the database clock is ahead, tasks execute early. Solution: always use the database server’s time for scheduling comparisons (the query runs on the DB, so NOW() is the DB’s clock). Never compare timestamps from different machines.
“How would you handle task dependencies — for example, task B can only run after task A completes?”Add a depends_on field to the task schema (a list of task IDs). The scheduler only moves a task to SCHEDULED if all its dependencies are in COMPLETED state. This turns the task scheduler into a DAG executor (like Airflow). For simple linear dependencies, this is straightforward. For complex DAGs, you need a topological sort when evaluating which tasks are ready. Mention that at this point you are essentially building a workflow engine, and it may be worth using Temporal or Airflow instead of reinventing it.“How do you ensure scheduling accuracy — that a task scheduled for exactly 10:00:00 AM runs within 1 second of that time?”The scheduler polls the database on a tight loop (every 1-5 seconds). At each poll, it selects all tasks where next_run_at <= NOW() and enqueues them. If the poll interval is 1 second, the maximum scheduling delay is 1 second plus the queue wait time. For sub-second accuracy, switch from polling to a timer-based approach: load the next N tasks into memory, set OS-level timers for each, and enqueue at the precise moment. This is how cron works internally. The trade-off: in-memory timers are lost if the scheduler crashes, so you still need the DB as the source of truth.“How would you handle a deployment where you need to update the task handler code without dropping in-flight tasks?”Use a blue-green deployment for workers. Deploy the new worker code to a new set of workers (green). Drain the old workers (blue) — stop feeding them new tasks and wait for in-flight tasks to complete (with a timeout). Once all blue workers are idle, shut them down. If a blue worker’s in-flight task exceeds the drain timeout, it is forcibly killed and the lease expires, causing a retry on a green worker. The key: the task payload must be backward-compatible, or versioned (include a version field in the payload so the worker knows which handler to use).
The detail that makes an interviewer’s eyes light up: When discussing the zombie task detector, explain that there is a subtle race condition: the worker might complete the task and try to update its status to COMPLETED at the exact moment the zombie detector resets it back to SCHEDULED. This causes a duplicate execution on the retry. The fix is to use a “generation counter” (or “fence token”): when a worker picks up a task, it reads the current attempt number. When it completes, its status update includes a WHERE clause: WHERE id = :task_id AND attempt = :expected_attempt. If the zombie detector already incremented the attempt, this WHERE clause matches zero rows and the stale completion is silently discarded. This is the fencing token pattern from distributed systems literature (Martin Kleppmann describes it in DDIA), and mentioning it unprompted signals that you understand the subtleties of distributed locking that most candidates hand-wave past.
If you were building this task scheduler on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
Task APIECS Fargate behind ALB or API Gateway + LambdaFargate for steady-state API traffic. Lambda for bursty task submission patterns (pay-per-invocation, auto-scales to thousands of concurrent invocations)
Task Store (source of truth)Amazon RDS for PostgreSQL (Multi-AZ)Managed PostgreSQL with automated failover (15-30 seconds). SELECT FOR UPDATE SKIP LOCKED works identically to self-managed PostgreSQL. Read replicas for status queries. Automated backups with point-in-time recovery
Task QueueAmazon SQS (Standard or FIFO)SQS Standard for at-least-once delivery with massive throughput (nearly unlimited). SQS FIFO for ordered processing with deduplication (300 msg/sec base, 3000 with high throughput mode). Built-in DLQ support — no need to build your own
Scheduler (cron/delayed tasks)Amazon EventBridge SchedulerFully managed scheduler that triggers targets (Lambda, SQS, Step Functions) on a schedule or at a specific time. Supports cron expressions and one-time schedules. Scales to millions of schedules. Eliminates the need for a custom scheduler leader
WorkersECS Fargate or LambdaFargate for long-running tasks (up to hours). Lambda for short tasks (<15 minutes) with automatic scaling to thousands of concurrent executions. Mix both: Lambda for fast tasks, Fargate for heavy tasks
Workflow OrchestrationAWS Step FunctionsFor complex task dependencies (DAGs), Step Functions provides visual workflow orchestration with built-in retry policies, error handling, timeouts, and parallel execution. Standard Workflows for long-running tasks, Express Workflows for high-volume short tasks
Dead Letter QueueSQS DLQ + CloudWatch AlarmSQS natively supports DLQ configuration. CloudWatch alarm when DLQ depth > 0 triggers SNS notification to on-call
MonitoringCloudWatch + CloudWatch Logs InsightsCustom metrics for task completion rate, execution duration, queue depth, DLQ depth. Logs Insights for querying task execution logs across all workers
Completed Task ArchiveS3 + GlacierAfter 7 days, archive completed task records from RDS to S3 (Parquet via AWS Glue ETL). Lifecycle policy moves to Glacier after 90 days for long-term retention
The “buy vs build” question for task scheduling on AWS: Before building a custom task scheduler, consider whether AWS Step Functions + EventBridge Scheduler covers your requirements. Step Functions handles task state machines, retries, timeouts, and dead-letter handling natively. EventBridge Scheduler handles cron and one-time scheduling for millions of schedules. Together, they replace 80% of what the custom design above provides. Build custom only when you need: (1) sub-second scheduling precision (EventBridge has ~1 second accuracy), (2) custom priority queuing (Step Functions is FIFO per execution), or (3) tight integration with your own task execution framework. In an interview, mentioning this “buy vs build” analysis and explaining when managed services fall short is a strong signal of production judgment.
Constraint 1: “Tasks can have dependencies. Task B must run only after Task A completes. Support arbitrary DAGs.”
  • The simple queue model breaks. You now need a DAG execution engine. Each task has a depends_on field listing prerequisite task IDs. The scheduler evaluates task readiness by checking whether all dependencies are in COMPLETED state. This requires a topological sort on the dependency graph. Cycles must be detected and rejected at submission time. At this point, you are building Airflow or Temporal. In an interview, acknowledge this and explain when to buy vs build: “If we need DAG execution, I would evaluate Temporal or Step Functions before building custom, because correct DAG scheduling with failure handling is a solved problem that is hard to get right.”
Constraint 2: “The task scheduler runs in a multi-region active-active deployment. Tasks submitted in any region must execute exactly once globally.”
  • Global deduplication becomes the hard problem. The idempotency key must be checked globally, not per-region. Options: (1) use a global DynamoDB table with a unique constraint on the idempotency key (cross-region replication with conflict resolution), (2) designate a single “coordinator region” that owns task deduplication and fans out execution to regional workers, or (3) partition tasks by a deterministic key (e.g., hash of idempotency key modulo 3 regions) so each task has exactly one “owning” region. Option 3 is the most scalable but requires re-routing submissions to the owning region.
Constraint 3: “Some tasks are long-running (up to 24 hours). Others are sub-second. The scheduler must handle both efficiently.”
  • A single worker pool cannot handle both. Sub-second tasks need lightweight workers (Lambda functions) that spin up and down instantly. 24-hour tasks need persistent workers (ECS tasks) with lease renewal, heartbeats, and checkpointing. The scheduler routes tasks to different worker pools based on estimated duration. The lease-based locking pattern from the original design must support configurable TTLs — a 60-second TTL for sub-second tasks, a 1-hour TTL with renewal for long-running tasks. Checkpointing allows a long-running task to resume from its last checkpoint after a worker failure, rather than restarting from scratch.
Constraint 4: “The task scheduler must support tenant isolation. One tenant’s burst of 1M tasks must not delay another tenant’s 10 critical tasks.”
  • Per-tenant queues or per-tenant priority within a shared queue. The simplest approach: each tenant gets a fair-share allocation of worker capacity. If you have 100 workers and 10 active tenants, each tenant gets 10 workers. Burst capacity is allocated from an overflow pool. Critical tasks bypass the per-tenant queue and go to a dedicated high-priority queue (similar to the notification system’s priority routing). This is the multi-tenancy pattern used by managed services like AWS Step Functions and Temporal Cloud.
This problem connects to several foundational topics covered in other chapters:
  • Databases — PostgreSQL advisory locks, SELECT FOR UPDATE SKIP LOCKED, ACID transactions for state machine integrity, and schema design connect to APIs and Databases. For a deeper dive into PostgreSQL’s locking mechanisms (advisory locks, row-level locks, SKIP LOCKED internals) and when to choose PostgreSQL over DynamoDB for transactional workloads, see Database Deep Dives.
  • Messaging and queues — Redis as a task queue, Kafka for event-driven architectures, dead-letter queues, and at-least-once delivery guarantees are covered in Messaging, Concurrency, and State.
  • Reliability — Leader election, automatic failover, circuit breaker patterns, and graceful degradation connect to Reliability Principles.
  • Distributed systems theory — The leader election for the scheduler, the fencing token pattern for zombie task detection, and the impossibility of exactly-once delivery in distributed systems are core consensus and consistency concepts explored in Distributed Systems Theory. The Raft or Paxos consensus protocols that underpin PostgreSQL’s Patroni failover and Redis’s Sentinel are covered there in depth.
  • Observability — Task status dashboards, DLQ depth alerting, canary tasks, and execution latency metrics connect to Caching and Observability.
  • Design patterns — The state machine pattern, lease-based locking, idempotency keys, and exponential backoff with jitter are fundamental patterns explored in Design Patterns.
  • Deployment — Blue-green deployments for worker code updates, draining in-flight tasks, and zero-downtime deploys connect to Networking and Deployment.
  • Cloud service patterns — Step Functions for workflow orchestration, EventBridge Scheduler for managed cron, SQS for task queuing with native DLQ support, and the “buy vs build” decision framework for task scheduling are AWS-specific patterns explored in Cloud Service Patterns.

Problem 6: Design a Notification System

Difficulty: Medium-Hard | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
ChannelsWhich channels: push notifications, email, SMS, in-app? All four?Each channel has different delivery infrastructure, cost, and latency characteristics
PriorityDo notifications have priority levels (urgent vs marketing)?Priority determines queue ordering and whether to bypass rate limits
User preferencesCan users opt out of specific channels or notification types?Requires a user preferences service and filtering layer before dispatch
DeduplicationShould we prevent sending the same notification twice?Dedup is critical — sending duplicate SMS costs money and annoys users
Rate limitingShould we limit notifications per user per time window?Users who receive 50 push notifications in an hour will disable notifications entirely
TemplatingAre notifications templated or free-form? Localization?Templates with variable substitution simplify the API and enable A/B testing
ScaleHow many notifications per day? Peak events?Black Friday or breaking news can cause 100x traffic spikes
Delivery trackingDo we need delivery confirmation and analytics?Affects whether you track sent/delivered/opened/clicked states

Requirements You Should Lock Down

Functional:
  • Send notifications across four channels: push (iOS + Android), email, SMS, and in-app
  • Support priority levels: critical (immediate), high (within 1 minute), normal (within 5 minutes), low (batched hourly)
  • Respect user channel preferences (opt-in/opt-out per channel per notification type)
  • Deduplicate notifications within a configurable time window (default: 1 hour)
  • Rate limit per-user per-channel (e.g., max 5 push notifications per hour for marketing)
  • Notification templates with variable substitution and localization support
  • Delivery tracking: sent, delivered, opened, clicked
Non-Functional:
  • Deliver critical notifications within 10 seconds
  • Handle 1 billion notifications per day (normal load), 10 billion during peak events
  • 99.9% delivery rate (at-least-once per channel)
  • Graceful degradation if one channel provider is down (do not block other channels)
Senior move: Ask about notification aggregation. “If a user gets 20 likes on a post in 5 minutes, should we send 20 push notifications or one that says ‘20 people liked your post’?” This question reveals whether you need a batching and aggregation layer, which significantly changes the architecture. Every mature notification system (Facebook, LinkedIn, Twitter) uses aggregation to prevent notification fatigue.

Back-of-Envelope Calculation

Assumptions:
  Total notifications/day:         1 billion (normal), 10 billion (peak)
  Channel distribution:
    Push:    40% = 400M/day
    In-app:  30% = 300M/day
    Email:   20% = 200M/day
    SMS:     10% = 100M/day (most expensive, reserved for critical)

Throughput:
  1B / 86,400                      ~ 11,600 notifications/sec (average)
  Peak (10x):                      ~ 116,000 notifications/sec

Per-channel peak throughput:
  Push:   ~46,000/sec
  In-app: ~35,000/sec
  Email:  ~23,000/sec
  SMS:    ~12,000/sec

Storage (notification history, 90-day retention):
  1B/day x 90 days x 500 bytes     = ~45 TB
  (notification ID, user_id, type, channel, status, timestamps, metadata)

User preferences store:
  500M users x 1 KB preferences     = 500 GB
  (channel prefs, notification type prefs, quiet hours, locale)

Deduplication window:
  Track last 1 hour of notification hashes per user
  500M users x avg 10 notifications/hour x 32 bytes hash = ~160 GB
  → Fits in a Redis cluster
Sanity check your numbers: Our estimate of 1 billion notifications/day is in the range of major platforms. LinkedIn reportedly sends billions of notifications per week across email, push, and in-app. Facebook sends tens of billions of push notifications daily. Our peak estimate of 10 billion/day accounts for events like election nights or Black Friday where notification volume spikes by 5-10x. The 45 TB storage estimate for 90-day retention is manageable with DynamoDB or Cassandra partitioned by user_id + date.

Component Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                      Notification System                                 │
│                                                                         │
│  ┌──────────┐    ┌────────────────┐    ┌──────────────────┐             │
│  │ Services │───▶│ Notification   │───▶│  Validation &    │             │
│  │ (callers)│    │ API            │    │  Enrichment      │             │
│  └──────────┘    └────────────────┘    └────────┬─────────┘             │
│                                                  │                       │
│                                          ┌───────▼────────┐             │
│                                          │  Priority       │             │
│                                          │  Router         │             │
│                                          └───────┬────────┘             │
│                           ┌──────────────────────┼──────────────┐       │
│                           │                      │              │       │
│                    ┌──────▼──────┐    ┌──────────▼──┐   ┌──────▼─────┐ │
│                    │  Critical   │    │  Normal     │   │  Low       │ │
│                    │  Queue      │    │  Queue      │   │  Queue     │ │
│                    └──────┬──────┘    └──────┬──────┘   └──────┬─────┘ │
│                           │                  │                 │       │
│                    ┌──────▼──────────────────▼─────────────────▼─────┐ │
│                    │           Channel Dispatcher                     │ │
│                    │  ┌────────┐ ┌───────┐ ┌──────┐ ┌────────────┐  │ │
│                    │  │  Push  │ │ Email │ │ SMS  │ │  In-App    │  │ │
│                    │  │ Worker │ │Worker │ │Worker│ │  Worker    │  │ │
│                    │  └───┬────┘ └───┬───┘ └──┬───┘ └─────┬──────┘  │ │
│                    └──────┼──────────┼────────┼────────────┼─────────┘ │
│                           │          │        │            │           │
│                    ┌──────▼──┐ ┌─────▼──┐ ┌──▼─────┐ ┌───▼────────┐ │
│                    │APNs/FCM│ │SES/    │ │SNS/   │ │WebSocket  │ │
│                    │        │ │Sendgrid│ │Twilio │ │/SSE       │ │
│                    └────────┘ └────────┘ └───────┘ └───────────┘ │
│                                                                         │
│  ┌──────────────────────┐  ┌────────────────────┐                       │
│  │  User Preferences    │  │  Dedup + Rate      │                       │
│  │  Service             │  │  Limiter           │                       │
│  └──────────────────────┘  └────────────────────┘                       │
└─────────────────────────────────────────────────────────────────────────┘

API Design

POST /api/v1/notifications
  Body: {
    "user_id": "user_123",
    "notification_type": "order_shipped",
    "priority": "high",
    "channels": ["push", "email"],       // Optional override; default from user prefs
    "template_id": "order_shipped_v2",
    "template_vars": {
      "order_id": "ORD-789",
      "tracking_url": "https://..."
    },
    "idempotency_key": "order-shipped-ORD-789",
    "schedule_at": null,                 // null = immediate, or ISO timestamp
    "aggregate_key": "post_likes:post_456"  // For batching similar notifications
  }

POST /api/v1/notifications/bulk
  Body: {
    "user_ids": ["user_123", "user_456", ...],
    "notification_type": "breaking_news",
    ...
  }

GET /api/v1/users/{user_id}/notifications
  Response: paginated list of in-app notifications

PATCH /api/v1/users/{user_id}/preferences
  Body: {
    "channels": { "push": true, "email": true, "sms": false },
    "quiet_hours": { "start": "22:00", "end": "07:00", "timezone": "America/New_York" },
    "notification_types": {
      "marketing": { "push": false, "email": true },
      "order_updates": { "push": true, "email": true, "sms": true }
    }
  }

Deep Dive 1: Priority-Based Routing and Queue Architecture

Not all notifications are created equal. A two-factor authentication code must arrive in seconds; a weekly digest can wait hours. The priority system ensures critical notifications are never delayed by marketing volume.
1

Priority Levels and Their SLAs

CRITICAL (P0): 2FA codes, security alerts, payment failures
  → SLA: < 10 seconds
  → Dedicated high-priority queue
  → Bypass rate limiting (but not dedup)
  → Never batched or aggregated

HIGH (P1): Order updates, direct messages, mentions
  → SLA: < 1 minute
  → Normal queue with priority ordering
  → Subject to rate limiting

NORMAL (P2): Social updates, content recommendations
  → SLA: < 5 minutes
  → Normal queue, processed after P0/P1
  → Subject to rate limiting and aggregation

LOW (P3): Marketing emails, weekly digests, surveys
  → SLA: < 1 hour
  → Batch queue, processed during off-peak hours
  → Heavily rate-limited, aggregated
2

Queue Architecture

Use separate queues per priority level (not a single priority queue) to ensure critical notifications are never blocked by a flood of low-priority ones.
Critical Queue (SQS FIFO or dedicated Kafka topic)
  → Dedicated worker pool (always running, never scaled down)
  → Independent of normal queue depth

Normal Queue (SQS Standard or Kafka topic, partitioned by channel)
  → Auto-scaling worker pool based on queue depth
  → Priority ordering within the queue if using Kafka (P1 before P2)

Batch Queue (SQS Standard with delayed visibility)
  → Workers process during configured windows
  → Aggregation happens before dispatch

Deep Dive 2: Deduplication and Idempotency

Duplicate notifications are expensive (SMS costs money) and annoying (users uninstall your app). Dedup happens at two levels:
1

API-Level Dedup (Idempotency Key)

Each notification request includes an idempotency_key. Before processing, check Redis:
SET dedup:{idempotency_key} 1 EX 3600 NX
→ If SET succeeds (NX = only if not exists): process the notification
→ If SET fails (key already exists): return "already processed", skip
This prevents the calling service from accidentally sending the same notification twice (e.g., retry after a timeout when the first request actually succeeded).
2

Content-Level Dedup (Similarity Hashing)

Even with different idempotency keys, the same logical notification can arrive from different code paths. Hash the notification content:
content_hash = SHA256(user_id + notification_type + template_id + key_vars)

Check: SISMEMBER user_dedup:{user_id} {content_hash}
→ If member exists: duplicate within the dedup window, skip
→ If not: SADD user_dedup:{user_id} {content_hash}
         EXPIRE user_dedup:{user_id} 3600
3

Aggregation-Level Dedup

For notifications that should be aggregated (e.g., “20 people liked your post”), use the aggregate_key:
INCR aggregate:{user_id}:{aggregate_key}
EXPIRE aggregate:{user_id}:{aggregate_key} 300

If count == 1: start a 5-minute timer
If count > 1:  do nothing (timer is already running)

When timer fires:
  count = GET aggregate:{user_id}:{aggregate_key}
  Send single notification: "{count} people liked your post"
  DEL aggregate:{user_id}:{aggregate_key}

Deep Dive 3: User Preferences and Channel Selection

The preferences engine determines which channels to use for each notification:
Channel Selection Algorithm:
1. Start with the requested channels (from API call)
2. If no channels specified, use defaults for this notification_type
3. Filter by user preferences:
   - Remove channels the user has opted out of globally
   - Remove channels the user has opted out of for this notification_type
4. Apply quiet hours:
   - If current time is within user's quiet hours:
     - CRITICAL: deliver anyway (2FA codes cannot wait)
     - HIGH: deliver push/in-app but defer email/SMS to end of quiet hours
     - NORMAL/LOW: defer all channels to end of quiet hours
5. Apply channel-specific rules:
   - SMS: only for critical and user-opted-in notification types (cost control)
   - Email: check bounce status; do not send to addresses that have bounced 3+ times
   - Push: check if user has a registered device token; fall back to in-app if not
6. Return the filtered list of channels to dispatch to
The preferences consistency problem: User preferences are read on every notification dispatch (1B+ reads/day). Caching is essential, but stale cache creates a bad experience — a user opts out of marketing emails, but the cache still has the old preferences and they receive one more email. Solution: cache preferences in Redis with a 5-minute TTL. On preference update, immediately invalidate the cache entry (DEL prefs:{user_id}). The next notification dispatch reads from the database and repopulates the cache. The worst case: a 5-minute window where a user receives a notification on a channel they just opted out of. For most applications, this is acceptable. For SMS (which costs money), read preferences from the database directly — the extra latency is worth the accuracy.

Deep Dive 4: Per-User Rate Limiting

Rate limiting per user per channel prevents notification fatigue:
Rate limits (configurable per notification_type):
  Push:  max 5 per hour for marketing, unlimited for critical
  Email: max 3 per day for marketing, max 10 per day total
  SMS:   max 2 per day total (cost control)
  In-app: unlimited (user controls their own inbox)

Implementation (sliding window counter in Redis):
  key = ratelimit:{user_id}:{channel}:{window}
  INCR key
  if first increment: EXPIRE key {window_seconds}
  if count > limit: reject (queue for later or drop based on priority)
The rate limiting vs priority interaction: Critical notifications (2FA codes, security alerts) must bypass rate limits. If a user has already received 5 push notifications this hour (at the marketing limit), a 2FA code must still get through. Implement this by checking priority before rate limiting — P0 notifications skip the rate limit check entirely. For P1-P3, apply rate limits and, if the limit is hit, either defer (P1-P2) or drop (P3). This priority-aware rate limiting is a production nuance that most candidates miss.
DecisionChoice MadeAlternativeWhy
Queue architectureSeparate queues per prioritySingle priority queueSeparate queues ensure critical notifications are isolated from marketing volume floods. A single queue risks head-of-line blocking
Dedup approachRedis-based with TTLDatabase unique constraintRedis is faster (sub-ms) for the dedup check that happens on every notification. DB constraint is a safety net but too slow for hot-path
User preferences cacheRedis with 5-min TTL + invalidation on updateRead-through cacheExplicit invalidation ensures preferences update within seconds. Read-through with long TTL causes stale reads
Channel selectionPreference engine at dispatch timePre-compute per-user channel listsDispatch-time evaluation respects real-time preference changes, quiet hours, and device availability. Pre-computation gets stale
Notification aggregationTimer-based batching with aggregate keysNo aggregationAggregation dramatically reduces notification fatigue. “20 people liked your post” is better UX than 20 separate notifications
SMS providerMulti-provider with failover (Twilio primary, Vonage backup)Single providerSMS providers have outages. Multi-provider with automatic failover ensures critical SMS (2FA) always gets delivered
Delivery trackingEvent-sourced status updatesStatus pollingEach channel worker publishes status events (sent, delivered, opened, clicked) to a stream. Async processing decouples tracking from dispatch
1

You designed priority-aware routing from the start

Separating critical from marketing notifications is not a feature — it is an architectural decision that affects queue design, rate limiting, and failure handling. Raising this proactively shows operational maturity.
2

You implemented multi-level deduplication

API-level dedup (idempotency keys), content-level dedup (similarity hashing), and aggregation-level dedup (batching similar notifications) are three distinct problems. Most candidates only address one.
3

You addressed the preferences-caching consistency problem

The tension between caching for performance and accuracy for opt-out respect is a real production problem. Explaining the trade-off (5-minute staleness window, immediate invalidation, direct DB read for SMS) shows you have operated notification systems.
4

You designed for notification fatigue

Rate limiting, aggregation, and quiet hours are not afterthoughts — they determine whether users keep your notifications enabled or disable them entirely. This user-experience awareness distinguishes senior engineers from those who only think about throughput.
5

You planned multi-provider failover for SMS

SMS is the most expensive and most unreliable channel. Designing multi-provider failover shows you understand vendor risk and cost management, not just system architecture.

LinkedIn’s Notification Infrastructure

LinkedIn’s notification system is one of the most sophisticated in the industry, delivering billions of notifications per week across push, email, in-app, and badge count updates:
  • Unified notification platform: LinkedIn built a centralized notification service that all product teams use. Teams do not build their own notification dispatch — they call the platform API with a notification type, user ID, and template variables. The platform handles channel selection, preference filtering, rate limiting, dedup, and delivery. This centralization is critical: without it, different teams implement conflicting rate limiting logic and users get bombarded.
  • Relevance scoring: Before dispatching, LinkedIn runs each notification through a relevance model that predicts whether the user will engage with it. Low-relevance notifications are silently dropped or downgraded to in-app only. This is a notification-specific ranking model similar to the feed ranking model — it considers the user’s historical engagement with this notification type, time of day, and current notification load.
  • Volume capping: LinkedIn enforces strict per-user per-day limits across all notification types. Even if 50 different product features want to notify a user, the total push notification count per day is capped. This forces product teams to compete for notification “budget” and naturally selects for the most relevant notifications.
  • Email digests: Rather than sending individual emails for every event, LinkedIn batches low-priority notifications into periodic digest emails (daily or weekly, based on user preference). The digest engine aggregates events, deduplicates, and selects the most relevant items for the digest. This reduces email volume by 10x while maintaining engagement.
  • Delivery feedback loop: LinkedIn tracks open rates, click-through rates, and unsubscribe rates per notification type per channel. If a notification type’s engagement drops below a threshold, it is automatically flagged for review. If a user stops engaging with push notifications for 30 days, LinkedIn reduces push volume to that user. This closed-loop system prevents notification fatigue at scale.

WhatsApp Business Notifications

WhatsApp’s notification system for businesses (appointment reminders, shipping updates, etc.) offers lessons in constraint-driven design:
  • Template-only messaging: Businesses can only send notifications using pre-approved message templates. This eliminates spam by design and ensures every notification the user receives is structured, relevant, and expected. Templates must be approved by WhatsApp before they can be used.
  • Rate limiting by business quality: Each business account has a quality score based on user feedback (blocks, reports). High-quality businesses can send more notifications; low-quality businesses get throttled. This feedback-driven rate limiting is more sophisticated than simple request counting.
  • Delivery windows: WhatsApp enforces a 24-hour conversation window. A business can only send template messages outside of a conversation window. Within a conversation (user has responded in the last 24 hours), the business can send free-form messages. This window-based approach balances business needs with user experience.
The key lesson from both systems: Mature notification platforms are as much about not sending notifications as they are about sending them. Relevance scoring (LinkedIn), quality-based throttling (WhatsApp), and user-level volume caps are what separate notification systems that users love from ones that get muted. In an interview, discussing these “negative” controls — the mechanisms that prevent sending — signals that you think about the user experience holistically, not just the delivery pipeline.
Notification systems have a unique failure characteristic: users often do not notice a missing notification until it is too late (a missed 2FA code, an undelivered order update). Silent failures are the most dangerous kind.Push notification provider (APNs/FCM) goes down:
  • Critical notifications (2FA) must still reach the user. Fall back to SMS for critical notifications when push delivery fails (check delivery receipt status from APNs/FCM within 30 seconds; if undelivered, escalate to SMS). For non-critical notifications, queue and retry when the provider recovers.
  • Monitor APNs/FCM delivery rates. A sudden drop (from 98% to 50%) indicates a partial outage. Switch to the alternative channel (email) for high-priority notifications during the outage.
Email provider (SES/SendGrid) experiences throttling:
  • Email providers throttle when you send too fast or your bounce rate spikes. Implement adaptive sending rate: start slow, increase until you hit the provider’s feedback signal (HTTP 429 or bounce rate > 2%). Back off and retry with exponential delay.
  • Have a secondary email provider configured and ready. If the primary is throttled for more than 5 minutes, route email traffic to the secondary. This dual-provider setup costs negligibly more but provides critical redundancy.
The dedup Redis cluster goes down:
  • Without dedup, duplicate notifications will be sent. This is the worst outcome for SMS (costs money) and bad for user experience on all channels.
  • Short-term: fall back to database-level dedup (check a dedup table in PostgreSQL with a unique constraint on idempotency_key). This is 10-100x slower but prevents duplicates.
  • For SMS specifically: always check the database, not just Redis, before sending. The cost of a duplicate SMS (and potential regulatory issues) justifies the extra latency.
A massive event triggers billions of notifications simultaneously (breaking news, system-wide alert):
  • The normal processing pipeline cannot handle 100x volume instantly. Implement backpressure: the notification API returns HTTP 202 (accepted, processing asynchronously) and queues all notifications. Workers process at their maximum sustainable rate.
  • Prioritize by priority level: critical notifications process first, marketing notifications are automatically deferred during high-volume events.
  • Auto-scale workers based on queue depth. Set CloudWatch alarms for queue depth and use ECS Service Auto Scaling or Lambda concurrency to add capacity within minutes.
A user changes their preferences while notifications are in-flight:
  • Notifications already in the queue were evaluated against old preferences. When the channel worker processes the notification, it does a final preference check just before dispatch. This “double-check” pattern ensures the user’s latest opt-out is respected, even if the notification was queued before the preference change.
  • This adds one Redis/DB read per dispatch but prevents the most common user complaint: “I opted out but still received a notification.”
The regulatory dimension most candidates miss: SMS and email notifications are subject to regulations (CAN-SPAM, GDPR, TCPA). Sending a marketing SMS to a user who has opted out is not just a bad experience — it is a legal violation with financial penalties. This is why SMS dedup and preference checking must be the most reliable part of your system, even at the cost of higher latency. In an interview, mentioning regulatory requirements for notification channels demonstrates business awareness beyond pure engineering.
“How would you implement notification templates with localization for 50+ languages?”Store templates in a template service (DynamoDB or PostgreSQL) keyed by template_id + locale. Each template contains the text with variable placeholders (e.g., "Your order {{order_id}} has shipped"). At render time, look up the user’s preferred locale from the user profile, fetch the matching template, and substitute variables. If the user’s locale is not available, fall back to English. For push notifications, render the final text server-side. For emails, you can use a templating engine (Handlebars, Jinja2) with richer formatting. Store templates separately from code so they can be updated by content teams without a deployment.“How do you handle the case where a user has multiple devices (iPhone, iPad, Android phone) and should receive a push notification on all of them?”Maintain a device registry that maps user_id to a list of device_tokens with metadata (platform, last_active timestamp, app version). When sending a push notification, fan out to all active devices for that user. “Active” means the device token has not been invalidated by APNs/FCM and the user has opened the app on that device in the last 90 days. Prune stale device tokens by checking APNs/FCM feedback — they report invalid tokens on failed deliveries. The fan-out is typically small (1-3 devices per user) so this does not create a scaling challenge.“What if the notification system itself needs to send notifications about its own health? How do you avoid a circular dependency?”This is the “who watches the watchmen” problem. Notification system health alerts must use an independent alerting path that does not depend on the notification system itself. Options: (1) PagerDuty or Opsgenie integration directly from CloudWatch alarms (bypasses your system entirely), (2) a dedicated Slack webhook from your monitoring service, or (3) a minimal, independent SMS service that only sends operational alerts. The key principle: your alerting infrastructure must have zero dependency on the system it monitors.
The detail that makes an interviewer’s eyes light up: When discussing rate limiting, explain that per-user rate limits are necessary but not sufficient. You also need a global notification budget per user per day across all notification types and channels. Even if each individual notification type stays within its rate limit, a user who is active in 30 different features could receive 30 x 5 = 150 push notifications per day. The solution is a two-tier rate limiter: per-type limits (configurable by the product team) and a global per-user cap (enforced by the platform, non-overridable). When the global cap is hit, the system uses the priority and relevance score to decide which notifications make the cut and which are silently dropped or downgraded to in-app only. This is exactly how LinkedIn’s notification platform works, and mentioning it unprompted signals that you have thought about notification systems from the user’s perspective, not just the infrastructure perspective.
If you were building this notification system on AWS, here is how the components map to specific services:
ComponentAWS ServiceWhy This Service
Notification APIAPI Gateway + Lambda or ECS FargateLambda for bursty traffic with per-invocation billing. Fargate for steady-state traffic with predictable costs
Priority QueuesSQS (separate queue per priority)SQS Standard for normal/low priority (nearly unlimited throughput). SQS FIFO for critical priority (strict ordering, exactly-once processing)
Push NotificationsAmazon SNS Mobile PushDirect integration with APNs (iOS) and FCM (Android). Handles device token management, multi-platform delivery, and automatic retries. Scales to billions of pushes per day
EmailAmazon SESExtremely cost-effective ($0.10 per 1,000 emails). Dedicated IPs for reputation management. Built-in bounce and complaint handling. Suppression list management
SMSAmazon SNS SMS or Amazon PinpointSNS SMS for transactional messages (2FA codes, alerts). Pinpoint for marketing messages with campaign management, segmentation, and analytics
In-App NotificationsDynamoDB + API Gateway WebSocketDynamoDB table partitioned by user_id with notification_id as sort key. WebSocket API for real-time delivery to connected users
User PreferencesDynamoDB + ElastiCache (DAX or Redis)DynamoDB for durable preference storage. DAX for microsecond reads on the hot path. Redis if you need complex data structures beyond key-value
DeduplicationElastiCache for RedisRedis SET with NX (not exists) and TTL for idempotency key dedup. Redis sorted sets for content-hash dedup with expiration
Notification AggregationLambda + DynamoDB Streams + Step FunctionsDynamoDB Streams capture notification events. Step Functions Wait state implements the aggregation timer (5-minute window). Lambda function builds the aggregated notification
Template StorageDynamoDB or S3DynamoDB for templates that are read on every dispatch (sub-ms latency). S3 for rich email templates (HTML) fetched and cached by workers
Delivery TrackingKinesis Data Firehose to S3 + AthenaStream delivery events (sent, delivered, opened, clicked) to S3 in Parquet format. Athena for analytics queries (open rates by notification type, delivery latency percentiles)
MonitoringCloudWatch + SNS Alarms to PagerDutyCustom metrics for delivery rate, queue depth per priority, dedup hit rate, preference cache hit rate. Alarms with different severity thresholds per metric
Cost optimization for email and SMS: SES is one of the most cost-effective AWS services (0.10/1Kemails).SMSisoneofthemostexpensive(0.10/1K emails). SMS is one of the most expensive (0.00645-0.05+permessagedependingoncountry).DesignthesystemsothatSMSisonlyusedforcriticalnotificationswheretheuserhasexplicitlyoptedin.Usepushoremailforeverythingelse.At100MSMS/dayat0.05+ per message depending on country). Design the system so that SMS is only used for critical notifications where the user has explicitly opted in. Use push or email for everything else. At 100M SMS/day at 0.01 average, that is $1M/day in SMS costs alone. A senior engineer raises cost awareness for SMS-heavy notification designs. For a deeper treatment of AWS cost optimization patterns, see Cloud Service Patterns.
Constraint 1: “Notifications must comply with regulations in 30 different countries. Each country has different opt-in requirements, quiet hours, and content restrictions.”
  • The user preferences model expands from simple opt-in/opt-out to a regulatory rules engine. Each country has a profile: Canada requires explicit opt-in for marketing SMS (CASL), Germany has strict quiet hours (no marketing before 8 AM or after 9 PM local time, GDPR), India requires a DND (Do Not Disturb) registry check before sending SMS. The channel selection algorithm from the original design becomes: apply country-specific rules AFTER user preferences, not before. This means you need the user’s country of residence (not just their current location) and a regulatory rules database that is updated by your legal team. The notification API should reject requests that violate country-specific rules, returning a clear error to the caller.
Constraint 2: “The notification system must support webhooks — delivering notifications to external URLs configured by enterprise customers.”
  • Webhooks add a new “channel” with unique failure characteristics: the destination URL may be slow, unreliable, or completely down. You need per-webhook retry policies with exponential backoff, a circuit breaker per webhook URL (if it fails 5 times in a row, stop trying and alert the customer), and a webhook delivery log that customers can inspect. Webhook payloads must be signed (HMAC-SHA256) so the receiver can verify authenticity. The webhook delivery infrastructure is essentially a separate system from the notification system — it looks more like a task scheduler with per-destination queues.
Constraint 3: “Your SMS costs are $1.2M/year. Reduce by 50% without losing critical notifications.”
  • SMS cost optimization: (1) Replace SMS with push notifications wherever possible — push is free, SMS costs 0.010.01-0.05 per message. For any notification where the user has push enabled, never send SMS. (2) Batch low-priority SMS into daily digests. Instead of 5 separate SMS for 5 order updates, send one SMS at end of day: “You had 5 order updates today. Check the app for details.” (3) Use short codes and toll-free numbers (cheaper per-message than long codes). (4) Negotiate volume discounts with your SMS provider — at $1.2M/year, you have leverage. (5) Implement SMS verification before sending: if a phone number has bounced 3 times, remove it from the SMS channel. Each of these is a real optimization used by companies like Uber and DoorDash.
Constraint 4: “The system must support AI-generated notification content. A language model writes personalized messages per user.”
  • This adds an inference step to the notification pipeline: before template rendering, the system calls an LLM (or a fine-tuned small model) to generate personalized content. Latency implications: an LLM call adds 200ms-2s per notification. For critical notifications (2FA), skip personalization. For marketing notifications, batch the LLM calls — generate personalized content for a cohort of users in bulk, then dispatch. The LLM call must be in the non-blocking path (after queuing, before dispatch), not in the API request path. Content moderation of LLM-generated text is non-negotiable — every generated message must pass through a toxicity filter before sending. This is how LinkedIn and Spotify personalize notification copy at scale.
This problem connects to several foundational topics covered in other chapters:
  • APIs and HTTP — RESTful API design for the notification endpoint, webhook delivery patterns, and HTTP status codes (202 Accepted for async processing) are covered in APIs and Databases.
  • Messaging and queues — Priority queues, dead-letter queues, at-least-once delivery guarantees, and the fan-out dispatch pattern connect to Messaging, Concurrency, and State.
  • Databases — DynamoDB for user preferences and notification history, partition key design for time-series notification data, and the trade-off between Redis and DynamoDB for dedup storage connect to Database Deep Dives.
  • Caching — Redis for dedup state, user preference caching with TTL-based invalidation, and template caching connect to Caching and Observability.
  • Reliability — Multi-provider failover for SMS/email, graceful degradation when a channel is down, and the “double-check” pattern for preferences connect to Reliability Principles.
  • Security — Notification opt-out compliance (CAN-SPAM, GDPR, TCPA), secure delivery of sensitive content (2FA codes), and rate limiting as abuse prevention connect to Auth and Security.
  • Real-time systems — In-app notification delivery via WebSocket, presence-aware channel selection (push to online users, email to offline users), and real-time delivery tracking connect to Real-Time Systems.
  • Distributed systems theory — Deduplication in distributed systems (idempotency guarantees across retries), exactly-once delivery semantics, and the CAP theorem implications for the preference cache connect to Distributed Systems Theory.
  • Cloud service patterns — SNS for push notifications, SES for email, Pinpoint for SMS, SQS for priority queuing, and the cost optimization strategies for high-volume notification delivery are AWS-specific patterns explored in Cloud Service Patterns.

Summary: The Pattern Across All Six Problems

Every senior-level system design answer follows the same meta-pattern (visible across all six problems above):
  1. Ask before you draw — Requirements clarification is not a formality; it changes the design.
  2. Numbers before architecture — Back-of-envelope estimation reveals whether you need 3 servers or 3,000.
  3. Start with the right abstraction — High-level design captures components and data flow before implementation details.
  4. Go deep on what matters — You cannot deep-dive everything in 45 minutes. Pick the 2-3 components that define the system.
  5. Name the trade-offs explicitly — Every design decision has a cost. Senior engineers articulate both sides.
Practice drill: Take each problem above and change one requirement. What if the URL shortener needs to support 10 billion URLs/day? What if the chat system needs end-to-end encryption? What if the task scheduler needs exactly-once semantics across data centers? What if the notification system needs to support regulatory-compliant opt-in for 30 different countries? Practicing with requirement mutations is how you build the adaptive thinking that aces real interviews.

Think of It This Way

System design interviews are like architectural blueprints — the interviewer wants to see your thinking process, not a finished building. An architect does not walk into a client meeting and immediately start drawing floor plans. They ask: how many people will use this space? What is the budget? Are there zoning constraints? What is the climate? Only after understanding the requirements do they pick up a pencil. Your system design interview should follow the same rhythm: clarify, estimate, sketch, refine. Here is another way to think about it: a great system design answer is like a map with multiple zoom levels. Start at the continent level (high-level architecture), then zoom into the country level (key components), then the city level (critical algorithms and data models). The interviewer should be able to stop you at any zoom level and feel that you have a coherent picture. Candidates who jump straight to city-level details (arguing about Redis data structures before explaining the overall architecture) lose the forest for the trees.

Deep Dive Resources

These resources will deepen your understanding beyond what any single guide can cover. They are organized by format so you can pick what fits your learning style.
ResourceWhy It Matters
System Design Interview by Alex Xu (Vol 1 & 2)The most interview-focused system design resource available. Each chapter walks through a specific system (rate limiter, chat, news feed, etc.) with the same Clarify-Estimate-Design-Deep Dive framework. Volume 1 covers fundamentals; Volume 2 covers more advanced topics like hotel reservation systems and stock exchanges.
Designing Data-Intensive Applications by Martin KleppmannThe definitive deep reference for understanding the building blocks (replication, partitioning, consistency models, stream processing) that underpin every system design. Not interview-focused, but the mental models it builds are irreplaceable. Read this to understand why systems work the way they do.
Web Scalability for Startup Engineers by Artur EjsmontA practical, end-to-end guide to scaling web applications. Covers caching, async processing, search, and more. Especially useful if you want to understand the full stack rather than individual components.
ResourceWhy It Matters
System Design Primer (github.com/donnemartin/system-design-primer)A free, comprehensive GitHub repository that covers system design concepts, trade-offs, and practice problems. Think of it as a structured self-study curriculum. The diagrams and summaries are excellent for quick review before interviews.
Grokking the System Design Interview (educative.io)An interactive course that walks through 20+ system design problems step by step. The format (text-based with diagrams, not video) makes it easy to study at your own pace. Widely regarded as the most popular paid resource for system design interview prep.
ByteByteGo by Alex Xu (blog.bytebytego.com / YouTube)Alex Xu’s newsletter and YouTube channel provide visual, concise explanations of system design concepts. The animated diagrams are particularly useful for building intuition about data flow, fan-out patterns, and distributed system behavior. The YouTube channel is free; the newsletter has free and paid tiers.
ResourceWhy It Matters
High Scalability (highscalability.com)One of the oldest and most comprehensive collections of real-world architecture case studies. Their “This is how X works” posts dissect the architectures of companies like Netflix, Twitter, WhatsApp, and Uber. Reading 5-10 of these gives you a library of real patterns to reference in interviews.
InfoQ Architecture Articles (infoq.com/architecture-design)In-depth articles and conference talks on distributed systems, microservices, and architecture patterns. The content skews more toward practicing architects than interview prep, which makes it excellent for building genuine expertise.
Company Engineering BlogsThe best primary sources for how real systems work. Prioritize: Netflix Tech Blog (resilience, streaming), Uber Engineering (real-time systems, Cadence/Temporal), Discord Blog (scaling real-time chat), Meta Engineering (feed systems, TAO, Memcache), Cloudflare Blog (edge computing, rate limiting, DNS). These are the sources that Alex Xu and other authors reference.
ResourceWhy It Matters
Exponent (tryexponent.com)Mock interview platform with system design practice, peer feedback, and expert-led sessions. Useful for practicing the verbal communication aspect that reading alone cannot build.
Pramp (pramp.com)Free peer-to-peer mock interviews. You practice with another engineer and take turns as interviewer and candidate. The best way to practice thinking out loud under time pressure.
LeetCode System Design (leetcode.com/discuss/interview-question/system-design)Community-contributed system design discussions. Quality varies, but the top posts have thousands of upvotes for a reason. Good for seeing how other candidates approach the same problems.

Common Mistakes in System Design Interviews

Even well-prepared candidates fall into these traps. Knowing them in advance gives you a significant edge.
What happens: The interviewer says “Design a chat system” and the candidate immediately starts drawing boxes and arrows. Five minutes later, the interviewer asks “Does this support group chat?” and the entire design needs rework.Why it is tempting: You want to show confidence and speed. You have practiced this exact problem and want to demonstrate your preparation.Why it hurts you: The interviewer is evaluating your engineering judgment, not your speed. Jumping to the solution signals that you build first and ask questions later — a dangerous trait in a senior engineer. It also means you are likely designing for the wrong requirements.The fix: Spend the first 5-8 minutes asking clarifying questions and explicitly stating your assumptions. Write down the functional and non-functional requirements. Get the interviewer to confirm before you draw a single box. This feels slow but actually saves time and earns significant points.
What happens: The candidate designs a URL shortener for Google-scale traffic with a Cassandra cluster, multi-region replication, and a custom sharding layer — when the interviewer specified “a small startup’s internal tool.”Why it is tempting: You have studied the advanced architectures and want to show you know them. Complexity feels impressive.Why it hurts you: Over-engineering is a signal of poor judgment. A senior engineer knows that a single PostgreSQL instance can handle thousands of requests per second and will last a startup years. Designing for 10x your actual scale is prudent; designing for 1000x is wasteful. Interviewers specifically watch for this.The fix: Let the estimation phase drive your architecture. If the numbers say you need 100 QPS, design for 1,000 (10x safety margin) and explain that you would add complexity only when monitoring shows you need it. Mention what you would change at 100x scale, but do not build it into your initial design.
What happens: The candidate presents a clean, happy-path design. The interviewer asks “What happens if this database goes down?” and the candidate has no answer.Why it is tempting: Failure handling is messy and hard to think about. The happy path is elegant and easier to explain within the time limit.Why it hurts you: In production, everything fails. Networks partition. Servers crash. Disks fill up. A design that only works when everything is healthy is not a design — it is a wish. Senior engineers are distinguished by how they think about failure.The fix: For every major component, ask yourself: “What happens when this fails?” Design for it explicitly. Use terms like “fail-open,” “circuit breaker,” “retry with backoff,” and “dead letter queue.” Even brief mentions of failure handling at each layer signal production experience.
What happens: The candidate designs the system without any numbers. They propose “a Redis cache” without knowing whether the dataset fits in memory, or “Cassandra for storage” without knowing if the write volume justifies it.Why it is tempting: Math feels slow and boring compared to architecture discussion. You worry about making arithmetic errors under pressure.Why it hurts you: Without numbers, every architectural choice is a guess. The estimation phase takes 3-5 minutes but shapes every decision that follows. Interviewers at top companies (especially Google and Meta) explicitly expect estimation.The fix: Practice the estimation framework until it is automatic: users per day, requests per second (peak and average), storage per year, bandwidth, cache size. Round aggressively — the goal is order of magnitude, not precision. 1,157 requests/sec and 1,200 requests/sec lead to the same architecture. State your assumptions explicitly so the interviewer can adjust them.
What happens: The candidate goes quiet for 2-3 minutes, thinking hard, then presents a fully formed component. The interviewer has no window into the reasoning.Why it is tempting: You want to present a polished answer. Thinking out loud feels vulnerable — what if you say something wrong?Why it hurts you: The interviewer is buying your thought process, not your final answer. Silence gives them nothing to evaluate. Worse, if you emerge from silence with a flawed design, the interviewer cannot tell whether you considered and rejected the right approach or never thought of it.The fix: Narrate your thinking. “I am considering two approaches here — fan-out on write and fan-out on read. Let me think about the trade-offs. Fan-out on write gives us fast reads but breaks for celebrity users, so I think a hybrid approach makes more sense for this scale.” Even if you change your mind, the interviewer sees your reasoning and gives you credit for it.
What happens: The candidate spends 20 minutes designing a perfect caching layer with eviction policies, warming strategies, and invalidation protocols — but never addresses the database, the API layer, or the overall data flow.Why it is tempting: You know caching really well and feel confident deep-diving there. Going deep feels like providing value.Why it hurts you: System design interviews evaluate breadth AND depth. The interviewer wants to see that you can design a complete system (breadth) and then go deep on 1-2 critical components (depth). Spending all your time on one component signals tunnel vision.The fix: Allocate your time explicitly. Spend the first 15 minutes on requirements, estimation, and the high-level architecture (all major components, their responsibilities, and how data flows between them). Only then pick 2-3 components to deep-dive on — ideally the ones most critical to the system’s success. Ask the interviewer: “I can go deeper on the caching strategy or the data partitioning scheme. Which would you prefer?” This shows time management and gives the interviewer control.

How to Practice System Design

Reading about system design is necessary but not sufficient. You would not prepare for a piano recital by only reading sheet music. System design interviews are a performance — you must practice performing.
The single most effective practice method: Set a 45-minute timer. Pick a problem from this guide (or any system design problem). Talk out loud — even if you are alone in a room. Draw boxes and arrows on paper or a whiteboard. Walk through each phase of the framework: Clarify, Estimate, Design, Deep Dive, Trade-Offs. When the timer goes off, stop and evaluate: Did I cover all phases? Did I get to a deep dive? Did I discuss trade-offs? Practice the framework until it is muscle memory, so that in the real interview, you can focus on the problem instead of the process.

The Practice Framework

1

Week 1-2: Learn the Framework

Read through all six problems in this guide. Do not try to memorize the solutions. Instead, internalize the structure: what questions to ask, how to estimate, how to organize your design on a whiteboard, and how to present trade-offs. Practice the estimation math until you can do it without hesitation — conversions like “100M requests per day = ~1,160 per second” should be automatic.
2

Week 3-4: Solo Practice

Pick one problem per day. Set a 45-minute timer. Talk out loud as if the interviewer is sitting across from you. Draw your design on paper. After each session, compare your design against the guided solution. Identify gaps: did you miss the celebrity problem in the news feed? Did you forget failure modes for the task scheduler? These gaps tell you what to study next.
3

Week 5-6: Peer Practice

Find a practice partner (a colleague, a friend who is also interviewing, or use a platform like Pramp). Take turns being the interviewer and the candidate. As the interviewer, practice asking follow-up questions that probe weaknesses. As the candidate, practice thinking out loud under the social pressure of another person watching. This is where most of the learning happens — the gap between solo practice and a live interview is enormous.
4

Ongoing: Requirement Mutations

Take each problem and change one requirement. What if the URL shortener needs end-to-end encryption? What if the rate limiter needs to support geographic rate limits (different limits per country)? What if the chat system needs to work offline (like iMessage)? What if the news feed needs to support video with real-time comments (like Twitch)? What if the notification system needs to support regulatory requirements in 30 countries with different opt-in rules? Practicing with requirement mutations builds the adaptive thinking that aces novel problems you have never seen before.

Tips That Make a Real Difference

Talk in “phases,” not “steps.” Saying “Let me start with the clarification phase” signals that you have a structured approach. Interviewers hear this and relax — they know you will not waste their time. Draw before you talk. A diagram with 5 boxes and 4 arrows, drawn in 30 seconds, communicates more than 3 minutes of verbal description. Label every box. Label every arrow with data flow direction. Use the diagram as your anchor for the rest of the conversation. Narrate your trade-offs in real-time. Do not save trade-offs for the end. When you pick NoSQL over SQL, say why at that moment: “I am choosing DynamoDB here because the access pattern is pure key-value lookup at 100K+ reads per second, and I do not need joins or complex queries. If the access pattern changes, I would revisit this.” This is the behavior interviewers reward most. Ask the interviewer where to go deep. After your high-level design (15-20 minutes in), say: “I can deep dive on the caching strategy, the data partitioning, or the failure handling. Which would be most useful to discuss?” This shows time management, respect for the interviewer’s priorities, and confidence that you can go deep on any component. Practice the numbers until they are reflexive. Memorize these approximations:
  • 1 day = ~86,400 seconds (round to 100K for quick math)
  • 1 million requests/day = ~12 requests/second
  • 1 billion requests/day = ~12,000 requests/second
  • 1 KB x 1 billion = 1 TB
  • Redis handles ~100K operations/second per instance
  • A single PostgreSQL instance handles ~10K transactions/second
  • A typical server handles ~10K concurrent WebSocket connections
The ultimate test of readiness: Can you design a system you have never practiced, using only the framework? Try “Design a distributed file storage system like Dropbox” or “Design a ride-sharing matching system like Uber.” If you can produce a coherent 45-minute answer to a novel problem using the Clarify-Estimate-Design-Deep Dive-Trade-Offs framework, you are ready. If you find yourself lost without a memorized solution to fall back on, practice the framework more — not more solutions.

Interview Deep-Dive Questions

These questions simulate the kind of probing, multi-layered interrogation you will face from experienced system design interviewers. Each question starts at a specific point and then drills deeper through follow-ups — exactly how a real 45-minute interview unfolds. Use them for self-study or peer practice.

Question

You are designing a social feed system. Walk me through the trade-off between fan-out on write (push model) and fan-out on read (pull model). When does one clearly beat the other, and what forces you into a hybrid?Difficulty: Senior | Concepts tested: Write amplification, read latency, hybrid architectures, real-world scaling judgment

Strong Answer

The way I think about this is through the lens of where you want to pay the cost — at write time or at read time.
  • Fan-out on write means when a user publishes a post, you immediately push that post ID into every follower’s pre-computed feed cache (typically a Redis sorted set). The read path becomes trivially fast — just read from the cache. The cost is write amplification: one post from a user with 1,000 followers generates 1,000 writes. For a system where the vast majority of users have a few hundred followers, this is the correct default because reads outnumber writes by 10-100x, and you want reads to be sub-50ms.
  • Fan-out on read means you store posts per-author and at read time you query all followed authors, merge their posts, rank, and return the top N. There is zero write amplification — one post is one write. But the read path is expensive: if I follow 300 users, the system must scatter-gather 300 queries, merge-sort the results, and return. At scale, this is too slow for a good user experience (500ms+ for a feed load).
  • The decision breaks down at the celebrity boundary. A user with 10 million followers posting once means 10 million cache writes. During peak events — an election, a Super Bowl halftime tweet — that can mean hundreds of millions of writes in a burst that backs up your fan-out queue by minutes or hours. This is exactly the problem Twitter faced publicly. The solution is a hybrid: fan-out on write for normal users (say, under 10K followers), and fan-out on read for celebrities. At read time, you merge the pre-computed feed with a small number of celebrity post lookups. The threshold should be dynamic based on current system load — during peak traffic, lower the threshold to reduce write pressure.
  • Real-world example: Facebook, Twitter/X, and Instagram all converged on this hybrid approach independently. Twitter documented their migration from pure fan-out-on-write to hybrid around 2012-2013 specifically because of the celebrity problem. The fact that three of the largest feed systems arrived at the same architecture is strong evidence that the hybrid is the correct design for this problem class.

Follow-up: How do you decide the exact follower threshold for switching from push to pull?

A static threshold (like 10K) is a reasonable starting point, but in production you want it to be adaptive. The key insight is that the cost of a fan-out depends on two things: the follower count AND the current state of the system.
  • During normal load, you might fan out for users with up to 50K followers because the queue has spare capacity.
  • During a traffic spike (breaking news, sporting event), you lower the threshold to 5K or even 1K to protect the fan-out workers from falling behind.
  • The simplest implementation is a single configuration value that the fan-out service reads each batch: MAX_FANOUT_FOLLOWERS. Monitor Kafka consumer lag — when lag exceeds 5 minutes, automatically lower the threshold. When lag returns to normal, raise it.
  • A more sophisticated approach uses a cost function: fanout_cost = follower_count * current_queue_depth_factor. If the cost exceeds a budget, defer to read-time merge. This makes the system self-tuning.
The trade-off is that lowering the threshold means more work at read time (more celebrity posts to merge), which slightly increases feed-fetch latency. But a 50ms increase in read latency is vastly preferable to a fan-out queue that is 30 minutes behind.

Follow-up: What happens to the user experience during the transition period when you change the threshold?

This is a subtle UX question. When you lower the threshold from 50K to 5K:
  • Users who follow someone with 20K followers were previously getting that person’s posts via their pre-computed feed cache. Now that person is suddenly treated as a “celebrity” and their posts must be merged at read time.
  • If the feed service does not know to look up this author’s posts, they simply disappear from the feed until the next full read-time merge.
  • The fix: maintain a per-user “celebrity list” that is updated when the threshold changes. When a user’s feed is fetched, the Feed Service checks both the pre-computed cache AND the celebrity list. The celebrity list is rebuilt periodically or on threshold change.
  • There is also a cache warming concern: when you raise the threshold back, posts from those users need to be fanned out again. You do not retroactively fan out old posts — you start from the next post. Users may see a slight discontinuity in their feed timeline. In practice, this is barely noticeable because feed ranking surfaces the most relevant content regardless of delivery mechanism.

Going Deeper: How would you handle the edge case where a normal user suddenly goes viral (10 followers yesterday, 5 million followers today)?

This is the “emergent celebrity” problem. Your system was happily fanning out for this user when they had 10 followers. Now they have 5 million, but your system has not reclassified them yet.
  • Detection: Run a periodic job (every 5-10 minutes) that checks follower counts against the current threshold. Flag users who have crossed the threshold as “newly promoted” celebrities.
  • Transition: Stop fan-out for the newly promoted user immediately. Their future posts use the pull model. Existing posts already in followers’ caches remain there (no need to remove them).
  • The dangerous window: If the user posts during the time between crossing the threshold and detection (up to 10 minutes), the fan-out service attempts to push to millions of followers. This is exactly the kind of burst that can back up the queue. Mitigation: add a pre-flight check in the fan-out worker itself — before processing a fan-out event, re-check the author’s current follower count. If it now exceeds the threshold, skip the fan-out and let the read path handle it. This adds one extra lookup per fan-out event but protects against the race condition.
  • WhatsApp and Telegram sidestep this entirely by not having a public feed — every message is point-to-point or to a defined group. The viral celebrity problem is unique to open social graphs.
Structured Answer Template — Fan-Out / Read-vs-Write Trade-Off Questions
  1. Frame the cost asymmetry — “This is a question of where you want to pay: at write time or at read time.”
  2. State the asymmetric read/write ratio — social feeds are typically 100:1 reads to writes, which is why fan-out-on-write wins by default.
  3. Name the breaking point — celebrity users, flash crowds, or viral events; give a concrete threshold (10K followers is the industry heuristic).
  4. Propose the hybrid explicitly — fan-out for normals, fan-in for celebrities, merge at read time. Name the dynamic threshold.
  5. Identify the monitoring signal — Kafka consumer lag, fan-out queue depth, or per-author post-to-fanout time.
Real-World Example — Twitter’s 2013 timeline rearchitecture: Twitter publicly described their migration from pure fan-out-on-write (“Timeline Service”) to a hybrid model around 2012-2013, specifically triggered by the celebrity problem. A single Justin Bieber tweet could trigger 40 million cache writes, backing up the fan-out queue by minutes during peak events. Their solution — hybrid push-pull with a celebrity threshold — has since been adopted by Instagram, Facebook, and most modern feed systems, which is strong evidence the pattern generalizes.
Big Word Alert — Write Amplification Write amplification is the ratio of actual storage/cache writes to logical application writes. In fan-out-on-write, one user post becomes N cache writes where N is the follower count — amplification of 1000x is typical, millions for celebrities. Use it naturally: “Fan-out on write has 1000x write amplification for a user with 1000 followers — that is our write budget, and celebrity accounts blow through it by orders of magnitude.” Warning: Write amplification is not just a storage cost — it is also a latency cost in the fan-out queue and a correctness risk if partial failures leave some followers with the post and some without.
Big Word Alert — Scatter-Gather Scatter-gather is a read pattern that sends one logical request to many backends in parallel, then merges the results. In fan-out-on-read, the feed service scatter-gathers posts from every followed author. Use it naturally: “The read path becomes a scatter-gather across 300 authors, which is why latency balloons — you are bottlenecked by the slowest of 300 parallel reads (tail-latency amplification).” Warning: Scatter-gather’s tail latency is dominated by the slowest backend, not the average. If P99 per-backend is 100ms, the P99 for a 300-way scatter-gather can be 500-1000ms due to the math of order statistics.
Follow-up Q&A Chain:Q: The interviewer says “what if we just do fan-out-on-read for everyone — CPUs are cheap?” Is there a defensible answer? A: Only at a very small scale. At the scale where fan-out-on-read works for everyone, you probably do not need a specialized feed service at all — a single SQL query with a JOIN across “follows” and “posts” suffices. The moment you have enough users that the scatter-gather becomes expensive (roughly 100K daily actives), you need pre-computation somewhere. “Fan-out on read for everyone” is a valid answer for a 1000-user MVP but collapses at scale because read latency is directly paid by the user, while write latency is hidden in a background queue.Q: How do you handle the edge case where a user has 10 million followers but tweets 1000 times per day? The write amplification destroys you. A: That user is functionally a broadcaster, not a social user, and should be treated as one. You move them to fan-out-on-read unconditionally, regardless of their follower count. The real heuristic is not just follower count but post-frequency-times-follower-count — the total write amplification per unit time. Bots, news accounts, and high-frequency celebrities cross this budget quickly. The detection is straightforward: track posts-per-day * follower-count and bucket accounts above a threshold into the pull model.Q: What does “dynamic threshold” actually look like in code — how do you build a self-tuning system? A: The simplest version is a feedback loop on consumer lag. The fan-out service publishes its current queue depth (or consumer lag) every 30 seconds. A controller reads that signal and adjusts MAX_FANOUT_FOLLOWERS in a shared config store (etcd, Consul, or even Redis). When lag exceeds 5 minutes, lower the threshold by 20%. When lag drops below 1 minute, raise it back toward the default. The math does not have to be sophisticated — a PID-like controller with a dead band works fine, and you cap the range so the threshold cannot collapse to zero or balloon to infinity.
Further Reading
  • High Scalability — “The Architecture Twitter Uses to Deal with 150M Active Users” (highscalability.com) — the canonical public writeup of Twitter’s hybrid timeline.
  • Facebook Engineering — “Scaling the Facebook Feed” (engineering.fb.com) — equivalent hybrid architecture at Facebook’s scale.
  • Martin Kleppmann — “Designing Data-Intensive Applications” Chapter 1 — the “Twitter example” section that made this trade-off a standard interview question.

Question

In your URL shortener design, you have Redis as a cache-aside layer in front of the database. The Redis instance just died completely. Walk me through what happens to the system — second by second, from the user’s perspective and from the infrastructure’s perspective.Difficulty: Senior | Concepts tested: Graceful degradation, thundering herd, cache recovery strategies, production incident response

Strong Answer

Let me walk through this as a timeline, because the failure unfolds in stages.
  • Second 0 — Redis dies. The application servers’ Redis clients start getting connection refused errors or timeouts. How quickly they detect this depends on the client configuration — a health check interval of 1-2 seconds with a connection timeout of 500ms means detection within 2-3 seconds.
  • Seconds 1-3 — Cache misses begin. Every redirect request that would have been a cache hit now falls through to the database. Latency jumps from ~5ms (cache hit) to ~20-50ms (database read). Users probably do not notice yet because 50ms is still fast for a redirect.
  • Seconds 3-10 — Thundering herd risk. Here is where it gets dangerous. If your system processes 350K reads/sec at peak and your cache hit rate was 99%, that means only ~3,500 reads/sec were hitting the database. Now ALL 350K reads/sec hit the database simultaneously. A typical PostgreSQL instance handles ~10K TPS, a DynamoDB table in on-demand mode scales up but not instantly. You are looking at a potential 100x spike to the database.
  • The mitigation that matters: request coalescing (also called collapsed forwarding). If 1,000 requests for the same popular short URL arrive within a 100ms window, only ONE request hits the database. The other 999 wait for that single query to return and share the result. This is not a theoretical optimization — it is the difference between your database surviving and falling over. Libraries like singleflight in Go or similar patterns in other languages implement this.
  • Seconds 10-30 — Stabilization. With request coalescing, the database load is manageable but elevated. Latency is higher than normal (50-100ms per redirect instead of 5ms). You are degraded but functional. The monitoring system fires alerts for the Redis outage and elevated database latency.
  • Recovery plan: Do NOT just restart Redis and let it fill up organically. A cold cache with 350K reads/sec means the database stays under heavy load for potentially hours while the cache warms. Instead, run a cache warming script that pre-loads the top 20% of URLs by access frequency (this 20% generates 80% of traffic). You can derive this from access logs or a pre-computed popularity ranking. Load these into Redis before routing traffic back to it. Then switch traffic back to the cache-first path. The warm-up should take minutes, not hours.
  • From the user’s perspective: Redirects were maybe 30-50ms slower for a few minutes. They almost certainly did not notice. This is the power of graceful degradation — the system got worse, not broken.

Follow-up: What if it is not a total Redis failure but a partial one — say, 1 out of 3 Redis cluster nodes goes down?

This is actually more insidious than a total failure.
  • With Redis Cluster, each node owns a subset of the hash slots (the keyspace). When one node dies, all keys on that node become inaccessible. The other two nodes continue serving their keys fine.
  • This means some short URLs resolve instantly from cache (their keys are on healthy nodes) and others are slow (their keys were on the dead node). The inconsistency is invisible to monitoring that only tracks averages — your average latency might look fine while one-third of users experience degraded performance.
  • Detection: track cache hit rate per node or per hash slot range, not just globally. A sudden drop from 99% to 66% is the signature of a single-node failure in a 3-node cluster.
  • Redis Cluster has automatic failover (if you have replicas configured): the replica of the failed master gets promoted to master within 10-30 seconds. During the failover window, requests for keys on that shard fail. Your application code should handle this with a fallback to database read — the same cache-miss path, but triggered by a Redis error rather than a cache miss.
  • The lesson: always configure at least one replica per master in a Redis Cluster. The cost is 2x the memory, but the availability improvement is enormous.

Follow-up: How would you design the cache warming script? What are the pitfalls?

The cache warming script has several non-obvious challenges.
  • Data source: You need to know which URLs are “hot.” Options: (1) maintain a real-time access counter in the database or a separate analytics system, (2) parse recent access logs, or (3) use the previous cache contents if you have periodic snapshots (Redis RDB dumps). Option 1 is most reliable. Option 3 is fastest but only works if you have regular backups.
  • Pace the warming: If you dump 20 million key-value pairs into Redis as fast as possible, you saturate Redis’s network bandwidth and CPU during the load. Rate-limit the warming to ~50K writes/sec (about half Redis’s capacity) so the node can simultaneously serve live traffic.
  • Stale data risk: The warming script reads from the database, but URLs may have been created, modified, or expired since the data was read. For a URL shortener this is low risk (URLs are mostly immutable), but for a system with mutable data, you need to validate freshness. Set a short TTL on warmed entries (say, 5 minutes) so stale entries expire quickly and get refreshed from the database.
  • The biggest pitfall: warming the cache while traffic is still hitting the database. You need to coordinate the cutover — warm the cache, then redirect traffic to use it. If traffic starts using the cache while it is only 10% warmed, you get a 90% miss rate, which is barely better than no cache.
Structured Answer Template — Cache Failure / Thundering Herd Questions
  1. Walk the timeline, not the static state — “at t=0 the cache dies, at t=3s clients detect, at t=10s thundering herd risk peaks.”
  2. Quantify the load multiplier — “99% hit rate means the database is about to see 100x its normal read load.”
  3. Name the mitigation by its pattern name — request coalescing / singleflight, not “we deduplicate.”
  4. Describe recovery, not just survival — cache warming strategy, not just “restart Redis.”
  5. Close with the user-visible outcome — “from the user’s perspective, 30ms of extra latency for 5 minutes, not a full outage.”
Real-World Example — Facebook’s 2010 Memcache “thundering herd” that took them down: Facebook has documented a real incident where a small fraction of memcache nodes restarted, and the cache miss storm on the databases caused a cascading overload that required hours to recover. The postmortem led to the now-famous “leases” feature in their memcache layer — a form of request coalescing baked into the protocol. This incident is why request coalescing is not a nice-to-have in any serious caching layer; it is the difference between degraded service and a full outage.
Big Word Alert — Thundering Herd Thundering herd is a failure mode where many clients simultaneously retry or fall through to the same expensive backend resource after a cache miss or service restart, overwhelming it and causing a cascading outage. Use it naturally: “When the cache node restarted, the thundering herd of 50K concurrent requests for the same hot key took the database’s read IOPS to its ceiling in under 2 seconds.” Warning: “Thundering herd” is sometimes confused with “hot key” — they are related but different. Hot key is too much traffic to a single key under normal conditions; thundering herd is a transient spike after cache loss. Different mitigations.
Big Word Alert — Request Coalescing (Singleflight) Request coalescing is a pattern where multiple concurrent requests for the same key are collapsed into a single backend call, with the other requests waiting for and sharing the result. Go’s singleflight package is the canonical reference implementation. Use it naturally: “We wrap every cache-miss path in singleflight, so if 10K concurrent requests miss the cache on the same short URL, only 1 goes to the database and 9,999 wait for the result.” Warning: Request coalescing only helps with concurrent duplicate requests. It does not help with many distinct hot keys arriving simultaneously — for that you need per-key rate limiting or a bigger database.
Follow-up Q&A Chain:Q: Request coalescing sounds free. What is the catch? A: Two catches. First, the coalesced request’s latency is paid by every waiter — if the backend takes 500ms, all 10K waiters see 500ms, not just the one who made the call. During a slow database, coalescing can make p99 latency worse even as it protects availability. Second, coalescing is per-process — if you have 200 application server instances, you still have up to 200 concurrent database reads for the same key (one per instance), not 1. For multi-instance coalescing you need a distributed lock or a shared cache-aside layer, which introduces its own complexity.Q: How do you test cache failure scenarios without taking down production? A: Three layers. First, chaos testing in staging — a Chaos Monkey-style tool that randomly kills a cache node during load tests and verifies the database survives. Second, a “game day” in production with a controlled blast — disable caching for 1% of traffic for 5 minutes at 3 AM and observe database load; this reveals whether your coalescing actually works at real scale. Third, shadow traffic — dual-read against a test cluster with caching disabled to measure the worst-case database load profile. The pattern across all three: do not find out at 3 AM during a real Redis outage whether your mitigation works.Q: What is the difference between cache-aside, write-through, and write-behind, and why does cache-aside win for a URL shortener? A: Cache-aside means the application reads from cache, falls through to the database on miss, and populates the cache. Write-through means every write goes to the cache and the database atomically. Write-behind means writes go to the cache immediately and are asynchronously flushed to the database. For URL shortener, cache-aside wins because URLs are mostly immutable — you rarely invalidate, so the consistency gymnastics of write-through add cost without value. Write-behind is a non-starter because losing a write in the async flush means losing a URL that was reported to the user as created. Cache-aside is the simplest model that matches the read-heavy, write-rare workload.
Further Reading
  • Facebook Engineering — “Scaling Memcache at Facebook” (research.facebook.com) — the canonical paper on leases, gutter caches, and thundering herd mitigations at scale.
  • Redis documentation — “Redis Cluster Specification” (redis.io/docs) — authoritative on replication, failover, and partial-failure semantics.
  • Brad Fitzpatrick / LiveJournal — “Memcached tutorial” historical writeups on the cache-aside pattern that became industry standard.

Question

In your task scheduler design, you chose at-least-once delivery with idempotent tasks. Explain the three delivery semantics, why exactly-once is effectively impossible in a distributed system, and how you achieve the practical equivalent.Difficulty: Senior to Staff | Concepts tested: Distributed systems fundamentals, Two Generals Problem, idempotency, production trade-offs

Strong Answer

The way I think about this is through the lens of what happens when things go wrong — specifically, when the acknowledgment message is lost.
  • At-most-once: Fire and forget. Send the message, do not retry. If the message is lost, it is gone. Simple and fast, but lossy. Use case: logging, metrics collection, non-critical analytics events. You accept that some data is lost.
  • At-least-once: Send the message, wait for an acknowledgment. If you do not receive the ack within a timeout, retransmit. The problem: the receiver may have processed the message and sent the ack, but the ack was lost. You retransmit, and the receiver processes it again. You get duplicates. Use case: anything where losing a message is unacceptable (payment processing, task execution, order fulfillment). You accept potential duplicates and design for them.
  • Exactly-once: Each message is processed exactly one time. No duplicates, no losses. Sounds ideal, but here is why it is effectively impossible in an unreliable network.
Consider the fundamental problem: the sender sends a message, the receiver processes it, and the receiver sends an acknowledgment. If the ack is lost, the sender cannot distinguish between “receiver processed it and the ack was lost” and “receiver never received it.” This is a variant of the Two Generals Problem, which has no deterministic solution over an unreliable channel. No matter how many acknowledgments you add (ack-of-ack-of-ack), you always have the problem of the last message potentially being lost.
  • The practical equivalent: At-least-once delivery plus idempotent processing. You guarantee the message is delivered at least once (with retries), and you guarantee that processing it multiple times has the same effect as processing it once. The receiver maintains a deduplication table — when it processes a message, it records the message ID. On subsequent deliveries, it checks the ID and skips reprocessing. From the outside, this looks like exactly-once, which is why Kafka calls its consumer semantics “exactly-once” — it is really at-least-once with built-in dedup and transactional writes.
  • Real-world example: In the task scheduler, a worker picks up a “charge customer $50” task. It calls the payment API with an idempotency key. The payment succeeds. The worker tries to mark the task as COMPLETED, but the network times out. The zombie detector eventually re-enqueues the task. Another worker picks it up and calls the payment API with the same idempotency key. The payment system recognizes the key and returns the original result without charging again. From the customer’s perspective, they were charged exactly once.

Follow-up: How does Kafka achieve what it calls “exactly-once semantics” if true exactly-once is impossible?

Great question — Kafka’s “exactly-once” is carefully scoped and often misunderstood.
  • Kafka achieves exactly-once within the Kafka ecosystem by combining three mechanisms: idempotent producers (each message has a sequence number, the broker deduplicates), transactional writes (a producer can write to multiple partitions atomically), and transactional consumers (a consumer reads and writes offsets in the same transaction).
  • What this means concretely: a Kafka Streams application that reads from topic A, processes the message, and writes to topic B will do so exactly once — even if the consumer crashes and restarts. The offset commit and the output write happen in the same transaction.
  • What this does NOT mean: if your Kafka consumer calls an external system (sends an email, writes to a database, calls an API), Kafka has no way to make that external call idempotent. If the consumer crashes after calling the external system but before committing the offset, the message will be reprocessed and the external system will be called again. Exactly-once only applies to the Kafka-to-Kafka path.
  • This is why the task scheduler design combines at-least-once delivery with application-level idempotency keys — because the “task” typically involves calling external systems where Kafka’s transactional guarantees do not extend.

Follow-up: What are the performance implications of idempotency? It sounds like you need a lookup on every operation.

Yes, and this is a real cost that you should not hand-wave.
  • Every idempotent operation requires a read-before-write: check the dedup table, then execute if the key is new. This doubles the number of I/O operations on the critical path.
  • The dedup table grows over time: if you process 56 million tasks per day and store idempotency keys for 7 days, you have ~400 million entries in the dedup table. At ~50 bytes per entry, that is ~20 GB — easily fits in Redis or a dedicated DynamoDB table.
  • TTL is essential: idempotency keys should expire after a reasonable window (1 hour to 7 days depending on the use case). Without TTL, the dedup table grows unbounded. The TTL must be longer than the maximum retry window — if your retry policy allows retries for up to 1 hour, the idempotency key must live for at least 1 hour.
  • In practice, the overhead is minimal: the dedup check is a single key lookup in Redis or DynamoDB (~1ms). Compared to the task execution itself (seconds to minutes), 1ms of overhead is negligible. The correctness guarantee is worth far more than the performance cost.
  • Where it gets expensive: high-throughput, low-latency systems where every microsecond matters (trading systems, game servers). In those cases, you might accept at-most-once for non-critical operations and reserve idempotent at-least-once for operations that must not be duplicated (trades, inventory decrements).

Question

In several of the system designs above — chat messages partitioned by conversation_id, feeds partitioned by user_id, tasks partitioned by task type — there is a risk of hot partitions. Explain what a hot partition is, how you would detect one, and what strategies you would use to mitigate it.Difficulty: Senior | Concepts tested: Database internals, partition key design, monitoring, data modeling

Strong Answer

A hot partition is when one partition in a distributed database receives a disproportionate share of the traffic — reads, writes, or both — compared to other partitions. In a system like DynamoDB or Cassandra, each partition key maps to a physical partition on a specific set of nodes. If one partition key is orders of magnitude more popular than others, the nodes hosting it become a bottleneck while other nodes sit idle.
  • Concrete example: In the chat system, conversation_id is the partition key. Most conversations have a few hundred messages per day. But a group chat for a 500-person company channel might receive 50,000 messages per day — all hitting the same partition. That single partition is handling 100x the load of an average partition.
  • Another example: In DynamoDB, each partition supports ~3,000 Read Capacity Units and ~1,000 Write Capacity Units per second. If a celebrity’s user feed receives 10,000 reads/sec, a single partition cannot handle it, regardless of how much total capacity you have provisioned.
Detection strategies:
  • DynamoDB: Use CloudWatch Contributor Insights, which identifies the most-accessed partition keys. It shows the top N partition keys by read/write volume over a time window. If one key accounts for more than 10% of total traffic, that is a hot partition.
  • Cassandra: Monitor per-partition read/write latencies using nodetool cfstats or through your monitoring system (Prometheus with Cassandra exporter). Look for P99 latency spikes that are localized to specific nodes. The nodetool toppartitions command samples and reports the most accessed partitions.
  • General approach: Instrument your application to log partition key access patterns. Aggregate in your analytics pipeline and set alerts for skew beyond a threshold (e.g., any single key exceeding 5x the average).
Mitigation strategies:
  • Write sharding (adding a suffix): Append a random suffix to the partition key. Instead of conversation_id = "chat_ABC", use conversation_id = "chat_ABC#3" (where 3 is a random number from 0-9). This spreads writes across 10 partitions. The cost: reads must scatter-gather across all 10 sub-partitions and merge results. This is the standard DynamoDB pattern for handling hot keys.
  • Time-based sub-partitioning: For the chat system specifically, use (conversation_id, year_month) as a composite partition key. Current messages go to the current month’s partition; historical queries read from older partitions. This naturally distributes the growing dataset and keeps the “hot” partition (current month) manageable.
  • Caching the hot partition: Put a cache (Redis, DAX) in front of the hot partition. If a celebrity’s profile is accessed 10K times per second, cache it in Redis and let 99% of reads hit the cache. Only 1% (100 reads/sec) reach the database partition.
  • Application-level routing: If you know certain entities will always be hot (celebrity accounts, viral posts), route them to a dedicated, scaled-up data store separate from the main database. This is effectively application-level sharding by hotness.

Follow-up: In DynamoDB specifically, does adaptive capacity solve the hot partition problem, or do you still need to worry about it?

DynamoDB’s adaptive capacity helps but does not eliminate the problem.
  • Adaptive capacity allows DynamoDB to reallocate unused throughput from cold partitions to hot ones. If you have 10,000 RCUs provisioned across 10 partitions and one partition needs 5,000 RCUs while the others need 500 each, adaptive capacity can handle that redistribution.
  • However, there are hard limits. A single partition cannot exceed its per-partition throughput limits (~3,000 RCU, ~1,000 WCU), regardless of adaptive capacity. If your hot partition genuinely needs 10,000 RCUs, adaptive capacity cannot help — you need write sharding or a cache.
  • DynamoDB’s on-demand mode is more generous with burst handling (up to 2x your previous peak), but the per-partition limits still apply.
  • The practical implication: for most workloads, adaptive capacity means you do not need to worry about moderate hotspots. But for extreme hotspots (celebrity accounts, viral content, flash sales), you still need explicit mitigation strategies.

Follow-up: How would you design the partition key for a notification system where some users receive 100x more notifications than others?

The challenge here is asymmetric write volume — a power user or a highly active account receives far more notifications than a typical user.
  • Option 1: User ID as partition key with write sharding: Use user_id#shard as the partition key, where shard is computed as hash(notification_id) % N. For high-volume users, set N=10; for normal users, N=1 (no sharding). The application knows each user’s shard count from a user metadata table.
  • Option 2: Time-based partitioning: Use user_id#date as the partition key. Each day gets a new partition. This naturally bounds partition size and keeps recent notifications (the ones users actually read) on a small number of partitions.
  • Option 3: Separate hot and cold storage: Keep the last 24 hours of notifications in a Redis sorted set per user (fast reads, no partition limits). Older notifications live in DynamoDB with time-based partitioning. Most notification reads are for recent items, so the Redis layer handles 95%+ of reads.
  • In my experience, Option 2 is the best balance of simplicity and effectiveness. It works without requiring the application to track per-user shard counts, and it naturally gives you a retention/archival boundary (delete partitions older than 90 days).
Structured Answer Template — Hot Partition Questions
  1. Define hot partition precisely — not “unbalanced” but “one partition receives more than N% of traffic” (typical threshold: 5-10%).
  2. Name the specific database’s per-partition limits — DynamoDB 3000 RCU / 1000 WCU, Cassandra replica-bound writes, Kafka per-partition ordering.
  3. Describe detection mechanics before mitigation — “CloudWatch Contributor Insights for DynamoDB, nodetool toppartitions for Cassandra, Kafka consumer lag per partition.”
  4. Propose 2-3 mitigations with trade-offs — write sharding (reads get expensive), time-based sub-partitioning (natural archival), dedicated store (operational overhead).
  5. Close with the monitoring artifact — a per-partition dashboard that would have alerted you before users noticed.
Real-World Example — DynamoDB and the celebrity follower problem: AWS publicly describes DynamoDB’s per-partition throughput ceiling (3000 RCU, 1000 WCU) as the reason hot-key mitigation is a first-class design concern. Instagram’s public engineering talks describe using per-user composite sharding with a random suffix specifically to handle accounts like @cristiano with 600M+ followers — the read pattern on a single profile key would otherwise exceed any partition’s ceiling regardless of total provisioned capacity.
Big Word Alert — Write Sharding (with Suffix) Write sharding prepends or appends a random suffix to a hot key so writes spread across many physical partitions. Reads become scatter-gather across the suffixes. Use it naturally: “We write-sharded the celebrity feed key as feed#celebrity_id#N where N is 0-9, spreading one hot partition into ten. Reads become a 10-way scatter-gather, but writes hit any of 10 partitions uniformly.” Warning: Write sharding trades write hotness for read complexity. If your workload is read-heavy (social feed), this can make read latency worse. Only apply write sharding when the write hotspot is the actual bottleneck.
Big Word Alert — Adaptive Capacity Adaptive capacity is DynamoDB’s automatic reallocation of unused throughput from cold partitions to hot ones. Helps with moderate skew but does not exceed per-partition hard limits. Use it naturally: “Adaptive capacity solves moderate hotspots up to the per-partition ceiling; for celebrity-scale hotspots (10x over the per-partition limit), we still need explicit write sharding or a cache.” Warning: Do not rely on adaptive capacity as a mitigation strategy in interviews. Interviewers expect you to know its limits — specifically, that it does not exceed 3000 RCU / 1000 WCU per partition regardless of total provisioned capacity.
Follow-up Q&A Chain:Q: You mentioned detection with Contributor Insights. What specifically would you alert on, and at what threshold? A: Two thresholds matter. First, “top partition traffic share” — if any single partition key exceeds 5% of total traffic, that is an early warning. Second, “per-partition throttling events” — if any partition is being throttled (HTTP 429 from DynamoDB), that is a page-worthy event because users are already being impacted. The 5% threshold is conservative; I would start there and tune based on false-positive rate. Below 5%, you probably do not have a problem; above 10%, you almost certainly do.Q: What is the cost of write sharding, in dollars and engineering terms? A: Dollars: with N=10, reads are roughly 10x more expensive because each logical read becomes 10 physical reads. For a read-heavy workload at 100K reads/sec, that multiplies read capacity units by 10x. Engineering: the application layer now has to know how many shards each key uses, which means either a metadata lookup (extra read) or a convention (always shard by 10). Testing gets harder because race conditions between shards can produce transient inconsistencies. Debugging gets harder because any query against the logical entity is now a scatter-gather. I only use write sharding when the alternatives (caching, dedicated store) are more expensive for the specific hot-key pattern.Q: Your partition key is user_id. A specific celebrity is hot. Do you reshape the whole table, or just that user? A: Just that user. Reshaping the whole table is a multi-week migration that penalizes every other user for one celebrity’s traffic. The pattern that works is application-level routing: maintain a small “hot users” list (easy to keep in Redis), and for users on that list the application writes to user_id#shard_N while for normal users it writes to user_id. Reads check the list first. This is a day of work, not a month, and it localizes the complexity to just the celebrities. You pay the scatter-gather cost only for those users, not the entire user base.
Further Reading
  • AWS — “Best practices for designing and using partition keys effectively” (docs.aws.amazon.com/amazondynamodb) — the official DynamoDB guide to hot-partition patterns.
  • Cassandra documentation — “Data modeling” chapter (cassandra.apache.org/doc) — authoritative on partition key design and nodetool observability.
  • Instagram Engineering — “Sharding & IDs at Instagram” (instagram-engineering.com) — public writeup of how they solved the celebrity problem at their scale.

Question

You proposed using Redis Lua scripts to atomically check and increment rate limit counters. Walk me through how this works under extreme contention — thousands of requests per second for the same rate limit key. What are the failure modes, and when would you consider alternatives to Redis?Difficulty: Senior | Concepts tested: Redis internals, concurrency models, distributed rate limiting trade-offs, alternatives evaluation

Strong Answer

First, let me explain why Lua scripts work for this. Redis is single-threaded for command execution. A Lua script executes atomically — no other Redis command can interleave during the script. This means our rate limit check (read counter, increment, check against limit, set TTL if needed) happens as one indivisible operation. There are no race conditions at the Redis level.
  • Under extreme contention (say, 100K requests/sec all hitting the same rate limit key for a popular API endpoint), the Lua script still works correctly because of Redis’s single-threaded model. Each script execution takes ~0.05-0.1ms. At 100K ops/sec, the total time spent executing Lua scripts for this key is ~5-10 seconds per second of wall-clock time — which is impossible. In practice, Redis pipelines and batches these, and the throughput limit is Redis’s overall ops/sec capacity (~100K-200K ops/sec for simple commands on a single instance), not per-key contention.
  • The real bottleneck is network round-trips, not Lua execution. Each rate limit check requires one network round-trip from the API server to Redis (~0.5ms in the same AZ). At 100K requests/sec, that is 100K round-trips. If you have 10 API servers, each sends 10K requests/sec to Redis, which is well within Redis’s capacity.
  • When Redis becomes the bottleneck: If you need more than ~200K rate limit checks per second, a single Redis instance is insufficient. Solutions: Redis Cluster (shard by rate limit key) or multiple Redis instances with client-side routing. Since rate limit keys are typically user_id:endpoint, the key distribution is naturally uniform across shards.
Failure modes specific to Lua scripts:
  • Long-running Lua script blocks everything: Redis blocks all other commands while a Lua script runs. If your script has a bug (infinite loop) or does expensive operations (large ZRANGEBYSCORE), it blocks the entire Redis instance. Mitigation: keep Lua scripts minimal (ours is 6 lines), and set the lua-time-limit config (default 5 seconds) to kill runaway scripts.
  • Memory growth: If you use sliding window log (storing every timestamp in a sorted set), a hot key can accumulate millions of entries. The Lua script that cleans up old entries (ZREMRANGEBYSCORE) runs in O(log N + M) where M is the number of removed entries. For very hot keys, this cleanup can be expensive. This is another reason to prefer the sliding window counter algorithm — it uses only two counters per window, not a growing sorted set.
Alternatives to Redis for rate limiting:
  • Local in-memory rate limiting: Each API server maintains its own token bucket per user. No network round-trip, sub-microsecond latency. The trade-off: limits are enforced per-server, not globally. A user hitting 10 different servers gets 10x the effective limit. Use this as a secondary defense layer alongside Redis, or as a fallback when Redis is down.
  • Envoy/Istio rate limiting in the service mesh: The sidecar proxy handles rate limiting with its own shared state (typically backed by Redis). The advantage is that rate limiting is decoupled from application code.
  • Cell-based rate limiting (Cloudflare’s approach): Each edge node limits independently and periodically synchronizes counts to a coordination layer. Gives approximate global enforcement with zero cross-node latency for the check itself.

Follow-up: How would you implement a sliding window counter in Redis using Lua, and why does it beat the sorted set approach at scale?

The sliding window counter is a brilliant optimization. Here is how it works in Redis.
  • Maintain two keys per rate limit: counter:{key}:{current_window} and counter:{key}:{previous_window}. Each is a simple integer counter with a TTL equal to 2x the window size.
  • When a request arrives, the Lua script calculates the weighted count: weighted = previous_count * (1 - elapsed_fraction_of_current_window) + current_count. If weighted is under the limit, increment the current window counter and allow.
  • Total Redis memory per rate limit: 2 keys with integer values — roughly 100 bytes. Compare that to the sorted set approach where you store a timestamp per request: 100 requests/minute per user means 100 entries of ~32 bytes each = 3,200 bytes. At 10 million users, that is 32 GB for sorted sets vs 1 GB for sliding window counters. A 32x memory reduction.
  • The accuracy trade-off: the sliding window counter is an approximation. It assumes requests in the previous window were uniformly distributed, which may not be true if traffic is bursty within a window. In practice, the error is less than 1%, which is acceptable for all but the most precise rate limiting needs (financial transaction limits).

Follow-up: If Redis is unavailable and you fail-open, how do you prevent abuse during the outage window?

Failing open means accepting that for a brief period, rate limits are not enforced. Here is how you minimize the blast radius.
  • Local rate limiters as secondary defense: Each API server runs a local in-memory token bucket. Set the local limit to 10x the normal per-user limit (generous, to avoid false positives from traffic distribution unevenness). This catches obvious abuse (thousands of requests/sec from one user) even without Redis.
  • Monitor and respond: Alert immediately on Redis failure. The average Redis recovery (with Sentinel or Cluster auto-failover) is 15-30 seconds. During this window, the local rate limiter is your only defense.
  • Post-incident analysis: After Redis recovers, run an anomaly detection query on access logs during the outage window. Identify users who significantly exceeded their normal rate. Take retroactive action (warn, temporarily restrict) if abuse is detected.
  • Circuit breaker with delayed close: After Redis comes back, do not immediately trust it. Use a circuit breaker that transitions from open (fail-open) to half-open (test Redis with a canary key every second) to closed (normal operation) over 30-60 seconds. This prevents thrashing if Redis is flapping.

Question

You need to maintain 5 million persistent WebSocket connections for a chat system. Walk me through how you would architect the connection management layer — from how connections are assigned to servers, to how you route a message from User A on Server 1 to User B on Server 3.Difficulty: Staff-Level | Concepts tested: WebSocket scaling, connection routing, pub/sub systems, stateful service challenges

Strong Answer

Managing 5 million persistent connections is fundamentally different from handling 5 million stateless HTTP requests. WebSocket connections are stateful — the server holding a user’s connection must be the one that pushes messages to that user. This introduces a routing problem that does not exist in stateless architectures.Connection assignment and server capacity:
  • A typical server handles ~10K-50K concurrent WebSocket connections depending on the language runtime and how much processing happens per message. At 10K per server, you need 500 servers. At 50K per server (Erlang, Go, or optimized Node.js), you need 100 servers.
  • Use a Network Load Balancer (Layer 4), not an Application Load Balancer (Layer 7), for WebSocket connections. NLB passes through TCP connections directly, adding minimal overhead. ALB terminates and re-establishes connections, adding latency and memory overhead.
  • Connection assignment is typically random (NLB round-robin or least-connections). There is no need for sticky sessions at the LB level because the routing table (below) handles message delivery.
Connection routing table:
  • When a user connects, the Chat Server registers the mapping user_id -> server_id in a shared store (Redis hash map). When the user disconnects, the mapping is removed.
  • This routing table is the single most critical piece of infrastructure. Every message delivery starts with a lookup: “Which server holds User B’s connection?” The lookup must be sub-millisecond.
  • Redis is ideal here: HSET connections user:B "chat-server-42" gives O(1) lookups. With 5 million entries at ~50 bytes each, the table is ~250 MB — trivially fits in memory.
Message routing flow:
  1. User A sends a message via WebSocket to Chat Server 1.
  2. Chat Server 1 persists the message to the Message Store (Cassandra).
  3. Chat Server 1 looks up User B’s server: HGET connections user:B returns chat-server-3.
  4. Chat Server 1 publishes the message to a Redis Pub/Sub channel that Chat Server 3 subscribes to. Alternatively, Chat Server 1 makes a direct gRPC call to Chat Server 3.
  5. Chat Server 3 receives the message and pushes it to User B via the WebSocket connection.
Redis Pub/Sub vs direct server-to-server communication:
  • Redis Pub/Sub: Each Chat Server subscribes to a channel named after itself (e.g., channel:chat-server-3). Senders publish to the target server’s channel. Pros: decoupled, simple. Cons: Redis Pub/Sub is fire-and-forget (no persistence), so if a server misses a message (brief disconnection from Redis), the message is lost. Also, every Chat Server subscribes to its own channel, so Redis maintains 500 subscriptions — lightweight.
  • Direct gRPC: Chat Servers call each other directly via gRPC using the routing table. Pros: point-to-point, no broker, lower latency. Cons: creates O(N^2) potential connections between servers, and requires service discovery.
  • My recommendation: Redis Pub/Sub for simplicity. The fire-and-forget limitation is mitigated because messages are already persisted to Cassandra before routing. If a message is lost in Pub/Sub, the recipient client’s reconnection logic catches up from the database.

Follow-up: What happens when a Chat Server crashes and 10K users need to reconnect?

This is the thundering herd problem applied to WebSocket reconnection.
  • 10K clients detect the disconnection (TCP reset or heartbeat timeout) and trigger their reconnection logic. If they all reconnect immediately, the load balancer distributes 10K new connection requests across remaining servers in a burst.
  • Client-side mitigation: exponential backoff with jitter. Each client waits random(0, min(2^attempt * 100ms, 30s)) before reconnecting. This spreads the 10K reconnections over 10-30 seconds instead of hitting all at once.
  • Server-side: When a client reconnects (to a different server now), it sends its last received message ID. The new server fetches all messages since that ID from Cassandra and pushes them. This “catch-up” ensures zero message loss.
  • Routing table cleanup: The dead server’s entries in the routing table become stale. The stale entries will cause message delivery to fail (publishing to a Pub/Sub channel nobody subscribes to). Two cleanup strategies: (1) set a TTL on routing entries and refresh with heartbeats, or (2) have each Chat Server periodically validate its own entries on startup and remove stale ones from the previous instance.
  • Capacity headroom: This is why you should not run Chat Servers at 100% connection capacity. If each server handles 10K connections and you have 500 servers, you are at 100% capacity — a single server failure means 10K users cannot reconnect (all other servers are full). Run at 80% capacity (servers handle 8K each), leaving 20% headroom for redistributing connections during failures.

Follow-up: How would you handle the routing table during a Redis outage?

Without the routing table, you cannot route messages to the correct server. This is a critical dependency.
  • Short-term (seconds): Chat Servers cache a local copy of recent routing lookups. If Server 1 recently looked up that User B is on Server 3, it uses the cached result. This works for ongoing conversations but fails for new conversations.
  • Fallback broadcast: If the routing table is unavailable, fall back to broadcasting the message to ALL Chat Servers. Each server checks if it holds the recipient’s connection and delivers if yes, ignores if no. This is O(N) where N is the number of servers, which is expensive but functional. At 500 servers, you are sending 500 messages instead of 1 for each chat message — a 500x overhead. Acceptable for a brief outage, catastrophic if prolonged.
  • Redis HA: This is why the routing table must be in a highly available Redis setup (Sentinel with automatic failover, or Redis Cluster with replicas). Target recovery time: under 15 seconds. The broadcast fallback covers the gap.
  • Alternative: Use a consistent hashing ring to deterministically assign users to servers. No routing table needed — you compute server = hash(user_id) % N. The problem: rebalancing when servers are added or removed causes mass disconnection and reconnection. Consistent hashing minimizes this (only ~1/N users are affected), but it is still disruptive. Most production systems prefer the routing table approach for its flexibility.

Question

Your task scheduler promises to execute delayed tasks within 1 second of their scheduled time. How do you achieve this precision at scale — 10 million tasks per day — without resorting to busy-polling the database every 100 milliseconds?Difficulty: Senior | Concepts tested: Scheduling algorithms, timer wheels, database polling trade-offs, distributed coordination

Strong Answer

The core tension is between precision and efficiency. Polling the database every 100ms gives you sub-second precision but hammers the database with queries. Polling every 10 seconds is efficient but you miss your 1-second SLA. The solution is a two-tier architecture: a database-backed durable store for long-horizon scheduling, and an in-memory timer for short-horizon precision.Tier 1 — Database scan (long horizon):
  • A scheduler process polls the database every 5-10 seconds: SELECT * FROM tasks WHERE next_run_at <= NOW() + interval '30 seconds' AND status = 'PENDING'.
  • This query uses a partial index (idx_tasks_pending ON tasks (next_run_at) WHERE status IN ('PENDING', 'RETRYING')) so it is fast even with millions of rows.
  • The scheduler loads these “soon-to-execute” tasks into memory and moves them to Tier 2.
Tier 2 — In-memory timer (short horizon):
  • Tasks within the next 30 seconds are held in a timer wheel (or a priority queue sorted by execution time) in the scheduler’s memory.
  • A timer thread checks the wheel every 100ms. When a task’s execution time arrives, it enqueues the task to the work queue (Redis sorted set or SQS).
  • The precision is now bounded by the timer resolution (100ms), well within the 1-second SLA.
Why a timer wheel instead of sleep():
  • A timer wheel (also called a timing wheel, Hashed and Hierarchical Timing Wheels from the Varghese and Lauck paper) is a ring buffer of time slots. Inserting a timer is O(1), firing expired timers is O(1) per timer. This beats a sorted list (O(log N) insert) when you have thousands of timers expiring per second.
  • In practice, for 10M tasks/day, you have ~650 tasks coming due per second at peak. A simple min-heap priority queue is sufficient at this scale. Timer wheels become important at 100K+ timers per second.
Durability during scheduler restart:
  • If the scheduler crashes, the in-memory timers are lost. On restart, the scheduler re-scans the database for tasks where next_run_at <= NOW() + 30 seconds. Tasks that were in-memory but not yet enqueued are re-discovered and processed, possibly a few seconds late (the restart time).
  • This is acceptable: a scheduler restart that takes 5-10 seconds means a few tasks execute 5-10 seconds late during the restart window. The database is the source of truth; the in-memory tier is an optimization for precision.

Follow-up: How do you handle the case where two scheduler instances both load the same task from the database?

This is the double-scheduling problem, and it is surprisingly subtle.
  • If you run multiple scheduler instances for HA (active-passive with leader election), only the leader scans the database and loads tasks. The passive instance does nothing until it becomes leader.
  • If you run active-active schedulers (for throughput, not just HA), you need to partition the work. Options: (1) each scheduler owns a range of task IDs (hash-based partitioning), (2) each scheduler uses SELECT FOR UPDATE SKIP LOCKED to atomically claim tasks without overlapping.
  • The SKIP LOCKED approach is elegant: when Scheduler A selects tasks, it locks them. Scheduler B’s concurrent query skips the locked rows and picks up different tasks. No coordination protocol needed — PostgreSQL handles it.
  • Even with SKIP LOCKED, you need the idempotency guarantee at the worker level. There is a window where the scheduler loads a task, marks it as SCHEDULED, but crashes before enqueuing it. The task stays in SCHEDULED state. A watchdog process (or the standby scheduler) picks up SCHEDULED tasks that have been in that state for longer than expected (say, 30 seconds) and re-enqueues them.

Follow-up: This design keeps 30 seconds of tasks in memory. What if the scheduler needs to handle a burst of 100K tasks all scheduled for the same second?

This is a thundering herd at the scheduling level.
  • If 100K tasks are all scheduled for exactly 12:00:00, the scheduler’s Tier 1 scan at 11:59:30 loads all 100K into memory. At 12:00:00, the timer fires for all of them.
  • The scheduler cannot enqueue 100K tasks to the work queue in zero time. If each enqueue takes 0.05ms (Redis ZADD), that is 5 seconds — some tasks execute 5 seconds late.
  • Mitigation 1: Pre-enqueue. When the scheduler loads tasks into memory, immediately enqueue them to the work queue with a “visibility delay” or “delayed execution” setting. The queue holds them until the execution time. This shifts the burst from the timer-fire moment to the database-scan moment, which is less time-sensitive.
  • Mitigation 2: Shard the scheduler. If you have 4 scheduler instances, each handling ~25K of the 100K tasks, the per-instance burst is manageable.
  • Mitigation 3: Accept imprecision during bursts. If 100K tasks are scheduled for the same second, delivering them over a 5-second window (12:00:00 to 12:00:05) is a reasonable degradation. The 1-second SLA is for normal conditions; burst conditions have a documented relaxed SLA.
  • In my experience, the pre-enqueue approach (Mitigation 1) is the most practical because it leverages the queue’s existing scalability rather than requiring you to scale the scheduler itself.

Question

Both the notification system and the chat system need to track whether a message was delivered. But you would design the storage very differently. Explain why.Difficulty: Intermediate to Senior | Concepts tested: Data modeling, access patterns, consistency requirements, storage trade-offs

Strong Answer

The way I think about this is through the lens of access patterns and consistency requirements. These two seemingly similar “was it delivered?” questions lead to radically different storage designs.Chat message storage:
  • Access pattern: Sequential reads within a conversation. “Give me the last 50 messages in this conversation” is the dominant query. Users scroll backward through history.
  • Write pattern: Append-only. New messages are added at the end of a conversation’s timeline. Edits and deletes are rare (and modeled as new events, not mutations).
  • Consistency requirement: Per-conversation causal ordering. If I reply to your message, my reply must appear after yours. But ordering across conversations does not matter.
  • Partition strategy: Partition by conversation_id with a clustering key on message timestamp or sequence number. All messages in a conversation are co-located for efficient range reads.
  • Database choice: Cassandra or ScyllaDB — optimized for append-heavy, time-series-like workloads with range queries within a partition.
  • Delivery state: Simple. Each message has a status: sent (persisted on server), delivered (recipient’s device acknowledged receipt), read (recipient opened the conversation). This state is updated by the recipient’s client and is non-critical — if a delivery receipt is lost, the user sees a single check mark instead of a double check mark. Not ideal but not a correctness issue.
Notification delivery state:
  • Access pattern: Random reads by notification ID or user ID. “What is the status of notification X?” or “Show me all notifications for user Y sorted by recency.” Also analytical queries: “What is the delivery rate for notification type Z across all users this week?”
  • Write pattern: Status transitions through a state machine: queued -> sent -> delivered -> opened -> clicked. Each transition is an update, not an append. Multiple channels may each have their own delivery state (push: delivered, email: opened, SMS: sent).
  • Consistency requirement: Strong consistency for deduplication (you must not send the same notification twice) and for delivery tracking (analytics dashboards must reflect actual delivery counts). Eventual consistency for the user-facing in-app notification list is acceptable.
  • Partition strategy: Partition by user_id for the user-facing queries, but you also need indexes by notification_type and timestamp for analytics. This is a multi-access-pattern problem.
  • Database choice: DynamoDB with a GSI (Global Secondary Index) on notification_type + timestamp for analytics queries. Or a dual-storage approach: DynamoDB for real-time delivery state and user-facing queries, S3 + Athena for analytics.
  • Delivery state: Complex. You track state per channel per notification. A single notification might be: push=delivered, email=sent, SMS=failed. Each state transition triggers different logic (retry SMS, increment delivery counter, fire a webhook to the calling service).
The fundamental differences:
  1. Chat messages are immutable after creation; notification state is mutable (transitions through a state machine).
  2. Chat queries are conversation-scoped (narrow range scan); notification queries span multiple dimensions (by user, by type, by status, by time).
  3. Chat delivery is a nice-to-have UX signal; notification delivery is a business metric and sometimes a legal requirement (proof of SMS delivery for regulatory compliance).
  4. Chat scales by conversation count (millions of small partitions); notification scales by per-user volume (some users get 100x more notifications than others, creating hot partitions).

Follow-up: How would you handle the scenario where you need to show both chat messages and notifications in a unified inbox (like LinkedIn)?

This is a data federation problem — merging two different data models into one view.
  • Do NOT try to store both in the same table. Their access patterns and data models are too different. Merging them compromises the performance of both.
  • Build a unified feed service that queries both the chat message store and the notification store in parallel, merges the results by timestamp, and returns a combined view.
  • Caching the unified view: Pre-compute the unified inbox for active users in Redis (sorted set with timestamp as score, containing references to either chat messages or notifications). Update the sorted set on new chat messages AND new notifications.
  • The tricky part is ordering: Chat messages have per-conversation sequence numbers. Notifications have creation timestamps. To merge them, normalize both to UTC timestamps. For chat messages, the timestamp is when the message was sent. For notifications, it is when the notification was created.
  • LinkedIn and Facebook both take this approach — separate backend systems for messages and notifications, unified in a single “inbox” view for the user. The unification happens at the presentation layer, not the storage layer.

Follow-up: For the notification system, how do you track “opened” and “clicked” states across email, push, and in-app without over-engineering it?

Each channel has a fundamentally different mechanism for tracking engagement.
  • Email: Embed a 1x1 tracking pixel in the email body. When the user’s email client loads the image, your server logs the “opened” event. For “clicked,” use redirect URLs that pass through your tracking service before forwarding to the destination. Caveat: many email clients now block tracking pixels (Apple Mail Privacy Protection), so email open rates are increasingly unreliable. Track clicks as the more reliable engagement signal.
  • Push: APNs and FCM provide delivery receipts (the notification reached the device). For “opened” (user tapped the notification), the mobile app sends a callback to your API when the notification is opened. This requires client-side instrumentation.
  • In-app: Track “seen” (the notification appeared in the user’s notification list) and “clicked” (the user tapped it). Both are client-side events sent to your API. These are the most reliable engagement signals because you control the client.
  • Storage: Write all engagement events to a stream (Kinesis, Kafka). A consumer updates the notification record’s status and publishes the event to an analytics pipeline. The real-time delivery state (in DynamoDB) reflects the latest status per channel. The analytics pipeline (S3 + Athena) provides aggregate views.
  • Do not block dispatch on tracking setup. The notification should send regardless of whether tracking infrastructure is healthy. If the tracking pixel URL is down, the email still sends — you just lose the open tracking for that email.

Question

Candidates often draw a “load balancer” box at the top of their system design without explaining what it actually does. Walk me through the differences between a load balancer, an API gateway, and a reverse proxy. When would you use one versus the other, and when do they overlap?Difficulty: Intermediate | Concepts tested: Networking fundamentals, infrastructure architecture, separation of concerns

Strong Answer

These three concepts have significant overlap, which is why they cause confusion. In practice, a single product (like NGINX or AWS ALB) can serve as all three simultaneously. But the responsibilities are distinct.
  • Reverse Proxy: A server that sits in front of your backend servers and forwards client requests to them. The client talks to the proxy, not directly to the backend. The core job is indirection — hiding the backend’s identity, IP addresses, and topology from the client. NGINX, HAProxy, and Envoy are all reverse proxies. A reverse proxy can do SSL termination, caching, compression, and request buffering. It is the broadest concept of the three.
  • Load Balancer: A reverse proxy with a specific additional responsibility — distributing traffic across multiple backend servers. It implements a distribution algorithm (round-robin, least-connections, weighted, IP-hash, consistent hashing). The core job is spreading load evenly so no single server is overwhelmed. Load balancers operate at Layer 4 (TCP level, like AWS NLB) or Layer 7 (HTTP level, like AWS ALB). Layer 4 is faster (just forwards TCP packets) but cannot make decisions based on HTTP content. Layer 7 can route based on URL path, headers, cookies.
  • API Gateway: A load balancer with application-layer intelligence. Beyond distributing traffic, it handles concerns like authentication, rate limiting, request transformation, API versioning, circuit breaking, and API key management. The core job is managing the API lifecycle. Kong, AWS API Gateway, and Apigee are API gateways. Think of it as a load balancer plus a middleware pipeline.
When to use each:
  • Just a reverse proxy: You have one backend server and want SSL termination, caching, and protection from direct internet exposure. Small deployments, simple architectures.
  • Load balancer: You have multiple backend servers and need to distribute traffic. Most system designs need this at minimum. Use L4 (NLB) for raw TCP throughput (WebSocket, gRPC) and L7 (ALB) when you need path-based routing.
  • API Gateway: You have multiple microservices, external API consumers, and need centralized auth, rate limiting, and API management. The gateway is the single entry point. Most medium-to-large systems need this. In a system design interview, putting “API Gateway” at the top of your diagram signals that you understand these cross-cutting concerns are best handled in one place, not scattered across services.
In practice, they layer: Internet -> CDN -> API Gateway (authentication, rate limiting) -> Load Balancer (traffic distribution) -> Backend Servers. But many products collapse multiple layers — AWS ALB acts as both a load balancer and a basic reverse proxy. Kong acts as both an API gateway and a load balancer.

Follow-up: In the chat system design, you used an NLB instead of an ALB for WebSocket connections. Why?

The choice comes down to how each handles persistent connections.
  • ALB (Layer 7) terminates the TCP connection from the client and creates a new TCP connection to the backend. For WebSocket, this means the ALB maintains TWO connections per user (client-to-ALB and ALB-to-backend). At 5 million concurrent users, the ALB maintains 10 million TCP connections. ALB also inspects HTTP headers on each frame, adding latency.
  • NLB (Layer 4) passes TCP packets through without terminating the connection. The client has a direct TCP connection to the backend (through the NLB). At 5 million concurrent users, the NLB is forwarding packets for 5 million connections — it does not maintain state for each one (or minimal state). This is dramatically more efficient.
  • NLB also supports millions of concurrent connections by design, while ALB has lower connection limits and higher per-connection memory overhead.
  • The trade-off: NLB cannot do path-based routing, header inspection, or SSL termination (well, it can do TLS passthrough, and newer NLBs support TLS termination). If you need to route /api/* to one service and /ws/* to another, you need ALB. A common pattern is: ALB for the REST API endpoints, NLB for the WebSocket endpoints.

Follow-up: Where does the rate limiter sit in this stack? Before the load balancer, inside the API gateway, or after?

The placement depends on what you are rate limiting.
  • At the CDN/edge level: Rate limiting by IP address to prevent DDoS. This is the first line of defense. AWS WAF rate-based rules or Cloudflare’s edge rate limiting do this before the request even reaches your infrastructure. Cost: very low latency, but you can only limit by IP or simple headers.
  • At the API Gateway: Rate limiting by API key, user ID, or custom dimensions. This is where most per-user rate limiting happens. The gateway authenticates the request (extracts user identity), then checks the rate limit. AWS API Gateway has built-in throttling. Custom gateways (Kong) support plugin-based rate limiting.
  • In the application: Rate limiting by business logic (e.g., max 3 password attempts per minute, max 10 friend requests per hour). This is too specific for the gateway and must live in the application code.
  • Best practice is layered rate limiting: IP-level at the edge (coarse, prevents DDoS), user-level at the gateway (medium, enforces API quotas), business-logic-level in the application (fine, enforces product rules). Each layer catches different types of abuse, and the combination is more robust than any single layer.

Question

Users are reporting that messages in their conversations sometimes appear out of order — they see a reply before the original message. You are the on-call engineer. Walk me through how you would diagnose this and what the likely root causes are.Difficulty: Senior | Concepts tested: Debugging distributed systems, message ordering, clock skew, causal consistency, production investigation

Strong Answer

Out-of-order messages in a chat system are one of the trickiest distributed systems bugs because the root cause can be at any layer — client, network, server, or database. Here is how I would systematically investigate.Step 1 — Reproduce and characterize the problem:
  • Is it happening for all users or specific users? Specific conversations? Specific server regions? Check the support tickets for patterns.
  • Is it 1-to-1 chats, group chats, or both? Group chats have a fan-out step that 1-to-1 chats do not — more opportunities for reordering.
  • Is the reordering consistent (Message B always appears before Message A) or transient (refresh fixes it)? Consistent reordering points to a storage or ordering problem. Transient reordering points to a delivery race condition.
Step 2 — Check the most common root causes:
  • Clock skew between Chat Servers: If you are using timestamps for ordering and two Chat Servers have clocks that differ by even 100ms, messages processed by different servers can be ordered incorrectly. Server 1’s clock is 200ms ahead of Server 2’s clock. User A sends at 12:00:00.000 (processed by Server 1, timestamp 12:00:00.200). User B replies at 12:00:00.100 (processed by Server 2, timestamp 12:00:00.100). The reply appears to have been sent before the original message.
    • Diagnosis: Compare server clocks. Check NTP synchronization status. If clock offsets exceed 10ms, this is likely the cause.
    • Fix: Use server-generated sequence numbers per conversation (monotonically increasing), not wall-clock timestamps. A single Chat Server (the “owner” for that conversation) assigns sequence numbers, guaranteeing causal order.
  • Race condition in the delivery path: User A and User B send messages nearly simultaneously. Message A is persisted to Cassandra but the Pub/Sub delivery to User B is slow. Message B is persisted and delivered to User A quickly. User A sees Message B before their own Message A is confirmed delivered. This is a perceived reordering on the sender side.
    • Diagnosis: Check message persistence timestamps vs delivery timestamps. If persistence order is correct but delivery order differs, this is a delivery race condition.
    • Fix: The client should render messages in sequence-number order, not delivery order. When Message B arrives with sequence number 5 and Message A (sequence 4) has not arrived yet, the client either waits briefly (50-100ms) for Message A or renders Message B and reorders when Message A arrives.
  • Cassandra read-repair or consistency level issue: If you are reading with consistency level ONE from a replica that has not yet received the most recent write (which was written at consistency level QUORUM), you might read stale data. The client fetches messages, gets an incomplete list, then on the next fetch gets the full list. Messages appear to “insert” into the middle of the conversation.
    • Diagnosis: Check your Cassandra read/write consistency levels. If reads are at ONE and writes at QUORUM, there is a window where replicas disagree.
    • Fix: Read at QUORUM as well (guarantees you read your own writes), or use LOCAL_QUORUM for both reads and writes in multi-datacenter setups.
Step 3 — Instrument and verify:
  • Add a distributed trace (X-Ray, Jaeger) to the message pipeline. Each message gets a trace ID that follows it from client send to server persist to Pub/Sub to recipient delivery. The trace shows exactly where time was spent and in what order events occurred.
  • Add a client-side event: when the client renders a message, log the {message_id, sequence_number, display_timestamp, delivery_timestamp}. This lets you compare what the server sent vs what the client displayed.

Follow-up: You mentioned using per-conversation sequence numbers. How do you generate those in a distributed system without a single point of failure?

This is the core of the problem. You need a monotonically increasing counter per conversation, and it must be fast and reliable.
  • Option 1 — Single writer per conversation: Assign each conversation to a single Chat Server (the “home” server for that conversation, determined by hashing the conversation_id). That server maintains an in-memory counter for the conversation and increments it for each message. Fast (no network hop for sequence generation) and correct (single writer = no conflicts). The downside: if the home server crashes, a new server takes over and must read the latest sequence number from the database before it can assign new ones.
  • Option 2 — Database-generated sequences: Use the database to atomically assign sequence numbers. In Cassandra, this is not natively supported (no auto-increment). In PostgreSQL, use a SEQUENCE or a SELECT ... FOR UPDATE on a counter row. This is slower (database round-trip per message) but eliminates the single-writer bottleneck.
  • Option 3 — Snowflake-like IDs: Generate globally unique, time-ordered IDs (timestamp + machine ID + sequence). These are not strictly sequential (they have gaps) but they are monotonically increasing, which is sufficient for ordering. Twitter’s Snowflake, Discord’s Snowflakes, and ULIDs all follow this pattern. The trade-off: ordering is timestamp-based, so clock skew still affects ordering across servers. But within a single server (likely for a single conversation), the ordering is guaranteed by the local sequence counter.
  • My recommendation for a chat system: Option 1 (single writer per conversation) for message ordering. It is the simplest correct solution. Use consistent hashing to assign conversations to servers, with a fallback to database-read on server failover.

Follow-up: How would you test for message ordering issues before they hit production?

This is a testing methodology question that separates experienced engineers from theoretical ones.
  • Chaos testing: Inject artificial clock skew (50-200ms) between Chat Servers in a staging environment. Send rapid-fire messages in test conversations and verify ordering. If your ordering is timestamp-based, you will immediately see reordering. If it is sequence-based, it should be immune.
  • Concurrent message injection: Write a load test that has N clients in the same conversation all sending messages as fast as possible. Verify that all clients see the exact same message order. This catches race conditions in the delivery path.
  • Network delay injection: Use a tool like tc (traffic control) or Toxiproxy to add random delays (10-500ms) to the Pub/Sub layer. Verify that the client-side rendering logic handles out-of-order delivery correctly (waits for gaps, reorders on display).
  • Partition testing: Simulate a Redis Pub/Sub outage. Messages should be persisted to Cassandra even though delivery is delayed. When Pub/Sub recovers, verify that clients catch up in the correct order by reading from the database.
  • The key principle: ordering bugs only manifest under concurrency and latency variance. Single-client, low-latency testing will never find them. Your test environment must be adversarial.

Question

Many system design candidates say they want “four nines” availability without understanding the implications. Walk me through what 99.99% availability means in real terms, and how the architecture of a URL shortener changes when you go from 99.9% to 99.99%.Difficulty: Senior to Staff | Concepts tested: Availability math, redundancy, failure domains, operational maturity, SLA vs SLO

Strong Answer

Let me start with what the numbers actually mean in terms of allowed downtime.
  • 99.9% (three nines): 8.76 hours of downtime per year, or ~43 minutes per month. This allows for a bad deployment that takes 30 minutes to roll back, plus one brief infrastructure incident per month.
  • 99.99% (four nines): 52.6 minutes of downtime per year, or ~4.4 minutes per month. You essentially cannot have a single incident that takes more than 5 minutes to detect and mitigate. Manual intervention is too slow for most incidents.
  • 99.999% (five nines): 5.26 minutes per year. This requires near-zero human involvement in failure recovery. Only the most critical systems (DNS root servers, core financial infrastructure) target this.
Going from three nines to four nines is not a 0.09% improvement — it is a 10x reduction in allowed downtime. That 10x has cascading architectural implications.How the URL shortener architecture changes:At 99.9% (the design we built):
  • Single-region deployment. Database primary with one replica. Manual or semi-automated failover. Redis with Sentinel. Single points of failure are acceptable if recovery is fast.
  • An incident playbook where on-call can SSH in, diagnose, and fix within 30 minutes.
At 99.99%, everything changes:
  • Multi-region active-active: A single-region outage (AWS availability zone failure, even a full region failure) cannot exceed 4 minutes. You need at least two active regions, each capable of handling full traffic. DNS-based failover (Route 53 health checks) detects a region failure and routes traffic to the healthy region within seconds.
  • Zero-downtime deployments: A bad deploy cannot cause more than 4 minutes of downtime. This means canary deployments (deploy to 1% of traffic, monitor for 5 minutes, then gradually increase), instant rollback (keep the previous version running and switch traffic back in seconds), and blue-green deployments.
  • Automated failover everywhere: Database failover must be automated (Patroni, RDS Multi-AZ) with detection-to-recovery under 30 seconds. Redis failover must be automated (Sentinel, Cluster). No component can require a human to restart it.
  • Health checking at every layer: Load balancers health-check the API servers every 5 seconds. API servers health-check Redis and the database. If any dependency is unhealthy, the server removes itself from the load balancer. Unhealthy servers are replaced automatically (ECS, Kubernetes).
  • Blast radius isolation: Changes should only affect a fraction of users. Deploy to one AZ first. Use feature flags to gate new code paths. If something breaks, it breaks for 10% of users, not 100%.
  • Observability budget: You need to detect issues within 60 seconds (leaves 3 minutes for mitigation). This means real-time dashboards, sub-minute alerting, and synthetic monitoring (a canary that continuously shortens and resolves URLs, alerting if latency exceeds thresholds or resolution fails).
The cost of four nines is roughly 3-5x the infrastructure cost of three nines (multi-region, more replicas, more redundancy) plus significantly higher operational investment (automated testing, canary pipelines, runbooks).

Follow-up: How do you measure availability in practice? Is it just “server uptime”?

Great question — this is where many teams get it wrong.
  • Naive approach: Measure uptime as “the server process is running.” By this measure, a server that returns HTTP 500 on every request is “available.” This is meaningless.
  • Correct approach: Measure availability from the user’s perspective. A request is “successful” if it returns a correct response within the latency SLA. Availability = (successful requests) / (total requests) over a time window.
  • For the URL shortener: A redirect request is successful if it returns a 302 with the correct Location header within 100ms. Anything else (500 error, timeout, wrong redirect, latency > 100ms) counts as a failure.
  • SLI vs SLO vs SLA: The SLI (Service Level Indicator) is the metric you measure (redirect success rate). The SLO (Service Level Objective) is your internal target (99.99%). The SLA (Service Level Agreement) is the contractual promise to customers (usually lower than SLO — promise 99.9%, target 99.99%, so you have an error budget).
  • Error budgets: If your SLO is 99.99%, you have a 0.01% error budget per month — about 4 minutes of failures. When the error budget is consumed, the team should freeze feature deployments and focus on reliability. This is the Google SRE approach, and it creates a concrete, data-driven tension between shipping features and maintaining reliability.

Going Deeper: How do you handle the case where one region is “available” but returning degraded results — say, serving stale cached data because the database is lagging?

This is the “partially available” problem, and it is the hardest part of availability measurement.
  • Define what “correct” means: For the URL shortener, a redirect to the correct URL is correct even if it is served from a stale cache. A redirect to the WRONG URL (because of replication lag in a multi-region setup where the URL was recently updated) is incorrect and should count as a failure, even though the server returned a 302.
  • Implement correctness checks: Periodically resolve a set of known test URLs and verify the destination. If a region is returning stale or incorrect results, its health check should fail, and traffic should be routed away.
  • Replication lag monitoring: In a multi-region setup with asynchronous replication, the secondary region might be seconds behind the primary. A URL created in the primary region might not be resolvable in the secondary region for a few seconds. For a URL shortener, this is rarely a problem (users do not use a short URL within seconds of creating it). For systems where this matters, read-your-writes consistency is achieved by routing the creator’s subsequent requests to the primary region.
  • The philosophical point: Availability is not binary. A system can be “up” but “degraded.” Mature organizations track multiple SLIs — latency, error rate, correctness, freshness — and define availability as the intersection of all of them meeting their thresholds. This is harder to measure but gives a much more honest picture of service health.
Structured Answer Template — Availability / SLO Questions
  1. Translate the nines into downtime — 99.99% is about 4 minutes/month, 53 minutes/year. Do the math out loud; it anchors the conversation.
  2. Separate SLI, SLO, and SLA explicitly — candidates who conflate these are a red flag; the interviewer tests this on purpose.
  3. Name the error budget — a 99.99% SLO means you get 4 minutes/month of failures before you freeze feature launches; this is the policy, not just a target.
  4. Describe the redundancy tier that matches — single-region single-replica is 99.9% at best; 99.99% requires multi-AZ with automated failover; 99.999% requires multi-region active-active.
  5. Close with what you give up — latency, cost, and consistency are all paid to buy more nines. Quantify one trade-off.
Real-World Example — Google’s error-budget policy: Google SRE (documented in the public SRE books at sre.google) popularized the error-budget concept as the boundary between product and reliability. When a service burns through its error budget, feature launches freeze until the budget is restored. This turns availability from an abstract goal into a concrete business constraint — it is why Google can ship new features and run 99.99%+ services simultaneously, because the two are in explicit negotiation rather than silent tension.
Big Word Alert — Error Budget Error budget is the amount of failure (downtime, error rate, latency violations) a service is allowed over a measurement window before the team must stop feature work and invest in reliability. It is calculated as 1 - SLO. Use it naturally: “Our 99.9% SLO gives us 43 minutes/month of error budget; after a region failover consumed 30 minutes of it, we froze the next feature launch and spent the sprint on the failover runbook.” Warning: The error budget is only useful if leadership honors the freeze. Teams that burn their budget and keep shipping features anyway are running without an SLO, regardless of what the dashboard says.
Big Word Alert — Multi-Region Active-Active Multi-region active-active means all regions serve live production traffic simultaneously, with data replication and failover between them. Contrasts with active-passive (one primary, others standby) or active-standby. Use it naturally: “For 99.99% availability we ran multi-region active-active with DNS-based traffic steering and cross-region replication — the cost was roughly 2x infrastructure and a significant consistency-model upgrade.” Warning: Active-active is not free. You need conflict-resolution strategies for writes (CRDTs, last-write-wins, or region-affinity), and you have to design for split-brain during partitions. Do not claim active-active in an interview without being ready to discuss these.
Follow-up Q&A Chain:Q: How do you set an SLO in the first place? Is it just “99.9% because that sounds good”? A: The principled approach is to measure what your users actually need. If your users cannot detect the difference between 99.9% and 99.95%, setting 99.95% is wasted engineering effort. The practical approach: start by measuring your current availability honestly for 30 days, then set the SLO slightly above that level — just enough to force improvement, not so high that you are perpetually in error-budget-burn mode. SLOs that are never met are worse than no SLOs because teams learn to ignore them. The calibration loop is: every quarter, compare actual to SLO, and if you are always hitting 99.99% easily, raise the bar; if you are always missing 99.9%, lower it and invest in reliability.Q: Your p99 latency meets the SLO but users are complaining. How is that possible? A: P99 means 1 in 100 requests is slow, but if a user makes 50 requests per session, the probability they hit at least one slow request is around 40%. This is the “tail-latency amplification” problem — a metric that looks good per-request is a bad experience per-user. The fix is to measure “session-level SLO” — what percentage of user sessions experience zero slow requests. This is a much harder target and often reveals that p99 is not a tight enough bound; you may need p99.9 or p99.99 depending on the session pattern.Q: What is the single most common mistake teams make when committing to an SLA? A: Not leaving margin between internal SLO and external SLA. If you promise customers 99.9% and your internal target is also 99.9%, the first minor incident makes you breach contract. The pattern that works: internal SLO is 99.95%, external SLA is 99.9%, and the 0.05% gap is your buffer. You will have incidents; they will eat budget; you need slack between “we missed our target” (OK, we learn) and “we breached the contract” (legal problem). Teams that set internal and external equal are setting themselves up for compliance drama every quarter.
Further Reading
  • Google SRE Book — Chapters on Service Level Objectives and Embracing Risk (sre.google/sre-book) — the authoritative text on SLO/SLI/SLA distinctions and error budget mechanics.
  • Alex Hidalgo — “Implementing Service Level Objectives” — practitioner-grade guidance on setting and operationalizing SLOs without the ceremony.
  • Amazon Builders’ Library — “Reliability, constant work, and a good cup of coffee” (aws.amazon.com/builders-library) — essays on the engineering patterns behind high-availability AWS services.

Question

You are a tech lead at an early-stage startup with three backend engineers. The product needs a notification system. You do not have the time or team to build the full architecture described in Problem 6. Walk me through how you would simplify it to ship in two weeks while still building something that can scale later.Difficulty: Staff-Level | Concepts tested: Engineering judgment, build vs buy, incremental architecture, pragmatism, scope management

Strong Answer

This is fundamentally a judgment question, not a technical one. The right answer is NOT to build a simplified version of the full architecture. The right answer is to identify what you absolutely must build versus what you can buy.Week 1 — Buy everything you can:
  • Push notifications: Use Firebase Cloud Messaging (FCM) directly. It handles device token management, multi-platform delivery, and retry logic. Three API calls: register device token, send to token, send to topic. No custom push worker needed.
  • Email: Use SendGrid or AWS SES with their template system. Store templates in SendGrid, not in your own database. SendGrid handles bounce management, unsubscribe links (CAN-SPAM compliance), and delivery tracking. One API call to send.
  • SMS: Use Twilio. One API call to send. Twilio handles delivery receipts, phone number validation, and regulatory compliance per country.
  • In-app notifications: This is the only part you must build, because it is tied to your application’s data model.
What you actually build:
  • A single notifications table in your existing PostgreSQL database: {id, user_id, type, title, body, read, created_at}.
  • A notification_preferences table: {user_id, channel, enabled}. Default everything to enabled.
  • A single API endpoint: POST /internal/notify that your other services call. This endpoint: (1) checks user preferences, (2) writes to the notifications table (for in-app), (3) calls SendGrid/FCM/Twilio directly (synchronous for now — yes, synchronous).
  • A REST endpoint for the frontend: GET /api/notifications for the in-app notification list, and PATCH /api/notifications/{id}/read.
What you explicitly skip:
  • No message queue. Send notifications synchronously. If SendGrid is slow (200ms), your API response is 200ms slower. At startup scale (hundreds of notifications per day), this is fine.
  • No deduplication service. Use a unique constraint on (user_id, type, idempotency_key) in PostgreSQL. If a duplicate arrives, the INSERT fails. Done.
  • No rate limiting per user. At startup scale with real users, nobody is getting spammed. Add this when you have data showing it is a problem.
  • No priority queues. Everything is sent immediately. A 2FA code and a marketing email go through the same path.
  • No aggregation. If a user gets 20 likes, they get 20 notifications. Fix this when users complain.
  • No multi-provider failover. If Twilio is down, SMS fails. You have three engineers — operating a failover system is more expensive than the occasional missed SMS.
The architecture that scales later:The key decision is making POST /internal/notify the single entry point. Every service in your codebase calls this one endpoint. When you need to scale:
  • Month 3: Replace the synchronous external API calls with a background job queue (Sidekiq, BullMQ, Celery). The API endpoint writes to the queue, returns immediately. Workers process the queue asynchronously. You have just added the async dispatch layer.
  • Month 6: Add Redis for dedup (replacing the PostgreSQL unique constraint, which is getting slow under load). Add per-user rate limiting.
  • Month 12: Split into separate queues by priority. Add notification aggregation. Consider replacing FCM/SendGrid direct calls with SNS or a dedicated notification service.
The point is: every enhancement is additive. You never rewrite the core. The POST /internal/notify contract stays stable. This is what “building for the future without over-engineering the present” actually looks like.

Follow-up: How do you convince the product team that skipping aggregation and rate limiting is okay for now?

With data and risk framing.
  • Aggregation: “At our current scale of 500 daily active users, the median user receives 3 notifications per day. The busiest user receives 15. Notification fatigue is not a problem yet. When we have data showing users are disabling notifications, we build aggregation. Building it now costs 2 engineer-weeks and delays the launch.”
  • Rate limiting: “We do not have users who are being spammed because we control all the notification sources — it is our own code sending notifications. Rate limiting protects against runaway code or abuse, but we can achieve the same protection with a MAX_NOTIFICATIONS_PER_HOUR = 20 constant in the notify endpoint. One line of code, not a Redis-backed distributed rate limiter.”
  • The general principle: Defer complexity until you have evidence it is needed. Every premature abstraction has a maintenance cost. Three engineers maintaining a rate limiting service is three engineers not building features.

Follow-up: At what point does the “buy everything” approach break down and you need to build custom?

The trigger is usually one of three things:
  1. Cost: At 10 million emails per month, SES costs 1,000.SendGridcosts1,000. SendGrid costs 5,000+. At 100 million, the difference is significant. You start building custom email infrastructure (dedicated IP management, templating, bounce handling) when the cost of the managed service exceeds the cost of an engineer maintaining custom infrastructure.
  2. Control: You need notification behaviors that the managed service does not support. For example, “do not send push notifications during the user’s sleeping hours based on their timezone” requires preference logic that FCM does not provide. Or “aggregate 20 like notifications into one” requires application-level batching before calling FCM.
  3. Reliability: When your notification SLA exceeds what the managed service provides. If you need 99.99% delivery for 2FA codes, depending on a single provider (Twilio) is a risk. You need multi-provider failover, which means building a dispatch layer.
In my experience, most startups hit trigger 2 (control) first — they need custom preferences, aggregation, or quiet hours. Trigger 1 (cost) usually comes later, after the user base is large enough that the engineering investment pays for itself. Trigger 3 (reliability) is rare until you are at significant scale.

Advanced Interview Scenarios

These questions are deliberately harder than the section above. They target the blind spots that survive standard preparation — scenarios where the obvious answer is wrong, where production experience separates real builders from textbook readers, and where the interviewer is testing judgment more than knowledge. If you can handle these, you can handle anything a staff-level loop throws at you.

Question

You are on-call for the notification system. A traffic spike hits — 10x normal volume due to a breaking news event. Auto-scaling kicks in and starts adding workers. But instead of recovering, latency gets worse and error rates climb. What is happening and how do you stabilize the system?Difficulty: Staff-Level | Concepts tested: Auto-scaling failure modes, cold start, connection pool exhaustion, thundering herd, incident command under pressure

The Trap

Most candidates say “auto-scaling handles traffic spikes” and move on. In reality, auto-scaling is one of the most common amplifiers of outages at scale. This question tests whether you have been paged at 3 AM and learned the hard way.What weak candidates say:“Auto-scaling should handle this. Maybe the scaling policy needs a higher max instance count. I would increase the max and wait for new instances to come online.”What strong candidates say:The way I think about this is that auto-scaling solves the capacity problem but can create three new problems simultaneously — and at 3 AM, you need to know which one you are fighting.
  • Problem 1 — Connection pool exhaustion (the most common killer). Each new worker instance opens connections to PostgreSQL, Redis, and downstream services. Your RDS instance has a max_connections of 500. You started with 20 workers at 10 connections each (200 total). Auto-scaling adds 50 more workers — that is 500 new connection attempts. PostgreSQL starts rejecting connections. Now all workers — old and new — start failing on database calls. The new capacity you added is not just useless, it is actively killing the healthy workers. I saw this exact scenario at a company processing payment notifications. Black Friday traffic triggered auto-scaling, which exhausted the RDS connection pool in 90 seconds. The fix was not more instances — it was adding PgBouncer as a connection pooler between the application and the database. PgBouncer multiplexes 500 application connections over 50 actual database connections. We deployed PgBouncer that weekend and never hit the issue again.
  • Problem 2 — Cold start cache stampede. New instances have empty local caches. Every request on a new instance is a cache miss that hits the database. If you just scaled from 20 to 70 instances, you have 50 instances all hammering the database with requests that would normally be served from local cache. DynamoDB’s adaptive capacity takes 5-10 minutes to scale up the hot partitions; during that window, you get throttling (ProvisionedThroughputExceededException). The mitigation is cache warming on startup. Before the instance starts accepting traffic, it pulls the top 1,000 most-requested items into its local cache. This adds 30 seconds to startup time but prevents the stampede.
  • Problem 3 — Downstream service collapse. Your notification workers call APNs, FCM, SendGrid, and Twilio. These external APIs have their own rate limits. 20 workers sending 100 req/sec each = 2,000 req/sec to SendGrid. Scale to 70 workers = 7,000 req/sec. SendGrid starts returning 429s. Your workers retry. Retries add to the load. You are now in a feedback loop where more workers means more retries means more load means more failures.
How I stabilize the system in the moment:
  1. Freeze auto-scaling immediately. Set the desired count to the current number. Stop the bleeding.
  2. Enable circuit breakers on downstream calls. If SendGrid is returning 429s, stop sending to it for 60 seconds. Buffer notifications in the queue. They are already in SQS — they will not be lost.
  3. Check the database connection count. SELECT count(*) FROM pg_stat_activity; If it is near max_connections, that is the root cause. Kill idle connections and add PgBouncer if not already present.
  4. Reduce to a known-good instance count. Scale down to the number of instances that were handling traffic before the spike. Let the queue absorb the burst. Process at a sustainable rate.
War Story: At a fintech company, we had a Kafka consumer auto-scaling policy tied to consumer lag. A burst of 2 million payment events created lag. Auto-scaling added 40 consumers in 3 minutes. Each consumer opened 5 connections to PostgreSQL. We went from 100 to 300 database connections instantly. PostgreSQL hit max_connections, every consumer started throwing errors, and consumer lag increased because no consumer could process anything. The fix: we added a connection pooler, capped auto-scaling at 2 new instances per minute (instead of 10), and added a startup health check that verified database connectivity before the instance joined the consumer group.

Follow-up: How do you design an auto-scaling policy that does not amplify outages?

  • Rate-limit the scaling itself. Add a cooldown period (5 minutes between scale-up events) and cap the number of new instances per scaling event (add 2 at a time, not 20). AWS calls this “step scaling with cooldown.”
  • Scale on the right metric. Scaling on CPU is almost always wrong for I/O-bound services. Scale on queue depth, request latency p99, or consumer lag. For notification workers, scale on SQS ApproximateNumberOfMessagesVisible — it directly measures backlog.
  • Pre-provision for known events. If you know Black Friday is coming, scale up before the traffic hits. Scheduled scaling at 5 AM on Black Friday morning is far safer than reactive scaling at 9 AM when the traffic arrives.
  • Circuit breaker before scaling. If downstream services are failing, adding more workers makes things worse. Your scaling policy should check dependency health: if the database connection pool is above 80%, do not scale.

Follow-up: How do you prevent the connection pool problem architecturally, not just with PgBouncer?

  • Serverless connection patterns. Use DynamoDB or Aurora Serverless v2 instead of standard RDS. Aurora Serverless manages connection pooling internally through its Data API. DynamoDB has no connection limit — it is HTTP-based.
  • HTTP-based data access. Replace direct database connections with an internal data service API. Workers call GET /api/internal/users/123 instead of connecting to PostgreSQL directly. The data service manages a fixed connection pool. You can scale workers independently of database connections.
  • RDS Proxy. AWS’s managed connection pooler that sits between Lambda/ECS and RDS. It maintains a pool of warm connections and multiplexes application connections over them. Adds ~1ms latency but eliminates the connection exhaustion problem.

Question

Your chat system has 500 million messages in MongoDB. Performance is degrading because of the single-primary write bottleneck. You have decided to migrate to Cassandra. You cannot take the system offline. Users cannot lose messages or see duplicates during the migration. Walk me through the migration, step by step.Difficulty: Staff-Level | Concepts tested: Zero-downtime migration, dual-write, data consistency verification, rollback strategy, operational planning

The Trap

Most candidates describe “export from old database, import to new database, switch over.” That is an offline migration. The question explicitly says zero downtime. This tests whether you have actually performed a live migration.What weak candidates say:“I would set up Cassandra, copy all the data over, then switch the application to point at Cassandra. Maybe do it during off-peak hours.”This answer misses: what happens to messages written during the copy? How do you verify data consistency? How do you roll back if Cassandra has a problem? What about the different data models?What strong candidates say:I have done a migration like this before, and it is a 4-phase process that typically takes 4-6 weeks from start to full cutover. The phases are: dual-write, backfill, shadow-read verification, and cutover. Let me walk through each.
  • Phase 1 — Dual-Write (Week 1-2). Deploy a code change so that every new message write goes to BOTH MongoDB and Cassandra. MongoDB remains the source of truth. The write path looks like: write to MongoDB (synchronous, must succeed), then write to Cassandra (asynchronous, best-effort with retry). If the Cassandra write fails, log the failure to a dead letter queue and retry later. No user-facing behavior changes. The critical detail: the Cassandra data model is different from MongoDB. MongoDB stores messages as documents with embedded arrays for reactions and read receipts. Cassandra uses a wide-row model with (conversation_id, message_id) as the primary key. The dual-write layer must translate between models. This translation logic is the highest-risk code — test it exhaustively with production-like data before enabling dual-write.
  • Phase 2 — Backfill (Week 2-3). While dual-write handles new messages, you need to copy the 500 million historical messages. Run a backfill job that reads from MongoDB in batches (10,000 messages at a time, paginated by _id), transforms each message to the Cassandra schema, and writes to Cassandra. The backfill must be idempotent — if it crashes and restarts, it should not create duplicates. Use the message ID as the Cassandra primary key so re-inserting the same message is a no-op (Cassandra upserts by default). Rate-limit the backfill to avoid overwhelming either database. At 5,000 writes/sec to Cassandra, 500 million messages takes ~28 hours. Run it over a weekend with monitoring. If the backfill causes latency spikes in MongoDB (from the read load), throttle it down. There is a race condition: a message written by the dual-write path might be overwritten by the backfill reading an older version from MongoDB. Mitigation: use a write_timestamp in Cassandra and implement a “latest wins” strategy using Cassandra’s built-in last-write-wins conflict resolution. Since the dual-write message is newer, it survives.
  • Phase 3 — Shadow-Read Verification (Week 3-4). This is the phase most people skip, and it is the one that saves you from a catastrophic cutover failure. Deploy code that reads from BOTH databases on every message fetch. The response is served from MongoDB (still the source of truth), but in the background, you also query Cassandra and COMPARE the results. Log every discrepancy: missing messages in Cassandra, different message content, ordering differences. Calculate a consistency score: “Cassandra agrees with MongoDB on 99.97% of reads.” Investigate the 0.03% discrepancies — they usually reveal bugs in the dual-write translation or backfill edge cases (messages with special characters, very long messages that were truncated, timezone conversion issues). Do not proceed to cutover until the consistency score is 99.99% or higher over 72 hours of shadow reads.
  • Phase 4 — Cutover (Week 5). Flip the read path to Cassandra. MongoDB becomes the fallback. The cutover is a feature flag, not a deployment. Step 1: Route 1% of read traffic to Cassandra. Monitor latency and error rates for 1 hour. Step 2: Route 10%. Monitor for 4 hours. Step 3: Route 50%. Monitor for 24 hours. Step 4: Route 100%. MongoDB is now only receiving dual-writes. Step 5 (Week 6+): After 2 weeks of stable Cassandra-primary operation, remove the dual-write to MongoDB. MongoDB is decommissioned. At every step, you can roll back by flipping the feature flag. MongoDB has all the data because dual-write was still active.
War Story: Discord’s migration from MongoDB to Cassandra is the canonical case study here. They had trillions of messages and the single-primary bottleneck on MongoDB was causing write latency spikes during peak hours. Their migration followed essentially this 4-phase pattern. The key lesson they shared publicly: the shadow-read phase caught dozens of edge cases they had not anticipated, including a timestamp precision difference between MongoDB and Cassandra that caused messages to sort differently. Without shadow reads, they would have discovered this in production after cutover — a terrifying prospect for a chat platform where message ordering is sacred.

Follow-up: What if you discover during shadow reads that Cassandra’s performance is actually worse than MongoDB for certain query patterns?

This is a realistic scenario — the data model translation might not map perfectly to Cassandra’s strengths.
  • Identify the specific query pattern. Common culprits: (1) queries that work on secondary indexes in MongoDB but require a full table scan in Cassandra (Cassandra secondary indexes are notoriously slow), (2) queries that fetch a small number of messages from many conversations (scatter-gather across partitions), (3) aggregation queries (count of unread messages across all conversations) that MongoDB handles natively but Cassandra cannot.
  • For slow secondary index queries: Denormalize. Create a separate Cassandra table optimized for that query. Cassandra’s philosophy is “model your tables around your queries, not your entities.” If you need to look up messages by sender, create a messages_by_sender table.
  • For scatter-gather queries: Add a caching layer. If the query is “fetch the latest message from each of my 50 conversations” (for the conversation list view), cache the latest message per conversation in Redis. Update on every new message. The query becomes a single Redis MGET, not 50 Cassandra reads.
  • If performance is fundamentally worse: Reconsider the migration. Maybe the right answer is not Cassandra but ScyllaDB (same data model, better tail latency due to no JVM GC pauses). Or maybe a different database entirely — for example, TiDB if you need distributed SQL.

Follow-up: How do you handle the dual-write failure mode where MongoDB succeeds but Cassandra fails?

  • During the dual-write phase, MongoDB is the source of truth. A Cassandra write failure is tolerable — the message exists in MongoDB and will be served correctly.
  • Log the failed Cassandra write to a dead letter queue (SQS or a separate PostgreSQL table). A reconciliation worker retries the failed writes every 30 seconds with exponential backoff.
  • If a message is in MongoDB but not in Cassandra, and a shadow read occurs before the reconciliation worker catches up, the shadow read comparison will flag it as a discrepancy. This is expected and should not be counted against the consistency score — filter out discrepancies for messages written in the last 5 minutes.
  • The dangerous case: Cassandra write appears to succeed (no error returned) but the data is silently corrupted or written to the wrong partition due to a data model bug. This is why shadow-read verification is non-negotiable. It catches bugs that error handling cannot.

Question

This question tests your ability to think beyond the happy path where your core infrastructure works. Every system has a “what if the thing you are depending on dies” scenario. Walk me through how you design a high-throughput event pipeline that survives the failure of its primary message broker.Difficulty: Senior to Staff | Concepts tested: Resilience patterns, multi-layer buffering, graceful degradation, “what if the obvious answer is wrong” thinking

The Trap

The trap is designing a system with a single critical dependency (Kafka) and no fallback. The interviewer wants to see that you think in layers of defense, not single points of failure.What weak candidates say:“Kafka is highly available with replication. If one broker goes down, the others take over. You just configure replication factor 3 and min.insync.replicas 2.”This answer is technically correct for a single-broker failure, but it misses the point. The interviewer is asking about a total Kafka outage — maybe a cluster-wide misconfiguration, a ZooKeeper failure, a network partition that isolates the entire Kafka cluster, or a bad version upgrade.What strong candidates say:Let me think about this in layers, because resilience is never about one backup plan — it is about degradation tiers.
  • Tier 1 — Kafka is healthy (normal operation). Producers write events to Kafka topics. Consumers process them. Replication factor 3, min.insync.replicas 2. This handles individual broker failures. Kafka recovers automatically. This is the 99.9% case.
  • Tier 2 — Kafka is degraded (some partitions unavailable). Producers get errors for specific partitions. The partition leader election is taking too long, or a minority of brokers are down. The producer should implement a local buffer: when a Kafka write fails, write the event to a local disk-backed queue (an embedded RocksDB instance, a memory-mapped file, or even just appending to a local file). Continue retrying Kafka in the background. When Kafka recovers, drain the local buffer into Kafka. Events are delayed but not lost. At Uber, the Kafka producer library (called uReplicator) does exactly this — it buffers to local disk when Kafka is unreachable and drains on recovery. They reported that during a 20-minute Kafka cluster restart, zero events were lost because the local buffer held approximately 400GB of events across their fleet.
  • Tier 3 — Kafka is fully down (total cluster outage). Local buffers on producers fill up. If the outage lasts hours, the local disk fills up. Now you need a fallback message broker. Option A: Write-ahead to S3. Every event is written to both Kafka and an S3 bucket (partitioned by yyyy/mm/dd/hh/producer_id). If Kafka is down, events still land in S3. A separate consumer reads from S3 when Kafka is unavailable. Latency goes from sub-second (Kafka) to minutes (S3 listing and reading), but no events are lost. Option B: Secondary message broker. Run a smaller Redis Streams or Amazon Kinesis cluster as a hot standby. When the producer detects Kafka is down (circuit breaker opens after 3 failures in 10 seconds), it switches to the secondary broker. Consumers subscribe to both Kafka and the secondary broker. This gives sub-second latency even during a Kafka outage, at the cost of maintaining a second streaming infrastructure. Option C (the pragmatic choice for most systems): Accept the degradation. If you are processing analytics events, a few hours of delay is acceptable. Use the local buffer (Tier 2) and do NOT invest in a secondary broker. Alert on-call, fix Kafka, drain the buffers. For critical events (payment notifications, 2FA codes), route through a separate, simpler path (SQS or a direct database write) that does not depend on Kafka at all.
  • The key architectural principle: Separate your event streams by criticality. Payment events and 2FA events should NEVER share infrastructure with analytics and clickstream events. Critical events get a simple, highly available path (SQS, which has 99.999% availability). Bulk events get the high-throughput path (Kafka) with best-effort resilience.
War Story: At LinkedIn, which operates one of the largest Kafka deployments in the world (trillions of messages per day), they experienced a multi-cluster Kafka outage in 2019 caused by a misconfigured deployment that rolled out bad configs to all brokers simultaneously. Their mitigation was exactly the layered approach: producers buffered locally, critical data paths (like the ads pipeline that generates revenue) had an independent Kafka cluster, and analytics data was accepted as delayed. The post-mortem led them to implement “blast radius isolation” — no single configuration change can affect more than one Kafka cluster.

Follow-up: You mentioned writing to both Kafka and S3 simultaneously. How do you handle deduplication when the consumer reads from both sources?

  • Every event has a globally unique event ID (UUID or ULID) assigned at the producer before writing to any destination.
  • Consumers maintain a dedup window: a Bloom filter or a Redis set of recently processed event IDs, with a TTL of 2x the maximum expected delay between Kafka delivery and S3 delivery.
  • When processing from S3 (the backup path), the consumer checks each event ID against the dedup set. If it was already processed from Kafka, skip it.
  • The Bloom filter approach is memory-efficient: 1 billion event IDs in a Bloom filter with a 0.01% false positive rate requires approximately 1.2 GB of memory. At a 0.01% false positive rate, you silently drop 1 in 10,000 duplicate-seeming events that are actually unique. For analytics, this is acceptable. For payments, use an exact dedup set (Redis or DynamoDB).

Follow-up: How would you test this resilience in production without actually breaking Kafka?

  • Chaos engineering with Gremlin or LitmusChaos. Inject Kafka broker failures in a controlled way during off-peak hours. Verify that the local buffer activates, events are not lost, and the buffer drains correctly on recovery.
  • Dark traffic testing. Run a shadow pipeline that mirrors 1% of production events. Intentionally break the shadow pipeline’s Kafka and observe the fallback behavior. The production pipeline is unaffected.
  • GameDay exercises. Schedule a quarterly drill where the team simulates a Kafka total outage. Practice the incident response: who gets paged, what is the communication plan, how do you verify events are buffered, how do you verify the buffer drains correctly. At AWS, GameDay exercises are how they discovered that their S3 fallback path had a bug where events with payloads larger than 256KB were silently dropped.

Question

This is not a “microservices vs monolith” definition question. I want you to give me the specific criteria for when a microservices architecture becomes the wrong choice, and how you would evaluate whether your current system should be simplified.Difficulty: Staff-Level | Concepts tested: Architecture judgment, organizational awareness, the “obvious answer is wrong” trap, intellectual honesty

The Trap

Most candidates default to defending microservices because it is the “modern” answer. The interviewer wants to see you challenge assumptions, reason from first principles, and potentially argue that a monolith is the better choice for many real teams.What weak candidates say:“Microservices are better because they allow independent deployments, scaling, and team autonomy. Each service can use the best technology for its domain.”This is the textbook answer, and it is only true for large organizations. For the majority of engineering teams, microservices add more cost than value.What strong candidates say:The honest answer is that microservices are the wrong architecture for most teams, most of the time. Let me explain my reasoning.
  • The organizational prerequisite that most people ignore. Conway’s Law tells us that system architecture mirrors organizational structure. Microservices work when you have multiple independent teams (6-10 engineers each) that own independent business domains and need to deploy independently. If you have 15 engineers in one team, splitting into microservices gives you all the distributed systems complexity (network calls, serialization, distributed tracing, service discovery, API versioning) without the organizational benefit. You are paying the microservices tax without the microservices dividend. Shopify has over 3,000 engineers and runs a monolith (a large Ruby on Rails application). They have invested heavily in making the monolith modular (component-based architecture with enforced boundaries), which gives them most of the isolation benefits of microservices without the operational overhead. Shopify explicitly chose NOT to migrate to microservices because the coordination cost across hundreds of services would have slowed their product velocity.
  • The specific criteria where microservices are justified:
    1. Different scaling profiles. Your notification dispatch service needs to scale to 10,000 instances during Black Friday, but your user preference service handles steady traffic. Putting them in the same monolith means you scale the entire application 10,000x just for the notification path.
    2. Different reliability requirements. The URL redirect service must have 99.99% availability. The analytics service can tolerate 99.9%. In a monolith, the analytics service’s failures can crash the redirect service. Separate processes isolate failure domains.
    3. Different deployment cadences. Team A deploys 10 times per day. Team B deploys once a week. In a monolith, Team A’s frequent deploys create risk for Team B’s code. Separate services let each team deploy independently.
    4. Genuine technology diversity needs. The ML ranking service is Python. The real-time event processor is Go. The CRUD API is Node.js. Different languages require different processes (i.e., separate services) by definition.
  • How I evaluate whether to simplify an existing microservices architecture:
    • Count the inter-service calls per user request. If a single user action triggers a chain of 8 synchronous service-to-service calls, you have a distributed monolith. You got the worst of both worlds: the deployment complexity of microservices and the tight coupling of a monolith.
    • Count the engineers per service. If you have 15 services and 10 engineers, that is 0.67 engineers per service. Nobody can maintain their services properly. Each service has stale dependencies, missing alerts, and outdated documentation.
    • Measure deployment coordination. If deploying Feature X requires coordinating releases across 4 services in a specific order, your service boundaries are in the wrong places. Well-designed microservices should be deployable independently for 90%+ of changes.
War Story: I worked at a company that had 42 microservices and 12 engineers. Every feature required touching 3-5 services. We spent 40% of our time on operational overhead — keeping service meshes healthy, debugging distributed traces, managing Kubernetes manifests. We consolidated to 3 services (API, workers, internal tools) over 6 months. Developer velocity roughly doubled. The consolidation was the single most impactful technical decision that year.Amazon famously advocated microservices (“two-pizza teams”), but what people miss is that Amazon has 10,000+ engineering teams. At that scale, independent services are essential. Applying Amazon’s architecture to a 15-person startup is cargo cult engineering.

Follow-up: If you do need microservices, what is the single most important thing to get right?

Service boundaries. Get them wrong and everything downstream is painful.
  • Boundaries should follow business domains, not technical layers. “User service, database service, logging service” is wrong. “Payments, notifications, content” is right. Each service should own a business capability end-to-end, including its own data store.
  • The litmus test: Can this service be developed, deployed, and operated by a single team without coordinating with other teams for 80%+ of their work? If no, the boundary is wrong.
  • Start with a monolith and extract. You cannot draw good service boundaries until you understand the domain. Build the monolith first, observe which parts change together and which change independently, then extract services along those natural seams. This is the Strangler Fig pattern, and it is how most successful microservices architectures were actually built (including Netflix and Amazon).

Follow-up: How do you handle a distributed transaction across microservices — say, “create order” requires updating inventory AND processing payment?

  • Short answer: you do not. Distributed transactions (two-phase commit) across microservices are fragile, slow, and defeat the purpose of independent services.
  • Use the Saga pattern instead. The order service creates the order in “pending” state, sends a “reserve inventory” event. The inventory service reserves the items and sends “inventory reserved.” The payment service processes the payment and sends “payment completed.” The order service moves the order to “confirmed.” If any step fails, compensating actions fire: release the reserved inventory, reverse the payment.
  • The hard part is compensating transactions. “Reverse the payment” sounds simple but has edge cases: what if the payment API is also down? What if the payment was partially processed? This is why each service must be idempotent and support a “compensate” operation that is robust to partial failures.
  • Temporal/Cadence is purpose-built for this. Instead of hand-coding sagas, define the workflow in Temporal. It handles the orchestration, retries, compensation, and state persistence. If a step fails, Temporal retries with backoff. If the entire workflow fails, Temporal runs the compensating actions. This is how Uber handles ride booking: a Temporal workflow coordinates rider matching, driver notification, payment authorization, and trip creation.

Question

This question is about API contract management, backward compatibility, and schema evolution. It seems simple on the surface, but the answer reveals how much production experience a candidate has with service-to-service communication at scale.Difficulty: Senior | Concepts tested: API versioning, schema evolution, backward/forward compatibility, contract testing, the Postel principle

The Trap

Most candidates jump to “use API versioning (v1, v2).” That is a partial answer. The real question is: why did adding a NEW field (which should be backward compatible) break downstream services?What weak candidates say:“We should version our APIs. Put /v1/ in the URL. When you make a breaking change, create /v2/.”This misses the core issue: adding a field should not be a breaking change. If it broke things, your serialization or your consumers are misconfigured.What strong candidates say:First, let me figure out WHY adding a new field broke anything, because in a well-designed system, adding a field is a backward-compatible change. The fact that it broke things tells me something is wrong at a deeper level.
  • Root cause 1 — Strict deserialization. The downstream service is using a deserializer that fails on unknown fields. In Java, Jackson’s DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES is true by default. A new field in the response causes Jackson to throw an exception. This is the most common cause. The fix is configuring all consumers to ignore unknown fields. In Jackson: @JsonIgnoreProperties(ignoreUnknown = true). In protobuf, this is the default behavior — unknown fields are silently ignored. I have seen this break production at multiple companies. At one, a developer added an avatar_url field to the user profile response. The payments service — which only needed user_id and email — deserialized the full response and crashed on the unknown field. An unrelated service broke because of a change it should never have noticed.
  • Root cause 2 — Exact schema validation. If consumers use JSON Schema validation or OpenAPI spec validation with additionalProperties: false, any extra field is rejected. The fix: always set additionalProperties: true in your schemas, or better yet, do not validate the full schema on the consumer side — only validate the fields you use.
  • Root cause 3 — The field changes the response size beyond a buffer limit. In rare cases, the new field (especially if it contains large data like a base64-encoded image) pushes the response beyond a size limit in the consumer’s HTTP client, a proxy buffer, or a message queue’s maximum message size. This is the sneakiest failure because the error message is often “connection reset” or “timeout” with no obvious link to the schema change.
How to prevent this systematically:
  • Consumer-Driven Contract Testing. Each downstream service defines a “contract” — the specific fields it expects from the API response. The API’s CI pipeline runs all consumer contracts against the current API. If a change breaks a contract, the build fails BEFORE deployment. Pact is the most popular tool for this. At Atlassian, Pact tests prevented hundreds of breaking changes from reaching production over the years.
  • The Robustness Principle (Postel’s Law). “Be conservative in what you send, be liberal in what you accept.” Consumers should accept responses with extra fields, missing optional fields, and fields in any order. Producers should only add fields (never remove or rename without a deprecation period).
  • Schema registries for event-driven systems. If your services communicate via Kafka events, use a schema registry (Confluent Schema Registry or AWS Glue Schema Registry). The registry enforces compatibility rules: new schema versions must be backward compatible (consumers using the old schema can still read new messages) or forward compatible (consumers using the new schema can still read old messages). A schema change that breaks compatibility is rejected at publish time, not in production.
  • The compatibility matrix:
    • Adding a field: Backward compatible (old consumers ignore it). Should NEVER break.
    • Removing a field: Backward INCOMPATIBLE (old consumers expect it). Requires a deprecation period.
    • Renaming a field: Equivalent to removing + adding. Breaking change.
    • Changing a field type: Breaking change. Always.
War Story: At a company running 150+ microservices, we had an average of 2 breaking API changes per month reaching production. We instituted three changes: (1) all services must configure JSON deserialization to ignore unknown fields, (2) all API changes must pass consumer contract tests in CI, (3) field removals require a 30-day deprecation period with a warning header (Deprecation: true). Breaking changes in production dropped to zero within two quarters.

Follow-up: How do you handle evolving a protobuf schema without breaking existing consumers?

  • Protobuf has built-in field numbering that makes evolution safe. Each field has a number: string name = 1; int32 age = 2;. As long as you never reuse a field number, you are safe.
  • Adding a field: Assign the next available number. Old consumers ignore it. New consumers use a default value if the field is missing from old messages.
  • Removing a field: Mark the field number as reserved so it can never be reused: reserved 3;. Old consumers see the field as empty (default value). Never delete the field number from the .proto file — reserve it.
  • The discipline: Never change a field’s type or number. Treat field numbers as permanent. This is why protobuf is the preferred serialization for long-lived service APIs — the wire format is inherently backward and forward compatible if you follow the rules.

Follow-up: What about database schema evolution? How do you add a column to a table that 50 services read from?

  • Expand-and-contract pattern. Phase 1 (expand): add the new column with a default value and make it nullable. No service breaks because they do not read the new column. Phase 2: update the write path to populate the new column. Backfill existing rows. Phase 3: update consumers one by one to read the new column. Phase 4 (contract): once all consumers are updated, make the column non-nullable if needed, remove the old column if it was being replaced.
  • Never add a non-nullable column without a default value. This is the database equivalent of a breaking API change — existing INSERT statements that do not include the new column will fail.
  • Online schema migration tools. For large tables (billions of rows), ALTER TABLE ADD COLUMN can lock the table for seconds or minutes on MySQL. Use online schema migration tools: gh-ost (GitHub’s tool) or pt-online-schema-change (Percona). These create a shadow table, copy data, sync changes via triggers, and do an atomic rename. PostgreSQL handles ADD COLUMN with a default value without locking since version 11 — another reason to prefer PostgreSQL.

Question

You have deployed the task scheduler from Problem 5 to production. Before you go home, you need to set up monitoring and alerting. You can only pick five alerts. Which five do you choose, and what thresholds do you set?Difficulty: Senior | Concepts tested: Operational maturity, signal vs noise in alerting, SLI/SLO thinking, production readiness, the difference between monitoring and observability

The Trap

Most candidates list generic infrastructure alerts: CPU > 80%, memory > 90%, disk > 85%. These are important but tell you nothing about whether the system is actually working. The interviewer wants to see you think from the USER’s perspective inward, not from the MACHINE’s perspective outward.What weak candidates say:“I would alert on CPU usage, memory usage, error rates, disk space, and response time.” These are machine-level metrics. CPU can be at 90% while the system is perfectly healthy (processing a backlog). Error rate can be 0% while the system is completely broken (scheduler is frozen, no tasks are being processed at all, so no errors are generated).What strong candidates say:My philosophy on alerting is: alert on symptoms first, causes second. A symptom alert tells you the user is affected. A cause alert helps you diagnose. If you only have five alerts, make them all symptoms.Alert 1 — Task execution delay exceeds SLA.
  • Metric: p99(actual_execution_time - scheduled_execution_time) over a 5-minute window.
  • Threshold: > 60 seconds for P0 tasks, > 5 minutes for P1 tasks.
  • Why this is the most important alert: This directly measures whether the system is fulfilling its core promise — executing tasks on time. If this alert fires, users are affected. Every other failure (scheduler down, queue backed up, workers crashing) eventually manifests as execution delay.
  • What it catches that other alerts miss: A frozen scheduler that is not enqueuing tasks. The queue is empty (so no queue depth alert fires), error rate is zero (so no error alert fires), but no tasks are being processed because nothing is being scheduled.
Alert 2 — Dead Letter Queue depth increasing.
  • Metric: count(tasks in DLQ added in last 15 minutes)
  • Threshold: > 10 tasks in 15 minutes. (In a healthy system, the DLQ is empty or nearly empty.)
  • Why: DLQ growth means tasks are exhausting retries and failing permanently. This is a leading indicator of a systemic problem — a dependency is down, a task type has a bug, or a deployment broke the task handler. The DLQ is the canary in the coal mine.
  • War Story: At a company running Sidekiq for background jobs, we had no DLQ alerting. A third-party API changed its response format. The integration tasks started failing, exhausted retries, and silently accumulated in the DLQ. We discovered 45,000 failed tasks three days later during a routine check. Those were customer onboarding emails that never sent. Three days of new users never received their welcome sequence.
Alert 3 — Scheduler heartbeat missing.
  • Metric: A canary task scheduled to run every 60 seconds. Alert if the canary has not executed in 3 minutes.
  • Threshold: Canary task not completed within 180 seconds of its scheduled time.
  • Why: This is an end-to-end health check. The canary exercises the entire pipeline: scheduler picks it up, enqueues it, worker executes it, completion is recorded. If the canary does not fire, something in the pipeline is broken. This catches failures that component-level metrics miss (e.g., the scheduler is running but its database query returns zero rows due to a schema migration that accidentally dropped the index).
Alert 4 — Worker pool utilization above capacity threshold.
  • Metric: (active_workers / total_workers) * 100 averaged over 10 minutes.
  • Threshold: > 90% utilization for 10 minutes.
  • Why: At 90% utilization, you have no headroom for traffic spikes. The next burst of tasks will cause queuing delays. This is a leading indicator — it fires BEFORE users are affected, giving you time to scale up. Unlike a queue depth alert (which fires when things are already bad), utilization tells you things are about to get bad.
Alert 5 — Task success rate drops below threshold.
  • Metric: (completed_tasks / (completed_tasks + failed_tasks)) * 100 over a 15-minute window.
  • Threshold: < 95% success rate.
  • Why: A sudden drop in success rate indicates either a deployment problem (new code is crashing) or a dependency problem (a downstream API is failing). The 95% threshold accounts for normal background failure rates (network blips, transient errors) while catching systemic issues.
  • Why not 99%? In my experience, task schedulers have a natural failure rate of 1-3% due to transient errors (timeouts, rate limiting by third parties). Setting the threshold at 99% would create alert fatigue from false positives. 95% means something genuinely changed.
What I deliberately did NOT alert on:
  • CPU/memory/disk. These are useful for dashboards, not for paging someone at 3 AM. High CPU while processing a backlog is normal. These should be warning-level signals that feed into the symptom alerts, not page-worthy on their own.
  • Individual task failures. One task failing is not an incident. Set this up as a log or a low-severity notification to the task owner, not an on-call page.
  • Queue depth absolute number. A queue depth of 50,000 might be normal during a batch job run and dangerous during idle hours. Absolute thresholds on queue depth create false positives. Instead, alert on the rate of change or on the downstream symptom (execution delay).

Follow-up: How do you avoid alert fatigue when operating this system?

  • Ruthlessly eliminate noisy alerts. If an alert fires more than once a week and the action is “acknowledge and ignore,” delete the alert. It is training your team to ignore pages.
  • Every alert must have a documented action. If you cannot write a one-paragraph runbook for what to do when an alert fires, the alert should not page anyone. Put it on a dashboard instead.
  • Separate alerting tiers: PagerDuty page (wake someone up) for symptom alerts only. Slack notification for leading indicators. Dashboard for infrastructure metrics. Most teams have the ratio inverted — they page on infrastructure and have no symptom alerts.
  • Google’s SRE guidance: If your on-call engineer is getting paged more than twice per 12-hour shift, your system or your alerting is broken. Fix one of them.

Follow-up: How would you set up distributed tracing for the task scheduler specifically?

  • Trace ID propagation. When a task is created via the API, generate a trace ID and attach it to the task payload. The trace ID follows the task through the scheduler (when it is enqueued), the queue (as a message attribute), the worker (when it is picked up), and any downstream calls the worker makes.
  • Key spans to instrument: (1) API receives task creation request, (2) task is persisted to PostgreSQL, (3) scheduler selects and enqueues the task, (4) worker picks up the task, (5) worker executes the task handler, (6) worker reports completion to PostgreSQL. These 6 spans form the task lifecycle trace.
  • What tracing reveals that metrics do not: Where is the time being spent? If execution delay is 30 seconds, tracing tells you: 2 seconds in the scheduler queue, 25 seconds waiting for a worker to pick it up (workers are saturated), 3 seconds for actual execution. Now you know to scale workers, not optimize the scheduler.
  • Tools: AWS X-Ray integrates natively with ECS, Lambda, and SQS. Jaeger or Grafana Tempo for non-AWS or multi-cloud environments. The OpenTelemetry SDK provides vendor-neutral instrumentation.

Question

A user in Europe invokes their “right to erasure” under GDPR. They want all their data deleted — every short URL they created, every click event recorded on those URLs, every analytics record, every cache entry. Walk me through how your URL shortener system processes this request across all data stores.Difficulty: Senior to Staff | Concepts tested: Data lifecycle management, distributed data deletion, GDPR/privacy engineering, cache invalidation, the tension between analytics and privacy

The Trap

Most system design candidates never consider data deletion. They design for writes and reads, but not for erasure. In a distributed system with caches, analytics pipelines, backups, and CDN edge servers, “delete the user’s data” is a genuinely hard distributed systems problem.What weak candidates say:“Delete the rows from the database and clear the cache entries. Send a 200 OK to the user.”This misses: analytics data in the data warehouse, events in Kafka that have already been consumed, data in CDN edge caches, data in database backups and snapshots, data replicated to other regions, and the URL mapping entries that other people’s short URLs might reference.What strong candidates say:GDPR deletion in a distributed system is not a single operation — it is a workflow that touches every data store, and it must be auditable. Let me walk through each layer.
  • Step 1 — Record the deletion request. Before deleting anything, create an immutable record: {user_id, request_time, status: "processing", audit_trail: []}. This record itself is your proof of compliance. GDPR requires you to complete deletion within 30 days and provide evidence that you did.
  • Step 2 — Primary database (DynamoDB). Query all short URLs owned by this user. For each URL, you have two options: Option A — Hard delete: Remove the row entirely. The short URL stops working. Anyone who has this short URL bookmarked gets a 404. This is the cleanest for GDPR but breaks existing links. Option B — Soft delete with anonymization: Replace the user ID with a tombstone value, remove PII from metadata, but keep the URL mapping alive. The short URL continues to redirect, but it is no longer associated with the user. This preserves link functionality while complying with GDPR (the “right to erasure” applies to personal data, not necessarily to a random string mapping to a URL). The correct choice depends on your legal counsel’s interpretation. In my experience, Option B is what most companies implement because breaking millions of existing links creates customer support nightmares.
  • Step 3 — Redis cache. For every short URL key belonging to this user, issue DEL commands. If you are using Redis Cluster, the keys are distributed across shards, so this is a scatter-gather operation. At 10,000 URLs per user, this takes seconds. The subtle problem: CDN edge caches (CloudFront). CloudFront caches redirect responses at edge locations worldwide. You must issue an invalidation for each URL: POST /2019-01-01/distribution/{id}/invalidation with the URL paths. CloudFront invalidation takes 10-60 seconds to propagate globally. During that window, cached redirects still work. This is generally accepted as compliant (you initiated the deletion promptly; propagation delay is a technical limitation).
  • Step 4 — Analytics and click data. This is the hardest part. Click events are in your analytics pipeline — Kafka topics, S3 data lake, Athena tables, Redshift warehouse, possibly third-party analytics (Google Analytics, Amplitude). For Kafka: events that are still in Kafka (within the retention period) cannot be selectively deleted. Kafka does not support single-message deletion. Your options: (1) wait for the events to age out of Kafka’s retention window (typically 7 days), (2) produce “tombstone” events that downstream consumers interpret as “delete any records for this user,” or (3) implement Kafka’s log compaction where a null-value message for a key deletes that key. For the data warehouse (S3 + Athena): run a deletion job that scans Parquet files in S3, removes or anonymizes records matching the user ID, and rewrites the files. This is expensive for large datasets. Tools like Apache Spark or AWS Glue handle this, but it can take hours for terabytes of data. Schedule these as batch jobs that run nightly, processing all pending deletion requests. For third-party analytics: call their deletion APIs (Google Analytics User Deletion API, Amplitude’s user privacy API). Each has different SLAs — Google processes deletions within 72 hours.
  • Step 5 — Backups. This is the most uncomfortable part. Your RDS snapshots and S3 backups contain the user’s data. Retroactively modifying backups is impractical (and some argue impossible without compromising backup integrity). The GDPR-accepted approach: document that backups contain data for deleted users, set backup retention policies (e.g., 30 days), and ensure that if a backup is restored, the deletion request is re-processed. Maintain a “deletion blocklist” — a list of user IDs whose data must be re-deleted if a backup is restored.
  • Step 6 — Confirmation. Update the deletion request record: {status: "completed", completed_at, audit_trail: [{store: "DynamoDB", action: "anonymized", count: 1523}, {store: "Redis", action: "deleted"}, ...]}. Email the user confirming deletion. Retain the audit record (without PII) for compliance evidence.
War Story: At a European SaaS company, we received our first GDPR deletion request and discovered that the user’s data existed in 14 different data stores across 3 AWS regions, including a legacy Elasticsearch cluster that nobody had documented. The deletion took a developer two full days to trace and execute manually. After that incident, we built an automated “data map” — a registry of every store that contains PII, with documented deletion procedures for each. The automated deletion workflow reduced the process from 2 days to 45 minutes. The lesson: data mapping is a prerequisite for data deletion. You cannot delete what you cannot find.

Follow-up: How do you handle the case where one user’s short URL appears in another user’s analytics — for example, in a referrer field?

  • This is the “interconnected data” problem and it has no clean technical solution.
  • If User A created a short URL that User B clicked, User A’s deletion request should remove User A’s ownership and creation data, but User B’s click record (“I clicked link XYZ”) is User B’s data, not User A’s.
  • The practical approach: anonymize the short URL in User B’s click record. Replace the URL creator’s user ID with a hash or “deleted user.” The click event still records that User B clicked something, but it is no longer linkable to User A.
  • Legal nuance: GDPR distinguishes between “data controller” (you, the platform) and “data subject” (the user). The deletion obligation applies to the data subject’s personal data. A short URL string (e.g., “abc123”) is arguably not personal data unless it can be linked back to the user. Once you remove the link (anonymize the creator), the short URL itself is anonymous.

Follow-up: How would you build this deletion capability into the system from day one instead of retrofitting it?

  • Data tagging at write time. Every record written to any store includes a data_owner_id field. This is the user ID whose deletion would affect this record. An analytics event about User B clicking User A’s link has data_owner_id = [user_A, user_B] — it is relevant to both users’ deletion requests.
  • Centralized data catalog. Maintain a registry (a simple database table or a service like AWS DataBrew or Collibra) that lists every data store, what PII it contains, and the deletion procedure. This is the first thing auditors ask for.
  • Deletion as a first-class API. POST /api/v1/users/{id}/deletion-request is a production endpoint, not an admin script. It is tested, monitored, and covered by the same SLO as your creation endpoints. If creating a user is automated, deleting a user must be automated.
  • Retention policies by default. Every data store has a TTL or a lifecycle policy. Analytics events are retained for 90 days. Kafka topics have 7-day retention. Database records older than the retention period are archived and anonymized automatically. This means most deletion requests only need to touch data within the retention window — older data is already gone.

Question

Your notification system has a P50 latency of 50ms and a P99 latency of 500ms — a 10x ratio. The team says this is expected. Should you accept this, and what are the possible causes of such a large gap between P50 and P99?Difficulty: Senior | Concepts tested: Latency distribution analysis, tail latency causes, performance debugging methodology, when to optimize vs when to accept

The Trap

Two opposite traps. Trap 1: “10x is fine, long tails are normal in distributed systems.” Trap 2: “10x is terrible, we need to fix this immediately.” The interviewer wants nuanced reasoning, not a knee-jerk reaction.What weak candidates say:“P99 is always higher than P50. That is just how distributed systems work. As long as the average is okay, it is fine.” This dismisses a potentially serious problem without investigation.Or: “We should optimize until P99 is closer to P50.” This ignores that tail latency has diminishing returns and the effort to reduce P99 by 50% might not be worth it.What strong candidates say:Whether a 10x P50-to-P99 ratio is acceptable depends entirely on what is causing it and whether the P99 latency violates the SLO. Let me walk through the most common causes of fat tail latency, because the cause determines the action.
  • Cause 1 — Garbage collection pauses (JVM, Go GC, .NET GC). If your workers run on a garbage-collected runtime, GC pauses cause periodic latency spikes. A typical JVM full GC takes 200-500ms — exactly matching the P99. This affects ~1-2% of requests (those that happen to land during GC). Diagnosis: correlate P99 spikes with GC logs. If every P99 spike aligns with a GC event, you found the cause. Fix: tune the GC (use G1GC or ZGC for Java, reduce heap pressure), reduce object allocation rate, or switch to a non-GC runtime for latency-sensitive paths (Rust, C++). Discord migrated from Go to Rust for their message read path specifically because Go’s GC pauses were causing P99 spikes.
  • Cause 2 — Cache misses on the long tail. P50 represents the majority of requests that hit the cache (5ms). P99 represents the 1% that miss the cache and hit the database (500ms). A 99% cache hit rate means 1% of requests are 100x slower. Diagnosis: split latency metrics by cache hit vs cache miss. If cache-miss latency accounts for the entire P99 tail, that is the cause. Fix: improve cache hit rate (larger cache, better eviction policy, cache warming), or reduce cache-miss latency (faster database, pre-computed results, read replicas closer to the application).
  • Cause 3 — One slow downstream dependency. The notification system calls APNs (push), SendGrid (email), and Twilio (SMS). APNs and SendGrid respond in 50ms. Twilio occasionally takes 500ms for international SMS delivery confirmation. If 1-2% of notifications go through SMS, Twilio’s latency becomes the P99. Diagnosis: break down latency by channel. If SMS latency accounts for the tail, that is the cause. Fix: make SMS delivery asynchronous — return a response to the caller before waiting for Twilio’s delivery confirmation. Track SMS delivery status via a webhook callback instead.
  • Cause 4 — Resource contention at peak. During traffic bursts, workers compete for database connections, Redis connections, or CPU. Most requests proceed normally, but a few are delayed while waiting for a resource. This manifests as a latency spike that correlates with traffic volume. Diagnosis: overlay P99 latency with request rate. If P99 spikes correlate with traffic peaks, you have a contention problem. Fix: increase connection pool sizes, add read replicas, or implement admission control (shed load before contention causes cascading slowness).
  • Cause 5 — Retry amplification. A small percentage of requests fail on the first attempt and are retried. The retried request’s total latency includes the original attempt timeout + the retry. If the timeout is 200ms and the retry succeeds in 50ms, the total latency is 250ms+ for retried requests. If 1-2% of requests are retried, this becomes the P99. Diagnosis: check retry rates and correlate with P99. Fix: reduce the initial timeout (if it is too generous), fix the root cause of failures (so retries are not needed), or use hedged requests (send the request to two backends simultaneously, use whichever responds first).
My framework for deciding whether to act:
  1. Does the P99 violate the SLO? If the SLO is “95% of requests under 200ms” and P99 is 500ms, you are violating your SLO for 1% of users. Fix it.
  2. Is the P99 getting worse over time? A stable 10x ratio is one thing. A ratio that was 5x last month and 10x this month indicates a growing problem. Investigate.
  3. Does the P99 affect user experience? A 500ms notification delivery is imperceptible. A 500ms API response that blocks a UI render is painful. Context matters.
  4. What is the cost to improve? Reducing P99 from 500ms to 100ms might require a complete architecture change (add caching, change the database, rewrite in a different language). If the SLO is met and users are not complaining, the engineering effort is better spent elsewhere.
In my experience, a 10x P50-to-P99 ratio is common but not something you should just accept without understanding the cause. Know the cause, quantify the impact, and make a deliberate decision.War Story: At a company running an API gateway, P50 was 12ms and P99 was 800ms. Everyone said “tail latency is normal.” We investigated and discovered that P99 was caused by TLS session resumption failures. 1% of requests were hitting the gateway with an expired TLS session ticket, causing a full TLS handshake (700ms extra due to multiple round trips to an HSM for certificate signing). The fix was increasing the TLS session ticket lifetime from 5 minutes to 24 hours. P99 dropped to 50ms. A “normal” 10x ratio was actually a misconfiguration that cost millions of slow requests per day.

Follow-up: How do you measure and visualize latency distributions properly?

  • Never use averages for latency. An average of 100ms could mean all requests are 100ms, or it could mean 99% are 10ms and 1% are 10,000ms. Averages hide the tail.
  • Use histograms, not summaries. Prometheus histograms let you calculate arbitrary percentiles after the fact. Summaries calculate percentiles on the client side and cannot be aggregated across instances. If you have 50 instances, you cannot average their P99s to get the system P99 — you need the merged histogram.
  • Track P50, P90, P95, P99, P99.9. The jumps between these percentiles tell you the shape of the tail. P50=50ms, P90=60ms, P99=500ms says the tail is caused by a small fraction of very slow requests (bimodal distribution). P50=50ms, P90=200ms, P99=500ms says latency degrades gradually (likely contention-related).
  • Grafana heatmaps are the best visualization for latency distributions. They show request count at each latency bucket over time. You can visually spot bimodal distributions, GC pauses (regular spikes), and gradual degradation.

Follow-up: What is a “hedged request” and when would you use one to reduce tail latency?

  • A hedged request sends the same request to two (or more) backend servers simultaneously and uses whichever response arrives first. The slower response is discarded.
  • When to use: When your tail latency is caused by unpredictable backend slowness (e.g., one replica has a cold cache, one database node is doing a compaction). The probability that BOTH backends are slow simultaneously is much lower than either one being slow independently. If each backend has a 1% chance of being slow, the hedged request has a 0.01% chance.
  • The cost: You double the load on the backend. This is acceptable when the tail latency problem is severe and the extra load is within capacity. Google uses hedged requests extensively in their internal systems (documented in their “The Tail at Scale” paper by Jeff Dean and Luiz Barroso).
  • Optimization: Do not hedge immediately. Send the first request, wait for P50 latency (say, 50ms). If no response, send the hedge. This avoids doubling load for the 50% of requests that complete quickly.

Question

You have built the notification system as a multi-tenant SaaS platform. Tenant A (a large e-commerce company) is sending 500 million notifications per day. Tenant B (a small startup) sends 50,000 per day. Tenant A’s traffic is causing queue delays that affect Tenant B’s critical notifications. How do you architect tenant isolation?Difficulty: Staff-Level | Concepts tested: Multi-tenancy, noisy neighbor problem, resource isolation, fairness algorithms, SaaS architecture

The Trap

The obvious answer is “rate limit Tenant A.” But rate limiting does not solve the problem — it just rejects Tenant A’s traffic, which they are paying for. The real question is how to provide proportional service to all tenants without penalizing anyone.What weak candidates say:“Just rate limit per tenant. If Tenant A exceeds their quota, reject their notifications.” This treats the symptom, not the cause. Tenant A is paying for 500M notifications/day. They should get what they pay for without hurting others.What strong candidates say:The noisy neighbor problem in multi-tenant systems is one of the hardest problems in SaaS architecture because you need to simultaneously satisfy three competing goals: maximize utilization (do not waste capacity), guarantee fairness (no tenant degrades another), and honor commitments (each tenant gets what they paid for).
  • Tier 1 — Queue isolation (the most impactful change). The root cause is that all tenants share the same processing queue. Tenant A’s 500M notifications are ahead of Tenant B’s critical 2FA code in the same FIFO queue. Solution: separate queues per tenant tier. Create three queue pools:
    • Enterprise queue pool (Tenants like A with high volume): separate SQS queue per enterprise tenant. Dedicated worker fleet.
    • Standard queue pool (Mid-size tenants): shared SQS queue with weighted fair queuing. Workers process messages in round-robin across tenants, not FIFO.
    • Critical queue (All tenants, P0 notifications only): 2FA codes and security alerts from ALL tenants go to a dedicated high-priority queue with its own worker fleet. This queue is never starved by bulk traffic.
    This is the pattern Twilio uses internally. Their “Super SIM” and enterprise customers have dedicated infrastructure. Smaller customers share pooled infrastructure.
  • Tier 2 — Compute isolation. Beyond queue isolation, you need to ensure Tenant A’s workers do not starve other tenants of compute resources.
    • Per-tenant worker limits. In the shared worker pool, no single tenant can occupy more than 40% of workers at any time. Implement this as a semaphore: before a worker picks up a task, check tenant_worker_count[tenant_A]. If it exceeds the limit, skip Tenant A’s messages and pick from another tenant.
    • Kubernetes namespace isolation. For enterprise tenants, run their workers in a separate Kubernetes namespace with resource quotas (CPU limits, memory limits). A memory leak in Tenant A’s notification handler cannot OOM-kill Tenant B’s workers.
  • Tier 3 — Downstream dependency isolation. Tenant A’s 500M emails per day can exhaust your SendGrid API rate limit, blocking Tenant B’s emails.
    • Separate API keys per tenant tier. Enterprise tenants get their own SendGrid subaccounts with their own sending limits and IP reputation. Standard tenants share a pooled subaccount.
    • Per-tenant connection pools to the database. Use PostgreSQL’s max_connections per role/user. Tenant A’s queries cannot exhaust the connection pool available to other tenants.
  • Tier 4 — Weighted fair scheduling. For the shared infrastructure that remains, use a weighted fair queue algorithm. Each tenant has a weight proportional to their plan tier. Workers use the Weighted Fair Queuing (WFQ) algorithm: in each scheduling round, process messages from each tenant proportionally to their weight. Tenant A (weight 100) gets 100x more processing than Tenant B (weight 1), but Tenant B ALWAYS gets their share, even when Tenant A is flooding the system. This is conceptually similar to how network routers handle traffic from multiple flows. The CFS (Completely Fair Scheduler) in the Linux kernel does the same thing for CPU time across processes.
War Story: At an observability SaaS company processing telemetry data, one customer accidentally enabled debug-level logging and sent 50x their normal volume overnight. The shared Kafka cluster filled up, ingestion lagged for all 200+ customers, and alerting was delayed for everyone. The post-mortem led to three changes: (1) per-customer Kafka topic quotas (bytes/sec hard limit per tenant), (2) a separate Kafka cluster for the top 5 customers by volume, and (3) an anomaly detector that alerts when any customer’s traffic exceeds 3x their 7-day rolling average. The anomaly detector would have caught the debug-logging incident within 10 minutes instead of 8 hours.

Follow-up: How do you price and enforce usage tiers without making the system overly complex?

  • Token bucket per tenant. Each tenant has a token bucket with capacity and refill rate set by their plan tier. Free: 10K notifications/day. Pro: 1M/day. Enterprise: custom. The token bucket naturally handles bursts (a Pro tenant can send 10K notifications in a minute if they have accumulated tokens) while enforcing daily limits.
  • Soft limits vs hard limits. For paying tenants, use soft limits: when they exceed 80% of their quota, send a warning notification. When they exceed 100%, start queuing (not rejecting) their notifications with lower priority. Only hard-reject at 200% (likely a bug or misconfiguration). This balances fairness with customer experience.
  • Usage metering in a separate pipeline. Do not check quotas synchronously on every notification send — that adds latency. Instead, increment a counter asynchronously (in Redis or a metering service) and check it periodically (every 100 notifications or every 10 seconds). The 10-second granularity means a tenant can briefly exceed their limit before enforcement kicks in, but this is acceptable for all but the most cost-sensitive scenarios.

Follow-up: At what point do you move from multi-tenant to single-tenant (dedicated infrastructure per customer)?

  • Revenue threshold. When a customer’s annual contract value exceeds the cost of dedicated infrastructure by 3-5x, it makes economic sense to give them their own stack. A customer paying 500K/yearforaservicethatcosts500K/year for a service that costs 30K/year to run dedicated is a clear case.
  • Compliance requirements. Customers in healthcare (HIPAA), finance (SOC 2 Type II with strict data isolation), or government (FedRAMP) often require single-tenant deployments. No amount of logical isolation satisfies an auditor who requires physical isolation.
  • Performance isolation needs. If the customer’s workload is so large or unique that it would require constant tuning of the shared infrastructure, dedicate the infrastructure. The engineering time spent tuning shared infrastructure for one customer exceeds the cost of just giving them their own.
  • The hybrid approach: Most SaaS companies end up with a “cell-based architecture” — a set of cells (independent deployment units), each serving a subset of customers. Small customers share a cell. Large customers get their own cell. Each cell is identical in code and infrastructure, just different in the customer set it serves. Slack, Notion, and Figma all use this pattern. It gives you the operational simplicity of a standard deployment with the isolation guarantees of single-tenant.

Question

Consider this scenario: a user creates a short URL, immediately copies it and shares it in a group chat. A friend clicks the link 2 seconds later. In your eventually consistent system, what happens, and why do all your integration tests pass while this fails in production?Difficulty: Senior to Staff | Concepts tested: Consistency model blind spots, read-your-writes, testing vs production divergence, real-world failure patterns

The Trap

The trap is that eventual consistency “sounds right” for a URL shortener (which is read-heavy with immutable data). But there is a specific window where it breaks, and the typical testing environment cannot reproduce the failure.What weak candidates say:“Eventual consistency is fine for a URL shortener because URLs are immutable once created. There is no conflict to resolve.”This is correct in steady state. But it ignores the creation-to-first-read window.What strong candidates say:The bug is in the creation-to-first-read window, and it is a classic read-your-writes consistency violation. Let me walk through exactly what happens.
  • The sequence of events:
    1. User creates short URL s.ly/abc123 by calling the write endpoint. The API server writes to the DynamoDB primary in us-east-1. Returns success.
    2. User copies the link and pastes it in a group chat (2 seconds later).
    3. Friend in Europe clicks the link. Their request is routed to eu-west-1 (closest region).
    4. The eu-west-1 DynamoDB replica has not yet received the replication of abc123. DynamoDB cross-region replication (Global Tables) has a replication lag of typically under 1 second, but can spike to 5-10 seconds under load.
    5. The friend gets a 404 Not Found. The URL does not exist yet in their region.
  • Why tests do not catch this: Integration tests run in a single region with a single database instance. Writes are immediately visible to reads. The test creates a URL and immediately resolves it — it works every time because there is no replication lag. The production failure only occurs when the reader is in a different region than the writer, and the read happens within the replication window.
  • Why this is more common than people think: At scale, this is not a rare edge case. If your URL shortener handles 100M creates/day and the average replication lag is 500ms, and 5% of links are clicked within 2 seconds of creation, that is 5M potentially affected clicks per day. Even if only 10% of those hit a lagging replica, that is 500K 404s per day for URLs that actually exist. This is a serious production bug.
The fix has three layers:
  • Layer 1 — Read-your-writes for the creator. After creating a URL, route the creator’s subsequent requests to the same region that handled the write (session affinity or a cookie that specifies the write region). This ensures the creator can always resolve their own URLs. But this does not help the friend in Europe.
  • Layer 2 — Retry on 404 with cross-region fallback. When a URL resolution returns 404, before returning the error to the user, the edge service checks the primary region directly: “Does this URL exist in us-east-1?” If yes, serve the redirect (with slightly higher latency due to the cross-region call). If no, return a genuine 404. This adds one network hop for the rare case of a replication-lagging read, but eliminates false 404s entirely. This is similar to what DynamoDB Global Tables does with the ReplicaReadMode: STRONG setting when available, or what you can implement at the application level.
  • Layer 3 — Write-through to all regions. For the URL creation path (which is low-volume compared to reads), synchronously write to all regions before returning success to the user. This guarantees the URL is available everywhere before the creator shares it. The trade-off: create latency increases from 50ms (single-region write) to 200ms (multi-region synchronous write). For a URL shortener, this is acceptable because URL creation is infrequent and not latency-sensitive.
War Story: This exact bug happened at a company operating a link shortener used by their marketing team. Marketing created campaign links and immediately posted them on Twitter. Thousands of users clicked within seconds. 3-5% got 404 errors during the first 10 seconds, creating a terrible first impression for the campaign. The fix was Layer 2 — a cross-region fallback on 404. They deployed it in a day, and 404 rates for newly created URLs dropped to zero.The broader lesson: eventual consistency is a correctness guarantee about the steady state, not about the transition period. The transition period is where the bugs hide. Every system that uses eventual consistency needs to explicitly handle the window between write and convergence.

Follow-up: How would you test for this in your CI/CD pipeline?

  • Simulated replication lag in integration tests. Add a test mode where writes to the “primary” test database are delayed before being visible to the “replica” test database. Use a proxy like Toxiproxy or a test double that introduces artificial read delays. Your test creates a URL, waits 0ms, and tries to resolve it against the “replica.” It should get a 404, which triggers the cross-region fallback.
  • Contract test: The URL creation endpoint’s response should include a header like X-Write-Region: us-east-1. The resolution endpoint, on receiving a 404, should check for a X-Fallback-Region query parameter. The contract test verifies this fallback behavior.
  • Canary in production. Deploy a canary that creates a URL in region A and immediately tries to resolve it in region B. Measure the time until the resolution succeeds. Alert if this exceeds 5 seconds. This is a real-time consistency monitor.

Follow-up: Are there other places in the six system designs where eventual consistency creates hidden bugs?

  • News feed — ghost posts. A user unfollows someone, but the follow graph has not propagated to the fan-out workers yet. The unfollowed user’s next post still gets fanned out to the unfollower’s feed. The user sees a post from someone they just unfollowed — a creepy experience.
  • Notification preferences — unwanted notifications. A user disables email notifications. The preference update has not propagated to the notification worker’s cache. The user receives one more email. This is not just a UX issue — for marketing emails, it can be a CAN-SPAM violation.
  • Task scheduler — duplicate execution. A task’s status is updated to COMPLETED in the primary database, but the scheduler reads from a replica that still shows RUNNING. The scheduler re-enqueues the task. Without idempotency, the task executes twice.
In every case, the pattern is the same: a write happens, and a subsequent read hits a stale replica. The fix is always one of: read-your-writes consistency, synchronous replication for critical paths, or application-level idempotency to tolerate the duplicate.
Strong Answer Framework:Step 1 - Separate choices by their decay curve: Not every decision ages at the same rate. I classify choices into three tiers: (a) structural decisions like data model, consistency boundaries, and tenancy architecture that are effectively irreversible after year two because every downstream system encodes their assumptions; (b) infrastructure decisions like queue brokers, databases, and compute substrate that are replaceable but expensive, with a typical half-life of 5-7 years; (c) tactical decisions like framework versions, serialization formats within a service, and observability vendors, which churn every 2-3 years and should be optimized for team velocity today. For a 10-year payments system, 80% of the review time goes to tier (a) because tier (c) will be rewritten anyway.Step 2 - Pick technologies with paid-off roadmaps, not hype curves: I look for a multi-decade track record, an open standard or multi-vendor escape hatch, and a slow-moving specification. For payments this points to PostgreSQL over trendier NewSQL options for the ledger, ISO 20022 for message formats because regulators are standardizing on it through 2030, and a pluggable KMS abstraction because HSM vendors churn. The filter question is “if this company disappeared tomorrow, how much of my system dies with it?” If the answer is nontrivial, I either wrap the dependency behind a port or pick a more boring substitute.Step 3 - Design explicit seams for the things you cannot predict: I do not try to predict 2036. I try to make the 2036 team able to replace any individual component without a full rewrite. That means: an event log (ideally append-only, persisted outside the primary DB) so future systems can rebuild state; strict schema evolution rules (additive-only, deprecation windows measured in years, not weeks); versioned public APIs with sunset timelines in contract; and a policy that any encrypted data at rest uses an envelope scheme so crypto primitives can rotate. The goal is not predicting change. It is making change cheap.Real-World Example: Stripe has been public about choosing PostgreSQL for its core ledger despite having the engineering capacity to build something custom; their reasoning, shared in multiple 2021-2023 talks, is that the operational knowledge around Postgres compounds over decades and they did not want to be the world’s leading expert in an obscure database. In contrast, Uber’s 2016 migration from Postgres to Schemaless (described in their public engineering blog) is cited as a cautionary tale internally because Schemaless required a dedicated team to keep it alive - a tax Stripe was unwilling to pay for a decade-scale system.Senior Follow-up Questions:
  • “The CTO asks you to pick between a managed cloud database and self-hosted PostgreSQL for the ledger, with a 10-year horizon. What do you choose?” - Strong answer: Self-hosted on the cloud provider’s raw disks or dedicated instances, because managed-DB feature deprecation and forced major-version upgrades on the vendor’s schedule are the single biggest source of “forced rewrite” surprises over a decade; you trade operational effort for schedule sovereignty.
  • “How do you handle the case where a vendor you bet on ‘pivots’ away from your use case three years in?” - Strong answer: Every third-party dependency enters a quarterly review; if the vendor’s roadmap drifts from yours, you pre-position a replacement behind the same port and do a rotating canary over 12-18 months rather than waiting for the forcing function.
  • “How do you convince a product org that is pushing for quarterly features to fund the 10-year hygiene work?” - Strong answer: Translate hygiene debt into a concrete forecasted incident rate and a forecasted migration cost; the conversation changes from “tech debt” to “a projected 3-month company-wide outage window in year 6 that we can avoid by spending 10% of eng capacity now.”
Common Wrong Answers:
  • “I would use the newest and most scalable option for everything to be future-proof.” - confuses novelty with longevity; newest tech has the shortest track record and the highest chance of being abandoned.
  • “Design for 10x scale from day one.” - optimizes the wrong axis; 10-year survivability is about change tolerance, not peak throughput, and premature scale architecture often ossifies the very seams you will need to move later.
Further Reading:
  • Building Evolutionary Architectures by Neal Ford, Rebecca Parsons, Patrick Kua
  • Stripe Engineering blog: “Online migrations at scale” (2017)
  • Related chapter in this series: Design Patterns, section on Strangler Fig and Hexagonal Architecture
Strong Answer Framework:Step 1 - Distinguish ‘the architecture is wrong’ from ‘the code is bad’: The most important triage is whether the pain is structural or surface. Structural signals that lean toward rewrite: the current data model cannot represent a new required concept without a schema overhaul; the concurrency model (e.g., a single-threaded event loop fronting a sync-only ORM) hard-caps throughput; the deployment unit cannot be split no matter how you factor the code. Surface signals that lean toward refactor: poor test coverage, tangled call graphs, outdated framework versions, slow CI. Surface problems are almost always cheaper to fix in place because a rewrite inherits the same business complexity and re-learns the same edge cases the hard way.Step 2 - Price both options honestly, including the opportunity cost curve: A rewrite’s true cost is rarely the engineering time; it is the freeze on the old system during the rewrite and the divergence risk where production keeps changing under the rewrite team. I budget a rewrite at 2-3x the initial estimate and a 12-24 month feature freeze on the old path. A refactor costs less peak effort but requires sustained discipline across many quarters and is easy to abandon when priorities shift. I also price the reversibility: a refactor gone wrong is recoverable in days, a rewrite gone wrong is a career-defining outage story.Step 3 - Run a 2-4 week de-risking spike before committing: Before either option gets a roadmap slot, I fund a spike: rebuild the single hottest hot path (e.g., the authorization check) as a vertical slice using the proposed target architecture, run it in shadow mode against production traffic, and measure. If the spike reveals that the new architecture is actually 30% slower or has a subtle semantic difference the team did not anticipate, the rewrite dies cheaply. If the spike nails it, the rewrite has evidence, not just faith. In practice this spike also reveals the hidden business logic that nobody remembered was in the old system - that revelation alone often tips the decision back toward Strangler Fig.Real-World Example: Basecamp / 37signals publicly rewrote their product multiple times (Basecamp Classic to Basecamp 2 to Basecamp 3) and DHH has written about the tax: each rewrite shed years of accumulated edge cases that turned out to be important, and customer migration took longer than the rewrite itself. In contrast, GitHub’s incremental Rails upgrade saga (documented in their 2018-2020 engineering blog posts about going from Rails 3 to Rails 6) kept the existing system running while pulling it forward - slower in absolute terms, but zero outage budget spent.Senior Follow-up Questions:
  • “What if the rewrite is driven by a single senior engineer who wants to use a new language they are excited about?” - Strong answer: That is a red flag, not a reason; language choice should fall out of constraints (team skills, ecosystem, operational fit), not drive them. I would make them write the decision memo with explicit alternatives considered, which usually reveals whether it is an engineering decision or a resume decision.
  • “You chose refactor, but two years in the team is burned out and morale is collapsing. Do you switch to a rewrite?” - Strong answer: Probably yes, but the real question is whether the refactor plan was wrong or whether leadership failed to protect capacity; switching tactics without understanding which of those it was just teleports the same failure mode into the new plan.
  • “How do you frame this decision for a non-technical CEO who just wants to know ‘is it worth it?’” - Strong answer: In terms of customer-visible outcomes and risk bands - “refactor is 18 months of gradual P99 improvement with low outage risk; rewrite is 30 months of flat performance then a step function with meaningful outage risk on cutover.”
Common Wrong Answers:
  • “Always refactor, never rewrite.” - ignores the case where the data model or concurrency model is the problem and no amount of refactoring will fix it.
  • “Rewrite in a new language because the old one is unfashionable.” - confuses aesthetics with engineering; the cost is paid by the future team, not the current one.
Further Reading:
  • Working Effectively with Legacy Code by Michael Feathers
  • Joel Spolsky: “Things You Should Never Do, Part I” (the original Netscape rewrite essay, 2000)
  • Related chapter in this series: Design Patterns, section on Strangler Fig Pattern