Skip to main content

System Design Practice Problems

These five problems are structured using the Answer Framework: Clarify, Estimate, Design, Deep Dive, and Trade-Offs. Each solution demonstrates the thinking process that separates senior engineers from mid-level candidates.
How to use this guide: Do not read the solutions immediately. For each problem, spend 30-45 minutes sketching your own design first. Then compare your approach against the guided solution. The goal is to internalize the framework, not memorize answers.
Common mistake: Candidates jump straight to drawing boxes. Interviewers want to see you think before you draw. The first 10 minutes of clarifying requirements and estimating scale are what set senior candidates apart.

Problem 1: Design a URL Shortener

Difficulty: Easy | Time Target: 35 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Core FeaturesDo we need custom aliases (e.g., short.ly/my-brand)?Changes the uniqueness and validation logic
ExpirationShould URLs expire? User-configurable TTL?Affects storage strategy and cleanup jobs
AnalyticsDo we need click tracking (geo, referrer, device)?Adds an entire analytics pipeline
ScaleHow many URLs are created per day? Read vs write ratio?Drives every downstream architecture decision
AvailabilityIs it acceptable for a redirect to fail occasionally?Determines replication and consistency model

Requirements You Should Lock Down

Functional:
  • Given a long URL, generate a unique short URL
  • Given a short URL, redirect (HTTP 301/302) to the original
  • Optional: custom aliases, expiration, click analytics
Non-Functional:
  • High availability (redirects must not fail)
  • Low latency (redirect < 100ms)
  • Short URLs should not be guessable (no sequential IDs)
Senior move: Ask whether the redirect should be 301 (permanent, browser caches it) or 302 (temporary, every request hits your server). This one question shows you understand HTTP semantics AND its impact on analytics — if you use 301, the browser caches the redirect and you lose click data.

Back-of-Envelope Calculation

Assumptions:
  New URLs created per day:     100M
  Read-to-write ratio:          100:1
  Average URL length:           500 bytes
  Short URL length:             7 characters
  Data retention:               5 years

Write (Create):
  100M / 86,400 seconds         ~ 1,160 writes/sec
  Peak (3x):                    ~ 3,500 writes/sec

Read (Redirect):
  100:1 ratio                   ~ 116,000 reads/sec
  Peak (3x):                    ~ 350,000 reads/sec

Storage (5 years):
  100M/day x 365 x 5            = 182.5 billion URLs
  Each record ~ 600 bytes       = ~110 TB total
  (short_url + long_url + metadata)

Short URL keyspace:
  7 chars, Base62 (a-z A-Z 0-9) = 62^7 = 3.5 trillion
  182.5 billion << 3.5 trillion  -> 7 chars is sufficient

Cache (80/20 rule):
  20% of URLs generate 80% of traffic
  Daily read requests: 100M x 100 = 10B
  Cache 20% of daily unique URLs:
  100M x 0.2 x 600 bytes        = ~12 GB (fits in memory)
Why this matters: The numbers reveal this is a read-heavy system. That single insight drives every design choice — caching strategy, database read replicas, CDN usage. If you skip estimation, you design blind.

Component Architecture

                           ┌──────────────┐
                           │   Clients    │
                           └──────┬───────┘

                           ┌──────▼───────┐
                           │ Load Balancer │
                           └──────┬───────┘

                    ┌─────────────┼─────────────┐
                    │             │             │
             ┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
             │  API Server │ │  ...  │ │  API Server │
             └──────┬──────┘ └───┬───┘ └──────┬──────┘
                    │            │             │
                    └────────────┼─────────────┘

                  ┌──────────────┼──────────────┐
                  │              │              │
           ┌──────▼──────┐ ┌────▼────┐ ┌──────▼──────┐
           │  Redis Cache │ │   DB   │ │  Analytics  │
           │  (read-thru) │ │(write) │ │   (Kafka)   │
           └─────────────┘ └────┬────┘ └─────────────┘

                         ┌──────▼──────┐
                         │ DB Read     │
                         │ Replicas    │
                         └─────────────┘

API Design

POST /api/v1/urls
  Body: { "long_url": "https://...", "custom_alias": "my-brand", "ttl": 3600 }
  Response: { "short_url": "https://short.ly/aB3x7Kp" }

GET /{short_url_key}
  Response: HTTP 302 Redirect → Location: https://original-long-url.com

GET /api/v1/urls/{short_url_key}/stats
  Response: { "clicks": 10432, "created_at": "...", "last_accessed": "..." }

The Core Problem: How to Generate a Unique 7-Character Key?

This is the most important design decision. Three approaches:
1

Approach A: MD5/SHA256 Truncation

Hash the long URL, take the first 7 characters (Base62 encoded).Pros: Deterministic (same URL = same hash), no coordination needed.Cons: Collision risk with truncation. Must check DB and re-hash on collision (append counter, re-hash).
hash("https://example.com/very-long-path")
  → MD5: "a3f2b8c9d1e4..."
  → Base62 first 7: "aB3x7Kp"
  → Collision? Append counter: hash("https://...&1") → retry
2

Approach B: Base62 Counter (Pre-Generated)

Use a global counter (or counter service) and convert to Base62.Pros: Zero collisions, predictable, simple.Cons: Sequential = guessable. Counter is a single point of failure. Solution: allocate ranges to each server (Server 1 gets 1-1M, Server 2 gets 1M-2M).
counter = 1000000
Base62(1000000) = "4c92"  → pad to 7 chars: "004c92x"
3

Approach C: Key Generation Service (KGS) -- Recommended

Pre-generate random 7-character keys in a separate database. Application servers fetch keys in batches.Pros: No collision at runtime, no coordination between app servers, keys are random (not guessable).Cons: Extra service to maintain. Must mark keys as “used” atomically.
KGS DB:
┌────────────┬────────┐
│    key     │  used  │
├────────────┼────────┤
│  aB3x7Kp  │ false  │
│  Zk9mW2q  │ true   │
│  ...       │  ...   │
└────────────┴────────┘

App Server starts → fetches 1000 keys from KGS → uses from local cache
When local cache runs low → fetch another batch
Collision handling matters: If you choose MD5 truncation, you MUST explain your collision resolution strategy. “Just re-hash” is not enough — what if the retry also collides? Use a bloom filter for fast existence checks before hitting the DB.

Database Choice

CriteriaSQL (PostgreSQL)NoSQL (DynamoDB/Cassandra)
SchemaFixed, well-definedFlexible but unnecessary here
Read patternKey lookup (fast with index)Key lookup (native strength)
Write patternSingle row insertSingle row insert
ScaleSharding is manualAuto-sharding built-in
ConsistencyStrong by defaultEventual (configurable)
VerdictGood for < 1B recordsBetter for 100B+ records
Recommendation: NoSQL (DynamoDB or Cassandra). The access pattern is simple key-value lookup at massive scale. No joins, no complex queries. This is exactly what NoSQL was built for.

Caching Strategy

  • Cache-aside with Redis: Check cache first, on miss read from DB and populate cache
  • TTL: Set cache TTL to match URL expiration (or 24 hours for non-expiring URLs)
  • Eviction: LRU — frequently accessed URLs stay hot, long-tail URLs get evicted
  • Cache size: ~12 GB for 20% of daily URLs (fits a single Redis instance)
DecisionChoice MadeAlternativeWhy
Hash methodKGS (pre-generated keys)MD5 truncationZero runtime collisions, no coordination overhead
DatabaseNoSQL (DynamoDB)PostgreSQLSimple key-value at extreme scale
Redirect code302 (temporary)301 (permanent)Preserves analytics; 301 causes browser caching
CacheRedis cache-asideWrite-throughRead-heavy system benefits from cache-aside simplicity
Short URL length7 characters6 characters3.5T keyspace vs 56B; future-proofing is cheap
1

You asked about 301 vs 302

This shows you understand HTTP redirect semantics and their downstream impact on analytics.
2

You calculated before designing

The 100:1 read-write ratio drove the cache-first architecture. Without estimation, you might have over-engineered writes.
3

You compared three hash approaches with trade-offs

Juniors pick one approach. Seniors present options, compare them, and justify their choice.
4

You addressed the analytics pipeline separately

Writing click events to Kafka for async processing shows you know not to block the redirect path with analytics writes.
5

You mentioned bloom filters for collision detection

This demonstrates awareness of probabilistic data structures and performance optimization at scale.

Bitly’s Architecture

Bitly processes billions of shortens and redirects. Their architecture reflects a decade of scaling lessons:
  • Key generation: Bitly uses a counter-based approach with Base62 encoding, allocating counter ranges to application servers to avoid coordination overhead. Each server gets a block of IDs and increments locally, only requesting a new block when the current one is exhausted.
  • Storage: They migrated from MySQL to a distributed data store. Their link mapping data is sharded by the hash of the short URL key, ensuring even distribution across nodes.
  • Caching: Bitly uses a multi-layer caching strategy. Popular links (the top 1-2% by click volume) are cached in memory on the application servers themselves. A shared Redis layer handles the next tier. The database is the fallback of last resort. Given the extreme read-to-write ratio, cache hit rates exceed 99%.
  • Analytics pipeline: Click data flows through Kafka into a real-time processing pipeline. They decouple the redirect path (which must be sub-100ms) from analytics aggregation entirely. Click events are enriched asynchronously with geo, device, and referrer data.
  • Global presence: Bitly uses Anycast DNS and edge servers in multiple regions to ensure redirects are fast worldwide. A redirect request in Tokyo does not need to round-trip to a US data center.

TinyURL’s Original Design

TinyURL, one of the earliest URL shorteners (launched around 2002), took a simpler approach that is instructive as a baseline:
  • Single MySQL database: All short-to-long URL mappings lived in a single MySQL instance with a B-tree index on the short key. For years, this was sufficient because the total dataset was small enough to fit in RAM.
  • Sequential IDs: TinyURL used auto-incrementing MySQL IDs converted to a short alphanumeric string. This is the simplest possible approach and works until you need non-guessable URLs or multi-datacenter writes.
  • No analytics: The original TinyURL had no click tracking at all. This dramatically simplified the architecture — a redirect was a single database lookup and an HTTP 301 response.
  • Lesson: TinyURL proves that the simplest possible design can serve millions of users if the access pattern is simple enough (key-value lookup). The complexity in Bitly’s architecture exists because of analytics, custom domains, enterprise features, and global scale — not because URL shortening itself is hard.
The takeaway for interviews: Start with TinyURL’s simplicity, then explain how you would evolve it toward Bitly’s architecture as requirements grow. This “simple first, scale when needed” narrative is exactly what interviewers want to see.

Problem 2: Design a Rate Limiter

Difficulty: Medium | Time Target: 40 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
ScopeClient-side or server-side rate limiting?Client-side is unreliable; focus on server-side
GranularityPer-user, per-API endpoint, per-IP, or global?Determines key structure in the rate limit store
ResponseShould we return rate limit headers (X-RateLimit-*)?Industry standard; shows HTTP expertise
ThrottlingHard limit (reject) or soft limit (queue/slow down)?Changes architecture from reject to backpressure
DistributedSingle server or distributed across multiple servers?Single-server is trivial; distributed is the real problem
RulesWho configures the rate limit rules? Dynamic or static?Affects whether you need a rules engine or config file

Requirements You Should Lock Down

Functional:
  • Rate limit requests based on configurable rules (user ID, IP, API key)
  • Support multiple rate limit tiers (free: 10 req/min, pro: 1000 req/min)
  • Return proper HTTP 429 with Retry-After header when limit is exceeded
  • Rate limit rules can be updated without redeployment
Non-Functional:
  • Very low latency (added overhead < 1ms per request)
  • Must work across distributed servers (shared state)
  • Highly available — if the rate limiter fails, requests should pass through (fail-open)
Senior move: Ask about fail-open vs fail-closed behavior. Fail-open (allow requests if limiter is down) prevents the rate limiter from becoming a single point of failure. Fail-closed (block all requests if limiter is down) is safer against abuse but risky for availability.

Back-of-Envelope Calculation

Assumptions:
  Active users:                 10M
  Average requests per user:    500/day
  Total requests/day:           5 billion
  Peak QPS:                     ~175,000 req/sec (3x average)

Rate limiter overhead:
  Each check = 1 Redis round-trip  ~ 0.5ms
  At 175K QPS:                     ~ 175K Redis ops/sec
  Redis single instance handles:   ~ 100K ops/sec
  → Need 2-3 Redis instances (or Redis Cluster)

Storage per user (sliding window):
  Key: "user:12345:api:/search"     ~ 50 bytes
  Value: sorted set of timestamps   ~ 200 bytes (for 100 entries)
  Total per user (10 endpoints):    ~ 2.5 KB
  10M users:                        ~ 25 GB
  → Fits in Redis cluster

Where to Place the Rate Limiter

Option A: API Gateway (Recommended for most cases)
  Client → API Gateway [Rate Limiter] → App Servers

Option B: Middleware in Application
  Client → App Server → [Rate Limiter Middleware] → Handler

Option C: Sidecar (Service Mesh)
  Client → Envoy Proxy [Rate Limit] → App Server

Component Architecture

  Client Request


  ┌──────────────┐     ┌─────────────────┐
  │ API Gateway  │────▶│  Rules Engine    │
  │ (rate check) │     │  (config store)  │
  └──────┬───────┘     └─────────────────┘

    ┌────▼────┐
    │  Redis  │  ← Centralized rate limit state
    │ Cluster │
    └────┬────┘

    Pass │ Reject
    ┌────▼────┐  ┌─────────────────┐
    │  App    │  │  HTTP 429       │
    │ Servers │  │  Retry-After: X │
    └─────────┘  └─────────────────┘

Rate Limit Headers (RFC 6585)

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1672531200
Retry-After: 30

Comparing Four Rate Limiting Algorithms

1

Algorithm 1: Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Increment a counter for each request. Reject when counter exceeds limit.
Window: 12:00:00 - 12:01:00 → counter = 87/100
Window: 12:01:00 - 12:02:00 → counter = 0/100 (reset)
Pros: Simple, low memory (one counter per window).Cons: Burst at window boundary. A user can send 100 requests at 12:00:59 and 100 more at 12:01:01, effectively 200 requests in 2 seconds.
2

Algorithm 2: Sliding Window Log

Store the timestamp of every request in a sorted set. For each new request, remove timestamps outside the window, count remaining.
Redis ZSET: user:123 → [12:00:01, 12:00:05, 12:00:33, ...]
New request at 12:01:10:
  Remove all timestamps < 12:00:10
  Count remaining: 45
  If 45 < 100 → ALLOW, add 12:01:10
Pros: Precise, no boundary burst problem.Cons: High memory (stores every timestamp). At 100 req/min per user with 10M users, that is billions of timestamps.
3

Algorithm 3: Sliding Window Counter (Recommended)

Hybrid of fixed window + sliding window. Maintain counters for the current and previous windows. Weight the previous window by overlap percentage.
Previous window (12:00-12:01): 84 requests
Current window  (12:01-12:02): 36 requests
Current time:   12:01:15 → 25% into current window

Weighted count = 84 * 0.75 + 36 = 99
Limit = 100 → ALLOW (barely)
Pros: Low memory (just two counters), smooth, no boundary burst.Cons: Approximation (not exact), but in practice the error is < 1%.
4

Algorithm 4: Token Bucket

Each user has a bucket that fills with tokens at a fixed rate. Each request consumes one token. If bucket is empty, reject.
Bucket capacity: 100 tokens
Refill rate:     100 tokens/minute
Current tokens:  23

Request arrives → 23 > 0 → ALLOW, tokens = 22
Pros: Allows controlled bursts (up to bucket capacity). Simple. Used by AWS API Gateway and Stripe.Cons: Two parameters to tune (capacity and refill rate). Requires per-user state.

Distributed Rate Limiting: The Hard Part

The real challenge is making this work across multiple servers.Problem: If you have 10 API servers, each checking Redis independently, race conditions cause over-counting or under-counting.Solution: Lua Script in Redis (Atomic Operations)
-- Atomic rate limit check in Redis
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- REJECTED
else
    return 1  -- ALLOWED
end
Clock skew in distributed systems: If your API servers have slightly different clocks, fixed window boundaries differ across servers. Mitigations: (1) Use Redis server time, not application server time. (2) Use NTP to keep clocks synchronized within 10ms. (3) The sliding window counter algorithm is inherently more tolerant of small clock differences.
DecisionChoice MadeAlternativeWhy
AlgorithmSliding window counterToken bucketBetter precision with minimal memory; token bucket is also excellent for APIs needing burst tolerance
PlacementAPI GatewayApplication middlewareCentralized enforcement, language-agnostic, single place to configure rules
State storeRedis ClusterLocal in-memoryMust work across all servers; local state gives inconsistent enforcement
Failure modeFail-openFail-closedAvailability > abuse protection; combine with secondary detection for abuse
AtomicityRedis Lua scriptsDistributed locksLua scripts are atomic in Redis, simpler than external locks, sub-millisecond
1

You compared four algorithms with concrete trade-offs

Not just naming them, but explaining boundary burst problems, memory implications, and why the sliding window counter wins for most cases.
2

You addressed the distributed race condition explicitly

The Lua script for atomic Redis operations shows you have dealt with real concurrency problems.
3

You discussed fail-open vs fail-closed

This is an operational concern that junior candidates never raise. It shows you think about failure modes, not just the happy path.
4

You mentioned clock skew

Distributed systems fundamentals. This single detail signals real-world experience with systems that span multiple servers.
5

You designed rate limit headers into the API response

Following RFC 6585 and including Retry-After shows you care about API ergonomics and client experience.

Cloudflare’s Rate Limiting at Scale

Cloudflare handles tens of millions of HTTP requests per second across 300+ data centers. Their rate limiting operates at a scale that most engineers never encounter:
  • Edge-first enforcement: Rate limiting happens at Cloudflare’s edge servers, not at a centralized service. Each data center runs its own rate limiting logic. This is critical because sending every request to a central rate limiter would add unacceptable latency and create a single point of failure.
  • Sliding window counters: Cloudflare uses a sliding window approach similar to Algorithm 3 above. They chose this because it provides good accuracy with minimal memory per client, which matters when you are tracking millions of unique IPs and API keys simultaneously.
  • Approximate counting across data centers: Since rate limiting is distributed, a user hitting multiple Cloudflare data centers (due to Anycast routing) could theoretically exceed their global limit. Cloudflare handles this with periodic synchronization between edge nodes — they trade perfect accuracy for speed. The result is rate limits that are “approximately correct” within a small margin, which is acceptable for virtually all use cases.
  • Configurable actions: When a rate limit is exceeded, Cloudflare supports multiple responses: block (HTTP 429), challenge (show a CAPTCHA), JS challenge (verify browser), or simulate (log only, for testing rules before enforcement). This flexibility is a great interview detail to mention when discussing rate limiter behavior.
  • IP reputation scoring: Beyond simple rate counting, Cloudflare layers in threat intelligence. An IP with a known bad reputation gets a lower effective rate limit. This hybrid approach (rate limiting + reputation) is more effective than either technique alone.

Stripe’s Rate Limiting API

Stripe’s approach to rate limiting is worth studying because they operate from the API provider’s perspective — they must protect their own infrastructure while giving developers a good experience:
  • Token bucket algorithm: Stripe uses a token bucket implementation for their API rate limits. Each API key gets a bucket with a defined capacity (e.g., 100 requests per second for live mode). Tokens refill at a steady rate. This allows short bursts (useful when a merchant processes a batch of charges) while enforcing a sustained rate limit.
  • Tiered limits by endpoint: Not all API endpoints have the same rate limit. Creating a payment intent might be limited to 100/sec, while listing events could allow 1000/sec. Stripe separates read-heavy endpoints from write-heavy ones because the cost to Stripe’s backend is different.
  • Graceful headers: Stripe returns RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers on every response (not just 429s). This allows well-behaved clients to self-throttle before hitting the limit. They also return a Retry-After header on 429 responses with a specific number of seconds to wait.
  • Idempotency keys for retries: Stripe’s rate limiting is designed to work hand-in-hand with their idempotency key system. When a client gets rate-limited and retries, the idempotency key ensures the retry does not create a duplicate charge. This is a subtle but important integration point between rate limiting and application semantics.
  • Load shedding at scale: When Stripe’s systems are under extreme load, they implement progressive load shedding — first throttling lower-priority traffic (webhooks, list operations) before restricting payment-critical paths. This priority-based approach is more sophisticated than a flat rate limit and is worth mentioning in interviews when discussing production-grade systems.
Key pattern: Both Cloudflare and Stripe separate their rate limiting into “protect the infrastructure” (backend stability) and “enforce fair usage” (per-customer limits). These are different concerns with different algorithms, thresholds, and failure modes. In an interview, distinguishing between these two motivations demonstrates operational maturity.

Problem 3: Design a Chat System

Difficulty: Medium | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Type1-to-1 only, or group chat too? Max group size?Group chat introduces fan-out complexity
FeaturesOnline/offline status? Read receipts? Typing indicators?Each adds real-time event complexity
MediaText only, or images/files/voice?Media requires object storage + CDN pipeline
HistoryHow far back should chat history go?Determines storage volume and partitioning
DeliveryWhat delivery guarantees? At-least-once? Exactly-once?Drives dedup and acknowledgment protocol
ScaleHow many concurrent users? Messages per day?Determines WebSocket server capacity

Requirements You Should Lock Down

Functional:
  • 1-to-1 messaging with delivery confirmation
  • Group chat (up to 500 members)
  • Online/offline presence indicators
  • Message history (persistent, searchable)
  • Push notifications for offline users
Non-Functional:
  • Real-time delivery (< 200ms for online users)
  • Message ordering guarantees (per-conversation)
  • 99.99% message delivery (no lost messages)
  • Support 50M concurrent connections

Back-of-Envelope Calculation

Assumptions:
  Daily Active Users (DAU):     50M
  Concurrent connections:       10% of DAU = 5M
  Messages per user per day:    40
  Total messages/day:           2 billion
  Average message size:         200 bytes
  Group chats per user:         5 (average 20 members)

Throughput:
  2B messages / 86,400 sec      ~ 23,000 messages/sec
  Peak (5x):                    ~ 115,000 messages/sec

WebSocket connections:
  5M concurrent connections
  Each connection ~ 10 KB memory
  Per server (10K connections): ~ 100 MB
  Servers needed:               5M / 10K = 500 WebSocket servers

Storage (1 year):
  2B messages/day x 365 x 200 bytes = ~146 TB/year
  With indexes and metadata:          ~200 TB/year

Presence (heartbeat):
  5M users sending heartbeat every 30 sec
  = 166K heartbeat events/sec
Key insight: This system is dominated by two challenges: (1) maintaining millions of persistent WebSocket connections, and (2) the group chat fan-out problem where one message must be delivered to hundreds of recipients simultaneously.

Component Architecture

 ┌──────────┐  ┌──────────┐  ┌──────────┐
 │ Client A │  │ Client B │  │ Client C │
 └────┬─────┘  └────┬─────┘  └────┬─────┘
      │  WebSocket   │             │
      └──────────────┼─────────────┘

              ┌──────▼──────┐
              │   Gateway   │  ← Connection routing + auth
              │   (HAProxy) │
              └──────┬──────┘

      ┌──────────────┼──────────────┐
      │              │              │
 ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
 │  Chat   │   │  Chat   │   │  Chat   │
 │Server 1 │   │Server 2 │   │Server 3 │   ← Hold WebSocket connections
 └────┬────┘   └────┬────┘   └────┬────┘
      │              │              │
      └──────────────┼──────────────┘

      ┌──────────────┼──────────────┐
      │              │              │
 ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
 │  Redis  │   │ Message │   │ Presence │
 │ Pub/Sub │   │  Store  │   │ Service  │
 └─────────┘   └─────────┘   └──────────┘

Message Flow (1-to-1)

1. User A sends message via WebSocket to Chat Server 1
2. Chat Server 1:
   a. Persists message to Message Store (Cassandra)
   b. Looks up which Chat Server holds User B's connection
   c. Publishes message to Redis Pub/Sub channel for User B
3. Chat Server 2 (holding User B's connection):
   a. Receives message from Redis Pub/Sub
   b. Pushes message to User B via WebSocket
4. If User B is offline:
   a. Message is stored in an "undelivered" queue
   b. Push notification sent via APNS/FCM

Deep Dive 1: WebSocket vs Long-Polling

1

WebSocket (Recommended)

Full-duplex, persistent connection. Server can push messages to client at any time.Why it wins: A chat system needs real-time bidirectional communication. WebSocket is purpose-built for this. One connection handles both sending and receiving.Challenge: Maintaining 5M persistent connections requires careful resource management. Connection drops need automatic reconnection with message catch-up.
2

Long-Polling (Fallback)

Client sends a request, server holds it open until a message arrives (or timeout). Client immediately sends another request.When to use: Fallback for environments where WebSocket is blocked (corporate firewalls, older proxies). Keep as a compatibility layer, not the primary transport.
3

Server-Sent Events (SSE)

Unidirectional (server to client only). Client still uses HTTP POST to send messages.Verdict: Acceptable for notifications but awkward for chat because sending and receiving use different channels.

Deep Dive 2: Message Storage and Partitioning

Database choice: Cassandra or ScyllaDBWhy: Write-heavy workload (2B messages/day), time-series-like access pattern (fetch recent messages for a conversation), horizontal scaling built-in.
Table: messages
┌──────────────────┬───────────────┬────────────┬──────────┬─────────┐
│ conversation_id  │  message_id   │  sender_id │   body   │  ts     │
│  (partition key) │ (clustering)  │            │          │         │
├──────────────────┼───────────────┼────────────┼──────────┼─────────┤
│  conv_abc123     │  msg_001      │  user_A    │  "Hey!"  │  17...  │
│  conv_abc123     │  msg_002      │  user_B    │  "Hi!"   │  17...  │
└──────────────────┴───────────────┴────────────┴──────────┴─────────┘

Partition key: conversation_id
  → All messages for a conversation are on the same node
  → Efficient range query: "Get last 50 messages for conversation X"

Clustering key: message_id (time-based UUID / Snowflake ID)
  → Messages within a partition are sorted by time
  → Supports efficient pagination
Hot partition risk: A group chat with 500 members that is very active could create a hot partition. Mitigation: sub-partition by time bucket (e.g., conversation_id + year_month) so older messages are on different partitions.

Deep Dive 3: Online Presence (Heartbeat Mechanism)

Presence Flow:
1. User connects → status set to "online" in Redis
   HSET presence user:123 '{"status":"online","server":"chat-02","ts":...}'

2. Client sends heartbeat every 30 seconds
   → Redis TTL refreshed: EXPIRE presence:user:123 60

3. If no heartbeat for 60 seconds → TTL expires → user is "offline"

4. When User A opens chat with User B:
   → Check Redis: HGET presence user:456
   → Subscribe to presence changes for contacts
Senior move: Do not fan-out presence updates to all friends in real-time (if a user has 5,000 friends, that is 5,000 events on every status change). Instead, use a lazy approach: only check presence when a user opens a conversation or views their contact list. This reduces presence event volume by 100x.
DecisionChoice MadeAlternativeWhy
TransportWebSocketLong-pollingReal-time bidirectional; fall back to long-polling when WS unavailable
Message DBCassandraPostgreSQLWrite-heavy, time-series access, horizontal scale
Inter-server messagingRedis Pub/SubDirect RPCDecouples chat servers; Pub/Sub handles routing naturally
PresenceLazy evaluation + heartbeatActive fan-outFan-out to all friends is prohibitively expensive at scale
Message orderingPer-conversation orderingGlobal orderingGlobal ordering is unnecessary and impossible to scale; conversations are independent
Group message deliveryWrite message once, fan-out readsWrite to each member’s inboxSaves storage; read fan-out is cheaper for groups < 500
1

You identified the fan-out problem for groups

The difference between writing one copy vs N copies per group message is the core scalability challenge. Explaining the trade-off shows distributed systems maturity.
2

You designed lazy presence instead of active push

This optimization alone saves orders of magnitude of network traffic. It shows you think about what NOT to compute, not just what to compute.
3

You addressed message ordering as per-conversation

Global ordering across all conversations is impossible at scale and unnecessary. Scoping ordering to conversations shows pragmatic thinking.
4

You planned for the offline path

Push notifications, undelivered message queues, and message catch-up on reconnect. The offline path is where most chat systems get complex.
5

You chose the partition key carefully

conversation_id as partition key with time-based clustering is the correct Cassandra modeling for this access pattern. This is a detail that shows real database design experience.

WhatsApp: 100B+ Messages/Day with ~50 Engineers

WhatsApp is the gold standard for doing more with less. At the time of the Facebook acquisition (2014), WhatsApp handled 50+ billion messages per day with roughly 50 engineers. Today that number exceeds 100 billion. Their architecture is a masterclass in simplicity:
  • Erlang/OTP: WhatsApp’s backend is built on Erlang, a language designed for telecom systems that need massive concurrency and fault tolerance. A single Erlang server can handle 2-3 million concurrent connections because Erlang processes are lightweight (a few hundred bytes each, compared to threads that cost megabytes). This is why 50 engineers could operate what would require hundreds of engineers in a Java or Python stack.
  • Message storage philosophy: WhatsApp treats the server as a delivery mechanism, not a storage system. Messages are stored on the server only until they are delivered to the recipient’s device. Once the recipient acknowledges receipt, the message is deleted from the server. This “transient store” approach dramatically reduces storage requirements and simplifies the architecture.
  • XMPP-based protocol (modified): WhatsApp started with the XMPP messaging protocol and heavily modified it for mobile efficiency. Their custom protocol minimizes battery drain and data usage — critical for WhatsApp’s user base in markets with limited data plans.
  • No read receipts stored server-side: The blue check marks (read receipts) are end-to-end signals between devices. The server facilitates delivery but does not store read state. This is another example of pushing complexity to the edges and keeping the server simple.
  • Mnesia for routing: WhatsApp uses Erlang’s built-in Mnesia database for connection routing — mapping which user is connected to which server. Mnesia is an in-memory distributed database that replicates across Erlang nodes, giving sub-millisecond lookups without an external dependency like Redis.
  • FreeBSD over Linux: WhatsApp chose FreeBSD for their servers, partly because of superior network stack performance for their specific workload (millions of concurrent TCP connections). This is an unconventional choice that paid off — it shows that sometimes the right tool is not the popular tool.

Discord’s Message Storage

Discord’s approach is interesting because they had to solve a different problem than WhatsApp — Discord needs persistent, searchable message history (channels can have millions of messages spanning years):
  • Migration from MongoDB to Cassandra: Discord initially stored messages in MongoDB. As they scaled, MongoDB’s single-primary-per-shard model created write bottlenecks for hot channels. They migrated to Cassandra, which distributes writes evenly across nodes. The partition key is (channel_id, bucket) where bucket is a time window, preventing any single partition from growing unbounded.
  • Bucket strategy: Each bucket covers a fixed time period (approximately 10 days of messages). When a user scrolls back through history, Discord fetches the appropriate bucket. New messages always write to the current bucket. This means the “hot” data (recent messages) is always on a small number of partitions, keeping reads fast.
  • Data services layer: Discord added a data services layer between their API and Cassandra. This layer handles request coalescing — if 1,000 users in the same channel all request the same messages at the same time (e.g., after a server outage), the data services layer deduplicates these into a single Cassandra read and fans the result back to all requesters. Without this, a thundering herd could overwhelm Cassandra.
  • ScyllaDB migration: Discord later migrated from Cassandra to ScyllaDB (a C++ reimplementation of Cassandra’s protocol) for better tail latency performance. Cassandra’s JVM garbage collection pauses caused p99 latency spikes that affected user experience. ScyllaDB, being C++ with a shard-per-core architecture, eliminated GC pauses entirely.
Interview insight: WhatsApp’s story demonstrates that choosing the right runtime (Erlang) can be a 10x architectural advantage. Discord’s story demonstrates that database migrations are real and normal — you do not need to get the database choice right on day one, but you need to design so that migration is possible.

Problem 4: Design a News Feed

Difficulty: Hard | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
ContentText posts only, or images/videos too?Media adds CDN and transcoding complexity
Feed typeChronological or ranked (algorithmic)?Ranked feeds need an ML scoring service
Social graphFollow model (Twitter) or friend model (Facebook)?Follow is asymmetric and simpler; friends is symmetric
Celebrity problemDo some users have millions of followers?This single question changes the entire fan-out strategy
Real-timeShould the feed update in real-time or on refresh?Real-time adds WebSocket/SSE complexity
ScaleDAU? Average follows per user? Posts per day?Drives the fan-out vs fan-in decision

Requirements You Should Lock Down

Functional:
  • Users create posts (text + images/video)
  • Users follow other users (asymmetric follow model)
  • Home feed shows posts from followed users, ranked by relevance
  • Support likes, comments, shares (interaction signals)
Non-Functional:
  • Feed generation latency < 500ms
  • Support 500M DAU
  • Handle celebrity accounts (10M+ followers)
  • Feed should feel “fresh” (updates within minutes of posting)

Back-of-Envelope Calculation

Assumptions:
  DAU:                          500M
  Average follows per user:     200
  Average posts per user/day:   2
  Total posts/day:              1 billion
  Average post size:            1 KB (text + metadata)
  Average feed fetch/day:       10 per user

Write (post creation):
  1B / 86,400                   ~ 11,600 posts/sec
  Peak (5x):                    ~ 58,000 posts/sec

Read (feed fetch):
  500M x 10 / 86,400           ~ 58,000 feed fetches/sec
  Peak (5x):                    ~ 290,000 feed fetches/sec

Fan-out calculation (the critical number):
  Average user: 200 followers
    → 1 post = 200 fan-out writes
  Celebrity (1M followers):
    → 1 post = 1,000,000 fan-out writes
    → 10 celebrities posting = 10M writes instantly

Storage for pre-computed feeds:
  500M users x 500 post IDs x 8 bytes = 2 TB
  (Only store post IDs in feed, not full content)
The celebrity problem is the entire interview: If a user with 10M followers posts, and you use fan-out-on-write, you must write to 10M feeds instantly. This is the single most important trade-off in this problem. If you do not raise it yourself, the interviewer will push you toward it.

The Two Fundamental Approaches

Fan-Out on Write (Push Model):
  User posts → immediately write to all followers' feed caches
  ┌──────────┐    ┌──────────────────┐    ┌───────────────┐
  │  Post    │───▶│  Fan-out Service │───▶│ Feed Cache    │
  │  Service │    │  (async workers) │    │ (per-user)    │
  └──────────┘    └──────────────────┘    └───────────────┘
  Pro: Feed reads are instant (pre-computed)
  Con: Write amplification; celebrity post = millions of writes

Fan-Out on Read (Pull Model):
  User opens feed → query all followed users' posts in real-time
  ┌──────────┐    ┌──────────────────┐    ┌───────────────┐
  │  Feed    │───▶│  Aggregation     │───▶│ Post Storage  │
  │  Request │    │  Service         │    │ (per-user)    │
  └──────────┘    └──────────────────┘    └───────────────┘
  Pro: No write amplification; simple writes
  Con: Slow read (must merge 200+ sorted lists at read time)
┌──────────────────────────────────────────────────────────────────┐
│                       Hybrid News Feed                           │
│                                                                  │
│  Normal user posts (< 10K followers):                           │
│    → Fan-out on WRITE to all followers' caches                   │
│                                                                  │
│  Celebrity posts (> 10K followers):                              │
│    → Do NOT fan out; store in celebrity post cache               │
│    → At read time, merge celebrity posts with pre-computed feed  │
└──────────────────────────────────────────────────────────────────┘

  ┌──────────┐
  │  User    │──── Creates post
  └────┬─────┘

  ┌────▼─────────────────────┐
  │  Post Service            │
  │  (writes to Post DB)     │
  └────┬─────────────────────┘

       ├── followers < 10K ──▶ Fan-out workers ──▶ Feed Cache
       │                                           (Redis sorted set)

       └── followers > 10K ──▶ Celebrity Post Cache
                                (read at feed-fetch time)

  Feed Fetch:
  ┌──────────┐    ┌─────────────────┐
  │  Client  │───▶│ Feed Service    │
  └──────────┘    │ 1. Read cache   │
                  │ 2. Merge celebs │
                  │ 3. Rank/sort    │
                  │ 4. Return top N │
                  └─────────────────┘

Deep Dive 1: The Hybrid Fan-Out Strategy

1

Post arrives at Post Service

Validate content, store in Post DB (PostgreSQL/Cassandra), generate post ID.
2

Check follower count

If the poster has fewer than 10K followers, proceed with fan-out on write. If more, only add to the celebrity post index.The 10K threshold is tunable. Some systems use dynamic thresholds based on current system load.
3

Fan-out workers process asynchronously

A Kafka consumer reads new-post events. For each followed user, the worker prepends the post ID to that user’s feed in Redis.
Redis sorted set per user:
ZADD feed:user:789 <timestamp_score> <post_id>

Keep only the most recent 500 entries:
ZREMRANGEBYRANK feed:user:789 0 -501
4

At read time, merge celebrity posts

When user fetches their feed, the Feed Service reads their pre-computed feed from Redis AND fetches recent posts from any celebrities they follow.Merge the two sorted lists, apply ranking, return top 50.

Deep Dive 2: Feed Ranking

A pure chronological feed is simple but produces poor engagement. A ranked feed considers multiple signals:
Score(post) = w1 * recency
            + w2 * affinity(user, author)
            + w3 * engagement(likes, comments, shares)
            + w4 * content_type_preference
            + w5 * diversity_penalty

Where:
  recency    = exponential decay from post creation time
  affinity   = how often user interacts with author
  engagement = normalized like/comment/share count
  diversity  = penalize seeing 5 posts from same author in a row
In an interview, you do not need to define the exact ML model. What matters is that you architect for ranking: the Feed Service fetches candidate posts, passes them through a lightweight ranker, and returns the top N. The ranking model can be iterated independently of the infrastructure.

Deep Dive 3: Caching and CDN for Media

Caching layers:
┌──────────────────────────────────────────────────┐
│  Layer 1: CDN (images, videos, static assets)    │
│  Layer 2: Feed Cache (Redis - post IDs per user) │
│  Layer 3: Post Cache (Redis - post content)      │
│  Layer 4: Social Graph Cache (who follows whom)  │
│  Layer 5: Database (source of truth)             │
└──────────────────────────────────────────────────┘

Cache invalidation strategy:
- Feed cache: append-only (new posts added, old posts expire via ZREMRANGEBYRANK)
- Post cache: invalidate on edit/delete (rare operations)
- Social graph cache: invalidate on follow/unfollow
Senior move: Pre-compute feeds for active users only. If a user has not logged in for 30 days, do not fan out posts to their feed cache. When they return, generate their feed on-demand (fan-out on read) and then resume pre-computation.
DecisionChoice MadeAlternativeWhy
Fan-out strategyHybrid (push for normal, pull for celebrities)Pure push or pure pullPure push breaks for celebrities; pure pull is too slow for reads
Celebrity threshold10K followersStatic vs dynamicStart with static threshold, evolve to dynamic based on system load
Feed storageRedis sorted sets (post IDs only)Store full post contentIDs are tiny; fetch content separately with post cache hit rates > 99%
RankingLightweight scoring at read timePre-computed ranked feedsRanking signals change in real-time (new likes); pre-ranking gets stale
Media deliveryCDN + object storage (S3)Serve from app serversMedia is the bandwidth bottleneck; CDN reduces latency and server load by 90%+
Inactive usersSkip fan-out, generate on demandFan-out to everyone60% of users do not check their feed daily; skipping saves enormous write volume
1

You immediately identified the celebrity fan-out problem

This is the central tension of the problem. Raising it proactively shows you have thought deeply about news feed systems.
2

You proposed a hybrid approach instead of picking one extreme

Real systems (Twitter, Facebook, Instagram) all use hybrid approaches. Knowing this signals industry awareness.
3

You separated the ranking layer from the data layer

Architecturally decoupling ranking from storage allows the ML team to iterate independently. This is how real organizations work.
4

You optimized for inactive users

Not fan-outing to inactive users is a massive optimization that most candidates miss. It shows you think about the 80% of wasted work, not just the 20% that matters.
5

You designed multiple caching layers with clear invalidation strategies

Feed cache, post cache, social graph cache, CDN — each with its own invalidation policy. This layered approach demonstrates production-grade thinking.

Facebook’s News Feed Ranking Evolution

Facebook’s News Feed is perhaps the most studied feed system in the world. Its evolution from simple to sophisticated mirrors the journey every feed system takes:
  • 2006 — Chronological feed: The original News Feed was purely chronological. If your friend posted, it appeared in your feed sorted by time. Simple, but as users added hundreds of friends, the feed became noisy and engagement dropped.
  • 2009 — EdgeRank: Facebook introduced EdgeRank, a relatively simple scoring formula: Score = Affinity x Weight x Decay. Affinity measured how close you were to the poster (based on interactions). Weight reflected the content type (photos ranked higher than text). Decay was a time-based exponential falloff. EdgeRank was essentially the formula we described in the ranking section above.
  • 2013+ — Machine learning replaces EdgeRank: Facebook replaced the hand-tuned formula with a machine learning model that considers thousands of signals. The model predicts the probability that you will engage with (like, comment, share, click) each candidate post. Posts are ranked by predicted engagement score.
  • Fan-out architecture: Facebook uses a hybrid fan-out model very similar to what we designed above. Normal users get fan-out-on-write (their posts are pushed to followers’ pre-computed feeds). Celebrities and pages with millions of followers use fan-out-on-read (their posts are merged at feed-fetch time). The threshold is dynamic and adjusts based on system load.
  • Aggregator service: Facebook’s feed generation pipeline has a dedicated aggregation service (called “Multifeed”) that merges posts from multiple sources: the pre-computed feed cache, celebrity posts fetched on-read, ads injected by the ad auction, and “story bumping” (resurfacing older posts that are getting new comments). This merge step is where the ranking model runs.
  • Feed cache (TAO + Memcache): Facebook built a custom distributed cache called TAO (The Associations and Objects cache) specifically for social graph data. Feed data lives in a combination of TAO, Memcache, and MySQL. The caching hierarchy is: L1 (in-process cache on the web server) to L2 (Memcache cluster in the same region) to L3 (TAO for social graph queries) to L4 (MySQL as source of truth).

Twitter’s Timeline Architecture

Twitter’s timeline system is a fascinating case study because they famously changed their approach:
  • Original approach (fan-out on write): For years, Twitter used aggressive fan-out-on-write. When you tweeted, a fleet of workers would push your tweet ID into the timelines of all your followers stored in Redis. This worked well because reads were instant — opening Twitter just meant reading a pre-computed list from Redis.
  • The celebrity problem: This approach broke for users like Lady Gaga or Barack Obama with 30M+ followers. A single tweet triggered 30 million Redis writes. During events like the Super Bowl or elections, the fan-out queue would back up by minutes or even hours, meaning some users would see tweets with significant delay.
  • Migration to hybrid: Twitter moved to a hybrid model around 2012-2013. Most users (those with fewer followers than a dynamic threshold) still get fan-out-on-write. High-follower accounts are excluded from fan-out. When you open your timeline, Twitter’s timeline service merges your pre-computed timeline with recent tweets from the high-follower accounts you follow.
  • Timeline cache: Twitter stores timelines in a massive Redis cluster. Each user’s timeline is a Redis list of tweet IDs (not full tweet content). At read time, the IDs are hydrated by fetching the tweet content from a separate Tweet Store service. This separation means the timeline cache stays small (just IDs and scores) while tweet content is cached independently.
  • Ranking and relevance: Twitter introduced algorithmic ranking (“top tweets”) alongside the chronological timeline. The ranking model considers engagement signals, recency, your past interactions with the author, and content type. Users can toggle between ranked and chronological views — a product decision that has architectural implications (you need to support both code paths).
The lesson from both Facebook and Twitter: Every major feed system ended up at a hybrid fan-out architecture. Pure push breaks at celebrity scale. Pure pull is too slow for reads. The hybrid model is not a compromise — it is the correct architecture for this problem class. If you propose a hybrid approach in your interview, you are aligned with what the biggest companies actually built.

Problem 5: Design a Distributed Task Scheduler

Difficulty: Hard | Time Target: 45 minutes

Questions You Should Ask the Interviewer

CategoryQuestionWhy It Matters
Task typesOne-time, delayed, recurring (cron), or all three?Recurring tasks need a fundamentally different scheduling model
ExecutionAt-most-once, at-least-once, or exactly-once?Determines dedup strategy and acknowledgment protocol
PriorityDo tasks have priority levels?Priority requires a priority queue, not a FIFO queue
PayloadWhat is a task? Just a function name + args, or arbitrary code?Arbitrary code execution is a security and sandboxing problem
DurationHow long can a task run? Timeout?Long-running tasks need heartbeats and lease renewal
ScaleHow many tasks per day? Concurrent workers?Drives queue partitioning and worker scaling decisions
FailureWhat happens when a task fails? Retry policy? Dead letter?Failure handling is where most task schedulers get complex

Requirements You Should Lock Down

Functional:
  • Schedule one-time tasks for immediate or delayed execution
  • Schedule recurring tasks using cron expressions
  • Retry failed tasks with configurable backoff (max 3 retries)
  • Dead-letter queue for tasks that exceed retry limit
  • Task status tracking: pending, scheduled, running, completed, failed, dead
  • Cancel or pause scheduled/recurring tasks
Non-Functional:
  • At-least-once execution guarantee (with idempotency support)
  • Handle 10M task executions per day
  • Scheduling accuracy: within 1 second of target time
  • Horizontal scaling of both scheduler and workers
  • No single point of failure
Senior move: Ask about exactly-once semantics. Then explain that true exactly-once is impossible in a distributed system. The practical solution is at-least-once delivery combined with idempotent task execution. This single clarification demonstrates deep distributed systems understanding.

Back-of-Envelope Calculation

Assumptions:
  Tasks scheduled per day:      10M
  One-time tasks:               60% (6M)
  Recurring tasks:              40% (4M definitions, generating 50M executions/day)
  Total executions per day:     56M
  Average task duration:        5 seconds
  Peak task burst:              10x average

Throughput:
  56M / 86,400                  ~ 650 tasks/sec average
  Peak (10x):                   ~ 6,500 tasks/sec

Worker capacity:
  Each worker handles 1 task at a time for 5 seconds
  Worker throughput: 12 tasks/min = 720 tasks/hour
  Peak needs: 6,500 concurrent tasks
  Workers needed: 6,500 (at peak, with 1 task per worker)
  With 10 tasks/worker (async): ~650 workers at peak

Task storage:
  56M tasks/day x 1 KB avg = 56 GB/day
  30-day retention: ~1.7 TB
  Completed tasks archived to cold storage after 7 days

Queue depth:
  At peak, 6,500 tasks queued per second
  If workers process in 5 seconds: ~32,500 tasks in queue at any time

Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  Distributed Task Scheduler                     │
│                                                                 │
│  ┌──────────┐       ┌──────────────┐       ┌───────────────┐   │
│  │  Client  │──────▶│  Task API    │──────▶│  Task Store   │   │
│  │  (SDK)   │       │  (REST/gRPC) │       │  (PostgreSQL) │   │
│  └──────────┘       └──────────────┘       └───────┬───────┘   │
│                                                     │           │
│                     ┌──────────────┐                │           │
│                     │  Scheduler   │◀───────────────┘           │
│                     │  (Leader)    │                             │
│                     └──────┬───────┘                             │
│                            │  Enqueues due tasks                │
│                     ┌──────▼───────┐                             │
│                     │  Task Queue  │                             │
│                     │  (Redis /    │                             │
│                     │   SQS)       │                             │
│                     └──────┬───────┘                             │
│                            │                                     │
│              ┌─────────────┼─────────────┐                      │
│              │             │             │                       │
│         ┌────▼────┐  ┌────▼────┐  ┌────▼────┐                  │
│         │Worker 1 │  │Worker 2 │  │Worker N │                  │
│         └────┬────┘  └────┬────┘  └────┬────┘                  │
│              │             │             │                       │
│              └─────────────┼─────────────┘                      │
│                            │                                     │
│                     ┌──────▼───────┐                             │
│                     │  Dead Letter │                             │
│                     │  Queue (DLQ) │                             │
│                     └──────────────┘                             │
└─────────────────────────────────────────────────────────────────┘

API Design

POST   /api/v1/tasks              — Schedule a new task
GET    /api/v1/tasks/{id}         — Get task status
DELETE /api/v1/tasks/{id}         — Cancel a task
PATCH  /api/v1/tasks/{id}/pause   — Pause a recurring task
GET    /api/v1/tasks?status=failed — List tasks by status

Task payload:
{
  "task_type": "send_email",
  "payload": { "to": "user@example.com", "template": "welcome" },
  "schedule": {
    "type": "delayed",          // "immediate" | "delayed" | "cron"
    "run_at": "2026-04-10T10:00:00Z",
    "cron": null                // "0 */6 * * *" for recurring
  },
  "retry_policy": {
    "max_retries": 3,
    "backoff": "exponential",   // "fixed" | "exponential"
    "initial_delay_sec": 10
  },
  "timeout_sec": 300,
  "idempotency_key": "email-welcome-user-123"
}

Deep Dive 1: Task State Machine

Every task follows a strict state machine. Invalid transitions must be rejected.
                  ┌───────────┐
                  │  PENDING   │  ← Task created, not yet due
                  └─────┬─────┘
                        │ (run_at time reached)
                  ┌─────▼─────┐
                  │ SCHEDULED  │  ← Placed in task queue
                  └─────┬─────┘
                        │ (worker picks up)
                  ┌─────▼─────┐
          ┌───────│  RUNNING   │───────┐
          │       └─────┬─────┘       │
          │ (success)   │    (failure + retries left)
    ┌─────▼─────┐       │       ┌─────▼─────┐
    │ COMPLETED  │       │       │ RETRYING   │
    └───────────┘       │       └─────┬─────┘
                        │             │ (back to SCHEDULED)
                        │             └──────▶ SCHEDULED

                        │ (failure + no retries left)
                  ┌─────▼─────┐
                  │   DEAD     │  ← Moved to DLQ
                  └───────────┘

Additional states:
  CANCELLED — user cancelled the task
  PAUSED    — recurring task paused by user

Database Schema

CREATE TABLE tasks (
    id              UUID PRIMARY KEY,
    idempotency_key VARCHAR(255) UNIQUE,
    task_type       VARCHAR(100) NOT NULL,
    payload         JSONB NOT NULL,
    status          VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    priority        INT DEFAULT 0,

    -- Scheduling
    schedule_type   VARCHAR(20) NOT NULL,  -- immediate/delayed/cron
    run_at          TIMESTAMPTZ,
    cron_expression VARCHAR(100),
    next_run_at     TIMESTAMPTZ,           -- For recurring tasks

    -- Execution tracking
    attempt         INT DEFAULT 0,
    max_retries     INT DEFAULT 3,
    timeout_sec     INT DEFAULT 300,
    locked_by       VARCHAR(100),          -- Worker ID holding the lock
    locked_at       TIMESTAMPTZ,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    error_message   TEXT,

    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

-- Index for scheduler to find due tasks efficiently
CREATE INDEX idx_tasks_pending ON tasks (next_run_at)
    WHERE status IN ('PENDING', 'RETRYING');

-- Index for status queries
CREATE INDEX idx_tasks_status ON tasks (status, created_at);
Why PostgreSQL and not a NoSQL database here? Task scheduling requires strong consistency (you cannot execute a task twice because of eventual consistency), transactional state transitions (PENDING to RUNNING must be atomic), and complex queries (find all tasks where next_run_at < NOW() AND status = PENDING). These are SQL strengths.

Deep Dive 2: Distributed Locking for Task Pickup

The critical problem: multiple workers must not pick up the same task.
1

Approach A: Database-Level Locking (Simple, Recommended to Start)

Use SELECT … FOR UPDATE SKIP LOCKED to atomically claim tasks.
BEGIN;
-- Atomically claim the next available task
UPDATE tasks
SET status = 'RUNNING',
    locked_by = 'worker-042',
    locked_at = NOW(),
    attempt = attempt + 1
WHERE id = (
    SELECT id FROM tasks
    WHERE status IN ('SCHEDULED')
      AND next_run_at <= NOW()
    ORDER BY priority DESC, next_run_at ASC
    LIMIT 1
    FOR UPDATE SKIP LOCKED
)
RETURNING *;
COMMIT;
Pros: No external lock service needed. SKIP LOCKED prevents workers from blocking each other. PostgreSQL handles all the concurrency.Cons: Database becomes the bottleneck at very high throughput (> 10K tasks/sec).
2

Approach B: Redis-Based Distributed Lock (For Higher Scale)

Use Redis sorted sets as the task queue. Workers use ZPOPMIN atomically.
-- Scheduler enqueues tasks (score = execution timestamp)
ZADD task_queue <run_at_timestamp> <task_id>

-- Worker atomically pops the next due task
ZPOPMIN task_queue

-- Worker sets a lock with TTL (lease)
SET lock:task:<task_id> worker-042 EX 300 NX
Pros: Much higher throughput than DB locking. Redis handles 100K+ ops/sec.Cons: Requires careful handling of Redis failures. Task state still lives in PostgreSQL (dual-write concern).
3

Lease-Based Execution

Whether using DB or Redis locks, the pattern is the same:
  1. Worker acquires a lease (lock with TTL) on the task
  2. Worker must renew the lease periodically (heartbeat) if the task runs long
  3. If the worker crashes, the lease expires and another worker can pick up the task
  4. When the task completes, the worker releases the lease and updates state
Lease renewal loop (in worker):
while task_is_running:
    sleep(lease_ttl / 3)
    EXPIRE lock:task:<task_id> lease_ttl  # Renew
The zombie task problem: A worker picks up a task, starts processing, then the process is killed (OOM, hardware failure). The task is stuck in RUNNING state forever. Solution: the scheduler periodically scans for tasks in RUNNING state where locked_at + timeout_sec < NOW(). These tasks are “zombies” and get moved back to SCHEDULED (if retries remain) or DEAD.

Deep Dive 3: Failure Handling and Idempotency

1

Retry with Exponential Backoff

When a task fails, calculate the next retry time:
next_retry_at = NOW() + initial_delay * (2 ^ attempt_number)

Attempt 1: retry after 10 seconds
Attempt 2: retry after 20 seconds
Attempt 3: retry after 40 seconds
Attempt 4: max retries exceeded → move to DLQ
Add jitter (random offset) to prevent thundering herd when many tasks fail simultaneously.
2

Idempotency via Idempotency Keys

Since at-least-once delivery means a task CAN run more than once (worker crash after execution but before acknowledgment), tasks must be idempotent.The idempotency_key field ensures the same logical task is not created twice:
POST /tasks with idempotency_key = "charge-order-456"
→ If key exists and task is COMPLETED → return existing result
→ If key exists and task is RUNNING   → return "in progress"
→ If key does not exist               → create new task
Task implementations must also be idempotent internally. Example: “send welcome email to user X” should check if the email was already sent before sending again.
3

Dead Letter Queue (DLQ)

Tasks that exhaust all retries are moved to a DLQ. The DLQ is:
  • A separate table or queue for manual inspection
  • Alerting triggers when DLQ depth exceeds threshold
  • Operations team can inspect, fix, and re-enqueue dead tasks
  • DLQ tasks are retained for 30 days, then archived
-- Move exhausted task to DLQ
UPDATE tasks
SET status = 'DEAD',
    completed_at = NOW(),
    error_message = 'Max retries exceeded: <last error>'
WHERE id = :task_id;

INSERT INTO dead_letter_queue (task_id, original_payload, failure_reason, dead_at)
VALUES (:task_id, :payload, :error, NOW());
DecisionChoice MadeAlternativeWhy
Task storePostgreSQLCassandraNeed strong consistency + transactions for state machine; Cassandra’s eventual consistency causes double execution
Task queueRedis sorted setsSQS / RabbitMQRedis gives precise scheduled execution time ordering; managed queues add latency
LockingDB-level SKIP LOCKED (start), Redis leases (scale)ZooKeeperStart simple with DB; graduate to Redis. ZooKeeper is operationally heavy for this use case
Execution guaranteeAt-least-once + idempotencyExactly-onceTrue exactly-once is impossible; at-least-once with idempotent tasks is the practical standard
Scheduler HALeader election (single active scheduler)Multiple schedulers with distributed lockingSingle scheduler is simpler; leader election via PostgreSQL advisory lock or Redis Redlock
Recurring tasksScheduler computes next_run_at after each executionWorker computes next runScheduler owns the schedule; workers just execute. Separation of concerns.
1

You defined a rigorous state machine

The task lifecycle (PENDING to RUNNING to COMPLETED/DEAD) with valid transitions shows you think about system correctness, not just the happy path.
2

You addressed the zombie task problem

Workers crash. Networks partition. Tasks get stuck. Proactively designing a zombie detector shows real operational experience.
3

You explained why exactly-once is impossible and offered the practical alternative

At-least-once + idempotency is how every production system (Kafka, SQS, Celery) actually works. This shows you know the theory AND the practice.
4

You designed for progressive scaling

Starting with PostgreSQL SKIP LOCKED and evolving to Redis queues as scale demands. This pragmatic “start simple, scale when needed” approach is exactly what senior engineers advocate.
5

You included observability from the start

Task status tracking, DLQ alerting, and execution metrics are not afterthoughts. In production, you cannot operate what you cannot observe.

Uber’s Cadence / Temporal

Cadence was built at Uber to solve the problem of orchestrating long-running, reliable workflows (ride matching, payment processing, driver onboarding). It was later open-sourced and evolved into Temporal (founded by the same engineers):
  • Workflow as code: Unlike traditional task schedulers that treat tasks as independent units, Cadence/Temporal models entire workflows as code. A workflow function can call activities (individual tasks), wait for signals, set timers, and branch on conditions — all while being fully durable. If the worker crashes mid-workflow, the framework replays the workflow function from the event history to resume exactly where it left off.
  • Event sourcing under the hood: Every state change in a workflow is persisted as an event in a history. The workflow’s current state is reconstructed by replaying this event history. This is fundamentally different from the state-machine-in-a-database approach we designed above. It is more flexible (arbitrary workflow logic instead of predefined states) but more complex to reason about.
  • Task queues with sticky execution: Temporal uses task queues to dispatch activities to workers. It supports “sticky execution” — once a worker starts processing a workflow, subsequent activities in that workflow are preferentially routed to the same worker. This improves cache locality (workflow context is already in memory) and reduces replay overhead.
  • Visibility and search: Temporal provides a built-in visibility layer that indexes workflow metadata (status, type, start time, custom search attributes) into Elasticsearch. Operators can search for workflows across millions of executions. This is the observability layer that is often an afterthought in custom task schedulers.
  • Multi-cluster replication: For disaster recovery, Temporal supports active-passive replication across clusters. Workflows in a failed cluster can be resumed in the standby cluster. This is critical for Uber’s ride-booking workflows — a datacenter failure cannot cause rides to be lost.
  • At Uber’s scale: At peak, Uber’s Cadence deployment handled hundreds of thousands of workflow executions per second across multiple Cassandra clusters totaling petabytes of event history. The system managed everything from ride lifecycle to financial settlement to promotional offer expiration.

Apache Airflow’s Architecture

Airflow, originally created at Airbnb (around 2014) and now an Apache project, is the most widely used open-source task orchestrator for data pipelines:
  • DAG-based scheduling: Airflow represents workflows as Directed Acyclic Graphs (DAGs). Each node is a task (e.g., “extract data from S3”, “run Spark job”, “load into warehouse”) and edges represent dependencies. The scheduler traverses the DAG, executing tasks only when their upstream dependencies are complete.
  • Scheduler architecture: Airflow’s scheduler is a loop that runs every few seconds: (1) parse all DAG files to discover task dependencies, (2) identify tasks whose dependencies are met and whose scheduled time has arrived, (3) place those tasks into a queue (Celery, Kubernetes, or a local executor). The scheduler was historically single-threaded and a bottleneck, but Airflow 2.0+ supports multiple schedulers with database-level locking (similar to the SKIP LOCKED pattern we described above).
  • Executor model: Airflow separates scheduling from execution. The executor is pluggable: CeleryExecutor distributes tasks to a fleet of Celery workers via Redis/RabbitMQ, KubernetesExecutor spins up a new pod for each task (strong isolation but higher overhead), and LocalExecutor runs tasks as subprocesses on the scheduler machine (fine for small deployments).
  • Metadata database (PostgreSQL/MySQL): All task state, DAG definitions, execution history, and scheduling metadata live in a relational database. This is the source of truth. The choice of PostgreSQL or MySQL is deliberate — Airflow needs ACID transactions for state machine transitions, exactly as we discussed in Problem 5.
  • XCom for inter-task communication: Tasks can pass small amounts of data to downstream tasks via “XComs” (cross-communications) stored in the metadata database. For large datasets, tasks write to external storage (S3, GCS) and pass only the reference via XCom. This pattern of passing pointers instead of payloads is a common production optimization.
  • Limitations: Airflow is designed for batch workflows (run this ETL every hour), not for real-time event-driven tasks. It has no native concept of responding to events or running sub-second-latency tasks. For those use cases, Temporal or a custom solution (like the one we designed) is more appropriate.
Choosing between them: Temporal is the better choice when you need durable execution of arbitrary application workflows (user signup flows, order processing, sagas). Airflow is the better choice when you need to orchestrate data pipelines with complex dependencies on a schedule. Our Problem 5 design sits between the two — it handles scheduled tasks (like Airflow) but with a simpler execution model (like Temporal’s activity workers). In an interview, knowing when to use which framework demonstrates architectural judgment.

Summary: The Pattern Across All Five Problems

Every senior-level system design answer follows the same meta-pattern:
  1. Ask before you draw — Requirements clarification is not a formality; it changes the design.
  2. Numbers before architecture — Back-of-envelope estimation reveals whether you need 3 servers or 3,000.
  3. Start with the right abstraction — High-level design captures components and data flow before implementation details.
  4. Go deep on what matters — You cannot deep-dive everything in 45 minutes. Pick the 2-3 components that define the system.
  5. Name the trade-offs explicitly — Every design decision has a cost. Senior engineers articulate both sides.
Practice drill: Take each problem above and change one requirement. What if the URL shortener needs to support 10 billion URLs/day? What if the chat system needs end-to-end encryption? What if the task scheduler needs exactly-once semantics across data centers? Practicing with requirement mutations is how you build the adaptive thinking that aces real interviews.

Think of It This Way

System design interviews are like architectural blueprints — the interviewer wants to see your thinking process, not a finished building. An architect does not walk into a client meeting and immediately start drawing floor plans. They ask: how many people will use this space? What is the budget? Are there zoning constraints? What is the climate? Only after understanding the requirements do they pick up a pencil. Your system design interview should follow the same rhythm: clarify, estimate, sketch, refine. Here is another way to think about it: a great system design answer is like a map with multiple zoom levels. Start at the continent level (high-level architecture), then zoom into the country level (key components), then the city level (critical algorithms and data models). The interviewer should be able to stop you at any zoom level and feel that you have a coherent picture. Candidates who jump straight to city-level details (arguing about Redis data structures before explaining the overall architecture) lose the forest for the trees.

Deep Dive Resources

These resources will deepen your understanding beyond what any single guide can cover. They are organized by format so you can pick what fits your learning style.
ResourceWhy It Matters
System Design Interview by Alex Xu (Vol 1 & 2)The most interview-focused system design resource available. Each chapter walks through a specific system (rate limiter, chat, news feed, etc.) with the same Clarify-Estimate-Design-Deep Dive framework. Volume 1 covers fundamentals; Volume 2 covers more advanced topics like hotel reservation systems and stock exchanges.
Designing Data-Intensive Applications by Martin KleppmannThe definitive deep reference for understanding the building blocks (replication, partitioning, consistency models, stream processing) that underpin every system design. Not interview-focused, but the mental models it builds are irreplaceable. Read this to understand why systems work the way they do.
Web Scalability for Startup Engineers by Artur EjsmontA practical, end-to-end guide to scaling web applications. Covers caching, async processing, search, and more. Especially useful if you want to understand the full stack rather than individual components.
ResourceWhy It Matters
System Design Primer (github.com/donnemartin/system-design-primer)A free, comprehensive GitHub repository that covers system design concepts, trade-offs, and practice problems. Think of it as a structured self-study curriculum. The diagrams and summaries are excellent for quick review before interviews.
Grokking the System Design Interview (educative.io)An interactive course that walks through 20+ system design problems step by step. The format (text-based with diagrams, not video) makes it easy to study at your own pace. Widely regarded as the most popular paid resource for system design interview prep.
ByteByteGo by Alex Xu (blog.bytebytego.com / YouTube)Alex Xu’s newsletter and YouTube channel provide visual, concise explanations of system design concepts. The animated diagrams are particularly useful for building intuition about data flow, fan-out patterns, and distributed system behavior. The YouTube channel is free; the newsletter has free and paid tiers.
ResourceWhy It Matters
High Scalability (highscalability.com)One of the oldest and most comprehensive collections of real-world architecture case studies. Their “This is how X works” posts dissect the architectures of companies like Netflix, Twitter, WhatsApp, and Uber. Reading 5-10 of these gives you a library of real patterns to reference in interviews.
InfoQ Architecture Articles (infoq.com/architecture-design)In-depth articles and conference talks on distributed systems, microservices, and architecture patterns. The content skews more toward practicing architects than interview prep, which makes it excellent for building genuine expertise.
Company Engineering BlogsThe best primary sources for how real systems work. Prioritize: Netflix Tech Blog (resilience, streaming), Uber Engineering (real-time systems, Cadence/Temporal), Discord Blog (scaling real-time chat), Meta Engineering (feed systems, TAO, Memcache), Cloudflare Blog (edge computing, rate limiting, DNS). These are the sources that Alex Xu and other authors reference.
ResourceWhy It Matters
Exponent (tryexponent.com)Mock interview platform with system design practice, peer feedback, and expert-led sessions. Useful for practicing the verbal communication aspect that reading alone cannot build.
Pramp (pramp.com)Free peer-to-peer mock interviews. You practice with another engineer and take turns as interviewer and candidate. The best way to practice thinking out loud under time pressure.
LeetCode System Design (leetcode.com/discuss/interview-question/system-design)Community-contributed system design discussions. Quality varies, but the top posts have thousands of upvotes for a reason. Good for seeing how other candidates approach the same problems.

Common Mistakes in System Design Interviews

Even well-prepared candidates fall into these traps. Knowing them in advance gives you a significant edge.
What happens: The interviewer says “Design a chat system” and the candidate immediately starts drawing boxes and arrows. Five minutes later, the interviewer asks “Does this support group chat?” and the entire design needs rework.Why it is tempting: You want to show confidence and speed. You have practiced this exact problem and want to demonstrate your preparation.Why it hurts you: The interviewer is evaluating your engineering judgment, not your speed. Jumping to the solution signals that you build first and ask questions later — a dangerous trait in a senior engineer. It also means you are likely designing for the wrong requirements.The fix: Spend the first 5-8 minutes asking clarifying questions and explicitly stating your assumptions. Write down the functional and non-functional requirements. Get the interviewer to confirm before you draw a single box. This feels slow but actually saves time and earns significant points.
What happens: The candidate designs a URL shortener for Google-scale traffic with a Cassandra cluster, multi-region replication, and a custom sharding layer — when the interviewer specified “a small startup’s internal tool.”Why it is tempting: You have studied the advanced architectures and want to show you know them. Complexity feels impressive.Why it hurts you: Over-engineering is a signal of poor judgment. A senior engineer knows that a single PostgreSQL instance can handle thousands of requests per second and will last a startup years. Designing for 10x your actual scale is prudent; designing for 1000x is wasteful. Interviewers specifically watch for this.The fix: Let the estimation phase drive your architecture. If the numbers say you need 100 QPS, design for 1,000 (10x safety margin) and explain that you would add complexity only when monitoring shows you need it. Mention what you would change at 100x scale, but do not build it into your initial design.
What happens: The candidate presents a clean, happy-path design. The interviewer asks “What happens if this database goes down?” and the candidate has no answer.Why it is tempting: Failure handling is messy and hard to think about. The happy path is elegant and easier to explain within the time limit.Why it hurts you: In production, everything fails. Networks partition. Servers crash. Disks fill up. A design that only works when everything is healthy is not a design — it is a wish. Senior engineers are distinguished by how they think about failure.The fix: For every major component, ask yourself: “What happens when this fails?” Design for it explicitly. Use terms like “fail-open,” “circuit breaker,” “retry with backoff,” and “dead letter queue.” Even brief mentions of failure handling at each layer signal production experience.
What happens: The candidate designs the system without any numbers. They propose “a Redis cache” without knowing whether the dataset fits in memory, or “Cassandra for storage” without knowing if the write volume justifies it.Why it is tempting: Math feels slow and boring compared to architecture discussion. You worry about making arithmetic errors under pressure.Why it hurts you: Without numbers, every architectural choice is a guess. The estimation phase takes 3-5 minutes but shapes every decision that follows. Interviewers at top companies (especially Google and Meta) explicitly expect estimation.The fix: Practice the estimation framework until it is automatic: users per day, requests per second (peak and average), storage per year, bandwidth, cache size. Round aggressively — the goal is order of magnitude, not precision. 1,157 requests/sec and 1,200 requests/sec lead to the same architecture. State your assumptions explicitly so the interviewer can adjust them.
What happens: The candidate goes quiet for 2-3 minutes, thinking hard, then presents a fully formed component. The interviewer has no window into the reasoning.Why it is tempting: You want to present a polished answer. Thinking out loud feels vulnerable — what if you say something wrong?Why it hurts you: The interviewer is buying your thought process, not your final answer. Silence gives them nothing to evaluate. Worse, if you emerge from silence with a flawed design, the interviewer cannot tell whether you considered and rejected the right approach or never thought of it.The fix: Narrate your thinking. “I am considering two approaches here — fan-out on write and fan-out on read. Let me think about the trade-offs. Fan-out on write gives us fast reads but breaks for celebrity users, so I think a hybrid approach makes more sense for this scale.” Even if you change your mind, the interviewer sees your reasoning and gives you credit for it.
What happens: The candidate spends 20 minutes designing a perfect caching layer with eviction policies, warming strategies, and invalidation protocols — but never addresses the database, the API layer, or the overall data flow.Why it is tempting: You know caching really well and feel confident deep-diving there. Going deep feels like providing value.Why it hurts you: System design interviews evaluate breadth AND depth. The interviewer wants to see that you can design a complete system (breadth) and then go deep on 1-2 critical components (depth). Spending all your time on one component signals tunnel vision.The fix: Allocate your time explicitly. Spend the first 15 minutes on requirements, estimation, and the high-level architecture (all major components, their responsibilities, and how data flows between them). Only then pick 2-3 components to deep-dive on — ideally the ones most critical to the system’s success. Ask the interviewer: “I can go deeper on the caching strategy or the data partitioning scheme. Which would you prefer?” This shows time management and gives the interviewer control.