System Design Practice Problems
These five problems are structured using the Answer Framework: Clarify, Estimate, Design, Deep Dive, and Trade-Offs. Each solution demonstrates the thinking process that separates senior engineers from mid-level candidates.Problem 1: Design a URL Shortener
Difficulty: Easy | Time Target: 35 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Core Features | Do we need custom aliases (e.g., short.ly/my-brand)? | Changes the uniqueness and validation logic |
| Expiration | Should URLs expire? User-configurable TTL? | Affects storage strategy and cleanup jobs |
| Analytics | Do we need click tracking (geo, referrer, device)? | Adds an entire analytics pipeline |
| Scale | How many URLs are created per day? Read vs write ratio? | Drives every downstream architecture decision |
| Availability | Is it acceptable for a redirect to fail occasionally? | Determines replication and consistency model |
Requirements You Should Lock Down
Functional:- Given a long URL, generate a unique short URL
- Given a short URL, redirect (HTTP 301/302) to the original
- Optional: custom aliases, expiration, click analytics
- High availability (redirects must not fail)
- Low latency (redirect < 100ms)
- Short URLs should not be guessable (no sequential IDs)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Hash Generation
Phase 4: Deep Dive -- Hash Generation
The Core Problem: How to Generate a Unique 7-Character Key?
This is the most important design decision. Three approaches:Approach A: MD5/SHA256 Truncation
Approach B: Base62 Counter (Pre-Generated)
Approach C: Key Generation Service (KGS) -- Recommended
Database Choice
| Criteria | SQL (PostgreSQL) | NoSQL (DynamoDB/Cassandra) |
|---|---|---|
| Schema | Fixed, well-defined | Flexible but unnecessary here |
| Read pattern | Key lookup (fast with index) | Key lookup (native strength) |
| Write pattern | Single row insert | Single row insert |
| Scale | Sharding is manual | Auto-sharding built-in |
| Consistency | Strong by default | Eventual (configurable) |
| Verdict | Good for < 1B records | Better for 100B+ records |
Caching Strategy
- Cache-aside with Redis: Check cache first, on miss read from DB and populate cache
- TTL: Set cache TTL to match URL expiration (or 24 hours for non-expiring URLs)
- Eviction: LRU — frequently accessed URLs stay hot, long-tail URLs get evicted
- Cache size: ~12 GB for 20% of daily URLs (fits a single Redis instance)
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Hash method | KGS (pre-generated keys) | MD5 truncation | Zero runtime collisions, no coordination overhead |
| Database | NoSQL (DynamoDB) | PostgreSQL | Simple key-value at extreme scale |
| Redirect code | 302 (temporary) | 301 (permanent) | Preserves analytics; 301 causes browser caching |
| Cache | Redis cache-aside | Write-through | Read-heavy system benefits from cache-aside simplicity |
| Short URL length | 7 characters | 6 characters | 3.5T keyspace vs 56B; future-proofing is cheap |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You asked about 301 vs 302
You calculated before designing
You compared three hash approaches with trade-offs
You addressed the analytics pipeline separately
How the Real System Works
How the Real System Works
Bitly’s Architecture
Bitly processes billions of shortens and redirects. Their architecture reflects a decade of scaling lessons:- Key generation: Bitly uses a counter-based approach with Base62 encoding, allocating counter ranges to application servers to avoid coordination overhead. Each server gets a block of IDs and increments locally, only requesting a new block when the current one is exhausted.
- Storage: They migrated from MySQL to a distributed data store. Their link mapping data is sharded by the hash of the short URL key, ensuring even distribution across nodes.
- Caching: Bitly uses a multi-layer caching strategy. Popular links (the top 1-2% by click volume) are cached in memory on the application servers themselves. A shared Redis layer handles the next tier. The database is the fallback of last resort. Given the extreme read-to-write ratio, cache hit rates exceed 99%.
- Analytics pipeline: Click data flows through Kafka into a real-time processing pipeline. They decouple the redirect path (which must be sub-100ms) from analytics aggregation entirely. Click events are enriched asynchronously with geo, device, and referrer data.
- Global presence: Bitly uses Anycast DNS and edge servers in multiple regions to ensure redirects are fast worldwide. A redirect request in Tokyo does not need to round-trip to a US data center.
TinyURL’s Original Design
TinyURL, one of the earliest URL shorteners (launched around 2002), took a simpler approach that is instructive as a baseline:- Single MySQL database: All short-to-long URL mappings lived in a single MySQL instance with a B-tree index on the short key. For years, this was sufficient because the total dataset was small enough to fit in RAM.
- Sequential IDs: TinyURL used auto-incrementing MySQL IDs converted to a short alphanumeric string. This is the simplest possible approach and works until you need non-guessable URLs or multi-datacenter writes.
- No analytics: The original TinyURL had no click tracking at all. This dramatically simplified the architecture — a redirect was a single database lookup and an HTTP 301 response.
- Lesson: TinyURL proves that the simplest possible design can serve millions of users if the access pattern is simple enough (key-value lookup). The complexity in Bitly’s architecture exists because of analytics, custom domains, enterprise features, and global scale — not because URL shortening itself is hard.
Problem 2: Design a Rate Limiter
Difficulty: Medium | Time Target: 40 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Scope | Client-side or server-side rate limiting? | Client-side is unreliable; focus on server-side |
| Granularity | Per-user, per-API endpoint, per-IP, or global? | Determines key structure in the rate limit store |
| Response | Should we return rate limit headers (X-RateLimit-*)? | Industry standard; shows HTTP expertise |
| Throttling | Hard limit (reject) or soft limit (queue/slow down)? | Changes architecture from reject to backpressure |
| Distributed | Single server or distributed across multiple servers? | Single-server is trivial; distributed is the real problem |
| Rules | Who configures the rate limit rules? Dynamic or static? | Affects whether you need a rules engine or config file |
Requirements You Should Lock Down
Functional:- Rate limit requests based on configurable rules (user ID, IP, API key)
- Support multiple rate limit tiers (free: 10 req/min, pro: 1000 req/min)
- Return proper HTTP 429 with
Retry-Afterheader when limit is exceeded - Rate limit rules can be updated without redeployment
- Very low latency (added overhead < 1ms per request)
- Must work across distributed servers (shared state)
- Highly available — if the rate limiter fails, requests should pass through (fail-open)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 3: High-Level Design
Phase 3: High-Level Design
Phase 4: Deep Dive -- Algorithms
Phase 4: Deep Dive -- Algorithms
Comparing Four Rate Limiting Algorithms
Algorithm 1: Fixed Window Counter
Algorithm 2: Sliding Window Log
Algorithm 3: Sliding Window Counter (Recommended)
Algorithm 4: Token Bucket
Distributed Rate Limiting: The Hard Part
The real challenge is making this work across multiple servers.Problem: If you have 10 API servers, each checking Redis independently, race conditions cause over-counting or under-counting.Solution: Lua Script in Redis (Atomic Operations)Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Algorithm | Sliding window counter | Token bucket | Better precision with minimal memory; token bucket is also excellent for APIs needing burst tolerance |
| Placement | API Gateway | Application middleware | Centralized enforcement, language-agnostic, single place to configure rules |
| State store | Redis Cluster | Local in-memory | Must work across all servers; local state gives inconsistent enforcement |
| Failure mode | Fail-open | Fail-closed | Availability > abuse protection; combine with secondary detection for abuse |
| Atomicity | Redis Lua scripts | Distributed locks | Lua scripts are atomic in Redis, simpler than external locks, sub-millisecond |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You compared four algorithms with concrete trade-offs
You addressed the distributed race condition explicitly
You discussed fail-open vs fail-closed
You mentioned clock skew
How the Real System Works
How the Real System Works
Cloudflare’s Rate Limiting at Scale
Cloudflare handles tens of millions of HTTP requests per second across 300+ data centers. Their rate limiting operates at a scale that most engineers never encounter:- Edge-first enforcement: Rate limiting happens at Cloudflare’s edge servers, not at a centralized service. Each data center runs its own rate limiting logic. This is critical because sending every request to a central rate limiter would add unacceptable latency and create a single point of failure.
- Sliding window counters: Cloudflare uses a sliding window approach similar to Algorithm 3 above. They chose this because it provides good accuracy with minimal memory per client, which matters when you are tracking millions of unique IPs and API keys simultaneously.
- Approximate counting across data centers: Since rate limiting is distributed, a user hitting multiple Cloudflare data centers (due to Anycast routing) could theoretically exceed their global limit. Cloudflare handles this with periodic synchronization between edge nodes — they trade perfect accuracy for speed. The result is rate limits that are “approximately correct” within a small margin, which is acceptable for virtually all use cases.
- Configurable actions: When a rate limit is exceeded, Cloudflare supports multiple responses: block (HTTP 429), challenge (show a CAPTCHA), JS challenge (verify browser), or simulate (log only, for testing rules before enforcement). This flexibility is a great interview detail to mention when discussing rate limiter behavior.
- IP reputation scoring: Beyond simple rate counting, Cloudflare layers in threat intelligence. An IP with a known bad reputation gets a lower effective rate limit. This hybrid approach (rate limiting + reputation) is more effective than either technique alone.
Stripe’s Rate Limiting API
Stripe’s approach to rate limiting is worth studying because they operate from the API provider’s perspective — they must protect their own infrastructure while giving developers a good experience:- Token bucket algorithm: Stripe uses a token bucket implementation for their API rate limits. Each API key gets a bucket with a defined capacity (e.g., 100 requests per second for live mode). Tokens refill at a steady rate. This allows short bursts (useful when a merchant processes a batch of charges) while enforcing a sustained rate limit.
- Tiered limits by endpoint: Not all API endpoints have the same rate limit. Creating a payment intent might be limited to 100/sec, while listing events could allow 1000/sec. Stripe separates read-heavy endpoints from write-heavy ones because the cost to Stripe’s backend is different.
- Graceful headers: Stripe returns
RateLimit-Limit,RateLimit-Remaining, andRateLimit-Resetheaders on every response (not just 429s). This allows well-behaved clients to self-throttle before hitting the limit. They also return aRetry-Afterheader on 429 responses with a specific number of seconds to wait. - Idempotency keys for retries: Stripe’s rate limiting is designed to work hand-in-hand with their idempotency key system. When a client gets rate-limited and retries, the idempotency key ensures the retry does not create a duplicate charge. This is a subtle but important integration point between rate limiting and application semantics.
- Load shedding at scale: When Stripe’s systems are under extreme load, they implement progressive load shedding — first throttling lower-priority traffic (webhooks, list operations) before restricting payment-critical paths. This priority-based approach is more sophisticated than a flat rate limit and is worth mentioning in interviews when discussing production-grade systems.
Problem 3: Design a Chat System
Difficulty: Medium | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Type | 1-to-1 only, or group chat too? Max group size? | Group chat introduces fan-out complexity |
| Features | Online/offline status? Read receipts? Typing indicators? | Each adds real-time event complexity |
| Media | Text only, or images/files/voice? | Media requires object storage + CDN pipeline |
| History | How far back should chat history go? | Determines storage volume and partitioning |
| Delivery | What delivery guarantees? At-least-once? Exactly-once? | Drives dedup and acknowledgment protocol |
| Scale | How many concurrent users? Messages per day? | Determines WebSocket server capacity |
Requirements You Should Lock Down
Functional:- 1-to-1 messaging with delivery confirmation
- Group chat (up to 500 members)
- Online/offline presence indicators
- Message history (persistent, searchable)
- Push notifications for offline users
- Real-time delivery (< 200ms for online users)
- Message ordering guarantees (per-conversation)
- 99.99% message delivery (no lost messages)
- Support 50M concurrent connections
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Three Critical Components
Phase 4: Deep Dive -- Three Critical Components
Deep Dive 1: WebSocket vs Long-Polling
WebSocket (Recommended)
Long-Polling (Fallback)
Deep Dive 2: Message Storage and Partitioning
Database choice: Cassandra or ScyllaDBWhy: Write-heavy workload (2B messages/day), time-series-like access pattern (fetch recent messages for a conversation), horizontal scaling built-in.Deep Dive 3: Online Presence (Heartbeat Mechanism)
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Transport | WebSocket | Long-polling | Real-time bidirectional; fall back to long-polling when WS unavailable |
| Message DB | Cassandra | PostgreSQL | Write-heavy, time-series access, horizontal scale |
| Inter-server messaging | Redis Pub/Sub | Direct RPC | Decouples chat servers; Pub/Sub handles routing naturally |
| Presence | Lazy evaluation + heartbeat | Active fan-out | Fan-out to all friends is prohibitively expensive at scale |
| Message ordering | Per-conversation ordering | Global ordering | Global ordering is unnecessary and impossible to scale; conversations are independent |
| Group message delivery | Write message once, fan-out reads | Write to each member’s inbox | Saves storage; read fan-out is cheaper for groups < 500 |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You identified the fan-out problem for groups
You designed lazy presence instead of active push
You addressed message ordering as per-conversation
You planned for the offline path
How the Real System Works
How the Real System Works
WhatsApp: 100B+ Messages/Day with ~50 Engineers
WhatsApp is the gold standard for doing more with less. At the time of the Facebook acquisition (2014), WhatsApp handled 50+ billion messages per day with roughly 50 engineers. Today that number exceeds 100 billion. Their architecture is a masterclass in simplicity:- Erlang/OTP: WhatsApp’s backend is built on Erlang, a language designed for telecom systems that need massive concurrency and fault tolerance. A single Erlang server can handle 2-3 million concurrent connections because Erlang processes are lightweight (a few hundred bytes each, compared to threads that cost megabytes). This is why 50 engineers could operate what would require hundreds of engineers in a Java or Python stack.
- Message storage philosophy: WhatsApp treats the server as a delivery mechanism, not a storage system. Messages are stored on the server only until they are delivered to the recipient’s device. Once the recipient acknowledges receipt, the message is deleted from the server. This “transient store” approach dramatically reduces storage requirements and simplifies the architecture.
- XMPP-based protocol (modified): WhatsApp started with the XMPP messaging protocol and heavily modified it for mobile efficiency. Their custom protocol minimizes battery drain and data usage — critical for WhatsApp’s user base in markets with limited data plans.
- No read receipts stored server-side: The blue check marks (read receipts) are end-to-end signals between devices. The server facilitates delivery but does not store read state. This is another example of pushing complexity to the edges and keeping the server simple.
- Mnesia for routing: WhatsApp uses Erlang’s built-in Mnesia database for connection routing — mapping which user is connected to which server. Mnesia is an in-memory distributed database that replicates across Erlang nodes, giving sub-millisecond lookups without an external dependency like Redis.
- FreeBSD over Linux: WhatsApp chose FreeBSD for their servers, partly because of superior network stack performance for their specific workload (millions of concurrent TCP connections). This is an unconventional choice that paid off — it shows that sometimes the right tool is not the popular tool.
Discord’s Message Storage
Discord’s approach is interesting because they had to solve a different problem than WhatsApp — Discord needs persistent, searchable message history (channels can have millions of messages spanning years):- Migration from MongoDB to Cassandra: Discord initially stored messages in MongoDB. As they scaled, MongoDB’s single-primary-per-shard model created write bottlenecks for hot channels. They migrated to Cassandra, which distributes writes evenly across nodes. The partition key is
(channel_id, bucket)where bucket is a time window, preventing any single partition from growing unbounded. - Bucket strategy: Each bucket covers a fixed time period (approximately 10 days of messages). When a user scrolls back through history, Discord fetches the appropriate bucket. New messages always write to the current bucket. This means the “hot” data (recent messages) is always on a small number of partitions, keeping reads fast.
- Data services layer: Discord added a data services layer between their API and Cassandra. This layer handles request coalescing — if 1,000 users in the same channel all request the same messages at the same time (e.g., after a server outage), the data services layer deduplicates these into a single Cassandra read and fans the result back to all requesters. Without this, a thundering herd could overwhelm Cassandra.
- ScyllaDB migration: Discord later migrated from Cassandra to ScyllaDB (a C++ reimplementation of Cassandra’s protocol) for better tail latency performance. Cassandra’s JVM garbage collection pauses caused p99 latency spikes that affected user experience. ScyllaDB, being C++ with a shard-per-core architecture, eliminated GC pauses entirely.
Problem 4: Design a News Feed
Difficulty: Hard | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Content | Text posts only, or images/videos too? | Media adds CDN and transcoding complexity |
| Feed type | Chronological or ranked (algorithmic)? | Ranked feeds need an ML scoring service |
| Social graph | Follow model (Twitter) or friend model (Facebook)? | Follow is asymmetric and simpler; friends is symmetric |
| Celebrity problem | Do some users have millions of followers? | This single question changes the entire fan-out strategy |
| Real-time | Should the feed update in real-time or on refresh? | Real-time adds WebSocket/SSE complexity |
| Scale | DAU? Average follows per user? Posts per day? | Drives the fan-out vs fan-in decision |
Requirements You Should Lock Down
Functional:- Users create posts (text + images/video)
- Users follow other users (asymmetric follow model)
- Home feed shows posts from followed users, ranked by relevance
- Support likes, comments, shares (interaction signals)
- Feed generation latency < 500ms
- Support 500M DAU
- Handle celebrity accounts (10M+ followers)
- Feed should feel “fresh” (updates within minutes of posting)
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Feed Ranking and Caching
Phase 4: Deep Dive -- Feed Ranking and Caching
Deep Dive 1: The Hybrid Fan-Out Strategy
Post arrives at Post Service
Check follower count
Fan-out workers process asynchronously
Deep Dive 2: Feed Ranking
A pure chronological feed is simple but produces poor engagement. A ranked feed considers multiple signals:Deep Dive 3: Caching and CDN for Media
Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Fan-out strategy | Hybrid (push for normal, pull for celebrities) | Pure push or pure pull | Pure push breaks for celebrities; pure pull is too slow for reads |
| Celebrity threshold | 10K followers | Static vs dynamic | Start with static threshold, evolve to dynamic based on system load |
| Feed storage | Redis sorted sets (post IDs only) | Store full post content | IDs are tiny; fetch content separately with post cache hit rates > 99% |
| Ranking | Lightweight scoring at read time | Pre-computed ranked feeds | Ranking signals change in real-time (new likes); pre-ranking gets stale |
| Media delivery | CDN + object storage (S3) | Serve from app servers | Media is the bandwidth bottleneck; CDN reduces latency and server load by 90%+ |
| Inactive users | Skip fan-out, generate on demand | Fan-out to everyone | 60% of users do not check their feed daily; skipping saves enormous write volume |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You immediately identified the celebrity fan-out problem
You proposed a hybrid approach instead of picking one extreme
You separated the ranking layer from the data layer
You optimized for inactive users
How the Real System Works
How the Real System Works
Facebook’s News Feed Ranking Evolution
Facebook’s News Feed is perhaps the most studied feed system in the world. Its evolution from simple to sophisticated mirrors the journey every feed system takes:- 2006 — Chronological feed: The original News Feed was purely chronological. If your friend posted, it appeared in your feed sorted by time. Simple, but as users added hundreds of friends, the feed became noisy and engagement dropped.
- 2009 — EdgeRank: Facebook introduced EdgeRank, a relatively simple scoring formula:
Score = Affinity x Weight x Decay. Affinity measured how close you were to the poster (based on interactions). Weight reflected the content type (photos ranked higher than text). Decay was a time-based exponential falloff. EdgeRank was essentially the formula we described in the ranking section above. - 2013+ — Machine learning replaces EdgeRank: Facebook replaced the hand-tuned formula with a machine learning model that considers thousands of signals. The model predicts the probability that you will engage with (like, comment, share, click) each candidate post. Posts are ranked by predicted engagement score.
- Fan-out architecture: Facebook uses a hybrid fan-out model very similar to what we designed above. Normal users get fan-out-on-write (their posts are pushed to followers’ pre-computed feeds). Celebrities and pages with millions of followers use fan-out-on-read (their posts are merged at feed-fetch time). The threshold is dynamic and adjusts based on system load.
- Aggregator service: Facebook’s feed generation pipeline has a dedicated aggregation service (called “Multifeed”) that merges posts from multiple sources: the pre-computed feed cache, celebrity posts fetched on-read, ads injected by the ad auction, and “story bumping” (resurfacing older posts that are getting new comments). This merge step is where the ranking model runs.
- Feed cache (TAO + Memcache): Facebook built a custom distributed cache called TAO (The Associations and Objects cache) specifically for social graph data. Feed data lives in a combination of TAO, Memcache, and MySQL. The caching hierarchy is: L1 (in-process cache on the web server) to L2 (Memcache cluster in the same region) to L3 (TAO for social graph queries) to L4 (MySQL as source of truth).
Twitter’s Timeline Architecture
Twitter’s timeline system is a fascinating case study because they famously changed their approach:- Original approach (fan-out on write): For years, Twitter used aggressive fan-out-on-write. When you tweeted, a fleet of workers would push your tweet ID into the timelines of all your followers stored in Redis. This worked well because reads were instant — opening Twitter just meant reading a pre-computed list from Redis.
- The celebrity problem: This approach broke for users like Lady Gaga or Barack Obama with 30M+ followers. A single tweet triggered 30 million Redis writes. During events like the Super Bowl or elections, the fan-out queue would back up by minutes or even hours, meaning some users would see tweets with significant delay.
- Migration to hybrid: Twitter moved to a hybrid model around 2012-2013. Most users (those with fewer followers than a dynamic threshold) still get fan-out-on-write. High-follower accounts are excluded from fan-out. When you open your timeline, Twitter’s timeline service merges your pre-computed timeline with recent tweets from the high-follower accounts you follow.
- Timeline cache: Twitter stores timelines in a massive Redis cluster. Each user’s timeline is a Redis list of tweet IDs (not full tweet content). At read time, the IDs are hydrated by fetching the tweet content from a separate Tweet Store service. This separation means the timeline cache stays small (just IDs and scores) while tweet content is cached independently.
- Ranking and relevance: Twitter introduced algorithmic ranking (“top tweets”) alongside the chronological timeline. The ranking model considers engagement signals, recency, your past interactions with the author, and content type. Users can toggle between ranked and chronological views — a product decision that has architectural implications (you need to support both code paths).
Problem 5: Design a Distributed Task Scheduler
Difficulty: Hard | Time Target: 45 minutesPhase 1: Clarify Requirements
Phase 1: Clarify Requirements
Questions You Should Ask the Interviewer
| Category | Question | Why It Matters |
|---|---|---|
| Task types | One-time, delayed, recurring (cron), or all three? | Recurring tasks need a fundamentally different scheduling model |
| Execution | At-most-once, at-least-once, or exactly-once? | Determines dedup strategy and acknowledgment protocol |
| Priority | Do tasks have priority levels? | Priority requires a priority queue, not a FIFO queue |
| Payload | What is a task? Just a function name + args, or arbitrary code? | Arbitrary code execution is a security and sandboxing problem |
| Duration | How long can a task run? Timeout? | Long-running tasks need heartbeats and lease renewal |
| Scale | How many tasks per day? Concurrent workers? | Drives queue partitioning and worker scaling decisions |
| Failure | What happens when a task fails? Retry policy? Dead letter? | Failure handling is where most task schedulers get complex |
Requirements You Should Lock Down
Functional:- Schedule one-time tasks for immediate or delayed execution
- Schedule recurring tasks using cron expressions
- Retry failed tasks with configurable backoff (max 3 retries)
- Dead-letter queue for tasks that exceed retry limit
- Task status tracking: pending, scheduled, running, completed, failed, dead
- Cancel or pause scheduled/recurring tasks
- At-least-once execution guarantee (with idempotency support)
- Handle 10M task executions per day
- Scheduling accuracy: within 1 second of target time
- Horizontal scaling of both scheduler and workers
- No single point of failure
Phase 2: Estimate Scale
Phase 2: Estimate Scale
Back-of-Envelope Calculation
Phase 4: Deep Dive -- Three Critical Components
Phase 4: Deep Dive -- Three Critical Components
Deep Dive 1: Task State Machine
Every task follows a strict state machine. Invalid transitions must be rejected.Database Schema
Deep Dive 2: Distributed Locking for Task Pickup
The critical problem: multiple workers must not pick up the same task.Approach A: Database-Level Locking (Simple, Recommended to Start)
Approach B: Redis-Based Distributed Lock (For Higher Scale)
Lease-Based Execution
- Worker acquires a lease (lock with TTL) on the task
- Worker must renew the lease periodically (heartbeat) if the task runs long
- If the worker crashes, the lease expires and another worker can pick up the task
- When the task completes, the worker releases the lease and updates state
Deep Dive 3: Failure Handling and Idempotency
Retry with Exponential Backoff
Idempotency via Idempotency Keys
idempotency_key field ensures the same logical task is not created twice:Phase 5: Trade-Offs Discussed
Phase 5: Trade-Offs Discussed
| Decision | Choice Made | Alternative | Why |
|---|---|---|---|
| Task store | PostgreSQL | Cassandra | Need strong consistency + transactions for state machine; Cassandra’s eventual consistency causes double execution |
| Task queue | Redis sorted sets | SQS / RabbitMQ | Redis gives precise scheduled execution time ordering; managed queues add latency |
| Locking | DB-level SKIP LOCKED (start), Redis leases (scale) | ZooKeeper | Start simple with DB; graduate to Redis. ZooKeeper is operationally heavy for this use case |
| Execution guarantee | At-least-once + idempotency | Exactly-once | True exactly-once is impossible; at-least-once with idempotent tasks is the practical standard |
| Scheduler HA | Leader election (single active scheduler) | Multiple schedulers with distributed locking | Single scheduler is simpler; leader election via PostgreSQL advisory lock or Redis Redlock |
| Recurring tasks | Scheduler computes next_run_at after each execution | Worker computes next run | Scheduler owns the schedule; workers just execute. Separation of concerns. |
What Makes This Answer Senior-Level
What Makes This Answer Senior-Level
You defined a rigorous state machine
You addressed the zombie task problem
You explained why exactly-once is impossible and offered the practical alternative
You designed for progressive scaling
How the Real System Works
How the Real System Works
Uber’s Cadence / Temporal
Cadence was built at Uber to solve the problem of orchestrating long-running, reliable workflows (ride matching, payment processing, driver onboarding). It was later open-sourced and evolved into Temporal (founded by the same engineers):- Workflow as code: Unlike traditional task schedulers that treat tasks as independent units, Cadence/Temporal models entire workflows as code. A workflow function can call activities (individual tasks), wait for signals, set timers, and branch on conditions — all while being fully durable. If the worker crashes mid-workflow, the framework replays the workflow function from the event history to resume exactly where it left off.
- Event sourcing under the hood: Every state change in a workflow is persisted as an event in a history. The workflow’s current state is reconstructed by replaying this event history. This is fundamentally different from the state-machine-in-a-database approach we designed above. It is more flexible (arbitrary workflow logic instead of predefined states) but more complex to reason about.
- Task queues with sticky execution: Temporal uses task queues to dispatch activities to workers. It supports “sticky execution” — once a worker starts processing a workflow, subsequent activities in that workflow are preferentially routed to the same worker. This improves cache locality (workflow context is already in memory) and reduces replay overhead.
- Visibility and search: Temporal provides a built-in visibility layer that indexes workflow metadata (status, type, start time, custom search attributes) into Elasticsearch. Operators can search for workflows across millions of executions. This is the observability layer that is often an afterthought in custom task schedulers.
- Multi-cluster replication: For disaster recovery, Temporal supports active-passive replication across clusters. Workflows in a failed cluster can be resumed in the standby cluster. This is critical for Uber’s ride-booking workflows — a datacenter failure cannot cause rides to be lost.
- At Uber’s scale: At peak, Uber’s Cadence deployment handled hundreds of thousands of workflow executions per second across multiple Cassandra clusters totaling petabytes of event history. The system managed everything from ride lifecycle to financial settlement to promotional offer expiration.
Apache Airflow’s Architecture
Airflow, originally created at Airbnb (around 2014) and now an Apache project, is the most widely used open-source task orchestrator for data pipelines:- DAG-based scheduling: Airflow represents workflows as Directed Acyclic Graphs (DAGs). Each node is a task (e.g., “extract data from S3”, “run Spark job”, “load into warehouse”) and edges represent dependencies. The scheduler traverses the DAG, executing tasks only when their upstream dependencies are complete.
- Scheduler architecture: Airflow’s scheduler is a loop that runs every few seconds: (1) parse all DAG files to discover task dependencies, (2) identify tasks whose dependencies are met and whose scheduled time has arrived, (3) place those tasks into a queue (Celery, Kubernetes, or a local executor). The scheduler was historically single-threaded and a bottleneck, but Airflow 2.0+ supports multiple schedulers with database-level locking (similar to the SKIP LOCKED pattern we described above).
- Executor model: Airflow separates scheduling from execution. The executor is pluggable: CeleryExecutor distributes tasks to a fleet of Celery workers via Redis/RabbitMQ, KubernetesExecutor spins up a new pod for each task (strong isolation but higher overhead), and LocalExecutor runs tasks as subprocesses on the scheduler machine (fine for small deployments).
- Metadata database (PostgreSQL/MySQL): All task state, DAG definitions, execution history, and scheduling metadata live in a relational database. This is the source of truth. The choice of PostgreSQL or MySQL is deliberate — Airflow needs ACID transactions for state machine transitions, exactly as we discussed in Problem 5.
- XCom for inter-task communication: Tasks can pass small amounts of data to downstream tasks via “XComs” (cross-communications) stored in the metadata database. For large datasets, tasks write to external storage (S3, GCS) and pass only the reference via XCom. This pattern of passing pointers instead of payloads is a common production optimization.
- Limitations: Airflow is designed for batch workflows (run this ETL every hour), not for real-time event-driven tasks. It has no native concept of responding to events or running sub-second-latency tasks. For those use cases, Temporal or a custom solution (like the one we designed) is more appropriate.
Summary: The Pattern Across All Five Problems
- Ask before you draw — Requirements clarification is not a formality; it changes the design.
- Numbers before architecture — Back-of-envelope estimation reveals whether you need 3 servers or 3,000.
- Start with the right abstraction — High-level design captures components and data flow before implementation details.
- Go deep on what matters — You cannot deep-dive everything in 45 minutes. Pick the 2-3 components that define the system.
- Name the trade-offs explicitly — Every design decision has a cost. Senior engineers articulate both sides.
Think of It This Way
System design interviews are like architectural blueprints — the interviewer wants to see your thinking process, not a finished building. An architect does not walk into a client meeting and immediately start drawing floor plans. They ask: how many people will use this space? What is the budget? Are there zoning constraints? What is the climate? Only after understanding the requirements do they pick up a pencil. Your system design interview should follow the same rhythm: clarify, estimate, sketch, refine. Here is another way to think about it: a great system design answer is like a map with multiple zoom levels. Start at the continent level (high-level architecture), then zoom into the country level (key components), then the city level (critical algorithms and data models). The interviewer should be able to stop you at any zoom level and feel that you have a coherent picture. Candidates who jump straight to city-level details (arguing about Redis data structures before explaining the overall architecture) lose the forest for the trees.Deep Dive Resources
These resources will deepen your understanding beyond what any single guide can cover. They are organized by format so you can pick what fits your learning style.Books
Books
| Resource | Why It Matters |
|---|---|
| System Design Interview by Alex Xu (Vol 1 & 2) | The most interview-focused system design resource available. Each chapter walks through a specific system (rate limiter, chat, news feed, etc.) with the same Clarify-Estimate-Design-Deep Dive framework. Volume 1 covers fundamentals; Volume 2 covers more advanced topics like hotel reservation systems and stock exchanges. |
| Designing Data-Intensive Applications by Martin Kleppmann | The definitive deep reference for understanding the building blocks (replication, partitioning, consistency models, stream processing) that underpin every system design. Not interview-focused, but the mental models it builds are irreplaceable. Read this to understand why systems work the way they do. |
| Web Scalability for Startup Engineers by Artur Ejsmont | A practical, end-to-end guide to scaling web applications. Covers caching, async processing, search, and more. Especially useful if you want to understand the full stack rather than individual components. |
Online Courses and Guides
Online Courses and Guides
| Resource | Why It Matters |
|---|---|
| System Design Primer (github.com/donnemartin/system-design-primer) | A free, comprehensive GitHub repository that covers system design concepts, trade-offs, and practice problems. Think of it as a structured self-study curriculum. The diagrams and summaries are excellent for quick review before interviews. |
| Grokking the System Design Interview (educative.io) | An interactive course that walks through 20+ system design problems step by step. The format (text-based with diagrams, not video) makes it easy to study at your own pace. Widely regarded as the most popular paid resource for system design interview prep. |
| ByteByteGo by Alex Xu (blog.bytebytego.com / YouTube) | Alex Xu’s newsletter and YouTube channel provide visual, concise explanations of system design concepts. The animated diagrams are particularly useful for building intuition about data flow, fan-out patterns, and distributed system behavior. The YouTube channel is free; the newsletter has free and paid tiers. |
Engineering Blogs and Articles
Engineering Blogs and Articles
| Resource | Why It Matters |
|---|---|
| High Scalability (highscalability.com) | One of the oldest and most comprehensive collections of real-world architecture case studies. Their “This is how X works” posts dissect the architectures of companies like Netflix, Twitter, WhatsApp, and Uber. Reading 5-10 of these gives you a library of real patterns to reference in interviews. |
| InfoQ Architecture Articles (infoq.com/architecture-design) | In-depth articles and conference talks on distributed systems, microservices, and architecture patterns. The content skews more toward practicing architects than interview prep, which makes it excellent for building genuine expertise. |
| Company Engineering Blogs | The best primary sources for how real systems work. Prioritize: Netflix Tech Blog (resilience, streaming), Uber Engineering (real-time systems, Cadence/Temporal), Discord Blog (scaling real-time chat), Meta Engineering (feed systems, TAO, Memcache), Cloudflare Blog (edge computing, rate limiting, DNS). These are the sources that Alex Xu and other authors reference. |
Practice Platforms
Practice Platforms
| Resource | Why It Matters |
|---|---|
| Exponent (tryexponent.com) | Mock interview platform with system design practice, peer feedback, and expert-led sessions. Useful for practicing the verbal communication aspect that reading alone cannot build. |
| Pramp (pramp.com) | Free peer-to-peer mock interviews. You practice with another engineer and take turns as interviewer and candidate. The best way to practice thinking out loud under time pressure. |
| LeetCode System Design (leetcode.com/discuss/interview-question/system-design) | Community-contributed system design discussions. Quality varies, but the top posts have thousands of upvotes for a reason. Good for seeing how other candidates approach the same problems. |
Common Mistakes in System Design Interviews
Even well-prepared candidates fall into these traps. Knowing them in advance gives you a significant edge.Mistake 1: Jumping to the Solution Without Clarifying Requirements
Mistake 1: Jumping to the Solution Without Clarifying Requirements
Mistake 2: Over-Engineering for Scale You Don't Have
Mistake 2: Over-Engineering for Scale You Don't Have
Mistake 3: Ignoring Failure Modes and Edge Cases
Mistake 3: Ignoring Failure Modes and Edge Cases
Mistake 4: Not Calculating Back-of-Envelope Estimates
Mistake 4: Not Calculating Back-of-Envelope Estimates
Mistake 5: Designing in Silence Instead of Thinking Out Loud
Mistake 5: Designing in Silence Instead of Thinking Out Loud
Mistake 6: Focusing on One Component Too Deeply While Ignoring the Big Picture
Mistake 6: Focusing on One Component Too Deeply While Ignoring the Big Picture