Skip to main content

Interview Practice Problems

Staff+ distributed systems interviews are among the most challenging in the industry. This module provides comprehensive practice problems with detailed solutions, covering the exact topics asked at Google, Meta, Amazon, and other top companies.
Problem Count: 15+ detailed problems
Difficulty: Staff/Principal level
Format: Problem → Hints → Full Solution → Follow-up Questions

Problem 1: Design a Distributed Rate Limiter

Difficulty: Medium-Hard
Companies: Stripe, Cloudflare, AWS
Time: 45 minutes

Problem Statement

Design a rate limiting system that:
  • Limits requests per user to 100 requests per minute
  • Works across multiple API servers (horizontally scaled)
  • Has low latency (< 5ms overhead)
  • Handles millions of users

Hints

Consider Token Bucket, Leaky Bucket, Sliding Window, or Fixed Window algorithms.
  • Token Bucket: Good for burst tolerance
  • Sliding Window: More accurate, more complex
  • Fixed Window: Simple but has edge case issues
Where do you store rate limit state?
  • Local memory: Fast but not shared
  • Redis: Shared but adds latency
  • Hybrid: Local + periodic sync
Multiple servers checking/updating same counter simultaneously. Consider atomic operations or Lua scripts in Redis.

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED RATE LIMITER ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ALGORITHM: Sliding Window Log (most accurate for distributed)              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                         API SERVERS                                │     │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │     │
│  │  │ Server 1 │  │ Server 2 │  │ Server 3 │  │ Server N │          │     │
│  │  │          │  │          │  │          │  │          │          │     │
│  │  │ Rate     │  │ Rate     │  │ Rate     │  │ Rate     │          │     │
│  │  │ Limiter  │  │ Limiter  │  │ Limiter  │  │ Limiter  │          │     │
│  │  │ Client   │  │ Client   │  │ Client   │  │ Client   │          │     │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │     │
│  │       │             │             │             │                 │     │
│  └───────┼─────────────┼─────────────┼─────────────┼─────────────────┘     │
│          │             │             │             │                        │
│          └─────────────┴──────┬──────┴─────────────┘                        │
│                               │                                             │
│                               ▼                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                      REDIS CLUSTER                                 │     │
│  │                                                                    │     │
│  │   Key: rate_limit:user_123                                        │     │
│  │   Type: Sorted Set                                                │     │
│  │   Members: [timestamp1, timestamp2, timestamp3, ...]              │     │
│  │   Score: timestamp (for range queries)                            │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  SLIDING WINDOW ALGORITHM:                                                  │
│  ─────────────────────────                                                  │
│  1. Remove all entries older than (now - window_size)                       │
│  2. Count remaining entries                                                 │
│  3. If count < limit, add current timestamp and allow                       │
│  4. Else, reject                                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Redis Lua Script (Atomic Operation):
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests allowed
-- ARGV[3] = current timestamp

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_start = now - window

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add current request
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- Allowed
else
    return 0  -- Denied
end
Follow-up Questions:
  1. How do you handle Redis failures?
    • Fail open (allow requests) or fail closed (deny)?
    • Local cache with eventual sync?
  2. How do you prevent gaming the system?
    • User changes IP? Use account ID, not IP
    • Distributed attacks? Per-IP limits too
  3. How do you handle different rate limits per user tier?
    • Store tier in user metadata
    • Different limits for free vs paid

Problem 2: Design a Distributed Lock Service

Difficulty: Hard
Companies: Google, Amazon, Uber
Time: 45 minutes

Problem Statement

Design a distributed lock service similar to Zookeeper or etcd that:
  • Provides mutual exclusion
  • Handles node failures
  • Prevents deadlocks
  • Supports lock TTL (lease-based)

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED LOCK SERVICE DESIGN                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  KEY COMPONENTS:                                                            │
│                                                                              │
│  1. CONSENSUS LAYER (Raft/Paxos)                                           │
│     ┌──────────────────────────────────────────────────────────────┐        │
│     │                                                              │        │
│     │  Leader ◄────────► Follower ◄────────► Follower             │        │
│     │                                                              │        │
│     │  All lock operations go through Raft log                    │        │
│     │  Majority must agree before lock is granted                 │        │
│     │                                                              │        │
│     └──────────────────────────────────────────────────────────────┘        │
│                                                                              │
│  2. LOCK DATA STRUCTURE                                                     │
│     ┌──────────────────────────────────────────────────────────────┐        │
│     │  Lock {                                                      │        │
│     │    key: "/locks/resource-123"                               │        │
│     │    owner: "client-abc"        // Who holds the lock         │        │
│     │    lease_id: 12345            // For TTL/renewal            │        │
│     │    fence_token: 789           // Monotonically increasing   │        │
│     │    expires_at: 1699900000     // When lease expires         │        │
│     │  }                                                          │        │
│     └──────────────────────────────────────────────────────────────┘        │
│                                                                              │
│  3. FENCING TOKEN (Critical for correctness!)                              │
│                                                                              │
│     Without fencing:                                                        │
│     ┌────────────────────────────────────────────────────────────────┐      │
│     │  Client A acquires lock                                        │      │
│     │  Client A pauses (GC, network)                                 │      │
│     │  Lock expires, Client B acquires lock                          │      │
│     │  Client A resumes, thinks it still has lock                    │      │
│     │  BOTH clients write! Data corruption!                          │      │
│     └────────────────────────────────────────────────────────────────┘      │
│                                                                              │
│     With fencing:                                                           │
│     ┌────────────────────────────────────────────────────────────────┐      │
│     │  Client A acquires lock, fence_token = 34                      │      │
│     │  Client A pauses (GC, network)                                 │      │
│     │  Lock expires, Client B acquires, fence_token = 35             │      │
│     │  Client A resumes, sends token 34 to storage                   │      │
│     │  Storage rejects: 34 < current (35)                            │      │
│     │  Only Client B's writes succeed!                               │      │
│     └────────────────────────────────────────────────────────────────┘      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Lock Acquisition Algorithm:
class DistributedLock:
    def __init__(self, lock_service, resource_key):
        self.lock_service = lock_service
        self.resource_key = resource_key
        self.lease_id = None
        self.fence_token = None
    
    async def acquire(self, timeout_seconds: float = 30) -> bool:
        """
        Acquire the lock with retries.
        Returns True if acquired, False if timeout.
        """
        deadline = time.time() + timeout_seconds
        
        while time.time() < deadline:
            # Try to create lock in consensus layer
            result = await self.lock_service.try_acquire(
                key=self.resource_key,
                owner=self.client_id,
                ttl_seconds=30  # Lease duration
            )
            
            if result.success:
                self.lease_id = result.lease_id
                self.fence_token = result.fence_token
                
                # Start background lease renewal
                self._start_renewal_loop()
                return True
            
            # Lock held by someone else, wait and retry
            await asyncio.sleep(0.1)
        
        return False
    
    async def release(self):
        """Release the lock"""
        if self.lease_id:
            await self.lock_service.release(
                key=self.resource_key,
                lease_id=self.lease_id
            )
            self._stop_renewal_loop()
    
    def _start_renewal_loop(self):
        """Renew lease periodically to prevent expiry"""
        async def renew():
            while self.lease_id:
                try:
                    await self.lock_service.renew_lease(self.lease_id)
                    await asyncio.sleep(10)  # Renew every 10s (TTL is 30s)
                except Exception:
                    # Lost connection, lease will expire
                    break
        
        asyncio.create_task(renew())
Follow-up Questions:
  1. What happens during network partition?
    • Clients on minority side can’t acquire new locks
    • Existing leases expire, preventing split-brain
  2. How do you handle leader election for the lock service itself?
    • Raft consensus with quorum reads/writes
  3. Redlock controversy: Why is Redis Redlock problematic?
    • Martin Kleppmann’s analysis: Clock assumptions too strong
    • Fencing tokens still needed for safety

Problem 3: Design a Global Payment System

Difficulty: Very Hard
Companies: Stripe, Square, Adyen
Time: 60 minutes

Problem Statement

Design a payment processing system that:
  • Handles credit card charges globally
  • Guarantees exactly-once processing
  • Supports refunds and chargebacks
  • Processes millions of transactions per day

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PAYMENT SYSTEM ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                           ┌─────────────────┐                               │
│                           │   API Gateway   │                               │
│                           │ (Idempotency)   │                               │
│                           └────────┬────────┘                               │
│                                    │                                        │
│                           ┌────────▼────────┐                               │
│                           │ Payment Service │                               │
│                           │   (Orchestrator)│                               │
│                           └────────┬────────┘                               │
│                                    │                                        │
│     ┌──────────────────────────────┼──────────────────────────────┐         │
│     │                              │                              │         │
│     ▼                              ▼                              ▼         │
│ ┌────────────┐            ┌────────────────┐           ┌────────────────┐  │
│ │  Fraud     │            │  Authorization │           │  Risk Engine   │  │
│ │  Detection │            │  Service       │           │                │  │
│ └────────────┘            └───────┬────────┘           └────────────────┘  │
│                                   │                                         │
│                    ┌──────────────┼──────────────┐                          │
│                    ▼              ▼              ▼                          │
│               ┌─────────┐   ┌─────────┐   ┌─────────┐                      │
│               │  Visa   │   │Mastercard│  │  AMEX   │ (Payment Networks)  │
│               └─────────┘   └─────────┘   └─────────┘                      │
│                                                                              │
│  STATE MACHINE FOR PAYMENT:                                                 │
│  ──────────────────────────                                                 │
│                                                                              │
│  ┌──────────┐    Auth     ┌────────────┐   Capture   ┌───────────┐         │
│  │ PENDING  │ ──────────► │ AUTHORIZED │ ──────────► │ CAPTURED  │         │
│  └──────────┘             └────────────┘             └───────────┘         │
│       │                         │                          │               │
│       │                         │ Void                     │ Refund       │
│       │                         ▼                          ▼               │
│       │                   ┌───────────┐             ┌───────────┐          │
│       └──────────────────►│  FAILED   │             │ REFUNDED  │          │
│          Decline          └───────────┘             └───────────┘          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Exactly-Once Processing:
class PaymentProcessor:
    async def process_payment(
        self,
        idempotency_key: str,
        amount: int,
        currency: str,
        card_token: str
    ) -> PaymentResult:
        """
        Process payment with exactly-once guarantee.
        """
        
        # Step 1: Check idempotency store
        existing = await self.idempotency_store.get(idempotency_key)
        if existing:
            if existing.status == "completed":
                return existing.result  # Return cached result
            elif existing.status == "in_progress":
                raise ConflictError("Payment in progress")
        
        # Step 2: Create payment record (source of truth)
        payment_id = generate_id()
        
        async with self.db.transaction():
            # Insert payment
            await self.db.insert("payments", {
                "id": payment_id,
                "idempotency_key": idempotency_key,
                "amount": amount,
                "currency": currency,
                "status": "PENDING"
            })
            
            # Insert into outbox for event
            await self.db.insert("outbox", {
                "event": "payment.created",
                "payload": {"payment_id": payment_id}
            })
        
        try:
            # Step 3: Call payment network
            auth_result = await self.call_payment_network(
                amount=amount,
                currency=currency,
                card_token=card_token
            )
            
            # Step 4: Update payment status
            async with self.db.transaction():
                await self.db.update(
                    "payments",
                    {"id": payment_id},
                    {"status": "AUTHORIZED" if auth_result.success else "FAILED",
                     "network_reference": auth_result.reference}
                )
                
                await self.db.insert("outbox", {
                    "event": "payment.authorized" if auth_result.success else "payment.failed",
                    "payload": {"payment_id": payment_id}
                })
            
            result = PaymentResult(
                payment_id=payment_id,
                success=auth_result.success,
                decline_code=auth_result.decline_code
            )
            
            # Step 5: Cache in idempotency store
            await self.idempotency_store.set(
                idempotency_key,
                {"status": "completed", "result": result},
                ttl=24 * 3600  # 24 hours
            )
            
            return result
            
        except Exception as e:
            # Network failure - payment status uncertain
            # Keep as PENDING, reconciliation job will resolve
            logger.error(f"Payment network error: {e}")
            raise
Key Design Decisions:
DecisionRationale
Transactional outboxDatabase and event publishing in same transaction
Idempotency keysClient-provided, prevents duplicate charges
State machineClear payment lifecycle, no invalid transitions
ReconciliationAsync job compares our state with payment networks
Saga for refundsMulti-step with compensation on failure

Problem 4: Design a Distributed Task Scheduler

Difficulty: Hard
Companies: Uber (Cadence), Airbnb, LinkedIn
Time: 45 minutes

Problem Statement

Design a system that:
  • Schedules tasks to run at specific times (cron-like)
  • Executes one-time delayed tasks
  • Guarantees at-least-once execution
  • Scales horizontally

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TASK SCHEDULER                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ARCHITECTURE:                                                              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        SCHEDULER LAYER                             │     │
│  │                                                                    │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │     │
│  │  │ Scheduler 1  │  │ Scheduler 2  │  │ Scheduler N  │             │     │
│  │  │  (Leader)    │  │ (Follower)   │  │ (Follower)   │             │     │
│  │  └──────┬───────┘  └──────────────┘  └──────────────┘             │     │
│  │         │                                                          │     │
│  │         │ Leader election via consensus                            │     │
│  │         │ (Only leader schedules tasks)                            │     │
│  │         │                                                          │     │
│  └─────────┼──────────────────────────────────────────────────────────┘     │
│            │                                                                 │
│            ▼                                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        TASK QUEUE                                  │     │
│  │                                                                    │     │
│  │   Ready Queue (sorted by execution_time):                         │     │
│  │   ┌─────────────────────────────────────────────────────────────┐ │     │
│  │   │ task_1 (run at 10:00) │ task_2 (10:01) │ task_3 (10:05) │   │ │     │
│  │   └─────────────────────────────────────────────────────────────┘ │     │
│  │                                                                    │     │
│  │   Delay Queue (Redis ZSET or dedicated delay queue):              │     │
│  │   ┌─────────────────────────────────────────────────────────────┐ │     │
│  │   │ task_4 (due 11:00) │ task_5 (due 12:00) │ ...              │ │     │
│  │   └─────────────────────────────────────────────────────────────┘ │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│            │                                                                 │
│            ▼                                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        WORKER LAYER                                │     │
│  │                                                                    │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │     │
│  │  │  Worker 1    │  │  Worker 2    │  │  Worker N    │             │     │
│  │  │              │  │              │  │              │             │     │
│  │  │ - Pulls task │  │ - Pulls task │  │ - Pulls task │             │     │
│  │  │ - Executes   │  │ - Executes   │  │ - Executes   │             │     │
│  │  │ - Acks/Fails │  │ - Acks/Fails │  │ - Acks/Fails │             │     │
│  │  └──────────────┘  └──────────────┘  └──────────────┘             │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  TASK DATA MODEL:                                                           │
│  ────────────────                                                           │
│  {                                                                          │
│    "id": "task-123",                                                        │
│    "type": "send_email",                                                    │
│    "payload": {"to": "[email protected]", ...},                             │
│    "schedule_time": 1699900000,                                            │
│    "status": "PENDING|RUNNING|COMPLETED|FAILED",                           │
│    "retry_count": 0,                                                       │
│    "max_retries": 3,                                                       │
│    "created_at": 1699800000                                                │
│  }                                                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Handling Exactly-Once Semantics:
class TaskWorker:
    async def process_task(self, task: Task):
        """
        Process task with at-least-once guarantee.
        Idempotency is caller's responsibility.
        """
        
        # Set visibility timeout (task invisible to other workers)
        await self.queue.set_visibility_timeout(task.id, seconds=300)
        
        try:
            # Execute the task
            handler = self.handlers[task.type]
            await handler(task.payload)
            
            # Mark as complete
            await self.store.update_task(task.id, status="COMPLETED")
            await self.queue.ack(task.id)
            
        except RetryableError as e:
            if task.retry_count < task.max_retries:
                # Reschedule with exponential backoff
                next_run = datetime.now() + timedelta(
                    seconds=2 ** task.retry_count * 60
                )
                await self.store.update_task(
                    task.id,
                    status="PENDING",
                    schedule_time=next_run,
                    retry_count=task.retry_count + 1
                )
                await self.queue.nack(task.id)
            else:
                # Max retries exceeded
                await self.store.update_task(
                    task.id, 
                    status="FAILED",
                    error=str(e)
                )
                await self.queue.ack(task.id)
                
        except Exception as e:
            # Non-retryable error
            await self.store.update_task(
                task.id,
                status="FAILED",
                error=str(e)
            )
            await self.queue.ack(task.id)

Problem 5: Design a Distributed Unique ID Generator

Difficulty: Medium
Companies: Twitter (Snowflake), Instagram
Time: 30 minutes

Problem Statement

Design a system that generates globally unique IDs with:
  • 64-bit IDs
  • Sortable by time (roughly)
  • No coordination between nodes
  • Millions of IDs per second

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SNOWFLAKE ID FORMAT (64 bits)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────┬───────────────────────────┬────────────┬──────────────────────┐    │
│  │  0  │       41 bits            │  10 bits   │      12 bits         │    │
│  │     │    Timestamp (ms)        │  Node ID   │   Sequence Number    │    │
│  │     │  (69 years from epoch)   │ (1024 nodes)│  (4096 per ms)      │    │
│  └─────┴───────────────────────────┴────────────┴──────────────────────┘    │
│                                                                              │
│  CAPACITY:                                                                  │
│  ─────────                                                                  │
│  - 1024 nodes                                                               │
│  - 4096 IDs per millisecond per node                                        │
│  - 4.1 million IDs per second per node                                      │
│  - 69 years before timestamp overflow                                       │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ────────                                                                   │
│  Timestamp: 1699900000000 (41 bits)                                         │
│  Node ID: 42 (10 bits)                                                      │
│  Sequence: 0 (12 bits)                                                      │
│                                                                              │
│  ID = (1699900000000 << 22) | (42 << 12) | 0                               │
│     = 7130486325248696320                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation:
import time
import threading

class SnowflakeGenerator:
    def __init__(self, node_id: int, epoch: int = 1288834974657):
        """
        Initialize Snowflake ID generator.
        
        Args:
            node_id: Unique identifier for this generator (0-1023)
            epoch: Custom epoch (Twitter's is 2010-11-04)
        """
        if not 0 <= node_id < 1024:
            raise ValueError("Node ID must be between 0 and 1023")
        
        self.node_id = node_id
        self.epoch = epoch
        self.sequence = 0
        self.last_timestamp = -1
        self._lock = threading.Lock()
    
    def _current_time_millis(self) -> int:
        return int(time.time() * 1000)
    
    def generate(self) -> int:
        """Generate a unique Snowflake ID"""
        with self._lock:
            timestamp = self._current_time_millis()
            
            if timestamp < self.last_timestamp:
                # Clock went backwards! Wait for it to catch up.
                raise ClockMovedBackwardsError(
                    f"Clock moved backwards by {self.last_timestamp - timestamp}ms"
                )
            
            if timestamp == self.last_timestamp:
                # Same millisecond, increment sequence
                self.sequence = (self.sequence + 1) & 0xFFF  # 12 bits
                
                if self.sequence == 0:
                    # Sequence exhausted, wait for next millisecond
                    timestamp = self._wait_next_millis(timestamp)
            else:
                # New millisecond, reset sequence
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            # Construct the ID
            id = (
                ((timestamp - self.epoch) << 22) |
                (self.node_id << 12) |
                self.sequence
            )
            
            return id
    
    def _wait_next_millis(self, last_timestamp: int) -> int:
        """Spin until we get to the next millisecond"""
        timestamp = self._current_time_millis()
        while timestamp <= last_timestamp:
            timestamp = self._current_time_millis()
        return timestamp
    
    @staticmethod
    def parse(id: int) -> dict:
        """Parse a Snowflake ID into its components"""
        return {
            "timestamp": (id >> 22) + 1288834974657,
            "node_id": (id >> 12) & 0x3FF,
            "sequence": id & 0xFFF
        }
Follow-up Questions:
  1. How do you assign node IDs?
    • Zookeeper/etcd for coordination
    • Use MAC address hash
    • Use Kubernetes pod ordinal
  2. What if clock goes backwards?
    • Throw exception (let operator fix NTP)
    • Wait for clock to catch up
    • Use logical clock component
  3. How do you handle datacenter failover?
    • Split node ID: 5 bits datacenter + 5 bits machine
    • Different epoch per datacenter

Problem 6: Design a Distributed Message Queue

Difficulty: Very Hard
Companies: Confluent, AWS (SQS), LinkedIn
Time: 60 minutes

Problem Statement

Design a message queue like Kafka that:
  • Handles millions of messages per second
  • Provides ordering guarantees (per partition)
  • Supports multiple consumers
  • Persists messages for replay

Key Design Points

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED MESSAGE QUEUE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TOPIC: orders (3 partitions, replication factor 3)                        │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                                                                    │     │
│  │   Partition 0          Partition 1          Partition 2           │     │
│  │   (Leader: B1)         (Leader: B2)         (Leader: B3)          │     │
│  │                                                                    │     │
│  │   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐      │     │
│  │   │[0][1][2][3]  │     │[0][1][2][3]  │     │[0][1][2][3]  │      │     │
│  │   └──────────────┘     └──────────────┘     └──────────────┘      │     │
│  │   Broker 1 (L)         Broker 2 (L)         Broker 3 (L)          │     │
│  │   Broker 2 (F)         Broker 3 (F)         Broker 1 (F)          │     │
│  │   Broker 3 (F)         Broker 1 (F)         Broker 2 (F)          │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  CONSUMER GROUP: order-processors                                           │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                                                                    │     │
│  │   Consumer 1           Consumer 2           Consumer 3            │     │
│  │   (Partition 0)        (Partition 1)        (Partition 2)         │     │
│  │                                                                    │     │
│  │   Each partition assigned to exactly ONE consumer in group        │     │
│  │   If Consumer 2 dies, Partition 1 reassigned to 1 or 3           │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  GUARANTEES:                                                                │
│  ───────────                                                                │
│  - Ordering: Within a partition only                                        │
│  - Delivery: At-least-once (consumer commits offset after processing)      │
│  - Durability: Message persisted to N replicas before ack                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

More Practice Problems

Design Uber/Lyft

Topics: Real-time location, matching, surge pricing Key challenges: Geospatial indexing, low-latency dispatch

Design Twitter Timeline

Topics: Fan-out, feed ranking, real-time updates Key challenges: Hot users, read-heavy workload

Design Google Docs

Topics: CRDT, operational transformation, collaboration Key challenges: Conflict resolution, real-time sync

Design Netflix

Topics: CDN, video encoding, recommendations Key challenges: Global scale, adaptive streaming

Interview Tips

  1. Clarify requirements (5 min)
    • Scale: Users, requests/second, data size
    • Consistency: Strong or eventual?
    • Latency: What’s acceptable?
  2. High-level design (10 min)
    • Draw boxes and arrows
    • Identify core components
    • Explain data flow
  3. Deep dive (20 min)
    • Pick 2-3 most critical components
    • Discuss algorithms, data structures
    • Explain trade-offs
  4. Handle edge cases (10 min)
    • Failures: Node crashes, network partitions
    • Scale: Hot spots, bottlenecks
    • Consistency: Race conditions
  • Breadth: Can you identify all components?
  • Depth: Can you go deep on key areas?
  • Trade-offs: Do you understand CAP, consistency vs availability?
  • Practicality: Is your design buildable?
  • Communication: Can you explain clearly?
  • Starting to code before understanding the problem
  • Not asking clarifying questions
  • Ignoring scale constraints
  • Over-engineering simple problems
  • Forgetting failure modes
  • Not discussing trade-offs