Interview Practice Problems

Staff+ distributed systems interviews are among the most challenging in the industry. This module provides comprehensive practice problems with detailed solutions, covering the exact topics asked at Google, Meta, Amazon, and other top companies.

Problem Count: 15+ detailed problems
Difficulty: Staff/Principal level
Format: Problem → Hints → Full Solution → Follow-up Questions

Problem 1: Design a Distributed Rate Limiter

Difficulty: Medium-Hard
Companies: Stripe, Cloudflare, AWS
Time: 45 minutes

Problem Statement

Design a rate limiting system that:

Limits requests per user to 100 requests per minute
Works across multiple API servers (horizontally scaled)
Has low latency (< 5ms overhead)
Handles millions of users

Hints

Hint 1: Algorithm Choice

Consider Token Bucket, Leaky Bucket, Sliding Window, or Fixed Window algorithms.

Token Bucket: Good for burst tolerance
Sliding Window: More accurate, more complex
Fixed Window: Simple but has edge case issues

Hint 2: Storage

Where do you store rate limit state?

Local memory: Fast but not shared
Redis: Shared but adds latency
Hybrid: Local + periodic sync

Hint 3: Race Conditions

Multiple servers checking/updating same counter simultaneously. Consider atomic operations or Lua scripts in Redis.

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED RATE LIMITER ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ALGORITHM: Sliding Window Log (most accurate for distributed)              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                         API SERVERS                                │     │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │     │
│  │  │ Server 1 │  │ Server 2 │  │ Server 3 │  │ Server N │          │     │
│  │  │          │  │          │  │          │  │          │          │     │
│  │  │ Rate     │  │ Rate     │  │ Rate     │  │ Rate     │          │     │
│  │  │ Limiter  │  │ Limiter  │  │ Limiter  │  │ Limiter  │          │     │
│  │  │ Client   │  │ Client   │  │ Client   │  │ Client   │          │     │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │     │
│  │       │             │             │             │                 │     │
│  └───────┼─────────────┼─────────────┼─────────────┼─────────────────┘     │
│          │             │             │             │                        │
│          └─────────────┴──────┬──────┴─────────────┘                        │
│                               │                                             │
│                               ▼                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                      REDIS CLUSTER                                 │     │
│  │                                                                    │     │
│  │   Key: rate_limit:user_123                                        │     │
│  │   Type: Sorted Set                                                │     │
│  │   Members: [timestamp1, timestamp2, timestamp3, ...]              │     │
│  │   Score: timestamp (for range queries)                            │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  SLIDING WINDOW ALGORITHM:                                                  │
│  ─────────────────────────                                                  │
│  1. Remove all entries older than (now - window_size)                       │
│  2. Count remaining entries                                                 │
│  3. If count < limit, add current timestamp and allow                       │
│  4. Else, reject                                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Redis Lua Script (Atomic Operation):

-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests allowed
-- ARGV[3] = current timestamp

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_start = now - window

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add current request
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- Allowed
else
    return 0  -- Denied
end

Follow-up Questions:

How do you handle Redis failures?
- Fail open (allow requests) or fail closed (deny)?
- Local cache with eventual sync?
How do you prevent gaming the system?
- User changes IP? Use account ID, not IP
- Distributed attacks? Per-IP limits too
How do you handle different rate limits per user tier?
- Store tier in user metadata
- Different limits for free vs paid

Problem 2: Design a Distributed Lock Service

Difficulty: Hard
Companies: Google, Amazon, Uber
Time: 45 minutes

Problem Statement

Design a distributed lock service similar to Zookeeper or etcd that:

Provides mutual exclusion
Handles node failures
Prevents deadlocks
Supports lock TTL (lease-based)

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED LOCK SERVICE DESIGN                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  KEY COMPONENTS:                                                            │
│                                                                              │
│  1. CONSENSUS LAYER (Raft/Paxos)                                           │
│     ┌──────────────────────────────────────────────────────────────┐        │
│     │                                                              │        │
│     │  Leader ◄────────► Follower ◄────────► Follower             │        │
│     │                                                              │        │
│     │  All lock operations go through Raft log                    │        │
│     │  Majority must agree before lock is granted                 │        │
│     │                                                              │        │
│     └──────────────────────────────────────────────────────────────┘        │
│                                                                              │
│  2. LOCK DATA STRUCTURE                                                     │
│     ┌──────────────────────────────────────────────────────────────┐        │
│     │  Lock {                                                      │        │
│     │    key: "/locks/resource-123"                               │        │
│     │    owner: "client-abc"        // Who holds the lock         │        │
│     │    lease_id: 12345            // For TTL/renewal            │        │
│     │    fence_token: 789           // Monotonically increasing   │        │
│     │    expires_at: 1699900000     // When lease expires         │        │
│     │  }                                                          │        │
│     └──────────────────────────────────────────────────────────────┘        │
│                                                                              │
│  3. FENCING TOKEN (Critical for correctness!)                              │
│                                                                              │
│     Without fencing:                                                        │
│     ┌────────────────────────────────────────────────────────────────┐      │
│     │  Client A acquires lock                                        │      │
│     │  Client A pauses (GC, network)                                 │      │
│     │  Lock expires, Client B acquires lock                          │      │
│     │  Client A resumes, thinks it still has lock                    │      │
│     │  BOTH clients write! Data corruption!                          │      │
│     └────────────────────────────────────────────────────────────────┘      │
│                                                                              │
│     With fencing:                                                           │
│     ┌────────────────────────────────────────────────────────────────┐      │
│     │  Client A acquires lock, fence_token = 34                      │      │
│     │  Client A pauses (GC, network)                                 │      │
│     │  Lock expires, Client B acquires, fence_token = 35             │      │
│     │  Client A resumes, sends token 34 to storage                   │      │
│     │  Storage rejects: 34 < current (35)                            │      │
│     │  Only Client B's writes succeed!                               │      │
│     └────────────────────────────────────────────────────────────────┘      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Lock Acquisition Algorithm:

class DistributedLock:
    def __init__(self, lock_service, resource_key):
        self.lock_service = lock_service
        self.resource_key = resource_key
        self.lease_id = None
        self.fence_token = None
    
    async def acquire(self, timeout_seconds: float = 30) -> bool:
        """
        Acquire the lock with retries.
        Returns True if acquired, False if timeout.
        """
        deadline = time.time() + timeout_seconds
        
        while time.time() < deadline:
            # Try to create lock in consensus layer
            result = await self.lock_service.try_acquire(
                key=self.resource_key,
                owner=self.client_id,
                ttl_seconds=30  # Lease duration
            )
            
            if result.success:
                self.lease_id = result.lease_id
                self.fence_token = result.fence_token
                
                # Start background lease renewal
                self._start_renewal_loop()
                return True
            
            # Lock held by someone else, wait and retry
            await asyncio.sleep(0.1)
        
        return False
    
    async def release(self):
        """Release the lock"""
        if self.lease_id:
            await self.lock_service.release(
                key=self.resource_key,
                lease_id=self.lease_id
            )
            self._stop_renewal_loop()
    
    def _start_renewal_loop(self):
        """Renew lease periodically to prevent expiry"""
        async def renew():
            while self.lease_id:
                try:
                    await self.lock_service.renew_lease(self.lease_id)
                    await asyncio.sleep(10)  # Renew every 10s (TTL is 30s)
                except Exception:
                    # Lost connection, lease will expire
                    break
        
        asyncio.create_task(renew())

Follow-up Questions:

What happens during network partition?
- Clients on minority side can’t acquire new locks
- Existing leases expire, preventing split-brain
How do you handle leader election for the lock service itself?
- Raft consensus with quorum reads/writes
Redlock controversy: Why is Redis Redlock problematic?
- Martin Kleppmann’s analysis: Clock assumptions too strong
- Fencing tokens still needed for safety

Problem 3: Design a Global Payment System

Difficulty: Very Hard
Companies: Stripe, Square, Adyen
Time: 60 minutes

Problem Statement

Design a payment processing system that:

Handles credit card charges globally
Guarantees exactly-once processing
Supports refunds and chargebacks
Processes millions of transactions per day

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PAYMENT SYSTEM ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                           ┌─────────────────┐                               │
│                           │   API Gateway   │                               │
│                           │ (Idempotency)   │                               │
│                           └────────┬────────┘                               │
│                                    │                                        │
│                           ┌────────▼────────┐                               │
│                           │ Payment Service │                               │
│                           │   (Orchestrator)│                               │
│                           └────────┬────────┘                               │
│                                    │                                        │
│     ┌──────────────────────────────┼──────────────────────────────┐         │
│     │                              │                              │         │
│     ▼                              ▼                              ▼         │
│ ┌────────────┐            ┌────────────────┐           ┌────────────────┐  │
│ │  Fraud     │            │  Authorization │           │  Risk Engine   │  │
│ │  Detection │            │  Service       │           │                │  │
│ └────────────┘            └───────┬────────┘           └────────────────┘  │
│                                   │                                         │
│                    ┌──────────────┼──────────────┐                          │
│                    ▼              ▼              ▼                          │
│               ┌─────────┐   ┌─────────┐   ┌─────────┐                      │
│               │  Visa   │   │Mastercard│  │  AMEX   │ (Payment Networks)  │
│               └─────────┘   └─────────┘   └─────────┘                      │
│                                                                              │
│  STATE MACHINE FOR PAYMENT:                                                 │
│  ──────────────────────────                                                 │
│                                                                              │
│  ┌──────────┐    Auth     ┌────────────┐   Capture   ┌───────────┐         │
│  │ PENDING  │ ──────────► │ AUTHORIZED │ ──────────► │ CAPTURED  │         │
│  └──────────┘             └────────────┘             └───────────┘         │
│       │                         │                          │               │
│       │                         │ Void                     │ Refund       │
│       │                         ▼                          ▼               │
│       │                   ┌───────────┐             ┌───────────┐          │
│       └──────────────────►│  FAILED   │             │ REFUNDED  │          │
│          Decline          └───────────┘             └───────────┘          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Exactly-Once Processing:

class PaymentProcessor:
    async def process_payment(
        self,
        idempotency_key: str,
        amount: int,
        currency: str,
        card_token: str
    ) -> PaymentResult:
        """
        Process payment with exactly-once guarantee.
        """
        
        # Step 1: Check idempotency store
        existing = await self.idempotency_store.get(idempotency_key)
        if existing:
            if existing.status == "completed":
                return existing.result  # Return cached result
            elif existing.status == "in_progress":
                raise ConflictError("Payment in progress")
        
        # Step 2: Create payment record (source of truth)
        payment_id = generate_id()
        
        async with self.db.transaction():
            # Insert payment
            await self.db.insert("payments", {
                "id": payment_id,
                "idempotency_key": idempotency_key,
                "amount": amount,
                "currency": currency,
                "status": "PENDING"
            })
            
            # Insert into outbox for event
            await self.db.insert("outbox", {
                "event": "payment.created",
                "payload": {"payment_id": payment_id}
            })
        
        try:
            # Step 3: Call payment network
            auth_result = await self.call_payment_network(
                amount=amount,
                currency=currency,
                card_token=card_token
            )
            
            # Step 4: Update payment status
            async with self.db.transaction():
                await self.db.update(
                    "payments",
                    {"id": payment_id},
                    {"status": "AUTHORIZED" if auth_result.success else "FAILED",
                     "network_reference": auth_result.reference}
                )
                
                await self.db.insert("outbox", {
                    "event": "payment.authorized" if auth_result.success else "payment.failed",
                    "payload": {"payment_id": payment_id}
                })
            
            result = PaymentResult(
                payment_id=payment_id,
                success=auth_result.success,
                decline_code=auth_result.decline_code
            )
            
            # Step 5: Cache in idempotency store
            await self.idempotency_store.set(
                idempotency_key,
                {"status": "completed", "result": result},
                ttl=24 * 3600  # 24 hours
            )
            
            return result
            
        except Exception as e:
            # Network failure - payment status uncertain
            # Keep as PENDING, reconciliation job will resolve
            logger.error(f"Payment network error: {e}")
            raise

Key Design Decisions:

Decision	Rationale
Transactional outbox	Database and event publishing in same transaction
Idempotency keys	Client-provided, prevents duplicate charges
State machine	Clear payment lifecycle, no invalid transitions
Reconciliation	Async job compares our state with payment networks
Saga for refunds	Multi-step with compensation on failure

Problem 4: Design a Distributed Task Scheduler

Difficulty: Hard
Companies: Uber (Cadence), Airbnb, LinkedIn
Time: 45 minutes

Problem Statement

Design a system that:

Schedules tasks to run at specific times (cron-like)
Executes one-time delayed tasks
Guarantees at-least-once execution
Scales horizontally

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TASK SCHEDULER                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ARCHITECTURE:                                                              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        SCHEDULER LAYER                             │     │
│  │                                                                    │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │     │
│  │  │ Scheduler 1  │  │ Scheduler 2  │  │ Scheduler N  │             │     │
│  │  │  (Leader)    │  │ (Follower)   │  │ (Follower)   │             │     │
│  │  └──────┬───────┘  └──────────────┘  └──────────────┘             │     │
│  │         │                                                          │     │
│  │         │ Leader election via consensus                            │     │
│  │         │ (Only leader schedules tasks)                            │     │
│  │         │                                                          │     │
│  └─────────┼──────────────────────────────────────────────────────────┘     │
│            │                                                                 │
│            ▼                                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        TASK QUEUE                                  │     │
│  │                                                                    │     │
│  │   Ready Queue (sorted by execution_time):                         │     │
│  │   ┌─────────────────────────────────────────────────────────────┐ │     │
│  │   │ task_1 (run at 10:00) │ task_2 (10:01) │ task_3 (10:05) │   │ │     │
│  │   └─────────────────────────────────────────────────────────────┘ │     │
│  │                                                                    │     │
│  │   Delay Queue (Redis ZSET or dedicated delay queue):              │     │
│  │   ┌─────────────────────────────────────────────────────────────┐ │     │
│  │   │ task_4 (due 11:00) │ task_5 (due 12:00) │ ...              │ │     │
│  │   └─────────────────────────────────────────────────────────────┘ │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│            │                                                                 │
│            ▼                                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                        WORKER LAYER                                │     │
│  │                                                                    │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │     │
│  │  │  Worker 1    │  │  Worker 2    │  │  Worker N    │             │     │
│  │  │              │  │              │  │              │             │     │
│  │  │ - Pulls task │  │ - Pulls task │  │ - Pulls task │             │     │
│  │  │ - Executes   │  │ - Executes   │  │ - Executes   │             │     │
│  │  │ - Acks/Fails │  │ - Acks/Fails │  │ - Acks/Fails │             │     │
│  │  └──────────────┘  └──────────────┘  └──────────────┘             │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  TASK DATA MODEL:                                                           │
│  ────────────────                                                           │
│  {                                                                          │
│    "id": "task-123",                                                        │
│    "type": "send_email",                                                    │
│    "payload": {"to": "user@example.com", ...},                             │
│    "schedule_time": 1699900000,                                            │
│    "status": "PENDING|RUNNING|COMPLETED|FAILED",                           │
│    "retry_count": 0,                                                       │
│    "max_retries": 3,                                                       │
│    "created_at": 1699800000                                                │
│  }                                                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Handling Exactly-Once Semantics:

class TaskWorker:
    async def process_task(self, task: Task):
        """
        Process task with at-least-once guarantee.
        Idempotency is caller's responsibility.
        """
        
        # Set visibility timeout (task invisible to other workers)
        await self.queue.set_visibility_timeout(task.id, seconds=300)
        
        try:
            # Execute the task
            handler = self.handlers[task.type]
            await handler(task.payload)
            
            # Mark as complete
            await self.store.update_task(task.id, status="COMPLETED")
            await self.queue.ack(task.id)
            
        except RetryableError as e:
            if task.retry_count < task.max_retries:
                # Reschedule with exponential backoff
                next_run = datetime.now() + timedelta(
                    seconds=2 ** task.retry_count * 60
                )
                await self.store.update_task(
                    task.id,
                    status="PENDING",
                    schedule_time=next_run,
                    retry_count=task.retry_count + 1
                )
                await self.queue.nack(task.id)
            else:
                # Max retries exceeded
                await self.store.update_task(
                    task.id, 
                    status="FAILED",
                    error=str(e)
                )
                await self.queue.ack(task.id)
                
        except Exception as e:
            # Non-retryable error
            await self.store.update_task(
                task.id,
                status="FAILED",
                error=str(e)
            )
            await self.queue.ack(task.id)

Problem 5: Design a Distributed Unique ID Generator

Difficulty: Medium
Companies: Twitter (Snowflake), Instagram
Time: 30 minutes

Problem Statement

Design a system that generates globally unique IDs with:

64-bit IDs
Sortable by time (roughly)
No coordination between nodes
Millions of IDs per second

Solution

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SNOWFLAKE ID FORMAT (64 bits)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────┬───────────────────────────┬────────────┬──────────────────────┐    │
│  │  0  │       41 bits            │  10 bits   │      12 bits         │    │
│  │     │    Timestamp (ms)        │  Node ID   │   Sequence Number    │    │
│  │     │  (69 years from epoch)   │ (1024 nodes)│  (4096 per ms)      │    │
│  └─────┴───────────────────────────┴────────────┴──────────────────────┘    │
│                                                                              │
│  CAPACITY:                                                                  │
│  ─────────                                                                  │
│  - 1024 nodes                                                               │
│  - 4096 IDs per millisecond per node                                        │
│  - 4.1 million IDs per second per node                                      │
│  - 69 years before timestamp overflow                                       │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ────────                                                                   │
│  Timestamp: 1699900000000 (41 bits)                                         │
│  Node ID: 42 (10 bits)                                                      │
│  Sequence: 0 (12 bits)                                                      │
│                                                                              │
│  ID = (1699900000000 << 22) | (42 << 12) | 0                               │
│     = 7130486325248696320                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation:

import time
import threading

class SnowflakeGenerator:
    def __init__(self, node_id: int, epoch: int = 1288834974657):
        """
        Initialize Snowflake ID generator.
        
        Args:
            node_id: Unique identifier for this generator (0-1023)
            epoch: Custom epoch (Twitter's is 2010-11-04)
        """
        if not 0 <= node_id < 1024:
            raise ValueError("Node ID must be between 0 and 1023")
        
        self.node_id = node_id
        self.epoch = epoch
        self.sequence = 0
        self.last_timestamp = -1
        self._lock = threading.Lock()
    
    def _current_time_millis(self) -> int:
        return int(time.time() * 1000)
    
    def generate(self) -> int:
        """Generate a unique Snowflake ID"""
        with self._lock:
            timestamp = self._current_time_millis()
            
            if timestamp < self.last_timestamp:
                # Clock went backwards! Wait for it to catch up.
                raise ClockMovedBackwardsError(
                    f"Clock moved backwards by {self.last_timestamp - timestamp}ms"
                )
            
            if timestamp == self.last_timestamp:
                # Same millisecond, increment sequence
                self.sequence = (self.sequence + 1) & 0xFFF  # 12 bits
                
                if self.sequence == 0:
                    # Sequence exhausted, wait for next millisecond
                    timestamp = self._wait_next_millis(timestamp)
            else:
                # New millisecond, reset sequence
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            # Construct the ID
            id = (
                ((timestamp - self.epoch) << 22) |
                (self.node_id << 12) |
                self.sequence
            )
            
            return id
    
    def _wait_next_millis(self, last_timestamp: int) -> int:
        """Spin until we get to the next millisecond"""
        timestamp = self._current_time_millis()
        while timestamp <= last_timestamp:
            timestamp = self._current_time_millis()
        return timestamp
    
    @staticmethod
    def parse(id: int) -> dict:
        """Parse a Snowflake ID into its components"""
        return {
            "timestamp": (id >> 22) + 1288834974657,
            "node_id": (id >> 12) & 0x3FF,
            "sequence": id & 0xFFF
        }

Follow-up Questions:

How do you assign node IDs?
- Zookeeper/etcd for coordination
- Use MAC address hash
- Use Kubernetes pod ordinal
What if clock goes backwards?
- Throw exception (let operator fix NTP)
- Wait for clock to catch up
- Use logical clock component
How do you handle datacenter failover?
- Split node ID: 5 bits datacenter + 5 bits machine
- Different epoch per datacenter

Problem 6: Design a Distributed Message Queue

Difficulty: Very Hard
Companies: Confluent, AWS (SQS), LinkedIn
Time: 60 minutes

Problem Statement

Design a message queue like Kafka that:

Handles millions of messages per second
Provides ordering guarantees (per partition)
Supports multiple consumers
Persists messages for replay

Key Design Points

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED MESSAGE QUEUE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TOPIC: orders (3 partitions, replication factor 3)                        │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                                                                    │     │
│  │   Partition 0          Partition 1          Partition 2           │     │
│  │   (Leader: B1)         (Leader: B2)         (Leader: B3)          │     │
│  │                                                                    │     │
│  │   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐      │     │
│  │   │[0][1][2][3]  │     │[0][1][2][3]  │     │[0][1][2][3]  │      │     │
│  │   └──────────────┘     └──────────────┘     └──────────────┘      │     │
│  │   Broker 1 (L)         Broker 2 (L)         Broker 3 (L)          │     │
│  │   Broker 2 (F)         Broker 3 (F)         Broker 1 (F)          │     │
│  │   Broker 3 (F)         Broker 1 (F)         Broker 2 (F)          │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  CONSUMER GROUP: order-processors                                           │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                                                                    │     │
│  │   Consumer 1           Consumer 2           Consumer 3            │     │
│  │   (Partition 0)        (Partition 1)        (Partition 2)         │     │
│  │                                                                    │     │
│  │   Each partition assigned to exactly ONE consumer in group        │     │
│  │   If Consumer 2 dies, Partition 1 reassigned to 1 or 3           │     │
│  │                                                                    │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  GUARANTEES:                                                                │
│  ───────────                                                                │
│  - Ordering: Within a partition only                                        │
│  - Delivery: At-least-once (consumer commits offset after processing)      │
│  - Durability: Message persisted to N replicas before ack                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Design Uber/Lyft

Topics: Real-time location, matching, surge pricing Key challenges: Geospatial indexing, low-latency dispatch

Design Twitter Timeline

Topics: Fan-out, feed ranking, real-time updates Key challenges: Hot users, read-heavy workload

Design Google Docs

Topics: CRDT, operational transformation, collaboration Key challenges: Conflict resolution, real-time sync

Design Netflix

Topics: CDN, video encoding, recommendations Key challenges: Global scale, adaptive streaming

Interview Tips

How to Structure Your Answer

Clarify requirements (5 min)
- Scale: Users, requests/second, data size
- Consistency: Strong or eventual?
- Latency: What’s acceptable?
High-level design (10 min)
- Draw boxes and arrows
- Identify core components
- Explain data flow
Deep dive (20 min)
- Pick 2-3 most critical components
- Discuss algorithms, data structures
- Explain trade-offs
Handle edge cases (10 min)
- Failures: Node crashes, network partitions
- Scale: Hot spots, bottlenecks
- Consistency: Race conditions

What Interviewers Look For

Breadth: Can you identify all components?
Depth: Can you go deep on key areas?
Trade-offs: Do you understand CAP, consistency vs availability?
Practicality: Is your design buildable?
Communication: Can you explain clearly?

Common Mistakes to Avoid

Starting to code before understanding the problem
Not asking clarifying questions
Ignoring scale constraints
Over-engineering simple problems
Forgetting failure modes
Not discussing trade-offs

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Interview Practice Problems

​Problem 1: Design a Distributed Rate Limiter

​Problem Statement

​Hints

​Solution

​Problem 2: Design a Distributed Lock Service

​Problem Statement

​Solution

​Problem 3: Design a Global Payment System

​Problem Statement

​Solution

​Problem 4: Design a Distributed Task Scheduler

​Problem Statement

​Solution

​Problem 5: Design a Distributed Unique ID Generator

​Problem Statement

​Solution

​Problem 6: Design a Distributed Message Queue

​Problem Statement

​Key Design Points

​More Practice Problems

Design Uber/Lyft

Design Twitter Timeline

Design Google Docs

Design Netflix

​Interview Tips

Interview Practice Problems

Problem 1: Design a Distributed Rate Limiter

Problem Statement

Hints

Solution

Problem 2: Design a Distributed Lock Service

Problem Statement

Solution

Problem 3: Design a Global Payment System

Problem Statement

Solution

Problem 4: Design a Distributed Task Scheduler

Problem Statement

Solution

Problem 5: Design a Distributed Unique ID Generator

Problem Statement

Solution

Problem 6: Design a Distributed Message Queue

Problem Statement

Key Design Points

More Practice Problems

Interview Tips