Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Scalability Patterns - From Single Server to Millions
Senior Level: This covers advanced scaling patterns expected at L5+ interviews. Know when and why to apply each pattern.

Horizontal vs Vertical: The Real Trade-offs

Vertical vs Horizontal Scaling
Interview Insight: Don’t immediately jump to “scale horizontally.” First ask: “What’s the actual bottleneck?” Sometimes a bigger machine or query optimization is the right answer.

Stateless vs Stateful Services

Making Services Stateless

Stateless Architecture

When Stateful is Okay

Stateful services are fine when:
• WebSocket connections (natural affinity)
• Real-time gaming (session state)
• In-memory caching (local cache + distributed)
• Batch processing (worker owns work)

Key: Design for graceful degradation when state is lost

Caching at Scale

Multi-Level Caching

Multi-Level Caching

Multi-Level Cache Implementation

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional, Any, Dict, List, Callable
from datetime import datetime, timedelta
import asyncio
import hashlib
import json
import logging

logger = logging.getLogger(__name__)

@dataclass
class CacheEntry:
    value: Any
    created_at: datetime
    ttl: timedelta
    tags: List[str] = field(default_factory=list)
    
    @property
    def is_expired(self) -> bool:
        return datetime.now() > self.created_at + self.ttl
    
    @property
    def remaining_ttl(self) -> timedelta:
        remaining = (self.created_at + self.ttl) - datetime.now()
        return max(timedelta(0), remaining)


class CacheLayer(ABC):
    """Abstract cache layer"""
    
    @abstractmethod
    async def get(self, key: str) -> Optional[CacheEntry]:
        pass
    
    @abstractmethod
    async def set(self, key: str, entry: CacheEntry) -> None:
        pass
    
    @abstractmethod
    async def delete(self, key: str) -> None:
        pass
    
    @abstractmethod
    async def delete_by_tag(self, tag: str) -> int:
        pass


class LocalCache(CacheLayer):
    """L1: In-process memory cache"""
    
    def __init__(self, max_size: int = 10000):
        self.cache: Dict[str, CacheEntry] = {}
        self.max_size = max_size
        self.access_order: List[str] = []
        self.tag_index: Dict[str, set] = {}
    
    async def get(self, key: str) -> Optional[CacheEntry]:
        entry = self.cache.get(key)
        if entry and not entry.is_expired:
            # Move to end (LRU)
            if key in self.access_order:
                self.access_order.remove(key)
            self.access_order.append(key)
            return entry
        elif entry:
            await self.delete(key)
        return None
    
    async def set(self, key: str, entry: CacheEntry) -> None:
        # Evict if full
        while len(self.cache) >= self.max_size:
            oldest = self.access_order.pop(0)
            del self.cache[oldest]
        
        self.cache[key] = entry
        self.access_order.append(key)
        
        # Update tag index
        for tag in entry.tags:
            if tag not in self.tag_index:
                self.tag_index[tag] = set()
            self.tag_index[tag].add(key)
    
    async def delete(self, key: str) -> None:
        if key in self.cache:
            entry = self.cache.pop(key)
            if key in self.access_order:
                self.access_order.remove(key)
            for tag in entry.tags:
                if tag in self.tag_index:
                    self.tag_index[tag].discard(key)
    
    async def delete_by_tag(self, tag: str) -> int:
        keys = self.tag_index.get(tag, set()).copy()
        for key in keys:
            await self.delete(key)
        return len(keys)


class RedisCache(CacheLayer):
    """L2: Distributed Redis cache"""
    
    def __init__(self, redis_client, prefix: str = "cache"):
        self.redis = redis_client
        self.prefix = prefix
    
    def _key(self, key: str) -> str:
        return f"{self.prefix}:{key}"
    
    def _tag_key(self, tag: str) -> str:
        return f"{self.prefix}:tag:{tag}"
    
    async def get(self, key: str) -> Optional[CacheEntry]:
        data = await self.redis.get(self._key(key))
        if data:
            entry_data = json.loads(data)
            return CacheEntry(
                value=entry_data["value"],
                created_at=datetime.fromisoformat(entry_data["created_at"]),
                ttl=timedelta(seconds=entry_data["ttl_seconds"]),
                tags=entry_data.get("tags", [])
            )
        return None
    
    async def set(self, key: str, entry: CacheEntry) -> None:
        data = json.dumps({
            "value": entry.value,
            "created_at": entry.created_at.isoformat(),
            "ttl_seconds": entry.ttl.total_seconds(),
            "tags": entry.tags
        })
        
        ttl_seconds = int(entry.remaining_ttl.total_seconds())
        if ttl_seconds > 0:
            await self.redis.setex(self._key(key), ttl_seconds, data)
            
            # Update tag sets
            for tag in entry.tags:
                await self.redis.sadd(self._tag_key(tag), key)
    
    async def delete(self, key: str) -> None:
        await self.redis.delete(self._key(key))
    
    async def delete_by_tag(self, tag: str) -> int:
        keys = await self.redis.smembers(self._tag_key(tag))
        if keys:
            await self.redis.delete(*[self._key(k) for k in keys])
            await self.redis.delete(self._tag_key(tag))
        return len(keys)


class MultiLevelCache:
    """
    Multi-level cache with automatic promotion/demotion.
    L1: Local in-memory (fastest, smallest)
    L2: Redis distributed (fast, shared)
    L3: Database (source of truth)
    """
    
    def __init__(
        self,
        l1_cache: LocalCache,
        l2_cache: RedisCache,
        db_loader: Callable,
        default_ttl: timedelta = timedelta(minutes=5)
    ):
        self.l1 = l1_cache
        self.l2 = l2_cache
        self.db_loader = db_loader
        self.default_ttl = default_ttl
        
        # Metrics
        self.hits = {"l1": 0, "l2": 0, "db": 0}
        self.misses = 0
    
    async def get(
        self, 
        key: str, 
        ttl: Optional[timedelta] = None,
        tags: List[str] = None
    ) -> Optional[Any]:
        """Get with automatic cache population"""
        
        # Try L1 (local)
        entry = await self.l1.get(key)
        if entry:
            self.hits["l1"] += 1
            return entry.value
        
        # Try L2 (Redis)
        entry = await self.l2.get(key)
        if entry:
            self.hits["l2"] += 1
            # Promote to L1
            await self.l1.set(key, entry)
            return entry.value
        
        # Load from DB
        value = await self.db_loader(key)
        if value is not None:
            self.hits["db"] += 1
            entry = CacheEntry(
                value=value,
                created_at=datetime.now(),
                ttl=ttl or self.default_ttl,
                tags=tags or []
            )
            # Populate both cache levels
            await asyncio.gather(
                self.l1.set(key, entry),
                self.l2.set(key, entry)
            )
            return value
        
        self.misses += 1
        return None
    
    async def invalidate(self, key: str) -> None:
        """Invalidate across all levels"""
        await asyncio.gather(
            self.l1.delete(key),
            self.l2.delete(key)
        )
    
    async def invalidate_by_tag(self, tag: str) -> int:
        """Invalidate all entries with a tag"""
        l1_count = await self.l1.delete_by_tag(tag)
        l2_count = await self.l2.delete_by_tag(tag)
        return max(l1_count, l2_count)
    
    def get_hit_rates(self) -> Dict[str, float]:
        total = sum(self.hits.values()) + self.misses
        if total == 0:
            return {"l1": 0, "l2": 0, "db": 0, "miss": 0}
        return {
            "l1": self.hits["l1"] / total,
            "l2": self.hits["l2"] / total,
            "db": self.hits["db"] / total,
            "miss": self.misses / total
        }


# Usage example
async def create_cache_system(redis_client, db):
    async def load_user(key: str):
        # key format: "user:123"
        user_id = key.split(":")[1]
        return await db.fetch_user(user_id)
    
    cache = MultiLevelCache(
        l1_cache=LocalCache(max_size=1000),
        l2_cache=RedisCache(redis_client, prefix="app"),
        db_loader=load_user,
        default_ttl=timedelta(minutes=10)
    )
    
    # Get user (auto-populates cache)
    user = await cache.get(
        "user:123",
        tags=["users", "user:123"]
    )
    
    # Invalidate on update
    await cache.invalidate("user:123")
    
    # Invalidate all users
    await cache.invalidate_by_tag("users")
    
    # Check hit rates
    print(cache.get_hit_rates())

Cache Consistency Strategies

# Strategy 1: Cache-Aside with TTL (Simple)
def get_user(user_id):
    # 1. Try cache
    user = cache.get(f"user:{user_id}")
    if user:
        return user
    
    # 2. Load from DB
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    
    # 3. Set cache with TTL
    cache.set(f"user:{user_id}", user, ttl=300)  # 5 min
    return user

def update_user(user_id, data):
    # Update DB
    db.update(user_id, data)
    # Invalidate cache
    cache.delete(f"user:{user_id}")


# Strategy 2: Write-Through (Strong Consistency)
def update_user(user_id, data):
    with transaction():
        # Update DB
        db.update(user_id, data)
        # Update cache in same transaction
        cache.set(f"user:{user_id}", data)


# Strategy 3: Event-Driven Invalidation (Best for microservices)
def update_user(user_id, data):
    db.update(user_id, data)
    # Publish event
    event_bus.publish("user.updated", {"user_id": user_id})

# Cache service listens for events
@event_handler("user.updated")
def invalidate_user_cache(event):
    cache.delete(f"user:{event.user_id}")

Database Scaling Patterns

Read Replicas Pattern

Read Replicas
class ReadWriteRouter:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas
        self.replica_index = 0
    
    def get_connection(self, query_type: str, user_session=None):
        if query_type == "write":
            return self.primary
        
        # Read-your-writes: Check if user recently wrote
        if user_session and user_session.last_write_time:
            if time.time() - user_session.last_write_time < 5:
                # Recent write, use primary to avoid stale reads
                return self.primary
        
        # Round-robin across replicas
        replica = self.replicas[self.replica_index % len(self.replicas)]
        self.replica_index += 1
        return replica

Sharding Strategies Deep Dive

Sharding Strategies

Cross-Shard Operations

# Problem: Query needs data from multiple shards

# Solution 1: Scatter-Gather
async def get_user_orders_all_time(user_id):
    # User's orders might be on different time-based shards
    shard_ids = get_all_shards()
    
    # Query all shards in parallel
    tasks = [query_shard(shard, user_id) for shard in shard_ids]
    results = await asyncio.gather(*tasks)
    
    # Merge results
    return merge_and_sort(results)

# Solution 2: Denormalization
# Store frequently-joined data together
# Instead of: users shard + orders shard + products shard
# Store: user_orders (denormalized) on user's shard

# Solution 3: Global Tables
# Some tables replicated to all shards (read-only)
# Example: countries, currencies, product categories

Async Processing Patterns

Task Queue Architecture

Task Queue
class ReliableTaskProcessor:
    """
    Production-grade task processor with exactly-once semantics
    """
    
    def __init__(self, queue, db, max_retries=3):
        self.queue = queue
        self.db = db
        self.max_retries = max_retries
    
    async def process_task(self, task):
        task_id = task.id
        
        # Idempotency check
        if await self.is_processed(task_id):
            await self.queue.ack(task)
            return
        
        try:
            # Process with timeout
            async with asyncio.timeout(30):
                result = await self.do_work(task)
            
            # Mark as processed (atomically with result storage)
            await self.mark_completed(task_id, result)
            await self.queue.ack(task)
            
        except asyncio.TimeoutError:
            await self.handle_timeout(task)
            
        except RetryableError as e:
            await self.retry_or_dlq(task, e)
            
        except Exception as e:
            # Non-retryable, send to DLQ immediately
            await self.send_to_dlq(task, e)
    
    async def retry_or_dlq(self, task, error):
        if task.retry_count < self.max_retries:
            delay = 2 ** task.retry_count  # Exponential backoff
            await self.queue.retry(task, delay_seconds=delay)
        else:
            await self.send_to_dlq(task, error)

Event-Driven Architecture

Event-Driven Architecture

Load Shedding & Backpressure

Graceful Degradation

Load Shedding
import asyncio
import time
import random
from collections import deque
from dataclasses import dataclass, field
from typing import Callable, Optional, Dict, Any
from enum import Enum
import logging

logger = logging.getLogger(__name__)

class Priority(Enum):
    CRITICAL = 0  # Never shed (health checks, admin)
    HIGH = 1      # Shed last (paid users, transactions)
    NORMAL = 2    # Standard requests
    LOW = 3       # Shed first (analytics, prefetch)

@dataclass
class LoadShedderConfig:
    target_latency_ms: float = 100.0
    max_latency_ms: float = 500.0
    window_size: int = 100
    adjustment_interval: float = 1.0  # seconds
    min_accept_rate: float = 0.1  # Always accept 10%

class AdaptiveLoadShedder:
    """
    Adaptively sheds load based on system health metrics.
    Uses latency as the primary signal with priority-based shedding.
    """
    
    def __init__(self, config: LoadShedderConfig = None):
        self.config = config or LoadShedderConfig()
        self.latencies = deque(maxlen=self.config.window_size)
        self.shed_rates: Dict[Priority, float] = {
            Priority.CRITICAL: 0.0,  # Never shed
            Priority.HIGH: 0.0,
            Priority.NORMAL: 0.0,
            Priority.LOW: 0.0
        }
        self.last_adjustment = time.time()
        
        # Metrics
        self.total_requests = 0
        self.shed_requests = 0
        self.accepted_requests = 0
    
    def should_accept(self, priority: Priority = Priority.NORMAL) -> bool:
        """Determine if request should be accepted"""
        self.total_requests += 1
        
        # Critical requests always pass
        if priority == Priority.CRITICAL:
            self.accepted_requests += 1
            return True
        
        # Check shed rate for this priority
        shed_rate = self.shed_rates[priority]
        if random.random() < shed_rate:
            self.shed_requests += 1
            logger.debug(f"Shedding {priority.name} request (rate: {shed_rate:.2%})")
            return False
        
        self.accepted_requests += 1
        return True
    
    def record_latency(self, latency_ms: float) -> None:
        """Record request latency and adjust shed rates"""
        self.latencies.append(latency_ms)
        
        # Adjust periodically
        if time.time() - self.last_adjustment > self.config.adjustment_interval:
            self._adjust_shed_rates()
            self.last_adjustment = time.time()
    
    def _adjust_shed_rates(self) -> None:
        """Adjust shed rates based on current latency"""
        if len(self.latencies) < 10:
            return
        
        sorted_latencies = sorted(self.latencies)
        p50 = sorted_latencies[len(sorted_latencies) // 2]
        p99 = sorted_latencies[int(len(sorted_latencies) * 0.99)]
        
        # Calculate pressure based on latency
        if p99 > self.config.max_latency_ms:
            # Emergency: shed aggressively
            pressure = 0.8
        elif p99 > self.config.target_latency_ms * 2:
            # High pressure
            pressure = 0.5
        elif p99 > self.config.target_latency_ms:
            # Moderate pressure
            pressure = 0.2
        elif p99 < self.config.target_latency_ms * 0.5:
            # Low pressure: recover
            pressure = -0.2
        else:
            pressure = 0.0
        
        # Apply pressure to each priority level differently
        priority_multipliers = {
            Priority.LOW: 1.5,      # Shed first
            Priority.NORMAL: 1.0,
            Priority.HIGH: 0.3,     # Shed last
        }
        
        for priority, multiplier in priority_multipliers.items():
            current = self.shed_rates[priority]
            adjustment = pressure * 0.1 * multiplier
            new_rate = max(0.0, min(0.9, current + adjustment))
            self.shed_rates[priority] = new_rate
        
        logger.info(
            f"Load shedder adjusted: p50={p50:.1f}ms p99={p99:.1f}ms "
            f"rates={{{k.name}: {v:.2%} for k, v in self.shed_rates.items()}}}"
        )
    
    def get_metrics(self) -> Dict[str, Any]:
        return {
            "total_requests": self.total_requests,
            "shed_requests": self.shed_requests,
            "shed_rate": self.shed_requests / max(1, self.total_requests),
            "current_shed_rates": {
                k.name: v for k, v in self.shed_rates.items()
            },
            "latency_p50": sorted(self.latencies)[len(self.latencies) // 2] if self.latencies else 0,
            "latency_p99": sorted(self.latencies)[int(len(self.latencies) * 0.99)] if self.latencies else 0
        }


class TokenBucketRateLimiter:
    """
    Token bucket for rate limiting with burst support.
    Use alongside load shedding for complete traffic management.
    """
    
    def __init__(
        self,
        rate: float,  # tokens per second
        bucket_size: int = None,  # max burst
        initial_tokens: int = None
    ):
        self.rate = rate
        self.bucket_size = bucket_size or int(rate * 2)
        self.tokens = initial_tokens if initial_tokens is not None else self.bucket_size
        self.last_update = time.time()
    
    def acquire(self, tokens: int = 1) -> bool:
        """Try to acquire tokens. Returns True if successful."""
        self._refill()
        
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False
    
    def _refill(self) -> None:
        """Add tokens based on elapsed time"""
        now = time.time()
        elapsed = now - self.last_update
        self.last_update = now
        
        self.tokens = min(
            self.bucket_size,
            self.tokens + (elapsed * self.rate)
        )


class GracefulDegradationManager:
    """
    Manages graceful degradation strategies based on load.
    """
    
    def __init__(self):
        self.load_shedder = AdaptiveLoadShedder()
        self.degradation_level = 0  # 0=normal, 1=reduced, 2=minimal, 3=emergency
        self.feature_flags = {
            "recommendations": True,
            "analytics": True,
            "search_suggestions": True,
            "full_search": True,
            "image_processing": True,
            "notifications": True
        }
    
    def update_degradation_level(self) -> None:
        """Update degradation level based on metrics"""
        metrics = self.load_shedder.get_metrics()
        p99 = metrics.get("latency_p99", 0)
        
        if p99 > 1000:  # > 1 second
            self.degradation_level = 3
        elif p99 > 500:
            self.degradation_level = 2
        elif p99 > 200:
            self.degradation_level = 1
        else:
            self.degradation_level = 0
        
        self._apply_degradation()
    
    def _apply_degradation(self) -> None:
        """Disable features based on degradation level"""
        levels = {
            0: [],  # All features enabled
            1: ["analytics", "recommendations"],
            2: ["analytics", "recommendations", "search_suggestions", "notifications"],
            3: ["analytics", "recommendations", "search_suggestions", 
                "notifications", "image_processing"]
        }
        
        disabled = levels.get(self.degradation_level, [])
        for feature in self.feature_flags:
            self.feature_flags[feature] = feature not in disabled
        
        logger.warning(
            f"Degradation level: {self.degradation_level}, "
            f"Disabled: {disabled}"
        )
    
    def is_feature_enabled(self, feature: str) -> bool:
        return self.feature_flags.get(feature, True)


# FastAPI integration
from fastapi import FastAPI, Request, Response, HTTPException
from fastapi.middleware.base import BaseHTTPMiddleware

app = FastAPI()
degradation_manager = GracefulDegradationManager()

class LoadSheddingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Determine priority from request
        priority = self._get_priority(request)
        
        if not degradation_manager.load_shedder.should_accept(priority):
            return Response(
                content='{"error": "Service temporarily overloaded"}',
                status_code=503,
                headers={"Retry-After": "5"}
            )
        
        start = time.time()
        response = await call_next(request)
        latency_ms = (time.time() - start) * 1000
        
        degradation_manager.load_shedder.record_latency(latency_ms)
        degradation_manager.update_degradation_level()
        
        return response
    
    def _get_priority(self, request: Request) -> Priority:
        # Health checks are critical
        if request.url.path == "/health":
            return Priority.CRITICAL
        # Premium users get high priority
        if request.headers.get("X-Premium-User"):
            return Priority.HIGH
        # Analytics/prefetch are low priority
        if request.url.path.startswith("/analytics"):
            return Priority.LOW
        return Priority.NORMAL

app.add_middleware(LoadSheddingMiddleware)

@app.get("/recommendations")
async def get_recommendations():
    if not degradation_manager.is_feature_enabled("recommendations"):
        return {"items": [], "degraded": True}
    return {"items": ["rec1", "rec2", "rec3"]}

Senior Interview Questions

Approach:
  1. Identify the bottleneck: Is it DB? Network? Application?
  2. Batching: Combine multiple writes into one
  3. Async writes: Write to queue, persist later
  4. Sharding: Distribute writes across nodes
  5. LSM-tree databases: Cassandra, RocksDB (optimized for writes)
Example answer: “First, I’d batch writes on the application side - instead of 1000 individual inserts, do bulk insert. Then add a queue like Kafka as a buffer. Finally, use a write-optimized database like Cassandra if the volume is truly massive.”
Framework:
  1. Current baseline: Measure current QPS, latency, resource usage
  2. Growth projection: Expected traffic increase (e.g., 2x in 6 months)
  3. Headroom: Plan for 3x current load (for spikes)
  4. Load testing: Verify system handles projected load
  5. Monitoring: Track capacity metrics, alert at 70% utilization
Key metrics to track:
  • CPU utilization by service
  • Memory usage and GC pressure
  • Database connections and query latency
  • Queue depth and processing rate
  • Network bandwidth
Safe migration strategy:
  1. Dual-write: Write to both old and new schema
  2. Backfill: Migrate historical data in batches
  3. Shadow read: Read from new, compare with old
  4. Cutover: Switch reads to new schema
  5. Cleanup: Remove dual-write, drop old schema
Key principles:
  • Never lock tables in production
  • Migrations must be reversible
  • Test with production-sized data
  • Have a rollback plan
  • Do it during low-traffic periods
Systematic approach:
  1. Identify scope: All users? Some? Specific data?
  2. Check cache layers: CDN, app cache, Redis, DB cache
  3. Verify TTLs: Are caches expiring correctly?
  4. Check replication: Is replica lagging?
  5. Trace the write: Did write actually succeed?
Common causes:
  • Cache not being invalidated on write
  • Reading from stale replica
  • CDN caching dynamic content
  • Race condition between cache invalidation and read

Interview Questions

Strong answer:
  • The first thing I would do is resist the urge to immediately say “scale horizontally.” The real question is: where is the bottleneck? If the bottleneck is CPU-bound computation (e.g., image processing, encryption), a single beefy machine with 96 cores might handle 100K RPS cheaper and simpler than 50 small instances behind a load balancer. Vertical scaling has zero coordination overhead — no distributed state, no network hops, no split-brain risk.
  • That said, vertical scaling has a ceiling (you can only buy so much CPU/RAM), and it is a single point of failure. For a stateless API service at 100K RPS, horizontal scaling is almost always the right call because: (a) you get fault tolerance for free — losing one of 20 instances means 5% capacity loss, not total outage, (b) you can scale incrementally by adding instances rather than migrating to a bigger machine with downtime, and (c) modern orchestrators like Kubernetes make horizontal scaling operationally simple for stateless services.
  • The nuance is that most systems are a mix. Stateless application servers scale horizontally, but the database underneath is typically scaled vertically first (bigger instance, more RAM for buffer pool, faster disks), then with read replicas, and sharding is the last resort. I have seen teams jump to microservices and sharding at 1,000 RPS when a single Postgres on a db.r6g.4xlarge would have handled 20,000 QPS easily.
  • At 100K RPS specifically, my architecture would be: horizontal stateless app servers behind an ALB, a vertically-scaled primary database with read replicas, Redis for hot-path caching, and a CDN for static content. I would only introduce sharding or message queues if profiling shows a specific bottleneck that these patterns solve.
Red flag answer: “Always scale horizontally because it is more scalable.” This ignores that horizontal scaling introduces distributed systems complexity (consistency, coordination, network partitions) and that for many workloads, a single well-tuned machine is simpler, cheaper, and faster.Follow-ups:
  1. At what point would you introduce database sharding into this architecture, and what signals would trigger that decision?
  2. How does your scaling strategy change if the system is stateful — for example, it maintains WebSocket connections with clients?
Strong answer:
  • A stateless service stores no per-request or per-session data locally — every request contains everything the service needs to process it, and any instance can handle any request. This makes scaling trivial: spin up more instances, put a load balancer in front, done. If an instance dies, requests just route to another one with no data loss.
  • A stateful service keeps data in local memory that is tied to a specific client or session — a WebSocket connection, an in-memory game state, a local cache of user preferences. You cannot just kill and replace these instances because the state is lost. You need sticky sessions, graceful draining on shutdown, and state replication or checkpointing for fault tolerance.
  • The advice “make services stateless” is correct for the common case — most request/response API services have no reason to hold state locally. Externalize session data to Redis, tokens to JWTs, and you are done. But the advice is wrong when the latency cost of externalizing state is unacceptable. Real-time gaming servers that need sub-millisecond access to game state cannot round-trip to Redis on every frame. WebSocket servers inherently hold connection state. Stream processing workers that maintain windowed aggregations perform orders of magnitude better with local state (this is exactly why Kafka Streams uses local RocksDB stores).
  • The key insight is: the goal is not “no state” — it is “state that can be recovered.” Design stateful services so that if the instance dies, the state can be rebuilt from a durable source (Kafka changelog, database snapshot, peer replication). Kubernetes StatefulSets with persistent volumes exist precisely for this pattern.
Red flag answer: “Stateful services are always bad and you should externalize all state to a database or cache.” This is dogmatic and ignores legitimate use cases where local state is a performance requirement, not a design flaw.Follow-ups:
  1. You have a WebSocket service holding 100K concurrent connections. How do you handle deployments and instance failures without dropping all those connections?
  2. How does Kafka Streams handle stateful processing with fault tolerance? What role does the changelog topic play?
Strong answer:
  • Cache-aside (also called lazy loading) is the simplest: the application checks the cache first, and on a miss, loads from the database and populates the cache. On writes, you invalidate the cache and write to the DB. The cache only contains data that has actually been requested, so there is no wasted memory on unaccessed data. The downside is that every cache miss incurs the full database latency, and there is a window between “write to DB” and “invalidate cache” where another request can read stale data and re-populate the cache with it.
  • Write-through writes to both the cache and the database synchronously on every write. This means the cache is always consistent with the DB — no stale data window. The cost is higher write latency (you are doing two writes on the critical path) and the cache gets populated with data that may never be read. This works well for read-heavy workloads where you want strong consistency and can tolerate slightly slower writes — think user profile data that is written once but read thousands of times.
  • Write-behind (write-back) writes to the cache immediately and asynchronously flushes to the database in the background, often batching multiple writes together. This gives you the lowest write latency and the best write throughput. The trade-off is durability risk: if the cache node dies before flushing, those writes are lost. This is appropriate for data where speed matters more than durability — think analytics counters, view counts, or rate limiter state. DynamoDB Accelerator (DAX) uses this pattern.
  • In my experience, cache-aside covers 80% of use cases. I reach for write-through when I need strong consistency without complex invalidation logic, and write-behind only for high-throughput writes where losing a few seconds of data is acceptable.
Red flag answer: “I always use cache-aside because it is the simplest.” Not evaluating trade-offs or recognizing that different data has different consistency and durability requirements is a sign of shallow understanding.Follow-ups:
  1. With cache-aside, there is a race condition: Thread A gets a cache miss, Thread B updates the DB and invalidates the cache, then Thread A writes the stale value back into the cache. How do you prevent this?
  2. You are implementing write-behind caching for a leaderboard system. The cache node crashes and you lose 30 seconds of writes. What is your recovery strategy?
Strong answer:
  • The thundering herd (also called cache stampede) happens when a popular cache key expires and hundreds or thousands of concurrent requests all miss the cache simultaneously, all hit the database with the same expensive query, and the database gets crushed. If you have a product page cached for 5 minutes and 1,000 users per second are viewing it, the moment that cache key expires, 1,000 requests all try to regenerate it at the same time.
  • The most effective prevention is cache locking (or request coalescing). When a cache miss occurs, the first request acquires a lock (in Redis: SET key:lock NX EX 5), loads the data, and populates the cache. All other concurrent requests either wait for the lock to release and then read from cache, or return a slightly stale version if you keep the old value around. This turns 1,000 database queries into 1.
  • Another approach is proactive refresh: instead of letting the cache expire passively, have a background job refresh the cache before the TTL expires. If the TTL is 5 minutes, refresh at 4 minutes. The cache never actually goes empty. This works well for predictably hot keys but adds complexity and is wasteful for keys that may not be accessed again.
  • A third technique is staggered TTLs with jitter: instead of setting all cache keys to exactly 300 seconds, set them to 300 + random(0, 60). This prevents mass simultaneous expiration. Netflix uses this extensively — they call it “jittered expiry.” It does not solve the thundering herd on a single hot key, but it prevents a system-wide cache avalanche where many keys expire at the same wall-clock time (e.g., all set at server startup with identical TTLs).
Red flag answer: “Just set a longer TTL so the cache does not expire as often.” This does not solve the fundamental problem — when it does expire, the herd still thunders. It also means stale data persists longer.Follow-ups:
  1. How would you implement cache locking in a distributed system with multiple application servers? What happens if the lock holder crashes before populating the cache?
  2. Your cache layer is Redis, and you see Redis CPU spike to 100% periodically every 5 minutes. You suspect thundering herd. Walk me through your investigation and fix.
Strong answer:
  • Load shedding means intentionally dropping a fraction of requests when the system is overloaded, so the remaining requests can be served with acceptable latency. Without load shedding, an overloaded system tries to serve everything, latency degrades for all requests, timeouts cascade, retries pile up, and the system collapses entirely. It is better to serve 70% of requests successfully than to serve 0% because the system fell over.
  • Backpressure is the related concept of propagating overload signals upstream so that producers slow down. Instead of dropping requests silently, you tell the caller “I am overloaded, slow down” via HTTP 429 (rate limit) or 503 (service unavailable) with a Retry-After header. TCP itself implements backpressure through flow control — when the receiver’s buffer fills, the sender slows down.
  • The decision of what to drop should be priority-based, not random. Health check endpoints are never shed (Priority CRITICAL) — if you shed health checks, your orchestrator thinks the instance is dead and kills it, making things worse. Paid-user requests get shed last. Background analytics, prefetch requests, and speculative queries get shed first. The key is that priority must be assigned at the edge (the load balancer or API gateway) before the request consumes any resources.
  • I have seen systems use adaptive load shedding where the shed rate adjusts based on real-time latency measurements. When p99 latency exceeds the target, you increase the shed percentage for low-priority traffic. When latency recovers, you reduce shedding. The implementation uses a sliding window of recent latencies and adjusts every 1-2 seconds. This is far better than a fixed threshold because it adapts to changing capacity (e.g., a background job using CPU, a dependent service being slow).
Red flag answer: “If the system is overloaded, just add more servers.” Auto-scaling takes minutes; load shedding acts in milliseconds. You need both: shedding to survive the spike now, and scaling to handle sustained load increases.Follow-ups:
  1. How do you prevent load shedding from causing retry storms? If you return 503 to clients and they all retry immediately, you have made the problem worse.
  2. Describe how you would implement priority-based load shedding in a microservices architecture where a single user request fans out to 5 downstream services.
Strong answer:
  • True exactly-once processing is impossible in a distributed system (per the Two Generals’ Problem), but you can achieve effectively-exactly-once by combining at-least-once delivery with idempotent processing. The queue guarantees it will deliver the message at least once (retrying on failure), and the consumer guarantees that processing the same message twice produces the same result as processing it once.
  • The implementation has three pillars: (1) Persistent task storage — tasks are written to a durable queue (Kafka, SQS, RabbitMQ with disk persistence) so they survive broker restarts. (2) Visibility timeout / acknowledgment — when a consumer picks up a task, it becomes invisible to other consumers for a timeout period. If the consumer processes it and ACKs, the task is removed. If the consumer crashes, the timeout expires, and the task becomes visible again for another consumer. (3) Idempotency on the consumer — before processing, check a deduplication store (e.g., a processed_tasks table) with the task ID. If already processed, ACK and skip. Process, then atomically mark as processed and ACK.
  • The critical detail most people miss is the atomicity between “do the work” and “mark as done.” If you process the task, then your app crashes before marking it as done, the task will be redelivered and processed again. The solution is to make the side effect and the completion marker part of the same transaction: write the result and the task ID to the database in one transaction, then ACK the queue message. If the ACK fails, the task is redelivered, but the idempotency check catches it.
  • For dead letter queues (DLQ): after N retries with exponential backoff (e.g., 1s, 2s, 4s, 8s up to 3 retries), move the task to a DLQ. The DLQ is not a black hole — you need monitoring, alerting, and tooling to inspect and replay DLQ messages. I have seen teams create a DLQ and then never look at it, which is the same as silently dropping messages.
Red flag answer: “Kafka provides exactly-once semantics out of the box so you do not need to worry about it.” Kafka’s exactly-once is limited to Kafka-to-Kafka processing within its transactions API. The moment you have side effects outside Kafka (database writes, API calls), you need application-level idempotency.Follow-ups:
  1. How would you handle a “poison pill” message that causes the consumer to crash every time it is processed? It would cycle between the main queue and retries forever.
  2. Your task queue processes 10,000 tasks/second. The idempotency check hits the database on every task. How would you optimize this without losing the exactly-once guarantee?
Strong answer:
  • Read replicas are full copies of the database that serve read queries. They contain all the data, support arbitrary SQL queries (including ad-hoc joins and aggregations), and stay consistent with the primary (within replication lag). Caching (e.g., Redis) stores specific precomputed results in memory for ultra-fast retrieval, but only serves exact key lookups — you cannot run a novel SQL query against a cache.
  • I would choose read replicas when: the read queries are diverse and unpredictable (an admin dashboard where users run custom filters), when I need the full power of SQL (joins, aggregations, window functions), or when the data set is too large or too dynamic to cache effectively. Read replicas are also useful for isolating analytical workloads from the production database — run your nightly reports against a replica without impacting OLTP performance.
  • I would choose caching when: the access pattern is highly repetitive (the same product page viewed 10,000 times per minute), when latency requirements are sub-millisecond (cache returns in <1ms vs. 5-50ms for a database query), or when I need to reduce load on the database by orders of magnitude rather than just 2-3x. A cache hit avoids the database entirely, while a read replica still executes the full query — it just does it on a different machine.
  • In practice, you use both together. Read replicas handle the breadth of diverse queries and serve as a warm standby for failover. Caching handles the depth of the hottest access patterns. For a system doing 50K reads/sec, caching the top 1% most-accessed keys might handle 80% of the traffic, and read replicas handle the remaining 20% of diverse queries. The key metric is cache hit ratio — if it is below 80%, your cache is not meaningfully helping and you should investigate whether your access pattern is too diverse for caching.
Red flag answer: “Use caching for everything because it is faster.” This ignores that caching requires you to predefine what to cache, handle invalidation, and accept staleness. It is not a replacement for a database; it is an optimization layer for hot paths.Follow-ups:
  1. You have 3 read replicas and replication lag spikes to 30 seconds during a bulk data import. How do you handle this without sending stale data to users?
  2. Your cache hit ratio is 95% but your database is still under heavy load. What could explain this, and how would you investigate?
Strong answer:
  • Graceful degradation means that when a system is under stress or a dependency fails, it reduces functionality progressively rather than failing entirely. The user gets a worse experience, but they get an experience — the core functionality stays alive while non-critical features are disabled. This is fundamentally different from circuit breaking, which is about protecting a single dependency; graceful degradation is a system-wide strategy.
  • A concrete example: take an e-commerce product page. At Level 0 (normal), the user sees the product, personalized recommendations, reviews, inventory count, estimated delivery, and dynamic pricing. At Level 1 (elevated load), disable personalized recommendations and show generic “popular items” instead — the recommendations service is expensive and non-critical. At Level 2 (high load), disable reviews and show a static cached version of the page. At Level 3 (emergency), serve a fully static product page from CDN with a “Contact us for pricing” button instead of dynamic pricing — the page loads in 50ms and the core purchase flow still works.
  • Implementation requires three things: (1) A health signal — typically p99 latency or error rate, measured via a sliding window. (2) Feature flags tied to degradation levels — when the level changes, features toggle off automatically. (3) Pre-computed fallbacks for each degraded feature — you cannot compute a fallback when you are already overloaded; it needs to be ready (static HTML, cached responses, sensible defaults).
  • The most important lesson from production is to test degradation regularly. Netflix runs Chaos Monkey to randomly kill instances, but the more useful practice is deliberately triggering each degradation level in staging and verifying that the user experience is acceptable. I have seen teams build elaborate degradation systems that were never tested, and when they actually triggered, they had bugs that caused worse behavior than no degradation at all — like serving 500 errors instead of the fallback response.
Red flag answer: “Return an error page when things are overloaded and let the user retry.” This is the opposite of graceful degradation — it is ungraceful failure. The goal is to keep serving useful responses, not to punt the problem to the user.Follow-ups:
  1. How do you decide the order in which features are disabled during degradation? What metrics or criteria drive that prioritization?
  2. Your degradation system relies on a feature flag service, and that service itself goes down. How do you handle degradation of the degradation system?