Distribute traffic across multiple servers for scalability and reliability. Think of a load balancer as a restaurant host: customers (requests) arrive at the door, and the host decides which waiter (server) to assign them to based on who is least busy, who specializes in what section, or simply the next one in rotation. Without the host, all customers would crowd around the first waiter they see, leaving others idle.
Store frequently accessed data in fast storage to reduce latency and database load. Caching is the system design equivalent of keeping your most-used cooking ingredients on the counter instead of fetching them from the pantry every time. It is the single most impactful optimization you can make in most systems — a well-placed cache can reduce database load by 90% or more and drop response times from hundreds of milliseconds to single digits.The fundamental trade-off: you are trading memory (expensive, limited) for speed, and you are accepting the risk that cached data might be stale. Every caching decision is really a question of “how stale can this data be before it causes a problem?”
Write to cache and database simultaneously. Guarantees cache consistency.
Python
JavaScript
class WriteThroughCache: """Write-Through: Update cache and DB atomically""" def __init__(self, cache, db): self.cache = cache self.db = db def update_user(self, user_id: str, data: dict) -> bool: cache_key = f"user:{user_id}" # Use a transaction/pipeline for atomicity try: # 1. Update database first (source of truth) self.db.begin_transaction() self.db.update("UPDATE users SET data = %s WHERE id = %s", (json.dumps(data), user_id)) # 2. Update cache in same transaction self.cache.setex(cache_key, 3600, json.dumps(data)) # 3. Commit only if both succeed self.db.commit() return True except Exception as e: self.db.rollback() self.cache.delete(cache_key) # Ensure consistency raise e def get_user(self, user_id: str) -> dict: cache_key = f"user:{user_id}" # Always check cache first (it's always up-to-date) cached = self.cache.get(cache_key) if cached: return json.loads(cached) # Cache miss (cold start or expired) user = self.db.query("SELECT * FROM users WHERE id = %s", (user_id,)) if user: self.cache.setex(cache_key, 3600, json.dumps(user)) return user
class WriteThroughCache { constructor(cache, db) { this.cache = cache; this.db = db; } async updateUser(userId, data) { const cacheKey = `user:${userId}`; const client = await this.db.connect(); try { await client.query('BEGIN'); // 1. Update database await client.query( 'UPDATE users SET data = $1 WHERE id = $2', [JSON.stringify(data), userId] ); // 2. Update cache await this.cache.setex(cacheKey, 3600, JSON.stringify(data)); // 3. Commit transaction await client.query('COMMIT'); return true; } catch (error) { await client.query('ROLLBACK'); await this.cache.del(cacheKey); // Ensure consistency throw error; } finally { client.release(); } } async getUser(userId) { const cacheKey = `user:${userId}`; // Cache is always consistent, check it first const cached = await this.cache.get(cacheKey); if (cached) { return JSON.parse(cached); } // Populate cache on miss const result = await this.db.query( 'SELECT * FROM users WHERE id = $1', [userId] ); if (result.rows[0]) { await this.cache.setex(cacheKey, 3600, JSON.stringify(result.rows[0])); return result.rows[0]; } return null; }}
Enable asynchronous communication and decouple components. Think of a message queue like a post office: the sender drops off a letter (message) and goes about their day without waiting for the recipient to read it. The post office (queue) holds the letter until the recipient is ready to pick it up. This decoupling means the sender and recipient do not need to be available at the same time, and a surge of incoming mail does not overwhelm the recipient — it just queues up.This is one of the most powerful patterns for handling traffic spikes. Instead of your web server waiting synchronously for a slow operation (sending an email, resizing an image, generating a report), it drops a message on the queue and responds to the user immediately. A pool of background workers processes the queue at their own pace.
Distribute content globally for faster access. The speed of light is a hard physical limit: a round trip from Tokyo to a US-based origin server takes roughly 150ms at minimum, and no amount of code optimization can fix physics. CDNs solve this by placing copies of your content in data centers around the world, so a user in Tokyo hits a server in Tokyo instead. It is like a franchise model for your data — the headquarters (origin) has the master copy, but local branches (edge PoPs) serve most customers directly.
Single entry point for all client requests. In a microservices architecture, an API gateway acts like the front desk of a large hotel — guests (clients) do not wander the hallways looking for housekeeping, room service, or the concierge individually. Instead, they make one call to the front desk, which routes the request to the right department, handles authentication (“Are you a guest here?”), enforces rate limits (“Please do not call us 100 times per minute”), and can even aggregate responses from multiple departments into a single reply.
Copy data across multiple servers for availability and read scaling. Replication is your insurance policy against data loss and your scaling strategy for read-heavy workloads. The core trade-off is between consistency and latency: synchronous replication guarantees every replica has the latest data but slows down every write (the write is only confirmed after all replicas acknowledge it); asynchronous replication keeps writes fast but means replicas might serve slightly stale data for a brief window.
Design Tip: Don’t add components just because they’re common. Each adds complexity. Start simple and add components as specific problems arise. In interviews, this translates to: start with the simplest architecture that could work, then say “As we scale to X users, the bottleneck will be Y, so I would add Z.” This shows the interviewer you understand why each component exists, not just that you memorized a reference architecture.
Scalability Mental Model: When reasoning about which building blocks to introduce, think in orders of magnitude. At 100 QPS, a single server with a database is fine. At 1,000 QPS, you likely need a cache and a load balancer. At 10,000 QPS, you need database replication and possibly a CDN. At 100,000+ QPS, you are looking at sharding, message queues for async processing, and multi-region deployment. Each jump in scale introduces a new bottleneck that the next building block addresses.
You are designing an e-commerce platform. Starting from zero, walk me through which building blocks you would add and at what scale thresholds. Be specific about the numbers.
Strong Answer:I think about this in orders of magnitude, and at each stage the bottleneck shifts to a different component.Stage 1: 0 to 1,000 QPS (launch to ~1M DAU). Single app server, single Postgres database, Nginx as a reverse proxy. This handles a surprising amount of traffic — Postgres comfortably serves 1,000-5,000 simple queries per second. Total cost: maybe $200/month on a decent cloud instance. Do not add complexity prematurely.Stage 2: 1,000 to 10,000 QPS (~1M to 10M DAU). The database becomes the bottleneck. First add a Redis cache in front of Postgres — product catalog, user sessions, cart data. A single Redis instance handles 100K+ reads/sec. Cache hit ratio of 90% means Postgres only sees 100-1,000 QPS. Next, add a load balancer with 3-5 stateless app servers. Add read replicas to Postgres (1 primary, 2 replicas) and route read traffic to replicas. Total cost: ~$2,000-5,000/month.Stage 3: 10,000 to 100,000 QPS (~10M to 100M DAU). Network and bandwidth become factors. Add a CDN for static assets (product images, CSS, JS). This offloads 60-70% of total bandwidth from your servers. Add a message queue (Kafka or SQS) for async processing — order confirmation emails, inventory updates, analytics events. This prevents synchronous processing from blocking the checkout flow. Consider database sharding if your dataset exceeds what a single Postgres instance can hold (usually around 1-5TB with good indexing). Total cost: ~$20,000-50,000/month.Stage 4: 100,000+ QPS (100M+ DAU). Multi-region deployment. API gateway for routing, auth, and rate limiting. Shard the database by user_id or region. Multiple Kafka clusters. A caching tier with both local (in-process) and distributed (Redis cluster) layers. This is where you also need a dedicated search service (Elasticsearch) because Postgres full-text search cannot keep up. Total cost: $200,000+/month.The key insight: at each stage, you are solving the current bottleneck, not preemptively adding every component.Follow-up: You are at Stage 2, and your Redis cache goes down. What happens, and how do you prevent a cascading failure?Without protection, all 10,000 QPS suddenly hit Postgres, which can handle maybe 3,000 QPS. Response times spike, connection pool exhausts, app servers start timing out, and the load balancer marks them unhealthy. Total outage in 30-60 seconds. Prevention: First, implement circuit breakers — when cache miss rate exceeds 50%, start returning stale cached data from a local in-memory fallback (even 30-second-old data is better than a timeout). Second, use Redis Sentinel or Redis Cluster for automatic failover — a replica promotes to primary within seconds. Third, rate-limit the database connection pool so Postgres degrades gracefully (slower responses) instead of crashing (connection refused). Fourth, pre-warm the cache on recovery — do not rely on lazy loading because the “thundering herd” of 10K cache misses all hitting Postgres simultaneously will just crash it again.
Your team is debating whether to use a message queue or direct HTTP calls between your order service and email/inventory/analytics services. The order service processes 5,000 orders per minute. Make the case for one approach.
Strong Answer:For this use case, the message queue is clearly the right choice, and the numbers make it obvious.With direct HTTP calls: The order service makes 3 synchronous calls per order (email, inventory, analytics). At 5,000 orders/min = 83 orders/sec, that is 249 HTTP calls/sec. If each downstream service takes 200ms to respond, each order takes 600ms for the downstream calls alone. But here is the real problem: if the email service goes down for 2 minutes, 10,000 orders fail or time out. The checkout flow is now coupled to whether the email service is healthy — and your customer does not care about email, they care about completing their purchase.With a message queue: The order service writes 3 messages per order to the queue (83 events/sec * 3 = 249 messages/sec — trivial for Kafka or even SQS). The order service responds to the customer in ~50ms (just the database write + queue publish). Each downstream service consumes at its own pace. If email is down for 2 minutes, the 10,000 email messages sit in the queue and are processed when it comes back. Zero orders fail.The numbers that matter:
Queue latency: Kafka publish takes ~2ms. SQS takes ~10ms. Either is negligible compared to 200ms HTTP calls.
Throughput headroom: Kafka handles 10K+ messages/sec per partition. You are at 249/sec. You have 40x headroom before you even need to think about scaling.
Storage for buffering: 5,000 orders/min * 1KB/message * 3 topics * 60 min of buffer = 900 MB. Essentially free.
When I would NOT use a queue: If the downstream response is needed to complete the order (e.g., a real-time fraud check that must return “approved” before the order can proceed). That is inherently synchronous and should remain an HTTP call — but with a circuit breaker and a fallback policy (auto-approve if fraud service is down, with post-hoc review).Follow-up: You use Kafka. Six months later, the analytics consumer falls behind by 2 hours due to a processing bug. Orders keep flowing. What are the implications and how do you handle it?This is one of the beautiful properties of Kafka: the order service and email/inventory consumers are completely unaffected. The analytics consumer has its own consumer group offset, so its lag is isolated. The 2 hours of messages (5,000/min * 120 min * 1KB = ~600 MB) are sitting safely in the Kafka topic’s retention window (typically 7 days). Fix the analytics bug, redeploy, and the consumer catches up by processing the backlog at full speed. Monitor consumer lag as a key metric — alert when any consumer group falls behind by more than 15 minutes. The architectural insight: this failure mode is exactly why queues exist. With direct HTTP calls, those analytics events would simply be lost.
Estimate the cache sizing for a social media platform with 200M DAU where each user views their feed 5 times per day. Each feed response is 50KB. What is the right cache size, and what cache hit ratio should you target?
Strong Answer:Let me work through this systematically.Total request volume: 200M DAU * 5 feed views = 1 billion feed requests/day. QPS: 1B / 86,400 = ~11,500 QPS. Peak (3x): ~35,000 QPS.Unique feeds: Not every request is unique. 200M users each have a unique feed, but the feed changes slowly (maybe 10-50 new posts per day for an average user). If we cache feeds with a 5-minute TTL, a user who refreshes 5 times in a day will get a cache hit on most of those refreshes.Cache sizing using the 80/20 rule: 20% of users generate 80% of feed views (power users). 200M * 20% = 40M feeds to cache. At 50KB each: 40M * 50KB = 2 TB. That is a lot. But we can be smarter.Refined approach: Cache the computed feed (list of post IDs + metadata) separately from the post content. The feed index is ~2KB (200 post IDs with timestamps). Post content is cached separately and shared across feeds. 40M feed indexes * 2KB = 80 GB — fits comfortably in a Redis cluster. The actual post content: if 10M unique posts are “hot” at any time, at 5KB each = 50 GB. Total: ~130 GB for the hot set.Target cache hit ratio: For feed reads, target 95%+. At 11,500 QPS with 95% cache hits, only 575 QPS hit the database — well within Postgres capacity. At 80% cache hits, 2,300 QPS hit the database, which is manageable but means 4x more database load.Cost math: A Redis cluster with 130 GB of memory = roughly 5 r6g.2xlarge instances (64 GB RAM each, with overhead) at ~0.50/hreach=1,800/month. The alternative — serving 11,500 QPS from the database directly — would require multiple read replicas at roughly $5,000-10,000/month. The cache pays for itself 3-5x over.Follow-up: A celebrity with 50M followers posts something. 10% of their followers check their feed in the next 5 minutes. What happens to your cache?5 million feed cache entries need to be updated or invalidated within 5 minutes. If you use a “fan-out on write” approach (pre-computing feeds), that is 5M cache writes in 300 seconds = 16,700 cache writes/sec — achievable for a Redis cluster but a significant spike. The better approach for celebrities: use “fan-out on read.” Do not pre-compute feeds for celebrity followers. Instead, when a user reads their feed, merge their pre-computed feed (from normal users they follow) with a real-time lookup of recent posts from celebrities they follow. Cache the celebrity’s recent posts once (not per-follower), and the 5M users each read from that single cached entry. This converts 5M cache writes into 1 cache write + 5M cache reads, which is far more efficient.