Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Horizontal vs Vertical: The Real Trade-offs
Stateless vs Stateful Services
Making Services Stateless
When Stateful is Okay
Caching at Scale
Multi-Level Caching
Multi-Level Cache Implementation
- Python
- JavaScript
Cache Consistency Strategies
Database Scaling Patterns
Read Replicas Pattern
Sharding Strategies Deep Dive
Cross-Shard Operations
Async Processing Patterns
Task Queue Architecture
Event-Driven Architecture
Load Shedding & Backpressure
Graceful Degradation
- Python
- JavaScript
Senior Interview Questions
How do you scale a write-heavy system?
How do you scale a write-heavy system?
- Identify the bottleneck: Is it DB? Network? Application?
- Batching: Combine multiple writes into one
- Async writes: Write to queue, persist later
- Sharding: Distribute writes across nodes
- LSM-tree databases: Cassandra, RocksDB (optimized for writes)
What's your approach to capacity planning?
What's your approach to capacity planning?
- Current baseline: Measure current QPS, latency, resource usage
- Growth projection: Expected traffic increase (e.g., 2x in 6 months)
- Headroom: Plan for 3x current load (for spikes)
- Load testing: Verify system handles projected load
- Monitoring: Track capacity metrics, alert at 70% utilization
- CPU utilization by service
- Memory usage and GC pressure
- Database connections and query latency
- Queue depth and processing rate
- Network bandwidth
How do you handle database migrations at scale?
How do you handle database migrations at scale?
- Dual-write: Write to both old and new schema
- Backfill: Migrate historical data in batches
- Shadow read: Read from new, compare with old
- Cutover: Switch reads to new schema
- Cleanup: Remove dual-write, drop old schema
- Never lock tables in production
- Migrations must be reversible
- Test with production-sized data
- Have a rollback plan
- Do it during low-traffic periods
How do you debug a system serving stale data?
How do you debug a system serving stale data?
- Identify scope: All users? Some? Specific data?
- Check cache layers: CDN, app cache, Redis, DB cache
- Verify TTLs: Are caches expiring correctly?
- Check replication: Is replica lagging?
- Trace the write: Did write actually succeed?
- Cache not being invalidated on write
- Reading from stale replica
- CDN caching dynamic content
- Race condition between cache invalidation and read
Interview Questions
You are designing a system that handles 100K requests per second. Walk me through your decision process for horizontal versus vertical scaling.
You are designing a system that handles 100K requests per second. Walk me through your decision process for horizontal versus vertical scaling.
- The first thing I would do is resist the urge to immediately say “scale horizontally.” The real question is: where is the bottleneck? If the bottleneck is CPU-bound computation (e.g., image processing, encryption), a single beefy machine with 96 cores might handle 100K RPS cheaper and simpler than 50 small instances behind a load balancer. Vertical scaling has zero coordination overhead — no distributed state, no network hops, no split-brain risk.
- That said, vertical scaling has a ceiling (you can only buy so much CPU/RAM), and it is a single point of failure. For a stateless API service at 100K RPS, horizontal scaling is almost always the right call because: (a) you get fault tolerance for free — losing one of 20 instances means 5% capacity loss, not total outage, (b) you can scale incrementally by adding instances rather than migrating to a bigger machine with downtime, and (c) modern orchestrators like Kubernetes make horizontal scaling operationally simple for stateless services.
- The nuance is that most systems are a mix. Stateless application servers scale horizontally, but the database underneath is typically scaled vertically first (bigger instance, more RAM for buffer pool, faster disks), then with read replicas, and sharding is the last resort. I have seen teams jump to microservices and sharding at 1,000 RPS when a single Postgres on a db.r6g.4xlarge would have handled 20,000 QPS easily.
- At 100K RPS specifically, my architecture would be: horizontal stateless app servers behind an ALB, a vertically-scaled primary database with read replicas, Redis for hot-path caching, and a CDN for static content. I would only introduce sharding or message queues if profiling shows a specific bottleneck that these patterns solve.
- At what point would you introduce database sharding into this architecture, and what signals would trigger that decision?
- How does your scaling strategy change if the system is stateful — for example, it maintains WebSocket connections with clients?
Explain the difference between stateless and stateful services. Why does everyone say 'make services stateless,' and when is that advice wrong?
Explain the difference between stateless and stateful services. Why does everyone say 'make services stateless,' and when is that advice wrong?
- A stateless service stores no per-request or per-session data locally — every request contains everything the service needs to process it, and any instance can handle any request. This makes scaling trivial: spin up more instances, put a load balancer in front, done. If an instance dies, requests just route to another one with no data loss.
- A stateful service keeps data in local memory that is tied to a specific client or session — a WebSocket connection, an in-memory game state, a local cache of user preferences. You cannot just kill and replace these instances because the state is lost. You need sticky sessions, graceful draining on shutdown, and state replication or checkpointing for fault tolerance.
- The advice “make services stateless” is correct for the common case — most request/response API services have no reason to hold state locally. Externalize session data to Redis, tokens to JWTs, and you are done. But the advice is wrong when the latency cost of externalizing state is unacceptable. Real-time gaming servers that need sub-millisecond access to game state cannot round-trip to Redis on every frame. WebSocket servers inherently hold connection state. Stream processing workers that maintain windowed aggregations perform orders of magnitude better with local state (this is exactly why Kafka Streams uses local RocksDB stores).
- The key insight is: the goal is not “no state” — it is “state that can be recovered.” Design stateful services so that if the instance dies, the state can be rebuilt from a durable source (Kafka changelog, database snapshot, peer replication). Kubernetes StatefulSets with persistent volumes exist precisely for this pattern.
- You have a WebSocket service holding 100K concurrent connections. How do you handle deployments and instance failures without dropping all those connections?
- How does Kafka Streams handle stateful processing with fault tolerance? What role does the changelog topic play?
Compare cache-aside, write-through, and write-behind caching strategies. When would you pick each one?
Compare cache-aside, write-through, and write-behind caching strategies. When would you pick each one?
- Cache-aside (also called lazy loading) is the simplest: the application checks the cache first, and on a miss, loads from the database and populates the cache. On writes, you invalidate the cache and write to the DB. The cache only contains data that has actually been requested, so there is no wasted memory on unaccessed data. The downside is that every cache miss incurs the full database latency, and there is a window between “write to DB” and “invalidate cache” where another request can read stale data and re-populate the cache with it.
- Write-through writes to both the cache and the database synchronously on every write. This means the cache is always consistent with the DB — no stale data window. The cost is higher write latency (you are doing two writes on the critical path) and the cache gets populated with data that may never be read. This works well for read-heavy workloads where you want strong consistency and can tolerate slightly slower writes — think user profile data that is written once but read thousands of times.
- Write-behind (write-back) writes to the cache immediately and asynchronously flushes to the database in the background, often batching multiple writes together. This gives you the lowest write latency and the best write throughput. The trade-off is durability risk: if the cache node dies before flushing, those writes are lost. This is appropriate for data where speed matters more than durability — think analytics counters, view counts, or rate limiter state. DynamoDB Accelerator (DAX) uses this pattern.
- In my experience, cache-aside covers 80% of use cases. I reach for write-through when I need strong consistency without complex invalidation logic, and write-behind only for high-throughput writes where losing a few seconds of data is acceptable.
- With cache-aside, there is a race condition: Thread A gets a cache miss, Thread B updates the DB and invalidates the cache, then Thread A writes the stale value back into the cache. How do you prevent this?
- You are implementing write-behind caching for a leaderboard system. The cache node crashes and you lose 30 seconds of writes. What is your recovery strategy?
What is the thundering herd problem in caching, and how do you prevent it?
What is the thundering herd problem in caching, and how do you prevent it?
- The thundering herd (also called cache stampede) happens when a popular cache key expires and hundreds or thousands of concurrent requests all miss the cache simultaneously, all hit the database with the same expensive query, and the database gets crushed. If you have a product page cached for 5 minutes and 1,000 users per second are viewing it, the moment that cache key expires, 1,000 requests all try to regenerate it at the same time.
- The most effective prevention is cache locking (or request coalescing). When a cache miss occurs, the first request acquires a lock (in Redis:
SET key:lock NX EX 5), loads the data, and populates the cache. All other concurrent requests either wait for the lock to release and then read from cache, or return a slightly stale version if you keep the old value around. This turns 1,000 database queries into 1. - Another approach is proactive refresh: instead of letting the cache expire passively, have a background job refresh the cache before the TTL expires. If the TTL is 5 minutes, refresh at 4 minutes. The cache never actually goes empty. This works well for predictably hot keys but adds complexity and is wasteful for keys that may not be accessed again.
- A third technique is staggered TTLs with jitter: instead of setting all cache keys to exactly 300 seconds, set them to
300 + random(0, 60). This prevents mass simultaneous expiration. Netflix uses this extensively — they call it “jittered expiry.” It does not solve the thundering herd on a single hot key, but it prevents a system-wide cache avalanche where many keys expire at the same wall-clock time (e.g., all set at server startup with identical TTLs).
- How would you implement cache locking in a distributed system with multiple application servers? What happens if the lock holder crashes before populating the cache?
- Your cache layer is Redis, and you see Redis CPU spike to 100% periodically every 5 minutes. You suspect thundering herd. Walk me through your investigation and fix.
Explain load shedding and backpressure. How do you decide which requests to drop when the system is overloaded?
Explain load shedding and backpressure. How do you decide which requests to drop when the system is overloaded?
- Load shedding means intentionally dropping a fraction of requests when the system is overloaded, so the remaining requests can be served with acceptable latency. Without load shedding, an overloaded system tries to serve everything, latency degrades for all requests, timeouts cascade, retries pile up, and the system collapses entirely. It is better to serve 70% of requests successfully than to serve 0% because the system fell over.
- Backpressure is the related concept of propagating overload signals upstream so that producers slow down. Instead of dropping requests silently, you tell the caller “I am overloaded, slow down” via HTTP 429 (rate limit) or 503 (service unavailable) with a
Retry-Afterheader. TCP itself implements backpressure through flow control — when the receiver’s buffer fills, the sender slows down. - The decision of what to drop should be priority-based, not random. Health check endpoints are never shed (Priority CRITICAL) — if you shed health checks, your orchestrator thinks the instance is dead and kills it, making things worse. Paid-user requests get shed last. Background analytics, prefetch requests, and speculative queries get shed first. The key is that priority must be assigned at the edge (the load balancer or API gateway) before the request consumes any resources.
- I have seen systems use adaptive load shedding where the shed rate adjusts based on real-time latency measurements. When p99 latency exceeds the target, you increase the shed percentage for low-priority traffic. When latency recovers, you reduce shedding. The implementation uses a sliding window of recent latencies and adjusts every 1-2 seconds. This is far better than a fixed threshold because it adapts to changing capacity (e.g., a background job using CPU, a dependent service being slow).
- How do you prevent load shedding from causing retry storms? If you return 503 to clients and they all retry immediately, you have made the problem worse.
- Describe how you would implement priority-based load shedding in a microservices architecture where a single user request fans out to 5 downstream services.
Walk me through how you would design and implement a reliable task queue with exactly-once processing semantics.
Walk me through how you would design and implement a reliable task queue with exactly-once processing semantics.
- True exactly-once processing is impossible in a distributed system (per the Two Generals’ Problem), but you can achieve effectively-exactly-once by combining at-least-once delivery with idempotent processing. The queue guarantees it will deliver the message at least once (retrying on failure), and the consumer guarantees that processing the same message twice produces the same result as processing it once.
- The implementation has three pillars: (1) Persistent task storage — tasks are written to a durable queue (Kafka, SQS, RabbitMQ with disk persistence) so they survive broker restarts. (2) Visibility timeout / acknowledgment — when a consumer picks up a task, it becomes invisible to other consumers for a timeout period. If the consumer processes it and ACKs, the task is removed. If the consumer crashes, the timeout expires, and the task becomes visible again for another consumer. (3) Idempotency on the consumer — before processing, check a deduplication store (e.g., a
processed_taskstable) with the task ID. If already processed, ACK and skip. Process, then atomically mark as processed and ACK. - The critical detail most people miss is the atomicity between “do the work” and “mark as done.” If you process the task, then your app crashes before marking it as done, the task will be redelivered and processed again. The solution is to make the side effect and the completion marker part of the same transaction: write the result and the task ID to the database in one transaction, then ACK the queue message. If the ACK fails, the task is redelivered, but the idempotency check catches it.
- For dead letter queues (DLQ): after N retries with exponential backoff (e.g., 1s, 2s, 4s, 8s up to 3 retries), move the task to a DLQ. The DLQ is not a black hole — you need monitoring, alerting, and tooling to inspect and replay DLQ messages. I have seen teams create a DLQ and then never look at it, which is the same as silently dropping messages.
- How would you handle a “poison pill” message that causes the consumer to crash every time it is processed? It would cycle between the main queue and retries forever.
- Your task queue processes 10,000 tasks/second. The idempotency check hits the database on every task. How would you optimize this without losing the exactly-once guarantee?
What is the difference between read replicas and caching for scaling reads? When would you choose one over the other?
What is the difference between read replicas and caching for scaling reads? When would you choose one over the other?
- Read replicas are full copies of the database that serve read queries. They contain all the data, support arbitrary SQL queries (including ad-hoc joins and aggregations), and stay consistent with the primary (within replication lag). Caching (e.g., Redis) stores specific precomputed results in memory for ultra-fast retrieval, but only serves exact key lookups — you cannot run a novel SQL query against a cache.
- I would choose read replicas when: the read queries are diverse and unpredictable (an admin dashboard where users run custom filters), when I need the full power of SQL (joins, aggregations, window functions), or when the data set is too large or too dynamic to cache effectively. Read replicas are also useful for isolating analytical workloads from the production database — run your nightly reports against a replica without impacting OLTP performance.
- I would choose caching when: the access pattern is highly repetitive (the same product page viewed 10,000 times per minute), when latency requirements are sub-millisecond (cache returns in
<1ms vs. 5-50ms for a database query), or when I need to reduce load on the database by orders of magnitude rather than just 2-3x. A cache hit avoids the database entirely, while a read replica still executes the full query — it just does it on a different machine. - In practice, you use both together. Read replicas handle the breadth of diverse queries and serve as a warm standby for failover. Caching handles the depth of the hottest access patterns. For a system doing 50K reads/sec, caching the top 1% most-accessed keys might handle 80% of the traffic, and read replicas handle the remaining 20% of diverse queries. The key metric is cache hit ratio — if it is below 80%, your cache is not meaningfully helping and you should investigate whether your access pattern is too diverse for caching.
- You have 3 read replicas and replication lag spikes to 30 seconds during a bulk data import. How do you handle this without sending stale data to users?
- Your cache hit ratio is 95% but your database is still under heavy load. What could explain this, and how would you investigate?
How do you design a system for graceful degradation? Give me a concrete example of degradation levels.
How do you design a system for graceful degradation? Give me a concrete example of degradation levels.
- Graceful degradation means that when a system is under stress or a dependency fails, it reduces functionality progressively rather than failing entirely. The user gets a worse experience, but they get an experience — the core functionality stays alive while non-critical features are disabled. This is fundamentally different from circuit breaking, which is about protecting a single dependency; graceful degradation is a system-wide strategy.
- A concrete example: take an e-commerce product page. At Level 0 (normal), the user sees the product, personalized recommendations, reviews, inventory count, estimated delivery, and dynamic pricing. At Level 1 (elevated load), disable personalized recommendations and show generic “popular items” instead — the recommendations service is expensive and non-critical. At Level 2 (high load), disable reviews and show a static cached version of the page. At Level 3 (emergency), serve a fully static product page from CDN with a “Contact us for pricing” button instead of dynamic pricing — the page loads in 50ms and the core purchase flow still works.
- Implementation requires three things: (1) A health signal — typically p99 latency or error rate, measured via a sliding window. (2) Feature flags tied to degradation levels — when the level changes, features toggle off automatically. (3) Pre-computed fallbacks for each degraded feature — you cannot compute a fallback when you are already overloaded; it needs to be ready (static HTML, cached responses, sensible defaults).
- The most important lesson from production is to test degradation regularly. Netflix runs Chaos Monkey to randomly kill instances, but the more useful practice is deliberately triggering each degradation level in staging and verifying that the user experience is acceptable. I have seen teams build elaborate degradation systems that were never tested, and when they actually triggered, they had bugs that caused worse behavior than no degradation at all — like serving 500 errors instead of the fallback response.
- How do you decide the order in which features are disabled during degradation? What metrics or criteria drive that prioritization?
- Your degradation system relies on a feature flag service, and that service itself goes down. How do you handle degradation of the degradation system?