Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Part X — Caching
Caching is not a performance optimization — it is a consistency trade-off. Every cache creates a second source of truth. The question is never “should we cache?” but “can we tolerate this data being stale for X seconds, and what happens if it is?” The reason caching bugs are so insidious is that they work perfectly 99% of the time and cause mysterious data corruption the other 1%.Real-World Story: How Facebook Scaled Memcached to Billions of Requests
At Facebook’s scale, caching is not an optimization — it is a survival strategy. In their landmark 2013 paper “Scaling Memcache at Facebook,” the engineering team described how they evolved Memcached from a simple key-value cache into a distributed system handling billions of requests per second across multiple data centers. The challenge was staggering: Facebook’s social graph — who is friends with whom, who liked what post, which content to show in the News Feed — requires reading thousands of data points to render a single page. Hitting the database for every read was physically impossible at their scale. Their solution was a multi-layered Memcached architecture that introduced several concepts now considered industry standard. They organized caches into pools (different pools for different access patterns), introduced lease tokens to solve the thundering herd problem (a mechanism where the cache gives a “lease” to exactly one client to refresh a stale key, while all other clients wait or get a slightly stale value), and built a system called McSqueal that listened to MySQL’s replication stream to invalidate cache keys — essentially using the database’s own change log as the invalidation trigger. One of their most revealing findings was about cross-datacenter consistency. When a user in California updates their profile, and a friend in London loads the page a moment later, both data centers need to agree on what the profile says. Facebook solved this by making the “master” region (where the write happened) responsible for invalidation and having remote regions use longer TTLs with markers indicating that a key “might be stale” — a practical acknowledgment that perfect consistency across continents is not achievable without unacceptable latency. The key takeaway for practitioners: Facebook did not build one big cache. They built a system of caches with clear rules about consistency, invalidation, and failure handling at every layer. The paper is required reading for anyone designing caching at scale.Real-World Story: Reddit and the Hot Post Stampede Problem
Reddit’s engineering team has publicly discussed one of the most elegant cache stampede problems in the industry: the “hot post” problem. When a post goes viral — say it hits the front page and suddenly receives tens of thousands of upvotes and comments per minute — the caching dynamics become extremely challenging. Here is the core tension: the post’s content, vote count, and comment tree are changing rapidly (making caches stale almost immediately), while simultaneously being read by millions of users (making the cache essential to survival). If you set a short TTL to keep data fresh, the key expires constantly and every expiration triggers a stampede of database queries. If you set a long TTL to prevent stampedes, users see vote counts and comment threads that are minutes out of date — which on Reddit, where “real-time” conversation is the product, is unacceptable. Reddit’s approach involved several strategies working together: probabilistic early expiration (where a small random subset of readers refresh the cache before it actually expires, spreading the load), write-through updates for vote counts (incrementing the cached counter directly on each vote rather than invalidating and re-reading), and tiered cache TTLs based on post “temperature” — a hot post gets a 5-second TTL while a cold post from last week gets a 5-minute TTL. They also separated the fast-changing data (vote count, comment count) from the slow-changing data (post title, body, author) into different cache keys with different TTLs, so a vote does not invalidate the entire post object. This is a masterclass in the principle that caching strategy should match data access and mutation patterns — not a one-size-fits-all TTL, but a thoughtful decomposition of the data model based on how frequently each piece changes and how stale it can be.Chapter 17: Caching Patterns and Tools
17.1 Types of Caching
Caching exists at every layer of the stack. Understanding which layer to cache at — and the staleness implications of each — is a key architectural skill. Browser cache: Controlled byCache-Control and ETag headers. Client stores responses locally. Fastest possible cache (zero network). But you cannot invalidate it from the server — you must wait for the TTL to expire or use cache-busting URLs (app.js?v=abc123).
CDN cache (Cloudflare, CloudFront, Akamai): Caches responses at edge locations globally. Reduces latency (users hit the nearest edge) and origin load. Best for: static assets (JS, CSS, images), infrequently changing HTML. Invalidation via cache purge API (takes seconds to propagate globally). Use Cache-Control: public, max-age=31536000, immutable for versioned static assets.
Application cache (in-memory LRU): Within a single application instance. Fastest after browser cache (no network). Problem: each instance has its own cache — inconsistency between instances, and cache is lost on restart. Good for: reference data that changes rarely (country list, config), computed results that are expensive but not critical to be fresh.
Distributed cache (Redis, Memcached): Shared across all application instances. Single source of cached truth. Adds ~1ms network latency per lookup. The standard caching layer for web applications. Good for: session data, user profiles, API responses, expensive database query results. In managed cloud environments, services like AWS ElastiCache abstract the operational overhead of running Redis or Memcached clusters — auto-failover, patching, and backup are handled for you. See Cloud Service Patterns — ElastiCache for the Redis vs Memcached decision matrix on AWS and when to use each.
Database cache (buffer pool): The database itself caches frequently accessed data pages in memory. PostgreSQL’s shared_buffers, MySQL’s InnoDB buffer pool. You rarely manage this directly, but understanding it explains why “the first query is slow, subsequent queries are fast” — the data pages are now in the buffer pool.
OS-level page cache: Below the database buffer pool sits the operating system’s page cache — the kernel caches recently accessed file data in unused RAM. This is why Kafka is fast despite writing to disk: it relies on the OS page cache rather than building its own in-memory caching layer. Understanding the page cache also explains why free -h on a Linux server shows very little “free” memory even when the system is healthy — the kernel is using that RAM productively for caching file I/O. For I/O-intensive workloads, memory-mapped files (mmap) can map database files directly into a process’s virtual address space, letting the OS handle paging transparently. See OS Fundamentals — Memory Management for a deep dive into page cache behavior, mmap trade-offs, and why these OS-level caching mechanisms underpin the performance of every database and caching layer above them.
Multi-Layer Caching
In production systems, caching is rarely a single layer. Requests flow through multiple caches before reaching the origin:- Browser cache serves the response instantly if the asset is fresh (per
Cache-Control/ETag). Zero latency. - CDN edge catches requests that miss the browser. Serves from the nearest PoP (Point of Presence). Latency: 5-20ms.
- Application cache (Redis/Memcached) catches requests that miss the CDN — typically dynamic, personalized content. Latency: 1-5ms from the app server.
- Database buffer pool catches queries that miss the application cache. The DB serves from in-memory pages if available. Latency: 1-10ms.
- Disk is the last resort. Latency: 5-15ms (SSD) or 10-50ms (HDD).
17.2 Caching Patterns
The four fundamental caching strategies. Know them cold — interviewers expect you to name the pattern, explain the data flow, and articulate when each is appropriate.Cache-Aside (Lazy Loading)
The application manages the cache directly. On read: check cache, if miss, read DB, populate cache, return. On write: update DB, then delete (not set) the cache key.- Pro: Only caches data that is actually requested (no wasted memory).
- Pro: Application has full control over caching logic.
- Con: First request after a miss is always slow (cache-cold penalty).
- Con: Possible inconsistency if the DB is updated but the cache key is not deleted.
Read-Through
The cache itself loads data from the DB on a miss. The application only talks to the cache — it never directly queries the database for cached entities.- Pro: Centralizes cache-loading logic — the application code is simpler.
- Pro: Cache library handles miss logic, retries, and population.
- Con: The cache layer needs a data-loader callback or configuration for each entity type.
- Con: First-request penalty still exists (same as cache-aside).
Write-Through
Every write goes to the cache AND the database synchronously. The cache is always current.- Pro: Cache is always consistent with the database — no stale reads.
- Pro: Simplifies read path (cache always has the latest data).
- Con: Write latency increases (must write to both cache and DB before returning).
- Con: Caches data that may never be read (wastes memory on write-heavy, read-light data).
Write-Back (Write-Behind)
Writes go to the cache immediately. The cache asynchronously flushes to the database in batches or after a delay.- Pro: Extremely fast writes (client does not wait for DB).
- Pro: Batching reduces DB write load.
- Con: Data loss risk — if the cache node fails before flushing, writes are lost.
- Con: Increased complexity for failure handling and ordering guarantees.
17.3 Cache Invalidation
“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil KarltonThis is not just a joke. Cache invalidation is genuinely one of the hardest problems in distributed systems because it requires coordinating state across multiple independent systems with different consistency models and failure modes. Five core invalidation strategies — with trade-offs: 1. TTL-based (Time-to-Live) expiration: Data expires after a fixed duration. Simple. Tolerates staleness up to the TTL value. The “set it and forget it” approach.
- When to use: Data where brief staleness is acceptable and the cost of serving stale data is low (product catalog descriptions, user avatars, reference data).
- Trade-off: You are trading freshness for simplicity. A 60-second TTL means data can be up to 60 seconds stale. For most read-heavy data this is fine. For financial balances or inventory counts, it is not.
- Gotcha: TTL alone is not a strategy — it is a safety net. If all your invalidation relies on TTL, you are accepting the maximum staleness window for every read, even when the data has not changed. This wastes the cache’s potential to serve fresh data indefinitely for unchanged entries.
- When to use: Data where staleness is unacceptable or where you want near-real-time cache freshness (pricing, inventory, user permissions, feature flags).
- Trade-off: You are trading simplicity for freshness. Every write path must know about every cache key it affects — miss one write path and you have a stale data bug that is extremely hard to detect. CDC-based approaches (listening to the database’s write-ahead log) are more robust because they catch all writes regardless of which code path made them, but they add infrastructure complexity.
- Gotcha: Event delivery is not guaranteed in most systems. A lost event means a permanently stale cache entry (until TTL saves you — which is why you always combine this with TTL as a backstop).
product:123:v7 or config:abc123). When the data changes, you increment the version. New reads use the new key and miss the cache (populating it fresh), while old versions expire naturally via TTL.
- When to use: Data that changes in discrete, versioned updates — configuration, feature flags, compiled templates, static asset manifests.
- Trade-off: You are trading cache space for simplicity. Old versions linger in cache until TTL evicts them, wasting memory. But you never need to explicitly delete anything — the key just changes. This is the pattern behind content-hashed filenames for static assets (
app.a1b2c3.js), and it is bulletproof for that use case. - Gotcha: You need a reliable way to propagate the “current version” to all readers. If different app instances disagree on the current version, some will read stale keys.
- When to use: Data where the write path already has the full new value and you want zero cache misses after writes (session state, user profile after edit, shopping cart).
- Trade-off: You are trading write latency for read consistency. Every write now takes longer (must update both DB and cache before returning). You also risk caching data that is never read, which wastes memory on write-heavy, read-light entities.
- Gotcha: Concurrent writes can still cause race conditions. Thread A writes value X to DB, Thread B writes value Y to DB, then Thread A writes X to cache after Thread B wrote Y to cache — cache now has X but DB has Y. Use conditional writes (
SET IF version = expected) or always prefer delete-on-write unless you have a strong reason for update-on-write.
- When to use: Multi-instance applications with in-process caches that need to stay in sync (feature flag caches, configuration caches, DNS-like lookup tables).
- Trade-off: You are trading network overhead for consistency across instances. Every instance must process every invalidation message, even if it does not have that key cached. At high write rates, invalidation traffic can become significant.
- Gotcha: Pub/sub delivery is at-most-once in most implementations (Redis Pub/Sub, for example, does not persist messages — if an instance is briefly disconnected, it misses invalidations). Combine with short TTLs as a fallback.
- Delete, never set, on write. When data changes, delete the cache key — do not try to update it. The next read will trigger a cache miss and repopulate from the source of truth. This avoids race conditions where two concurrent writes leave the cache with stale data.
- Subscribe to change events (CDC). Use database change-data-capture (CDC) — such as Debezium for PostgreSQL/MySQL or DynamoDB Streams — or application-level events to trigger invalidation. This decouples the write path from cache management and catches invalidations that direct code paths miss. Facebook’s McSqueal system (described above) is the canonical example: it listened to MySQL’s replication stream and invalidated Memcached keys based on which rows changed.
- Use short TTLs as a safety net. Even with event-based invalidation, always set a TTL. If the invalidation event is lost (network blip, consumer crash), the TTL ensures the data eventually refreshes. This is defense in depth — the TTL is your “worst case” staleness guarantee.
-
Tag-based invalidation. Assign tags to cache entries (e.g.,
product:123,category:electronics). When a category changes, invalidate all entries tagged with that category. Frameworks like Laravel and libraries likecache-managersupport this natively. This is especially powerful for invalidating aggregate views — when one product in a category changes, you invalidate the cached category page rather than trying to figure out which specific page cache keys contained that product. - Layered invalidation audit. For every write operation, draw the full invalidation path through every cache layer (browser, CDN, application cache, distributed cache). Verify each layer has a mechanism for receiving the invalidation signal. A common production bug: you invalidate the Redis key perfectly but forget the CDN, so users see stale data for the full CDN TTL. Build integration tests that verify invalidation reaches all layers.
| Data Type | Suggested TTL | Reasoning |
|---|---|---|
| Session data | 15-30 minutes | Security — stale sessions are a risk |
| User profile | 5-15 minutes | Changes infrequently, staleness is minor |
| Product catalog | 1-5 minutes | Changes occasionally, brief staleness acceptable |
| Feature flags | 30 seconds - 2 minutes | Changes must propagate quickly |
| Static reference data (countries, currencies) | 1-24 hours | Rarely changes |
| Search results | 30 seconds - 5 minutes | Freshness matters, expensive to compute |
| API rate limit counters | Match the rate limit window | Must be accurate |
| Computed aggregations (dashboards) | 1-5 minutes | Expensive to compute, brief staleness fine |
17.3.0 The Stale-Data UX Problem — When Caching Hurts the Product
A cache hit ratio of 95% can hide a product that is silently broken. The metric tells you the cache is working. It does not tell you whether the data the cache is serving is correct enough for the user experience. The “green dashboard, broken product” trap: Consider an e-commerce catalog with a 5-minute TTL. The cache hit ratio is 98% — technically excellent. But during a flash sale, prices change every 30 seconds. For 5 minutes after each price change, users see the wrong price. They add items to their cart at 39.99. Support tickets spike. Conversion drops 15%. The cache dashboard is green. The revenue dashboard is red. This is the stale-data UX problem: the cache’s health metrics and the user’s experience are measuring different things. A high hit ratio means the cache is serving data efficiently. It says nothing about whether that data is fresh enough for the business context. Patterns that look good but hide bad behavior:- High hit ratio with high staleness. A 99% hit ratio with a 30-minute TTL on data that changes every minute means 99% of users are seeing stale data. The cache is excellent at serving the wrong answer quickly.
- Uniform TTL on non-uniform data. Applying the same 5-minute TTL to product descriptions (changes weekly) and inventory counts (changes every second) means inventory is perpetually stale. The hit ratio is the same for both — but the business impact is vastly different.
- Cache hit ratio that ignores the “read-your-writes” gap. A user updates their profile and immediately refreshes the page. The page loads from cache — showing the old profile. The hit ratio says “success.” The user says “broken.” For the author of a mutation, the acceptable staleness window is zero seconds.
- Aggregate hit ratio hiding per-key skew. An overall 90% hit ratio might mean 100 popular keys have a 99.9% hit ratio while 10,000 long-tail keys have a 5% hit ratio. The aggregate looks healthy while the long tail is effectively uncached.
| Metric | What It Reveals | Why Hit Ratio Alone Misses It |
|---|---|---|
cache_staleness_seconds (age of served data) | How old the data was when the user received it | A hit can serve 1-second-old or 5-minute-old data — the hit ratio treats both equally |
cache_hit_ratio_per_key_prefix | Whether specific data types are underserved | Aggregate ratio hides per-type problems |
stale_read_after_write_count | How often a user sees stale data within N seconds of their own write | Directly measures the read-your-writes gap |
business_metric_divergence (e.g., displayed price vs checkout price) | Whether cached values are causing downstream business errors | The cache does not know what “wrong” means — business metrics do |
The UX of Stale Data — When the Cache Becomes the Product’s Enemy
Staleness is not an infrastructure metric — it is a UX metric. Users do not know or care that they are being served from a cache. They see a number, a status, a price, and they act on it. When that data is stale, the product is lying to the user, and no amount of infrastructure green lights changes that. Stale data UX patterns by domain:| Domain | Stale Data Symptom | User Impact | Business Cost |
|---|---|---|---|
| E-commerce pricing | Flash sale price shows 39.99 | Rage, support tickets, trust erosion | 10-15% conversion drop during sale events |
| Inventory / stock levels | ”In Stock” badge on a sold-out item | Cart abandonment at checkout, wasted ad spend driving traffic to unavailable products | False demand signals, fulfillment cancellations |
| Social media counts | Like/share counts frozen for minutes on viral content | Users think the post is not performing, stop sharing | Reduced organic amplification during the critical first hour |
| Financial dashboards | Portfolio value shows yesterday’s close, not real-time | Users make buy/sell decisions on wrong data | Regulatory risk (displaying stale prices as current) |
| Collaborative editing | Document shows another user’s edits from 30 seconds ago | Conflicting edits, overwritten work, user frustration | Reduced trust in collaboration tool, users switch to competitors |
| Delivery tracking | Status shows “In Transit” when package was delivered 2 hours ago | Unnecessary support calls, user anxiety | $5-8 per support call, NPS damage |
- Show data freshness to the user. For dashboards and real-time data, display “Last updated: 30 seconds ago” or a subtle indicator. Bloomberg terminals, stock trading apps, and Grafana all do this. It sets the user’s expectation and turns a silent lie into an explicit contract.
- Read-your-writes bypass. When a user performs a mutation (updates profile, changes price, places order), bypass the cache for that user for the next N seconds. The rest of the world can tolerate staleness; the author of the change cannot. Implement with a short-lived per-user flag:
user:123:cache_bypass_until=now+30s. - Staleness-aware rendering. If the cache entry is older than a domain-specific threshold, render a visual indicator (dimmed text, a warning badge) or trigger an inline refresh. An inventory count that is 5 minutes old should not display a confident green “In Stock” badge — it should display “Last checked: 5 min ago.”
- Split display from transaction. Show cached data for browsing (acceptable staleness) but always read from the source of truth at the moment of transaction. The cart page can show a cached price; the checkout must read the real price. Amazon does this — the “price may have changed since you added this item” notice is a stale-data UX pattern.
17.3.1 Cache Eviction Policies — LRU vs LFU vs TTL-Based
When your cache reaches its memory limit, it must decide what to throw away. This is the eviction policy, and choosing the wrong one can tank your hit ratio overnight. The three policies you need to know cold are LRU, LFU, and TTL-based eviction — each wins in different scenarios, and the differences are not academic.LRU (Least Recently Used)
How it works: Evict the key that has not been accessed for the longest time. Maintains a recency order — every access moves the key to the “front” of the queue, and eviction removes from the “back.” When LRU wins:- Workloads with temporal locality — when recently accessed items are likely to be accessed again soon. Web session data, recent API responses, and user profile caches during active sessions are classic LRU workloads.
- Scanning-resistant workloads — if your access pattern naturally clusters around “hot” items that rotate over time (e.g., trending content that changes daily), LRU adapts quickly because yesterday’s hot items age out naturally.
- Frequency-skewed workloads — a product catalog page viewed 10,000 times/day but idle for the last 5 minutes gets evicted in favor of a page accessed once 2 minutes ago. LRU has no concept of “popularity” — it only knows recency.
- Scan pollution — a batch job that sequentially reads through many keys (e.g., a nightly report scanning all users) will push every scanned key to the “front,” evicting genuinely hot items. One bulk operation can destroy your cache effectiveness.
maxmemory-samples, default 5) and evicts the least recently used among the sample. Increasing the sample size improves accuracy at the cost of CPU.
LFU (Least Frequently Used)
How it works: Evict the key that has been accessed the fewest times. Maintains an access counter per key. Redis’s LFU implementation (since Redis 4.0) uses a logarithmic counter that decays over time — so a key that was hot last week but cold this week will eventually be evictable. When LFU wins:- Popularity-skewed workloads — when some items are accessed orders of magnitude more often than others (product catalogs, homepage widgets, frequently queried reference data). LFU keeps the popular items regardless of when they were last accessed.
- Scan resistance — a batch scan that touches every key once does not inflate frequency counters enough to displace genuinely popular items. This is LFU’s biggest practical advantage over LRU.
- Shifting popularity — a new product launch or trending topic needs cache space, but LFU has already filled the cache with historically popular items. The new item must accumulate enough frequency to displace incumbents, causing an elevated miss rate during the transition. Redis mitigates this with counter decay, but the adjustment is not instant.
- One-shot access patterns — if many keys are accessed exactly once (user-specific data for brief sessions), LFU treats them all equally and eviction becomes essentially random among them. LRU would at least keep the most recent ones.
allkeys-lfu or volatile-lfu. Tune the decay time with lfu-decay-time (minutes before the counter is halved, default 1) and the logarithmic factor with lfu-log-factor (higher = slower counter growth, default 10). For most caching workloads, allkeys-lfu is the best modern default — see Database Deep Dives — Redis Eviction Policies for the full policy comparison table and tuning guidance.
TTL-Based Eviction
How it works: Keys are not evicted by a capacity policy — they simply expire after a fixed time-to-live. When memory is under pressure, Redis can prioritize evicting keys with the shortest remaining TTL (volatile-ttl policy).
When TTL-based wins:
- Freshness-critical data — when the primary concern is not “what to keep in cache” but “how long can stale data live.” Rate limit counters, feature flags, and short-lived tokens are TTL workloads. The TTL is the contract, not the capacity.
- Predictable memory usage — if all keys have similar TTLs and arrival rates, memory usage reaches a natural steady state. No surprise evictions, no cache churn.
- Variable-value data — when some cached items are far more expensive to recompute than others. TTL evicts a key that took 500ms to compute just as readily as one that took 5ms. LRU and LFU at least consider access patterns; TTL is blind to them.
- Without a capacity backstop — TTL alone does not prevent memory exhaustion. If keys arrive faster than they expire, memory grows unbounded until
maxmemorytriggers a different eviction policy (ornoevictionerrors). Always pair TTL with a capacity-based policy.
The Decision Matrix
| Scenario | Best Policy | Why |
|---|---|---|
| General-purpose web app cache | allkeys-lfu | Most web workloads have power-law access patterns — a few items get most reads |
| Session store with expiry | volatile-lru or volatile-ttl | Sessions have natural lifetimes; evict expired/idle sessions first |
| Mixed cache + persistent data | volatile-lfu | Evict only among keys with TTLs; persistent keys (config, flags) are protected |
| Streaming/time-series data | allkeys-lru | Recent data is always more relevant; old data ages out naturally |
| Simple cache, no clear pattern | allkeys-lru | Safe default; reasonable performance across most workloads |
| Rate limiting / counters | TTL only (no eviction) | Counters must live exactly as long as their window; eviction would break correctness |
17.4 Cache Stampede (Thundering Herd)
A cache stampede occurs when a popular cache key expires and hundreds (or thousands) of requests simultaneously miss the cache and hit the database. The DB gets overwhelmed, latency spikes, and the system can cascade into failure. Why it happens: Imagine a product page viewed 1,000 times per second. The cache key expires. All 1,000 requests in the next second find no cache entry and each independently queries the database. The DB goes from 0 queries/sec to 1,000 queries/sec instantly. Solutions: 1. Lock-based rebuilding (mutex/sentry): Only one request is allowed to rebuild the cache. All others wait (spin or sleep) and retry. This is what the pseudocode above demonstrates withredis.set(lock_key, "1", NX=true, EX=5).
- Pro: Simple, effective, guaranteed single rebuild.
- Con: Other requests must wait — adds latency to the “waiting” requests. If the rebuilding request crashes, the lock TTL must expire before another request can try.
- Pro: No locks, no waiting. Naturally distributes refreshes.
- Con: Multiple requests may still refresh simultaneously (but far fewer than without protection).
- Pro: Zero cache misses for known hot keys.
- Con: Requires knowing which keys are hot. Wastes resources refreshing keys that may not be requested.
17.5 Interview Questions — Caching
You are designing a product catalog for an e-commerce site. Which caching pattern would you use and why?
You are designing a product catalog for an e-commerce site. Which caching pattern would you use and why?
PriceChanged event, a consumer immediately deletes or updates the cache key. I would also reduce TTL to 5 seconds as a safety net. For the checkout flow specifically, I would always read the price from the database (source of truth), never from cache — the catalog page can show a briefly stale price, but the actual charge must be accurate.What weak candidates say: “I would just reduce the TTL to 1 second everywhere.” They apply a single mechanism without distinguishing between browsing and transactional contexts, and they do not consider the database load implications of a 1-second TTL at scale.What strong candidates say: “I would separate the display path from the transaction path. The catalog page uses cache-aside with event-based invalidation and a short TTL safety net. The checkout flow reads the price directly from the database, never from cache. For the author of a price change, I would add a read-your-writes bypass for 30 seconds so the admin sees the new price immediately.”Follow-up chain:- Failure mode: “What if the event bus drops the
PriceChangedevent?” — The TTL safety net ensures the stale price expires within 5 seconds. I would monitor event delivery rate and alert on consumer lag exceeding 2 seconds. Dead letter queues capture failed events for replay. - Rollout: “How do you migrate from TTL-only to event-based invalidation?” — Shadow mode first (events flow but do not invalidate), then dual-mode for non-critical data, then feature-flag rollout for prices, measuring staleness reduction at each phase.
- Rollback: “The event pipeline fails at 2 AM on Black Friday. What happens?” — The system gracefully degrades to TTL-only caching. The 5-second TTL means prices are at most 5 seconds stale. The rollback is automatic — no human intervention needed.
- Measurement: “How do you prove the event-based invalidation is working?” — Track
cache_staleness_seconds(age of served data),event_delivery_lag_seconds, andprice_mismatch_at_checkout_count(cached price vs DB price at the moment of charge). - Cost: “What does this event pipeline cost?” — A managed Kafka or SNS/SQS topic processing price-change events at typical e-commerce volume (1K-10K price changes/hour) costs 10K-50K in support tickets and refunds).
- Security/governance: “The price-change events contain product pricing data. Who should have access?” — Restrict Kafka topic ACLs to the pricing service (producer) and catalog/cache services (consumers). Audit access logs. If events cross team boundaries, ensure pricing data is not considered commercially sensitive under internal data classification policies.
price_mismatch_at_checkout_count spiked from 0 to 150 in the last 5 minutes during a flash sale. Your event pipeline dashboard shows consumer lag of 8 seconds. Walk me through your first 10 minutes.”You add caching to a slow API and response time drops from 500ms to 5ms. Two weeks later, users report seeing stale data. How do you investigate and fix?
You add caching to a slow API and response time drops from 500ms to 5ms. Two weeks later, users report seeing stale data. How do you investigate and fix?
- Failure mode: “What if the stale data is user account balances, not product descriptions?” — This changes the severity from UX annoyance to potential financial liability. I would reduce TTL to 5 seconds for balances, add a read-your-writes bypass for the account holder, and ensure the actual transaction path (transfers, purchases) always reads from the database.
- Measurement: “How do you detect stale data before users report it?” — Instrument
cache_staleness_secondsper key prefix. Compare cached values against DB values periodically with a canary job. Alert if any cached entry is older than 2x its TTL (which indicates the invalidation event was missed). - Cost: “Is event-based invalidation overkill for this API?” — It depends on the data sensitivity and change frequency. For a product catalog that changes a few times per day, TTL-only with a 5-minute window is sufficient and costs nothing. For inventory counts during a sale, event-based invalidation ($50-200/month for a message broker) pays for itself in prevented customer complaints.
How would you handle a cache stampede on a key that is read 50,000 times per second?
How would you handle a cache stampede on a key that is read 50,000 times per second?
- Lock-based rebuild as the primary protection — use a distributed lock (Redis
SET NX EX) so only one request rebuilds the key. Other requests either wait briefly or serve a slightly stale value (see next point). - Stale-while-revalidate — keep serving the old cached value (even past its TTL) while the rebuilding request is in-flight. This eliminates latency spikes for the “waiting” requests entirely.
- Background refresh — for a key this hot, I would set up a background worker that refreshes it on a schedule (e.g., every 30 seconds), so the key effectively never expires under normal operation.
- Probabilistic early expiration as an additional layer — requests that read the key within the last 10% of its TTL have an increasing probability of triggering a background refresh, spreading the load.
- Failure mode: “What if the lock holder crashes mid-rebuild?” — The lock has a 5-second TTL (
EX=5on theSET NX). If the holder crashes, the lock auto-expires and the next request acquires it. Meanwhile, all other requests serve the stale value. Maximum impact: 5 seconds of stale data. - Rollout: “How do you test stampede protection before it is needed in production?” — Load test with a controlled cache key expiration. Use a tool like k6 or Locust to send 10K concurrent requests while simultaneously expiring the target key. Measure: how many requests hit the database (should be 1), what was the p99 latency for the “waiting” requests, and did any request return an error.
- Measurement: “How do you know the stampede protection is working in production?” — Track
cache_stampede_lock_waits_total,cache_stale_serves_total, andcache_rebuild_duration_seconds. If lock waits spike regularly, your TTLs are too short or your background refresh is not keeping up. - Security/governance: “At 50K reads/sec on a single key, what is the operational risk?” — This is a hot key problem. A single Redis node can handle ~100K ops/sec, so 50K reads on one key is within limits but leaves little headroom. If this key’s traffic doubles, consider client-side local caching (in-process LRU with a 1-second TTL) to reduce Redis load, or Redis read replicas to distribute reads.
cache_stampede_lock_waits_total jumped from 0 to 12,000 in the last 60 seconds on key prefix product:homepage_featured. The background refresh job last ran 3 minutes ago but its status shows ‘failed — connection timeout to database replica’. What do you do in the next 5 minutes?”Your cache hit ratio dropped from 95% to 60% overnight. Walk me through your investigation.
Your cache hit ratio dropped from 95% to 60% overnight. Walk me through your investigation.
product:123 to product:v2:123), effectively creating a brand-new cache with zero entries. Check git history and deployment logs.Step 2: Check cache memory and eviction metrics. Look at Redis/Memcached memory usage and evicted_keys counters. If evictions spiked, the working set grew beyond cache capacity — maybe a new feature started caching a high-cardinality dataset, or a TTL change caused keys to accumulate. Run INFO memory on Redis and compare with the previous day.Step 3: Analyze key-space changes. Are the cache misses concentrated on specific key prefixes, or spread uniformly? If concentrated, a specific data type lost its caching. If uniform, the problem is systemic (capacity, configuration, or infrastructure). Use Redis MONITOR briefly or log sampling to identify the miss patterns.Step 4: Check for traffic pattern shifts. Did a marketing campaign or external event drive traffic to cold content that was not cached? A viral social media post linking to long-tail pages could cause a legitimate spike in misses for content that was never hot before.Step 5: Check infrastructure. Did a Redis node restart, get replaced, or have a network partition? A node restart means a cold cache. If you are using Redis Cluster, check if a resharding event redistributed keys.Step 6: Measure downstream impact. While investigating, confirm whether the lower hit ratio is actually causing problems — check database load, API latency, and error rates. A 60% hit ratio might be temporary and self-correcting if the cause is cold cache after a restart.Recovery actions depend on root cause: If it is a cold cache from a restart, pre-warm the cache from a database scan of hot keys. If it is a key naming change, deploy a fix or add a migration path. If it is capacity, scale the cache cluster or review what is being cached.Design a caching strategy for a social media feed where popular posts get millions of views but content changes frequently.
Design a caching strategy for a social media feed where popular posts get millions of views but content changes frequently.
You are adding caching to a service for the first time. What would you instrument first -- before you even write the cache logic?
You are adding caching to a service for the first time. What would you instrument first -- before you even write the cache logic?
- Database query latency histogram (
db_query_duration_seconds) by query type or endpoint. This is my “before” measurement. If I cannot prove the cache improved things, I cannot justify its existence. - Request latency histogram at the application layer. Same reason — I need before/after comparison at the user-visible level, not just the DB level.
- Database QPS counter (
db_queries_total). After caching, this should drop proportionally to the hit ratio. If it does not, the cache is not intercepting the right queries.
cache_hits_totalandcache_misses_totalcounters by key prefix. Not just a global hit ratio — I need to see which data types are benefiting and which are not.cache_latency_secondshistogram for cache reads and writes. If the cache layer itself adds 2ms overhead on every request, that erodes the benefit.cache_staleness_secondsgauge — the age of the data when served from cache. This is the metric that tells me whether the TTL is appropriate for the business context.cache_stampede_lock_waits_totalcounter — how often requests are waiting for a lock-based rebuild. If this is high, my stampede protection is firing frequently, which means my TTL or key design needs adjustment.
- “Six months later, a new team member asks ‘why do we have this cache?’ How do you answer with data?” — This is why the baseline metrics matter. You show the before/after DB QPS, the latency improvement, and the hit ratio. Without the baseline, you cannot justify the complexity.
- “The cache hit ratio is 95% but p99 latency got worse. How is that possible?” — Cache misses are now slower because they pay the overhead of checking the cache AND hitting the database. Or the cache itself (Redis) is under pressure. The
cache_latency_secondsmetric tells you.
How would you use an AI coding assistant (Copilot, Claude, etc.) when designing or debugging a caching layer? Where does AI help and where does it mislead?
How would you use an AI coding assistant (Copilot, Claude, etc.) when designing or debugging a caching layer? Where does AI help and where does it mislead?
- Boilerplate generation. Cache-aside with stampede protection, Redis connection pooling, serialization/deserialization helpers — these are well-documented patterns. An AI can generate a correct implementation in minutes that would take 30 minutes to write manually. I would still review every line, but the time savings on scaffolding is real.
- TTL decision support. If I describe the access pattern (“read-heavy, updates twice daily, staleness tolerance of 5 minutes”), an AI can suggest a reasonable TTL and invalidation strategy. It is essentially pattern-matching against best practices — which is what TTL selection largely is.
- Generating test scenarios. “Write me integration tests for cache-aside behavior including: cache hit, cache miss, stampede under 100 concurrent requests, and stale-while-revalidate during Redis downtime.” AI is excellent at generating thorough test matrices.
- PromQL / LogQL query generation. “Write a PromQL query that shows cache hit ratio by key prefix over the last hour” — AI handles query syntax fluently.
- Invalidation design. AI will generate a plausible-looking invalidation strategy that misses edge cases specific to your data model. It does not know that your “update product” endpoint is also called by a batch job that bypasses the ORM, or that your CDC pipeline has a 3-second lag. Invalidation logic requires understanding your system’s write paths — not generic patterns.
- Capacity planning. “How much Redis memory do I need?” depends on key size distribution, TTL distribution, eviction pressure, and traffic patterns that the AI cannot observe. It will give you a formula, but the inputs require measurement.
- Production debugging. An AI can suggest “check SLOWLOG” or “look for scan pollution,” but it cannot look at your actual Redis metrics, correlate with your deployment timeline, or SSH into your server. Production debugging requires real-time system interaction, not pattern matching.
- “An AI suggests adding Redis caching to a write-heavy pipeline. How do you evaluate that suggestion?” — Apply the same critical thinking as any engineering proposal: what is the hit ratio likely to be? What is the read/write ratio? Would Kafka be a better fit? The AI does not know these answers for your workload.
- “How would you use AI to help during a cache-related incident?” — Feed it the symptoms (“p99 latency spiked, hit ratio dropped from 95% to 60%, SLOWLOG shows 200ms pauses”) and ask for a diagnostic checklist. Useful as a brainstorming partner, but you still need to run the checks yourself.
Tools
Redis — distributed cache, pub/sub, data structures. For Redis internals (persistence, cluster architecture, eviction tuning), see Database Deep Dives — Redis Architecture. For managed Redis on AWS (ElastiCache configuration, failover, sizing), see Cloud Service Patterns — ElastiCache. Memcached — simpler, pure caching. Varnish — HTTP reverse proxy cache. Caffeine — JVM in-memory cache (powered by the TinyLFU admission policy — the most sophisticated eviction algorithm in production use). node-cache — Node.js. Microsoft.Extensions.Caching — .NET.Further Reading
- Redis in Action by Josiah Carlson — practical Redis usage patterns beyond simple caching.
- Redis Official Documentation — the authoritative reference for Redis commands, data structures, persistence, replication, and cluster configuration. Start with the “Introduction to Redis” and “Data types” sections for a solid foundation, then move to “Redis persistence” and “High availability with Redis Sentinel” for production-grade knowledge.
- Redis University (free courses) — free, self-paced courses covering Redis data structures, caching patterns, Streams, and RediSearch. The “RU101: Introduction to Redis Data Structures” and “RU301: Running Redis at Scale” courses are particularly relevant to caching architecture.
- Memcached Official Wiki — the definitive guide to Memcached’s architecture, slab allocation, memory management, and operational best practices. The wiki’s “ConfiguringServer” and “Performance” pages explain the design decisions behind Memcached’s simplicity and why it outperforms Redis for certain pure-caching workloads.
- Every Programmer Should Know About Memory by Ulrich Drepper — deep understanding of CPU caches and memory hierarchy.
- TinyLFU: A Highly Efficient Cache Admission Policy — the algorithm behind Caffeine (Java’s best caching library).
- Scaling Memcache at Facebook (2013) — the foundational paper on how Facebook evolved Memcached into a multi-datacenter distributed caching system handling billions of requests. Section 3.2 on the thundering herd problem and lease-based stampede prevention is especially relevant — it describes the exact lease-token mechanism that has since become the industry-standard approach to cache stampede protection.
- Netflix Tech Blog — Caching for a Global Netflix — Netflix’s engineering team regularly publishes deep dives on EVCache (their distributed caching layer built on Memcached), cache warming strategies, and how they handle caching across multiple AWS regions for their 200+ million subscribers.
- AWS ElastiCache Best Practices — AWS’s official guide covering cluster sizing, connection management, eviction policies, and replication strategies for Redis and Memcached. Especially useful for understanding the cache-aside pattern at scale, including connection pooling, lazy loading, and write-through configurations in managed environments.
- Cloudflare CDN Caching Documentation — comprehensive guide to CDN caching concepts including cache-control headers, edge TTLs, cache keys, purge strategies, and tiered caching. The “How caching works” and “Cache Rules” sections are the best freely available introduction to CDN-layer caching behavior and configuration.
- Fastly Caching Concepts — Fastly’s documentation on HTTP caching semantics, surrogate keys (their approach to tag-based CDN invalidation), stale-while-revalidate at the edge, and cache shielding. Particularly valuable for understanding advanced CDN patterns like instant purge and surrogate-key-based invalidation that go beyond simple TTL expiration.
Part XI — Observability
Monitoring vs Observability
These terms are often used interchangeably, but the distinction matters — and interviewers will test whether you understand the difference. Monitoring answers known questions: “Is the error rate above 5%?” “Is CPU above 80%?” “Is the service up?” You define dashboards and alerts for expected failure modes in advance. Monitoring handles known unknowns — failure modes you have seen before and can anticipate. Observability answers unknown questions: “Why are 2% of users in Brazil seeing slow responses?” “What is different about the requests that are failing?” You need high-cardinality data (individual request traces, structured logs with many fields) that you can slice and dice to investigate novel problems. Observability handles unknown unknowns — failure modes you have never seen and cannot predict. The practical implication: Monitoring tells you that something is wrong. Observability helps you figure out why. You need both. Most teams start with monitoring (dashboards, alerts) and add observability (distributed tracing, high-cardinality logging) as their systems grow more complex.The Three Pillars Are Complementary, Not Competing
A common mistake — especially in interviews — is to describe logs, metrics, and traces as three independent tools you can choose between. They are not alternatives. They are complementary lenses that each reveal different aspects of system behavior:Metrics tell you SOMETHING is wrong. Logs tell you WHAT happened. Traces tell you WHERE in the chain it broke.Here is how they work together in a real incident:
- Metrics fire the alert: “Error rate on
/api/checkoutjust crossed 5% over the last 5 minutes.” - Traces narrow the scope: you pull traces for failing checkout requests and see that 100% of failures have a slow span in the
payment-servicecall, specifically timing out after 30 seconds. - Logs reveal the root cause: you filter
payment-servicelogs for the failing trace IDs and find:"Connection pool exhausted — 50/50 connections in use, 23 requests queued".
Real-World Story: How Honeycomb Built Observability and Changed the Conversation
Honeycomb’s origin story is a case study in why observability as a discipline exists. Charity Majors, Honeycomb’s co-founder, was previously an infrastructure engineer at Facebook and then Parse (a mobile backend-as-a-service platform acquired by Facebook). At Parse, her team managed a system where hundreds of thousands of mobile apps — each with wildly different usage patterns — ran on shared infrastructure. When something went wrong, the question was never simple. It was not “is the database slow?” It was “why are requests from this specific app, using this specific query pattern, on this particular shard, slow only during this time window?” Traditional monitoring tools could not answer these questions. Dashboards showed averages and aggregates — they could tell you that overall p99 latency was fine while completely hiding the fact that one customer’s app was experiencing 30-second timeouts. The problem was cardinality: to find the needle in the haystack, you needed to slice data by app_id, query_type, shard, time, and dozens of other dimensions simultaneously. Pre-aggregated metrics (the foundation of traditional monitoring) collapse these dimensions away by design. This experience led Majors and co-founder Christine Yen to build Honeycomb around a fundamentally different data model: instead of pre-aggregating metrics, Honeycomb stores wide structured events — individual request records with dozens or hundreds of fields — and lets you query them interactively after the fact. Want to know the p99 latency for user_id=abc123, hitting endpoint=/api/feed, on shard=7, in the last 15 minutes? You can ask that question without having defined that specific combination of dimensions in advance. The broader impact of Honeycomb’s approach was a shift in how the industry thinks about production debugging. Majors popularized the phrase “observability is about unknown unknowns” — the failures you did not anticipate and therefore could not build dashboards for. She argued (persuasively, and somewhat controversially at the time) that most teams were over-invested in dashboards for known failure modes and under-invested in the ability to explore novel failures. Her blog at charity.wtf became required reading for SRE teams, and the concept of “high-cardinality observability” entered the mainstream vocabulary. Whether or not you use Honeycomb specifically, the lesson is universal: if your observability tooling can only answer questions you thought to ask in advance, you are blind to the failures that will actually surprise you.Real-World Story: Datadog vs New Relic vs Grafana — Why Companies Choose Different Observability Stacks
One of the most common questions engineering leaders face is which observability platform to standardize on. The answer reveals a lot about organizational priorities, and the trade-offs are genuinely instructive. Datadog has become the dominant commercial observability platform, particularly among cloud-native companies. Its strength is breadth: metrics, logs, traces, profiling, security monitoring, and synthetics all in one platform, with deep integrations for AWS, GCP, Azure, Kubernetes, and hundreds of other technologies. Datadog’s bet is that having everything in one place with correlated data is worth paying a premium for. The trade-off is cost — Datadog’s per-host and per-GB pricing model becomes very expensive at scale. Companies regularly report six- and seven-figure annual Datadog bills, and “Datadog cost optimization” has become its own mini-discipline. Companies like Coinbase and Peloton have publicly discussed building internal tooling specifically to manage Datadog costs. New Relic repositioned itself with a usage-based pricing model (100GB/month free, then per-GB) and a “full-stack observability” pitch. Their advantage is the free tier and the simpler pricing model — for mid-size companies, New Relic can be significantly cheaper than Datadog. The trade-off is that New Relic’s integrations ecosystem and query language (NRQL) are less mature in some areas, and their Kubernetes and infrastructure monitoring historically lagged Datadog. New Relic’s bet is that a lower price point with good-enough features wins in the mid-market. Grafana Labs (Grafana + Prometheus + Loki + Tempo + Mimir) represents the open-source-first approach. Grafana itself is the visualization layer; the data stores are separate, pluggable components. Companies like IKEA, Bloomberg, and Roblox run large-scale Grafana-based observability stacks. The advantage is cost control (you can self-host on your own infrastructure) and flexibility (mix and match components, avoid vendor lock-in). The trade-off is operational burden — running Prometheus, Loki, and Tempo at scale requires dedicated infrastructure engineering effort. Grafana Cloud offers a managed version, but at that point the cost comparison with Datadog becomes closer. The decision framework in practice:- Startup with a small team and no dedicated platform engineers: Datadog or New Relic (managed, low operational overhead). Choose New Relic if budget-constrained, Datadog if you want the deepest integrations.
- Mid-size company with platform engineering capacity: Grafana stack (self-hosted or Grafana Cloud) for cost control and flexibility, especially if you are already invested in Prometheus.
- Enterprise with compliance requirements: Often a mix — Datadog for application teams (ease of use), Grafana for infrastructure teams (flexibility and data sovereignty), with OpenTelemetry as the instrumentation layer to avoid lock-in.
Chapter 18: The Three Pillars
The three pillars of observability — logs, metrics, and traces — are not three competing approaches you pick from. They are three complementary perspectives on the same system. Think of them as three views of a building: the floor plan (metrics — the big picture, aggregated shape), the security camera footage (logs — detailed record of what happened), and the GPS tracker on a delivery (traces — following one specific journey through the building). You need all three to fully understand what is happening inside.18.1 Logs
Structured logging (JSON with consistent fields). Correlation IDs across all services. Log levels: DEBUG, INFO, WARN, ERROR. Centralize logs for querying and analysis. What a good structured log line looks like:| Level | What to log | Example |
|---|---|---|
DEBUG | Internal state, variable values, branch decisions | "Cache key product:123 not found, querying DB" |
INFO | Business events, request completions, state transitions | "Order ord_321 created for user usr_789, total $49.99" |
WARN | Recoverable problems, degraded operation, retries | "Redis connection timeout, retrying (attempt 2/3)" |
ERROR | Failures requiring attention, unhandled exceptions | "Payment processing failed for order ord_321: gateway timeout" |
18.2 Metrics
Aggregated measurements: counters (total requests), gauges (current connections), histograms (latency distribution). Cheaper to store and query than logs. Foundation of dashboards and alerts. The RED Method (for request-driven services): Rate (requests/second), Errors (error rate), Duration (latency distribution). The USE Method (for resources): Utilization, Saturation, Errors. Both from Brendan Gregg’s performance methodology. What good metric names look like (Prometheus convention):http_requests_total{method="POST", path="/api/orders", status="201"}— counterhttp_request_duration_seconds{method="GET", path="/api/products"}— histogramdb_connections_active{pool="primary"}— gaugequeue_messages_pending{queue="order-processing"}— gauge
| Type | What it measures | Example metric | Why it matters |
|---|---|---|---|
| Counter | Cumulative count of events | http_requests_total, orders_created_total, cache_hits_total | Rate of change reveals throughput and trends |
| Gauge | Current value (can go up or down) | db_connections_active, queue_depth, memory_usage_bytes | Shows current state and saturation |
| Histogram | Distribution of values | http_request_duration_seconds, payload_size_bytes | Reveals p50/p95/p99 latency, not just averages |
| Summary | Pre-computed quantiles | rpc_duration_seconds{quantile="0.99"} | Client-side computed percentiles (less flexible than histograms) |
- Top row: Request rate (req/sec), error rate (%), p50/p95/p99 latency.
- Second row: CPU utilization, memory usage, active database connections.
- Third row: Downstream dependency latency, cache hit rate, queue depth.
18.3 Distributed Tracing
Follow a request across services. Each service creates a span. Spans are linked by trace ID. Visualize the full request path with timing. What to capture in spans — concrete examples:| Span Type | Key Attributes | Example |
|---|---|---|
| HTTP inbound | http.method, http.url, http.status_code, user_id | GET /api/orders/123 -> 200 (145ms) |
| HTTP outbound | http.method, peer.service, http.status_code | POST payment-service/charge -> 201 (89ms) |
| Database query | db.system, db.statement (sanitized), db.operation | SELECT orders WHERE user_id=? (12ms) |
| Cache operation | cache.hit, cache.key_prefix, db.system=redis | GET product:123 -> HIT (0.4ms) |
| Message publish | messaging.system, messaging.destination | PUBLISH order-events/order.created (2ms) |
OpenTelemetry (OTel) — The Industry Standard
OpenTelemetry is a CNCF project that provides a single set of APIs, libraries, and agents to capture distributed traces, metrics, and logs. Instrument once, export to any backend (Jaeger, Datadog, New Relic, Grafana, etc.). Why it matters: Before OpenTelemetry, every observability vendor had its own proprietary instrumentation SDK. Switching vendors meant re-instrumenting your entire codebase. OTel provides vendor-neutral instrumentation — you write instrumentation code once and can switch backends by changing a configuration file. If you are starting fresh, use OpenTelemetry from day one. It is the converged standard (merging OpenTracing and OpenCensus), backed by every major observability vendor, and is the future of observability instrumentation. Key OTel components:- API — defines interfaces for traces, metrics, logs (what you code against)
- SDK — implements the API, handles sampling, batching, export
- Auto-instrumentation — automatic span creation for popular frameworks and libraries
- Collector — receives, processes, and exports telemetry data (acts as a pipeline between your app and your backends)
@opentelemetry/auto-instrumentations-node (Node.js), opentelemetry-instrumentation (Python), go.opentelemetry.io/contrib (Go), io.opentelemetry:opentelemetry-javaagent (Java). These automatically create spans for HTTP handlers, database clients, cache clients, and message queues with zero code changes.18.3.1 Observability for Distributed Systems — Tracing Across Service Boundaries
In a monolith, a stack trace tells you everything. In a distributed system, a single user request might traverse 5, 10, or 50 services — and a stack trace in one service tells you nothing about what happened in the others. This is the fundamental observability challenge of distributed systems: correlating signals across independent processes that share no memory, no call stack, and no clock.Context Propagation — The Foundation of Distributed Tracing
When Service A calls Service B, how does Service B know it is part of the same user request? The answer is context propagation — passing metadata about the current trace alongside the actual request payload. This is the single most important concept in distributed observability. What gets propagated:- Trace ID — A globally unique identifier for the entire end-to-end request. Every service that participates in handling this request tags its logs, metrics, and spans with this trace ID. This is what lets you search “show me everything that happened for request abc123” across 20 services.
-
Span ID — Identifies the current operation within the trace. When Service A calls Service B, Service A’s span ID becomes the
parent_span_idin Service B’s span, creating the parent-child relationship that forms the trace tree. - Trace flags — Sampling decisions and debug flags. Critically, the sampling decision is made once (at the entry point) and propagated to all downstream services. If the entry point decides “this request is sampled,” every downstream service must also record its spans — otherwise you get incomplete traces with missing segments.
- Baggage — Arbitrary key-value pairs that ride along with the trace context through the entire call chain. Baggage is the most powerful and most misunderstood part of context propagation.
Baggage Propagation — Carrying Context Through the Call Chain
Baggage lets you attach arbitrary metadata at the start of a request and have it available in every downstream service — without those services needing to know about each other. This is not just a convenience; it fundamentally changes what you can observe. Practical baggage examples:user.tier=premium— Set at the API gateway based on the authenticated user. Now every downstream service can tag its spans, metrics, and logs with the user tier. You can answer “what is the p99 latency for premium users vs free users?” across all services without each service needing to look up user tiers.experiment.variant=B— Set by the feature flag service. Every service in the chain can now report metrics segmented by experiment variant, enabling accurate A/B test analysis even when the feature being tested affects downstream services.request.source=mobile-app— Set at the edge. You can now compare behavior across all services by client platform without retrofitting every service with client-detection logic.
Propagation Formats — W3C Trace Context
The industry has converged on the W3C Trace Context standard for propagation via HTTP headers:X-B3-TraceId / X-B3-SpanId headers) or Jaeger propagation (uber-trace-id header). OTel supports all of these via configurable propagators — set them in the OTel SDK configuration.
For service-to-service calls over message queues (Kafka, SQS, RabbitMQ): Context propagation works the same way but through message headers/attributes instead of HTTP headers. OTel auto-instrumentation handles this for most popular messaging libraries. The critical point is that trace context must survive the asynchronous boundary — if a message sits in a queue for 30 seconds before being consumed, the consumer’s span should still be linked to the producer’s trace.
Correlation IDs — The Practical Glue
A correlation ID (often the trace ID, but sometimes a separate business-level identifier likeorder_id or request_id) is the single most important field in your structured logs. It is what lets you do this during an incident:
- Generate the ID at the system boundary (API gateway, load balancer) — not in application code.
- Accept an incoming
X-Request-IDheader if present (allows client-side correlation), generate one if not. - Pass it to every downstream call — HTTP headers, message queue headers, database query comments (
/* trace_id=abc123 */). - Include it in every log line — not as a separate log statement, but as a field in every structured log entry.
- Use the OpenTelemetry trace ID as your correlation ID when possible — this gives you both log correlation and distributed trace visualization from the same identifier.
18.3.2 The Cost of Observability
Observability is not free. Every log line, every metric time series, every trace span consumes compute, network, and storage resources. At scale, observability infrastructure can become one of the most expensive line items on your cloud bill — and one of the hardest to optimize because nobody wants to “fly blind.” The challenge is reducing cost without losing the signal you need during incidents.Where the Money Goes
Log storage is typically the largest cost. A medium-sized microservices deployment (20-50 services) can easily generate 50-200 GB of logs per day. At Datadog’s log ingestion pricing (~5-20/day just for logs — 50,000-100,000+/month. Metric cardinality explosion is the sneakiest cost driver. Every unique combination of label values creates a new time series. A metric likehttp_request_duration{method, path, status, region, instance} with 5 methods, 100 paths, 10 statuses, 4 regions, and 50 instances creates 5 x 100 x 10 x 4 x 50 = 1,000,000 time series. Prometheus stores ~1-3 bytes per sample per series. At a 15-second scrape interval, that is 5.7 million samples/minute. Multiply by retention period, and you have a serious storage and query performance problem. Datadog charges per custom metric (starting around 50,000/year before you even count logs or traces.
Trace storage costs depend heavily on sampling. An unsampled trace pipeline for a system handling 10,000 requests/second generates approximately 50-100 GB of trace data per day (assuming average trace sizes of 5-10 KB with 10-20 spans). Most teams cannot afford to store every trace.
Sampling Strategies — Reducing Cost Without Losing Signal
Sampling is the primary lever for controlling observability costs, especially for traces. The key insight is that most requests are boring — they succeed, take a normal amount of time, and tell you nothing new. You need 100% coverage of the interesting requests and can afford to sample the rest. 1. Head-based sampling (decision at trace start): The sampling decision is made when the trace begins (at the entry point) and propagated to all downstream services via trace context flags. Simple to implement — “keep 10% of traces” — but blind to outcomes. You might sample out the one request that would have revealed a bug because it looked normal at the start.- Pro: Low overhead, simple configuration, consistent (all spans in a trace are kept or dropped together).
- Con: Cannot make decisions based on what happens during the request (errors, high latency). A 10% head sample means you miss 90% of errors.
tail_sampling processor supports this natively.
- Pro: Keeps 100% of interesting traces. The best balance of cost and signal.
- Con: Requires buffering complete traces before deciding, which adds memory pressure and latency to the collector pipeline. Traces that span many services need all spans to arrive before the decision can be made — if a span from a slow service arrives after the decision window, the trace is incomplete.
if endpoint == /api/checkout -> sample 100%, if user.tier == premium -> sample 50%, if status >= 500 -> sample 100%, default -> sample 5%. This gives you surgical control over where you spend your observability budget.
4. Adaptive / dynamic sampling:
Adjust the sampling rate based on traffic volume. During normal traffic, sample 20%. During a traffic spike (Black Friday), drop to 5% to control costs. Honeycomb’s “dynamic sampling” and the OTel Collector’s probabilistic_sampler with rate limiting support this pattern.
A practical sampling configuration for most teams:
Controlling Metric Cardinality
The rule: Never use unbounded values as metric labels.user_id, request_id, email, IP address — none of these should be metric labels. They create millions of time series and will crash your Prometheus server or bankrupt your Datadog account.
Safe labels: method (GET/POST/PUT/DELETE), status_class (2xx/3xx/4xx/5xx), service, region, version. These have bounded, predictable cardinality.
Dangerous labels that seem safe: path — if your API has path parameters (/users/123, /users/456), each unique user ID becomes a label value. Normalize paths before labeling: /users/:id, not /users/123. error_message — every unique error string creates a new series. Use error codes or categories instead.
Monitoring your cardinality: Run prometheus_tsdb_head_series (Prometheus) or check the “Custom Metrics” count in Datadog. Set alerts when cardinality exceeds expected bounds. A sudden jump in time series count almost always means a new label with unbounded values was introduced.
Reducing Log Volume Without Losing Signal
-
Log at the right level in production.
DEBUGlogs in production are almost never worth the cost. Set production log level toINFOand use dynamic log level changes (feature flags, runtime config) to temporarily enableDEBUGfor a specific service during active investigation. -
Sample repetitive logs. If a service logs “cache miss for key X” 10,000 times/minute, you do not need all 10,000 lines. Log the first occurrence, then aggregate: “cache miss occurred 10,000 times in the last minute for key prefix
product:*.” Libraries like Go’szapsupport sampled logging natively. -
Use metric counters instead of log lines for high-frequency events. Instead of logging every cache hit/miss, increment
cache_hits_totalandcache_misses_totalcounters. You get the same information (hit ratio trends) at a fraction of the storage cost. - Tiered log retention. Hot storage (7 days): full logs in Elasticsearch/Loki for interactive querying. Warm storage (30-90 days): compressed in S3, queryable via Athena. Cold storage (1+ year, if compliance requires): S3 Glacier. Automate the lifecycle with S3 lifecycle policies or your logging platform’s retention settings.
-
Drop or filter known-noisy log sources. Health check logs (
GET /healthevery 5 seconds from every load balancer) and Kubernetes liveness probes generate enormous volume with zero diagnostic value. Filter them at the log pipeline level (Fluentd, OTel Collector) before they reach your storage backend.
Noisy Metrics — When More Data Makes You Dumber
Not all metric noise is obvious. Some of the most dangerous noise looks like signal — it presents patterns that appear meaningful but lead to wrong conclusions. Common noisy metric patterns:- Correlated metrics that imply false causality. CPU spikes at the same time as latency increases. Obvious conclusion: CPU is causing the latency. Real cause: both are symptoms of a traffic spike. The CPU is a co-symptom, not the cause. If you alert on CPU and scale horizontally, you might discover the real bottleneck is a database connection pool that does not scale with more instances. The fix: Always ask “what changed that correlates with both signals?” Use Grafana annotations to overlay deployment timestamps, traffic changes, and config changes on metric charts.
-
Seasonal patterns misread as anomalies. Latency increases every Monday at 9 AM. Is this a problem? Only if it deviates from the Monday 9 AM baseline. Comparing Monday 9 AM to Sunday 3 AM will always look like a spike. The fix: Use week-over-week comparison in PromQL:
http_request_duration_seconds - http_request_duration_seconds offset 7d. Alert on deviation from the same time last week, not from the last hour. -
Metrics that measure the wrong granularity. A “p99 latency” across all endpoints hides the fact that
/api/searchhas a 5-second p99 while/api/healthhas a 1ms p99. The aggregate p99 is “200ms” — technically correct, completely useless. The fix: Always segment latency metrics by endpoint (or at least by endpoint criticality tier). - Survivor bias in success metrics. Your success rate is 99.9%. But that only counts requests that reached your server. Users whose DNS resolution fails, whose connection times out at the load balancer, or whose browser JavaScript crashes never generate a server-side metric. Your actual success rate might be 97%, but you only see the 99.9% of survivors. The fix: Combine server-side metrics with client-side RUM data and synthetic monitoring from external vantage points.
- Temporal ordering. The cause must precede the effect. If latency spiked 30 seconds before the deployment completed, the deployment is not the cause — something else is.
- Mechanism. You must be able to explain how A causes B, not just that they correlate. “CPU increased and latency increased” is correlation. “The garbage collector paused for 500ms because heap usage exceeded the G1GC threshold, and during that pause, 500 requests queued in the Netty event loop” is causality with mechanism.
- Counterfactual. Would B have happened without A? If you can roll back the suspected cause (revert a deploy, disable a feature flag) and the symptom resolves, you have strong evidence. If the symptom persists after rollback, the “cause” was correlation.
- Reproducibility. Can you reproduce the effect in a staging environment by introducing the same cause? If adding the same traffic pattern in staging causes the same latency behavior, your causal model is solid.
Log Retention Trade-offs — The Compliance vs Cost vs Debuggability Triangle
Log retention is a three-way trade-off that most teams resolve poorly because each stakeholder has a different optimization target:- Engineering wants 90+ days of hot-tier logs for root cause analysis of slow-developing bugs
- Finance wants minimal retention because log storage is the #1 observability cost
- Compliance/Legal wants specific retention periods — sometimes mandating minimums (SOC 2: 1 year of security logs) and sometimes mandating maximums (GDPR: delete PII-containing logs within the data retention policy window)
- Classify logs by sensitivity and value before setting retention. Not all logs deserve the same retention:
| Log Category | Hot Retention | Warm Retention | Cold/Archive | Rationale |
|---|---|---|---|---|
| Security audit logs (auth, permission changes) | 30 days | 90 days | 1 year+ (compliance) | SOC 2 / regulatory requirement |
| Error and alert-triggering logs | 30 days | 60 days | None | High diagnostic value, moderate volume |
| Business event logs (orders, payments) | 14 days | 90 days | 1 year (compliance) | Financial reconciliation |
| Normal operational logs (INFO) | 7 days | 14 days | None | Low per-line value, high volume |
| Debug logs | 3 days | None | None | Only useful during active investigation |
| Health check / liveness probe logs | 0 days (drop at pipeline) | None | None | Zero diagnostic value, enormous volume |
- Automate lifecycle management. Use Elasticsearch ILM (Index Lifecycle Management), S3 Lifecycle Policies, or Loki’s retention configuration to enforce these tiers automatically. Manual retention management always drifts.
- Budget the hot tier. Set a hard GB limit for hot-tier log storage and treat it like compute capacity. When a team wants to add a new high-volume log source, they must identify something to drop or move to warm tier. This forces conscious trade-offs instead of unbounded growth.
18.4 Observability Maturity Model
Not every team needs — or can support — Level 5 observability from day one. This maturity model helps you understand where you are, where you should aim next, and what capabilities each level unlocks. Move up one level at a time; skipping levels creates fragile tooling that nobody trusts.| Level | Name | Capabilities | What You Can Answer | Typical Team |
|---|---|---|---|---|
| 1 | Basic Health Checks | Uptime monitoring (ping/HTTP checks), basic server metrics (CPU, memory, disk), manual log file access via SSH | ”Is it up?” “Is the server running out of disk?” | Solo developer, early startup, side project |
| 2 | Metrics + Dashboards | Centralized metrics (Prometheus/CloudWatch), Grafana dashboards, basic alerting on thresholds, centralized log aggregation (ELK/Loki) | “What is the error rate?” “When did latency spike?” “Which endpoint is slowest?” | Small team, single-service architecture |
| 3 | Distributed Tracing | OpenTelemetry instrumentation, trace propagation across services, correlation IDs in logs, request waterfall visualization (Jaeger/Tempo), structured logging with high-cardinality fields | ”Where in the call chain did this request slow down?” “Which downstream service is the bottleneck?” | Team running microservices, moderate complexity |
| 4 | SLO-Based Alerting | SLI/SLO definitions for critical user journeys, error budget tracking and burn-rate alerts, symptom-based (not cause-based) alerting, automated runbooks linked to every alert, weekly error budget reviews | ”Are we meeting our reliability targets?” “How much risk budget do we have left for feature launches?” “Should we freeze deploys or keep shipping?” | Platform/SRE team, multiple services, business-critical systems |
| 5 | AIOps + Anomaly Detection | ML-based anomaly detection on metrics and logs, automated root cause correlation (e.g., Datadog Watchdog, Honeycomb BubbleUp), predictive alerting (forecast budget exhaustion before it happens), chaos engineering integrated with observability (verify detection capabilities), continuous profiling (CPU/memory flame graphs in production) | “What changed across all signals right before this incident?” “Which combination of dimensions explains the anomaly?” “Will we breach our SLO next Tuesday at current burn rate?” | Large-scale platform team, hundreds of services, strong data engineering culture |
- Assess honestly. Most teams overestimate their maturity. If your traces exist but nobody uses them during incidents, you are not at Level 3 — you are at Level 2 with unused tooling.
- Move up one level at a time. Jumping from Level 1 to Level 4 means you have SLO-based alerts but no dashboards to investigate when they fire. Each level builds on the one below it.
- The biggest ROI jump is from Level 2 to Level 3 — adding distributed tracing transforms your debugging speed in microservice architectures. This is where most teams should invest next.
- Level 5 is not a goal for most teams. AIOps and anomaly detection require significant data volume and engineering investment. Pursue it only when Levels 1-4 are solid and you have hundreds of services generating enough signal for ML models to be useful.
18.4.1 Business-Level Observability — Connecting Technical Signals to Revenue
Technical metrics tell you the system is healthy. Business metrics tell you the product is healthy. These are not the same thing — and the gap between them is where the most damaging production incidents hide. Why technical metrics alone are insufficient: A service returning 200 OK in 50ms can still be silently dropping orders, charging the wrong amount, showing stale inventory, or serving the wrong experiment variant. HTTP status codes measure transport success, not business correctness. The most dangerous production incidents are the ones where every dashboard is green but the product is broken. Essential business-level metrics to instrument:| Business Metric | What It Catches | Technical Metrics That Miss It |
|---|---|---|
orders_completed_total by payment method, region, platform | A specific payment method or region silently failing | HTTP error rate (the endpoint returns 200 with a “please try again” message) |
revenue_per_minute | Revenue drop during a “healthy” deploy | Latency and error rate (both normal) |
checkout_funnel_completion_rate | Frontend JS error preventing checkout | Server-side metrics (request never reaches the server) |
signup_completion_rate | Broken OAuth flow or email verification | API latency (the API call succeeds, the downstream email service silently fails) |
search_zero_results_rate | Broken search index or stale search cache | Search API latency (fast response with zero results is still a 200) |
orders_completed_total. When payment confirmation is received from the gateway, increment payments_confirmed_total. These counters live in your application code alongside the business logic, not in middleware.
The “green dashboards, broken product” anti-pattern in depth:
This anti-pattern deserves its own mental model because it is the most dangerous failure mode in observability. Every metric is green — latency is low, error rate is zero, throughput is stable — but the business is silently hemorrhaging. Real examples:
- Silent data corruption. A serialization bug causes 2% of orders to save with the wrong quantity. The API returns 200, the database writes succeed, the logs look clean. But the warehouse ships 1 item when the customer ordered 3. This only surfaces when customers complain days later.
- Experiment serving wrong variant. The A/B testing framework has a cache bug and serves the control variant to 100% of users instead of the intended 50/50 split. All technical metrics are healthy. But the experiment data is garbage, and product decisions made on it will be wrong.
- Downstream partner failure. Your payment service successfully sends the charge request and gets a 200 response from the payment gateway. But the gateway has an internal issue where charges are accepted but not settled. Your
payments_sent_totalcounter is perfect. Your actual revenue is zero. - Correct response, wrong data. A stale cache serves yesterday’s product recommendations. The recommendation API responds in 5ms with a valid JSON payload. The dashboard shows excellent latency. But every user sees the same stale recommendations, and click-through rate drops 40%.
- Instrument business outcomes, not just technical operations. For every critical flow, define the end-to-end success metric that only increments when the business outcome is confirmed.
orders_fulfilled_total(not justorders_placed_total),emails_delivered_total(not justemails_sent_total),payments_settled_total(not justpayments_charged_total). - Create a “business vitals” dashboard that lives alongside the RED dashboard. The RED dashboard answers “is the system healthy?” The business vitals dashboard answers “is the product working?” Both should be visible during incident triage.
- Alert on business metric divergence. If
orders_placed_totalis increasing butorders_fulfilled_totalis flat, something is broken in the fulfillment pipeline — even if every service in the pipeline reports healthy. - Implement end-to-end synthetic transactions. A synthetic monitoring job that completes a real transaction (place an order, verify fulfillment, confirm email delivery) every 5 minutes. This is the ultimate “is the product working?” check because it exercises the full business flow, not just individual service health.
18.4.2 Frontend Observability and Real User Monitoring (RUM)
Server-side observability has a blind spot: everything that happens in the user’s browser before, during, and after the server interaction. A JavaScript error that prevents the “Buy” button from working generates zero server-side signal — no HTTP request, no error log, no trace. The server thinks everything is fine. What RUM captures that server-side observability cannot:- JavaScript errors and unhandled promise rejections — the #1 cause of “invisible” outages where dashboards are green but users cannot interact with the product.
- Core Web Vitals (Largest Contentful Paint, First Input Delay, Cumulative Layout Shift) — these directly measure the user’s perceived performance. A 50ms API response means nothing if the page takes 4 seconds to become interactive because of render-blocking JavaScript.
- Client-side navigation timing — how long the full page load takes from the user’s perspective, including DNS resolution, TLS handshake, resource loading, and JavaScript execution. This is the real latency, not the API latency.
- User session context — which browser, OS, screen size, network type (4G vs WiFi), and geographic location. A bug that only affects Safari on iOS 16 is invisible in server-side metrics.
- Rage clicks and dead clicks — a user clicking the same button 5 times in 3 seconds is a signal that the UI is unresponsive or the click handler is broken.
traceparent header so the backend trace is connected to the frontend session. RUM tools like Datadog RUM, Sentry, and New Relic Browser automatically propagate trace context. This lets you do: “User X reported a problem at 3:15 PM -> find their RUM session -> see the JS error -> click through to the backend trace -> see the database timeout that caused the API to return an error that the frontend handled poorly.”
Tools: Datadog RUM, Sentry (error tracking + session replay), New Relic Browser, LogRocket (session replay), Vercel Analytics (for Next.js), Google Lighthouse / PageSpeed Insights (synthetic testing).
The RUM observability maturity ladder:
| Maturity Level | What You Capture | What You Can Answer |
|---|---|---|
| Level 1: Error tracking | JavaScript errors, unhandled rejections, network failures | ”Are there JS errors affecting users right now?” |
| Level 2: Performance monitoring | Core Web Vitals (LCP, FID, CLS), page load timing, resource waterfall | ”How fast is the real user experience?” “Which pages are slowest?” |
| Level 3: Session context | User session recordings, rage clicks, dead clicks, user flow analysis | ”What was the user doing when they encountered the error?” “Where do users get stuck?” |
| Level 4: Full-stack correlation | Frontend sessions linked to backend traces via traceparent propagation | ”User X reported a problem -> frontend session -> JS error -> backend trace -> root cause” |
| Level 5: Business impact quantification | RUM data correlated with business metrics (conversion, revenue, churn) | “This 500ms LCP regression on the product page caused a 3% drop in add-to-cart rate” |
18.4.3 PII-Safe Observability
Observability data is only useful if you can actually store and query it. In regulated industries (fintech, healthcare, any company with EU users), storing PII in logs, traces, or metrics can create compliance violations that are far more expensive than the incidents the data helps you debug. What counts as PII in observability data (often missed):- Obvious: email addresses, names, phone numbers, SSNs, credit card numbers in log messages
- Less obvious: IP addresses (PII under GDPR), user-agent strings (can be fingerprinted), session tokens in URL parameters logged by web servers, full request/response bodies captured by auto-instrumentation
- Subtle: high-cardinality IDs that can be reverse-mapped to users (even if the ID itself is opaque, if your database maps
user_idto a name, the ID is effectively PII in the hands of anyone with database access)
-
Redact at the source. Use structured logging libraries with field-level redaction: Go’s
zapsupportszap.String("email", redact(email)), Python’sstructlogsupports processor-based field masking. Never rely on downstream pipeline scrubbing — if the PII reaches any log aggregator, you have already failed the compliance test. -
Use pseudonymized identifiers for tracing. Instead of logging
user_id=usr_789, log a one-way hash:user_hash=sha256(usr_789 + salt). You can still correlate all events for the same user (the hash is deterministic), but you cannot reverse it to the real user ID without access to the salt (which is stored separately with access controls). -
OTel Collector as a PII scrubbing layer. The OTel Collector’s
attributesprocessor can delete or hash specific span attributes before export. Configure rules like:delete attributes matching key=user.email,hash attributes matching key=user.id. This centralizes PII scrubbing in one pipeline stage instead of relying on every service to get it right. - Separate high-sensitivity and low-sensitivity telemetry. Route logs containing potential PII (authentication events, user profile changes, payment processing) to a restricted Elasticsearch index or Loki tenant with stricter access controls and shorter retention. Route infrastructure logs (health checks, cache metrics, deployment events) to a general-access store. This minimizes the blast radius of a log access breach.
-
Retention as a compliance tool. GDPR requires the ability to delete a user’s data on request. If their
user_idappears in 90 days of logs across 3 observability backends, “right to erasure” becomes an engineering nightmare. Shorter retention (7-14 days for PII-containing logs) and pseudonymization at ingest dramatically simplify compliance.
- Request/response body logging. Does your HTTP middleware log full request or response bodies? This is the #1 source of accidental PII in logs. Default Express.js
morgan('combined')logs URLs (which may contain tokens or query params). Default Django debug middleware logs request bodies. Audit every logging middleware. - Auto-instrumentation span attributes. OTel auto-instrumentation for HTTP clients captures URL parameters by default. A URL like
/users/john.doe@email.com/profileputs an email address in every span. Configure URL scrubbing in your OTel SDK: replace path parameters with{id}patterns. - Error message capture. Exceptions often contain user data in the message:
"User john@example.com not found","Invalid phone number: 555-1234". Configure your error tracking tool (Sentry, Datadog) to scrub error messages, not just stack traces. - Log aggregator search indexes. Even if you redact PII before shipping logs, check whether your log aggregator creates searchable indexes on fields that contained PII before you added redaction. Old indexed data may still be queryable.
- Trace context in downstream services. If Service A logs a redacted
user_hashbut Service B logs the originaluser_emailin the same trace, anyone querying traces can correlate the hash to the email. PII safety must be consistent across the entire trace, not just per-service.
18.4.4 Telemetry Cost Discipline
Observability costs are one of the fastest-growing line items on cloud bills, and they have a unique political problem: nobody wants to be the person who reduced observability and then missed an incident. This creates a ratchet effect where telemetry only grows, never shrinks. The cost discipline framework:- Treat telemetry like infrastructure capacity. Set a monthly telemetry budget (GB of logs, number of metric time series, GB of traces) and review it in the same meeting where you review compute costs. Make it visible.
- Assign cost to teams. If Team A generates 60% of your log volume, they should see that number. Datadog and Grafana Cloud both support per-team cost attribution. When a team sees that their debug logging costs $3,000/month, they self-optimize faster than any top-down policy.
-
Implement graduated retention. Not all data needs the same retention:
- Error and alert-triggering data: 90 days in hot storage
- Normal operational data: 14 days hot, 30 days warm (S3 + Athena)
- Debug-level data: 3 days hot, then deleted
- Compliance-required data: warm/cold storage per regulation, never hot
- Use the “cost per incident” metric. Total monthly observability spend / number of incidents where observability data was used in diagnosis = cost per incident. If you spend 3,000 in observability. Is that worth it? Almost certainly. If you spend 25,000-per-incident cost deserves scrutiny.
- Audit quarterly. Every quarter, identify: metrics with zero dashboard/alert references, log sources with zero queries in 90 days, and trace endpoints with zero investigation clicks. Delete or reduce them. This is the observability equivalent of deleting dead code.
- “We are not reducing observability. We are making it more efficient.” The analogy: buying a smaller fire extinguisher is dangerous. Replacing an always-running firehose with a sprinkler system that activates on smoke detection is smart. Tail-based sampling, tiered retention, and cardinality controls are the sprinkler system.
- Cost per useful query. Not all telemetry is equally valuable. Debug-level logs that are never queried cost the same to store as error logs that are queried daily during incidents. Track:
monthly cost / number of queries against the data. Data that is stored but never queried is waste, full stop. - The “cardinality explosion” tax. A single high-cardinality label (e.g.,
user_idon a Prometheus metric) can create millions of time series and 10x your metrics storage bill. The most common culprits: request URLs with path parameters (/users/12345creates a unique series per user), error messages used as labels, and unbounded queue or topic names. A CI check that flags new metrics with >1000 estimated cardinality saves more money than any other cost optimization.
| Signal | Typical Volume (100-service org) | Monthly Cost Range | Cost Driver |
|---|---|---|---|
| Logs | 5-50 TB/month | 25,000 | Volume (GB ingested + stored) |
| Metrics | 500K-5M active time series | 15,000 | Cardinality (number of unique series) |
| Traces | 1-10 billion spans/month | 20,000 | Volume (spans ingested) + retention |
| RUM sessions | 1M-50M sessions/month | 10,000 | Session volume + replay storage |
18.5 Alerting
Symptom-Based vs Cause-Based Alerts
Cause-based alert: “CPU usage > 80%.” This tells you a technical fact but not whether users are affected. CPU at 85% might be perfectly fine if latency and error rates are normal. Symptom-based alert: “Error rate > 5% for 5 minutes.” This tells you users are actually experiencing problems, regardless of the underlying cause.Alert Fatigue
Alert fatigue is one of the most dangerous operational problems: when teams receive too many alerts, they start ignoring all of them — including the critical ones. Signs of alert fatigue:- More than 5-10 actionable alerts per on-call shift per week
- Alerts that are routinely acknowledged and ignored
- “Flappy” alerts that fire and resolve repeatedly
- Alerts that have no runbook or clear remediation steps
- Every alert must be actionable. If the on-call person cannot take a specific action in response, delete the alert. Move it to a dashboard.
- Every alert must have a runbook link. The runbook describes: what this alert means, what to check first, how to mitigate, and when to escalate.
- Tune aggressively. Review alert noise monthly. Raise thresholds, increase evaluation windows, consolidate related alerts.
- Use severity levels. Page (wake someone up) only for P1/P2 — user-facing impact. P3/P4 go to a queue for next business day.
- Suppress during known events. Deployments, maintenance windows, and expected batch jobs should suppress related alerts.
SLI/SLO-Based Alerting and Burn Rate
The most sophisticated approach to alerting ties directly to your Service Level Objectives (SLOs). SLI (Service Level Indicator): A quantitative measure of a specific aspect of service quality. Example: “The proportion of HTTP requests that return a 2xx status in under 500ms.” SLO (Service Level Objective): A target for an SLI over a time window. Example: “99.9% of requests succeed in under 500ms over a rolling 30-day window.” Error budget: The inverse of the SLO. A 99.9% SLO means you have a 0.1% error budget — you can “afford” 43 minutes of downtime per 30 days (0.1% of 43,200 minutes). Burn rate alerts: Instead of alerting on instantaneous error rate spikes, alert when you are consuming your error budget faster than expected.- 1x burn rate: You are burning the error budget at exactly the expected rate. You will exhaust it at the end of the window. No alert needed.
- 14.4x burn rate for 5 minutes: You are burning the error budget 14.4x faster than allowed. At this rate, the entire 30-day budget will be consumed in ~2 days. This is a high-severity page.
- 6x burn rate for 30 minutes: Burning 6x faster than allowed. Budget exhausted in ~5 days. Medium-severity alert.
- 1x burn rate for 6 hours: You are slowly burning faster than planned. Low-severity notification for next business day.
- They tolerate brief spikes (a 30-second blip does not page anyone).
- They catch slow degradations that threshold-based alerts miss.
- They are directly tied to user impact (the SLO).
- They give you a time-to-exhaustion estimate so you can prioritize appropriately.
18.6 The Observability Day-1 Checklist
You just deployed a new service. Here is what to instrument before you call it production-ready:- Structured logging middleware: Every inbound request logs
method,path,status,duration_ms,trace_id,user_id - Request metrics middleware: Emit
http_request_duration_secondsandhttp_requests_totalwithmethod,path,statuslabels - OpenTelemetry auto-instrumentation: Install the OTel SDK for your framework (Node.js:
@opentelemetry/auto-instrumentations-node, Python:opentelemetry-instrumentation, Go:go.opentelemetry.io/contrib) - Spans around every outbound call: Database queries, Redis calls, HTTP calls to other services, message publishes — each gets a span with the operation name and duration
- RED dashboard: Request rate (req/sec), Error rate (%), Latency (p50/p95/p99). One row per service. One row per critical endpoint.
- Three baseline alerts: Error rate > 5% for 5 minutes, p99 latency > 2x your baseline for 10 minutes, health check down for 2 minutes
- Health endpoints:
GET /health(liveness — is the process running? Keep it simple) andGET /ready(readiness — can this instance handle requests? Check DB connectivity, cache availability) - Correlation ID propagation: Accept
X-Request-IDheader from upstream, generate one if missing, pass it to all downstream calls, include it in every log line
Interview Questions — Observability
You are on-call and get paged at 3 AM for high error rates. Walk through your incident response.
You are on-call and get paged at 3 AM for high error rates. Walk through your incident response.
- Google SRE Book — “Postmortem Culture: Learning from Failure” (sre.google/sre-book) — the foundational text on blameless postmortems.
- PagerDuty Incident Response Guide (response.pagerduty.com) — free, practical guidance on roles (Incident Commander, Scribe, SME) during major incidents.
- Cloudflare Blog — “The October 30, 2020 outage” (blog.cloudflare.com) — exemplary public postmortem showing MTTM optimization.
What is the difference between monitoring and observability? When do you need each?
What is the difference between monitoring and observability? When do you need each?
status_code is low cardinality (~10 values). user_id is high cardinality (millions). Use this term when explaining why certain debugging questions are impossible with traditional metrics — Prometheus cannot tag metrics with user_id without time-series explosion.- Charity Majors — “Observability — A Manifesto” (charity.wtf) — the foundational argument for event-based observability.
- Cindy Sridharan — “Distributed Systems Observability” (O’Reilly, free download) — concise primer on the three pillars.
- opentelemetry.io/docs — “Concepts” — explains the OTel data model and why high-cardinality attributes matter.
Your team gets paged 15 times per week. Most alerts do not require action. How do you fix this?
Your team gets paged 15 times per week. Most alerts do not require action. How do you fix this?
- Audit every alert over the past 30 days. Categorize each as: actionable (required human intervention), noise (auto-resolved or no action needed), or duplicate.
- Delete or demote noise alerts. If an alert fires and resolves within 2 minutes, it should not page — make it a dashboard metric or a low-severity notification.
- Raise thresholds and extend evaluation windows. “Error rate > 1% for 1 minute” is too sensitive. Try “Error rate > 5% for 5 minutes.”
- Consolidate related alerts. Five alerts about the same downstream dependency failure should be one alert.
- Transition to SLO-based burn-rate alerts where possible — these naturally tolerate brief spikes while catching sustained degradation.
- Require a runbook for every remaining alert. If you cannot write a runbook, the alert is not well-defined enough to keep.
- Google SRE Workbook — “Alerting on SLOs” (sre.google/workbook) — the canonical reference for burn-rate alerting.
- PagerDuty — “On-Call: The Complete Guide” (pagerduty.com/resources) — practical frameworks for sustainable rotations.
- Charity Majors — “The Alerting Manifesto” (charity.wtf) — philosophical argument for symptom-based alerting and alert culture reform.
Explain SLI, SLO, and error budgets. How would you use them to make engineering decisions?
Explain SLI, SLO, and error budgets. How would you use them to make engineering decisions?
- Google SRE Book — Chapter 4 “Service Level Objectives” (sre.google/sre-book) — the canonical definition of SLIs, SLOs, and error budgets.
- Google SRE Workbook — “Alerting on SLOs” (sre.google/workbook) — multi-burn-rate alerting with worked examples.
- Alex Hidalgo — “Implementing Service Level Objectives” (O’Reilly) — a practical book-length treatment of SLO adoption.
- Nobl9 blog (nobl9.com/blog) — applied SLO engineering patterns from a platform that specializes in SLO management.
You are joining a team that has zero observability on a critical service handling $2M/month in transactions. You have one week. What do you instrument first?
You are joining a team that has zero observability on a critical service handling $2M/month in transactions. You have one week. What do you instrument first?
- Structured logging middleware. Every inbound request logs:
method,path,status,duration_ms,trace_id. This takes 2-4 hours with a shared logging library. Without this, every future investigation isgrepby timestamp and prayer. - Health endpoint —
GET /health(liveness) andGET /ready(readiness with DB/cache connectivity checks). 1 hour of work. Wire it into your uptime monitor (even a free Pingdom or UptimeRobot account). - Three alerts: Error rate > 5% for 5 minutes, p99 latency > 2x current baseline for 10 minutes, health check down for 2 minutes. This catches the catastrophic failures.
- RED metrics — request rate, error rate, duration histogram. One Grafana dashboard with 6 panels. This is the dashboard I stare at when the alert fires.
- Business metric:
transactions_completed_totalby payment method and status. For a $2M/month service, “transactions per minute” dropping to zero is the most important signal. This catches the “green dashboards, broken product” scenario where HTTP metrics look fine but the business outcome is wrong. - OpenTelemetry auto-instrumentation. Install the SDK, enable auto-instrumentation, export to a free Jaeger instance. This gives me distributed tracing with zero custom code.
- Span annotations on critical operations — database queries, cache calls, external API calls. Auto-instrumentation captures HTTP spans; I need manual spans for the business logic inside.
- Dependency health dashboard. For each external dependency (payment gateway, database, cache), show latency and error rate. When my service degrades, I need to know if it is my code or their service.
- “The CFO asks why you spent a week on observability instead of features. How do you justify it?” — “This service processes 2,700 in lost transactions. Without observability, we would not even know it happened until customers called. The week of instrumentation work pays for itself the first time we detect an issue in 5 minutes instead of 45.”
- “What changes after you have 30 days of baseline data?” — Define SLOs based on actual performance, implement burn-rate alerting, add tail-based sampling if trace volume is high, and build the business-specific dashboards the team actually needs.
How would you use AI assistants (Copilot, Claude, etc.) for observability work -- writing queries, building dashboards, or diagnosing incidents? Where does AI help and where is it dangerous?
How would you use AI assistants (Copilot, Claude, etc.) for observability work -- writing queries, building dashboards, or diagnosing incidents? Where does AI help and where is it dangerous?
- PromQL / LogQL / NRQL query generation. “Write a PromQL query showing the error rate by endpoint for the last hour, excluding health checks, as a percentage” — AI handles query syntax far better than most engineers who write PromQL twice a month. This alone saves significant time during incidents when you are stress-typing queries.
- Alert rule generation. “Generate a Prometheus alerting rule for burn-rate alerting at 14.4x over 5 minutes and 6x over 30 minutes for a 99.9% SLO” — the math is well-documented and AI generates correct rules consistently.
- Dashboard JSON generation. Grafana dashboards are tedious JSON configurations. Describing what you want in natural language and letting AI generate the panel configuration is a legitimate productivity multiplier.
- Postmortem drafting. Feed the incident timeline (alert fired at T, root cause identified at T+15, mitigated at T+20) to an AI and ask it to draft a structured postmortem with timeline, root cause, action items, and prevention measures. You edit and refine, but the structure is solid.
- Log pattern analysis. “Here are 50 error log lines from the last 10 minutes. Identify the common patterns and group them.” AI is excellent at pattern extraction from semi-structured text.
- Root cause diagnosis during a live incident. An AI given symptoms (“p99 latency spiked, CPU is high, error rate increased”) will confidently suggest plausible causes. But it cannot see your deployment timeline, does not know that a config change went out 10 minutes ago, and has no access to your actual traces. The risk is that the AI’s confident-sounding suggestion sends you investigating the wrong thing during a time-critical incident.
- Sampling rule design. “What sampling rate should I use?” depends on your traffic volume, budget, incident frequency, and SLO requirements — context AI cannot infer. It will give you a “reasonable default” that might sample out the exact data you need.
- PII assessment. “Is this log field safe to store?” requires understanding your regulatory context (GDPR, HIPAA, CCPA), your data processing agreements, and whether an opaque ID can be reversed. AI will guess; compliance requires certainty.
You are getting paged at 3 AM. The dashboard shows high latency but no errors. How do you diagnose this without clear error signals?
You are getting paged at 3 AM. The dashboard shows high latency but no errors. How do you diagnose this without clear error signals?
- Brendan Gregg — “USE Method for Performance Analysis” (brendangregg.com) — the canonical methodology.
- Google SRE Book — “The Four Golden Signals” (Chapter 6) — complements USE with a service-oriented view.
- Honeycomb blog (honeycomb.io/blog) — “BubbleUp” posts showing differential analysis in action for incident investigation.
- opentelemetry.io/docs — “Tracing best practices” — how to instrument for investigations like this.
Further Reading
- Observability Engineering by Charity Majors, Liz Fong-Jones, George Miranda — the definitive guide to modern observability practices.
- Distributed Systems Observability by Cindy Sridharan — free, concise guide focused on the three pillars. Sridharan’s writing is unusually clear for a technical book, and at ~100 pages it is the best time-to-value ratio of any observability resource.
- Practical Monitoring by Mike Julian — hands-on guide to building effective monitoring for real systems.
- Site Reliability Engineering (Google SRE Book) — chapters on monitoring, alerting, and SLOs are essential reading. Chapter 6 (“Monitoring Distributed Systems”) lays out the principles of symptom-based alerting and the four golden signals. Chapter 4 (“Service Level Objectives”) is the authoritative reference for SLI/SLO definitions and error budget mechanics.
- Google SRE Book — Chapter 11: Being On-Call — practical guidance on alerting philosophy, on-call load management, and the principle that alerts should be actionable, symptom-based, and tied to user impact. Pairs directly with the SLO-based alerting concepts covered in this chapter.
- The SRE Workbook — Alerting on SLOs — the definitive reference for burn-rate alerting. Walks through multi-window, multi-burn-rate alert configurations with worked examples — this is the document that popularized the 14.4x/6x/1x burn-rate approach described in Section 18.5 above.
- Prometheus Official Documentation — the authoritative reference for Prometheus architecture, metric types, instrumentation, service discovery, and alerting rules. Start with “Getting Started” for a hands-on walkthrough, then move to “Data Model” and “Metric Types” to understand counters, gauges, histograms, and summaries — the foundation of everything in Section 18.2.
- Prometheus PromQL Tutorial — PromQL is the query language that powers Prometheus alerting rules and Grafana dashboards. This official guide covers selectors, functions, aggregations, and the
rate()vsirate()distinction that trips up most beginners. The “Querying Examples” page is especially useful for building the RED dashboard described in Section 18.2. - Grafana Official Documentation — comprehensive guide to building dashboards, configuring data sources, creating alert rules, and managing organizations. The “Best practices for creating dashboards” section is required reading before building the RED dashboards recommended in this chapter — it covers panel layout, variable templating, and annotation strategies that separate useful dashboards from noisy ones.
- Grafana Labs Blog — Prometheus and Loki — deep technical content on running Prometheus at scale, LogQL query patterns for Loki, and Grafana dashboard best practices. Particularly useful if you are building a self-hosted observability stack. The “Prometheus at scale” series covers federation, Thanos, and Mimir for long-term metrics storage.
- OpenTelemetry Documentation — getting started guides for every major language. The “Getting Started” guides for Node.js, Python, Go, and Java walk you through auto-instrumentation in under 30 minutes. The “Collector” documentation explains how to deploy the OTel Collector as a pipeline between your applications and your observability backends.
- OpenTelemetry Concepts Guide — covers the OTel data model (spans, traces, metrics, logs), context propagation, sampling strategies, and the relationship between the API, SDK, and Collector. If you are implementing the Day-1 checklist from Section 18.6, start here to understand what you are instrumenting and why.
- Jaeger Documentation — the official guide for Jaeger, the open-source distributed tracing platform originally built by Uber. Covers architecture (agent, collector, query, storage backends), deployment patterns, sampling strategies, and the trace UI. The “Architecture” and “Getting Started” pages provide the quickest path to running distributed tracing locally and understanding trace propagation.
- Zipkin Documentation — the original open-source distributed tracing system, inspired by Google’s Dapper paper. Zipkin’s documentation covers its data model, instrumentation libraries (Brave for Java, zipkin-js for Node.js), and storage backends. Useful as a lighter-weight alternative to Jaeger, especially for teams already running Spring Boot (which has native Zipkin integration via Spring Cloud Sleuth / Micrometer Tracing).
- Elastic (ELK Stack) Documentation — the official reference for Elasticsearch (search and analytics), Logstash (log pipeline), and Kibana (visualization). For log-based observability, the Kibana Discover and Dashboard guides explain how to build log exploration views, create visualizations from structured log fields, and set up index patterns — the core skills for investigating incidents using centralized logs.
- PagerDuty Incident Response and Alerting Best Practices — PagerDuty’s freely available guide covers alert routing, escalation policies, on-call scheduling, incident severity classification, and strategies for reducing alert fatigue. Directly applicable to the alerting best practices in Section 18.5 — especially the guidance on making every alert actionable and requiring runbooks.
- Datadog Structured Logging Guide — a practical walkthrough of why structured logging (JSON with consistent fields) outperforms unstructured text logs for production debugging. Covers log parsing, attribute naming conventions, log pipelines, and correlation with traces and metrics. Useful context for understanding why the structured log format shown in Section 18.1 is the industry standard.
- Charity Majors’ Blog (charity.wtf) — Honeycomb’s co-founder writes some of the sharpest thinking on observability, on-call culture, and engineering management. Start with “Observability — A Manifesto” and “Logs vs Structured Events” for the foundational arguments on why high-cardinality structured events are superior to traditional logging and metrics.
- Ben Sigelman on Distributed Tracing — Sigelman co-created Dapper (Google’s internal distributed tracing system) and co-founded LightStep (now part of ServiceNow). His writing on why distributed tracing matters, the design of trace propagation, and the evolution from Dapper to OpenTelemetry provides the conceptual foundation that most tracing documentation assumes you already have.
Interview Deep-Dive Questions
These questions go beyond surface-level definitions. They simulate the multi-layered probing you will encounter in senior and staff-level interviews — where the interviewer keeps digging until they find the boundary of your knowledge. Each question includes follow-up chains that branch into different paths, just as a real interview would.You have a microservices architecture where Service A writes to a database and Service B reads from a cache of that data. Users intermittently see stale data in Service B after updates through Service A. Walk me through how you would design the invalidation path.
You have a microservices architecture where Service A writes to a database and Service B reads from a cache of that data. Users intermittently see stale data in Service B after updates through Service A. Walk me through how you would design the invalidation path.
-
Service A writes to the database and publishes a domain event (e.g.,
ProductUpdated) to a message broker (Kafka, SNS/SQS, RabbitMQ). Critically, the write and the event publish must be atomic or near-atomic — otherwise you risk the database being updated but the event never being sent. For true atomicity, use the transactional outbox pattern: Service A writes the event to anoutboxtable in the same database transaction as the business write, and a separate relay process polls the outbox and publishes to the broker. - Service B subscribes to that event and deletes (not updates) the relevant cache key when it receives the event. Deleting is safer than updating because it avoids race conditions where two concurrent events arrive out of order and the cache ends up with the older value.
- TTL as a safety net. Even with event-driven invalidation, every cache key in Service B has a TTL — say 60 seconds for this data. If the event is lost (broker hiccup, consumer crash), the cache self-heals within the TTL window. The TTL is not your primary invalidation mechanism; it is your fallback.
- Monitoring the invalidation pipeline. I would instrument the lag between the database write timestamp and the cache invalidation timestamp, and alert if the gap exceeds an acceptable threshold (say 5 seconds). If the event pipeline backs up, I want to know before users start complaining.
Follow-up: What if the event pipeline introduces unacceptable latency — say events take 2-3 seconds to propagate — and the product team demands sub-second cache freshness?
Strong answer:This is a real tension I have seen in practice. You have a few options, and the right one depends on the consistency requirement:- Write-through for the critical path. When Service A writes to the database, it also writes directly to the shared cache (Redis) in the same request. This gives you sub-second cache freshness for the happy path. The event pipeline still runs as a backup to handle edge cases (missed writes, Service A crashes between DB write and cache write). The trade-off is that Service A now needs to know about Service B’s cache keys, which couples them.
- CDC (Change Data Capture) instead of application events. Tools like Debezium read the database’s write-ahead log and publish change events with very low latency — typically under 500ms. This is faster than application-level event publishing (which depends on your broker’s batching and delivery semantics) and catches all writes regardless of which code path made them, including manual database patches.
- Accept eventual consistency for reads, enforce strong consistency at the decision point. This is the Facebook approach: let the cached catalog page show a price that might be 1-2 seconds stale, but at the actual checkout flow, always read from the database. The user sees a slightly stale price on the browse page, but the charge is always correct. This reframes the problem: you do not actually need sub-second cache freshness everywhere — you need it at the transaction boundary.
Follow-up: How would you handle the case where Service B has an in-process LRU cache in addition to Redis? Now you have two cache layers to invalidate.
Strong answer:This is the multi-layer invalidation problem, and it gets tricky fast. The Redis invalidation is straightforward — delete the key when the event arrives. But the in-process LRU cache sits inside each Service B instance’s memory, and there might be 20 instances. You need a broadcast mechanism.The standard approach is Redis Pub/Sub (or a similar pub/sub system). When the invalidation event arrives, the consumer not only deletes the Redis key but also publishes an invalidation message on a Redis Pub/Sub channel. Every Service B instance subscribes to that channel and evicts the key from its local LRU cache.The gotcha with Redis Pub/Sub is that it is fire-and-forget — if an instance is briefly disconnected (network blip, restart), it misses the invalidation. So every item in the in-process LRU cache must have a short TTL — say 5-10 seconds. This means the worst-case staleness for the local cache is bounded by the TTL even if the pub/sub message is lost.I would also add a cache version or generation counter. When invalidating, increment the version in Redis. The local LRU cache entries include the version they were fetched at. On every read from the local LRU, compare the cached version with the current Redis version. If they differ, treat it as a miss. This adds one Redis call per read, but it is a singleGET on a tiny key — sub-millisecond — and it provides a strong consistency guarantee for the local cache.Going Deeper: The transactional outbox pattern you mentioned — what happens if the outbox relay process crashes or falls behind? How do you ensure exactly-once processing of those events?
Strong answer:The outbox relay reads rows from the outbox table and publishes them to the broker. If it crashes mid-batch, the rows remain in the outbox (they were not deleted yet), and the relay picks them up on restart — this gives you at-least-once delivery. The consumer side must be idempotent: deleting a cache key that is already deleted is a no-op, so cache invalidation is naturally idempotent. That is one reason I prefer delete-on-invalidate over update-on-invalidate.For the relay itself, I would use a pattern where each outbox row has apublished_at timestamp. The relay marks rows as published after successful broker acknowledgment. On restart, it re-publishes any rows where published_at is null. A separate cleanup job periodically deletes old published rows to keep the table small.If the relay falls behind — the outbox table is growing faster than the relay can drain it — that is a capacity problem that shows up as increasing lag. I would monitor the size of the outbox table and the age of the oldest unpublished row, and alert if either exceeds a threshold. Scaling the relay (running multiple instances with partitioned reads) or switching to CDC (which reads the WAL directly, bypassing the outbox table entirely) are the escape hatches.Practically, tools like Debezium handle all of this for you. I would only build a custom outbox relay if I had constraints that ruled out CDC tooling.Follow-up: How would you roll out this event-driven invalidation system without risking a production incident during the migration?
Strong answer:The rollout must be incremental and reversible at every stage:- Phase 1 — Shadow mode. Deploy the event pipeline alongside the existing TTL-based caching. Events flow and invalidation runs, but the application still uses TTL as the primary freshness mechanism. Monitor the invalidation events: are they arriving? How many? What is the lag? This phase is risk-free because the events do not change behavior.
- Phase 2 — Dual-mode with comparison. Enable event-driven invalidation for a single non-critical data type (e.g., product descriptions, not prices). Log every case where the event-driven invalidation would have refreshed the cache sooner than the TTL expiration. This tells you the improvement in freshness. Also log any case where the event pipeline lags behind the TTL — this tells you if the pipeline is slower than expected.
- Phase 3 — Feature-flag rollout. For the critical data types, put event-driven invalidation behind a feature flag. Enable it for 5% of traffic (or one region), monitor staleness metrics for that cohort, and expand gradually. The kill switch is the feature flag — disable it and you are back to TTL-only.
- Phase 4 — Extend TTLs. Once event-driven invalidation is proven reliable, extend the TTL from 60 seconds to 5 minutes. The TTL is now a safety net, not the primary freshness mechanism. If the event pipeline fails, the worst case is 5 minutes of staleness instead of the previous 60 seconds.
Explain cache-aside versus read-through. When would you choose one over the other, and what would make you switch mid-project?
Explain cache-aside versus read-through. When would you choose one over the other, and what would make you switch mid-project?
- When different data types need different caching strategies (different TTLs, different invalidation logic).
- When I want full visibility into caching behavior in my application code.
- When multiple teams are working on the same codebase and I want caching logic to be explicit, not hidden in infrastructure configuration.
- When I have many services all doing the same cache-aside boilerplate for the same data — read-through centralizes the “how to fetch on miss” logic in one place.
- When I am building an internal platform layer and I want to abstract caching away from application developers so they cannot accidentally bypass the cache or implement it incorrectly.
- When combined with write-through, for a data layer that needs to guarantee the cache is always populated.
Follow-up: If you are using cache-aside and the database goes down, what happens? How would you design for that failure mode?
Strong answer:With cache-aside, if the database goes down, all cache misses become errors — the application tries to read from DB, gets a connection failure, and returns a 500 to the user. Cache hits still work perfectly. So the system degrades gracefully proportional to your hit ratio: if you have a 95% hit ratio, 95% of reads still succeed during a DB outage.To improve resilience, I would add a few layers:- Serve stale on miss. Instead of returning an error when the DB is unreachable, return the expired cached value if one exists. In Redis, you can implement this by storing the value with a “soft TTL” (the intended freshness window) and a “hard TTL” (how long you are willing to serve stale data in an emergency). On miss, check if the key exists but is past its soft TTL — if the DB is down, return the stale value; if the DB is up, refresh it.
- Circuit breaker on the database call. After N consecutive failures, stop hitting the DB entirely for a cooldown period. During cooldown, serve stale values from cache or return a degraded response. This prevents the DB from being hammered with retry storms the moment it starts to recover.
- Pre-warm critical keys. For the most important data (homepage content, product catalog, authentication tokens), ensure the cache is warm before you need it. Background refresh jobs keep these keys perpetually populated, so even a cold-start scenario after a cache eviction does not depend on the DB being available right now.
Follow-up: You mentioned Caffeine for Java. What makes Caffeine’s eviction algorithm better than a standard LRU, and when does it matter?
Strong answer:Caffeine uses an algorithm called Window TinyLFU, which is genuinely one of the most elegant eviction algorithms in production use. The core idea is that it combines the strengths of LRU and LFU while avoiding their weaknesses.Standard LRU evicts the least recently used item — great for temporal locality, but it has no concept of frequency. A key accessed 10,000 times but idle for 5 seconds gets evicted in favor of a key accessed once 2 seconds ago. Standard LFU evicts the least frequently used item — great for popularity, but it cannot adapt when popularity shifts because established items have high counters.Caffeine’s approach has three components: a small admission window (LRU, about 1% of the cache), a main space (segmented LRU, about 99%), and a TinyLFU frequency sketch that acts as an admission filter. When a new item enters the cache, it goes into the admission window. When it is evicted from the window, TinyLFU compares its estimated frequency against the item that would be evicted from the main space. The new item only gets into the main space if it is “more popular” than what it would replace. This means one-hit wonders (scans, batch reads, random lookups) get filtered out before they can pollute the main cache.The frequency sketch itself uses a Count-Min Sketch (a probabilistic data structure) that is periodically halved to decay old frequencies — this is how it adapts to shifting popularity.Where it matters: any workload with mixed access patterns — some items are consistently popular, some are trending, and there is a long tail of items accessed rarely. Web application caches are the canonical example. In benchmarks, Caffeine consistently achieves 5-15% higher hit ratios than a standard LRU on real-world traces. For a high-traffic application where a 5% hit ratio improvement translates to millions fewer database queries per day, that is significant.You are designing the observability stack for a new platform with 30 microservices. You have a limited budget. What do you instrument first, and what do you deliberately defer?
You are designing the observability stack for a new platform with 30 microservices. You have a limited budget. What do you instrument first, and what do you deliberately defer?
- OpenTelemetry auto-instrumentation on every service. This is nearly free in engineering effort — install the OTel SDK, enable auto-instrumentation, and you get HTTP request spans, database query spans, and outbound call spans with zero code changes. Export traces to Jaeger (open-source, cheap to run) and logs to Grafana Loki (also open-source). This gives you distributed tracing and centralized logs from day one.
- RED metrics on the API gateway and the 3-5 most critical services. Request rate, error rate, and latency distribution (histograms, not averages). Use Prometheus — it is free and the ecosystem is mature. Build one Grafana dashboard per critical service.
- Three baseline alerts per critical service: Error rate exceeding threshold for 5 minutes, p99 latency exceeding 2x baseline for 10 minutes, and health check failure for 2 minutes. These catch the vast majority of user-impacting incidents.
-
Structured logging with correlation IDs everywhere. This is a one-time investment in a shared logging library. Every log line includes
trace_id,service,level,timestamp, and enough context to be useful. This is cheap to implement and pays dividends forever.
- Custom business metrics for non-critical services. The 25 internal services that are not on the critical user path can wait. When an incident involves them, I will use traces and logs — I do not need pre-built dashboards for services that rarely cause user-facing issues.
- SLO-based burn-rate alerting. This requires defining SLIs, agreeing on SLO targets with stakeholders, and building the burn-rate calculation. It is the right end state, but it is premature in the first month when you do not even have baseline data to set meaningful targets.
- Continuous profiling and AIOps. CPU flame graphs, memory profiling, anomaly detection — these are Level 5 maturity capabilities. They are expensive to operate and require significant data volume to be useful. I would revisit after 6 months when the baseline observability is stable.
- Tail-based sampling. At 30 services with moderate traffic, you probably do not need sophisticated sampling yet. Store all traces for now. Implement tail-based sampling when trace storage costs become a real line item (usually around 10,000+ requests/second).
Follow-up: Six months in, your trace storage costs are spiking. How do you implement tail-based sampling without losing the signal you need during incidents?
Strong answer:The key principle is: most requests are boring. You need 100% coverage of the interesting ones and can aggressively sample the rest.I would deploy the OTel Collector as a central aggregation point and configure itstail_sampling processor with these rules, in priority order:- Keep 100% of traces containing any error span — these are always interesting.
- Keep 100% of traces where total duration exceeds the p95 baseline — slow requests often reveal emerging problems before they become outages.
- Keep 100% of traces from synthetic monitors and health checks — these are my SLO measurement data, and they have low volume.
- Keep 50% of traces for high-value endpoints (checkout, payment, authentication) — I want denser coverage of the paths where money or security is at stake.
- Keep 5% of all remaining traces — enough to maintain baseline statistics and catch rare patterns.
Follow-up: A developer on your team argues that you should just use head-based sampling at 10% because it is simpler. How do you make the case for tail-based?
Strong answer:The fundamental problem with head-based sampling is that the sampling decision is made before you know the outcome of the request. A 10% head sample means you keep 10% of traces chosen at random at the entry point — before you know whether the request will succeed, fail, be slow, or be fast.Here is the concrete impact: if your system has a 0.1% error rate and you are head-sampling at 10%, you keep 10% of error traces. That is 0.01% of all traces. During an incident where you need to examine failing requests, you might have 3 traces instead of 30. That is a very thin evidence base for root cause analysis.With tail-based sampling at the same overall volume, you keep 100% of error traces and 5% of everything else. You get the same storage cost but 10x better coverage of the data you actually need during incidents.I would frame it to the developer this way: “Head-based sampling optimizes for simplicity of implementation. Tail-based sampling optimizes for quality of signal during incidents. Since the whole point of tracing is incident investigation, we should optimize for the outcome we care about. The complexity of tail-based sampling lives in the Collector configuration, not in application code — it is a one-time infrastructure investment.”That said, if the team is very early stage and the Collector infrastructure feels like too much right now, I would accept head-based as a starting point and plan the migration to tail-based within the next quarter. Imperfect sampling is better than no sampling.You join a team that has been running Redis as their caching layer for two years. The cache hit ratio is 70%. They consider this acceptable. Walk me through how you would evaluate whether it actually is, and what you would do about it.
You join a team that has been running Redis as their caching layer for two years. The cache hit ratio is 70%. They consider this acceptable. Walk me through how you would evaluate whether it actually is, and what you would do about it.
INFO memory on Redis and compare used_memory against maxmemory. If the cache is full and evicting keys (evicted_keys counter is non-zero), the working set is larger than the cache — you might simply need more memory. If the cache is only half full, the problem is not capacity.Step 2: Analyze the miss pattern. Are misses concentrated on specific key prefixes or spread uniformly? If concentrated, one data type has a caching problem (wrong TTL, missing population logic, a write path that bypasses cache invalidation). If uniform, it is systemic. I would sample the keyspace_misses to understand which keys are being missed.Step 3: Check the TTL distribution. If TTLs are too short, keys expire before they get a second read — every entry gets exactly one hit before expiring. This is the “single-use cache” anti-pattern. For data that is read 10 times in 5 minutes, a 30-second TTL means it is repopulated 10 times instead of once. Extending TTL to 5 minutes would convert 9 of those 10 misses into hits.Step 4: Check the eviction policy. If the team is using allkeys-lru but the workload is frequency-skewed (a small set of keys gets most reads), switching to allkeys-lfu might improve hit ratio by 10-15% — LFU retains popular keys even if they have not been accessed in the last few seconds.Step 5: Check for cache-defeating patterns. Common culprits: non-normalized cache keys (/users/123?utm_source=google vs /users/123 creating separate cache entries for the same data), per-request unique data embedded in cache keys (session tokens, timestamps), or batch jobs that scan the key space and pollute the cache with cold data.Step 6: Measure the actual impact. Even if I determine the hit ratio could be 90%, the question is: does the 30% miss rate actually cause a problem? Check database load, API latency, and cost. If the database is running at 20% capacity and latency is within SLO, optimizing the cache might not be the highest-value work. If the database is at 80% capacity and you are about to scale it, improving cache hit ratio is cheaper than scaling the database.My experience is that most teams with a “70% hit ratio” have at least one of the problems above, and getting to 85-90% is usually straightforward once you identify which one.Follow-up: You discover that 40% of the misses are caused by a single batch job that runs hourly and scans through every user record for a report. How do you fix this without removing the batch job?
Strong answer:This is classic scan pollution — the batch job reads thousands of keys sequentially, promoting them all to the front of the LRU queue and evicting the genuinely hot keys that real users need.Several options, in order of my preference:- Use a separate Redis connection with a different logical database or a separate Redis instance for the batch job. The batch job’s reads do not share the same LRU eviction space as the real-time cache. This is the cleanest isolation.
- Bypass the cache entirely for the batch job. If the batch job is generating a report, it should read from a read replica of the database, not from cache. The cache exists to serve real-time user traffic, and the batch job is not a user. Modify the batch job to query the database directly (preferably a read replica to avoid impacting the primary).
-
Switch to
allkeys-lfueviction. LFU is naturally scan-resistant because a sequential read gives each key a frequency of exactly 1, which is too low to displace keys with high frequency counts. The batch job’s scanned keys will be the first to be evicted because their frequency is the lowest. This is the lowest-effort fix if you cannot change the batch job’s code. -
If the batch job must use the cache (because the database read replica does not exist and you cannot add one right now), prefix the batch job’s cache reads with a lower priority or use Redis’s
OBJECT FREQto monitor which keys the batch is displacing, and ensure the hot keys are repopulated immediately after the batch completes.
Going Deeper: You mentioned that switching from LRU to LFU can help with scan resistance. But what happens when your team launches a new product feature that introduces a new set of hot keys? Does LFU create a cold-start problem?
Strong answer:Yes, and this is LFU’s well-known weakness. When new keys enter the system, they start with a frequency count of 1. Meanwhile, the existing hot keys have high frequency counts built up over time. The new keys need to accumulate enough frequency to compete with the incumbents, which means they experience an elevated miss rate during the ramp-up period.In Redis specifically, the mitigation is thelfu-decay-time parameter. This controls how quickly the logarithmic frequency counter decays — the default is 1 minute, meaning counters are halved every minute. A lower decay time (or 0, which decays on every access check) makes LFU more responsive to shifts in popularity because old keys lose their frequency advantage faster.For a product launch specifically, I would do two things: first, pre-warm the cache with the new feature’s data before launch. If I know which keys the new feature will need (product pages, configuration, user segments), populate them in advance so they enter the cache with at least one read. Second, temporarily lower lfu-decay-time to 0 for the first few hours after launch, then raise it back to 1 after the new keys have established their frequency baseline.If the cold-start problem is severe and ongoing (the system constantly introduces new key patterns), a hybrid like Caffeine’s Window TinyLFU is better — it has a small LRU admission window specifically to give new keys a chance before they compete on frequency. Redis does not natively support this hybrid, so at that point I would evaluate whether an in-process cache (Caffeine for JVM services) as a first layer before Redis would solve the problem.Your distributed tracing shows that a particular request takes 2 seconds end-to-end, but the sum of all span durations is only 400ms. Where did the other 1.6 seconds go?
Your distributed tracing shows that a particular request takes 2 seconds end-to-end, but the sum of all span durations is only 400ms. Where did the other 1.6 seconds go?
- Uninstrumented code. The most common cause. If a span covers the HTTP handler but there is business logic inside that handler which is not wrapped in its own span — data transformation, validation, serialization/deserialization, file I/O — that time shows up as the gap between the parent span’s duration and the sum of child spans. The fix is straightforward: add spans around the uninstrumented sections.
- Queue wait time. If the request involves an asynchronous step — a message published to a queue and then consumed — the time the message sits in the queue is often not captured as a span. The producer creates a span when it publishes, the consumer creates a span when it processes, but the gap between “published” and “consumed” is dead time that is not represented in the trace. Adding a “queue wait” span (computed from the difference between the published and consumed timestamps) makes this visible.
- Connection acquisition time. Waiting for a database connection from the pool, waiting for a Redis connection, waiting for an HTTP connection from the connection pool — these waits happen before the actual operation span starts. If the database span starts when the query executes but the request spent 800ms waiting for a connection from an exhausted pool, that 800ms is invisible in the trace. Some auto-instrumentation libraries capture this; many do not.
- Garbage collection or process-level pauses. A long GC pause (in JVM, .NET, or Go services) stops the world. No spans are being created during the pause, but clock time is advancing. This shows up as a gap that is impossible to attribute to any specific operation.
- Network latency between spans. The time between “parent span ends” and “child span begins” includes network round-trip time, serialization, and middleware processing at the receiving service. For a call chain traversing 5 services, if each hop has 50ms of network overhead, that is 250ms of gap time.
- Clock skew across services. If the clocks on different service instances are not synchronized (NTP drift), span timestamps can be inconsistent. A child span might appear to start before its parent if the child’s clock is ahead. This can make gap analysis unreliable. Check if NTP is configured on all hosts.
Follow-up: You add instrumentation and discover that 1.2 seconds of the gap is spent waiting for a database connection from an exhausted pool. The pool size is 20. What do you do?
Strong answer:First, resist the urge to just increase the pool size. A larger pool might mask the symptom but not the cause, and it shifts the bottleneck to the database (which now has to handle more concurrent connections, potentially degrading performance for everyone).I would investigate why the pool is exhausted:- Check for connection leaks. Are connections being properly returned to the pool after use? A single code path that opens a connection but does not close it in the error path will slowly drain the pool. Look at the pool’s active vs idle connections over time — if active connections grow monotonically and never return to idle, you have a leak.
- Check query duration. If individual queries are taking 500ms instead of 5ms, each connection is occupied 100x longer, so the pool is effectively 100x smaller. A slow query (missing index, lock contention, table scan) is the most common root cause of pool exhaustion. Check the database’s slow query log.
- Check for N+1 query patterns. A request that opens one connection and runs 50 sequential queries holds that connection for the entire duration. If 20 concurrent requests each do this, all 20 connections are occupied with long-running sessions.
- Check concurrency relative to pool size. If the service handles 200 concurrent requests and the pool size is 20, only 10% of requests can have an active database connection at any time. Either the pool needs to grow (if the database can handle it), or the application needs connection-pooling middleware like PgBouncer (for PostgreSQL) that multiplexes application connections over a smaller set of database connections.
Follow-up: How would you add observability to the connection pool itself so you catch this problem before it causes user-facing latency?
Strong answer:I would expose three metrics from the connection pool as Prometheus gauges:db_pool_connections_active— number of connections currently in usedb_pool_connections_idle— number of connections availabledb_pool_connections_wait_duration_seconds— histogram of how long requests waited for a connection
db_pool_connections_wait_duration_seconds p95 exceeds 100ms for 5 minutes, alert.” That catches pool exhaustion long before it reaches the 1.2-second waits you saw.I would also add the connection wait time as a span attribute (or a dedicated span) in the distributed trace. This way, when someone looks at a slow trace, they see “connection pool wait: 1200ms” explicitly instead of a mysterious gap.Most connection pool libraries expose these metrics natively or through hooks: HikariCP (Java) has built-in Prometheus metrics, pgx (Go) has pool stats, and node-postgres has pool event hooks. If the library does not expose them, wrapping the pool’s acquire and release methods to record timing is straightforward.Your team is deciding between Datadog and a self-hosted Grafana + Prometheus + Loki + Tempo stack. The CTO asks you to make the recommendation. How do you approach this?
Your team is deciding between Datadog and a self-hosted Grafana + Prometheus + Loki + Tempo stack. The CTO asks you to make the recommendation. How do you approach this?
- Datadog: Per-host pricing for infrastructure monitoring (0.10/GB), per-span trace pricing (varies by plan). For 50 hosts, 100 GB/day logs, and moderate tracing, budget $5,000-15,000/month.
- Self-hosted: Compute for Prometheus, Loki, Tempo, Grafana (maybe 6-10 nodes on AWS), S3 storage for long-term data, and 20-40% of a platform engineer’s time for maintenance. Budget $2,000-5,000/month in infrastructure + the engineering opportunity cost.
- Early-stage startup, small team, no platform engineers: Datadog. Pay the premium for zero operational overhead. Focus engineering time on the product.
- Growing company, 50+ services, building a platform team: Start with Datadog + OTel instrumentation. Begin migrating metrics to Prometheus + Grafana first (lowest risk), then logs to Loki, then traces to Tempo — in that order. This gives a gradual migration with escape velocity.
- Enterprise, 200+ services, existing platform team: Self-hosted Grafana stack (or Grafana Cloud for the managed option, which lands between Datadog and fully self-hosted in both cost and operational burden).
Follow-up: The team goes with Datadog. Six months later the monthly bill is $18,000 and the CFO is asking hard questions. How do you reduce costs without losing critical observability?
Strong answer:Datadog cost optimization is almost its own discipline at this point. The three biggest cost drivers, in order, are usually: log ingestion, custom metric count, and APM trace volume.Logs (typically 40-60% of the bill):- Filter out health check logs, Kubernetes liveness probes, and other high-volume low-value log sources before they reach Datadog. Use the Datadog Agent’s log processing pipeline or OTel Collector’s filter processor to drop them at the source.
- Reduce log verbosity in production. Set production log levels to
INFOand ensure no service is accidentally logging atDEBUG. - Use Datadog’s log exclusion filters in the pipeline configuration to drop or sample logs matching patterns you have identified as noise.
- Consider sending less-critical logs to a cheaper backend (S3 + Athena) and only routing high-value logs (errors, critical business events) to Datadog.
- Audit for cardinality explosions. Run the Datadog metric cardinality estimator and look for metrics with unbounded label values (un-normalized URL paths, user IDs, error messages as labels).
- Remove unused metrics. Datadog’s “Metrics without Limits” feature lets you see which metrics are not used in any dashboard or alert. Delete them.
- Aggregate at the source. Instead of sending per-instance metrics and letting Datadog aggregate, pre-aggregate in the OTel Collector or StatsD layer.
- Implement sampling. Head-based sampling at the Datadog Agent level (keep 10% of successful traces, 100% of error traces) can reduce trace volume by 80-90%.
- Use Datadog’s ingestion controls to set per-service sampling rates — sample 100% for critical services, 5% for internal utility services.
Going Deeper: If you later decide to migrate off Datadog to the self-hosted Grafana stack, what is the migration path, and what are the biggest risks?
Strong answer:If we instrumented with OpenTelemetry (as I recommended earlier), the migration is primarily an infrastructure and routing exercise, not a re-instrumentation one. Here is the sequence I would follow:Phase 1: Metrics (lowest risk). Deploy Prometheus and Grafana. Configure the OTel Collector to export metrics to both Prometheus and Datadog simultaneously (dual-write). Build equivalent dashboards in Grafana. Run both in parallel for 2-4 weeks. Once the team validates that the Grafana dashboards match Datadog’s data, disable the Datadog metrics export. This phase has the lowest risk because metrics are stateless aggregates — if you lose a few data points during the transition, nobody notices.Phase 2: Traces. Deploy Tempo. Dual-write traces to both Tempo and Datadog. Verify that trace search, service maps, and latency analysis work in Grafana + Tempo. This phase is higher risk because the team has built muscle memory around Datadog’s trace UI (which is genuinely excellent), and Tempo + Grafana’s trace experience is functional but less polished.Phase 3: Logs (highest risk). Deploy Loki. This is the most dangerous phase because logs are the most-queried observability data and Loki’s query language (LogQL) is different from Datadog’s log query syntax. The team needs training. Run dual ingest for at least 4 weeks and ensure every on-call engineer is comfortable with LogQL before cutting over.Biggest risks:- Alert migration. Datadog alerts (monitors) need to be recreated as Prometheus alerting rules or Grafana alert rules. This is tedious and error-prone — miss one alert and you have a gap in coverage.
- Institutional knowledge. The team has 6 months of Datadog-specific knowledge — saved queries, investigation workflows, bookmarked traces. This evaporates on migration. Document the most important investigation playbooks and translate them.
- On-call during transition. For the dual-write period, on-call engineers need to know which system is authoritative. Make this crystal clear.
A candidate says 'we should cache the result of this database query' during a system design interview. What questions would you ask them to probe whether they actually understand the implications?
A candidate says 'we should cache the result of this database query' during a system design interview. What questions would you ask them to probe whether they actually understand the implications?
Follow-up: How do you evaluate whether a candidate’s caching answer is “good enough” for a senior role versus a staff role?
Strong answer:The senior bar is: can they design a correct caching solution for a well-defined problem and anticipate the common failure modes (stampede, stale data, invalidation gaps)?The staff bar is: can they reason about the systemic implications — how this cache interacts with other caches in the system, how it affects the team’s operational burden, and when caching is the wrong solution entirely?Specifically, a staff-level candidate:- Challenges the premise. “Before we cache this, have we profiled the query? Maybe a missing index is the real fix.” A senior candidate solves the caching problem; a staff candidate questions whether caching is the right solution.
- Thinks about the cache’s lifecycle. Not just the happy path, but: what happens during a deploy (cold cache), during a traffic spike (stampede), during a data migration (mass invalidation), during a database failover (read replica lag)?
- Considers the team impact. “This cache adds a dependency that every on-call engineer needs to understand. Do we have runbooks? Dashboards? Is the team ready to operate this?” Staff engineers think about maintainability, not just correctness.
- Connects to the broader architecture. “If we cache this at the service level, but the CDN also caches the API response, we have two invalidation paths to manage. Let me design both.”
Walk me through how you would set up SLOs for a new payment processing service. What SLIs would you choose, what targets would you set, and how would you handle the error budget?
Walk me through how you would set up SLOs for a new payment processing service. What SLIs would you choose, what targets would you set, and how would you handle the error budget?
- Availability SLI: The proportion of payment requests that return a valid response (success or well-defined failure like “card declined”) vs. an error (timeout, 500, connection failure). A card being declined is not an error from the system’s perspective — it is the system correctly reporting the outcome. An error is when the system cannot process the request at all.
- Latency SLI: The proportion of payment requests that complete within an acceptable duration. I would set two thresholds: a “good” threshold (e.g., under 2 seconds for 99% of requests) and a “tolerable” threshold (under 10 seconds for 99.9% of requests). Payment processing involves external calls to payment gateways, so latency is inherently higher and more variable than a typical API.
- Correctness SLI (unique to payments): The proportion of payments where the charged amount matches the requested amount and the payment state is consistent between our system and the payment gateway. This catches subtle bugs like double-charges or amount mismatches. This is harder to measure — you need reconciliation processes — but for a payment service, correctness is more important than availability.
- Availability: 99.95% (about 22 minutes of downtime per 30-day window). Not 99.99% — payment gateways themselves are not 99.99% reliable, and setting an SLO higher than your dependencies can achieve is dishonest.
- Latency: 99% of requests under 2 seconds, 99.9% under 10 seconds.
- Correctness: 99.999% (this should be as close to 100% as possible — a correctness failure is a financial discrepancy).
- 14.4x burn rate over 5 minutes — P1 page, potential outage.
- 6x burn rate over 30 minutes — P2 alert, on-call investigates.
- 3x burn rate over 6 hours — P3 ticket for next business day.
Follow-up: The product team pushes back on the deploy freeze policy. They argue that they have a critical feature launch in two weeks and cannot stop shipping. How do you handle this?
Strong answer:This is a common and important organizational challenge. The error budget framework only works if both sides respect it, and the first time it is tested is the hardest.I would approach it in three steps:- Present the data, not the rule. Instead of saying “the policy says we freeze,” I would show the actual numbers: “We have consumed 90% of our error budget this month. If we deploy this feature and it causes even a minor regression, we will breach our SLO. Here is what that means for customers — X% of payments will fail.” Data is persuasive in a way that policy is not.
- Negotiate risk mitigation, not a blanket freeze. Maybe the feature can be deployed behind a feature flag with a 1% canary rollout. If the canary shows no error budget impact after 24 hours, roll to 10%, then 50%, then 100%. This lets the product team keep shipping while giving us a kill switch if things go wrong.
- Escalate clearly if needed. If the product team insists on a full deployment despite the risk, I would escalate to our shared leadership (VP of Engineering or CTO) with a clear framing: “We can ship this feature now with a quantified risk of X% payment failures, or we can ship it next week after we recover error budget. I need a decision from someone who owns both the product timeline and the reliability commitment.”
Going Deeper: How would you instrument the correctness SLI you mentioned? What is the actual mechanism for detecting that a charge amount does not match?
Strong answer:Correctness measurement for payments is fundamentally different from availability and latency — you cannot measure it from a single request’s response code. You need a reconciliation pipeline.Here is the mechanism:-
Dual-write the payment record. When we process a payment, we write the intended charge amount and the payment gateway’s response (including their transaction ID and confirmed amount) to our database. These are two separate fields:
requested_amountandgateway_confirmed_amount. -
Near-real-time reconciliation. A background job runs every few minutes, comparing
requested_amountvsgateway_confirmed_amountfor recent payments. Any mismatch is flagged and counted as a correctness failure. For cases where the gateway is authoritative (they charged the amount they say they charged), the mismatch represents a bug in our system. - End-of-day batch reconciliation. Pull the full transaction ledger from the payment gateway (most gateways expose this via API or SFTP file) and reconcile against our records. This catches edge cases the near-real-time reconciliation misses — like a charge that succeeded on the gateway side but our system recorded as failed (or vice versa, due to a timeout where the charge went through but we never received the confirmation).
-
The correctness SLI is computed as:
(total payments - correctness failures) / total paymentsover the rolling window. Each mismatch found in reconciliation subtracts from the numerator.
You are investigating a production incident. Your metrics show a spike in p99 latency, but p50 is unchanged. Your traces show that slow requests all have one thing in common: they pass through a caching layer. Explain what could be happening and how you would diagnose it.
You are investigating a production incident. Your metrics show a spike in p99 latency, but p50 is unchanged. Your traces show that slow requests all have one thing in common: they pass through a caching layer. Explain what could be happening and how you would diagnose it.
cache_misses_total and correlate with the latency spike timing.2. Redis (or your cache layer) experiencing intermittent slowness. A Redis node might be running a slow KEYS or DEBUG SLEEP command (which blocks the single-threaded event loop), performing a background save (RDB snapshot or AOF rewrite consuming I/O), or experiencing network micro-partitions. The affected requests wait for Redis to respond, adding hundreds of milliseconds. Check Redis SLOWLOG for commands that took longer than expected, and check Redis INFO persistence for background save activity during the spike window.3. Thundering herd on a single key. One highly popular cache key expired, and the lock-based rebuild mechanism means N-1 requests are waiting (sleeping and retrying) while one request rebuilds. If the rebuild takes 500ms (complex database query), all waiting requests add that 500ms plus their sleep/retry overhead to their total latency. The p50 is fine because most requests are for other keys that are still cached. Check for lock keys (lock:* pattern) in Redis during the spike and measure their duration.4. Connection pool exhaustion in the cache client library. The application has a connection pool to Redis. Under load, if the pool is too small, requests queue waiting for a connection. Most requests get a connection quickly (p50 is fast), but the tail (p99) includes the requests that waited longest. Check the Redis client’s connection pool metrics — active connections, wait time, pool size.5. Serialization/deserialization cost for large cached objects. If some cache entries are significantly larger than others (e.g., a product catalog page with hundreds of items vs. a single product), the deserialization cost for the large entries creates tail latency. The cache hit is fast (Redis returns the bytes quickly), but the CPU time to deserialize a 500KB JSON blob adds 50-200ms on some requests. Check if the slow traces have larger cache values by adding a cache.value_size_bytes span attribute.My diagnostic steps:- Pull the slow traces and fast traces from the same time window. Compare them side by side — where does the divergence happen?
- Check Redis
SLOWLOGfor the spike window. - Check cache hit/miss rates — did miss rate increase around the time of the spike?
- Check Redis
INFO statsforevicted_keys,keyspace_hits,keyspace_missestrends. - Check the application’s Redis connection pool metrics.
- If still unclear, enable brief
MONITORon Redis (for 10 seconds only — MONITOR itself is expensive) to see what commands the slow requests are executing.
Follow-up: You narrow it down to cause number 2 — Redis is intermittently slow. SLOWLOG shows periodic 200ms pauses. What is your next step?
Strong answer:200ms pauses in Redis are almost certainly one of three things: background persistence operations, swapping, or operating system-level transparent huge pages (THP).Check 1: Background saves. RunINFO persistence and check rdb_last_bgsave_time_sec and aof_rewrite_in_progress. If BGSAVE or BGREWRITEAOF is running during the spike windows, the fork operation to create a child process for the snapshot can block the main thread for hundreds of milliseconds — especially if Redis is using a lot of memory (the fork needs to copy page tables). The fix: if you are using AOF, switch to appendfsync everysec (not always). If BGSAVE is causing pauses and you do not need RDB snapshots (you are using Redis as a cache, not a persistent store), disable them with save "".Check 2: Memory swapping. Run redis-cli INFO memory and check used_memory_rss vs used_memory. If RSS is significantly higher, the OS might be swapping Redis memory to disk. Any swap activity causes catastrophic latency for an in-memory store. Check vmstat or the OS-level swap metrics. The fix: ensure Redis’s maxmemory is set below available physical RAM, and set vm.overcommit_memory = 1 in the kernel to avoid OOM-killer issues during fork.Check 3: Transparent Huge Pages. THP is a Linux kernel feature that allocates memory in 2MB pages instead of 4KB pages. For Redis, this is disastrous — every BGSAVE fork triggers copy-on-write at the 2MB granularity, amplifying write overhead 512x per page. Redis itself warns about this at startup: "WARNING you have Transparent Huge Pages (THP) support enabled". The fix: echo never > /sys/kernel/mm/transparent_hugepage/enabled.I would check all three in parallel since each takes only a few seconds. In my experience, the most common cause is BGSAVE pauses on a cache node where persistence was enabled by default and nobody realized it was unnecessary.Going Deeper: How would you architect Redis to avoid these pauses entirely for a pure caching use case?
Strong answer:For a pure cache (no durability needed — the data can be reconstructed from the database), here is the optimal configuration:- Disable all persistence:
save ""(no RDB snapshots),appendonly no(no AOF). This eliminates fork-related pauses entirely. - Disable THP at the OS level.
- Set
maxmemoryexplicitly to 75-80% of available RAM, leaving headroom for OS page cache and Redis child processes (if you ever need to debug withBGSAVEtemporarily). - Set
maxmemory-policy allkeys-lfufor most web caching workloads. - Set
tcp-backlogto 511 or higher and tune the OSsomaxconnandnet.core.somaxconnto match, avoiding connection queue drops under burst load. - Set
latency-monitor-threshold 50so Redis internally tracks any operation taking more than 50ms, giving youLATENCY LATESTandLATENCY HISTORYfor diagnostics. - Run Redis on dedicated hardware or isolated containers — co-locating with CPU-intensive workloads causes noisy-neighbor latency.
Explain the difference between high-cardinality data and low-cardinality data in the context of observability. Why does it matter, and how does it change your tool selection?
Explain the difference between high-cardinality data and low-cardinality data in the context of observability. Why does it matter, and how does it change your tool selection?
http_method is low cardinality — maybe 5-7 unique values. user_id is high cardinality — potentially millions. This distinction is not just academic; it fundamentally determines what you can store as a metric versus what you must store as a log or trace.Why it matters for metrics:Every unique combination of label values in a metric creates a new time series. Prometheus stores and indexes each time series independently. If you have a metric http_request_duration{method, path, status, user_id} and you have 5 methods, 200 paths, 10 statuses, and 1 million users, you have created 10 billion potential time series. No metrics system can handle that. You will exhaust memory, crash your TSDB, or receive a very large bill from your SaaS provider.The rule is: metric labels must have bounded, predictable cardinality. Status codes (bounded), HTTP methods (bounded), service name (bounded), region (bounded) — these are safe. User ID, request ID, IP address, email — never as metric labels.What you do with high-cardinality data:You put it in logs and traces, not metrics. This is the fundamental reason logs and traces exist alongside metrics — they are the storage layer for high-cardinality data. When you need to answer “what is the p99 latency for user X?”, you cannot pre-compute a metric for every user. But you can query structured logs or traces that have user_id as a field.How it changes tool selection:Traditional metrics tools (Prometheus, Graphite, InfluxDB) are optimized for low-cardinality data. They pre-aggregate at ingest time and struggle with high-cardinality queries. Observability-first tools (Honeycomb, Datadog’s high-cardinality mode, ClickHouse-backed solutions) are designed to handle high-cardinality data by storing raw events and computing aggregations at query time. This is the architectural difference Charity Majors has been evangelizing — “wide structured events” that you can slice by any dimension after the fact, versus “pre-aggregated metrics” that lock you into the dimensions you chose at instrumentation time.If your debugging workflow frequently requires slicing by high-cardinality dimensions (which user, which tenant, which specific request), you need an observability tool that supports it natively. If you are mostly looking at aggregate health (total error rate, average latency), traditional metrics are sufficient and much cheaper.Follow-up: A developer adds user_id as a label on a Prometheus counter in production. What happens, and how do you prevent it from happening again?
Strong answer:Depending on user volume, one of several bad things happens:At low user count (thousands), Prometheus’s memory usage increases noticeably and query performance degrades. At moderate count (tens of thousands), Prometheus’s head chunk memory fills up, scrape durations increase, and compaction takes longer. At high count (hundreds of thousands+), Prometheus OOM-kills, and you lose your metrics entirely.The immediate fix is to remove the label from the metric in the next deploy and restart Prometheus — the old time series will age out after the retention window, but you may need to manually compact or restart to reclaim memory immediately.Prevention:- Code review discipline. Treat new metric labels with the same scrutiny as new database columns. Every label addition should be reviewed with the question: “What is the maximum cardinality of this label?”
-
Automated cardinality limits. Prometheus has
-storage.tsdb.max-block-chunk-series-countflags, and tools like Mimir and Thanos have per-tenant cardinality limits. Set a hard cap so that a cardinality explosion causes a clear error rather than silently degrading the entire system. - A shared instrumentation library. Instead of letting developers create raw Prometheus metrics directly, provide a wrapper library that defines the approved label set for common metric types (HTTP request metrics, database query metrics, etc.). The library rejects unknown labels at compile time or raises an error at startup.
-
A metric linting CI check. Static analysis on metric definitions to flag any label that uses a known high-cardinality pattern (a variable from user input, a path parameter, a UUID). Tools like
promtool check rulescan validate metric definitions as part of the CI pipeline.
Follow-up: How does Honeycomb handle high-cardinality data differently from Prometheus, architecturally?
Strong answer:The architectural difference is fundamental. Prometheus pre-aggregates data into time series at ingest time. Each unique label combination becomes a time series, and Prometheus stores sampled data points (timestamp + value) for each series. This is extremely efficient for low-cardinality queries but breaks down at high cardinality because the number of time series explodes.Honeycomb takes the opposite approach: it stores raw, unaggregated events in a columnar data store (heavily inspired by ClickHouse-style architecture). Each event is a wide row — potentially hundreds of columns — and aggregation happens at query time, not ingest time. When you query “p99 latency for user_id=abc123,” Honeycomb scans the relevant column, filters, and computes the percentile on the fly.This means Honeycomb can handleuser_id as a field with millions of unique values without any special consideration — it is just another column in the event store. The trade-off is that query-time aggregation is more expensive per query than reading a pre-aggregated time series, so Honeycomb queries are slower than Prometheus queries for simple aggregate metrics (total error rate, average latency). But Honeycomb can answer questions that Prometheus literally cannot — like “show me the latency distribution for requests from user X, on endpoint Y, in region Z, during the last 15 minutes.”This is why the tool choice matters: if your debugging workflow requires high-cardinality breakdowns, Prometheus cannot do it regardless of configuration. It is an architectural limitation, not a tuning problem. Honeycomb, Datadog’s log analytics, and ClickHouse-backed solutions can, because they store raw events rather than pre-aggregated time series.You notice that your distributed traces are frequently incomplete — spans are missing from some services in the chain. What are the possible causes and how do you fix them?
You notice that your distributed traces are frequently incomplete — spans are missing from some services in the chain. What are the possible causes and how do you fix them?
traceparent flags). All downstream services must respect the upstream decision. Check that your OTel SDK is configured to propagate (not override) the sampling decision.2. Missing context propagation through async boundaries. The trace context propagates automatically through synchronous HTTP calls (via headers), but asynchronous boundaries — message queues, background jobs, cron triggers — often lose context. If Service A publishes a message to Kafka, and Service B consumes it, the trace context must be embedded in the Kafka message headers. If the message producer is not instrumented to attach trace context, the consumer creates a new root span, disconnected from the original trace. Check every async boundary in your architecture and verify trace context is being propagated through message headers.3. Clock skew causing spans to be misattributed. If Service C’s clock is 30 seconds ahead of Service A’s clock, the OTel Collector or your tracing backend might misassociate spans or drop them as “too old.” This is particularly insidious because the spans exist but they do not get linked to the correct trace. Ensure NTP is running on all hosts and check for clock drift.4. OTel Collector pipeline drops. The Collector has finite memory for buffering spans before export. Under load, if the Collector’s export queue fills up (because the backend is slow to ingest or the Collector is undersized), spans are dropped. Check the Collector’s otelcol_exporter_send_failed_spans and otelcol_processor_dropped_spans metrics. Increase the Collector’s memory limit and export batch size, or scale horizontally.5. Mixed instrumentation libraries. If some services use OpenTelemetry, some use Jaeger client, and some use Datadog’s dd-trace, the context propagation formats might be incompatible. OTel uses W3C Trace Context by default, Jaeger uses uber-trace-id, and Datadog uses x-datadog-* headers. If Service A sends W3C headers and Service B only reads Jaeger headers, the trace context is lost. Standardize on OTel with W3C Trace Context, and configure the OTel SDK with multi-format propagators (tracecontext,b3,jaeger) for backward compatibility during migration.6. Client library bugs or version mismatches. OTel is still evolving, and auto-instrumentation for some libraries has gaps. A Redis client that is not auto-instrumented will not generate spans for cache operations, leaving gaps in the trace. Check the OTel auto-instrumentation registry for your language and ensure all outbound calls have instrumentation.My investigation approach:- Look at incomplete traces and identify the pattern — which service’s spans are missing? Is it always the same service?
- Check that the missing service is running the OTel SDK and exporting spans (verify by checking the Collector’s
otelcol_receiver_accepted_spansmetric per service). - Check that context propagation is working — add a test endpoint that logs the
traceparentheader it receives and thetrace_idit uses for its spans. If they differ, propagation is broken. - Check the Collector for dropped spans.
- Check for async boundaries that lack context propagation.
traceparent headers with the service’s own trace_id is a practical debugging technique that shows hands-on experience. The fact that they organize the investigation by starting with pattern identification (“which service is always missing?”) rather than checking everything randomly demonstrates systematic debugging skill.Follow-up: You fix the context propagation issues. Now traces are complete, but you are ingesting 500GB of trace data per day and costs are unsustainable. How do you reduce this without introducing the incomplete-trace problem again?
Strong answer:The key is to sample whole traces, never individual spans. If you drop spans independently, you are back to incomplete traces.Step 1: Implement tail-based sampling at the Collector. Move from “keep everything” to “keep what matters.” My rules:- 100% of traces with errors.
- 100% of traces exceeding the p95 latency threshold.
- 100% of traces for critical business flows (payment, auth).
- 10% of successful, normal-latency traces — enough for baseline analysis.
traces_sampled_total and traces_dropped_total by sampling rule. If error traces are a disproportionate share of your retained data, it might indicate a systemic problem worth investigating, not just a cost optimization success.This approach typically reduces trace storage by 80-90% while keeping 100% of diagnostically useful traces.Your observability dashboards are all green -- latency is normal, error rate is zero, throughput is steady. But customer support tickets tripled overnight. How do you investigate, and what does this tell you about your observability gaps?
Your observability dashboards are all green -- latency is normal, error rate is zero, throughput is steady. But customer support tickets tripled overnight. How do you investigate, and what does this tell you about your observability gaps?
- “I cannot complete checkout” — a broken user flow
- “I see the wrong data” — a stale cache or data corruption issue
- “The page does not load” — a frontend/client-side issue generating no server traffic
- “I was charged but did not receive confirmation” — a downstream processing failure
- Frontend errors. If the JavaScript bundle has a bug that prevents button clicks from firing API calls, the server sees fewer requests but zero errors — because the requests never happen. Check RUM data for JavaScript errors, rage clicks, and session recordings. If you have no RUM, this is blind spot #1.
- Business logic correctness. The API returns 200 with a valid JSON body, but the data is wrong (stale cache, incorrect calculation, wrong experiment variant). Check business metrics:
orders_completed_total,checkout_funnel_completion_rate,search_zero_results_rate. If you have no business metrics, this is blind spot #2. - Downstream async failures. The synchronous API call succeeds (returns 200 to the user), but the async downstream processing (email, payment settlement, inventory update) fails. The user’s HTTP request was “successful” but the business outcome was not delivered. Check downstream job queues, dead letter queues, and async processing metrics. If you have none, this is blind spot #3.
- Third-party integration failures. Your system is healthy, but a third-party service (payment gateway, email provider, SMS service) is returning success codes but not actually processing. Check reconciliation: do the outcomes (email delivered, payment settled) match the requests?
- Add RUM if you have none (captures the frontend blind spot).
- Add business metrics for every critical user journey (captures the “correct but wrong” blind spot).
- Add end-to-end synthetic monitoring that completes a real transaction and verifies the outcome.
- Add async pipeline monitoring with dead letter queue alerts.
- “How do you convince your team to invest in business-level observability when the infrastructure dashboards have always been ‘enough’?” — “These support tickets are the evidence. I would calculate: 3x increase in support tickets * Y per day in support costs that our dashboards did not prevent. That is the ROI of business metrics.”
- “What is the cheapest first step you can take today to prevent this from happening again?” — “A synthetic transaction monitor. One script that runs every 5 minutes, completes the critical user journey end-to-end, and alerts if it fails. This catches the ‘everything looks fine but nothing works’ scenario with a few hours of engineering effort.”
Advanced Interview Scenarios
These scenarios are designed to break interview autopilot. Several contain traps where the obvious answer is wrong. They test judgment under ambiguity, cross-domain thinking, and the kind of scar tissue you only earn from production incidents.Your team deploys a new version of the user service. Within 30 minutes, your CDN is serving stale user profile data to millions of users. The origin is returning correct data. What happened and how do you fix it — both immediately and permanently?
Your team deploys a new version of the user service. Within 30 minutes, your CDN is serving stale user profile data to millions of users. The origin is returning correct data. What happened and how do you fix it — both immediately and permanently?
The Scenario
You pushed a change that modified the user profile API response schema — adding a newverified_badge field. The origin server returns the correct response immediately after deploy. But users across the globe are still seeing the old response without the badge for up to 24 hours.What weak candidates say:“Just purge the CDN cache.” They treat it as a one-time operational task and move on. They do not explain how the stale data got there, why a 24-hour TTL existed, or how to prevent recurrence. Some will blame the CDN provider.What strong candidates say:The root cause is a mismatch between the cache key and the data’s actual variability. Here is what likely happened and how I would handle it:-
Immediate fix: Issue a CDN cache purge via the provider’s API — Cloudflare’s purge-by-prefix (
/api/users/*), CloudFront’s invalidation (/api/users/*), or Fastly’s surrogate key purge if we tagged entries withuser-profile. Purge propagation takes 5-30 seconds at Cloudflare, 5-15 minutes at CloudFront. During that window, users still see stale data. If the data is user-facing and critical, flip theCache-Controlheader tono-cachetemporarily at the origin to stop the bleeding, then remove it once the purge completes. -
Root cause: Someone set
Cache-Control: public, max-age=86400on the user profile endpoint — a 24-hour TTL on personalized, mutable data. This is a classic mistake: treating dynamic API responses like static assets. The CDN cached the pre-deploy response and will serve it until the TTL expires. The deploy changed the origin, but the CDN does not know or care — it has a valid cached copy. -
Permanent fix — three layers:
- Set appropriate cache headers at the origin. User profiles should use
Cache-Control: private, max-age=0orCache-Control: public, s-maxage=60, stale-while-revalidate=30— a 60-second edge TTL with stale-while-revalidate so the CDN serves the old version for up to 30 more seconds while fetching fresh data in the background. Never a 24-hour TTL on mutable data. - Use surrogate keys (cache tags) for surgical invalidation. Tag every cached user profile with a surrogate key like
user:12345. When user 12345 updates their profile, hit the CDN purge API for that specific surrogate key. Fastly supports this natively via theSurrogate-Keyheader. Cloudflare supports it via Cache Tags on Enterprise plans. This avoids the nuclear option of purging all user profiles. - Add a deploy-time cache bust. Include a build version or deploy hash as a query parameter or
Varyheader value, so every deploy naturally misses the CDN cache for changed responses. For API responses, append a version to the cache key: the CDN sees/api/users/123?v=deploy-abcas a new resource.
- Set appropriate cache headers at the origin. User profiles should use
-
War Story: At a company I worked at, we had a 4-hour outage of “correct” data because a marketing team member had set a CDN page rule with a 7-day TTL on all
/api/*endpoints to “speed things up.” No engineer reviewed it. We only caught it because a customer complained that their updated shipping address was not showing in checkout — and checkout was reading from the CDN-cached API response, not the origin. We added a CI check that flags any CDN rule change affecting API paths and requires engineering approval. The lesson: CDN configuration is infrastructure code and belongs in version control, not a web UI.
Follow-up: What if you cannot purge the CDN fast enough and the stale data is causing users to see incorrect account balances?
Strong answer:If the data is financially sensitive and purge propagation is too slow, I would bypass the CDN entirely for that endpoint as an emergency measure. At the load balancer or API gateway level, add a header or path rewrite that forces the request to skip the CDN edge — Cloudflare supportsCache-Control: no-cache on the request side via a Worker, and CloudFront supports origin-bypass behaviors. For the most critical case, temporarily point the DNS for the API subdomain directly to the origin, bypassing CDN entirely. This is the nuclear option and you lose all CDN benefits (DDoS protection, latency reduction), but it guarantees freshness. Revert once the purge is confirmed propagated.The deeper question is: should financially sensitive data ever be CDN-cached at all? For account balances, the answer is almost certainly no. The CDN should serve the application shell (HTML, JS, CSS) and the balance should be fetched client-side from an API endpoint that returns Cache-Control: no-store. This separation — cache the container, never cache the financial data — is a pattern every fintech learns eventually, usually the hard way.Follow-up: How do you test that your CDN caching configuration is correct before it reaches production?
Strong answer:Three approaches, layered:- Cache-control header assertions in integration tests. Every API endpoint test asserts on the
Cache-Controlheader value. If someone changes the header, the test fails. This catches configuration drift at the code level. - CDN staging environment. Mirror production CDN configuration in a staging environment and run synthetic requests that verify cache behavior — request a resource, mutate the origin, request again, and assert you get the fresh version within the expected TTL window.
- Production cache-header monitoring. A synthetic monitor (Datadog Synthetic, Checkly) hits production endpoints every 60 seconds and reports the
Cache-Control,Age,X-Cache(HIT/MISS), andCF-Cache-Statusheaders as metrics. Alert if a dynamic endpoint starts returningAgevalues above your expected maximum TTL — it means something is being cached that should not be.
A senior engineer proposes adding Redis caching to a write-heavy internal analytics pipeline that processes 50,000 events/second. Each event is unique. You believe this is a mistake. Make the case against caching here.
A senior engineer proposes adding Redis caching to a write-heavy internal analytics pipeline that processes 50,000 events/second. Each event is unique. You believe this is a mistake. Make the case against caching here.
The Trap
The obvious-sounding answer is “caching is always good for performance.” This question tests whether you know when caching is actively harmful.What weak candidates say:“Sure, cache the events in Redis and batch-write to the database.” They apply the caching hammer to every nail. Or they argue against it only on cost grounds (“Redis is expensive”) without understanding the fundamental mismatch.What strong candidates say:This is a textbook example of when caching makes things worse, not better. Here is why:- Cache hit ratio will be near zero. Caching optimizes for repeated reads of the same data. In a write-heavy analytics pipeline processing 50K unique events/second, each event is written once and either never read again or read once for aggregation. There is no temporal locality to exploit. A cache with a 0% hit ratio is not a cache — it is an expensive write buffer that adds latency (the Redis round-trip) and a failure mode (Redis goes down, events are lost) without any performance benefit.
- The real bottleneck is not reads — it is write throughput. The pipeline’s performance problem is ingesting 50K events/second into the analytics store. Caching does not help write throughput. What helps is batching (accumulate events in memory for 100ms, write a batch of 5,000 at once), write-optimized storage (ClickHouse, TimescaleDB, or Kafka as a durable buffer), and partitioning (shard writes across multiple database nodes).
-
If the proposal is to use Redis as a write buffer (write-back cache): This is a durability risk. Redis with persistence disabled loses all buffered events on crash. Redis with AOF persistence at
fsync=everyseccan lose up to 1 second of events. For analytics data that drives business decisions or billing, losing even 1 second of data at 50K events/sec means 50,000 missing records. If the data is truly disposable, a write buffer might be acceptable — but then Kafka is a far better write buffer than Redis because it provides durable, replayable, partitioned storage designed exactly for this workload. -
What I would actually recommend:
- Kafka as the ingestion buffer. Producers write events to Kafka (which handles 50K events/sec trivially). Consumers read from Kafka and batch-insert into the analytics store. Kafka provides durability, replay, and backpressure handling that Redis does not.
- ClickHouse or TimescaleDB as the analytics store. Both are designed for high-volume time-series inserts — ClickHouse can ingest millions of rows/sec on modest hardware.
- Cache the aggregations, not the raw events. If downstream dashboards query “events per minute by category,” cache those results with a 30-second TTL. The aggregation query is expensive but the result is read many times — that is a perfect caching use case. The raw events themselves should never touch a cache.
- War Story: I once saw a team add Redis caching to a logging pipeline because “Redis is fast.” They cached the last 1 million log entries in Redis for a search feature. Redis memory usage hit 40GB within a week. The search feature was used maybe 5 times a day. The cost of the r6g.2xlarge ElastiCache instance (4,300/year) exceeded the cost of just running an Elasticsearch instance that would have handled the search use case natively with better full-text search capabilities. They decommissioned the Redis cache and saved both money and operational complexity.
Follow-up: The engineer pushes back and says “but we need sub-millisecond reads of recent events for the real-time dashboard.” Does that change your answer?
Strong answer:Partially. If there is a genuine read path — a real-time dashboard showing the last N events — then caching the most recent events in Redis makes sense for that specific read pattern. But I would scope it tightly: cache only the last 1,000-10,000 events in a Redis Sorted Set (scored by timestamp), useZRANGEBYSCORE for time-range queries, and set MAXLEN to cap memory usage. This is Redis being used as a window buffer for the dashboard, not as a cache for the write pipeline.The write pipeline itself still goes directly to Kafka and the analytics store. The Redis buffer is a parallel read optimization, not on the write path. If Redis goes down, the dashboard shows “data temporarily unavailable” but no events are lost.The key distinction: caching is for reads. Write-heavy pipelines need buffering (Kafka) and write-optimized storage (ClickHouse), not caching.Follow-up: How would you measure whether a cache is actually helping? What metrics would prove it?
Strong answer:Four metrics, and they must all be positive for the cache to justify its existence:- Hit ratio. Below 50%, the cache is causing more harm than good — the majority of requests pay the Redis round-trip cost and still fall through to the origin. Below 80%, question whether the caching strategy is correct for this workload. Above 90% is the target for most read-heavy workloads.
- Origin load reduction. Compare database query rate before and after caching. If the database QPS did not drop proportionally to the hit ratio, something is wrong — maybe the cache is intercepting cheap queries while the expensive ones still hit the database.
- Latency improvement at the percentiles that matter. Check p50, p95, and p99 latency with and without cache. If p50 improved but p99 got worse (because cache misses now have the overhead of checking Redis AND querying the database), the cache is making the tail worse.
- Total cost of ownership. Sum the Redis infrastructure cost, the engineering time to maintain cache logic and handle cache-related bugs, and compare against the alternative (scaling the database, adding a read replica, optimizing the query). If the cache costs more than the problem it solves, rip it out.
You are debugging an outage. Your Grafana dashboard shows all services green — latency normal, error rate near zero, throughput steady. But customer support is flooded with complaints that the checkout flow is broken. What is happening?
You are debugging an outage. Your Grafana dashboard shows all services green — latency normal, error rate near zero, throughput steady. But customer support is flooded with complaints that the checkout flow is broken. What is happening?
The Scenario
Every metric looks healthy. But real users cannot complete purchases. This is the nightmare scenario that exposes observability blind spots.What weak candidates say:“The monitoring must be wrong, restart the services.” Or “check if the load balancer is healthy.” They have no framework for when metrics and reality diverge.What strong candidates say:This is one of the scariest production scenarios — green dashboards during a real outage. It means our observability is measuring the wrong things. Here is my systematic approach:- Hypothesis 1: We are measuring the wrong SLI. Our metrics measure HTTP status codes and response times. But the checkout flow can “succeed” (return 200 OK) while silently doing the wrong thing — charging the wrong amount, creating a duplicate order, returning a success page but not actually processing the payment. This is a correctness failure, not an availability failure. Our metrics are technically correct (the API returned 200 in 150ms) but the business outcome is wrong. I would immediately check the payment gateway dashboard for discrepancies, pull recent order records and verify they are complete, and check for error logs in downstream services (payment, inventory, email) that our service-level metrics would not capture.
- Hypothesis 2: We are monitoring the synthetic path, not the real user path. If our health checks and synthetic monitors hit the checkout endpoint with test data, they might succeed while real user traffic is routed differently. For example: a canary deployment put 5% of traffic on a broken new version, but our synthetics hit the stable version. Or a feature flag enabled a new payment provider for users in a specific region, and that provider is down. The metrics from the stable path are fine; the broken path has no dedicated monitoring. Check feature flag states, deployment canary status, and segment the metrics by user cohort or deployment version.
- Hypothesis 3: The failure is in client-side code, not server-side. A JavaScript error in the checkout page prevents the “Place Order” button from working. The server never receives the request, so server-side metrics show nothing. No HTTP request = no latency metric = no error counter increment = green dashboard. Check: Real User Monitoring (RUM) data if we have it (Datadog RUM, Sentry, LogRocket), JavaScript error tracking, and browser console errors. If we do not have client-side observability, this is a critical gap to fix after the incident.
- Hypothesis 4: A third-party dependency failure that we do not monitor. The checkout page loads a third-party fraud detection script, a payment iframe, or an address validation service. If that third-party is down or slow, the checkout page hangs or errors on the client side. Our server metrics are perfect because our server is not involved in the failure. Check third-party status pages (Stripe Status, Google Maps API, etc.) and client-side error logs.
- Hypothesis 5: DNS or certificate issue affecting a subset of users. A DNS change propagated partially, or a TLS certificate renewal failed for one of several domains used by the checkout flow. Users in some regions or on some DNS resolvers cannot reach the checkout service at all, while others (including our monitoring) can. Check certificate expiry, DNS propagation status, and look for geographic patterns in customer complaints.
-
War Story: At a previous company, we had a 45-minute “invisible outage” where dashboards were completely green but zero orders were being processed. The root cause: a database migration added a
NOT NULLconstraint to a column that the checkout service was not populating for a specific payment method (Apple Pay). Regular credit card checkouts worked fine — and our synthetic monitors only tested credit cards. Apple Pay orders silently failed at the database layer, but the service caught the exception and returned a vague “Please try again” message to users with a 200 status code. The error counter never incremented because the code treated it as a “handled” error. We found it by querying the orders table directly and noticing the Apple Pay order count dropped to zero at the deployment timestamp. After that incident, we added business metric monitoring — orders/minute by payment method — alongside technical metrics. Iforders_per_minute{payment_method="apple_pay"}drops to zero, that fires an alert regardless of what the HTTP metrics say.
Follow-up: After this incident, how do you redesign your observability to catch this class of failure?
Strong answer:The fundamental lesson is that technical metrics alone are insufficient for business-critical flows. You need business-level observability.- Business KPI metrics.
orders_completed_total{payment_method, region, platform},revenue_dollars_total,checkout_funnel_drop_off_rate. These are computed from database events or application logs, not HTTP metrics. If revenue drops 50% while HTTP metrics are green, the business metric catches it. - Semantic health checks. Instead of just
GET /health, add a synthetic that exercises the full checkout flow end-to-end — creates a test cart, submits a test order with a test payment method, and verifies the order appears in the database. If this “canary order” fails, page immediately. Companies like Amazon run thousands of these “canary transactions” per minute across every payment method and region. - Client-side error monitoring. Deploy Sentry, Datadog RUM, or a similar tool that captures JavaScript errors, unhandled promise rejections, and user session recordings. This closes the gap where server-side observability is blind.
- Anomaly detection on business metrics. A sudden change in orders/minute, conversion rate, or average order value should trigger an alert — even if all technical metrics are healthy. Datadog Watchdog and Grafana ML can detect these anomalies automatically.
Follow-up: How do you balance the cost of all this additional observability against the cost of the outages it prevents?
Strong answer:Frame it as a risk equation. Quantify the cost of the invisible outage: 45 minutes of zero Apple Pay orders, at your average Apple Pay revenue rate. If Apple Pay represents 15% of revenue and your hourly revenue is 11,250 loss in 45 minutes — plus the customer trust erosion you cannot easily quantify. The business metrics dashboard costs maybe 50/month) and a day of engineering time. The ROI is clear after preventing a single incident. The harder sell is client-side monitoring, which can run $1,000-5,000/month depending on traffic volume. For that, I would start with error-only monitoring (Sentry free tier captures 5K errors/month) and upgrade if the error data proves valuable during investigations.You inherit a system where the previous team stored user sessions in Redis with no TTL and no eviction policy set. Redis is at 98% memory capacity. The system is in production with 2 million active users. What do you do?
You inherit a system where the previous team stored user sessions in Redis with no TTL and no eviction policy set. Redis is at 98% memory capacity. The system is in production with 2 million active users. What do you do?
The Scenario
Redis is about to hitmaxmemory with no eviction policy (noeviction is the default). When it does, every write will return an error. Sessions cannot be created, users cannot log in, and the system crashes. You cannot just restart Redis — you would lose all sessions and force 2 million users to re-authenticate simultaneously, which would DDoS your authentication service.What weak candidates say:“Increase the Redis memory” or “Set maxmemory-policy allkeys-lru.” Both are partially right but miss critical nuances — increasing memory buys time but does not fix the leak, and LRU eviction on session data will randomly log users out.What strong candidates say:This is a ticking time bomb and requires both an immediate stabilization and a proper fix. Here is my plan:-
Immediate (next 30 minutes):
- Increase
maxmemorytemporarily viaCONFIG SET maxmemory <higher value>if the host has available RAM. This buys time without a restart. Checkfree -hon the host — if there is 4GB free and Redis is at 12GB, setmaxmemoryto 14GB. This is a bandaid, not a fix. - Set
maxmemory-policy volatile-lruviaCONFIG SET. This tells Redis: when you hit the limit, evict keys that have a TTL set, starting with the least recently used. Since current sessions have no TTL, nothing will be evicted yet — but it prepares Redis for the next step and preventsOOMerrors if we do hit the limit. - Do NOT set
allkeys-lru— this would start evicting any key, including sessions that were just accessed 5 minutes ago. Users would be randomly logged out.volatile-lruonly evicts keys with TTLs, which is safer.
- Increase
-
Short-term fix (next few hours):
- Add TTLs to existing sessions. Write a script that iterates through session keys using
SCAN(neverKEYS *— it blocks the single-threaded event loop) and sets a TTL on each one. Set TTL to match your intended session lifetime — say 24 hours for “remember me” sessions, 30 minutes for standard sessions. UseEXPIREon each key. Batch the operations to avoid overwhelming Redis — process 1,000 keys per second with a small sleep between batches. - Deploy a code fix that sets a TTL on every session at creation time. This stops the bleeding — new sessions will expire naturally.
- Identify and remove zombie sessions. Run
OBJECT IDLETIMEsampling across session keys to find sessions that have not been accessed in days or weeks. These are likely abandoned — the user closed their browser but the session persists forever. A session not accessed in 7 days is almost certainly safe to delete.
- Add TTLs to existing sessions. Write a script that iterates through session keys using
-
Long-term fix (next sprint):
- Implement sliding window TTL. Every time a session is accessed (user makes a request), reset the TTL to 30 minutes. Active users keep their sessions alive; inactive users expire naturally. This is
EXPIRE session:user123 1800on every authenticated request. - Add session metrics.
redis_session_count(gauge),redis_session_created_total(counter),redis_session_expired_total(counter),redis_memory_used_bytes(gauge). Alert if session count grows faster than user count (indicates a leak) or if memory usage exceeds 80% ofmaxmemory. - Evaluate whether Redis is the right session store. For 2 million sessions with 24-hour TTLs, estimate the memory: if each session is 2KB, that is 4GB. Manageable. But if sessions grow (storing cart data, user preferences, feature flags in the session), the memory cost scales linearly with users. Consider: is a database-backed session store (PostgreSQL with row-level TTL via
pg_cron, or DynamoDB with TTL) more appropriate? The latency difference (1ms Redis vs 5ms database) may not matter for session lookups that happen once per request.
- Implement sliding window TTL. Every time a session is accessed (user makes a request), reset the TTL to 30 minutes. Active users keep their sessions alive; inactive users expire naturally. This is
- War Story: I have personally seen a Redis instance go from 0% to 100% memory in 3 days because a session library was creating a new session for every bot request. Googlebot, Bingbot, health check bots, monitoring bots — each got a unique session that never expired. The fix was twofold: do not create sessions for requests without a valid user-agent or authentication token, and always set TTLs. The session count dropped from 8 million to 200,000 overnight after the fix deployed.
Follow-up: The SCAN-and-EXPIRE script you mentioned — how do you run it safely on a production Redis instance handling 50,000 operations/second without impacting latency?
Strong answer:Redis is single-threaded, so every operation on the main thread competes with production traffic. The safety measures:- Use
SCANwith a smallCOUNThint (e.g.,COUNT 100).SCANreturns a batch of keys per call and is O(1) amortized — it does not block likeKEYS. Process each batch, then issue the nextSCANcursor. - Pipeline the
EXPIREcommands. Instead of sending oneEXPIREper round-trip, batch 50-100EXPIREcommands into a single pipeline. This reduces network overhead and minimizes the time Redis spends on your operations. - Throttle between batches. After each batch of 100-500 key expirations, sleep for 50-100ms. At 50K ops/sec, Redis processes one command every 20 microseconds. A batch of 100
EXPIREcommands takes ~2ms. A 50ms sleep between batches means your script uses ~4% of Redis’s capacity. Monitorredis-cli --latencyduring the script to verify you are not causing latency spikes. - Run during low-traffic hours if possible. If your traffic has a daily trough (e.g., 3 AM local time), schedule the script then.
- Target a specific key pattern. If sessions use a
session:*prefix, useSCAN 0 MATCH session:* COUNT 100to avoid touching non-session keys.
Follow-up: After fixing the TTL issue, how do you prevent this class of problem from ever happening again — at the organizational level, not just the technical level?
Strong answer:The real failure here was not a missing TTL — it was the absence of a standard that required TTLs and the absence of monitoring that would have caught the growth before it became critical.- Redis usage policy. Document and enforce: every key written to Redis must have a TTL. No exceptions. If data must persist indefinitely, it does not belong in Redis — use a database. Make this a code review checklist item.
- Linting/static analysis. If your language supports it, write a linting rule that flags Redis
SETcommands without anEXorPXparameter. In Go, wrap the Redis client so that theSetmethod requires a TTL parameter — make the “wrong” thing hard to do. - Automated alerting on key growth. Monitor
dbsize(total key count) andused_memoryover time. Alert on rate-of-change: if the key count is growing faster than your user count, something is leaking. A dashboard showing “keys per active user” reveals leaks instantly. - Periodic Redis audits. Monthly: run
SCAN+OBJECT IDLETIMEsampling to find stale keys. Quarterly: review memory usage by key prefix to identify unexpected growth. This is the Redis equivalent of database table bloat monitoring.
Your observability pipeline — Fluentd collecting logs, shipping to Elasticsearch — starts dropping logs during a traffic spike. Engineers discover gaps in the logs exactly when they need them most: during the incident caused by the spike. How do you redesign the pipeline to handle backpressure?
Your observability pipeline — Fluentd collecting logs, shipping to Elasticsearch — starts dropping logs during a traffic spike. Engineers discover gaps in the logs exactly when they need them most: during the incident caused by the spike. How do you redesign the pipeline to handle backpressure?
The Scenario
Your logging infrastructure is a single point of failure for incident response. During the exact moments when you need logs most (outages, spikes), the pipeline drops them because it cannot handle the volume. This is the observability paradox: the system fails when you need it most.What weak candidates say:“Scale up Elasticsearch” or “Add more Fluentd instances.” They treat it as a capacity problem. It is partially a capacity problem, but the fundamental issue is the absence of backpressure handling and buffering in the pipeline.What strong candidates say:This is the classic “you need your parachute most when you are falling” problem in observability. The pipeline must be designed to degrade gracefully under load, never to drop data silently. Here is how I would redesign it:-
Root cause analysis first. Where exactly are logs being dropped? Fluentd has three failure points: (a) input buffer overflow — logs arrive faster than Fluentd can process them, (b) output buffer overflow — Fluentd processes logs but Elasticsearch cannot ingest fast enough, or (c) Elasticsearch itself rejecting writes due to thread pool saturation or disk pressure. Check Fluentd’s
buffer_queue_length,buffer_total_queued_size, andretry_countmetrics. Check Elasticsearch’sthread_pool.write.rejectedandindexing_pressuremetrics. The fix depends on which component is the bottleneck. -
Redesigned pipeline architecture:
-
Add Kafka as a durable buffer between Fluentd and Elasticsearch. This is the single most impactful change. Instead of
Fluentd -> Elasticsearch, doFluentd -> Kafka -> Logstash/Vector -> Elasticsearch. Kafka absorbs traffic spikes by buffering messages on disk. It can handle millions of messages/second and retain data for days. If Elasticsearch falls behind, Kafka holds the logs until ES catches up. You never lose data — you just see it with a delay. -
Tune Fluentd’s buffer configuration. Increase
buffer.chunk_limit_size(how much data each chunk holds before flushing),buffer.total_limit_size(how much data can be buffered in total), and configureoverflow_actiontoblock(slow down the input) rather thandrop_oldest_chunk(lose data). Use file-based buffering (@type file) instead of memory-based — disk is cheaper and survives Fluentd restarts. -
Implement priority-based routing. Not all logs are equal during an incident. Route
ERRORandWARNlogs to a dedicated high-priority Kafka topic with guaranteed delivery. RouteINFOandDEBUGlogs to a standard topic that can be sampled or dropped under pressure. This ensures that during a spike, you keep the most valuable logs even if you lose the noise. - Add a dead letter queue. Any log that fails to be written to Elasticsearch after N retries goes to a DLQ (S3 bucket, dedicated Kafka topic). After the incident, replay the DLQ into Elasticsearch to fill the gaps. This turns “data loss” into “data delay.”
- Consider Vector as a Fluentd replacement. Vector (by Datadog, open-source) is written in Rust and handles 10x the throughput of Fluentd on the same hardware. It has built-in adaptive concurrency, backpressure propagation, and disk-based buffering. If Fluentd is your bottleneck, the migration might be more effective than tuning Fluentd.
-
Add Kafka as a durable buffer between Fluentd and Elasticsearch. This is the single most impactful change. Instead of
- War Story: At a previous company, we lost 4 hours of logs during a Black Friday traffic spike — exactly the period where we later discovered a payment processing bug. The investigation took 3 days instead of 3 hours because we had no logs for the critical window. We redesigned the pipeline with Kafka in the middle and priority routing. The next Black Friday, we hit 5x normal traffic, Elasticsearch fell 20 minutes behind on ingestion, but zero logs were lost. The 20-minute delay was invisible to engineers because Kafka buffered everything. The Kafka cluster cost us 50,000 in engineering time and delayed customer refunds. The math was easy.
Follow-up: You mentioned priority-based routing for ERROR vs INFO logs. How would you implement this without requiring every application team to change their logging code?
Strong answer:Handle it in the pipeline, not in application code. The OTel Collector or Fluentd can inspect log content and route accordingly:- In Fluentd: Use the
rewrite_tag_filterplugin to inspect thelevelfield of each log record and re-tag it (e.g.,log.error,log.info). Then use<match log.error>to route to the high-priority Kafka topic and<match log.info>to the standard topic. - In OTel Collector: Use the
attributesprocessor to inspect log severity and theroutingconnector to send different severities to different exporters. - In Vector: Use
routetransforms with conditions like.level == "ERROR"to split the stream.
level field). If your applications emit unstructured text logs, the pipeline needs a parser stage to extract the level — regex parsing is fragile, so this is another argument for mandating structured logging from day one.Follow-up: Elasticsearch is your biggest cost in this pipeline. How would you reduce log storage costs by 60% without losing query capability for incidents?
Strong answer:Tiered storage with intelligent retention:- Hot tier (7 days): Full logs in Elasticsearch on SSD-backed nodes. This is where engineers query during incidents. Index only the fields you actually query on — do not index raw message bodies if you only search by
trace_id,service,level, andtimestamp. - Warm tier (30 days): Older indices rolled to warm nodes (HDD-backed, fewer replicas). Reduce replica count from 2 to 1. Use
_forcemergeto compact segments and reduce storage. Queries are slower but still possible. - Cold/Frozen tier (90-365 days, if compliance requires): Indices stored in S3 via Elasticsearch’s searchable snapshots (or move to Grafana Loki backed by S3, which is significantly cheaper than Elasticsearch for cold storage). Queries take 10-30 seconds but are available for forensic investigation.
- Drop fields aggressively. Do you really need
user_agent,referrer, andrequest_headersin your stored logs? These fields are useful for debugging specific issues but are rarely queried. Drop or sample them in the pipeline to reduce storage by 30-40%. - Use Index Lifecycle Management (ILM) in Elasticsearch to automate the hot -> warm -> cold -> delete transitions based on index age.
Your team adopts OpenTelemetry and instruments all 15 services. Three months later, nobody uses the traces during incidents. Engineers still grep logs. What went wrong and how do you fix it?
Your team adopts OpenTelemetry and instruments all 15 services. Three months later, nobody uses the traces during incidents. Engineers still grep logs. What went wrong and how do you fix it?
The Scenario
You made the technical investment — OTel is deployed, traces are flowing, Jaeger is running. But when the pager goes off at 2 AM, engineers open Kibana and search logs. The tracing infrastructure is shelfware.What weak candidates say:“We need to train the team on how to use Jaeger.” Training is necessary but insufficient. If the tool does not fit the workflow, no amount of training changes behavior.What strong candidates say:This is an adoption failure, not a technology failure. I have seen this happen multiple times, and the causes are predictable:-
Cause 1: Traces are not discoverable from the alert. When the pager fires, the engineer sees an alert with a service name, an error message, and maybe a dashboard link. There is no direct link to a relevant trace. The engineer would need to open Jaeger, figure out the query syntax, guess at the right time window, and find the relevant trace — all at 2 AM under stress. They fall back to what they know: grep the logs. Fix: Every alert must include a deep link to a pre-filtered trace search. In Grafana, use the
traceIDfield in log lines to create a “View Trace” link that opens the trace in Tempo/Jaeger with one click. In Datadog, traces are automatically linked to logs and alerts. The path from alert to trace must be zero friction. - Cause 2: Traces are incomplete or noisy. If auto-instrumentation captured HTTP spans but missed database queries, Redis calls, and message queue operations, the traces show a series of HTTP calls with unexplained gaps. Engineers look at the trace, see nothing useful, and go back to logs. Fix: Audit trace completeness for the top 5 most-investigated request paths. Manually add spans for uninstrumented operations. The goal is that when an engineer opens a trace for a slow checkout request, they see every significant operation — not just the HTTP hops.
- Cause 3: No one demonstrated the workflow during a real incident. Engineers learn investigative tools by watching someone else use them effectively under pressure. If no one has ever said “here, let me show you how I found the root cause in 30 seconds using this trace” during a live incident, the tool remains theoretical. Fix: During the next incident, the most experienced engineer should deliberately use tracing as their primary investigation tool while sharing their screen. In the postmortem, show the trace that revealed the root cause and explain how it was found. This single demonstration is worth more than 10 training sessions.
- Cause 4: The investigation workflow does not start with traces. Most engineers start with metrics (dashboard) or logs (search for the error message). Traces are useful when you already know which request to investigate. If there is no easy path from “error rate is high” to “here are the traces of failing requests,” tracing is an island. Fix: Build the bridge. Add exemplars to metrics — Prometheus supports exemplars that link a metric data point to a specific trace ID. When an engineer sees a latency spike on the Grafana dashboard, they click on the spike and see the trace IDs of the slowest requests. One click takes them to the trace. Grafana supports this workflow natively with Tempo and Prometheus.
- Cause 5: Jaeger’s UI is not good enough. This sounds petty but it matters. Jaeger’s default UI is functional but minimal. The search UX is clunky, the timeline visualization is basic, and comparing two traces side-by-side is not straightforward. Engineers accustomed to the polish of Kibana or Datadog find Jaeger frustrating. Fix: Consider Grafana Tempo with the Grafana trace viewer (better UX than standalone Jaeger), or if budget allows, Datadog APM or Honeycomb (which has the best trace investigation UX in the industry). Tool UX directly impacts adoption.
- War Story: At a company I was at, trace adoption went from ~5% to ~80% usage during incidents after one change: we added a Slack bot that, when an alert fired, posted the alert details along with 3 links: the Grafana dashboard for the affected service, the Kibana log search pre-filtered by the service and time window, and a Jaeger search pre-filtered to error traces for that service in the last 15 minutes. Engineers naturally started clicking all three links. Within a month, they were starting with traces because they found root causes faster that way. The bottleneck was never training — it was making the tool accessible at the moment of need.
Follow-up: How do you measure whether your observability investment is actually reducing incident resolution time?
Strong answer:Track four metrics over time:- Mean Time to Detect (MTTD): Time from the start of the incident to the first alert firing. This measures your alerting quality. Should decrease as you improve alert coverage and reduce false negatives.
- Mean Time to Root Cause (MTTRC): Time from the alert firing to identifying the root cause. This directly measures your observability’s diagnostic value. Track this per incident and look for trends. If MTTRC is not decreasing after deploying tracing, the traces are not being used or are not useful.
- Mean Time to Resolve (MTTR): Time from the alert to resolution. This includes mitigation and fix. MTTR = MTTD + MTTRC + time-to-fix.
- Investigation method used. In postmortems, record which tools the investigator used to find the root cause — logs, metrics, traces, database queries, or “asked another engineer.” Track the distribution over time. If traces never appear in this list despite being deployed, you have an adoption problem.
Follow-up: What is the single highest-ROI observability improvement you have ever made at a company?
Strong answer:Addingtrace_id to every structured log line and making it a clickable link in our log viewer. Cost: approximately 4 hours of engineering work across 12 services (shared logging library update). Impact: investigation time for cross-service issues dropped from an average of 45 minutes to under 10 minutes. Before this change, engineers would find an error in one service’s logs, then manually search other services’ logs by timestamp trying to correlate events. After the change, they click the trace ID, see the entire request path, and immediately identify which downstream service caused the failure. This single change delivered more value than the entire tracing deployment — not because tracing is not valuable, but because connecting logs to traces is what makes both usable. Neither is sufficient alone.Your company runs a multi-region deployment (us-east-1 and eu-west-1). A customer in Germany reports that they see their old profile photo for 10 minutes after uploading a new one. The customer in New York sees the update instantly. Diagnose this.
Your company runs a multi-region deployment (us-east-1 and eu-west-1). A customer in Germany reports that they see their old profile photo for 10 minutes after uploading a new one. The customer in New York sees the update instantly. Diagnose this.
The Scenario
Cross-region cache consistency — one of the hardest problems in distributed systems, made worse by the fact that it only affects a subset of users and is intermittent.What weak candidates say:“The European CDN cache has the old image.” Correct but shallow — they do not explain the replication mechanism, the consistency model, or how to fix it without destroying CDN performance for European users.What strong candidates say:This is a multi-layer cross-region consistency problem. Let me trace the full path to identify where staleness can hide:-
Layer 1: Object storage replication lag. If the profile photo is stored in S3 with cross-region replication (CRR) to an EU bucket, there is a replication delay — typically seconds, but it can spike to minutes during high load. The user uploaded to
us-east-1, the EU application reads from the EU bucket, and the old image is still there. Check: S3 CRR metrics —ReplicationLatencyandOperationsFailedReplicationin CloudWatch. If replication is lagging, the EU bucket has stale data. -
Layer 2: CDN caching the old image. The CDN edge node in Frankfurt cached the old profile photo with a long TTL. Even after the S3 bucket in EU is updated, the CDN serves its cached copy. Check: Request the image URL with
curl -Ifrom an EU location and inspect theAgeheader andX-Cache: HITstatus. IfAgeis high, the CDN has a stale copy. Fix: On upload, explicitly purge the CDN cache for that image URL. Use Cloudflare’sPurge by URLAPI or CloudFront invalidation. If using content-hashed URLs (photo_abc123.jpg), the new upload gets a new URL and the CDN naturally misses — this is the preferred approach but requires the application to update the URL reference atomically. -
Layer 3: Application-level caching of the image URL. The user profile API response includes
profile_photo_url. If this response is cached in Redis with a 10-minute TTL, the EU application serves the old URL even after the new image is uploaded and replicated. The user requests their profile, gets the cached response with the old URL, and sees the old photo. Check: Query Redis in the EU region for the user’s profile cache key and compare theprofile_photo_urlwith the actual latest URL in the database. -
Layer 4: Database replication lag. If the EU region reads from a database read replica that replicates from the US primary, the profile update (including the new
profile_photo_url) may not have propagated yet. PostgreSQL streaming replication is typically sub-second, but during heavy write load, replica lag can spike. Check:pg_stat_replicationon the primary andpg_last_wal_replay_lsn()on the replica. DynamoDB global tables have similar eventual consistency behavior — writes propagate to other regions asynchronously with “typically sub-second” latency but no hard guarantee. -
The fix requires addressing all four layers:
- Use content-hashed image URLs so CDN cache is naturally busted on upload.
- On profile update, invalidate the Redis cache key in all regions — use a cross-region invalidation event (SNS cross-region subscription or Kafka MirrorMaker).
- For the author of the upload (read-your-writes consistency), route the user’s subsequent reads to the primary region for 30 seconds after the write. This ensures the uploader sees their change immediately while other users converge within the replication window.
- Set appropriate S3 CRR monitoring alerts so you know when replication is lagging.
- War Story: A social media company I know about had this exact issue — users in Europe saw stale profile data for up to 15 minutes after updates. The root cause turned out to be a combination of Layer 2 (CDN) and Layer 3 (application cache). The CDN TTL was 5 minutes and the Redis TTL was 10 minutes, and in the worst case, a user would hit a CDN cache populated just before the update, which contained a Redis-cached URL from just before the update — stacking the staleness windows. The fix was: content-hashed image URLs (eliminating Layer 2), event-driven Redis invalidation (fixing Layer 3), and read-your-writes routing for the uploader (fixing the perception problem). Total staleness after the fix: under 2 seconds for the uploader, under 30 seconds for other users in the same region, under 60 seconds for users in the remote region.
Follow-up: The product team says “we want globally consistent reads — every user everywhere should see the update within 1 second.” What do you tell them?
Strong answer:I would explain the physics and then negotiate. The speed of light between US-East and EU-West is approximately 85ms round-trip. Any replication mechanism adds overhead on top of that. Achieving true sub-second global consistency requires one of two approaches:- Strong consistency with global routing. All writes and reads go to a single primary region. EU users experience ~170ms of additional latency on every request (the transatlantic round-trip). This guarantees consistency but degrades performance for half your user base. For a profile photo update, this is probably not worth it.
- Synchronous cross-region replication. The write does not return success until the data is replicated to all regions. This adds 85-200ms to every write operation and creates a cross-region dependency — if the EU region is unreachable, writes fail globally. This is the approach DynamoDB global tables in “strongly consistent” mode or CockroachDB’s multi-region configuration use, but both come with significant latency and availability trade-offs.
You set up an SLO of 99.9% availability for your API. At the end of Q1, you have consumed 120% of your error budget — you breached the SLO. But the product team shipped 3 major features on time and revenue grew 15%. Was the SLO the right target?
You set up an SLO of 99.9% availability for your API. At the end of Q1, you have consumed 120% of your error budget — you breached the SLO. But the product team shipped 3 major features on time and revenue grew 15%. Was the SLO the right target?
The Trap
The obvious answer is “the SLO was too aggressive — we should lower it to 99.5%.” But that misses the entire point of SLOs. This question tests whether you understand SLOs as a decision framework, not a score.What weak candidates say:“We need to lower the SLO to something achievable.” Or “We need to improve reliability so we meet 99.9%.” Both treat the SLO as a report card to pass or fail, rather than as a tool for making trade-offs.What strong candidates say:This question gets at the heart of what SLOs are for. The answer is not about whether 99.9% is “right” or “wrong” — it is about whether the team used the SLO to make informed decisions, and whether the breach had acceptable consequences.- First, assess whether the breach was a conscious choice or an accident. If the team knowingly shipped features that burned error budget, understood the risk, and the business result (15% revenue growth) justified the reliability trade-off — the SLO worked perfectly. It surfaced the trade-off and the team made a deliberate decision. That is the entire purpose of error budgets: to create a structured conversation between “ship fast” and “be reliable.”
- If the breach was accidental — the team did not realize they were burning budget until the quarter ended — the problem is not the SLO target, it is the observability and process. The team should have had burn-rate alerts that warned them mid-quarter. They should have had a policy discussion about what to do when the budget runs low. The SLO target might be fine; the feedback loop is broken.
- Evaluate whether 99.9% is the right target for the users, not the engineering team. The SLO should reflect user expectations and business impact. If the API serves an internal tool where 99.5% is perfectly acceptable and nobody complains, 99.9% is over-engineering reliability at the expense of feature velocity. If the API serves external paying customers who churn when they see errors, 99.9% might even be too low. The right SLO is the one where breaching it causes unacceptable user or business impact — and “unacceptable” is a product decision, not an engineering one.
-
What I would actually do after this quarter:
- Hold a retrospective — not to assign blame, but to answer: “Did we have the data to make conscious trade-offs? Or were we flying blind?” If blind, fix the feedback loop (burn-rate alerts, weekly error budget reviews).
- Re-evaluate the target with product and business stakeholders. Present the data: “We spent 120% of our error budget. Here is what users experienced: X% of requests failed, Y users saw errors, Z support tickets were filed. Here is the business outcome: 15% revenue growth. Is this trade-off acceptable going forward?” If the answer is “yes, do it again next quarter,” then 99.9% is aspirational, and the effective SLO is 99.7% or whatever the actual performance was. Adjust the formal target to match reality.
- Differentiate SLOs by user journey. Maybe the aggregate 99.9% is wrong because it blends critical flows (checkout, login) with non-critical flows (profile browsing, search). Set 99.95% for checkout, 99.5% for search. This lets the team ship aggressively on non-critical paths while protecting the critical ones.
- War Story: At one company, the platform team set a 99.99% SLO for an internal API used by 5 teams. The error budget was 4.3 minutes per month. One team’s monthly deploy routinely caused a 2-minute blip. The platform team flagged it every month. The deploying team’s response: “Our users don’t care about 2 minutes of degraded search results.” They were right — the SLO was calibrated for a criticality level that did not match the actual user impact. We lowered the SLO to 99.9% (43 minutes/month), which gave teams breathing room for deployments while still catching real outages. The insight: an SLO that nobody respects is worse than no SLO at all, because it trains the organization to ignore reliability signals.
Follow-up: How do you get product and engineering leadership to agree on SLO targets? They typically have conflicting incentives.
Strong answer:The key is framing SLOs as a shared tool, not an engineering constraint on product.- Translate reliability into business language. Do not say “99.9% availability.” Say “We commit to fewer than 43 minutes of downtime per month, which based on our traffic means about 500 users see errors.” Then ask: “Is that acceptable, or do we need to invest more in reliability?” When product leaders understand the error budget in terms of affected users and support tickets, they engage with the trade-off meaningfully.
- Give product the error budget to spend. Frame it as: “You have a risk budget of 43 minutes this month. You can spend it on aggressive feature launches, risky deploys, or experimental A/B tests. When it is gone, we slow down to protect users.” This gives product agency over the reliability trade-off instead of making it feel like engineering gatekeeping.
- Show the historical data. “Last quarter, we had 3 incidents totaling 90 minutes. 2 of those were caused by rushed feature deploys. If we had an SLO, we would have slowed down after the second incident and avoided the third.” Concrete historical examples are more persuasive than hypothetical risk calculations.
- Start with one critical journey, not the whole system. Propose SLOs for checkout and login — flows where nobody argues about the importance of reliability. Once the framework proves valuable there, expand to other journeys.
Follow-up: What is the most common mistake teams make when first implementing SLOs?
Strong answer:Setting too many SLOs. Teams define 15 SLOs across every service and every endpoint, then drown in noise. After a month, nobody checks the dashboards because there are too many numbers to track and most of them are green.The right starting point: one SLO per critical user journey. For an e-commerce platform, that might be three: checkout completion rate, search availability, and login success rate. Three numbers that the entire team reviews weekly. Add more only when those three are stable and the team has muscle memory around error budget reviews.The second most common mistake: defining SLOs without defining what happens when they are breached. An SLO without a response policy is just a number on a dashboard. You need the “if X then Y” — if the budget is 50% consumed with 50% of the month remaining, we review. If it is 80% consumed, we freeze non-critical deploys. If it is exhausted, all hands on reliability. Without this policy, the SLO is decoration.A junior developer asks you: 'Why can't we just set a TTL of 1 second on everything? Then data is always fresh.' What is wrong with this reasoning and how do you explain it?
A junior developer asks you: 'Why can't we just set a TTL of 1 second on everything? Then data is always fresh.' What is wrong with this reasoning and how do you explain it?
The Trap
At first glance, a 1-second TTL sounds like it solves staleness. This question tests whether you understand that ultra-short TTLs can be worse than no caching at all.What weak candidates say:“1-second TTL would cause too many cache misses.” Correct but vague — they cannot explain the cascading consequences or quantify the impact.What strong candidates say:This is a great question because the intuition seems sound — shorter TTL means fresher data. But a 1-second TTL on everything would likely cause more harm than having no cache at all. Here is why, and I would walk the junior developer through each point:- 1. You are rebuilding the cache on nearly every request. If an endpoint receives 1,000 requests/second and the cache TTL is 1 second, the cache entry is valid for exactly one “generation” of requests. After it expires, the next request triggers a cache miss and a database query. In the best case (with stampede protection), that is one database query per second per cache key. Without stampede protection, all 1,000 requests in the next second pile into the database simultaneously. You have replaced “no cache” with “no cache plus the overhead of checking Redis on every request.”
- 2. You are adding latency, not removing it. Every request now pays the cost of: check Redis (0.5ms) -> cache miss (likely) -> query database (5ms) -> write to Redis (0.5ms) -> return. Without caching, the path is just: query database (5ms). The cache adds 1ms of overhead to the majority of requests (the misses) and saves 4.5ms for the minority (the hits within the 1-second window). If your hit ratio is below ~20%, you have made every request slower on average.
- 3. You are generating enormous Redis write volume. With a 1-second TTL, every key is written once per second. If you have 10,000 active cache keys, that is 10,000 writes/second to Redis just for cache population — in addition to the reads. This write volume can itself become a bottleneck, especially if the cached values are large (5KB+ per key) or if you are on a constrained Redis instance.
- 4. The right approach: match TTL to data change frequency and staleness tolerance. If data changes once per hour, a 5-minute TTL gives you 12 cache rebuilds per hour instead of 3,600 (one per second). That is a 300x reduction in database load. The trade-off: data can be up to 5 minutes stale. For most data, that is perfectly acceptable. The conversation should start with: “How often does this data actually change, and what is the worst thing that happens if a user sees a 5-minute-old version?”
- 5. When 1-second TTL actually makes sense. For a handful of keys that change very frequently and where staleness directly causes user-visible errors — like a real-time stock price, a live auction bid, or a rate limit counter — a very short TTL (or event-based invalidation instead of TTL) is appropriate. But these are the exceptions, not the default.
- The way I would frame it to the junior developer: “Think of TTL as a dial between two extremes. At one end, TTL=infinity means perfect cache performance but the data might be stale forever. At the other end, TTL=0 means the data is always fresh but you have no cache. A 1-second TTL is very close to TTL=0 — you are paying all the costs of a caching system (Redis infrastructure, code complexity, cache invalidation logic) while getting almost none of the benefits. The art of caching is finding the point on that dial where you get the most performance benefit for the staleness your users can tolerate.”
-
War Story: A team I advised set a 2-second TTL on a product catalog cache serving 5,000 requests/second. Their Redis instance was handling 5,000 reads/sec and ~2,500 writes/sec (populating keys after misses). Their database was handling ~2,500 queries/sec — barely less than the 5,000 it would handle without the cache. The cache was absorbing only ~50% of the read load, at the cost of a Redis instance and significant code complexity. We raised the TTL to 60 seconds. Redis writes dropped to ~83/sec (one per key per minute). Database queries dropped to ~83/sec. The cache was now absorbing 98.3% of the read load. The latency improvement was dramatic and the database CPU dropped from 70% to 8%. All we changed was one number:
EX=2toEX=60.
Follow-up: The junior developer then asks “But what if the product price changes? A customer could see a stale price for 60 seconds.” How do you answer?
Strong answer:Great question — and this is where you teach the junior developer about separating display tolerance from transactional accuracy.The product catalog page showing a price that is 60 seconds stale is almost certainly fine. The user browses, adds to cart, and proceeds to checkout. At the checkout step — where money actually changes hands — you never read the price from cache. You read it from the database (source of truth). If the price changed between the catalog view and the checkout, you show the user the updated price and ask them to confirm.This pattern is everywhere: Amazon shows you a cached product page, but the actual charge at checkout reads from the live inventory and pricing system. The “Add to Cart” button uses the cached price for display, but the “Place Order” button uses the real-time price for the transaction.For the rare case where even the display price must be real-time (live auction, stock ticker), use event-based invalidation instead of TTL: when the price changes, publish an event that deletes the cache key immediately. The TTL becomes a safety net (60 seconds), not the primary invalidation mechanism. Most requests still hit the cache; the cache is only invalid during the sub-second window between the event and the cache repopulation.Follow-up: How do you decide what TTL to set for a new piece of data you are caching for the first time?
Strong answer:I use three inputs:- How often does the data change? Check the database write frequency for this table/entity. If it changes once per hour, a 5-minute TTL means you serve stale data for at most 5 minutes and rebuild the cache only 12 times/hour. If it changes every 5 seconds, a 5-minute TTL means you serve very stale data — switch to event-based invalidation.
- What is the cost of staleness? Ask the product owner: “If a user sees this data from 30 seconds ago, what is the worst that happens?” For a product description: nothing. For an account balance: potential financial dispute. For a feature flag: users might see the wrong experience. This gives you the maximum acceptable TTL.
- What is the cost of rebuilding? If the cache rebuild query takes 500ms and hits 3 tables, you want a long TTL to minimize rebuilds. If it takes 2ms, a shorter TTL is affordable. Match the TTL to the rebuild cost — expensive rebuilds get longer TTLs.
min(5min, 5min) = 5 minutes. Monitor the hit ratio after deployment and adjust: if the hit ratio is above 95%, the TTL is fine. If it is below 80%, the TTL might be too short relative to the access pattern, or the data changes more often than expected.Your traces show that a Kafka consumer takes 200ms to process a message, but the end-to-end latency from producer to the downstream HTTP response is 45 seconds. Where are the other 44.8 seconds?
Your traces show that a Kafka consumer takes 200ms to process a message, but the end-to-end latency from producer to the downstream HTTP response is 45 seconds. Where are the other 44.8 seconds?
The Scenario
Async architectures create a specific observability challenge: the time a message sits in a queue is often invisible in traces. This question tests whether you understand observability across synchronous and asynchronous boundaries.What weak candidates say:“The Kafka consumer is slow.” They look at the consumer processing time and cannot explain the gap. Or: “Kafka is slow.” They blame the message broker without investigating.What strong candidates say:The 44.8-second gap is almost certainly not in the consumer processing logic. Here is where to look:-
1. Consumer lag — messages are sitting in the Kafka partition waiting to be consumed. This is the most common cause and the first thing I would check. If the consumer group has high lag, messages are enqueued faster than they are being consumed. A message produced at T=0 might not be picked up by the consumer until T=44.8s. Check: Kafka consumer group lag via
kafka-consumer-groups.sh --describe, or (better) via Burrow/Kafka metrics in your monitoring. The metric isrecords-lag-maxper partition. If lag is 44,800ms worth of messages (at your production rate), that is your entire gap. -
2. Consumer rebalancing. If a consumer instance crashed, was redeployed, or the consumer group is scaling up/down, Kafka triggers a consumer group rebalance. During rebalancing, all consumers in the group stop processing for the duration of the rebalance — which can be 10 seconds to several minutes depending on the number of partitions and the
session.timeout.ms/max.poll.interval.mssettings. Check: Consumer logs for rebalance events. Kafka’srebalance-latency-avgmetric. - 3. A single slow message blocking the partition. Kafka guarantees ordering within a partition. If one message takes 30 seconds to process (maybe it triggers a slow downstream call), all subsequent messages in that partition are queued behind it. The consumer’s “average processing time” is 200ms, but one outlier message blocks the entire partition for 30 seconds. Check: Look for high-variance processing times. Check if the latency is consistent across all partitions or localized to one — if one partition has 45-second latency and others have 200ms, a single slow message is the likely cause.
-
4. Retry and dead-letter processing adding delay. If the consumer encounters a transient error (downstream service timeout, database lock), it might retry the message with exponential backoff. Three retries with 5-10 second delays between each adds 15-30 seconds to the message’s lifecycle. Check: Consumer logs for retry events. Count how many times the message was processed (via a
retry_countheader if you have one). - 5. The message passed through multiple Kafka topics. The producer writes to Topic A. A stream processor reads from Topic A, transforms the message, and writes to Topic B. The consumer reads from Topic B. Each hop adds latency — the stream processor’s processing time, the time waiting in Topic B, the consumer’s pickup delay. Check: Trace the message’s journey through all topics. If you have trace context propagation through Kafka headers, the trace should show each hop. If not, correlate by message key or a correlation ID embedded in the message payload.
-
To make this visible in observability:
- Add a
produced_attimestamp to every Kafka message header. The consumer reads this timestamp and computesqueue_wait_time = now() - produced_at. Emit this as a metric:kafka_message_queue_wait_seconds{topic, consumer_group}. This makes the invisible wait time visible. - Create a dedicated span for queue wait time. When the consumer picks up a message, create a span that starts at the message’s
produced_attimestamp and ends at the current time. This fills the gap in the trace that would otherwise be invisible. - Alert on consumer lag that exceeds your latency SLO. If your end-to-end SLO is “process within 5 seconds,” alert when consumer lag exceeds 4 seconds (leaving 1 second for processing).
- Add a
-
War Story: A team I worked with had a 60-second end-to-end latency on messages that individually processed in 50ms. The root cause was a consumer with
max.poll.records=500andmax.poll.interval.ms=300000(5 minutes). The consumer fetched 500 messages per poll, processed all 500 sequentially, and did not poll again until done. The last message in each batch waited for 499 messages to process before it. At 50ms per message, that is 25 seconds of queuing within a single poll cycle. The fix: reducemax.poll.recordsto 50 (2.5 seconds per batch), enable concurrent processing within each batch using a thread pool, and configure the consumer to poll more frequently. End-to-end latency dropped from 60 seconds to under 2 seconds.
Follow-up: How do you propagate trace context through Kafka when the producer and consumer are different services written in different languages?
Strong answer:The standard approach is to embed the W3C Trace Context (traceparent and tracestate headers) into Kafka message headers. Both the OTel Java and OTel Python SDKs have native Kafka instrumentation that does this automatically:- Producer side: The OTel instrumentation intercepts the Kafka
produce()call and injects the current trace context into the message headers. - Consumer side: The OTel instrumentation intercepts the
poll()/consume()call, extracts the trace context from the message headers, and creates a new span that is a child of the producer’s span.
- Producer:
message.headers().add("traceparent", currentContext.toTraceparent()). - Consumer:
context = extract(message.headers().get("traceparent")), then start a new span with that context as parent.
Span.addLink(producerContext). This preserves the relationship without creating an impossible parent-child tree.For cross-language compatibility, the W3C Trace Context format is language-agnostic — it is just a string in the header. A Java producer and a Python consumer both understand the same traceparent header format as long as both use W3C-compatible OTel SDKs.Follow-up: Consumer lag is growing and you need to scale. How do you decide between adding more consumers versus adding more partitions?
Strong answer:The fundamental constraint: the number of active consumers in a consumer group cannot exceed the number of partitions. If you have 10 partitions and 10 consumers, adding an 11th consumer is useless — it will sit idle because Kafka assigns at most one consumer per partition.Add more consumers when you have more partitions than consumers. This is the fast fix — no Kafka configuration change required, just scale the consumer service. Each new consumer picks up partitions from the existing consumers (after a rebalance), distributing the work.Add more partitions when you are already at consumer=partition parity and need more parallelism. This requires a Kafka admin operation (kafka-topics.sh --alter --partitions N) and has implications:- Existing messages in the current partitions are not redistributed. Only new messages use the new partitions.
- If you rely on key-based partition assignment (all messages for user X go to the same partition), adding partitions changes the key-to-partition mapping. Messages for user X may now go to a different partition, which can break ordering guarantees if you depend on them.
- More partitions means more overhead in the Kafka broker (more file handles, more replication traffic, longer rebalance times).
You are reviewing a pull request that adds 47 new custom Prometheus metrics to a service. The developer says 'more data is always better.' Argue against merging this PR.
You are reviewing a pull request that adds 47 new custom Prometheus metrics to a service. The developer says 'more data is always better.' Argue against merging this PR.
The Trap
Observability culture often celebrates “instrument everything.” This question tests whether you understand that too much instrumentation has real costs and can actually reduce observability quality.What weak candidates say:“Metrics are cheap, just add them.” They do not consider storage costs, dashboard noise, or maintainability.What strong candidates say:I would block this PR with a constructive review. Here is my argument:- 1. 47 metrics creates an unmaintainable observability surface. Each metric needs: a dashboard panel, context for what “normal” looks like, an understanding of what “abnormal” means, and ideally an alert or at least a threshold someone monitors. 47 metrics is 47 things an on-call engineer needs to understand. In practice, nobody will learn what all 47 mean. During an incident, the engineer will look at the 5-6 metrics they already know and ignore the other 41. The unused metrics are not free — they consume storage, slow down Prometheus queries, clutter dashboards, and create the illusion that the service is well-instrumented when it is actually just noisy.
- 2. Cardinality risk. 47 new metrics, each with even modest label sets, can easily create thousands of new time series. If even one metric has an unbounded label (a path parameter, a user-facing error message, a dynamic queue name), it can create hundreds of thousands of time series and degrade Prometheus for every service. I would audit every metric in the PR for label cardinality before even discussing the metric’s value.
- 3. The signal-to-noise ratio drops. Dashboards with 47 panels are unusable. When everything is highlighted, nothing stands out. Good observability is about surfacing the 5-10 signals that matter, not displaying every internal variable. The developer is confusing “data” with “information.” More data is not always better — more relevant data is better.
-
What I would ask in the PR review:
- “For each metric, what question does it answer during an incident? If you cannot name a specific incident scenario where you would look at this metric, remove it.”
- “Which of these 47 would you put on a dashboard that an on-call engineer sees first during a page? Those are your top 5-10 — keep those, drop the rest.”
- “Can any of these be derived from existing metrics? If
request_duration_secondsalready exists, you do not need a separaterequest_fast_countandrequest_slow_count— just use histogram quantiles.” - “Have you checked the total cardinality impact? Run
promtoolto estimate the number of new time series this PR creates.”
- My counter-proposal: Start with the RED metrics (Rate, Error, Duration) if they do not already exist, plus 3-5 business-specific metrics that answer specific diagnostic questions. Deploy them. Use them in the next 2-3 incidents. Then add more metrics driven by the gaps you discovered during those incidents. This is observability-driven development: instrument in response to questions you could not answer, not in anticipation of questions you might someday ask.
- War Story: A team I know added 120 custom metrics to a service during a “observability sprint.” Within 3 months, their Prometheus storage doubled, query times tripled, and Grafana dashboards were so dense that engineers started building personal “simplified” dashboards that only showed the 8 metrics that actually mattered. We did a metric audit: of the 120, 15 were used in dashboards, 4 were used in alerts, and the remaining 101 were never queried. We deleted the unused ones and saved $2,000/month in Prometheus infrastructure costs. The lesson: every metric you add is a commitment to understand, maintain, and eventually clean up. Treat metrics like code — each one should justify its existence.
Follow-up: The developer pushes back: “But what if we have an incident and need a metric we did not add?” How do you handle that?
Strong answer:This is the “what if” fear that drives over-instrumentation. The answer is: that is exactly what logs and traces are for.Metrics are your first-alert system — they tell you something is wrong and roughly where. They need to cover the critical signals: rate, errors, duration, and resource utilization. For the long tail of “what if” questions, you use logs (which can carry arbitrary high-cardinality data) and traces (which show the exact path and timing of a specific request).If, during an incident, you discover you need a metric that does not exist — that is a valid finding for the postmortem. Add that specific metric as a follow-up. But add it because you proved you needed it, not because you might someday need it. This is the difference between reactive instrumentation (efficient, battle-tested) and speculative instrumentation (wasteful, creates noise).The other escape hatch: if you instrument with OpenTelemetry and export structured logs with rich attributes, you can often answer the “what if” question by querying logs — even though the question was not anticipated. The log line{"service": "order-service", "db_pool_active": 18, "db_pool_max": 20} gives you the same insight as a db_pool_active gauge metric, just queried differently. Structured logging is your insurance policy for the questions you did not anticipate.Follow-up: How do you maintain metric hygiene over time as the codebase evolves?
Strong answer:Four practices:- Quarterly metric audits. Query Prometheus or Datadog for metrics that are never used in any dashboard, alert, or recording rule. Delete them. This is the equivalent of deleting dead code.
- Metric ownership. Every metric should be owned by a team. When a service changes ownership, the new team reviews all metrics and removes ones they do not understand or use. Orphaned metrics are the fastest-growing category of waste.
- Deprecation process. When removing a metric, do not just delete it — add a comment in the code for one release cycle (“deprecated: scheduled for removal in v2.4”) and check if any consumer (dashboard, alert, downstream system) references it. Grafana’s “unused panels” detection and Datadog’s “Metrics without Limits” both help identify references.
- CI validation. A CI check that counts the total number of custom metrics and fails if it exceeds a team-defined budget. This is a soft cap — the team can raise it, but they must justify the increase in the PR, which forces the “do we really need this?” conversation.