Part V — Reliability, Resilience, and Availability
Reliability is not about preventing failures — it is about choosing which failures to tolerate. Every system fails. The senior engineer’s job is to decide how much failure is acceptable (SLOs), invest proportionally (error budgets), and build systems that degrade gracefully rather than collapse catastrophically. The core insight: reliability is an economic decision, not a technical one.
This chapter draws heavily from the principles in Google’s Site Reliability Engineering book. Reliability is not about preventing all failures — it is about defining acceptable failure rates and investing appropriately.
Cross-chapter foundations: Reliability engineering does not exist in a vacuum — it builds on several foundational disciplines covered elsewhere in this guide:
Distributed Systems Theory — Consensus algorithms (Raft, Paxos) are the mechanism behind high-availability leader election and replicated state machines. You cannot design HA systems without understanding what consensus guarantees (and what it costs in latency and partition tolerance). The CAP theorem and its practical implications for choosing consistency vs. availability during network partitions are covered there.
OS Fundamentals — Process crashes, OOM kills, file descriptor exhaustion, and signal handling are the low-level failure modes that reliability patterns must account for. When the Linux OOM Killer sends SIGKILL to your process, there is no graceful shutdown — understanding why this happens and how to prevent it (memory limits, proper resource budgeting) is covered in that chapter.
Cloud Service Patterns — Multi-AZ deployments, multi-region architectures, and cloud-native resilience primitives (ALB health checks, Route 53 failover, S3 cross-region replication) are the infrastructure building blocks that implement the HA and DR strategies described in this chapter.
When Reliability Fails: The Stories That Changed the Industry
Before we dive into SLOs and error budgets, let’s look at what happens when reliability goes wrong — because these stories are the reason all the theory in this chapter exists.
The Amazon S3 Outage (2017) — A Typo That Broke the Internet
On February 28, 2017, an Amazon engineer was debugging an issue with the S3 billing system in the US-East-1 region. The fix required removing a small number of servers from a subsystem. The engineer executed a command — and typed the wrong number. Instead of removing a handful of servers, the command removed a massive chunk of the S3 index subsystem and the placement subsystem.S3 is not just “file storage.” It is the foundation that half the internet runs on. When S3 went down, it took with it: Slack, Trello, Quora, Business Insider, the IFTTT service, and — ironically — the AWS Service Health Dashboard itself (which was hosted on S3, so Amazon could not even update their own status page to say S3 was down).The outage lasted about four hours. The root cause was not a software bug or a hardware failure — it was a human typing a number wrong in a maintenance command that had no guardrails, no confirmation prompt, and no rate limiter on how many servers could be removed at once. The fix was straightforward: Amazon added safeguards so that commands could not remove capacity below a minimum threshold, and they added confirmation steps for large-scale operations.The lesson: Your system is only as reliable as the most dangerous manual command someone can run against it. Guardrails on human operations — confirmation prompts, blast radius limits, “are you sure?” steps — are as important as any software resilience pattern. Also: do not host your status page on the same infrastructure it reports on.
Cloudflare's Regex Outage (2019) — One Bad Regular Expression, Global Impact
On July 2, 2019, Cloudflare pushed a routine update to their Web Application Firewall (WAF) rules. One of the new rules contained a regular expression that, when evaluated against certain HTTP request patterns, caused catastrophic backtracking — the regex engine spiraled into exponential CPU consumption.Because Cloudflare’s WAF runs on every request at every Point of Presence (PoP) globally, the CPU spike was not isolated to one server or one region. It hit every Cloudflare edge server simultaneously. CPU utilization spiked to 100% across the entire network. For 27 minutes, Cloudflare — which proxies and protects millions of websites — effectively went offline. Sites behind Cloudflare showed 502 errors worldwide.The fix was to roll back the WAF rule. But the deeper issue was that the deployment process had no canary phase — the rule went to 100% of production traffic immediately. There was no automated mechanism to detect “CPU is spiking globally” and auto-rollback. The regex had not been tested against a performance benchmark, only against correctness.The lesson: Global infrastructure means global blast radius. Any change that touches every request needs canary deployments (roll out to 1% of traffic, observe, then expand). Automated rollback triggers — “if CPU exceeds X% within Y minutes of a deploy, revert” — are not optional for edge infrastructure. And regular expressions are surprisingly dangerous: a pattern that looks simple can have exponential worst-case performance.
Facebook/Meta's BGP Outage (2021) — When DNS and BGP Cascade Together
On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger all went completely offline for approximately six hours. Not degraded. Not slow. Gone. For roughly 3 billion users.The root cause was a maintenance command intended to assess the capacity of Facebook’s backbone network. A bug in the audit tool caused it to withdraw BGP route advertisements for all of Facebook’s DNS servers. BGP (Border Gateway Protocol) is how routers on the internet know where to send traffic. When Facebook’s BGP routes disappeared, the rest of the internet simply forgot how to reach Facebook’s servers.Here is where it cascaded: Facebook’s DNS servers became unreachable, so DNS lookups for facebook.com, instagram.com, and whatsapp.com all started failing. But Facebook’s internal tools also depended on that same DNS infrastructure. Engineers could not access their own dashboards, deployment tools, or even the internal communication systems they would normally use to coordinate the fix. Physical access to the data centers was also complicated — the badge-reader systems depended on network services that were unreachable. Engineers had to be physically dispatched to data centers to manually restore the BGP routes.The lesson: Your recovery tooling must not depend on the thing that is broken. If your DNS goes down and your incident response tools need DNS to function, you have a circular dependency that turns a bad day into a catastrophic one. Test your recovery path independently — can your team actually fix the system when the system itself is down? Also: BGP is the single most consequential protocol most engineers never think about.
Netflix and Chaos Monkey — Breaking Things on Purpose to Prevent Real Outages
In 2010, Netflix migrated from its own data centers to AWS. Early in the migration, they experienced a significant outage when an AWS availability zone went down and took Netflix services with it. Rather than just adding more redundancy and hoping for the best, Netflix took a radical approach: they built a tool called Chaos Monkey that randomly terminates virtual machine instances in production during business hours.The philosophy was counterintuitive — deliberately cause failures so that engineers are forced to build services that tolerate them. If your service cannot survive one instance dying at random, it is not resilient enough for production. Chaos Monkey eventually grew into the “Simian Army” — Chaos Gorilla (simulates an entire availability zone failure), Latency Monkey (injects artificial delays), Conformity Monkey (finds instances that do not adhere to best practices), and more.The result? During subsequent major AWS outages that took down competitors like Reddit, Imgur, and Heroku, Netflix continued streaming without interruption. Their services had been hardened by months of intentional, controlled failure injection. Engineers had already encountered and fixed the edge cases that only surface when things go wrong.The lesson: You cannot test resilience by reading architecture diagrams. You test resilience by actually breaking things — in a controlled way, with safety nets, during business hours when everyone is awake and ready to respond. The teams that practice failure regularly are the teams that handle real incidents calmly.
These three terms sound similar and are often confused — even by experienced engineers. Here is the precise distinction, grounded in one concrete example so the relationship is unmistakable:SLI (Service Level Indicator): A measurement of system behavior from the user’s perspective. It is a number you observe. Example: “Over the last 30 days, 99.2% of checkout API requests completed in under 200ms.”SLO (Service Level Objective): A target you set for your SLI — the threshold you commit to internally. It is a goal your team agrees to meet. Example: “99.5% of checkout API requests must complete in under 200ms over any rolling 30-day window.”SLA (Service Level Agreement): A contract between you and your customer, with explicit consequences if breached. It is a legal or business commitment. Example: “If checkout API availability drops below 99.0% in a calendar month, affected customers receive a 10% service credit.”Notice the hierarchy: SLI is what you measure (99.2%). SLO is what you aim for (99.5%). SLA is what you promise externally with penalties (99.0%). The SLA is always less aggressive than the SLO, because you want your internal target to catch problems before they become contractual violations. If your SLO and SLA are the same number, you have zero safety margin — every near-miss becomes a breach.
Do not confuse availability with reliability. A system can be available (responding to every request) but unreliable (returning wrong data, corrupting state, dropping events silently). A checkout API that returns 200 OK but charges the wrong amount is available but catastrophically unreliable. Your SLIs should measure correctness, not just uptime. A service that returns errors honestly (503) is more reliable than one that returns garbage with a 200 status code.
Error Budget: If your SLO is 99.9% availability, you have 0.1% downtime budget — 43.2 minutes per month. When the budget is healthy, ship features aggressively. When it is burning, slow down and invest in reliability. Error budgets are the bridge between product velocity and reliability.
Cross-chapter connection: SLIs need to be measured — which means you need robust observability infrastructure. See the Observability chapter for how to instrument metrics, build dashboards, and set up alerting that tracks your SLIs in real time. Without observability, SLOs are aspirational fiction.
Each additional nine is roughly 10x harder and more expensive to achieve. Most services should target 99.9% and invest the saved engineering effort in features.
Availability
Downtime / Month
Downtime / Year
Typical Use Case
99% (“two nines”)
7.2 hours
3.65 days
Internal tools, dev environments
99.9% (“three nines”)
43.2 minutes
8.76 hours
Most SaaS products, APIs
99.95%
21.6 minutes
4.38 hours
E-commerce, business-critical apps
99.99% (“four nines”)
4.3 minutes
52.6 minutes
Payment systems, core infrastructure
99.999% (“five nines”)
26 seconds
5.26 minutes
Telecom, life-safety systems
A quick mental model: 99.9% = about 8 hours and 46 minutes of allowed downtime per year. For most web services, this is the sweet spot. Going to 99.99% (about 52 minutes/year) typically requires multi-region deployment, automated failover, and a dedicated SRE team — a 10x cost increase for a 10x improvement.
Budget exhausted (0%): Full feature freeze. Only reliability work and critical security patches until the budget replenishes in the next window.
Who decides? The error budget policy is co-owned by the SRE team (or on-call engineering lead) and the product manager. The SRE team reports budget status. The PM acknowledges the trade-off. Escalation goes to the VP of Engineering if there is disagreement. The key: this is a pre-negotiated agreement, not a per-incident debate.
Analogy — Error Budgets Are Like a Bank Account for Reliability. Think of your error budget as a checking account. Every month it refills to a set balance (your allowed downtime). You can spend that balance on risky deploys, aggressive feature launches, or infrastructure migrations — each of which might cause a few minutes of downtime. When the account is flush, spend freely. When it is running low, tighten up and stop making withdrawals. When it hits zero, you are frozen — no discretionary spending (feature deploys) until the balance replenishes next month. The metaphor works because it reframes reliability not as a constraint but as a resource you manage.
The One Thing to Remember: SLOs are not technical targets — they are organizational contracts that align engineering and product on how much reliability to buy. Without SLOs, reliability is a never-ending argument. With SLOs, it is a budget you manage together.
When your service depends on other services, your effective SLO is bounded by the product of your dependency SLOs, not the minimum. This is the composition problem most teams get wrong.The math: If Service A (99.95% SLO) calls Service B (99.9% SLO) and Service C (99.9% SLO) synchronously, the best-case end-to-end availability is 99.95% x 99.9% x 99.9% = ~99.75%. Service A cannot promise 99.95% to its users if its dependencies only deliver 99.8% combined. This has three practical implications:
Downstream services must have stricter SLOs than upstream services. If the user-facing API targets 99.95%, every synchronous dependency should target 99.99% or higher. The gap is your margin for the API’s own failure modes.
Every async boundary resets the composition. If Service A publishes an event to a queue and Service C processes it asynchronously, Service C’s availability no longer multiplies against Service A. The queue decouples them. This is the strongest reliability argument for event-driven architecture.
Parallel dependencies compose differently than serial ones. If Service A calls Service B and Service C in parallel and can succeed if either responds (e.g., primary vs. fallback), the effective availability is 1 - (1-SLO_B) * (1-SLO_C) — much better than serial composition. For two services at 99.9%, parallel composition gives 99.9999%.
The hidden dependency: Your SLO composition must include infrastructure dependencies that do not appear in your architecture diagram. DNS resolution, certificate validation, load balancer health checks, secret store lookups, and feature flag evaluation are all in the critical path. A team that carefully composes microservice SLOs but forgets that every request depends on DNS (99.99% at best) is fooling themselves.
Most SLO frameworks only track binary states: request succeeded or request failed. Real user experience has a spectrum between “fully functional” and “completely down” — and the middle of that spectrum is where most reliability problems live.The degraded UX problem: Your checkout page loads in 200ms, but the recommendation widget times out and shows a spinner for 3 seconds. Your availability SLI says 100%. Your latency SLI for the checkout endpoint says 200ms. But the user’s perceived experience is degraded — they see a broken-looking page with a spinning section.How to measure it:
Apdex score: Classifies user experience into Satisfied (response < T), Tolerating (response < 4T), and Frustrated (response > 4T or error). An Apdex SLI captures the full spectrum, not just pass/fail.
Core Web Vitals as SLIs: Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS) measure what the user actually experiences, not what the server thinks it delivered. For frontend-heavy applications, these are better SLIs than server-side latency.
Composite page health: Track the percentage of page loads where all above-the-fold components rendered successfully within their timeout. A page where the main content loads but 2 of 5 widgets fail is not 100% healthy — it is 60% healthy.
Senior vs. Staff distinction: A senior engineer sets SLOs for their service and monitors them. A staff engineer designs the SLO composition model across the dependency graph, identifies where async boundaries should exist to break composition chains, and builds the organizational framework for cross-team SLO accountability. The senior answers “what is our SLO?” The staff answers “why does our SLO have to be this number given the system topology?”
Reliability is not free, and it is not infinitely valuable. The engineering leadership question is always: where is the next dollar of reliability investment best spent?The Reliability Investment Quadrant:
Low Cost to Fix
High Cost to Fix
High Impact if Broken
Fix immediately — these are free wins (missing alerts, no rollback plan, unbounded retries)
Invest strategically — these are your quarterly reliability epics (multi-region, chaos engineering program)
Low Impact if Broken
Fix opportunistically — boy scout rule, fix during related work
Do not fix — this is over-engineering disguised as reliability
How to calculate the ROI of a reliability investment:
Annual cost of unreliability = (incidents/year) * (avg duration hours) * (revenue/hour + engineering hours * hourly rate + trust cost estimate)Annual cost of investment = infrastructure cost + engineering time + ongoing maintenanceROI = (cost of unreliability - cost after investment) / cost of investment
A real example: A team experiences 4 incidents per year averaging 45 minutes each, with 15,000/hourinrevenueimpactand10engineeringhoursoffollow−upperincidentat100/hour. Annual unreliability cost: 4 x (0.75 x 15,000+10x100) = 49,000.Addingautomatedcanarydeploymentscosts30,000 in engineering time and 500/monthininfrastructure.Ifcanariesprevent3of4incidents,theannualsavingsis37,000. The investment pays for itself in under a year.
Further reading:Google SRE Book — Chapter 4: Service Level Objectives — the definitive treatment of SLIs, SLOs, and SLAs, including how to choose meaningful indicators and set realistic targets. Google SRE Book — Chapter 3: Embracing Risk — the chapter that introduces error budgets and frames reliability as an economic decision, not a technical absolute. These two chapters together form the intellectual foundation for everything in this section.
Interview Question: How do you set SLOs for a new service?
Strong answer: Start by understanding what matters to users — for a checkout service, availability and latency matter most; for a batch report generator, completeness matters more than speed. Choose SLIs that reflect user experience: for an API, that is typically request success rate (errors / total requests) and latency at p99. Measure baseline performance for 2-4 weeks before setting targets. Set the SLO slightly above the baseline — achievable but ambitious. For example, if current p99 latency is 180ms, set the SLO at 200ms. Define error budgets for each SLO type separately.Availability SLO: “99.9% of minutes in the month, the service returns non-error responses” — error budget = 43.2 minutes of allowed downtime.Latency SLO: “99% of requests complete in under 200ms” — error budget = 1% of requests are allowed to be slow.These are different measurements with different budgets. When either budget is burning, slow down and invest in reliability. When both are healthy, ship features aggressively.
Structured Answer Template:
Start from user experience, not server metrics: what would a user complain about first?
Measure the baseline for 2-4 weeks before committing to a target.
Set the SLO slightly above baseline — achievable but ambitious (not aspirational fiction).
Define availability AND latency separately with their own error budgets.
Tie the SLO to an error budget policy with pre-agreed thresholds and responses.
Big Word Alert — Error Budget: The complement of your SLO, treated as a spendable resource (99.9% SLO = 0.1% = 43.2 minutes/month of allowed failure). Use it like: “Our checkout error budget is 43 minutes per month; we’ve burned 18 in the first week, so we’re in the caution zone.” Do not say “error budget” for simple downtime allowance — the key word is budget: it implies spending it deliberately on feature risk.
Real-World Example: Google’s Ads team famously sets SLOs based on user-perceived latency (how fast the ad creative appears on the page), not server-side metrics — they use RUM data and pick a p99 SLO that correlates with click-through rate dropping. Netflix sets tiered SLOs per service importance: “Play button click to start streaming” has a much tighter SLO than “profile picture upload.” This user-centric framing prevents the common mistake of SLO-ing whatever is easy to measure rather than what actually matters.Follow-up Q&A Chain:Q: What happens if you measure baseline for 4 weeks and find your service is already hitting 99.99%?
A: Either tighten the SLO to match reality (and use the freed-up “permission to fail” to ship faster), or accept that you are over-invested in reliability and that engineering capacity could be redirected. A service that always meets its SLO by a huge margin is either mis-classified (should be lower tier) or over-engineered for its business need.Q: How do you pick between 99.9% and 99.95% for a new user-facing service?
A: Ask “what’s the business impact of 43 minutes of downtime per month vs 21 minutes?” If the difference is meaningful (say, peak-hour outages hitting a paying customer’s workflow), go tighter. If the delta is imperceptible to users, stay at 99.9% — you spend less engineering effort maintaining it and have more error budget to ship features aggressively.Q: Should the SLO include “successful” requests that return a 404?
A: Usually no. 404 is a client error, not a service failure. A good availability SLI filters out 4xx errors (which indicate bad client requests) and counts only 5xx and timeout-class failures. The exception: if a valid request is returning 404 due to a bad deploy (missing route), that should count as a failure — use synthetic probes of known-good endpoints to catch this.
Further Reading:
Google SRE Workbook, Chapter 2 “Implementing SLOs” — the canonical reference on SLO calibration.
Alex Hidalgo, “Implementing Service Level Objectives” (O’Reilly, 2020) — book-length treatment with real case studies.
Charity Majors, “SLOs are the API for your engineering organization” (charity.wtf) — organizational framing.
Follow-up: The backend team says 99.99% availability but the frontend team says 99.9% is fine. Who wins?
The SLO should be set from the user’s perspective, not the team’s preference. If the frontend serves the user and it depends on the backend, the backend’s SLO must be at least as strict as the frontend’s. If the frontend targets 99.9%, the backend should target 99.95% or higher — because the frontend has its own failure modes on top of backend failures. If the backend targets only 99.9%, the end-to-end availability will be lower (roughly the product of both).The real conversation: what does the business need? If losing the checkout flow for 4 minutes/month is fine, 99.99% on the backend is over-investing.
Structured Answer Template:
Reframe: the user does not care about the team’s preference, only about the end-to-end experience.
Explain SLO composition math: serial dependencies multiply, so downstream must be stricter than upstream.
Connect to the business: what is the actual user-visible target, and work backward from there.
Call out the hidden dependencies (DNS, certs, LB health checks) that also compose.
Land on the judgment: the stricter team is usually right because they absorb their own failures on top.
Big Word Alert — SLO Composition: The mathematical fact that chained synchronous SLOs multiply, so end-to-end availability is lower than any single service’s SLO. Say it like: “Our 99.95% API depends on three 99.9% services synchronously, so composition gives us 99.75% end-to-end — we need to either tighten the dependencies or break the chain with async boundaries.” Do not confuse with simple “uptime” — composition is about the product of probabilities.
Real-World Example: At AWS, the S3 team publicly documents that S3’s 99.99% availability SLO is achieved by depending on internal services with 99.999% SLOs — they call this “SLO amplification up the stack.” If S3 depended on services at the same 99.99%, the composition math would give only 99.98% end-to-end. This is why platform teams consistently get the strictest SLOs — they must absorb their consumers’ error budgets, not just their own.Follow-up Q&A Chain:Q: What if the backend genuinely cannot hit 99.95% with current architecture?
A: Three options: (1) invest in the architecture (multi-AZ, async paths, more replicas) to raise capability, (2) add resilience patterns in the frontend (caching, circuit breakers, fallbacks) so the frontend tolerates backend degradation, or (3) lower the frontend SLO to something achievable. Whichever path, the conversation must involve the product owner — engineering cannot unilaterally decide to degrade user experience.Q: Does this composition math apply to asynchronous dependencies?
A: No — that is exactly why async boundaries are so valuable for reliability. If the frontend publishes an event to a queue and the backend processes it later, the frontend’s availability is independent of the backend’s current availability. The queue absorbs the temporal coupling. This is the strongest structural argument for event-driven architecture in reliability-critical systems.Q: How do you present this to a product manager without drowning them in math?
A: Concretely: “Our frontend target is 99.95% — that’s 21 minutes of downtime per month. The backend currently targets 99.9% — 43 minutes. When the backend goes down for its full budget, the frontend is down too. So we are really promising the user 99.87% — 57 minutes. To keep our promise, we need the backend at 99.97% or higher, or we need to stop making the frontend depend synchronously on the backend.”
Further Reading:
Google SRE Book, Chapter 4 “Service Level Objectives” — the composition math section.
Netflix blog, “Building Netflix’s Distributed Tracing Infrastructure” — how they measure composition in practice.
Scenario: Your service has consumed its entire monthly error budget in the first week. What do you do? Who do you talk to?
What they are really testing: Whether you understand error budgets as a governance mechanism — not just a metric — and whether you can navigate the organizational response.What weak candidates say: “We should look at the dashboard and fix the bug.” This treats the error budget as a technical metric, not a governance mechanism. They focus only on the immediate fix without addressing the organizational response, the communication chain, or the policy enforcement.What strong candidates say: They immediately frame this as both a technical and organizational event. They talk about the error budget policy, name specific stakeholders, propose a graduated response, and think about longer-term prevention.Strong answer framework:
Immediate triage. Determine why the budget burned so fast. Was it a single catastrophic incident (a bad deploy that caused 30 minutes of downtime) or a slow bleed (elevated error rates over several days)? The response differs. Pull up the SLI dashboards and correlate the budget burn with specific events in the deploy log or dependency status.
Activate the error budget policy. This is why you pre-negotiate the policy. With the budget exhausted in week one, the policy should mandate a feature freeze for the remainder of the 30-day window. Only reliability improvements and critical security patches ship. Communicate this to the team immediately — not as a punishment, but as the agreed-upon protocol.
Who to talk to:
The on-call/SRE lead — to confirm the budget status and validate the root cause analysis.
The product manager — to invoke the error budget policy. The PM needs to know that feature work is paused and why. This is a collaborative conversation, not a decree.
Engineering leadership — if the PM pushes back on the freeze, escalate per the pre-agreed escalation path. This is exactly the scenario the policy was designed for.
The team — reorient the sprint. What reliability investments will prevent this from happening again? Fix the root cause, add missing alerts, improve rollback speed, or add canary deployments.
Longer-term. After the window resets, conduct a retrospective. Was the SLO set correctly? Was the budget burn caused by something preventable (missing canary, no rollback plan) or something structural (the service is fundamentally under-provisioned)? Adjust the SLO, the deployment process, or the infrastructure accordingly.
Common mistake: Treating the error budget as just a dashboard number with no teeth. If burning the budget does not trigger a real response (feature freeze, reliability sprint), it is not actually governing anything.Follow-up chain:
Failure mode: What if the budget burn was caused by a third-party dependency outage, not your own code? How do you attribute budget burn when the root cause is outside your control? (Tests whether they understand dependency SLO composition and shared accountability.)
Rollout/Rollback: If you deploy a reliability fix during the freeze period and that fix itself causes degradation, what is your rollback plan? (Tests whether they recognize the irony of reliability fixes causing incidents.)
Measurement: How do you measure whether the reliability sprint during the freeze actually improved anything? What metrics do you track? (Tests whether they can quantify reliability ROI.)
Cost: What is the dollar cost of a feature freeze to the business, and how do you present that trade-off to leadership? (Tests business acumen.)
Security/Governance: If a critical security patch needs to ship during the freeze, how does the exception process work? Who approves it, and what safeguards apply? (Tests policy nuance.)
Senior vs Staff distinction — SLOs and Error Budgets:
What a senior engineer says: “I would set SLOs based on measured baselines, define error budgets, and enforce the policy when the budget burns. I would instrument SLIs, create dashboards, and set up alerts for budget consumption thresholds.”
What a staff/principal engineer adds: “I would design the SLO composition model across the entire dependency graph, ensuring downstream services have stricter SLOs than upstream consumers. I would build the organizational framework for cross-team SLO accountability — including attribution models that distinguish self-caused vs dependency-caused budget burns. I would establish the quarterly SLO calibration process and ensure error budget policy is co-owned by product and engineering leadership, not just tolerated by them. I would also identify where async boundaries should exist to break serial SLO composition chains.”
AI-Assisted Engineering Lens: SLOs and Error Budgets
LLMs and AI coding assistants are changing how teams approach SLO work:
SLO definition acceleration: Tools like GitHub Copilot and Claude can generate boilerplate Prometheus recording rules, Grafana dashboard JSON, and alerting configurations from natural language descriptions of SLIs. What used to take an SRE 2-3 hours of YAML wrangling can be drafted in 15 minutes. The human judgment still matters — the AI generates the configuration, you validate that it measures what users actually care about.
Error budget dashboards: LLMs can generate custom Datadog or Grafana dashboard definitions from a description like “show error budget burn rate for the checkout service with a 30-day rolling window and threshold markers at 50%, 20%, and 0%.” The scaffolding is automated; the SLO target is still a business decision.
Post-incident analysis: AI tools can analyze incident timelines and correlate them with deployment logs, SLI metrics, and error budget consumption to draft post-mortem timelines. This reduces the toil of assembling the chronological narrative, letting the team focus on root cause analysis and action items.
The danger: Over-relying on AI-suggested SLO targets. An LLM might suggest 99.99% because it sounds good in training data, but your business might only need 99.9%. The economic analysis of “which SLO is the cheapest that keeps users happy” still requires human judgment about business context.
Work-sample prompt — SLOs:
“Debug this: Your team’s SLO dashboard shows 99.97% availability for the month — well within the 99.9% target. But support tickets about checkout failures have tripled this week. The dashboard is green. Customers are angry. You have 5 minutes before the VP asks what is going on. What do you check first, and what is the most likely explanation?”(Tests whether the candidate understands that aggregate SLOs can mask localized failures, and whether they can triage under time pressure.)
Toil (from the SRE book) is repetitive, manual, automatable, tactical work that scales linearly with service size. Responding to the same alert every week, manually provisioning accounts, manually rotating secrets — all toil.The SRE principle: Keep toil below 50% of an engineer’s time. Invest the other 50% in automation that eliminates toil. If a task must be done more than twice, automate it.
The One Thing to Remember: Toil is insidious because it feels like “real work.” The test: if a task scales linearly with service growth (more users = more manual steps) and does not make the system permanently better, it is toil. Automate it or it will eventually consume your entire team.
Not all services need the same reliability. A marketing landing page can tolerate more downtime than a payment processing system. The SRE approach: quantify the cost of unreliability (lost revenue, user trust, SLA penalties) and invest proportionally.The reliability cost curve: Going from 99% to 99.9% might require adding Redis and a second database replica. Going from 99.9% to 99.99% might require multi-region deployment, automated failover, and a dedicated SRE team. Going from 99.99% to 99.999% might require custom infrastructure, consensus protocols, and years of hardening. Each nine costs roughly 10x more than the last. The engineering question is always: what is the cost of an additional nine vs. the cost of not having it?
Cross-chapter connection: The jump from 99.9% to 99.99% typically requires multi-AZ or multi-region deployment — the practical cloud infrastructure for this (availability zone architecture, managed failover services, cross-region replication) is covered in Cloud Service Patterns. The jump from 99.99% to 99.999% often requires consensus protocols for zero-downtime leader election — see Distributed Systems Theory for how Raft and Paxos make this possible and what latency cost you pay.
Interview Question: A product manager asks you to guarantee 99.999% uptime for a new feature. How do you respond?
Strong answer: First, quantify what 99.999% means — 26 seconds of downtime per month. Then ask: what is the business impact of 5 minutes of downtime vs 26 seconds? If the feature is an internal dashboard, 99.9% (43 minutes/month) is likely sufficient and dramatically cheaper. If it is a payment processing system where every second of downtime costs $10,000, the investment may be justified.Present the cost-reliability curve to the PM: here is what 99.9% costs, here is what 99.99% costs, here is what 99.999% costs. Let the business decide which trade-off makes sense.
Scenario: Your VP says 'we cannot afford to slow down on features right now' but the error budget is exhausted. How do you navigate this?
What they are really testing: Whether you can handle the organizational pressure that is the real reason most reliability programs fail. The technical answer is “invoke the error budget policy.” The practical answer requires political judgment.Strong answer framework:
Start with the data, not the policy. “Our error budget is exhausted, which means we have already used our 43 minutes of allowed downtime this month. In the last incident, we lost an estimated $12,000 in revenue and 6 engineering hours in incident response. Continuing to ship at current velocity without addressing the underlying causes means we are statistically likely to have another incident this month.”
Offer a compromise, not an ultimatum. A full feature freeze is the policy, but proposing it without alternatives makes you look inflexible. Instead: “I propose we continue shipping features that do not touch the critical path. For the checkout and payment services, we pause new features for one sprint and invest in the three highest-impact reliability items from our last post-mortem. That is a 40% velocity reduction for 2 weeks, not a total freeze.”
Make the cost of ignoring the policy explicit. “If we override the error budget policy and ship anyway, and we have another incident, we will have consumed twice our error budget. At that point, the SLA breach penalties kick in — that is a 10% service credit to our enterprise customers, which is roughly $50,000. That is the dollar amount we are betting by continuing to ship.”
Escalate if overridden, but document it. If leadership decides to override the policy, accept the decision but document it: “VP approved continued deployment despite exhausted error budget on [date]. Risk accepted: potential SLA breach.” This is not CYA — it is organizational memory. When the next incident happens, the retrospective can trace the decision chain.
Red flag answer: “The VP is in charge, so we should just keep shipping.” This shows no spine and no understanding of the engineer’s responsibility to advocate for reliability. The opposite red flag: “The error budget policy is the policy and leadership cannot override it.” This shows no organizational awareness — every policy has an override, and the engineer’s job is to make the cost of overriding visible, not to refuse.Staff-level nuance: The recurring pattern where leadership overrides error budget policies is itself a signal. It usually means one of three things: the SLO is set too aggressively (tighten the SLO), the error budget policy was never truly co-owned by product leadership (re-negotiate with buy-in), or reliability incidents are not causing enough visible pain to justify the investment (improve incident cost tracking). A staff engineer addresses the pattern, not just the individual override.
Follow-up: How do you decide which reliability work to prioritize when you only get 20% of sprint capacity?
The prioritization framework I use has four inputs:
Frequency x Impact matrix. List every reliability gap (missing alerts, no rollback plan, unconfigured circuit breaker, missing timeout, no DLQ monitoring). Score each by: how often this gap would cause or extend an incident (frequency) and how severe the impact would be (user-minutes, revenue, blast radius). The top-right quadrant — high frequency, high impact — is your priority list.
Dependency on upcoming work. If next sprint includes a major migration to a new payment provider, “circuit breaker and fallback for payment service” jumps to the top regardless of the matrix. Align reliability work with upcoming risk.
Leverage. Some reliability investments protect one service. Others protect many. Adding canary deployments to the CI/CD pipeline protects every service that deploys through it. Adding an alert for one endpoint protects one endpoint. Favor investments with high leverage — infrastructure-level improvements over service-level ones.
Time to value. A missing alert takes 30 minutes to add and provides value immediately. A chaos engineering program takes 3 months to stand up. With 20% capacity, front-load the quick wins to build credibility, then invest in the larger items.
The prioritization anti-pattern: Spending the 20% on whatever the most recent incident exposed. This is reactive, not strategic. The most recent incident gets a fix regardless — it is in the post-mortem action items. Your 20% should go toward preventing the next class of incidents, not relitigating the last one.
The One Thing to Remember: Reliability is an economic decision, not an engineering flex. The right SLO is the cheapest one that keeps users happy — anything beyond that is wasted money that should be spent on features.
Further reading:Google SRE Book — Chapter 3: Embracing Risk — the chapter that frames reliability as a cost-benefit analysis, not a binary. Explains how Google quantifies the cost of each additional nine and why most services should explicitly choose to be less reliable than they technically could be.
8.4 Reliability Policy: The Organizational Operating Agreement
An error budget policy tells you what to do when the budget burns. A reliability policy is broader — it codifies how the organization makes reliability decisions across the entire lifecycle: planning, building, deploying, operating, and recovering. Without a written reliability policy, every reliability decision is a one-off negotiation that depends on who is in the room and how loudly they argue.What a reliability policy covers:1. SLO Ownership and Review Cadence. Every service has an SLO owner (a named human, not a team alias). SLOs are reviewed quarterly. The review asks: Is this SLO still appropriate given current traffic, architecture, and business requirements? Is the SLO being met consistently (indicating it may be too lenient) or consistently missed (indicating it may be too aggressive or the architecture needs investment)?2. Error Budget Governance. The error budget policy defines thresholds and responses (covered in 8.1 above). But the reliability policy adds: who can override the policy, what documentation is required for an override, and how overrides are tracked over time. If leadership overrides the error budget freeze more than twice per quarter, that is a signal that either the SLO is wrong or the organization is not taking reliability seriously.3. Feature Freeze Protocol. When an error budget is exhausted, the reliability policy defines exactly what “feature freeze” means:
Which deployments are blocked? (All deployments? Only deployments to the affected service? Only deployments that touch the critical path?)
What exceptions are allowed? (Security patches, contractual commitments, revenue-protecting fixes?)
Who approves exceptions? (The SLO owner? The VP of Engineering? A designated reliability council?)
How long does the freeze last? (Until the budget replenishes? Until the root cause is fixed? Until the next sprint boundary?)
Ambiguity in any of these answers guarantees conflict during the next incident. Write it down, get sign-off from engineering and product leadership, and revisit it quarterly.4. Reliability Review for New Services. Before a new service goes to production, it passes a reliability review: Does it have SLOs? Are SLIs instrumented? Does it have health checks (liveness and readiness)? Does it have alerts? Does it have a runbook? Is there an on-call rotation that covers it? Is there a rollback procedure? Services that skip the review become the services that cause incidents — because nobody thought about what happens when they fail.5. Incident Response Protocol. Who gets paged for which severity? What is the escalation path? How quickly must the incident be acknowledged (5 minutes for Sev1, 30 minutes for Sev2)? What tool is used for incident coordination (PagerDuty, Opsgenie, Slack #incident channel)? Post-incident review within how many business days (5 for Sev1, 10 for Sev2)?
The reliability policy anti-pattern: policy without teeth. A reliability policy that exists as a Confluence page nobody reads is worse than no policy at all — it creates the illusion of governance while providing none. The policy has teeth only if: (1) the error budget freeze actually stops deployments (enforced in CI/CD, not just documented), (2) reliability review is a gate in the launch process (services cannot go to production without it), and (3) post-incident action items have owners and deadlines that are tracked to closure.
The most politically charged moment in SRE is when the error budget is exhausted and the PM says “we cannot freeze features.” This is not a technical problem — it is an organizational alignment problem. Here are the three most common conflict patterns and how to navigate them.
Conflict Pattern 1: 'The Launch Cannot Slip'
The situation: The error budget is exhausted. The reliability policy says feature freeze. But the PM has a contractual commitment to launch Feature X by end of month, and the launch requires deploying new code to the affected service.Navigation:
Separate the risk. Does Feature X touch the code path that caused the budget burn? If the budget burned because of a database migration issue and Feature X is a UI change, the risk is low. Allow the deploy with extra safeguards (canary at 1%, extended soak time, manual rollback readiness).
Negotiate scope. Can Feature X launch with a feature flag that is off by default? Deploy the code (low risk — the code exists but does not execute), then enable the flag after the error budget has recovered. This satisfies the contractual commitment (“the feature shipped”) without adding operational risk (“the feature is not active during our reliability recovery period”).
Make the risk explicit and documented. If the deploy must proceed with the flag enabled, require the PM to sign off on the risk: “PM approved deploy to [service] despite exhausted error budget on [date]. Estimated risk: [probability of incident] x [estimated impact]. Risk accepted by: [name].” This is not CYA — it is organizational memory that informs future policy refinements.
The anti-pattern: The SRE says “the policy says no” and refuses to engage with the business context. This gets you overridden and makes you irrelevant. The SRE’s job is to quantify the risk and present options, not to be an unconditional gatekeeper.
Conflict Pattern 2: 'We Have Never Enforced the Freeze Before'
The situation: The error budget policy has existed for a year, but every time the budget has been exhausted, leadership has overridden the freeze. Now you are trying to enforce it for the first time and the response is “we never actually do that.”Navigation:
Acknowledge the history. “You are right — we have not enforced this before. But we have also had 3 preventable incidents in the last 6 months that cost us $X in revenue and Y engineering hours in incident response. The error budget policy exists specifically to prevent the pattern we are in.”
Propose a graduated enforcement. Instead of a hard freeze (which has never been enforced and will face resistance), propose: “For the next 2 weeks, all deploys to the affected service require a 10-minute canary soak and manual promotion by the on-call engineer. This adds 30 minutes to each deploy but does not block them.” This builds the muscle of deployment discipline without the political cost of a full freeze.
Build the case for next time. Track every incident that occurs during the “unfrozen” period. If another incident happens, you have data: “We chose not to enforce the freeze on [date]. On [date+5], we had another incident that consumed 15 minutes of budget. If we had enforced the freeze, this incident would not have occurred.” Data builds policy credibility over time.
Conflict Pattern 3: 'Reliability Is the Platform Team's Problem'
The situation: The product team’s feature code caused the error budget burn (a new endpoint without a timeout on a downstream call). But the product team says reliability is the platform team’s responsibility. They want to keep shipping while “someone else” fixes the reliability issue.Navigation:
Reframe with SLO ownership. “The SLO belongs to the service, not to a team. The endpoint that caused the budget burn is in [product team’s service]. The platform team can help with tooling (better circuit breaker libraries, improved monitoring), but the fix — adding a timeout and fallback to the endpoint — is in the product team’s code.”
Offer a reliability pairing session. Instead of an adversarial “your problem, fix it” conversation, offer: “Let’s pair for 2 hours this week. I’ll help instrument the timeout and circuit breaker, and we’ll add the alert together. That way the product team learns the pattern and can apply it to future endpoints independently.”
Establish shared accountability in the post-incident review. The action items from the post-mortem should have owners on both teams: the product team owns the specific fix (add timeout to the endpoint), the platform team owns the systemic improvement (add a linting rule that flags endpoints without timeouts, add a circuit breaker library to the service template).
Every incident produces two outputs: a fix for the immediate problem, and a list of systemic improvements that prevent the class of problem from recurring. Most teams ship the fix and ignore the list. This section is the checklist for the aftercare that turns a single incident into lasting reliability improvement.
Reliability Aftercare ChecklistRun this checklist after every Sev1 or Sev2 incident. The post-incident review produces the action items; this checklist ensures they actually get done and verified.Alert Tuning (complete within 3 business days):
Was this incident caught by an existing alert, or discovered by a user report / manual check? If discovered externally, add the missing alert.
If an alert fired, did it fire early enough to prevent user impact? If not, tighten the threshold. A P99 alert that fires at 5x baseline should fire at 2x baseline.
Did any noisy or irrelevant alerts fire during the incident and distract the responder? If so, tune or silence them. Alert fatigue during an incident is a force multiplier for slow response.
Is the alert’s notification channel correct? (Sev1 should page; Sev2 should Slack + email; Sev3 should only Slack.) Review and adjust.
Add an alert for the specific condition that caused the incident, not just the symptom. If pool exhaustion caused the P99 spike, add an alert on pool utilization at 80% — do not rely only on the P99 alert.
Runbook Update (complete within 5 business days):
Does a runbook exist for this service and failure mode? If not, create one.
If a runbook exists, did the on-call engineer follow it during the incident? If not, why? Was it outdated, hard to find, or unclear?
Update the runbook with the specific steps taken during this incident, including the diagnostic commands that worked, the ones that did not, and the resolution sequence.
Add a “quick reference” section at the top of the runbook: the 3 most common symptoms and the first command to run for each. During an incident at 3 AM, nobody reads a 5-page document — they need the first step in 10 seconds.
Verify the runbook is accessible from the alert notification (link in the PagerDuty description or Slack alert message).
Rollback Capability (complete within 5 business days):
Could the incident have been resolved faster with a rollback? If so, how long did the rollback take? Target: under 5 minutes for application deploys, under 15 minutes for infrastructure changes.
Is the rollback procedure automated or manual? If manual, automate it. A rollback that requires 8 manual steps at 3 AM will be done wrong.
Was the rollback tested recently (within the last quarter)? If not, test it now. An untested rollback procedure is an optimistic assumption, not a plan.
For database migrations: was the migration reversible? If not, add “reversibility review” to the migration approval checklist for future migrations.
Every action item from the post-incident review has an owner (a named person, not a team) and a deadline.
Action items are tracked in the same system as sprint work (Jira, Linear, etc.) — not in a separate Google Doc that nobody checks.
Action items are reviewed in the weekly team standup until closed. Stale action items (open > 2 weeks past deadline) are escalated to the engineering manager.
The post-incident review is re-reviewed 30 days after the incident to verify: (1) all action items are closed, (2) the fix is still in place (not reverted by a subsequent deploy), (3) the new alerts are firing correctly (not silenced or misconfigured).
The incident is added to the team’s “incident catalog” — a living document that tracks recurring failure patterns and the reliability investments made to address them. If the same failure class appears three times, it is a systemic issue that needs architectural investment, not another action item.
Metric that proves the aftercare worked: The same failure mode does not recur within 90 days. If it recurs, the aftercare was insufficient — either the fix was incomplete, the alert was not tight enough, or the root cause was misidentified.
The aftercare accountability rule: If a Sev1 incident’s action items are not closed within 14 business days, the incident is reopened and escalated. The psychological trick: it is much harder to let action items rot when doing so means the incident is officially “still open” in the tracking system. Nobody wants a 3-month-old open Sev1 on their dashboard.
Cross-chapter connections: Resilience patterns do not exist in isolation. They connect directly to Deployment strategies (canary deployments as a reliability mechanism), Testing (testing your retry/circuit-breaker/fallback behavior), and Incident Response (what happens when these patterns are not in place or fail). Read those chapters alongside this one.
Retry only transient failures (timeouts, 503s, network errors), not permanent ones (400s, 404s). Use exponential backoff with jitter. Set maximum retry count. Ensure retried operations are idempotent.Pseudocode — retry with exponential backoff and jitter:
function retry_with_backoff(operation, max_retries=3, base_delay=1.0, max_delay=30.0): for attempt in 0..max_retries: try: return operation() catch error: if not is_retryable(error): // 400, 404 = don't retry throw error if attempt == max_retries: throw error // exhausted retries // Exponential backoff with equal jitter (guarantees minimum delay) exp_delay = min(base_delay * (2 ** attempt), max_delay) // Equal jitter: half the delay is guaranteed, half is random // This ensures some minimum backoff while still de-correlating clients sleep(exp_delay / 2 + random(0, exp_delay / 2))function is_retryable(error): // 502, 503, 504 = almost always transient (gateway/upstream issues) // 429 = rate limited (respect Retry-After header) // 408, Timeout, Connection = transient network issues // 500 = debatable — may be a bug (deterministic failure) or transient // Conservative approach: retry 500 once, then treat as non-retryable return error.status in [408, 429, 500, 502, 503, 504] or error is TimeoutError or error is ConnectionError
Idempotent — An operation is idempotent if doing it once produces the same result as doing it multiple times. GET requests are naturally idempotent. Creating an order is not — retrying might create duplicates unless you use an idempotency key. This concept connects to retry patterns, message processing, API design, and database operations.
Retry Storms. If a downstream service is overloaded and 1000 clients all retry with the same backoff schedule, they all hit the service again simultaneously. Jitter (adding random delay) prevents this. Without jitter, retries make overload worse.
The One Thing to Remember: Retries without backoff and jitter are a DDoS attack you launch against yourself. Always ask: “If every client retries simultaneously, does this make the problem worse?”
Further reading:Exponential Backoff and Jitter — AWS Architecture Blog — the canonical reference on why jitter matters and the difference between full jitter, equal jitter, and decorrelated jitter. Includes simulations showing how different strategies perform under contention. If you implement retries in production, read this first.
Analogy — Circuit Breakers Are Like Electrical Fuses. In your house, a fuse (or circuit breaker) does not exist to protect the one appliance that is drawing too much current — it exists to protect the entire house from catching fire. When the fuse trips, the broken appliance stops getting power, but your refrigerator, lights, and everything else keeps running. Software circuit breakers work the same way: when a downstream dependency starts failing, the circuit breaker trips to protect your service — and all its other callers — from being dragged down with it. You sacrifice one dependency’s functionality to preserve the health of the whole system.
The circuit breaker pattern prevents cascade failures and gives failing services time to recover. It operates as a state machine with three states:
CLOSED (normal): All requests pass through to the downstream service. Failures are counted. When the failure count exceeds the threshold (e.g., 5 consecutive failures), the breaker trips and transitions to OPEN.
OPEN (failing fast): All requests are immediately rejected with an error (or a fallback response) without calling the downstream service. This protects the failing service from additional load and protects the caller from waiting on timeouts. After a recovery timeout period (e.g., 30 seconds), the breaker transitions to HALF-OPEN.
HALF-OPEN (testing recovery): A limited number of requests are allowed through as a test. If they succeed (meeting a success threshold, e.g., 3 consecutive successes), the breaker transitions back to CLOSED. If any request fails, the breaker immediately returns to OPEN and the recovery timer resets.
Pseudocode implementation:
class CircuitBreaker: state = CLOSED failure_count = 0 success_count = 0 last_failure_time = null FAILURE_THRESHOLD = 5 // open after 5 consecutive failures RECOVERY_TIMEOUT = 30s // try half-open after 30 seconds SUCCESS_THRESHOLD = 3 // close after 3 successes in half-open function call(request): if state == OPEN: if now() - last_failure_time > RECOVERY_TIMEOUT: state = HALF_OPEN // enough time passed, test one request half_open_in_flight = true else: throw CircuitOpenException("Service unavailable, circuit is open") if state == HALF_OPEN and half_open_in_flight == false: throw CircuitOpenException("Half-open test in progress, rejecting") try: result = downstream.call(request) on_success() return result catch error: on_failure() throw error function on_success(): if state == HALF_OPEN: success_count++ if success_count >= SUCCESS_THRESHOLD: state = CLOSED // recovered — resume normal traffic failure_count = 0 success_count = 0 else: failure_count = 0 // reset on success in closed state function on_failure(): failure_count++ last_failure_time = now() if state == HALF_OPEN: state = OPEN // still failing — reopen success_count = 0 elif failure_count >= FAILURE_THRESHOLD: state = OPEN // too many failures — trip the breaker
Tools: Polly (.NET), Resilience4j (JVM), cockatiel (Node.js), hystrix-go (Go). Istio service mesh provides circuit breaking at the infrastructure level.
Scenario: A downstream dependency starts returning 500s intermittently. Your circuit breaker opens. But business says we MUST serve traffic. What is your strategy?
What they are really testing: Whether you can balance resilience engineering with business requirements, and whether you understand graceful degradation as the bridge between the two.Strong answer framework:
Acknowledge the tension. The circuit breaker is doing its job — protecting your service from a failing dependency. But “rejecting all requests” is not acceptable to the business. The answer is not “disable the circuit breaker” (that would cascade the failure into your service) — the answer is graceful degradation with fallbacks.
Implement a fallback strategy. When the circuit breaker is open, instead of returning an error to the user, serve a degraded experience:
If the dependency is a recommendation engine: serve a static “Popular Items” list from cache or a pre-computed fallback.
If the dependency is a pricing service: serve the last-known-good prices from cache, with a “prices as of X minutes ago” disclaimer.
If the dependency is critical-path (like payment processing): queue the request for retry, show the user “your order is being processed,” and process it asynchronously when the dependency recovers.
Use feature flags to disable non-essential UI components that depend on the failing service.
Tune the circuit breaker, do not disable it. Consider adjusting the half-open behavior to allow a higher percentage of probe requests through, so recovery is detected faster. But do not increase the failure threshold to the point where the breaker never trips — that defeats the purpose.
Communicate. Set up a status page update. Alert the dependency team. If the degraded experience has business impact (e.g., stale prices might cause revenue loss), loop in the product owner to make the cost-benefit decision.
Common mistake: Disabling the circuit breaker under business pressure. This is the software equivalent of bypassing a fuse because you need the appliance to work — it might work for a minute, but you risk burning down the house.
The One Thing to Remember: A circuit breaker does not exist to protect the failing service — it exists to protect everything else from the failing service. The question is never “can we tolerate this dependency failing?” but “can we tolerate this failure spreading?”
Further reading:Martin Fowler — Circuit Breaker — the article that popularized the pattern for software, with clear state machine diagrams and implementation guidance. Resilience4j Documentation — the most widely used JVM resilience library, with excellent docs on circuit breaker configuration, metrics, and integration with Spring Boot. For .NET, see Polly; for Node.js, see cockatiel.
Every external call needs a timeout. Without one, a slow dependency hangs your thread/connection indefinitely.Types: Connection timeout (how long to wait for TCP handshake — typically 1-5 seconds). Read/response timeout (how long to wait for response — depends on expected operation time). Overall request timeout (end-to-end deadline including retries).
Setting timeouts too high negates their purpose. Setting them too low causes false failures. Base timeouts on measured p99 latency of the downstream service with a reasonable buffer (e.g., p99 x 2).
The One Thing to Remember: A missing timeout is an unbounded commitment. Every external call without a timeout is a promise to wait forever — and “forever” in production means your thread pool is drained and your service is dead.
Further reading:Microsoft — Retry Pattern (Cloud Design Patterns) — covers timeouts in the context of retry strategies, including how to set connection vs. read vs. overall request timeouts and how they interact with retries and circuit breakers.
Isolate components so failure in one does not affect others. Named after ship bulkheads that contain flooding to one compartment.Concrete example: Your service calls both a fast internal database (5ms) and a slow third-party API (500ms). Both share a single thread pool of 50 threads. When the third-party API starts timing out at 30 seconds, all 50 threads get stuck waiting for it. Now your fast database queries also fail — not because the database is slow, but because there are no free threads. Fix: Separate thread/connection pools. Give the database calls their own pool of 30 threads and the third-party API its own pool of 20 threads. When the API hangs, only its 20 threads are consumed. Database calls continue working normally.
Consider an e-commerce backend that calls three downstream services:
Dependency
Thread Pool Size
Timeout
Priority
Payment service
25 threads
5s
Critical — revenue path
Search service
15 threads
2s
Important — but degradable
Recommendation engine
10 threads
1s
Nice-to-have — can show “Popular Items” fallback
If the recommendation engine hangs, only its 10 threads are consumed. Payment and search continue unaffected. Without bulkheads, a slow recommendation engine could starve the payment service of threads and block checkout — a non-critical dependency taking down your revenue path.Types of bulkheads:
Thread pool isolation — separate pools per dependency
Connection pool isolation — separate database connection pools for critical vs non-critical queries
Process isolation — separate services or containers
Infrastructure isolation — separate Kubernetes namespaces with resource quotas per team or service tier
In Kubernetes: Resource requests and limits are infrastructure-level bulkheads — they prevent one pod from consuming all CPU/memory on a node and starving other pods.
The One Thing to Remember: Without bulkheads, your system is only as reliable as your least reliable dependency. A slow recommendation engine should never be able to take down your payment flow.
Further reading:Microsoft — Bulkhead Pattern (Cloud Design Patterns) — the reference documentation for the bulkhead pattern, with detailed guidance on thread pool isolation, process isolation, and how to size partitions based on SLOs and dependency criticality. Part of Microsoft’s excellent Cloud Design Patterns collection, which also covers circuit breaker, retry, and throttling patterns.
Provide reduced functionality rather than complete failure. The goal: protect the critical path while letting non-critical features fail silently.Concrete fallback examples:
Database is slow -> show cached data (stale but available)
Recommendation engine is down -> show “Popular Products” (static, pre-computed)
Review service is unavailable -> hide the reviews section (product page still works)
Payment service timeout -> queue the payment for retry, tell the user “processing”
Search service overloaded -> show category browsing instead
CDN is down -> serve directly from origin (slower but functional)
The principle: Identify your revenue-critical path (browse -> cart -> checkout -> payment) and protect it at all costs. Everything else (recommendations, reviews, analytics, notifications) can degrade. Use feature flags as kill switches — when a non-critical service is struggling, disable the feature entirely rather than let it drag down the page.
This is the operational checklist for building systems that degrade gracefully. Print it. Review it during design reviews. Revisit it quarterly as your architecture evolves.Feature Flags as Kill Switches:
Every non-critical feature has a feature flag that can disable it in production without a deploy
Feature flags are evaluated at request time (not cached for hours) so kill switches take effect immediately
There is a runbook that documents which flags to flip during which failure scenarios
Feature flags have an owner and are reviewed quarterly — stale flags are removed to prevent flag debt
The flag evaluation service itself has a fallback (if LaunchDarkly is down, default to “feature off” for non-critical, “feature on” for critical-path)
Circuit Breakers and Fallback Responses:
Every synchronous external dependency has a circuit breaker with tuned thresholds (not just defaults)
Circuit breakers use error-rate thresholds (not just consecutive failures) to catch intermittent degradation
Each circuit breaker has a defined fallback: cached data, static response, degraded UI, or queued retry
Fallback responses are tested regularly — not just in unit tests, but in integration tests that simulate the circuit opening
Circuit breaker state is exposed as a metric and triggers an alert when it opens
Read-Only Mode / Reduced-Write Mode:
The system can operate in read-only mode if the write path (database primary, message queue) is impaired
Users see a clear “read-only mode” indicator — not a confusing error when they try to submit a form
Write operations are queued (not dropped) so they can be processed when write capability is restored
There is a toggle (feature flag or configuration) to activate read-only mode manually during incidents
Queue-Based Load Leveling:
Spiky or burst traffic is absorbed by a queue rather than hitting downstream services directly
Queue depth is monitored with alerts for abnormal growth (indicates consumers are falling behind or downstream is slow)
Consumers implement backpressure — if the downstream service is slow, consumers slow their consumption rate rather than flooding it
Dead letter queues capture messages that fail after max retries for investigation and replay
Queue-based paths are idempotent — processing the same message twice produces the same result
Static Fallback Content:
Pre-computed fallback content exists for every enhancing dependency (popular products, default recommendations, cached search results)
Fallback content is refreshed on a schedule (daily or hourly) so it does not become embarrassingly stale
The CDN can serve a static version of critical pages if the application tier is completely down (static site failover)
Error pages are informative, branded, and hosted independently of the application (not on the same infrastructure that is failing)
Load Shedding:
The system can shed non-critical traffic under extreme load (return 503 to analytics endpoints while serving checkout)
Load shedding priorities are pre-defined: revenue-critical > user-facing > internal > background jobs
Shedding is automatic (based on CPU, memory, or queue depth thresholds) with manual override capability
Shed requests receive a Retry-After header so clients know when to retry
Cross-chapter connection — Cloud-native load leveling: Queue-based load leveling is most commonly implemented with SQS, SNS, or EventBridge in AWS environments. The specific configuration patterns — including visibility timeouts, message retention periods, and DLQ redrive policies — are covered in Cloud Service Patterns. For Kubernetes-native backpressure and horizontal pod autoscaling based on queue depth, see the container orchestration sections of that chapter.
The One Thing to Remember: Before building any feature, classify it: is this on the critical path or off it? Critical-path features need fallbacks. Off-path features need kill switches. If you cannot answer which category a feature belongs to, your architecture is not well-understood enough to operate safely.
Not every service needs every pattern. Applying circuit breakers, bulkheads, retry budgets, and multi-provider failover to a low-traffic internal dashboard is over-engineering that adds maintenance burden without proportional benefit.Decision framework — match investment to criticality:
All of the above + bulkheads, DLQ, idempotency keys, Saga pattern, multi-provider failover
Nothing — invest in everything
Platform infrastructure (auth, API gateway, DNS)
All of the above + chaos engineering, multi-region, automated failover
Nothing — this is your blast radius multiplier
The “overkill” test: Ask these three questions before adding a resilience pattern:
What is the blast radius if this dependency fails without the pattern? If the answer is “one internal dashboard shows stale data for 10 minutes,” a circuit breaker is unnecessary. If the answer is “checkout goes down for all users,” the pattern is justified.
How often does this failure mode actually occur? If the dependency has been 99.99% available for 2 years and is a managed cloud service (S3, DynamoDB), spending 2 weeks building a multi-provider fallback is poor ROI. Spend that time on the service that fails monthly.
Can the team maintain this pattern? A circuit breaker that nobody understands, nobody monitors, and nobody tunes after initial deployment is not a resilience pattern — it is dead code that will surprise you during an incident. If the team cannot commit to maintaining a pattern, do not add it.
The ownership rule: Every resilience pattern must have an owner — a person or team responsible for tuning thresholds, monitoring state transitions, and validating behavior during chaos experiments. A circuit breaker with default thresholds, no dashboard, and no owner is resilience theater. It gives the illusion of protection while providing none.
Cross-chapter connection: Feature flags as kill switches are covered in detail in the Deployment chapter, including canary rollout patterns and automated rollback criteria. Graceful degradation is only as good as your ability to roll back or disable features quickly.
Messages that fail after maximum retries go to a DLQ for investigation. Without one, a poison message (a message that always fails processing) blocks the entire queue.DLQ processing workflow:
Monitor DLQ depth — alert when > 0 messages (or > threshold for noisy systems).
Investigate: read the message payload, check error logs with the correlation_id, determine if the failure is transient (dependency was down) or permanent (malformed data, bug in consumer logic).
Fix: if transient — replay messages from DLQ back to the main queue. If permanent — fix the consumer bug, deploy, then replay. If truly unprocessable — move to a permanent failure store and alert the business.
Automate: set up a DLQ consumer that logs message details, sends alerts, and provides a UI for manual replay.
Infinite DLQ Loop. If you automatically replay DLQ messages without fixing the underlying issue, you create an infinite loop: message fails -> DLQ -> replay -> fails again -> DLQ. Always fix the root cause before replaying. Add a retry counter header to each message — if it exceeds a maximum (e.g., 5 total attempts across DLQ replays), move to a permanent failure store.
The One Thing to Remember: A DLQ is not a trash bin — it is a triage queue. Every message in the DLQ represents a promise you made to a user or another system that you have not kept yet. Monitor DLQ depth like you monitor error rates.
Further reading:AWS — Amazon SQS Dead-Letter Queues — practical documentation on configuring DLQs with maxReceiveCount, redrive policies, and monitoring. Microsoft — Competing Consumers Pattern — covers the broader message processing context in which DLQs operate, including how to handle poison messages and ensure exactly-once processing semantics.
Liveness (/health): Is the process running? Keep it simple — return 200 if the process is alive. Do NOT check dependencies here. If you check the database in your liveness probe and the database goes down, Kubernetes restarts all your pods — making the outage worse (you now have zero application capacity AND the database is down).Readiness (/ready): Can this instance handle traffic right now? Check: database connection works, cache is reachable, any warmup (loading config, building in-memory indexes) is complete. When readiness fails, the instance is removed from the load balancer — no traffic is routed to it, but it is not killed.Startup probes (Kubernetes): For applications that take a long time to start (JVM warmup, large model loading), use a startup probe with generous timeouts. Without it, Kubernetes may kill your pod during startup because the liveness probe fails during the warmup period.
Common mistake: Putting expensive checks (database query, downstream HTTP call) in the liveness probe with aggressive intervals (every 5 seconds). Under load, the probe itself becomes a source of load on the database.
The One Thing to Remember: Liveness answers “is this process alive?” — keep it trivial. Readiness answers “can this instance handle traffic right now?” — check dependencies here. Confusing the two is one of the most common causes of Kubernetes-amplified outages.
Cross-chapter connection — OS-level failure modes: Health checks exist because processes die in ways that are invisible to the application layer. The Linux OOM Killer can SIGKILL your process with zero warning when memory pressure is high. A file descriptor leak can leave your process alive but unable to accept connections. A zombie process can hold a PID but do nothing useful. Understanding why processes fail — and why “the process is running” is not the same as “the process is healthy” — requires OS-level knowledge. See OS Fundamentals for the mechanics of process lifecycle, signals, OOM scoring, and resource limits that determine when and how your application dies.
Further reading:Kubernetes — Configure Liveness, Readiness and Startup Probes — the official documentation with concrete YAML examples for HTTP, TCP, and exec probes, including timing parameters (initialDelaySeconds, periodSeconds, failureThreshold) and common anti-patterns to avoid. Essential reading before configuring probes in production.
Deliberately inject failures to test resilience: kill instances, introduce network latency, simulate dependency outages. The goal is to find weaknesses before they cause real incidents.
Chaos engineering is not just “randomly breaking things.” It follows a disciplined scientific method:
Define steady state. Establish measurable indicators of normal system behavior — e.g., “p99 latency < 200ms, error rate < 0.1%, orders per minute > 500.” This is your baseline.
Form a hypothesis. “If we terminate 1 of 3 application instances, the load balancer will redistribute traffic and steady state will be maintained within 30 seconds.”
Introduce a real-world failure. Kill the instance, inject network latency, saturate CPU, drop packets, corrupt DNS responses.
Observe the difference. Compare actual system behavior against the steady-state hypothesis. Did latency spike? Did errors increase? How long until recovery?
Fix or accept. If the system handled it gracefully, increase the blast radius next time. If it did not, you found a weakness — fix it and retest.
Simulates network conditions (latency, bandwidth, timeouts) at the TCP level
Start small — kill one instance and verify the system recovers gracefully before injecting more complex failures. Run chaos experiments in staging first, then graduate to production with tight blast radius controls (affect 1% of traffic, auto-halt if error rate exceeds threshold). Chaos engineering in production without safety controls is just causing outages.
Analogy — Chaos Engineering Is Like a Fire Drill. Nobody runs a fire drill because they want the building to catch fire. They run it because when a real fire happens, they want everyone to know exactly where the exits are, who is responsible for what, and which systems work under stress. Chaos engineering is the same: you inject controlled failures not because failure is fun, but because the rehearsal is what makes the real incident survivable. The organizations that never drill are the ones that panic during actual emergencies.
The One Thing to Remember: The goal of chaos engineering is not to cause failures — it is to find them before your users do. Every chaos experiment that reveals a weakness is a production incident you prevented.
Further reading:Principles of Chaos Engineering — the foundational manifesto that defines the discipline, written by the Netflix team that invented it. Short, precise, and essential. Netflix Chaos Monkey — GitHub — the source code and documentation for the tool that started it all. Chaos Engineering by Casey Rosenthal and Nora Jones (O’Reilly) — the comprehensive book that expands the principles into a full engineering practice with case studies from Netflix, Google, Amazon, and Microsoft.
Cross-chapter connection: Chaos engineering findings feed directly into your Testing strategy (adding regression tests for discovered failure modes) and your Incident Response runbooks (updating playbooks based on what you learned). Chaos experiments without follow-through are just controlled outages.
Further reading:Release It! by Michael Nygard — the essential book on resilience patterns. Covers stability patterns, capacity patterns, and real-world failure stories.
Every production service depends on things it does not control: third-party APIs (Stripe, Twilio, SendGrid), cloud managed services (RDS, ElastiCache, S3), internal platform services owned by other teams, and SaaS tools (LaunchDarkly, Datadog, Auth0). When any of these dependencies have an outage, your service’s reliability is only as good as the strategies you have in place to handle it. This section covers the patterns that separate services that survive dependency outages from services that cascade-fail.
Before you can manage dependencies, you need to classify them. Not all dependencies are equal, and the investment in resilience should match the criticality:
Classification
Definition
Example
Required Resilience
Critical-path, synchronous
Request cannot complete without this dependency responding successfully in real time
Fire-and-forget with async retry, accept data loss
The One Thing to Remember: Classify every dependency before an outage, not during one. When Stripe goes down at 2 AM is not the time to discover you have no fallback strategy for your payment flow. The classification matrix should be a living document reviewed quarterly.
Third-party services are the most dangerous dependencies because you have zero control over their performance characteristics, deployment schedule, or incident response. Your timeout strategy must account for this:1. Aggressive timeouts with fallback. Set timeouts based on the dependency’s observed p99 latency, not their documented SLA. If Stripe’s p99 is 800ms, set your timeout to 1.5-2 seconds — generous enough to avoid false timeouts but tight enough to prevent thread pool starvation when Stripe is degraded.2. Separate timeouts for connection vs. read. A connection timeout (TCP handshake) of 1-2 seconds catches DNS or routing failures fast. A read timeout (waiting for response body) should be longer and tuned per endpoint. Do not use a single blanket timeout for all third-party calls.3. Deadline propagation. If your user-facing API has a 3-second SLA and you call two downstream services sequentially, you cannot give each one a 3-second timeout — the total would be 6 seconds. Pass a deadline through the call chain. Each downstream call gets remaining_deadline - estimated_overhead. This is what gRPC’s deadline propagation does natively.
Cross-chapter connection: Deadline propagation across service boundaries is a distributed systems coordination problem. The mechanics of how deadlines and cancellation signals propagate through RPC chains — and the subtle bugs that arise when one service respects the deadline but another ignores it — are explored in Distributed Systems Theory.
When a dependency is down, you have five options — and choosing the right one depends on the dependency classification:1. Serve from cache (stale data). If the dependency provides data that changes infrequently (product catalog, user profile, pricing), cache the last-known-good response and serve it when the dependency is unreachable. Be transparent: add a X-Data-Freshness: stale header or show “prices as of 10 minutes ago” in the UI. This is the most common and most effective fallback for enhancing dependencies.
function get_product_recommendations(user_id): try: result = recommendation_service.get(user_id, timeout=500ms) cache.set(f"recs:{user_id}", result, ttl=1h) // cache fresh response return result catch (TimeoutError, ServiceUnavailableError): cached = cache.get(f"recs:{user_id}") if cached: return cached.with_metadata(stale=true) // serve stale cache return STATIC_POPULAR_ITEMS // ultimate fallback
2. Degrade the feature. If the recommendation engine is down, show “Popular Products” instead of personalized recommendations. If the review service is down, hide the review section entirely. The product page still loads, the user can still buy — the experience is reduced but functional. Use feature flags as kill switches to disable dependent features instantly.3. Queue for async processing. For critical-path operations that cannot be skipped (payment capture, order confirmation email), accept the request synchronously but queue the downstream call for retry. Tell the user “your order is being processed” and complete the downstream call when the dependency recovers. This requires idempotency keys to prevent double-processing.4. Multi-provider failover. For truly critical dependencies (payment processing, SMS delivery), maintain integrations with two providers. If Stripe is down, fail over to Adyen. If Twilio is down, fail over to Vonage. This doubles integration maintenance cost, so reserve it for dependencies where downtime has direct revenue impact. The circuit breaker on the primary provider triggers the failover.5. Graceful rejection with clear communication. When no fallback is possible (e.g., the identity provider is down and you cannot authenticate users), return a clear error with context: “We’re experiencing issues with our authentication provider. Please try again in a few minutes.” This is better than a generic 500 or a hanging request. Set a retry-after header.
Cache stampede during dependency recovery. When a dependency comes back online after an outage, do not let all cached clients simultaneously refresh. Implement staggered TTLs (add jitter to cache expiration) or use a single-flight pattern where only one request fetches fresh data and all others wait for that result. Without this, your “recovery” becomes a thundering herd that immediately overloads the dependency again.
You cannot manage what you do not measure. For every external dependency, track:
Availability rate — percentage of calls that succeed (non-5xx) per rolling window
Latency percentiles — p50, p95, p99 per endpoint, trended over time
Circuit breaker state — how often each breaker opens, how long it stays open, and how quickly it recovers
Fallback activation rate — how often you are serving stale cache or degraded responses (if this number is high, the dependency is chronically unreliable)
Error budget consumption by dependency — which dependency is burning your SLO budget? If 80% of your error budget is consumed by one third-party API, that is actionable intelligence for a vendor conversation or a multi-provider strategy
Cross-chapter connection: Building dashboards that track dependency health, setting alerts on circuit breaker state changes, and correlating dependency latency with your own SLI degradation all require robust observability infrastructure. See Caching & Observability for how to instrument these metrics, build dependency health dashboards, and set up meaningful alerts that catch degradation before your users do.
Interview Question: A critical third-party API (e.g., Stripe) starts responding with 50% error rates but is not fully down. Your circuit breaker has not tripped because it is configured for consecutive failures. What do you do?
What they are really testing: Whether you understand that circuit breakers configured for consecutive failures miss intermittent degradation, and whether you can reason about circuit breaker tuning in a nuanced way.Strong answer framework:
Diagnose the gap. The circuit breaker is configured to trip after N consecutive failures. With a 50% error rate, requests alternate between success and failure, resetting the consecutive failure counter each time. The breaker never trips, but half your users are getting errors.
Switch to a sliding-window error-rate threshold. Instead of (or in addition to) consecutive failures, configure the circuit breaker to trip when the error rate exceeds a threshold over a time window — e.g., “if more than 30% of requests in the last 60 seconds fail, trip the breaker.” Libraries like Resilience4j support this with a slidingWindowType: COUNT_BASED or TIME_BASED configuration.
Activate the fallback immediately. While you retune the breaker, use a feature flag or manual override to force the fallback path (queue payments for retry, switch to a backup provider). Do not wait for the automated system to catch up — this is a case for human judgment.
After the incident. Review all circuit breaker configurations. Ensure they use error-rate thresholds, not just consecutive-failure thresholds. Add an alert for “dependency error rate > X% but circuit breaker still closed” to catch this gap in the future.
Common mistake: Leaving the circuit breaker in its default configuration and assuming it handles all failure modes. Intermittent degradation is harder to detect than total outage and often causes more cumulative damage.
Diagnose from Signals: Resilience Pattern Failures
The Alert Says: Circuit breaker for payment-service transitioned from CLOSED to OPEN. Fallback activated: payments being queued for retry. Queue depth: 340 and growing. Error rate on checkout endpoint: 0% (fallback is working). Time: 22:15 UTC.The circuit breaker is doing exactly what it should. But “working as designed” does not mean “no action needed.” The queued payments represent real revenue at risk, and every minute the breaker stays open increases the retry backlog.Triage in 5 minutes:
Check the dependency’s status. Is payment-service truly down, or did the breaker trip on transient errors? Check the payment service’s health endpoint directly: curl -s https://payment-service.internal/health. Check Stripe’s status page (status.stripe.com) if the payment service calls Stripe. A 50% error rate that triggered the breaker may be a transient network issue that is already resolving.
Check the half-open probe results. If the breaker is in OPEN state and the recovery timeout has not elapsed, you are waiting blind. Consider manually transitioning the breaker to HALF-OPEN (if your circuit breaker library supports runtime override) to test recovery sooner.
Monitor the fallback queue. 340 queued payments at 80averageordervalue=27,200 in unprocessed revenue. If the queue has a TTL or max size, payments could be dropped. Verify: does the queue have a DLQ? What is the max retention? Is the retry consumer idempotent (critical for payments)?
Assess the blast radius. Is only checkout affected, or are other services that depend on payment-service also degraded? Check the service mesh for other consumers of the payment service.
Metric that proves resolution: Circuit breaker transitions from OPEN to HALF-OPEN to CLOSED. The queued payments drain (queue depth returns to 0). All queued payments are successfully processed (verify by comparing queue drain count with payment confirmations). No duplicate charges (idempotency keys working correctly).Rollback trigger: If the fallback queue exceeds 1,000 messages or the breaker has been OPEN for more than 15 minutes, escalate to Sev1 and activate the multi-provider failover (switch to backup payment processor if available).
The Alert Says: Retry rate on inventory-service calls jumped from 2% to 38%. Circuit breaker is still CLOSED (has not tripped). P99 latency on the product page increased from 180ms to 1.4s. No error rate increase on the product page endpoint.This is the dangerous middle ground: the dependency is degraded but not failed. The circuit breaker has not tripped because requests are eventually succeeding (after retries). But the retries are consuming time and resources, inflating latency.Triage in 5 minutes:
Check the retry pattern. 38% retry rate with eventual success means the first attempt fails but the second or third succeeds. This is a sign of intermittent degradation — possibly a single unhealthy instance behind the dependency’s load balancer, a network partition affecting some routes, or the dependency experiencing garbage collection pauses.
Calculate the latency impact. If the base call takes 50ms and a retry adds 100ms (50ms backoff + 50ms second attempt), the 38% of requests that retry add 100ms. For requests that retry twice: +200ms. This explains the P99 jump from 180ms to 1.4s.
Consider adjusting the circuit breaker. If the breaker is configured for consecutive failures, intermittent degradation resets the failure counter on every success. Switch to a sliding-window error-rate threshold: “trip if more than 30% of requests in the last 60 seconds fail.” This catches exactly this scenario.
Check if retries are making the dependency worse. If 38% of requests fail and each failure triggers 2 retries, you are sending 1.76x the original request volume to an already-degraded service. This is a retry storm — your retries are contributing to the degradation. Consider reducing the max retry count from 3 to 1, or adding a retry budget (max 20% of requests can be retries).
Metric that proves resolution: Retry rate drops below 5%. P99 latency returns to baseline. Secondary check: Verify the dependency’s health improved — not just that you stopped retrying. If you reduced retries and the dependency is still degraded, you moved from “slow” to “erroring” which may be worse for users.Rollback trigger: If retry rate exceeds 40% for more than 5 minutes, activate the circuit breaker manually (force OPEN state) and serve the cached/degraded fallback. Better to serve stale data than to let retry storms drag down both your service and the dependency.
After deploying any resilience pattern (circuit breaker, retry, bulkhead, fallback), the pattern is not “done” until it has been validated, tuned, and operationalized. Most resilience patterns fail in production not because they were implemented wrong, but because they were never tuned after deployment.
Resilience Pattern Aftercare ChecklistFor every circuit breaker deployed:
Verify it has been tested in a non-production environment by manually failing the dependency and confirming the breaker trips and the fallback activates.
Verify the fallback response is correct (not a default error, not empty data, not stale beyond acceptable limits).
The breaker’s state transitions (CLOSED to OPEN, OPEN to HALF-OPEN, HALF-OPEN to CLOSED) are emitted as metrics and visible on a dashboard.
An alert fires when the breaker transitions to OPEN (this is a signal that a dependency is degraded, even if your service is handling it gracefully).
Threshold tuning: the failure threshold, recovery timeout, and half-open probe count have been calibrated based on observed dependency behavior, not left at library defaults.
For every retry policy deployed:
Verify that the retried operation is idempotent. If it is not, retries can cause duplicate side effects (double charges, duplicate messages, duplicate database rows).
Verify that retry backoff includes jitter (not just exponential delay) to prevent synchronized retry storms.
Monitor retry rate as a metric. A baseline retry rate above 5% indicates a chronically unreliable dependency that needs investigation, not just retries.
Set a retry budget: cap the percentage of total requests that can be retries (e.g., max 20%). This prevents retry storms from amplifying load on a degraded dependency.
For every fallback deployed:
The fallback is tested periodically (not just at deployment time). Feature flag systems can force-activate the fallback in staging weekly to verify it still returns valid data.
The fallback data source (cache, static file, pre-computed list) has a freshness guarantee. A “Popular Products” fallback from 6 months ago is worse than no fallback.
The fallback activation is logged and metricked. If the fallback activates 500 times per day, something is wrong with the primary path — the fallback should be the exception, not the norm.
For every bulkhead deployed:
Thread pool sizes are based on measured dependency latency and throughput, not guesses. Use Little’s Law: pool_size = requests_per_second * average_latency_seconds.
Pool utilization is metricked. An alert fires when any pool exceeds 80% utilization.
Pool exhaustion is tested: simulate a slow dependency and verify that only the isolated pool is affected while other pools continue operating normally.
Redundancy at every layer: multiple application instances, database replicas, multi-zone deployment. No single point of failure. Automated failover.The HA checklist:
Application: multiple instances behind a load balancer, health checks, graceful shutdown
Cache: Redis Sentinel or Redis Cluster for automatic failover
DNS: multiple providers, low TTL for fast failover
Load balancer: managed (cloud) or active-passive pair
Secrets: replicated secret store (Vault with HA backend)
Each layer must answer: what happens when this component fails? How quickly does failover occur? Is it automatic or manual?
Interview Question: How would you design a system that survives an entire availability zone failure?
Strong answer: Deploy across at least 3 availability zones. Application instances spread across zones behind a zone-aware load balancer. Database primary in one zone with synchronous replicas in other zones — automated failover promotes a replica. Stateless application instances so any zone can handle any request. Cache warmed in each zone (or a distributed cache like Redis Cluster spanning zones). All dependent services must also be multi-zone. Test regularly by simulating zone failure.The key insight: multi-zone is about eliminating single points of failure at the infrastructure level, not just the application level.
The One Thing to Remember: High availability is not a feature you add — it is a property that emerges from eliminating single points of failure at every layer. If you have HA at the application layer but a single database with no replica, you do not have HA.
Cross-chapter connection — Consensus and HA: Automated failover (e.g., promoting a database replica when the primary dies) is fundamentally a consensus problem. How do the remaining nodes agree on who the new leader is without a split-brain scenario where two nodes both think they are primary? This is exactly what Raft and Paxos solve. See Distributed Systems Theory for how leader election works under the hood — understanding this is the difference between configuring HA and actually knowing why it works (or fails during a network partition).
Cross-chapter connection — Cloud infrastructure for HA: The practical implementation of multi-AZ and multi-region architectures — including how AWS Availability Zones map to physical data centers, how Route 53 health checks enable DNS failover, and how S3 cross-region replication protects data durability — is covered in detail in Cloud Service Patterns. That chapter turns the abstract “deploy across zones” advice into concrete service configurations.
Recovery Time Objective (RTO): How long can the system be down? A 1-hour RTO means you must restore service within 1 hour of failure.Recovery Point Objective (RPO): How much data can you lose? A 5-minute RPO means you must be able to restore data to within 5 minutes of the failure. Determines backup frequency and replication lag tolerance.
Many teams define RTO/RPO but never test them. Run disaster recovery drills regularly. The only way to know if your recovery process works in 1 hour is to actually do it under pressure.
The One Thing to Remember: RTO answers “how long can we be down?” RPO answers “how much data can we lose?” These two numbers, more than anything else, determine your entire backup, replication, and disaster recovery architecture. Get them from the business before you design the system.
Low — back up to another region, restore when needed
Pilot light
10-30 minutes
Low-Medium
Keep core infra running (DB replica), spin up compute on demand
Warm standby
Minutes
Medium-High
Scaled-down full system in secondary region, faster failover
Multi-region active-active
Seconds
Highest
Full system in multiple regions simultaneously, requires data sync and conflict resolution
The One Thing to Remember: Pick the DR strategy that matches your RTO and budget — not the one that sounds most impressive. Most services are perfectly well-served by pilot light or warm standby. Multi-region active-active is the right answer for maybe 5% of services and the wrong answer for the other 95%.
Cross-chapter connection — Multi-region and distributed consistency: Multi-region active-active is a disaster recovery strategy, but it is also a distributed systems problem. When you have write-capable replicas in two regions, you must handle conflict resolution — what happens when the same record is updated in both regions simultaneously? The options (last-write-wins, CRDTs, application-level conflict resolution) are rooted in the CAP theorem and consistency models covered in Distributed Systems Theory. The practical cloud infrastructure for implementing multi-region — including DynamoDB Global Tables, Aurora Global Database, S3 cross-region replication, and Route 53 failover routing — is covered in Cloud Service Patterns.
Further reading:AWS — Disaster Recovery of Workloads on AWS (Whitepaper) — the comprehensive AWS whitepaper covering all four DR strategies (backup/restore, pilot light, warm standby, multi-region active-active) with architecture diagrams, cost comparisons, and implementation guidance. The best single resource for understanding DR trade-offs in cloud environments. AWS Well-Architected Framework — Reliability Pillar — broader guidance on designing for failure, including multi-AZ, multi-region, and data backup strategies.
The Alert Says: RDS Multi-AZ failover completed automatically. Primary database endpoint switched from us-east-1a to us-east-1b. Application error rate spiked to 12% for 45 seconds then recovered. Connection pool metrics show 20 connections dropped and re-established. Time: 04:22 UTC.An automated failover is working as designed — but the 45-second error spike means your application did not handle it gracefully. In a true HA system, failover should be invisible or nearly invisible to users.Triage in 5 minutes:
Check what caused the failover. RDS failovers trigger for several reasons: hardware failure on the primary, network disruption in the primary AZ, storage failure, or a manual reboot/modification. Check RDS events: aws rds describe-events --source-identifier <db-identifier> --source-type db-instance. Understanding the cause determines whether this is a one-off (hardware failure) or recurring (storage IOPS exhaustion causing periodic unresponsiveness).
Analyze the 45-second error window. The application’s connection pool lost its connections when the primary died. The pool then attempted to reconnect, but the DNS endpoint had not yet resolved to the new primary. The gap between failover initiation and DNS propagation is typically 15-45 seconds for RDS Multi-AZ. During this window, every database call fails.
Check for a write-after-failover consistency issue. RDS Multi-AZ uses synchronous replication, so the standby should have all committed transactions. But if the application had in-flight transactions at failover time, those transactions are lost (neither committed nor rolled back on the new primary). Verify: are there any “orphaned” states in the application (orders marked “processing” that were never completed)?
Metric that proves resolution: Error rate returned to baseline (<0.1%). Connection pool re-established fully. No data inconsistencies found.Reducing the blast radius for next time:
Connection pool retry with backoff. Configure the pool to retry failed connection attempts with 1-second intervals for up to 60 seconds before failing the request. HikariCP: connectionTimeout=60000. This absorbs the DNS propagation delay.
Health check grace period. After a failover, add a 60-second grace period before marking pods as unhealthy due to database connectivity failures. Kubernetes: increase failureThreshold on the readiness probe or add a startup probe with generous timing.
Application-level retry for idempotent operations. Wrap database calls in a retry that catches connection-reset errors and retries once after a 2-second delay. This covers the window where the pool is reconnecting.
Test failover quarterly. Trigger a manual RDS failover during business hours (aws rds reboot-db-instance --db-instance-identifier <id> --force-failover) and measure the actual user impact. A failover you have never tested is a failover you do not understand.
After every failover event (automated or manual), run this checklist:Immediate (within 1 hour):
Verify the failover completed successfully. The primary endpoint is now pointing to the promoted standby.
Verify data integrity. Run a checksum or count comparison on critical tables between the last known backup and the current state. For RDS Multi-AZ (synchronous replication), data loss should be zero — but verify anyway.
Verify a new standby has been provisioned. RDS automatically provisions a new standby after failover, but this can take 10-30 minutes. Until the new standby is ready, you have no failover capability — a second failure in this window is unrecoverable.
Verify application health. All pods are in Ready state. Connection pools are full. Error rates are back to baseline. No stale connections hanging from the pre-failover primary.
Within 24 hours:
Root-cause the failover trigger. Was it a hardware failure (expected, no action needed beyond monitoring), an AZ issue (check AWS Health Dashboard, consider multi-region), or application-induced (e.g., the application overwhelmed the primary with queries, causing unresponsiveness)?
Review the error window duration. Was the 45-second error spike acceptable per your SLO? If your SLO is 99.95% (21.6 minutes/month), a 45-second spike consumed 3.5% of your monthly budget on this single event. If you expect 2-3 failovers per year, is that budget consumption acceptable?
Update the DR runbook with the actual timeline and steps from this event. Actual experience always reveals gaps in the documented procedure.
Within 1 week:
If the error window exceeded 30 seconds, implement the connection pool retry and health check grace period improvements described above.
If this was the first failover the team has experienced, schedule a quarterly failover drill. The second failover is always smoother than the first because the team knows what to expect.
Review your RPO. The failover had zero data loss (synchronous replication). But what about your other data stores? Redis cache was warm on the old primary — is it cold on the new one? Elasticsearch indexes, Kafka consumer offsets, application-level caches — all need to be verified or rebuilt after failover.
These resources represent the best thinking on reliability engineering, incident response, and building resilient systems. Organized from foundational to advanced.
Essential Reading — Start Here
Google SRE Book (free online) — The foundational text. Chapters on SLOs, error budgets, toil, and incident response are required reading for any engineer working on production systems. Read chapters 1-4 and 28 first, then explore based on your area of focus.
“How Complex Systems Fail” by Richard Cook — A short paper (only 4 pages) originally written about medical systems, but every sentence applies to software. Its core insight: complex systems are always running in a partially broken state, and safety is a property of the whole system, not individual components. This paper will change how you think about incidents.
Charity Majors’ Blog on Observability — The best writing on observability, SLOs, and what it actually means to operate software in production. Start with her posts on “observability vs monitoring” and “SLOs are the API for your engineering organization.” She cuts through buzzwords with unusual clarity.
Intermediate — Deepening Your Practice
Netflix Tech Blog — Tagged: Chaos Engineering — First-hand accounts from the team that invented chaos engineering. Their posts on Chaos Monkey, the Simian Army, and failure injection testing are essential reading for understanding how to build confidence in distributed systems.
AWS Architecture Blog — Resilience — Detailed write-ups on resilience patterns (retry, circuit breaker, bulkhead, cell-based architecture) with AWS-specific implementation details. Particularly valuable for understanding how cloud-native resilience differs from traditional HA approaches.
Gergely Orosz’s “The Pragmatic Engineer” on Incidents — Orosz writes about engineering culture at scale, and his coverage of major incidents (including the Facebook BGP outage and Cloudflare’s outages) provides the organizational and human context that purely technical write-ups miss. His analysis of how companies handle postmortems is especially valuable.
Advanced — Becoming a Reliability Leader
The Site Reliability Workbook (free online) — The practical companion to the SRE book. Where the SRE book explains the philosophy, the Workbook shows implementation with real examples, sample SLO documents, error budget policies, and on-call procedures.
Release It! by Michael Nygard (2nd edition) — War stories from production systems combined with stability patterns (circuit breaker, bulkhead, timeout) and anti-patterns (cascading failure, blocked threads, unbounded result sets). The narrative style makes it both instructive and entertaining.
Learning from Incidents in Software — A community and collection of resources applying resilience engineering and human factors research to software operations. Goes beyond “what broke” to examine how organizations learn (or fail to learn) from incidents.
Coupling and Cohesion — The two most important metrics of code quality. Coupling measures how much one module depends on another — low coupling means changing module A rarely requires changing module B. Cohesion measures how related the responsibilities within a module are — high cohesion means everything in a module serves one purpose. The goal: high cohesion within modules, low coupling between modules. Every principle in this chapter (SOLID, DRY, SoC) is a strategy for achieving this goal.
A class has one reason to change. Not “a class does one thing” — it means “a class serves one actor/stakeholder.” If the Invoice class changes when the accounting rules change AND when the PDF rendering changes, it has two reasons to change. Split it: InvoiceCalculator (accounting logic) and InvoiceRenderer (PDF generation). Group things that change together, separate things that change for different reasons.Code smell it prevents: Shotgun Surgery. When a single business change (e.g., “add a discount field”) forces you to modify 5 different files because one class was handling too many concerns, SRP is being violated. The fix is not “make smaller classes” — it is “group things that change for the same reason.”BAD — one class with two reasons to change:
class Invoice: def calculate_total(self, items, tax_rate): # Accounting logic — changes when tax rules change subtotal = sum(item.price * item.qty for item in items) return subtotal + (subtotal * tax_rate) def generate_pdf(self, invoice_data): # Rendering logic — changes when PDF layout changes pdf = PDFDocument() pdf.add_header("Invoice") pdf.add_table(invoice_data) return pdf.render()
GOOD — separated by stakeholder:
class InvoiceCalculator: """Changes only when accounting/tax rules change.""" def calculate_total(self, items, tax_rate): subtotal = sum(item.price * item.qty for item in items) return subtotal + (subtotal * tax_rate)class InvoiceRenderer: """Changes only when PDF presentation changes.""" def generate_pdf(self, invoice_data): pdf = PDFDocument() pdf.add_header("Invoice") pdf.add_table(invoice_data) return pdf.render()
Open for extension, closed for modification. When a new payment method is added, you should add a new class (StripePaymentProcessor), not modify existing code. Strategy pattern and polymorphism enable this.Code smell it prevents: the ever-growing if/elif chain. When adding a new feature means modifying existing, working, tested code — adding another elif branch to a function that already has 12 branches — OCP is being violated. Every modification to working code risks introducing a regression. The fix: design so that new behavior is added by creating new classes, not editing old ones.BAD — modifying existing code for every new payment method:
class PaymentProcessor: def process(self, payment): if payment.method == "stripe": # Stripe-specific logic stripe.charge(payment.amount, payment.token) elif payment.method == "paypal": # PayPal-specific logic paypal.create_payment(payment.amount, payment.email) elif payment.method == "apple_pay": # Every new method = modifying this growing if/elif chain apple.authorize(payment.amount, payment.device_token)
GOOD — extend by adding new classes:
class PaymentProcessor(ABC): @abstractmethod def process(self, payment): ...class StripeProcessor(PaymentProcessor): def process(self, payment): stripe.charge(payment.amount, payment.token)class PayPalProcessor(PaymentProcessor): def process(self, payment): paypal.create_payment(payment.amount, payment.email)# Adding Apple Pay = adding a new class, no existing code modifiedclass ApplePayProcessor(PaymentProcessor): def process(self, payment): apple.authorize(payment.amount, payment.device_token)
The pragmatic reality: OCP is aspirational, not absolute. Small, contained modifications are fine. OCP matters most for code that changes frequently and has many consumers.
Subtypes must be substitutable for their base types without breaking behavior.Code smell it prevents: isinstance checks and surprise side effects. When you see code littered with if isinstance(obj, SpecificSubclass) before calling methods, or when a subclass method silently changes behavior that callers depend on, LSP is being violated. The contract of the base type is broken, and downstream code cannot trust polymorphism anymore.BAD — classic violation (Square extends Rectangle):
class Rectangle: def __init__(self, width, height): self._width = width self._height = height def set_width(self, w): self._width = w # Only changes width def set_height(self, h): self._height = h # Only changes height def area(self): return self._width * self._heightclass Square(Rectangle): def set_width(self, w): self._width = w self._height = w # SURPRISE: also changes height def set_height(self, h): self._width = h # SURPRISE: also changes width self._height = h# Code that expects Rectangle behavior breaks:def test_area(rect: Rectangle): rect.set_width(5) rect.set_height(4) assert rect.area() == 20 # FAILS for Square (returns 16)
GOOD — model shapes without misleading inheritance:
Split large interfaces into focused ones.Code smell it prevents: NotImplementedError and dead methods. When a class is forced to implement methods it cannot support — raising NotImplementedError or returning None for methods that do not apply — the interface is too fat. Callers cannot trust the interface because some methods are traps. The fix: split the interface so each implementor only promises what it can deliver.BAD — fat interface forces unused implementations:
Depend on abstractions, not concrete implementations.Code smell it prevents: untestable code and vendor lock-in. When unit tests require spinning up a real database, real Stripe API, or real email server because the class directly instantiates concrete clients, DIP is being violated. The fix: inject an abstraction. You get testability (mock the interface) and flexibility (swap implementations) for free.BAD — high-level module depends on low-level concrete class:
class OrderService: def __init__(self): self.payment = StripeClient() # Hardcoded dependency def checkout(self, order): self.payment.charge(order.total) # Can't swap, can't test
GOOD — depend on abstractions:
class PaymentGateway(ABC): @abstractmethod def charge(self, amount): ...class OrderService: def __init__(self, payment: PaymentGateway): # Injected abstraction self.payment = payment def checkout(self, order): self.payment.charge(order.total)# Easy to swap: StripeGateway, AdyenGateway, MockGateway for tests
This enables testing (mock the interface) and flexibility (swap Stripe for Adyen without changing OrderService).
A notification service originally had one class that decided, formatted, and delivered. Adding Slack and SMS meant modifying the core class every time. Refactored with a NotificationChannel interface (ISP, DIP), separate implementations for Email/Slack/SMS, and a NotificationRouter (SRP). Adding a new channel means adding a class, not modifying one (OCP).
The One Thing to Remember: SOLID is not about following rules for their own sake — it is about making code that is cheap to change. The test: when the next feature request arrives, do you add new code or rewrite existing code? If you are always rewriting, SOLID violations are the likely cause.
Cross-chapter connection — SOLID and Design Patterns: SOLID principles are the why; design patterns are the how. OCP says “open for extension, closed for modification” — the Strategy pattern, Observer pattern, and Plugin architecture are concrete implementations of that principle. DIP says “depend on abstractions” — Dependency Injection, Factory pattern, and Repository pattern are the mechanisms that make it practical. ISP says “segregated interfaces” — the Adapter and Facade patterns help you present focused interfaces to consumers while wrapping complex implementations. If SOLID tells you what good design looks like, the Design Patterns chapter gives you the toolkit to build it. Read them as companions: SOLID without patterns is philosophy without tools; patterns without SOLID is technique without judgment.
Cross-chapter connection: SOLID principles directly affect reliability. Code that violates SRP (one class doing everything) is harder to test, harder to reason about during incidents, and harder to deploy safely. Well-structured code is more testable (Testing chapter) and easier to observe in production (Observability chapter).
DRY: Single authoritative representation of each piece of knowledge. But DRY is about duplicate knowledge, not duplicate code. Two functions with identical code but different concepts should not share an abstraction — that creates coupling.
Wrong Abstraction. Premature DRY is worse than duplication. If you abstract too early, you create the wrong abstraction, and changing it later is harder than duplicating. “Duplication is far cheaper than the wrong abstraction” — Sandi Metz.
WET stands for “Write Everything Twice” (or “We Enjoy Typing”). The conventional wisdom is that DRY is always good. The nuance: premature DRY creates coupling between things that should evolve independently.Example — premature DRY that hurts:
# Two teams share a "validate_input" function because the code looks identicaldef validate_input(data, context): if context == "user_registration": # registration-specific rules creep in ... elif context == "order_placement": # order-specific rules creep in ... # This function grows into a god function with branching for every caller
The registration validation and order validation looked the same initially, so someone DRY-ed them up. But they change for different reasons (different stakeholders, different compliance rules). Now every change to one requires careful testing of the other. The “shared” function becomes a liability.The Rule of Three: Tolerate duplication until you see the same pattern three times. By the third occurrence, the correct abstraction usually becomes clear. Two occurrences can be coincidence.KISS: Choose the simplest solution that works. Complexity has a cost in development speed, bugs, and onboarding.YAGNI: Do not build for hypothetical future requirements. Build now, refactor when real requirements emerge.
The One Thing to Remember: DRY is about eliminating duplicate knowledge, not duplicate code. Two functions with identical code that serve different business domains should stay separate. “Duplication is far cheaper than the wrong abstraction” (Sandi Metz) is one of the most important sentences in software engineering.
11.3 Separation of Concerns, Cohesion, and Coupling
Cohesion measures how related the responsibilities within a module are. High cohesion: an EmailService that handles composing, sending, and tracking emails. Low cohesion: a Utils class with string formatting, date parsing, and HTTP helpers (unrelated responsibilities dumped together).Coupling measures how much one module depends on another. Loose coupling: OrderService publishes an OrderPlaced event; EmailService subscribes and sends a confirmation — neither knows about the other. Tight coupling: OrderService directly calls EmailService.sendOrderConfirmation(order) — changing the email service’s interface requires changing the order service.The goal at every level: High cohesion within modules (everything serves one purpose), loose coupling between modules (changes in one rarely require changes in another). This applies to functions, classes, packages, services, and entire systems. When you feel a change “rippling” through many files, coupling is too high. When you struggle to understand where code belongs, cohesion is too low.
The One Thing to Remember: The litmus test for good architecture: “How many files do I need to change to add this feature?” If the answer is consistently “just one module,” you have high cohesion and low coupling. If the answer is “six files across three services,” your boundaries are wrong.
Further reading:Martin Fowler — Coupling and Cohesion — a concise treatment of the relationship between these two properties and why they should be considered together, not separately. Structured Design by Larry Constantine and Ed Yourdon — the original text that formalized coupling and cohesion as measurable design properties. The terminology has endured for 50 years because the concepts are that fundamental.
Track it explicitly, quantify its impact (“this adds 2 days to every payment feature”), prioritize strategically (fix what actively slows you down), budget time for reduction, prevent new debt through reviews.Types of technical debt:
Deliberate: “we know this is a shortcut but need to ship by Friday”
Inadvertent: “we did not know there was a better pattern”
Bit rot: code decays as the world around it changes
Dependency debt: outdated libraries with known vulnerabilities
Not all debt is bad — deliberate debt with a plan to repay is a legitimate engineering strategy. The danger is untracked debt that compounds silently.
The One Thing to Remember: Technical debt is only useful as a metaphor if you track it like real debt — with a principal (what is the shortcut), an interest rate (how much does it slow you down per sprint), and a repayment plan (when and how you will fix it). Untracked debt is not “strategic” — it is negligence.
Further reading:Martin Fowler — Technical Debt Quadrant — the classic 2x2 matrix (deliberate vs. inadvertent, reckless vs. prudent) that gives you a vocabulary for discussing different kinds of debt with your team. Invaluable for distinguishing “we chose this shortcut strategically” from “we did not know any better.” Martin Fowler — Technical Debt — the broader article that traces the metaphor back to Ward Cunningham and explains when the metaphor helps vs. when it misleads.
Interview Question: How do you convince a product team to invest in paying down technical debt?
Strong answer: Translate debt into business impact. Do not say “we need to refactor the auth module.” Say “every new feature that touches user permissions takes 3 extra days because of the auth module design — that is 15 extra engineering days this quarter.”Track velocity over time and show the slowdown. Propose a specific, bounded investment (“2 sprints to fix the top 3 bottlenecks”) with a measurable outcome (“feature velocity returns to Q1 levels”). Bundle small debt reduction with feature work when possible. The key: never frame it as “cleaning up” — frame it as “investing in speed.”
Scenario: You inherit a codebase where every class has 1000+ lines and violates SRP. How do you prioritize the refactor without stopping feature work?
What they are really testing: Whether you can be strategic about refactoring — resisting the urge to rewrite everything — and whether you understand how to interleave cleanup with delivery.Strong answer framework:
Do not attempt a Big Bang rewrite. The codebase is working in production. A full rewrite is the highest-risk, longest-duration option, and it has a terrible track record. Joel Spolsky called this “the single worst strategic mistake that any software company can make.” Instead, adopt an incremental approach.
Identify the hot spots. Not all 1000-line classes are equally painful. Use two metrics to prioritize:
Change frequency — run git log --format=format: --name-only | sort | uniq -c | sort -rn to find which files change most often. A 1000-line class that has not been touched in a year is low priority. A 1000-line class that gets modified in every sprint is urgent.
Bug density — which classes are associated with the most production incidents or bug reports? Cross-reference change frequency with defect rate. Files that change often and break often are your top targets.
Apply the Strangler Fig pattern to classes. When you need to add a feature that touches a bloated class, extract the new functionality into a clean, well-tested class. Then extract closely related existing functionality into that new class. Over time, the old class shrinks as responsibilities migrate outward. You never stop feature work — you just do the feature work in new, clean modules and route calls through them.
Set a “boy scout” rule for the team. Every pull request that touches a bloated class must leave it slightly better — extract one method, split one responsibility, add one test for untested behavior. No PR makes the class worse. Incremental improvement compounds over time.
Timebox and track. Allocate 15-20% of sprint capacity to refactoring, focused on the hotspot list. Track the results: measure change-failure rate and cycle time for features touching refactored modules. Show the product team that refactored areas are delivering faster.
Common mistake: Trying to refactor everything at once, creating a massive PR that nobody can review, that conflicts with every other branch, and that introduces regressions because the refactoring is not covered by tests. The other common mistake: asking for “refactoring sprints” with no feature delivery — this burns product trust and rarely gets approved.
A Philosophy of Software Design by John Ousterhout — A short, opinionated book that challenges conventional wisdom. Ousterhout argues that deep modules (simple interfaces, complex implementations) are better than shallow ones, and that the most important goal of software design is reducing complexity. Read this if you find “Clean Code” too prescriptive.
“How Complex Systems Fail” by Richard Cook — Yes, this appears in the reliability section too. It belongs here as well because its insights on how complexity emerges from seemingly simple components apply directly to software architecture. Understanding that “complex systems run in degraded mode” changes how you think about code quality.
Advanced Practice
Charity Majors on Observability and Engineering Culture — Beyond her observability writing, Majors has excellent posts on engineering management, technical debt, and how to build a culture where quality is sustainable. Her post on “the engineer/manager pendulum” is essential reading for senior ICs considering management.
Gergely Orosz’s “The Pragmatic Engineer” — Covers engineering culture, career growth, and how decisions get made at scale. His deep dives on incident response culture and how different companies approach technical debt are informed by his experience at Uber and other high-scale companies.
Working Effectively with Legacy Code by Michael Feathers — If you are facing the “1000-line class” scenario from the interview question above, this book is your tactical manual. Feathers provides specific techniques for getting untested code under test, breaking dependencies, and refactoring safely when you have no safety net.
Every section in this chapter connects to a single question: “Is this safe to ship?” Before deploying any feature, change, or migration to production, walk through this checklist. Print it. Tape it to your monitor. Make it part of your team’s PR template.
This is not optional bureaucracy. Every major outage story in this chapter — the S3 typo, the Cloudflare regex, the Facebook BGP withdrawal — could have been prevented or significantly mitigated if someone had asked these questions before executing.
What is the SLO? What availability, latency, or correctness target does this feature fall under? If there is no SLO for this service, stop and define one before shipping. You cannot know if a change is safe if you have not defined what “safe” means.
What is the rollback plan? Can you revert this change in under 5 minutes? Is it a code rollback, a feature flag toggle, or a database migration rollback? If the rollback requires a manual database fix or a multi-step process, that is a red flag — simplify the rollback before shipping.
What alerts fire if this breaks? Which dashboards will show the problem? Which PagerDuty/Opsgenie alert will wake someone up? If the answer is “none” or “we will notice when users complain,” you are shipping blind. Add alerting before the deploy, not after the incident.
What is the blast radius? If this goes wrong, who is affected? All users? Users in one region? Users on one plan? 1% of traffic (canary) or 100% (big bang)? The blast radius determines how carefully you roll out and how aggressively you monitor.
Is this change idempotent and retry-safe? If the deploy fails halfway, can you safely re-run it? If a message gets processed twice, does it produce the correct result? Non-idempotent changes in distributed systems are ticking time bombs.
Have the failure modes been tested? Not just “does it work when everything is fine” but “what happens when the database is slow, the downstream API returns 500s, or the cache is cold?” See the Testing chapter for how to test failure paths, not just happy paths.
Who is on call and do they know this deploy is happening? The worst time to learn about a risky deploy is when you are paged at 2 AM with no context. Communicate deploy timing, expected impact, and rollback instructions to the on-call engineer before you ship.
The One Thing to Remember for This Entire Chapter: Reliability is not about building systems that never fail. It is about building systems where failure is expected, budgeted, contained, and recoverable. The most reliable teams are not the ones with the fewest incidents — they are the ones who recover fastest and learn the most from each one.
The pre-ship checklist gets you to production safely. The aftercare checklist keeps you safe once you are there. Most teams do the first part and skip the second — which is why incidents cluster in the 48 hours after a deploy and in the 2 weeks after people stop watching.
This is the work that separates teams that operate well from teams that just ship well. Shipping is the beginning, not the end. The aftercare items below should be completed within 48 hours of any significant production change.
1
Alert Tuning (within 24 hours)
Did the deploy change your baseline metrics? After a significant change, your p50 latency, error rate, or resource utilization may have shifted. If your alert thresholds were tuned to the old baseline, they are now either too sensitive (false alarms) or too lenient (missed incidents). Review the metrics for the first 24 hours and adjust thresholds.Are your alerts actionable? Every alert that fires should have a clear “what do I do next?” If an alert fires and the on-call engineer’s response is “I do not know what this means,” the alert needs a runbook link or it needs to be deleted. Noisy, unactionable alerts train the team to ignore alerts — which is the precondition for every missed incident.Alert hygiene rule: If an alert fired more than 3 times in the past month without requiring human action, it should be either auto-remediated (a script fixes it) or deleted. Alert fatigue is a reliability risk.
2
Runbook Update (within 48 hours)
Does a runbook exist for this service’s failure modes? If not, write one. A minimal runbook answers four questions:
What does this alert mean? (one sentence)
How do I verify the issue? (which dashboard, which query, which log)
How do I mitigate immediately? (rollback command, feature flag toggle, scale-up command)
Who do I escalate to if mitigation fails? (name, Slack channel, phone number)
Runbook testing: A runbook that has never been executed is an untested hypothesis. Schedule a quarterly “runbook drill” where the on-call engineer follows the runbook step-by-step for a simulated incident. Fix every step that is outdated, unclear, or wrong.
3
Rollback Validation (before the deploy, verified after)
Can you actually roll back? Many teams assume rollback works but never test it. After deploying, verify:
Code rollback: Can you redeploy the previous version in under 5 minutes? Test the actual command or pipeline.
Database rollback: If the deploy included a migration, is the migration backward-compatible? Can the previous code version run against the new schema? If not, your rollback is blocked until you write a reverse migration.
Feature flag rollback: If the change is behind a feature flag, does disabling the flag return the system to its previous behavior? Test this explicitly — “flag off” is not always equivalent to “the code before the flag existed.”
Data rollback: If the deploy wrote data in a new format, can the old code read that data? Data format changes are the most common rollback blocker.
4
Action Item Closure (tracked weekly)
Are post-mortem action items actually getting done? The single biggest predictor of recurring incidents is incomplete action items from previous incidents. Track these in a dedicated board (not the general backlog) with:
Named owner (not “the team”)
Specific deliverable (not “improve monitoring”)
Deadline (not “next quarter”)
Verification criteria (not “done when merged” but “done when the alert fires correctly in staging”)
Weekly review cadence: Every Monday, review open action items. Items older than 30 days without progress get escalated to the engineering manager. Items that have been deprioritized twice get either re-scoped to something achievable or explicitly accepted as risk with a documented justification.
The aftercare mindset: Shipping a feature without aftercare is like performing surgery without post-op monitoring. The operation might have been perfect, but the patient can still deteriorate. The first 48 hours after a deploy are when your monitoring, runbooks, and rollback plans prove they work — or reveal they do not.
Reliability does not live in isolation. It is the thread that runs through every other engineering discipline. Here is how the topics in this chapter connect to the rest of the guide:
These questions go beyond surface-level recall. They are the kind of questions a senior interviewer asks when they want to separate candidates who have read about reliability from candidates who have lived it. Each question includes follow-up chains that branch into different areas — the way a real 45-minute interview unfolds.
1. Your team just launched a new microservice and you need to define its SLOs. Walk me through your process from scratch.
What the interviewer is really testing
Whether you understand SLOs as an organizational alignment tool, not just a number on a dashboard. They want to see if you start with user experience, consult stakeholders, measure before committing, and build governance around the target.
Strong answer
The way I think about this is — SLOs are not something engineering decides in isolation. They are a contract between engineering and the business about how much reliability we are buying. Here is how I would approach it:Step 1 — Identify the user-facing behaviors that matter. For a checkout service, that is probably request success rate and latency. For a batch data pipeline, it might be completeness and freshness. I would talk to the product manager and ask: “If this service were degraded, what would users complain about first?” That complaint is your SLI.Step 2 — Instrument and measure before setting a target. I would deploy with observability in place — request latency histograms at p50, p95, p99, error rates by status code, and throughput — and let it bake for 2 to 4 weeks. You cannot set a meaningful SLO without baseline data. I have seen teams pick 99.99% because it sounds good and then burn their error budget in week one because the baseline was actually 99.7%.Step 3 — Set the SLO slightly above the measured baseline. If our p99 latency is 180ms, I would set the SLO at 200ms. If our success rate is 99.85%, I would target 99.9%. The target should be achievable but require discipline. Too aggressive and you are in a permanent feature freeze. Too lenient and the SLO does not drive any behavior.Step 4 — Define separate SLOs for separate dimensions. Availability (percentage of successful responses) and latency (percentage of requests under threshold) burn their error budgets independently. A service can be available but slow, or fast but error-prone — each is a distinct failure mode.Step 5 — Build an error budget policy. This is the part most teams skip and then regret. I would write down explicitly: when the budget is above 50%, ship freely; between 20 and 50%, require rollback plans on every deploy; below 20%, freeze non-critical work; at zero, full reliability sprint. Get the PM and the engineering lead to co-sign it. Without teeth, the SLO is just a number people ignore.Step 6 — Set the external SLA lower than the internal SLO. If we target 99.9% internally, the SLA to customers should be 99.5% or lower. That gap is your safety margin. If the SLA and SLO are the same number, every near-miss becomes a contractual breach.
Follow-up: How do you handle a situation where different consumers of your service need different SLOs?
This is actually more common than people think. For example, your internal analytics pipeline might be fine with p99 latency of 2 seconds, but the customer-facing API that also calls your service needs sub-200ms responses.The approach I have used is tiered SLOs. You define a “premium” tier and a “standard” tier. Premium requests get dedicated capacity, stricter circuit breaker thresholds, and higher priority in load shedding scenarios. You implement this at the infrastructure level — separate thread pools (the bulkhead pattern), separate rate limits, and sometimes even separate deployment targets for latency-sensitive consumers.The key governance question is: who decides which consumer gets which tier? In my experience, you tie it to the consumer’s own SLO. If their SLO is 99.99%, they get premium. If their SLO is 99.9%, they get standard. This prevents every team from demanding premium treatment.One gotcha — if you have 15 consumers and 12 of them are “premium,” you effectively have no tiering. Tiering only works if the premium tier is genuinely a minority of traffic.
Follow-up: Your SLO has been met every month for six consecutive months. Is that good or bad?
This is a trick question that catches a lot of people. The intuitive answer is “that is great, we are reliable!” But the experienced answer is: it might mean your SLO is too lenient.If you are never approaching your error budget, you are probably over-investing in reliability at the expense of feature velocity. The whole point of SLOs and error budgets is to give you permission to take calculated risks. If the error budget is always at 95% remaining, you are leaving velocity on the table — you could be shipping faster, doing riskier migrations, and experimenting more aggressively.The sweet spot is burning 50 to 80% of your error budget most months. That means you are shipping aggressively while maintaining discipline. If you are consistently under 30% budget consumption, tighten the SLO — move from 99.9% to 99.95% — and use the freed-up “permission to fail” to ship faster.The counterargument is that some teams are in a hardening phase where stability matters more than velocity — maybe after a series of bad incidents or during a compliance push. In that case, consistently meeting the SLO is intentional and correct. Context matters.
Going Deeper: How do you measure SLOs for an event-driven, asynchronous system where there is no synchronous request-response to measure?
This is where things get genuinely hard and where most teams wing it. In a synchronous API, your SLI is straightforward — request success rate and latency. In an async system, you need to think differently.Freshness SLIs. For a data pipeline, the SLI might be “time between event occurrence and availability in the data warehouse.” You measure this by embedding timestamps at ingestion and comparing them at the query layer. The SLO could be “95% of events are queryable within 5 minutes of occurrence.”Completeness SLIs. “What percentage of events that were produced were successfully consumed?” This requires correlating producer-side metrics (messages published) with consumer-side metrics (messages processed). Any delta is your incompleteness rate. The SLO could be “99.9% of published events are processed within 1 hour.”Correctness SLIs. Harder to automate, but you can use techniques like data quality checks at pipeline boundaries — row counts, schema validation, checksum comparisons between source and destination.The tooling challenge is real. Unlike HTTP services where every load balancer gives you latency and error metrics for free, async systems require explicit instrumentation at every stage. I have seen teams use distributed tracing with correlation IDs through message queues — each message carries a trace that spans from producer through every transformation to final consumer. This lets you measure end-to-end latency and identify which stage is the bottleneck.
2. Explain how you would implement a circuit breaker from scratch. Then tell me why your implementation is probably wrong.
What the interviewer is really testing
Two things at once: can you implement the state machine correctly, and do you understand the production edge cases that naive implementations miss? The second part — “why your implementation is probably wrong” — is testing self-awareness and production experience.
Strong answer
The basic implementation is a three-state machine: Closed, Open, and Half-Open.In the Closed state, every request passes through. I maintain a sliding window of recent requests — say, the last 100 requests or the last 60 seconds. If the error rate in that window exceeds a threshold, say 50%, the breaker trips to Open. In Open state, all requests immediately fail with a fallback response — no call to the downstream service. After a recovery timeout — maybe 30 seconds — the breaker transitions to Half-Open. In Half-Open, I let a small number of probe requests through. If they succeed, the breaker closes. If any fail, it reopens.Now, here is why this implementation is probably wrong in several ways:Problem 1 — Consecutive failure counting misses intermittent degradation. If I count consecutive failures, a service that alternates between success and failure (50% error rate) never trips the breaker because every success resets the counter. I need a sliding window error-rate threshold, not a consecutive failure count.Problem 2 — The state is per-instance, not shared. If I have 20 application instances, each has its own circuit breaker with its own failure counts. One instance might trip while others do not, or none trip because the failures are spread across instances. For some use cases, I need a shared circuit breaker state — stored in Redis, for instance — so all instances agree on the breaker state.Problem 3 — Half-Open is dangerous without traffic limiting. If I transition to Half-Open and let all queued requests through at once, I might overwhelm the recovering service. The probe should be a single request or a tiny percentage of traffic, not the full firehose.Problem 4 — Thread safety. Multiple threads are reading and writing the failure count and state simultaneously. Without proper synchronization — an atomic state machine or a lock — you get race conditions where the breaker trips and closes simultaneously from different threads.Problem 5 — No differentiation between failure types. A timeout is different from a 400 Bad Request. I should only count transient failures (503, timeout, connection refused) toward the threshold, not client errors that indicate a bug in my request.In practice, I would not implement this from scratch. I would use Resilience4j on the JVM, Polly in .NET, or cockatiel in Node.js. But understanding these edge cases is what helps me configure those libraries correctly.
Follow-up: How does a circuit breaker interact with retries? Can they conflict?
Absolutely, and this is a common source of bugs. If you have retries configured outside the circuit breaker, here is what happens: a request fails, the retry logic sends it again, that fails too, the retry sends it a third time. Each of those failures increments the circuit breaker’s failure counter. So three retries of one logical request look like three failures to the breaker, causing it to trip much faster than intended.The correct architecture is retries inside the circuit breaker. The circuit breaker wraps the entire retry chain. So one logical request — even with three retries — either succeeds or fails once from the breaker’s perspective.The other conflict is between the circuit breaker’s recovery timeout and the retry backoff schedule. If the breaker opens for 30 seconds but the retry backoff is 60 seconds, the breaker might close before the retry fires — and then the retry hits a service that has not actually recovered. Or conversely, the breaker might still be open when the retry fires, causing the retry to be immediately rejected.The pattern I have used is: circuit breaker on the outside, retries on the inside, and the retry max duration is shorter than the circuit breaker recovery timeout. This way the retry policy gives up before the breaker has a chance to interact with it in weird ways.
Follow-up: When should you NOT use a circuit breaker?
This is an underrated question. Circuit breakers add complexity, and they are not always the right tool.Do not use a circuit breaker for dependencies you cannot degrade. If your service literally cannot function without this dependency — it is in the critical synchronous path and there is no fallback — the circuit breaker just turns a dependency outage into a faster failure. The user still gets an error, just quicker. The breaker helps your thread pool, but it does not help the user. In this case, you are better off investing in redundancy (multi-provider failover) rather than a breaker.Do not use a circuit breaker for async fire-and-forget calls. If you are publishing analytics events to Kafka asynchronously, a circuit breaker adds complexity without much benefit. A bounded buffer with drop-on-overflow is simpler and achieves the same goal.Be cautious with circuit breakers across service meshes. If you are running Istio, the service mesh already has circuit breaking at the infrastructure level. Adding application-level circuit breakers on top creates layered state machines that can interfere with each other — the app breaker might be half-open while the mesh breaker is open, leading to confusing behavior.The general principle: use circuit breakers when you have a synchronous dependency with a viable fallback. If there is no fallback, invest in making the dependency more reliable instead.
3. You are on call and get paged at 2 AM. Your service’s p99 latency has gone from 150ms to 3 seconds in the last 10 minutes, but the error rate has not changed. Walk me through your investigation.
What the interviewer is really testing
Incident response methodology under pressure. Can you form hypotheses systematically, or do you thrash between random guesses? They also want to see if you understand the difference between latency degradation and availability degradation — they have very different root causes.
Strong answer
High latency with normal error rates narrows the problem space significantly. The service is not crashing or returning errors — it is just slow. Here is how I would work through it:First 2 minutes — establish scope and context.
Is this affecting all endpoints or just one? If it is one endpoint, the problem is likely in that endpoint’s specific dependency or query. If it is all endpoints, the problem is systemic — CPU, memory, GC, network, or a shared dependency.
Did anything deploy in the last hour? I check the deploy log immediately. A deployment that correlates with the latency spike is the most common root cause, and the fastest fix is a rollback.
Is the traffic volume normal? A sudden spike in requests (legitimate or DDoS) can cause latency increases without errors if the system is not overloaded enough to fail — just overloaded enough to queue.
Next 5 minutes — check the infrastructure stack top-down.
Application metrics: CPU and memory utilization on the application instances. A JVM garbage collection storm — particularly full GC pauses — is one of the most common causes of latency spikes without errors. If I see GC pause times spiking, that is the smoking gun.
Database: Are queries taking longer than usual? Check slow query logs and connection pool utilization. A missing index on a growing table can cause latency to degrade gradually, then suddenly spike once the table crosses a size threshold. Also check: did a new query get introduced with the recent deploy that is doing a full table scan?
Downstream dependencies: Check the latency of every service I call. If my service calls three downstream services, and one of them went from 50ms to 2 seconds, that explains my p99 spike. Use distributed tracing to identify which span is the bottleneck.
Network: Check for packet loss, retransmissions, or DNS resolution delays. A network issue at the cloud provider level can add latency without causing outright failures.
Connection pool exhaustion: If all connections in the pool are in use, new requests wait in a queue. This causes latency spikes without errors — the requests eventually succeed, just slowly. Check active vs. idle connections.
Decision point — mitigate before root-causing. If I have identified the likely cause within 10 minutes, I fix it. If not, and latency is severely impacting users, I take a mitigation action first: scale up instances, restart the worst-performing instances (clears connection pool and GC state), or toggle off a non-essential feature that might be contributing to load.The mistake I would avoid: spending 45 minutes root-causing while users suffer. Mitigate first, root-cause after.
Follow-up: You discover the latency spike is caused by a database query that was fine for months but suddenly got slow. Nothing changed in the code. What happened?
This is actually one of my favorite debugging scenarios because the answer is almost never “nothing changed.” Something always changed — it just was not a code change.Most likely culprits:Query plan change. The database optimizer decided to use a different execution plan. In PostgreSQL, this can happen when table statistics are updated (after an ANALYZE or auto-vacuum) and the optimizer discovers that a sequential scan is now cheaper than an index scan — or vice versa. The data distribution shifted. Maybe a column that used to have high cardinality now has a hot value that 80% of rows share, making the index less selective.Data volume threshold. The table crossed a size boundary where the working set no longer fits in memory. Queries that were served entirely from the buffer cache now hit disk. This is insidious because it appears to happen “suddenly” — in reality, cache hit rates were slowly declining for weeks, and you just crossed the tipping point.Lock contention. A background job — a large batch update, a schema migration, a long-running analytics query — is holding locks that your production queries are now waiting on. The background job might run weekly and just happened to overlap with peak traffic for the first time.Connection pool saturation at the database level. The database has a max_connections limit. If another service started consuming more connections, your service’s queries wait in a queue at the database level.The debugging approach: run EXPLAIN ANALYZE on the slow query and compare the plan to what it was before (if you have historical query plan logging). Check pg_stat_activity for blocked queries. Check pg_stat_user_tables for the table size and the last vacuum/analyze timestamp. Check the buffer cache hit ratio — if it dropped below 99%, you have a memory-to-data-size mismatch.
Going Deeper: How do you prevent this class of latency surprise from happening again?
This requires a shift from reactive debugging to proactive monitoring. Here is what I would put in place:Query performance tracking. Log the execution time of every query (or at least the p99) and alert on week-over-week regression. Tools like pg_stat_statements in PostgreSQL or the slow query log in MySQL give you this. Set an alert: “if p99 query time for any tracked query increases by 2x compared to last week’s baseline, page.”Table growth monitoring. Track row count and table size over time for critical tables. Set an alert when a table grows past a threshold where you have not validated performance — for example, “alert if the orders table exceeds 100M rows, because our index performance was last validated at 50M.”Synthetic canary queries. Run your critical queries against production (read replicas) on a schedule and track their latency. This catches plan regressions and data volume issues before real user traffic is affected.Database review in the deploy process. Any PR that adds or modifies a query on a large table requires an EXPLAIN ANALYZE output in the PR description, run against a database with production-scale data. Reviewing query plans in CI against a small test database is misleading because the optimizer chooses different plans at different data volumes.The organizational piece matters too: make sure the on-call rotation includes someone who can reason about database performance, not just application code. The worst latency incidents I have seen dragged on because the on-call engineer treated the database as a black box.
4. Compare and contrast the bulkhead pattern, circuit breaker pattern, and retry pattern. When would you use each, and when would you use them together?
What the interviewer is really testing
Whether you understand these as complementary tools with different purposes, not interchangeable options. This also tests whether you can reason about how patterns compose — which is a staff-level skill.
Strong answer
The way I think about these three patterns is by the question each one answers:Retry answers: “This failed — should I try again?” It handles transient failures. A network blip, a momentary 503, a brief timeout. Retries assume the failure is temporary and the next attempt will succeed. The key constraint is idempotency — you can only safely retry operations that produce the same result when executed multiple times.Circuit breaker answers: “This keeps failing — should I stop trying?” It handles sustained failures. When a dependency is down, retrying just wastes resources and adds latency. The circuit breaker stops the bleeding by failing fast, giving the downstream time to recover. Where retries are optimistic (“try again, it might work”), circuit breakers are pessimistic (“stop trying, it is not going to work right now”).Bulkhead answers: “This dependency is failing — how do I prevent it from taking down everything else?” It handles failure isolation. A bulkhead does not care whether requests succeed or fail — it limits the blast radius by constraining how many resources any single dependency can consume. If the recommendation engine is hanging, the bulkhead ensures it can only consume 10 threads, not all 50.When to use them together — and the correct composition order:In practice, you layer all three. The composition from outermost to innermost should be: Bulkhead -> Circuit Breaker -> Retry -> Actual Call.The bulkhead ensures this entire chain — including its retries — cannot consume more than its allocated resources. The circuit breaker observes the result of the retry chain (so three retries that all fail count as one failure from the breaker’s perspective, not three). The retry handles transient blips that the circuit breaker should not react to.A real example: in our e-commerce checkout, the payment service call has a bulkhead of 25 threads (isolating it from the search service’s pool), a circuit breaker that trips at 40% error rate over a 60-second window, and a retry policy of 2 retries with exponential backoff. If Stripe is having a brief blip, the retries handle it. If Stripe is down, the circuit breaker stops us from burning threads on hopeless requests. If we misconfigure something and the retry-and-breaker loop goes wrong, the bulkhead ensures we never consume more than 25 threads — leaving the rest of the system healthy.
Follow-up: What happens when retries interact badly with an overloaded downstream service?
This is the retry storm problem, and it is one of the most common ways well-intentioned resilience patterns make things worse.Picture this: your downstream service is overloaded and starts returning 503s. You have 100 clients, each configured with 3 retries. Each client’s first request fails. Now all 100 clients retry simultaneously — 200 requests hit the overloaded service. Those fail too. Third retry — 300 total requests have now hit a service that was already drowning under the original 100.This is why jitter is non-negotiable. Full jitter means each client waits a random amount of time before retrying, spreading the retry wave over a window instead of hammering the service in synchronized bursts.But even with jitter, retries under sustained overload are counterproductive. This is exactly where the circuit breaker earns its keep. Once the error rate crosses the threshold, the circuit breaker opens and stops the retries entirely. The downstream service gets breathing room. The half-open probe checks if it has recovered.The other mitigation is adaptive retry budgets. Instead of retrying every failed request, maintain a ratio: “retry no more than 10% of total requests in a given window.” Google’s SRE book calls this the “retry budget.” If you are sending 1000 requests per second and 500 are failing, you retry at most 100 — not 500. This bounds the amplification factor.
Follow-up: A junior engineer asks you — if circuit breakers protect us from failing dependencies, why do we also need bulkheads? Is it not redundant?
Great question, and it reveals a common misunderstanding. They protect against different failure modes.A circuit breaker protects you from sustained failures — the dependency is returning errors. But what about sustained slowness? The dependency is not failing; it is just taking 30 seconds instead of 500 milliseconds. Every request eventually times out, but while it is waiting, it is holding a thread. The circuit breaker might not trip because the requests technically do not fail — they just take forever before timing out.Without a bulkhead, those slow requests consume all your threads. Your entire application grinds to a halt — not because of errors, but because of resource exhaustion. The bulkhead says “I do not care if requests are fast, slow, succeeding, or failing — this dependency gets 10 threads maximum, period.”The concrete scenario: your recommendation engine has a p99 latency of 500ms, but it starts experiencing network issues and its p99 goes to 15 seconds. Requests are not failing — they are completing, just very slowly. Your timeout is set to 30 seconds (too generous). Without a bulkhead, 50 threads are all stuck waiting for slow recommendation responses, and your checkout flow — which shares the same thread pool — cannot serve any requests.With a bulkhead, only the recommendation engine’s 10 threads are stuck. The payment service’s 25 threads and the search service’s 15 threads are completely unaffected. Checkout keeps working.So no — circuit breakers and bulkheads are not redundant. Circuit breakers handle error surges. Bulkheads handle resource isolation. You need both.
5. Your company runs a multi-region active-active deployment. Describe the hardest engineering problems this creates.
What the interviewer is really testing
Whether you have actually thought through — or worked with — the genuine difficulty of multi-region, or whether you just know it as a buzzword. This is a staff-level question that tests distributed systems intuition.
Strong answer
Multi-region active-active is the most operationally complex deployment model, and in my experience most teams underestimate the difficulty. Here are the genuinely hard problems:Problem 1 — Data consistency and conflict resolution. When both regions can accept writes, you will get conflicting writes to the same record. User updates their profile in us-east-1 at timestamp T, and a different update to the same profile hits eu-west-1 at timestamp T+50ms. Which one wins? You have three options, each with trade-offs:
Last-write-wins (LWW): Simple but lossy. One update is silently discarded. Acceptable for non-critical data like user preferences. Unacceptable for financial data.
CRDTs (Conflict-Free Replicated Data Types): Mathematically guarantee convergence without coordination. Work brilliantly for counters, sets, and certain map structures. But they cannot model all data types — try representing a bank account balance as a CRDT and you will quickly hit limitations.
Application-level conflict resolution: The application merges conflicting versions. This is what DynamoDB Global Tables does — it delivers both versions to the application and your code decides how to merge them. Flexible but complex and error-prone.
Problem 2 — Cross-region latency in the data path. Replication between us-east-1 and eu-west-1 adds 70 to 100 milliseconds of latency. For eventual consistency, this is fine — your read might be 100ms stale. But for operations that require global ordering (like “deduct from inventory” or “assign a unique sequential ID”), you either pay the cross-region latency on every write or accept the possibility of over-selling or ID collisions.Problem 3 — Failover is not actually automatic. DNS-based failover (Route 53 health checks) has TTL propagation delays. Even with a 60-second TTL, some clients cache DNS for minutes. During a failover, you get a period where some traffic goes to the healthy region and some still goes to the degraded region. Session state, in-flight transactions, and WebSocket connections all break during the transition.Problem 4 — Testing is exponentially harder. You now need to test: both regions operating normally, region A failing while B is healthy, region B failing while A is healthy, partial connectivity between regions, and — the nightmare scenario — a split-brain where both regions think the other is down. Your chaos engineering scope just quadrupled.Problem 5 — Operational overhead. Every deployment must be coordinated across regions. Every database schema migration must be backward-compatible across potentially different code versions running in different regions. Every configuration change must propagate to all regions. Your CI/CD complexity roughly doubles.The honest assessment: most companies that say they need multi-region active-active actually need multi-region active-passive (warm standby) with fast failover. The cost and complexity difference between the two is enormous, and active-passive with a 2-minute failover satisfies 95% of business requirements.
Follow-up: How do you handle database schema migrations in a multi-region active-active setup?
This is one of those problems that sounds simple until you are actually doing it. The core constraint is: during a migration, different regions might be running different versions of your code, and all of them need to read and write the same database (or replicated databases) without breaking.The answer is expand-and-contract migrations, rigorously enforced.Phase 1 — Expand. Add the new column or table, but do not remove or rename the old one. Deploy code to all regions that can write to both the old and new schema. This is backward-compatible. If region A has the new code and region B has the old code, both can still operate because the old schema is untouched.Phase 2 — Migrate. Run a backfill job that copies data from the old structure to the new one. All regions should now be running the new code version that reads from the new schema with a fallback to the old one.Phase 3 — Contract. Only after all regions are running the new code and the backfill is complete, drop the old column or table. This is a separate deploy, done days or weeks after the expand phase.The common mistake is trying to do all three phases in a single migration script. In a single-region deployment, you might get away with that. In multi-region, it is a recipe for data loss or schema incompatibility errors during the rollout window.The other gotcha: DDL statements (ALTER TABLE) can take locks on large tables, and that lock can replicate to other regions, causing production latency. Use online DDL tools like pt-online-schema-change or gh-ost for MySQL, or ensure PostgreSQL migrations use CREATE INDEX CONCURRENTLY and avoid ACCESS EXCLUSIVE locks.
Going Deeper: Walk me through a split-brain scenario and how you would resolve it.
Split-brain is the nightmare scenario in multi-region: both regions lose connectivity to each other but remain healthy and continue accepting writes independently. When connectivity is restored, you have two divergent datasets that need to be reconciled.How it happens: Inter-region replication link goes down (backbone network issue, BGP misconfiguration, cloud provider peering problem). Health checks between regions time out. Each region’s automated systems conclude the other region is dead and promote themselves as the sole primary. Both regions keep accepting writes.Why it is dangerous: For 15 minutes, user A updates their shipping address in us-east-1 while user A’s order is being processed in eu-west-1 with the old address. An inventory counter is decremented in both regions for the same item — you now have negative inventory or have sold more units than you have. Financial transactions may have been processed against stale account balances.Resolution approach:
Detect it early. Monitor the replication lag metric between regions. If lag exceeds a threshold (or replication stops entirely), alert immediately. Do not wait for the regions to reconnect and discover the divergence.
Choose a source of truth. If you have a designated primary region, that region’s data wins. If you are truly active-active with no primary, you need application-level merge logic. This is where your conflict resolution strategy from the design phase pays off — or where you pay the price for not having one.
Audit and reconcile. After connectivity is restored, run a reconciliation job that compares records modified in both regions during the split. For each conflict, apply the resolution policy: LWW for low-stakes data, manual review for financial data, and CRDTs for data types that support automatic merge.
Communicate to affected users. If orders were placed during the split with potentially stale data, proactively notify those customers rather than waiting for complaints.
The prevention: many teams implement a quorum-based write acceptance — a write is only acknowledged if it has been confirmed by a majority of regions. This means a split-brain scenario cannot accept writes in the minority partition (it cannot reach quorum). The trade-off is higher write latency during normal operation. This is fundamentally the same trade-off that consensus protocols like Raft make — availability during partitions versus consistency guarantees. The CAP theorem is not just theory; this is exactly the scenario it describes.
6. What is the difference between a liveness probe and a readiness probe in Kubernetes, and what is the most common way teams misconfigure them?
What the interviewer is really testing
Whether you have operated services on Kubernetes in production, not just read the docs. The misconfiguration question is the real test — it separates textbook knowledge from operational experience.
Strong answer
Liveness probe answers: “Is this process stuck?” If the liveness probe fails, Kubernetes kills the pod and restarts it. The assumption is the process is in a state it cannot recover from — a deadlock, an infinite loop, a corrupted JVM heap.Readiness probe answers: “Can this instance serve traffic right now?” If the readiness probe fails, Kubernetes removes the pod from the Service’s endpoint list — it stops receiving traffic — but it is not killed. The assumption is the pod is temporarily unable to serve (warming up, overloaded, waiting for a dependency) but will recover on its own.The most common misconfiguration — and I have seen this cause outages multiple times — is putting dependency health checks in the liveness probe.Here is the scenario: your liveness probe checks the database connection. The database goes down for 2 minutes. All your pods’ liveness probes fail. Kubernetes restarts all pods simultaneously. The new pods come up and… the database is still down. Their liveness probes fail again. Kubernetes restarts them again. You are now in a crash loop with zero application capacity, even though your application was perfectly capable of serving cached responses or returning graceful 503 errors.The correct approach: liveness probes should be trivial. Return 200 if the process is running. Maybe check that your main thread is not deadlocked. That is it. No database checks, no downstream service checks, no external calls.Readiness probes should check dependencies. Database connection is active, cache is reachable, warmup is complete. When the database goes down, readiness fails, the pod is removed from the load balancer — it stops getting traffic. But it is not killed. When the database recovers, readiness passes again, and traffic resumes. No restarts, no crash loops, no lost capacity.Another common misconfiguration: setting the liveness probe’s initialDelaySeconds too low for JVM applications. A Spring Boot app might take 45 seconds to start. If your liveness probe starts checking at 10 seconds, the pod will be killed during startup, restarted, killed again — an infinite restart loop. This is what startup probes solve: they give the application a generous window to initialize before liveness probes start evaluating.
Follow-up: Your readiness probe checks three downstream services. One of them is a non-critical recommendation engine that goes down frequently. What happens and how do you fix it?
What happens is that every time the recommendation engine has a blip, your readiness probe fails, and Kubernetes removes your pod from the load balancer. Your entire application goes offline — checkout, search, browsing, everything — because a non-critical recommendation engine is down. You have made your application’s availability a function of your least reliable dependency.The fix is a tiered readiness approach:Option 1 — Only check critical dependencies in the readiness probe. The recommendation engine is not critical, so do not include it. Your readiness probe checks: database up, cache reachable, authentication service reachable. The recommendation engine is handled by an application-level circuit breaker with a fallback to “popular items.”Option 2 — Implement a composite health check with weighted scoring. The probe reports “ready” if all critical dependencies are healthy, regardless of non-critical dependency status. Non-critical dependency failures are exposed as metrics and alerts, but they do not affect the pod’s readiness state.The broader principle: readiness probes should reflect your service’s ability to serve its critical path, not its ability to serve every feature at full fidelity. A degraded response (showing generic recommendations instead of personalized ones) is better than no response at all.
Follow-up: How would you design health checks for a service that takes 5 minutes to warm up (loading ML models into memory)?
This is a real problem for services that load large models, build in-memory indexes, or have substantial JVM warmup. The standard approach has three phases:Phase 1 — Startup probe. Kubernetes has a dedicated startup probe for this. Configure it with a generous failureThreshold * periodSeconds that exceeds your maximum startup time. For a 5-minute warmup: periodSeconds: 10, failureThreshold: 40 gives you ~6.5 minutes. The startup probe checks a flag that the application sets after model loading is complete. While the startup probe is running, liveness and readiness probes are disabled — Kubernetes will not kill the pod for being “slow to start.”Phase 2 — Liveness takes over. Once the startup probe succeeds, the liveness probe activates with its normal schedule. It checks that the process is alive and the model is still loaded in memory (not evicted by OOM pressure).Phase 3 — Readiness with warmup validation. The readiness probe does not just check that the model is loaded — it runs a lightweight inference on a canary input and verifies the result is within expected bounds. This catches subtle issues like a corrupted model file that loads without errors but produces garbage outputs.The deployment strategy also matters. For a service with a 5-minute warmup, you absolutely need rolling deployment with maxUnavailable: 0 and maxSurge: 1. You spin up the new pod, wait for it to pass startup and readiness, and only then terminate the old pod. If you kill old pods before new ones are warm, you have 5 minutes of reduced or zero capacity.One additional trick: pre-pulling the model to a shared volume or node-local cache so that the warmup is loading from local disk rather than downloading over the network. This can cut warmup from 5 minutes to 30 seconds for large ML models.
7. “Duplication is far cheaper than the wrong abstraction.” Explain what Sandi Metz means and give me a real example from your experience.
What the interviewer is really testing
Whether you understand DRY at a nuanced level — not as a rule to blindly follow but as a trade-off to navigate. They want to hear about premature abstraction, the wrong abstraction, and the courage to duplicate when it is the right call.
Strong answer
The core insight is that when you see duplicate code and extract a shared abstraction, you are making a bet: “these two things will always change together in the same way.” If that bet is wrong — if the two use cases evolve in different directions — the shared abstraction becomes a liability. Every change to one use case risks breaking the other. You end up with a function full of if context == X then... elif context == Y then... branches, and it becomes harder to change than the original duplication.A real example: I worked on a system where two teams had nearly identical input validation logic — one for user registration and one for merchant onboarding. An engineer saw the duplication and extracted a shared validate_entity function. For six months, it was elegant. Then the compliance team required KYC checks for merchants but not for users. The merchant team needed field-level validation that the user team did not. The shared function grew branches for each context. Every change to merchant validation required running the user registration test suite (because the function was shared), and vice versa. Eventually, a merchant validation change broke user registration in production because a test case was missed.We eventually split the function back into two separate validators. The “duplication” was cheaper than the coupling the shared abstraction created.The pattern I follow now: tolerate duplication until I see the same pattern three times. At two occurrences, the similarity might be coincidental. At three, the correct abstraction usually becomes clear because you can see which parts truly vary and which are genuinely shared. This is the “Rule of Three” — and it has saved me from premature abstraction more times than I can count.The deeper lesson: DRY is about eliminating duplicate knowledge, not duplicate code. Two functions that happen to have identical code but represent different business concepts — different reasons to change, different stakeholders, different evolution paths — should remain separate. Coupling them creates a false dependency that will cause pain later.
Follow-up: How do you identify the 'wrong abstraction' in an existing codebase?
There are a few reliable signals:Signal 1 — The function has a type or context parameter that changes its behavior. If you see a function with a parameter like mode, type, variant, or context and the function body is a giant switch or if-else chain based on that parameter, you are looking at a premature abstraction. Each branch is a different concept wearing a shared function’s trenchcoat.Signal 2 — Changes to one consumer require understanding all other consumers. When a developer says “I need to add this feature but I am afraid to change the shared module because I do not know who else uses it and how” — that is the cost of the wrong abstraction. The fear of change is the interest payment on the abstraction debt.Signal 3 — Shotgun surgery inverted. Instead of changing one concept requiring edits in many places (shotgun surgery), you have the opposite: changing one file affecting many unrelated concepts. A single shared utility module becomes a coupling magnet.Signal 4 — The abstraction has more parameters than the original duplicated functions had. If the “clean” abstraction takes 8 parameters to handle all the variations it consolidated, it is not an abstraction — it is a configuration-driven branch router.The fix is not always “split it back.” Sometimes the abstraction is correct but the interface needs to be redesigned — maybe it should accept a strategy object instead of a type flag. But often, especially when the original duplication was between different business domains, splitting is the cleanest resolution.
Going Deeper: How does the wrong abstraction interact with microservice boundaries? Can DRY violations exist across services?
This is where things get interesting. Within a monolith, the wrong abstraction creates coupling between modules. Across microservices, the temptation to eliminate “duplication” creates coupling between services — which is far worse.The most common anti-pattern is the shared library that multiple services depend on. Team A and Team B both have code to validate and parse order data. Someone extracts this into a shared order-utils library. Now both services depend on the same library version. When Team A needs a change to the library, Team B must also upgrade, test, and redeploy. You have created a distributed monolith — the services are “micro” in name but coupled through their shared dependency.The principle in microservices is: duplicate across service boundaries freely. Each service should own its own models, its own validation logic, its own data transformation code. The duplication is intentional — it is the price of independence. The only things that should be shared across services are stable, rarely-changing contracts (API schemas, event schemas) and true infrastructure utilities (logging format, tracing propagation).There is a caveat: this applies to business logic, not infrastructure concerns. Sharing a logging library or a circuit breaker library across services is fine because those change for infrastructure reasons, not business reasons. But sharing a Customer model class across three services means those three services cannot evolve their understanding of a customer independently — and in my experience, they always need to eventually.
8. You have a dead letter queue with 50,000 messages that accumulated over a weekend. Walk me through how you handle it.
What the interviewer is really testing
Operational maturity. Can you triage systematically, or do you panic and replay everything blindly? They also want to see if you understand the DLQ as a reliability mechanism, not just a dumping ground.
Strong answer
50,000 messages over a weekend means something is systematically wrong, not just a transient blip. Here is how I would approach it:Step 1 — Do NOT replay the messages yet. The most common mistake is to immediately replay everything. If the root cause is still present — a bug in the consumer, a downstream service still down, a schema mismatch — you will just generate 50,000 more failures and potentially DLQ them again, creating an infinite loop.Step 2 — Sample and categorize. Pull 50 to 100 messages from the DLQ and examine them. Group them by failure reason. In my experience, you will usually find 2 to 3 distinct categories:
Category A: a specific error type (maybe a null pointer on a field that used to always be present)
Category B: a timeout error (downstream dependency was unavailable)
Category C: a schema validation failure (a producer started sending a new field your consumer does not expect)
Each category gets a different treatment.Step 3 — Fix the root causes. For Category A (code bug), fix the consumer code, deploy, and validate with a single message from the DLQ before replaying the batch. For Category B (transient dependency failure), verify the dependency is healthy now — then those messages can be replayed safely. For Category C (schema mismatch), this requires coordination with the producing team. You might need to update your consumer’s schema, or the producer might need to fix a regression.Step 4 — Replay in controlled batches. Do not replay all 50,000 at once. Start with 100, monitor error rates and processing latency, then increase to 1,000, then 10,000. Each batch should be followed by a check: are new messages appearing in the DLQ? If yes, stop — you have not fully fixed the issue.Step 5 — Handle the unreplayable. Some messages may be genuinely unprocessable — corrupted data, referring to entities that have been deleted, or representing actions that are no longer valid (a discount code that expired during the outage). These need business decisions, not engineering ones. Move them to a permanent failure store, alert the product owner, and let them decide whether those need manual remediation or can be written off.Step 6 — Fix the monitoring gap. 50,000 messages over a weekend means nobody was alerted on Friday when the DLQ started growing. Add an alert for DLQ depth that pages during business hours — “DLQ depth > 100” should trigger an investigation, not “DLQ depth > 50,000 discovered Monday morning.”
Follow-up: How do you ensure replayed messages do not cause duplicate side effects?
This is the idempotency question, and it is critical. If the original processing attempt partially succeeded before failing — say, it charged the customer’s credit card but failed before writing the order record — replaying the message would charge them again.Pattern 1 — Idempotency keys. Each message has a unique identifier (message ID, correlation ID, or a business-level idempotency key). Before processing, the consumer checks an idempotency store: “have I successfully processed this ID before?” If yes, skip it. If no, process it and record the ID on successful completion. The store can be a database table, a Redis set with TTL, or a DynamoDB table.Pattern 2 — Transactional outbox. The processing and the “I processed this” record are written in the same database transaction. If the transaction commits, both the business effect and the idempotency record are persisted atomically. If it rolls back, neither is persisted, and replay is safe.Pattern 3 — Check before acting. For operations that modify state, check the current state before applying the change. If the message says “set order status to SHIPPED” and the order is already SHIPPED, skip it. This is a weaker form of idempotency — it works for idempotent state transitions but not for additive operations like “add $10 to balance.”The gotcha with idempotency stores is TTL management. If your idempotency records expire after 7 days but a message sits in the DLQ for 14 days, the idempotency protection is gone when you replay. For DLQ scenarios specifically, either use a longer TTL than your maximum DLQ retention period, or accept that very old DLQ messages need manual review rather than automated replay.
Follow-up: How do you distinguish between a 'poison message' that will never succeed and a message that failed due to a transient issue?
This is a classification problem, and in practice you need both automated heuristics and human judgment.Automated classification:
Retry count header. Each time a message is retried, increment a header. If the message has been attempted 5 times with 5 different error messages, it is likely a code bug or data issue, not a transient failure. If it failed 5 times with the same “connection timeout” error, it is likely transient.
Error type analysis. Deserialize the error from the DLQ metadata. A NullPointerException or SchemaValidationError is deterministic — it will fail every time regardless of how many times you retry. A ConnectionTimeoutError or ServiceUnavailableError is transient — the next attempt might succeed.
Message age vs. error pattern. If all messages from a specific time window have the same error, it was likely a transient outage during that window. If messages from different time windows have the same error, the root cause is persistent.
The poison message pattern: A message that always fails processing — maybe it references a deleted user, has an invalid field combination, or triggers an unhandled edge case in your consumer. Without DLQ handling, this message blocks the queue because it is retried infinitely and never succeeds. The DLQ isolates it.For poisoned messages, the resolution is either: fix the consumer to handle the edge case (then replay), transform the message to make it valid (then replay to a repair queue), or discard it with an audit trail and notify the business.The key discipline: every message in the DLQ must be accounted for. You never just “empty the DLQ.” Each message is either replayed, remediated, or explicitly written off with a documented reason.
9. The Dependency Inversion Principle says “depend on abstractions, not concretions.” Give me a scenario where following this principle would actually make things worse.
What the interviewer is really testing
Critical thinking about SOLID principles. Can you reason about when a principle does not apply, or do you follow rules blindly? This is a senior to staff-level question about engineering judgment.
Strong answer
DIP is a powerful principle, but like all principles, it has a cost — and that cost is not always worth paying.Scenario 1 — You will never swap the implementation. If your service uses PostgreSQL and there is zero business or technical reason to ever switch to MySQL, MongoDB, or anything else, wrapping every database call behind a Repository interface adds a layer of abstraction that nobody will ever use. You write the interface, the concrete implementation, the wiring code, and the test mock — quadrupling the surface area. Every new developer has to navigate through the abstraction layer to understand what actually happens. The indirection makes debugging harder: stack traces are longer, and you cannot jump from the call site to the actual SQL.In my experience, the “we might switch databases someday” argument is rarely true and almost never justifies the upfront cost. If you do need to switch databases, the repository interface is the least of your problems — you also need to deal with query syntax, transaction semantics, consistency guarantees, and data migration.Scenario 2 — Prototyping and throwaway code. When you are building a proof of concept to validate a business idea, the priority is speed to learning, not architectural purity. Abstractions slow you down and add code that obscures the prototype’s core logic. If the prototype works, you will rewrite it with proper architecture. If it does not, you saved the time of building abstractions that were never needed.Scenario 3 — The abstraction is leaky by nature. Some dependencies have such rich, specific APIs that any useful abstraction is either so thin it adds no value or so thick it reimplements the dependency. Wrapping Elasticsearch behind a generic SearchService interface means either exposing Elasticsearch-specific features (making the abstraction leaky) or limiting yourself to basic search capabilities (making the abstraction useless). You end up fighting the abstraction instead of benefiting from it.The judgment call: Apply DIP when the abstraction boundary corresponds to a real point of variation — you have multiple implementations, or you need testability for a slow external dependency, or the dependency’s interface is unstable. Skip it when the coupling is stable, the implementation is singular, and the indirection cost exceeds the flexibility benefit.
Follow-up: How do you decide which abstractions are worth the cost in a new project?
I use a simple heuristic: abstract at the boundaries, not in the core.Boundaries are where your system meets the outside world: databases, third-party APIs, message queues, file systems, external services. These are the points where implementations might change, where you need test doubles, and where failure modes are complex. Putting an abstraction at the database boundary (a repository) is valuable because it lets you test business logic without a database and it isolates you from database-specific concerns.The core — your business logic — should be concrete and direct. Abstracting the core logic itself (interfaces for everything, factories for factories) adds indirection without flexibility. Business logic does not get swapped out; it gets modified. You want it to be readable and direct, not wrapped in three layers of indirection.The practical test: for each proposed abstraction, ask “what is the second implementation?” If you can name a concrete, plausible second implementation (StripeGateway and AdyenGateway, PostgresRepository and InMemoryRepository for tests), the abstraction is justified. If the answer is “well, maybe someday…” — skip the abstraction. You can always add it later when the need is real. Extracting an interface from a concrete class is a low-risk refactoring; removing a premature abstraction is much harder because code has grown around it.
Going Deeper: How does DIP interact differently in a microservices architecture versus a monolith?
This is a great question because the unit of abstraction changes at the architectural level.In a monolith, DIP operates at the class and module level. You inject a PaymentGateway interface into OrderService and provide a StripeGateway implementation. The abstraction boundary is an in-process function call.In microservices, the service boundary itself IS the abstraction. When OrderService calls PaymentService over HTTP or gRPC, the network protocol is the interface. You do not need a PaymentGateway interface inside OrderService — the HTTP client calling POST /payments is already loosely coupled. Changing the payment provider means changing the implementation of PaymentService, not changing OrderService at all.Where DIP still matters in microservices is at the infrastructure boundary within each service. Each service should abstract its database, its cache, its message queue — because each service might be tested independently, deployed independently, and might evolve its technology stack independently.The anti-pattern I have seen is applying monolith-style DIP across service boundaries: creating shared interface libraries that multiple services import so they can “depend on the same abstraction.” This couples services at the library level and defeats the entire point of microservices. The shared contract between services should be the API schema (OpenAPI, protobuf), not a shared code dependency.So the answer shifts: in a monolith, DIP is applied within the process. In microservices, DIP is applied within each service, and the service boundary replaces the abstraction boundary between services.
10. Describe a chaos engineering experiment you would run against a production e-commerce system. Walk me through the full lifecycle.
What the interviewer is really testing
Whether you understand chaos engineering as a disciplined practice with hypotheses and controls, or whether you think it is just “break stuff and see what happens.” They also want production judgment — can you design a safe experiment?
Strong answer
Let me walk through a specific experiment: testing what happens when the product catalog database replica goes offline during peak traffic.Phase 1 — Define steady state. Before touching anything, I establish what “normal” looks like. For our e-commerce system during peak hours: p99 latency < 300ms, error rate < 0.1%, orders per minute > 200, product page load time < 1.5 seconds. These are my steady-state indicators.Phase 2 — Form a hypothesis. “If one of three database read replicas becomes unavailable, the load balancer will redistribute read traffic to the remaining two replicas. p99 latency will increase by no more than 30% (to ~400ms) and error rate will remain below 0.5%. The product catalog will remain fully functional.”Phase 3 — Define safety controls. This is what separates chaos engineering from just causing outages:
Blast radius limit: Only affect one replica out of three. The system is designed to tolerate this — we are testing that the design works, not whether we can destroy the system.
Automatic abort conditions: If error rate exceeds 1% or p99 latency exceeds 800ms, immediately restore the replica. Set this up as an automated trigger, not something a human has to remember.
Time window: Run during business hours when the full engineering team is available, not at 3 AM. If something goes wrong, we want all hands on deck.
Rollback plan: The “injection” is network-level isolation of the replica (using iptables or a tool like Chaos Mesh). Rollback is removing the network rule — instant recovery.
Phase 4 — Run the experiment. Isolate replica-3 from the application tier. Monitor all steady-state indicators in real time. Record everything — latency percentiles per second, connection pool metrics on remaining replicas, query queue depth, cache hit rates.Phase 5 — Observe and analyze. Maybe the hypothesis holds — latency increases 20%, error rate stays flat, and the remaining replicas handle the redistributed load. Or maybe we discover something: the connection pool on replica-1 maxes out because it was sized for one-third of read traffic, not two-thirds. Queries start queuing. Latency spikes to 2 seconds. The circuit breaker on the product catalog service trips and starts serving stale cache.Phase 6 — Act on findings. If the experiment revealed a weakness (undersized connection pool), fix it. Resize the connection pool to handle N-1 replica scenarios. Then re-run the experiment to validate the fix. Document the finding and the fix in the runbook for database failover scenarios.Phase 7 — Increase the blast radius. Next month, test with two replicas down. Then test with the primary failing and automated failover promoting a replica. Each experiment builds confidence incrementally.
Follow-up: How do you get organizational buy-in for running chaos experiments in production?
This is honestly the hardest part of chaos engineering — not the technical execution, but the political and cultural work.Start with a scary incident that already happened. If your team had an outage caused by a failure mode that chaos engineering would have caught, that is your strongest argument. “If we had tested failover last month, we would have caught the undersized connection pool before it caused a two-hour outage affecting 50,000 users.”Start with staging, not production. Run experiments in staging first and present the findings: “We discovered three weaknesses in staging. If any of these had manifested in production, the impact would have been X. We want to validate that production does not have the same weaknesses — here is how we would do it safely.” This builds trust.Frame it as risk reduction, not risk introduction. The language matters. Do not say “we want to break things in production.” Say “we want to verify our failover works before a real incident forces us to find out.” Executives understand risk mitigation. They do not understand why engineers want to cause outages.Start extremely small. First experiment in production: kill one non-critical pod and verify the deployment replaces it in under 30 seconds. This is almost zero risk and demonstrates the methodology. Then gradually increase scope and blast radius as trust builds.Make results visible. After each experiment, share a brief report: hypothesis, result, findings, fixes applied. Over time, this creates a record of prevented incidents that justifies continuing investment. The most effective argument is: “In the last quarter, chaos experiments identified four weaknesses that we fixed before they caused production incidents.”
Going Deeper: What is the difference between chaos engineering and traditional failure testing? When is each appropriate?
The distinction is subtle but important.Traditional failure testing (integration tests with injected failures, disaster recovery drills) verifies known failure modes: “if the database goes down, does the failover work?” You are testing a specific scenario with an expected outcome. It is confirmatory — you know what should happen and you are verifying it does.Chaos engineering is exploratory. You are not testing a specific failure mode — you are probing the system to discover failure modes you did not anticipate. The hypothesis is not “failover works” but “the system maintains steady state under this perturbation.” The valuable outcome is when the hypothesis is disproven — when you discover an unexpected interaction, a cascading failure, a configuration gap that nobody knew about.Think of it this way: traditional failure testing is like running a known test case. Chaos engineering is like fuzzing — you are looking for the unknown unknowns.When to use each:Traditional failure testing: for every deployment, in CI/CD, for known recovery procedures. It should be automated and run continuously. “Does our database failover actually work?” should be a regularly exercised runbook, not a hope.Chaos engineering: periodically (monthly or quarterly), for systems with complex interactions, when you suspect there are failure modes your tests do not cover. It requires more human judgment — forming hypotheses, interpreting results, deciding what to explore next.The maturity progression for most organizations is: first, get basic monitoring and alerting in place. Then, establish failure testing for known scenarios (DR drills). Then, graduate to chaos engineering for discovering unknown weaknesses. Jumping straight to chaos engineering without the foundation of monitoring and basic failure testing is like running before you can walk — you will cause outages without the ability to detect or recover from them.
11. You are designing the error budget policy for an organization with 30 microservices. How do you handle the fact that services have dependencies on each other?
What the interviewer is really testing
Systems thinking at the organizational level. Can you reason about error budgets as a governance mechanism across teams, not just within a single service? This is a staff-level question about reliability architecture.
Strong answer
The fundamental challenge is that in a microservice architecture, no service’s reliability is independent. If Service A depends on Service B, and Service B burns its error budget, Service A’s error budget burns too — even if Service A’s code is perfect.The dependency chain problem. If Service A calls Service B calls Service C, and each has a 99.9% SLO, the end-to-end availability is roughly 99.9% x 99.9% x 99.9% = 99.7%. The user-facing service (A) is meeting its SLO, but the user experience reflects the compounded failure rate. This means downstream services must have stricter SLOs than the services that depend on them.Here is how I would structure the policy:Tier the services by criticality. Not all 30 services deserve the same SLO or the same error budget governance. Create tiers:
Attribution. When a Tier 1 service’s error budget burns because a Tier 2 dependency went down, who is responsible? You need a mechanism for attributing budget burn to the root cause. Use distributed tracing and dependency mapping to trace errors back to their source. If the checkout service had 10 minutes of errors caused by the payment service being down, those 10 minutes are charged against the payment service’s budget, not the checkout service’s. This prevents the checkout team from being penalized for someone else’s outage and ensures the right team feels the pressure to improve.Dependency SLO contracts. Each service publishes an internal SLO to its consumers. If the search service promises 99.9% availability to the product listing service, and it drops below that, the search team’s error budget policy activates. The product listing service should have resilience patterns (circuit breaker, fallback) that let it function even when the search service is degraded — but the search team still owns the reliability of their service.Aggregate monitoring. Build a dashboard that shows error budget status across all 30 services simultaneously, with dependency arrows. When the payment service’s budget drops to 20%, the checkout team should see that too — it is an early warning that their critical path is at risk.
Follow-up: How do you prevent gaming of the error budget system?
This comes up in every organization that implements error budgets, and it is as much a cultural problem as a technical one.Common gaming patterns:Setting lenient SLOs to avoid ever triggering the policy. If a team sets their SLO at 99% when their actual reliability is 99.8%, they will never burn their budget and never have to slow down on features. The fix: SLOs should be set based on user impact analysis and reviewed by a reliability council or architecture team, not unilaterally by the owning team.Classifying incidents as “not counting.” After a bad deploy causes 20 minutes of downtime, the team argues it was a “one-time event” and should not count against the budget. The fix: the error budget is measured automatically from SLI metrics, not from manually classified incidents. If the SLI measured an outage, the budget is consumed. Period. No post-hoc reclassification.Shifting blame to dependencies. “Our budget burned because the database team had an outage.” While attribution matters for accountability, the user-facing team’s budget still burned — the user does not care whose fault it is. Both teams need to act: the database team fixes their issue, and the user-facing team adds a fallback so they are more resilient next time.The cultural fix: frame error budgets as a shared resource that enables feature velocity, not as a punishment mechanism. Teams that manage their budget well get to ship faster. Teams that burn it frequently lose velocity. The incentive structure should reward reliability investment, not just feature output.
12. Walk me through how you would decide between investing in another nine of availability versus investing that same engineering effort in feature development.
What the interviewer is really testing
Whether you can think about reliability as an economic decision, not a technical one. This is the fundamental SRE insight, and the interviewer wants to see if you can make the business case in both directions — for more reliability AND for less.
Strong answer
The way I frame this is: every nine of availability has a cost, and every nine of unavailability has a cost. The right answer is where these two curves cross.Quantify the cost of unreliability. Start with the business impact of downtime. For an e-commerce site doing 10M/monthinrevenue,anhourofdowntimeduringbusinesshourscostsroughly14,000 in lost revenue (assuming even distribution — it is worse during peak). At 99.9% availability (43 minutes/month downtime), you lose about 10,000/month.At99.991,000/month. The delta — moving from three nines to four nines — saves about $9,000/month.Quantify the cost of the reliability investment. Going from 99.9% to 99.99% typically requires: multi-AZ database deployment (2,000−5,000/month),automatedfailovertestinginfrastructure,adedicatedon−callrotation,canarydeploymentpipeline,androughly2−3monthsofseniorengineeringtimetoimplementandvalidate.Theongoingcostmightbe10,000/month in infrastructure plus $5,000/month in engineering maintenance.Compare the curves. In this example, the reliability investment costs more (15,000/month)thanthedowntimeitprevents(9,000/month). The business should NOT pursue four nines — the money is better spent on features that grow the revenue denominator.But here is where it gets nuanced. The cost of downtime is not just lost revenue. It includes:
Customer trust erosion (hard to quantify but real)
SLA penalty payouts to enterprise customers
Engineering productivity lost to firefighting
Reputation damage in competitive markets
And the cost calculation changes with scale. If that same business grows to 100M/month,thedowntimecostbecomes100,000/month, and the four-nines investment is suddenly a no-brainer.My recommendation framework: Present the cost-reliability curve to the product and business leadership. Show them three options — current state, one nine better, and two nines better — with costs and savings for each. Let the business decide. Engineering’s job is to make the trade-offs legible, not to make the decision unilaterally. Reliability is a business investment, not an engineering religion.
Follow-up: Your CTO says 'we need five nines.' How do you push back constructively?
I would not frame it as pushback. I would frame it as education and calibration.Step 1 — Make the math concrete. “Five nines means 26 seconds of downtime per month. That means no single deployment can take longer than 26 seconds to roll back. Every dependency — including DNS, certificate authorities, and cloud provider control planes — must be redundant. We need multi-region active-active with consensus-based replication. Our planned maintenance windows? Zero. Literally zero allowed downtime.”Step 2 — Show the cost. “Our current three-nines infrastructure costs $X/month. Moving to four nines would cost roughly 10X. Moving to five nines would cost roughly 100X, plus a dedicated reliability team of 3-5 engineers. Here is a rough estimate…” The sticker shock usually recalibrates expectations.Step 3 — Ask what problem they are actually solving. Often when a CTO says “five nines,” they mean “I do not want outages like the one we had last month.” The actual need might be solved by going from 99.9% to 99.95% — reducing 43 minutes of monthly downtime to 21 minutes — which is achievable with targeted investments like canary deploys and faster rollbacks. That is very different from five nines.Step 4 — Propose the incremental path. “Let us invest in moving from three nines to four nines for our Tier 1 services. That addresses 90% of the customer pain at 10% of the five-nines cost. In six months, we can reassess whether further investment is justified based on what the SLO data tells us.”The key message: reliability beyond what users notice is a luxury, not a necessity. And the money spent on unnecessary reliability is money not spent on features that grow the business.
Going Deeper: How do you measure the cost of 'lost customer trust' that is hard to quantify?
This is the honest gap in the reliability-as-economics framework, and pretending you can precisely quantify trust erosion is dishonest. But you can approximate it.Proxy metrics that correlate with trust:
Churn rate after incidents. Track customer retention in the 30 days following a major outage versus a normal 30-day period. If churn spikes 15% after an outage, and your average customer LTV is $1,200, you can estimate the trust cost per incident.
Support ticket volume. After an outage, support tickets increase. The cost of handling those tickets (engineering time triaging, support staff responding) is measurable.
NPS score movement. If your NPS drops 5 points after an outage and takes 2 months to recover, and you have data correlating NPS to revenue, you can estimate the impact.
Competitive loss. If you are in a market where reliability is a differentiator (cloud infrastructure, payment processing), every outage is a sales opportunity for your competitor. Talk to the sales team — they often have specific examples of deals lost due to reliability concerns.
The practical approach: Acknowledge that the trust cost is real but imprecise. Use a range: “We estimate the trust impact of a major outage is between 50Kand200K.” Then include that range in the cost-benefit analysis. Even with uncertainty, it shifts the economic calculation and helps justify reliability investments that pure revenue-loss calculations do not.The teams that get this right treat reliability incidents the way finance treats bad debt — as a known cost category with imprecise but nonzero estimates, budgeted and managed rather than ignored or pretended away.
These questions are designed to expose the gap between engineers who have studied reliability and engineers who have been woken up by it at 3 AM. Each scenario involves a situation where the obvious answer is wrong, where real trade-offs have no clean resolution, or where multiple principles conflict with each other. These are the questions that separate senior from staff.
13. Your service implements retry with exponential backoff on all downstream calls. During an incident, you discover your retries are making the outage worse, not better. Explain how this happens and what you do about it — both right now and permanently.
What the interviewer is really testing
Whether you understand that resilience patterns can become amplification vectors under the exact conditions they were designed to handle. This is a “the obvious answer is wrong” question — retries are supposed to help, but in certain failure modes they are the primary cause of sustained outages.
What weak candidates say
“We should increase the backoff interval or reduce the retry count.” This is tweaking parameters on a fundamentally broken approach. Or: “We should disable retries during the outage.” This helps immediately but does not address the structural problem.
What strong candidates say
The way I think about this is that retries are an optimistic strategy — they assume the failure is transient and the next attempt will succeed. When the downstream is genuinely overloaded, every retry is additional load on a system that is already drowning. You end up in a positive feedback loop: service is slow, clients retry, service gets more load, service gets slower, more retries fire, and the whole thing spirals.I have seen this play out concretely at a fintech company processing around 2M transactions per day. The payment orchestrator had 3 retries with exponential backoff on calls to a bank integration service. The bank’s API started returning 503s due to their own maintenance. Our 50 payment workers each had 3 retries, so 50 initial failures became 150 retry attempts within 30 seconds. The bank’s rate limiter kicked in and started rejecting everything, including requests from other customers. Our 503 rate went to 100%, and the retries kept hammering. The circuit breaker was configured for 10 consecutive failures, but the intermittent successes from the bank’s rate limiter (which returned 429, not 503) kept resetting the consecutive counter. The outage lasted 47 minutes instead of the bank’s planned 5-minute maintenance window.What I would do right now during the incident:First, manually trip the circuit breaker or use a feature flag to disable the call path entirely. Stop the bleeding. In our case, we had an admin endpoint that forced the circuit breaker to OPEN state — we built that after a previous incident where we wished we had it.Second, drain the retry queues. Any in-flight retries need to be cancelled, not just paused. In Java with Resilience4j, this means calling circuitBreaker.transitionToForcedOpenState(). In a queue-based system, stop the consumers.Third, implement emergency load shedding. If other services share the same downstream, shed non-critical traffic immediately. We kept payment retries for Tier 1 merchants and shed everything else.What I would do permanently — four structural changes:Retry budgets instead of per-request retries. Instead of “each request gets 3 retries,” implement “the system can retry at most 10% of total request volume in any 10-second window.” Google describes this in their SRE book as a retry budget. At 1,000 requests per second with 500 failing, you retry 100 — not 1,500. This caps the amplification factor at 1.1x instead of 4x.Switch the circuit breaker from consecutive-failure to sliding-window error-rate. The bank’s intermittent 429s were resetting our consecutive counter. A 60-second sliding window with a 30% error-rate threshold would have tripped within 15 seconds of the bank going into maintenance.Adaptive concurrency limiting. Tools like Netflix’s concurrency-limits library dynamically adjust how many outstanding requests you allow to a downstream based on observed latency. When latency starts climbing — the leading indicator of overload — the limiter reduces concurrency before errors even start. This is proactive where retries are reactive.Distinguish between overload failures and fault failures. A 503 due to overload should not be retried — retrying makes overload worse. A 503 due to a transient process restart should be retried. We started looking at the Retry-After header: if present, it is an explicit signal from the downstream that retrying sooner will not help. We also stopped retrying 429s entirely — a rate limit response is the downstream telling you to back off, and retrying is ignoring that signal.War Story: After the 47-minute incident, we implemented retry budgets and the sliding-window circuit breaker. Three months later, the same bank had another maintenance event. This time, the circuit breaker tripped in 12 seconds, the retry budget capped amplification at 1.1x, and the total user-visible impact was 45 seconds of degraded payment processing. Our queued-payment fallback handled the rest. That 47 minutes versus 45 seconds comparison became the slide I used to justify the engineering investment to our VP.
Follow-up: How do retry budgets work in a system with multiple layers of retries (edge -> API gateway -> service -> database)?
This is the retry amplification cascade, and it is the reason most microservice architectures have a hidden 10x-100x amplification factor that nobody has calculated.Picture a three-layer call chain: the mobile client retries 3 times, the API gateway retries 2 times, and the backend service retries 3 times against the database. One failed user request generates: 3 (client) x 2 (gateway) x 3 (service) = 18 database attempts. If you have 1,000 users hitting this simultaneously, one database hiccup creates 18,000 database connection attempts — from 1,000 original requests.The fix is retry at exactly one layer — typically the layer closest to the failure. The backend service retries against the database (it understands what is retryable). The API gateway does NOT retry calls to the backend service — if the backend returned an error after its own retries, the error is not transient from the gateway’s perspective. The client has a retry with long backoff for user-level recovery, but it is retrying the entire operation, not amplifying a known failure.The general rule: if the layer below you already retries, you should not retry the same failure. Instead, trust the layer below to handle transient issues, and propagate genuine failures upward quickly.
Follow-up: How do you test that your retry and circuit breaker configuration actually works under realistic load?
You cannot test this with unit tests or even integration tests with one request at a time. The failure modes only emerge under concurrent load with realistic failure injection.What I have used is a combination of load testing with Locust or k6 where we simulate normal traffic volume, combined with Toxiproxy sitting between our service and the downstream to inject failures. We run scenarios like: “at T=0, Toxiproxy starts returning 503 on 50% of requests to the payment service. Verify that: (a) the circuit breaker trips within 15 seconds, (b) retry amplification stays below 1.2x, (c) the fallback path activates, and (d) when Toxiproxy stops injecting at T=60, the circuit breaker transitions to half-open and recovers within 30 seconds.”We run this as a weekly automated job in our staging environment. The test fails if any of those four conditions are violated. This has caught two regressions — once when someone bumped the circuit breaker failure threshold from 30% to 80% in a config change, and once when a new library version changed the default retry behavior.
14. Your team has been running blameless post-incident reviews for two years. Incidents keep recurring. The same classes of failures happen every quarter. Leadership asks why blameless culture is not working. What is actually going wrong?
What the interviewer is really testing
Whether you understand the difference between blameless and toothless. This is a question where the candidate’s instinct is to defend blameless culture, but the experienced answer is that blameless post-mortems without accountability for follow-through are organizational theater.
What weak candidates say
“Blameless culture is important and we should keep doing it. Maybe people are not being honest enough in the reviews.” This misses the point entirely — the problem is not the culture of the review, it is what happens after the review.
What strong candidates say
I have lived through exactly this. The phrase I use is: “We were blameless but also actionless.” The post-incident reviews were great — honest, detailed, well-attended. We produced thorough documents with root cause analysis and 8-12 action items per incident. The problem was that the action items went into a Jira backlog and competed with feature work. A quarter later, 70% of them were still open.The root causes of recurring incidents are almost always organizational, not technical:Problem 1 — Action items have no owner with deadline and accountability. A post-mortem that says “improve monitoring for the payment service” is not an action item — it is a wish. A real action item is: “Sarah will add a Datadog alert for payment service p99 latency exceeding 500ms, with PagerDuty escalation, by March 15. The on-call lead will verify the alert fires correctly in staging before the next on-call rotation.” Name, deliverable, date, verification.Problem 2 — Follow-up items are not prioritized against feature work. At a previous company, we tracked this: out of 156 post-mortem action items in 2023, 41 were completed within the stated timeline. The rest were deprioritized by product managers who — reasonably — were optimizing for feature delivery. The fix was allocating 20% of each sprint to reliability work, ring-fenced and not available for feature overflow. The engineering director made this non-negotiable after our fourth repeat incident.Problem 3 — Post-mortems focus on proximate cause, not systemic cause. “The deploy was bad” is a proximate cause. “We have no canary deployment pipeline, so every deploy goes to 100% of traffic instantly” is a systemic cause. Fixing the proximate cause (roll back the bad deploy) stops this incident. Fixing the systemic cause (build canary deployments) prevents an entire class of future incidents. Most post-mortems stop at the proximate cause because systemic fixes are expensive and cross-cutting.Problem 4 — No mechanism to detect recurring patterns across incidents. Each incident is reviewed individually. Nobody is looking at the aggregate and noticing that 6 of the last 10 incidents involved configuration changes, or that 4 of the last 7 were caused by a dependency timeout that was never tuned. We started tagging incidents with failure categories — “config change,” “dependency failure,” “capacity exhaustion,” “deploy regression” — and reviewing the distribution quarterly. That review exposed that 40% of our incidents were deploy-related, which justified investing in canary infrastructure.What I would tell leadership: “Blameless culture is working — people are honest in reviews and we understand our failures. What is not working is the feedback loop between learning and action. We need three changes: ring-fenced reliability sprint capacity, action items with named owners and deadlines tracked in a separate board with weekly review, and quarterly pattern analysis across incidents.”War Story: At the fintech company I mentioned, we implemented what we called “Reliability Review” — a monthly meeting where the engineering director, two senior engineers, and the product lead reviewed all open post-mortem action items and the incident category distribution. The rule was: if an action item was deprioritized for two consecutive sprints, the engineering director either reassigned it to a different team or escalated it as a leadership priority. In the first year after implementing this, repeat incidents dropped from 11 to 3. The post-mortems did not get better — the follow-through did.
Follow-up: How do you conduct a post-mortem for an incident that was clearly caused by one person's mistake without it becoming a blame session?
The key distinction is between the human error and the system that allowed the error to have that impact. Humans will always make mistakes. The question is not “who typed the wrong command” but “why did the system allow a single typo to take down production?”The Amazon S3 outage in 2017 is the canonical example. An engineer typed the wrong number in a maintenance command and removed too many servers. The post-mortem did not focus on the engineer’s typo — it focused on: why was there no confirmation prompt? Why was there no rate limiter on how many servers could be removed at once? Why was there no minimum capacity threshold that blocked dangerous commands?In practice, I structure post-mortems with an explicit section called “What organizational and systemic conditions made this outcome possible?” This redirects attention from the individual to the system. The person who made the error is actually the best source of information — they know what was confusing, what was missing, and what they wish had been in place. If they feel blamed, they clam up and you lose that insight.One technique I have seen work well: the person involved in the incident writes the first draft of the post-mortem. This gives them control of the narrative and demonstrates trust. The team reviews and adds context, but the author’s perspective anchors the document.
Follow-up: How do you measure whether your post-incident process is actually improving system reliability over time?
Three metrics, tracked quarterly:Mean time to recovery (MTTR) trend. If your process is working, you should be getting faster at recovering from incidents because your runbooks are better, your alerting catches issues sooner, and your rollback mechanisms are more reliable. Track MTTR by incident severity and look for a downward trend.Repeat incident rate. Tag every incident with a failure category. Track what percentage of incidents in Q2 share a root cause category with incidents in Q1. If the rate is flat or increasing, your action items are not being completed or are not addressing the systemic cause.Time from incident to action item completion. If the median time from “action item identified in post-mortem” to “action item verified in production” is 90 days, your feedback loop is too slow. Target under 30 days for critical items. Track this like you track feature delivery cycle time.A bonus metric: error budget consumption trend by category. If “dependency failures” consumed 60% of your error budget in Q1 and 35% in Q2 after you invested in circuit breakers and fallbacks, that is measurable evidence that your reliability investments are working.
15. You are designing a load shedding strategy for a service that handles both paid enterprise API customers and free-tier users. During a capacity crunch, the system needs to shed load. Walk me through your architecture.
What the interviewer is really testing
Whether you can make hard trade-off decisions with incomplete information, and whether you understand that load shedding is a business decision wrapped in a technical implementation. They also want to see if you think about fairness, contractual obligations, and the operational mechanics of actually shedding load without creating worse problems.
What weak candidates say
“We should just rate limit the free tier users and let the enterprise customers through.” This is the right instinct but misses the implementation complexity, the edge cases, and the fact that naive rate limiting can be worse than the original overload.
What strong candidates say
Load shedding is one of those areas where the architecture must be designed before the incident, because you cannot build a priority queue while the house is on fire. Here is how I would structure it:Step 1 — Define the priority tiers and their contracts. This is a business decision that engineering facilitates:
Priority
Customer Type
SLA Commitment
Shedding Behavior
P0
Enterprise (contracted SLA)
99.95% availability
Never shed unless system is at risk of total failure
P1
Paid (standard)
99.9% availability
Shed last, with Retry-After header
P2
Free tier (authenticated)
Best effort
Shed first, with clear 503 and Retry-After
P3
Anonymous/unauthenticated
No commitment
Shed aggressively, return static cached response
Step 2 — Implement priority-aware admission control at the edge. The load shedding decision happens at the API gateway or load balancer layer, not inside the application. Why? Because if the shedding happens inside the application, the request has already consumed connection and thread resources before being rejected. Shedding at the edge means rejected requests cost almost nothing.At Cloudflare or AWS API Gateway, you can route requests through a priority queue based on an API key lookup. The key maps to a customer tier. When the system is under load, a controller adjusts the admission threshold: “accept P0 and P1, reject P2 and P3.” If load continues to increase: “accept only P0.”Step 3 — Use adaptive thresholds, not static ones. The shedding controller monitors CPU utilization, active connection count, and request queue depth. When CPU crosses 70%, start shedding P3. When it crosses 85%, shed P2. When it crosses 95%, shed P1. The thresholds are tuned based on load testing data — specifically, we need to know “at what CPU utilization does p99 latency start exceeding our SLO?” and set the first shedding threshold 10 points below that.Step 4 — Make rejected responses useful. A bare 503 is hostile. Shed responses should include: Retry-After: 30 header so clients know when to retry, a response body explaining the situation (“Service is experiencing high demand. Enterprise customers are unaffected. Please retry in 30 seconds.”), and for API consumers, a machine-readable error code that SDKs can use to implement intelligent backoff.Step 5 — The hard edge case: what about free-tier users who are in the middle of a multi-step operation? If a free-tier user started a checkout flow and gets shed on step 3 of 4, they lose their cart state and have a terrible experience. The answer is session affinity for active transactions — once a user has started a critical flow, they get temporarily elevated to P1 until the flow completes or times out. This prevents the worst user experience without meaningfully increasing load during a capacity crunch, because the number of users mid-transaction is small relative to total traffic.War Story: At a SaaS company serving about 15,000 API customers, we had an incident where a free-tier customer ran an automated script that generated 40x their normal traffic in a 10-minute window. Without load shedding, their traffic spike consumed 60% of our shared compute capacity. Paid customers started seeing timeouts. Our initial fix was per-customer rate limiting, which helped, but it did not solve the case where 50 free-tier customers each spiked 5x simultaneously — individually within limits, collectively overwhelming. That is when we implemented the priority-based adaptive shedding described above. The first time it activated in production, it shed 12,000 requests per minute from free-tier traffic while keeping enterprise customer p99 latency under 200ms. Without it, the entire platform would have degraded for 20+ minutes.
Follow-up: How do you prevent load shedding from causing a thundering herd when the capacity crunch resolves?
This is the recovery problem, and it is the part most load shedding implementations get wrong.When the system recovers and you stop shedding, all the clients that received 503s with Retry-After: 30 headers retry simultaneously. You go from “under capacity” to “overwhelmed again” in one second. This is the thundering herd.Three mitigations:Jittered Retry-After. Do not return Retry-After: 30 to every client. Return Retry-After: 25 to some, 30 to others, 35 to others. Spread the retry wave over a window. In practice, I set Retry-After to a base value plus a random offset: 30 + random(0, 30).Gradual readmission. Do not flip shedding from “on” to “off.” Ramp it down. If you were shedding P2 and P3 at 85% CPU, start readmitting P3 at 70% CPU, then P2 at 60%. Leave a buffer between “stop shedding” and “at capacity” so the readmitted traffic does not immediately push you back over the threshold.Capacity reservation for recovery. While shedding, keep 10-15% of capacity unused. This headroom absorbs the initial retry burst when shedding is relaxed. Without this reservation, you oscillate between shedding and not-shedding — a feedback loop where the system is always on the edge.
Follow-up: A paying enterprise customer says 'I am paying for SLA guarantees — why was my request rejected during load shedding?' How do you respond?
If a P0 enterprise customer was shed, that is a bug in our shedding implementation — full stop. The contract says they are never shed unless the entire system is at risk of total failure. We owe them a post-incident report explaining what happened, a service credit per the SLA, and a fix.But the more interesting scenario is when the enterprise customer’s traffic itself is the cause of the overload. If their API usage spiked 20x above their contracted rate, our shedding correctly protected the platform. The conversation then is: “Your usage exceeded your contracted rate limits. Traffic within your rate limit was served at full priority. Traffic above the limit was subject to shedding under our fair use policy.” This requires having per-customer rate limits that are explicitly part of the enterprise contract, not just a backend configuration. If rate limits are not in the contract, you have no contractual basis for rejecting their traffic, and that is a sales and legal conversation, not an engineering one.
16. A critical payment transaction partially completed — the customer was charged but the order was not created in your system. The customer is calling support. Walk me through how you investigate, resolve, and prevent this.
What the interviewer is really testing
Whether you understand distributed transactions, the impossibility of exactly-once processing across system boundaries, and how to design compensation patterns for the inevitable partial failures. They also want to see if you can balance technical investigation with customer impact.
What weak candidates say
“We should use distributed transactions with two-phase commit to prevent this.” Two-phase commit across external services like Stripe is not possible — you do not control Stripe’s transaction coordinator. Or: “We should just refund the customer.” This handles the symptom but not the systemic issue.
What strong candidates say
This is the classic distributed transaction partial failure, and it happens more often than anyone admits. In any system where “charge the customer” and “create the order” are two separate operations — especially when one involves a third-party payment provider — you will eventually have a state where one succeeded and the other did not.Immediate resolution — the customer comes first.Before any debugging, I check the payment provider’s dashboard (Stripe, Adyen, whatever) using the customer’s email or payment reference to confirm the charge exists. If confirmed, I have two options: create the order manually in our system to match the charge, or issue an immediate refund. Which one depends on whether the product or service can still be fulfilled. The customer should hear back within 15 minutes of contacting support — not after we finish root-causing.Investigation — how did this happen?I pull the distributed trace for the transaction using the correlation ID. In a well-instrumented system, the trace shows me: request arrives, payment service call starts at T=0, Stripe responds with charge confirmation at T=800ms, order creation call starts at T=810ms, and then… what? There are a few common failure modes:Failure Mode A — crash after payment, before order creation. The application process died (OOM kill, deployment rolling restart, uncaught exception) in the 10ms window between receiving the Stripe confirmation and writing the order to the database. This is the most common cause. The payment succeeded, but the order write never executed.Failure Mode B — order creation failed but error handling did not compensate. The database INSERT failed (constraint violation, timeout, connection pool exhausted), but the error handler did not trigger a Stripe refund. The code path was try { charge(); createOrder(); } catch { logError(); } — the catch block logs but does not reverse the charge.Failure Mode C — network partition between services. In a microservice architecture, the payment service confirmed the charge, sent a response, but the response was lost due to a network issue. The calling service timed out and assumed the payment failed, so it did not create the order. But the payment actually went through.Permanent fix — the Saga pattern with compensation.The correct architecture for cross-service transactions is the Saga pattern: a sequence of local transactions with compensating actions for each step.Step 1: Create the order in PENDING state (local database transaction).
Step 2: Charge the customer via Stripe with the order ID as the idempotency key.
Step 3: On payment success, update order to CONFIRMED. On payment failure, update order to CANCELLED.The key insight: the order record is created BEFORE the payment is attempted. This means the order always exists in our system, even if the payment fails. If the payment succeeds but the order update fails, a reconciliation job detects orders in PENDING state with confirmed payments and moves them to CONFIRMED.The reconciliation job is non-negotiable. It runs every 5 minutes, queries for orders in PENDING state older than 10 minutes, checks the payment provider’s API for the payment status, and reconciles. This is your safety net for every edge case your code path did not handle. At a previous company, this reconciliation job recovered 3-5 “orphaned” transactions per week — transactions that would have otherwise become support tickets.War Story: At an e-commerce company processing about 8,000 orders per day, we had a recurring issue where roughly 0.02% of transactions ended up in the “charged but no order” state. That is 1-2 per day, which was enough to generate angry support tickets every week. The root cause was a 200ms window between receiving the Stripe webhook confirmation and completing the database write. Deploys during that window (we deployed 4-6 times per day with rolling restarts) killed processes mid-transaction. We implemented the Saga pattern with the PENDING state plus a reconciliation job. The reconciliation job recovered 100% of these cases automatically. Support tickets for “charged but no order” went from 8 per week to zero. The total engineering investment was about 3 weeks, and the support cost savings alone justified it within 2 months.
Follow-up: How does an idempotency key actually work in this context, and what happens if you get it wrong?
An idempotency key is a unique identifier that you send with a payment request so that if the request is retried, the payment provider recognizes it as a duplicate and returns the original result instead of creating a second charge.In practice: when the user clicks “Pay,” your backend generates a UUID (or derives a deterministic key from the order ID) and sends it as the Idempotency-Key header to Stripe. If the first request succeeds but your backend crashes before processing the response, your retry sends the same key. Stripe sees the key, looks up the original result, and returns “already charged” instead of charging again.What happens if you get it wrong:Wrong key scope. If the idempotency key is too broad (same key for all requests from one user), legitimate separate purchases get deduplicated. If it is too narrow (different key on every retry), you get double charges. The key should be scoped to the logical operation — one key per checkout attempt.Key reuse after TTL. Stripe’s idempotency keys expire after 24 hours. If a retry happens after 24 hours (a DLQ message that sat over a weekend), the key is no longer recognized and you get a duplicate charge. For DLQ replays, always check the payment provider’s records before replaying a charge request.Missing key on the retry path. If your initial request uses an idempotency key but your retry logic (or your DLQ consumer) does not propagate the same key, the retry is treated as a new request. The key must be stored alongside the message or order record so that any code path that retries the operation uses the original key.
Follow-up: Why not just use a distributed transaction with two-phase commit instead of the Saga pattern?
Two-phase commit (2PC) requires all participants to be under the same transaction coordinator. When one participant is Stripe’s API, you do not control their transaction coordinator — you cannot tell Stripe to “prepare but do not commit.” Stripe’s API is a black box: you send a charge request, and it either succeeds or fails. There is no “prepare” phase.Even for internal services where you theoretically control both sides, 2PC has severe practical problems: it holds locks on both databases during the prepare phase, increasing latency and contention. If the coordinator crashes during the commit phase, all participants are stuck holding locks until the coordinator recovers — this can cascade into production outages. And 2PC does not scale horizontally — adding a third participant makes it worse.The Saga pattern is the practical alternative for cross-service transactions. Each step is a local transaction, and each step has a defined compensating action. It is eventually consistent rather than strictly consistent, which means there is a brief window where the system is in an intermediate state (order PENDING, payment processing). For most business operations, this brief inconsistency is acceptable because the reconciliation job cleans it up within minutes.The honest answer is: truly atomic cross-service transactions do not exist in distributed systems at scale. You are always choosing between availability and strict consistency (the CAP theorem is not optional). The Saga pattern with reconciliation chooses availability and eventual consistency — which is the right trade-off for payment processing, where a 5-minute delay in consistency is far preferable to rejecting the transaction entirely.
17. Your feature flag service (LaunchDarkly, Unleash, etc.) goes down. Half your application’s behavior is controlled by feature flags, including critical kill switches for graceful degradation. What happens and how should you have designed for this?
What the interviewer is really testing
Whether you recognize the meta-reliability problem: the tool you depend on for resilience is itself a single point of failure. This tests architectural thinking about dependencies-of-dependencies and whether the candidate has thought through what happens when the safety net itself fails.
What weak candidates say
“LaunchDarkly has high availability, so this is unlikely.” Unlikely is not never. Every SaaS dependency will eventually have an outage. Or: “We should self-host the feature flag service.” This just moves the availability problem from LaunchDarkly’s infrastructure to yours.
What strong candidates say
This is one of my favorite reliability thought experiments because it exposes a circular dependency that most teams do not notice until it bites them. You build graceful degradation using feature flags as kill switches. Then the feature flag service goes down. Now you cannot activate your kill switches. Your resilience mechanism has a single point of failure.I have seen this happen in practice. At a company running about 200 microservices, we used LaunchDarkly for feature flags across all services. LaunchDarkly had a 45-minute outage. During those 45 minutes, we also had an unrelated issue with our recommendation service that we would normally have mitigated by flipping a kill switch flag. We could not flip the flag because LaunchDarkly was unreachable. The recommendation service issue, which should have been a 2-minute mitigation via kill switch, turned into a 45-minute customer-visible degradation.How you should design for this — the defense-in-depth approach:Layer 1 — Cached flag values with a stale-serve policy. Every feature flag SDK worth using (LaunchDarkly, Unleash, Flagsmith) caches flag values locally. When the flag service is unreachable, the SDK serves the last known values. This handles short outages transparently. But you need to verify: how long does the cache persist? Is it in-memory only (lost on process restart) or persisted to disk? If your application restarts during the flag service outage, does it lose all cached values?In our setup, we configured the LaunchDarkly SDK to write the flag cache to a local file every 60 seconds. On startup, the SDK reads from the file first, then tries to connect to LaunchDarkly. This means a process restart during a LaunchDarkly outage still has flag values — they might be up to 60 seconds stale, but that is acceptable for kill switches.Layer 2 — Hardcoded defaults for critical flags. Every critical kill switch flag has a hardcoded default value in the application code. If the SDK cannot reach LaunchDarkly AND the local cache is empty (fresh deployment, cache corruption), the default kicks in. For kill switch flags, the default should be “feature disabled” — fail safe. For critical-path flags, the default should be “feature enabled.” Document these defaults explicitly: a table of critical flags, their defaults, and the business impact of the default.Layer 3 — An independent override mechanism. For the “I absolutely need to change a flag value right now and LaunchDarkly is down” scenario, we built a lightweight override: an environment variable or a local config file that the application checks before consulting LaunchDarkly. During an incident, an engineer can SSH into the instances (or update a ConfigMap in Kubernetes) and set FEATURE_RECOMMENDATIONS=disabled. This is ugly, manual, and does not scale — which is fine, because it is the emergency backup for the backup.Layer 4 — Do not put everything behind feature flags. This is the most important lesson: not every behavior should be flag-controlled. Flags add a dependency. Use them for: new feature rollouts (temporary), kill switches for non-critical features (long-lived), and A/B experiments (temporary). Do not use them for: core business logic that never changes, security controls (use RBAC instead), and configuration that belongs in environment variables.War Story: After the 45-minute incident, we audited our flag usage and found 847 active flags. 312 of them were stale — experiments that ended months ago but nobody removed the flag. 94 were controlling core business logic that had no reason to be behind a flag. We cleaned up 400 flags in a quarter, reducing our blast radius from “LaunchDarkly outage affects 847 behaviors” to “LaunchDarkly outage affects 447 behaviors, of which only 23 are kill switches with proper caching and defaults.”
Follow-up: How do you test that your feature flag defaults are correct?
This is something almost nobody does and almost everybody should. The test is simple in concept but requires discipline:In your integration test suite, add a test mode where the feature flag SDK is configured to be permanently offline — it never connects to LaunchDarkly, never reads from cache, and serves only hardcoded defaults. Run your full integration test suite in this mode. Every test that fails reveals a feature whose default value produces incorrect or degraded behavior.We ran this test quarterly and called it the “LaunchDarkly is dead” test. The first time we ran it, 14 tests failed. Three of those failures were in the critical checkout path — meaning a LaunchDarkly outage combined with a fresh deployment would have broken checkout. We fixed those defaults immediately.The other approach is a startup health check: when the application starts, it logs every feature flag, its current value, and its default value. A monitoring rule alerts if any critical flag is serving its default in production — that means either the flag service is down or the cache is stale, and someone should investigate.
Follow-up: How do you manage the lifecycle of feature flags to prevent flag debt?
Flag debt is real and it compounds. Every flag in your system is a branch in your logic — it doubles the state space your code can be in. 100 flags means 2^100 theoretical combinations, most of which have never been tested together.The lifecycle I enforce:Creation: Every flag has a Jira ticket linked, a type (release, experiment, kill-switch, ops), an owner, and an expiration date. Release flags expire 30 days after full rollout. Experiment flags expire when the experiment concludes. Kill-switch flags are long-lived but reviewed quarterly. Ops flags (infrastructure toggles) are reviewed monthly.Monitoring: A weekly automated report shows flags past their expiration date. It goes to the flag owner and their manager. After 14 days past expiration with no action, the flag is queued for removal.Removal: Removing a flag is a two-step process. First, the code that reads the flag is updated to hardcode the “on” or “off” behavior. Second, the flag is archived in LaunchDarkly. We never just delete the flag definition without removing the code — that creates dead code branches.At the company I mentioned, we reduced our active flags from 847 to about 450 and implemented the lifecycle process. Flag-related incidents (flags in unexpected states, conflicting flags, stale flags causing bugs) dropped from about 2 per month to roughly 1 per quarter.
18. You have a service with a 3-second end-to-end SLA for API responses. The call chain is: API Gateway -> Service A -> Service B -> Service C -> Database. Each hop has its own timeout configured. Explain how timeout budgets should work and what goes wrong when they do not.
What the interviewer is really testing
Whether you understand deadline propagation in distributed call chains and the cascading failures that happen when each service independently configures generous timeouts. This is a question that catches engineers who have configured timeouts but never reasoned about how they compose across a call chain.
What weak candidates say
“Set each service’s timeout to 3 seconds.” This means the total possible wait time is 4 x 3 = 12 seconds — 4x the SLA. Or: “Set each service’s timeout to 750ms so they add up to 3 seconds.” This is better but ignores processing time, network latency, and the fact that Service A might call Service B twice (with a retry).
What strong candidates say
This is the deadline propagation problem, and getting it wrong is one of the most common causes of SLA violations in microservice architectures.The naive approach and why it fails:If each service independently sets a 3-second timeout on its downstream call, the worst case is: API Gateway waits 3 seconds for Service A. Service A was waiting 3 seconds for Service B. Service B was waiting 3 seconds for Service C. Service C was waiting 3 seconds for the database. The user waited 12 seconds — 4x the SLA. And their response might still succeed (if every service eventually responded just before its timeout), so it does not show up as an error, just as a terrible latency metric.Worse: when the API Gateway’s own timeout fires at 3 seconds and returns a 504 to the user, Services A, B, and C continue processing the request they will never be able to deliver. This is wasted work — consuming threads, connections, and database capacity for a response nobody will read.The correct approach — deadline budgets:When the API Gateway receives the request, it sets a deadline: deadline = now() + 3 seconds. This deadline is propagated to every service in the chain — in gRPC, this happens natively via the grpc-timeout header. In HTTP-based systems, you pass it as a custom header like X-Request-Deadline: <epoch_ms>.Each service in the chain does three things:
Check the remaining budget. When Service A receives the request, it reads the deadline and calculates remaining = deadline - now() - own_processing_overhead. If remaining is less than some minimum (say, 100ms), return an error immediately — there is not enough time to do anything useful.
Pass the reduced budget downstream. Service A estimates it needs 50ms for its own processing. It passes deadline = original_deadline - 50ms to Service B. Service B does the same calculation, and so on.
Cancel downstream work when the deadline expires. If Service A’s deadline fires before Service B responds, Service A sends a cancellation signal to Service B. Service B propagates the cancellation to Service C. This is cooperative cancellation — it stops wasted work at every layer.
In practice, this looks like:
User's 3000ms budget arrives at API Gateway Gateway reserves 100ms for overhead -> passes 2900ms to Service A Service A reserves 200ms for processing -> passes 2700ms to Service B Service B reserves 150ms for processing -> passes 2550ms to Service C Service C reserves 100ms for processing -> allows 2450ms for database query
If the database is slow and takes 2500ms (exceeding its 2450ms budget), Service C times out and propagates the failure back. The user gets a response at roughly 2700ms — within the 3-second SLA — telling them the request failed. Without deadline propagation, that same slow database query would have held threads at every layer for the full 3 seconds, and the user might have waited the full 3 seconds before getting a timeout.The really insidious failure mode — “orphaned work”:Without cancellation propagation, the API Gateway returns a 504 to the user after 3 seconds, but Service C is still waiting for the database. The database query returns after 8 seconds. Service C processes the result, calls some side effects (sends a notification, updates a cache), and returns a success to Service B. Service B processes it and returns to Service A. All of that work — the notification, the cache update, the processing CPU — was wasted. Worse, the side effects actually happened for a request the user already received an error for. If the user retries, those side effects may happen twice.War Story: At a company running about 40 microservices, we traced a mysterious issue where our email service sent duplicate order confirmation emails. The root cause: the API Gateway timed out at 3 seconds and the user retried. But the first request was still working its way through the service chain. Both the original and the retry eventually reached the email service and both triggered emails. We implemented gRPC deadline propagation across the chain, and when the Gateway timed out, a cancellation signal propagated to all downstream services. The email service received the cancellation before it sent the email. Duplicate emails dropped from roughly 200 per day to near zero.
Follow-up: How does deadline propagation interact with retries?
This is where the complexity compounds. If Service A has a 2-second remaining deadline and its call to Service B fails after 800ms, it has 1200ms remaining. Can it retry?The answer is: only if the remaining deadline is enough for the retry to complete. If Service B’s expected response time is 500ms and you have 1200ms remaining, you can afford one retry. If Service B’s expected response time is 1000ms, a retry would consume your entire remaining budget and leave zero time for your own processing.The pattern I use is: retries consume from the same deadline budget. The retry policy checks the remaining deadline before each attempt. If remaining time is less than the expected p99 latency of the call, skip the retry and fail immediately. This prevents retries from causing SLA violations.In gRPC, this happens automatically — the deadline is shared across all attempts. In HTTP systems, you implement it manually: if (deadline - now() < expectedLatency) { return fail; }.The worst anti-pattern is retries with fixed delays that ignore the deadline. “Retry after 1 second” when you only have 500ms left means you waste 500ms sleeping and then make a call that immediately times out. Check the budget before sleeping.
Follow-up: Your team uses HTTP, not gRPC. How do you implement deadline propagation without native framework support?
You build it as middleware. Every service has an incoming middleware and an outgoing middleware.Incoming middleware: reads the X-Request-Deadline header. If present, stores it in a request-scoped context (thread-local, async context, Go context, etc.). If absent (the request originated externally), sets a default deadline based on the service’s SLA.Outgoing middleware: before making any downstream HTTP call, reads the deadline from the request context, subtracts the estimated own-processing overhead, and attaches the reduced deadline as X-Request-Deadline on the outgoing request. It also sets the HTTP client timeout to min(configured_timeout, remaining_deadline) — so even if the configured timeout is 5 seconds, the actual timeout is whatever deadline budget remains.Cancellation: HTTP does not have native cooperative cancellation like gRPC’s context cancellation. The practical approach is: when the deadline expires, the calling service closes the HTTP connection. The downstream service sees a broken pipe or connection reset. Well-written services handle this as a cancellation signal and stop processing. Poorly-written services catch the error, log it, and continue processing — which is why cancellation propagation is harder in HTTP than gRPC.We built this as a shared middleware library at the infrastructure level. Every service included it. It took about 3 weeks to implement and 2 months to roll out across 40 services. The before-and-after was dramatic: p99 end-to-end latency dropped by 400ms because services stopped waiting for downstream calls that were going to be discarded anyway.
19. You are oncall and notice that your service’s memory usage has been creeping up by 2% per day for the past two weeks. It is now at 78% of the container limit. No alerts have fired because the threshold is 90%. What do you do?
What the interviewer is really testing
Whether you can reason about slow-burn failure modes that automated alerting misses, and whether you have the operational instinct to investigate proactively rather than waiting for the 90% alert to fire at 3 AM in five days. This also tests memory leak debugging methodology.
What weak candidates say
“I would increase the memory limit on the container.” This treats the symptom, not the cause. Or: “I would wait until it hits 90% and then investigate.” This is reactive — you know it will hit 90%, and you are choosing to be paged at 3 AM instead of investigating during business hours.
What strong candidates say
A 2% per day linear increase is a textbook memory leak — or a traffic growth pattern that is revealing an unbounded data structure. The fact that no alert fired is itself a finding: our alerting only catches sudden spikes, not slow trends. Let me walk through the investigation and the systemic fix.First — is this a leak or is this growth?Before debugging code, I check two things:Is the traffic volume growing? If traffic increased 2% per day for the past two weeks (marketing campaign, seasonal growth, bot traffic), the memory growth might be proportional to legitimate load. Check the requests-per-second metric against the memory trend. If they correlate, this is capacity planning, not a leak.Is the memory growth monotonic? A leak increases continuously and never decreases. Normal memory usage grows during business hours and drops at night (or during low-traffic periods) as objects are garbage collected. If the memory graph only goes up — even during low-traffic periods — it is almost certainly a leak.Diagnosing the leak — language-specific approaches:For JVM services (Java, Kotlin, Scala): I would enable heap dump on OOM (-XX:+HeapDumpOnOOMKill) if not already enabled, but I would not wait for OOM. Instead, I would take a live heap dump now (jmap -dump:live) and analyze it with Eclipse MAT or VisualVM. I am looking for the dominator tree — which objects are holding the most retained memory? Common culprits: an in-memory cache without TTL or size bounds, a HashMap that is populated on every request but never evicted, event listeners that register but never unregister, thread-local storage that accumulates across request threads.For Node.js services: I would take a heap snapshot via the Chrome DevTools protocol or the --inspect flag, then compare two snapshots taken 10 minutes apart. Objects that grow between snapshots are the leak suspects. Common culprits: closures capturing request-scoped variables in a long-lived scope, event emitter listeners that accumulate (the MaxListenersExceededWarning you ignored six months ago was trying to tell you this), and streams that are opened but never closed or properly piped.For Go services:pprof is the standard tool. I would hit the /debug/pprof/heap endpoint, take two profiles 10 minutes apart, and use go tool pprof -diff_base to see what grew. Common culprits: goroutines that are started but never finish (goroutine leak), slices that are appended to but never reset, and deferred closes that accumulate in a long-running function.The three most common root causes I have seen in production:
Unbounded in-memory caches. Someone added a local cache (Map<String, Object>) to avoid hitting Redis on every request. They did not set a max size or TTL. Every unique key stays in memory forever. At 10,000 new keys per day, this is a 2% daily growth. Fix: use a bounded LRU cache (Caffeine for Java, lru-cache for Node, groupcache for Go) with a max entry count and TTL.
Connection or goroutine leaks. HTTP clients or database connections opened in an error path that does not close them. The garbage collector cannot reclaim them because they are still referenced by the underlying runtime. Fix: ensure every connection is closed in a finally/defer block, and add metrics for active connection count.
Accumulated metrics or traces. An in-process metrics library or tracing library that keeps all recorded data in memory until flush. If the flush interval is longer than the data accumulation rate, memory grows. Fix: configure the library with a bounded buffer that drops oldest data.
The systemic fix — trend alerting:After resolving the leak, I would add a trend-based alert: “if memory usage increases by more than 10% over any 7-day rolling window, alert.” This catches slow leaks that a static threshold misses. Datadog, Prometheus with recording rules, and Grafana all support this via deriv() or predict_linear() functions. Specifically, predict_linear(container_memory_usage_bytes[7d], 7*24*3600) > container_memory_limit_bytes alerts when the current trend predicts hitting the memory limit within the next 7 days.War Story: At a company running Java microservices on Kubernetes, we had a service that leaked 50MB per day. It ran fine for 3 weeks after each deploy, then got OOM-killed on a Saturday morning, restarted, and ran fine for another 3 weeks. Nobody investigated because “it restarts automatically.” After 4 months, I traced it to a ConcurrentHashMap used as a request deduplication cache with no eviction. Every unique request ID was stored forever. The fix was one line — adding a Caffeine cache with a 10-minute TTL and 50,000 max entries. The service went from consuming 3.8GB at its peak to a stable 1.2GB. We then added the predict_linear alert and caught two more slow leaks in other services within the next quarter.
Follow-up: How do you distinguish between a genuine memory leak and JVM garbage collector tuning issues?
The key signal is what happens during garbage collection. A memory leak means the GC runs but cannot reclaim the leaked objects — memory after GC stays high and trends upward. A GC tuning issue means the GC is not running often enough or the heap regions are misconfigured — memory grows, then GC runs and recovers most of it, but the growing peak causes problems.In a JVM heap graph: if the sawtooth pattern (memory grows, GC runs, memory drops, grows again) has a rising baseline — each post-GC trough is higher than the previous one — that is a leak. If the sawtooth pattern has a flat baseline but the peaks are growing (GC runs less frequently), that is a tuning issue.For GC tuning: check if you are using the appropriate collector. G1GC is the default in Java 11+ and works well for most workloads. ZGC or Shenandoah reduce pause times for latency-sensitive services. For services with large heaps (8GB+), check the G1 region size and the max pause target. A common mistake is setting -XX:MaxGCPauseMillis too low, which causes the GC to run frequent small collections without ever doing a full collection, leading to memory buildup.The diagnostic tool I reach for first: GC logging. Enable -Xlog:gc* (Java 11+) and check the log for Full GC events. If Full GC runs and reclaims significant memory, you do not have a leak — you have a heap that is too small or a GC policy that delays full collection too long.
Follow-up: The memory leak is in a third-party library you cannot modify. What are your options?
This happens more often than anyone likes. Your options, in order of preference:Option 1 — Upgrade the library. Check if a newer version fixes the leak. This is the cleanest solution but requires testing the upgrade against your test suite and potentially dealing with breaking API changes.Option 2 — Work around the leak. If the leak is in a specific code path, avoid that code path. Use a different API from the same library or a different library for the leaking functionality.Option 3 — Contain the leak. If you cannot avoid the leaking code, limit its impact. Run the leaking functionality in a separate process or sidecar container with a lower memory limit and automatic restarts. The leak is contained, and the restart is invisible to users because the main application process is unaffected.Option 4 — Scheduled restarts. If the leak is slow and predictable (50MB per day in a 4GB container), schedule a rolling restart every 3 weeks during low-traffic hours. This is the least elegant solution but it is pragmatic. Document it explicitly: “This service requires a rolling restart every 21 days due to a known memory leak in library X version Y. Issue tracking the fix: JIRA-1234.” Do not let it become forgotten tribal knowledge.Option 5 — Contribute the fix upstream. File a detailed bug report with a minimal reproduction. If you can identify the leak (heap dump showing the offending objects), contribute a fix via pull request. This takes longer but benefits everyone.
20. Your SLO is 99.9% availability. You just had two incidents: Incident A was 20 minutes of total downtime affecting 100% of users. Incident B was a subtle data corruption bug that served wrong prices to 5% of users for 6 hours. Which incident is worse, and how does your SLO framework handle Incident B?
What the interviewer is really testing
Whether you understand that availability metrics can miss entire categories of failure — specifically, correctness failures. A service that returns 200 OK with wrong data is “available” but catastrophically unreliable. This question tests whether the candidate’s SLO framework accounts for correctness, not just uptime.
What weak candidates say
“Incident A is worse because we had 20 minutes of total downtime versus no downtime for Incident B.” This reveals a naive understanding of reliability as pure uptime. Or: “Incident B is worse because 6 hours is longer than 20 minutes.” This recognizes the duration but does not reason about impact.
What strong candidates say
Incident B is almost certainly worse from a business impact perspective, and the fact that most SLO frameworks would not catch it is the real problem to discuss here.Quantifying the two incidents:Incident A: 20 minutes, 100% of users. In terms of user-minutes of impact: 20 x 100% = 20 user-minutes per user. Clearly visible. Alerts fired. Response was immediate. Post-incident, every affected user knows there was an outage and attributes their experience to “the site was down.”Incident B: 6 hours, 5% of users seeing wrong prices. In terms of user-minutes: 360 x 5% = 18 user-minutes per user. Similar scale — but the nature of the impact is fundamentally different. Users did not see an error — they saw wrong prices. Some may have purchased at incorrect prices (revenue loss or margin erosion). Some may have lost trust without even knowing the cause. The business impact includes potential refunds, legal exposure if prices were misrepresented, and brand damage that is hard to quantify.The SLO framework problem:A standard availability SLO (percentage of non-error responses) would score Incident A as ~98.6% availability for that hour (20 minutes of 5xx responses) and Incident B as 100% availability for the entire 6 hours (every response was 200 OK). Your availability SLO would not even register Incident B. The dashboards would be green. Error budgets would show no consumption.This is why correctness SLIs are essential:You need SLIs that measure correctness, not just availability. For a pricing service, a correctness SLI might be: “percentage of responses where the returned price matches the current price in the source-of-truth database.” You measure this by sampling — run a synthetic check every 60 seconds that fetches 10 product prices from the API and compares them against a direct database query. Any mismatch increments the correctness error counter.A correctness SLO might be: “99.99% of price responses must be accurate within the last 5 minutes of the source-of-truth.” The 0.01% budget accounts for brief replication lag and cache staleness.With this SLO in place, Incident B would have been detected automatically: the synthetic check would have flagged mismatches within 60 seconds, the correctness SLI would have dropped, and the error budget would have started burning. An alert would have fired within 2-3 minutes — not 6 hours.How to implement correctness SLIs in practice:For data-serving systems: periodic reconciliation between what the API returns and what the source of truth stores. Mismatch rate is the SLI.For transaction systems: end-to-end validation. After a payment is processed, verify the charge amount matches the order amount. After an inventory update, verify the stock count is consistent.For computation systems: run known-input-known-output validation. Process a canary input with a known correct output and compare.The key insight: correctness SLIs require knowing what “correct” means, which means you need a source of truth or a validation oracle. This is harder to automate than availability (which just checks for 5xx responses), but it catches the class of failures that availability misses — and those failures are often more damaging.War Story: At an e-commerce company, a cache invalidation bug caused a 4-hour window where 8% of product listings showed prices from 3 days earlier. Some prices were higher (customers complained), some were lower (the company lost margin on 1,200 orders). Total financial impact: $47,000 in refunds and margin loss. Our availability SLO was 100% for that day — green across the board. After this incident, we implemented a correctness SLI: a synthetic monitor that compared 50 random product prices from the API against the pricing database every 30 seconds. Any mismatch rate above 0.1% triggered a P1 alert. We also added a data freshness SLI: if the cache’s oldest entry exceeded the expected TTL by more than 2x, alert. These two SLIs would have detected the bug within 2 minutes instead of the 4 hours it took a customer to report it.
Follow-up: How do you set an SLO for data correctness when you cannot easily define what 'correct' means?
This is genuinely hard, and being honest about the difficulty is better than pretending it is straightforward.For some domains, correctness is well-defined: a price is either right or wrong, an account balance is either accurate or not. For other domains — search relevance, recommendation quality, ML model accuracy — “correct” is subjective or probabilistic.The approach I use for ambiguous correctness: define correctness as consistency rather than absolute truth. “Did the API return the same result that the source-of-truth system would return?” is answerable even when “is this result correct?” is not. If the search API returns results that differ from what a direct database query would return, that is a consistency violation regardless of which result is “better.”For ML-driven systems: use behavioral SLIs. “Does the model return a non-null result within the expected output range?” catches model corruption and catastrophic failures. “Does the model’s accuracy on a held-out validation set remain within 2% of the production baseline?” catches gradual model drift. Neither requires defining absolute correctness, just consistency with known-good behavior.
Follow-up: How would you communicate the severity of Incident B to non-technical stakeholders who see the dashboard was green?
This is a leadership communication challenge. The dashboard being green while customers were impacted is a failure of the dashboard, not evidence that things were fine.The framing I use: “Our monitoring measured whether the service was responding. It did not measure whether the service was responding correctly. Imagine a doctor’s office that tracks whether patients get seen (availability) but not whether they get the right diagnosis (correctness). Our monitoring was only checking the first part. This incident exposed that gap.”Then I present the business impact in concrete terms: “Over 6 hours, approximately 1,200 orders were placed with incorrect prices. The estimated financial impact is 47,000inrefundsandmarginadjustments.A20−minutetotaloutage,bycomparison,wouldhavecostapproximately5,000 in lost sales.”Then the action plan: “We are adding a new correctness metric to our dashboard that checks whether the data we serve matches our source of truth. Once implemented, this category of issue will be detected within 2 minutes rather than discovered by a customer after 6 hours.”The key message for stakeholders: a green dashboard does not mean everything is fine. It means everything the dashboard measures is fine. We are expanding what the dashboard measures.
21. You join a team that has no SLOs, no error budgets, and ships whenever features are ready. The service has roughly 2-3 incidents per month that each last 30-60 minutes. Leadership asks you to “fix reliability.” Where do you start?
What the interviewer is really testing
Whether you can introduce reliability practices into an organization that has never had them, without alienating the team. This is a change management question disguised as a technical question. The interviewer wants to see if you start with measurement (not mandates), build allies (not bureaucracy), and show value quickly (not spend 6 months on infrastructure).
What weak candidates say
“First, we need to define SLOs for every service, implement error budgets, set up a full observability stack, and hire an SRE team.” This is the right destination but the wrong starting point — you cannot boil the ocean. Or: “We should stop shipping features until reliability is fixed.” Good luck getting buy-in for that.
What strong candidates say
I have been in exactly this situation, and the mistake I see most often is trying to implement the entire Google SRE book in month one. That approach fails because it creates a lot of process overhead before demonstrating any value, and the team resents it.Week 1-2: Measure the current pain, not the current metrics.I would not start with tooling or process. I would start by reading every incident report from the past 6 months (if they exist) or interviewing the team about recent incidents (if they do not). I am looking for patterns: what category of failures recur? Which services are the most fragile? What is the actual business impact per incident?At one company, this analysis revealed that 7 out of 12 incidents in the past quarter were caused by configuration changes pushed without review. The fix was not SLOs — it was a PR review requirement on config changes. That one change eliminated 60% of incidents within a month. Quick win, immediate credibility.Week 3-4: Implement observability for the top pain point.Pick the single most painful service — the one that causes the most incidents or wakes people up at night. Set up three things for that service only:
A Grafana or Datadog dashboard showing request rate, error rate, and latency (the RED method). This takes 2-4 hours if you are using a modern metrics stack.
An alert for error rate exceeding 1% sustained for 5 minutes. This replaces “a customer called support” as the detection mechanism.
A structured log format with correlation IDs so that when an alert fires, you can trace the error to a root cause in minutes instead of hours.
Month 2: Introduce one SLO for the most critical service.Not all services. One service. The one the business cares about most. Define an availability SLI (error rate) and a latency SLI (p99 response time). Set the SLO based on the current baseline plus a reasonable improvement target. Create a simple error budget dashboard — a single number showing “X% of budget remaining this month.”Present this to the team and the product manager. The conversation: “Last month we had 45 minutes of downtime on the checkout service. That consumed our entire error budget by day 12. If we had been tracking this, we would have slowed down deploys after day 12 instead of having the second incident on day 23.”Month 3-4: Build the feedback loop.Start lightweight post-incident reviews for P1 incidents only. Keep them short — 30 minutes, three questions: what happened, what was the impact, what one action item would prevent recurrence? Assign an owner and a deadline for each action item. Track completion weekly.This is where credibility compounds. When the product manager sees that post-incident action items are actually reducing incident frequency, they become an ally for investing more in reliability.Month 5-6: Expand.Add SLOs for the next 2-3 services. Formalize an error budget policy. Start allocating 15-20% of sprint capacity to reliability work. By this point, you have data to justify the investment: “Since we started tracking SLOs, incident frequency dropped from 2.5 per month to 1.2. Here is the data.”What I would explicitly NOT do in the first 6 months:I would not propose hiring an SRE team. That is an expensive, slow organizational change that leadership will resist without evidence. Prove the value of reliability practices with the existing team first.I would not implement chaos engineering. You need basic monitoring and alerting before injecting failures — otherwise you are just causing outages.I would not create a 40-page SLO document. One page per service, three metrics, simple thresholds. Iterate.War Story: At a 50-person startup, I joined a team with zero observability. Their “monitoring” was a Slack channel where customers reported issues. In week 1, I set up Datadog APM on their primary API service. In week 2, we discovered that their p99 latency was 8 seconds — something nobody knew because they only looked at average latency (which was 200ms). The p99 was caused by a database query that ran a full table scan on the 1% of requests that hit a specific code path. A single index addition dropped p99 from 8 seconds to 300ms. That one fix — discovered because we started measuring — was worth more than any SLO document. It built the trust that let me introduce SLOs, error budgets, and post-incident reviews over the following quarter.
Follow-up: The team pushes back and says 'SLOs are just more process that slows us down.' How do you respond?
I would not argue with that framing because they are partly right — poorly implemented SLOs absolutely are bureaucratic overhead. The response:“SLOs are not the goal. Shipping faster with confidence is the goal. Right now, we ship fast but then spend 30-60 minutes per incident, plus the investigation time, plus the customer trust damage. SLOs are a speedometer — they tell you whether you are going too fast for road conditions. Without a speedometer, you either drive too cautiously (shipping slowly because you are afraid of breaking things) or too recklessly (shipping fast and breaking things monthly). With a speedometer, you drive at the maximum safe speed — which is faster than cautious and more sustainable than reckless.”Then I make it concrete: “Last month, we spent approximately 6 engineering days on incident response and follow-up. That is 30% of one engineer’s month. If SLOs and error budgets reduce incidents by half — which the data from the past quarter suggests they can — that is 3 engineering days per month redirected from firefighting to features. That is not more process. That is less wasted time.”
Follow-up: How do you prioritize reliability investments when every sprint is packed with feature work?
This is the fundamental tension, and pretending it does not exist is dishonest. The honest answer is that reliability competes with features for engineering time, and you need a mechanism to ensure reliability does not always lose.Three mechanisms I have seen work:The tax model. 20% of sprint capacity is ring-fenced for reliability, infrastructure, and tech debt. This is not negotiable — it is a “tax” on feature velocity that keeps the codebase healthy. The product manager plans features assuming 80% capacity, not 100%. Amazon and Google both use variants of this model.The incident-driven model. Every P1 incident triggers a mandatory reliability sprint where the team spends the next sprint addressing the root cause and systemic issues. This is reactive but effective — it creates a direct link between incidents and reliability investment. Teams that have frequent incidents spend more time on reliability, which is the right incentive.The error budget model. When the error budget is healthy, ship features. When it is burning, shift to reliability. This is the SRE book approach and it is the most principled, but it requires having the SLO and error budget infrastructure in place first.In practice, most teams need a blend. The tax model provides steady investment. The error budget model provides the alarm signal for when to ramp up. The incident-driven model provides urgency for the worst problems.
22. You are reviewing an architecture proposal where every microservice communicates synchronously via HTTP. The architect says “we have retries and circuit breakers everywhere, so it is resilient.” What concerns would you raise?
What the interviewer is really testing
Whether you understand that resilience patterns on top of a fundamentally fragile architecture do not make it resilient — they make it complex and fragile. This is a question about architectural judgment, specifically whether the candidate knows when synchronous communication is the wrong default.
What weak candidates say
“That sounds fine as long as the circuit breakers and retries are properly configured.” This accepts the premise without questioning it.
What strong candidates say
I would have several concerns, and they escalate from tactical to architectural.Concern 1 — Temporal coupling. When every service communicates synchronously, every service in a call chain must be healthy simultaneously for any single request to succeed. In a chain of 5 services, if each has 99.9% availability independently, the end-to-end availability is 99.9%^5 = 99.5%. That is 3.6 hours of downtime per month — from services that are each individually “three nines.”Retries and circuit breakers do not change this fundamental math. They improve recovery speed and prevent cascading failures, but they cannot make a synchronous chain more available than the product of its components.Concern 2 — Latency accumulation. Every synchronous hop adds latency. If each service call takes 50ms on a good day, a 5-service chain takes 250ms. Under load, when each call takes 200ms, the chain takes 1 second. If one service is slow, the entire chain is slow — and the calling service’s thread is held the entire time. Circuit breakers help by failing fast, but the user still gets an error after the timeout.Concern 3 — Resource coupling under load. In a synchronous architecture, a spike in user traffic propagates to every downstream service simultaneously. If the product catalog gets 10x traffic, every service it calls also gets 10x traffic. There is no buffering, no smoothing, no backpressure. Each service must be provisioned for peak load of every upstream service — which is expensive and wasteful during normal traffic.Concern 4 — Retries amplify the problem. With retries at every layer, one user request can generate dozens of downstream calls. A 5-service chain where each layer retries 3 times can generate 3^5 = 243 downstream calls from a single user request in the worst case. The retries are not making the system more resilient — they are making the overload worse. This is the amplification problem I discussed earlier.What I would recommend instead:Default to asynchronous communication for anything that does not need a synchronous response. Order processing, email notifications, analytics events, inventory updates, search index updates — all of these can be done asynchronously via a message queue (SQS, Kafka, RabbitMQ). The producing service publishes an event and moves on. The consuming service processes it at its own pace. No temporal coupling. No latency accumulation. Natural backpressure via queue depth.Use synchronous communication only when the user is waiting for the response. Fetching product details for a page render, checking authentication, processing a payment where the user needs immediate confirmation — these are legitimate synchronous calls.Apply the CQRS pattern where possible. Reads and writes have different communication patterns. Reads can be served from materialized views or caches that are updated asynchronously. Writes can be accepted synchronously (acknowledgment) and processed asynchronously (execution).For the services that remain synchronous, YES, add retries, circuit breakers, and bulkheads. But now you are applying these patterns to a smaller, targeted set of synchronous dependencies rather than an entire architecture built on synchronous communication.The general rule: Synchronous = both parties must be healthy at the same time. Asynchronous = the producer and consumer can be healthy at different times. The more services in your architecture, the more important this distinction becomes.War Story: At a company with about 25 microservices all communicating synchronously via REST, our average incident lasted 45 minutes and always involved at least 3 services. A slow database query in the order service would back up the checkout service, which would back up the API gateway, which would time out user requests. We spent 6 months migrating the non-user-facing communication paths to Kafka events. After the migration, an order service database issue only affected order processing latency — the checkout service published an “order requested” event and returned immediately. The user saw “order placed successfully” within 200ms. The order was actually processed 30 seconds later when the database recovered. Average incident blast radius dropped from 3+ services to 1 service. Incident duration dropped from 45 minutes to 12 minutes because we only had to fix one service, not untangle a cascade.
Follow-up: What are the trade-offs of moving to asynchronous communication? What new problems does it create?
The benefits of async are real, but so are the costs:Eventual consistency. When the checkout service publishes “order placed” and the inventory service processes it asynchronously, there is a window where the order exists but inventory has not been decremented. If another user orders the same item in that window, you might over-sell. The fix is either reserving inventory synchronously before publishing the event (hybrid approach) or accepting eventual consistency and handling oversells as a business process (backorder, apology, refund).Debugging complexity. In a synchronous system, a distributed trace shows the full call chain for a single request. In an async system, the trace breaks at every queue boundary. Correlating “this event was published” with “this event was consumed” requires explicit correlation IDs propagated through message headers. Without this, debugging becomes “the event went into the queue and… something happened eventually.”Ordering guarantees. Synchronous calls have natural ordering — step 1 finishes before step 2 starts. Asynchronous messages can be processed out of order, especially with competing consumers. If “update address” and “ship order” events are processed in the wrong order, the order ships to the old address. You need either ordered partitions (Kafka partitioned by customer ID) or idempotent, commutative operations.Operational complexity. Queues need monitoring (depth, consumer lag, DLQ size). Consumers need scaling independently of producers. Message schemas need versioning. DLQs need triage processes. This is real operational overhead that synchronous systems do not have.The honest assessment: asynchronous communication trades immediate complexity (cascade failures, latency, temporal coupling) for deferred complexity (eventual consistency, ordering, operational overhead). For most teams, the trade is worth it because cascade failures are catastrophic while eventual consistency is manageable — but it is a trade, not a free lunch.
23. Your company enforces a reliability policy: when the error budget is exhausted, non-critical deployments freeze until the budget replenishes. The PM for the highest-revenue product says “my feature launch cannot wait — the board is expecting it next week.” The error budget is at zero. What do you do?
What the interviewer is really testing
Two things: whether you understand that reliability policies exist to prevent exactly this kind of pressure from causing exactly the kind of incidents the policy was designed to prevent — and whether you have the organizational skill to navigate the conflict without either caving or becoming the “process police” that everyone routes around.
Strong answer (Senior level)
This is a governance test, not a technical one. The error budget policy was designed specifically for this moment — when the pressure to ship is highest, the risk of another incident is also highest because the system is already in a degraded state.Step 1 — Quantify the risk of shipping. “Our error budget is zero. That means any incident this month triggers SLA breach penalties. The feature launch touches the checkout flow, which is the service that burned the budget. If the launch causes even 5 minutes of degradation, we owe enterprise customers service credits. Here is the dollar estimate for a breach: $X.”Step 2 — Explore a scoped exception. Not all deploys carry equal risk. If the feature can be launched behind a feature flag with a 1% canary rollout, the blast radius of a failure is 1% of users for a short window — well within recoverable territory even at zero budget. Propose: “We ship the feature flagged to 1%, monitor for 24 hours, and ramp to 100% only if no SLI degradation is observed. If anything goes wrong, the flag is toggled off in under 60 seconds.”Step 3 — Document the decision. Whether the answer is “we ship with a canary” or “we freeze as policy dictates,” document the decision, who made it, and the risk assessment. This is not CYA — it is organizational learning. The next time this happens, the team can reference the precedent.Step 4 — Address the root cause. Why is the board expecting a feature launch in the same month the error budget burned? Either the reliability incidents were not communicated to leadership early enough, or the launch timeline did not account for reliability risk. After the immediate crisis, propose a change: “Feature launch timelines should include a ‘reliability health check’ gate at T-2 weeks. If the error budget is below 30%, the launch date gets a risk flag that goes to the PM and the VP.”
Strong answer (Staff level) -- the systemic view
Everything in the senior answer is correct, but the staff-level response addresses why this conflict keeps happening and how to prevent it structurally.The systemic problem: The error budget policy and the product roadmap are not connected. The PM committed a board-facing launch date without consulting the error budget status. The engineering team burned the error budget without escalating the roadmap risk to the PM. Both sides operated independently, and the conflict was inevitable.The structural fix:Error budget status is a standing item in product planning. Every sprint planning meeting starts with a 30-second budget check: “Checkout service: 62% remaining. Payment service: 28% remaining — caution zone. Search: 91% healthy.” When the PM sees the payment service at 28% before committing to a launch date, they can either buffer the timeline or pre-invest in reliability to protect the budget.Launch readiness includes a reliability gate. Major feature launches require a sign-off from the on-call lead or SRE: “Is the system healthy enough to absorb the risk of this launch?” This is not a veto — it is a risk assessment. The PM can override it, but they override with full visibility of the risk, not in ignorance.Error budget burn triggers an automatic escalation to the PM. When budget drops below 30%, the PM gets an automated notification: “Your team’s error budget is in the caution zone. Feature freeze may be triggered within N minutes of additional downtime.” This eliminates the surprise of a freeze.The meta-lesson: Every error budget policy conflict that escalates to “PM vs. engineering” is a symptom of disconnected planning. The policy should never be a surprise — it should be a visible, predictable constraint that shapes planning upstream, not a retroactive brake that creates conflict downstream.
Follow-up: What failure modes does the feature freeze itself create?
This is the counterintuitive part that most reliability advocates miss: feature freezes can make reliability worse in several ways.Deployment atrophy. If a team does not deploy for 3 weeks during a freeze, the next deployment after the freeze is a “big bang” with 3 weeks of accumulated changes. The blast radius of that deploy is 10x larger than a normal daily deploy. Ironically, the post-freeze deploy is the riskiest deploy of the month.Morale and velocity tax. Engineers who are blocked from shipping features during a freeze often feel punished for incidents they did not cause. If the budget burned because of an infrastructure issue outside the team’s control, the feature freeze feels unjust. Over time, this erodes buy-in for the error budget system.Batch-and-burst pattern. Teams learn to front-load risky deploys at the start of the month when the error budget is full, creating an artificial deployment spike that itself increases risk. Then they coast toward the end of the month to preserve the remaining budget. This is the opposite of steady, continuous delivery.Mitigations: Allow low-risk deploys (config changes, flag toggles, documentation) during freezes. Require freeze-period deploys to use canary rollouts with automated rollback. Keep the freeze window short — use it to address the root cause of budget burn, not as a blanket prohibition.
Follow-up: How do you measure whether a feature freeze was the right call?
The measurement is retrospective: did the freeze period result in reliability improvements that prevented future incidents?Track three things during and after the freeze:
Reliability items completed. How many post-mortem action items were closed? How many alerts were tuned? How many runbooks were updated? If the answer is “none — the team just waited for the budget to replenish,” the freeze was wasted time, not a reliability investment.
Post-freeze incident rate. Compare the incident rate in the 30 days after the freeze to the 30 days before. If the rate decreased, the reliability work during the freeze had impact. If it stayed flat, the team did not address the systemic causes.
Error budget consumption in the next window. If the team consistently exhausts the budget by week 2, a freeze alone is not solving the problem — the system is fundamentally under-reliable for its SLO. Either the SLO is too aggressive or the architecture needs investment, not just a deployment pause.
24. You are running a critical service at a company going through layoffs. Your team is halved. On-call rotation that had 6 people now has 3. Feature pressure has not decreased. How do you maintain reliability with half the team?
What the interviewer is really testing
Whether you can maintain engineering discipline under organizational stress — the exact moment when most teams abandon reliability practices because “we do not have time.” This is a question about resilience in the organizational sense, not just the technical sense.
What weak candidates say
“We should work harder and cover more on-call shifts.” This is the burnout path that leads to attrition of the remaining team members. Or: “We should deprioritize reliability until we hire replacements.” This is how incident spirals start.
What strong candidates say
I have been through this scenario, and the instinct to “just work harder” is exactly the wrong response. A team of 3 doing the work of 6 does not produce half the output — it produces 30% of the output at twice the error rate because exhausted engineers make mistakes. Here is what I would actually do:Immediate triage (Week 1) — Reduce the surface area.
Audit and shed non-critical services. If the team owns 8 services, identify which ones are actively used and which are in maintenance mode. For maintenance-mode services, reduce the SLO, set up basic automated alerting, and stop active development. You cannot maintain 8 services with 3 people. You can maintain 4 well and 4 on life support.
Simplify the on-call rotation. With 3 people, a weekly rotation means each person is on call every 3 weeks. That is survivable. A 24/7 rotation is not. Implement “follow the sun” if the 3 people are in different time zones. If not, set up automatic escalation to a manager or VP after 30 minutes of an unacknowledged page — this is the safety net for when the on-call person is asleep or overwhelmed.
Raise the alerting threshold. Review every alert. If an alert requires human investigation but not immediate action, change it from PagerDuty to Slack. If an alert fires more than twice a week without requiring action, delete it. A team of 3 cannot afford alert fatigue. Only page for things that require immediate human intervention.
Short-term stabilization (Month 1) — Invest in automation, not process.
Automate the top 3 toil items. With half the team, the fastest ROI is eliminating the manual work that consumes disproportionate time. If the team spends 4 hours per week manually restarting a flaky service, that is 4 hours a smaller team cannot afford. Write the auto-restart script. Automate the deployment pipeline. Automate the DLQ replay for known-transient failures.
Implement feature flags on all new work. With reduced review capacity, every feature ships behind a flag with an instant kill switch. This reduces the blast radius of bugs that slip through thinner review coverage. The on-call person can disable a bad feature in 30 seconds instead of coordinating a rollback.
Reduce deployment frequency but increase deployment safety. Instead of deploying daily, deploy twice a week with mandatory canary rollouts. Fewer deploys means fewer potential incidents. Canary rollouts mean the incidents that do occur are caught at 1% traffic instead of 100%.
Medium-term (Month 2-3) — Negotiate the scope.
Renegotiate SLOs with the business. “We previously committed to 99.95% on all services. With the current team size, we can maintain 99.95% on Tier 1 services and 99.9% on Tier 2. For Tier 3 services, we are moving to best-effort with a 99% target.” This is an honest conversation that most managers avoid. Having the conversation proactively is better than silently letting reliability degrade and having the conversation reactively after an outage.
Document everything obsessively. With 3 people, the bus factor is catastrophic. If one person gets sick for a week, the team loses 33% of its capacity and potentially 100% of its knowledge about a critical service. Write runbooks, document tribal knowledge, and pair on on-call incidents so knowledge is not siloed.
What I explicitly would not do:I would not cancel post-incident reviews. A smaller team cannot afford repeat incidents. The 30-minute investment in a post-mortem saves hours in future incidents.I would not skip code review. Shipping without review on a stressed team is how you create the incidents that consume the on-call time you do not have.I would not promise the same output. Setting realistic expectations with leadership is a critical part of navigating this. “With 3 engineers, we can maintain reliability for our top 4 services and deliver 2 features per sprint, down from 5. Here are the 3 features I recommend deferring.” If leadership overrides this, document the risk.War Story: After a layoff at a mid-stage startup, our platform team went from 5 to 2. We owned 6 services. We immediately put 2 services into “maintenance mode” with automated alerts only, no active development. We automated the deployment pipeline (which had been manual and took 45 minutes per deploy — an absurd time sink for 2 people). We raised alert thresholds so we were only paged for genuine P1 incidents. Over 3 months, we actually improved reliability on our top 4 services because the automation reduced human error, and the focused scope let us invest deeply in the services that mattered. Our incident count dropped from 3/month to 1/month. The 2 maintenance-mode services each had one incident during that period, which we handled reactively. When we eventually hired 2 more engineers, they inherited a more automated, better-documented system than what the original 5-person team had operated.
Follow-up: How does on-call quality degrade with a 3-person rotation, and what are the safety valves?
With 6 people, each engineer is on call roughly one week every 6 weeks — sustainable. With 3 people, it is one week every 3 weeks. That is at the edge of sustainability and depends heavily on incident volume.The degradation pattern:
Week 1-4: The team manages. Energy is high, incidents are handled well.
Month 2-3: Fatigue accumulates. On-call responses slow. Judgment degrades. The on-call engineer starts making “quick fix” decisions that create tech debt because they do not have the energy for a proper fix at 2 AM.
Month 4+: Burnout. The best engineers start interviewing. You lose another person, and the 3-person rotation becomes 2 — which is unsustainable by any measure.
Safety valves:
Compensation and recognition. On-call frequency doubling without acknowledgment is demoralizing. At minimum: on-call stipend, comp time after heavy incident weeks, and explicit leadership recognition.
Escalation tiers. Not every alert needs the on-call engineer. Tier 1 (automated response): auto-scaling, auto-restart, auto-rollback. Tier 2 (delayed response): alert goes to Slack, engineer checks during business hours. Tier 3 (page): genuine P1 that requires immediate human intervention. Most incidents should be Tier 1 or 2.
On-call handoff protocol. With 3 people, the handoff must be thorough. Every handoff includes: current active incidents, ongoing investigations, upcoming risky deploys, and error budget status. A 15-minute handoff meeting prevents the incoming engineer from being surprised by context they missed.
The “one bad week” circuit breaker. If an engineer has more than 3 pages in a single on-call week, they get the next rotation off. Another engineer doubles up. This prevents the scenario where one person is burned out while the other two are fresh.
Follow-up: What is the security risk dimension of running with a reduced team?
Security is the reliability investment that gets cut first under resource pressure, and it is the most dangerous one to cut.Specific risks with a halved team:
Slower patching cadence. Security patches that used to be applied within 48 hours now wait 1-2 weeks because the team is prioritizing feature work and incident response. Every day a known CVE is unpatched is exposure.
Reduced code review depth. Reviewers are rushed. Security-sensitive code paths (authentication, authorization, input validation, crypto) get the same cursory review as UI changes. Injection vulnerabilities and access control gaps slip through.
Secret rotation stops. API keys and certificates that should be rotated quarterly are not rotated because nobody has time. Stale credentials accumulate.
Audit log monitoring lapses. Nobody is reviewing access logs, deployment logs, or data access patterns. An insider threat or a compromised credential could go unnoticed for weeks.
Mitigations that work with a small team:
Automate dependency vulnerability scanning (Dependabot, Snyk) and make critical CVE patches a deploy-day-zero requirement.
Add SAST (static analysis) to the CI pipeline — it catches common vulnerability patterns without requiring human review time.
Automate secret rotation with Vault or AWS Secrets Manager with auto-rotation policies.
Set up automated anomaly detection on audit logs — unusual access patterns trigger an alert rather than requiring manual review.
The key message: security debt compounds faster than technical debt because the downside is not “slower features” — it is “data breach.” A reduced team should increase security automation, not decrease security investment.
25. Walk me through how you would design an error budget policy that works across 5 teams, each owning 3-6 microservices, where some services are shared platform infrastructure used by all teams.
What the interviewer is really testing
Organizational reliability architecture. This is a staff-level question about governance, incentive design, and cross-team accountability. The technical implementation of error budgets is straightforward — the organizational implementation is where it breaks down.
Strong answer -- the organizational design
The fundamental challenge is that shared platform services create externalities: when the platform team’s service causes an incident, every consuming team’s error budget burns. Without proper attribution, consuming teams are punished for failures they did not cause, and the platform team is insulated from the impact of their own reliability gaps.Tier 1 — Service Classification and SLO AssignmentFirst, classify every service across all teams:
Platform services (auth, API gateway, service mesh, shared databases): These are blast radius multipliers. Their SLO must be stricter than any consuming service — typically 99.99% or higher. They are owned by the platform team but their SLO is set by the strictest consumer. If the checkout team needs 99.95% and the checkout depends synchronously on the auth service, auth must be at 99.99% or higher.
Business-critical services (checkout, payment, order management): 99.95% SLO. Error budget policy: deploy caution at 50%, freeze at 20%, full reliability sprint at 0%.
User-facing, degradable services (search, recommendations, reviews): 99.9% SLO. Error budget policy: caution at 30%, freeze at 10%.
Tier 2 — Attribution ModelWhen a business-critical service’s error budget burns because a platform dependency failed, the budget burn is dual-attributed:
The platform team’s budget burns because their service was the root cause. This triggers their error budget policy.
The consuming team’s budget also burns from their users’ perspective. However, the consuming team is not penalized in performance reviews or planning for platform-caused burns. Their “controllable” error budget is tracked separately.
This prevents two failure modes: the platform team not feeling the pain of their outages (no attribution) and the consuming team being punished for someone else’s failures (unfair attribution).Tier 3 — Governance Structure
Weekly reliability sync (15 minutes): Error budget status for all Tier 1 and Tier 2 services, displayed on a shared dashboard. Each team reports: budget remaining, trend direction, top risk for the coming week.
Monthly reliability review (45 minutes): Engineering director, all tech leads, and a product representative review: aggregate budget burn by category (deploy failure, dependency failure, capacity, configuration), cross-team attribution for shared-cause incidents, and reliability investment ROI from the prior month.
Quarterly SLO calibration: Review whether SLOs are correctly set. Services that never approach their budget get tightened. Services that consistently exhaust their budget either get investment or get their SLO relaxed (with business sign-off).
Tier 4 — Conflict ResolutionThe most common conflict: a product team wants to ship a feature, but the platform service their feature depends on has an exhausted error budget. The product team says “we are ready, the platform team is blocking us.” The platform team says “we are in a reliability sprint, we cannot take on additional load.”Resolution protocol: The engineering director (or VP) makes the call, informed by: the business value of the feature launch (revenue impact, contractual commitment), the risk assessment from the platform team (probability and severity of another incident), and the cost of delaying (competitive pressure, customer commitment). The decision is documented and the risk is explicitly accepted.The incentive design is critical. If reliability is measured only by incident count, teams game the system (lenient SLOs, reclassifying incidents). If reliability is measured by error budget consumption relative to SLO, the metric is harder to game because it is derived from automated SLI measurement, not human classification.
Follow-up: The platform team says their 99.99% SLO is unrealistic given their current architecture. How do you handle this?
This is one of the most productive conversations a reliability program can trigger, because it forces an honest assessment of what the architecture can actually deliver.Step 1 — Measure the current state. If the platform team says 99.99% is unrealistic, the first question is: what is their actual reliability over the last 6 months? If they are at 99.95%, the gap is 0.04% — roughly 17 minutes per month. That is a meaningful gap but not an insurmountable one.Step 2 — Identify the gap between current and target. What specific failure modes account for the 0.04% gap? In my experience, it is usually 2-3 specific issues: a deployment that takes the service down for 5 minutes (no canary), a dependency timeout that cascades (missing circuit breaker), and a capacity limit that is hit during peak hours (no autoscaling). Fixing those three issues might close the gap entirely.Step 3 — If the gap is architectural. If the platform service is a single-instance database or a non-replicated stateful service, 99.99% may genuinely be unrealistic without a significant re-architecture. In this case, the conversation shifts: “Given the current architecture, the best achievable SLO is 99.95%. Achieving 99.99% requires investment X over timeline Y. Is that investment justified by the business impact of the 0.04% gap?”Step 4 — Adjust the consuming teams’ architecture. If the platform service cannot deliver 99.99%, consuming teams must add resilience patterns (caching, circuit breakers, fallbacks) to tolerate the platform’s actual reliability. The consuming team’s SLO accounts for the platform’s limitations: “Our service targets 99.95%, which is achievable given that our auth dependency delivers 99.95% and we have a 5-second cached fallback for auth token validation.”The meta-lesson: SLO-setting is an iterative negotiation between what the business needs, what the architecture can deliver, and what the team is willing to invest. SLOs that are set top-down without input from the teams that must deliver them will fail. SLOs that are set bottom-up without business context will be too lenient.
Follow-up: How do you handle the cost dimension -- who pays for the reliability infrastructure that serves all teams?
This is the organizational economics problem that kills shared platform reliability. Three models, each with trade-offs:Model 1 — Centralized platform budget. The platform team has its own budget for infrastructure (monitoring, CI/CD, shared databases). Consuming teams do not pay directly. Pro: Platform team can invest holistically. Con: No price signal — consuming teams demand maximum reliability without bearing the cost, leading to over-investment.Model 2 — Chargeback/showback. Consuming teams are charged (or shown the cost) proportionally based on their usage of platform services. The checkout team, which drives 40% of platform load, sees 40% of the platform cost on their budget. Pro: Price signal — teams think twice about demanding five nines when they see the infrastructure bill. Con: Accounting overhead, incentivizes teams to build their own infrastructure to avoid the chargeback, fragmenting the platform.Model 3 — Tiered service model. The platform offers two tiers: standard (included in the base platform cost, 99.9% SLO) and premium (additional cost, 99.99% SLO with dedicated capacity, faster incident response, priority in load shedding). Teams choose based on their business criticality. Pro: Self-selecting — teams that need high reliability pay for it, teams that do not save the cost. Con: Complexity of maintaining two service tiers.In practice, most organizations start with Model 1, grow into the over-investment problem, and evolve toward Model 3. The trigger for the transition is usually a budgeting conversation where a finance leader asks “why is our platform infrastructure cost growing 40% year-over-year while team count only grew 15%?“
26. You discover that your service is reliable in aggregate — 99.95% availability — but a small subset of users (those in a specific geographic region, or on a specific client version, or hitting a specific API endpoint) experience 95% availability. How do you find and fix this?
What the interviewer is really testing
Whether you understand that aggregate SLOs can hide catastrophic localized failures — the “averaging over misery” problem. A service that is 99.95% available overall but 95% available for 5% of users is failing those users 100x worse than the SLO suggests. This tests analytical rigor and user empathy.
Strong answer
This is the “golden signals lie when averaged” problem, and it is more common than people realize. Aggregate metrics are inherently dangerous because they smooth over localized failures. An average latency of 200ms can hide a bimodal distribution where 95% of requests are 50ms and 5% are 3 seconds. An aggregate error rate of 0.05% can hide a region where the error rate is 5%.Detection — how to find localized reliability failures:Step 1 — Slice your SLIs along every user-relevant dimension. Instead of one availability number, compute availability per: geographic region (or cloud region/AZ), client version (mobile app version, SDK version), API endpoint, customer tier (free, paid, enterprise), and time-of-day window (peak vs. off-peak). Any slice that deviates significantly from the aggregate is a candidate for investigation.Step 2 — Set up percentile-based alerting, not just average-based. An alert on “average error rate > 1%” will not fire when 5% of users have a 5% error rate and 95% have a 0% error rate (the average is 0.25%). Instead, alert on: “any region with error rate > 1%” or “any endpoint with p99 latency > 2x the global p99.”Step 3 — Use Real User Monitoring (RUM), not just server-side metrics. Server-side metrics show what your server experienced. RUM shows what the user experienced. A user on a slow mobile network, an old browser, or behind a corporate proxy may have a completely different experience than your server-side metrics suggest. RUM tools (Datadog RUM, New Relic Browser, Sentry) capture client-side errors, page load times, and JavaScript exceptions that server-side monitoring misses entirely.Common root causes for localized reliability failures:
Region-specific infrastructure issues. One availability zone has degraded network connectivity. Or a regional CDN PoP is unhealthy. Or a database read replica in one region has replication lag causing stale reads that trigger application errors.
Client version skew. An old mobile app version has a bug that causes it to retry requests aggressively, overwhelming the service — but only for users who have not updated. Or a new API response format breaks parsing in older client versions.
Endpoint-specific performance cliffs. A specific search query pattern triggers a full table scan. A specific product category has 10x more items than others, causing pagination timeouts. These only affect users who hit that specific code path.
DNS/routing asymmetry. Users in a specific geography are being routed to a distant region due to a DNS misconfiguration or a BGP routing change, adding 200ms of latency that pushes them over the timeout threshold.
Fix — the stratified SLO model:After identifying the localized failure, the structural fix is to track SLOs per dimension, not just in aggregate. At a minimum: one SLO per region, one SLO per customer tier, and separate SLOs for your top 5 highest-traffic endpoints. The aggregate SLO is the reporting metric; the stratified SLOs are the operating metrics that actually trigger action.War Story: At a SaaS company, our aggregate API availability was 99.97% — well within our SLO. But we kept getting support tickets from customers in Southeast Asia complaining about timeouts. When we sliced our SLIs by region, we discovered that users routed through our Singapore PoP had 97.5% availability — 100x worse than the aggregate suggested. The root cause was a third-party CDN node in Singapore that had degraded, adding 800ms of latency to every request, which pushed our responses over the client SDK’s 2-second timeout. The CDN provider’s global health dashboard showed green — because the Singapore node was still responding, just slowly. We switched to a multi-CDN strategy for the APAC region, and Singapore availability went from 97.5% to 99.96% within a week.
Follow-up: How do you set SLOs for a global service where different regions inherently have different latency characteristics?
You cannot set one global latency SLO for a service that serves users in both Virginia and Mumbai. A 200ms p99 SLO is easy for Virginia (10ms to the nearest server) and impossible for Mumbai (150ms of physics-imposed network latency).The approach: region-aware SLOs.Define a base latency budget per region based on the minimum physical network round-trip time. Then add your application’s processing overhead as the SLO:
Region
Network RTT (min)
Application Budget
Total SLO
US-East (local)
5ms
195ms
200ms p99
EU-West
80ms
195ms
275ms p99
APAC (Singapore)
150ms
195ms
345ms p99
The application budget is the same everywhere — 195ms. If the application itself is slow, it shows up in every region. But the user-experienced latency differs by the network physics, and the SLO should reflect that reality.For availability SLOs, the target should be the same across regions (99.95% everywhere). If a region has lower availability, that is an infrastructure problem to fix, not a target to relax.
Follow-up: How does this change your rollout and rollback strategy?
Localized reliability failures change both how you deploy and how you revert.Rollout: Instead of a global canary (1% of all traffic), use a regional canary: deploy to one region first, monitor that region’s SLIs for the stabilization window, then roll out to the next region. If a bug only manifests under specific network conditions or data distributions, a global 1% canary might miss it. A regional canary catches it because the entire region’s traffic exercises the code under that region’s specific conditions.Rollback: A localized failure might only require a localized rollback. If the Singapore region is degraded after a deploy but all other regions are healthy, you can rollback Singapore to the previous version while keeping the new version running everywhere else. This requires your deployment infrastructure to support per-region versioning — which is straightforward with Kubernetes multi-cluster or regional deployment targets, but requires planning.Measurement: After rollout, do not just check the aggregate SLIs. Check every regional SLI and every endpoint SLI independently. A deploy that improves global p99 by 10% but degrades APAC p99 by 200% is not a good deploy — it just looks like one from the aggregate dashboard.
Interview: Product wants 99.99% availability, SRE wants 99.9%, finance wants to cut infra spend by 20%. You are the staff engineer in the room. How do you resolve this in one meeting?
Strong Answer Framework:Step 1 - Reframe from targets to user outcomes: I would refuse to negotiate the number first. Instead, I would ask product: “What is the specific user journey where an extra nine matters, and what is the revenue impact of a 4-minute outage versus a 43-minute outage per month?” Nine out of ten times the PM cannot answer, which reveals that 99.99% is an aspiration, not a requirement. SLOs are negotiated in the currency of user pain and revenue, never in decimals.Step 2 - Expose the cost curve: I would put the real numbers on the whiteboard. 99.9% to 99.95% often costs 1.5x because you need a warm standby and better alerting. 99.95% to 99.99% commonly costs 3-5x because you need multi-region active-active, synchronous replication, and a full follow-the-sun on-call rotation. Finance and product need to see that “one more nine” is not a rounding error, it is usually a doubling of the reliability budget. Once the cost curve is visible, the conversation shifts from “we want 99.99%” to “we want 99.99% on the checkout path, 99.9% everywhere else.”Step 3 - Propose a tiered SLO and a written trade: I would walk out with a signed document: checkout and auth are 99.99% with a dedicated budget; browse and search are 99.9% with a shared budget; internal tools are 99.5%. In return for giving product their checkout number, SRE gets an explicit error-budget-freeze policy (no feature launches for 2 weeks after a burn), and finance gets permission to retire the redundant DR region for the 99.5% tier. Everyone trades something, which is the only way these agreements survive the first incident.Real-World Example:
Google’s SRE team formalized this exact pattern into the “error budget policy” document that every Google service owner signs. In 2016 the Google Cloud team famously used the error budget to freeze launches for Gmail for-business after a run of incidents burned through the monthly budget in week one, and the CFO accepted it because the cost of one more nine on Gmail had been made explicit upfront.Senior Follow-up Questions:
“What if product refuses to accept a lower SLO and escalates to the CTO?” - Strong answer: Take it to the CTO with the cost curve, not with a complaint. Framing it as a capital allocation question (“do you want to spend 3M USD a year on one nine for search, or on the new ML platform?”) forces a real decision rather than an emotional one.
“Your error budget burns 80% in the first week of the month. What do you actually do on day 8?” - Strong answer: Trigger the pre-agreed freeze policy, shift the team to reliability work for the rest of the month, and publish a burn-down report. The policy only works if it is mechanical, not negotiated mid-incident.
“How do you stop SLOs from becoming a vanity metric that nobody looks at after Q1?” - Strong answer: Tie the SLO to the on-call rotation’s promotion and compensation conversation. If burning the budget does not change anyone’s week, the SLO is decoration.
Common Wrong Answers:
“Pick the middle ground, say 99.95%, and move on.” - Splitting the difference avoids the real conversation about user journeys and cost curves, and guarantees nobody is happy six months later.
“Promise 99.99% and figure out the budget later.” - Committing to a number you have not costed makes you the person who either overspends or misses the SLO; both end careers.
Further Reading:
Google SRE Workbook, Chapter 2 “Implementing SLOs” (Alex Hidalgo et al.)
“Implementing Service Level Objectives” by Alex Hidalgo (2020)
Related chapter: performance-scalability.mdx on capacity cost modeling
Interview: Your VP wants you to start running chaos experiments in production. Half the team thinks this is reckless. How do you design the first 90 days without causing a real incident?
Strong Answer Framework:Step 1 - Earn the right to break things: Before a single chaos experiment runs in production, the team must have three things in place: working SLOs with a real error budget, runbooks that have actually been executed (not just written), and a working rollback path measured in minutes, not hours. I would spend the first 30 days auditing these, not breaking things. Chaos engineering on a system without observability is not science, it is vandalism. Netflix ran Chaos Monkey for years in staging before Chaos Kong touched production.Step 2 - Start in the blast radius you can afford: Weeks 4-8 are for low-blast-radius experiments scoped to a single non-critical service, during business hours, with the on-call engineer in the same room, with a hard stop after 60 seconds. The first experiment I would run is killing a single pod of a stateless service that already has 3+ replicas behind a load balancer. If that causes user-visible pain, we learned something for free. If it passes, we escalate to AZ-level failures, then latency injection, then dependency failures. Every experiment has a written hypothesis (“we expect 0 user errors and recovery in under 30 seconds”) and a go/no-go gate before the next one.Step 3 - Build the ritual, not just the tool: The hardest part is cultural. I would run “GameDay Fridays” where the whole team watches the experiment live on a shared dashboard, and we write a short doc afterwards: what we expected, what happened, what we learned. After 90 days, a chaos experiment should feel as routine as a deployment. The test of success is not how many experiments we ran, it is whether a junior engineer feels safe running one unsupervised.Real-World Example:
Netflix built this playbook over a decade - Chaos Monkey (2011) only killed single instances, Chaos Kong (2014) took down whole AWS regions, and the full Simian Army emerged only after years of hardening. The LinkedIn “Waterbear” program in 2017 followed the same staged approach and explicitly credited Netflix’s “start tiny, expand slowly” doctrine. The companies that skipped the ramp (several fintech teams in the 2018-2019 window) caused real customer outages and had chaos programs shut down by leadership.Senior Follow-up Questions:
“The business is pushing back because a chaos experiment last month caused a 2-minute customer-visible blip. How do you defend the program?” - Strong answer: Reframe the blip as a controlled outage that cost 2 minutes and taught us about a real weakness, versus an uncontrolled outage that could have cost 2 hours and happened at 3 AM on a Saturday. If that trade is not defensible with data, the program is not ready.
“How do you run chaos on a stateful system like a primary database where you cannot just kill a replica?” - Strong answer: You do not kill the primary in production. You run failover drills on a production-shaped clone with production-shaped load, and in production you only inject non-destructive faults like added latency on read replicas, never data loss or writes.
“One of your senior engineers refuses to participate and calls chaos engineering a vanity project. How do you handle it?” - Strong answer: Invite them to design the first experiment themselves, with full veto power. Skeptics who help design the rules become the strongest advocates because they trust the guardrails they wrote.
Common Wrong Answers:
“Install Gremlin or Chaos Mesh and start running experiments next week.” - The tool is the easy part; missing observability and rollback means the first experiment will be the last one leadership ever approves.
“Only run chaos in staging, it is too risky in prod.” - Staging does not have real traffic, real data distributions, or real dependencies; chaos in staging tells you almost nothing about production resilience.
Further Reading:
“Chaos Engineering” by Casey Rosenthal and Nora Jones (O’Reilly, 2020)
Netflix Tech Blog “The Netflix Simian Army” and “Chaos Kong” postmortems
Related chapter: reliability-principles.mdx earlier sections on failure mode analysis
Interview: Your team keeps burning 100% of the error budget every month and the SLO policy is ignored. Leadership asks you to fix it. Where do you start?
Strong Answer Framework:Step 1 - Diagnose whether the SLO is the problem or the culture is: I would pull 6 months of burn data before changing anything. If the budget burns on the same root cause every month (say, a flaky deploy pipeline), the SLO is a symptom and the pipeline is the disease. If the budget burns on 12 different root causes, the team has a reliability-investment gap, not an SLO-calibration gap. If the budget burns in weeks 3-4 every month because a specific team ships risky features at end of sprint, the problem is release cadence, not reliability. The fix is different for each; conflating them is the most common staff-level mistake.Step 2 - Make ignoring the budget expensive: An error budget only changes behavior if violating it costs something the team cares about. The mechanism I would install: missing the SLO for two consecutive months triggers an automatic feature freeze enforced by the deploy pipeline itself, not by manager discretion. The only way out is a written remediation plan approved by the skip-level. If the VP overrides the freeze to ship a feature, that override gets logged and reported to the CTO monthly. Policies that can be overridden quietly are policies that will be.Step 3 - Rebuild trust by shipping visible reliability wins: In parallel with the enforcement change, I would pick the top 3 burn-causing root causes from the data and kill them in the next quarter. Concrete wins - like “we cut deploy-caused incidents from 8 per month to 1” - rebuild the team’s belief that the SLO is a tool, not a punishment. Error budgets fail culturally when engineers see them as blame mechanisms; they succeed when engineers see them as “the reason I am not paged at 2 AM.”Real-World Example:
At Shopify around 2019, the platform team documented publicly how their checkout error budget was being ignored until they tied it directly to the Black Friday launch-freeze calendar - missing the budget in October meant your team could not ship new features in November, when every team wanted to ship. That one coupling - to a date engineers actually cared about - did more to change behavior than a year of SRE evangelism. Similar pattern at Etsy’s 2015 “code freeze as consequence” policy.Senior Follow-up Questions:
“What if the team genuinely cannot hit the SLO because it was set aspirationally, not based on current architecture?” - Strong answer: Lower the SLO and announce the lower number publicly, with a plan to raise it. A met 99.5% is better than an aspired-to and missed 99.95% because the met number is trustworthy and the missed number teaches the team to ignore SLOs.
“A director asks you to ‘pause the SLO’ for a critical launch. What do you say?” - Strong answer: Pausing an SLO is the same as deleting it. Instead, pre-allocate extra budget for the launch window by explicit written agreement, and document the trade; that is still accountable, whereas a pause is not.
“How do you get engineers to actually care about error budget burn when they already hate being on-call?” - Strong answer: Connect budget burn to on-call load with a public dashboard. When engineers see that “we burned 80% of the budget because of three deploys” directly maps to “the on-call person was paged nine times last week,” they internalize the SLO as self-interest, not management policy.
Common Wrong Answers:
“Just lower the SLO so we stop missing it.” - Lowering to make numbers look good without changing underlying reliability trains the team that SLOs are negotiable downward, destroying the whole mechanism.
“Add a Slack alert when the budget is 50% burned.” - Alerts without consequences become noise; the team will mute the channel by week three.
Further Reading:
Google SRE Workbook, Chapter 5 “Alerting on SLOs” and Chapter 8 “On-Call”
“Seeking SRE” edited by David Blank-Edelman (O’Reilly, 2018), especially the Etsy and Dropbox chapters
Related chapter: capacity-git-pipelines.mdx on release policy and freeze windows