Part V — Reliability, Resilience, and Availability
Reliability is not about preventing failures — it is about choosing which failures to tolerate. Every system fails. The senior engineer’s job is to decide how much failure is acceptable (SLOs), invest proportionally (error budgets), and build systems that degrade gracefully rather than collapse catastrophically. The core insight: reliability is an economic decision, not a technical one.Chapter 8: Reliability (The SRE Perspective)
This chapter draws heavily from the principles in Google’s Site Reliability Engineering book. Reliability is not about preventing all failures — it is about defining acceptable failure rates and investing appropriately.When Reliability Fails: The Stories That Changed the Industry
Before we dive into SLOs and error budgets, let’s look at what happens when reliability goes wrong — because these stories are the reason all the theory in this chapter exists.The Amazon S3 Outage (2017) — A Typo That Broke the Internet
The Amazon S3 Outage (2017) — A Typo That Broke the Internet
Cloudflare's Regex Outage (2019) — One Bad Regular Expression, Global Impact
Cloudflare's Regex Outage (2019) — One Bad Regular Expression, Global Impact
Facebook/Meta's BGP Outage (2021) — When DNS and BGP Cascade Together
Facebook/Meta's BGP Outage (2021) — When DNS and BGP Cascade Together
Netflix and Chaos Monkey — Breaking Things on Purpose to Prevent Real Outages
Netflix and Chaos Monkey — Breaking Things on Purpose to Prevent Real Outages
8.1 SLOs, SLAs, SLIs, and Error Budgets
These three terms sound similar and are often confused — even by experienced engineers. Here is the precise distinction, grounded in one concrete example so the relationship is unmistakable: SLI (Service Level Indicator): A measurement of system behavior from the user’s perspective. It is a number you observe. Example: “Over the last 30 days, 99.2% of checkout API requests completed in under 200ms.” SLO (Service Level Objective): A target you set for your SLI — the threshold you commit to internally. It is a goal your team agrees to meet. Example: “99.5% of checkout API requests must complete in under 200ms over any rolling 30-day window.” SLA (Service Level Agreement): A contract between you and your customer, with explicit consequences if breached. It is a legal or business commitment. Example: “If checkout API availability drops below 99.0% in a calendar month, affected customers receive a 10% service credit.” Notice the hierarchy: SLI is what you measure (99.2%). SLO is what you aim for (99.5%). SLA is what you promise externally with penalties (99.0%). The SLA is always less aggressive than the SLO, because you want your internal target to catch problems before they become contractual violations. If your SLO and SLA are the same number, you have zero safety margin — every near-miss becomes a breach. Error Budget: If your SLO is 99.9% availability, you have 0.1% downtime budget — 43.2 minutes per month. When the budget is healthy, ship features aggressively. When it is burning, slow down and invest in reliability. Error budgets are the bridge between product velocity and reliability.The Nines of Availability
Each additional nine is roughly 10x harder and more expensive to achieve. Most services should target 99.9% and invest the saved engineering effort in features.| Availability | Downtime / Month | Downtime / Year | Typical Use Case |
|---|---|---|---|
| 99% (“two nines”) | 7.2 hours | 3.65 days | Internal tools, dev environments |
| 99.9% (“three nines”) | 43.2 minutes | 8.76 hours | Most SaaS products, APIs |
| 99.95% | 21.6 minutes | 4.38 hours | E-commerce, business-critical apps |
| 99.99% (“four nines”) | 4.3 minutes | 52.6 minutes | Payment systems, core infrastructure |
| 99.999% (“five nines”) | 26 seconds | 5.26 minutes | Telecom, life-safety systems |
Error Budget Policy in Practice
An error budget policy defines what happens when the budget is burning too fast. Here is how it works concretely:- Measurement window: Rolling 30-day window. The SLO is 99.9% availability (error budget = 43.2 minutes of downtime).
- Budget remaining > 50%: Ship freely. Feature velocity is the priority.
- Budget remaining 20-50%: Caution. All deployments require a rollback plan. No risky migrations.
- Budget remaining < 20%: Freeze non-critical deployments. Engineering effort shifts to reliability improvements (fixing flaky tests, adding retries, improving monitoring).
- Budget exhausted (0%): Full feature freeze. Only reliability work and critical security patches until the budget replenishes in the next window.
Interview Question: How do you set SLOs for a new service?
Interview Question: How do you set SLOs for a new service?
Follow-up: The backend team says 99.99% availability but the frontend team says 99.9% is fine. Who wins?
Follow-up: The backend team says 99.99% availability but the frontend team says 99.9% is fine. Who wins?
Scenario: Your service has consumed its entire monthly error budget in the first week. What do you do? Who do you talk to?
Scenario: Your service has consumed its entire monthly error budget in the first week. What do you do? Who do you talk to?
- Immediate triage. Determine why the budget burned so fast. Was it a single catastrophic incident (a bad deploy that caused 30 minutes of downtime) or a slow bleed (elevated error rates over several days)? The response differs. Pull up the SLI dashboards and correlate the budget burn with specific events in the deploy log or dependency status.
- Activate the error budget policy. This is why you pre-negotiate the policy. With the budget exhausted in week one, the policy should mandate a feature freeze for the remainder of the 30-day window. Only reliability improvements and critical security patches ship. Communicate this to the team immediately — not as a punishment, but as the agreed-upon protocol.
-
Who to talk to:
- The on-call/SRE lead — to confirm the budget status and validate the root cause analysis.
- The product manager — to invoke the error budget policy. The PM needs to know that feature work is paused and why. This is a collaborative conversation, not a decree.
- Engineering leadership — if the PM pushes back on the freeze, escalate per the pre-agreed escalation path. This is exactly the scenario the policy was designed for.
- The team — reorient the sprint. What reliability investments will prevent this from happening again? Fix the root cause, add missing alerts, improve rollback speed, or add canary deployments.
- Longer-term. After the window resets, conduct a retrospective. Was the SLO set correctly? Was the budget burn caused by something preventable (missing canary, no rollback plan) or something structural (the service is fundamentally under-provisioned)? Adjust the SLO, the deployment process, or the infrastructure accordingly.
8.2 Toil
Toil (from the SRE book) is repetitive, manual, automatable, tactical work that scales linearly with service size. Responding to the same alert every week, manually provisioning accounts, manually rotating secrets — all toil. The SRE principle: Keep toil below 50% of an engineer’s time. Invest the other 50% in automation that eliminates toil. If a task must be done more than twice, automate it.8.3 Risk and Reliability Investment
Not all services need the same reliability. A marketing landing page can tolerate more downtime than a payment processing system. The SRE approach: quantify the cost of unreliability (lost revenue, user trust, SLA penalties) and invest proportionally. The reliability cost curve: Going from 99% to 99.9% might require adding Redis and a second database replica. Going from 99.9% to 99.99% might require multi-region deployment, automated failover, and a dedicated SRE team. Going from 99.99% to 99.999% might require custom infrastructure, consensus protocols, and years of hardening. Each nine costs roughly 10x more than the last. The engineering question is always: what is the cost of an additional nine vs. the cost of not having it?Interview Question: A product manager asks you to guarantee 99.999% uptime for a new feature. How do you respond?
Interview Question: A product manager asks you to guarantee 99.999% uptime for a new feature. How do you respond?
Chapter 9: Resilience Patterns
9.1 Retry with Exponential Backoff and Jitter
Retry only transient failures (timeouts, 503s, network errors), not permanent ones (400s, 404s). Use exponential backoff with jitter. Set maximum retry count. Ensure retried operations are idempotent. Pseudocode — retry with exponential backoff and jitter:9.2 Circuit Breaker
The circuit breaker pattern prevents cascade failures and gives failing services time to recover. It operates as a state machine with three states:Circuit Breaker State Machine
- CLOSED (normal): All requests pass through to the downstream service. Failures are counted. When the failure count exceeds the threshold (e.g., 5 consecutive failures), the breaker trips and transitions to OPEN.
- OPEN (failing fast): All requests are immediately rejected with an error (or a fallback response) without calling the downstream service. This protects the failing service from additional load and protects the caller from waiting on timeouts. After a recovery timeout period (e.g., 30 seconds), the breaker transitions to HALF-OPEN.
- HALF-OPEN (testing recovery): A limited number of requests are allowed through as a test. If they succeed (meeting a success threshold, e.g., 3 consecutive successes), the breaker transitions back to CLOSED. If any request fails, the breaker immediately returns to OPEN and the recovery timer resets.
Scenario: A downstream dependency starts returning 500s intermittently. Your circuit breaker opens. But business says we MUST serve traffic. What is your strategy?
Scenario: A downstream dependency starts returning 500s intermittently. Your circuit breaker opens. But business says we MUST serve traffic. What is your strategy?
- Acknowledge the tension. The circuit breaker is doing its job — protecting your service from a failing dependency. But “rejecting all requests” is not acceptable to the business. The answer is not “disable the circuit breaker” (that would cascade the failure into your service) — the answer is graceful degradation with fallbacks.
-
Implement a fallback strategy. When the circuit breaker is open, instead of returning an error to the user, serve a degraded experience:
- If the dependency is a recommendation engine: serve a static “Popular Items” list from cache or a pre-computed fallback.
- If the dependency is a pricing service: serve the last-known-good prices from cache, with a “prices as of X minutes ago” disclaimer.
- If the dependency is critical-path (like payment processing): queue the request for retry, show the user “your order is being processed,” and process it asynchronously when the dependency recovers.
- Use feature flags to disable non-essential UI components that depend on the failing service.
- Tune the circuit breaker, do not disable it. Consider adjusting the half-open behavior to allow a higher percentage of probe requests through, so recovery is detected faster. But do not increase the failure threshold to the point where the breaker never trips — that defeats the purpose.
- Communicate. Set up a status page update. Alert the dependency team. If the degraded experience has business impact (e.g., stale prices might cause revenue loss), loop in the product owner to make the cost-benefit decision.
9.3 Timeout Patterns
Every external call needs a timeout. Without one, a slow dependency hangs your thread/connection indefinitely. Types: Connection timeout (how long to wait for TCP handshake — typically 1-5 seconds). Read/response timeout (how long to wait for response — depends on expected operation time). Overall request timeout (end-to-end deadline including retries).9.4 Bulkhead Pattern
Isolate components so failure in one does not affect others. Named after ship bulkheads that contain flooding to one compartment. Concrete example: Your service calls both a fast internal database (5ms) and a slow third-party API (500ms). Both share a single thread pool of 50 threads. When the third-party API starts timing out at 30 seconds, all 50 threads get stuck waiting for it. Now your fast database queries also fail — not because the database is slow, but because there are no free threads. Fix: Separate thread/connection pools. Give the database calls their own pool of 30 threads and the third-party API its own pool of 20 threads. When the API hangs, only its 20 threads are consumed. Database calls continue working normally.Bulkhead in a Real E-Commerce Service
Consider an e-commerce backend that calls three downstream services:| Dependency | Thread Pool Size | Timeout | Priority |
|---|---|---|---|
| Payment service | 25 threads | 5s | Critical — revenue path |
| Search service | 15 threads | 2s | Important — but degradable |
| Recommendation engine | 10 threads | 1s | Nice-to-have — can show “Popular Items” fallback |
- Thread pool isolation — separate pools per dependency
- Connection pool isolation — separate database connection pools for critical vs non-critical queries
- Process isolation — separate services or containers
- Infrastructure isolation — separate Kubernetes namespaces with resource quotas per team or service tier
9.5 Graceful Degradation and Fallbacks
Provide reduced functionality rather than complete failure. The goal: protect the critical path while letting non-critical features fail silently. Concrete fallback examples:- Database is slow -> show cached data (stale but available)
- Recommendation engine is down -> show “Popular Products” (static, pre-computed)
- Review service is unavailable -> hide the reviews section (product page still works)
- Payment service timeout -> queue the payment for retry, tell the user “processing”
- Search service overloaded -> show category browsing instead
- CDN is down -> serve directly from origin (slower but functional)
9.6 Dead Letter Queues (DLQ)
Messages that fail after maximum retries go to a DLQ for investigation. Without one, a poison message (a message that always fails processing) blocks the entire queue. DLQ processing workflow:- Monitor DLQ depth — alert when > 0 messages (or > threshold for noisy systems).
- Investigate: read the message payload, check error logs with the correlation_id, determine if the failure is transient (dependency was down) or permanent (malformed data, bug in consumer logic).
- Fix: if transient — replay messages from DLQ back to the main queue. If permanent — fix the consumer bug, deploy, then replay. If truly unprocessable — move to a permanent failure store and alert the business.
- Automate: set up a DLQ consumer that logs message details, sends alerts, and provides a UI for manual replay.
9.7 Health Checks
Liveness (/health): Is the process running? Keep it simple — return 200 if the process is alive. Do NOT check dependencies here. If you check the database in your liveness probe and the database goes down, Kubernetes restarts all your pods — making the outage worse (you now have zero application capacity AND the database is down).
Readiness (/ready): Can this instance handle traffic right now? Check: database connection works, cache is reachable, any warmup (loading config, building in-memory indexes) is complete. When readiness fails, the instance is removed from the load balancer — no traffic is routed to it, but it is not killed.
Startup probes (Kubernetes): For applications that take a long time to start (JVM warmup, large model loading), use a startup probe with generous timeouts. Without it, Kubernetes may kill your pod during startup because the liveness probe fails during the warmup period.
initialDelaySeconds, periodSeconds, failureThreshold) and common anti-patterns to avoid. Essential reading before configuring probes in production.9.8 Chaos Engineering
Deliberately inject failures to test resilience: kill instances, introduce network latency, simulate dependency outages. The goal is to find weaknesses before they cause real incidents.The Chaos Engineering Process
Chaos engineering is not just “randomly breaking things.” It follows a disciplined scientific method:- Define steady state. Establish measurable indicators of normal system behavior — e.g., “p99 latency < 200ms, error rate < 0.1%, orders per minute > 500.” This is your baseline.
- Form a hypothesis. “If we terminate 1 of 3 application instances, the load balancer will redistribute traffic and steady state will be maintained within 30 seconds.”
- Introduce a real-world failure. Kill the instance, inject network latency, saturate CPU, drop packets, corrupt DNS responses.
- Observe the difference. Compare actual system behavior against the steady-state hypothesis. Did latency spike? Did errors increase? How long until recovery?
- Fix or accept. If the system handled it gracefully, increase the blast radius next time. If it did not, you found a weakness — fix it and retest.
Modern Chaos Engineering Tools
| Tool | Environment | Key Strength |
|---|---|---|
| Chaos Monkey (Netflix) | Cloud VMs | The original — randomly terminates instances |
| Litmus | Kubernetes-native | CRD-based chaos experiments, GitOps-friendly, large experiment hub |
| Gremlin | Any (SaaS) | Enterprise-grade, controlled blast radius, safety controls |
| AWS Fault Injection Simulator | AWS | Native AWS integration, targets EC2/ECS/RDS/etc. |
| Chaos Mesh | Kubernetes | CNCF project, fine-grained pod/network/IO fault injection |
| Toxiproxy (Shopify) | Any | Simulates network conditions (latency, bandwidth, timeouts) at the TCP level |
Chapter 10: Availability and Disaster Recovery
10.1 High Availability
Redundancy at every layer: multiple application instances, database replicas, multi-zone deployment. No single point of failure. Automated failover. The HA checklist:- Application: multiple instances behind a load balancer, health checks, graceful shutdown
- Database: primary with synchronous replica, automated failover (RDS Multi-AZ, Cloud SQL HA)
- Cache: Redis Sentinel or Redis Cluster for automatic failover
- DNS: multiple providers, low TTL for fast failover
- Load balancer: managed (cloud) or active-passive pair
- Secrets: replicated secret store (Vault with HA backend)
Interview Question: How would you design a system that survives an entire availability zone failure?
Interview Question: How would you design a system that survives an entire availability zone failure?
10.2 RTO and RPO
Recovery Time Objective (RTO): How long can the system be down? A 1-hour RTO means you must restore service within 1 hour of failure. Recovery Point Objective (RPO): How much data can you lose? A 5-minute RPO means you must be able to restore data to within 5 minutes of the failure. Determines backup frequency and replication lag tolerance.10.3 Disaster Recovery Strategies
| Strategy | RTO | Cost | Complexity |
|---|---|---|---|
| Backup and restore | Hours | Lowest | Low — back up to another region, restore when needed |
| Pilot light | 10-30 minutes | Low-Medium | Keep core infra running (DB replica), spin up compute on demand |
| Warm standby | Minutes | Medium-High | Scaled-down full system in secondary region, faster failover |
| Multi-region active-active | Seconds | Highest | Full system in multiple regions simultaneously, requires data sync and conflict resolution |
Curated Reading: Reliability and Resilience
These resources represent the best thinking on reliability engineering, incident response, and building resilient systems. Organized from foundational to advanced.Essential Reading — Start Here
Essential Reading — Start Here
- Google SRE Book (free online) — The foundational text. Chapters on SLOs, error budgets, toil, and incident response are required reading for any engineer working on production systems. Read chapters 1-4 and 28 first, then explore based on your area of focus.
- “How Complex Systems Fail” by Richard Cook — A short paper (only 4 pages) originally written about medical systems, but every sentence applies to software. Its core insight: complex systems are always running in a partially broken state, and safety is a property of the whole system, not individual components. This paper will change how you think about incidents.
- Charity Majors’ Blog on Observability — The best writing on observability, SLOs, and what it actually means to operate software in production. Start with her posts on “observability vs monitoring” and “SLOs are the API for your engineering organization.” She cuts through buzzwords with unusual clarity.
Intermediate — Deepening Your Practice
Intermediate — Deepening Your Practice
- Netflix Tech Blog — Tagged: Chaos Engineering — First-hand accounts from the team that invented chaos engineering. Their posts on Chaos Monkey, the Simian Army, and failure injection testing are essential reading for understanding how to build confidence in distributed systems.
- AWS Architecture Blog — Resilience — Detailed write-ups on resilience patterns (retry, circuit breaker, bulkhead, cell-based architecture) with AWS-specific implementation details. Particularly valuable for understanding how cloud-native resilience differs from traditional HA approaches.
- Gergely Orosz’s “The Pragmatic Engineer” on Incidents — Orosz writes about engineering culture at scale, and his coverage of major incidents (including the Facebook BGP outage and Cloudflare’s outages) provides the organizational and human context that purely technical write-ups miss. His analysis of how companies handle postmortems is especially valuable.
Advanced — Becoming a Reliability Leader
Advanced — Becoming a Reliability Leader
- The Site Reliability Workbook (free online) — The practical companion to the SRE book. Where the SRE book explains the philosophy, the Workbook shows implementation with real examples, sample SLO documents, error budget policies, and on-call procedures.
- Release It! by Michael Nygard (2nd edition) — War stories from production systems combined with stability patterns (circuit breaker, bulkhead, timeout) and anti-patterns (cascading failure, blocked threads, unbounded result sets). The narrative style makes it both instructive and entertaining.
- Learning from Incidents in Software — A community and collection of resources applying resilience engineering and human factors research to software operations. Goes beyond “what broke” to examine how organizations learn (or fail to learn) from incidents.
Part VI — Software Engineering Principles
Chapter 11: Foundational Principles
11.1 SOLID
SRP (Single Responsibility Principle)
A class has one reason to change. Not “a class does one thing” — it means “a class serves one actor/stakeholder.” If theInvoice class changes when the accounting rules change AND when the PDF rendering changes, it has two reasons to change. Split it: InvoiceCalculator (accounting logic) and InvoiceRenderer (PDF generation). Group things that change together, separate things that change for different reasons.
Code smell it prevents: Shotgun Surgery. When a single business change (e.g., “add a discount field”) forces you to modify 5 different files because one class was handling too many concerns, SRP is being violated. The fix is not “make smaller classes” — it is “group things that change for the same reason.”
BAD — one class with two reasons to change:
OCP (Open/Closed Principle)
Open for extension, closed for modification. When a new payment method is added, you should add a new class (StripePaymentProcessor), not modify existing code. Strategy pattern and polymorphism enable this. Code smell it prevents: the ever-growing if/elif chain. When adding a new feature means modifying existing, working, tested code — adding anotherelif branch to a function that already has 12 branches — OCP is being violated. Every modification to working code risks introducing a regression. The fix: design so that new behavior is added by creating new classes, not editing old ones.
BAD — modifying existing code for every new payment method:
LSP (Liskov Substitution Principle)
Subtypes must be substitutable for their base types without breaking behavior. Code smell it prevents:isinstance checks and surprise side effects. When you see code littered with if isinstance(obj, SpecificSubclass) before calling methods, or when a subclass method silently changes behavior that callers depend on, LSP is being violated. The contract of the base type is broken, and downstream code cannot trust polymorphism anymore.
BAD — classic violation (Square extends Rectangle):
ISP (Interface Segregation Principle)
Split large interfaces into focused ones. Code smell it prevents:NotImplementedError and dead methods. When a class is forced to implement methods it cannot support — raising NotImplementedError or returning None for methods that do not apply — the interface is too fat. Callers cannot trust the interface because some methods are traps. The fix: split the interface so each implementor only promises what it can deliver.
BAD — fat interface forces unused implementations:
DIP (Dependency Inversion Principle)
Depend on abstractions, not concrete implementations. Code smell it prevents: untestable code and vendor lock-in. When unit tests require spinning up a real database, real Stripe API, or real email server because the class directly instantiates concrete clients, DIP is being violated. The fix: inject an abstraction. You get testability (mock the interface) and flexibility (swap implementations) for free. BAD — high-level module depends on low-level concrete class:Real-World SOLID Example
A notification service originally had one class that decided, formatted, and delivered. Adding Slack and SMS meant modifying the core class every time. Refactored with aNotificationChannel interface (ISP, DIP), separate implementations for Email/Slack/SMS, and a NotificationRouter (SRP). Adding a new channel means adding a class, not modifying one (OCP).
11.2 DRY, KISS, YAGNI
DRY: Single authoritative representation of each piece of knowledge. But DRY is about duplicate knowledge, not duplicate code. Two functions with identical code but different concepts should not share an abstraction — that creates coupling.DRY vs WET: When Duplication Is the Right Call
WET stands for “Write Everything Twice” (or “We Enjoy Typing”). The conventional wisdom is that DRY is always good. The nuance: premature DRY creates coupling between things that should evolve independently. Example — premature DRY that hurts:11.3 Separation of Concerns, Cohesion, and Coupling
Cohesion measures how related the responsibilities within a module are. High cohesion: anEmailService that handles composing, sending, and tracking emails. Low cohesion: a Utils class with string formatting, date parsing, and HTTP helpers (unrelated responsibilities dumped together).
Coupling measures how much one module depends on another. Loose coupling: OrderService publishes an OrderPlaced event; EmailService subscribes and sends a confirmation — neither knows about the other. Tight coupling: OrderService directly calls EmailService.sendOrderConfirmation(order) — changing the email service’s interface requires changing the order service.
The goal at every level: High cohesion within modules (everything serves one purpose), loose coupling between modules (changes in one rarely require changes in another). This applies to functions, classes, packages, services, and entire systems. When you feel a change “rippling” through many files, coupling is too high. When you struggle to understand where code belongs, cohesion is too low.
11.4 Technical Debt
Track it explicitly, quantify its impact (“this adds 2 days to every payment feature”), prioritize strategically (fix what actively slows you down), budget time for reduction, prevent new debt through reviews. Types of technical debt:- Deliberate: “we know this is a shortcut but need to ship by Friday”
- Inadvertent: “we did not know there was a better pattern”
- Bit rot: code decays as the world around it changes
- Dependency debt: outdated libraries with known vulnerabilities
Interview Question: How do you convince a product team to invest in paying down technical debt?
Interview Question: How do you convince a product team to invest in paying down technical debt?
Scenario: You inherit a codebase where every class has 1000+ lines and violates SRP. How do you prioritize the refactor without stopping feature work?
Scenario: You inherit a codebase where every class has 1000+ lines and violates SRP. How do you prioritize the refactor without stopping feature work?
- Do not attempt a Big Bang rewrite. The codebase is working in production. A full rewrite is the highest-risk, longest-duration option, and it has a terrible track record. Joel Spolsky called this “the single worst strategic mistake that any software company can make.” Instead, adopt an incremental approach.
-
Identify the hot spots. Not all 1000-line classes are equally painful. Use two metrics to prioritize:
- Change frequency — run
git log --format=format: --name-only | sort | uniq -c | sort -rnto find which files change most often. A 1000-line class that has not been touched in a year is low priority. A 1000-line class that gets modified in every sprint is urgent. - Bug density — which classes are associated with the most production incidents or bug reports? Cross-reference change frequency with defect rate. Files that change often and break often are your top targets.
- Change frequency — run
- Apply the Strangler Fig pattern to classes. When you need to add a feature that touches a bloated class, extract the new functionality into a clean, well-tested class. Then extract closely related existing functionality into that new class. Over time, the old class shrinks as responsibilities migrate outward. You never stop feature work — you just do the feature work in new, clean modules and route calls through them.
- Set a “boy scout” rule for the team. Every pull request that touches a bloated class must leave it slightly better — extract one method, split one responsibility, add one test for untested behavior. No PR makes the class worse. Incremental improvement compounds over time.
- Timebox and track. Allocate 15-20% of sprint capacity to refactoring, focused on the hotspot list. Track the results: measure change-failure rate and cycle time for features touching refactored modules. Show the product team that refactored areas are delivering faster.
Curated Reading: Software Engineering Principles
Foundational Texts
Foundational Texts
- Martin Fowler’s Articles on Refactoring and SOLID — Fowler’s website is a treasure trove. Start with his articles on Refactoring, the SOLID principles, and his bliki entries on coupling, cohesion, and design patterns. His writing is precise, opinionated, and grounded in decades of consulting experience. The article on Is High Quality Software Worth the Cost? is particularly useful for convincing product teams.
- A Philosophy of Software Design by John Ousterhout — A short, opinionated book that challenges conventional wisdom. Ousterhout argues that deep modules (simple interfaces, complex implementations) are better than shallow ones, and that the most important goal of software design is reducing complexity. Read this if you find “Clean Code” too prescriptive.
- “How Complex Systems Fail” by Richard Cook — Yes, this appears in the reliability section too. It belongs here as well because its insights on how complexity emerges from seemingly simple components apply directly to software architecture. Understanding that “complex systems run in degraded mode” changes how you think about code quality.
Advanced Practice
Advanced Practice
- Charity Majors on Observability and Engineering Culture — Beyond her observability writing, Majors has excellent posts on engineering management, technical debt, and how to build a culture where quality is sustainable. Her post on “the engineer/manager pendulum” is essential reading for senior ICs considering management.
- Gergely Orosz’s “The Pragmatic Engineer” — Covers engineering culture, career growth, and how decisions get made at scale. His deep dives on incident response culture and how different companies approach technical debt are informed by his experience at Uber and other high-scale companies.
- Working Effectively with Legacy Code by Michael Feathers — If you are facing the “1000-line class” scenario from the interview question above, this book is your tactical manual. Feathers provides specific techniques for getting untested code under test, breaking dependencies, and refactoring safely when you have no safety net.
Reliability in Practice: The Pre-Ship Checklist
Every section in this chapter connects to a single question: “Is this safe to ship?” Before deploying any feature, change, or migration to production, walk through this checklist. Print it. Tape it to your monitor. Make it part of your team’s PR template.Before Shipping Any Feature, Ask:
- What is the SLO? What availability, latency, or correctness target does this feature fall under? If there is no SLO for this service, stop and define one before shipping. You cannot know if a change is safe if you have not defined what “safe” means.
- What is the rollback plan? Can you revert this change in under 5 minutes? Is it a code rollback, a feature flag toggle, or a database migration rollback? If the rollback requires a manual database fix or a multi-step process, that is a red flag — simplify the rollback before shipping.
- What alerts fire if this breaks? Which dashboards will show the problem? Which PagerDuty/Opsgenie alert will wake someone up? If the answer is “none” or “we will notice when users complain,” you are shipping blind. Add alerting before the deploy, not after the incident.
- What is the blast radius? If this goes wrong, who is affected? All users? Users in one region? Users on one plan? 1% of traffic (canary) or 100% (big bang)? The blast radius determines how carefully you roll out and how aggressively you monitor.
- Is this change idempotent and retry-safe? If the deploy fails halfway, can you safely re-run it? If a message gets processed twice, does it produce the correct result? Non-idempotent changes in distributed systems are ticking time bombs.
- Have the failure modes been tested? Not just “does it work when everything is fine” but “what happens when the database is slow, the downstream API returns 500s, or the cache is cold?” See the Testing chapter for how to test failure paths, not just happy paths.
- Who is on call and do they know this deploy is happening? The worst time to learn about a risky deploy is when you are paged at 2 AM with no context. Communicate deploy timing, expected impact, and rollback instructions to the on-call engineer before you ship.
Cross-Chapter Map: Where Reliability Connects
Reliability does not live in isolation. It is the thread that runs through every other engineering discipline. Here is how the topics in this chapter connect to the rest of the guide:| This Chapter’s Topic | Connects To | Why It Matters |
|---|---|---|
| SLIs and SLOs | Observability | You cannot manage SLOs without metrics, dashboards, and alerts that track your SLIs in real time |
| Error Budgets | Deployment | Error budget status determines whether you ship aggressively (canary) or freeze deploys |
| Circuit Breakers and Retries | Testing | Resilience patterns must be tested — inject failures in integration tests to verify fallbacks work |
| Chaos Engineering | Incident Response | Chaos experiments produce findings that update runbooks and incident playbooks |
| Graceful Degradation | Deployment (Feature Flags) | Feature flags are the kill switches that enable graceful degradation in production |
| Health Checks | Deployment (Kubernetes) | Liveness and readiness probes determine how your orchestrator manages your service lifecycle |
| RTO/RPO | Databases | Recovery objectives drive your replication strategy, backup frequency, and failover architecture |
| SOLID Principles | Testing | Well-structured code (DIP, SRP) is testable code — violations make unit testing nearly impossible |
| Technical Debt | Observability | Debt shows up as increasing cycle time, change-failure rate, and incident frequency — measure it |