Part XXIII — Multi-Tenancy
Chapter 30: Multi-Tenant Architecture
Big Word Alert: Noisy Neighbor. In a multi-tenant system, a “noisy neighbor” is a tenant whose heavy usage degrades performance for other tenants sharing the same infrastructure. One tenant running a massive report saturates the database, making all other tenants’ queries slow. This is the central challenge of multi-tenant architecture — balancing cost efficiency (sharing resources) with isolation (protecting tenants from each other).
Tools: PostgreSQL Row-Level Security (database-level tenant isolation). Citus (distributed PostgreSQL with tenant-aware sharding). AWS SCP, Azure Policies (tenant-level cloud resource governance). Kubernetes namespaces + ResourceQuotas (infrastructure-level isolation).
Analogy: The Apartment Building. Multi-tenancy is like an apartment building — everyone shares the structure (the foundation, the plumbing, the elevator), but each unit has its own lock and you NEVER want to accidentally walk into the wrong apartment. The cheapest building crams everyone onto one floor with thin walls (shared schema) — cost-effective but noisy. The mid-tier building gives each tenant their own floor (separate schemas) — better isolation but the elevator is still shared. The luxury building gives each tenant their own wing with a private entrance (separate databases) — maximum privacy, maximum cost. The building manager’s hardest job? Making sure tenant A’s spare key never accidentally opens tenant B’s door. That is the entire discipline of multi-tenant architecture in one sentence.
How Salesforce Built the Most Successful Multi-Tenant SaaS Platform
In the early 2000s, when enterprise software meant million-dollar Oracle licenses, rack-mounted servers, and 18-month deployment cycles, Marc Benioff and his team at Salesforce made a bet that seemed reckless: they would run every customer — from a 5-person startup to a Fortune 500 bank — on the same shared infrastructure. Not separate instances. Not separate databases. The same tables, the same application servers, the same everything. The industry said it was impossible. Enterprise customers would never trust their data to a shared environment. The compliance requirements alone would kill it. And the technical challenges were staggering: how do you prevent one customer’s massive report from crippling the system for everyone else? How do you handle a customer who needs custom fields, custom objects, and custom workflows without polluting the shared schema? Salesforce solved it by building a metadata-driven architecture. Instead of creating physical database tables for each customer’s custom objects, they stored metadata that described the schema, and the platform interpreted that metadata at runtime. Every customer’s data lived in the same set of large, generic tables (imagine columns namedValue0 through Value500), with a metadata layer that mapped those generic columns to customer-specific field names like Annual Revenue or Lead Score. This approach meant Salesforce could onboard a new customer in minutes (just insert metadata rows), deploy updates to every customer simultaneously (one codebase, one deployment), and scale to hundreds of thousands of tenants without the operational nightmare of managing hundreds of thousands of separate database instances.
The result? Salesforce became the poster child for multi-tenant SaaS, grew to over $30 billion in annual revenue, and proved that multi-tenancy — done right — is not a compromise but a competitive advantage. The lesson for engineers: the hardest part of multi-tenancy is not the isolation model you pick on day one. It is the tenant context propagation, the noisy neighbor mitigation, and the metadata flexibility that you build over years. Salesforce did not get it right immediately. They iterated relentlessly, and their architecture today looks nothing like their v1. But the core bet — shared infrastructure with logical isolation — never changed.
How Slack Evolved from Single-Tenant to Multi-Tenant
Slack’s early architecture tells a different multi-tenancy story — one of pragmatic evolution rather than upfront design. When Stewart Butterfield’s team pivoted from their failed game Glitch to build Slack in 2013, they made the reasonable startup decision: each workspace (team) got its own isolated MySQL shard. This was effectively a separate-database-per-tenant model. It was simple, provided strong isolation, and was perfectly fine when Slack had hundreds of teams. Then Slack exploded. By 2015, they had hundreds of thousands of active teams. The separate-shard-per-tenant model that had been a strength became a liability. Provisioning new shards was slow. Schema migrations had to be rolled out across thousands of database instances. Cross-workspace features (like Slack Connect, which lets users from different companies share channels) were architecturally painful because data lived in completely separate databases. Operational overhead was enormous — monitoring, backups, and failover multiplied by the number of shards. Slack’s engineering team spent years incrementally migrating toward a more consolidated architecture, introducing shared services, moving certain data to centralized stores (like their move to Vitess for MySQL clustering), and building abstraction layers that could route queries to the right shard transparently. They did not rip and replace — they evolved. The lesson here is crucial: your isolation model is not a permanent decision. It is a starting point. What matters is that you design your tenant context propagation layer cleanly enough that you can change the underlying isolation model without rewriting your entire application. Slack’s ability to evolve was a direct result of having a clean abstraction between “which tenant is this?” and “which database holds their data?“30.1 Isolation Models
Shared DB, shared schema: All tenants in same tables.tenant_id column. Cheapest. Requires rigorous filtering.
Shared DB, separate schemas: Each tenant has own schema/namespace. Better isolation. Migrations must apply to every schema.
Separate DB per tenant: Maximum isolation. Most expensive. Complex at scale (hundreds of databases).
Isolation Model Comparison
| Factor | Shared DB / Shared Schema | Shared DB / Separate Schema | Separate DB per Tenant |
|---|---|---|---|
| Cost | Lowest — one database, one set of tables | Moderate — one database, but N schemas to manage | Highest — N databases, N connection pools, N backups |
| Tenant Isolation | Weakest — a missing WHERE tenant_id = ? leaks data | Moderate — schema boundary prevents accidental cross-tenant queries | Strongest — complete physical separation |
| Security | Relies entirely on application-level or RLS filtering | Schema-level permissions add a layer of defense | Full database-level isolation; easiest to meet compliance requirements (SOC2, HIPAA) |
| Operational Complexity | Simplest — one schema to migrate, one backup to manage | Moderate — every migration must be applied to every schema; tooling needed | Highest — provisioning, patching, monitoring, and backing up N databases |
| Performance Isolation | None by default — noisy neighbor risk is highest | Partial — shared DB resources still contended | Full — one tenant’s load cannot affect another |
| Onboarding a New Tenant | Insert a row | Create a new schema + run migrations | Provision a new database + configure connection |
| Data Export / Deletion (GDPR) | Query by tenant_id, risk of missed tables | Drop the schema | Drop the database |
| Best For | SaaS with many small tenants, low compliance requirements | Mid-tier SaaS needing better isolation without per-tenant DB cost | Enterprise customers, regulated industries, tenants with strict SLAs |
30.2 Tenant Context Propagation
The most critical multi-tenant engineering challenge: ensuringtenant_id flows through every layer of your system.
The propagation chain: HTTP request arrives -> API gateway/middleware extracts tenant_id from JWT claims, subdomain (acme.app.com), or API key lookup -> tenant_id is set in the request context (Express req.tenantId, Go context.Value, Java ThreadLocal) -> every database query includes WHERE tenant_id = ? (enforced by query middleware or ORM scope) -> every log line includes tenant_id -> every event published includes tenant_id -> every downstream service call passes tenant_id in a header.
The safety net: Application-level filtering can be missed (one developer forgets the WHERE clause). Database-level Row-Level Security (RLS) is your safety net:
SET app.tenant_id = 'acme'). Now even if the application forgets the filter, the database will never return another tenant’s data.
30.3 Tenant-Aware Concerns
Tenant-specific configuration (feature flags, limits, themes). Tenant-level rate limiting. Support access with audit trail. Tenant-aware logging and observability (includetenant_id in every log and metric).
30.4 Noisy Neighbor Mitigation Strategies
The noisy neighbor problem is the defining challenge of multi-tenant systems. Here are concrete strategies, ordered from cheapest to most expensive: 1. Per-Tenant Rate Limiting Apply rate limits at the API gateway based ontenant_id. Each tenant gets a request budget (e.g., 1000 req/min). Exceeding the budget returns 429 Too Many Requests. This prevents one tenant from monopolizing API capacity.
2. Per-Tenant Resource Quotas
Enforce CPU, memory, and storage limits per tenant. In Kubernetes, use ResourceQuota objects scoped to tenant namespaces. In databases, use connection pool limits per tenant (e.g., tenant A gets max 20 connections, tenant B gets max 50).
3. Separate Processing Queues
Route tenant workloads to separate queues based on tier. Free-tier tenants share a queue with lower priority. Paid tenants get a dedicated queue. Enterprise tenants get a dedicated queue with guaranteed throughput.
4. Query Governors and Timeouts
Set per-tenant query timeouts at the database level. Kill queries that exceed the timeout. This prevents one tenant’s unoptimized report from locking tables for everyone.
5. Separate Compute Pools for Enterprise Tenants
For your largest customers, provision dedicated compute (separate Kubernetes node pools, separate database read replicas, or fully separate database instances). This is the most expensive strategy but offers contractual SLA guarantees.
6. Monitoring and Alerting Per Tenant
Track resource consumption per tenant. Alert when a single tenant’s usage exceeds a threshold (e.g., tenant X is consuming 40% of total DB CPU). This enables proactive intervention before other tenants are affected.
Interview Question: How do you handle noisy neighbors in a multi-tenant system?
Interview Question: How do you handle noisy neighbors in a multi-tenant system?
- Rate limiting at the API gateway prevents request floods.
- Resource quotas at the infrastructure level (Kubernetes ResourceQuotas, DB connection pool limits) prevent resource monopolization.
- Separate processing queues ensure high-priority tenants are not blocked by bulk operations from free-tier tenants.
- Query timeouts prevent runaway queries from saturating the database.
- Dedicated infrastructure for enterprise tenants who need contractual SLA guarantees.
- Per-tenant monitoring and alerting so you can detect and respond before other tenants are impacted.
Further reading: Ultimate Guide to Multi-Tenant SaaS Data Modeling by Flightcontrol — excellent practical walkthrough of the trade-offs. AWS SaaS Tenant Isolation Strategies Whitepaper — deep dive into pool, silo, and bridge isolation models with AWS-native implementation patterns; essential reading if you are building on AWS. AWS SaaS Architecture Fundamentals — the Well-Architected SaaS Lens covers tenant onboarding, metering, billing integration, and operational patterns at scale.
Interview Question: You're building a B2B SaaS product. One enterprise client wants data residency in the EU. Others are fine with US. How do you architect this?
Interview Question: You're building a B2B SaaS product. One enterprise client wants data residency in the EU. Others are fine with US. How do you architect this?
data_region field (e.g., us-east-1, eu-west-1). This is set during onboarding and drives all downstream routing decisions.Step 2: Regional data planes, global control plane. The control plane (tenant management, authentication, billing) stays global — there is no compliance reason to regionalize it as long as it does not store regulated customer data. The data plane (the databases, object storage, and compute that process tenant data) is deployed per region. When a request arrives, the API gateway reads the tenant’s data_region from the control plane and routes the request to the correct regional data plane.Step 3: Regional database instances. The EU tenant’s data lives in an EU database instance. US tenants’ data lives in a US instance. This is not “separate DB per tenant” — multiple EU tenants can share the same EU database using a shared-schema model with tenant_id filtering. You are regionalizing the infrastructure, not per-tenanting it.Step 4: Cross-region concerns. Analytics and reporting that aggregate across regions need careful handling. Options: (a) replicate anonymized/aggregated data to a central analytics store, (b) run federated queries across regions, or (c) accept that some cross-region reports have higher latency. Avoid replicating raw PII across regions — that defeats the purpose.Step 5: Compliance verification. Automated tests that assert no EU tenant data exists in US storage. Regular audits. Data residency is not a one-time setup — it is an ongoing operational concern.Common mistakes: Trying to solve this with application-level filtering alone (you need infrastructure-level separation to satisfy auditors). Over-engineering by giving every tenant their own region (most tenants do not need it — only provision regional isolation for tenants that contractually require it). Forgetting that backups, logs, and cache layers also contain tenant data and must respect residency requirements.Part XXIV — Domain Modeling and Business Logic
Chapter 31: Domain-Driven Design Basics
Big Word Alert: Ubiquitous Language. A shared vocabulary between developers and domain experts where each term has one precise meaning within a bounded context. If the business says “order” and the developers say “transaction,” misunderstandings will leak into the code. DDD insists that the code uses the same terms as the business. When the PM says “the customer’s subscription was paused,” the code should havesubscription.pause(), notsetStatus(INACTIVE).
Tools: Event Storming (workshop format for discovering domain events and bounded contexts — uses sticky notes on a wall). Context Mapper (open-source tool for modeling bounded contexts). Miro/FigJam (for remote event storming sessions).
Analogy: Bounded Contexts Are Like Countries. Bounded contexts are like countries — “football” means something completely different in the US vs the UK, and that is okay as long as you know which country you are in. The word “order” in the Fulfillment context means a shipment to pack and dispatch. The word “order” in the Billing context means an invoice to charge. The word “order” in the Analytics context means a data point in a revenue trend. Just like you do not try to create one universal definition of “football” that works in both countries, you do not try to create one universal Order model that serves all contexts. Each context gets its own model with its own language, and you translate at the borders — just like a currency exchange at an airport. The Anti-Corruption Layer in DDD is literally that currency exchange booth.
How Spotify’s “Spotify Model” Maps Bounded Contexts to Organizational Structure
Spotify’s engineering culture — widely documented around 2012-2014 through their “Spotify Model” whitepapers by Henrik Kniberg and Anders Ivarsson — offers one of the most tangible illustrations of how bounded contexts in DDD map to real organizational structure. Spotify organized its engineering teams into Squads (small, autonomous teams of 6-12 people, each owning a specific feature area), Tribes (collections of squads working in related areas), Chapters (groups of specialists across squads, like all backend engineers), and Guilds (informal communities of interest). What made this relevant to DDD was the alignment between squad ownership and bounded context boundaries. The Search squad owned the Search bounded context — its own data model, its own APIs, its own deployment pipeline. The Playlist squad owned the Playlist context. The Payment squad owned the Billing context. Each squad spoke its own ubiquitous language within its domain. A “track” meant something different to the Search squad (a searchable document with metadata and ranking signals) than to the Playback squad (a streamable audio file with codec information, bitrate options, and DRM licenses). The boundaries between squads were the context boundaries, and the APIs and events between squads were the context maps. When the Playlist squad needed information from the Social squad (to show which friends were listening to a playlist), they consumed integration events — they did not reach into the Social squad’s database. This organizational structure enforced the same decoupling that DDD prescribes at the software level. It is worth noting that Spotify itself has acknowledged the model evolved significantly over the years and was never as clean in practice as it appeared on paper. Many companies copied the labels (squads, tribes) without understanding the underlying principle: that organizational boundaries should align with domain boundaries, and that each team should own its context end-to-end. The lesson is not “copy Spotify’s org chart” — it is that Conway’s Law is real, and your bounded contexts will inevitably mirror your team structure. Design both intentionally.31.1 Entities, Value Objects, and Aggregates
Entities have identity — two users with the same name are different users. Identity persists even if every attribute changes (a user changes their name, email, and password — still the same user). In code: entities have anid field and equality is based on id, not attributes.
Value objects are defined by their attributes — two Money(100, "USD") are the same. They are immutable (to change an amount, you create a new Money object). In code: equality is based on all attributes, no id field. Use for: addresses, date ranges, coordinates, money, email addresses.
Aggregates are clusters of entities and value objects treated as a unit for data changes. The aggregate root is the single entry point — external code can only modify the aggregate through the root. This enforces business rules.
Aggregate Rules
- Aggregate Root is the only entry point. External objects may only hold references to the aggregate root, never to internal entities. To add a line item to an order, you call
order.addItem(), notlineItem.save(). - Consistency boundary. All invariants within an aggregate are enforced in a single transaction. If the business rule says “order total must be at least $10,” the aggregate root checks this on every mutation.
- Transactional boundary. One transaction = one aggregate. Never modify two aggregates in the same transaction. If placing an order must also update inventory, the Order aggregate publishes an
OrderPlacedevent and the Inventory aggregate handles it asynchronously. - Keep aggregates small. Large aggregates cause lock contention and merge conflicts. If two users can independently modify different parts of a large aggregate, it needs to be split.
- Reference other aggregates by ID, not by object. An
OrderLineItemstoresproduct_id, not a reference to theProductaggregate. This prevents coupling and allows aggregates to live in different bounded contexts or services.
31.2 Bounded Contexts
A boundary within which a particular domain model is defined and applicable. Concrete example — the word “User” in different contexts: Consider a SaaS platform. The same real-world person — say, Jane — is modeled differently depending on the context:- In the Authentication context, “User” means:
email,password_hash,mfa_enabled,last_login,session_tokens. The concern is identity verification. - In the Billing context, “User” means:
subscription_plan,payment_method,invoice_history,mrr_contribution. The concern is revenue. - In the Support context, “User” means:
ticket_history,satisfaction_score,support_tier,assigned_agent. The concern is customer experience.
User model that serves all three contexts creates a bloated, tangled entity with 50+ fields that is painful to maintain and impossible to reason about. Bounded contexts let each team model the concept in the way that best serves their needs.
Another example — “Customer” across Sales and Support:
“Customer” in the Sales context has: name, email, payment methods, purchase history, loyalty tier. “Customer” in the Support context has: name, ticket history, satisfaction score, support tier, assigned agent. Same real-world person, different models optimized for different purposes.
How contexts communicate: Through well-defined interfaces at the boundary. The Sales context publishes a CustomerRegistered event. The Support context consumes it and creates its own representation. Each context owns its own database/tables. They never share database tables (that would couple the contexts).
Context mapping patterns:
- Shared Kernel: Two contexts share a small, carefully managed subset of code.
- Customer-Supplier: Upstream context provides, downstream consumes — the supplier accommodates the consumer’s needs.
- Conformist: Downstream accepts whatever upstream provides without negotiation.
- Anti-Corruption Layer: Downstream translates the upstream model into its own terms — essential when integrating with legacy or third-party systems.
31.3 Domain Events
Something meaningful that happened in the domain.OrderPlaced, PaymentReceived, InventoryReserved. Events represent facts — they are immutable and past tense. They drive integration between bounded contexts.
Design rules for domain events: Name them in past tense (something that happened, not something that should happen). Include all data the consumers need (do not force consumers to call back for details). Include: event type, aggregate ID, timestamp, causation ID (what triggered this event), and the relevant data. Events are the primary mechanism for decoupling bounded contexts — the Sales context does not call the Shipping context directly; it publishes OrderPlaced and the Shipping context reacts.
Domain Events vs Integration Events
Understanding the distinction between these two types of events is critical: Domain Events are internal to a bounded context. They represent something that happened within the domain model and are used to trigger side effects inside the same context. They are typically dispatched in-memory (not through a message broker). Example:OrderLineItemAdded triggers a recalculation of the order total within the Order aggregate.
Integration Events cross bounded context boundaries. They are published to a message broker (Kafka, RabbitMQ, SNS) and consumed by other services. They carry a self-contained payload so consumers do not need to call back. Example: OrderPlaced is published by the Order Service and consumed by the Shipping Service, the Notification Service, and the Analytics Service.
| Aspect | Domain Events | Integration Events |
|---|---|---|
| Scope | Within a bounded context | Across bounded contexts / services |
| Transport | In-memory (mediator pattern) | Message broker (Kafka, RabbitMQ, SNS/SQS) |
| Payload | Can reference internal domain objects | Must be self-contained (no internal references) |
| Schema Evolution | Free to change (internal) | Must be versioned carefully (public contract) |
| Failure Handling | Part of the same transaction | Requires idempotency, retries, dead-letter queues |
Connection to Event Sourcing
Domain events are the foundation of event sourcing. In a traditional system, you store the current state (e.g.,order.status = SHIPPED). In event sourcing, you store the sequence of events that led to the current state:
Interview Question: How do you identify bounded context boundaries in an existing monolith?
Interview Question: How do you identify bounded context boundaries in an existing monolith?
Further reading: Domain-Driven Design by Eric Evans — the foundational text. Implementing Domain-Driven Design by Vaughn Vernon — the practical companion. Learning Domain-Driven Design by Vlad Khononov — a more modern, accessible introduction than Evans’ original. Eric Evans’ DDD Reference — a free, concise summary of all DDD patterns from the creator himself; keep this bookmarked as a quick-reference card. Vaughn Vernon’s Key Concepts from “Implementing DDD” — distilled summary of the most important tactical and strategic patterns with code examples. Martin Fowler on Bounded Contexts — Fowler’s clear, concise explanation of why bounded contexts are the most important pattern in DDD. Martin Fowler on Aggregate Design — practical guidance on sizing aggregates and enforcing consistency boundaries.
Interview Question: Two teams are building features that both need a 'User' entity but with different fields and behaviors. How do you resolve this with DDD?
Interview Question: Two teams are building features that both need a 'User' entity but with different fields and behaviors. How do you resolve this with DDD?
User model that serves both teams. That model will grow into a bloated, conflicted entity with 40+ fields, half of which are irrelevant to each team, and every change risks breaking the other team’s features.The DDD approach: each team defines its own model of User within its own bounded context.Say Team A is building authentication and Team B is building billing. In the Auth context, User means: email, password_hash, mfa_enabled, last_login, session_tokens, login_attempts. In the Billing context, User means: subscription_plan, payment_method, invoice_history, mrr_contribution, billing_address. These are not the same entity — they are different projections of the same real-world person, optimized for different purposes.How they stay in sync: A shared identifier (user_id) links the two models. When a new user registers, the Auth context publishes a UserRegistered integration event containing { user_id, email, name }. The Billing context consumes that event and creates its own BillingCustomer record with the user_id as a foreign reference. Each context owns its own database tables and evolves its schema independently.What about shared fields like name and email? These are duplicated across contexts — and that is okay. The Auth context is the source of truth for email (because email changes go through the auth flow). If the Billing context needs an updated email (for invoice delivery), it listens for UserEmailChanged events. This is eventual consistency, and it is the right trade-off — it is far better than coupling two teams to a shared database table.The anti-pattern to avoid: A shared User library or shared database table that both teams depend on. This creates a coordination bottleneck — every schema change requires cross-team alignment, deployments become coupled, and you lose the autonomy that bounded contexts are designed to provide.When bounded contexts are overkill: If the two teams are actually working in the same domain and the fields overlap by 80%+, they might belong in the same bounded context with a single model. Bounded contexts are not about giving every team its own copy of everything — they are about recognizing genuine differences in how a concept is modeled and used.Part XXV — Documentation and Communication
Chapter 32: Engineering Documentation
Big Word Alert: ADR (Architecture Decision Record). A lightweight document that captures an important architectural decision along with its context and consequences. ADRs accumulate as a decision log — when a new engineer asks “why do we use Kafka instead of RabbitMQ?”, the ADR explains the reasoning at the time of the decision. Without ADRs, architectural knowledge lives only in people’s heads and leaves when they do.
Tools: adr-tools (CLI for managing ADRs). Backstage (developer portal with TechDocs). Notion, Confluence (team documentation). Swagger/OpenAPI (API documentation from code). Docusaurus, MkDocs (documentation sites from Markdown).
How Amazon’s “Working Backwards” Culture Drives Engineering Quality
Amazon is famously a writing culture. Jeff Bezos banned PowerPoint in executive meetings in the early 2000s, replacing slide decks with six-page narrative memos that meeting attendees read in silence for the first 20-30 minutes before discussion begins. But the most remarkable documentation practice at Amazon is the “Working Backwards” press release — and it has profound implications for how engineers think about documentation. Before building a new product or feature, Amazon teams write a mock press release announcing the finished product to the world. This is not marketing fluff — it is a forcing function for clarity. The press release must articulate: who is the customer, what is the problem, why do existing solutions fall short, what does this product do, and what does the customer say about it (a fictional quote that captures the value proposition). If the team cannot write a compelling one-page press release, the idea is not clear enough to build. What makes this relevant to engineering documentation is the underlying philosophy: writing is not something you do after you build. Writing is how you think. The act of putting an idea into clear prose exposes fuzzy thinking, unstated assumptions, and gaps in logic that whiteboards and verbal discussions miss. An ADR forces you to articulate why you chose PostgreSQL over MongoDB before you start coding. An RFC forces you to think through failure modes before you ship. Amazon’s press release forces product teams to define success before writing a single line of code. Amazon engineers have noted that the six-page memo culture creates better meetings (everyone has the same context), better decisions (arguments are written and structured, not improvised), and better institutional memory (memos are archived and searchable). The cost is real — writing a good six-page memo takes days, not hours. But Amazon’s bet is that the cost of building the wrong thing, or building the right thing without shared understanding, is far higher. For engineers at any company, the lesson is this: if you cannot explain what you are building and why in clear, jargon-free prose, you do not yet understand it well enough to build it.Interview Question: You join a team and discover there is no documentation — no ADRs, no runbooks, no architecture diagrams. How do you fix this?
Interview Question: You join a team and discover there is no documentation — no ADRs, no runbooks, no architecture diagrams. How do you fix this?
Follow-up: The team pushes back — they say documentation slows them down.
Follow-up: The team pushes back — they say documentation slows them down.
32.1 Architecture Decision Records (ADRs)
Capture: title, status (proposed/accepted/deprecated), context (why this decision is needed), decision (what we decided), consequences (what changes, what trade-offs we accept). Forces clarity of thought by requiring you to articulate reasoning before implementing.ADR Template
Use this template for every architectural decision worth recording:32.2 Runbooks
Step-by-step operational playbooks written for someone who has never dealt with this situation before (because at 3 AM, that might be you — half-asleep with no context). Every runbook must include:- Symptom: What alert or user report triggers this.
- Impact: What is broken for users.
- Diagnosis: Specific commands to run, dashboards to check.
- Mitigation: Steps to fix, ordered from fastest to most thorough.
- Escalation: Who to contact if mitigation fails, with phone numbers.
- Post-incident: Links to postmortem template, follow-up actions.
What Makes a Good Runbook
-
Step-by-step, copy-pasteable commands. Do not write “check the database.” Write
psql -h prod-db.internal -U readonly -c "SELECT count(*) FROM orders WHERE status = 'stuck' AND created_at > NOW() - INTERVAL '1 hour';". The reader should be able to follow the runbook without thinking. - No assumptions about reader knowledge. Assume the reader has never seen this system before. Spell out which dashboard to open, which cluster to connect to, which namespace to look in. Include URLs, not descriptions (“open the Grafana dashboard” is bad; “open https://grafana.internal/d/orders-health” is good).
- Decision trees, not paragraphs. Use “If X, do Y. If Z, do W.” format. At 3 AM, nobody reads paragraphs.
- Tested regularly. A runbook that has never been followed is a runbook that does not work. Run through your runbooks during game days (scheduled incident simulations). Fix the steps that are wrong or missing.
- Owned and dated. Every runbook has an owner and a “last verified” date. If the last verified date is more than 6 months ago, the runbook is suspect.
- Linked from alerts. Every PagerDuty/Opsgenie alert should include a direct link to the relevant runbook. The on-call engineer should never have to search for documentation during an incident.
32.3 API Documentation
Principles: Keep it close to code (OpenAPI/Swagger generated from annotations or code — documentation that drifts from reality is worse than no documentation). Include request/response examples for every endpoint (developers read examples first, descriptions second). Document all error responses (not just the happy path — what does a 422 look like?). Document rate limits and pagination. Version documentation alongside the API. Provide a “Getting Started” guide (authentication, first API call, common workflows).Tools: OpenAPI/Swagger (the standard — generates interactive documentation, client SDKs, and server stubs). Redoc (beautiful documentation from OpenAPI specs). Postman (API testing + documentation). Stoplight (API design-first platform).
32.4 Communication Skills
Explain trade-offs, not just solutions. Write good PR descriptions (what changed, why, how to test). Write good tickets (problem, context, acceptance criteria). Communicate incidents clearly (what happened, impact, timeline, what we are doing).PR Description Template
Use this template for every non-trivial pull request:Communication as an Engineering Skill
Technical skill gets you to mid-level. Communication is what makes you senior. The ability to write clearly, explain trade-offs, and align a team on a technical direction is the single most underrated engineering skill.How Senior Engineers Write
Senior engineers write with three qualities:-
Precision. Every word carries meaning. “The service is slow” becomes “P95 latency for the
/ordersendpoint increased from 120ms to 800ms after the deployment at 14:32 UTC.” Precision eliminates follow-up questions. - Structure. Information is organized so the reader gets what they need at the level of detail they need. An executive gets the one-line summary. A peer engineer gets the technical details. Both find what they need without reading the whole document.
- Audience awareness. The same information is framed differently for different audiences. To engineering: “We need to migrate from MySQL to PostgreSQL because of XYZ limitations.” To leadership: “The database migration will take 3 sprints, reduce incident frequency by ~40%, and unblock the multi-region initiative.”
RFC / Design Documents
For decisions that are too large for an ADR (new services, major refactors, platform changes), write an RFC (Request for Comments) or Design Document. The structure:- Title and Authors — who is proposing this.
- Status — Draft, In Review, Accepted, Rejected, Implemented.
- Summary — one paragraph a busy VP could read and understand the proposal.
- Motivation — why is the current state insufficient? What problem are we solving? Include data (error rates, latency numbers, customer complaints).
- Proposed Solution — the detailed technical design. Diagrams, API schemas, data models, sequence diagrams. Enough detail that another engineer could implement it.
- Alternatives Considered — what else did you evaluate? Why did you reject it? This section builds trust — it shows you did not just pick the first idea.
- Risks and Mitigations — what could go wrong? How will you detect it? What is the rollback plan?
- Milestones and Timeline — break the work into phases. What can we ship incrementally?
- Open Questions — what do you not know yet? What input do you need from reviewers?
Status Updates and Incident Communication
Project status updates follow a consistent format so stakeholders can scan quickly:- Status: On Track / At Risk / Blocked
- Summary: One sentence on progress since last update.
- Completed: What shipped.
- In Progress: What is being worked on.
- Blocked: What is stuck and what is needed to unblock.
- Next: What is planned for the next cycle.
- Initial notification (within 5 minutes of detection): What is happening, what is impacted, who is investigating.
- Updates every 15-30 minutes during the incident: What we know, what we have tried, what we are trying next.
- Resolution notification: What fixed it, what is the residual impact, when will the postmortem happen.
- Postmortem (within 48 hours): Timeline, root cause, contributing factors, action items with owners and deadlines.
Writing as Leverage
A well-written document is the highest-leverage activity a senior engineer can perform:- A design doc aligns 10 engineers without 10 meetings.
- An ADR answers the same question for every future engineer who joins the team.
- A runbook saves hours of debugging at 3 AM.
- A clear status update prevents a panicked Slack thread from leadership.
Further reading: The Staff Engineer’s Path by Tanya Reilly — covers technical communication, influence, and documentation as core engineering skills. Docs for Developers by Jared Bhatti et al. — practical guide to writing documentation that people actually read. ADR GitHub Organization — comprehensive collection of ADR tools, templates, and examples; includesadr-tools,log4brains, and MADR (Markdown ADR) templates for different team workflows. Stripe’s Approach to API Documentation — widely considered the gold standard for developer-facing API docs; study how they structure endpoints, show request/response pairs inline, and provide copy-pasteable code in multiple languages. Google’s Technical Writing Courses — free, self-paced courses covering grammar for engineers, writing clear sentences, organizing documents, and illustrating technical concepts; required training for many Google engineering teams. Basecamp’s “Shape Up” Methodology — while primarily about product development, the “shaping” process is one of the best frameworks for writing effective technical proposals and design documents; the concepts of appetite, pitches, and fat-marker sketches translate directly to RFC writing.
Interview Question: Your team writes great code but terrible documentation. How do you change the culture without mandating docs nobody reads?
Interview Question: Your team writes great code but terrible documentation. How do you change the culture without mandating docs nobody reads?