Skip to main content

Part XXIII — Multi-Tenancy

Chapter 30: Multi-Tenant Architecture

Big Word Alert: Noisy Neighbor. In a multi-tenant system, a “noisy neighbor” is a tenant whose heavy usage degrades performance for other tenants sharing the same infrastructure. One tenant running a massive report saturates the database, making all other tenants’ queries slow. This is the central challenge of multi-tenant architecture — balancing cost efficiency (sharing resources) with isolation (protecting tenants from each other).
Tools: PostgreSQL Row-Level Security (database-level tenant isolation). Citus (distributed PostgreSQL with tenant-aware sharding). AWS SCP, Azure Policies (tenant-level cloud resource governance). Kubernetes namespaces + ResourceQuotas (infrastructure-level isolation).
Analogy: The Apartment Building. Multi-tenancy is like an apartment building — everyone shares the structure (the foundation, the plumbing, the elevator), but each unit has its own lock and you NEVER want to accidentally walk into the wrong apartment. The cheapest building crams everyone onto one floor with thin walls (shared schema) — cost-effective but noisy. The mid-tier building gives each tenant their own floor (separate schemas) — better isolation but the elevator is still shared. The luxury building gives each tenant their own wing with a private entrance (separate databases) — maximum privacy, maximum cost. The building manager’s hardest job? Making sure tenant A’s spare key never accidentally opens tenant B’s door. That is the entire discipline of multi-tenant architecture in one sentence.

How Salesforce Built the Most Successful Multi-Tenant SaaS Platform

In the early 2000s, when enterprise software meant million-dollar Oracle licenses, rack-mounted servers, and 18-month deployment cycles, Marc Benioff and his team at Salesforce made a bet that seemed reckless: they would run every customer — from a 5-person startup to a Fortune 500 bank — on the same shared infrastructure. Not separate instances. Not separate databases. The same tables, the same application servers, the same everything. The industry said it was impossible. Enterprise customers would never trust their data to a shared environment. The compliance requirements alone would kill it. And the technical challenges were staggering: how do you prevent one customer’s massive report from crippling the system for everyone else? How do you handle a customer who needs custom fields, custom objects, and custom workflows without polluting the shared schema? Salesforce solved it by building a metadata-driven architecture. Instead of creating physical database tables for each customer’s custom objects, they stored metadata that described the schema, and the platform interpreted that metadata at runtime. Every customer’s data lived in the same set of large, generic tables (imagine columns named Value0 through Value500), with a metadata layer that mapped those generic columns to customer-specific field names like Annual Revenue or Lead Score. This approach meant Salesforce could onboard a new customer in minutes (just insert metadata rows), deploy updates to every customer simultaneously (one codebase, one deployment), and scale to hundreds of thousands of tenants without the operational nightmare of managing hundreds of thousands of separate database instances. The result? Salesforce became the poster child for multi-tenant SaaS, grew to over $30 billion in annual revenue, and proved that multi-tenancy — done right — is not a compromise but a competitive advantage. The lesson for engineers: the hardest part of multi-tenancy is not the isolation model you pick on day one. It is the tenant context propagation, the noisy neighbor mitigation, and the metadata flexibility that you build over years. Salesforce did not get it right immediately. They iterated relentlessly, and their architecture today looks nothing like their v1. But the core bet — shared infrastructure with logical isolation — never changed.

How Slack Evolved from Single-Tenant to Multi-Tenant

Slack’s early architecture tells a different multi-tenancy story — one of pragmatic evolution rather than upfront design. When Stewart Butterfield’s team pivoted from their failed game Glitch to build Slack in 2013, they made the reasonable startup decision: each workspace (team) got its own isolated MySQL shard. This was effectively a separate-database-per-tenant model. It was simple, provided strong isolation, and was perfectly fine when Slack had hundreds of teams. Then Slack exploded. By 2015, they had hundreds of thousands of active teams. The separate-shard-per-tenant model that had been a strength became a liability. Provisioning new shards was slow. Schema migrations had to be rolled out across thousands of database instances. Cross-workspace features (like Slack Connect, which lets users from different companies share channels) were architecturally painful because data lived in completely separate databases. Operational overhead was enormous — monitoring, backups, and failover multiplied by the number of shards. Slack’s engineering team spent years incrementally migrating toward a more consolidated architecture, introducing shared services, moving certain data to centralized stores (like their move to Vitess for MySQL clustering), and building abstraction layers that could route queries to the right shard transparently. They did not rip and replace — they evolved. The lesson here is crucial: your isolation model is not a permanent decision. It is a starting point. What matters is that you design your tenant context propagation layer cleanly enough that you can change the underlying isolation model without rewriting your entire application. Slack’s ability to evolve was a direct result of having a clean abstraction between “which tenant is this?” and “which database holds their data?“

30.1 Isolation Models

Shared DB, shared schema: All tenants in same tables. tenant_id column. Cheapest. Requires rigorous filtering. Shared DB, separate schemas: Each tenant has own schema/namespace. Better isolation. Migrations must apply to every schema. Separate DB per tenant: Maximum isolation. Most expensive. Complex at scale (hundreds of databases).

Isolation Model Comparison

FactorShared DB / Shared SchemaShared DB / Separate SchemaSeparate DB per Tenant
CostLowest — one database, one set of tablesModerate — one database, but N schemas to manageHighest — N databases, N connection pools, N backups
Tenant IsolationWeakest — a missing WHERE tenant_id = ? leaks dataModerate — schema boundary prevents accidental cross-tenant queriesStrongest — complete physical separation
SecurityRelies entirely on application-level or RLS filteringSchema-level permissions add a layer of defenseFull database-level isolation; easiest to meet compliance requirements (SOC2, HIPAA)
Operational ComplexitySimplest — one schema to migrate, one backup to manageModerate — every migration must be applied to every schema; tooling neededHighest — provisioning, patching, monitoring, and backing up N databases
Performance IsolationNone by default — noisy neighbor risk is highestPartial — shared DB resources still contendedFull — one tenant’s load cannot affect another
Onboarding a New TenantInsert a rowCreate a new schema + run migrationsProvision a new database + configure connection
Data Export / Deletion (GDPR)Query by tenant_id, risk of missed tablesDrop the schemaDrop the database
Best ForSaaS with many small tenants, low compliance requirementsMid-tier SaaS needing better isolation without per-tenant DB costEnterprise customers, regulated industries, tenants with strict SLAs
The most common multi-tenant data leak is a missing WHERE tenant_id = ? filter in one query. Database-level RLS is your safety net. A single forgotten filter in a reporting query, admin endpoint, or background job can expose all tenants’ data.

30.2 Tenant Context Propagation

The most critical multi-tenant engineering challenge: ensuring tenant_id flows through every layer of your system. The propagation chain: HTTP request arrives -> API gateway/middleware extracts tenant_id from JWT claims, subdomain (acme.app.com), or API key lookup -> tenant_id is set in the request context (Express req.tenantId, Go context.Value, Java ThreadLocal) -> every database query includes WHERE tenant_id = ? (enforced by query middleware or ORM scope) -> every log line includes tenant_id -> every event published includes tenant_id -> every downstream service call passes tenant_id in a header. The safety net: Application-level filtering can be missed (one developer forgets the WHERE clause). Database-level Row-Level Security (RLS) is your safety net:
CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.tenant_id'));
Set the RLS variable in the connection middleware (SET app.tenant_id = 'acme'). Now even if the application forgets the filter, the database will never return another tenant’s data.
The most common multi-tenant data leak is a single query missing WHERE tenant_id = ?. It may be a reporting query, an admin endpoint, or a background job. The fix is defense in depth: application-level filtering + database-level RLS + integration tests that assert tenant isolation (create data in tenant A, query as tenant B, assert zero results).

30.3 Tenant-Aware Concerns

Tenant-specific configuration (feature flags, limits, themes). Tenant-level rate limiting. Support access with audit trail. Tenant-aware logging and observability (include tenant_id in every log and metric).

30.4 Noisy Neighbor Mitigation Strategies

The noisy neighbor problem is the defining challenge of multi-tenant systems. Here are concrete strategies, ordered from cheapest to most expensive: 1. Per-Tenant Rate Limiting Apply rate limits at the API gateway based on tenant_id. Each tenant gets a request budget (e.g., 1000 req/min). Exceeding the budget returns 429 Too Many Requests. This prevents one tenant from monopolizing API capacity. 2. Per-Tenant Resource Quotas Enforce CPU, memory, and storage limits per tenant. In Kubernetes, use ResourceQuota objects scoped to tenant namespaces. In databases, use connection pool limits per tenant (e.g., tenant A gets max 20 connections, tenant B gets max 50). 3. Separate Processing Queues Route tenant workloads to separate queues based on tier. Free-tier tenants share a queue with lower priority. Paid tenants get a dedicated queue. Enterprise tenants get a dedicated queue with guaranteed throughput. 4. Query Governors and Timeouts Set per-tenant query timeouts at the database level. Kill queries that exceed the timeout. This prevents one tenant’s unoptimized report from locking tables for everyone. 5. Separate Compute Pools for Enterprise Tenants For your largest customers, provision dedicated compute (separate Kubernetes node pools, separate database read replicas, or fully separate database instances). This is the most expensive strategy but offers contractual SLA guarantees. 6. Monitoring and Alerting Per Tenant Track resource consumption per tenant. Alert when a single tenant’s usage exceeds a threshold (e.g., tenant X is consuming 40% of total DB CPU). This enables proactive intervention before other tenants are affected.
Start with rate limiting and monitoring — they are cheap and catch most noisy neighbor problems. Only invest in separate compute pools when you have enterprise customers whose SLAs demand it.
Strong answer:Rate limiting per tenant. Resource quotas (CPU, memory, storage). Separate queues or processing lanes for different tiers. Monitoring per-tenant resource usage. For premium tenants, dedicated infrastructure.A layered approach works best:
  1. Rate limiting at the API gateway prevents request floods.
  2. Resource quotas at the infrastructure level (Kubernetes ResourceQuotas, DB connection pool limits) prevent resource monopolization.
  3. Separate processing queues ensure high-priority tenants are not blocked by bulk operations from free-tier tenants.
  4. Query timeouts prevent runaway queries from saturating the database.
  5. Dedicated infrastructure for enterprise tenants who need contractual SLA guarantees.
  6. Per-tenant monitoring and alerting so you can detect and respond before other tenants are impacted.
Further reading: Ultimate Guide to Multi-Tenant SaaS Data Modeling by Flightcontrol — excellent practical walkthrough of the trade-offs. AWS SaaS Tenant Isolation Strategies Whitepaper — deep dive into pool, silo, and bridge isolation models with AWS-native implementation patterns; essential reading if you are building on AWS. AWS SaaS Architecture Fundamentals — the Well-Architected SaaS Lens covers tenant onboarding, metering, billing integration, and operational patterns at scale.
Strong answer:This is a tenant-aware data routing problem, not just a database problem. The architecture needs to support per-tenant data residency without fragmenting the entire system.Step 1: Tenant metadata. Every tenant record includes a data_region field (e.g., us-east-1, eu-west-1). This is set during onboarding and drives all downstream routing decisions.Step 2: Regional data planes, global control plane. The control plane (tenant management, authentication, billing) stays global — there is no compliance reason to regionalize it as long as it does not store regulated customer data. The data plane (the databases, object storage, and compute that process tenant data) is deployed per region. When a request arrives, the API gateway reads the tenant’s data_region from the control plane and routes the request to the correct regional data plane.Step 3: Regional database instances. The EU tenant’s data lives in an EU database instance. US tenants’ data lives in a US instance. This is not “separate DB per tenant” — multiple EU tenants can share the same EU database using a shared-schema model with tenant_id filtering. You are regionalizing the infrastructure, not per-tenanting it.Step 4: Cross-region concerns. Analytics and reporting that aggregate across regions need careful handling. Options: (a) replicate anonymized/aggregated data to a central analytics store, (b) run federated queries across regions, or (c) accept that some cross-region reports have higher latency. Avoid replicating raw PII across regions — that defeats the purpose.Step 5: Compliance verification. Automated tests that assert no EU tenant data exists in US storage. Regular audits. Data residency is not a one-time setup — it is an ongoing operational concern.Common mistakes: Trying to solve this with application-level filtering alone (you need infrastructure-level separation to satisfy auditors). Over-engineering by giving every tenant their own region (most tenants do not need it — only provision regional isolation for tenants that contractually require it). Forgetting that backups, logs, and cache layers also contain tenant data and must respect residency requirements.

Part XXIV — Domain Modeling and Business Logic

Chapter 31: Domain-Driven Design Basics

Big Word Alert: Ubiquitous Language. A shared vocabulary between developers and domain experts where each term has one precise meaning within a bounded context. If the business says “order” and the developers say “transaction,” misunderstandings will leak into the code. DDD insists that the code uses the same terms as the business. When the PM says “the customer’s subscription was paused,” the code should have subscription.pause(), not setStatus(INACTIVE).
Applying DDD Everywhere. DDD is powerful for complex business domains (insurance, finance, logistics) where the rules are nuanced and change frequently. For simple CRUD applications (a blog, a todo app, a basic admin panel), DDD adds unnecessary abstraction. Ask: “Is the complexity in the business rules or in the technical implementation?” If the business rules are simple, a straightforward layered architecture is better.
Tools: Event Storming (workshop format for discovering domain events and bounded contexts — uses sticky notes on a wall). Context Mapper (open-source tool for modeling bounded contexts). Miro/FigJam (for remote event storming sessions).
Analogy: Bounded Contexts Are Like Countries. Bounded contexts are like countries — “football” means something completely different in the US vs the UK, and that is okay as long as you know which country you are in. The word “order” in the Fulfillment context means a shipment to pack and dispatch. The word “order” in the Billing context means an invoice to charge. The word “order” in the Analytics context means a data point in a revenue trend. Just like you do not try to create one universal definition of “football” that works in both countries, you do not try to create one universal Order model that serves all contexts. Each context gets its own model with its own language, and you translate at the borders — just like a currency exchange at an airport. The Anti-Corruption Layer in DDD is literally that currency exchange booth.

How Spotify’s “Spotify Model” Maps Bounded Contexts to Organizational Structure

Spotify’s engineering culture — widely documented around 2012-2014 through their “Spotify Model” whitepapers by Henrik Kniberg and Anders Ivarsson — offers one of the most tangible illustrations of how bounded contexts in DDD map to real organizational structure. Spotify organized its engineering teams into Squads (small, autonomous teams of 6-12 people, each owning a specific feature area), Tribes (collections of squads working in related areas), Chapters (groups of specialists across squads, like all backend engineers), and Guilds (informal communities of interest). What made this relevant to DDD was the alignment between squad ownership and bounded context boundaries. The Search squad owned the Search bounded context — its own data model, its own APIs, its own deployment pipeline. The Playlist squad owned the Playlist context. The Payment squad owned the Billing context. Each squad spoke its own ubiquitous language within its domain. A “track” meant something different to the Search squad (a searchable document with metadata and ranking signals) than to the Playback squad (a streamable audio file with codec information, bitrate options, and DRM licenses). The boundaries between squads were the context boundaries, and the APIs and events between squads were the context maps. When the Playlist squad needed information from the Social squad (to show which friends were listening to a playlist), they consumed integration events — they did not reach into the Social squad’s database. This organizational structure enforced the same decoupling that DDD prescribes at the software level. It is worth noting that Spotify itself has acknowledged the model evolved significantly over the years and was never as clean in practice as it appeared on paper. Many companies copied the labels (squads, tribes) without understanding the underlying principle: that organizational boundaries should align with domain boundaries, and that each team should own its context end-to-end. The lesson is not “copy Spotify’s org chart” — it is that Conway’s Law is real, and your bounded contexts will inevitably mirror your team structure. Design both intentionally.

31.1 Entities, Value Objects, and Aggregates

Entities have identity — two users with the same name are different users. Identity persists even if every attribute changes (a user changes their name, email, and password — still the same user). In code: entities have an id field and equality is based on id, not attributes. Value objects are defined by their attributes — two Money(100, "USD") are the same. They are immutable (to change an amount, you create a new Money object). In code: equality is based on all attributes, no id field. Use for: addresses, date ranges, coordinates, money, email addresses. Aggregates are clusters of entities and value objects treated as a unit for data changes. The aggregate root is the single entry point — external code can only modify the aggregate through the root. This enforces business rules.

Aggregate Rules

  1. Aggregate Root is the only entry point. External objects may only hold references to the aggregate root, never to internal entities. To add a line item to an order, you call order.addItem(), not lineItem.save().
  2. Consistency boundary. All invariants within an aggregate are enforced in a single transaction. If the business rule says “order total must be at least $10,” the aggregate root checks this on every mutation.
  3. Transactional boundary. One transaction = one aggregate. Never modify two aggregates in the same transaction. If placing an order must also update inventory, the Order aggregate publishes an OrderPlaced event and the Inventory aggregate handles it asynchronously.
  4. Keep aggregates small. Large aggregates cause lock contention and merge conflicts. If two users can independently modify different parts of a large aggregate, it needs to be split.
  5. Reference other aggregates by ID, not by object. An OrderLineItem stores product_id, not a reference to the Product aggregate. This prevents coupling and allows aggregates to live in different bounded contexts or services.
Concrete example — e-commerce Order aggregate:
Order (Aggregate Root)
  |-- order_id (identity)
  |-- status: DRAFT -> SUBMITTED -> SHIPPED -> DELIVERED
  |-- OrderLineItem (Entity within aggregate)
  |     |-- line_item_id
  |     |-- product_id, quantity, unit_price
  |-- ShippingAddress (Value Object — immutable, no identity)
  |     |-- street, city, zip, country
  |-- Money total (Value Object)

Rules enforced by the aggregate root:
  - Cannot add items to a SHIPPED order
  - Total is recalculated when items change
  - Minimum order value is $10
  - External code calls order.addItem(), never modifies OrderLineItem directly
Why aggregate boundaries matter: The aggregate is the consistency boundary. Within an aggregate, changes are atomic (one transaction). Across aggregates, changes are eventually consistent (via domain events). Choosing the right aggregate size is a key design decision: too large leads to lock contention, too small leads to consistency issues across aggregates.

31.2 Bounded Contexts

A boundary within which a particular domain model is defined and applicable. Concrete example — the word “User” in different contexts: Consider a SaaS platform. The same real-world person — say, Jane — is modeled differently depending on the context:
  • In the Authentication context, “User” means: email, password_hash, mfa_enabled, last_login, session_tokens. The concern is identity verification.
  • In the Billing context, “User” means: subscription_plan, payment_method, invoice_history, mrr_contribution. The concern is revenue.
  • In the Support context, “User” means: ticket_history, satisfaction_score, support_tier, assigned_agent. The concern is customer experience.
Trying to build one User model that serves all three contexts creates a bloated, tangled entity with 50+ fields that is painful to maintain and impossible to reason about. Bounded contexts let each team model the concept in the way that best serves their needs. Another example — “Customer” across Sales and Support: “Customer” in the Sales context has: name, email, payment methods, purchase history, loyalty tier. “Customer” in the Support context has: name, ticket history, satisfaction score, support tier, assigned agent. Same real-world person, different models optimized for different purposes. How contexts communicate: Through well-defined interfaces at the boundary. The Sales context publishes a CustomerRegistered event. The Support context consumes it and creates its own representation. Each context owns its own database/tables. They never share database tables (that would couple the contexts). Context mapping patterns:
  • Shared Kernel: Two contexts share a small, carefully managed subset of code.
  • Customer-Supplier: Upstream context provides, downstream consumes — the supplier accommodates the consumer’s needs.
  • Conformist: Downstream accepts whatever upstream provides without negotiation.
  • Anti-Corruption Layer: Downstream translates the upstream model into its own terms — essential when integrating with legacy or third-party systems.

31.3 Domain Events

Something meaningful that happened in the domain. OrderPlaced, PaymentReceived, InventoryReserved. Events represent facts — they are immutable and past tense. They drive integration between bounded contexts. Design rules for domain events: Name them in past tense (something that happened, not something that should happen). Include all data the consumers need (do not force consumers to call back for details). Include: event type, aggregate ID, timestamp, causation ID (what triggered this event), and the relevant data. Events are the primary mechanism for decoupling bounded contexts — the Sales context does not call the Shipping context directly; it publishes OrderPlaced and the Shipping context reacts.

Domain Events vs Integration Events

Understanding the distinction between these two types of events is critical: Domain Events are internal to a bounded context. They represent something that happened within the domain model and are used to trigger side effects inside the same context. They are typically dispatched in-memory (not through a message broker). Example: OrderLineItemAdded triggers a recalculation of the order total within the Order aggregate. Integration Events cross bounded context boundaries. They are published to a message broker (Kafka, RabbitMQ, SNS) and consumed by other services. They carry a self-contained payload so consumers do not need to call back. Example: OrderPlaced is published by the Order Service and consumed by the Shipping Service, the Notification Service, and the Analytics Service.
AspectDomain EventsIntegration Events
ScopeWithin a bounded contextAcross bounded contexts / services
TransportIn-memory (mediator pattern)Message broker (Kafka, RabbitMQ, SNS/SQS)
PayloadCan reference internal domain objectsMust be self-contained (no internal references)
Schema EvolutionFree to change (internal)Must be versioned carefully (public contract)
Failure HandlingPart of the same transactionRequires idempotency, retries, dead-letter queues

Connection to Event Sourcing

Domain events are the foundation of event sourcing. In a traditional system, you store the current state (e.g., order.status = SHIPPED). In event sourcing, you store the sequence of events that led to the current state:
1. OrderCreated { orderId: "123", customerId: "456", items: [...] }
2. PaymentReceived { orderId: "123", amount: 99.00 }
3. OrderShipped { orderId: "123", trackingNumber: "UPS-789" }
The current state is derived by replaying these events. This gives you a complete audit trail, the ability to rebuild state at any point in time, and natural integration with CQRS (Command Query Responsibility Segregation).
Event sourcing is not required to use domain events. Most systems use domain events without event sourcing. Adopt event sourcing only when you need a complete audit trail or temporal queries (“what was the state of this order on March 15th?”). It adds significant complexity to querying and storage.
Strong answer:Look for natural seams — areas of the code that change together, use the same domain language, and serve the same business capability. Talk to domain experts — different teams often think about the same concept (e.g., “customer”) differently, which signals a context boundary. Look at the database — tables that are always queried together belong in the same context. Look at the deployment — code that should be deployable independently belongs in different contexts.Common mistake: drawing boundaries too small (creating a “service” for every database table). The right size is a business capability: “Order Management,” “Inventory,” “Customer Accounts” — not “OrderLineItem.”
Further reading: Domain-Driven Design by Eric Evans — the foundational text. Implementing Domain-Driven Design by Vaughn Vernon — the practical companion. Learning Domain-Driven Design by Vlad Khononov — a more modern, accessible introduction than Evans’ original. Eric Evans’ DDD Reference — a free, concise summary of all DDD patterns from the creator himself; keep this bookmarked as a quick-reference card. Vaughn Vernon’s Key Concepts from “Implementing DDD” — distilled summary of the most important tactical and strategic patterns with code examples. Martin Fowler on Bounded Contexts — Fowler’s clear, concise explanation of why bounded contexts are the most important pattern in DDD. Martin Fowler on Aggregate Design — practical guidance on sizing aggregates and enforcing consistency boundaries.
Strong answer:This is the textbook scenario for bounded contexts. The mistake is trying to build one shared User model that serves both teams. That model will grow into a bloated, conflicted entity with 40+ fields, half of which are irrelevant to each team, and every change risks breaking the other team’s features.The DDD approach: each team defines its own model of User within its own bounded context.Say Team A is building authentication and Team B is building billing. In the Auth context, User means: email, password_hash, mfa_enabled, last_login, session_tokens, login_attempts. In the Billing context, User means: subscription_plan, payment_method, invoice_history, mrr_contribution, billing_address. These are not the same entity — they are different projections of the same real-world person, optimized for different purposes.How they stay in sync: A shared identifier (user_id) links the two models. When a new user registers, the Auth context publishes a UserRegistered integration event containing { user_id, email, name }. The Billing context consumes that event and creates its own BillingCustomer record with the user_id as a foreign reference. Each context owns its own database tables and evolves its schema independently.What about shared fields like name and email? These are duplicated across contexts — and that is okay. The Auth context is the source of truth for email (because email changes go through the auth flow). If the Billing context needs an updated email (for invoice delivery), it listens for UserEmailChanged events. This is eventual consistency, and it is the right trade-off — it is far better than coupling two teams to a shared database table.The anti-pattern to avoid: A shared User library or shared database table that both teams depend on. This creates a coordination bottleneck — every schema change requires cross-team alignment, deployments become coupled, and you lose the autonomy that bounded contexts are designed to provide.When bounded contexts are overkill: If the two teams are actually working in the same domain and the fields overlap by 80%+, they might belong in the same bounded context with a single model. Bounded contexts are not about giving every team its own copy of everything — they are about recognizing genuine differences in how a concept is modeled and used.

Part XXV — Documentation and Communication

Chapter 32: Engineering Documentation

Big Word Alert: ADR (Architecture Decision Record). A lightweight document that captures an important architectural decision along with its context and consequences. ADRs accumulate as a decision log — when a new engineer asks “why do we use Kafka instead of RabbitMQ?”, the ADR explains the reasoning at the time of the decision. Without ADRs, architectural knowledge lives only in people’s heads and leaves when they do.
Documentation Nobody Reads. The biggest documentation failure is writing docs that are never maintained. Outdated documentation is worse than no documentation — it actively misleads. Prefer documentation that is close to the code (README in the repo, OpenAPI spec, ADRs in a /docs folder) and part of the development workflow (update the docs in the same PR as the code change). Never maintain documentation in a separate wiki that is disconnected from the code.
Tools: adr-tools (CLI for managing ADRs). Backstage (developer portal with TechDocs). Notion, Confluence (team documentation). Swagger/OpenAPI (API documentation from code). Docusaurus, MkDocs (documentation sites from Markdown).

How Amazon’s “Working Backwards” Culture Drives Engineering Quality

Amazon is famously a writing culture. Jeff Bezos banned PowerPoint in executive meetings in the early 2000s, replacing slide decks with six-page narrative memos that meeting attendees read in silence for the first 20-30 minutes before discussion begins. But the most remarkable documentation practice at Amazon is the “Working Backwards” press release — and it has profound implications for how engineers think about documentation. Before building a new product or feature, Amazon teams write a mock press release announcing the finished product to the world. This is not marketing fluff — it is a forcing function for clarity. The press release must articulate: who is the customer, what is the problem, why do existing solutions fall short, what does this product do, and what does the customer say about it (a fictional quote that captures the value proposition). If the team cannot write a compelling one-page press release, the idea is not clear enough to build. What makes this relevant to engineering documentation is the underlying philosophy: writing is not something you do after you build. Writing is how you think. The act of putting an idea into clear prose exposes fuzzy thinking, unstated assumptions, and gaps in logic that whiteboards and verbal discussions miss. An ADR forces you to articulate why you chose PostgreSQL over MongoDB before you start coding. An RFC forces you to think through failure modes before you ship. Amazon’s press release forces product teams to define success before writing a single line of code. Amazon engineers have noted that the six-page memo culture creates better meetings (everyone has the same context), better decisions (arguments are written and structured, not improvised), and better institutional memory (memos are archived and searchable). The cost is real — writing a good six-page memo takes days, not hours. But Amazon’s bet is that the cost of building the wrong thing, or building the right thing without shared understanding, is far higher. For engineers at any company, the lesson is this: if you cannot explain what you are building and why in clear, jargon-free prose, you do not yet understand it well enough to build it.
Strong answer:Start small, make it part of the workflow, not a separate project. Week 1: write the first ADR for the next architectural decision your team makes — this sets the precedent. Create a README template for each service (what it does, how to run it, key dependencies, who owns it). Write a runbook for the last incident. Do not try to document everything retroactively — document as you go. Make documentation part of the PR checklist: “If this PR changes behavior, is the relevant doc updated?” Within a month, you will have a growing, living documentation base. The key: documentation is a habit, not a project.
Strong answer:Reframe it: undocumented systems slow everyone down more. Every time a new engineer asks “how does the auth flow work?” and someone has to explain it verbally, that is 30 minutes. An ADR takes 15 minutes to write once and saves hundreds of hours over its lifetime. Start with the three highest-impact documents: the service README (saves onboarding time), the incident runbook (saves 3 AM debugging time), and a system architecture diagram (saves “how does this all fit together?” questions). Prove the value with these three, and adoption follows.

32.1 Architecture Decision Records (ADRs)

Capture: title, status (proposed/accepted/deprecated), context (why this decision is needed), decision (what we decided), consequences (what changes, what trade-offs we accept). Forces clarity of thought by requiring you to articulate reasoning before implementing.

ADR Template

Use this template for every architectural decision worth recording:
# ADR-[NUMBER]: [Short Title of Decision]

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]

## Date
[YYYY-MM-DD]

## Context
[What is the issue that we are seeing that motivates this decision?
What are the forces at play (technical, business, team, compliance)?
What constraints exist?]

## Decision
[What is the change that we are proposing and/or doing?
State the decision in full sentences, in active voice:
"We will use X because Y."]

## Consequences

### Positive
- [What becomes easier or better?]

### Negative
- [What becomes harder or worse?]
- [What trade-offs are we accepting?]

### Risks
- [What could go wrong? How will we mitigate it?]

## Alternatives Considered
- [Option A]: [Why rejected]
- [Option B]: [Why rejected]
Example ADR:
ADR-007: Use PostgreSQL over MongoDB for Order Service
Status: Accepted
Date: 2025-09-15

Context: Orders have complex relationships (line items, shipping, payments, discounts).
  We need ACID transactions for payment processing. Team has deep PostgreSQL expertise.

Decision: Use PostgreSQL with the existing RDS cluster. Use JSONB columns for
  flexible order metadata rather than a separate document store.

Consequences:
  Positive: Strong consistency and JOINs. Team expertise reduces ramp-up time.
  Negative: Less schema flexibility for order metadata (mitigated by JSONB).
  Risks: Additional load on the existing RDS cluster (monitor and consider a read replica).

Alternatives Considered:
  - MongoDB: Rejected because of the need for ACID transactions across related
    collections and the team's lack of MongoDB operational experience.
  - DynamoDB: Rejected because of complex querying requirements (JOINs across
    orders, line items, and payments) that DynamoDB handles poorly.
Keep ADRs short — one page maximum. If you need more than a page, you are probably making multiple decisions and should split them into separate ADRs. Store ADRs in the repository (e.g., /docs/adrs/) so they are versioned alongside the code they describe.

32.2 Runbooks

Step-by-step operational playbooks written for someone who has never dealt with this situation before (because at 3 AM, that might be you — half-asleep with no context). Every runbook must include:
  • Symptom: What alert or user report triggers this.
  • Impact: What is broken for users.
  • Diagnosis: Specific commands to run, dashboards to check.
  • Mitigation: Steps to fix, ordered from fastest to most thorough.
  • Escalation: Who to contact if mitigation fails, with phone numbers.
  • Post-incident: Links to postmortem template, follow-up actions.
Update the runbook after every incident — the next on-call engineer should not rediscover the same steps.

What Makes a Good Runbook

  1. Step-by-step, copy-pasteable commands. Do not write “check the database.” Write psql -h prod-db.internal -U readonly -c "SELECT count(*) FROM orders WHERE status = 'stuck' AND created_at > NOW() - INTERVAL '1 hour';". The reader should be able to follow the runbook without thinking.
  2. No assumptions about reader knowledge. Assume the reader has never seen this system before. Spell out which dashboard to open, which cluster to connect to, which namespace to look in. Include URLs, not descriptions (“open the Grafana dashboard” is bad; “open https://grafana.internal/d/orders-health” is good).
  3. Decision trees, not paragraphs. Use “If X, do Y. If Z, do W.” format. At 3 AM, nobody reads paragraphs.
  4. Tested regularly. A runbook that has never been followed is a runbook that does not work. Run through your runbooks during game days (scheduled incident simulations). Fix the steps that are wrong or missing.
  5. Owned and dated. Every runbook has an owner and a “last verified” date. If the last verified date is more than 6 months ago, the runbook is suspect.
  6. Linked from alerts. Every PagerDuty/Opsgenie alert should include a direct link to the relevant runbook. The on-call engineer should never have to search for documentation during an incident.
The best test of a runbook: hand it to an engineer on another team and ask them to follow it. If they get stuck or confused, the runbook needs improvement.

32.3 API Documentation

Principles: Keep it close to code (OpenAPI/Swagger generated from annotations or code — documentation that drifts from reality is worse than no documentation). Include request/response examples for every endpoint (developers read examples first, descriptions second). Document all error responses (not just the happy path — what does a 422 look like?). Document rate limits and pagination. Version documentation alongside the API. Provide a “Getting Started” guide (authentication, first API call, common workflows).
Tools: OpenAPI/Swagger (the standard — generates interactive documentation, client SDKs, and server stubs). Redoc (beautiful documentation from OpenAPI specs). Postman (API testing + documentation). Stoplight (API design-first platform).

32.4 Communication Skills

Explain trade-offs, not just solutions. Write good PR descriptions (what changed, why, how to test). Write good tickets (problem, context, acceptance criteria). Communicate incidents clearly (what happened, impact, timeline, what we are doing).

PR Description Template

Use this template for every non-trivial pull request:
## What
[One sentence: what does this PR do?]

## Why
[The business or technical reason. Link to the ticket/issue.]

## How
[Key design decisions. Why did you approach it this way?
Mention alternatives you considered and why you rejected them.]

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manually tested [describe the scenario]
- [ ] Edge cases considered: [list them]

## Risks
[What could go wrong after merging? What should we monitor?
If nothing, write "Low risk — isolated change with no external dependencies."]

## Screenshots
[If UI changed, include before/after screenshots.
If API changed, include example request/response.]

## Reviewer Notes
[Anything the reviewer should pay special attention to.
Files to start reviewing from. Context they might need.]
A good PR description saves 30 minutes of review time because the reviewer understands context before reading code. If your PR description is empty, you are shifting the burden of understanding onto the reviewer — and they will either rubber-stamp it (dangerous) or ask many questions (slow).
Writing for engineers vs. writing for stakeholders: For engineers: be precise, include technical details, link to relevant code. For stakeholders: lead with business impact, use metrics, skip implementation details, end with a recommendation. “We refactored the auth middleware” means nothing to a PM. “Login is now 3x faster and we eliminated the session timeout bug affecting 200 users/day” means everything.

Communication as an Engineering Skill

Technical skill gets you to mid-level. Communication is what makes you senior. The ability to write clearly, explain trade-offs, and align a team on a technical direction is the single most underrated engineering skill.

How Senior Engineers Write

Senior engineers write with three qualities:
  1. Precision. Every word carries meaning. “The service is slow” becomes “P95 latency for the /orders endpoint increased from 120ms to 800ms after the deployment at 14:32 UTC.” Precision eliminates follow-up questions.
  2. Structure. Information is organized so the reader gets what they need at the level of detail they need. An executive gets the one-line summary. A peer engineer gets the technical details. Both find what they need without reading the whole document.
  3. Audience awareness. The same information is framed differently for different audiences. To engineering: “We need to migrate from MySQL to PostgreSQL because of XYZ limitations.” To leadership: “The database migration will take 3 sprints, reduce incident frequency by ~40%, and unblock the multi-region initiative.”

RFC / Design Documents

For decisions that are too large for an ADR (new services, major refactors, platform changes), write an RFC (Request for Comments) or Design Document. The structure:
  1. Title and Authors — who is proposing this.
  2. Status — Draft, In Review, Accepted, Rejected, Implemented.
  3. Summary — one paragraph a busy VP could read and understand the proposal.
  4. Motivation — why is the current state insufficient? What problem are we solving? Include data (error rates, latency numbers, customer complaints).
  5. Proposed Solution — the detailed technical design. Diagrams, API schemas, data models, sequence diagrams. Enough detail that another engineer could implement it.
  6. Alternatives Considered — what else did you evaluate? Why did you reject it? This section builds trust — it shows you did not just pick the first idea.
  7. Risks and Mitigations — what could go wrong? How will you detect it? What is the rollback plan?
  8. Milestones and Timeline — break the work into phases. What can we ship incrementally?
  9. Open Questions — what do you not know yet? What input do you need from reviewers?
The RFC process is not about getting permission — it is about getting feedback. Circulate the RFC before you start building. The cost of changing a design document is near zero. The cost of changing a running system is enormous.

Status Updates and Incident Communication

Project status updates follow a consistent format so stakeholders can scan quickly:
  • Status: On Track / At Risk / Blocked
  • Summary: One sentence on progress since last update.
  • Completed: What shipped.
  • In Progress: What is being worked on.
  • Blocked: What is stuck and what is needed to unblock.
  • Next: What is planned for the next cycle.
Incident communication follows a different cadence:
  • Initial notification (within 5 minutes of detection): What is happening, what is impacted, who is investigating.
  • Updates every 15-30 minutes during the incident: What we know, what we have tried, what we are trying next.
  • Resolution notification: What fixed it, what is the residual impact, when will the postmortem happen.
  • Postmortem (within 48 hours): Timeline, root cause, contributing factors, action items with owners and deadlines.
During an incident, over-communicate. Silence makes people nervous. Even “We are still investigating, no new information” is better than no update for 45 minutes.

Writing as Leverage

A well-written document is the highest-leverage activity a senior engineer can perform:
  • A design doc aligns 10 engineers without 10 meetings.
  • An ADR answers the same question for every future engineer who joins the team.
  • A runbook saves hours of debugging at 3 AM.
  • A clear status update prevents a panicked Slack thread from leadership.
The engineers who get promoted to staff and principal are almost always strong writers. Not because writing is valued in isolation, but because writing forces clear thinking, and clear thinking produces better systems.
Further reading: The Staff Engineer’s Path by Tanya Reilly — covers technical communication, influence, and documentation as core engineering skills. Docs for Developers by Jared Bhatti et al. — practical guide to writing documentation that people actually read. ADR GitHub Organization — comprehensive collection of ADR tools, templates, and examples; includes adr-tools, log4brains, and MADR (Markdown ADR) templates for different team workflows. Stripe’s Approach to API Documentation — widely considered the gold standard for developer-facing API docs; study how they structure endpoints, show request/response pairs inline, and provide copy-pasteable code in multiple languages. Google’s Technical Writing Courses — free, self-paced courses covering grammar for engineers, writing clear sentences, organizing documents, and illustrating technical concepts; required training for many Google engineering teams. Basecamp’s “Shape Up” Methodology — while primarily about product development, the “shaping” process is one of the best frameworks for writing effective technical proposals and design documents; the concepts of appetite, pitches, and fat-marker sketches translate directly to RFC writing.
Strong answer:The mistake most people make is treating documentation as a separate activity — a chore to be done after the “real work.” That framing guarantees failure. Nobody wants to do chores. The fix is to embed documentation into the existing workflow so that writing docs is not extra work — it is part of the work.Tactic 1: Make documentation a side effect of decisions, not a separate task. Do not ask people to “write documentation.” Instead, adopt ADRs — every time the team makes a technical decision, the decision-maker writes a short ADR in the PR. This is not documentation for documentation’s sake. It is capturing the reasoning while it is fresh. After three months, you have a searchable decision log and nobody felt like they were “writing docs.”Tactic 2: Embed docs in the PR workflow. Add a section to your PR template: “If this changes behavior, what doc needs updating?” Make it a checklist item, not a gate. The goal is to create a habit, not a police state. When reviewers start asking “did you update the runbook?” in code review, the culture is shifting.Tactic 3: Write the first docs yourself and make them visibly useful. Write the onboarding guide that saves the next new hire two days of confusion. Write the runbook that saves the on-call engineer an hour at 3 AM. When people experience the value of good documentation firsthand — when they are the ones saved by a runbook at 3 AM — they become converts. You cannot lecture people into caring about documentation. You can show them.Tactic 4: Lower the bar for quality. A rough, slightly-wrong document that exists is infinitely more valuable than a perfect document that no one ever writes. Encourage “good enough” documentation. It will get better over time as people iterate on it.Tactic 5: Kill zombie docs ruthlessly. Outdated documentation is worse than no documentation because it actively misleads. If you find a doc that is wrong, delete it or fix it immediately. This builds trust that your documentation base is reliable, which makes people more likely to both read it and contribute to it.What does NOT work: Mandating that every ticket requires a doc. Creating a “documentation sprint.” Assigning documentation to the most junior person on the team. Putting docs in a wiki that is disconnected from the codebase. All of these approaches treat documentation as punishment, and the results will reflect that.