Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Database Patterns
Data management is one of the most challenging aspects of microservices — and the place where most migrations go wrong. The fundamental tension: in a monolith, you get ACID transactions and JOINs for free. In microservices, every cross-service data operation becomes a distributed systems problem. The patterns in this chapter (database-per-service, sagas, outbox, CQRS) exist because you are trading database-level consistency for deployment independence. That trade-off is worth it at scale, but only if you understand what you are giving up and how to compensate.- Implement database-per-service pattern
- Handle data consistency across services
- Design shared data strategies
- Execute zero-downtime database migrations
- Choose the right database for each service
Database Per Service Pattern
Why Separate Databases?
Database Selection Guide for Microservices
Polyglot persistence — using different databases for different services — is a key advantage of microservices. But “use the right tool for the job” is only useful advice if you know which tool fits which job.| Data Characteristic | Best Database Type | Examples | Why |
|---|---|---|---|
| Transactional with complex relationships | Relational (SQL) | PostgreSQL, MySQL | ACID guarantees, JOINs, mature tooling |
| Flexible schema, document-shaped | Document store | MongoDB, Couchbase | No schema migrations, nested objects, fast iteration |
| High-write throughput, wide-column | Column-family | Cassandra, ScyllaDB | Linear write scaling, time-series friendly |
| Key-value lookups, caching, sessions | Key-value store | Redis, DynamoDB | Sub-millisecond reads, simple access patterns |
| Full-text search, log analytics | Search engine | Elasticsearch, OpenSearch | Inverted index, fuzzy matching, aggregations |
| Graph relationships (social, fraud) | Graph database | Neo4j, Amazon Neptune | Traversal queries that would be N+1 JOINs in SQL |
| Event log, audit trail | Append-only log | Kafka (log), EventStoreDB | Immutable history, replay capability |
| Service | Primary DB | Why | Cache Layer |
|---|---|---|---|
| User/Auth | PostgreSQL | User data is relational; login needs ACID | Redis (sessions) |
| Product Catalog | MongoDB | Varied attributes per category; read-heavy | Redis or CDN |
| Orders | PostgreSQL | Order lifecycle is transactional; financial data needs ACID | None (consistency critical) |
| Inventory | PostgreSQL | Decrement operations need transactions | Redis (read cache, not source of truth) |
| Cart | Redis | Ephemeral, session-scoped, fast expiry | N/A (Redis IS the cache) |
| Search | Elasticsearch | Full-text search, faceted filtering | N/A (Elasticsearch IS the index) |
| Notifications | MongoDB or DynamoDB | Simple writes, no complex queries, high volume | None |
| Analytics | ClickHouse or BigQuery | Column-oriented, aggregation-heavy | Pre-computed materialized views |
Implementation Patterns
Why does each service own its own connection pool and schema? Because in a distributed system, the database is an implementation detail of the service — not a shared resource. When you let two services talk to the same tables, you have created an invisible coupling that bypasses your API contracts. Schema changes require coordinated deploys, query performance problems in one service cause latency spikes in another, and a runaway connection pool from one service can exhaust slots for everyone else. The CAP theorem tells us we can only pick two of consistency, availability, and partition tolerance. By giving each service its own database, you are choosing availability and partition tolerance at the system level — each service can evolve, fail, and recover independently. The cost is that you can no longer use a database transaction to keep two services’ data in sync. That is what the rest of this chapter is about: the patterns you use to compensate. The trade-off worth being explicit about: operational complexity goes up (more databases to back up, monitor, patch), but blast radius goes down (one database failure no longer takes down the whole platform). For small systems, this is a bad trade. For systems with more than a handful of services or teams, it is the only way to stay sane.- Node.js
- Python
users table. That table lives in a different database — physically on a different server, possibly a different engine. You have given up referential integrity at the database layer and must now enforce it at the application layer (or tolerate brief inconsistency). This is the part that makes experienced DBAs nervous about microservices, and rightly so: you have replaced a compiler-verified invariant with a runtime contract.
The trade-off: database-per-service buys you deployment independence and the ability to pick the right storage engine per workload, at the cost of losing cross-entity constraints. You compensate by making the user_id an opaque string (no FK), validating existence at the API boundary, and using events to keep denormalized copies fresh.
- Node.js
- Python
Handling Data Ownership
Once each service owns its own database, the next question is: how does the Order Service work with user data it does not own? The guiding principle is that each piece of data has exactly one service that is the source of truth. Everyone else either asks that service at read time or keeps a local, read-only projection that is kept fresh via events. The pattern below uses synchronous API calls for validation (confirm the user and products exist) and denormalization for historical accuracy (capture the product name and price into the order). Denormalization is critical here: if the product’s price changes tomorrow, the order record must still reflect the price the customer actually paid. This is an example of where consistency requirements differ between the current-state view and the historical record — the product service owns “current price,” but the order owns “price at time of purchase.” The trade-off: synchronous calls create runtime coupling (User Service outage blocks order creation) and latency (network hop on every order). For high-throughput systems, you combine this with a read-through cache or a local projection updated via events. For correctness-critical fields (shipping address, payment method), you accept the coupling because stale data would be worse than a brief outage.- Node.js
- Python
Data Consistency Strategies
Eventual Consistency
Eventual consistency means that after a write, there is a window of time where different services may see different versions of the data. This is not a bug — it is a design choice. The key question is: how long is that window, and can your business tolerate it? For most use cases (user profile updates, catalog changes, notification preferences), a few seconds of inconsistency is invisible to users. For others (inventory counts during a flash sale, account balances), even brief inconsistency can cause real problems. The pattern below shows how services propagate changes through events. In CAP terms, eventual consistency is what you get when you choose AP (availability + partition tolerance). Both sides of the network partition can accept writes; when the partition heals, the writes are reconciled. The cost is that readers may see stale data during the partition — or even outside a partition, during the natural replication lag between services. The key design lever is the “inconsistency window” — the time between a write on the source of truth and the read model catching up. For in-process event handlers, this can be milliseconds. For events routed through Kafka consumers with batching, it can be hundreds of milliseconds. For cross-region replication, it can be seconds. Always know what your window is and whether your business can tolerate it — “eventual” without a number is just wishful thinking.- Node.js
- Python
Saga Pattern for Distributed Transactions
Sagas exist because two-phase commit (2PC) does not scale across services. 2PC requires every participant to hold locks while the coordinator waits for votes — any slow service blocks everyone else, and any coordinator failure leaves participants in an in-doubt state. A saga replaces “all-or-nothing atomicity” with “all-or-eventually-compensated.” The system passes through intermediate states where some steps are done and others aren’t, and if a later step fails, earlier steps are undone by explicit compensation. The trade-off is important: a saga is not a transaction. It has no isolation. Other readers will see the intermediate states, and two concurrent sagas can interleave in ways that a real transaction would prevent. You compensate with idempotency (every step and compensation can be retried safely), semantic locks (mark the order as “reserving” so no one else touches it), and careful ordering (do the reversible steps first, commit the irreversible steps last). There are two flavors worth knowing: orchestrated sagas use a central coordinator (easy to reason about, becomes a bottleneck and SPOF), and choreographed sagas use events between peers (more scalable, harder to debug because the “flow” is spread across many services). The example below uses orchestration because it is clearer to read — but in production systems beyond 3-4 steps, I usually prefer choreography with a saga log for observability.- Node.js
- Python
Outbox Pattern
The Outbox pattern solves one of the trickiest problems in distributed systems: how do you atomically update your database AND publish an event? If you write to the database and then publish to Kafka, what happens when your process crashes between those two operations? The database has the update but the event was never sent — now downstream services are permanently out of sync. The solution is elegant: write the event to an “outbox” table in the same database transaction as your business data. A separate background process reads the outbox and publishes events to the message broker. Since the business data and the outbox entry are in the same transaction, they either both succeed or both fail. Guaranteed consistency. One subtle but important point: the outbox gives you at-least-once delivery, not exactly-once. If the publisher crashes after sending the event to Kafka but before marking the outbox row as published, the event will be re-sent. Consumers must be idempotent — typically by including a message ID that consumers track in their own “inbox” table and ignoring duplicates. This is the distributed systems version of “measure twice, cut once”: plan for replays from day one, because you will have them.- Node.js
- Python
Data Replication Strategies
Change Data Capture (CDC)
CDC is the outbox pattern’s more ambitious cousin. Instead of writing events to an application-managed outbox table, CDC reads the database’s own transaction log (Postgres WAL, MySQL binlog, MongoDB oplog) and turns every row change into an event. Tools like Debezium do this transparently — your application code just writes to the database as normal, and Debezium streams every INSERT/UPDATE/DELETE into Kafka. The big advantage over outbox: zero application code. You do not need to maintain an outbox table, write publishers, or remember to add events for new tables. The big trade-off: you are now coupled to the database’s internal representation. Column renames, schema changes, and database engine migrations all ripple out to downstream consumers. You also get every change, including ones that should not have been events (internal columns, admin corrections, migration backfills) — so consumers need to filter. The pattern shines when you are extracting from a system you do not own (legacy monolith, vendor database), or when you want to build read models from an existing transactional database without touching application code. For new systems, I usually prefer the outbox pattern because it makes event contracts explicit rather than leaking schema details.- Node.js
- Python
CQRS (Command Query Responsibility Segregation)
CQRS is the observation that reads and writes want fundamentally different shapes. Writes want normalized data (one source of truth per fact, minimal update anomalies) and transactional invariants (decrement stock only if available >= requested). Reads want denormalized data (everything for one screen in one query, no joins) and aggregations (dashboards, rollups). A single data model that is optimal for both is rare — usually it is a compromise that is good at neither. CQRS separates them explicitly. The write model (command side) owns the business rules and enforces invariants. The read model (query side) is a projection — often in a different store entirely, like Elasticsearch for search or Redis for counters — that is updated asynchronously from events. You get to scale them independently, optimize them independently, and even use different databases. The CAP trade-off is explicit here: the read model is eventually consistent with the write model. A user who places an order may not see it immediately in their order history (read model has not caught up). Most users tolerate sub-second lag. If they cannot, you either read from the write model for that specific query (bypass CQRS) or accept the complexity. The reason CQRS is not default everywhere: for simple CRUD services, it is massive over-engineering. Apply it where read patterns diverge significantly from write patterns — search, dashboards, complex reporting.- Node.js
- Python
Zero-Downtime Database Migrations
Expand-Contract Pattern
Why is a one-shotALTER TABLE ADD COLUMN ... NOT NULL dangerous in production? Because for the moment between the migration running and the new code deploying, your old application code is still running against a schema it does not know about. If the column is NOT NULL with no default, old inserts break. If the migration takes a table lock (which many legacy databases do for certain ALTER operations), every query blocks until it completes. At scale, this is an outage.
The expand-contract pattern (sometimes called “parallel change”) inverts the dependency by splitting a schema change across multiple deploys. Each deploy is backward compatible with the previous one, so at no point do you need to coordinate “database and application update together.” You add the new thing (expand), move traffic to it gradually, then remove the old thing (contract). Each phase is safe to roll back independently.
This is one of those patterns that feels like overkill when you are three engineers on a startup but becomes non-negotiable at scale. The first time you take down production with a careless ALTER, you will understand why. The trade-off is calendar time — a rename that would take 5 minutes in a monolith takes 2-4 weeks in a mature microservices system — in exchange for zero downtime and full reversibility.
- Node.js
- Python
Renaming a Column
Renaming a column is the canonical example of why you need expand-contract thinking. Renames are “free” in a monolith — one commit, one deploy — but in a system with multiple services or replicas running the same code during a rolling deploy, a rename is actively dangerous. At the moment the migration runs, half your instances know the column asname, the other half know it as full_name, and every query from the wrong half errors.
The dual-write phase (writing to both columns) may feel wasteful, but it buys you independence: you can roll back the code change without touching the database, or roll back the database change without touching the code. At scale, that optionality is worth the extra writes. The critical discipline is the soak period between phases — “deploy the dual-write, wait a week, verify all instances are on the new code, then backfill.” Rushing the phases is how you end up with rows that have full_name but not name, breaking the old readers you forgot still existed.
- Node.js
- Python
Polyglot Persistence
Choosing the Right Database
- Node.js
- Python
Interview Questions
Q1: Why database-per-service? What are the challenges?
Q1: Why database-per-service? What are the challenges?
- Loose coupling between services
- Independent scaling
- Right database for each use case
- Isolated failures
- Independent deployments
- Data consistency (no ACID across services)
- Query complexity (no JOINs across databases)
- Duplicate data management
- Operational overhead
- Saga pattern for transactions
- Event sourcing for consistency
- CQRS for complex queries
- API composition for cross-service data
Q2: How do you handle transactions across services?
Q2: How do you handle transactions across services?
- Choreography: Services react to events, publish compensating events on failure
- Orchestration: Central coordinator manages the flow
- Design idempotent operations
- Store saga state for recovery
- Implement timeouts
- Log everything for debugging
Q3: What is the Outbox pattern?
Q3: What is the Outbox pattern?
- Begin transaction
- Update business data
- Insert event to outbox table
- Commit transaction
- Background worker publishes events from outbox
- Mark events as published
- Atomic: Data and event saved together
- Reliable: Won’t lose events if message broker is down
- Idempotent: Can replay if needed
Q4: How do you perform zero-downtime database migrations?
Q4: How do you perform zero-downtime database migrations?
- Expand: Add new column (nullable)
- Deploy: Code writes to both old and new
- Migrate: Backfill existing data
- Switch: Code reads from new column
- Contract: Make non-nullable, remove old
- Rename columns
- Change column types
- Drop columns
- Add NOT NULL without default
- Backward compatible migrations
- Small batches for data migration
- Test rollback procedures
Q5: When would you use CQRS?
Q5: When would you use CQRS?
- Read and write patterns differ significantly
- Complex read queries (aggregations, joins)
- Need to scale reads independently
- Event sourcing is used
- Optimized read models (denormalized)
- Better read performance
- Simpler write logic
- Simple CRUD operations
- Strong consistency required
- Small team (adds complexity)
Chapter Summary
- Each service should own its database (no shared schemas)
- Use Saga pattern for distributed transactions
- Outbox pattern ensures reliable event publishing
- CQRS separates read and write concerns
- Zero-downtime migrations use expand-contract pattern
- Choose the right database for each use case (polyglot persistence)
Interview Deep-Dive
'You need to perform a database migration that renames a column used by 3 other services through APIs. How do you execute this with zero downtime across all services?'
'You need to perform a database migration that renames a column used by 3 other services through APIs. How do you execute this with zero downtime across all services?'
'Your Order Service needs data from the User Service to create an order. Should it call the User Service API at order creation time, or should it maintain a local copy of user data? What are the trade-offs?'
'Your Order Service needs data from the User Service to create an order. Should it call the User Service API at order creation time, or should it maintain a local copy of user data? What are the trade-offs?'
'How do you choose the right database for each microservice? Give me a concrete example where using the wrong database caused real problems.'
'How do you choose the right database for each microservice? Give me a concrete example where using the wrong database caused real problems.'
Interview Questions with Structured Answers
Finance needs a monthly reconciliation report that joins data from five microservices' databases (Orders, Payments, Inventory, Users, Refunds). Each uses a different engine. Walk me through your options and their trade-offs.
Finance needs a monthly reconciliation report that joins data from five microservices' databases (Orders, Payments, Inventory, Users, Refunds). Each uses a different engine. Walk me through your options and their trade-offs.
- Reject the obvious wrong answers first. Explicitly state that you will not query the operational databases directly across service boundaries (creates coupling, OLTP load from OLAP queries) and you will not do runtime API composition for a report that joins millions of rows (N+1, fan-out timeouts, cost).
- Identify the workload class. Monthly reconciliation is OLAP: large joins, historical data, tolerance for minutes of staleness, high query complexity, low query frequency. This is the wrong workload for any of the five operational stores.
- Propose a data warehouse approach with streaming ingestion. Stream every domain event (OrderPlaced, PaymentCaptured, InventoryDeducted, UserCreated, RefundIssued) from each service into a warehouse (BigQuery, Snowflake, or Redshift) via Kafka plus a sink connector (Kafka Connect, Debezium for CDC where event streams are not yet available). The warehouse stores denormalized, queryable, historical data.
- Address the consistency question. Reconciliation tolerates eventual consistency on the order of minutes because the report runs monthly. Acknowledge that the warehouse will lag real-time by seconds to minutes, and design a “freeze boundary” (e.g., month-end plus 24 hours) after which data is considered final for the report.
- Call out operational costs honestly. A warehouse plus streaming pipeline is expensive. For a team under 20 engineers, a simpler approach is a nightly ETL job that reads from each service’s read replica (never the primary) and populates a dedicated reporting PostgreSQL with denormalized tables. This is less real-time but dramatically cheaper to operate.
- Explain the governance layer. Whichever path you pick, the reports live in a place with its own access controls, audit logs, and PII classification. Finance queries the warehouse, not the operational services.
- “I would just have each service expose a reporting API and the report service fans out to all five.” This fails because the report joins potentially millions of rows across five services. Fan-out APIs are for serving a user’s view (tens of rows), not reconciliation. It also puts OLAP load on OLTP databases, risking production impact.
- “Give the reporting team direct read access to each service’s database.” This recreates the shared-database anti-pattern. A harmless column rename in Orders now breaks Finance’s report. Reporting gets coupled to internal schemas that were never meant to be public, and service teams lose the ability to evolve their storage without breaking downstream consumers.
- Data Mesh principles (Zhamak Dehghani, martinfowler.com) — the foundational articulation of treating data as a product.
- “Data Contracts” on Chad Sanderson’s Substack — practical patterns for making service-to-warehouse handoffs reliable.
- Airbnb Minerva blog posts (medium.com/airbnb-engineering) — a worked example of a semantic-layer-on-warehouse architecture at scale.
Your team is migrating from a shared database to database-per-service. Three services need to coordinate on a column rename that will touch all their schemas. How do you sequence this without a platform-wide freeze?
Your team is migrating from a shared database to database-per-service. Three services need to coordinate on a column rename that will touch all their schemas. How do you sequence this without a platform-wide freeze?
You are designing the Orders Service. It needs current user info for every order. Do you call the User Service synchronously, maintain a local projection via events, or store a full copy via CDC? Defend your choice.
You are designing the Orders Service. It needs current user info for every order. Do you call the User Service synchronously, maintain a local projection via events, or store a full copy via CDC? Defend your choice.
- Clarify what “needs” means. Identify the specific fields needed (name, email, shipping address) and the specific moments they are needed (cart, checkout, post-order email, dispute resolution). Different moments have different freshness requirements.
- Reject the extremes. Full synchronous coupling makes Orders depend on User availability and adds latency to every order. Full local copy via CDC replicates User’s entire schema into Orders, recreating coupling at the database level. The answer is usually in the middle.
- Propose a “slim projection” via events. Orders subscribes to UserCreated, UserUpdated, UserDeleted events and maintains a small, well-defined projection: user_id, display_name, contact_email, default_shipping_address. Projection is read-only, owned by Orders, and decoupled from User’s internal schema.
- Define freshness boundaries by field. Display_name and contact_email from the projection are fine (seconds of staleness is invisible). Shipping address at checkout is verified synchronously against User because address correctness has financial and operational consequences. Payment method tokens are never copied; they are fetched synchronously at payment time.
- Handle the bootstrap case. What if Orders sees an order for a user it has never heard of (new user, event not yet consumed)? Fallback to a synchronous API call with a circuit breaker. This handles the race between user signup and first order gracefully.
- Call out the governance implications. The projection is now a derived dataset that Orders owns. If GDPR deletion hits User, Orders must handle it (UserDeleted event triggers projection cleanup). If the projection schema evolves, it evolves independently from User’s internal schema.
- “Call the User Service API synchronously for every order. It is simpler and always fresh.” This couples Orders availability to User availability. If User is down for 10 minutes, no orders can be placed, even though 90 percent of the order creation flow does not actually need user data. Synchronous coupling also adds tens of milliseconds of latency to every order, which at scale is a non-trivial throughput hit.
- “Use CDC to replicate the entire users table into Orders database.” This replicates User’s internal schema into Orders, which means any User schema change breaks Orders. It also copies every field, including ones Orders does not need, violating least-privilege for PII. The right pattern is a domain event contract, not raw table replication.
- “Designing Data-Intensive Applications” by Martin Kleppmann, Chapter 11 (Stream Processing) — the canonical treatment of event-driven projections and their trade-offs.
- “Event-driven architecture” patterns on microservices.io by Chris Richardson — pragmatic coverage of CQRS and read-model patterns.
- “The Outbox Pattern and CDC for Reliable Event Propagation” (debezium.io blog) — concrete implementation guidance for the event pipeline that feeds projections.