Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Data Management Patterns
Managing data across microservices is one of the hardest challenges. This chapter covers patterns for data ownership, distributed transactions, and consistency.- Implement database per service pattern
- Handle distributed transactions with Saga pattern
- Understand event sourcing and its trade-offs
- Implement CQRS for complex queries
- Design for eventual consistency
Database Per Service
Each microservice owns its data and exposes it only through APIs.Why This Pattern Exists
In a monolith, every module can reach into any table with a simple JOIN. That convenience is exactly what kills scalability and team autonomy at scale. When ten teams share one database, a schema change made by one team breaks queries in services owned by three others, and nobody can deploy independently because the shared schema becomes a coordination bottleneck. The database per service pattern draws a hard boundary: each service owns its persistence layer, period. If you want another service’s data, you call its API or consume its events. This trades query convenience (no more cross-service JOINs) for deployment independence, polyglot persistence (MongoDB for users, PostgreSQL for orders, Redis for inventory cache), and blast-radius containment when a schema migration goes wrong. What happens if you ignore this? You get the “distributed monolith” anti-pattern: services that look independent on the surface but share a database underneath. A single bad migration locks tables across five services. A slow query in one service starves connections for all the others. You have all the complexity of microservices with none of the benefits. The key tradeoff to watch: you lose ACID transactions across service boundaries. An order cannot atomically deduct inventory and charge the customer if those tables live in different databases. This is why the rest of this chapter exists — Saga, Outbox, Event Sourcing, and CQRS are all responses to “we broke the database apart, now how do we keep the business logic consistent?”Implementation Example
Below, three services each pick a database that fits their access pattern. The User Service uses MongoDB because user profiles have flexible, nested attributes (preferences, addresses, metadata) that evolve often — a schema-rigid relational store would require frequent migrations. The Order Service uses PostgreSQL because orders have strong relational structure (order-to-items), need ACID guarantees on financial data, and benefit from rich query capability. The Inventory Service combines PostgreSQL (source of truth) with Redis (hot read cache), because stock lookups happen thousands of times per second but mutations are comparatively rare. If you used one database for all three, you would either over-engineer the simple cases (MongoDB for simple orders) or under-engineer the complex ones (SQL for rapidly-evolving user preferences). Polyglot persistence is one of the genuine wins of microservices — but only if each team is prepared to operate the database they picked.- Node.js
- Python
Cross-Service Data Access
The rule is simple but frequently violated: never reach into another service’s database. The moment Service A runs a query against Service B’s tables, you have secretly coupled their schemas. Now any refactor in Service B — renaming a column, splitting a table, migrating to a different database engine — silently breaks Service A. The correct mental model: treat every other service’s data as if it were on a different company’s servers. You would never query a third-party vendor’s database directly; you would call their API. Apply the same discipline internally. APIs are contracts; database schemas are implementation details. The cost is real. An API call is slower than a JOIN (milliseconds vs microseconds), adds a network dependency, and requires you to think about failure modes like “what if the user service is down?” But the benefit is also real: each team can evolve its storage independently, and a bad migration in one service cannot corrupt another. If you find yourself wanting cross-service JOINs frequently, that’s a design smell. Either your service boundaries are wrong (and the two services should merge) or you need a dedicated read model via CQRS (covered later in this chapter).- Node.js
- Python
Saga Pattern
Manage distributed transactions across multiple services.Why Sagas Exist
Once you split a database per service, the classic ACID transaction dies. You cannot wrap “reserve inventory, charge payment, create order” in a singleBEGIN...COMMIT because those tables live in different databases owned by different teams. The academic answer is two-phase commit (2PC), but 2PC is blocking (locks held across network round-trips), fragile (coordinator failure wedges the system), and unsupported by most modern NoSQL stores. Nobody uses 2PC in production microservices.
The Saga pattern replaces atomicity with a sequence of local transactions, each of which has a corresponding compensating transaction to undo it. Think of it as “eventual consistency with explicit rollback.” If step 3 of a 5-step saga fails, you run the compensations for steps 2 and 1 in reverse order. The system converges to a consistent state — just not instantly.
If you did this naively (fire and forget, no compensation), you get stuck orders: payment charged, inventory reserved, but the order record was never created because the service crashed between steps. Money disappears, stock is locked forever, customer support is on fire.
The tradeoff: sagas accept temporary inconsistency in exchange for availability and decoupling. A saga in mid-flight is a real business state — “order pending inventory” is a thing the UI must handle. This is where many teams go wrong: they build sagas but treat the intermediate states as internal implementation details rather than first-class business states visible to users and operators.
Choreography vs Orchestration: The Fundamental Choice
There are two styles: choreography (services react to events with no central brain) and orchestration (a single orchestrator calls each step in order). The choice shapes everything downstream: debuggability, coupling, observability. Choreography feels elegant at small scale — just services reacting to events, very “decoupled.” But as the saga grows to 5+ steps, nobody can answer the question “what’s the current flow?” without tracing events across five services. You get implicit coupling (Service B must know which event Service A emits, which event to emit next, and who else might listen). Orchestration centralizes the flow in one place: you can read the orchestrator’s code and see the full saga. The downside is the orchestrator becomes a critical service and a potential coupling point — it needs to know about every step. For most production systems, I prefer orchestration once the saga has more than three steps, because the observability win outweighs the “central coordinator” concern.Choreography-Based Saga
Services communicate through events without a central coordinator.sagaState field on the Order record is a breadcrumb — it tells any operator examining the database “what step are we on?” Without this state tracking, a failure mid-saga becomes nearly impossible to diagnose: you’d have to reconstruct the flow from distributed event logs.
- Node.js
- Python
Orchestration-Based Saga
A central orchestrator controls the saga flow. In the orchestrator pattern, the saga’s full sequence lives in one file. You can readstartOrderSaga top-to-bottom and see every step, every compensation, in order. Compare that to the choreography version where the flow was spread across five event handlers in three services.
The critical design choice here is storing compensations in a stack as each step succeeds. When something fails at step 4, you pop compensations off the stack and execute them in reverse order. This mirrors how a database rollback works — undo the most recent operation first. If you compensated in forward order instead, you would leave the system in bizarre intermediate states.
Pay attention to the executeStep helper: it persists the current step and the compensation function together. If the orchestrator itself crashes mid-saga, a restart can read the state store, see “we were at step PROCESS_PAYMENT with compensations [cancelOrder, releaseInventory],” and either retry or compensate. Without this durable state, a crash in the orchestrator leaves sagas permanently stuck.
- Node.js
- Python
Saga Caveats and Interview Deep-Dive
Your distributed transaction across 3 services fails halfway -- you have no 2PC -- how do you recover?
Your distributed transaction across 3 services fails halfway -- you have no 2PC -- how do you recover?
- Stop and diagnose before acting. First question: did the partial state happen because a service call failed, or because a service succeeded but the response was lost? These look identical but have different fixes. Check idempotency: if the downstream stored an
idempotency_key, was it stored? If yes, the call succeeded — you just lost the response. - Read the persisted saga state. Your saga row tells you exactly which step was last committed. If state is
INVENTORY_RESERVEDbutPAYMENT_CHARGEDnever got written, you know payment is the uncertain step. - Query downstreams by business key, not saga memory. Ask the Payment service “do you have a charge for order X?” using the order ID as the idempotency key. If yes, advance the saga. If no, retry the payment step.
- Trigger compensation only after confirming failure. Never compensate on timeout alone — query first. Compensating a step that actually succeeded doubles the damage.
- Emit the new saga state and continue. Either roll forward (retry the failed step) or roll backward (compensate prior steps). Log the decision and the evidence that led to it.
- Surface the incident. Any saga that requires recovery should page a human after N retries. Silent self-healing is good; silent self-failing is a disaster.
idempotency_key before deciding to compensate.Senior Follow-up Questions:- “What if the saga state store itself is down when you try to recover?” The saga state must be persisted in the same transaction as the business operation — this is the Outbox pattern. If the state store is part of the primary DB (same Postgres as orders), it is either up together or down together, which simplifies reasoning. If the state store is external, you need an escape hatch: dump saga rows to a file log every N seconds, and recover from the file if DB is gone.
- “How do you test compensation code? It only runs on failure.” Build a chaos-testing harness that randomly aborts saga steps in a staging environment. Also: every compensation must be callable from a CLI tool that operators can invoke manually in an incident. Write integration tests that drive the saga to each state and then kill it, verifying compensation runs correctly. Teams that skip this discover compensation bugs during real outages — the worst possible time.
-
“What’s your policy for compensations that fail?” Retry with exponential backoff for N attempts. After that, move the saga to a
MANUAL_INTERVENTIONstate and alert. Compensation failures are rare but real (what if the refund API is down for 6 hours?) — the system needs a path where a human acknowledges and resolves. Pretending compensations always succeed is how teams end up with billions in stuck funds.
- “I’d wrap the whole saga in a try/catch and roll back on exception.” This reveals the candidate does not understand distributed transactions. You cannot roll back a remote service with a try/catch — the database on the other side has already committed. This is exactly the problem that sagas exist to solve.
- “Use 2PC/XA transactions instead.” Shows lack of production experience. 2PC requires distributed locks held across network round-trips, kills throughput, and is not supported by most modern data stores (DynamoDB, Cassandra, most SaaS APIs). Nobody runs 2PC in production microservices for a reason.
- Chris Richardson, Microservices Patterns — chapters on Saga and Outbox patterns.
- Caitie McCaffrey, “Distributed Sagas” talk (QCon) — the foundational modern treatment.
- AWS Step Functions documentation — a production-grade saga orchestrator you can learn from.
How do you make compensating transactions safe when they might run twice?
How do you make compensating transactions safe when they might run twice?
- Idempotency key per compensation call. The key is typically
{saga_id}:{step_id}:compensateso replays are recognized. - Store the effect, not just the intent. Before executing, write a record saying “compensation X is being applied.” If you crash and retry, the retry checks the record and either completes or skips based on what was already done.
- Use database constraints to prevent double-effect. A unique index on
(saga_id, step_id, direction='compensate')in the audit table makes a second insert fail cleanly rather than running the compensation logic twice. - Design compensations as state-setting, not delta-applying. Instead of “subtract 1 from inventory” (dangerous if replayed), write “set inventory reservation status = RELEASED” (safe, same result on replay).
- Acknowledge the edge case where compensation cannot be made idempotent. Some external APIs (email sends, SMS, legacy ERPs) are inherently non-idempotent. For these, accept that double-compensation may happen and either (a) add a human-approval gate or (b) log the double-effect explicitly so support can follow up.
- “How do you garbage-collect idempotency records?” TTL of 7-14 days is typical. Long enough to catch all realistic retries, short enough that the table does not grow unbounded. Use a separate partitioned table with time-based partitions and drop old partitions rather than deleting rows — much faster.
- “What if the downstream service doesn’t support idempotency keys?” Wrap the call in an idempotency proxy: a small stateful layer that you own, which stores the key and the response, and only calls the downstream on first attempt. This is the same pattern API gateways use for exactly-once semantics on non-idempotent backends.
- “Idempotency sounds expensive — every call now has a DB lookup. How much does it cost?” With proper indexing, the lookup is under 1ms on Postgres or Redis. For a service doing 10K RPS, that’s less than 1% of CPU. Real cost is developer discipline: getting idempotency right requires every endpoint author to remember to check the key. Use middleware or decorators to make it automatic.
- “Our message broker guarantees exactly-once delivery, so we don’t need idempotency.” Dangerous misconception. Even Kafka’s “exactly-once semantics” only applies within Kafka — the moment you call an external service or DB, you are back to at-least-once. Idempotency is required regardless of broker.
- “We use SERIALIZABLE isolation to prevent double-apply.” Confuses local transaction isolation with distributed effects. SERIALIZABLE protects one database, not the remote service you’re compensating.
- Stripe Engineering Blog, “Designing robust and predictable APIs with idempotency.”
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 8 (The Trouble with Distributed Systems).
- AWS Well-Architected: Reliability Pillar, idempotency patterns.
Event Sourcing
Store state as a sequence of events instead of current state.Why Event Sourcing Exists
Traditional storage keeps only the current state. An order row says “status: SHIPPED, total: $99.” But how did it get there? Was it ever CANCELLED and then re-created? Did the total change after a discount? The history is lost the moment you runUPDATE.
Event sourcing flips this around: the events are the database. You never UPDATE a row — you append an event like ItemAdded, DiscountApplied, OrderShipped. The current state is a derived view computed by replaying events from the beginning. This is how your bank account works: there is no “balance” field in the ledger, just a list of deposits and withdrawals.
The wins are real and hard-won: perfect audit trail (regulatory gold for finance and healthcare), time-travel debugging (replay to any point in history), and natural fit for domains with rich state transitions (insurance claims, legal workflows, shipping logistics).
The costs are equally real. Event schema evolution is hard — events are immutable, so renaming a field means versioning the schema and writing upcasters that translate old events to the new shape. Querying is hard — “show me all orders over $100” requires replaying every event or maintaining a separate read model (see CQRS). Storage grows unbounded unless you snapshot periodically. And teams new to the pattern consistently underestimate the mental shift: you can no longer reason about “the current state of X” without thinking about which events produced it.
Do not adopt event sourcing because it sounds cool. Adopt it when the business requires full history (finance, audit-heavy domains) or when you already have many state transitions that are painful to model in CRUD. For a simple CRUD app, it is dramatic overkill.
Implementation
The Event Store below enforces optimistic concurrency via the version number. Two concurrent users acting on the same aggregate will both read version=5, both try to append their changes expecting version=5, but only one of them wins — the other gets aConcurrencyError and must re-read the events, re-apply their change, and retry. This is how you prevent the classic lost-update problem without using pessimistic locks that would kill throughput.
The Order aggregate deserves careful attention. It has two ways to get into a state: create (factory for new orders) and fromEvents (replay past events to rebuild state). Every business method — addItem, submit, ship — does two things: validates the business rule, then applies an event. The event is both the record-of-truth and the mechanism that mutates in-memory state. This dual purpose is the whole point: your business logic becomes a function from (current state, command) to events, with no hidden side effects.
If you did this differently — say, mutated state directly and then “also published an event” — you would risk divergence between state and events. Bug in the publisher? State changes, event missed. Replay from events? The rebuilt state does not match what you saw earlier. By making events the single source of truth, you eliminate that class of bug entirely.
- Node.js
- Python
Snapshotting for Performance
If your event stream grows to thousands of events per aggregate, loading it becomes painfully slow: every read replays the entire history. Snapshots are checkpoints — periodic “here is the aggregate state at version N” records. When loading, you grab the latest snapshot, then replay only the events after it. The tradeoff: snapshots introduce a new failure mode. If your aggregate logic changes (bug fix, new business rule), an old snapshot reflects the old behavior. You must either invalidate old snapshots on code changes or be certain snapshots capture only data, never derived calculations that might evolve. A common rule: snapshot every 100 events, keep only the latest, and be prepared to rebuild snapshots from scratch when aggregate logic changes.- Node.js
- Python
Event Sourcing Caveats and Interview Deep-Dive
Your event store has grown to 3 TB. Queries are slow. What do you do?
Your event store has grown to 3 TB. Queries are slow. What do you do?
- Measure first, cut second. Which aggregates dominate the storage? Which events are rarely replayed? Put numbers on it before touching anything.
- Introduce snapshotting if not already present. Snapshots cap replay cost. If you have 1M events per aggregate but a snapshot every 1K events, you replay at most 1K events to rebuild state. That is 1000x faster.
- Tier storage by age. Events older than 90 days move to cheap object storage (S3 Glacier, GCS Coldline). Online queries go against the hot tier. Regulatory or audit queries go against the cold tier (slower, fine).
- Partition by aggregate ID. Events for aggregate A never need to touch aggregate B’s partition. Sharding by aggregate key reduces the per-query working set dramatically.
- Consider projections as the primary read path. Direct event-store queries should be rare — most reads should hit denormalized projections. If you are hitting the event store for user-facing reads, that is the problem, not the size.
- Archive dead aggregates. Orders completed 5 years ago do not need online access. Migrate them out with a verified archive and restore process.
- “How do you handle snapshot corruption?” Every snapshot stores the event version it represents. On load, you rebuild from the snapshot, then replay events after that version. If the rebuilt state from the snapshot does not match a fresh replay from event 0 (checked periodically), the snapshot is corrupt — delete it and regenerate. Snapshots are cache, not truth. The event stream is truth.
- “What if you need to query events by business attribute (e.g., find all orders with total over $1K)?” Do not query the event store for this — it is not indexed by business fields and scanning is O(N). Build a projection indexed by the attribute you need. The event store answers “give me events for this aggregate”; projections answer everything else.
- “How do you delete personal data for GDPR compliance if events are immutable?” Two options. Option A: crypto-shredding — each user’s events are encrypted with a user-specific key stored separately. To “delete” the user, destroy the key; events become unreadable noise. Option B: a pseudonymization projection that replaces personal fields with hashes during replay. Both keep the event store immutable while making the data effectively unavailable.
- “Delete old events.” Event sourcing’s core invariant is immutability. Deleting events destroys the audit trail, breaks replay, and means projections can never be rebuilt. This answer disqualifies the candidate for any serious event-sourcing role.
- “Just add more storage.” Misses the point. Storage is cheap, query latency is not. The problem is working-set size, not disk capacity.
- Greg Young, “Versioning in an Event Sourced System” (free PDF) — the canonical reference.
- EventStoreDB documentation, projections chapter.
- “Event Sourcing and CQRS with Axon Framework” — deep dive on snapshotting patterns.
How do you evolve an event schema without breaking historical replay?
How do you evolve an event schema without breaking historical replay?
- Never mutate existing event types. Old events on disk are frozen. If you need a new shape, introduce
EventTypeV2. - Write upcasters. A function that takes a V1 event and returns a V2 event in memory. The disk stays V1 forever; the aggregate always sees V2. This is the single most important pattern in event schema evolution.
- Treat schema changes as additive when possible. New optional fields with defaults are backward-compatible. Renames, removals, and type changes require upcasters.
- Version the event envelope, not just the payload. The envelope carries
event_type,event_version,timestamp,correlation_id. Upcasters key offevent_versionto decide how to translate. - Test upcasters against real historical data. A synthetic test is not enough. Replay a recent production event stream through the new upcaster chain before deploying.
- “What if the upcaster chain is slow? Every replay runs all upcasters.” True, but the cost is usually small (tens of nanoseconds per event). If measurements show otherwise, collapse the chain: at snapshot time, store the snapshot in the current format, so only events since the snapshot need upcasting. This amortizes the upcaster cost.
- “How do you handle event deletion requirements (GDPR, regulation)?” Covered above — crypto-shredding or pseudonymization projections. The key insight: event sourcing does not conflict with GDPR, it just requires intentional design. Most teams bolt this on late and regret it.
-
“What if you need to merge two event streams (e.g., after an account merge)?” Introduce a
StreamMergedevent in the target stream that references the source. Do not copy events — copying loses identity and breaks idempotency. The target projection reads both streams and reconciles, or you build a merged projection from both sources.
- “I’d run a migration script to rewrite old events.” This breaks the immutability contract and makes audit impossible. A regulator or auditor asking “what did the system record on this date?” would get a different answer depending on when you ask, which is exactly the problem event sourcing is supposed to solve.
- “Use schema registry like Avro/Protobuf.” Schema registries help with wire format but do not solve semantic evolution. Renaming
customer_nametobuyer_namerequires reinterpreting the field, not just reformatting it. Upcasters are what do that.
- Greg Young, “Versioning in an Event Sourced System.”
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 4 (Encoding and Evolution).
- “Practical Event Sourcing” blog series from the EventStore team.
CQRS Pattern
Command Query Responsibility Segregation - separate read and write models.Why CQRS Exists
The classic relational model assumes reads and writes share a schema. But in practice, they have opposite requirements: writes need normalization (no data duplication, referential integrity), while reads want denormalized, pre-joined views optimized for specific queries. Trying to serve both from one schema means either slow reads (expensive JOINs on every query) or polluted writes (denormalized write paths that invite data drift). CQRS splits them. Commands go to the write model (typically normalized, fully consistent, source of truth). Queries hit a read model (denormalized, often in a different database like Elasticsearch or MongoDB, updated by subscribing to write-side events). Reads and writes scale independently. You can have five read replicas feeding a search index while the write side stays on a single strongly-consistent Postgres instance. What breaks if you skip this at scale? Think Amazon’s product page. Joining products, inventory, pricing, reviews, and recommendations on every page load would grind the database into dust. The search team needs aggregated facets. The reporting team needs OLAP-style rollups. Without separation, every team is fighting over the same tables, adding indexes that help their queries and hurt everyone else’s writes. The tradeoff is significant: eventual consistency between write and read. A user updates their profile, hits the query API, and sees stale data for a few hundred milliseconds until the read model catches up. This feels wrong to users who expect read-your-writes consistency. You handle it by (a) accepting the staleness for queries where it is fine (search, listings), and (b) reading from the write side for queries that must see fresh data (your own profile, the order you just placed). Do not adopt CQRS for simple CRUD apps — the overhead of maintaining two models and a projection pipeline is wasted when your read and write schemas are essentially identical.Implementation
The implementation below has three distinct pieces. Command handlers load aggregates, invoke domain logic, and persist results. Read model updaters subscribe to events and maintain the denormalized view (Elasticsearch here, but could be Redis, MongoDB, or a denormalized Postgres table). Query handlers hit only the read model — they never touch the write side. Notice what the read model does differently from the write model. WhenOrderCreated arrives, it doesn’t just copy the event data — it also calls getCustomerName() to enrich the record with data from another service, pre-computes itemCount and totalAmount, and indexes by multiple fields for fast search. This is the whole value of CQRS: the read model is a view purpose-built for how users query the data, not constrained by normalization rules.
The alternative — trying to answer “show me all shipped orders in the last 30 days with their customer names and item counts” from the normalized write side — requires JOINs across multiple tables (or service calls) on every query. Fine at 100 QPS, fatal at 100,000.
- Node.js
- Python
CQRS Caveats and Interview Deep-Dive
A user updates their profile but sees old data when they navigate to the profile page. How do you handle this?
A user updates their profile but sees old data when they navigate to the profile page. How do you handle this?
Your projection handler crashes on a bad event and the read model falls 10 minutes behind. How do you recover?
Your projection handler crashes on a bad event and the read model falls 10 minutes behind. How do you recover?
- Stop the bleeding first. Pause the projection so it is not retrying the poison event 10 times per second and flooding logs.
- Route the offending event to a dead-letter queue. The projection skips it and advances to the next event. The DLQ preserves the event for investigation without blocking the stream.
- Alert the on-call. A DLQ’d event means either a projection bug or an unexpected event shape. Both require human attention.
- Investigate and fix. Either patch the projection handler to handle the event, or produce a correction event that supersedes the bad one.
- Replay from the DLQ. Once the fix is deployed, re-queue the DLQ’d events and let the projection process them.
- Verify projection integrity. After catch-up, run a reconciliation job: rebuild a sample of projection rows from the event stream and compare to the live projection. Any discrepancy is a latent bug.
- “How do you design a handler so it does not crash on unexpected data?” Treat events like untrusted input. Validate with Pydantic / JSON Schema at ingress. If validation fails, emit a metric and skip — do not throw. Log the full event payload for forensics. The projection should be resilient by default; crashing is a last resort.
- “What if the projection falls behind because of a traffic spike, not a bug?” Separate problem, separate fix. Scale out the projection workers (partition by aggregate key so ordering is preserved within a partition). Monitor consumer lag per partition. If a single partition is hot, consider repartitioning with a better key.
- “How do you test projection recovery?” Game-day exercises. Once a quarter, intentionally break a projection in staging (corrupt an event, kill the worker, block the DB write) and time the recovery. Teams that do not practice this are surprised by how long it takes in production — when a projection 10 minutes behind snowballs into customer-facing incidents.
- “Delete the bad event from Kafka.” Kafka events are immutable once committed. Even if you could delete, you’d break replay for any future projection or any existing consumer. DLQ is the right answer.
- “Rebuild the entire projection from scratch.” Works but takes hours to days for a multi-billion-event stream. DLQ + patch + replay is faster and less disruptive. Full rebuild is the last resort, not the first.
- Confluent Blog, “Handling Bad Data in Kafka with Dead Letter Queues.”
- Jay Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction.”
- Uber Engineering Blog, post-mortems on streaming pipeline incidents.
Interview Questions
Q1: How do you handle distributed transactions?
Q1: How do you handle distributed transactions?
- Saga Pattern: Series of local transactions with compensating actions
- Two-Phase Commit (2PC): Coordinator-based, but slow and blocking
- Outbox Pattern: Store events in same transaction as data changes
- Non-blocking
- Each service has autonomy
- Scales better
- Handles failures gracefully
Q2: When would you use Event Sourcing?
Q2: When would you use Event Sourcing?
- Need full audit trail
- Business requires “what happened” history
- Complex domain with many state transitions
- Need to rebuild state at any point in time
- Event-driven architecture already in place
- Simple CRUD operations
- Low complexity domain
- Team unfamiliar with pattern
- Query performance is critical (use with CQRS)
Q3: How does CQRS help with scalability?
Q3: How does CQRS help with scalability?
- Independent Scaling: Scale reads (often 90%+) separately from writes
- Optimized Models: Read model denormalized for query performance
- Different Databases: Use best database for each (PostgreSQL for writes, Elasticsearch for reads)
- Caching: Read model can be cached aggressively
- Eventual consistency between read/write
- More complex architecture
- Need to handle stale reads
Q4: What is the Outbox Pattern?
Q4: What is the Outbox Pattern?
- In same transaction: save data AND write event to outbox table
- Separate process reads outbox and publishes to message broker
- Mark event as published after successful publish
Summary
Key Takeaways
- Each service owns its data
- Use Saga for distributed transactions
- Event Sourcing for audit/history needs
- CQRS separates read/write concerns
- Accept eventual consistency
Next Steps
Interview Deep-Dive
'A customer placed an order, was charged, but then inventory reservation failed. The customer sees the charge on their credit card but no order confirmation. How do you design the system to handle this?'
'A customer placed an order, was charged, but then inventory reservation failed. The customer sees the charge on their credit card but no order confirmation. How do you design the system to handle this?'
'Explain the Outbox Pattern. Why can you not just write to the database and publish an event in the same method?'
'Explain the Outbox Pattern. Why can you not just write to the database and publish an event in the same method?'
'When would you use CQRS, and what are the operational challenges teams underestimate?'
'When would you use CQRS, and what are the operational challenges teams underestimate?'
- Eventual consistency lag. When a seller updates a product price, there is a delay (seconds to minutes) before the read model reflects the change. Users might see stale prices. You need to decide per use case whether this is acceptable. For product pages, a few seconds of staleness is fine. For inventory counts near zero, you might need a synchronous check before allowing “Add to Cart.”
- Projection rebuild complexity. When you change the read model schema, you need to rebuild the projection from the event history. For a table with 100 million rows, this can take hours. You need a strategy for running the old and new projections in parallel during migration.
- Monitoring for drift. How do you know the read model is correct? I run a daily consistency checker that samples records from the write side and verifies they match the read side. Any drift triggers an alert and a targeted re-projection.