Design patterns are not rules to follow — they are names for solutions that experienced engineers reach for repeatedly. The value is not memorizing Factory vs Strategy, but recognizing when a problem you are facing has the same shape as a problem that has been solved before. The anti-skill is applying patterns where they do not fit — every pattern adds indirection, and indirection has a cost. Use patterns when they solve a real problem, not to prove you know them.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Real-World Stories: Patterns in the Wild
These are not hypothetical scenarios. These are billion-dollar architectural decisions that shaped the companies behind them — and the lessons apply whether you are building for ten users or ten million.Uber: The Monolith-to-Microservices Migration (and the Pain That Came With It)
Uber started, like most startups, as a monolith. A single Python application handled dispatch, payments, rider matching, and everything else. By 2014, that monolith was under extreme strain. Deployments were terrifying — a bug in the payment code could take down the entire dispatch system. Teams stepped on each other constantly. A single database became the bottleneck as Uber expanded to hundreds of cities. So Uber broke the monolith apart. Aggressively. By 2016, Uber had over 2,000 microservices. The result? They gained independent deployability and team autonomy, but they also inherited a sprawling distributed system that was enormously difficult to reason about. Debugging a single rider request meant tracing calls across dozens of services. Service-to-service failures cascaded in unexpected ways. The operational overhead was staggering — each service needed its own CI/CD pipeline, monitoring, alerting, and on-call rotation. Uber eventually invested heavily in platform infrastructure — building Jaeger for distributed tracing, adopting CQRS and event sourcing for ride-state management, and creating internal tools to manage service dependencies. The takeaway is not “microservices are bad” or “microservices are good.” It is that microservices are an organizational scaling solution, not a technical silver bullet, and that the infrastructure investment required to make them work is often underestimated by an order of magnitude.Amazon: Two-Pizza Teams and the Service-Oriented Architecture That Changed Everything
In the early 2000s, Amazon’s codebase was a tangled monolith that engineers called “the big ball of mud.” Jeff Bezos issued what became known as the “Bezos API Mandate” — a company-wide decree that all teams must expose their data and functionality through service interfaces, that all communication must happen through these interfaces, and that there would be no exceptions. The famous “two-pizza team” rule followed: every team should be small enough to be fed by two pizzas, and every team should own a service end-to-end. This was not a technical decision — it was an organizational one. Amazon realized that the bottleneck was not the code; it was the coordination cost between teams. By forcing service boundaries that aligned with team boundaries, they eliminated cross-team deployment dependencies. Each team could deploy independently, choose their own technology stack, and scale their service according to its specific load profile. The pattern that emerged — services owning their own data, communicating through well-defined APIs, teams organized around business capabilities — became the blueprint for what we now call microservices. But it is worth noting: Amazon had the engineering resources, the platform infrastructure, and the organizational maturity to make this work. They did not start with microservices; they evolved into them out of genuine organizational pain.Shopify: The Modular Monolith (Why They Chose NOT to Go Microservices)
While everyone else was rushing toward microservices around 2016-2018, Shopify made a deliberate, contrarian choice: they would stay on a monolith — but make it modular. Their core application is a large Ruby on Rails monolith that powers millions of merchants. Instead of breaking it into separate services, they introduced strict internal module boundaries, enforced through a tool called Packwerk that statically analyzes dependency violations between modules. Why? Shopify’s engineering leadership calculated the cost. They had hundreds of engineers working on the same codebase, and yes, that created friction. But the friction of a distributed system — network failures, eventual consistency, distributed tracing, service-to-service contract management — would have been worse. A modular monolith gave them the key benefit they needed (team autonomy through clear module ownership) without the operational tax of microservices. The result has been remarkably successful. Shopify handles massive traffic spikes (Black Friday/Cyber Monday) with a monolith. They deploy multiple times per day. They have clear team boundaries. And when a module genuinely needs to be extracted as a separate service (which has happened for a few performance-critical components), the clean module boundaries make that extraction straightforward. Shopify’s story is a powerful counter-narrative to the “microservices or bust” mentality — and a strong argument for the modular monolith as a default starting point.Stripe: The Repository Pattern at Scale for Multi-Database Support
Stripe processes billions of dollars in payments, and their data access needs are anything but simple. They use the Repository pattern extensively to abstract away the details of their storage layer. Behind a singlePaymentRepository interface, Stripe’s codebase can route queries to different databases depending on the context — a primary relational database for transactional writes, a read replica for analytics queries, a separate store for compliance and audit data.
This is the Repository pattern earning its keep at scale. When Stripe needed to migrate parts of their data layer from one database technology to another, the repository abstraction meant the migration was invisible to the hundreds of engineers writing business logic. They swapped the adapter behind the interface, ran both implementations in parallel during the migration window, and cut over without changing a single line of domain code. It is a textbook example of why the “unnecessary abstraction” crowd is wrong when the problem is complex enough: the Repository pattern’s value is not in day one simplicity, but in year-three flexibility when the storage landscape inevitably shifts under your feet.
Chapter 12: Code-Level Patterns
12.1 Strategy Pattern
Define a family of algorithms, encapsulate each, make them interchangeable. Replace if-else chains with interface implementations. Problem it solves: You have multiple algorithms or behaviors that differ only in implementation, and selecting between them with conditional logic creates brittle, growing if-else chains that violate the Open/Closed Principle. Real example: A payment processing service supports credit cards, PayPal, and bank transfers. Without Strategy, you get a giant if-else chain that grows with every new payment method. With Strategy, define aPaymentProcessor interface with a process(amount, details) method. Implement CreditCardProcessor, PayPalProcessor, BankTransferProcessor. The payment service receives the right processor via configuration or a factory. Adding Stripe? Add a new class. No existing code changes. The if-else chain becomes a lookup map.
When to use: Any time you have multiple algorithms or behaviors that should be selectable at runtime. Pricing strategies (flat rate, tiered, usage-based), notification channels (email, SMS, push), file export formats (CSV, JSON, PDF).
When NOT to use: When you only have two behaviors and it is unlikely a third will ever appear. A simple if-else is easier to read than an interface, two implementations, and a factory for something that will never change. Do not introduce strategy for the sake of it — wait until the if-else chain starts growing.
AI-Assisted Engineering Lens: Strategy Pattern
AI-Assisted Engineering Lens: Strategy Pattern
Work-Sample Prompt: Strategy Pattern
Work-Sample Prompt: Strategy Pattern
PricingEngine.calculateDiscount(). The switch dispatches on customerTier (free, basic, pro, enterprise, partner, educational, government). Each branch is 15-25 lines of discount logic. The PR author says “I just added the government tier, same pattern as the others.”Your task: Write your code review comment. Should this be refactored? If yes, what is the first step? If no, why not? What questions would you ask before deciding? What is the risk of refactoring now vs. leaving it?12.2 Repository Pattern
Abstract data access behind a collection-like interface. Decouples business logic from persistence. Enables testing with in-memory implementations. Problem it solves: Business logic becomes tangled with database queries, making it impossible to test domain rules without standing up a real database. Changes to the persistence layer ripple through the entire codebase. Real example: YourOrderRepository has methods like findById(id), findByCustomer(customerId), save(order), delete(id). Your business logic calls orderRepo.findByCustomer(id) without knowing whether data comes from PostgreSQL, MongoDB, or an in-memory cache. In tests, you swap in a InMemoryOrderRepository that stores orders in a simple array — no database needed, tests run in milliseconds.
How this plays out with specific databases: The Repository pattern’s value becomes concrete when you see how the adapter layer maps to different database engines. A PostgresOrderRepository uses SQL joins and transactions — and understanding PostgreSQL’s MVCC and indexing internals (covered in depth in the Database Deep Dives chapter) directly affects how you implement query methods. A MongoOrderRepository uses document lookups and aggregation pipelines — no joins, so the repository method for findByCustomerWithOrders() embeds related data differently. A DynamoOrderRepository must design around partition keys and access patterns (single-table design), meaning the repository interface might need to accommodate DynamoDB’s constraints. The point: the Repository interface stays the same, but the adapter implementation requires deep knowledge of the specific database’s strengths and limitations.
When to use: When business logic is complex enough to benefit from isolation from persistence. Domain-driven design projects. Any time you want fast, reliable unit tests over domain logic.
When NOT to use: When you are building a simple CRUD app where the ORM already provides a clean enough interface. Adding a repository layer on top of an ORM that already abstracts the database can be unnecessary indirection. Use it when business logic is complex enough to benefit from isolation.
AI-Assisted Engineering Lens: Repository Pattern
AI-Assisted Engineering Lens: Repository Pattern
findAll, findById, save, delete) rather than domain-meaningful operations. Always review generated repositories and ask: “Does this method name describe a domain concept or a database operation?” Rename findByStatusAndCreatedAtBefore to findOverdueOrders. The other high-value use: prompt an LLM with “generate a contract test suite that both InMemoryOrderRepository and PostgresOrderRepository must pass” — this catches fake-drift bugs before they reach production.Work-Sample Prompt: Repository Pattern
Work-Sample Prompt: Repository Pattern
UserRepository has 23 methods. Fourteen of them are called by exactly one service method each. Five of them are pass-throughs to the ORM with no additional logic. Three of them contain complex query logic with joins and subqueries. One of them (findUsersEligibleForAnnualReview) is used by three different service classes.Your task: Diagnose the health of this repository. Which methods justify their existence? What would you propose to simplify it? Draft a 3-sentence Slack message to the team explaining your recommendation without making anyone defensive.PostgresOrderRepository needs to understand MVCC, index strategies, and transaction isolation levels. A DynamoOrderRepository needs to understand partition key design and single-table patterns. A MongoOrderRepository needs to understand document modeling and aggregation pipelines. The Database Deep Dives chapter covers these internals for PostgreSQL, MongoDB, DynamoDB, and Redis — knowledge that directly affects how you implement repository adapters for each engine.12.3 Factory Pattern
Encapsulate object creation. When creation logic is complex or varies by context, a factory centralizes it and hides the complexity from consumers. Problem it solves: Object creation logic is scattered across the codebase, duplicated, and inconsistent. Callers need to know too many details about which concrete class to instantiate and how to configure it. Real example: A notification system creates different notification objects based on type and user preferences. ANotificationFactory.create(type, user) method checks the user’s preferences, the notification type, the user’s timezone, and returns the right notification object fully configured. Without the factory, this creation logic is scattered across every caller, duplicated and inconsistent.
Analogy: The Factory pattern is like ordering food at a restaurant — you say WHAT you want (“I’ll have the salmon”), not HOW to make it (source the fish, season it, heat the grill to 400 degrees, cook for 6 minutes per side). The kitchen is the factory. You get back a finished dish without knowing or caring about the creation process. If the restaurant changes suppliers or cooking techniques, your ordering experience does not change. That is exactly what a factory does for object creation — it hides the “how” and lets callers focus on the “what.”
Variations: Simple Factory (a function that returns objects), Factory Method (subclasses decide which class to instantiate), Abstract Factory (creates families of related objects). In practice, the simple factory function is what you will use 90% of the time.
When to use: When object creation involves conditional logic, configuration, or multiple steps. When you want to decouple callers from concrete class names.
When NOT to use: When construction is trivial — new Thing(x, y) is perfectly fine. A factory for a single class with a simple constructor adds indirection for no gain.
create() grows past 5 parameters, we refactor to a builder. And I would measure: how many developers are confused by the indirection the factory introduces vs. how many benefit from not knowing the construction details? At under 10 engineers, factories for simple objects are often a net negative on readability.”AI-Assisted Engineering Lens: Factory Pattern
AI-Assisted Engineering Lens: Factory Pattern
NotificationFactory even for two notification types. This is the LLM’s training bias toward “enterprise patterns.” The productive use: when you genuinely need a factory (conditional creation, environment-based configuration), prompting with “generate a factory for these 5 types with this configuration schema” saves 20 minutes of boilerplate. The high-value LLM assist: use Copilot to generate exhaustive test cases for factory edge cases — “what happens when the config specifies an unknown type? A null type? A type with missing required fields?” Edge-case test generation is where LLMs shine for factory code.Work-Sample Prompt: Factory Pattern
Work-Sample Prompt: Factory Pattern
NotificationFactory.create() returned null for type 'whatsapp'. A product team deployed a new WhatsApp notification feature at 6 PM, but they forgot to register the new type in the factory’s lookup map. The feature passed all unit tests because the tests instantiate WhatsAppNotification directly.Your task: (1) What is your immediate fix to stop the errors? (2) What systemic change do you propose to prevent this class of bug? (3) How would you modify the CI pipeline or the factory’s design so that an unregistered type fails at build time, not at 2 AM?12.4 Decorator Pattern
Add behavior to objects dynamically without modifying the original. Wrap a logging decorator around a repository to add logging without changing the repository. Problem it solves: You need to add cross-cutting behavior (logging, caching, metrics, retries) to existing objects without modifying their source code or creating an explosion of subclass combinations. Real example: You have aUserRepository that fetches users from the database. You need logging, caching, and metrics. Instead of modifying UserRepository, create wrappers: LoggingUserRepository wraps UserRepository and logs every call. CachingUserRepository wraps that and checks Redis before hitting the database. MetricsUserRepository wraps that and records timing. Each layer is independent, testable, and removable. The calling code sees the same interface.
In modern code: Decorators appear as middleware (Express, Koa), Python decorators (@cache, @retry), and higher-order functions. The pattern is everywhere even when not called by name.
When to use: When you need to compose behaviors around an object and want each behavior to be independently addable and removable. Middleware stacks, cross-cutting concerns, feature toggles.
When NOT to use: When deep nesting of decorators makes debugging a nightmare. If you find yourself wrapping 5+ layers deep and losing track of which decorator is responsible for which behavior, consider a different approach (like aspect-oriented programming or a pipeline pattern).
12.5 Observer Pattern
When one object changes, all dependents are notified. Foundation of event-driven programming. Used in UI frameworks, pub/sub systems, and reactive programming. Problem it solves: An object needs to notify an unknown, extensible set of other objects when its state changes, without being tightly coupled to them. Real example: An e-commerce system publishes anOrderPlaced event. The inventory service listens and reserves stock. The notification service listens and sends a confirmation email. The analytics service listens and updates dashboards. The order service does not know about any of these — it just publishes. Adding a loyalty points service means adding a new listener, not modifying the order service.
Trade-off: Loose coupling is the benefit. The cost is that the system’s behavior becomes harder to trace — “what happens when an order is placed?” requires checking all subscribers. Debugging a chain of events is harder than debugging a direct function call. Use event catalogs and tracing to manage this complexity.
When to use: When the set of “things that should react” will grow over time. UI state management, domain events, pub/sub messaging, webhook systems.
When NOT to use: When only one or two things need to react and the set is stable. A direct function call is simpler, more explicit, and easier to debug. Also avoid when ordering of notifications matters critically — observer does not guarantee execution order across listeners.
12.6 Adapter Pattern
Convert one interface to another. Wrap a third-party library so your code depends on your interface, not theirs. Essential for third-party dependency isolation. Problem it solves: Your code needs to work with a class or API whose interface does not match what your code expects. Or you want to insulate your codebase from third-party API changes and vendor lock-in. Real example: Your application uses Stripe for payments. Instead of calling Stripe’s SDK directly throughout your code, create aPaymentGateway interface that your code uses, and a StripePaymentGateway adapter that translates your interface calls into Stripe SDK calls. When the business decides to also support Adyen, you write an AdyenPaymentGateway adapter. Your application code does not change. When Stripe releases a breaking API change, only the adapter changes.
When it matters most: Third-party APIs (payment, email, SMS, cloud storage), legacy system integration, and any dependency you might need to swap. The adapter is your insulation layer.
When to use: Integrating with external services, wrapping legacy APIs, bridging incompatible interfaces during migrations.
When NOT to use: When you are wrapping an internal class you control. If you own both sides, just change the interface directly. Adapters for internal code add indirection without the vendor-isolation benefit.
Chapter 13: Architectural Patterns
13.1 Layered Architecture
Organize code into layers: Presentation → Business Logic → Data Access. Each layer only talks to the one below. Simple, well-understood, but can lead to unnecessary indirection for simple operations. Problem it solves: Without layering, presentation code directly queries databases, business rules live in UI handlers, and everything is tangled together. Changes in one area cascade unpredictably. When it works well: Most CRUD applications, team-based development where different teams own different layers, applications where the business logic is the most complex part. When it breaks down: When a simple “get user by ID” requires passing through 4 layers of indirection. When cross-cutting concerns (logging, auth, validation) do not fit neatly into one layer. When the “business logic” layer becomes a thin pass-through that just calls the data layer. When NOT to use: Highly event-driven systems, real-time streaming applications, or anything where the rigid top-to-bottom flow does not match the actual data flow of the system.13.2 Hexagonal Architecture (Ports and Adapters)
Business logic at the center, surrounded by ports (interfaces) and adapters (implementations). The core has no dependency on infrastructure — databases, APIs, and UIs are all adapters plugged in from outside. Makes the core independently testable. Problem it solves: In layered architecture, business logic often leaks into infrastructure concerns and vice versa. Testing business rules requires spinning up databases, HTTP servers, and message brokers. Hexagonal architecture enforces a hard boundary: the core is pure logic, everything else is pluggable. How it works — Ports and Adapters explained:- Ports are interfaces defined by the core. They represent what the core needs from the outside world (driven ports, e.g.,
OrderRepository,PaymentGateway) or what the outside world can ask of the core (driving ports, e.g.,PlaceOrderUseCase). - Adapters are implementations that connect ports to real infrastructure. A
PostgresOrderRepositoryadapter implements theOrderRepositoryport. AnExpressHttpAdapteradapter calls thePlaceOrderUseCaseport when an HTTP request arrives. - The dependency rule: Adapters depend on ports. The core depends on nothing external. Dependencies always point inward.
OrderService, PricingEngine, and domain models — pure business logic with no imports from frameworks, databases, or HTTP libraries. Ports define interfaces: OrderRepository (port for data access), PaymentGateway (port for payments), NotificationSender (port for notifications). Adapters implement those ports: PostgresOrderRepository, StripePaymentGateway, SendGridNotificationSender. In tests, swap in InMemoryOrderRepository, FakePaymentGateway. The core is 100% testable without any infrastructure.
Why it matters for testability: Because the core has zero infrastructure dependencies, you can test all business rules with fast, in-memory fakes. No database containers, no network mocks, no flaky integration tests for logic validation. Integration tests only need to verify that adapters correctly translate between the port interface and the real infrastructure — a much smaller, more focused surface area.
When to use: When business logic is complex and you need fast, reliable tests. When you expect to swap infrastructure (migrate databases, change cloud providers, replace third-party services). Domain-driven design projects.
When NOT to use: Simple CRUD apps with minimal business logic. If your “business logic” is just “take the request, validate it, save it to the database, return it,” hexagonal architecture adds ceremony without proportional benefit.
When to Migrate to Hexagonal Architecture
If you are on a Layered Architecture and feeling pain, here is a decision framework and step-by-step migration path: Trigger signals — migrate when you see two or more of these:- Unit tests require a running database, HTTP server, or message broker to verify business rules.
- A framework upgrade or swap (Express to Fastify, Django to FastAPI) would require rewriting business logic.
- Domain logic is scattered across controllers, middleware, and data access code — no single place to understand “the rules.”
- Multiple teams need to integrate with the same business logic through different interfaces (REST API, CLI, event consumer, background job).
- Third-party service swaps (Stripe to Adyen, SendGrid to SES) require changes in dozens of files.
- Identify the core domain logic. Look at your service/business-logic layer. Which parts are pure rules (pricing calculations, eligibility checks, state transitions) and which parts are infrastructure calls (database queries, HTTP calls, message publishing)? List them separately.
-
Define your first port. Pick the most painful infrastructure dependency — usually the database. Create an interface (
OrderRepository) that describes what your business logic needs from persistence, using domain language (findOverdueOrders(), notSELECT * WHERE...). -
Extract the adapter. Move the existing database code into a class that implements the port (
PostgresOrderRepository implements OrderRepository). Your business logic now depends on the interface, not the implementation. -
Write an in-memory adapter. Create
InMemoryOrderRepositorythat stores data in a hash map. Use it in tests. If your business logic works with both adapters, the port boundary is clean. - Repeat for external services. Define ports for payment gateways, notification senders, external APIs. Extract adapters for each. This is where the Adapter pattern (Section 12.6) scales from a code-level concept to an architectural principle.
-
Define driving ports. Create use case interfaces (
PlaceOrderUseCase,CancelOrderUseCase) that represent what the outside world can ask of your core. Your HTTP controllers, CLI handlers, and event consumers all become adapters that call these ports. -
Enforce the dependency rule. The core module should have zero imports from infrastructure packages. Verify with static analysis or build-tool module boundaries. If the core imports
pg,express, orstripe, the boundary has leaked.
13.3 Clean Architecture
Similar to hexagonal — dependencies point inward. Entities at the center, use cases around them, interface adapters and frameworks on the outside. The dependency rule: inner circles know nothing about outer circles. The practical difference from hexagonal: Clean Architecture prescribes more specific layers (entities, use cases, interface adapters, frameworks) while hexagonal is more flexible with just “inside” and “outside.” In practice, most teams use a hybrid — the key principle is the same: business logic has zero dependencies on infrastructure.When to Migrate to Clean Architecture
Clean Architecture makes the most sense when you already buy the Hexagonal premise (business logic at the center, dependencies pointing inward) but need more prescriptive structure because developers on the team keep asking “where does this go?” Migrate from Hexagonal to Clean Architecture when:- The team has grown beyond 5-6 engineers and the flexible “inside vs outside” boundary of Hexagonal leads to inconsistent placement of code — one developer puts validation in the adapter, another puts it in the core, a third creates a new folder nobody else uses.
- You have complex use cases that deserve their own explicit layer. If your core logic has both stable domain entities (a
Moneyvalue object that rarely changes) and volatile use cases (the checkout flow that changes every sprint), separating them into distinct rings prevents churn in entities from cascading into every use case. - You need onboarding velocity. Clean Architecture’s named layers (Entities → Use Cases → Interface Adapters → Frameworks & Drivers) give new team members a concrete mental map. “Your new feature is a use case — it goes in this folder, depends on entities, and is called by interface adapters.”
- Split your Hexagonal “core” into two layers: Entities (domain models, value objects, business rules that change rarely) and Use Cases (application-specific orchestration that changes with features).
- Rename your “adapters” to Interface Adapters (controllers, presenters, gateways) and explicitly separate the Frameworks & Drivers layer (the actual HTTP framework, ORM library, message broker client).
- Enforce that Use Cases depend only on Entities and port interfaces — never on Interface Adapters or Frameworks.
13.4 Event-Driven Architecture (EDA)
Systems structured around events rather than direct calls. Services publish events (OrderPlaced), others subscribe and react. The producer does not know or care who is listening. Problem it solves: Tight coupling between services. In a synchronous world, the order service must know about the inventory service, the notification service, and the analytics service — and call each of them. Adding a new reaction means modifying the order service. EDA inverts this. Why EDA is powerful: Adding a new reaction (send a loyalty points email when an order is placed) means adding a new consumer — zero changes to the order service. Services are independently deployable and scalable. Temporal decoupling — the consumer can be down temporarily and process events when it recovers. Trade-offs: Eventual consistency (the email is not sent at the same instant the order is placed — it is sent seconds later). Harder debugging (a user request triggers a chain of events across 5 services — you need distributed tracing to follow the flow). Event ordering challenges (ifOrderPlaced arrives after OrderShipped, your consumer logic must handle out-of-order events). Duplicate handling required (at-least-once delivery means every consumer must be idempotent).
When to Migrate to Event-Driven Architecture
Migrating from synchronous request-response to EDA is one of the most impactful — and most dangerous — architectural shifts a team can make. Do not do it all at once. Here is the incremental path: Trigger signals — migrate when you see two or more of these:- A producing service directly calls 4+ downstream services after each state change, and the list grows with every feature.
- The producing service’s latency includes the sum of all downstream call latencies — users are waiting for emails to send, analytics to log, and reports to generate before seeing a response.
- A downstream service being slow or down causes the entire upstream flow to fail, even when the downstream work is not essential to the user’s request.
- Different downstream services have dramatically different scaling needs (the email service handles 100 req/s, the analytics service handles 10,000 req/s).
- Adding a new “reaction” to a business event requires modifying the producing service — violating the Open/Closed Principle at the system level.
- Identify fan-out points. Find the methods where a service calls multiple downstream services after a state change. These are your migration candidates. Rank them by the number of downstream calls and the pain of adding new ones.
- Introduce a message broker. Deploy Kafka, RabbitMQ, or your cloud provider’s equivalent (SNS/SQS, Cloud Pub/Sub, Event Hubs). Do not build your own. For most teams starting out, a managed service (Amazon SQS, Google Cloud Pub/Sub) minimizes operational overhead.
- Migrate one consumer at a time. Pick the least critical downstream call (analytics tracking is a great first candidate — if it fails, nobody notices immediately). Replace the synchronous call with an event publication + an event consumer. Keep all other downstream calls synchronous. Deploy. Monitor. Gain confidence.
- Use the Outbox Pattern (Section 14.4) from day one. Do not publish events directly after a database write — use the outbox table to guarantee atomicity between the data change and the event publication. This saves you from the “lost event” bugs that plague naive EDA migrations.
- Add idempotency to every consumer. At-least-once delivery is the norm. Every consumer must handle duplicate events gracefully — use idempotency keys, deduplication tables, or naturally idempotent operations.
- Migrate remaining consumers one by one. After each migration, verify that the producing service’s latency has decreased (because it no longer waits for that consumer) and that the consumer handles failures independently.
- Invest in observability immediately. Without distributed tracing and correlation IDs, an event-driven system is a black box. Set up tracing (Jaeger, Zipkin, AWS X-Ray) before the second consumer migration, not after the fifth when debugging becomes impossible. See the Observability chapter for correlation ID strategies across event chains.
13.5 CQRS (Command Query Responsibility Segregation)
Separate write model (optimized for consistency and business rules) from read model (optimized for query performance, denormalized). Scale reads and writes independently. Problem it solves: A single data model cannot be optimal for both writing (normalized, constrained, consistent) and reading (denormalized, fast, shaped for the UI). When read and write loads differ dramatically (most apps are read-heavy), a unified model forces you to compromise on both. How the read model gets populated: The write side persists data and publishes an event (or uses Change Data Capture). An event handler or projection builder listens for changes and updates the read model. The read model is a denormalized, query-optimized view — it may be in a different database (write side in PostgreSQL, read side in Elasticsearch for full-text search). The consistency window: After a write, the read model is stale until the projection catches up. This is usually milliseconds to seconds. Handle it in the UI: after a user creates an item, redirect them to the item using data from the write response (not the read model). Or use “read your own writes” — route the writing user’s reads to the primary for a short period. When CQRS without event sourcing is the right call: Most of the time. If you just need separate read and write models (e.g., normalized writes to PostgreSQL, denormalized reads from Redis or Elasticsearch), you do not need the complexity of event sourcing. CQRS + a simple CDC or event-publish-on-write is sufficient.13.6 Event Sourcing
Store the full history of state changes as events rather than just current state. Instead of storing “Order #123: status=shipped, total=50) → ItemAdded(Widget) → PaymentReceived($50) → OrderShipped. Problem it solves: Traditional state-based persistence throws away history. You know the current state but not how you got there. In domains where the “how” matters (finance, compliance, audit), this is a critical gap. How event replay works: To get the current state of an entity, read all its events from the event store (an append-only, ordered stream per aggregate) and replay them in order. Each event applies a state change. After replaying all events, you have the current state. This is powerful but slow for entities with thousands of events. Snapshots: To avoid replaying thousands of events on every read, periodically save a snapshot (the materialized state at a point in time). Then replay only events after the snapshot. Snapshot every N events (e.g., every 100) or on a schedule. Projections (read models): Event handlers that listen to the event stream and build query-optimized views. A “daily revenue” projection listens for PaymentReceived events and updates a running total. You can build new projections retroactively by replaying historical events — this is one of event sourcing’s strongest benefits. When event sourcing is genuinely the right choice: Audit-heavy domains (finance, healthcare, legal) where you must prove what happened and when. Systems where the history itself is valuable (undo/redo, temporal queries). Systems where you need to build new read models from historical data. When it is over-engineering: CRUD applications, simple data management, when you just need an audit log (use a changes table instead).Interview question: When would you choose event-driven architecture over a synchronous request-response model?
Interview question: When would you choose event-driven architecture over a synchronous request-response model?
- Martin Fowler — What do you mean by “Event-Driven”? — Fowler’s clarification of event notification vs event-carried state transfer vs event sourcing.
- Confluent Blog — Event-Driven Architecture — covers EDA patterns with Kafka specifics.
Interview question: Explain CQRS and when you would — and would NOT — use it.
Interview question: Explain CQRS and when you would — and would NOT — use it.
- Martin Fowler — CQRS — concise definition with Fowler’s honest warnings about misuse.
- Microsoft — CQRS Pattern — trade-off analysis and when to avoid.
Interview question: What are the trade-offs of event sourcing vs traditional state-based persistence?
Interview question: What are the trade-offs of event sourcing vs traditional state-based persistence?
- Martin Fowler — Event Sourcing — the canonical pattern description.
- Microsoft — Event Sourcing Pattern — includes the trade-offs most tutorials skip.
Chapter 14: Microservices
14.1 What Microservices Are
Independently deployable services, each owning a specific business capability. Each has its own data store, its own deployment pipeline, and communicates with others through well-defined APIs or events. Analogy: Microservices are like independent food trucks vs. a single restaurant kitchen. Each food truck has its own menu, its own chef, its own supply chain, and can set up or shut down independently. That is powerful — a taco truck can upgrade its grill without affecting the sushi truck. But try coordinating a multi-course meal across five food trucks (appetizer from truck A, entree from truck B, dessert from truck C, all arriving at your table hot and in the right order) and you will immediately feel the coordination cost of distributed systems. A single restaurant kitchen handles that coordination trivially because everything is in one place. That is the monolith trade-off in a nutshell: easier coordination, harder independence. What “independently deployable” actually means: You can deploy a new version of the Order Service at 2 PM on Tuesday without deploying, testing, or even notifying the Payment Service team. If this is not true — if deploying one service requires coordinating with other teams — you do not have microservices, you have a distributed monolith. What “owns its data” actually means: The Order Service has its own database (or at minimum its own schema). No other service queries the Order tables directly. Other services get order data through the Order Service’s API or by consuming events it publishes. This is the hardest discipline in microservices and the most commonly violated.14.2 Benefits of Microservices
Independent deployment: Ship changes to the order service without touching the payment service. Deploy 10 times a day per service. Rollback one service without affecting others. Independent scaling: Scale the search service during peak traffic without scaling everything. Run the image processing service on GPU instances while the API runs on standard instances. Technology flexibility: Use Python for the ML service, Go for the high-throughput API, TypeScript for the BFF. Team autonomy: Each team owns their service end-to-end — they decide on the technology, the deployment schedule, and the internal architecture. Fault isolation: A crash in the review service does not bring down the checkout flow (if properly designed with circuit breakers and graceful degradation).14.3 Problems with Microservices (and Solutions)
Distributed system complexity. Network calls fail, latency is unpredictable, partial failures are normal. Solution: resilience patterns (retry, circuit breaker, timeout, bulkhead), async communication where possible. Data consistency. No distributed transactions. Each service owns its data. Solution: saga pattern for multi-service workflows, eventual consistency, outbox pattern for reliable event publishing. Service discovery. How does Service A find Service B? Solution: DNS-based discovery (Kubernetes services), service registries (Consul, Eureka), service mesh (Istio, Linkerd). Distributed tracing. A single user request flows through 5 services — how do you debug it? Solution: distributed tracing (Jaeger, Zipkin, AWS X-Ray, Azure Application Insights), correlation IDs propagated through all calls. Data duplication and joins. You cannot JOIN across service databases. Solution: each service maintains the data it needs (via events). API composition for queries that span services. CQRS with denormalized read models. Testing complexity. Integration testing across services is hard. Solution: contract testing (Pact), consumer-driven contracts, service virtualization, robust CI/CD per service. Operational overhead. Each service needs monitoring, alerting, deployment pipelines, log aggregation. Solution: platform team providing shared infrastructure, service mesh, standardized templates, internal developer platform. Network latency. Every service call adds network round-trip time. Solution: minimize synchronous call chains, use async communication, batch requests, use gRPC for internal communication (faster than REST).14.4 Key Microservices Patterns
API Gateway: Single entry point for external clients. Handles routing, authentication, rate limiting, request aggregation. Prevents clients from needing to know about individual services.Backend for Frontend (BFF) Pattern — Deep Dive
The BFF pattern deserves more than a one-liner because it is increasingly the default approach for any company with multiple client types — and it is the natural complement to GraphQL federation. Problem it solves: A single API serving all client types forces painful compromises. Mobile apps need small, battery-efficient payloads — 3 fields per card in a list view. Web dashboards need rich, nested data — 40 fields with related entities, all in one round trip. Third-party integrations need stable, versioned contracts. A single API either over-fetches for mobile (wasting bandwidth and battery), under-fetches for web (requiring N+1 round trips), or creates a bloated “god endpoint” that returns everything and lets each client pick what it wants. How it works:- Receives requests from one client type
- Calls the relevant backend microservices
- Aggregates, transforms, and shapes the response for that specific client’s needs
- Handles client-specific concerns (mobile pagination, web caching headers, partner API versioning)
- You have 2+ client types with genuinely different data needs (not just “mobile shows fewer fields” — that is a UI concern, not an API concern).
- Mobile latency and bandwidth constraints require aggressive response shaping that the backend teams should not own.
- Different clients have different authentication flows, rate limits, or versioning cadences.
- You want to insulate backend services from client-specific churn — the web team’s redesign should not require backend API changes.
- You have one client type. A BFF for a single web app is just an API gateway with a fancy name.
- The data needs across clients are 90% identical. If mobile and web both need the same fields, a single API with field-level selection (or GraphQL) is simpler.
- You do not have the team capacity to maintain multiple BFF services. Each BFF is a service — it needs CI/CD, monitoring, on-call, and ownership.
Saga Pattern (Deep Dive)
Manage distributed transactions as a sequence of local transactions with compensating actions. This is one of the most critical patterns in microservices — without it, multi-service workflows that require atomicity have no reliable coordination mechanism. Problem it solves: In a monolith, you wrap a multi-step operation in a database transaction. In microservices, there is no distributed transaction (and 2PC does not scale). The saga pattern provides eventual consistency across services by chaining local transactions with explicit undo steps. Concrete example — Order Processing Saga:- Order Service: Create order (status: pending)
- Payment Service: Charge customer → if fails, compensate: cancel order
- Inventory Service: Reserve items → if fails, compensate: refund payment, cancel order
- Shipping Service: Create shipment → if fails, compensate: release inventory, refund payment, cancel order
Choreography vs Orchestration
This is the most important decision when implementing sagas. Both are valid — the right choice depends on complexity and observability needs. Choreography — decentralized, event-driven: Each service publishes events and other services react. No central coordinator.- Order Service publishes
OrderCreated→ Payment Service listens, charges, publishesPaymentCharged→ Inventory Service listens, reserves, publishesInventoryReserved→ Shipping Service listens, ships. - If Inventory fails, it publishes
InventoryReservationFailed→ Payment Service listens and refunds → Order Service listens and cancels.
Strangler Fig Pattern (Deep Dive)
The Strangler Fig is the most practical migration pattern in software engineering — named after the fig trees that grow around a host tree, eventually replacing it entirely while the host continues to live during the transition. Martin Fowler coined the term after observing these trees in Australia, and the metaphor is perfect: you do not kill the old system. You grow the new system around it until the old system is no longer needed. Problem it solves: You have a legacy monolith that is too risky, too large, or too poorly understood to rewrite from scratch. A “big bang” rewrite — where you build the new system in parallel and cut over on a single date — fails far more often than it succeeds (see the Netscape rewrite, or the FBI’s Virtual Case File). The Strangler Fig gives you incremental migration with continuous delivery of value, manageable risk at each step, and the ability to stop or reverse at any point. How the Strangler Fig works — the complete mechanism: The pattern has three core components:- The Routing Layer (the “strangler proxy”). A reverse proxy, API gateway, or load balancer sits in front of both the monolith and the new services. All client traffic goes through this layer. It decides, on a per-request basis, whether to route to the monolith or to the new service.
- The New Service. A standalone service that implements one piece of functionality that currently lives in the monolith. It has its own database, its own deployment pipeline, and its own tests.
- The Migration Toggle. A mechanism (feature flag, routing rule, percentage-based traffic split) that controls which requests go to the new service vs the monolith. This is your safety valve.
- Low risk: Not the core revenue path. Not the payment flow. Something where a bug is embarrassing, not catastrophic.
- Well-bounded: Has clear inputs and outputs. Does not deeply entangle with 15 other monolith modules.
- High value to modernize: Perhaps it needs independent scaling, a different technology, or it is the module that blocks monolith deployments most often.
- Option A: Shared database temporarily. The new service reads/writes the same database tables as the monolith during transition. This is pragmatic but creates coupling. Use it as a stepping stone, not a permanent state.
- Option B: Dual writes. Write to both the old and new databases during transition. Complex and error-prone — you need to handle failures in either write.
- Option C: CDC-based sync. Use Change Data Capture (Debezium, DynamoDB Streams) to replicate data from the monolith’s database to the new service’s database. The new service reads from its own store while the monolith continues writing to the original.
- Option D: Event-driven migration. If you are already using events, the new service builds its data store by consuming events. This is the cleanest approach but requires the event infrastructure to already exist.
14.5 Microservice Anti-Patterns
Know these — they come up in interviews and are common in real organizations: The Distributed Monolith: All services must be deployed together, share a database, or cannot function independently. You have all the complexity of microservices with none of the benefits. Symptom: “We can’t deploy the Order Service without also deploying the User Service.” Fix: Enforce independent deployability as a hard rule. Each service owns its data. Communication through APIs or events only. The Shared Database: Multiple services read and write the same database tables. Any schema change requires coordinating across all services. Symptom: “We need to update 5 services because we added a column to the users table.” Fix: Each service owns its tables. Other services access data through the owning service’s API. Duplicate data via events where needed. The God Service: One service that everything depends on (often called “common-service” or “core-service”). It becomes the bottleneck — every team needs changes in it, and it cannot be deployed without risking everything. Symptom: The god service has 50+ API endpoints and is modified in every sprint by 3 different teams. Fix: Decompose by business capability. If UserService handles user profiles, authentication, preferences, and billing — those are 4 services waiting to be extracted. Chatty Microservices: A single user request triggers a sequential chain of 5+ synchronous service calls. Latency compounds (5 services × 50ms = 250ms minimum). Failure in any one breaks the chain. Symptom: A product page takes 2 seconds because it calls 8 services sequentially. Fix: Aggregate data at the BFF (Backend for Frontend) layer. Use async communication where possible. Cache aggressively. Denormalize data so services have what they need locally. The Entity Service Trap: Splitting by data entity (UserService, OrderService, ProductService) instead of by business capability (Checkout, Catalog, Fulfillment). Entity services become CRUD wrappers with no business logic, and real business operations span multiple services. Fix: Design around business capabilities and use cases, not database tables.14.6 The Monolith-First Argument
Monolith: One deployment unit. Simple to develop, test, deploy. Right for most teams starting out. Modular monolith: Monolith with strict internal boundaries. Each module has its own models, data access, and clear interfaces. Simplicity of monolith with modularity for future extraction. Microservices: When you need independent deployment, independent scaling, technology diversity, or team autonomy at scale. The rule: Start with a modular monolith. Extract services only when you have a clear, measurable reason. When microservices are actually harmful:- Small teams (fewer than 20-30 engineers). The operational overhead of running, monitoring, and debugging distributed services exceeds the organizational benefit. A small team does not need independent deployment per team because they are one team.
- Early-stage products where the domain is not yet understood. Microservice boundaries are domain boundaries. If you do not yet know your domain well (the product is still pivoting, requirements shift weekly), you will draw the boundaries wrong. Refactoring across service boundaries is orders of magnitude harder than refactoring within a monolith. Get the boundaries right in a modular monolith first, then extract.
- When there is no platform/infrastructure team. Microservices require investment in CI/CD per service, centralized logging, distributed tracing, service discovery, and deployment orchestration. Without this foundation, each team reinvents the wheel and operational incidents multiply.
- When the team lacks distributed systems experience. Microservices introduce failure modes that do not exist in monoliths: network partitions, eventual consistency, message ordering, partial failures, distributed debugging. If the team has not dealt with these before, the learning curve during a production system build is costly.
Interview question: Your team is debating whether to start a new project with microservices. What is your recommendation?
Interview question: Your team is debating whether to start a new project with microservices. What is your recommendation?
- Martin Fowler — MonolithFirst — the original argument for monolith-as-default.
- Shopify Engineering — Deconstructing the Monolith — how they use Packwerk for module enforcement.
Interview question: How do you handle a distributed transaction that spans three microservices?
Interview question: How do you handle a distributed transaction that spans three microservices?
- Microsoft — Saga Pattern — choreography vs orchestration with failure-handling diagrams.
- Martin Fowler — Saga pattern — historical context and the trade-offs.
Interview question: You're designing an e-commerce checkout. Payment, inventory, and shipping are separate services. How do you ensure consistency? Walk me through the Saga pattern.
Interview question: You're designing an e-commerce checkout. Payment, inventory, and shipping are separate services. How do you ensure consistency? Walk me through the Saga pattern.
- Create Order — the Order Service creates an order in
pendingstatus. This is the starting point and the orchestrator records that step 1 succeeded. - Reserve Inventory — the orchestrator calls the Inventory Service to reserve the items. If this fails (out of stock), we cancel the order immediately. No payment was taken, so no compensation needed beyond updating the order status to
cancelled. - Process Payment — the orchestrator calls the Payment Service to charge the customer. If this fails (declined card), the compensating action is to release the inventory reservation, then cancel the order.
- Initiate Shipping — the orchestrator calls the Shipping Service to create a shipment. If this fails, we refund the payment, release inventory, and cancel the order.
- Microsoft — Compensating Transaction Pattern — how to design rollbacks when ACID transactions are off the table.
- Temporal Docs — Sagas — how a workflow engine expresses sagas in real code.
Interview question: Your team wants to adopt microservices. You have 5 engineers and a 6-month-old product. What do you advise and why?
Interview question: Your team wants to adopt microservices. You have 5 engineers and a 6-month-old product. What do you advise and why?
- Segment Engineering — Goodbye Microservices — a real team’s story of collapsing services back into a monolith.
- Martin Fowler — MicroservicePremium — the productivity cost that microservices impose before paying off.
Interview question: You're migrating a legacy monolith to microservices. Walk me through your approach using the Strangler Fig pattern.
Interview question: You're migrating a legacy monolith to microservices. Walk me through your approach using the Strangler Fig pattern.
- Martin Fowler — StranglerFigApplication — Fowler’s original pattern description.
- Microsoft — Strangler Fig Pattern — operational guidance for routing and decommissioning.
Interview question: Your mobile and web apps need different API responses. How would you architect this?
Interview question: Your mobile and web apps need different API responses. How would you architect this?
- Sam Newman — “Backends For Frontends” (samnewman.io) — the canonical write-up of the BFF pattern.
- Apollo GraphQL documentation (apollographql.com/docs) — federation, persisted queries, query complexity analysis.
- Netflix Technology Blog — “Embracing the Differences: Inside the Netflix API Redesign” — the origin story of per-client APIs at scale.
Interview question: Show me how you'd refactor a God class using the Strategy pattern. What's the first step?
Interview question: Show me how you'd refactor a God class using the Strategy pattern. What's the first step?
ReportGenerator class with a 500-line generate() method containing a giant if-else chain: if format == 'pdf' does one thing, elif format == 'csv' does another, elif format == 'excel' does a third, and so on. Every new format means adding another branch, and the class has become a dumping ground for unrelated formatting logic.The first step — and this is critical — is not to start extracting strategies. The first step is to write characterization tests. I need tests that capture the current behavior of each branch, so I can refactor with confidence that I am not breaking anything. I would write a test for PDF output, a test for CSV output, and a test for Excel output, each asserting on the actual output the current code produces.With tests in place, step two is to define the Strategy interface. Something like:PdfReportFormatter. Move the PDF logic out of the if-else branch and into this class. Run the tests. Green? Move to the next one. CsvReportFormatter. Run tests. ExcelReportFormatter. Run tests. Each extraction is a small, safe step.Step four: replace the if-else chain with a lookup map:ReportGenerator is now a thin coordinator. Adding a new format means adding a new class and one entry in the map — no existing code changes.The key insight is that each step is independently committable and deployable. At no point did I do a big-bang rewrite. If I get pulled onto an incident after step three, the code is in a better state than when I started.”Common mistakes: Jumping straight to the end state without describing the incremental steps. Forgetting to mention tests as the first step. Describing the pattern in the abstract without a concrete example. Not explaining why the God class is problematic in the first place (violates Open/Closed Principle, single class changing for multiple reasons).Words that impress: Characterization tests, incremental extraction, Open/Closed Principle, each step is independently deployable, lookup map replacing conditional logic, thin coordinator.if/else actually beat Strategy?
A: When you have two or three branches that are unlikely to grow, and the branching logic is trivial. A Strategy for “free shipping for orders over $100, otherwise calculate shipping” is overkill — the indirection costs more than the one-line conditional. The Strategy earns its keep around 4+ branches that vary independently.Q: What’s the single biggest risk during this refactor?
A: Silent behavior change in an edge case the tests didn’t cover. Characterization tests only catch what you tested. Before the refactor, run the old code in production shadow mode against the new code’s output for a week — any divergence is a missed test case.- Michael Feathers — “Working Effectively with Legacy Code” — the definitive guide to refactoring untested code, including characterization tests.
- Martin Fowler — “Refactoring” (martinfowler.com/books/refactoring.html) — the standard refactoring catalog with the exact move sequence for Replace Conditional with Polymorphism.
- refactoring.guru/design-patterns/strategy — visual walkthrough of the Strategy pattern with language-specific examples.
Pattern Selection Guide
Use this table when choosing between patterns. Match your problem to the pattern, and weigh the trade-off honestly.| Problem | Pattern | Trade-off |
|---|---|---|
| Multiple algorithms selectable at runtime (e.g., payment methods, pricing tiers) | Strategy | Adds interface + implementations per algorithm; overkill for 1-2 static behaviors |
| Business logic tangled with database code; need testable domain layer | Repository | Extra abstraction layer; unnecessary if ORM already provides clean separation |
| Complex or conditional object creation scattered across callers | Factory | Centralizes creation but hides what is being created; can obscure debugging |
| Need to add cross-cutting behavior (logging, caching, metrics) without modifying existing code | Decorator | Each layer adds indirection; deeply nested decorators are hard to debug |
| Unknown, extensible set of reactors to a state change | Observer / Event-Driven | Loose coupling at the cost of traceability; debugging event chains is hard |
| Insulate code from third-party API changes and vendor lock-in | Adapter | Extra wrapper layer; unnecessary for internal code you control |
| Simple app with clear layers (presentation, business, data) | Layered Architecture | Pass-through layers become ceremony; cross-cutting concerns do not fit neatly |
| Complex domain logic that must be testable without infrastructure | Hexagonal Architecture | More up-front structure; overhead not justified for simple CRUD |
| Read and write loads differ dramatically; need different query shapes | CQRS | Two models to maintain, eventual consistency to reason about; overkill for simple CRUD |
| Audit trail, history, temporal queries, retroactive projections | Event Sourcing | Schema evolution is hard, replay is slow at scale, storage grows unbounded |
| Multi-service workflow requiring atomicity without distributed transactions | Saga (Orchestration) | Orchestrator complexity; compensating transactions must be carefully designed |
| Simple 2-3 service reactive workflow | Saga (Choreography) | No central visibility; hard to answer “what state is this saga in?” |
| Multiple client types (web, mobile, partners) with different data needs | Backend for Frontend (BFF) | Each BFF is a service to maintain; coupling risk if backend APIs change frequently |
| Gradual migration from monolith to services | Strangler Fig | Dual running costs during migration; routing complexity at the boundary |
| Reliable event publishing tied to data changes | Outbox Pattern | Extra table + relay process; operational overhead of polling or CDC setup |
| Many teams, independent deploy/scale needs, mature platform | Microservices | Distributed system complexity; harmful for small teams or unclear domains |
| Small team, evolving domain, speed of iteration priority | Modular Monolith | Must enforce boundaries with discipline; extraction to services requires later effort |
When to Remove Patterns, Flatten Abstractions, and Refuse Indirection
Knowing when to apply a pattern is intermediate knowledge. Knowing when to remove one is senior knowledge. The hardest architectural conversation is not “should we add a pattern?” — it is “should we remove one that someone invested a sprint building?”The Pattern Removal Decision Framework
Before removing a pattern, run this five-question diagnostic:- Is the pattern serving its original purpose? Every pattern was introduced to solve a problem. Is that problem still present? If the team introduced the Repository pattern because they planned to swap databases, and that swap never happened and is no longer on any roadmap, the pattern is solving a phantom problem.
- What is the carrying cost? Count the daily tax: how many files does a new developer need to navigate to understand a single operation? How many indirection hops exist between the API controller and the actual work? If adding a new field requires touching 6 files across 3 abstraction layers, the pattern’s carrying cost is high.
- What is the removal cost? This is usually lower than people think. Inlining a Strategy with one implementation is a 30-minute refactoring. Collapsing a pass-through Repository is a find-and-replace. The fear of removal is almost always disproportionate to the actual effort.
- What is the re-introduction cost if we need it later? For code-level patterns (Strategy, Factory, Decorator), the re-introduction cost is low — you can extract the pattern again in under an hour. For architectural patterns (CQRS, Event Sourcing, microservice extraction), the re-introduction cost is high. This asymmetry matters: be aggressive about removing code-level patterns and conservative about removing architectural ones.
- Is the pattern creating a false sense of flexibility? An adapter that mirrors the SDK’s interface does not provide vendor isolation — it provides the illusion of it. A factory with one code path does not provide creation flexibility — it provides indirection. If the pattern’s flexibility has never been exercised, it is speculative, and speculative flexibility has a real daily cost.
Patterns That Commonly Overstay Their Welcome
| Pattern | Common reason it was introduced | Common reason it should be removed |
|---|---|---|
| Repository | ”We might swap databases” | Two years later, still on PostgreSQL, no swap planned, every repo method is a pass-through |
| Factory | ”Creation might become complex” | Creation logic never grew beyond new Thing(a, b), but every constructor change requires updating the factory too |
| Strategy | ”We’ll need more algorithms” | One implementation for 18 months, the second never materialized |
| Adapter | ”We might switch vendors” | The vendor has been stable for 3 years, the adapter’s interface mirrors the SDK 1:1 |
| Decorator | ”We need observability” | Three decorators were added during an incident, the incident was resolved, nobody removed the decorators |
| Observer/Events | ”We’re going event-driven” | Half the events have exactly one subscriber, the other half have zero (dead events nobody cleaned up) |
| Hexagonal/Clean Architecture | ”We need testable business logic” | The app is CRUD, the “business logic” layer is pass-through, and tests hit the real database anyway |
How to Propose Pattern Removal Without Starting a War
Pattern removal is emotionally charged — someone built it, someone reviewed it, someone defended it in a design review. Here is the approach that works:- Lead with data, not opinion. “This adapter has been modified 0 times in 14 months, while adding 3ms of indirection to every request” is harder to argue with than “I think this adapter is unnecessary.”
- Frame as evolution, not failure. “The team made a reasonable bet that we’d swap payment providers. That bet didn’t pay off, which is fine — now we can simplify.” Nobody wants their past work called a mistake.
- Propose incremental removal. “Let’s inline this one factory and see if anyone misses it in 2 weeks. If they do, we revert.” Low-risk experiments build confidence.
- Establish a pattern health review. Quarterly, the tech lead reviews each abstraction layer and asks: “What value did this provide in the last 3 months?” Patterns that provide no measurable value get flagged for removal. This makes removal a regular maintenance activity, not a political event.
Recognizing Pattern Misuse in Existing Codebases
Applying patterns to greenfield code is intermediate knowledge. Recognizing misapplied patterns in an existing codebase — and having the judgment to propose their removal — is senior knowledge. In interviews, questions about “what would you change in this codebase?” test this skill directly.The Pattern Misuse Diagnostic
When you join a team or review a codebase, here are the prompts that reveal misapplied patterns: Prompt 1: “Show me a pattern in this codebase that was introduced for a reason that no longer applies.” Every codebase has at least one. The adapter wrapping a vendor the team will never swap. The event bus connecting two modules that will never have a third subscriber. The factory creating one product type. The strongest signal of architectural maturity is a team that regularly retires patterns whose justification has expired. Prompt 2: “Find a pattern whose carrying cost exceeds its benefit.” Count the files a new developer must navigate to understand a single request flow. If the answer is 12 files across 4 abstraction layers for a simple CRUD operation, the pattern is not serving the developers — the developers are serving the pattern. Measure the time-to-first-PR for new hires: if onboarding takes 3 weeks because of architectural complexity, the architecture is a tax on hiring velocity. Prompt 3: “Identify a pattern that is being used inconsistently — some modules apply it, others do not.” Inconsistency is not always bad. It often indicates that the team tried the pattern, discovered it was not worth the cost in some contexts, and stopped applying it there. The danger is when inconsistency is accidental — some modules have repositories and others call the ORM directly, not by design but because different developers had different habits. If the inconsistency is intentional, document the criteria. If it is accidental, converge in one direction. Prompt 4: “Find the pattern that would be easiest to remove with the most improvement in readability.” This is the highest-leverage simplification question. In most codebases, the answer is either (a) a pass-through layer that adds zero logic, or (b) a dead abstraction whose only implementation has been the sole implementation since it was created. Inlining these is a 30-minute refactoring that removes cognitive overhead from every subsequent code read. Prompt 5: “What pattern is the team about to misapply next?” Listen for the signals: “we should go event-driven” when there is no fan-out need. “We need a factory” when there is one product type. “Let us add CQRS” when an index would solve the read performance problem. The best time to prevent a misapplied pattern is before the PR is merged, not 18 months later when it has become load-bearing ceremony.Common Codebase Smells That Indicate Pattern Misuse
| What you see | What it likely means | What to do |
|---|---|---|
| An interface with exactly one implementation, and it has had one implementation for over a year | Speculative abstraction. The flexibility was never needed. | Inline the implementation. Re-extract the interface when a second implementation actually appears. |
| An event with zero or one subscriber, and the subscriber count has not changed since creation | Overenthusiastic event-driven adoption. A direct call would be clearer. | Replace the event with a direct method call for single-subscriber events. Keep events for fan-out. |
A factory whose create() method has no conditionals | A constructor in disguise. The factory adds a file and a layer with no creation logic. | Delete the factory. Use new directly. |
| A decorator chain deeper than 3 layers | Composition has become a debugging tax. Each layer adds stack depth and cognitive overhead. | Consider collapsing into a pipeline pattern with explicit, named steps. |
| A repository that mirrors the ORM’s API 1:1 | Pass-through abstraction. The repository is not adding domain language. | Either enrich with domain-meaningful methods or delete and call the ORM. |
| Hexagonal architecture where the “business logic” layer is under 500 lines | Architecture is disproportionate to domain complexity. The infrastructure outweighs the logic it protects. | Simplify to layered architecture. Re-adopt hexagonal when domain complexity grows. |
| A saga running inside a single service with a single database | The saga is solving a distributed transaction problem that does not exist. A database transaction would suffice. | Replace the saga with a database transaction. |
Senior vs Staff Calibration Guide
Interview answers exist on a spectrum. Understanding what separates a senior-level answer from a staff-level answer on design pattern questions helps both interviewers calibrate and candidates target the right depth.What Senior Engineers Demonstrate
- Pattern recognition: They identify the right pattern for the problem and explain why alternatives are worse.
- Trade-off articulation: They name the specific costs of the pattern, not just the benefits. “Strategy adds an interface and N implementations — that’s N+1 files to maintain. Worth it when N > 3 or when the behavior changes at runtime.”
- Production awareness: They mention failure modes, testing implications, and debugging considerations. “The decorator chain is great for composition but makes stack traces opaque.”
- When-not-to-use judgment: They can articulate when the pattern is overkill. “For two code paths, an if-else is clearer.”
- Incremental approach: They describe how to introduce a pattern safely — characterization tests first, one extraction at a time, each step deployable.
What Staff Engineers Demonstrate (in addition to the above)
- Pattern removal judgment: They identify patterns that should be removed, not just applied. “The first thing I’d simplify in this codebase is the adapter layer around the logging library — it’s pure ceremony.”
- Organizational impact reasoning: They reason about how patterns affect team structure, onboarding velocity, and cross-team coordination, not just code quality.
- Second-order effects: “If we adopt CQRS here, we need to hire someone with event store experience or invest 2 months in team ramp-up. The pattern is right, but the timing depends on hiring.”
- Migration and rollback planning: They describe how to adopt a pattern incrementally in an existing codebase, how to measure whether it’s working, and what rollback looks like if it’s not.
- Cost quantification: They put numbers on architectural decisions. “The hexagonal architecture adds ~15% to feature development velocity for the first 3 months, but reduces regression bug rate by ~40% after month 6. At our scale of 200 PRs/month, that math works.”
- “What would you simplify first?” instinct: Given a complex system, they identify the highest-leverage simplification. This is the inverse of pattern application — it’s pattern pruning, and it requires the confidence to say “this complexity is not serving us.”
Calibration Table: Same Question, Different Depths
| Signal | Senior answer | Staff answer |
|---|---|---|
| Pattern selection | ”I’d use Strategy because we have 4 algorithms" | "I’d use Strategy, but first I’d check whether the existing if-else is actually causing problems — stable code doesn’t need refactoring for aesthetics” |
| Failure mode awareness | ”The saga compensates on failure" | "The saga compensates on failure, but compensation itself can fail — here’s the dead-letter flow and the reconciliation job” |
| Measurement | ”We’ll monitor the system" | "Success metric: deployment lead time should drop from 4 hours to 30 minutes within 6 weeks. If it doesn’t, the pattern isn’t solving the right problem” |
| Rollback thinking | ”We can revert the PR" | "The migration has 3 checkpoints where we can stop and revert to the previous architecture without data loss. Here’s checkpoint 1…” |
| Six-months-later thinking | ”The architecture should be extensible" | "In 6 months, the team will have grown from 5 to 12 engineers. The modular monolith boundaries need to be enforced with static analysis before that growth, or they’ll erode” |
| Simplification instinct | (Rarely surfaces unprompted) | “Before adding CQRS, I’d first check whether a database index solves the read performance problem. The simplest intervention that works is the right one” |
Curated Resources
These are not “further reading for completeness.” These are the resources that will genuinely move your understanding forward, organized by what you will get from each one.Foundational References
- Martin Fowler — Patterns of Enterprise Application Architecture (articles) — The free online catalog from Fowler’s seminal book. Each pattern (Repository, Unit of Work, Data Mapper, Active Record, and dozens more) gets a concise explanation with diagrams. This is the vocabulary that senior engineers use when discussing data access and enterprise architecture. Start with Repository, Unit of Work, and Domain Model — those three appear in almost every design discussion.
- Refactoring.guru — Design Patterns — The best free visual catalog of design patterns available. Every pattern includes intent, motivation, structure diagrams, pseudocode, real-world analogies, and examples in multiple languages. If you learn better visually, this is your primary resource. The “Relations between patterns” section for each pattern is especially valuable — it shows when patterns complement each other and when one can substitute for another.
- Microsoft — Cloud Design Patterns — Despite the Azure branding, these are cloud-agnostic architectural patterns with exceptional depth. The Saga pattern, Circuit Breaker, CQRS, Event Sourcing, Strangler Fig, Ambassador, Sidecar — each has a detailed write-up with problem context, solution mechanics, when to use, and when not to use. This is the single best free resource for architectural patterns in distributed systems.
Books That Shift Your Thinking
- Building Microservices (2nd Edition) by Sam Newman — The definitive practical guide to microservices architecture. Newman is honest about trade-offs (the chapter on “should you even do microservices?” is worth the book alone). Key concepts to focus on: service decomposition strategies, data ownership, the monolith-first approach, and migration patterns. The second edition (2021) reflects lessons the industry learned the hard way since the microservices hype of 2015.
- Designing Data-Intensive Applications by Martin Kleppmann — Not a patterns book per se, but the best book on understanding the data systems that underpin every architectural pattern discussed here. If you want to truly understand why event sourcing has the trade-offs it does, or what eventual consistency really means at the database level, this is where you go. Chapters 5 (Replication), 7 (Transactions), and 11 (Stream Processing) are directly relevant to every pattern in this module.
Engineering Blogs for Real-World Application
- Uber Engineering Blog — CQRS and Domain Events — Uber’s engineering blog documents their journey through event-driven architecture, CQRS, and event sourcing at massive scale. Search for posts on their domain event platform and how they handle ride-state management. These are not theoretical discussions — they are battle reports from running these patterns with millions of concurrent users.
- Shopify Engineering — Deconstructing the Monolith — Shopify’s detailed explanation of their modular monolith approach, including how they use Packwerk for boundary enforcement, why they chose this path over microservices, and the concrete results. Essential reading for anyone considering (or being pressured toward) a microservices migration.
- ThoughtWorks Technology Radar — Published twice yearly, the Technology Radar tracks which patterns, tools, and techniques are being adopted, trialed, assessed, or put on hold across the industry. Check the “Techniques” quadrant for pattern trends. This is how you stay current on what the industry is learning about CQRS, event sourcing, modular monoliths, and architecture decision records.
Pattern Recognition in Interviews
The hardest part of pattern knowledge is not memorizing the patterns — it is recognizing when they apply. In interviews, the interviewer will rarely say “use the Strategy pattern here.” Instead, they will describe a problem, and your job is to hear the signal and reach for the right tool. This table maps common interviewer phrases and problem descriptions to the patterns they are testing.| When the interviewer says… | Consider this pattern | Why it fits |
|---|---|---|
| ”Different behavior based on type” / “The logic changes depending on the mode” / “We need to support multiple algorithms” | Strategy | Varying behavior behind a common interface — the classic strategy signal |
| ”We might switch vendors” / “What if we need to support a different payment provider?” / “How do you isolate third-party dependencies?” | Adapter | Vendor isolation through an interface that shields your code from external API changes |
| ”How would you add logging/caching/metrics without changing existing code?” / “Cross-cutting concerns” | Decorator | Composable behavior wrapping — each concern is an independent, removable layer |
| ”The object creation is complex” / “Different configurations depending on the environment” / “How do you avoid scattering new() calls?” | Factory | Centralized, encapsulated object creation that hides conditional construction logic |
| ”Multiple services need to react when this happens” / “We need to add new reactions without modifying the source” | Observer / Event-Driven Architecture | Decoupled fan-out where the producer does not know or care about consumers |
| ”How do you keep business logic testable without a database?” / “Separate domain logic from infrastructure” | Repository + Hexagonal Architecture | Abstracted data access (Repository) within a ports-and-adapters structure (Hexagonal) |
| “Read traffic is 100x write traffic” / “The dashboard query is killing the database” / “Reads need a different shape than writes” | CQRS | Separate read/write models optimized for their respective access patterns |
| ”We need a complete audit trail” / “What was the state at this point in time?” / “We want to replay history” | Event Sourcing | Immutable event stream that preserves full history and enables temporal queries |
| ”This workflow spans three services” / “How do you handle a distributed transaction?” / “What if step 3 fails?” | Saga (Orchestration or Choreography) | Coordinated multi-service workflow with compensating transactions for failure recovery |
| ”We want to migrate off the monolith gradually” / “We cannot rewrite everything at once” | Strangler Fig | Incremental migration via routing — new functionality goes to new services, old monolith shrinks |
| ”The data change and the event must be consistent” / “Sometimes events get lost” | Outbox Pattern | Atomic write of data + event in the same transaction, with a relay process for publishing |
| ”We have 200 engineers and deployments take a week because everyone is coupled” | Microservices | Independent deployment and team autonomy at organizational scale |
| ”We are a team of 8 and need clean boundaries without distributed system overhead” | Modular Monolith | Internal module boundaries with the operational simplicity of a single deployment |
| ”Requests keep failing because one downstream service is slow” / “Cascading failures” | Circuit Breaker (covered in depth in Reliability chapters) | Fail fast when a dependency is unhealthy, preventing cascade failures |
| ”Every service implements its own retry/auth/logging differently” | Sidecar / Service Mesh | Standardized cross-cutting infrastructure as a separate process alongside each service |
| ”The frontend calls 6 different backends” / “Mobile needs smaller payloads than web” | API Gateway / BFF | Unified entry point (Gateway) or client-specific aggregation layer (BFF) |
Interview Deep-Dive Questions
These are the questions a senior interviewer would actually ask to separate candidates who have read about patterns from candidates who have shipped systems using them. Each question starts at one level and drills deeper through follow-ups — the way a real 45-minute interview unfolds.Q1: You join a team and discover the codebase has a Repository layer wrapping an ORM, but every repository method is just a pass-through to the ORM. Is this a problem? What would you do?
Q1: You join a team and discover the codebase has a Repository layer wrapping an ORM, but every repository method is just a pass-through to the ORM. Is this a problem? What would you do?
findById calls orm.findById, save calls orm.save, and nothing else — is adding a layer of indirection with zero value. The Repository pattern earns its keep when it exposes domain-meaningful operations like findOverdueInvoices() or findActiveSubscriptionsByRegion(), abstracting the query complexity behind a domain-language interface. If every method is just a thin proxy, the repository is not providing abstraction — it is providing bureaucracy.But before I rip it out, I would ask three questions. First, does the codebase have unit tests that swap in an in-memory repository? If so, the repository is providing testability value even if the methods are simple — and I would keep it but start adding domain-specific query methods as new features require them. Second, is there any realistic chance of swapping the ORM or database? If the team is considering a migration from ActiveRecord to a different ORM, or from PostgreSQL to DynamoDB, the repository layer becomes genuinely valuable. Third, how complex is the domain? If this is a CRUD app with simple data access, the repositories are ceremony. If the domain is growing in complexity, the repositories are scaffolding for a good abstraction that just has not been filled in yet.My default action: if the domain is simple and there are no plans to swap storage, I would remove the pass-through repositories and call the ORM directly. If the domain has complexity worth isolating, I would keep the repository interfaces but start migrating methods from pass-throughs to domain-meaningful operations — findUsersEligibleForPromotion() instead of findAll() followed by filtering in the service layer.”Follow-up: How would your answer change if you were working in a hexagonal architecture?
“In a hexagonal architecture, the repository interface is a port — it is the contract between the domain core and the persistence adapter. Even if the current implementation is a simple pass-through, the interface has architectural value: it enforces the dependency rule that the core does not depend on infrastructure. I would keep the interface but still push for domain-specific methods on it. The pass-through implementation behind the interface is fine as a starting point, because the value is the boundary, not the implementation complexity. What I would watch for is whether the port interface uses domain language (findOverdueOrders) or infrastructure language (findByStatusAndDateLessThan) — if it looks like a SQL query translated to a method name, the abstraction is leaking.”Follow-up: A junior developer asks you when they should create a repository vs. just using the ORM directly. What heuristic do you give them?
“I would give them a simple rule: if you can write a meaningful unit test for your business logic without a database, and the ORM is getting in the way of that, you need a repository. If your tests work fine calling the ORM directly — either because the logic is simple or because you are doing integration tests anyway — skip the repository. The moment you find yourself writingservice.getActiveUsers() that does orm.findAll().filter(u => u.isActive && u.lastLoginAfter(thirtyDaysAgo)), that filtering logic belongs in a findActiveUsers() method on a repository, not scattered in the service. The repository gives the filter a name, makes it testable, and prevents three other developers from writing their own slightly different version.”Going Deeper: How does the Repository pattern interact with the Unit of Work pattern, and when do you need both?
“The Repository handles querying and retrieving aggregates. The Unit of Work tracks changes across multiple entities and commits them as a single transaction. You need both when a business operation touches multiple aggregates that must be persisted atomically. For example, transferring money between accounts: you load both accounts through repositories, perform the transfer in domain logic, and the Unit of Work ensures both updated accounts are saved in a single transaction. Most ORMs (Entity Framework, Hibernate, SQLAlchemy sessions) implement Unit of Work under the hood — your ORM session is the Unit of Work. The question is whether you need to make it explicit. In DDD with complex aggregates, an explicit Unit of Work is valuable because it makes transaction boundaries visible. In simpler apps, the ORM’s implicit Unit of Work is sufficient. The gotcha is that a Unit of Work should not span multiple aggregates in a microservices context — each service owns one aggregate root, and cross-aggregate consistency should use sagas, not a shared Unit of Work.”findById, findOverdueInvoices) in domain language. Say “repository” when you want testable business logic that isn’t coupled to an ORM or query builder.findEligibleDriversNearLocation(geohash, radius) rather than generic find(). This keeps the geospatial query logic in one place, independently testable with an in-memory fake, and prevents the service layer from growing ad-hoc geohash filtering.Follow-up Q&A Chain:Q: Isn’t “maybe we’ll swap databases” a legitimate reason to keep pass-through repos?
A: Only if you have concrete evidence. In 15 years of engineering, I’ve seen the database swap happen twice — and both times the repository layer was ripped out and rewritten because the new database’s access patterns didn’t fit the old abstraction. Speculative swap insurance is almost always a loss.Q: How do you introduce domain-meaningful methods without a big-bang refactor?
A: Opportunistically. When you touch the service layer to add a feature and find yourself writing filter-after-fetch logic, extract that filter into a new repository method as part of the PR. Over six months, the pass-through methods decay and the domain methods grow naturally.- Martin Fowler — “Repository” (martinfowler.com/eaaCatalog/repository.html) — the canonical definition.
- Vaughn Vernon — “Implementing Domain-Driven Design” — chapters on aggregate design and repository boundaries.
- Microsoft Docs — “Designing the infrastructure persistence layer” — concrete code examples of Repository + Unit of Work in .NET.
Q2: Explain the difference between the Observer pattern and Event-Driven Architecture. Are they the same thing at different scales?
Q2: Explain the difference between the Observer pattern and Event-Driven Architecture. Are they the same thing at different scales?
observer.update() synchronously. Delivery is guaranteed — if the observer is registered, it gets called. Ordering is deterministic. Failures propagate immediately — if an observer throws, the subject knows. The cost is runtime coupling: all observers execute in the same thread (unless you explicitly make it async), and a slow observer blocks the subject. Think of React’s setState triggering re-renders, or a Java PropertyChangeListener.Event-Driven Architecture is distributed. A producer publishes an event to a broker (Kafka, RabbitMQ, SQS). Consumers subscribe to topics. Now everything is different. Delivery is at-least-once or at-most-once, never exactly-once without application-level logic. Ordering is only guaranteed within a partition, not globally. Consumers can be down and catch up later — temporal decoupling. But you also get new failure modes that do not exist in-process: network partitions, message broker outages, consumer lag, out-of-order delivery, duplicate processing. You need idempotency. You need dead letter queues. You need distributed tracing to follow the event chain.So yes, they are the same concept at different scales — but the scale change is not just ‘bigger.’ It fundamentally changes what guarantees you have and what failure modes you must handle. I would say the Observer pattern is where you learn the concept, and EDA is where you learn that distributed systems make everything harder.”Follow-up: When would you use an in-process event bus (like MediatR or Spring’s ApplicationEventPublisher) instead of a full message broker?
“When all the consumers are in the same process, you do not need cross-service communication, and you want the decoupling benefit of events without the operational overhead of a broker. In a modular monolith, an in-process event bus is perfect — module A publishesOrderPlaced, module B’s listener sends the email, module C’s listener updates analytics, all within the same deployment. You get the extensibility of the Observer pattern with cleaner decoupling than direct method calls.The trap is when people use an in-process event bus as a stepping stone to ‘eventual’ EDA and start treating it like a distributed system. An in-process event bus gives you synchronous, guaranteed delivery. The moment you add async or external consumers, you need the full machinery — broker, idempotency, dead letters, retry policies. Do not build half a distributed system. Either keep it in-process and simple, or go fully distributed with proper infrastructure.”Follow-up: How do you debug a problem in an event-driven system where an expected side effect did not happen?
“This is one of the hardest operational problems in EDA, and it is where teams discover they underinvested in observability. My debugging approach is a systematic narrowing:First, did the event get published? Check the producer’s logs or the outbox table. If using the outbox pattern, check whether the row exists and whether thepublished flag is set. If the event was never written to the outbox, the bug is in the producer’s business logic.Second, did the event reach the broker? Check the topic in Kafka (consumer group lag, partition offsets) or the queue in SQS/RabbitMQ. If the event is in the outbox but not on the broker, the relay process has a problem — maybe it crashed, maybe CDC is lagging.Third, did the consumer receive it? Check consumer group offsets and consumer logs. If the event is on the broker but the consumer has not processed it, either the consumer is down, it is lagging, or the event ended up in a partition the consumer is not assigned to.Fourth, did the consumer process it successfully? Check for errors. If the consumer received the event but the side effect did not happen, the consumer logic has a bug — maybe it is an event schema mismatch, maybe a downstream dependency failed.The tooling that makes this tractable: correlation IDs that flow from the original request through every event and every consumer, distributed tracing (Jaeger/Zipkin) so you can see the full event chain, and structured logging on every consumer that records event ID, processing duration, and outcome. Without these, you are grepping through logs across five services hoping to find the needle.”Going Deeper: What is the difference between event notification, event-carried state transfer, and event sourcing — and when would you pick each?
“These are Martin Fowler’s three event patterns, and they represent different amounts of information in the event.Event notification is the lightest — the event says ‘something happened’ with minimal data.OrderPlaced { orderId: 123 }. The consumer must call back to the producer to get details. This keeps events small and the producer as the source of truth, but creates runtime coupling — if the producer is down, the consumer cannot get the data it needs.Event-carried state transfer includes enough data in the event that the consumer never needs to call back. OrderPlaced { orderId: 123, customerId: 456, items: [...], total: 50.00 }. The consumer stores what it needs locally. This eliminates the callback coupling — the consumer is self-sufficient even if the producer is down. The trade-off is larger events and data duplication across services.Event sourcing stores every state change as an event and derives current state by replaying them. This is not about inter-service communication — it is about how a single service persists its own state.In practice, I default to event-carried state transfer for inter-service events because the decoupling benefit is enormous. The consumer does not need to know the producer’s API, and a producer outage does not cascade. I use event notification only when the event payload would be very large and the consumer rarely needs the full data. I use event sourcing only when the domain genuinely requires history replay — finance, audit, compliance.”- Martin Fowler — “What do you mean by Event-Driven?” (martinfowler.com) — the canonical taxonomy of event notification, event-carried state transfer, and event sourcing.
- kafka.apache.org/documentation — delivery semantics section is essential reading for anyone shipping Kafka consumers.
- Ben Stopford — “Designing Event-Driven Systems” (Confluent, free ebook) — real-world patterns from the team that built Kafka.
Q3: You are designing a notification system that needs to support email, SMS, push notifications, and Slack — with more channels likely in the future. Walk me through your design.
Q3: You are designing a notification system that needs to support email, SMS, push notifications, and Slack — with more channels likely in the future. Walk me through your design.
NotificationSender interface with a send(recipient, message) method, then implement EmailSender, SmsSender, PushSender, SlackSender. Each handles its own protocol — SMTP for email, Twilio API for SMS, APNs/FCM for push, webhook for Slack. Adding a new channel means writing a new class that implements the interface. Zero changes to existing code.For selecting the right sender, a Factory backed by user preferences. The user’s settings say ‘notify me via email and Slack.’ The factory looks up the user’s preferences and returns the list of senders to invoke. This keeps the notification service clean — it calls factory.getSendersForUser(userId) and iterates over them.For the trigger side, I would use the Observer pattern (or events). The notification system subscribes to domain events — OrderPlaced, PasswordReset, SubscriptionExpiring — and each event type has a template and a channel configuration. The services that produce these events do not know about notifications at all.For failure handling — and this is where the real complexity lives — each channel has different failure modes. Email is fire-and-forget (SMTP accepts it, but delivery is not guaranteed). SMS can fail with carrier errors. Push tokens expire. Slack webhooks return 429 rate limits. I would wrap each sender in a retry decorator with channel-specific retry policies: exponential backoff for rate limits, immediate dead-lettering for invalid tokens, and a separate failure queue for investigation. Each send attempt gets logged with a correlation ID back to the originating event.The architecture ends up being: Domain Event -> Notification Service (subscriber) -> User Preference Lookup -> Factory produces Senders -> Strategy Senders execute -> Retry Decorator handles failures.”Follow-up: How would you handle a user who has notifications for email and SMS, but the SMS provider is down for 2 hours?
“This is where the difference between channel independence and channel coupling matters. Each channel should be independently deliverable — the failure of SMS should never block or delay email. So the notification service sends to each channel independently, not sequentially. If SMS fails, it goes into a retry queue with exponential backoff. Email succeeds immediately.The harder question is: should we tell the user? If the notification is time-sensitive (two-factor auth code), an SMS failure is critical — I would fall back to email delivery of the code and log an operational alert. If it is informational (order shipped), the SMS will be delivered when the provider recovers, and the user already got the email. The notification service should track delivery status per channel per notification, so the support team can see ‘email: delivered, SMS: pending retry’ in the admin dashboard.I would also implement a circuit breaker around the SMS sender. After N consecutive failures, stop attempting SMS and route all SMS notifications to a dead letter queue. This prevents hammering a downed provider and wasting resources on retries that will not succeed. When the circuit breaker trips, fire an alert to the on-call team.”Follow-up: How does this design change if notification volume goes from 1,000/day to 10 million/day?
“At 10 million notifications a day, two things break: synchronous processing and single-instance sending.First, the notification service must become asynchronous. Domain events go into a message queue (Kafka or SQS). Notification workers pull from the queue and process in parallel. This decouples the event production rate from the notification sending rate — if the email provider is slow, messages queue up rather than creating backpressure on the producing services.Second, each channel needs its own scaling profile. Email might handle 10M/day easily through a bulk provider like SendGrid or SES with batching. SMS through Twilio has rate limits per account and per phone number — you need request batching, rate limiting on the sender side, and potentially multiple Twilio accounts. Push notifications can be batched to APNs (up to 100 per request to FCM).I would partition the notification queue by channel — one queue per channel type — so each can scale independently. The email consumer pool might be 5 workers, while the push notification pool might be 50 because of higher volume and lower per-message cost.The other thing that changes at scale is template rendering. At 1,000/day, rendering a Handlebars template per notification is fine. At 10M/day, template rendering becomes a measurable cost. I would pre-compile templates, cache rendered output for notifications sent to many users with the same content, and consider separating template rendering from delivery as distinct pipeline stages.”notifications_sent table keyed on (event_id, user_id, channel) with a unique constraint — the send is conditional on the insert succeeding. Second, propagate an idempotency key to the provider (SendGrid, Twilio, APNs) so even if your consumer calls twice, the provider suppresses the duplicate.Q: A user disabled email but your consumer still tried to send — how did that happen?
A: Almost always a stale cache of user preferences. The preference service updated at T0, but the notification worker was still holding a 60-second-old cache. Fix: version the preferences, include the version in the notification request, and reject sends where the version is stale — the notification goes back to the queue and picks up fresh preferences.- martinfowler.com/bliki/CircuitBreaker.html — Fowler’s explanation with state diagram.
- Courier (courier.com) engineering blog — production notification system design, multi-channel delivery trade-offs.
- AWS Architecture Blog — “Scalable multi-channel notifications with SNS + SQS + Lambda” — AWS-native reference architecture.
Q4: What is the difference between Hexagonal Architecture and Clean Architecture? When does the distinction actually matter in practice?
Q4: What is the difference between Hexagonal Architecture and Clean Architecture? When does the distinction actually matter in practice?
Money value object changes once a year, but the CheckoutUseCase changes every sprint. Separating them into distinct layers prevents checkout churn from touching stable domain primitives.What I would never do is get into a religious debate about which one is ‘correct.’ In my experience, most teams end up with a pragmatic hybrid — hexagonal’s ports-and-adapters as the outer boundary, with some internal layering of the core inspired by Clean Architecture. The principle is what matters, not the label.”Follow-up: How do you enforce the dependency rule in practice? What happens when someone violates it?
“Enforcement has to be automated — code reviews catch violations sometimes, but not reliably. There are a few mechanisms:At the build level, you can use module systems to enforce boundaries. In Java, module-info.java or ArchUnit tests can assert thatcore does not import from infrastructure. In TypeScript, you can use project references or eslint-plugin-import with zone restrictions. In .NET, separate assemblies enforce compile-time dependency direction — the Core project literally cannot reference the Infrastructure project.At the CI level, I would add an architectural fitness function — an automated test that scans imports and fails the build if the core module has any import from an infrastructure package. This is cheaper than it sounds — it is usually a 20-line test.When someone violates it — and they will, usually because they need ‘just one quick database call’ in the core — you have a teaching moment. The fix is usually to define a new port. If the core needs to send an email, it does not import SendGrid — it defines a NotificationSender port and the infrastructure layer provides a SendGridNotificationSender adapter. The violation is a signal that a port is missing, not that the rule is wrong.”Follow-up: Can you have too many ports? When does the abstraction become a burden?
“Absolutely. I have seen codebases where every single external call — logging, metrics, time, random number generation — is behind a port. The team was so committed to hexagonal purity that reading the code required navigating 30 interfaces to understand a single request flow.My heuristic: create a port when you need to swap the implementation for testing or for production flexibility. A port for the database? Yes — you will swap it for in-memory in tests. A port for the payment gateway? Yes — you will swap between Stripe and Adyen, and you need a fake for tests. A port for the system clock? Maybe — if your domain logic is time-dependent and you need deterministic tests, yes. A port for the logger? Almost never — your tests should not be asserting on log output, and you are never going to swap your logging library mid-project.The tell that you have over-abstracted: if adding a simple feature requires creating an interface, an implementation, a test double, and updating the dependency injection wiring — and the entire flow is still just calling one thing — you have optimized for a flexibility that will never be exercised.”PaymentGateway, OrderRepository). “Adapters” are the implementations that plug into those ports (StripePaymentGateway, PostgresOrderRepository). Use the phrase when explaining how the core stays infrastructure-agnostic.main() function or application bootstrap). Pure constructor injection. DI frameworks make wiring easier at scale but aren’t required, and many teams are better served by the explicitness of manual wiring.Q: What’s the worst misuse of these patterns you’ve seen?
A: Hexagonal applied to a CRUD service with zero domain logic. The “core” was 50 lines of validation. The ports, adapters, DI, and tests were 3,000 lines. The architecture outweighed the logic it was protecting by 60x. We collapsed it to a straightforward layered design and shipped features twice as fast.- Alistair Cockburn — “Hexagonal Architecture” (alistair.cockburn.us) — the original paper.
- Robert C. Martin — “The Clean Architecture” (blog.cleancoder.com) — Uncle Bob’s concentric-rings formulation.
- Tom Hombergs — “Get Your Hands Dirty on Clean Architecture” — concrete Java implementation with ArchUnit boundary tests.
Q5: You have a service that writes to a database and then publishes an event to Kafka. Sometimes the event gets lost. Diagnose the problem and propose a solution.
Q5: You have a service that writes to a database and then publishes an event to Kafka. Sometimes the event gets lost. Diagnose the problem and propose a solution.
Follow-up: Debezium CDC vs. a polling relay — which would you choose and why?
“For most teams, I would start with a polling relay because it is operationally simpler. It is a cron job or a loop that runs every 500ms, queriesSELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100, publishes each to Kafka, and marks them as published. You can build it in an afternoon, and it is easy to understand and debug.Debezium CDC is more powerful — it reads the database’s write-ahead log (WAL) directly, so there is no polling delay, no missed events, and it handles higher throughput without the overhead of repeated queries. But it is also more operationally complex: you need to run a Debezium connector (usually in Kafka Connect), manage its state (offsets into the WAL), handle schema changes, and monitor a new piece of infrastructure.My decision framework: if your event throughput is under 1,000 events per second and latency of 500ms-1s is acceptable, use the poller. Above that, or when you need sub-100ms latency between write and publish, use Debezium. If you are already running Kafka Connect for other purposes, the marginal cost of adding a Debezium connector is low — go with CDC.”Follow-up: The outbox table is growing and your DBA is concerned. How do you manage it?
“The outbox table is a temporary staging area, not a permanent store. Published rows should be cleaned up aggressively. I would set up a cleanup job that deletes (or moves to a cold-storage archive table) all outbox rows wherepublished = true and created_at is older than a retention window — 24 hours is usually more than enough, or even 1 hour if you are confident in your relay.If the DBA’s concern is about the rate of growth even for unpublished rows — which means the relay is not keeping up — that is a different problem. Either the relay is lagging (scale it up, add parallelism by partitioning the outbox by aggregate type), or Kafka is unavailable and rows are accumulating. In the latter case, the outbox is doing exactly what it should — buffering events during a downstream outage — and the fix is to resolve the Kafka issue, not to truncate the outbox.For long-term hygiene, I would also add a monitoring alert on outbox lag: if there are unpublished rows older than 5 minutes, something is wrong with the relay or the broker. This is your early warning system before the table grows to problematic size.”Going Deeper: Can you achieve the same guarantees without the outbox pattern — for example, by publishing the event first and then writing to the database?
“Interesting question. Flipping the order — publish to Kafka first, then write to the database — does not solve the problem; it just reverses the failure window. Now if the Kafka publish succeeds but the database write fails, you have published an event for a change that never happened. Consumers will react to a phantom event.There is an alternative approach called the ‘listen to yourself’ pattern: the service writes to the database, then a CDC stream from the database becomes the event source. The service does not publish to Kafka directly at all — Debezium captures every committed write and streams it to Kafka. This is similar to the outbox pattern but without the explicit outbox table — the ‘outbox’ is the database table itself. The downside is that you lose control over the event shape — CDC captures row-level changes, not domain events. You can mitigate this by having the CDC stream feed into a transformation layer that converts row changes into properly-shaped domain events, but that adds complexity.Another option is Kafka Transactions, which provide exactly-once semantics within Kafka. But this only helps if both your read and write are within Kafka — it does not help with the database-to-Kafka dual-write problem. The outbox pattern remains the most practical solution when your source of truth is a relational database and you need to publish events to a message broker.”- microservices.io/patterns/data/transactional-outbox.html — Chris Richardson’s canonical explanation.
- debezium.io/documentation — Debezium’s outbox event router docs are the reference implementation.
- martinfowler.com — “Patterns of Distributed Systems: Outbox” — Fowler’s treatment with sequence diagrams.
Q6: A team lead proposes using Event Sourcing for a new e-commerce product catalog service. How do you evaluate this decision?
Q6: A team lead proposes using Event Sourcing for a new e-commerce product catalog service. How do you evaluate this decision?
price_changes table with timestamps is orders of magnitude simpler than a full event-sourced system.The costs of event sourcing for a catalog service would be significant. Event schema evolution: when the product model changes (add a new field, restructure categories), you need to handle multiple event versions. Every query needs a projection — ‘show me all products in category X under $50’ requires a dedicated read model, maintained and kept in sync. Debugging becomes harder — ‘why does this product show the wrong price?’ means replaying events rather than inspecting a row. And the team needs to learn event sourcing patterns, which has a real ramp-up cost.I would push back and recommend a standard relational model with a change audit table if history is needed. If the team lead’s underlying concern is about building different read models — a search-optimized view in Elasticsearch, a recommendation-engine view in a graph database — CQRS without event sourcing handles that cleanly. Write to PostgreSQL, project to Elasticsearch via CDC, no event sourcing required.The one scenario where I would reconsider: if this catalog feeds into a marketplace where pricing disputes, seller audit trails, or regulatory compliance require provable change history. In that case, event sourcing starts to make sense — but I would scope it to the pricing subdomain, not the entire catalog.”Follow-up: The team lead argues that event sourcing gives them the flexibility to build new projections from historical data in the future. How do you respond?
“That is the most compelling argument for event sourcing, and it is not wrong in the abstract. The ability to replay history and build views you did not anticipate is genuinely powerful. But there is a cost-benefit analysis to make.The cost is ongoing: event schema evolution, projection maintenance, operational complexity, developer cognitive overhead — every day, for the entire lifetime of the system. The benefit is speculative: we might need new projections from historical data at some unspecified future date.My counterargument: if you are not event-sourcing, you can still capture change events in a change log or append-only table. It is not as clean as a purpose-built event store, but it gives you 80% of the ‘replay’ benefit at 20% of the cost. If the day comes when you genuinely need full event sourcing, you can migrate to it — and you will be migrating with a clear understanding of why, rather than bearing the cost from day one on a speculative benefit.In engineering, the pattern that wins is usually the simplest one that solves today’s real problems while leaving the door open for tomorrow’s. YAGNI — You Aren’t Gonna Need It — applies to architectural patterns as much as it applies to features.”Follow-up: Name a real scenario where event sourcing is clearly the right choice, and explain why alternatives fall short.
“A bank’s transaction ledger. The account balance is derived state — it is the sum of all credits and debits. You need to answer ‘what was this account’s balance at close of business on March 15?’ You need an immutable, tamper-evident record of every transaction for regulatory compliance. You need to be able to reconstruct the state from scratch if a projection has a bug. And you need to build new financial reports retroactively — the compliance team asks for a report that did not exist when the transactions happened.A simple ‘current state + change log’ approach falls short here because the change log is the source of truth, not the current state. The balance is always derived. If your change log and your current state table disagree, the change log wins. That is event sourcing by definition — you have just given it a different name. The event store gives you ordering guarantees, concurrency control (optimistic concurrency via expected version), and purpose-built infrastructure for replaying and projecting.Other strong candidates: a collaborative document editor (operational transformation events are the history), a regulatory system for pharmaceutical trials (every data change must be provably traceable), or an insurance claims system (the full chain of events — filed, assessed, approved, paid, disputed — is the business logic, not a side effect of it).”changes audit table populated by database triggers or CDC. Every state-mutating operation writes a row with old_value, new_value, actor, timestamp. You get 80% of the audit benefit at 10% of the cost. The remaining 20% — retroactive projections, arbitrary temporal queries — is what event sourcing uniquely provides.- martinfowler.com/eaaDev/EventSourcing.html — Fowler’s canonical definition.
- Greg Young — “CQRS Documents” (cqrs.files.wordpress.com) — the practitioner’s guide from one of the pattern’s originators.
- Stripe Engineering Blog — “Online migrations at scale” — discusses ledger design and why immutability matters for financial systems.
Q7: You are building a system where a single API request must coordinate across three microservices. One architect proposes choreography, another proposes orchestration. How do you decide?
Q7: You are building a system where a single API request must coordinate across three microservices. One architect proposes choreography, another proposes orchestration. How do you decide?
OrderPlaced triggers PaymentCharged which triggers InventoryReserved can work as choreography. Each service is autonomous, there is no single point of coordination, and the coupling is minimal.But choreography degrades badly with complexity. The moment you have conditional branching (‘if payment fails, do X; if inventory fails, do Y; if both have already succeeded but shipping fails, undo both’), the compensating logic is distributed across multiple services, and no single place shows you the full workflow. Debugging ‘what went wrong with order 12345?’ requires correlating events across three service logs. Adding a new step means modifying the event chain, which is harder to reason about than adding a step to an orchestrator.Orchestration shines when the flow has 4+ steps, complex failure handling, conditional branching, or when the business frequently changes the workflow. The orchestrator is a single, readable state machine: step 1, step 2, if step 2 fails then compensate step 1, step 3, done. You can look at one file and understand the entire flow. You can monitor saga state — ‘how many orders are stuck in step 3?’ — with a simple query. The cost is that the orchestrator is a piece of infrastructure you must maintain, and it creates a logical coupling point (though not a runtime single point of failure if designed well).For the three-service coordination described in this question, I would default to orchestration unless the flow is genuinely simple with no conditional compensation. The operational clarity is worth the overhead.”Follow-up: How do you prevent the orchestrator from becoming a God service?
“The key rule is: the orchestrator coordinates, it does not contain business logic. It tells services what to do and in what order, but it does not decide how they do it. The orchestrator says ‘charge this customer 100 and they are a premium member, then apply a discount before charging’), that logic belongs in a domain service, not the orchestrator.Scope-wise, one orchestrator should manage one business workflow. TheCheckoutOrchestrator manages checkout. The RefundOrchestrator manages refunds. If you have one WorkflowOrchestrator that handles checkout, refunds, subscription renewal, and inventory reconciliation, you have a God service. Each saga gets its own orchestrator with its own state machine.Tooling helps too. Temporal, Step Functions, and Conductor all provide orchestration frameworks that enforce this separation — the workflow definition is declarative, and the business logic lives in the activity implementations (which are the services themselves).”Follow-up: Have you evaluated Temporal or AWS Step Functions for orchestration? What are the trade-offs?
“I have used both. They solve the same fundamental problem — durable workflow orchestration with automatic retry, state persistence, and visibility — but they differ in important ways.AWS Step Functions is managed and serverless. You define workflows as JSON state machines (or CDK constructs). The execution history, retries, and state persistence are handled for you. The trade-off is vendor lock-in (your workflow definition is AWS-specific), limited expressiveness for complex conditional logic (the Amazon States Language is awkward for deeply nested branching), and cost at high volume — Step Functions charges per state transition, which adds up quickly for workflows with many steps running millions of times.Temporal is open-source (with a hosted option) and lets you write workflows in real programming languages — Go, Java, TypeScript, Python. Your workflow is actual code with loops, conditionals, and error handling, not a JSON state machine. This is dramatically more expressive for complex flows. It is also portable — no vendor lock-in. The trade-off is operational: you need to run the Temporal server (or pay for Temporal Cloud), manage its database, and handle upgrades. The learning curve is steeper because of Temporal’s determinism requirements — your workflow code must be deterministic since it is replayed.My recommendation: if you are in AWS and your workflows are simple (under 10 steps, minimal branching), Step Functions is the pragmatic choice — zero operational overhead. For complex, long-running workflows with significant business logic in the orchestration (multi-day approval chains, complex compensations, human-in-the-loop steps), Temporal is worth the operational investment.”- docs.temporal.io — Temporal’s documentation on durable execution is the best treatment of orchestration tradeoffs available.
- AWS Step Functions Developer Guide — concrete state-machine patterns and cost model.
- Sam Newman — “Building Microservices, 2nd Edition” — Chapter on sagas covers orchestration vs choreography with honest tradeoffs.
Q8: Explain the Decorator pattern. Now tell me why middleware in Express.js or Django is the same pattern — and where the analogy breaks down.
Q8: Explain the Decorator pattern. Now tell me why middleware in Express.js or Django is the same pattern — and where the analogy breaks down.
MetricsDecorator(CachingDecorator(LoggingDecorator(RealService))). Each layer adds one concern, and they compose independently.Express middleware follows this exact structure. Each middleware function receives (req, res, next), does something (authentication, logging, CORS headers), and calls next() to pass control to the next middleware. The ‘wrapped object’ is the next middleware in the chain, and the final handler is the real service. Each middleware can modify the request, modify the response, short-circuit the chain (by not calling next()), or add behavior after next() returns (by putting code after the next() call).Django middleware is similar — it wraps the view function. Each middleware has process_request (before) and process_response (after), forming layers around the actual view.Where the analogy breaks down is in interface conformity. In the classic Decorator pattern, every decorator implements the same interface as the object it wraps. CachingRepository has the same methods as Repository. This is what makes decorators composable and interchangeable. Middleware does not strictly follow this — each middleware has access to the full request and response objects, not a narrowly defined interface. A logging middleware and an auth middleware do not implement the same ‘business interface.’ They share a pipeline contract (next()), not a domain interface.The other break is in bidirectionality. Classic decorators wrap a single method call — the decorator calls the wrapped object and optionally modifies the result. Middleware is bidirectional — it can modify both the inbound request and the outbound response. Some middleware does work on the way in (auth checks), some on the way out (compression, response headers), and some both. This is more like the Chain of Responsibility pattern with decoration characteristics than a pure Decorator.”Follow-up: When would you use a Decorator over middleware, even if both are available in your framework?
“Decorators are object-scoped; middleware is request-scoped. If I want to add caching to a specific repository and not all repositories, aCachingRepositoryDecorator wrapping that one repository is precise. Middleware would apply to all requests through that endpoint, which is too broad.I use middleware for cross-cutting concerns that apply uniformly to the request pipeline: authentication, request logging, CORS, rate limiting. These apply to all (or most) requests regardless of which service or repository handles them.I use decorators for behavior that targets specific objects: caching a specific repository, adding metrics to a specific service, retry logic around a specific external call. The decorator composes at the object level, giving you fine-grained control.”Follow-up: What happens when you have 7 decorators stacked on an object and something goes wrong? How do you debug it?
“This is the real-world pain point of the Decorator pattern. The stack trace showsMetricsDecorator.findById -> RetryDecorator.findById -> CachingDecorator.findById -> LoggingDecorator.findById -> CircuitBreakerDecorator.findById -> TimeoutDecorator.findById -> AuthDecorator.findById -> ActualRepository.findById. Figuring out which layer introduced a bug or unexpected behavior is a nightmare.My approach: first, I would question whether 7 decorators is appropriate. In my experience, beyond 3-4 layers, the debugging cost outweighs the composition benefit. At that point, I would consider a pipeline pattern instead — an explicit, ordered list of steps where each step has a name and can be individually toggled, monitored, and debugged.For the immediate debugging problem, I would add structured logging at each decorator boundary — entry, exit, and duration — with a correlation ID. This turns the opaque stack into a visible pipeline: ‘AuthDecorator: passed in 2ms -> TimeoutDecorator: passed in 0ms -> CircuitBreakerDecorator: passed in 0ms -> CachingDecorator: cache miss -> LoggingDecorator: logged -> RetryDecorator: first attempt -> ActualRepository: returned in 45ms’. Now you can see exactly where the flow went and where it went wrong.”next()).A(B(C(D(real)))) — which is hard to reason about.Q: What’s the biggest runtime cost of deep decorator chains?
A: In most languages, stack depth and virtual method dispatch — usually negligible. The real cost is cognitive: stack traces become noise and bug isolation takes longer. If your chain is 7+ layers deep, debuggability is the bottleneck, not performance.- refactoring.guru/design-patterns/decorator — visual Decorator pattern walkthrough.
- expressjs.com/en/guide/writing-middleware.html — canonical middleware reference.
- Robert C. Martin — “Agile Software Development: Principles, Patterns, and Practices” — chapter on Decorator with Open/Closed Principle framing.
Q9: Your company has a monolith with 80 engineers. The CTO wants microservices. The VP of Engineering wants a modular monolith. You are the staff engineer asked to make the recommendation. Walk me through your analysis.
Q9: Your company has a monolith with 80 engineers. The CTO wants microservices. The VP of Engineering wants a modular monolith. You are the staff engineer asked to make the recommendation. Walk me through your analysis.
Follow-up: The CTO pushes back and says ‘but we need to be able to hire quickly, and microservices let teams work independently.’ How do you respond?
“The CTO’s concern is legitimate — team independence at 80 engineers is a real problem. But I would challenge the assumption that microservices are the only way to achieve it.A modular monolith with enforced boundaries gives you team ownership of modules. Team A owns the Checkout module, Team B owns the Catalog module. They can develop independently within their modules. The enforcement tooling (Packwerk, ArchUnit, or even simple module-level dependency rules in CI) prevents cross-module coupling.What a modular monolith does not give you is independent deployment — and that is the CTO’s strongest argument. If Team A’s change breaks the build, Team B cannot deploy either. This is real friction. My counter: invest in a good CI/CD pipeline with per-module testing. If the Checkout module’s tests pass, deploy. If the Catalog module’s tests fail, that should not block the Checkout deploy. This is harder to implement than microservices’ inherent independence, but it does not require the full distributed systems infrastructure.I would also present data. Ask the CTO: ‘In the last 3 months, how many times has deployment coordination been the bottleneck vs. how many times has domain complexity been the bottleneck?’ If the answer is that deployments are blocked weekly because of cross-team coupling, microservices for the most-coupled-most-deployed modules is warranted. If the answer is that the real pain is understanding the tangled codebase, microservices will not fix that — they will just distribute the tangle.”Follow-up: Two years into the modular monolith, teams start violating module boundaries because it is ‘easier.’ How do you prevent this?
“This is the modular monolith’s Achilles heel, and it requires ongoing investment, not a one-time setup. Boundaries must be enforced technically, not just culturally.First, automated enforcement in CI. Every PR must pass a boundary check. If an import in the Checkout module references an internal class in the Catalog module, the build fails. Tools like Packwerk (Ruby), ArchUnit (Java), deptry (Python), or custom eslint rules (TypeScript) do this. The key is that it runs in CI, not just as a local lint — you cannot merge a boundary violation.Second, ownership. Each module has an explicit owning team. Cross-module changes require approval from both teams. This is a code-review rule, but it must be enforced in the PR tool (GitHub CODEOWNERS, GitLab code review rules).Third, public API enforcement. Each module exposes a public API — a set of interfaces or facade classes — and everything else is internal. The static analysis tool verifies that cross-module access only goes through the public API.Fourth, education and visibility. Run a monthly ‘boundary health’ report that shows which modules have the most violations, which teams are the most frequent violators, and the trend over time. Make it visible. Nobody wants to be on the ‘most boundary violations’ list.If despite all this, a team consistently circumvents boundaries because the module’s public API does not meet their needs, that is a signal that the API is missing a method — not that the boundary is wrong. Treat boundary violations as API gap indicators, not as evidence that modularity does not work.”import/no-restricted-paths). Each module has a public API; cross-module access is limited to that API. You get team ownership and clean boundaries without the distributed systems tax.- shopify.engineering — “Deconstructing the Monolith” and follow-up modular monolith posts are the reference material.
- Sam Newman — “Monolith to Microservices” — explicit about when not to migrate.
- martinfowler.com/bliki/MonolithFirst.html — Fowler’s case for monolith-first.
Q10: Compare the Strangler Fig pattern with a big-bang rewrite. When might a rewrite actually be justified?
Q10: Compare the Strangler Fig pattern with a big-bang rewrite. When might a rewrite actually be justified?
Follow-up: You choose the Strangler Fig approach. Six months in, the monolith is only 30% migrated, and leadership is frustrated with the pace. What do you do?
“First, I would reframe expectations. The first 30% takes the longest because you are building the migration infrastructure — the routing layer, the CDC pipeline, the deployment playbooks, the monitoring. Each subsequent extraction should be faster. Show the trend: ‘The first module took 6 weeks, the second took 3, the third took 2. At this rate, we will be 80% done in another 4 months.’Second, I would question whether 100% migration is the right goal. Maybe 70% of the monolith handles critical, high-change business logic that benefits from independent deployment and scaling. The other 30% is stable utility code that nobody touches. Migrating that 30% has high cost and low value. Declare those modules ‘done’ in the monolith and focus migration effort on the modules that actually cause pain.Third, I would look at what is actually slow. Is it the extraction work itself? The data migration? The testing? The organizational decision-making? Often, the bottleneck is not the technical migration — it is the cross-team coordination: ‘we cannot migrate the Order module because Team X still has 3 critical features planned that depend on direct database access.’ If that is the case, the fix is organizational (freeze new monolith features for that module), not technical.Fourth, I would make sure we are celebrating the wins. The 30% that has been migrated — is it faster? Is it independently scalable? Are those teams deploying more frequently? Show leadership the concrete improvements from the migrated services to justify the continued investment.”Going Deeper: What is your strategy for handling the data migration aspect of a Strangler Fig migration?
“Data migration is consistently the hardest part, and I have seen three strategies work in practice.The first and simplest: shared database during transition. The new service reads and writes the same database tables as the monolith. This is fast to implement and avoids data duplication, but it couples the services at the database layer. I use this as a stepping stone — get the service running independently with its own code and deployment, then tackle data migration as a separate phase.The second: CDC-based synchronization with Debezium. The monolith continues to own the database. Debezium streams changes from the monolith’s tables to the new service’s own database. The new service reads from its local copy. This is unidirectional — the new service does not write to the monolith’s database. The tricky part is the cutover: when you switch the new service from reading replicated data to being the primary owner, you need to reverse the CDC direction and update the monolith to stop writing to those tables.The third: event-driven data migration. If the monolith publishes domain events (or you add event publishing to the monolith as a pre-migration step), the new service builds its own data store by consuming events. This is the cleanest long-term approach because it establishes the event-driven integration pattern that the microservices will use permanently. But it requires the monolith to publish events, which can be a significant change to a legacy system.My recommendation: start with the shared database to unblock the service extraction, then migrate to CDC-based sync in a second phase. If you are building toward an event-driven architecture, invest in event publishing from the monolith early — it pays dividends across all future extractions.”- martinfowler.com/bliki/StranglerFigApplication.html — Fowler’s pattern description.
- Joel Spolsky — “Things You Should Never Do, Part I” (joelonsoftware.com) — the canonical anti-rewrite essay.
- Chris Richardson — “Microservices Patterns” — Chapter 13 on migrating monoliths with extensive Strangler Fig treatment.
Q11: You see a 300-line switch statement in production code. Walk me through how you would evaluate whether and how to refactor it.
Q11: You see a 300-line switch statement in production code. Walk me through how you would evaluate whether and how to refactor it.
handlers.get(caseType).handle(input). Adding a new case means adding a new class and one entry in the map. No existing code changes.But here is the key: I would only extract the cases that are actively changing. If 20 of the 30 cases have not changed in a year, I might leave them in a simplified switch or a default handler and only extract the volatile ones. Partial refactoring is fine — you do not have to pattern-ify everything.”Follow-up: The switch statement dispatches on a string type field that comes from user input. What risks do you see?
“Dispatching on raw user input is a security and stability concern. If the type field is ‘premium’ or ‘basic’ and comes directly from a request body, an attacker can send ‘admin’ or ‘internal’ and potentially reach code paths they should not. Even if there is a default case that handles unknowns, the fact that user input drives control flow is a code smell.I would validate and normalize the input before it reaches the dispatch logic. Map the raw string to an enum or a validated type at the boundary — in the controller or the input parser. The switch should operate on a validated enum, not a raw string. This also prevents typo-related bugs (‘premum’ instead of ‘premium’ silently hitting the default case).With the Strategy pattern refactoring, this becomes a lookup in a map with explicit keys. If the key does not exist in the map, you get a clear ‘unknown type’ error rather than falling through to a default case that might silently do the wrong thing.”Follow-up: A colleague says ‘a switch statement is fine, the Strategy pattern is over-engineering.’ How do you respond?
“They might be right. And I would not dismiss that perspective, because over-engineering is a real problem that is just as costly as under-engineering. Here is my decision framework.If the switch has 2-3 cases and rarely changes, a switch is absolutely fine. The Strategy pattern for two cases is over-engineering — you have an interface, two classes, a factory or map, and DI wiring, all to avoid a 20-line if-else. The cure is worse than the disease.The inflection point, in my experience, is around 5-7 cases that change independently, or 3+ cases when different developers frequently modify different cases and create merge conflicts. At that point, the switch forces you to reason about all cases simultaneously, and the independent classes of the Strategy pattern pay for themselves in isolation and testability.I would tell my colleague: ‘You are right that a switch is simpler for a small, stable set of cases. Let us keep it until we hit the threshold where the maintenance cost justifies the abstraction.’ The pragmatic approach is to set a concrete trigger: ‘If we add a fourth payment method or if we get another merge conflict in this switch, we refactor.’ That way, both of us have a clear decision point rather than an ideological argument.”Q12: Design a plugin system for a document editor where third-party developers can add new export formats (PDF, DOCX, HTML, Markdown) without modifying the core editor code.
Q12: Design a plugin system for a document editor where third-party developers can add new export formats (PDF, DOCX, HTML, Markdown) without modifying the core editor code.
ExportPlugin interface — this is the Strategy pattern. Each plugin implements:ExportPlugin. In a web editor, plugins could be registered at build time via a configuration file, or loaded dynamically from a CDN with a plugin manifest.Now, the hard parts that separate a toy design from a production system:Versioning the plugin API. The ExportPlugin interface will evolve. Version 1 has export(document). Version 2 might add exportAsync(document, progressCallback). I would version the interface and have the registry check the plugin’s declared API version. Old plugins continue to work — the core wraps them in an adapter that provides default behavior for new methods. This is the Adapter pattern applied to backward compatibility.Security. Third-party plugins receive a document object. They should receive a read-only view, not a mutable reference to the editor’s internal state. I would use a Facade or a DTO — the plugin gets a DocumentSnapshot with the content it needs for export, not the live document object with mutation methods.Error isolation. A buggy plugin should not crash the editor. Each export call runs in a sandbox — a try-catch at minimum, ideally a separate process or worker thread for untrusted plugins. If the plugin throws, the editor catches it and shows the user a meaningful error, not a crash.Lifecycle hooks. Plugins might need initialization (loading fonts for PDF) or cleanup (releasing temporary files). The interface needs initialize() and dispose() lifecycle methods. The registry manages the lifecycle.The Observer pattern enters if plugins need to react to editor events — for example, a ‘live preview’ plugin that updates whenever the document changes. The editor publishes DocumentChanged events, and plugins can optionally subscribe.”Follow-up: How do you handle a plugin that takes 30 seconds to export a large document?
“This is a UX and architecture problem. The user should not stare at a frozen UI for 30 seconds.I would make the export async by default. TheExportPlugin interface includes exportAsync(document, progressCallback) where the callback reports progress (0-100%). The editor shows a progress bar and keeps the UI responsive.For the architecture, the export runs in a background worker — a Web Worker in a browser context, a child process in a desktop app. This ensures the plugin’s CPU-intensive work does not block the main thread. The worker communicates progress back via message passing.I would also add a timeout. If a plugin does not complete within a configurable limit (say 60 seconds), the editor cancels it and shows an error. This prevents a buggy plugin from hanging the export indefinitely.For particularly large documents, I would consider a streaming export API where the plugin processes the document in chunks rather than receiving the entire document at once. This reduces memory pressure and allows progress reporting at a more granular level.”Follow-up: Two plugins register for the same format — how do you handle the conflict?
“This is a registry design decision with three options:First, last-write-wins — the second plugin to register for ‘PDF’ overwrites the first. This is simple but surprising. A user installs a new PDF plugin and their old one silently stops being available.Second, first-write-wins — the second registration is rejected. The plugin system logs a warning. This is safer but means a user cannot upgrade their PDF plugin without uninstalling the old one first.Third, explicit resolution — the registry allows multiple plugins per format and the user chooses which to use. The export menu shows ‘Export as PDF (built-in)’ and ‘Export as PDF (AwesomePDF plugin).’ This is the most user-friendly approach and is what most mature plugin systems (VS Code extensions, browser extensions) do.I would implement the third option with a ‘default’ concept — one plugin is marked as the default for each format, and the user can change the default in settings. The first registered plugin for a format becomes the default, and subsequent registrations appear as alternatives.”Going Deeper: How would you design the plugin API so it can evolve over 5 years without breaking existing plugins?
“This is a contract management problem, and it is the same problem that public APIs face. I would apply three principles:First, additive-only evolution. New versions of the interface add optional methods with default implementations.ExportPlugin v2 adds exportAsync — but plugins that only implement v1’s export still work because the registry wraps them in an adapter that calls export synchronously. You never remove or change the signature of existing methods.Second, capability negotiation. Instead of version numbers, plugins declare capabilities: ['sync-export', 'async-export', 'progress-reporting', 'streaming']. The core checks capabilities before calling methods. This is more flexible than linear versioning because capabilities can be mixed and matched.Third, a stable data contract. The DocumentSnapshot that plugins receive should be versioned separately from the plugin API. If the document model changes (add support for a new element type), old plugins receive a snapshot that omits the new element type — they see the document as they knew it. This requires maintaining backward-compatible serialization, which is work, but it is the only way to avoid breaking third-party plugins when the editor evolves.The meta-principle: treat your plugin API like a public API with external consumers you cannot control. Assume every breaking change will break someone’s plugin and create angry users. This discipline — backward compatibility by default, capability negotiation, and adapter-based compatibility layers — is what lets a plugin ecosystem thrive over years.”Advanced Interview Scenarios
These questions are designed to expose the gap between theoretical pattern knowledge and battle-tested production experience. Several have deliberately counterintuitive answers. Each one is built from the kind of incident, migration failure, or architectural misfire that shapes how experienced engineers think. If you can answer these well, you have been in the room when things went wrong.Q13: Your team wrapped every third-party dependency behind an Adapter interface — Stripe, SendGrid, Twilio, AWS S3, even the logging library. Six months later, engineers are complaining the codebase is harder to work with, not easier. What went wrong?
Q13: Your team wrapped every third-party dependency behind an Adapter interface — Stripe, SendGrid, Twilio, AWS S3, even the logging library. Six months later, engineers are complaining the codebase is harder to work with, not easier. What went wrong?
putObject or getObject — the AWS SDK is already a reasonable interface. You have wrapped an interface with another interface that looks identical.Stripe and SendGrid adapters, on the other hand, are probably justified. You genuinely might switch payment providers (Stripe to Adyen) or email providers (SendGrid to SES). The adapter protects hundreds of callsites from a vendor swap that could happen in the next 12 months.The diagnostic I would run: for each adapter, ask two questions. First, what is the probability this dependency gets swapped in the next 2 years? If the answer is under 10%, the adapter is speculative insurance with a daily premium. Second, does the adapter’s interface add domain meaning beyond the SDK’s interface? If PaymentGateway.charge(amount, customer) is meaningfully simpler than stripe.charges.create({amount, customer}), the adapter is earning its keep. If StorageAdapter.upload(key, data) is identical to s3.putObject({Key: key, Body: data}), it is pure ceremony.At Shopify’s scale, they learned this the hard way — they initially wrapped everything in adapters as part of their modular monolith push, then found that the ‘adapter tax’ on development velocity was real. They pulled back to adapting only the dependencies where the swap probability justified it. The heuristic they landed on: if the vendor name appears in business conversations about switching, adapt it. If it does not, do not.”Follow-up: How do you convince a team that already has these adapters to remove some of them without it feeling like wasted work?
“Frame it as pruning, not failure. The team made a reasonable bet — ‘we might swap any of these’ — and now has 6 months of data showing which bets paid off. The Stripe adapter saved us during the Adyen evaluation. The logging adapter has never been touched. Removing the low-value adapters is learning, not waste.Practically, I would propose a ‘keep/remove’ scorecard in a team retro. For each adapter: how many times was it useful for testing? For vendor evaluation? For changing behavior? If the answer is zero across all dimensions, it is a candidate for removal. I would not remove them all at once — delete one per sprint, see if anyone notices, build confidence.”Follow-up: A principal engineer argues ‘but what if we need the adapter later — we’ll have to touch every callsite.’ How do you respond?
“The cost of adding an adapter later is a well-scoped refactoring task — introduce the interface, wrap the existing calls, done. Modern IDEs make this a 30-minute job for a focused dependency. The cost of maintaining an unnecessary adapter is paid every single day: every new engineer has to learn the indirection, every debugging session has to navigate through it, every PR touches two files instead of one.The math is: daily cost * 365 days vs. one-time future cost * probability of actually needing it. For a dependency with a 5% swap probability, you are paying 365 days of friction to avoid a 30-minute refactoring that has a 1-in-20 chance of ever being needed. That is not a good trade.”War Story: A fintech startup I advised had 14 adapter interfaces wrapping everything from Redis to their date-formatting library. They called it ‘hexagonal architecture.’ In practice, a new hire took 3 weeks to become productive because every operation required understanding the adapter layer, the DI container wiring, and the actual library underneath. When they measured, 11 of the 14 adapters had been created and never modified since. They removed 9, kept the ones for Stripe, their KYC provider, and their SMS gateway — the three dependencies that had actually been swapped or seriously evaluated in the prior year. Developer onboarding time dropped to 1.5 weeks.Q14: You inherit a system that the previous team called 'microservices.' It has 12 services, but you cannot deploy any of them independently — every release requires coordinating all 12. Diagnose what happened and propose a fix.
Q14: You inherit a system that the previous team called 'microservices.' It has 12 services, but you cannot deploy any of them independently — every release requires coordinating all 12. Diagnose what happened and propose a fix.
SELECT DISTINCT application_name FROM pg_stat_activity — if I see 8 different service names connected to the same database, that is the smoking gun.Coupling vector 2: Synchronous call chains. Service A calls B, which calls C, which calls D — all synchronous REST. Deploying a new version of D with a slightly different response shape breaks C, which breaks B, which breaks A. I would map the call graph by enabling distributed tracing (Jaeger) for one week and visualizing the dependency chains. At a prior company, we did this and discovered a single request to our API gateway triggered 23 synchronous inter-service calls. The ‘microservices’ were functioning as a single distributed function call.Coupling vector 3: Shared libraries with domain logic. A common anti-pattern: a shared-models library that contains domain entities used by all 12 services. Updating the Order model in the shared library triggers a rebuild and redeploy of every service that depends on it. This is a monolith packaged as a library.Coupling vector 4: Integration tests that test the whole system. If the CI pipeline runs end-to-end tests across all 12 services before any single service can deploy, you have re-created monolithic deployment through your test infrastructure.The fix is incremental, not a second big-bang migration. I would score each coupling vector by severity (how often does it force coordinated deploys?) and tractability (how hard is it to fix?). Then attack the highest-severity, most-tractable one first.For a shared database, the playbook is: identify which service is the true owner of each table (the one that does the most writes), have that service expose an API for the data, and migrate other services to call the API instead of querying the table directly. Use CDC to populate local read replicas if latency is a concern. This is painful — typically 2-3 months per table cluster — but it is the single most impactful fix.For shared libraries, split the library into independent packages per domain concept. The Order model lives in order-models, the User model in user-models. Services depend only on the packages they need. Better yet, have each service define its own representation of external data (an anti-corruption layer) rather than sharing model classes.For synchronous call chains, introduce async communication (events) for the non-critical paths. The Notification Service does not need to be called synchronously after an order is placed — publish an OrderPlaced event and let it react. Keep synchronous calls only where the caller genuinely needs an immediate response.”Follow-up: You discover that 4 of the 12 services are just CRUD wrappers around database tables with no business logic. What do you do with them?
“This is the Entity Service anti-pattern. AUserService that only does getUser, createUser, updateUser is not a microservice — it is a database table with a network hop in front of it. These services add latency, operational overhead, and failure modes without providing any independence or encapsulation benefit.I would merge them back. Not into a monolith, but into the services that actually contain the business logic that uses that data. If the Order Service is the primary consumer of product data and the Product Service is just a CRUD proxy, the product data belongs inside the Order Service’s bounded context — or more precisely, the Order Service should own the product data it needs and the Product Service should be eliminated.The counter-argument I would preempt: ‘but then two services need user data.’ Fine. Each service stores the user data it needs locally, populated via events. The User Profile Service publishes UserUpdated events. The Order Service stores the user’s name and shipping address. The Notification Service stores the user’s email and notification preferences. Each service owns its local copy. This is data duplication, but it eliminates the runtime coupling of a shared User Service.”Follow-up: The team pushes back — ‘we have been working on these microservices for 18 months, and rolling them back feels like admitting failure.’ How do you handle the organizational dynamics?
“This is the sunk cost fallacy applied to architecture, and it is one of the hardest conversations in engineering leadership. I would not frame it as rolling back. I would frame it as graduating.‘We built microservices to get independence and deployment velocity. The measurement shows we do not have those benefits yet because of coupling at the database and library level. The proposal is not to undo the work — it is to fix the coupling that is preventing us from getting the benefits we invested in. Some of that means merging services that should never have been separated. Some of it means decoupling services that are coupled at the database. The outcome is fewer, genuinely independent services — which is what we wanted all along.’Data makes this conversation easier. I would present: average deploy lead time (from merge to production), number of services affected per deploy, incidents caused by cross-service coupling in the last quarter, and developer satisfaction survey results on deployment friction. Numbers turn an emotional ‘are we admitting failure?’ into a pragmatic ‘how do we improve these metrics?’”War Story: A Series B startup (60 engineers) I consulted for had 18 ‘microservices’ that all shared a single PostgreSQL database with 200+ tables. They called them microservices because they deployed from separate repos. In practice, a column rename required updating 4 services, and their ‘independent deployment’ meant running 18 CI pipelines that all ran the same integration test suite against the shared database. We spent 6 months untangling: merged the 6 CRUD-only entity services back into the 3 domain services that used them, introduced event-based data replication for the remaining cross-service data needs, and moved to service-owned schemas within the same PostgreSQL instance as an intermediate step. Deployment lead time dropped from 4 hours (coordinated deploy of all services) to 25 minutes (independent deploy of any service). They went from 3 production deploys per week to 8 per day.Q15: You implemented CQRS with a separate read model. Users report that after creating an item, they sometimes see a blank page when redirected to view it. Diagnose and fix this.
Q15: You implemented CQRS with a separate read model. Users report that after creating an item, they sometimes see a blank page when redirected to view it. Diagnose and fix this.
/items/{id} (read path), but the read model has not yet processed the event that populates it. The projection lag — time between the write committing and the read model being updated — is typically 50-500ms, but under load it can spike to seconds. The redirect happens in under 50ms. So the read model does not have the item yet.There are four solutions, ordered from simplest to most robust:Solution 1: Return the created item in the write response and use it directly. The write endpoint returns the full item data in the 201 response. The frontend does not redirect to a read endpoint — it uses the response data to render the page immediately. This completely avoids the read model for the ‘read your own write’ case. This is what I would implement first because it requires zero infrastructure changes. The catch: it only works for the user who created the item. If they share the URL, the recipient might still hit a stale read model.Solution 2: Read-your-own-writes routing. After a write, set a short-lived cookie or header (e.g., X-Read-After-Write: {timestamp}). The API layer checks this header and, if present, routes the read to the primary database (the write model) instead of the read model, for a configurable window (e.g., 5 seconds). After the window expires, reads go back to the read model. Amazon DynamoDB’s consistent-read flag works on this exact principle. The trade-off: you are briefly bypassing your read model’s scalability for one user’s session.Solution 3: Synchronous projection for the writing user. After the write commits, synchronously update the read model for just the affected item before returning the 201 response. This adds latency to the write path (the projection must complete before the response), but guarantees the item exists in the read model when the redirect happens. The trade-off: you have coupled your write latency to your projection speed. If the projection involves Elasticsearch indexing, that could add 100-200ms to every write.Solution 4: Polling with exponential backoff on the frontend. The frontend redirects to the read endpoint. If it returns 404, the frontend retries with increasing delays (100ms, 200ms, 400ms) up to a cap (3 seconds). This is the least elegant but most robust approach — it handles all edge cases including spikes in projection lag. Show a skeleton screen during polling, not a blank page.I would implement Solution 1 as the immediate fix (30 minutes of work) and Solution 2 as the long-term approach if other read-your-own-write scenarios emerge.”Follow-up: The projection lag is usually 100ms but spikes to 8 seconds during peak traffic. How do you investigate?
“The projection pipeline is a consumer — it reads events and updates the read model. When it lags, either it is consuming too slowly or events are arriving too quickly.First, I would check consumer lag metrics. In Kafka, this is the consumer group lag — the delta between the latest offset and the committed offset. If lag correlates with traffic spikes, the consumer cannot keep up with the event production rate.Second, I would profile the projection handler. Is the bottleneck the event deserialization, the business logic that transforms the event into a read model update, or the write to the read store (Elasticsearch, Redis, or whatever)? At a company I worked at, we found that 80% of projection time was spent in Elasticsearch bulk indexing — the projection logic was fine, but ES was the bottleneck. We batched projection writes (accumulate 100 events, bulk-index once) and lag dropped from 8 seconds to 200ms.Third, I would check for head-of-line blocking. If the projection consumer processes events sequentially and one event type is slow (e.g., aCatalogRebuilt event that updates 10,000 read model entries), it blocks all subsequent events. The fix is to partition projections by event type — fast events (item created, item updated) get their own consumer, slow events (catalog rebuilt) get theirs.Fourth, check if the read store itself is under pressure. During peak traffic, the read model is serving reads AND receiving projection writes. If reads are consuming all the IOPS, writes queue up. Separating read and write connections with different priority, or using a read replica for serving reads while projections write to the primary, can help.”Follow-up: A product manager asks ‘why can’t we just make it consistent?’ How do you explain the trade-off?
“I would avoid the CAP theorem lecture and speak in their language. ‘We could make it fully consistent — the write would not return until the read model is updated. That means every create/update operation takes 200ms longer because it waits for the search index to update. At our current volume of 50,000 writes per hour, that adds 2.7 hours of cumulative user-facing latency per day. The alternative — what we have now — is that writes are fast, and 99.9% of the time the read model catches up before anyone notices. For the 0.1% where there is a visible delay, we show the created item directly from the write response, so the user never sees a blank page. The trade-off is: slower writes for everyone vs. a brief inconsistency window that we mask in the UI. I recommend the latter.’Numbers and user impact — that is what product managers need to make a decision. Not ‘eventual consistency is a fundamental property of distributed systems.’”War Story: At a mid-size e-commerce platform (500K orders/day), we rolled out CQRS to separate our product catalog reads (served from Elasticsearch) from writes (PostgreSQL). The first week, customer support tickets spiked 40% — merchants updating product prices would see the old price on their dashboard for up to 30 seconds during peak hours. The projection pipeline was consuming from a single Kafka partition. We re-partitioned by merchant ID, scaled the consumer group to 12 instances, and implemented Solution 1 (return updated product in the write response). Tickets dropped below baseline within a week. The lesson: CQRS’s consistency window is not a theoretical concern — it shows up as customer support load.Q16: Your saga's compensation step fails. The payment was charged, the inventory was reserved, shipping failed, you try to refund the payment — and the refund call times out. Now what?
Q16: Your saga's compensation step fails. The payment was charged, the inventory was reserved, shipping failed, you try to refund the payment — and the refund call times out. Now what?
Follow-up: How do you prevent the customer from being in a bad state while the compensation is being resolved?
“The customer’s experience and the system’s internal state are separate concerns. Even while the saga is stuck in ‘compensation-failed,’ the customer should see a consistent status.I would immediately update the order status to ‘cancellation-in-progress’ when the shipping step fails — before attempting any compensation. The customer sees ‘Your order is being cancelled, and a refund is being processed.’ This is true regardless of whether the refund succeeds on the first try or takes 30 minutes of retries.For the payment specifically: do not show ‘refunded’ until the payment provider confirms it. Show ‘refund pending.’ This manages the customer’s expectations accurately. If they call support, the support agent can see the saga’s actual state: ‘The refund has been initiated and is being processed. It will appear on your statement within 5-10 business days.’The worst outcome is showing the customer ‘refunded’ optimistically and then discovering the refund actually failed. Now you have a customer who thinks they have been refunded but has not been, and you have a trust problem.”Follow-up: You mentioned idempotency keys. What happens if the payment provider does not support them?
“Then you are in a harder situation, and you have to build idempotency yourself. Before calling the refund endpoint, check your own records: have I already successfully refunded this charge? If yes, skip the call. If the call is in ‘unknown’ status (timed out), query the provider’s transaction history API to check whether a refund for this charge already exists.If the provider has no idempotency support AND no transaction query API — which is rare for any reputable provider but does happen with legacy systems — you need a reconciliation-first approach. Attempt the refund, record the attempt. Run a daily reconciliation that pulls all refunds from the provider’s settlement report and matches them against your records. Discrepancies (refund in your records but not in theirs, or in theirs but not in yours) get flagged for manual review.This is where the choice of payment provider becomes an architectural decision. If your provider does not support idempotent operations, the operational cost of compensating for that gap is significant. I have seen this be the deciding factor in a Stripe vs. legacy-provider evaluation — Stripe’s idempotency key support alone saved an estimated 20 engineering hours per month in reconciliation work.”War Story: At a travel booking platform processing 200K bookings/day, we had a saga for flight + hotel + car rental bundles. The car rental provider’s cancellation API had a 5% timeout rate during peak hours (their system was overwhelmed). Our initial saga design retried 3 times and then marked the booking as ‘cancelled’ — but the car reservation was still active on the provider’s side 30% of the time. Customers would get charged for car rentals they thought were cancelled. The fix had three parts: idempotent cancellation with provider-specific confirmation polling, a reconciliation job that ran every 30 minutes against the provider’s booking API, and a Slack alert to the operations team for any reservation stuck in ‘cancellation-pending’ for more than 2 hours. The weekly customer complaints about phantom charges dropped from ~40 to zero. The reconciliation job alone found and resolved 15-20 stuck cancellations per day that would have otherwise become customer complaints.Q17: Your team adopted Event Sourcing 18 months ago. Now the business wants to add a 'middle name' field to the user profile. In a normal system this is a 5-minute migration. How long does it take in your event-sourced system, and why?
Q17: Your team adopted Event Sourcing 18 months ago. Now the business wants to add a 'middle name' field to the user profile. In a normal system this is a 5-minute migration. How long does it take in your event-sourced system, and why?
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100). Deployed in a migration, done in seconds. In an event-sourced system, the current state is derived from replaying events. There is no table to alter. The ‘schema’ lives in the events themselves, and events are immutable — you cannot change historical events.Here is what is actually involved:Step 1: Create a new event version. You now have UserCreated_v1 { firstName, lastName, email } and UserCreated_v2 { firstName, middleName, lastName, email }. Going forward, all new user creations publish v2. This part is quick.Step 2: Handle historical events. When replaying events to build state, your projection code encounters v1 events (no middle name) and v2 events (with middle name). Every projection handler must handle both versions. If you have 5 projections that consume user events, all 5 need to be updated. The v1 handler defaults middleName to null. This is manageable.Step 3: Upcasting (optional but recommended). Instead of polluting every projection handler with version-checking logic, implement an upcaster — a middleware in the event deserialization pipeline that transforms v1 events into v2 format before the projection sees them. The upcaster fills in middleName: null for v1 events. Now projections only need to handle v2. This is the clean approach, but it adds a layer of complexity to the event pipeline.Step 4: Existing users who want to add a middle name. You need a new event: UserMiddleNameAdded { userId, middleName }. The projection handler applies this event by setting the middle name on the read model. The aggregate’s apply method must also handle this event to update in-memory state.Step 5: Consider snapshots. If your system uses snapshots (serialized state at a point in time to avoid replaying all events), the snapshot schema must also be updated. Old snapshots do not have middleName. Your snapshot deserialization needs to handle the old format. Alternatively, invalidate all existing snapshots and let them rebuild — but if you have millions of aggregates, the rebuild could take hours.Step 6: Rebuild affected projections. If you want the middle name to appear in your search index, your reporting database, your user directory projection — each needs a full rebuild from the event stream. For a system with 10 million users and an average of 50 events per user, that is 500 million events to replay. At 10,000 events/second, that is ~14 hours.The honest estimate: 2-5 days of engineering work and a projection rebuild window. Versus 5 minutes for a database migration. This is the real cost of event sourcing, and teams need to have this conversation before adopting it, not after.”Follow-up: After 3 years, you have 12 event versions for the User aggregate. The upcasting chain is v1->v2->v3->…->v12. Is this sustainable?
“No. A 12-step upcasting chain is a maintenance nightmare and a performance concern — every historical event gets transformed through 12 functions before it reaches the projection.The sustainable approach is periodic event stream compaction. You create a new event stream where every aggregate’s full history is replaced by a single ‘snapshot event’ in the latest schema version, followed by only the events after the snapshot. This is conceptually similar to how Kafka log compaction works.The process: for each aggregate, replay all events to build current state, emit a singleUserState_v12 { ...all current fields } event to a new stream, then only carry forward events newer than a cutoff date. Old event streams are archived to cold storage for compliance (you may legally need them) but are no longer used for projections.This is a major operational effort — essentially a migration of your event store — and it is why I recommend doing it annually rather than letting 12 versions accumulate. Some teams implement it as a continuous background process: whenever an aggregate’s snapshot is rebuilt, the old events are marked for archival.The broader lesson: event sourcing is not ‘store events forever and replay them.’ It is ‘store events as your source of truth, with a deliberate lifecycle management strategy for schema evolution and data growth.’ Teams that skip the lifecycle strategy end up in the 12-version upcasting chain.”Follow-up: Given what you have described, how do you evaluate whether a new project should use event sourcing?
“I would apply a strict three-question test. First, does the domain genuinely require historical state reconstruction? Not an audit log — achanges table does that. Not event-driven communication — an outbox pattern does that. The question is: do you need to replay events to derive current state, build retroactive projections, or answer temporal queries like ‘what was the account balance at close of business on March 15?’Second, is the team willing to invest in the infrastructure? Event sourcing requires an event store, a projection pipeline, a snapshotting strategy, an upcasting framework, and a schema evolution process. If the team does not have the capacity or expertise to build and maintain this infrastructure, event sourcing will become the system’s biggest liability within a year.Third, what is the schema change frequency? If the domain model changes every sprint — which is normal for a product in its first 2 years — every change triggers the multi-step process I described. The event sourcing tax on schema changes is high enough that it actively slows down product development in fast-moving domains.If all three answers are favorable, event sourcing is worth it. If even one is a no, I would use a state-based persistence model with a change log table for audit and domain events for inter-service communication.”War Story: A healthcare platform used event sourcing for patient records — genuinely justified by regulatory requirements for immutable audit trails. After 2 years, they had 8 event versions for the core PatientRecord aggregate. Adding a ‘preferred pharmacy’ field required updating 3 upcasters, 7 projection handlers, and triggering a projection rebuild that took 22 hours because they had 4 million patients with an average of 200 events each. The rebuild had to run over a weekend with the projection serving stale data. They now budget ‘event schema evolution overhead’ as a line item in sprint planning — typically 3-5 story points per schema change, compared to near-zero in their state-based services. Their principal engineer’s advice to other teams: “If you are not legally required to reconstruct historical state, do not use event sourcing. Use an append-only change log and a regular database.”Q18: You are reviewing a PR where a developer has introduced a Factory that creates objects based on a config file, a StrategyProvider that selects strategies based on a database flag, and a Decorator chain configured through a DI container — all for a feature that currently has two code paths. What is your code review feedback?
Q18: You are reviewing a PR where a developer has introduced a Factory that creates objects based on a config file, a StrategyProvider that selects strategies based on a database flag, and a Decorator chain configured through a DI container — all for a feature that currently has two code paths. What is your code review feedback?
Follow-up: The developer responds ‘but the if-else violates the Open/Closed Principle.’ How do you handle this?
“The Open/Closed Principle says code should be open for extension and closed for modification. It is a real principle, and it is genuinely valuable in the right context. But it is not an absolute law — it is a heuristic that applies when extension is frequent and modification is risky.For two code paths in a feature that was created last week, there is no evidence that extension will be frequent. The OCP’s value scales with the number of existing behaviors and the stability of the code. If we have 8 payment methods in a 3-year-old payment processing pipeline, OCP matters — adding a 9th should not risk breaking the other 8. For 2 code paths in a new feature, the OCP is a solution looking for a problem.I would tell the developer: ‘You are right about the principle, and I appreciate the forward-thinking. But principles are tools, not rules. The OCP earns its complexity when we have evidence that the code path will grow. Right now, the evidence says two paths. Let us apply OCP when we add the third — and the refactoring will be easier then because we will understand the real variation points, not the guessed ones.’The meta-lesson: the best engineers apply principles situationally, not universally. ‘When should I NOT apply this principle?’ is a more valuable question than ‘how do I apply this principle?’”Follow-up: How do you balance ‘YAGNI’ with ‘it is harder to refactor later’?
“The claim that ‘it is harder to refactor later’ is usually wrong for code-level patterns. Extracting a Strategy from an if-else is a well-understood refactoring move that takes 30 minutes with tests in place. Introducing an Adapter around a third-party call is a mechanical transformation. These are not hard refactorings.What IS harder to refactor later are architectural decisions: monolith vs. microservices, shared database vs. event-driven data replication, synchronous vs. async communication. Those decisions have high reversal costs because they span multiple services, teams, and deployment pipelines.So my balance point is: YAGNI for code-level patterns (Strategy, Factory, Decorator). The refactoring cost when you need them is low, and the carrying cost of premature abstraction is paid daily. Think ahead for architectural patterns (CQRS, event sourcing, service boundaries). The reversal cost is high, so the upfront analysis investment is justified.The shorthand: if the pattern affects one service’s internal code, wait until you need it. If the pattern affects how services interact, think hard about it now.”War Story: At a previous company, a well-intentioned architect introduced what the team called the ‘AbstractBeanFactoryConfigurationStrategyProvider’ layer — a Spring-boot application where every service class had a corresponding interface, factory, strategy selector, and DI configuration. The application had 8 features, each with one implementation. Total: 8 interfaces with one implementation each, 8 factories that returned the only implementation, and a 200-line DI configuration class. When a new engineer joined, they timed how long it took to trace a single API request through the codebase: 25 minutes. After a team vote, they spent one sprint removing the premature abstractions, collapsing interfaces-with-one-implementation into concrete classes, and inlining factories. The same request trace took 4 minutes. Code review throughput doubled because reviewers could actually understand the changes. The architect, to their credit, later said: ‘I was building for a scale of complexity we never hit.’Q19: You are designing a system where a user action triggers work across 5 services. The PM insists the user must see a success message within 200ms. But the full workflow takes 3 seconds. How do you architect this?
Q19: You are designing a system where a user action triggers work across 5 services. The PM insists the user must see a success message within 200ms. But the full workflow takes 3 seconds. How do you architect this?
GET /orders/{id}/status every 2 seconds. When the status changes from ‘processing’ to ‘confirmed,’ the UI updates. Simple to implement, wastes some bandwidth, works everywhere.Option 2: WebSocket/SSE. The frontend opens a WebSocket or Server-Sent Events connection. When the saga completes, a notification pushes the status update to the client in real time. More responsive, more complex infrastructure (WebSocket server, connection management).Option 3: Optimistic UI. The frontend immediately shows ‘Order confirmed’ after the 202, treating the acknowledgment as if the workflow will succeed. If the workflow later fails (payment declined), a notification corrects the state: ‘There was an issue with your order.’ This is how most food delivery apps work — you see ‘Order confirmed’ instantly, and the restaurant/driver assignment happens over the next 30 seconds.I would choose Option 3 for e-commerce (order failure rate is under 2%, so optimistic is correct 98% of the time) and Option 2 for high-stakes workflows (financial transactions where the user needs to know the real status).The critical implementation detail: the 202 response must include enough information for the frontend to function. The order ID, the order summary, and the expected completion time. Do not return just an ID — the frontend needs to render something meaningful while the workflow completes.”Follow-up: The PM pushes back — ‘I do not want the user to see processing. I want them to see confirmed immediately. What if the payment fails?’
“This is the optimistic UI conversation, and it requires the PM to understand a trade-off. They can have instant ‘confirmed’ (optimistic) or they can have accurate ‘confirmed’ (synchronous). They cannot have both.With optimistic UI: 98% of the time, the user sees ‘confirmed’ and it is correct. 2% of the time, the payment fails and the user gets a notification 30-60 seconds later saying ‘There was an issue with your payment.’ This is the model Uber, DoorDash, and Amazon use. The failure path must be graceful: a clear notification, a simple action the user can take (update payment method), and no data loss (the order items are still in their cart).With synchronous confirmation: the user waits 3 seconds on a loading screen before seeing ‘confirmed.’ Every user pays the 3-second tax so that the 2% who would have failed get an inline error instead of a delayed notification.I would present both options with the user experience impact and let the PM decide. In my experience, PMs choose optimistic UI once they see that Uber and Amazon do it — the social proof is powerful. The key is that the failure path must be polished, not an afterthought. If the failure notification is an ugly error toast that disappears after 3 seconds, the PM will regret the optimistic approach.”Follow-up: How do you handle the case where 3 of the 5 services succeed but the 4th fails — and the user has already seen ‘confirmed’?
“This is the saga compensation problem combined with the optimistic UI correction problem. Both need to work together.On the backend, the saga orchestrator initiates compensating transactions for the 3 successful steps. If payment was charged, refund it. If inventory was reserved, release it. If shipping was scheduled, cancel it. Each compensation is idempotent and retryable per the patterns we discussed earlier.On the frontend, the user who saw ‘confirmed’ needs to be informed. I would send a push notification and an email: ‘We were unable to complete your order because [specific reason — e.g., an item went out of stock]. Your payment of $X will be refunded within 3-5 business days. Your cart has been preserved so you can try again.’The notification should be specific, actionable, and empathetic. Not ‘Order failed. Error code 4312.’ But ‘We are sorry — the Blue Widget is no longer in stock. We have refunded your payment. Would you like us to notify you when it is back in stock?’The architectural detail: the user notification must be triggered by the saga state machine reaching a ‘fully-compensated’ state, not by the initial failure. You do not want to tell the user ‘your payment will be refunded’ before the refund has actually been initiated.”War Story: A ride-hailing app I worked on had exactly this problem. The user taps ‘Request Ride,’ and the backend needs to match a driver (2-15 seconds), verify the driver is available (500ms), calculate the route and price (300ms), and reserve the payment hold (200ms). We could not make the user wait 15 seconds. The solution: immediate 202 with an animated ‘Finding your driver’ screen, WebSocket updates for driver match and ETA, and optimistic pricing shown from the pre-calculated estimate. The key metric we tracked: ‘confirmation regret rate’ — how often we showed ‘driver found’ and then had to retract it (driver cancelled, payment hold failed). We kept it under 1.5%. Above 2%, user trust eroded measurably in NPS scores. Below 1%, we were being too conservative with matching and losing ride volume. That 1-2% band was the sweet spot, and we tracked it on a real-time dashboard.Q20: A junior developer asks you: 'Should I learn design patterns? My senior says they are outdated and everything should just be functions.' What do you tell them?
Q20: A junior developer asks you: 'Should I learn design patterns? My senior says they are outdated and everything should just be functions.' What do you tell them?
Follow-up: The senior developer overhears and says ‘I have shipped production systems for 10 years without ever naming a pattern. Patterns are academic.’ How do you respond?
“They have almost certainly used patterns without naming them. Every time they passed a function to customize behavior, they used Strategy. Every time they wrapped a third-party call behind an interface, they used Adapter. Every time they added middleware to a web server, they used Decorator. The question is not whether they use patterns — it is whether they benefit from the shared vocabulary.For a solo developer or a tiny team that has worked together for years, implicit patterns work. You do not need to call it Strategy — everyone knows ‘we pass a function here.’ But at scale — 50 engineers, cross-team design reviews, architecture decision records — the vocabulary matters enormously. Saying ‘we need a saga with orchestration and the outbox pattern for reliable event publishing’ communicates more in one sentence than 30 minutes of whiteboard explanation.I would not argue with the senior. I would ask: ‘When you are reviewing a PR from someone on another team, and they have introduced a complex coordination flow across three services, how do you quickly communicate what is right or wrong with the approach?’ If the answer involves describing the solution structure in detail, patterns give them shorthand. If they genuinely do not do cross-team design communication, they may be right that patterns are not valuable in their specific context.”Follow-up: What is one pattern you think is actively harmful when applied naively?
“Singleton. In the GoF book, Singleton is presented as a pattern for ensuring a class has only one instance. In practice, it is global mutable state with a fancy name. It makes testing hard (you cannot swap the instance), creates hidden coupling (any code anywhere can access the singleton), makes concurrency dangerous (shared mutable state across threads), and makes dependency graphs invisible (the singleton is a dependency that does not appear in constructors or function signatures).Dependency injection solves every legitimate Singleton use case better. Need one database connection pool? Register it as a singleton in your DI container. Need one configuration object? Inject it. The difference is that DI makes the dependency explicit and swappable, while the Singleton pattern hides it.The reason I single out Singleton: it is often the first pattern juniors learn and apply, because it feels clever and useful. And it creates problems that do not surface until the codebase grows — tests that interfere with each other because they share a singleton’s state, race conditions in multithreaded code, and services that are impossible to test in isolation because they reach into a global singleton instead of accepting a dependency.”War Story: I mentored a junior engineer who spent 2 weeks building a notification system using the full GoF playbook: NotificationFactory, NotificationStrategyInterface, EmailNotificationStrategy, PushNotificationStrategy, NotificationObserver, NotificationDecorator for retry logic. The system had exactly 2 notification types: email and push. I sat with them and we refactored it to a dictionary mapping notification types to handler functions, with a simple retry wrapper. The entire system went from 14 files and 600 lines to 1 file and 80 lines, with identical functionality and better readability. But I did not tell them the original work was wasted — I told them the truth: ‘You just learned when patterns pay for themselves and when they do not. That judgment is worth more than knowing all 23 GoF patterns.’ They are now a senior engineer who writes some of the cleanest, most appropriately-abstracted code on their team.Q21: You notice that your Decorator-based caching layer is adding 15ms of latency to every request — even cache hits. The team says 'caching should make things faster, not slower.' What is happening and how do you fix it?
Q21: You notice that your Decorator-based caching layer is adding 15ms of latency to every request — even cache hits. The team says 'caching should make things faster, not slower.' What is happening and how do you fix it?
MetricsDecorator -> CachingDecorator -> LoggingDecorator -> Repository, add timing to each decorator’s entry and exit. I want to see: MetricsDecorator overhead (2ms), CachingDecorator overhead including the cache call (8ms), LoggingDecorator overhead (3ms), Repository never called (cache hit). Now I know where the 15ms is.Step 2: Investigate the cache hit path specifically. In the CachingDecorator, a cache hit typically involves: generate the cache key (hash the method name + arguments), serialize the arguments for the key, call Redis GET, deserialize the cached value. Each step has a cost.Common culprits I have seen in production:Key generation is expensive. If the cache key is generated by JSON-serializing the method arguments and hashing them, and the arguments include large objects (a full user profile, a product catalog query), the serialization alone can take 5-10ms. Fix: use a simple, precomputed cache key based on the primary identifier (user:{userId}), not a hash of the entire argument object.Deserialization on every hit. If the cached value is stored as JSON and deserialized on every cache hit, that is CPU work on every request. For a complex nested object, JSON.parse can take 2-5ms. Fix: cache already-deserialized objects in an in-memory L1 cache (a simple LRU map in the process) with a short TTL (10-30 seconds). Redis becomes the L2 cache. The L1 hit returns a reference with zero serialization cost.Redis round trip. Even on a local network, a Redis call is 0.5-2ms. If the CachingDecorator makes this call on every single request, that is a floor of 0.5ms you cannot eliminate without an in-memory cache. This is usually acceptable, but if the cached method is called 50 times per request (inside a loop), the 50 Redis round trips add up to 25-100ms.The decorator overhead itself. Each decorator in the chain involves a function call, possibly object allocation (if the decorator creates wrapper objects), and potentially async/await overhead (if the decorators are async and the chain involves multiple promise resolutions). In Node.js, a chain of 5 async decorators adds measurable microtask overhead. In Java, the virtual method dispatch and object creation through the chain are usually negligible but can matter in hot paths called 10,000 times per second.Logging on every cache hit. If the LoggingDecorator writes a log line for every request, and the logger does synchronous I/O (writing to a file or sending to a logging service), that blocks the thread. Even async loggers have buffer flush overhead. For a hot-path method called thousands of times per second, I would make cache-hit logging configurable and default it to off — log only misses.”Follow-up: You profile and find that 12ms of the 15ms is Redis round trips — the method is called 6 times per request in a loop, each hitting Redis. What do you do?
“Six individual Redis calls in a loop is the real problem. Each call is ~2ms (network round trip), and they execute sequentially.Option 1: Batch the cache lookups. Instead of 6 individualGET commands, use Redis MGET to fetch all 6 keys in a single round trip. This turns 6 * 2ms = 12ms into 1 * 2ms = 2ms. This requires changing the CachingDecorator to support batch operations — which might mean the caching interface needs a getMany(keys) method in addition to get(key).Option 2: Prefetch and cache locally. Before the loop, fetch all needed data in one batch and store it in a local hash map. The loop reads from the local map (sub-microsecond) instead of hitting Redis. This is the ‘request-scoped cache’ pattern — a local cache that lives for the duration of one request.Option 3: Restructure to avoid the loop. Why is this method called 6 times? If it is fetching user profiles for 6 users in a list view, the API should support a batch endpoint (getUsers(ids: [1,2,3,4,5,6])) instead of the caller looping over individual fetches. This is an API design fix, not a caching fix.I would implement Option 2 as the quick fix (request-scoped cache is a well-understood pattern and non-invasive) and Option 3 as the proper fix if the loop pattern appears elsewhere.”Follow-up: After fixing the cache performance, how do you prevent similar decorator overhead issues from reappearing?
“Add a performance budget to the decorator chain. Define a rule: the total overhead of all decorators wrapping a repository method must not exceed 5ms for a cache hit and 10ms for a cache miss (excluding the actual database query time).Enforce this with a performance test: a benchmark that exercises the decorator chain with a mock repository and measures the overhead. Run it in CI. If the overhead exceeds the budget, the build fails. This catches regressions before they reach production.More broadly, I would establish a team guideline: decorators on hot-path methods (called >100 times per second) should be audited for performance. Decorators on cold-path methods (admin endpoints, batch jobs) can prioritize clarity over performance. Not all code paths are equal, and the acceptable overhead of a decorator depends on how often it runs.”War Story: At a SaaS platform handling 15,000 API requests/second, we had a CachingDecorator around our tenant configuration lookup. Every request needed tenant config — rate limits, feature flags, plan tier. The decorator usedJSON.stringify(tenantId) as the cache key (wasteful but not the main issue), fetched from Redis (2ms), and deserialized with JSON.parse (1ms for the deeply nested config object). At 15K req/s, that was 45,000ms of CPU time per second just on JSON parsing for cache hits. We added a process-local LRU cache with a 5-second TTL as an L1 in front of Redis. Cache hit rate on the L1: 99.7% (tenant config rarely changes within 5 seconds). P50 latency for the config lookup dropped from 3ms to 0.02ms. The Redis connection pool went from saturated to 0.3% utilization for this workload.Q22: You join a company and discover the codebase has 47 interfaces, each with exactly one implementation. The previous architect called this 'clean architecture.' What do you do?
Q22: You join a company and discover the codebase has 47 interfaces, each with exactly one implementation. The previous architect called this 'clean architecture.' What do you do?
PaymentGateway, EmailSender, OrderRepository (if tests use in-memory fakes). These stay.Bucket B — Unjustified but harmless: The interface is between two internal classes that we own and could change directly. UserServiceInterface with UserServiceImpl. These add one extra file and one extra hop but do not cause real pain. Low priority for removal.Bucket C — Actively harmful: The interface obscures what is happening, makes debugging harder, and lives on a hot path where the indirection adds cognitive load during incidents. ConfigLoaderInterface, DateFormatterInterface, StringHelperInterface. These should go first.Step 2: Remove Bucket C in small PRs. One interface per PR. Inline the implementation. Run the tests. The PR is small enough that review takes 5 minutes and the risk is near zero.Step 3: Establish a going-forward rule. ‘Interfaces are introduced when there is a second implementation or a concrete test-double use case. An interface with one implementation and no in-memory fake in the test suite is not justified.’ Add this to the team’s architecture decision records.Step 4: Let Bucket B erode naturally. When someone touches a Bucket B interface for a feature, they inline it as part of the feature work. No dedicated cleanup sprint needed.”Follow-up: The original architect is still on the team and feels defensive. How do you handle this?
“I would not position it as ‘your architecture was wrong.’ I would position it as ‘the codebase has evolved, and some of the flexibility we planned for was not needed — which is a normal and healthy outcome. Now we can simplify.’I would also acknowledge what the architect got right. If 5 of the 47 interfaces genuinely enabled painless vendor swaps or fast test suites, that is real value. ‘These 5 interfaces are excellent — they saved us weeks during the Stripe-to-Adyen evaluation. The other 42 were reasonable bets that did not pay off. Let’s keep the winners and clean up the rest.’The key is that pattern removal is not a personal attack — it is maintenance. We remove dead code, we remove dead feature flags, and we remove dead abstractions. All for the same reason: they add cognitive load without providing value.”Follow-up: Six months after the cleanup, a new requirement arrives that would have benefited from one of the interfaces you removed. What now?
“Then we re-introduce it. The cost of re-extracting an interface from a concrete class is 30 minutes with modern IDE refactoring tools. The cost of carrying 42 unnecessary interfaces for 6 months was far higher — in onboarding time, in debugging friction, in code review complexity.This is the YAGNI trade-off made explicit: the cost of carrying an unused abstraction daily versus the cost of introducing it when actually needed. For code-level patterns, the re-introduction cost is almost always lower than the carrying cost. If someone says ‘but we might need it,’ the answer is ‘and if we do, we will add it then, in 30 minutes, with full knowledge of the actual requirement instead of the guessed one.’”Follow-up (Staff-level): How do you measure whether this simplification was successful?
“Three metrics, measured at 30 and 90 days post-cleanup.First, onboarding velocity. Time for a new engineer to make their first meaningful PR. If this drops, the codebase is easier to understand.Second, code review throughput. Average time from PR opened to PR approved. Fewer indirection layers means reviewers understand changes faster.Third, incident debugging time. Mean time from ‘alert fires’ to ‘root cause identified.’ Fewer abstraction layers means fewer hops to trace during an outage.If all three improve, the cleanup was correct. If none improve, the interfaces were not the bottleneck and the problem is elsewhere.”Q23: Two teams disagree on architecture. Team A wants a shared library for common domain logic (e.g., pricing rules). Team B wants each service to have its own copy. Both have valid arguments. How do you resolve this?
Q23: Two teams disagree on architecture. Team A wants a shared library for common domain logic (e.g., pricing rules). Team B wants each service to have its own copy. Both have valid arguments. How do you resolve this?
Q24: You are handed a 3-year-old codebase and asked 'what would you simplify first?' The system uses CQRS, event sourcing, the saga pattern, hexagonal architecture, and microservices -- for a product that has 200 daily active users. Walk me through your assessment.
Q24: You are handed a 3-year-old codebase and asked 'what would you simplify first?' The system uses CQRS, event sourcing, the saga pattern, hexagonal architecture, and microservices -- for a product that has 200 daily active users. Walk me through your assessment.
changes audit table gives you the audit trail. Direct queries give you the read models. This simplification alone might save 2-3 days per sprint in schema change overhead.Simplification 3: Remove CQRS. At 200 DAU, your read and write loads are trivially served by a single database with appropriate indexes. The separate read model, the projection pipeline, the eventual consistency window — these are solving problems that do not exist at this scale. Use one database, one model, add an index when a query is slow.Simplification 4: Keep hexagonal architecture if the team finds it valuable for testing. Of all the patterns listed, hexagonal has the lowest carrying cost — it is primarily about code organization and dependency direction. If the team writes fast unit tests using in-memory fakes behind ports, the pattern is earning its keep even at low scale. If the tests hit the real database anyway, collapse to a simpler layered architecture.Simplification 5: Replace saga orchestration with local transactions. If the services are now merged into a monolith, operations that previously spanned services now span modules within the same process. A database transaction replaces the saga. Compensating transactions, idempotency keys, dead letter queues — all unnecessary when the operation is local.The goal is not to ‘dumb down’ the system. The goal is to right-size the architecture for the actual problem. A simpler system ships features faster, has fewer failure modes, and is easier to reason about. If the product succeeds and grows to 200,000 DAU, you can re-introduce these patterns one at a time, each justified by a concrete, measurable need.”Follow-up: The original architect argues ‘but this architecture will let us scale to millions of users without rearchitecting.’ How do you respond?
“Two counterarguments.First, the product has 200 DAU. The overwhelming probability is that the product’s challenge is finding product-market fit and growing, not handling scale. Every sprint spent maintaining distributed system infrastructure is a sprint not spent on features that might grow the user base from 200 to 2,000. Architecture optimized for a scale that never arrives is not engineering — it is waste.Second, the claim that this architecture ‘avoids rearchitecting later’ is usually false. The microservice boundaries drawn today — before the domain is mature, before the usage patterns are known — are almost certainly wrong. When you actually need to scale, you will rearchitect anyway: merging services that were split wrong, splitting services that grew too large, changing the event schema, rebuilding projections. You are not avoiding future work — you are paying for it prematurely AND you will pay for it again when the real requirements appear.The right approach is: build the simplest thing that works, instrument it so you know where the bottlenecks are, and rearchitect the specific bottleneck when it appears. This is how every successful scaled system was actually built. Netflix, Uber, Shopify — they all started simple and added complexity in response to measured pain, not anticipated pain.”Follow-up: You convince the team to simplify. How do you roll it out without breaking production?
“The simplification is itself a migration, and I would apply the same discipline I would to any migration.Phase 1: Merge services but keep the same database schema and event infrastructure. The microservices become modules in a monolith, but the data model does not change yet. This is the lowest-risk simplification because the data paths are unchanged — you are just removing the network boundary. Deploy. Monitor for 2 weeks.Phase 2: Replace event-sourced aggregates with state-based tables one at a time. For each aggregate, create the new table, backfill from the current state (not by replaying events — just snapshot the current derived state), switch the module to read/write from the new table, verify, then decommission the event stream for that aggregate. One aggregate per sprint.Phase 3: Collapse the separate read model. Once event sourcing is gone, the read model projections have no source. Replace them with direct queries (possibly with materialized views if the query pattern warrants it) and remove the projection pipeline.Phase 4: Remove the saga orchestrator. Replace saga-coordinated workflows with local transactions within the monolith. Test each workflow thoroughly — this is where the highest regression risk lives, because the control flow changes from async-event-driven to synchronous-transactional.At each phase, the system is in a working state. If any phase causes unexpected problems, stop there and investigate. The simplification is incremental, just like any good migration.”Follow-up: How do you measure whether the simplification worked?
“Four metrics, tracked weekly before, during, and after the simplification.Feature velocity: Average number of story points (or PRs, or features) shipped per sprint. This should increase as the architecture imposes less overhead.Incident rate: Number of production incidents per month. This should decrease as distributed-system failure modes are eliminated.Onboarding time: Days for a new engineer to ship their first PR. This should decrease as the codebase becomes easier to understand.Deployment frequency: Number of production deploys per week. This should increase as the deployment process becomes simpler.If all four improve, the simplification was correct. If feature velocity does not improve, the architecture was not the bottleneck — the problem is elsewhere (unclear requirements, slow code reviews, flaky CI). The numbers will tell you whether you made the right call.”War Story: A B2B SaaS company ($2M ARR, 150 enterprise customers, 8 engineers) had a system built by a consulting firm that used event sourcing, CQRS, 6 microservices, and Kubernetes. Their deployment pipeline took 45 minutes. Adding a new field to an entity required changes in 4 services and took 3 days. In 6 months, we collapsed the 6 services into a modular Rails monolith, replaced event sourcing with PostgreSQL + an audit table, removed the CQRS read model in favor of database views, and replaced Kubernetes with a single Heroku dyno (their peak traffic was 50 requests per second). Deployment time dropped to 4 minutes. New field addition dropped to 2 hours. The engineering team shipped more features in the following quarter than they had in the prior two quarters combined. The CTO’s quote: “We spent 18 months building for Netflix scale and we have 150 customers.”Q25: You inherit a codebase where half the team uses events for inter-module communication and the other half uses direct function calls. The event-based modules are harder to debug but more decoupled. The direct-call modules are easier to trace but tightly coupled. What is your strategy?
Q25: You inherit a codebase where half the team uses events for inter-module communication and the other half uses direct function calls. The event-based modules are harder to debug but more decoupled. The direct-call modules are easier to trace but tightly coupled. What is your strategy?
OrderPlaced triggers inventory, notifications, analytics, and loyalty points, events are the right pattern. The set of reactors will grow, and the producing module should not need to know about them. If any of these are currently direct calls, they should migrate to events.Category 2: Point-to-point communication (one caller, one callee, stable relationship). When the CheckoutModule calls PricingModule.calculateTotal(), a direct function call is simpler, faster, and easier to debug than an event. If this is currently event-based with one subscriber, it should migrate to a direct call.Category 3: Ambiguous (one caller today, maybe more tomorrow). This is the judgment call. My default: start with a direct call. If a second consumer appears, refactor to an event. The refactoring cost is low and you will have concrete knowledge of the actual event shape, not a guessed one.My convergence strategy: do not mandate a single approach. Instead, establish clear criteria for when to use each. ‘Events for fan-out and temporal decoupling. Direct calls for point-to-point and synchronous needs.’ Write it in the team’s architecture decision records. Apply the criteria to new code. Migrate existing code opportunistically — when you touch a module for a feature, check whether its communication pattern matches the criteria.The worst outcome is a ‘big-bang migration’ that converts all direct calls to events or vice versa. That is a massive effort with a high regression risk and the benefit is aesthetic consistency, not functional improvement.”Follow-up: Six months later, the team has followed the criteria, but 3 modules are still inconsistent because nobody had a reason to touch them. Do you force the migration?
“No. Consistency for its own sake is not worth the risk and effort of migrating stable, untouched code. If those 3 modules are working, tested, and not causing problems, leave them. The cognitive cost of inconsistency is real but bounded — you can add a code comment or an ADR entry that says ‘this module uses direct calls for historical reasons; migrate to events when next modified for a feature.’The modules that matter most are the ones that change often. If a frequently-modified module is inconsistent with the team’s criteria, fix it on the next feature change. If a stable module has not been touched in 9 months, the inconsistency is invisible to the team 99% of the time.”Follow-up: How do you handle the security implications of each approach?
“Events create a broader data exposure surface. WhenOrderPlaced carries { orderId, customerId, items, total, paymentMethodLast4 } and is published to an event bus, every subscriber — including ones added later by other teams — can see that data. With a direct function call, data flows only to the explicitly called module.For event-based communication, I would enforce data minimization: events carry the minimum data needed. The OrderPlaced event does not need paymentMethodLast4 — that is only relevant to the payment module, which already has it. If a subscriber needs sensitive data, it fetches it from the source via an authenticated API call, not from the event payload.For both approaches, I would ensure that PII in events is either encrypted or tokenized, and that event retention policies comply with GDPR/data retention requirements. Events are often stored in Kafka for weeks or months — that is a compliance surface that direct function calls do not have.”Follow-up: What is the failure mode of each approach, and how do you design for it?
“Direct calls fail loudly and immediately. If the pricing module throws, the checkout module’s try-catch handles it synchronously. The failure is visible in the call stack, the error propagates to the user, and the system is in a known state.Events fail silently and asynchronously. If the notification module’s event handler throws, the checkout module does not know. The event goes to a dead letter queue (if you have one) or disappears (if you do not). The user sees ‘order confirmed’ but never gets the confirmation email. The failure is only discovered when a customer complains or an alert on dead letter queue depth fires.The design implication: for event-based paths, you need dead letter queues, retry policies, alerting on consumer lag, and a dashboard showing ‘events published vs. events successfully processed’ per consumer. For direct-call paths, standard error handling and circuit breakers are sufficient. The observability investment for event-based communication is significantly higher — if you adopt events, budget for the observability infrastructure or you will have silent failures.”Q26: Your team shipped a hexagonal architecture 18 months ago. The business logic layer is now 400 lines of code, while the port interfaces, adapter implementations, and DI wiring total 2,200 lines. A new hire says 'this feels backwards.' Are they right?
Q26: Your team shipped a hexagonal architecture 18 months ago. The business logic layer is now 400 lines of code, while the port interfaces, adapter implementations, and DI wiring total 2,200 lines. A new hire says 'this feels backwards.' Are they right?
Follow-up: What would you simplify first in this codebase?
“The port interfaces that have exactly one adapter and no test fake. These are the highest-cost, lowest-value abstractions. Each one is an interface file, an implementation file, and a DI binding — three files that exist to abstract something that is never swapped. I would inline them one at a time, starting with the most frequently navigated ones (whatever developers interact with most during feature development). Each inlining is a small, safe PR.”Follow-up: How would you measure the cost of this architecture?
“Three measurements.First, navigation depth. Time a developer tracing a request from API entry to database query. In the hexagonal setup: controller -> use case port -> use case implementation -> repository port -> repository implementation -> ORM -> database. That is 6 hops. In a simplified architecture: controller -> service -> repository -> database. That is 3 hops. Each hop is a ‘Go to Definition’ click and a context switch.Second, feature development overhead. When a new feature requires a new query, count how many files are touched. In hexagonal: add method to port interface, add method to adapter implementation, add method to test fake, update DI wiring, call from use case. That is 5 files. In simplified: add method to repository, call from service. That is 2 files.Third, incident debugging time. During an outage, how long does it take to trace from ‘this endpoint is returning wrong data’ to ‘the bug is in this function’? More abstraction layers means more hops to trace.”Follow-up: What is the rollback strategy if the simplification causes problems?
“Each simplification is a small PR that inlines one abstraction. If it causes a test failure or an unexpected behavior, revert the single PR. There is no big-bang rollback needed because the simplification is incremental.The riskiest simplification is removing a port interface that a test fake depends on. Before removing it, I check whether the fake is actually used. If it is, I keep the port (it is earning its keep). If the fake exists but no test references it, I delete both the fake and the port. Dead test infrastructure is still dead code.”War Story: A fintech startup (12 engineers) adopted hexagonal architecture for their loan origination system because the CTO read ‘Get Your Hands Dirty on Clean Architecture.’ After 18 months, the system had 14 port interfaces, 14 production adapters, 14 in-memory test fakes, and 14 DI binding configurations. The actual business logic — loan eligibility calculations, risk scoring, and disbursement rules — was 600 lines. A new hire took 3 weeks to understand the code structure before making their first PR. During a simplification initiative, the team discovered that 9 of the 14 in-memory fakes had drifted from the real adapter behavior and were not catching bugs — they were masking them. They kept the 5 ports where the fakes were genuinely useful (payment gateway, credit score API, KYC provider, document store, notification sender) and collapsed the other 9. Onboarding time dropped from 3 weeks to 1 week. Importantly, the 5 remaining ports continued to provide excellent test isolation for the complex integration points.Q27: Walk me through a design pattern rollout plan. You have decided to adopt the Outbox Pattern for your team's event publishing. How do you go from 'decision made' to 'running in production' without incident?
Q27: Walk me through a design pattern rollout plan. You have decided to adopt the Outbox Pattern for your team's event publishing. How do you go from 'decision made' to 'running in production' without incident?
- Create the outbox table schema. Fields:
id(UUID),aggregate_type,aggregate_id,event_type,payload(JSONB),created_at,published(boolean),published_at. Add an index on(published, created_at)for the relay query. - Write the relay process. Start with a polling relay (simplest to implement and debug). Query unpublished rows, publish to Kafka, mark as published.
- Write the cleanup job. Delete published rows older than 24 hours.
- Security review: the outbox table contains event payloads that may include PII. Ensure the table is encrypted at rest, access is restricted to the service’s database user, and the cleanup job’s retention period complies with data retention policies.
- Deploy the outbox table and relay process to production, but do not write to the outbox yet.
- Run the relay process against an empty table to verify it starts, polls, and handles the ‘no rows’ case gracefully without logging errors or consuming excessive resources.
- Monitor: relay process CPU, memory, database connection usage, Kafka producer metrics.
- Update the service to write to BOTH the outbox table (inside the existing transaction) AND publish directly to Kafka (the current path). This is a temporary dual-write.
- The relay process is also reading the outbox and publishing. So events may be published twice — by the direct path and by the relay. Consumers must be idempotent (they should be already, but verify).
- Monitor: compare events published directly vs. events published by the relay. They should match. Any discrepancy indicates a bug in the outbox writing or relay logic.
- This stage gives you confidence that the outbox path produces identical events to the direct path.
- Remove the direct Kafka publish. The service now only writes to the outbox table. The relay is the sole publisher.
- Monitor aggressively for the first 4 hours: consumer lag, event delivery latency (time from outbox insert to Kafka publish), error rates.
- Rollback plan: re-enable the direct publish path. This is a feature flag toggle, not a code revert. Design the dual-write stage with a feature flag so cutover and rollback are instant.
- Track: events published per hour (should match pre-cutover), consumer lag (should be stable), end-to-end event latency (should be slightly higher — the relay adds 100-500ms), error rate (should be zero or near-zero), outbox table size (should be bounded by the cleanup job).
- Success criteria: zero lost events over 2 weeks, latency increase under 1 second, no incidents attributable to the outbox path.
- Remove the feature flag and the dual-write code path.
- Document the outbox pattern in the team’s ADRs with the measurement results.
- Update runbooks: ‘if event delivery stops, check the relay process first, then the outbox table for unpublished rows.’”
Follow-up: What is the failure mode of the outbox pattern, and how do you detect it?
“Three failure modes.Relay process dies. The outbox table fills up with unpublished rows. Events stop reaching Kafka. Consumers see no new events. Detection: alert on ‘oldest unpublished outbox row age.’ If any row has been unpublished for more than 5 minutes, page the on-call. This is the highest-severity failure.Relay process falls behind. The outbox table is being written to faster than the relay can publish. Unpublished row count grows. Event delivery latency increases. Detection: alert on ‘unpublished row count’ exceeding a threshold (e.g., 1,000 rows). The fix: scale the relay (run multiple instances partitioned by aggregate type) or batch publishes (MGET multiple rows, publish as a batch).Database transaction rolls back but the relay already read the row. This should not happen because the relay only reads committed rows (assuming READ COMMITTED isolation level or higher). But if the relay is configured with a lower isolation level, or if the database connection pool has stale transactions, you could read uncommitted outbox rows. Detection: consumers receive events for entities that do not exist. Prevention: ensure the relay’s database connection uses READ COMMITTED.”Follow-up: What is the cost of this pattern?
“Four ongoing costs.Operational cost: The relay process is infrastructure to monitor, alert on, and maintain. If it is a dedicated process, it needs its own health check, logging, and deployment pipeline. If it is a cron job, it needs monitoring for missed runs.Latency cost: Events are no longer published in the same request cycle. The relay adds 100ms-1s of latency between the write and the Kafka publish. For most use cases, this is acceptable. For real-time requirements (live dashboards, instant notifications), it may require a synchronous publish in addition to the outbox.Storage cost: The outbox table grows with every write. The cleanup job must keep pace. At 10,000 events/hour with a 24-hour retention, the table stays around 240,000 rows — manageable. At 1 million events/hour, the table management becomes a significant DBA concern.Complexity cost: Every developer writing a new event must remember to write to the outbox inside the transaction, not publish directly. This is a convention that must be enforced through code review, linting, or a shared library that wraps the pattern.”Follow-up: How do you secure the event payloads in the outbox table?
“The outbox table is a copy of event data sitting in your database, often in plaintext JSONB. Three security considerations.First, PII in payloads. If events contain customer names, emails, or payment details, the outbox table is a PII store. Encrypt thepayload column at the application level (not just database-level encryption at rest) if your compliance requirements demand it. The relay decrypts before publishing to Kafka, or publishes encrypted and consumers decrypt.Second, access control. The outbox table should be readable only by the relay process’s database user. Other services or reporting tools should not query it directly — it is an internal implementation detail.Third, retention. The cleanup job’s retention window must comply with data retention policies. If GDPR requires that deleted user data is purged within 30 days, outbox rows containing that user’s data must be cleaned up within the same window. The cleanup job should be aware of deletion events and proactively clean related outbox rows.”Q28: You are reviewing an existing codebase and notice the team adopted the Saga pattern for a workflow that runs entirely within a single service and a single database. The workflow has 4 steps, all of which could be wrapped in a database transaction. What is your assessment?
Q28: You are reviewing an existing codebase and notice the team adopted the Saga pattern for a workflow that runs entirely within a single service and a single database. The workflow has 4 steps, all of which could be wrapped in a database transaction. What is your assessment?
| Dimension | Database transaction | Saga in same database |
|---|---|---|
| Atomicity | Guaranteed by the database | You must write compensating actions for each step |
| Consistency | Immediate | Eventual (the saga state is intermediate during execution) |
| Error handling | Rollback is automatic | Each step needs explicit undo logic |
| Code complexity | ~10 lines (begin, steps, commit/rollback) | ~200 lines (orchestrator, state machine, compensations) |
| Failure modes | Transaction fails or succeeds | Saga can get stuck in intermediate states |
| Debugging | ’Did the transaction commit?' | 'What state is the saga in? Did compensation succeed?’ |
BEGIN; ... COMMIT; does natively.My recommendation: replace the saga with a database transaction. This is not a refactoring — it is a simplification. The test suite can verify the same business rules with less infrastructure.”Follow-up: The developer who wrote the saga says ‘we might split this into microservices later, so the saga is future-proofing.’ How do you respond?
“The saga is future-proofing for a future that may never arrive, at a daily cost that is paid today. If and when the workflow splits across services, the saga is the right pattern — but the saga written for a single-service workflow will need to be rewritten anyway, because the service boundaries, the communication protocol, and the failure modes are different.A saga within one service uses local function calls for each step. A saga across services uses HTTP or message queue calls. The retry logic is different. The compensation logic is different. The monitoring is different. You are not saving future work by building the saga now — you are building the wrong saga.The pragmatic approach: use a database transaction today. If the service splits, write a new saga designed for the actual service boundaries that exist at that point. The transaction-to-saga refactoring is well-understood and can be done in a focused sprint.”Follow-up: What would you simplify first, and how would you measure success?
“First simplification: replace the saga orchestrator with a single function that wraps the 4 steps in a database transaction. One function, one transaction, one rollback on failure.Measurement:- Lines of code: The saga implementation (orchestrator, compensations, state persistence) is probably 200-400 lines. The transaction wrapper is 20-40 lines. A 10x reduction in code for identical behavior.
- Failure modes eliminated: Stuck sagas, failed compensations, intermediate states — all gone. The only failure mode is ‘transaction commits’ or ‘transaction rolls back.’ Binary, not a state machine.
- Operational overhead removed: No more stuck-saga monitoring, no more dead letter queue for failed compensations, no more saga state table in the database.
- Developer velocity: Time to add a 5th step to the workflow. In the saga: write the step, write the compensation, update the state machine, add idempotency handling, add monitoring. In the transaction: add one function call inside the transaction block.”