Choosing the right architecture pattern is crucial for building scalable, maintainable systems. Each pattern has trade-offs, and the best choice depends on your specific context: team size, domain complexity, scale requirements, and organizational structure. Think of architecture patterns like city planning: a small town does not need a highway interchange, and a metropolis cannot survive on dirt roads. The pattern must fit the traffic.
Conway’s Law: “Organizations design systems that mirror their own communication structure.” Your architecture should align with your team structure. Amazon famously reorganized into “two-pizza teams” — small, autonomous groups — and their architecture naturally evolved into independent microservices. The org chart became the system diagram.
Best For: Startups, MVPs, small teams, simple domains. Shopify ran as a monolith for years and scaled to billions in GMV before selectively extracting services. The monolith is not a stepping stone you outgrow — it is a legitimate architecture that, when well-structured, can serve enormous scale.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│ User │ │ Order │ │ Payment │ │ Notif ││ Service │ │ Service │ │ Service │ │ Service │└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └────────────┴─────┬──────┴────────────┘ │ ┌─────────┴─────────┐ │ API Gateway │ └───────────────────┘ │ ┌─────────┴─────────┐ │ Clients │ └───────────────────┘
✅ Pros
Independent scaling
Technology diversity
Fault isolation
Faster deployments
❌ Cons
Distributed system complexity
Network latency
Data consistency challenges
Operational overhead
Best For: Large teams, complex domains, high scale requirements. Netflix, Uber, and Amazon run thousands of microservices, but each started with a monolith and decomposed only after hitting real scaling walls. The rule of thumb: if you do not have at least 50+ engineers and clearly bounded domains, microservices will likely slow you down rather than speed you up.
Two fundamental approaches exist, and choosing the wrong one is a frequent source of production pain. Synchronous calls are like phone calls — you wait on the line until the other party responds. Asynchronous messaging is like dropping a letter in the mailbox — you move on with your day and trust it will be handled.
# Synchronous (HTTP/REST) -- Simple but creates tight coupling.# If order-service is down, this entire call fails. Use when you# need an immediate response (e.g., checkout flow showing order total).import requestsdef get_user_orders(user_id): user = requests.get(f"http://user-service/users/{user_id}").json() orders = requests.get(f"http://order-service/orders?user={user_id}").json() return {"user": user, "orders": orders}# Asynchronous (Message Queue) -- Decoupled and resilient.# The publisher does not care if consumers are slow or temporarily down.# Use for tasks that do not need an immediate response (e.g., sending# confirmation emails, updating analytics, syncing inventory).import pikadef publish_order_created(order): channel.basic_publish( exchange='orders', routing_key='order.created', body=json.dumps(order) )
Practical tip: Default to async messaging between services and only use synchronous calls when the user is actively waiting for the result. This single decision prevents most cascading failure scenarios.
Instead of storing the current state of an entity (like a bank balance), you store every event that led to that state (every deposit and withdrawal). Think of it like an accounting ledger: you never erase entries, you only append new ones. To know the current balance, you replay the ledger from the beginning. This gives you a complete, auditable history and the ability to reconstruct state at any point in time.
# Store events, not state -- every change is an immutable fact that happened.# This is how bank ledgers, Git version control, and event-sourced systems work.class OrderEventStore: def append(self, event): # Events are immutable facts -- never update or delete them. self.events.append({ "type": event.type, "data": event.data, "timestamp": datetime.now() }) def get_order_state(self, order_id): # Replay all events to reconstruct current state. # For production systems, use snapshots to avoid replaying # thousands of events on every read (snapshot every N events). order = Order() for event in self.events: if event["data"]["order_id"] == order_id: order.apply(event) return order# Event types -- each represents something that already happened (past tense)OrderCreated(order_id="123", items=[...])PaymentReceived(order_id="123", amount=100)OrderShipped(order_id="123", tracking="XYZ")
When to use Event Sourcing: Audit-critical domains (finance, healthcare, legal), systems that need temporal queries (“what was the inventory at 3 PM Tuesday?”), or when you need to rebuild read models from events. When to avoid it: simple CRUD applications where the added complexity is not justified by the benefits.
The most intuitive architecture pattern — like a layer cake where each layer only talks to the one directly below it. Most web frameworks (Django, Spring, Rails) naturally guide you toward this pattern. It works well until your application grows complex enough that the layers become bloated “god layers” with tangled responsibilities.
# Clean Architecture Example# The key rule: dependencies always point inward. The domain layer# knows nothing about databases, HTTP, or frameworks. This makes# your business logic testable without spinning up infrastructure.# Domain Layer (innermost - ZERO external dependencies)# This code should work identically whether your app uses# PostgreSQL, MongoDB, or stores data in a spreadsheet.class Order: def __init__(self, id, items): self.id = id self.items = items self.status = "pending" def calculate_total(self): return sum(item.price for item in self.items)# Application Layer (use cases -- orchestrates domain objects)# Contains application-specific business rules but stays ignorant# of how data is persisted or how payments are actually processed.class CreateOrderUseCase: def __init__(self, order_repo, payment_service): self.order_repo = order_repo # Interface, not concrete class self.payment_service = payment_service # Interface, not concrete class def execute(self, order_data): order = Order(**order_data) self.payment_service.process(order.calculate_total()) self.order_repo.save(order) return order# Infrastructure Layer (outermost -- the "dirty" details)# Swap PostgresOrderRepository for MongoOrderRepository and# the domain and application layers do not change at all.class PostgresOrderRepository: def save(self, order): # SQL implementation details live here, nowhere else passclass StripePaymentService: def process(self, amount): # Stripe API call -- if you switch to Braintree, only this file changes pass
The core business logic is isolated from external concerns through ports (interfaces) and adapters (implementations). Think of it like a power outlet: the port defines the shape of the plug (the interface), and different adapters (US, EU, UK plugs) can connect without changing the device. Your domain logic is the device — it does not care where the electricity comes from.
# Port (Interface) -- defines WHAT the domain needs, not HOW it's done.# This is the "socket shape" that adapters must match.from abc import ABC, abstractmethodclass UserRepository(ABC): @abstractmethod def save(self, user: User) -> None: pass @abstractmethod def find_by_id(self, user_id: str) -> User: pass# Adapter (Implementation) -- the concrete plug that fits the socket.# You can swap PostgresUserRepository for InMemoryUserRepository in# tests without touching any business logic.class PostgresUserRepository(UserRepository): def __init__(self, connection): self.connection = connection def save(self, user: User) -> None: self.connection.execute( "INSERT INTO users ...", user.to_dict() ) def find_by_id(self, user_id: str) -> User: row = self.connection.query("SELECT * FROM users WHERE id = ?", user_id) return User.from_dict(row)# Domain service (core business logic, no external dependencies).# Notice: UserService depends on the abstract UserRepository port,# never on PostgresUserRepository directly. This is the power of# hexagonal architecture -- your domain is framework-agnostic.class UserService: def __init__(self, user_repo: UserRepository): # Inject port, not adapter self.user_repo = user_repo def register_user(self, name: str, email: str) -> User: user = User(name=name, email=email) user.validate() # Business rule enforcement stays in the domain self.user_repo.save(user) return user
For maintaining data consistency across microservices without distributed transactions. In a monolith, you wrap everything in a database transaction and either all changes commit or all roll back. In microservices, there is no single database to wrap. The Saga pattern solves this by breaking a distributed transaction into a sequence of local transactions, each with a compensating action (an “undo”) if something downstream fails. Think of it like booking a vacation: if the hotel reservation succeeds but the flight booking fails, you cancel the hotel reservation (the compensating action).
In the orchestration approach, a central coordinator (the “orchestra conductor”) tells each service what to do and handles compensation when things fail. This is easier to understand and debug than choreography but creates a single point of coordination.
class OrderSagaOrchestrator: """Central coordinator that manages the distributed transaction. Each step has a compensating action -- if step N fails, we undo steps N-1, N-2, ... in reverse order.""" def __init__(self, payment_service, inventory_service, notification_service): self.payment = payment_service self.inventory = inventory_service self.notification = notification_service def execute(self, order: Order): try: # Step 1: Process payment (can fail if card declined) payment_id = self.payment.process(order.total) # Step 2: Reserve inventory (can fail if out of stock) try: self.inventory.reserve(order.items) except InventoryError: # Compensate step 1: refund the payment we already took self.payment.refund(payment_id) raise # Step 3: Send confirmation (best-effort, retry on failure) self.notification.send_confirmation(order) return SagaResult.success(order.id) except Exception as e: return SagaResult.failure(str(e))
Choreography vs. Orchestration: Choreography (event-based, no central coordinator) scales better and avoids a single point of failure, but is harder to debug because the flow is implicit. Orchestration (central coordinator) is easier to reason about but introduces coupling to the orchestrator. For most teams, start with orchestration — you can always move to choreography later when you understand the domain boundaries well.
Prevent cascading failures in distributed systems. Named after electrical circuit breakers in your house: when too much current flows (too many failures), the breaker trips (opens) and stops all current (requests) to prevent a fire (cascading system failure). After a cooldown period, it lets a small test current through (half-open state) to see if the problem is resolved.Without a circuit breaker, a single failing downstream service can exhaust your thread pool and connection pool, causing your healthy service to fail too — a cascading failure that can take down your entire system in minutes.
import timefrom enum import Enumclass CircuitState(Enum): CLOSED = "closed" # Normal operation -- requests flow through OPEN = "open" # Failing -- reject requests immediately (fail fast) HALF_OPEN = "half_open" # Recovery test -- allow one request to probe healthclass CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=30): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.failure_count = 0 self.last_failure_time = None self.state = CircuitState.CLOSED def call(self, func, *args, **kwargs): if self.state == CircuitState.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = CircuitState.HALF_OPEN else: raise CircuitOpenError("Circuit is open") try: result = func(*args, **kwargs) self._on_success() return result except Exception as e: self._on_failure() raise def _on_success(self): self.failure_count = 0 self.state = CircuitState.CLOSED def _on_failure(self): self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN# Usage -- wrap any unreliable external call with a circuit breaker.# In production, use battle-tested libraries like resilience4j (Java),# Polly (.NET), or pybreaker (Python) instead of rolling your own.circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=60)try: result = circuit.call(external_service.fetch_data)except CircuitOpenError: # Graceful degradation: serve stale cached data rather than # showing an error page. Users get slightly outdated data, # which is almost always better than no data at all. result = cached_data # Fallback
Practical tip: Combine circuit breakers with retries (with exponential backoff) and timeouts. The three together form the “resilience trinity” for any distributed system. Netflix’s Hystrix popularized this pattern, and modern alternatives like Istio handle it at the infrastructure layer via service mesh.
Document important architectural decisions. ADRs answer the question future engineers will inevitably ask: “Why on earth did we build it this way?” Without ADRs, architectural knowledge lives only in the heads of people who may leave the team. GitHub, Spotify, and ThoughtWorks all use ADRs extensively. Each ADR is a short document (one page) capturing the context, decision, and consequences at the time the decision was made.
# ADR-001: Use PostgreSQL for Primary Database## StatusAccepted## ContextWe need a database for our e-commerce platform that handles:- Complex queries with JOINs- ACID transactions for orders- ~1M daily transactions## DecisionUse PostgreSQL as our primary database.## Consequences### Positive- Strong ACID compliance- Rich query capabilities- Mature ecosystem### Negative- Horizontal scaling is complex- May need read replicas for scale### Neutral- Team has moderate PostgreSQL experience
Interview Tip: Always discuss trade-offs. There’s no “best” architecture—only the right one for your context. Consider team size, domain complexity, scale requirements, and time-to-market.
Common Mistake: Don’t start with microservices. Start with a well-structured monolith, then extract services as needed. Premature decomposition causes more problems than it solves. Martin Fowler calls this the “Monolith First” approach — you cannot correctly define service boundaries until you deeply understand the domain, and you cannot deeply understand the domain until you have built and iterated on the system.
Your team is running a monolith that is starting to show scaling pain. Walk me through how you would decide what to extract into a microservice first, and what you would leave behind.
Strong Answer:
The first step is NOT to extract anything. I would instrument the monolith heavily — APM traces, database query analysis, CPU/memory profiling per module — to find the actual bottleneck. In my experience, “scaling pain” is often one or two hot paths, not the entire system. At one e-commerce company, 80% of the load came from the product catalog search, not from orders or payments.
Once I have identified the bottleneck module, I evaluate it against three extraction criteria: (1) Does it have a clearly defined bounded context with minimal data coupling to other modules? (2) Does it need to scale independently — meaning its load profile is fundamentally different from the rest? (3) Does the team structure support owning it independently (Conway’s Law)?
I would draw the dependency graph. If the candidate module makes 15 synchronous calls back into the monolith, extracting it creates a distributed monolith — the worst of both worlds. You get network latency and partial failure modes without any of the independence benefits.
The extraction itself follows the Strangler Fig pattern: stand up the new service, route a small percentage of traffic to it (canary), run both in parallel, compare results, then cut over. Keep the monolith code intact until the new service has proven itself in production for at least 2-4 weeks.
What I would leave behind: anything with heavy transactional coupling. If placing an order requires atomically updating inventory, processing payment, and creating a shipment record, keeping those in a monolith with a single database transaction is dramatically simpler and more reliable than a distributed saga across three services.
Follow-up: You mentioned the Strangler Fig pattern. What happens if the new service and the old monolith module produce different results during the parallel-run phase?This is the critical validation step. I would implement a “dark launch” or “shadow mode” where the new service processes every request but its results are logged and compared against the monolith’s actual response rather than served to users. We build a diff report that flags discrepancies. Common causes: subtle differences in business logic, timezone handling, floating-point rounding in financial calculations, or database query ordering differences. Each discrepancy gets investigated and resolved before we shift any real traffic. At Shopify, they used this exact approach when extracting their storefront rendering — they ran shadow comparisons for weeks and caught dozens of edge cases that unit tests missed.
Explain the Saga pattern to me. When would you use choreography versus orchestration, and what failure modes keep you up at night with each approach?
Strong Answer:
A Saga is a sequence of local transactions across multiple services where each step has a compensating action (an undo) if a downstream step fails. It replaces the distributed two-phase commit (2PC) that does not scale in microservices architectures.
Choreography means each service listens to events and decides what to do next. There is no central coordinator. Service A publishes “OrderCreated,” Service B hears it and processes payment, publishes “PaymentProcessed,” Service C hears that and reserves inventory. The flow is implicit — it lives in the event subscriptions, not in any single piece of code.
Orchestration means a central saga coordinator tells each service what to do in sequence and handles rollbacks when things fail. The flow is explicit — you can read the orchestrator code and see every step.
I use orchestration when: the saga has more than 3-4 steps, the team is new to distributed systems, or the compensation logic is complex (partial refunds, conditional rollbacks). The explicit flow makes debugging far easier. I use choreography when: the workflow is simple (2-3 steps), services are owned by different teams who deploy independently, or we need to avoid a single point of failure.
Failure modes that keep me up at night: (1) In choreography, “ghost events” — a service processes an event, crashes before publishing its output event, and the saga is stuck in limbo. No central coordinator knows it is stalled. You need dead-letter queues and timeout monitors to detect these. (2) In orchestration, the orchestrator itself failing mid-saga — you have charged the customer but the orchestrator dies before reserving inventory. This requires the orchestrator to persist its state (saga log) so it can resume on restart. (3) In both: compensating actions that themselves fail. You tried to refund the payment but Stripe is down. Now you need compensation for your compensation — this is where things get genuinely ugly and you need manual intervention queues.
Follow-up: How would you test a saga that spans four services to make sure the compensating actions actually work?Testing sagas is one of the hardest problems in microservices. I would use three layers: (1) Unit tests for each service’s compensating action in isolation — given this state, does the undo produce the expected result. (2) Contract tests (using Pact or similar) to verify that each service’s published events match what downstream consumers expect. (3) Integration tests in a staging environment with chaos injection — we deliberately fail each step of the saga and verify the system reaches a consistent terminal state (either fully committed or fully compensated). I would inject failures using a tool like Toxiproxy between services. The key assertion is eventual consistency: after the saga completes or fails, every service’s local state is consistent. We also run a background reconciliation job in production that compares states across services daily and flags any that are inconsistent.
You are designing a system for a company that processes 50,000 orders per day. A colleague proposes an event-sourced architecture with CQRS. Walk me through your response.
Strong Answer:
My first reaction is skepticism, and I would push back respectfully. 50,000 orders per day is roughly 0.6 orders per second on average, maybe 5-10 per second at peak. This is trivially handled by a well-indexed PostgreSQL database on a single server. Event sourcing and CQRS add enormous operational complexity — separate read and write models, eventual consistency between them, event schema versioning, replay infrastructure, snapshot management. That complexity has to earn its place.
The question I would ask is: “What problem are we solving?” If the answer is “scale,” then event sourcing is overkill at this volume by two orders of magnitude. If the answer is “we need a complete audit trail of every state change for regulatory compliance,” then event sourcing starts to make sense because it gives us that for free. If the answer is “our read and write patterns are wildly different — writes are simple but reads require complex aggregations across multiple dimensions,” then CQRS alone (without event sourcing) might help.
If we do proceed, I would start with CQRS only: separate the read model (optimized denormalized views) from the write model (normalized transactional tables), synchronized via database-level change data capture (CDC) using something like Debezium. This gives us 80% of the benefit with 20% of the complexity. We would only add full event sourcing if we hit a concrete need for temporal queries, audit logs, or the ability to replay and rebuild state.
The hidden cost most people miss: event schema evolution. On day 1, your OrderCreated event has 5 fields. Six months later, it has 12. You now have millions of stored events in two different schemas, and your event replay logic needs to handle both. Tools like Avro with a schema registry help, but it is still a continuous maintenance burden.
Follow-up: Your colleague insists on event sourcing because “Netflix uses it.” How do you handle this conversation without damaging the relationship?I would acknowledge that event sourcing is a powerful pattern used by sophisticated engineering teams, and that their instinct to think about scalable architecture is good. Then I would reframe the conversation around our specific constraints: “Netflix processes billions of events per day across thousands of services with hundreds of engineers. We have a 15-person team processing 50K orders per day. The pattern that is optimal for their scale can actively harm us at ours — the operational overhead of event sourcing would consume engineering time we should spend on product features.” I would propose a concrete experiment: “Let us build one bounded context with event sourcing as a proof of concept, measure the development velocity impact over 4 weeks, and decide based on data.” This respects their idea, avoids an authority-based argument, and lets reality settle the debate.
What is the difference between a distributed monolith and a proper microservices architecture? How do teams accidentally end up with a distributed monolith?
Strong Answer:
A distributed monolith has all the operational complexity of microservices (network latency, partial failures, distributed debugging, separate deployments) with none of the benefits (independent scaling, independent deployment, technology diversity). You cannot deploy service A without also deploying service B because they are coupled through shared databases, synchronous call chains, or shared libraries.
Teams end up here through three common paths. First, extracting services along the wrong boundaries — splitting by technical layer (a “data service,” an “API service,” a “business logic service”) instead of by business domain. This forces every business operation to traverse all three services synchronously. Second, sharing a database across services — two services reading and writing the same tables means any schema change requires coordinated deployment of both services. You have one deployment unit masquerading as two. Third, using synchronous request chains where service A calls B, which calls C, which calls D. Any single service being slow or down causes the entire chain to fail. This is a synchronous distributed monolith.
The test for whether you have a distributed monolith: “Can this service be deployed independently without coordinating with other teams?” If the answer is no, you have a distributed monolith regardless of how many Docker containers you run.
The fix is painful but straightforward: identify the coupling points, and either merge tightly-coupled services back into one (yes, sometimes the right answer is fewer services) or introduce asynchronous communication and separate data ownership. Each service should own its data and expose it only through well-defined APIs.
Follow-up: Your team has inherited a distributed monolith with 12 services that all share one PostgreSQL database. What is your migration plan?I would not try to fix all 12 at once — that is a multi-quarter project with high risk. I would start by mapping the coupling: which services read from which tables, which write, and where there are cross-service transactions. Then I would pick the service with the cleanest boundary — the one that reads and writes tables that no other service touches. That service gets its own database first. For services with shared read access, I would introduce a data synchronization layer: the owning service publishes change events, consuming services maintain their own read-optimized copies. For shared writes, the solution is usually an API call to the owning service rather than direct database access. We would do this one service at a time over 3-6 months, validating after each extraction that behavior is unchanged. The key metric to track is deployment coupling: can we deploy each extracted service independently? When the answer becomes yes for all 12, we are done.
Walk me through how you would design an API Gateway for a system with 20 backend services. What concerns does it handle, and what are the risks of getting it wrong?
Strong Answer:
An API Gateway is a single entry point for all client requests that handles cross-cutting concerns so individual services do not have to. The core responsibilities are: routing (mapping /users/* to user-service, /orders/* to order-service), authentication (validating JWT tokens once at the edge, not in every service), rate limiting (protecting backend services from traffic spikes), request/response transformation (aggregating responses from multiple services for mobile clients), and observability (logging every request with a correlation ID for distributed tracing).
For 20 services, I would use an off-the-shelf gateway like Kong, AWS API Gateway, or Envoy-based solutions rather than building custom. Building your own API gateway is tempting but almost always a mistake — it becomes a critical path component that needs to handle TLS termination, connection pooling, circuit breaking, retries, and load balancing, all at high throughput with low latency. That is a full-time infrastructure project.
The Backend-for-Frontend (BFF) pattern becomes important at 20 services. A mobile client should not make 6 API calls to render one screen. I would create BFF layers — one for mobile, one for web, one for third-party partners — that aggregate backend calls and return tailored responses. The mobile BFF returns smaller payloads and fewer fields. The web BFF returns richer data.
Risks of getting it wrong: (1) The gateway becomes a single point of failure — if it goes down, the entire system is unreachable. Mitigation: deploy multiple gateway instances behind a load balancer, with health checks and auto-scaling. (2) The gateway becomes a “god service” where developers dump business logic because “it is easy to add middleware.” The gateway should only handle cross-cutting infrastructure concerns, never business logic. (3) Latency overhead — every request now has an extra network hop. At high throughput (100K+ RPS), the gateway’s connection pool and CPU become bottlenecks. Profile it continuously.
Follow-up: How would you handle authentication at the API Gateway level without creating a bottleneck? What if the auth service is down?JWT tokens are the solution here because they are self-contained and verifiable without a network call. The gateway validates the JWT signature and expiration locally using the public key, which it caches. No call to the auth service is needed for validation — only for token issuance (login). This means the auth service being down prevents new logins but does not affect already-authenticated users. For the public key, I would use a JWKS (JSON Web Key Set) endpoint that the gateway polls periodically (every 5 minutes) and caches. If the auth service is down during a key rotation, the gateway falls back to the previously cached key. For token revocation (user logged out or compromised), I would maintain a small revocation list (blocklist) in Redis that the gateway checks — this is a fast O(1) lookup and covers the gap between token issuance and expiration. The revocation list only needs to hold tokens that have not yet expired, so it stays small.