Before diving into implementation, it’s crucial to understand when and why to use microservices. Many teams adopt microservices for the wrong reasons and end up with a distributed monolith that’s harder to maintain than what they started with.Think of it like home renovation. A monolith is a studio apartment — everything is in one room, which is perfectly fine when you live alone. Microservices are like converting that studio into a multi-room house. You get privacy, dedicated spaces, and the ability to renovate one room without tearing down the whole building. But you also need hallways (networks), doors (APIs), a shared electrical system (infrastructure), and a much bigger budget. If you are a solo occupant, the studio was always the right call. The multi-room house only makes sense when you have a family (multiple teams) with genuinely different needs.
Learning Objectives:
Understand the evolution from monolith to microservices
Before we look at code, it’s worth pausing on why monoliths dominate early in a product’s life. The honest answer is that they are dramatically simpler along almost every axis that matters when you have a small team: one git repo, one deployment pipeline, one runtime, one database connection pool, one set of logs. When something goes wrong at 2am, you open one debugger, attach to one process, and follow one stack trace.If you were to skip this and start with microservices instead, you would spend your first three months building infrastructure (service discovery, distributed tracing, event buses) rather than building the features that prove your product idea is worth pursuing. In the broader microservices architecture, the monolith is your starting point and often your fallback — you extract services from a monolith as organizational pressure mounts, not the other way around. The tradeoff to watch: the monolith will eventually become painful if your team grows past ~15 engineers or if you have wildly different scaling needs between features. Recognizing that moment is an art, not a science.
Node.js
Python
// Monolithic Express Application// Everything lives in one process, one deployment artifact, one database connection.// This is not inherently bad -- for a team of 5, this is the right architecture.const express = require('express');const app = express();// All modules in one application -- notice how easy refactoring is:// renaming a function in userRoutes can be done with a single find-and-replace.const userRoutes = require('./routes/users');const orderRoutes = require('./routes/orders');const inventoryRoutes = require('./routes/inventory');const paymentRoutes = require('./routes/payments');const notificationRoutes = require('./routes/notifications');// Single database connection -- ACID transactions across all modules come free.// This is the biggest thing you give up when you move to microservices.const db = require('./database');app.use('/api/users', userRoutes);app.use('/api/orders', orderRoutes);app.use('/api/inventory', inventoryRoutes);app.use('/api/payments', paymentRoutes);app.use('/api/notifications', notificationRoutes);// One process, one deployment -- a single CI pipeline, one health check, one log stream.app.listen(3000);
# Monolithic FastAPI Application# Everything lives in one process, one deployment artifact, one database connection.# This is not inherently bad -- for a team of 5, this is the right architecture.from fastapi import FastAPIfrom sqlalchemy.ext.asyncio import create_async_engine, async_sessionmakerfrom app.routers import users, orders, inventory, payments, notificationsapp = FastAPI(title="Monolith")# Single database connection pool -- ACID transactions across all modules come free.# This is the biggest thing you give up when you move to microservices.engine = create_async_engine("postgresql+asyncpg://user:pass@localhost/app")SessionLocal = async_sessionmaker(engine, expire_on_commit=False)# All routers mounted in one application -- notice how easy refactoring is:# renaming a function in users router can be done with a single find-and-replace.app.include_router(users.router, prefix="/api/users", tags=["users"])app.include_router(orders.router, prefix="/api/orders", tags=["orders"])app.include_router(inventory.router, prefix="/api/inventory", tags=["inventory"])app.include_router(payments.router, prefix="/api/payments", tags=["payments"])app.include_router(notifications.router, prefix="/api/notifications", tags=["notifications"])# One process, one deployment -- a single CI pipeline, one health check, one log stream.# Run with: uvicorn main:app --host 0.0.0.0 --port 3000
The Scalability Fallacy: Microservices Are Not A Scaling Strategy
One of the most persistent myths in our industry is that microservices are “how you scale.” They are not. Microservices are how you scale teams, not traffic. A well-tuned monolith on a beefy machine routinely handles more requests per second than a tangle of 15 poorly-designed microservices with network hops between every step.
Caveats & Common Pitfalls — The “Microservices = Scalability” Fallacy
Teams reach for microservices when their bottleneck is a single N+1 query. 90% of “scalability problems” in a monolith are actually missing indexes, bad caching, or a single slow endpoint. Microservices will make these worse, not better, because you now pay network latency on top.
The “scale independently” argument is usually false. Teams claim they need to scale user-service separately from order-service. In reality, user-service gets hit once per session; order-service gets hit 3x per session. The ratio never justifies the operational cost of splitting.
Network calls are 100-1000x slower than function calls. A monolith function call is roughly 10 nanoseconds; a well-tuned intra-cluster gRPC call is roughly 1-5 milliseconds. You just made your system 100,000x slower at boundaries. You had better be solving a real problem.
Distributed systems add partial-failure modes the monolith did not have. “Timeout,” “retry storm,” “thundering herd,” and “cascading failure” are all things a monolith simply cannot experience. Congratulations, you have new bug categories.
Solutions & Patterns — Scale Vertically Before You Scale OrganizationallyThe honest scaling playbook, in order:
Profile first. 95% of “scaling problems” are fixed by adding an index, caching hot paths, or batching database calls. Run pg_stat_statements, APM flamegraphs, or py-spy before redesigning anything.
Vertical scaling. Bigger box. AWS’s largest EC2 is 96 vCPUs and 768GB RAM. This gets you to 5-10M users for most SaaS.
Read replicas. Postgres streaming replication handles most read-heavy workloads without architectural changes.
Cache. Redis or memcached in front of hot endpoints.
Only then consider extraction, and only of components with genuinely different scaling profiles (e.g., video encoding vs. profile lookups).
Decision rule:If you cannot point to a specific P99 latency metric on a specific endpoint that is failing your SLO, you do not have a scaling problem. You have a vibes problem.Before/after example: A B2B SaaS team was convinced they needed microservices because “checkout is slow under load.” Three days of profiling showed the real issue: their tax calculator was serializing JSON in a Ruby hot loop. One afternoon of optimization (replace JSON serialization with a struct) brought P99 from 2800ms to 340ms. They did not need microservices. They needed a profiler.
Team Cognitive Load: The Hidden Scalability Ceiling
The real reason microservices scale is not traffic — it is cognitive load on teams. Below 10-15 engineers, one team can hold the whole monolith in their collective head. Above 30-40 engineers, nobody can, and the monolith becomes a graveyard of code that everyone is afraid to touch.
Caveats & Common Pitfalls — Cognitive Load
Teams are given more services than they can cognitively own. Team Topologies research suggests a team can deeply own 3-5 services max. Give a team 15 services and ownership degenerates into “whoever got paged last week is the expert.”
“Polyglot persistence” destroys context. One team owning services in 4 languages and 3 databases spends more time context-switching than delivering. Language diversity is a benefit only when teams, not services, diversify.
Senior engineers become “mandatory reviewers” on every service. This is a load-bearing anti-pattern — the org claims team ownership but actually routes every non-trivial change through 2-3 staff engineers. Their calendars become the bottleneck.
On-call rotations span too many services. An oncall that covers 20 services will always be in reactive mode; deep investigation of any single incident becomes impossible.
Solutions & Patterns — Bounded Cognitive LoadFrom Team Topologies (Skelton and Pais, 2019), the load-bearing heuristic:
A Stream-aligned team (the default) should own one bounded context and 3-5 services max.
A Platform team supports stream-aligned teams by offering self-service capabilities (CI/CD, observability, service mesh) — not by owning business services.
An Enabling team is a rotating coaching team that helps stream-aligned teams adopt new practices.
Decision rule:Count the services per team. If any team has more than 5, you are over-fragmented. If any team has fewer than 2 and is fully staffed, you are under-fragmented.Before/after example: A healthtech team owned 14 services across 3 programming languages. Before: sprint velocity dropped 40% YoY as context-switching cost compounded. After consolidating to 4 services (all Node.js) and moving 2 generic concerns to shared platform libraries, velocity returned to year-1 levels within two quarters.
Interview: Your CTO says 'Our Rails monolith can't scale past 50K RPS. We need to migrate to microservices immediately.' How do you respond?
Strong Answer Framework:
Challenge the premise with data. Ask for the current P50/P99 latency numbers, the bottleneck endpoint, and the CPU/DB utilization at peak. If they do not have these, scaling is a guess.
Walk through the vertical-scaling options first. Bigger instance, read replicas, query optimization, caching. Most monoliths can reach 500K+ RPS with proper tuning; Shopify ran on Rails at that scale.
Quantify the microservices cost. Roughly 30-40% engineering capacity for 12-18 months, plus ongoing operational tax. Present this as a real budget line.
Offer a diagnostic spike. Two weeks to profile the monolith and identify what is actually slow. Publish the findings.
If extraction is truly needed, extract one bottleneck component. Strangler Fig, measure, learn, then decide on the next one.
Real-World Example:Shopify (2017-2020). Their Ruby monolith handled 80K RPS at Black Friday peak before they extracted services. The extractions that did happen (Storefront Renderer, Checkout) were for specific pain points — not a blanket migration. Their core admin still runs on the monolith today.Senior Follow-up Questions:
“How do you explain the ‘distributed systems tax’ concretely to a non-technical CTO?” Use a concrete analogy: “Today, calling getUserById is a function call — 10 nanoseconds. After microservices, it is a network call — 5 milliseconds. We just made every user lookup 500,000x slower. For that to be worth it, we need to be saving engineer-months, not CPU cycles.”
“What if the CTO insists?” Agree on metrics that would prove the migration is working and metrics that would prove it is failing. Commit to re-evaluating in 90 days. Documented kill criteria are the most powerful tool against resume-driven architecture.
“What if profiling shows the monolith really is the bottleneck?” Usually the answer is a specific hot path — search, video transcoding, real-time chat. Extract that one thing as a service with its own scaling profile. Keep the rest in the monolith.
Common Wrong Answers:
“Yes, microservices will solve the scaling issue.” Fails because it accepts the false premise that microservices are a scaling pattern. Microservices add network latency; they do not remove work.
“We should just move to Kubernetes.” Fails because it conflates orchestration with architecture. K8s runs monoliths just fine and does not by itself improve latency.
Further Reading:
Shopify Engineering, “Surviving Black Friday at Monolith Scale” (2019).
Sam Newman, “Building Microservices” 2nd ed., Chapter 4 (on coupling and scaling).
DHH, “The Majestic Monolith” and “The Modular Monolith” posts on 37signals.
Interview: How would you explain the cost of microservices to a skeptical CTO who just read 'Monolith to Microservices' and wants to migrate?
Strong Answer Framework:
Frame the conversation around total cost of ownership, not migration cost. Migration is a one-time hit; operational overhead is forever.
Break the cost into five explicit buckets (observability, data consistency, testing, network reliability, operational toil) and estimate each.
Present the “steady-state tax.” In the first year, expect 30-40% of engineering capacity on infrastructure. Years 2+, expect 15-20% ongoing.
Identify the break-even point. At what team size or deploy frequency do these costs pay back? Usually 30-50 engineers and 10+ deploys per day.
Propose the modular monolith as the first investment. Strict module boundaries + separate schemas + architectural tests gets you 70% of the value for ~5% of the cost.
Real-World Example:Etsy (2011-2018). Etsy ran a PHP monolith with 2000+ engineers contributing, deploying 50+ times per day. They famously did not migrate to microservices during that period, because their deploy pipeline, feature flags, and rollback capabilities were so good that team independence was achievable within the monolith. Their “monolith tax” was lower than their projected “microservices tax.”Senior Follow-up Questions:
“What’s the single most-underestimated cost?” Distributed debugging. In a monolith, a bug is a stack trace. In microservices, a bug is “something happened somewhere across 12 services in the last 200ms — find it.” Distributed tracing helps, but engineers still take 3-5x longer to diagnose cross-service issues.
“How do you budget for the transition period?” Assume the migrated services will have more bugs for 6-12 months, because the team is learning new patterns (idempotency, sagas, circuit breakers). Add a 20% feature velocity drop to your forecast.
“What’s the one investment that pays for itself fastest?” Distributed tracing (Jaeger, Tempo, Datadog APM). Without it, every cross-service incident becomes a multi-hour investigation. Set it up before extracting service #2.
Common Wrong Answers:
“The cost is mostly just the extra servers.” Fails because infrastructure cost is the smallest line item. Engineering time (both migration and ongoing operational) dominates.
“We can avoid most costs by using managed services.” Fails because managed services reduce operational cost but not the fundamental complexity cost. Your team still needs to understand sagas, idempotency, and eventual consistency.
Further Reading:
Sam Newman, “Monolith to Microservices” (O’Reilly, 2019), Chapters 1-3.
Etsy Engineering, “Continuous Deployment at Etsy” (2014) — how they achieved team independence within a monolith.
Google SRE book, Chapter 21 (Handling Overload) — quantifies the cascading-failure cost in distributed systems.
Interview: You're consulting for a team hitting 'partial failures' in production -- one service sometimes returns stale data. How do you explain eventual consistency to their confused PM?
Strong Answer Framework:
Define the terms without jargon. “Strong consistency” means everyone sees the latest value immediately. “Eventual consistency” means everyone will see it — eventually, usually within seconds. You trade immediacy for availability.
Map it to a business concept the PM already understands. Email delivery. You send an email; it arrives “eventually.” Nobody expects instant delivery. That is eventual consistency as a business norm.
Explain why microservices force this choice. Each service has its own database. Writes to service A cannot atomically update service B’s DB without distributed transactions (which have their own severe costs — 2PC, performance, failure modes).
Show the concrete symptoms the PM should expect. “After you click Save, it may take up to 2 seconds for search results to update.” That is a product spec, not a bug.
Offer the mitigation. UI patterns (optimistic updates, “saving…” indicators), idempotency keys, and event-driven reconciliation.
Real-World Example:Amazon shopping cart (2007-present). When you add an item to your cart, it is written to a distributed, eventually-consistent store. Sometimes two tabs show slightly different cart contents for a few seconds. Amazon chose availability over strong consistency because a brief cart mismatch is a better UX than “Sorry, try again” during a shopping spree. Werner Vogels wrote about this in “Eventually Consistent” (ACM Queue, 2008).Senior Follow-up Questions:
“How do you handle the rare case where the user sees truly inconsistent data?” Idempotency + reconciliation. The backend reconciles within seconds; the UI shows a merge banner if the user’s view drifts. Google Docs does this with every concurrent edit.
“What if the PM says ‘just make it strongly consistent’?” Show them the CAP theorem in plain English: in a distributed system, during a network partition, you must choose between consistency and availability. Most consumer products choose availability, because “briefly wrong” beats “briefly down.”
“When is eventual consistency not acceptable?” Financial correctness. Do not let the checkout charge the user 100whentheircartactuallyshows90. Use sagas with compensations and verify the total synchronously at checkout time.
Common Wrong Answers:
“Use distributed transactions to keep everything consistent.” Fails because 2PC across services is a known anti-pattern — it creates single points of failure and dramatically increases latency.
“It’s just a caching bug — invalidate the cache.” Fails because eventual consistency is structural, not a cache artifact. You cannot invalidate your way out of it.
Further Reading:
Werner Vogels, “Eventually Consistent” (ACM Queue, 2008).
Martin Kleppmann, “Designing Data-Intensive Applications” (O’Reilly, 2017), Chapter 9.
Pat Helland, “Life Beyond Distributed Transactions: An Apostate’s Opinion” (CIDR, 2007) — the foundational paper.
The microservices vs. monolith decision is not a technology choice — it is an organizational choice. You are trading local complexity (one big codebase) for distributed complexity (many small codebases connected by a network). Neither is inherently better.
What you gain
What it costs
Worth it when…
Independent deployment
Network latency between modules
Teams are blocked waiting to deploy
Technology diversity
Operational overhead of many runtimes
Problems genuinely require different tools
Fault isolation
Distributed debugging complexity
Partial outages are acceptable, total outages are not
Team autonomy
Coordination cost for cross-cutting changes
You have 50+ engineers with clear domain ownership
Granular scaling
Infrastructure management overhead
Scaling needs vary wildly between components
The most common mistake in the industry is treating microservices as an upgrade from monoliths. It is not an upgrade. It is a different set of trade-offs optimized for different constraints. Shopify runs a massive modular monolith. Netflix runs thousands of microservices. Both are correct for their context.
Every microservices architecture pays a “distributed systems tax” — a set of costs that do not exist in a monolith. Understanding this tax deeply is what separates a senior engineer from a mid-level one in system design interviews.
Caveats & Common Pitfalls — Underestimating The Tax
“The network is reliable” is Fallacy #1 of Distributed Computing. Network calls fail, time out, return partial data, and occasionally succeed twice. Every HTTP call in your service graph is a potential failure point. Peter Deutsch codified the 8 fallacies in 1994; most teams rediscover them painfully in production.
“Latency is zero” is Fallacy #2. A P99 of 50ms per service across a 5-service chain compounds to ~250ms P99 in the best case (it is usually worse due to tail-latency amplification). Users feel it.
Partial failure is the dominant failure mode. In a monolith, you either work or you do not. In microservices, you partially work: checkout succeeds but notification fails, payment processes but inventory does not update. Every cross-service operation needs explicit handling for each partial-failure state.
Eventual consistency bugs are invisible in test environments. Your staging cluster does not have 10K RPS, so the 2-second consistency window never manifests. Production traffic exposes race conditions that no test caught.
Solutions & Patterns — Paying the Tax DeliberatelyYou cannot avoid the distributed systems tax — but you can pay it knowingly, with explicit budgets and engineering discipline.Decision rule:For every cross-service call, define explicitly: (a) timeout budget, (b) retry policy with backoff, (c) idempotency strategy, (d) fallback behavior, (e) observability hooks (trace spans + metrics). No default-configuration HTTP clients in production code.The five investments that pay back fastest:
Distributed tracing from day one (OpenTelemetry → Jaeger/Tempo). Without it, cross-service incidents take 3-5x longer to diagnose.
Idempotency keys on every mutation. Every POST / PATCH takes an Idempotency-Key header. This is how Stripe makes retries safe.
Timeouts on every outbound call. Default library timeouts are usually “forever.” Set explicit P99-budget-aware timeouts (e.g., 2x expected P99).
Circuit breakers around every downstream dependency. Use opossum (Node), pybreaker (Python), or service-mesh-level breakers (Istio, Linkerd).
Correlation IDs in every log line.trace_id and span_id must flow through every HTTP call, queue message, and log entry.
Before/after example: A fintech platform had 25 services and roughly 40 production incidents per quarter attributed to “random network issues.” Before: Each incident took 90 minutes to diagnose because logs from different services had no shared identifier. After enforcing trace_id propagation through every service and every log line, P50 diagnosis time dropped to under 10 minutes and unresolved “mystery” incidents dropped by 80%.
Each service should do one thing well and have a clear, bounded responsibility.
GOOD Service Boundaries:┌──────────────────┬──────────────────┬──────────────────┐│ User Service │ Order Service │ Payment Service │├──────────────────┼──────────────────┼──────────────────┤│ • Registration │ • Cart mgmt │ • Process payment││ • Authentication │ • Order creation │ • Refunds ││ • Profile mgmt │ • Order history │ • Payment history││ • Preferences │ • Order status │ • Payment methods│└──────────────────┴──────────────────┴──────────────────┘BAD Service Boundaries:┌──────────────────────────────────────────────────────────┐│ User and Order Service ││ • Users + Orders + Payments (too coupled) │└──────────────────────────────────────────────────────────┘
Services should minimize dependencies on other services. Think of loose coupling like departments in a company communicating through formal memos rather than by rummaging through each other’s filing cabinets. Each department controls access to its own records and exposes only what others need through defined channels.
Loose coupling is the single most important property of a healthy microservices system. The moment two services share a database table, a Redis key, or even an in-memory cache, you have re-invented the monolith’s coupling with none of its benefits — and now you cannot even use database-level foreign keys to protect yourself. A tightly-coupled pair of services must be deployed together, tested together, and often debugged together. They share an implicit schema that no compiler enforces, so breakages happen silently at runtime.If you ignore this principle and let services reach directly into each other’s storage, your architecture looks distributed on paper but behaves as a monolith in practice. Any change to one schema cascades through the entire system, and independent deployment becomes impossible. The tradeoff to watch: loose coupling via API calls introduces network failures, latency, and serialization overhead. You are buying maintainability at the cost of runtime complexity. This is almost always a good trade, but you must design for the failure modes (circuit breakers, timeouts, retries) from day one.
Node.js
Python
// ❌ Tight Coupling - Direct database access// This is the filing-cabinet-rummaging approach. Order Service reaches directly into// User Service's database. If User Service changes its schema, Order Service breaks// silently -- no compile error, no failed contract test, just wrong data at runtime.class OrderService { async createOrder(orderData) { // Directly accessing user database - BAD! const user = await userDatabase.users.findById(orderData.userId); // Directly accessing inventory database - BAD! const stock = await inventoryDatabase.products.findById(orderData.productId); if (!user || stock.quantity < orderData.quantity) { throw new Error('Invalid order'); } // ... create order }}// ✅ Loose Coupling - API calls// Dependencies are injected (constructor injection), making this testable and swappable.// Each client encapsulates the network call, retry logic, and circuit breaking.class OrderService { constructor(userClient, inventoryClient) { this.userClient = userClient; this.inventoryClient = inventoryClient; } async createOrder(orderData) { // Call User Service API -- if this fails, we get a clear HTTP error, // not a cryptic database connection error from a foreign schema. const user = await this.userClient.getUser(orderData.userId); // Call Inventory Service API -- the inventory team can change their DB // from Postgres to Redis without us knowing or caring. const available = await this.inventoryClient.checkStock( orderData.productId, orderData.quantity ); if (!user || !available) { throw new Error('Invalid order'); } // ... create order }}// Production pitfall: The "loose coupling via API calls" pattern above introduces// a new failure mode -- network calls can timeout, return 503, or hang. Without// circuit breakers and timeouts, this "loosely coupled" service can still cascade-fail// when User Service goes down. Loose coupling is necessary but not sufficient for resilience.
# Tight Coupling - Direct database access (BAD)# This is the filing-cabinet-rummaging approach. Order Service reaches directly into# User Service's database. If User Service changes its schema, Order Service breaks# silently -- no type error, no failed contract test, just wrong data at runtime.from sqlalchemy.ext.asyncio import AsyncSessionclass OrderServiceTightlyCoupled: def __init__(self, user_db: AsyncSession, inventory_db: AsyncSession): self.user_db = user_db self.inventory_db = inventory_db async def create_order(self, order_data: dict) -> None: # Directly accessing user database - BAD! user = await self.user_db.get(User, order_data["user_id"]) # Directly accessing inventory database - BAD! stock = await self.inventory_db.get(Product, order_data["product_id"]) if not user or stock.quantity < order_data["quantity"]: raise ValueError("Invalid order") # ... create order# Loose Coupling - API calls (GOOD)# Dependencies are injected (constructor injection), making this testable and swappable.# Each client encapsulates the network call, retry logic, and circuit breaking.from pydantic import BaseModelimport httpxclass OrderCreateRequest(BaseModel): user_id: str product_id: str quantity: intclass OrderService: def __init__(self, user_client: "UserClient", inventory_client: "InventoryClient"): self.user_client = user_client self.inventory_client = inventory_client async def create_order(self, order_data: OrderCreateRequest) -> None: # Call User Service API -- if this fails, we get a clear HTTP error, # not a cryptic database connection error from a foreign schema. user = await self.user_client.get_user(order_data.user_id) # Call Inventory Service API -- the inventory team can change their DB # from Postgres to Redis without us knowing or caring. available = await self.inventory_client.check_stock( order_data.product_id, order_data.quantity, ) if not user or not available: raise ValueError("Invalid order") # ... create order# Production pitfall: The "loose coupling via API calls" pattern above introduces# a new failure mode -- network calls can timeout, return 503, or hang. Without# circuit breakers and timeouts, this "loosely coupled" service can still cascade-fail# when User Service goes down. Loose coupling is necessary but not sufficient for resilience.
Related functionality should be grouped together within a service. Cohesion is the positive mirror of coupling: where loose coupling minimizes what crosses service boundaries, high cohesion maximizes what belongs together within a single boundary. A cohesive service is one you can describe in a single sentence without using the word “and.” “Manages user identity and authentication” is cohesive. “Manages users and sends marketing emails” is not — those are two jobs pretending to be one.If you get cohesion wrong and split related logic across services, you will pay the “distributed transaction” tax on every feature. Adding a new user preference field would require changes in five services coordinated via a release train. In the broader architecture, high cohesion is what allows a team to move fast: they touch one service, one database, one test suite, one deploy pipeline. The tradeoff: a highly cohesive service can grow large. That is fine — “small” is a consequence of cohesion and coupling being correct, not a goal unto itself.
Node.js
Python
// ✅ High Cohesion - User Service handles all user-related operationsclass UserService { // All user-related operations in one service async createUser(userData) { /* ... */ } async getUser(userId) { /* ... */ } async updateUser(userId, updates) { /* ... */ } async deleteUser(userId) { /* ... */ } async authenticateUser(credentials) { /* ... */ } async resetPassword(email) { /* ... */ } async updatePreferences(userId, prefs) { /* ... */ }}// ❌ Low Cohesion - Mixed responsibilitiesclass MixedService { async createUser(userData) { /* ... */ } async createOrder(orderData) { /* ... */ } // Should be in Order Service async processPayment(paymentData) { /* ... */ } // Should be in Payment Service}
Each service should own its data and expose it only through APIs. This is polyglot persistence — each service picks the storage engine best suited to its access patterns, rather than every module sharing one Postgres instance and fighting over schema migrations.The trade-off is explicit: you gain deployment independence and technology freedom, but you lose cross-service joins and ACID transactions. In a monolith, “give me all orders for users who signed up last week” is a single SQL join. In microservices, it becomes two API calls and client-side joining. That cost is real, and it is worth paying only when the benefits of independent deployment outweigh it.
This principle is where most teams hesitate, and for good reason — it feels wrong. Your DBA has spent a decade optimizing one Postgres cluster with foreign keys, referential integrity, and beautiful query plans. Now you are being told to throw that away and accept eventual consistency across a dozen heterogeneous stores. Yes. That is exactly what you are being told, and there is no way around it if you want true service independence.If you violate this principle and share a database, you get “independent deployment” in name only. Schema migrations become globally coordinated events. One team’s index change degrades another team’s queries. Worst of all, you lose your ability to evolve services independently, because the database schema is the most coupling thing in the entire stack. Within the broader architecture, the database-per-service rule is what actually unlocks technology diversity, scaling per-service, and bounded-context ownership. The tradeoff to watch: you will need to rebuild cross-service queries using event-driven read models (CQRS, materialized views), API composition, or dedicated reporting databases fed by change data capture.
Node.js
Python
// User Service - MongoDB// Why MongoDB? User profiles are document-shaped (nested preferences, variable fields).// Schema flexibility matters here because product adds new profile fields every sprint.const userDb = mongoose.connect(process.env.USER_DB_URI);const UserSchema = new mongoose.Schema({ email: { type: String, unique: true }, name: String, passwordHash: String, preferences: Object});// Order Service - PostgreSQL// Why Postgres? Orders are relational (order -> items -> payments), need ACID guarantees// within the order aggregate, and benefit from strong indexing for status-based queries.const orderDb = new Pool({ connectionString: process.env.ORDER_DB_URI});// Inventory Service - Redis for fast access// Why Redis? Stock levels are read thousands of times per second (every product page load)// but written infrequently. The in-memory speed of Redis makes this a natural fit.// Durable state (purchase history) lives in a separate Postgres behind this service.const inventoryDb = redis.createClient({ url: process.env.INVENTORY_REDIS_URI});// Production pitfall: "Database per service" does not mean "database server per service."// Running 15 separate Postgres clusters in production is operational madness. You can share// a database server with separate schemas or databases, as long as services only access// their own schema. The boundary is logical, not necessarily physical.
# User Service - MongoDB via Motor (async driver)# Why MongoDB? User profiles are document-shaped (nested preferences, variable fields).# Schema flexibility matters here because product adds new profile fields every sprint.from motor.motor_asyncio import AsyncIOMotorClientfrom pydantic import BaseModel, EmailStr, Fieldfrom pydantic_settings import BaseSettingsclass UserSettings(BaseSettings): user_db_uri: str = "mongodb://localhost:27017"user_settings = UserSettings()user_db = AsyncIOMotorClient(user_settings.user_db_uri).users_dbclass UserDocument(BaseModel): email: EmailStr name: str password_hash: str preferences: dict = Field(default_factory=dict)# Order Service - PostgreSQL via SQLAlchemy 2.0 async# Why Postgres? Orders are relational (order -> items -> payments), need ACID guarantees# within the order aggregate, and benefit from strong indexing for status-based queries.from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmakerclass OrderSettings(BaseSettings): order_db_uri: str = "postgresql+asyncpg://user:pass@localhost/orders"order_settings = OrderSettings()order_engine = create_async_engine(order_settings.order_db_uri, pool_size=10)OrderSession = async_sessionmaker(order_engine, expire_on_commit=False)# Inventory Service - Redis for fast access# Why Redis? Stock levels are read thousands of times per second (every product page load)# but written infrequently. The in-memory speed of Redis makes this a natural fit.# Durable state (purchase history) lives in a separate Postgres behind this service.import redis.asyncio as redisclass InventorySettings(BaseSettings): inventory_redis_uri: str = "redis://localhost:6379/0"inventory_settings = InventorySettings()inventory_db = redis.from_url(inventory_settings.inventory_redis_uri, decode_responses=True)# Production pitfall: "Database per service" does not mean "database server per service."# Running 15 separate Postgres clusters in production is operational madness. You can share# a database server with separate schemas or databases, as long as services only access# their own schema. The boundary is logical, not necessarily physical.
Assume other services will fail and design accordingly. In a monolith, a function call either returns or throws — there is no ambiguity. In microservices, a network call can succeed, fail, succeed slowly, succeed but return stale data, or hang forever. You are designing for eight failure modes instead of two, and every one of them needs a plan.
The biggest mental adjustment for engineers new to microservices is accepting that every network call is a tiny distributed system with its own partial failure modes. The naive approach is to wrap every call in try/catch and call it a day. That does not work, because the most dangerous failure mode is not an error — it is a slow response. A 30-second timeout on a call that is made 1,000 times per second will exhaust your thread pool and take down your service, even though technically nothing “failed.”If you skip resilience patterns and assume the network is reliable, you will experience the classic cascade failure: User Service gets slow, Order Service blocks waiting for it, threads pile up, memory fills with pending requests, and now Order Service crashes too. Then Payment Service, which was calling Order Service, fails. Then everything fails. Circuit breakers, timeouts, and fallbacks exist specifically to contain this cascade. In the broader architecture, resilience patterns are what make independent failure recovery actually work — without them, “fault isolation” is just a diagram on a slide. Tradeoff: fallbacks add code complexity and can mask real problems if not instrumented carefully.
Node.js
Python
const CircuitBreaker = require('opossum');class ResilientUserClient { constructor() { // The circuit breaker wraps the actual HTTP call. If the call fails too often, // the breaker "trips open" and returns the fallback immediately -- no network // call, no wasted time, no thread blocked waiting for a timeout. this.breaker = new CircuitBreaker( async (userId) => { const response = await axios.get(`${USER_SERVICE_URL}/users/${userId}`); return response.data; }, { timeout: 3000, // Fail fast: 3 seconds max per call errorThresholdPercentage: 50, // Trip after 50% of calls fail resetTimeout: 30000 // Try again after 30 seconds } ); // Fallback when circuit is open -- the order page still loads, // just with "Unknown User" instead of the real name. This is graceful // degradation: partial data beats a 500 error every time. this.breaker.fallback((userId) => ({ id: userId, name: 'Unknown User', _fallback: true })); } async getUser(userId) { return this.breaker.fire(userId); }}// Production pitfall: Fallback data that is "too good" is dangerous. If your fallback// returns a plausible-looking user object with default values, downstream code may treat// it as real data and charge the wrong customer or ship to the wrong address. Always include// a clear marker (_fallback: true) and have downstream logic check it before proceeding// with irreversible operations like payments.
# Resilient client using pybreaker (circuit breaker) + httpx (async HTTP)# Install: pip install pybreaker httpximport httpximport pybreakerfrom pydantic import BaseModelfrom pydantic_settings import BaseSettingsclass Settings(BaseSettings): user_service_url: str = "http://user-service:3000"settings = Settings()class UserResponse(BaseModel): id: str name: str fallback: bool = False # marker so downstream code knows this is degraded dataclass ResilientUserClient: def __init__(self) -> None: # The circuit breaker wraps the actual HTTP call. If the call fails too often, # the breaker "trips open" and returns the fallback immediately -- no network # call, no wasted time, no coroutine blocked waiting for a timeout. self._breaker = pybreaker.CircuitBreaker( fail_max=5, # Trip after 5 consecutive failures reset_timeout=30, # Try again after 30 seconds exclude=[httpx.HTTPStatusError], # 4xx should not trip the breaker ) self._client = httpx.AsyncClient(timeout=3.0) # Fail fast: 3s per call async def get_user(self, user_id: str) -> UserResponse: try: return await self._breaker.call_async(self._fetch_user, user_id) except pybreaker.CircuitBreakerError: # Fallback when circuit is open -- the order page still loads, # just with "Unknown User" instead of the real name. This is graceful # degradation: partial data beats a 500 error every time. return UserResponse(id=user_id, name="Unknown User", fallback=True) async def _fetch_user(self, user_id: str) -> UserResponse: response = await self._client.get(f"{settings.user_service_url}/users/{user_id}") response.raise_for_status() data = response.json() return UserResponse(id=data["id"], name=data["name"])# Production pitfall: Fallback data that is "too good" is dangerous. If your fallback# returns a plausible-looking user object with default values, downstream code may treat# it as real data and charge the wrong customer or ship to the wrong address. Always include# a clear marker (fallback=True) and have downstream logic check it before proceeding# with irreversible operations like payments.
Services are deployed separately but still tightly coupled.
Distributed Monolith:┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ Service A │───▶│ Service B │───▶│ Service C ││ │◀───│ │◀───│ │└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ └───────────────────┴──────────────────┘ SHARED DATABASESigns:• Must deploy all services together• Shared database between services• Synchronous chain calls• Breaking one service breaks everything
Services are too small, creating unnecessary complexity.
Too Granular:• UserCreationService• UserUpdateService• UserDeletionService• UserQueryService• UserValidationServiceRight Size:• UserService (handles all user operations)
Too much shared code creates hidden coupling. Think of shared libraries like a shared lease on an apartment — every change requires all tenants to agree, and one tenant’s renovation can break another’s furniture arrangement.
Node.js
Python
// ❌ Shared library that's too large// Every service depends on this, so updating it requires coordinating deployments// across all 12 services. You have effectively recreated the monolith's coupling// but now with the added pain of package versioning.import { User, Order, Payment, Inventory, validateUser, validateOrder, formatCurrency, calculateTax, sendEmail, sendSMS} from '@company/mega-shared-lib';// ✅ Minimal, focused shared code// Only share infrastructure concerns (logging, HTTP, tracing) that change rarely.// Business logic stays in respective services -- this is the key distinction.import { Logger } from '@company/logger';import { HttpClient } from '@company/http-client';// Business logic stays in respective services
# Shared library that's too large (BAD)# Every service depends on this, so updating it requires coordinating deployments# across all 12 services. You have effectively recreated the monolith's coupling# but now with the added pain of package versioning.from company_mega_shared_lib import ( User, Order, Payment, Inventory, validate_user, validate_order, format_currency, calculate_tax, send_email, send_sms,)# Minimal, focused shared code (GOOD)# Only share infrastructure concerns (logging, HTTP, tracing) that change rarely.# Business logic stays in respective services -- this is the key distinction.from company_logger import Loggerfrom company_http_client import HttpClient# Business logic stays in respective services
Production pitfall: Shared libraries with domain models (User, Order, Payment types) are the most insidious form of coupling. When the Order team needs to add a field to the shared Order type, they must now coordinate with every team that imports it. Version pinning helps but creates drift — eventually service A uses v2.3 and service B uses v2.7, and they disagree on what an Order looks like.
'You join a company with a distributed monolith -- 12 services that all share one PostgreSQL database and must be deployed together. How do you untangle this?'
Strong Answer:A distributed monolith is arguably worse than either a proper monolith or proper microservices because you have all the operational complexity of distributed systems with none of the independence benefits. The first thing I would do is stop extracting more services. The instinct is usually “we need to keep splitting,” but the root cause is shared data coupling, not insufficient decomposition.My approach would be in three phases. Phase one: map all the database dependencies. I would use query logging or a tool like pg_stat_statements to identify which service reads from and writes to which tables. This gives you the actual coupling graph, not the one in the architecture diagram.Phase two: introduce API boundaries for the most critical shared tables. If five services read the users table directly, I would pick the service that logically owns user data, build a proper API on it, and migrate the other services to call that API instead of querying the table.Phase three: once services communicate through APIs rather than shared tables, you can start splitting the database. Move one table at a time, using Change Data Capture (Debezium) to keep data synchronized during the transition. Each table migration should be independently reversible.Follow-up: “What would you do if two services legitimately need the same data with strong consistency guarantees?”That is a signal that those two services might actually be one bounded context and should be merged. If order creation and inventory reservation absolutely must be atomic, maybe they belong in the same service with a single database transaction. The alternative — distributed transactions via two-phase commit — is something I would avoid because it creates a single point of failure and dramatically increases latency. Saga pattern with compensation works, but only if you can tolerate eventual consistency for that workflow.
'Walk me through how you would determine the right size for a microservice. What is too small? What is too big?'
Strong Answer:The way I think about service sizing is through three lenses: team ownership, data ownership, and change frequency.A service is the right size when one team (ideally 4-8 people) can own it completely — understand its codebase, deploy it independently, and be on-call for it. If a service requires two teams to coordinate on changes, it is probably too big. If one engineer owns six services and none of them can be modified without touching the others, they are probably too small (nano-services).On data ownership: a properly sized service owns a coherent set of data that represents one business capability. The “Order Service” owns orders, order items, and order status. If it also owns product catalog data, it is too big. If you have separate services for order creation, order status, and order history that all share the same database, they are too small — merge them.The practical heuristic I use: if a new team member can understand the service’s purpose, codebase, and data model within their first week, it is about the right size.Follow-up: “How does Conway’s Law factor into your sizing decisions?”Conway’s Law is not just an observation — I treat it as a design constraint. If you have three teams, you will end up with roughly three natural service boundaries. The inverse Conway maneuver — structuring your teams to match the architecture you want — is more effective. If you want an independent order processing service, you need a team whose primary responsibility is order processing. Without that organizational alignment, the service boundary erodes over time.
'Your team is building an MVP for a new product. The CTO wants microservices from day one. Do you push back?'
Strong Answer:Yes, I push back firmly. The argument for microservices at MVP stage optimizes for a problem you do not have yet (scale) while making your actual problem harder (speed of iteration). At MVP stage, your biggest risk is building the wrong product, not technical scalability.In a monolith, renaming a field is a find-and-replace. In microservices, it is a coordinated API change across services requiring versioning and staged rollouts. That friction kills iteration speed.What I would propose instead is a “modular monolith” — a single deployable application with clear internal module boundaries, well-defined interfaces, and separate database schemas per module. This gives you 80% of the architectural benefits with none of the operational overhead. When you actually hit scaling problems, you can extract the bottleneck module into a service.Follow-up: “The CTO insists. How do you handle the disagreement?”I would ask for specifics about their previous experience: how many engineers, what DevOps maturity, how long did setup take, what was the impact on feature velocity? Then I would propose a compromise: start with the modular monolith, agree on specific metrics (team blocking frequency, scaling bottlenecks), and commit to re-evaluating in three months. Data-driven decisions beat opinion-driven ones.
The monolith-to-microservices journey is not something you decide on a whiteboard. It happens to you. You start out with a perfectly reasonable monolith because you have a team of four, a product that might not exist in a year, and no operational budget. Then you succeed. And as you succeed, the monolith that was perfect for four people starts to groan under the weight of your growth. Here is what that actually feels like, stage by stage.
You have one application, one database, maybe a Redis instance for sessions. Your CI pipeline takes 8 minutes. Deploys are a git push and a kubectl rollout. When something breaks, you open one log stream and find the stack trace in seconds. Four engineers share the codebase and meet in person to resolve conflicts. The monolith was absolutely the right call and remains so. Nobody has any reason to change anything.
Traffic has grown 100x. The checkout flow — which hits the payment gateway, writes to 3 tables, and sends a confirmation email — is slow under load. Your first instinct is correct: resize the box. You go from a 4-core VM to a 32-core VM. The database goes from 8GB of RAM to 64GB. Everything is fast again. Total operational cost: $2,000/month. Total engineering time spent: one afternoon. This is vertical scaling, and for most applications it is genuinely the best answer until you hit somewhere between 500K and 5M users, depending on workload.The problem with vertical scaling is not that it does not work — it works great — it is that it stops working eventually, and when it stops working there are only a few knobs left to turn. AWS’s biggest EC2 instance caps out. Your Postgres single-writer architecture caps out. You cannot double performance by buying more expensive hardware forever.
You start seeing specialized scaling needs. Your product catalogue gets 10,000 reads per second; every page view hits it. Your checkout gets 50 writes per second. These have wildly different profiles. You add a Redis cache in front of the catalogue. Great — catalogue reads now hit Redis 95% of the time and fly. But the cache is bolted onto the monolith, which means every module now has access to it, whether they should or not. A well-meaning engineer uses the cache for session data and another for rate limits. The cache’s blast radius is now the entire app.Meanwhile, checkout is still the bottleneck. It is CPU-bound, doing payment cryptography and tax calculations. The rest of the app would be happy on 8 cores, but checkout wants 32. You provision the whole fleet for checkout’s needs and pay 4x the compute cost on every pod. You cannot scale checkout independently because it lives in the same binary as catalogue, user profiles, recommendations, and 40 other modules.
Stage 4: 10,000,000 Users and 40 Engineers — Team Coupling
This is the stage where the organizational problems eclipse the technical ones. You now have eight teams of five engineers each. Every team wants to ship features. But there is only one codebase, one test suite, one deploy pipeline.Here is what a Tuesday looks like: the recommendations team pushes a change that introduces a subtle memory leak. It does not manifest until four hours after deploy, under load. By then, the checkout team has merged their work, the notifications team has merged theirs, and five other teams are queued up. At 10:47 PM, checkout errors start climbing. Nobody knows whose change caused it because the last deploy contained 14 merges. The on-call engineer rolls back the whole deploy, which reverts everyone’s work. Tomorrow morning, seven teams have to rebase and retest. Velocity has dropped to near-zero. One bad line in recommendations broke checkout. This is deploy coupling — unrelated work is strapped together by the single deploy artifact.
Deploy coupling is the single most reliable predictor that you have outgrown the monolith. When a five-person team blocks another five-person team from releasing because one PR broke the shared test suite, you are now paying a coordination tax on every feature. That tax compounds. A team that ships weekly under monolith constraints will ship daily under microservices, and the compounded velocity difference over a year is enormous.
There is no magic number of users or engineers at which microservices become correct. What matters is the ratio of coordination cost to build cost. When your engineers spend more time resolving merge conflicts, waiting for CI, untangling shared-state bugs, and coordinating deploys than they spend writing the features those activities support, you have hit the inflection point. In my experience this tends to happen around 15-25 engineers for a typical web app, or earlier for a product with dramatically different scaling profiles per feature (real-time chat + video encoding + catalogue browse in one codebase is asking for pain by engineer #12).The mistake is either reaching for microservices too early (stage 2 or 3, when vertical scaling is still winning) or too late (stage 5, when the coordination tax has already destroyed velocity for a year). The right move is usually: start extracting services at the precise team-coupling boundaries that hurt the most, one at a time, using the Strangler Fig pattern. Do not do a big-bang rewrite. Do not extract services that are not yet causing pain. Pain is the signal.
+-----------------------------------------------------------------------------+| SCALING PATH OF A TYPICAL SAAS |+-----------------------------------------------------------------------------+| || 1K users 100K users 1M users 10M users + 40 eng || | | | | || monolith bigger box add caches ??? microservices ??? || is fine still fine still mostly pain is now || mono; cracks organizational, || show on the not technical || hottest path || || cost curve: ~flat ~flat, doubled rising fast exponential|| velocity: ~constant ~constant slight drop falling off || a cliff || |+-----------------------------------------------------------------------------+
The single best trick for delaying this inflection point is the modular monolith. Enforce module boundaries inside the monolith with separate schemas per module and architecture tests (tools like ArchUnit for Java or ts-arch for TypeScript) that fail the build if a module imports from a sibling’s internals. This buys you years of runway before you need to cross into real microservices.
Every API call your system receives represents a human intent. “I want to buy this.” “I want to send this message.” “I want to update my profile.” “I want to cancel my subscription.” Behind every HTTP request is a human being with a goal, and that human does not care about your thread pools, your distributed transactions, or your circuit breakers. They care about one thing: did the system do what they asked?Microservices are reliable precisely to the degree that they honor these intents even when the underlying execution is degraded. Intent parity is the architectural commitment that a user’s original request — their intent, captured at the moment of the click or API call — will be eventually fulfilled, not silently dropped. The outcome may be delayed. The intent is never lost.
Intent Parity: The architectural property that user intent, once captured, is preserved durably and eventually fulfilled, regardless of transient infrastructure failures. “We honor what you asked for, even if we cannot execute it right this second.”
Intent parity is not a feature. It is not a pattern you toggle on. It is a design philosophy that shapes every component in your system. Once you adopt it, patterns like event sourcing, the outbox, and sagas stop looking like abstract academic ideas — they become tools for keeping the promise you made when the user clicked “Submit.”
Event sourcing stores the complete sequence of state-change events — every intent the system has recorded — as the source of truth. Instead of storing “cart currently contains: [item A, item B]”, you store the events: “user added item A at 10:01”, “user added item B at 10:03”, “user removed item A at 10:05”. The current state is derived by replaying those events.The reason this matters for intent parity: every intent is a first-class, immutable record. If your downstream materialized view (the query model) gets corrupted or lost, you replay from the events and the system self-heals. The intent is never lost, because the intent is the storage. Traditional CRUD loses the “why” and the “when” the moment you overwrite state. Event sourcing keeps every intent forever.
The outbox pattern is the simplest, most production-ready way to guarantee intent parity for services that cannot yet adopt full event sourcing. The pattern: when your service handles a request, it writes two things in the same local database transaction — the business state change AND an “outbox” row describing the event that should be sent to downstream systems. A separate process reads new outbox rows and publishes them to Kafka, SQS, or wherever.Why this guarantees intent parity: both writes commit atomically (they are in the same transaction). If the database commits, the intent is durably recorded. If the subsequent Kafka publish fails, the outbox row is still there — the publisher retries forever until it succeeds. If the service crashes mid-publish, the next instance picks up the outbox and continues. The intent cannot be lost short of catastrophic data corruption.Contrast this with the “naive” approach: write to the database, then publish to Kafka. If the database commit succeeds but the Kafka publish fails (network blip), you have a state change with no event, and downstream systems never find out. The user’s intent was recorded but never fulfilled end-to-end. That gap is exactly what intent parity demands you close.
When one user intent requires changes across multiple services — “place this order” touches Orders, Inventory, Payment, and Notifications — you need a mechanism to drive all of those changes to completion or to a consistent compensation state. That is a saga.A saga is a sequence of local transactions, each in a different service, coordinated by events (choreography) or a central orchestrator. If step 3 of 5 fails, the saga runs compensating actions for steps 1 and 2 to leave the system in a consistent state. The user’s intent is either fully honored (happy path) or cleanly reverted (compensation path) — it is never left in a half-complete, ambiguous state.This is intent parity for distributed workflows. The saga is the machinery that turns “user clicked Place Order” into either “order placed successfully, inventory reserved, payment captured, notification sent” or “order cancelled, no inventory held, no charge, apology email sent.” There is no middle ground where the user is charged but the order was not created. That middle ground is what intent parity refuses to accept.
+-----------------------------------------------------------------------------+| HOW THESE PATTERNS SERVE INTENT PARITY |+-----------------------------------------------------------------------------+| || Intent arrives --> captured durably --> fulfilled (eventually) || || +-------------------+ +-------------------+ +-------------------+ || | Event Sourcing | | Outbox Pattern | | Sagas | || +-------------------+ +-------------------+ +-------------------+ || | Every intent is | | Intent + state | | Multi-service | || | an immutable | | change committed | | intent driven to | || | event; state is | | atomically; | | success OR | || | replayable | | republished until | | compensated | || | | | downstream gets | | cleanly | || | | | it | | | || +-------------------+ +-------------------+ +-------------------+ || |+-----------------------------------------------------------------------------+
If you remember one thing from this chapter, let it be this: your users do not care about your architecture; they care whether the system kept its promises to them. Every pattern you build — circuit breakers, retries, queues, events — ultimately serves intent parity. A system that loses a user’s intent under load is a system the user cannot trust, regardless of how beautiful the architecture diagram is.
In 1967, Melvin Conway published a short paper containing an observation that has haunted every software architecture decision ever since:
“Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure.”
This is Conway’s Law, and it is not advice — it is a description of what will happen whether you plan for it or not. If you have three teams, you will end up with roughly three major components. If your backend and frontend teams never talk to each other, you will end up with a backend API that does not fit what the frontend actually needs. If your payments team is organizationally separate from your orders team, the seam between those two systems will calcify into an API boundary.You cannot fight this. Engineers have tried, and they lose. What you can do is use Conway’s Law deliberately: design your org chart to match the architecture you want. This is sometimes called the inverse Conway maneuver, and it is one of the highest-leverage moves in organizational design.
Spotify’s famous “squad” model is often misunderstood as “agile, but trendier.” It is actually a direct application of inverse Conway. Each squad owns a specific user-facing capability (search, recommendations, library management) end-to-end: frontend, backend, data pipeline, operational concerns, everything. Because one team owns the whole vertical, the service boundary naturally forms at the edge of that team’s concern. Squads do not step on each other’s code because they do not share code in ways that matter to daily work.The magic is not the word “squad.” It is the alignment between team ownership and service boundary. If you replaced “squad” with “team” and dropped all the Spotify marketing, the underlying principle — one team owns one service end-to-end — would still work. The reason Spotify’s architecture held up through hypergrowth was that the team structure and the service structure were the same shape.
The mirror image: companies that try to impose a microservices architecture on a traditional functional org (separate frontend team, backend team, database team, QA team). What happens is predictable and sad. Every feature requires coordination across four teams. Every service ends up with four different owners, none of whom can unilaterally deploy it. Services get architected to minimize cross-team coordination, which means they absorb responsibilities they should not have just to avoid calling another team’s API. Within a year, you have a monolith held together with HTTP calls — a distributed monolith — because the org chart made real service boundaries impossible.If you want independent services, you need independent teams. If you only have functional silos, you will get a system with functional-silo boundaries, no matter how many diagrams you draw.
Here is the practical play. When you start planning a microservices migration or a significant architecture shift, redraw your org chart first.
List the services you want. For each service, identify the team that owns it end-to-end. If no such team exists, you need to form one before the service exists, or the service will degenerate.
Minimize cross-team dependencies. If two teams constantly need to coordinate to ship a feature, their services are probably in the wrong places, or the two teams should merge.
Give each team its own deploy pipeline and on-call rotation. This is the real test of service ownership: can this team deploy at 3 AM without waking anyone else up?
Accept some redundancy. Inverse Conway often means two teams build similar-looking internal tools, because neither is willing to depend on the other. That is fine. The coordination cost of sharing is often higher than the cost of a little duplication.
+-----------------------------------------------------------------------------+| CONWAY'S LAW: TEAM SHAPE = SYSTEM SHAPE |+-----------------------------------------------------------------------------+| || ORG CHART (the cause) ARCHITECTURE (the effect) || || +-----------------+ +-------------------+ || | Checkout Team | <----------->| Checkout Service | || | (8 engineers) | | - Own DB | || +-----------------+ | - Own deploy | || | - Own on-call | || +-------------------+ || || +-----------------+ +-------------------+ || | Catalog Team | <----------->| Catalog Service | || | (6 engineers) | | - Own ES index | || +-----------------+ | - Own deploy | || +-------------------+ || || +-----------------+ +-------------------+ || | Notifications | <----------->| Notifier Service | || | Team | | - Own queue | || +-----------------+ | - Own deploy | || +-------------------+ || || The arrows go BOTH ways: team shape determines service shape, || but changing service shape without changing team shape does not stick. || |+-----------------------------------------------------------------------------+
The most common Conway-related failure I have seen: an architecture team designs a beautiful microservices diagram, hands it to a traditionally-organized engineering department, and expects the diagram to become reality. It never does. A year later, the “microservices” share databases, cannot deploy independently, and require cross-team tickets to change. The architecture reverted to match the org chart. If you want the architecture on the diagram, you must first make the org chart match the diagram.
Conway’s Law is not a rule you can break. It is a force you can harness. Every architecture decision is also an organizational decision, and every organizational decision is also an architecture decision. Staff engineers who understand this deeply are the ones who successfully shepherd large migrations — they know that changing the code is the easy part, and changing the team structure is the real work.