> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Microservices Mastery

> Master microservices architecture with Node.js and Python — From fundamentals to production-ready systems for top tech company interviews

# Microservices Mastery

A comprehensive, interview-focused curriculum designed for engineers targeting **Senior/Staff roles at top tech companies** (Google, Amazon, Netflix, Uber, Stripe, etc.). This course covers everything from microservices fundamentals to production-ready distributed systems.

<Info>
  **Course Duration**: 16-20 weeks (self-paced)\
  **Target Outcome**: Senior+ Engineer at FAANG / Top-tier microservices expertise\
  **Prerequisites**: Node.js or Python basics, REST APIs, basic database knowledge\
  **Languages**: Every code example is shown in **both Node.js/TypeScript and Python** so you can learn in the stack you use at work\
  **Projects**: 10+ hands-on projects including a full e-commerce platform\
  **Chapters**: 27 in-depth chapters covering all aspects of microservices
</Info>

***

## Why This Course?

<CardGroup cols={2}>
  <Card title="Interview Ready" icon="building">
    Covers exact topics asked at top tech companies for backend/distributed systems roles
  </Card>

  <Card title="Production Patterns" icon="shield-halved">
    Real patterns from systems handling millions of requests at Netflix, Uber, Amazon
  </Card>

  <Card title="Hands-On Projects" icon="laptop-code">
    Build a complete e-commerce platform with 10+ microservices
  </Card>

  <Card title="Deep Technical Foundation" icon="book">
    Understand the "why" behind every pattern, not just the "how"
  </Card>
</CardGroup>

<Warning>
  **Interview Reality**: At Senior+ level, you're expected to design microservices systems that handle **millions of requests**, maintain **data consistency**, and recover from **failures gracefully**. This course prepares you for exactly that.
</Warning>

***

## Before You Begin: Organizational Readiness

Before you write a single `docker-compose.yml`, stop and check whether your organization is ready. The most expensive microservices mistakes are not technical -- they are organizational. Teams adopt microservices because they read about Netflix or Amazon, not because their own context calls for it. The result is a distributed monolith with twice the operational cost and half the velocity of what they had before.

<Warning>
  **Caveats & Common Pitfalls — Jumping to Microservices Without Org Readiness**

  * **Teams adopt microservices before they can deploy reliably.** If your current monolith takes 2 hours to deploy manually and rollback requires a Slack thread of 6 engineers, splitting into 12 services will make every deploy exponentially more painful. You need mature CI/CD, container orchestration, and observability *before* decomposition, not as a bonus goal.
  * **Engineers mistake microservices for modularity.** A well-structured modular monolith gives you 80% of the boundary benefits with 10% of the operational tax. If your pain is "code is tangled," the answer is module boundaries and architectural tests, not a network hop.
  * **Leadership pushes microservices for resume-driven reasons.** "Our CTO wants microservices" is not a requirement; it is a preference. Convert it to measurable goals (deploy frequency, team independence, scaling granularity) and check whether monolith improvements could hit those goals first.
  * **Conway's Law is treated as a suggestion, not a law.** If you have four functional teams (frontend, backend, database, QA), you will produce a four-tier architecture, regardless of the microservices diagram on the wall. The org chart wins every time.
</Warning>

<Tip>
  **Solutions & Patterns — The Org Readiness Checklist**

  Before extracting your first service, verify these six prerequisites. This is the **"readiness checklist"** that Martin Fowler and Sam Newman both cite as a minimum bar.

  **Decision rule:** If you cannot confidently check all six, your next investment should be in the gap -- not in extracting services.

  1. **Rapid provisioning** — You can spin up a new environment (DB, runtime, load balancer) in under 30 minutes via automation.
  2. **Basic monitoring** — You have structured logs, metrics, and at least one shared dashboard; you can answer "is service X healthy?" without SSH.
  3. **Rapid deployment** — You can deploy a single service to production without a human approval chain longer than 15 minutes.
  4. **DevOps culture** — Developers own deployment, not a separate ops team who "releases" code.
  5. **Explicit domain boundaries** — At least one Event Storming session has produced a context map that the whole team agrees on.
  6. **Team autonomy** — Teams can make technical choices (framework, DB, language) without a central architecture review board blocking each decision.

  **Before/after example:** A fintech team I worked with wanted to split their 400K-LOC Rails monolith into 20 services. They had manual deploys, one shared Postgres, and no distributed tracing. *Before* readiness work: 14 months of pain projected. *After* investing 4 months in CI/CD, tracing, and a modular-monolith refactor: they extracted the first service (fraud scoring) in 6 weeks with zero production incidents. The readiness work paid for itself before service #2.
</Tip>

<AccordionGroup>
  <Accordion title="Interview: A VP of Engineering asks you 'Why shouldn't we just adopt microservices now so we're ready when we grow?' How do you respond?">
    **Strong Answer Framework:**

    1. **Reframe the question from "when" to "why."** Microservices are not an upgrade path -- they are a different set of trade-offs. Ask what specific problems they are trying to solve.
    2. **Quantify the current pain.** Deploy frequency, failed deploys, onboarding time, cross-team blockers. If these numbers are fine, the VP is optimizing for a future problem.
    3. **Calculate the distributed-systems tax.** Roughly 30-40% of engineering capacity goes to infrastructure for the first year post-migration.
    4. **Propose the modular monolith as a bridge.** You get boundary discipline, separate schemas, and architectural tests without the network hop. Extract later when the pain is real.
    5. **Agree on explicit triggers for extraction.** "When deploy coordination costs exceed X hours/week per team" beats "when we're bigger."

    **Real-World Example:**

    **Segment (2017-2018).** Segment famously moved from a monolith to microservices, then back to a monolith. With about 100 engineers, their fine-grained services (150+ at peak) meant a single event crossed 15+ services, and debugging consumed more engineering time than feature work. They consolidated back, and their 2020 blog post became the canonical cautionary tale for premature decomposition.

    **Senior Follow-up Questions:**

    * *"What metrics would prove to the VP that microservices are working?"* Deploy frequency per team, change failure rate, mean time to recovery, and team-independence index (how often does team A block team B on a deploy?).
    * *"How would you handle a direct order to proceed anyway?"* Propose a pilot: extract one service, run it for one quarter, and measure the metrics above. If they improve, continue. If not, pause.
    * *"What's the cheapest experiment that de-risks the decision?"* Run a two-week modular-monolith spike: introduce strict module boundaries with architectural tests and see if the deploy coupling pain drops. If it does, you may not need microservices at all.

    **Common Wrong Answers:**

    * *"Sure, let's start extracting services next sprint."* Fails because it skips the readiness audit and the org analysis. You will end up with a distributed monolith.
    * *"Microservices are always better at scale; we should migrate."* Fails because it treats microservices as universally superior. Shopify, Basecamp, and Stack Overflow all run successful monoliths at scale.

    **Further Reading:**

    * Martin Fowler, *"Microservice Prerequisites"* — the canonical readiness checklist.
    * Sam Newman, *"Monolith to Microservices"* (O'Reilly, 2019) — when and how to extract.
    * Segment's *"Goodbye Microservices"* blog post (2018) — the pivot-back story in detail.
  </Accordion>

  <Accordion title="Interview: 'When would you tell a team microservices would hurt more than help?' Give me your explicit decision criteria.">
    **Strong Answer Framework:**

    1. **Team size under 15-20 engineers.** Operational overhead per service is roughly constant; fixed costs swamp small teams.
    2. **Unclear or unstable domain boundaries.** If you cannot name the contexts, extraction will freeze the wrong lines into APIs.
    3. **Strong consistency requirements across most workflows.** If 70% of your flows need ACID across domains, distributed sagas will hurt every feature.
    4. **Limited DevOps maturity.** No CI/CD, no observability, manual deploys -- microservices will amplify these weaknesses.
    5. **MVP or rapid-pivot stage.** You need to change data models weekly; API contracts slow you down.
    6. **Uniform scaling profile.** If every part of the system scales together, you gain nothing by scaling independently.

    **Real-World Example:**

    **Istio itself (2017-2019).** Istio, the service mesh, initially shipped as a set of microservices (Pilot, Mixer, Citadel, Galley). By 2020 they consolidated back into a single binary called `istiod`. Reason: the microservices pattern added operational pain for users without buying internal development velocity, because a small team owned all four components.

    **Senior Follow-up Questions:**

    * *"What about a 10-person team building a real-time trading platform?"* Still probably a monolith, because latency budgets are tight and network hops are measurable. Extract only the pieces with truly different scaling profiles.
    * *"You said 'strong consistency across domains kills microservices' -- but Amazon uses microservices for orders and payments."* Amazon uses saga patterns with idempotent compensations, and they accept eventual consistency on non-critical paths. They invested a decade of platform engineering to make that cheap -- it is not cheap for most teams.
    * *"How do you spot 'unclear domain boundaries' early?"* Run Event Storming. If the domain experts disagree on basic vocabulary ("is a customer the same as a user?"), freeze extraction until the business model clarifies.

    **Common Wrong Answers:**

    * *"Microservices are always right once you're big enough."* Size is necessary but not sufficient. Basecamp (37signals) has stayed monolithic at scale by choice.
    * *"You should extract when the codebase hits X lines of code."* LOC is a terrible metric. Coupling and team friction are the real signals.

    **Further Reading:**

    * Google's Istio v1 -> istiod consolidation blog (2020).
    * DHH, *"The Majestic Monolith"* — Basecamp's explicit anti-microservices position.
    * Matt Klein (Envoy creator), *"Microservices: A Retrospective"* on why service meshes should not themselves be mesh-of-services.
  </Accordion>

  <Accordion title="Interview: Your team of 8 engineers has a 300K-LOC monolith and leadership wants to break it into 20 services in 6 months. What do you recommend?">
    **Strong Answer Framework:**

    1. **Push back on the timeline, not the goal.** 20 services in 6 months with 8 engineers is a recipe for a distributed monolith. Propose extracting 1-2 services and reassessing.
    2. **Audit readiness first (2-4 weeks).** CI/CD, observability, deployment automation. Fix gaps before extracting.
    3. **Identify the highest-pain boundary.** Which module causes the most merge conflicts, has the most distinct scaling needs, or requires the most cross-team coordination? That is extraction #1.
    4. **Use the Strangler Fig pattern.** Route new traffic to the new service behind the existing API gateway. Old code path remains until the new service is proven.
    5. **Define success metrics explicitly.** Deploy frequency for the extracted service, number of cross-service incidents, team velocity changes.
    6. **Plan to stop.** Commit to not extracting service #3 until #1 and #2 are fully stable.

    **Real-World Example:**

    **Shopify (2016-present).** Despite serving millions of merchants, Shopify runs what they call the "majestic modular monolith" in Ruby. When they extract services, they do it surgically -- for example, their Storefront Renderer was extracted specifically because its caching profile was dramatically different from admin. They did not set out to have "X services" -- they set out to solve a specific scaling problem per extraction.

    **Senior Follow-up Questions:**

    * *"What if leadership insists on all 20?"* Present the risk in business terms: projected production incident rate, feature velocity drop, time-to-recovery projections. Offer a 3-month checkpoint with explicit kill criteria.
    * *"How do you pick extraction #1?"* Two factors: highest current pain (pick the module causing the most friction) and lowest coupling to the rest (pick something with a clean data boundary). You want an early win that builds confidence.
    * *"What's the sign you should stop extracting?"* When cross-service debugging starts costing more engineering hours than the benefits of independence provide. Measure this explicitly.

    **Common Wrong Answers:**

    * *"Yes, we can do 20 services in 6 months if we hire contractors."* Staffing up mid-migration multiplies coordination cost. You will end with more services and less expertise per service.
    * *"Let's do a big-bang rewrite."* Historically, rewrites from monolith to microservices-at-once have a failure rate above 70% (per DORA research). Incremental extraction is the only safe path.

    **Further Reading:**

    * Shopify Engineering blog, *"Deconstructing the Monolith"* (2019).
    * Sam Newman, *"Building Microservices"* 2nd edition, Chapter 3 (Strangler Fig).
    * DORA *Accelerate State of DevOps* reports on rewrite failure rates.
  </Accordion>
</AccordionGroup>

***

## What Companies Ask

| Company     | Common Microservices Topics                                             |
| ----------- | ----------------------------------------------------------------------- |
| **Amazon**  | Service decomposition, eventual consistency, DynamoDB patterns, SQS/SNS |
| **Netflix** | Circuit breakers, chaos engineering, service mesh, Eureka               |
| **Uber**    | Event-driven architecture, Kafka, real-time systems, CQRS               |
| **Stripe**  | Distributed transactions, idempotency, exactly-once delivery            |
| **Google**  | gRPC, service discovery, load balancing, observability                  |
| **Meta**    | Graph services, fan-out patterns, caching at scale                      |

***

## Complete Curriculum

| #  | Chapter                  | Topics                                              |
| -- | ------------------------ | --------------------------------------------------- |
| 00 | Overview                 | Course structure, learning path, prerequisites      |
| 01 | Foundations              | Monolith vs microservices, when to use, trade-offs  |
| 02 | Domain-Driven Design     | Bounded contexts, aggregates, service decomposition |
| 03 | Sync Communication       | REST, gRPC, protocol buffers, API versioning        |
| 04 | Async Communication      | Message queues, RabbitMQ, Kafka, event-driven       |
| 05 | API Gateway              | Routing, authentication, rate limiting, Kong        |
| 06 | Data Management          | Database per service, saga pattern, CQRS            |
| 07 | Resilience Patterns      | Circuit breaker, retry, bulkhead, timeout           |
| 08 | Service Discovery        | Consul, Eureka, DNS-based discovery                 |
| 09 | Observability            | Tracing, logging, metrics, Prometheus, Jaeger       |
| 10 | Security                 | OAuth2, JWT, mTLS, secrets management               |
| 11 | Containerization         | Docker, multi-stage builds, best practices          |
| 12 | Kubernetes               | Deployments, services, ConfigMaps, scaling          |
| 13 | Testing                  | Unit, integration, contract, E2E testing            |
| 14 | Interview Prep           | Common questions, system design, coding             |
| 15 | Capstone Project         | E-commerce platform with 10+ services               |
| 16 | Service Mesh             | Istio, Linkerd, traffic management, mTLS            |
| 17 | Configuration Management | Consul, feature flags, hot reload                   |
| 18 | CI/CD                    | GitOps, ArgoCD, GitHub Actions, canary deploys      |
| 19 | Database Patterns        | Data partitioning, migrations, replication          |
| 20 | Caching Strategies       | Redis, cache-aside, invalidation, distributed       |
| 21 | Chaos Engineering        | Chaos Monkey, LitmusChaos, game days                |
| 22 | Case Studies             | Netflix, Uber, Amazon, Spotify architectures        |
| 23 | Load Balancing           | Client/server-side, algorithms, health checks       |
| 24 | Migration Patterns       | Strangler fig, branch by abstraction, CDC           |
| 25 | Event Sourcing Deep Dive | Event stores, projections, snapshots                |
| 26 | GraphQL Federation       | Apollo Federation, schema composition               |

***

## Course Structure

The curriculum is organized into **10 tracks** progressing from fundamentals to Staff+ expertise:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      MICROSERVICES MASTERY                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TRACK 1: FOUNDATIONS           TRACK 2: COMMUNICATION                      │
│  ─────────────────────          ────────────────────────                    │
│  ☑ Monolith vs Microservices    ☑ Synchronous (REST/gRPC)                   │
│  ☑ Domain-Driven Design         ☑ Asynchronous (Message Queues)             │
│  ☑ Service Decomposition        ☑ Event-Driven Architecture                 │
│  ☑ API Design Principles        ☑ API Gateway Patterns                      │
│  ☑ Database per Service         ☑ GraphQL Federation                        │
│                                                                              │
│  TRACK 3: DATA MANAGEMENT       TRACK 4: RESILIENCE                         │
│  ─────────────────────────      ──────────────────────                      │
│  ☑ Database per Service         ☑ Circuit Breaker Pattern                   │
│  ☑ Saga Pattern                 ☑ Retry Strategies                          │
│  ☑ Event Sourcing               ☑ Bulkhead Isolation                        │
│  ☑ CQRS Pattern                 ☑ Timeout Patterns                          │
│  ☑ Caching Strategies           ☑ Chaos Engineering                         │
│                                                                              │
│  TRACK 5: DEPLOYMENT            TRACK 6: OBSERVABILITY                      │
│  ─────────────────────          ────────────────────────                    │
│  ☑ Containerization             ☑ Distributed Tracing                       │
│  ☑ Kubernetes Basics            ☑ Centralized Logging                       │
│  ☑ Service Discovery            ☑ Metrics & Dashboards                      │
│  ☑ Service Mesh (Istio)         ☑ Health Checks                             │
│  ☑ CI/CD Pipelines              ☑ Alerting Systems                          │
│                                                                              │
│  TRACK 7: SECURITY              TRACK 8: ADVANCED PATTERNS                  │
│  ─────────────────────          ────────────────────────────                │
│  ☑ Service-to-Service Auth      ☑ Strangler Fig Pattern                     │
│  ☑ API Gateway Security         ☑ Backend for Frontend (BFF)                │
│  ☑ Secrets Management           ☑ Sidecar Pattern                           │
│  ☑ Zero Trust Architecture      ☑ Ambassador Pattern                        │
│  ☑ Rate Limiting & Throttling   ☑ Migration Patterns                        │
│                                                                              │
│  TRACK 9: REAL-WORLD            TRACK 10: INTERVIEW PREP                    │
│  ─────────────────────          ────────────────────────                    │
│  ☑ Netflix Architecture         ☑ System Design Questions                   │
│  ☑ Uber Dispatch System         ☑ Coding Challenges                         │
│  ☑ Amazon SOA Journey           ☑ Behavioral Questions                      │
│  ☑ Spotify Squad Model          ☑ Architecture Deep Dives                   │
│  ☑ Configuration Management     ☑ Capstone Project                          │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│  CAPSTONE PROJECT: E-COMMERCE PLATFORM                                       │
│  ─────────────────────────────────────                                       │
│  □ User Service          □ Order Service        □ Inventory Service          │
│  □ Payment Service       □ Notification Service □ Search Service             │
│  □ API Gateway           □ Event Bus (Kafka)    □ Full Observability         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Learning Path

<Steps>
  <Step title="Foundations (Week 1-2)">
    Understand when and why to use microservices. Learn DDD and service decomposition strategies.
  </Step>

  <Step title="Communication Patterns (Week 3-4)">
    Master sync/async communication, event-driven architecture, and API gateway patterns.
  </Step>

  <Step title="Data Management (Week 5-6)">
    Handle distributed data with sagas, event sourcing, CQRS, and caching strategies.
  </Step>

  <Step title="Resilience Patterns (Week 7-8)">
    Build fault-tolerant services with circuit breakers, retries, bulkheads, and chaos engineering.
  </Step>

  <Step title="Deployment & DevOps (Week 9-10)">
    Deploy with Docker, Kubernetes, service mesh, and implement CI/CD pipelines.
  </Step>

  <Step title="Observability (Week 11-12)">
    Implement distributed tracing, logging, metrics, and comprehensive monitoring.
  </Step>

  <Step title="Security & Advanced (Week 13-14)">
    Secure your services and learn advanced architectural patterns including migration strategies.
  </Step>

  <Step title="Case Studies (Week 15-16)">
    Learn from Netflix, Uber, Amazon, and Spotify architectures. Understand real-world trade-offs.
  </Step>

  <Step title="GraphQL & Event Sourcing (Week 17-18)">
    Master GraphQL Federation and deep dive into event sourcing patterns.
  </Step>

  <Step title="Capstone Project (Week 19-20)">
    Build a production-ready e-commerce platform with 10+ services.
  </Step>
</Steps>

***

## Conway's Law as a Warning, Not a Tattoo

Every course on microservices mentions Conway's Law. Most treat it as an inspirational quote. It is not. It is a load-bearing constraint on your architecture, and ignoring it is the fastest way to a distributed monolith.

<Warning>
  **Caveats & Common Pitfalls — The Conway Traps**

  * **The "diagram first, org last" trap.** Architects draw a beautiful service mesh diagram, hand it to a functionally-organized engineering department, and expect the diagram to materialize. Six months later the "microservices" share databases and require cross-team tickets for every change. The org chart won.
  * **The shared-ownership trap.** "This service is owned by the platform team *and* the payments team" means it is owned by nobody. On-call gaps, slow decisions, and conflicting roadmaps follow within a quarter.
  * **The "extract first, reorg later" trap.** Teams extract services while keeping the old functional org. The new service is touched by 6 teams, each making small changes, none deploying independently. This is the exact definition of a distributed monolith.
  * **The "too many teams, too few services" trap.** If you have 15 teams and 5 services, you will see queue-heavy feature delivery because every service release requires coordinating 3 teams. Service count and team count should grow roughly in parallel.
</Warning>

<Tip>
  **Solutions & Patterns — The Inverse Conway Maneuver**

  Instead of fighting Conway's Law, use it. Before you extract a service, verify there is a single team that can own it end-to-end: code, database, deploy pipeline, on-call rotation.

  **Decision rule:** *No service should exist without a single owning team, and no team should own more than 2-3 services without platform support.*

  **Before/after example:** At a healthtech company I advised, the "Appointments" service was owned by three teams -- Scheduling, Notifications, and Billing -- because the service touched all three concerns. Deploys required Slack coordination across team leads. *After* reorganizing so Scheduling team fully owned Appointments (with Billing and Notifications subscribing to its events via Kafka), deploy frequency for Appointments went from once every two weeks to daily, and incidents dropped by about 60%. The code barely changed; the org chart did.

  **Practical implementation:**

  1. **Start with the teams you want.** Sketch the team topology (Stream-aligned teams, Platform team, Enabling team) from Team Topologies (Skelton and Pais, 2019).
  2. **Map services to teams before extraction.** If two teams both claim a service, the service boundary is wrong.
  3. **Give each team deploy independence as the acceptance test.** Can they deploy at 3 AM without waking anyone else? If not, the boundary is not real yet.
</Tip>

***

## Projects You'll Build

<CardGroup cols={2}>
  <Card title="API Gateway" icon="door-open">
    Build a custom API gateway with rate limiting, auth, and intelligent routing
  </Card>

  <Card title="Event Bus" icon="envelope">
    Implement message broker integration with RabbitMQ and Apache Kafka
  </Card>

  <Card title="Saga Orchestrator" icon="sitemap">
    Build distributed transaction handling with compensation logic
  </Card>

  <Card title="Service Mesh Demo" icon="network-wired">
    Deploy services with Istio for traffic management, mTLS, and observability
  </Card>

  <Card title="Observability Stack" icon="chart-line">
    Set up Prometheus, Grafana, Jaeger, and centralized logging
  </Card>

  <Card title="Chaos Engineering Lab" icon="shuffle">
    Implement chaos experiments with LitmusChaos and game days
  </Card>

  <Card title="GraphQL Federation" icon="code-branch">
    Build unified APIs across microservices with Apollo Federation
  </Card>

  <Card title="E-Commerce Platform" icon="store">
    Complete microservices application with 10+ production-ready services
  </Card>
</CardGroup>

***

## Interview Topics Covered

### System Design Questions

* Design a URL shortener with microservices
* Design an e-commerce checkout system
* Design a notification service at scale
* Design a real-time messaging system
* Design a rate limiter service

### Coding Questions

* Implement circuit breaker from scratch
* Design an event-driven order processing system
* Build a distributed cache invalidation system
* Implement saga pattern with compensation
* Create a service discovery mechanism

### Behavioral/Architecture Questions

* When would you choose microservices over monolith?
* How do you handle distributed transactions?
* How do you debug issues in a microservices system?
* How do you handle service failures gracefully?
* How do you ensure data consistency across services?

***

## Tech Stack

| Category              | Technologies                                                          |
| --------------------- | --------------------------------------------------------------------- |
| **Languages**         | Node.js/TypeScript **and** Python 3.11+ (every example shown in both) |
| **Frameworks**        | Express, Fastify, NestJS (Node) / FastAPI (Python)                    |
| **Databases**         | PostgreSQL, MongoDB, Redis                                            |
| **Message Queues**    | RabbitMQ, Apache Kafka                                                |
| **Containers**        | Docker, Docker Compose                                                |
| **Orchestration**     | Kubernetes, Helm                                                      |
| **API Gateway**       | Kong, Express Gateway                                                 |
| **Service Mesh**      | Istio, Linkerd                                                        |
| **Observability**     | Prometheus, Grafana, Jaeger, OpenTelemetry                            |
| **CI/CD**             | GitHub Actions, ArgoCD, GitOps                                        |
| **GraphQL**           | Apollo Server, Apollo Federation                                      |
| **Chaos Engineering** | Chaos Monkey, LitmusChaos                                             |
| **Configuration**     | Consul, ConfigMaps, Feature Flags                                     |

***

## Prerequisites

<AccordionGroup>
  <Accordion title="Node.js Fundamentals">
    * JavaScript/TypeScript basics
    * async/await, Promises
    * Express.js basics
    * npm/yarn package management
  </Accordion>

  <Accordion title="Database Knowledge">
    * SQL basics (PostgreSQL preferred)
    * NoSQL concepts (MongoDB)
    * Basic caching concepts
  </Accordion>

  <Accordion title="Web Development">
    * REST API design
    * HTTP methods and status codes
    * JSON data format
    * Basic authentication concepts
  </Accordion>

  <Accordion title="DevOps Basics">
    * Command line basics
    * Git version control
    * Basic Docker knowledge (helpful but not required)
  </Accordion>
</AccordionGroup>

***

## Languages Used in This Course

Every non-trivial code example in this course is shown in **both Node.js/TypeScript and Python**, side by side using tabbed blocks. You pick the stack you already use at work — the underlying distributed-systems patterns are identical in either language.

<Info>
  **Why two languages?** Most microservices interviewers care about the *pattern* (how do you implement a circuit breaker, idempotency key, saga step?), not your syntax. Showing both stacks makes it obvious which lines are the pattern and which lines are ceremony. If you can read the Node.js and the Python side of a circuit-breaker example, you understand the pattern deeply.
</Info>

### Node.js/TypeScript Stack

| Layer                 | Library                           | Purpose                             |
| --------------------- | --------------------------------- | ----------------------------------- |
| **HTTP framework**    | Express, Fastify                  | REST APIs, middleware               |
| **HTTP client**       | axios, undici                     | Outbound calls with retries         |
| **ORM / data access** | Prisma, TypeORM                   | Type-safe queries, migrations       |
| **Validation**        | zod, class-validator              | Request/response schemas            |
| **Message broker**    | amqplib, kafkajs                  | RabbitMQ, Kafka producers/consumers |
| **Resilience**        | opossum, cockatiel                | Circuit breakers, retries, timeouts |
| **Observability**     | @opentelemetry/api, pino          | Tracing, structured logs            |
| **GraphQL**           | Apollo Server, @apollo/federation | Federated gateway + subgraphs       |
| **Testing**           | Jest, Vitest, Pact                | Unit, integration, contract tests   |

### Python Stack

| Layer                 | Library                              | Purpose                                  |
| --------------------- | ------------------------------------ | ---------------------------------------- |
| **HTTP framework**    | FastAPI                              | Async REST APIs with Pydantic validation |
| **HTTP client**       | httpx                                | Async outbound calls with timeouts       |
| **ORM / data access** | SQLAlchemy 2.0 (async)               | Typed queries, Alembic migrations        |
| **Validation**        | Pydantic v2                          | Request/response schemas                 |
| **Message broker**    | aiokafka, aio-pika                   | Kafka, RabbitMQ async clients            |
| **Resilience**        | pybreaker, tenacity                  | Circuit breakers, retry with backoff     |
| **Observability**     | opentelemetry-sdk, structlog         | Tracing, structured logs                 |
| **GraphQL**           | strawberry-graphql                   | Federation-compatible subgraphs          |
| **Testing**           | pytest, pytest-asyncio, schemathesis | Unit, integration, contract tests        |

Both stacks are production-grade and widely used at FAANG-tier companies. Node.js dominates Netflix, Uber's gateway tier, and Stripe's API edge. Python (especially FastAPI with SQLAlchemy 2.0) is everywhere at ML-heavy companies, fintech, and the data-engineering side of most large organizations.

***

## Learning Path by Background

Not everyone should read this course linearly. Pick the path that matches how you plan to use it.

<AccordionGroup>
  <Accordion title="Backend Engineer Migrating to Microservices">
    You've built monoliths, you're now on a team that's decomposing one, and you need the full picture fast. Skip the philosophical "should we use microservices" debates — you're already committed. Focus on the mechanics.

    **Suggested reading order:**

    1. **01 Foundations** — Just the failure modes and trade-offs section, so you know what to watch for.
    2. **02 Domain-Driven Design** — Bounded contexts are how you decide where to cut the monolith. This is non-negotiable.
    3. **24 Migration Patterns** — Strangler fig, branch by abstraction, CDC. Read this *before* writing any new service.
    4. **03 Sync Communication + 04 Async Communication** — You'll use both. Understand when to reach for which.
    5. **06 Data Management + 19 Database Patterns** — The hardest part of migration is the data. Budget weeks here.
    6. **07 Resilience Patterns + 09 Observability** — Non-negotiable before production traffic hits the new service.
    7. **13 Testing** — Contract tests are how you avoid 3am pages when your service schema drifts.

    Estimated time: **6-8 weeks** of focused study alongside day-job work.
  </Accordion>

  <Accordion title="SRE / Platform Engineer">
    You own the infrastructure, not the business logic. Your job is keeping 50 services running, not designing one. Skip the DDD chapter — you don't decide bounded contexts. Go deep on the operational chapters.

    **Suggested reading order:**

    1. **11 Containerization + 12 Kubernetes** — Your bread and butter. Read carefully.
    2. **09 Observability** — Distributed tracing, metrics, log aggregation. This is your on-call lifeline.
    3. **08 Service Discovery + 23 Load Balancing** — How requests actually find services in production.
    4. **16 Service Mesh** — Istio, Linkerd. Decide whether you need it (usually only at 20+ services).
    5. **07 Resilience Patterns + 21 Chaos Engineering** — You'll be the one running game days.
    6. **10 Security** — mTLS, secrets rotation, zero-trust networking.
    7. **17 Configuration Management + 18 CI/CD** — Platform plumbing.
    8. **22 Case Studies** — How Netflix, Uber, Amazon structured their platform teams.

    Estimated time: **5-7 weeks**. Skip: 02, 14, 15, 25, 26 unless curious.
  </Accordion>

  <Accordion title="Full-Stack Engineer Preparing for Interviews">
    You have a system-design round coming up in 4-8 weeks. You're not building microservices at work, but the interviewer will ask about them. You need breadth, strong vocabulary, and 5-6 case studies you can cite confidently.

    **Suggested reading order (interview-optimized):**

    1. **14 Interview Prep** — Read this *first* to calibrate what's actually asked.
    2. **01 Foundations + 02 Domain-Driven Design** — The "why" questions.
    3. **03 + 04 Communication** — Sync vs async trade-offs, the most common follow-up.
    4. **06 Data Management** — Sagas, CQRS, eventual consistency. Expect a drill-down here.
    5. **07 Resilience Patterns** — Circuit breakers come up in almost every interview.
    6. **09 Observability** — "How would you debug this?" is a senior-level screener.
    7. **22 Case Studies** — Memorize 2-3 you can cite by name (Netflix Chaos Monkey, Uber Cadence, Amazon service ownership).
    8. **15 Capstone** — Skim the architecture diagrams. You don't need to build it.

    Estimated time: **3-5 weeks** if you're studying evenings and weekends. Skip: 11, 12, 16, 17, 18, 21, 24, 25 unless the JD mentions them.
  </Accordion>

  <Accordion title="Engineering Manager">
    You won't write the code, but you'll own the decision: "should we adopt this pattern?" You need enough depth to push back on bad proposals, ask the right questions in design reviews, and budget correctly. Focus on trade-offs and case studies, skim the mechanics.

    **Suggested reading order:**

    1. **01 Foundations** — Read the trade-offs and "hidden costs" sections carefully. Skim the code.
    2. **22 Case Studies** — How Netflix, Uber, Amazon, Spotify structured their teams. This is Conway's Law in practice.
    3. **24 Migration Patterns** — If you're leading a decomposition, this chapter is what you'll reference for stakeholder conversations.
    4. **14 Interview Prep** — Read the "What are the hidden costs?" answer. You'll use this framing in budget discussions.
    5. **13 Testing + 18 CI/CD** — Skim. Understand what contract testing buys you so you can advocate for it.
    6. **09 Observability** — Skim the "infrastructure cost" commentary. Know the rough price tag.
    7. **21 Chaos Engineering** — Just the philosophy section. Know what game days are and why they matter.

    Estimated time: **2-3 weeks** of reading. You're optimizing for vocabulary and judgment, not implementation skill.
  </Accordion>
</AccordionGroup>

***

## How to Use This Course

There's no single right way to work through 27 chapters. Pick the strategy that matches your timeline and goals.

<CardGroup cols={2}>
  <Card title="Strategy 1: Linear Deep Dive" icon="book-open">
    **Best for:** Engineers with 16-20 weeks to invest who want mastery, not just interview readiness.

    Read chapters 00 through 26 in order. Do the code exercises in both Node.js and Python (pick your daily-driver language and do the other at a skim level). Build the capstone project in chapter 15 in parallel — it reinforces everything.

    **Pace:** 1-2 chapters per week. Expect 8-12 hours per chapter if you're doing the exercises seriously.
  </Card>

  <Card title="Strategy 2: Topic-Focused" icon="bullseye">
    **Best for:** Engineers who already work with microservices but have specific gaps.

    Identify the 4-6 chapters that cover your weak spots (common gaps: sagas, circuit breakers, observability, Kubernetes internals). Read those deeply. Skim everything else. Use the "Key Takeaways" sidebars to self-assess.

    **Pace:** 4-6 weeks total. Focus on the gaps, don't re-read what you already know.
  </Card>

  <Card title="Strategy 3: Interview Prep Only" icon="user-tie">
    **Best for:** Engineers with an interview in 3-5 weeks who need breadth and vocabulary.

    Follow the Full-Stack Engineer reading path above. Prioritize the "Strong Answer" boxes and "Vocabulary" sections. Memorize 2-3 real case studies you can cite. Practice explaining trade-offs out loud.

    **Pace:** 3-5 weeks. Skim code, memorize arguments. Don't waste time on K8s manifests if your interview is about system design.
  </Card>

  <Card title="Strategy 4: Reference Guide" icon="book-bookmark">
    **Best for:** Senior engineers who already know this material but want a trusted lookup.

    Bookmark the chapters that match your current project. Use Ctrl+F for specific topics. Read the "Honest Truth" and "Pitfalls" sections of chapters you're about to implement. The code blocks are production-quality and meant to be copied/adapted.

    **Pace:** Ongoing. Dip in when needed. The course is designed to hold up as a reference, not just a linear tutorial.
  </Card>
</CardGroup>

<Tip>
  **Whatever strategy you pick:** do the "Self-Assessment" section of each chapter you read. If you can't answer the "If you can debate X and Y, you're at senior level" question, you haven't finished the chapter — no matter how many lines you've read.
</Tip>

***

## Ready to Begin?

<Card title="Start with Foundations" icon="play" href="./01-foundations">
  Begin your microservices journey by understanding when and why to use microservices architecture.
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You are interviewing for a Senior Backend role. The interviewer says: 'We have a successful monolith serving 5 million users. Leadership wants to move to microservices. Walk me through how you would advise them.'">
    **Strong Answer:**

    The first thing I would push back on is the premise. "Leadership wants microservices" is not a technical requirement -- it is an organizational desire. I would start by asking three questions: How many engineering teams are blocked on each other during deploys? Which components have fundamentally different scaling profiles? And what is their DevOps maturity -- do they have CI/CD, container orchestration, and observability in place?

    If the answer is "two teams, uniform scaling, and we deploy manually," microservices will make things worse, not better. At 5 million users with a single team, a well-structured modular monolith is almost certainly the right answer. You get module boundaries, clear interfaces, and the ability to extract later -- without the operational tax of distributed systems.

    If they genuinely have 50+ engineers, clear domain boundaries, and mature DevOps, I would recommend an incremental approach using the Strangler Fig pattern. Start with the domain that has the most independent scaling needs or the highest deployment friction. Extract one service, run it in production for a quarter, learn what your infrastructure gaps are, and then decide whether to continue. The worst outcome is extracting 15 services in parallel and discovering your observability stack cannot handle distributed tracing across them.

    **Follow-up: "What specific metrics would you track to know whether the migration is succeeding or failing?"**

    I would track four things. First, deployment frequency per team -- if microservices are working, teams should be deploying more often, not less. Second, change failure rate -- are we breaking things more frequently because of distributed complexity? Third, mean time to recovery -- when something breaks, can we isolate and fix it faster? Fourth, developer satisfaction scores, because if engineers hate the new system, adoption will stall regardless of the architecture's elegance. Netflix tracks all four of these. If deployment frequency goes up but change failure rate also spikes, that tells you the team is moving fast without adequate testing or contract enforcement.
  </Accordion>

  <Accordion title="An interviewer asks: 'You mentioned that microservices are not always the right answer. Give me a concrete example of a company or team that adopted microservices too early and what went wrong.'">
    **Strong Answer:**

    The canonical example is Segment. They wrote a detailed blog post about moving from a monolith to microservices and then back to a monolith. They had around 100 engineers but their services became so fine-grained that a single customer event had to traverse 15+ services. The operational overhead was staggering: they spent more time debugging cross-service issues than building features. Their on-call rotations became nightmares because a single user-facing bug could involve 5 different service owners. They eventually consolidated back into a well-structured monolith and reported that engineering velocity increased significantly.

    The pattern I see repeatedly is: small teams adopt microservices because they read about Netflix doing it, without recognizing that Netflix has thousands of engineers and a dedicated platform team that builds internal tools for service management. When a 20-person startup has 30 services, each engineer is responsible for multiple services, and nobody understands the full system. You end up with a distributed monolith where every change requires coordinated deployments across services, which is strictly worse than the original monolith.

    The rule of thumb I use: if your team is small enough that everyone can fit in one room and understand the whole codebase, you do not need microservices. You need good module boundaries inside your monolith.

    **Follow-up: "So when would you say the tipping point is -- when does the pain of a monolith exceed the pain of microservices?"**

    In my experience, the tipping point is not about lines of code or request volume -- it is about team structure. When you have 3 or more teams that need to deploy independently on different cadences, and they are constantly blocking each other with merge conflicts and coordinated release windows, that is the signal. Conway's Law is real: your architecture will mirror your organization. If you have autonomous teams with clear domain ownership, microservices support that autonomy. If you have one team wearing many hats, microservices just add overhead. The other signal is genuinely divergent scaling needs -- if your search component needs 10x the compute of your user profile component, and you are paying for 10x compute across the entire monolith, that is a real financial argument for extraction.
  </Accordion>

  <Accordion title="'What are the hidden costs of microservices that most teams underestimate during planning?'">
    **Strong Answer:**

    There are five costs that consistently blindside teams:

    * **Observability infrastructure.** In a monolith, you grep a log file. In microservices, you need distributed tracing (Jaeger or Zipkin), centralized logging (ELK or Loki), metrics aggregation (Prometheus plus Grafana), and correlation IDs flowing through every service. At a previous company, setting up the observability stack for 12 services took a dedicated engineer three months and cost around \$4,000/month in infrastructure.

    * **Data consistency complexity.** You lose ACID transactions across service boundaries. Suddenly every cross-service operation needs a saga pattern or event-driven choreography, with compensation logic for failures. A simple "place order, charge payment, reserve inventory" flow that was one database transaction becomes an asynchronous multi-step workflow with at least six failure modes.

    * **Testing overhead.** Contract testing, integration testing across services, and end-to-end testing all become exponentially harder. Teams that skip contract testing inevitably discover on a Friday afternoon that Service A changed its response format and Service B is now throwing null pointer exceptions in production.

    * **Network reliability.** Every inter-service call is a network call that can fail, time out, or return stale data. You need circuit breakers, retries with exponential backoff, fallback strategies, and timeout budgets. This is code that did not exist in the monolith and adds cognitive load to every feature.

    * **Operational toil.** Each service needs its own CI/CD pipeline, health checks, deployment configuration, secret management, and on-call rotation. Multiply every operational concern by the number of services. Teams with 20 services and no platform engineering team often spend 40-50% of their time on operational tasks rather than feature development.

    **Follow-up: "How would you budget for these costs when pitching a migration to leadership?"**

    I would frame it as a tax rate. For the first year of migration, expect 30-40% of engineering capacity to go toward infrastructure and tooling rather than features. That decreases to maybe 15-20% once the platform stabilizes. I would also budget for a dedicated platform or infrastructure team -- at minimum two engineers -- once you pass 10 services. Without that investment, every product team reinvents the same wheels (logging, auth, deployment) differently, which compounds technical debt rapidly.
  </Accordion>
</AccordionGroup>