Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Capstone Project: E-Commerce Platform

Build a production-ready e-commerce platform using all the microservices patterns you have learned. This is where theory meets reality. The patterns that sounded clean in isolation — sagas, event sourcing, circuit breakers — start bumping into each other in interesting ways when you wire them together. That friction is the point: wrestling with real trade-offs is what transforms knowledge into skill.
Project Goals:
  • Apply all microservices patterns in practice
  • Build a portfolio-worthy project
  • Gain hands-on experience with real-world challenges
  • Prepare for system design interviews

Project Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    E-COMMERCE MICROSERVICES PLATFORM                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                              ┌─────────────┐                                │
│                              │   Client    │                                │
│                              │  (React)    │                                │
│                              └──────┬──────┘                                │
│                                     │                                        │
│                              ┌──────▼──────┐                                │
│                              │ API Gateway │                                │
│                              │   (Kong)    │                                │
│                              └──────┬──────┘                                │
│                                     │                                        │
│      ┌──────────────────────────────┼──────────────────────────────┐        │
│      │                              │                              │        │
│      ▼                              ▼                              ▼        │
│ ┌─────────┐                  ┌─────────────┐                ┌──────────┐   │
│ │  User   │                  │   Order     │                │ Product  │   │
│ │ Service │                  │  Service    │                │ Catalog  │   │
│ │         │                  │             │                │          │   │
│ │ MongoDB │                  │ PostgreSQL  │                │ MongoDB  │   │
│ └─────────┘                  └──────┬──────┘                └──────────┘   │
│                                     │                                        │
│                    ┌────────────────┼────────────────┐                      │
│                    │                │                │                      │
│                    ▼                ▼                ▼                      │
│             ┌───────────┐    ┌───────────┐    ┌───────────┐                │
│             │  Payment  │    │ Inventory │    │   Cart    │                │
│             │  Service  │    │  Service  │    │  Service  │                │
│             │           │    │           │    │           │                │
│             │  Stripe   │    │PostgreSQL │    │   Redis   │                │
│             └───────────┘    └───────────┘    └───────────┘                │
│                                     │                                        │
│                              ┌──────▼──────┐                                │
│                              │    Kafka    │                                │
│                              │  (Events)   │                                │
│                              └──────┬──────┘                                │
│                                     │                                        │
│                              ┌──────▼──────┐                                │
│                              │Notification │                                │
│                              │  Service    │                                │
│                              └─────────────┘                                │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                      OBSERVABILITY STACK                             │   │
│  │  Prometheus → Grafana    Jaeger (Tracing)    Loki (Logs)            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 1: Project Setup

Directory Structure

ecommerce-microservices/
├── docker-compose.yml
├── docker-compose.override.yml
├── Makefile
├── README.md

├── services/
   ├── api-gateway/
   ├── user-service/
   ├── product-service/
   ├── cart-service/
   ├── order-service/
   ├── payment-service/
   ├── inventory-service/
   └── notification-service/

├── shared/
   ├── proto/               # gRPC definitions
   ├── events/              # Event schemas
   └── utils/               # Shared utilities

├── infrastructure/
   ├── k8s/                 # Kubernetes manifests
   ├── helm/                # Helm charts
   ├── prometheus/          # Monitoring config
   └── grafana/             # Dashboards

└── tests/
    ├── unit/
    ├── integration/
    ├── contract/
    └── e2e/

Initial Setup Script

#!/bin/bash
# setup.sh

# Create service directories
services=("api-gateway" "user-service" "product-service" "cart-service" "order-service" "payment-service" "inventory-service" "notification-service")

for service in "${services[@]}"; do
    mkdir -p "services/$service/src"
    
    # Initialize npm project
    cd "services/$service"
    npm init -y
    npm install express cors helmet morgan
    npm install --save-dev jest nodemon
    cd ../..
done

# Create shared directories
mkdir -p shared/{proto,events,utils}
mkdir -p infrastructure/{k8s,helm,prometheus,grafana}
mkdir -p tests/{unit,integration,contract,e2e}

echo "Project structure created!"

Phase 2: Core Services

User Service

Why this service exists and its bounded context. The User Service owns everything related to identity: authentication credentials, profile data, addresses, and user preferences. Its bounded context is deliberately narrow — it does not know about orders, carts, or payments. It answers one question: “Who is this person, and what do we know about them?” Separating identity from the rest of the domain means we can swap authentication providers (e.g., move from password-based auth to OAuth) without touching order logic. How it communicates with other services. The User Service exposes a synchronous REST API for user lookups (e.g., GET /users/:id called by the Order Service when creating an order). It also publishes domain events (user.created, user.updated, user.deleted) to Kafka so downstream services like Notification and Cart can react without polling. Inter-service reads are sync because callers need the data immediately to complete their own request; fan-out notifications are async because latency tolerance is higher. What data it owns and why. It owns the canonical user record, including the email (the unique identifier) and the password hash. No other service stores passwords — that would multiply the attack surface. Other services reference users by userId and fetch additional profile data on demand or subscribe to profile-updated events and cache what they need. Key design decisions. MongoDB was chosen because user profiles evolve (new fields get added frequently) and have natural document shape (addresses nested inside users). PostgreSQL was considered and rejected — migrating schemas for every new profile field would slow feature velocity. We rejected a shared auth library for multiple services (tempting but creates hidden coupling) in favor of short-lived JWTs issued here and verified everywhere.
// services/user-service/src/index.js
const express = require('express');
const mongoose = require('mongoose');
const { initTracing } = require('./observability/tracing');
const { Metrics } = require('./observability/metrics');
const { logger } = require('./observability/logger');

// Initialize tracing before other imports
initTracing('user-service');

const app = express();
const metrics = new Metrics({ serviceName: 'user-service' });

// Middleware
app.use(express.json());
app.use(metrics.middleware());

// User Schema
const UserSchema = new mongoose.Schema({
  email: { type: String, required: true, unique: true },
  passwordHash: { type: String, required: true },
  profile: {
    firstName: String,
    lastName: String,
    phone: String
  },
  addresses: [{
    street: String,
    city: String,
    country: String,
    postalCode: String,
    isDefault: Boolean
  }],
  createdAt: { type: Date, default: Date.now }
});

const User = mongoose.model('User', UserSchema);

// Routes
app.post('/users', async (req, res) => {
  try {
    const user = new User(req.body);
    await user.save();
    
    logger.info('User created', { userId: user._id });
    res.status(201).json({ id: user._id, email: user.email });
  } catch (error) {
    logger.error('Failed to create user', { error: error.message });
    res.status(400).json({ error: error.message });
  }
});

app.get('/users/:id', async (req, res) => {
  const user = await User.findById(req.params.id).select('-passwordHash');
  if (!user) {
    return res.status(404).json({ error: 'User not found' });
  }
  res.json(user);
});

// Health checks
app.get('/health/live', (req, res) => res.json({ status: 'UP' }));
app.get('/health/ready', async (req, res) => {
  const dbReady = mongoose.connection.readyState === 1;
  res.status(dbReady ? 200 : 503).json({ 
    status: dbReady ? 'READY' : 'NOT_READY',
    database: dbReady
  });
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', metrics.register.contentType);
  res.send(await metrics.getMetrics());
});

// Start server
const PORT = process.env.PORT || 3000;

mongoose.connect(process.env.MONGODB_URI)
  .then(() => {
    app.listen(PORT, () => {
      logger.info(`User service running on port ${PORT}`);
    });
  });

Order Service with Saga

Why this service exists and its bounded context. The Order Service is the orchestrator of the checkout journey. Its bounded context covers order lifecycle: creation, confirmation, fulfillment, and cancellation. Crucially, it does not directly reserve inventory or charge cards — it coordinates those operations through a saga. This separation means the Order Service remains the single source of truth for “what orders exist and what state are they in,” even as downstream services come and go. How it communicates with other services. Order Service talks to the outside world (API Gateway, clients) synchronously via REST, because the client needs an immediate order ID to show on screen. Internally, the saga uses Kafka for all coordination. The Order Service publishes commands (inventory.reserve, payment.process) and subscribes to outcomes (inventory.reserved, payment.completed, payment.failed). Async messaging is non-negotiable here because sync chaining across inventory + payment + notification would mean any slow downstream service blocks checkout. What data it owns and why. Orders, order items, shipping addresses (snapshotted at order time, not referenced — addresses change, but the order’s shipping address must not), and the saga state machine. The saga state (STARTED, INVENTORY_RESERVED, COMPLETED, COMPENSATING, FAILED) is stored alongside the order so that a crashed service can recover mid-saga by reading the last known state. Key design decisions. We chose orchestration over choreography for the saga. Choreography (each service reacts to events from other services) scales poorly once sagas exceed three steps — debugging becomes a nightmare of “who sent what to whom.” Orchestration centralizes the flow at the cost of a slight coupling increase. PostgreSQL was chosen because order state transitions need ACID guarantees; a half-written order is worse than no order.
// services/order-service/src/sagas/OrderSaga.js
const { Kafka } = require('kafkajs');

class OrderSaga {
  constructor(orderRepository, kafka) {
    this.orderRepository = orderRepository;
    this.producer = kafka.producer();
    this.consumer = kafka.consumer({ groupId: 'order-saga' });
  }

  async start() {
    await this.producer.connect();
    await this.consumer.connect();
    
    await this.consumer.subscribe({ 
      topics: ['payment.completed', 'payment.failed', 'inventory.reserved', 'inventory.failed'] 
    });

    await this.consumer.run({
      eachMessage: async ({ topic, message }) => {
        const event = JSON.parse(message.value.toString());
        await this.handleEvent(topic, event);
      }
    });
  }

  async createOrder(orderData) {
    // Step 1: Create order in PENDING state
    const order = await this.orderRepository.create({
      ...orderData,
      status: 'PENDING',
      sagaState: 'STARTED'
    });

    // Step 2: Request inventory reservation
    await this.producer.send({
      topic: 'inventory.reserve',
      messages: [{
        key: order.id,
        value: JSON.stringify({
          orderId: order.id,
          items: order.items
        })
      }]
    });

    return order;
  }

  async handleEvent(topic, event) {
    const order = await this.orderRepository.findById(event.orderId);
    if (!order) return; // Order may have been deleted or this is a stale event -- safe to ignore

    switch (topic) {
      case 'inventory.reserved':
        // Step 3: Request payment
        await this.orderRepository.update(order.id, { sagaState: 'INVENTORY_RESERVED' });
        await this.producer.send({
          topic: 'payment.process',
          messages: [{
            key: order.id,
            value: JSON.stringify({
              orderId: order.id,
              customerId: order.customerId,
              amount: order.total
            })
          }]
        });
        break;

      case 'payment.completed':
        // Step 4: Complete order
        await this.orderRepository.update(order.id, { 
          status: 'CONFIRMED',
          sagaState: 'COMPLETED',
          paymentId: event.paymentId
        });
        await this.producer.send({
          topic: 'order.confirmed',
          messages: [{
            key: order.id,
            value: JSON.stringify({ orderId: order.id })
          }]
        });
        break;

      case 'payment.failed':
        // Compensate: Release inventory -- this is the saga's compensation logic.
        // If payment fails, we must undo the inventory reservation or those items
        // remain locked forever (a common production bug).
        await this.orderRepository.update(order.id, { 
          status: 'FAILED',
          sagaState: 'COMPENSATING'
        });
        await this.producer.send({
          topic: 'inventory.release',
          messages: [{
            key: order.id,
            value: JSON.stringify({
              orderId: order.id,
              items: order.items
            })
          }]
        });
        break;

      case 'inventory.failed':
        // No compensation needed, just fail the order
        await this.orderRepository.update(order.id, { 
          status: 'FAILED',
          sagaState: 'FAILED',
          failureReason: 'Insufficient inventory'
        });
        break;
    }
  }
}

module.exports = { OrderSaga };

Payment Service

Why this service exists and its bounded context. The Payment Service is the only component allowed to talk to payment providers (Stripe, PayPal, etc.). Its bounded context is narrow on purpose: capture charges, record transactions, and expose idempotent payment outcomes. Keeping this logic isolated also keeps PCI-DSS scope minimal — only one service (and its hosts) is in scope for the audit. How it communicates with other services. Payment Service is almost entirely async. It consumes payment.process commands from Kafka and emits payment.completed / payment.failed events. Avoiding sync HTTP to the Order Service decouples checkout latency from Stripe latency — if Stripe has a 5-second spike, the Order Service is not blocked. The Payment Service does make one sync outbound call: to Stripe itself, because that is the nature of third-party payment APIs. What data it owns and why. Payment intents, transaction records, and idempotency keys. It does not store raw card data — Stripe tokenizes cards, and we only store Stripe’s customer ID and payment method ID. This is the core of keeping PCI scope manageable: if we never hold a PAN (primary account number), we do not have to secure one. Key design decisions. Idempotency is implemented at three layers: an idempotency key per incoming command (deduplicates retries), Stripe’s Idempotency-Key header (deduplicates upstream calls), and a database unique constraint on (order_id, payment_intent_id) (last-line defense). We rejected an in-memory-only idempotency store — it loses state on pod restart, and deploys would produce duplicate charges. Redis or the database are the only acceptable stores.
// services/payment-service/src/index.js
const express = require('express');
const { Kafka } = require('kafkajs');
const Stripe = require('stripe');
const { initTracing } = require('./observability/tracing');

initTracing('payment-service');

const app = express();
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);

const kafka = new Kafka({
  clientId: 'payment-service',
  brokers: process.env.KAFKA_BROKERS.split(',')
});

const producer = kafka.producer();
const consumer = kafka.consumer({ groupId: 'payment-service' });

// Idempotency store -- In production, use Redis or a database table, not an in-memory Map.
// An in-memory Map is lost on restart, meaning duplicate charges can slip through during deploys.
const processedPayments = new Map();

async function processPayment(event) {
  const { orderId, customerId, amount } = event;
  
  // Check idempotency
  if (processedPayments.has(orderId)) {
    console.log(`Payment already processed for order ${orderId}`);
    return processedPayments.get(orderId);
  }

  try {
    // Get customer's payment method
    const customer = await getCustomerPaymentMethod(customerId);
    
    // Create payment intent
    const paymentIntent = await stripe.paymentIntents.create({
      amount: Math.round(amount * 100),
      currency: 'usd',
      customer: customer.stripeCustomerId,
      payment_method: customer.defaultPaymentMethod,
      confirm: true,
      metadata: { orderId }
    });

    const result = {
      paymentId: paymentIntent.id,
      orderId,
      status: 'COMPLETED'
    };

    // Store for idempotency
    processedPayments.set(orderId, result);

    // Publish success event
    await producer.send({
      topic: 'payment.completed',
      messages: [{
        key: orderId,
        value: JSON.stringify(result)
      }]
    });

    return result;
  } catch (error) {
    // Publish failure event
    await producer.send({
      topic: 'payment.failed',
      messages: [{
        key: orderId,
        value: JSON.stringify({
          orderId,
          error: error.message
        })
      }]
    });

    throw error;
  }
}

// Start consumer
async function start() {
  await producer.connect();
  await consumer.connect();
  
  await consumer.subscribe({ topic: 'payment.process' });
  
  await consumer.run({
    eachMessage: async ({ message }) => {
      const event = JSON.parse(message.value.toString());
      await processPayment(event);
    }
  });
  
  app.listen(3000, () => {
    console.log('Payment service running on port 3000');
  });
}

start();

Phase 3: Infrastructure

Docker Compose

# docker-compose.yml
version: '3.8'

services:
  api-gateway:
    build: ./services/api-gateway
    ports:
      - "8080:3000"
    environment:
      - USER_SERVICE_URL=http://user-service:3000
      - ORDER_SERVICE_URL=http://order-service:3000
      - PRODUCT_SERVICE_URL=http://product-service:3000
    depends_on:
      - user-service
      - order-service
      - product-service

  user-service:
    build: ./services/user-service
    environment:
      - MONGODB_URI=mongodb://user-db:27017/users
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
    depends_on:
      - user-db

  order-service:
    build: ./services/order-service
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@order-db:5432/orders
      - KAFKA_BROKERS=kafka:9092
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
    depends_on:
      - order-db
      - kafka

  product-service:
    build: ./services/product-service
    environment:
      - MONGODB_URI=mongodb://product-db:27017/products
      - REDIS_URL=redis://redis:6379
    depends_on:
      - product-db
      - redis

  payment-service:
    build: ./services/payment-service
    environment:
      - STRIPE_SECRET_KEY=${STRIPE_SECRET_KEY}
      - KAFKA_BROKERS=kafka:9092
    depends_on:
      - kafka

  inventory-service:
    build: ./services/inventory-service
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@inventory-db:5432/inventory
      - KAFKA_BROKERS=kafka:9092
    depends_on:
      - inventory-db
      - kafka

  cart-service:
    build: ./services/cart-service
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis

  notification-service:
    build: ./services/notification-service
    environment:
      - KAFKA_BROKERS=kafka:9092
      - SENDGRID_API_KEY=${SENDGRID_API_KEY}
    depends_on:
      - kafka

  # Databases
  user-db:
    image: mongo:6
    volumes:
      - user-db-data:/data/db

  order-db:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=orders
      - POSTGRES_PASSWORD=postgres
    volumes:
      - order-db-data:/var/lib/postgresql/data

  product-db:
    image: mongo:6
    volumes:
      - product-db-data:/data/db

  inventory-db:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=inventory
      - POSTGRES_PASSWORD=postgres
    volumes:
      - inventory-db-data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

  # Message Broker
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      CLUSTER_ID: 'capstone-cluster'
    volumes:
      - kafka-data:/var/lib/kafka/data

  # Observability
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./infrastructure/prometheus:/etc/prometheus

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    volumes:
      - ./infrastructure/grafana:/etc/grafana/provisioning

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "4317:4317"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

volumes:
  user-db-data:
  order-db-data:
  product-db-data:
  inventory-db-data:
  redis-data:
  kafka-data:

Phase 4: Testing

Contract Test Example

Why contract tests here. Contract tests verify that the Order Service (the consumer) and the Payment Service (the provider) agree on the shape of their interaction, without requiring both services to be deployed at the same time during CI. The test generates a pact file from the consumer side that the provider verifies independently. This lets teams ship independently without a shared staging gate.
// tests/contract/OrderPayment.consumer.test.js
const { Pact } = require('@pact-foundation/pact');
const path = require('path');

describe('Order Service - Payment Service Contract', () => {
  const provider = new Pact({
    consumer: 'OrderService',
    provider: 'PaymentService',
    port: 8081,
    dir: path.resolve(process.cwd(), 'pacts')
  });

  beforeAll(() => provider.setup());
  afterAll(() => provider.finalize());

  it('should process payment successfully', async () => {
    await provider.addInteraction({
      state: 'customer has valid payment method',
      uponReceiving: 'payment request',
      withRequest: {
        method: 'POST',
        path: '/payments',
        body: {
          orderId: '12345',
          amount: 99.99,
          customerId: 'cust-123'
        }
      },
      willRespondWith: {
        status: 200,
        body: {
          paymentId: like('pay_abc123'),
          status: 'COMPLETED'
        }
      }
    });

    // Test actual client
    const result = await paymentClient.process({
      orderId: '12345',
      amount: 99.99,
      customerId: 'cust-123'
    });

    expect(result.status).toBe('COMPLETED');
  });
});

E2E Test

Why this E2E test exists. Unit and contract tests validate individual services; E2E tests validate the wired-up system. This test drives a full checkout flow through the API Gateway and asserts that the saga completes. It is slow (seconds, not milliseconds) and flaky-prone (async timing), so we only use it for critical revenue paths — not for every branch.
// tests/e2e/checkout.test.js
describe('Checkout Flow E2E', () => {
  const api = axios.create({ baseURL: 'http://localhost:8080' });
  let authToken, orderId;

  beforeAll(async () => {
    const { data } = await api.post('/auth/login', {
      email: 'test@example.com',
      password: 'password'
    });
    authToken = data.token;
    api.defaults.headers.Authorization = `Bearer ${authToken}`;
  });

  it('should complete checkout successfully', async () => {
    // 1. Add to cart
    await api.post('/cart/items', {
      productId: 'prod-123',
      quantity: 2
    });

    // 2. Create order
    const orderResponse = await api.post('/orders', {
      shippingAddressId: 'addr-123'
    });
    orderId = orderResponse.data.id;
    expect(orderResponse.data.status).toBe('PENDING');

    // 3. Process payment
    await api.post(`/orders/${orderId}/pay`, {
      paymentMethodId: 'pm-123'
    });

    // 4. Wait for saga completion
    await new Promise(r => setTimeout(r, 3000));

    // 5. Verify order confirmed
    const finalOrder = await api.get(`/orders/${orderId}`);
    expect(finalOrder.data.status).toBe('CONFIRMED');
  });
});

Technology Selection Rationale

Choosing the right database, broker, and cache for each service is one of the most impactful decisions in the project. Here is why the capstone uses what it does:
ServiceTechnologyWhy This ChoiceAlternative ConsideredWhy Not
User ServiceMongoDBFlexible user profile schema; no complex joins neededPostgreSQLProfile data is document-shaped; schema evolves frequently
Order ServicePostgreSQLACID transactions for order lifecycle; complex queries for reportingMongoDBOrder state machines need transactional guarantees
Cart ServiceRedisEphemeral data; sub-millisecond reads; natural TTL for cart expiryPostgreSQLCarts are session-scoped, not permanent records
Product CatalogMongoDBVaried product attributes per category; nested reviewsPostgreSQLEach product category has different fields (clothing vs electronics)
Inventory ServicePostgreSQLExact counts require strong consistency; decrement-and-check is transactionalRedisRace conditions on concurrent stock decrements without DB transactions
Payment ServiceExternal (Stripe)PCI compliance offloaded; no card data stored locallySelf-hostedPCI-DSS compliance costs more than the Stripe fee for most businesses
Event BusKafkaDurable, replayable event log; supports consumer groups; handles backpressureRabbitMQNeed event replay for rebuilding projections; Kafka’s log model fits event sourcing

Edge Cases to Handle in Your Implementation

These are the scenarios that separate a portfolio project from a production system. Addressing even a few of them demonstrates senior-level thinking:
  1. Double-submit on checkout — User clicks “Pay” twice. Without idempotency keys, you charge them twice. Solution: generate a client-side idempotency key and check it server-side before processing.
  2. Inventory reserved but payment times out — The saga reserved inventory, but the payment provider never responds. If you do not set a reservation TTL, those items are locked forever. Solution: inventory reservations expire after 15 minutes; a background job releases expired reservations.
  3. Kafka consumer lag during a flash sale — Your order service publishes events faster than the notification service can consume them. Users get order confirmations 30 minutes late. Solution: monitor consumer lag in Grafana; auto-scale consumers based on lag metrics.
  4. Partial failure in the observability stack — Jaeger is down, but your services are fine. If your health check includes Jaeger connectivity, you’ll mark healthy services as unhealthy. Solution: observability dependencies should never be in the critical path; degrade tracing gracefully.

Caveats & Common Pitfalls: Integrating All the Patterns

Integration traps that only surface once the patterns meet each other:
  • Saga + circuit breaker + retries = double-debits. Your circuit breaker opens on Stripe timeout, the saga’s retry policy fires, and the payment provider may have succeeded on the first attempt. Without an end-to-end idempotency key (not just at the HTTP layer), you charge twice. Teams routinely discover this in chargebacks three weeks post-launch.
  • Health checks that recursively depend on downstream services. The user service’s readiness probe calls the auth service which calls the user service. One pod restarts, a brief cascade takes down all of them. Keep readiness probes local: check your own DB connection, not your neighbors’.
  • Distributed tracing becomes the biggest log producer. At 100% sampling across 8 services, you will generate more trace volume than application logs. Sample at 1-5% in production; sample 100% only for flagged requests or error paths.
  • The observability stack consumes more resources than the app. Prometheus scraping every pod every 15s + Jaeger + Loki routinely eats 30-40% of cluster CPU on small-to-medium deployments. Budget for it, or you will be surprised on the AWS bill.
Solutions & Patterns:
  • Pass an end-to-end correlation key (not just an HTTP idempotency header) through every saga step and store it on the payment record with a unique constraint. Even a retry that crosses a circuit-breaker boundary will collide at the DB layer.
  • Readiness probes check self, liveness probes check critical self-functions. If your pod cannot serve traffic because the DB is down, that is a readiness failure (stop routing traffic) not a liveness failure (do not restart it — the DB will not come back faster). Confusing these two is the #1 cause of restart storms.
  • Use head-based sampling with error-biased upsampling. Sample 1% of happy-path traces, but force-sample 100% of traces that touched an error response code. You get cheap baseline observability plus full detail where it matters.
  • Allocate an observability budget upfront (e.g., 20% of cluster CPU/memory) and treat it as a hard constraint. When the budget is breached, the team must cut cardinality (drop labels) or sampling rate rather than scale up infrastructure silently.

Interview Questions: Greenfield E-commerce Architecture

Strong Answer Framework:
  1. Start with domain decomposition, not services. Map the bounded contexts (Identity, Catalog, Cart, Order, Payment, Inventory, Fulfillment, Notification). Services follow contexts, not the other way around. If two “services” share a database, they are one service with an extra network hop.
  2. Pick the sync/async axis per interaction. Checkout fan-out (order then inventory then payment) uses async via Kafka; user lookup uses sync REST because the caller cannot proceed without the answer.
  3. Choose storage per service based on access pattern. Postgres for transactional state (orders, payments), MongoDB for document-shaped entities (users, products), Redis for ephemeral state (carts, sessions).
  4. Decide on saga orchestration vs choreography. Orchestration for checkout because the flow is linear and debuggable; choreography for non-critical fan-out (notifications, analytics).
  5. Define the deployment substrate. Kubernetes + GitOps (ArgoCD) + per-service Helm charts. Start with rolling deploys, add canary when volume justifies the tooling cost.
  6. Observability from day one, not day 90. OpenTelemetry SDK in every service. Prometheus for metrics, Jaeger or Tempo for traces, Loki or CloudWatch for logs. Start with 5% sampling and RED metrics per service.
  7. Security baselines. mTLS via service mesh or in-app TLS, secrets in Vault or a managed secrets store, per-service IAM/service account scoping. PCI scope is isolated to payment service only.
  8. Testing strategy. Contract tests between every service pair that communicates; integration tests per service against real dependencies in CI; E2E only for revenue-critical flows.
  9. Plan for v2. Year three (500K/day, ~6 orders/sec peak) is well within a well-tuned Postgres + 10-service cluster. Do not over-architect for imagined scale; plan to re-shard inventory when it hits 200M rows.
Real-World Example: Shopify’s public architecture talks (2019-2022) describe exactly this evolution: they started with a Rails monolith, extracted services (identity, checkout, payments) at clear bounded contexts, and still run a pod-based sharding model for the core storefront. They explicitly rejected the “100 services from day one” approach.Senior Follow-up Questions:
Q: How would you handle inventory consistency when three users try to buy the last unit simultaneously? A: Optimistic concurrency with UPDATE inventory SET qty = qty - 1 WHERE product_id = ? AND qty >= 1 and check the affected-rows count. Only one UPDATE succeeds. The saga’s inventory-reservation step then uses a short-TTL lock (15 minutes). This is more scalable than pessimistic locks and avoids the distributed-lock-via-Redis trap where a network partition can issue two locks.
Q: What would you not build as microservices at this stage? A: Reporting and analytics. Shove all of that into a data warehouse (BigQuery, Snowflake) and let BI tools query it. Building a “reporting service” is where most teams accidentally rebuild a SQL database on top of microservices. Also: admin panels — let the monolithic admin talk directly to shared reporting views.
Q: When would you split the Order service into multiple services? A: When the bounded context stops being “orders” and becomes “orders + returns + disputes + subscriptions.” Each of those has its own state machine, its own integrations, and different release cadences. The signal: multiple teams making conflicting changes to the same codebase, or the PR queue getting backed up for weeks.
Common Wrong Answers:
  • “Start with 20 microservices, one per noun in the requirements.” This is the distributed monolith trap — you get all the operational cost of microservices with none of the team-autonomy benefit. Start with 3-5 coarse services, split when pain justifies it.
  • “Use event sourcing for everything because it’s the cleanest architecture.” Event sourcing is expensive (replay complexity, schema evolution, projection cost) and only pays off for domains with complex audit/temporal requirements. Using it for product catalog is tool-driven thinking, not problem-driven.
Further Reading:
  • Sam Newman, Building Microservices (2nd ed., O’Reilly, 2021) — the definitive reference on decomposition and bounded contexts.
  • Shopify Engineering blog posts on “deconstructing the monolith” and “Shopify Pods” (2017-2023).
  • Chris Richardson, Microservices Patterns (Manning, 2018) — saga, CQRS, and event sourcing patterns grounded in production use cases.
Strong Answer Framework:
  1. Pull the actual data before theorizing. Get 90 days of incidents with root-cause tags from the incident tracker. Do not trust your vibes about “which service is broken” — teams are systematically bad at this.
  2. Categorize the incidents. Code bugs (fix with better testing), infrastructure flakes (fix with infra investment), upstream failures (fix with resilience patterns), or design flaws (fix with rearchitecting).
  3. Look at the distribution. If 80% of incidents are one failure mode, that’s the thing to fix. If incidents are spread evenly across many modes, your Order service is probably just too busy — it owns too many responsibilities.
  4. Check the change log. Is the Order service the highest-churn codebase? High churn + high incidents = quality problem. Low churn + high incidents = architectural problem (the design amplifies external instability).
  5. Decide the intervention. Code quality: add integration tests, pair rotations, SLO-backed error budget. Operational: fix alerting thresholds, improve runbooks. Architectural: split the service or redesign the saga.
Real-World Example: Monzo (UK neobank) published an engineering blog (2019-2020) describing exactly this analysis. Their payments service was consuming disproportionate on-call time. Root cause: it was owning both “money movement” and “reconciliation,” which had wildly different failure modes. Splitting them dropped on-call load ~60%.Senior Follow-up Questions:
Q: How do you avoid a “rewrite the Order service” 6-month project? A: Extract by pain, not by plan. Identify the hottest single responsibility (e.g., the saga orchestration) and move just that out. Build the new thing, migrate traffic, retire the old code path. Each extraction should be 2-4 weeks, not 6 months. If a rewrite proposal has “phase 1 of 4” in the roadmap, push back hard — that plan has never shipped in the history of software.
Q: The team insists the problem is “not enough resources” and wants 3 more engineers. How do you respond? A: Resources without architectural changes usually just distribute the pain more widely. Ask: “If we added 3 engineers, what specific on-call pages would go away?” If the answer is vague, the issue isn’t headcount. Fix the top 3 incident causes first, then re-evaluate.
Q: How do you present this diagnosis to leadership without sounding like you’re throwing the team under the bus? A: Frame it as “system outgrew its design” rather than “team made mistakes.” The original design was right for launch; the business changed. Leadership respects honest system-level diagnosis more than heroic firefighting narratives.
Common Wrong Answers:
  • “The Order service needs to be rewritten in Go/Rust.” Language is almost never the cause of on-call load. Pattern issues and coupling are.
  • “Add more automated tests and the incidents will stop.” Tests help code-quality incidents but not architectural ones. If the service fails because it owns too many responsibilities, tests just make the existing complexity more tolerable.
Further Reading:
  • Google SRE Book, Chapter 11 (“Being On-Call”) — metrics and thresholds for on-call health.
  • Will Larson, An Elegant Puzzle (Stripe Press, 2019) — “How to invest in technical infrastructure” chapter on diagnosing system pain.
  • Monzo Engineering blog, “How Monzo runs its payments platform” (2020).

Evaluation Checklist

Architecture

  • Clear service boundaries
  • API Gateway implemented
  • Service discovery working
  • Database per service
  • Event-driven communication

Resilience

  • Circuit breakers
  • Retry with backoff
  • Fallback strategies
  • Health checks
  • Graceful degradation

Data

  • Saga pattern for orders
  • Event sourcing (optional)
  • Idempotency handling
  • Data consistency

Observability

  • Distributed tracing
  • Centralized logging
  • Metrics & dashboards
  • Alerting configured

Summary

Congratulations on completing the Microservices Mastery course! You now have:
  • Deep understanding of microservices patterns
  • Hands-on experience with Node.js and Python implementations
  • Production-ready code for your portfolio
  • Interview preparation for top tech companies

What's Next?

  1. Deploy your capstone to Kubernetes (local or cloud)
  2. Add more features (search, recommendations)
  3. Practice system design interviews
  4. Share your project on GitHub

Interview Deep-Dive

Strong Answer:First, I would add comprehensive error handling and compensation for every saga path. The capstone likely handles the happy path and one or two failure modes, but production needs coverage for every permutation: payment succeeds but notification service is down, inventory reserved but payment gateway times out and comes back with an ambiguous response (did it charge or not?), user cancels during the processing window. Each of these needs explicit handling, not just a generic catch-all error.Second, I would add a proper secrets management system. The capstone probably uses environment variables or docker-compose env files. For production, I need HashiCorp Vault or AWS Secrets Manager with automatic rotation, especially for database credentials and payment API keys. A leaked Stripe secret key in production is a company-ending event for a small business.Third, I would implement proper database backup, disaster recovery, and data retention policies. The capstone has no backup strategy. In production, I need automated daily backups with point-in-time recovery (PITR), tested restore procedures (backups you have not tested are not backups), and data retention policies for PCI compliance (payment card data has specific storage and deletion requirements).Honorable mention: load testing. The capstone works at 10 requests per second. Does it work at 1,000? At 10,000? I would run load tests with a tool like k6 or Artillery to find the breaking points — which service falls over first, which database query becomes slow, which connection pool gets exhausted.Follow-up: “If you could only add one observability tool to your capstone project for production, which would it be and why?”Distributed tracing with OpenTelemetry and Jaeger. In a microservices system, the most common production issue is “the request is slow but I do not know which service is the bottleneck.” Distributed tracing answers that question in seconds by showing the latency breakdown across every service in the call chain. You can add metrics and logging later, but without tracing, debugging a 10-service call chain is like finding a needle in a haystack.
Strong Answer:Preparation starts weeks before, not the day of. I would break it into three phases: capacity planning, pre-scaling, and real-time response.Capacity planning: load test the system at 10x current traffic. Identify which components break first. Usually it is the database connection pool (PostgreSQL max_connections hit), then the payment gateway rate limit, then memory on the order service pods. For each bottleneck, decide: can I scale it horizontally (more pods), vertically (bigger instances), or do I need an architectural change (add caching, convert sync to async)?Pre-scaling: the morning of Black Friday, I would manually scale critical services to their expected peak capacity. Do not rely on autoscaling to react fast enough — HPA takes 3-5 minutes to scale, and during a traffic spike, those 3 minutes mean thousands of failed requests. Pre-scale Order Service to 20 pods (instead of the usual 5), Payment Service to 15 pods, and double the Kafka partition count for the order events topic (more partitions = more consumer parallelism).Real-time response: feature degradation. If the system is under stress, I would use feature flags to disable non-critical features: turn off product recommendations (reduces load on the ML service and product catalog), disable real-time inventory counts on the product page (show “In Stock” or “Out of Stock” from cache instead of live counts), and queue non-essential notifications (marketing emails) for post-event processing.The nuclear option: if a downstream dependency (like the payment provider) starts failing, the circuit breaker opens and I show users “Your order is being processed, you will receive confirmation within 30 minutes.” The order and payment details are queued in Kafka and processed when the provider recovers. This is better than a checkout error page.Follow-up: “After Black Friday, how do you handle the scale-down without disrupting in-progress orders?”Graceful scale-down: I reduce replica counts gradually (20 -> 10 -> 5) over several hours, monitoring error rates at each step. Each pod that terminates must finish processing its current request before shutting down (graceful shutdown with a pre-stop hook). For Kafka consumers, I use cooperative rebalancing protocol so that scaling down does not cause a full consumer group rebalance that pauses all consumers. The Kafka consumer lag metric tells me when it is safe to remove consumers — if lag is zero, consumption is keeping up with production.
Strong Answer:This is the idempotency problem, and for payments it is non-negotiable — charging a customer twice is both a revenue problem and a trust problem.I implement idempotency at three layers. First, at the API level: every order creation request includes an idempotency key (a client-generated UUID). The Order Service stores this key in Redis with a TTL. If the same key arrives again (because the client retried after a timeout), the service returns the original response without creating a new order. Stripe’s API works exactly this way.Second, at the saga level: the saga orchestrator generates a unique payment_idempotency_key when it sends the charge command to the Payment Service. The Payment Service forwards this key to Stripe (which is Stripe’s Idempotency-Key header). If the saga retries the charge command (because it did not receive a response due to a crash), Stripe returns the result of the original charge without charging again.Third, at the database level: the Payment Service has a unique constraint on (order_id, payment_intent_id). Even if the application logic somehow sends a duplicate charge, the database rejects the duplicate record. This is the last line of defense.The tricky edge case: the Payment Service sends the charge to Stripe, Stripe processes it successfully, but the Payment Service crashes before recording the result. On restart, the saga retries. The Payment Service sends the same idempotency key to Stripe, which returns the original successful result. The Payment Service now records this result. From the saga’s perspective, the payment succeeded on the retry. From Stripe’s perspective, only one charge was made. This is why the idempotency key at the payment provider level is critical — it bridges the gap when our system’s state is ambiguous.Follow-up: “How long do you keep the idempotency keys, and what happens if a user legitimately wants to place another order for the same items?”Idempotency keys have a TTL — typically 24-48 hours at the API level (matching Stripe’s default). After that, the same key can be reused. But the key is generated per request, not per item combination. If a user places two separate orders for the same product, each order creation request has a different idempotency key. The idempotency key protects against retries of the same request, not against intentionally repeated business actions.