Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Microservices Design Patterns

Monolith vs Microservices

Architecture Comparison

┌─────────────────────────────────────────────────────────────────┐
│                        Monolith                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌───────────────────────────────────────────────────────┐    │
│   │                   Single Application                   │    │
│   │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐         │    │
│   │  │  User  │ │ Order  │ │Payment │ │  Ship  │         │    │
│   │  │ Module │ │ Module │ │ Module │ │ Module │         │    │
│   │  └────────┘ └────────┘ └────────┘ └────────┘         │    │
│   │                     │                                  │    │
│   │           ┌─────────▼─────────┐                       │    │
│   │           │  Shared Database  │                       │    │
│   │           └───────────────────┘                       │    │
│   └───────────────────────────────────────────────────────┘    │
│                                                                 │
│   + Simple to develop, test, deploy                            │
│   - Scales as a unit, hard to modify                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      Microservices                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│   │  User   │   │  Order  │   │ Payment │   │Shipping │       │
│   │ Service │   │ Service │   │ Service │   │ Service │       │
│   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘       │
│        │             │             │             │              │
│   ┌────▼────┐   ┌────▼────┐   ┌────▼────┐   ┌────▼────┐       │
│   │ User DB │   │Order DB │   │  Stripe │   │ Ship DB │       │
│   └─────────┘   └─────────┘   └─────────┘   └─────────┘       │
│                                                                 │
│   + Independent scaling, tech diversity, team autonomy         │
│   - Distributed system complexity, operational overhead        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Use Microservices

Use Microservices

  • Large, complex applications
  • Multiple teams working independently
  • Parts need different scaling
  • Different technology requirements
  • Frequent, independent deployments

Avoid Microservices

  • Small team (< 10 engineers)
  • Simple domain
  • Early-stage startup
  • Unclear domain boundaries
  • Team lacks distributed systems experience
Start with a Monolith: Most successful microservices evolved from monoliths. The domain boundaries became clear over time. Don’t start with microservices unless you have a clear reason.

Service Communication

Synchronous vs Asynchronous

┌─────────────────────────────────────────────────────────────────┐
│               Communication Patterns                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Synchronous (Request-Response)                                 │
│  ─────────────────────────────                                  │
│                                                                 │
│  Order ─────► User ─────► Payment                              │
│  Service     Service     Service                                │
│     │           │           │                                   │
│     │◄──────────│◄──────────│                                   │
│                                                                 │
│  + Simple, immediate response                                   │
│  - Cascading failures, tight coupling                           │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  Asynchronous (Event-Driven)                                    │
│  ──────────────────────────                                     │
│                                                                 │
│  Order ─────► Message Queue ─────► Payment                     │
│  Service         │                Service                       │
│                  │                                              │
│                  └───────────────► Notification                │
│                                   Service                       │
│                                                                 │
│  + Decoupled, resilient, scalable                               │
│  - Complex, eventual consistency                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

API Gateway Pattern

API Gateway Pattern Key Responsibilities:
│         ┌─────────────────────┼─────────────────────┐          │
│         │                     │                     │           │
│    ┌────▼────┐          ┌─────▼─────┐         ┌────▼────┐     │
│    │  Users  │          │  Orders   │         │ Products │     │
│    │ Service │          │  Service  │         │ Service  │     │
│    └─────────┘          └───────────┘         └──────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Popular Options: Kong, AWS API Gateway, Nginx, Traefik

Backend for Frontend (BFF)

┌─────────────────────────────────────────────────────────────────┐
│                    BFF Pattern                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    ┌─────────┐         ┌─────────┐         ┌─────────┐        │
│    │   Web   │         │ Mobile  │         │   IoT   │        │
│    │  Client │         │   App   │         │ Device  │        │
│    └────┬────┘         └────┬────┘         └────┬────┘        │
│         │                   │                   │              │
│    ┌────▼────┐         ┌────▼────┐         ┌────▼────┐        │
│    │  Web    │         │ Mobile  │         │  IoT    │        │
│    │   BFF   │         │   BFF   │         │   BFF   │        │
│    │ (Rich)  │         │ (Lean)  │         │(Minimal)│        │
│    └────┬────┘         └────┬────┘         └────┬────┘        │
│         │                   │                   │              │
│         └───────────────────┼───────────────────┘              │
│                             │                                   │
│                    ┌────────▼────────┐                         │
│                    │    Services     │                         │
│                    └─────────────────┘                         │
│                                                                 │
│  Each BFF tailored for its client:                             │
│  - Web: Rich data, complex UIs                                 │
│  - Mobile: Optimized payloads, less bandwidth                  │
│  - IoT: Minimal data, low power                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service Discovery

Client-Side vs Server-Side

┌─────────────────────────────────────────────────────────────────┐
│                    Service Discovery                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client-Side Discovery                                          │
│  ────────────────────────                                       │
│                                                                 │
│    Client ───► Service Registry ───► "Order service at 1.2.3.4"│
│       │              │                                          │
│       └──────────────┴──────────► Order Service                │
│                                                                 │
│    Client decides which instance to call                       │
│    Examples: Netflix Eureka, Consul client                     │
│                                                                 │
│  ────────────────────────────────────────────────────────────  │
│                                                                 │
│  Server-Side Discovery                                          │
│  ─────────────────────                                          │
│                                                                 │
│    Client ───► Load Balancer ───► Service Registry             │
│                     │                    │                      │
│                     └────────────────────┘                      │
│                     │                                           │
│                     ▼                                           │
│              Order Service                                      │
│                                                                 │
│    Load balancer handles discovery                             │
│    Examples: AWS ALB, Kubernetes Services                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service Discovery Implementation

import asyncio
import random
import time
import hashlib
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Callable, Any
from enum import Enum
from abc import ABC, abstractmethod
import aiohttp
import logging

logger = logging.getLogger(__name__)

class ServiceStatus(Enum):
    HEALTHY = "healthy"
    UNHEALTHY = "unhealthy"
    DRAINING = "draining"

@dataclass
class ServiceInstance:
    service_name: str
    instance_id: str
    host: str
    port: int
    metadata: Dict[str, str] = field(default_factory=dict)
    status: ServiceStatus = ServiceStatus.HEALTHY
    last_heartbeat: float = field(default_factory=time.time)
    weight: int = 100
    
    @property
    def address(self) -> str:
        return f"{self.host}:{self.port}"

class ServiceRegistry:
    """In-memory service registry with health checking"""
    
    def __init__(self, heartbeat_timeout: float = 30.0):
        self.services: Dict[str, Dict[str, ServiceInstance]] = {}
        self.heartbeat_timeout = heartbeat_timeout
        self._cleanup_task = None
    
    def register(self, instance: ServiceInstance) -> None:
        """Register a service instance"""
        if instance.service_name not in self.services:
            self.services[instance.service_name] = {}
        
        self.services[instance.service_name][instance.instance_id] = instance
        logger.info(f"Registered {instance.service_name}/{instance.instance_id} at {instance.address}")
    
    def deregister(self, service_name: str, instance_id: str) -> None:
        """Deregister a service instance"""
        if service_name in self.services:
            if instance_id in self.services[service_name]:
                del self.services[service_name][instance_id]
                logger.info(f"Deregistered {service_name}/{instance_id}")
    
    def heartbeat(self, service_name: str, instance_id: str) -> bool:
        """Update heartbeat for an instance"""
        if service_name in self.services:
            if instance_id in self.services[service_name]:
                instance = self.services[service_name][instance_id]
                instance.last_heartbeat = time.time()
                instance.status = ServiceStatus.HEALTHY
                return True
        return False
    
    def get_instances(
        self, 
        service_name: str, 
        healthy_only: bool = True
    ) -> List[ServiceInstance]:
        """Get all instances of a service"""
        if service_name not in self.services:
            return []
        
        instances = list(self.services[service_name].values())
        
        if healthy_only:
            instances = [i for i in instances if i.status == ServiceStatus.HEALTHY]
        
        return instances
    
    async def start_cleanup(self) -> None:
        """Start background task to clean up dead instances"""
        while True:
            await asyncio.sleep(10)
            self._cleanup_expired()
    
    def _cleanup_expired(self) -> None:
        """Remove instances that haven't sent heartbeat"""
        now = time.time()
        for service_name in list(self.services.keys()):
            for instance_id in list(self.services[service_name].keys()):
                instance = self.services[service_name][instance_id]
                if now - instance.last_heartbeat > self.heartbeat_timeout:
                    instance.status = ServiceStatus.UNHEALTHY
                    logger.warning(f"Instance {service_name}/{instance_id} marked unhealthy")


class LoadBalancer(ABC):
    """Abstract load balancer"""
    
    @abstractmethod
    def select(self, instances: List[ServiceInstance]) -> Optional[ServiceInstance]:
        pass

class RoundRobinLoadBalancer(LoadBalancer):
    def __init__(self):
        self.counters: Dict[str, int] = {}
    
    def select(self, instances: List[ServiceInstance]) -> Optional[ServiceInstance]:
        if not instances:
            return None
        
        service_name = instances[0].service_name
        if service_name not in self.counters:
            self.counters[service_name] = 0
        
        index = self.counters[service_name] % len(instances)
        self.counters[service_name] += 1
        
        return instances[index]

class WeightedRoundRobinLoadBalancer(LoadBalancer):
    def __init__(self):
        self.current_weights: Dict[str, Dict[str, int]] = {}
    
    def select(self, instances: List[ServiceInstance]) -> Optional[ServiceInstance]:
        if not instances:
            return None
        
        service_name = instances[0].service_name
        
        if service_name not in self.current_weights:
            self.current_weights[service_name] = {}
        
        weights = self.current_weights[service_name]
        
        # Initialize weights
        for instance in instances:
            if instance.instance_id not in weights:
                weights[instance.instance_id] = 0
        
        # Add original weights
        for instance in instances:
            weights[instance.instance_id] += instance.weight
        
        # Select instance with highest current weight
        max_weight = -1
        selected = None
        for instance in instances:
            if weights[instance.instance_id] > max_weight:
                max_weight = weights[instance.instance_id]
                selected = instance
        
        # Decrease selected instance's weight
        if selected:
            total_weight = sum(i.weight for i in instances)
            weights[selected.instance_id] -= total_weight
        
        return selected

class ConsistentHashLoadBalancer(LoadBalancer):
    """Consistent hashing for sticky sessions"""
    
    def __init__(self, replicas: int = 100):
        self.replicas = replicas
        self.ring: Dict[int, ServiceInstance] = {}
        self.sorted_keys: List[int] = []
    
    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def _build_ring(self, instances: List[ServiceInstance]) -> None:
        self.ring.clear()
        for instance in instances:
            for i in range(self.replicas):
                key = f"{instance.instance_id}:{i}"
                hash_val = self._hash(key)
                self.ring[hash_val] = instance
        self.sorted_keys = sorted(self.ring.keys())
    
    def select(
        self, 
        instances: List[ServiceInstance], 
        key: str = None
    ) -> Optional[ServiceInstance]:
        if not instances:
            return None
        
        self._build_ring(instances)
        
        if not key:
            key = str(random.random())
        
        hash_val = self._hash(key)
        
        # Find first node >= hash
        for ring_key in self.sorted_keys:
            if ring_key >= hash_val:
                return self.ring[ring_key]
        
        return self.ring[self.sorted_keys[0]]


class ServiceDiscoveryClient:
    """Client-side service discovery with load balancing"""
    
    def __init__(
        self,
        registry: ServiceRegistry,
        load_balancer: LoadBalancer = None,
        cache_ttl: float = 30.0
    ):
        self.registry = registry
        self.load_balancer = load_balancer or RoundRobinLoadBalancer()
        self.cache_ttl = cache_ttl
        self.cache: Dict[str, tuple] = {}  # (instances, timestamp)
    
    def discover(self, service_name: str) -> Optional[ServiceInstance]:
        """Discover and select a service instance"""
        instances = self._get_instances(service_name)
        if not instances:
            raise ServiceNotFoundException(f"No instances found for {service_name}")
        
        return self.load_balancer.select(instances)
    
    def _get_instances(self, service_name: str) -> List[ServiceInstance]:
        """Get instances with caching"""
        if service_name in self.cache:
            instances, timestamp = self.cache[service_name]
            if time.time() - timestamp < self.cache_ttl:
                return instances
        
        instances = self.registry.get_instances(service_name)
        self.cache[service_name] = (instances, time.time())
        return instances
    
    async def call(
        self,
        service_name: str,
        path: str,
        method: str = "GET",
        **kwargs
    ) -> Any:
        """Make HTTP call to discovered service"""
        instance = self.discover(service_name)
        url = f"http://{instance.address}{path}"
        
        async with aiohttp.ClientSession() as session:
            async with session.request(method, url, **kwargs) as response:
                return await response.json()

class ServiceNotFoundException(Exception):
    pass


# Self-registering service
class SelfRegisteringService:
    """Base class for services that self-register"""
    
    def __init__(
        self,
        service_name: str,
        host: str,
        port: int,
        registry: ServiceRegistry,
        heartbeat_interval: float = 10.0
    ):
        self.instance = ServiceInstance(
            service_name=service_name,
            instance_id=f"{host}:{port}:{random.randint(1000, 9999)}",
            host=host,
            port=port
        )
        self.registry = registry
        self.heartbeat_interval = heartbeat_interval
        self._heartbeat_task = None
    
    async def start(self) -> None:
        """Register and start heartbeat"""
        self.registry.register(self.instance)
        self._heartbeat_task = asyncio.create_task(self._send_heartbeats())
    
    async def stop(self) -> None:
        """Deregister and stop heartbeat"""
        if self._heartbeat_task:
            self._heartbeat_task.cancel()
        self.registry.deregister(
            self.instance.service_name,
            self.instance.instance_id
        )
    
    async def _send_heartbeats(self) -> None:
        while True:
            await asyncio.sleep(self.heartbeat_interval)
            self.registry.heartbeat(
                self.instance.service_name,
                self.instance.instance_id
            )


# Usage example
async def main():
    # Create registry
    registry = ServiceRegistry()
    
    # Register some services
    registry.register(ServiceInstance(
        service_name="order-service",
        instance_id="order-1",
        host="10.0.0.1",
        port=8080,
        weight=100
    ))
    registry.register(ServiceInstance(
        service_name="order-service",
        instance_id="order-2",
        host="10.0.0.2",
        port=8080,
        weight=50  # Less capacity
    ))
    
    # Create discovery client
    client = ServiceDiscoveryClient(
        registry,
        load_balancer=WeightedRoundRobinLoadBalancer()
    )
    
    # Discover and call service
    for _ in range(10):
        instance = client.discover("order-service")
        print("Selected:", instance.address)

Service Registration

┌─────────────────────────────────────────────────────────────────┐
│                   Service Registration                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Self-Registration:                                             │
│  ┌─────────────┐                                               │
│  │   Service   │──► Register on startup                        │
│  │             │──► Send heartbeats                            │
│  │             │──► Deregister on shutdown                     │
│  └─────────────┘                                               │
│                                                                 │
│  Third-Party Registration:                                      │
│  ┌─────────────┐    ┌───────────────┐                          │
│  │   Service   │◄───│  Registrar    │  (Sidecar/Agent)        │
│  │             │    │  (monitors)   │                          │
│  └─────────────┘    └───────────────┘                          │
│                            │                                    │
│                            ▼                                    │
│                    Service Registry                             │
│                                                                 │
│  Popular Registries: Consul, etcd, ZooKeeper, Kubernetes DNS   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Management

Database per Service

┌─────────────────────────────────────────────────────────────────┐
│              Database per Service Pattern                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐              │
│  │   User    │    │   Order   │    │  Product  │              │
│  │  Service  │    │  Service  │    │  Service  │              │
│  └─────┬─────┘    └─────┬─────┘    └─────┬─────┘              │
│        │                │                │                      │
│  ┌─────▼─────┐    ┌─────▼─────┐    ┌─────▼─────┐              │
│  │ PostgreSQL│    │  MongoDB  │    │Elasticsearch│             │
│  │  (Users)  │    │ (Orders)  │    │ (Products) │              │
│  └───────────┘    └───────────┘    └───────────┘              │
│                                                                 │
│  + Independent scaling                                          │
│  + Tech flexibility (right DB for the job)                      │
│  + Loose coupling                                               │
│                                                                 │
│  - Cross-service queries are hard                               │
│  - Data consistency challenges                                  │
│  - More infrastructure to manage                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Saga Pattern for Distributed Transactions

┌─────────────────────────────────────────────────────────────────┐
│                    Saga Pattern                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Example: E-commerce Order Flow                                 │
│                                                                 │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│  │ Create  │──►│ Reserve │──►│ Charge  │──►│  Ship   │       │
│  │  Order  │   │Inventory│   │ Payment │   │  Order  │       │
│  └────┬────┘   └────┬────┘   └────┬────┘   └─────────┘       │
│       │             │             │                            │
│  Compensating Actions (on failure):                            │
│       │             │             │                            │
│  ┌────▼────┐   ┌────▼────┐   ┌────▼────┐                      │
│  │ Cancel  │◄──│ Release │◄──│  Refund │  ← If shipping fails │
│  │  Order  │   │Inventory│   │ Payment │                       │
│  └─────────┘   └─────────┘   └─────────┘                       │
│                                                                 │
│  Choreography: Services publish/consume events                  │
│  Orchestration: Central saga coordinator                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Saga Pattern Implementation

import asyncio
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional, Callable
from enum import Enum
from datetime import datetime
import uuid
import logging

logger = logging.getLogger(__name__)

class SagaStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"
    FAILED = "failed"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensation: Callable
    timeout: float = 30.0
    retries: int = 3

@dataclass
class StepResult:
    step_name: str
    success: bool
    data: Any = None
    error: Optional[str] = None

@dataclass
class Saga:
    saga_id: str
    name: str
    steps: List[SagaStep]
    status: SagaStatus = SagaStatus.PENDING
    current_step: int = 0
    context: Dict[str, Any] = field(default_factory=dict)
    results: List[StepResult] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    completed_at: Optional[datetime] = None

class SagaOrchestrator:
    """
    Central saga orchestrator for distributed transactions.
    Handles step execution, compensation, and state management.
    """
    
    def __init__(self, store: 'SagaStore' = None):
        self.store = store or InMemorySagaStore()
    
    async def execute(
        self,
        name: str,
        steps: List[SagaStep],
        context: Dict[str, Any] = None
    ) -> Saga:
        """Execute a saga"""
        saga = Saga(
            saga_id=str(uuid.uuid4()),
            name=name,
            steps=steps,
            context=context or {}
        )
        
        await self.store.save(saga)
        
        try:
            saga.status = SagaStatus.RUNNING
            await self._execute_steps(saga)
            saga.status = SagaStatus.COMPLETED
            
        except SagaStepFailed as e:
            logger.error(f"Saga {saga.saga_id} step failed: {e}")
            saga.status = SagaStatus.COMPENSATING
            await self._compensate(saga)
            saga.status = SagaStatus.COMPENSATED
            
        except Exception as e:
            logger.error(f"Saga {saga.saga_id} failed unexpectedly: {e}")
            saga.status = SagaStatus.FAILED
        
        finally:
            saga.completed_at = datetime.now()
            await self.store.save(saga)
        
        return saga
    
    async def _execute_steps(self, saga: Saga) -> None:
        """Execute saga steps in order"""
        for i, step in enumerate(saga.steps):
            saga.current_step = i
            await self.store.save(saga)
            
            result = await self._execute_step_with_retry(saga, step)
            saga.results.append(result)
            
            if not result.success:
                raise SagaStepFailed(step.name, result.error)
            
            # Store step result in context for next steps
            saga.context[f"{step.name}_result"] = result.data
    
    async def _execute_step_with_retry(
        self, 
        saga: Saga, 
        step: SagaStep
    ) -> StepResult:
        """Execute a step with retries"""
        last_error = None
        
        for attempt in range(step.retries):
            try:
                async with asyncio.timeout(step.timeout):
                    result = await step.action(saga.context)
                    return StepResult(
                        step_name=step.name,
                        success=True,
                        data=result
                    )
                    
            except asyncio.TimeoutError:
                last_error = f"Step timed out after {step.timeout}s"
                
            except Exception as e:
                last_error = str(e)
                
            if attempt < step.retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
        
        return StepResult(
            step_name=step.name,
            success=False,
            error=last_error
        )
    
    async def _compensate(self, saga: Saga) -> None:
        """Execute compensating actions in reverse order"""
        # Only compensate completed steps
        completed_steps = saga.results[:saga.current_step]
        
        for i in range(len(completed_steps) - 1, -1, -1):
            step = saga.steps[i]
            
            try:
                logger.info(f"Compensating step: {step.name}")
                async with asyncio.timeout(step.timeout):
                    await step.compensation(saga.context)
                    
            except Exception as e:
                logger.error(f"Compensation failed for {step.name}: {e}")
                # Continue compensating other steps


class SagaStepFailed(Exception):
    def __init__(self, step_name: str, error: str):
        self.step_name = step_name
        self.error = error
        super().__init__(f"Step '{step_name}' failed: {error}")


class SagaStore(ABC):
    @abstractmethod
    async def save(self, saga: Saga) -> None:
        pass
    
    @abstractmethod
    async def get(self, saga_id: str) -> Optional[Saga]:
        pass

class InMemorySagaStore(SagaStore):
    def __init__(self):
        self.sagas: Dict[str, Saga] = {}
    
    async def save(self, saga: Saga) -> None:
        self.sagas[saga.saga_id] = saga
    
    async def get(self, saga_id: str) -> Optional[Saga]:
        return self.sagas.get(saga_id)


# Example: Order processing saga
async def create_order_saga(order_data: dict) -> Saga:
    orchestrator = SagaOrchestrator()
    
    # Define saga steps
    steps = [
        SagaStep(
            name="create_order",
            action=lambda ctx: create_order(ctx["order_data"]),
            compensation=lambda ctx: cancel_order(ctx["create_order_result"]["order_id"]),
            timeout=10.0
        ),
        SagaStep(
            name="reserve_inventory",
            action=lambda ctx: reserve_inventory(
                ctx["create_order_result"]["order_id"],
                ctx["order_data"]["items"]
            ),
            compensation=lambda ctx: release_inventory(
                ctx["create_order_result"]["order_id"]
            ),
            timeout=15.0
        ),
        SagaStep(
            name="charge_payment",
            action=lambda ctx: charge_payment(
                ctx["create_order_result"]["order_id"],
                ctx["order_data"]["amount"]
            ),
            compensation=lambda ctx: refund_payment(
                ctx["charge_payment_result"]["transaction_id"]
            ),
            timeout=30.0,
            retries=3
        ),
        SagaStep(
            name="ship_order",
            action=lambda ctx: initiate_shipping(
                ctx["create_order_result"]["order_id"]
            ),
            compensation=lambda ctx: cancel_shipping(
                ctx["ship_order_result"]["shipment_id"]
            ),
            timeout=20.0
        )
    ]
    
    return await orchestrator.execute(
        name="order_processing",
        steps=steps,
        context={"order_data": order_data}
    )


# Service functions (would be actual service calls)
async def create_order(order_data: dict) -> dict:
    return {"order_id": str(uuid.uuid4()), "status": "created"}

async def cancel_order(order_id: str) -> None:
    logger.info(f"Cancelling order {order_id}")

async def reserve_inventory(order_id: str, items: list) -> dict:
    return {"reservation_id": str(uuid.uuid4())}

async def release_inventory(order_id: str) -> None:
    logger.info(f"Releasing inventory for order {order_id}")

async def charge_payment(order_id: str, amount: float) -> dict:
    return {"transaction_id": str(uuid.uuid4()), "amount": amount}

async def refund_payment(transaction_id: str) -> None:
    logger.info(f"Refunding transaction {transaction_id}")

async def initiate_shipping(order_id: str) -> dict:
    return {"shipment_id": str(uuid.uuid4())}

async def cancel_shipping(shipment_id: str) -> None:
    logger.info(f"Cancelling shipment {shipment_id}")


# Usage
async def main():
    result = await create_order_saga({
        "items": [{"product_id": "123", "quantity": 2}],
        "amount": 99.99,
        "customer_id": "cust_123"
    })
    
    print(f"Saga status: {result.status}")
    print(f"Results: {result.results}")

CQRS (Command Query Responsibility Segregation)

CQRS Pattern

Event Sourcing

Store all changes to application state as a sequence of events. Event Sourcing

Resilience Patterns

Circuit Breaker

┌─────────────────────────────────────────────────────────────────┐
│                    Circuit Breaker States                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│      ┌─────────────────────────────────────────────────┐       │
│      │                                                 │        │
│      ▼                                                 │        │
│  ┌────────┐    Failure    ┌────────┐    Timeout   ┌─────────┐ │
│  │ CLOSED │───threshold──►│  OPEN  │─────────────►│HALF-OPEN│ │
│  │        │               │        │              │         │ │
│  └────────┘               └────────┘              └────┬────┘ │
│      ▲                                                 │       │
│      │                    Success                      │       │
│      └─────────────────────────────────────────────────┘       │
│                                                                 │
│  CLOSED:    Requests pass through, count failures             │
│  OPEN:      Requests fail immediately, don't call service     │
│  HALF-OPEN: Allow limited requests to test if service is back │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
# Python implementation using circuitbreaker library
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id):
    response = requests.post(
        "https://payment-service/charge",
        json={"order_id": order_id}
    )
    response.raise_for_status()
    return response.json()

# Usage with fallback
try:
    result = call_payment_service(order_id)
except CircuitBreakerError:
    # Fallback: queue for later processing
    queue_for_retry(order_id)
    return {"status": "pending", "message": "Payment will be processed shortly"}

Bulkhead Pattern

┌─────────────────────────────────────────────────────────────────┐
│                    Bulkhead Pattern                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Without Bulkhead:                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                  Shared Thread Pool (100)               │   │
│  │                                                         │   │
│  │   Product (fast)  Order (slow)  Payment (medium)       │   │
│  │                                                         │   │
│  │   Slow Order service exhausts all threads              │   │
│  │   → Product & Payment also fail                        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  With Bulkhead:                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │  Product    │  │   Order     │  │  Payment    │            │
│  │  Pool (30)  │  │  Pool (40)  │  │  Pool (30)  │            │
│  │             │  │ ██████████  │  │             │            │
│  │  Still      │  │ (exhausted) │  │  Still      │            │
│  │  working    │  │             │  │  working    │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
│                                                                 │
│  Failure is isolated to Order service only                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Retry with Backoff

import time
import random
from functools import wraps

def retry_with_exponential_backoff(
    max_retries=3,
    base_delay=1,
    max_delay=60,
    exponential_base=2,
    jitter=True
):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            retries = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    retries += 1
                    if retries > max_retries:
                        raise
                    
                    delay = min(
                        base_delay * (exponential_base ** (retries - 1)),
                        max_delay
                    )
                    
                    if jitter:
                        delay = delay * (0.5 + random.random())
                    
                    print(f"Retry {retries}/{max_retries} in {delay:.2f}s")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_exponential_backoff(max_retries=5)
def call_external_service():
    response = requests.get("https://api.example.com/data")
    response.raise_for_status()
    return response.json()

Service Mesh

What is a Service Mesh?

┌─────────────────────────────────────────────────────────────────┐
│                    Service Mesh Architecture                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                    ┌─────────────────────┐                     │
│                    │   Control Plane     │                     │
│                    │  (Istio, Linkerd)   │                     │
│                    │  - Config           │                     │
│                    │  - Policies         │                     │
│                    │  - Certificates     │                     │
│                    └──────────┬──────────┘                     │
│                               │                                 │
│         ┌─────────────────────┼─────────────────────┐          │
│         │                     │                     │           │
│   ┌─────▼─────┐         ┌─────▼─────┐         ┌─────▼─────┐   │
│   │┌─────────┐│         │┌─────────┐│         │┌─────────┐│   │
│   ││ Sidecar ││◄───────►││ Sidecar ││◄───────►││ Sidecar ││   │
│   ││ (Envoy) ││         ││ (Envoy) ││         ││ (Envoy) ││   │
│   │└────┬────┘│         │└────┬────┘│         │└────┬────┘│   │
│   │     │     │         │     │     │         │     │     │   │
│   │┌────▼────┐│         │┌────▼────┐│         │┌────▼────┐│   │
│   ││ Service ││         ││ Service ││         ││ Service ││   │
│   ││    A    ││         ││    B    ││         ││    C    ││   │
│   │└─────────┘│         │└─────────┘│         │└─────────┘│   │
│   └───────────┘         └───────────┘         └───────────┘   │
│        Pod                   Pod                   Pod         │
│                                                                 │
│  Data Plane: Sidecar proxies handle all traffic               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Service Mesh Features

Traffic Management

  • Load balancing
  • A/B testing
  • Canary deployments
  • Traffic splitting
  • Retries & timeouts

Security

  • mTLS encryption
  • Service-to-service auth
  • Access policies
  • Certificate management

Observability

  • Distributed tracing
  • Metrics collection
  • Access logs
  • Service topology

Resilience

  • Circuit breaking
  • Rate limiting
  • Fault injection
  • Health checks

Observability

The Three Pillars

┌─────────────────────────────────────────────────────────────────┐
│                    Observability Pillars                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │   LOGS      │    │   METRICS   │    │   TRACES    │        │
│  │             │    │             │    │             │        │
│  │ What        │    │ How much    │    │ Where       │        │
│  │ happened?   │    │ & how fast? │    │ did it go?  │        │
│  │             │    │             │    │             │        │
│  │ • Events    │    │ • Counters  │    │ • Spans     │        │
│  │ • Errors    │    │ • Gauges    │    │ • Context   │        │
│  │ • Debug     │    │ • Histograms│    │ • Latency   │        │
│  │             │    │             │    │             │        │
│  │ ELK Stack   │    │ Prometheus  │    │ Jaeger      │        │
│  │ Splunk      │    │ Grafana     │    │ Zipkin      │        │
│  │ DataDog     │    │ DataDog     │    │ DataDog     │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Distributed Tracing

┌─────────────────────────────────────────────────────────────────┐
│                    Distributed Trace Example                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Request: GET /orders/123                                       │
│                                                                 │
│  Trace ID: abc-123                                              │
│  ├── API Gateway (50ms)                                        │
│  │   └── Order Service (200ms)                                 │
│  │       ├── User Service (30ms)                               │
│  │       │   └── User DB (15ms)                                │
│  │       ├── Product Service (80ms)                            │
│  │       │   └── Product DB (40ms)                             │
│  │       └── Order DB (50ms)                                   │
│  │                                                              │
│  Total: 250ms                                                   │
│                                                                 │
│  Waterfall View:                                                │
│  ─────────────────────────────────────────────────────────────  │
│  API Gateway   [██████]                                         │
│  Order Service      [████████████████████████████████████████] │
│  User Service           [████]                                  │
│  User DB                  [██]                                  │
│  Product Service              [████████████]                    │
│  Product DB                       [████████]                    │
│  Order DB                                    [██████████]      │
│                                                                 │
│  0ms        50ms       100ms      150ms      200ms      250ms  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Deployment Strategies

Common Strategies

┌─────────────────────────────────────────────────────────────────┐
│                    Deployment Strategies                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Blue-Green Deployment                                       │
│     ┌────────────┐     ┌────────────┐                          │
│     │   Blue     │     │   Green    │                          │
│     │   (v1)     │     │   (v2)     │                          │
│     │  Current   │ ──► │   New      │                          │
│     └────────────┘     └────────────┘                          │
│     Switch traffic instantly, easy rollback                    │
│                                                                 │
│  2. Canary Deployment                                           │
│     ┌────────────┐     ┌────────────┐                          │
│     │    v1      │ ─── │    v2      │                          │
│     │   95%      │     │    5%      │                          │
│     └────────────┘     └────────────┘                          │
│     Gradually increase v2 traffic, monitor for issues          │
│                                                                 │
│  3. Rolling Update                                              │
│     [v1] [v1] [v1] [v1]                                        │
│     [v2] [v1] [v1] [v1]  → Update one at a time               │
│     [v2] [v2] [v1] [v1]                                        │
│     [v2] [v2] [v2] [v1]                                        │
│     [v2] [v2] [v2] [v2]                                        │
│                                                                 │
│  4. A/B Testing                                                 │
│     Users A-M  → v1 (control)                                  │
│     Users N-Z  → v2 (experiment)                               │
│     Measure business metrics, not just technical               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways

PatternWhen to Use
API GatewaySingle entry point, cross-cutting concerns
Service DiscoveryDynamic service locations
Circuit BreakerPrevent cascading failures
SagaDistributed transactions
CQRSDifferent read/write patterns
Event SourcingAudit trail, temporal queries
Service MeshComplex microservices, need observability
Interview Tip: When discussing microservices, always mention trade-offs. “Microservices add operational complexity, distributed system challenges, and require mature DevOps practices. The benefits (independent scaling, team autonomy) should outweigh these costs.”

Interview Deep-Dive Questions

What the interviewer is really testing: Whether you understand that architecture decisions are contextual — not ideological — and whether you can articulate the real costs of premature decomposition.Strong Answer:
  • I would push back on this proposal in most cases. Starting with microservices at a 50-person startup is almost always premature optimization. The reason is not that microservices are bad — it is that the costs are front-loaded while the benefits only materialize at a scale and organizational complexity the startup has not reached yet.
  • The concrete costs of starting with microservices on day one: (1) You need service discovery, an API gateway, distributed tracing, centralized logging, and a container orchestration platform before you can ship your first feature. That is 2-4 weeks of infrastructure work that delivers zero customer value. (2) Every feature that spans services requires coordinated deployments, API contracts, and cross-service debugging. At a startup, almost every feature spans what would be service boundaries. (3) You do not yet know where the domain boundaries are. The biggest risk is drawing service boundaries wrong and spending months refactoring when the business model pivots.
  • The right approach: start with a well-structured monolith. Use clear module boundaries internally (separate packages or modules for user management, orders, payments). Enforce these boundaries with code review and linting rules — no direct database access across module boundaries, communication through well-defined interfaces. This gives you 80% of the organizational benefit (clear ownership, clean interfaces) with none of the distributed systems overhead.
  • When to extract: extract a service when you have a concrete, measurable reason. Examples: the payment module needs to scale independently because it has different resource requirements; the recommendation engine team wants to deploy on a different cadence; a specific module needs to be in a different language for performance reasons. Each extraction should be driven by a pain point, not a prediction.
  • The exception: if the startup is building a platform product (like an API or marketplace) where different components have fundamentally different scaling characteristics from day one, a small number of services (2-3, not 15) might be justified.
  • Example: Shopify ran as a monolith (the “Shopify monolith” is famous) serving millions of merchants for years before selectively extracting services. They explicitly chose to invest in making the monolith modular rather than splitting prematurely. When they did extract services, they had years of production data telling them exactly where the boundaries should be.
Follow-up: The senior engineer counters that the monolith will become a “big ball of mud” and will be impossible to decompose later. How do you address this concern?The “big ball of mud” is a code quality problem, not an architecture problem. If the team cannot enforce clean module boundaries in a monolith, they will not enforce clean API contracts between microservices either — they will just create a distributed big ball of mud, which is strictly worse because now you have the same coupling plus network latency and partial failure modes. The solution is discipline: enforce module boundaries with automated tooling (dependency checks, architecture fitness functions), use internal interfaces as if they were API contracts, and write integration tests at module boundaries. If you can maintain discipline in a monolith, you can extract services cleanly when the time comes. If you cannot, microservices will not save you.
What the interviewer is really testing: Whether you understand resilience patterns beyond just naming “circuit breaker” — specifically, how to degrade gracefully when a dependency fails and what data you can live without.Strong Answer:
  • The first step is to categorize the dependency: is the User Service data critical or supplementary for order processing? The user’s shipping address is critical — you cannot fulfill an order without it. The user’s display name for the order confirmation email is supplementary — you can send the email later.
  • For critical data (shipping address): implement a local cache in the Order Service. When the User Service is healthy, cache user profile data with a reasonable TTL (5-15 minutes for addresses). When the User Service returns a 500, fall back to the cached data. If the cache is also empty (new user, cache eviction), then the order must fail — but fail fast with a clear error message, not after a 30-second timeout chain.
  • For supplementary data (display name, preferences): use a circuit breaker. After 5 consecutive failures to the User Service, open the circuit. While the circuit is open, skip the User Service call entirely and use default values. The order proceeds without the supplementary data. A background job re-enriches the order when the User Service recovers.
  • The circuit breaker configuration matters: failure threshold of 5, timeout of 30 seconds in the open state, then allow one probe request in half-open state. If the probe succeeds, close the circuit. If it fails, reset the timeout. Add jitter to the timeout so all Order Service instances do not probe the User Service simultaneously when it is recovering.
  • For the Inventory Service, the pattern is different because inventory checks have side effects (reservation). Use the Saga pattern: create the order optimistically, then attempt inventory reservation as a separate step. If inventory reservation fails, compensate by canceling the order. This decouples order creation from inventory availability at the cost of occasionally needing to cancel.
  • Timeout configuration: set aggressive timeouts (200-500ms) for supplementary calls, longer (2-5s) for critical calls. Never use the default HTTP client timeout (often 30-60 seconds) — that will cascade into thread pool exhaustion in the Order Service.
  • Example: Netflix’s Hystrix (now in maintenance, but the patterns live on in Resilience4j) introduced the bulkhead pattern alongside circuit breakers. Each downstream dependency gets its own thread pool. If the User Service thread pool is exhausted, Order Service can still call Inventory Service because those threads are isolated. This prevents a single bad dependency from starving all other calls.
Follow-up: The User Service recovers, but your circuit breaker is still open for another 20 seconds. During that window, you are processing orders with cached or default user data. When the circuit closes, do you need to backfill any of those orders? How?Yes, you need a reconciliation process. Log every order that was processed with degraded data (add a flag like user_data_degraded: true to the order record). When the circuit closes, a background consumer picks up these flagged orders, fetches fresh user data from the User Service, and updates any fields that were stale or defaulted. For critical fields like shipping address, if the cached version differs from the fresh version, flag the order for manual review before shipping. This is cheaper than blocking all orders during the outage.
What the interviewer is really testing: Whether you understand that microservice decomposition is primarily a data problem, not a code problem — and whether you can navigate the loss of referential integrity and joins.Strong Answer:
  • Data decomposition is the hardest part of migrating to microservices, and it should be done incrementally, not all at once. The Strangler Fig pattern applied to data: start by creating logical boundaries in the existing database, then physically separate over time.
  • Phase 1 — Logical separation: Within the monolith, introduce a data access layer per domain. The Order module can only access the Orders table through an OrderRepository. The User module owns the Users table. Enforce this with code review and eventually with database views or schemas. Foreign keys still exist at this point — that is fine.
  • Phase 2 — API boundary: The Order Service stops joining directly to the Users table. Instead, it calls a User Service API (which initially is just another module in the same monolith) to get user data. Replace the join with an application-level join: fetch the order, then fetch the user by ID. Yes, this is slower. Cache user data aggressively.
  • Phase 3 — Physical separation: Move the Users table to its own database (or schema) that only the User Service can access. Drop the foreign key constraint. The Order table stores user_id as a plain column with no foreign key. Referential integrity is now the application’s responsibility.
  • Handling the loss of foreign keys: (1) Use soft deletes — never hard delete a user; mark them as deleted. This prevents dangling references. (2) Implement eventual consistency checks — a periodic job that scans orders for user_ids that no longer exist in the User Service and flags them. (3) Use events — when a user is deleted, publish a UserDeleted event. The Order Service subscribes and handles orphaned orders (archive them, anonymize them, whatever the business requires).
  • Handling the loss of joins: for read-heavy queries that previously joined Orders and Users (e.g., “show all orders with user names”), either (1) denormalize — store the user name in the Order record and update it via events when the name changes, or (2) use an API composition layer that fetches from both services and merges the results, or (3) maintain a read-optimized view using CQRS — an event-driven projection that materializes the joined view into a query-optimized store.
  • Example: Uber’s migration from a monolith to microservices took years. They specifically called out the data layer as the bottleneck. They used the “database-per-service” pattern but maintained a shared schema registry to prevent drift. Their biggest challenge was cross-service reporting, which they solved by streaming all events into a central data lake for analytics queries.
Follow-up: During the migration, you are running both the monolith (which still has the foreign key) and the new User Service (which has its own database). A user is created in the new User Service but the monolith does not know about it yet. An order comes in for that user through the monolith path. What happens?This is the dual-write problem during migration. The solution is to ensure a single source of truth during the transition. Use the Strangler Fig approach: route all user writes through the new User Service, which writes to its own database AND publishes a sync event that updates the monolith’s Users table (via a CDC connector or a sync job). The monolith can still read from its local Users table but never writes to it for user data. This way, the monolith always has a slightly-lagged but complete view of all users. The reverse path (monolith creates a user) should be disabled once the User Service is live, or also routed through the User Service API.
What the interviewer is really testing: Whether you understand distributed tracing as a first-class requirement in microservices, not an afterthought, and whether you can use it systematically to debug cross-service issues.Strong Answer:
  • This is exactly the scenario where distributed tracing becomes non-negotiable. The three pillars — metrics, logs, and traces — need to work together, but traces are the primary tool for debugging cross-service request failures.
  • The foundation is a correlation ID (trace ID) that is generated at the edge (API gateway or first service) and propagated through every service call via HTTP headers (W3C Trace Context standard or B3 propagation). Every log line, every metric, every error report includes this trace ID. When a user reports a failure, you search by trace ID and see the entire request path.
  • The trace shows you: which services were called, in what order, how long each took, and which one failed. For an intermittent failure across 8 services, the trace will immediately narrow it down to “the Payment Service returned a 500 after 12 seconds” or “the Inventory Service timed out on the database query.”
  • Beyond basic tracing, instrument span attributes: add business context to spans (user_id, order_id, product_id). This lets you search for “all traces where user X’s requests to the Inventory Service took longer than 2 seconds.” Without these attributes, you have infrastructure data but no business context.
  • For intermittent failures specifically, use tail-based sampling: capture 100% of traces for errored or slow requests, and sample 1-5% of successful requests. This ensures you never miss a failure trace while keeping storage costs manageable. Head-based sampling (decide at the edge whether to trace) risks missing the exact failures you need to debug.
  • Service dependency maps: generate these automatically from trace data. “The Order Service calls User, Inventory, Payment, and Notification. Payment calls Fraud Detection and Bank Gateway.” This map is invaluable for understanding blast radius — if the Bank Gateway goes down, which user-facing flows are affected?
  • Example: Uber built Jaeger specifically for this problem. With thousands of microservices, they needed to trace requests that could touch dozens of services. They found that most debugging time was spent not on finding the failing service (traces made that obvious) but on understanding why it failed (which required correlated logs and metrics). Their solution was a unified observability platform where clicking a span in a trace opens the relevant logs and metrics for that exact time window and service.
Follow-up: Distributed tracing adds latency overhead to every request (header propagation, span creation, async export). Your p99 latency is already tight at 200ms. How do you minimize the observability tax?The overhead of trace header propagation is negligible (adding a few HTTP headers). The real cost is span creation and export. Use asynchronous, batched span export — never block the request thread to send telemetry data. Buffer spans in memory and flush in batches every 5-10 seconds. Use a lightweight trace SDK (OpenTelemetry’s Go or Rust SDKs add microseconds, not milliseconds). For the 200ms p99 budget, proper implementation adds less than 1ms of overhead. If you are seeing more, it is usually because someone configured synchronous export or is creating too many spans per request. Limit span depth: trace at the service boundary level, not at every function call within a service.

Interview Questions

Strong Answer:
  • The Saga pattern manages data consistency across services without a single ACID transaction. Instead of locking all resources at once (as 2PC does), a saga breaks a business operation into a sequence of local transactions, each with a compensating action that undoes its effect if a later step fails. For example, an e-commerce checkout saga might go: Create Order, Reserve Inventory, Charge Payment, Schedule Shipping — and if Payment fails, it runs Release Inventory then Cancel Order in reverse.
  • Two-phase commit does not work across microservices for practical reasons: it requires all participants to hold locks until the coordinator says “commit.” In a distributed system with services that have independent databases, this means holding row-level locks across network boundaries for the duration of the slowest participant. At 1,000 concurrent orders, you are holding 1,000 sets of cross-service locks. One slow or crashed participant blocks the entire system. 2PC is a single point of failure disguised as a coordination protocol.
  • Sagas give you eventual consistency instead of strong consistency. The trade-off is that intermediate states are visible — a customer might briefly see an order as “created” before payment is confirmed. You handle this with careful UI design (showing “processing” states) and idempotent compensations.
  • There are two saga execution styles: choreography (each service publishes events and the next service reacts) and orchestration (a central coordinator tells each service what to do). Choreography is simpler for 3-4 step flows but becomes spaghetti at 8+ steps because the flow logic is scattered. Orchestration adds a single point of coordination but makes complex flows readable and testable.
Red flag answer: “We use distributed transactions to keep everything consistent” or “We just use a shared database so we can use regular transactions.” Both answers reveal the candidate does not understand the fundamental constraint of microservices: each service owns its own data store.Follow-ups:
  1. A compensation step in your saga fails (e.g., the refund API to Stripe returns a 500). Now you have an inconsistent state — the order is cancelled but the customer was still charged. How do you handle this?
  2. Your orchestration-based saga has 6 steps. During step 4, the saga orchestrator itself crashes. How do you ensure the saga resumes or compensates correctly after the orchestrator restarts?
Strong Answer:
  • The circuit breaker is a resilience pattern that prevents a service from repeatedly calling a failing downstream dependency, which would waste resources and cascade the failure. It has three states: Closed (requests flow normally, failures are counted), Open (all requests fail immediately without calling the dependency, giving it time to recover), and Half-Open (after a timeout, one probe request is allowed through — if it succeeds the circuit closes, if it fails it reopens).
  • The key insight is that calling a failing service is worse than not calling it. Without a circuit breaker, 100 threads can pile up waiting for 30-second timeouts from a dead Payment Service, exhausting the Order Service’s thread pool and causing the Order Service to also appear dead to its callers. The circuit breaker converts a 30-second timeout into a 5-millisecond “fail fast” response.
  • Choosing the failure threshold: this depends on the expected error rate under normal conditions. If the downstream service has a normal error rate of 0.1%, a threshold of 5 failures in 10 seconds is reasonable. If the normal error rate is 2% (some services are flaky by nature), you need a higher threshold like 10-15 failures in a 30-second window, or you will get false trips. The window matters as much as the count — 5 failures in 1 second is a signal; 5 failures in 10 minutes is noise.
  • Choosing the timeout (how long the circuit stays open): this should match the dependency’s expected recovery time. If the dependency typically recovers within 30 seconds (e.g., a pod restart on Kubernetes), use a 30-second open timeout. For services with longer recovery (database failover, 2-5 minutes), use longer timeouts. Start conservative (30 seconds) and tune based on production data.
  • In practice, combine the circuit breaker with bulkheading (separate thread pools per dependency) and fallbacks (return cached data or a degraded response when the circuit is open). The circuit breaker without a fallback just changes the error from “timeout” to “circuit open” — the user still gets an error.
Red flag answer: “The circuit breaker automatically retries failed requests” — this confuses circuit breakers with retry logic. Or giving hardcoded values like “I always use a threshold of 5” without explaining how context drives the choice.Follow-ups:
  1. You have a circuit breaker on the Payment Service. The circuit opens during a flash sale with 50,000 users actively checking out. How do you handle the business impact of failing all payment attempts for 30 seconds?
  2. Your service calls the same downstream dependency for two different use cases — one is critical (payment verification) and one is supplementary (fetching display metadata). Should they share a circuit breaker or have separate ones? Why?
Strong Answer:
  • An API Gateway is the single entry point for all client requests in a microservices architecture. It sits between external clients and internal services and handles cross-cutting concerns: request routing, authentication/authorization, rate limiting, request/response transformation, protocol translation (e.g., REST to gRPC), and API composition (aggregating responses from multiple services into one).
  • A reverse proxy (like Nginx) forwards requests to backend servers based on URL patterns. It handles basic routing and TLS termination but has no awareness of your API semantics or business logic. A load balancer distributes traffic across instances of a single service for availability and throughput. The API Gateway is a superset — it does what both do, plus API-aware features like request validation, response shaping, and per-route rate limiting.
  • The critical difference: an API Gateway understands that /api/orders/123 should go to the Order Service, that the response should be enriched with user data from the User Service, that this particular endpoint requires an OAuth2 token with the orders:read scope, and that free-tier users are limited to 100 requests per minute on this endpoint. A load balancer knows none of this.
  • The trade-off: the API Gateway is a single point of failure and a potential bottleneck. If it goes down, everything goes down. You mitigate this with horizontal scaling (multiple gateway instances behind a load balancer — yes, an LB in front of the gateway), health checks, and keeping the gateway thin — business logic belongs in services, not the gateway.
  • Real-world examples: Kong (open-source, plugin ecosystem), AWS API Gateway (managed, pay-per-request), Envoy (often used as both gateway and service mesh data plane), and custom BFF gateways built with Node.js or Go for specific client needs.
Red flag answer: “It is just a reverse proxy for microservices” — this misses the API-aware, cross-cutting concern handling that distinguishes a gateway. Or “we put all our business logic in the gateway” — this creates a monolithic gateway that defeats the purpose.Follow-ups:
  1. Your team is debating between a single API Gateway for all clients versus a Backend-for-Frontend (BFF) pattern with separate gateways for web, mobile, and IoT. What are the trade-offs and when would you choose BFF?
  2. The API Gateway is doing request aggregation — calling 3 services and combining the results. One of those services is slow. How do you prevent the gateway from becoming the bottleneck?
Strong Answer:
  • Service discovery solves the problem of “where is the service I need to call?” In a dynamic environment (Kubernetes, auto-scaling groups), service instances come and go constantly — IPs change, ports shift, new instances spin up during traffic spikes. Hardcoding addresses is not viable.
  • There are two models: client-side discovery, where the client queries a service registry (like Consul or Eureka) and picks an instance itself using a load balancing algorithm; and server-side discovery, where the client sends the request to a load balancer or DNS endpoint that handles routing to a healthy instance. Kubernetes uses server-side discovery by default — a Service object gives you a stable DNS name that routes to healthy pods.
  • The registry itself stores service name to instance mappings and health status. Services register on startup, send periodic heartbeats, and deregister on shutdown. The registry marks instances as unhealthy if heartbeats stop (typically after 30 seconds).
  • When the registry goes down, the failure mode depends on the discovery model. With client-side discovery, clients typically cache the last known set of healthy instances locally. As long as those instances are still running, requests succeed. The cache has a TTL (usually 30-60 seconds), and smart implementations extend the TTL when the registry is unreachable rather than evicting the cache. The risk: if an instance goes down while the registry is unavailable, clients will still route to it until their cache expires. Combine with active health checking on the client side — if the client gets a connection refused, remove that instance from the local cache immediately.
  • With server-side discovery (like Kubernetes DNS), the failure mode is different. Kubernetes keeps the endpoints in etcd, and kube-proxy maintains iptables/IPVS rules locally on each node. If the control plane goes down, existing rules continue working — traffic flows to the last known set of pods. New pods cannot register, and dead pods are not removed, but existing routing works.
Red flag answer: “If the registry goes down, everything stops working” — this shows no understanding of caching, local health checks, or how production registries handle partitions. Or describing only one model (client-side or server-side) without acknowledging the other exists.Follow-ups:
  1. You are using Consul for service discovery. A network partition splits your data center into two halves, each with some Consul servers. How does service discovery behave in each partition?
  2. Your service registry shows 5 healthy instances of the Payment Service, but 2 of them are actually in a degraded state — they pass health checks but respond very slowly (5-second latency). How do you detect and handle this?
Strong Answer:
  • Database per service means each microservice owns and exclusively manages its own data store. No other service can read or write to it directly — only through the service’s API. This is arguably the most important microservices principle because without it, you have a distributed monolith: services that are deployed independently but coupled through shared data.
  • The immediate pain point: in a monolith, “get all orders with customer names and product details” is a single SQL query with two JOINs. In microservices, this data lives in three separate databases owned by three separate services. You cannot JOIN across them.
  • There are four strategies, each with different trade-offs. (1) API Composition: the caller makes three API calls (Order Service, User Service, Product Service) and joins the data in application code. Simple but slow for large result sets and creates runtime coupling. (2) Denormalization via events: the Order Service stores a copy of the customer name and product name in its own database, updated via events (CustomerNameChanged, ProductNameChanged). Fast reads, but data can be slightly stale and you pay storage cost. (3) CQRS with event-driven projections: build a dedicated read model that consumes events from all three services and materializes the joined view. Best for complex queries but adds architectural complexity. (4) Data lake/warehouse: stream all events or CDC changes into a central analytics store (BigQuery, Redshift) for reporting queries. Not real-time, but perfect for business intelligence.
  • In practice, most teams use a mix: API composition for simple, low-volume queries; denormalization for high-traffic read paths (e.g., order detail page); and a data warehouse for cross-domain analytics.
Red flag answer: “We just connect the Order Service to the User database to do the JOIN” — this directly violates the pattern. Or “we use a shared database but with separate schemas” without acknowledging this is effectively not database-per-service.Follow-ups:
  1. You chose denormalization: the Order Service stores the customer name. The customer changes their name. The event gets published, but 10,000 historical orders still have the old name. Do you backfill? What are the trade-offs?
  2. Your read model (CQRS projection) is running 30 seconds behind the write model due to event processing lag. A user places an order and immediately navigates to “My Orders” but does not see the new order. How do you handle this read-your-own-writes problem?
Strong Answer:
  • A service mesh is an infrastructure layer that handles service-to-service communication transparently. It deploys a sidecar proxy (typically Envoy) alongside every service instance. All traffic goes through the sidecar, which handles mTLS encryption, retries, circuit breaking, load balancing, observability (metrics and traces), and access control — without the application code knowing about any of it.
  • The architecture has two components: the data plane (the sidecar proxies that handle traffic) and the control plane (Istio, Linkerd, or Consul Connect, which configures the proxies and manages certificates). The application talks to localhost, the sidecar intercepts the traffic and applies policies before forwarding.
  • When it is justified: (1) you have 20+ services and enforcing consistent resilience patterns, TLS, and observability across all of them via application libraries is becoming a maintenance nightmare — every service needs the same circuit breaker library, the same retry config, the same TLS setup, and they drift; (2) you operate in a polyglot environment (Go, Java, Python, Node.js) and cannot share a single SDK; (3) you need zero-trust networking (mTLS everywhere) for compliance and doing it per-service is error-prone; (4) you need fine-grained traffic management like canary deployments at the network level.
  • When it is not justified: fewer than 10 services, a single-language stack where a shared library handles cross-cutting concerns, a team that does not have Kubernetes expertise (most meshes assume Kubernetes), or when latency overhead matters — each sidecar adds 1-3ms per hop, which compounds across a 5-service call chain.
  • The hidden cost: operational complexity. The mesh itself needs to be monitored, upgraded, and debugged. When something goes wrong, you are debugging both the application AND the mesh. Istio in particular has a reputation for being difficult to operate. Linkerd is lighter-weight but has fewer features.
Red flag answer: “You should always use a service mesh with microservices” — this shows no understanding of the complexity cost. Or “a service mesh replaces the need for application-level resilience” — the mesh handles transport-level concerns, but application-level idempotency, saga logic, and business retries still belong in the application.Follow-ups:
  1. Your service mesh sidecar is adding 2ms of latency per hop. A single user request passes through 6 services. That is 12ms of pure mesh overhead on a 100ms budget. How do you reduce this?
  2. A new engineer on the team deploys a service without the sidecar (they used a plain Kubernetes Deployment instead of the mesh-injected one). What breaks, and how do you prevent this from happening again?
Strong Answer:
  • In choreography, there is no central coordinator. Each service listens for events and reacts by performing its local transaction and publishing the next event. The flow emerges from the event chain: Order Service publishes OrderCreated, Payment Service hears it and publishes PaymentProcessed, Inventory Service hears that and publishes StockReserved. The “saga” exists implicitly in the event chain, not as a single artifact.
  • In orchestration, a central Saga Orchestrator explicitly defines the flow: “Step 1: call Order Service. Step 2: call Payment Service. Step 3: call Inventory Service.” The orchestrator tracks state, decides next steps, and initiates compensations if a step fails. The flow is explicitly defined in one place.
  • Choreography advantages: no single point of failure (no orchestrator to crash), services are more loosely coupled (they react to events, not commands), and it is simpler for short, linear flows (3-4 steps). Think of it as the publish-subscribe equivalent of a relay race — each runner knows to start when the previous runner finishes.
  • Choreography disadvantages: the flow logic is distributed across multiple services, making it very hard to understand, test, and debug. If you have 8 services in a saga and step 5 fails, understanding the compensation chain requires reading code in 5 different services. Adding a new step means modifying existing services to react to new events. Circular dependencies can emerge (Service A reacts to Service B which reacts to Service A).
  • Orchestration advantages: the flow is defined in one place, making it readable, testable, and debuggable. Adding a step means modifying one file. Compensations are clearly defined next to their corresponding actions. Complex branching logic (if payment fails AND order is over $500, escalate to manual review) is straightforward.
  • Orchestration disadvantages: the orchestrator is a potential single point of failure (mitigate with persistent saga state and recovery on restart), and it introduces tighter coupling between the orchestrator and the services it calls.
  • Rule of thumb: use choreography for simple, linear flows with 3-4 steps where the participants are unlikely to change. Use orchestration for anything more complex — 5+ steps, branching logic, or flows that change frequently.
Red flag answer: “Choreography is always better because it is more decoupled” — this ignores the practical debugging nightmare. Or “orchestration is always better because it is centralized” — this ignores the coupling trade-off.Follow-ups:
  1. You have a choreography-based saga with 6 services. A bug in Service D causes it to publish the wrong event. How do you trace the issue and understand the impact on downstream services?
  2. Your orchestration-based saga orchestrator is processing 10,000 sagas per second. It stores saga state in PostgreSQL and is hitting write throughput limits. How do you scale it?
Strong Answer:
  • Canary deployment routes a small percentage of production traffic (typically 1-5% to start) to the new version while the rest continues hitting the old version. The goal is to detect problems with real production traffic before they affect all users. For a payment service, this is especially critical because a bad deploy directly impacts revenue.
  • The process: (1) Deploy the new version alongside the old version (do not replace). (2) Configure the traffic router (service mesh, API gateway, or Kubernetes Ingress) to send 1% of traffic to the canary. (3) Monitor key metrics for a bake period (15-30 minutes minimum for a payment service). (4) If metrics are healthy, gradually increase to 5%, 10%, 25%, 50%, 100%. (5) If any metric breaches the threshold, automatically roll back by routing 100% to the old version.
  • The metrics for a payment service specifically: (a) Error rate — compare the canary’s 5xx rate to the baseline. If the canary is 2x higher, roll back. (b) Latency — compare p50, p95, and p99 latency. A jump in p99 from 200ms to 800ms signals a problem even if p50 looks fine. (c) Business metrics — payment success rate is the most important. If the canary’s payment success rate drops from 98.5% to 96%, that is a revenue-impacting regression even if there are zero 5xx errors (the failure might be in the payment gateway integration, returning a 200 with an error payload). (d) Downstream health — is the canary causing increased errors in downstream services like fraud detection?
  • The subtlety most people miss: statistical significance. At 1% traffic, you might see 50 requests per minute. A single failed request is a 2% error rate, but it could be noise. You need enough traffic volume for the comparison to be meaningful. For low-traffic services, you might need to run the canary for hours instead of minutes, or increase the initial percentage to 5-10%.
  • For a payment service, I would also do a shadow/dark launch first: route duplicate traffic to the new version but do not use its responses. Compare the canary’s responses to the production version. This catches logic bugs without any customer impact.
Red flag answer: “We just deploy to one server and check if it works” — this is manual, not systematic. Or only watching error rates without business metrics — a payment service can return 200 OK while silently declining transactions.Follow-ups:
  1. Your canary is at 5% traffic and the payment success rate drops by 0.3%. That is within normal variance for a 15-minute window but outside normal variance for a 1-hour window. Do you roll back immediately or wait? How do you decide?
  2. The canary version introduces a new database schema migration. How do you handle the fact that both the old version and the canary version need to work with the same database simultaneously?