Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Covers serverless AI, GPU inference, caching strategies, rate limiting, and production infrastructure patterns.

The Production Gap

Building an AI demo takes hours. Making it production-ready takes weeks. The gap is not about the model — it is about everything around the model. Your demo worked because you tested it 10 times with friendly inputs on a fast laptop. Production means 10,000 concurrent users, each sending unexpected inputs, while the OpenAI API occasionally returns 429 (rate limited) or 500 (server error) and your Redis cache decides to evict entries at the worst possible moment. This module covers:
  • Reliability (error handling, retries, fallbacks)
  • Performance (caching, batching, async)
  • Cost (model routing, token optimization)
  • Scaling (rate limits, queues, load balancing)
Reality Check: 90% of AI projects fail to reach production. The difference is infrastructure, not models.

Production Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          LOAD BALANCER                               │
│                    (CloudFlare, AWS ALB, nginx)                     │
└─────────────────────────────────────────────────────────────────────┘

                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
            ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
            │  API Pod 1  │ │  API Pod 2  │ │  API Pod N  │
            └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
                   │               │               │
                   └───────────────┼───────────────┘

            ┌─────────────────────────────────────────────────┐
            │              RATE LIMITER / QUEUE               │
            │                    (Redis)                      │
            └─────────────────────────────────────────────────┘

                   ┌───────────────┼───────────────┐
                   ▼               ▼               ▼
            ┌───────────┐   ┌───────────┐   ┌───────────┐
            │   Cache   │   │  Model    │   │ Fallback  │
            │  (Redis)  │   │  Router   │   │  Queue    │
            └───────────┘   └─────┬─────┘   └───────────┘

                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
             ┌──────────┐  ┌──────────┐  ┌──────────┐
             │  OpenAI  │  │ Anthropic│  │  Local   │
             │   API    │  │   API    │  │  (Ollama)│
             └──────────┘  └──────────┘  └──────────┘

Error Handling & Retries

LLM APIs fail in ways that traditional APIs don’t. A database query either works or throws an error in milliseconds. An LLM call can succeed in 2 seconds, time out at 30 seconds, return a rate-limit error, or — most annoyingly — succeed but return garbage because the model hallucinated through your format requirements. The patterns below handle each failure mode.

Robust API Wrapper

import asyncio
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
from tenacity import (
    retry, 
    stop_after_attempt, 
    wait_exponential,
    retry_if_exception_type
)
import logging

logger = logging.getLogger(__name__)

class RobustLLMClient:
    """Production-grade LLM client with retries and fallbacks"""
    
    def __init__(self):
        self.openai = OpenAI()
        self.fallback_models = [
            "gpt-4o",
            "gpt-4o-mini",
            "gpt-3.5-turbo",
        ]
    
    @retry(
        # Only retry transient errors. APIError (400 Bad Request) means your
        # prompt is broken -- retrying won't help and wastes money.
        retry=retry_if_exception_type((RateLimitError, APIConnectionError)),
        stop=stop_after_attempt(3),
        # Exponential backoff: wait 4s, then 8s, then 16s (capped at 60s).
        # This matches OpenAI's recommendation and avoids hammering a
        # rate-limited endpoint.
        wait=wait_exponential(multiplier=1, min=4, max=60),
        before_sleep=lambda retry_state: logger.warning(
            f"Retry {retry_state.attempt_number} after {retry_state.outcome.exception()}"
        )
    )
    async def _call_with_retry(self, model: str, messages: list, **kwargs):
        """Single model call with retries."""
        return self.openai.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )
    
    async def complete(
        self, 
        messages: list,
        model: str = "gpt-4o",
        timeout: float = 30.0,
        **kwargs
    ) -> str:
        """Complete with automatic fallback"""
        
        models_to_try = [model] + [m for m in self.fallback_models if m != model]
        last_error = None
        
        for try_model in models_to_try:
            try:
                response = await asyncio.wait_for(
                    self._call_with_retry(try_model, messages, **kwargs),
                    timeout=timeout
                )
                
                if try_model != model:
                    logger.info(f"Succeeded with fallback model: {try_model}")
                
                return response.choices[0].message.content
                
            except asyncio.TimeoutError:
                logger.warning(f"Timeout with {try_model}")
                last_error = TimeoutError(f"Timeout after {timeout}s")
            except RateLimitError as e:
                logger.warning(f"Rate limited on {try_model}: {e}")
                last_error = e
            except APIError as e:
                logger.error(f"API error on {try_model}: {e}")
                last_error = e
        
        raise last_error or RuntimeError("All models failed")

Circuit Breaker Pattern

The circuit breaker is borrowed from electrical engineering: when too much current flows through a circuit, the breaker trips to prevent a fire. In software, when an API fails repeatedly, the circuit breaker “trips” and immediately rejects new requests instead of waiting 30 seconds for each one to time out. This prevents cascading failures: without a circuit breaker, a downed OpenAI API causes your entire server to hang because every request thread is stuck waiting. The three states map to a simple lifecycle: CLOSED (everything is fine, let requests through), OPEN (things are broken, fail fast), and HALF_OPEN (cautiously let one request through to see if the service recovered).
from datetime import datetime, timedelta
from dataclasses import dataclass
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation -- requests flow through
    OPEN = "open"          # Service is down -- reject immediately (fail fast)
    HALF_OPEN = "half_open"  # Testing recovery -- allow one probe request

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: int = 60  # seconds
    
    def __post_init__(self):
        self.failures = 0
        self.last_failure_time: datetime = None
        self.state = CircuitState.CLOSED
    
    def record_success(self):
        self.failures = 0
        self.state = CircuitState.CLOSED
    
    def record_failure(self):
        self.failures += 1
        self.last_failure_time = datetime.now()
        
        if self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        
        if self.state == CircuitState.OPEN:
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.recovery_timeout):
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        
        # HALF_OPEN: allow one request to test
        return True

# Usage with LLM
circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

async def protected_llm_call(prompt: str):
    if not circuit.can_execute():
        raise RuntimeError("Circuit breaker is OPEN - service unavailable")
    
    try:
        result = await llm_client.complete(prompt)
        circuit.record_success()
        return result
    except Exception as e:
        circuit.record_failure()
        raise

Caching Strategies

Semantic Cache

import hashlib
import json
from datetime import datetime, timedelta
import redis
import numpy as np

class SemanticCache:
    """Cache LLM responses with semantic similarity matching"""
    
    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        similarity_threshold: float = 0.95,
        ttl_seconds: int = 3600
    ):
        self.redis = redis.from_url(redis_url)
        self.similarity_threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.openai = OpenAI()
    
    def _get_embedding(self, text: str) -> list[float]:
        response = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def _hash_key(self, text: str) -> str:
        return f"llm_cache:{hashlib.sha256(text.encode()).hexdigest()[:16]}"
    
    async def get(self, prompt: str) -> str | None:
        """Try to get cached response"""
        
        # First try exact match
        exact_key = self._hash_key(prompt)
        cached = self.redis.get(exact_key)
        if cached:
            return json.loads(cached)["response"]
        
        # Try semantic match
        prompt_embedding = self._get_embedding(prompt)
        
        # Scan recent cache entries (in production, use vector DB)
        for key in self.redis.scan_iter("llm_cache:*"):
            cached = json.loads(self.redis.get(key))
            if "embedding" in cached:
                similarity = self._cosine_similarity(
                    prompt_embedding, 
                    cached["embedding"]
                )
                if similarity >= self.similarity_threshold:
                    return cached["response"]
        
        return None
    
    async def set(self, prompt: str, response: str):
        """Cache a response"""
        key = self._hash_key(prompt)
        embedding = self._get_embedding(prompt)
        
        self.redis.setex(
            key,
            self.ttl,
            json.dumps({
                "prompt": prompt,
                "response": response,
                "embedding": embedding,
                "cached_at": datetime.now().isoformat()
            })
        )

# Usage
cache = SemanticCache()

async def cached_llm_call(prompt: str) -> str:
    # Try cache first
    cached = await cache.get(prompt)
    if cached:
        logger.info("Cache hit!")
        return cached
    
    # Call LLM
    response = await llm_client.complete(prompt)
    
    # Cache the response
    await cache.set(prompt, response)
    
    return response

Response Streaming with Cache

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import hashlib

app = FastAPI()

async def stream_with_cache(prompt: str, cache: SemanticCache):
    """Stream response while building cache"""
    
    # Check cache first
    cached = await cache.get(prompt)
    if cached:
        # Stream from cache
        for chunk in cached.split():
            yield f"data: {chunk} \n\n"
            await asyncio.sleep(0.02)  # Simulate streaming
        return
    
    # Stream from LLM and collect
    full_response = []
    
    async for chunk in llm_client.stream(prompt):
        full_response.append(chunk)
        yield f"data: {chunk}\n\n"
    
    # Cache complete response
    await cache.set(prompt, "".join(full_response))

@app.post("/chat/stream")
async def chat_stream(request: dict):
    return StreamingResponse(
        stream_with_cache(request["prompt"], cache),
        media_type="text/event-stream"
    )

Rate Limiting

Rate limiting protects both your wallet and your upstream API quotas. Without it, a single power user (or an accidental infinite loop in a client) can burn through your entire OpenAI budget in minutes. The token bucket algorithm below is the industry standard: imagine a bucket that fills with tokens at a steady rate. Each request consumes tokens. When the bucket is empty, requests are rejected until it refills. The elegance is that it naturally allows bursts (the bucket can be full) while enforcing an average rate.

Token Bucket Rate Limiter

import time
from dataclasses import dataclass
import redis

@dataclass
class RateLimitResult:
    allowed: bool
    remaining: int
    reset_at: float
    retry_after: float = 0

class TokenBucketLimiter:
    """Token bucket rate limiter with Redis backend"""
    
    def __init__(
        self,
        redis_url: str,
        tokens_per_minute: int = 60,
        bucket_size: int = 100
    ):
        self.redis = redis.from_url(redis_url)
        self.rate = tokens_per_minute / 60  # tokens per second
        self.bucket_size = bucket_size
    
    def check(self, key: str, tokens: int = 1) -> RateLimitResult:
        """Check if request is allowed and consume tokens"""
        
        now = time.time()
        bucket_key = f"ratelimit:{key}"
        
        # Get current bucket state
        pipe = self.redis.pipeline()
        pipe.hgetall(bucket_key)
        result = pipe.execute()[0]
        
        if result:
            tokens_available = float(result.get(b"tokens", self.bucket_size))
            last_update = float(result.get(b"last_update", now))
        else:
            tokens_available = self.bucket_size
            last_update = now
        
        # Refill tokens based on time passed
        time_passed = now - last_update
        tokens_available = min(
            self.bucket_size,
            tokens_available + (time_passed * self.rate)
        )
        
        # Check if we have enough tokens
        if tokens_available >= tokens:
            # Consume tokens
            tokens_available -= tokens
            self.redis.hset(bucket_key, mapping={
                "tokens": tokens_available,
                "last_update": now
            })
            self.redis.expire(bucket_key, 120)  # Cleanup old keys
            
            return RateLimitResult(
                allowed=True,
                remaining=int(tokens_available),
                reset_at=now + (self.bucket_size - tokens_available) / self.rate
            )
        else:
            # Calculate wait time
            tokens_needed = tokens - tokens_available
            wait_time = tokens_needed / self.rate
            
            return RateLimitResult(
                allowed=False,
                remaining=0,
                reset_at=now + wait_time,
                retry_after=wait_time
            )

# FastAPI middleware
from fastapi import Request, HTTPException

limiter = TokenBucketLimiter(
    redis_url="redis://localhost:6379",
    tokens_per_minute=100
)

async def rate_limit_middleware(request: Request, call_next):
    # Get user identifier
    user_id = request.headers.get("X-API-Key", request.client.host)
    
    result = limiter.check(user_id)
    
    if not result.allowed:
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded",
            headers={
                "Retry-After": str(int(result.retry_after)),
                "X-RateLimit-Remaining": "0",
                "X-RateLimit-Reset": str(int(result.reset_at))
            }
        )
    
    response = await call_next(request)
    response.headers["X-RateLimit-Remaining"] = str(result.remaining)
    return response

Model Router

Model routing at the infrastructure level is different from the application-level routing covered in Cost Optimization. Here, you are also considering latency, context window size, and provider availability — not just cost. The router below acts as a smart proxy that can switch between OpenAI, Anthropic, and local models based on the task, your budget, and which APIs are currently healthy.

Cost-Optimized Routing

from dataclasses import dataclass
from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "simple"       # Classification, extraction
    MODERATE = "moderate"   # Summarization, Q&A
    COMPLEX = "complex"     # Reasoning, coding
    CRITICAL = "critical"   # High-stakes decisions

@dataclass
class ModelConfig:
    name: str
    cost_per_1k_input: float
    cost_per_1k_output: float
    max_context: int
    speed: float  # tokens/second

MODELS = {
    "gpt-4o": ModelConfig("gpt-4o", 0.0025, 0.01, 128000, 100),
    "gpt-4o-mini": ModelConfig("gpt-4o-mini", 0.00015, 0.0006, 128000, 200),
    "claude-3-5-sonnet": ModelConfig("claude-3-5-sonnet", 0.003, 0.015, 200000, 80),
    "claude-3-5-haiku": ModelConfig("claude-3-5-haiku", 0.0008, 0.004, 200000, 150),
}

class ModelRouter:
    """Route requests to optimal model based on task"""
    
    def __init__(self, default_model: str = "gpt-4o-mini"):
        self.default = default_model
        self.classifier = OpenAI()
    
    def classify_task(self, prompt: str) -> TaskComplexity:
        """Use a cheap model to classify task complexity"""
        response = self.classifier.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Classify this task's complexity:
                
Task: {prompt[:500]}

Options:
- simple: Basic extraction, classification, formatting
- moderate: Summarization, Q&A, translation
- complex: Multi-step reasoning, coding, analysis
- critical: High-stakes decisions, medical/legal advice

Reply with just the complexity level."""
            }],
            max_tokens=10
        )
        
        level = response.choices[0].message.content.strip().lower()
        return TaskComplexity(level) if level in [c.value for c in TaskComplexity] else TaskComplexity.MODERATE
    
    def select_model(
        self,
        prompt: str,
        priority: str = "balanced"  # cost, speed, quality
    ) -> str:
        """Select best model for the task"""
        
        complexity = self.classify_task(prompt)
        
        routing = {
            TaskComplexity.SIMPLE: {
                "cost": "gpt-4o-mini",
                "speed": "gpt-4o-mini",
                "quality": "gpt-4o-mini",
                "balanced": "gpt-4o-mini"
            },
            TaskComplexity.MODERATE: {
                "cost": "gpt-4o-mini",
                "speed": "claude-3-5-haiku",
                "quality": "gpt-4o",
                "balanced": "gpt-4o-mini"
            },
            TaskComplexity.COMPLEX: {
                "cost": "gpt-4o-mini",
                "speed": "gpt-4o",
                "quality": "gpt-4o",
                "balanced": "gpt-4o"
            },
            TaskComplexity.CRITICAL: {
                "cost": "gpt-4o",
                "speed": "gpt-4o",
                "quality": "gpt-4o",
                "balanced": "gpt-4o"
            }
        }
        
        return routing[complexity][priority]

# Usage
router = ModelRouter()

async def smart_complete(prompt: str, priority: str = "balanced"):
    model = router.select_model(prompt, priority)
    logger.info(f"Routing to {model}")
    return await llm_client.complete(prompt, model=model)

Deployment Options

The deployment landscape for AI apps has three tiers, each suited to a different stage. Serverless (Vercel, Lambda) is perfect for launching: zero infrastructure management, pay-per-request, scales to zero when idle. Docker/Compose gives you more control when you need persistent connections (Redis, Postgres) and predictable latency. Kubernetes is for when you have multiple services, need fine-grained autoscaling, or are running at a scale where the engineering cost of K8s is justified by the operational benefits. Most teams should start serverless and graduate to Docker when they hit their first real scaling pain.
# Vercel / AWS Lambda with FastAPI
from mangum import Mangum
from fastapi import FastAPI

app = FastAPI()

@app.post("/chat")
async def chat(request: dict):
    return await llm_client.complete(request["prompt"])

# Handler for AWS Lambda
handler = Mangum(app)

Docker Deployment

Docker gives you reproducibility: if it works in the container, it works in production. The Dockerfile below is intentionally minimal — no multi-stage build, no poetry, no complexity. Get it working first, optimize later.
# Dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
    deploy:
      replicas: 3
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  redis_data:

Kubernetes

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-api
  template:
    metadata:
      labels:
        app: ai-api
    spec:
      containers:
      - name: ai-api
        image: your-registry/ai-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: openai-key
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: ai-api
spec:
  selector:
    app: ai-api
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Deployment Option Comparison

FactorServerless (Vercel/Lambda)Docker Compose (Railway/Render)Kubernetes
Time to first deployMinutesHoursDays
Monthly cost (low traffic)$0-20$10-50$50-200+ (cluster overhead)
Monthly cost (high traffic)Unpredictable — scales with requestsPredictable — fixed instance costPredictable with autoscaling bands
Cold start latency500ms-5sNone (always running)None (always running)
Persistent connections (Redis, Postgres)Difficult — connections drop between invocationsNative — services stay connectedNative
Scaling ceilingProvider-dependent (Lambda: 1000 concurrent by default)Manual replica scalingAutoscale based on CPU/memory/custom metrics
Operational complexityNear zeroLow-mediumHigh — requires K8s expertise
Best forMVP launch, low/spiky trafficSteady traffic, need persistent servicesMulti-service architectures, enterprise requirements
Decision framework:
  1. Launching a new product with uncertain traffic? Start serverless. You pay nothing when idle and scale automatically.
  2. Hit cold-start issues or need persistent Redis/Postgres connections? Move to Docker Compose on Railway or Render. This is the sweet spot for most AI products with steady traffic.
  3. Running multiple services (API, worker, scheduler) with autoscaling requirements? Graduate to Kubernetes. But only when the operational overhead is justified by real scaling needs, not hypothetical ones.

Production Edge Cases

Streaming responses and load balancers. Server-Sent Events (SSE) require the load balancer to keep the connection open for the entire response duration (potentially 30+ seconds). Many default load balancer configs have a 30-second idle timeout. If your LLM response takes 35 seconds, the connection drops mid-stream and the user sees a truncated answer. Configure your load balancer’s idle timeout to at least 120 seconds for SSE endpoints. Secrets in container images. Docker images are layered. If you COPY .env . in an early layer and delete it in a later layer, the secret still exists in the image history. Never copy .env files into images. Use runtime environment variables or secrets managers (AWS Secrets Manager, Vault). Health checks that lie. A /health endpoint that returns 200 without checking database and Redis connectivity tells your orchestrator “everything is fine” when the data layer is down. Your health check should verify the critical dependencies your app needs to serve requests. But keep it fast — a health check that takes 5 seconds defeats the purpose. Graceful shutdown for in-flight requests. When Kubernetes or Railway deploys a new version, it sends SIGTERM to the old instance. If your app ignores SIGTERM, in-flight LLM requests get killed mid-response. Handle the signal: stop accepting new requests, wait for current requests to complete (up to a timeout), then exit. Rate limit synchronization across replicas. If you run 3 API replicas with in-memory rate limiting, each replica tracks limits independently. A user gets 3x the intended rate. Use Redis-backed rate limiting (as shown in the Token Bucket section) so all replicas share a single counter.

Key Takeaways

Retry Everything

LLM APIs fail. Build retries, fallbacks, and circuit breakers.

Cache Aggressively

Semantic caching can reduce costs by 50%+ for common queries.

Route Smart

Use cheap models for simple tasks, expensive ones for complex.

Monitor Everything

Track latency, costs, errors. You can’t optimize what you don’t measure.

What’s Next

Capstone Project

Build a complete production AI application from scratch

Interview Deep-Dive

Strong Answer:
  • Low CPU/memory with high latency is the signature of I/O-bound bottlenecks, and AI applications are almost entirely I/O-bound. Your replicas are not compute-limited — they are waiting on external calls. The most likely culprits: (1) OpenAI API latency increases during peak hours (their infrastructure gets loaded too), (2) database connection pool exhaustion (all connections are held by requests waiting on LLM responses), or (3) rate limiting from the LLM provider causing retry backoffs.
  • Diagnosis: add tracing to every external call. Measure time spent waiting for OpenAI, time waiting for database connections, time in your application code. In my experience, 80-90% of the latency in AI applications is OpenAI response time. If OpenAI’s p99 goes from 1.5s to 12s during peak hours, there is nothing your infrastructure can do to fix that — it is upstream.
  • For OpenAI-side latency: implement request-level timeouts (30 seconds), add a circuit breaker that fails fast after 5 consecutive timeouts, and have a fallback model (Claude, a local model via Ollama) that kicks in when the primary provider is slow. You can also pre-warm common responses with a cache so peak-hour traffic hits cache more often.
  • For connection pool exhaustion: the classic mistake is holding a database connection while waiting for the LLM. A request acquires a DB connection to load context, then calls OpenAI for 5 seconds while holding that connection. With 20 pool connections and 3-second average LLM calls, you can only handle about 7 concurrent requests per replica before the pool starves. The fix: release the connection before calling the LLM, re-acquire after.
Follow-up: You add a fallback to Claude when OpenAI is slow, but now users complain that responses are inconsistent — sometimes they get GPT-4o style answers, sometimes Claude style. How do you handle this?Response consistency across providers is a real production problem. The approach is to standardize output through your system prompt and post-processing. First, make system prompts provider-agnostic — do not rely on OpenAI-specific behaviors. Second, add a response normalization layer that enforces consistent formatting (same JSON structure, same tone, same length constraints) regardless of which model generated the response. Third, if strict consistency matters (e.g., a legal product), use the fallback only for clearly defined degradation scenarios (return a cached response or a “system is busy, try again” message) rather than switching models silently. Most users cannot tell the difference between GPT-4o and Claude 3.5 Sonnet if the prompt and formatting are consistent, but they absolutely notice if one response is a paragraph and the next is three sentences.
Strong Answer:
  • The circuit breaker prevents cascading failures when an upstream service is down. It has three states: CLOSED (normal — requests flow through), OPEN (service is broken — reject immediately without waiting), and HALF_OPEN (tentatively allow one probe request to test recovery).
  • In LLM applications, the circuit breaker fires when the OpenAI API fails repeatedly — say, 5 consecutive 500 errors or timeouts within a minute. Instead of every new request waiting 30 seconds for a timeout, the circuit breaker immediately returns an error in milliseconds. This is “fail fast” — it is better to give the user a quick “service temporarily unavailable” than to make them wait 30 seconds for the same error.
  • Without a circuit breaker, here is what happens: OpenAI goes down, every request in your server blocks for 30 seconds on timeout, your thread/connection pool fills up, new requests queue behind the stuck ones, your server becomes completely unresponsive — not just for LLM calls, but for everything including health checks. The load balancer sees failing health checks and restarts your pods, but the new pods immediately fill up again because OpenAI is still down. You now have a cascading failure: one upstream issue has taken down your entire application.
  • The recovery_timeout parameter is critical. After the circuit opens, you wait (say, 60 seconds) before testing with a single request. If that succeeds, the circuit closes and traffic resumes. If it fails, the circuit stays open for another 60 seconds. This prevents a “thundering herd” where all queued requests hit the recovering service simultaneously and overwhelm it again.
Follow-up: Your circuit breaker for OpenAI trips during a 10-minute outage. During those 10 minutes, how do you still serve users?This is where your fallback strategy becomes real, not theoretical. Tier 1: serve cached responses for queries that match the semantic cache (covers 20-30% of traffic). Tier 2: route to a secondary provider (Anthropic, Together AI) if you have one configured. Tier 3: for queries that cannot be answered by cache or fallback, return a degraded response — maybe a pre-written FAQ answer, or an honest “I am experiencing issues and will follow up shortly.” Track which users got degraded responses and optionally re-process their queries when the circuit closes. The worst option is to silently drop requests or show a generic error. The best teams I have worked with maintain a “degradation playbook” that specifies exactly what each endpoint returns when the circuit is open.
Strong Answer:
  • At 100 requests per minute, start with the simplest thing that works: a single Docker container running FastAPI with uvicorn behind a managed load balancer (Railway, Render, or AWS ECS). No Kubernetes, no microservices, no over-engineering. Add Redis for caching and rate limiting, Postgres for data. Total infrastructure: 3 services. Monthly cost: $50-200.
  • The critical early investments are not infrastructure — they are instrumentation. From day one, track: latency per endpoint (p50, p95, p99), cost per request by model, error rates by type (timeout, rate limit, invalid response), and cache hit rates. These metrics tell you what to optimize when traffic increases. Without them, you are guessing.
  • At 1,000 requests per minute (the first scaling pain point): add horizontal scaling with 3-5 replicas, move rate limiting to Redis (shared across replicas), implement semantic caching, and add model routing to reduce cost. This handles 10x growth with the same architecture.
  • At 10,000 requests per minute: this is where architecture changes become necessary. Separate the API layer (fast, handles HTTP) from the worker layer (slow, handles LLM calls) using a task queue (Celery, BullMQ). The API accepts requests and enqueues them, workers process at their own pace. This decouples request acceptance from LLM latency. You might also need to shard your vector database, add read replicas for Postgres, and implement request deduplication at the queue level.
  • The principle: solve today’s problems today, and build the observability to know when tomorrow’s problems arrive. Every premature architecture decision is technical debt you pay interest on until you actually need it.
Follow-up: Your task queue approach means users no longer get synchronous responses. How do you handle the UX transition from real-time chat to async processing?You do not need to make everything async. The hybrid approach is: for simple queries (cache hits, cheap model responses), return synchronously — they complete in 200-500ms. For complex queries that will take 5+ seconds, return immediately with a request ID and stream updates via WebSocket or SSE. The frontend shows a loading state with progress indicators (“retrieving context… generating response…”). This is actually better UX than a synchronous endpoint that hangs for 15 seconds with no feedback. The task queue handles the heavy processing, pushes results to Redis, and the SSE endpoint polls Redis for updates. Users perceive this as faster because they get feedback immediately even though the total time is the same.