Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Production Gap
Building an AI demo takes hours. Making it production-ready takes weeks. The gap is not about the model — it is about everything around the model. Your demo worked because you tested it 10 times with friendly inputs on a fast laptop. Production means 10,000 concurrent users, each sending unexpected inputs, while the OpenAI API occasionally returns 429 (rate limited) or 500 (server error) and your Redis cache decides to evict entries at the worst possible moment. This module covers:- Reliability (error handling, retries, fallbacks)
- Performance (caching, batching, async)
- Cost (model routing, token optimization)
- Scaling (rate limits, queues, load balancing)
Production Architecture
Error Handling & Retries
LLM APIs fail in ways that traditional APIs don’t. A database query either works or throws an error in milliseconds. An LLM call can succeed in 2 seconds, time out at 30 seconds, return a rate-limit error, or — most annoyingly — succeed but return garbage because the model hallucinated through your format requirements. The patterns below handle each failure mode.Robust API Wrapper
Circuit Breaker Pattern
The circuit breaker is borrowed from electrical engineering: when too much current flows through a circuit, the breaker trips to prevent a fire. In software, when an API fails repeatedly, the circuit breaker “trips” and immediately rejects new requests instead of waiting 30 seconds for each one to time out. This prevents cascading failures: without a circuit breaker, a downed OpenAI API causes your entire server to hang because every request thread is stuck waiting. The three states map to a simple lifecycle: CLOSED (everything is fine, let requests through), OPEN (things are broken, fail fast), and HALF_OPEN (cautiously let one request through to see if the service recovered).Caching Strategies
Semantic Cache
Response Streaming with Cache
Rate Limiting
Rate limiting protects both your wallet and your upstream API quotas. Without it, a single power user (or an accidental infinite loop in a client) can burn through your entire OpenAI budget in minutes. The token bucket algorithm below is the industry standard: imagine a bucket that fills with tokens at a steady rate. Each request consumes tokens. When the bucket is empty, requests are rejected until it refills. The elegance is that it naturally allows bursts (the bucket can be full) while enforcing an average rate.Token Bucket Rate Limiter
Model Router
Model routing at the infrastructure level is different from the application-level routing covered in Cost Optimization. Here, you are also considering latency, context window size, and provider availability — not just cost. The router below acts as a smart proxy that can switch between OpenAI, Anthropic, and local models based on the task, your budget, and which APIs are currently healthy.Cost-Optimized Routing
Deployment Options
The deployment landscape for AI apps has three tiers, each suited to a different stage. Serverless (Vercel, Lambda) is perfect for launching: zero infrastructure management, pay-per-request, scales to zero when idle. Docker/Compose gives you more control when you need persistent connections (Redis, Postgres) and predictable latency. Kubernetes is for when you have multiple services, need fine-grained autoscaling, or are running at a scale where the engineering cost of K8s is justified by the operational benefits. Most teams should start serverless and graduate to Docker when they hit their first real scaling pain.Serverless (Recommended Start)
Docker Deployment
Docker gives you reproducibility: if it works in the container, it works in production. The Dockerfile below is intentionally minimal — no multi-stage build, no poetry, no complexity. Get it working first, optimize later.Kubernetes
Deployment Option Comparison
| Factor | Serverless (Vercel/Lambda) | Docker Compose (Railway/Render) | Kubernetes |
|---|---|---|---|
| Time to first deploy | Minutes | Hours | Days |
| Monthly cost (low traffic) | $0-20 | $10-50 | $50-200+ (cluster overhead) |
| Monthly cost (high traffic) | Unpredictable — scales with requests | Predictable — fixed instance cost | Predictable with autoscaling bands |
| Cold start latency | 500ms-5s | None (always running) | None (always running) |
| Persistent connections (Redis, Postgres) | Difficult — connections drop between invocations | Native — services stay connected | Native |
| Scaling ceiling | Provider-dependent (Lambda: 1000 concurrent by default) | Manual replica scaling | Autoscale based on CPU/memory/custom metrics |
| Operational complexity | Near zero | Low-medium | High — requires K8s expertise |
| Best for | MVP launch, low/spiky traffic | Steady traffic, need persistent services | Multi-service architectures, enterprise requirements |
- Launching a new product with uncertain traffic? Start serverless. You pay nothing when idle and scale automatically.
- Hit cold-start issues or need persistent Redis/Postgres connections? Move to Docker Compose on Railway or Render. This is the sweet spot for most AI products with steady traffic.
- Running multiple services (API, worker, scheduler) with autoscaling requirements? Graduate to Kubernetes. But only when the operational overhead is justified by real scaling needs, not hypothetical ones.
Production Edge Cases
Streaming responses and load balancers. Server-Sent Events (SSE) require the load balancer to keep the connection open for the entire response duration (potentially 30+ seconds). Many default load balancer configs have a 30-second idle timeout. If your LLM response takes 35 seconds, the connection drops mid-stream and the user sees a truncated answer. Configure your load balancer’s idle timeout to at least 120 seconds for SSE endpoints. Secrets in container images. Docker images are layered. If youCOPY .env . in an early layer and delete it in a later layer, the secret still exists in the image history. Never copy .env files into images. Use runtime environment variables or secrets managers (AWS Secrets Manager, Vault).
Health checks that lie. A /health endpoint that returns 200 without checking database and Redis connectivity tells your orchestrator “everything is fine” when the data layer is down. Your health check should verify the critical dependencies your app needs to serve requests. But keep it fast — a health check that takes 5 seconds defeats the purpose.
Graceful shutdown for in-flight requests. When Kubernetes or Railway deploys a new version, it sends SIGTERM to the old instance. If your app ignores SIGTERM, in-flight LLM requests get killed mid-response. Handle the signal: stop accepting new requests, wait for current requests to complete (up to a timeout), then exit.
Rate limit synchronization across replicas. If you run 3 API replicas with in-memory rate limiting, each replica tracks limits independently. A user gets 3x the intended rate. Use Redis-backed rate limiting (as shown in the Token Bucket section) so all replicas share a single counter.
Key Takeaways
Retry Everything
Cache Aggressively
Route Smart
Monitor Everything
What’s Next
Capstone Project
Interview Deep-Dive
Your LLM-powered API is deployed with 3 replicas behind a load balancer. During peak hours, p99 latency spikes from 2 seconds to 15 seconds even though CPU and memory are at 30% utilization. What is going on?
Your LLM-powered API is deployed with 3 replicas behind a load balancer. During peak hours, p99 latency spikes from 2 seconds to 15 seconds even though CPU and memory are at 30% utilization. What is going on?
- Low CPU/memory with high latency is the signature of I/O-bound bottlenecks, and AI applications are almost entirely I/O-bound. Your replicas are not compute-limited — they are waiting on external calls. The most likely culprits: (1) OpenAI API latency increases during peak hours (their infrastructure gets loaded too), (2) database connection pool exhaustion (all connections are held by requests waiting on LLM responses), or (3) rate limiting from the LLM provider causing retry backoffs.
- Diagnosis: add tracing to every external call. Measure time spent waiting for OpenAI, time waiting for database connections, time in your application code. In my experience, 80-90% of the latency in AI applications is OpenAI response time. If OpenAI’s p99 goes from 1.5s to 12s during peak hours, there is nothing your infrastructure can do to fix that — it is upstream.
- For OpenAI-side latency: implement request-level timeouts (30 seconds), add a circuit breaker that fails fast after 5 consecutive timeouts, and have a fallback model (Claude, a local model via Ollama) that kicks in when the primary provider is slow. You can also pre-warm common responses with a cache so peak-hour traffic hits cache more often.
- For connection pool exhaustion: the classic mistake is holding a database connection while waiting for the LLM. A request acquires a DB connection to load context, then calls OpenAI for 5 seconds while holding that connection. With 20 pool connections and 3-second average LLM calls, you can only handle about 7 concurrent requests per replica before the pool starves. The fix: release the connection before calling the LLM, re-acquire after.
Explain the circuit breaker pattern in the context of LLM applications. When would it fire, and what is the risk of not having one?
Explain the circuit breaker pattern in the context of LLM applications. When would it fire, and what is the risk of not having one?
- The circuit breaker prevents cascading failures when an upstream service is down. It has three states: CLOSED (normal — requests flow through), OPEN (service is broken — reject immediately without waiting), and HALF_OPEN (tentatively allow one probe request to test recovery).
- In LLM applications, the circuit breaker fires when the OpenAI API fails repeatedly — say, 5 consecutive 500 errors or timeouts within a minute. Instead of every new request waiting 30 seconds for a timeout, the circuit breaker immediately returns an error in milliseconds. This is “fail fast” — it is better to give the user a quick “service temporarily unavailable” than to make them wait 30 seconds for the same error.
- Without a circuit breaker, here is what happens: OpenAI goes down, every request in your server blocks for 30 seconds on timeout, your thread/connection pool fills up, new requests queue behind the stuck ones, your server becomes completely unresponsive — not just for LLM calls, but for everything including health checks. The load balancer sees failing health checks and restarts your pods, but the new pods immediately fill up again because OpenAI is still down. You now have a cascading failure: one upstream issue has taken down your entire application.
- The recovery_timeout parameter is critical. After the circuit opens, you wait (say, 60 seconds) before testing with a single request. If that succeeds, the circuit closes and traffic resumes. If it fails, the circuit stays open for another 60 seconds. This prevents a “thundering herd” where all queued requests hit the recovering service simultaneously and overwhelm it again.
You need to deploy an AI application that handles 100 requests per minute today but might scale to 10,000 per minute in 6 months. Walk me through your deployment architecture decisions.
You need to deploy an AI application that handles 100 requests per minute today but might scale to 10,000 per minute in 6 months. Walk me through your deployment architecture decisions.
- At 100 requests per minute, start with the simplest thing that works: a single Docker container running FastAPI with uvicorn behind a managed load balancer (Railway, Render, or AWS ECS). No Kubernetes, no microservices, no over-engineering. Add Redis for caching and rate limiting, Postgres for data. Total infrastructure: 3 services. Monthly cost: $50-200.
- The critical early investments are not infrastructure — they are instrumentation. From day one, track: latency per endpoint (p50, p95, p99), cost per request by model, error rates by type (timeout, rate limit, invalid response), and cache hit rates. These metrics tell you what to optimize when traffic increases. Without them, you are guessing.
- At 1,000 requests per minute (the first scaling pain point): add horizontal scaling with 3-5 replicas, move rate limiting to Redis (shared across replicas), implement semantic caching, and add model routing to reduce cost. This handles 10x growth with the same architecture.
- At 10,000 requests per minute: this is where architecture changes become necessary. Separate the API layer (fast, handles HTTP) from the worker layer (slow, handles LLM calls) using a task queue (Celery, BullMQ). The API accepts requests and enqueues them, workers process at their own pace. This decouples request acceptance from LLM latency. You might also need to shard your vector database, add read replicas for Postgres, and implement request deduplication at the queue level.
- The principle: solve today’s problems today, and build the observability to know when tomorrow’s problems arrive. Every premature architecture decision is technical debt you pay interest on until you actually need it.