LLM orchestration is the infrastructure layer that sits between your application code and the zoo of LLM providers. Think of it like an ORM for LLMs — just as SQLAlchemy lets you swap between PostgreSQL and MySQL without rewriting queries, an orchestration layer like LiteLLM lets you swap between OpenAI, Anthropic, and Groq without touching business logic. Without orchestration, every provider switch means updating API calls, response parsing, error handling, and retry logic throughout your codebase. With it, you change a single model string.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
LiteLLM Overview
LiteLLM provides a unified interface for 100+ LLM providers. Under the hood, it translates the OpenAI-compatible request format into each provider’s native format (Anthropic’s separatesystem parameter, Bedrock’s AWS auth, etc.) and normalizes the responses back:
Async Support
For production workloads, async calls are essential. Without async, each LLM call blocks your server for 1-3 seconds — meaning a single-threaded server can only handle one request at a time.Provider Configuration
Environment Variables
Programmatic Configuration
Fallback Configuration
Fallbacks are your safety net. When the primary model returns an error (rate limit, timeout, outage), the system automatically tries the next model in the chain. This is the single most impactful reliability pattern for production LLM applications.Load Balancing
When you have multiple deployments of the same model (e.g., two Azure OpenAI regions plus direct OpenAI), load balancing distributes requests to avoid hitting rate limits on any single endpoint. This is especially important for Azure, where each deployment has independent TPM/RPM quotas.Rate Limiting
Rate limiting protects you from two things: exceeding provider quotas (which causes errors) and runaway costs from bugs or attacks. A single misconfigured loop can generate thousands of API calls in seconds — rate limiting is your circuit breaker.Caching Integration
Custom Provider Wrapper
When LiteLLM’s built-in abstraction is not enough — for example, you need cost tracking, custom logging, or semantic model aliases (“fast”, “smart”, “cheap”) — wrap it in a thin client that adds your business logic.Streaming with Router
Observability Integration
You cannot optimize what you cannot measure. LLM observability means tracking every call’s latency, token usage, cost, model, and success/failure status. Without this, you are flying blind — you will not know which model is slow, which prompts are expensive, or when your error rate spikes.Common Pitfalls
Over-abstracting too early
Over-abstracting too early
Not pinning model versions
Not pinning model versions
gpt-4o instead of a dated version like gpt-4o-2024-08-06 means your application behavior can change without any code deployment when the provider updates the model alias. In production, always pin model versions and test before upgrading.Caching responses that should not be cached
Caching responses that should not be cached
Ignoring cold-start latency in routing decisions
Ignoring cold-start latency in routing decisions
Model Comparison
| Provider | Model | Speed | Quality | Cost |
|---|---|---|---|---|
| Groq | llama-3.3-70b | Fastest | Good | Low |
| OpenAI | gpt-4o-mini | Fast | Good | Low |
| OpenAI | gpt-4o | Medium | Excellent | Medium |
| Anthropic | claude-3-5-sonnet | Medium | Excellent | Medium |
| gemini-1.5-pro | Medium | Excellent | Medium |
What is Next
Semantic Search
Interview Deep-Dive
You are designing an LLM orchestration layer for a product that needs to call multiple providers. What are the key concerns and how do you architect it?
You are designing an LLM orchestration layer for a product that needs to call multiple providers. What are the key concerns and how do you architect it?
- The first concern is provider abstraction. You need a unified interface so that switching from OpenAI to Anthropic to a self-hosted model does not require changes in your application code. This means normalizing the request format (messages array, temperature, max_tokens), the response format (content, usage, finish_reason), and the error taxonomy (rate limit, auth failure, context overflow). LiteLLM does this well out of the box, but in production I have found you often need a thin wrapper on top for your own cost tracking and routing logic.
- The second concern is failover and resilience. LLM APIs go down more often than people expect — OpenAI has had multiple multi-hour outages. You need automatic failover with a priority chain: try GPT-4o first, fall back to Claude 3.5 Sonnet, then to a self-hosted Llama model as the last resort. The failover logic needs to distinguish between retryable errors (429 rate limit, 500 server error) and non-retryable errors (400 bad request, 401 auth failure). Retrying on a non-retryable error wastes time and money.
- The third concern is cost-aware routing. Not every query needs your most expensive model. I build a routing layer that classifies incoming requests by complexity — simple extraction goes to gpt-4o-mini at 2.50/M. This classification itself can be a lightweight rule-based system or a small model. In one system I worked on, this routing saved about 60% on the monthly API bill without measurable quality degradation.
- The fourth concern is observability. Every request through the orchestration layer must be logged with: provider, model, latency, token counts, cost, and whether it was a primary call or a fallback. Without this, you cannot debug failures, optimize costs, or detect quality regressions after a model update.
system parameter, not as a message with role “system.” OpenAI supports response_format for JSON mode, Anthropic uses tool-use with a specific schema. Function calling schemas differ between providers. My approach is a provider-specific adapter layer beneath the unified interface. The adapter translates the canonical request format into provider-specific format on the way in, and normalizes the response on the way out. The critical thing is to have integration tests per provider that validate the adapter behavior, because providers change their APIs without warning. I have been burned by Anthropic changing their message format validation rules in a minor version update that broke our adapter silently.Explain the trade-offs between different load balancing strategies for LLM API calls: round-robin, least-busy, and latency-based.
Explain the trade-offs between different load balancing strategies for LLM API calls: round-robin, least-busy, and latency-based.
- Round-robin is the simplest: you cycle through your available endpoints in order. Its strength is simplicity and even distribution. Its weakness is that it is completely blind to the actual state of each endpoint. If one Azure deployment is overloaded and responding in 5 seconds while another is idle at 200ms, round-robin still sends them equal traffic. In practice, round-robin works fine when all your deployments have identical capacity and are healthy, which is often the case in steady state.
- Least-busy routing sends each request to the deployment with the fewest in-flight requests. This adapts naturally to varying processing speeds — a slow deployment accumulates in-flight requests and naturally receives less new traffic. The trade-off is that you need to track in-flight counts accurately across your routing layer, which in a distributed system means shared state or approximate counting. The failure mode is thundering herd: if a deployment recovers from being slow, it suddenly has zero in-flight requests and gets hammered with all new traffic simultaneously.
- Latency-based routing tracks the rolling average (or P95) latency of each deployment and prefers the fastest one. This is the most sophisticated and usually gives the best user experience because it directly optimizes for what users care about — response time. The trade-off is cold-start bias: a deployment that has not been used recently has no latency data, so you need an exploration strategy. I typically use epsilon-greedy: 90% of traffic goes to the lowest-latency endpoint, 10% is randomly distributed to keep latency estimates fresh for all endpoints.
- In production with multiple Azure OpenAI deployments, I use a combination: latency-based routing as the primary strategy, with a fallback to round-robin when latency data is stale (no requests in the last 60 seconds). I also layer on rate-limit-aware routing — if a deployment returns a 429 with a
Retry-Afterheader, I remove it from the pool for that duration. This combination handles the common failure modes: regional outages, temporary rate limits, and gradual performance degradation.
When would you use semantic caching for LLM responses versus exact-match caching, and what are the risks of semantic caching?
When would you use semantic caching for LLM responses versus exact-match caching, and what are the risks of semantic caching?
- Exact-match caching is straightforward: hash the request (model, messages, temperature, etc.) and use that as a cache key. If the exact same request comes in again, serve the cached response. This is safe and deterministic — you will never serve a wrong cached response. The limitation is that “What is machine learning?” and “Explain machine learning to me” are completely different cache keys despite being semantically identical. Hit rates for exact-match caching are typically 5-15% for conversational applications and 30-50% for applications with templated queries.
- Semantic caching embeds the query, compares it against cached query embeddings using cosine similarity, and returns the cached response if similarity exceeds a threshold. This dramatically increases hit rates — 30-60% in typical applications — because it catches paraphrases and minor variations. The cost savings can be substantial: if your average request costs 0.008 per request.
- The risk of semantic caching is serving stale or wrong responses. A similarity threshold of 0.95 feels safe, but embedding models can assign 0.95+ similarity to queries that are semantically related but require different answers. “What is the refund policy for electronics?” and “What is the refund policy for software?” might score 0.96 similarity but have completely different correct answers. You have essentially introduced a new class of bug: the cache collision.
- The second risk is temporal staleness. If your underlying data changes — a product price updates, a policy changes — the cached response is now wrong and will be served confidently. You need cache invalidation tied to your data freshness, not just a TTL.
- My recommendation: use exact-match caching as the default for all applications. Layer semantic caching on top only for specific query patterns where you have validated that the similarity threshold does not produce false matches in your domain. Always log cache hits so you can audit them for correctness.