Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Multi-Provider Strategy?
Think of LLM providers like airlines. If you book every flight on one carrier and they cancel due to weather, you are stranded. Experienced travelers keep backup reservations on a different airline. Multi-provider LLM strategies work the same way — when OpenAI has an outage or Anthropic hits rate limits, your application gracefully re-routes to a healthy provider instead of showing users an error page.Unified LLM Interface
Provider Implementations
Fallback Chain
Intelligent Model Router
Route requests to optimal models based on task. This is analogous to a hospital triage system — a broken finger goes to the ER nurse, a cardiac arrest goes to the trauma surgeon. Cheap tasks get cheap models; hard reasoning tasks get expensive frontier models. The router makes the call so every individual request handler does not have to.Load Balancing
Load balancing distributes requests across providers the same way a load balancer distributes HTTP traffic across web servers. The goal is to avoid hammering a single provider to the point where you hit rate limits or experience degraded latency while other providers sit idle.Cost-Based Routing
Think of this like a household budget. Early in the month you eat at restaurants (premium models); as the budget runs low you switch to home cooking (cheap models). The router tracks daily spend and automatically downgrades model quality to stay within budget — your application keeps working, just with a thriftier model behind the scenes.Common Pitfalls
Treating all providers as interchangeable
Treating all providers as interchangeable
No observability on fallback frequency
No observability on fallback frequency
Falling back on content policy rejections
Falling back on content policy rejections
Ignoring response quality differences during fallback
Ignoring response quality differences during fallback
Key Takeaways
Unified Interface
Automatic Fallback
Smart Routing
Cost Control
What’s Next
Evaluation & Testing
Interview Deep-Dive
Your LLM application relies on a single provider and it goes down during peak hours. Walk me through how you design a multi-provider fallback system.
Your LLM application relies on a single provider and it goes down during peak hours. Walk me through how you design a multi-provider fallback system.
- The first principle is that a real multi-provider strategy requires providers with genuinely independent infrastructure. OpenAI and Azure OpenAI share underlying infrastructure, so they are not truly independent — an OpenAI outage frequently takes Azure OpenAI down with it. A real fallback chain looks like OpenAI as primary, Anthropic as secondary, Groq as tertiary. These run on completely separate infrastructure stacks.
- The architecture starts with a unified interface: an abstract
BaseLLMClientclass that every provider implements with the samecomplete(),stream(), andhealth_check()methods. This abstraction is essential because each provider has different message formats (Anthropic separates system messages, OpenAI includes them in the messages array), different response structures, and different error types. The adapter layer normalizes these differences so the rest of my application is provider-agnostic. - The fallback chain wraps a list of these clients and tries them in priority order. When the primary provider raises an exception, it catches it, logs the failure, and tries the next provider. I configure retry behavior per provider: 1 retry with 2-second backoff for rate limit errors (which are usually transient), 0 retries for authentication errors (permanent), and 1 retry for timeout errors. Each provider also has a circuit breaker that trips after 3 consecutive failures within a 60-second window. Once tripped, the circuit breaker skips that provider entirely for 30 seconds before testing it again with a probe request.
- The critical thing most implementations miss is that fallback is not free. Different providers produce different quality outputs, have different token limits, and support different features. If my primary is GPT-4o with function calling and my fallback is Llama-3 on Groq, the fallback might not support structured function calling. I design the fallback to gracefully degrade: the core text generation works on all providers, but advanced features like function calling or structured output might only be available on the primary. The application handles this by checking provider capabilities before making feature-dependent calls.
You need to route different types of requests to different LLM providers -- coding questions to one model, creative writing to another, quick lookups to a fast model. How do you design this router?
You need to route different types of requests to different LLM providers -- coding questions to one model, creative writing to another, quick lookups to a fast model. How do you design this router?
- The model router is essentially a task classifier followed by a model selection strategy. The classifier determines the task type (coding, creative, analysis, chat, math, fast-lookup), and the selector picks the optimal model for that task based on configurable priorities: quality, cost, or speed.
- For the classifier, I have used three approaches with increasing sophistication. Keyword matching works for prototypes: if the prompt contains “code,” “function,” “debug,” route to the coding model. It is fast (microseconds) but brittle — “write a story about a coder” gets misrouted. Embedding-based classification is my production default: I compute the query embedding and compare it against precomputed centroids for each task category. This handles paraphrases and costs a fraction of a cent per classification. LLM-based classification (asking a fast model “what type of task is this?”) is most accurate but adds 200-500ms of latency per request, so I only use it for high-stakes routing decisions.
- The model selection maintains a configuration table mapping task types to ranked model lists. Coding goes to Claude Sonnet first (strong at code), then GPT-4o as fallback. Creative writing goes to Claude first (strong at creative), then GPT-4o. Fast lookups go to Groq’s Llama-3 (sub-200ms latency) first, then GPT-4o-mini. The ranking can be optimized for different dimensions: “quality” uses the default ranking, “cost” sorts by price per token, “speed” sorts by average latency.
- The key production concern is that the router itself should not be a single point of failure. If the classifier errors or returns an unknown task type, the fallback is a default model (GPT-4o-mini for cost efficiency, or GPT-4o for quality). I also track routing decisions in production metrics so I can detect when the classifier starts misrouting — for example, if coding questions are being sent to the cheap chat model and users report quality degradation.
- At one company, the router reduced our LLM costs by 40% because 60% of requests were simple chat queries that did not need GPT-4o. Routing those to GPT-4o-mini at 17x cheaper tokens made a huge dent in the bill without measurable quality loss on those specific query types.
How do you implement cost-based routing that respects a daily budget while maintaining quality?
How do you implement cost-based routing that respects a daily budget while maintaining quality?
- Cost-based routing is a dynamic optimization problem: early in the day when budget is plentiful, I route to the best-quality model. As the budget depletes, I progressively downgrade to cheaper models to ensure the application stays within budget and does not go dark at 3pm.
- The implementation tracks daily spend in a fast datastore (Redis counter with daily TTL). Each request estimates its cost before execution (input token count estimate from character count divided by 4, multiplied by model pricing per token). The routing logic has three tiers based on remaining budget percentage. Above 80% remaining: route to the quality-optimal model for the task. Between 20-80% remaining: route to a balanced model (GPT-4o-mini instead of GPT-4o). Below 20% remaining: route to the cheapest available option (Groq Llama-3 or a self-hosted model). At 0% remaining, I either reject requests with a clear error or serve from cache only.
- The nuance is that not all requests are equal in business value. A paying enterprise customer’s request should always get the best model, even when budget is tight. I implement a priority system: critical requests (enterprise tier, revenue-impacting features) always get quality routing, while low-priority requests (free tier, internal tooling) absorb the budget cuts first. This means the budget depletion curve affects different user segments at different thresholds.
- After execution, I update the daily spend with the actual cost (from the API response’s usage field, not the estimate). The discrepancy between estimated and actual cost is usually under 20%, but I track it to catch cases where the model generates unexpectedly long outputs that blow the estimate.
- One lesson from production: the budget needs a reserve. If I set a hard 80 and halt at 5 in reserve for critical requests and retries. Without the reserve, a burst of expensive requests at 95% budget can overshoot before the system reacts.
"model_tier": "standard" versus "model_tier": "premium". The frontend can use this to show a subtle indicator like “Using fast mode” or “Responses may be shorter during peak usage.” This is better than silently degrading quality and having users think the product is broken. For chat applications, I also adjust the system prompt for cheaper models to compensate: simpler instructions, fewer examples, more explicit constraints on output format. A GPT-4o-mini with a well-tuned prompt often matches GPT-4o with a generic prompt for straightforward tasks. The other strategy is strategic caching: during high-budget periods, I aggressively cache responses for common queries. During low-budget periods, the cache hit rate is higher because I have been warming it all day, which reduces the number of live API calls needed and effectively stretches the remaining budget.