Semantic routing directs queries to the most appropriate handler, model, or pipeline based on content understanding. Think of it as a smart receptionist at a hospital: instead of sending every patient to the same doctor, they assess symptoms and route to the right specialist. Sending “what’s 2+2?” to GPT-4o is like sending a paper cut patient to the ER — expensive and wasteful. Semantic routing fixes this. The payoff is significant: teams that implement intelligent routing typically see 40-70% cost reductions with no quality loss on simple queries, because the cheap model handles them just fine.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Intent Classification
Embedding-Based Classification
LLM-Based Classification
Query Routing
Multi-Model Router
The core insight: not every query needs your most expensive model. “What’s 2+2?” doesn’t need GPT-4o, but “Design a distributed system for…” does. Routing by complexity can cut your API bill by 50-70% with negligible quality loss on the queries that get routed down. Route queries to the most appropriate model based on complexity:Topic-Based Routing
Routing Approach Comparison
| Approach | Latency Overhead | Accuracy | Cost | Best For |
|---|---|---|---|---|
| Embedding-based | ~50ms (one embed call) | Good for well-separated intents | Lowest | High-throughput classification with 5-20 distinct intents |
| LLM-based | 200-500ms (one LLM call) | Best for nuanced/overlapping intents | Medium | Complex routing where intent boundaries are fuzzy |
| Rule-based (keyword/regex) | <1ms | Limited to exact patterns | Free | Simple filters (profanity, PII detection, language detection) |
| Hybrid (rules + embedding fallback) | 1-50ms | Good overall | Low | Production systems: rules catch the obvious, embeddings handle the rest |
- Under 10 intents with clear boundaries (billing, support, sales): Embedding-based. Fast, cheap, and the centroid approach handles it well.
- Overlapping intents or nuanced classification (“is this a complaint or a feature request?”): LLM-based. The model’s reasoning catches subtlety that cosine similarity misses.
- Cost-sensitive at high volume (10K+ queries/day): Hybrid. Use regex/keyword rules to catch 60-70% of queries instantly, then route the ambiguous remainder through embeddings.
- Multi-model routing (choosing between GPT-4o-mini, GPT-4o, Claude): Two-stage. First classify complexity with a fast model, then route based on the classification. The routing call should never cost more than the cheapest model in your fleet.
Cost-Optimized Routing
In production, you’re optimizing three variables simultaneously: cost, latency, and quality. This router makes those trade-offs explicit and configurable rather than using one model for everything and hoping for the best.Hybrid Routing
Combine multiple routing strategies:- Use fast models for routing decisions themselves — if your router uses GPT-4o to decide which model to call, the routing overhead defeats the purpose. Use GPT-4o-mini or embeddings.
- Cache routing decisions — similar queries should route the same way. Hash the query and cache for 5-10 minutes.
- Monitor routing accuracy — track cases where the cheap model produced bad answers. This is your “mis-routing” rate.
- Implement fallbacks — if the fast model returns low confidence, automatically escalate to the powerful model. Don’t make users retry.
- Track cost savings — measure actual cost with routing vs. what it would have been with the powerful model for everything. This justifies the engineering investment.
- Pitfall to avoid: Don’t over-engineer routing for small scale. If you’re under $100/month in API costs, just use one good model. Routing ROI kicks in at scale.
Practice Exercise
Build a production routing system that:- Classifies queries by intent and complexity
- Routes to appropriate models based on requirements
- Optimizes for cost while meeting quality thresholds
- Tracks routing decisions and outcomes
- Adapts routing rules based on feedback
- Low-latency routing decisions
- Graceful degradation on failures
- A/B testing different routing strategies
- Cost and quality monitoring
Interview Deep-Dive
Your company spends $50K/month on LLM API calls. Walk me through how you would design a semantic routing layer to cut that by 40% without degrading user-facing quality.
Your company spends $50K/month on LLM API calls. Walk me through how you would design a semantic routing layer to cut that by 40% without degrading user-facing quality.
- First, I would instrument every request to capture the query text, the model used, the latency, the token count, and a quality signal — either explicit user feedback (thumbs up/down) or an automated LLM-as-judge evaluation on a sampled subset. You cannot optimize what you do not measure, and without a quality baseline, any cost reduction is a gamble.
- The routing layer itself has two stages. Stage one is a fast classifier — either an embedding-based centroid approach or a fine-tuned small model — that buckets queries into complexity tiers: simple, moderate, and complex. The classifier must run on something cheap like
text-embedding-3-smallorgpt-4o-mini, never on the expensive model you are trying to avoid. If the classifier itself costs significant tokens, you have defeated the purpose. - Stage two maps tiers to models: simple queries go to
gpt-4o-mini(or an even cheaper model), moderate togpt-4o, and complex to the most capable model available. The key insight is that 50-70% of production queries in most customer-facing products are simple — greetings, FAQ-type questions, single-fact lookups — and the cheap model handles them indistinguishably from the expensive one. - I would deploy this with a shadow mode first: route all queries to both the current model and the proposed cheaper model, then compare outputs using an automated eval. This gives you a real mis-routing rate before you flip any traffic. A mis-routing rate above 5% means your classifier needs more training examples or a different threshold.
- The fallback mechanism is critical. If the cheap model returns low confidence or the user re-asks the same question, automatically escalate to the powerful model. This catch-net prevents the worst user experiences while still capturing the cost savings on the majority of traffic.
What are the failure modes of using an LLM to classify query complexity before routing to another LLM? When does this architecture break down?
What are the failure modes of using an LLM to classify query complexity before routing to another LLM? When does this architecture break down?
- The most insidious failure mode is latency amplification. If you use
gpt-4o-minito classify before routing, you have added 200-400ms of latency to every single request. For a chat application where perceived responsiveness matters, this overhead can negate the UX benefit of streaming. The fix is to use embeddings for classification instead — a single embedding call is 50ms and does not require a full LLM inference pass. - Second failure mode: the classifier and the router create a circular dependency. The LLM-based classifier is itself an API call that can fail, rate-limit, or time out. If your routing layer goes down, all queries stall. You need a fast fallback — if classification fails, default to the middle-tier model. Never default to the cheapest model on failure, because that degrades quality silently without any signal.
- Third: confidence miscalibration. LLMs are notoriously overconfident when asked to self-rate. If you ask
gpt-4o-mini“how complex is this query?” and it says 0.95 confidence that it is simple, that 0.95 is not a real probability. It is a language pattern. You cannot trust model-generated confidence scores for routing thresholds without calibrating them against actual outcomes. - The architecture fundamentally breaks down when query complexity is not predictable from the query text alone. For instance, “Tell me about the Johnson account” looks simple, but the answer might require reasoning across five documents in a RAG system. Complexity is often a function of the retrieval results, not the query. In these cases, you need a two-phase approach: do a cheap retrieval first, assess the complexity of the retrieved context, then route.
- Finally, adversarial or ambiguous inputs — sarcasm, multi-intent queries (“book me a flight and also explain quantum physics”), or queries in mixed languages — tend to confuse simple classifiers. These edge cases get mis-routed to the cheap model and produce visibly bad outputs.
You are building a multi-model router that uses GPT-4o-mini, GPT-4o, and Claude. A product manager asks: how do we know the router is actually working and not just randomly assigning queries? What observability would you build?
You are building a multi-model router that uses GPT-4o-mini, GPT-4o, and Claude. A product manager asks: how do we know the router is actually working and not just randomly assigning queries? What observability would you build?
- The core metric is the “routing accuracy rate” — the percentage of queries where the routed model produced an answer of equivalent or better quality compared to always using the most expensive model. You measure this by running an offline evaluation: take a random sample of routed queries (say 500 per week), re-run them through the expensive model, and compare outputs using an LLM-as-judge or human eval. If the cheap-model answers are rated equally good 95%+ of the time for queries routed to the cheap tier, your router is working.
- Second, track the “escalation rate” — queries where the user re-asked the same question, gave a thumbs-down, or where a downstream quality check flagged the response. A rising escalation rate for the cheap tier is the earliest signal of router degradation. I would set up an alert if the weekly escalation rate for any tier increases by more than 2 percentage points.
- Third, build a routing distribution dashboard that shows: what percentage of queries go to each tier, the average cost per query per tier, the p50/p95 latency per tier, and the total monthly cost. If the distribution shifts suddenly (e.g., 80% going to the cheap model when it was 60% last week), something changed — either user behavior shifted or the classifier drifted.
- Fourth, log every routing decision with the classifier’s confidence score and the actual model used. This lets you build confusion matrices: for queries the classifier labeled “simple” that users rated poorly, what were the common patterns? These misclassified queries become new training examples for the next version of the classifier.
- Finally, run a continuous A/B test where 5% of traffic bypasses the router and goes to the expensive model regardless. This control group gives you a live quality benchmark to compare against routed traffic. If the control group’s quality metrics are significantly better, the router is losing value somewhere.
Compare embedding-based intent classification versus LLM-based classification. In what specific production scenarios would you choose one over the other, and why?
Compare embedding-based intent classification versus LLM-based classification. In what specific production scenarios would you choose one over the other, and why?
- Embedding-based classification computes a vector for the query and compares it against pre-computed intent centroids using cosine similarity. It is fast (one embedding API call, ~50ms), cheap (embedding models cost 10-100x less than chat models), and deterministic given the same model version. The trade-off is that it cannot reason about context, negation, or multi-intent queries. “I do NOT want to cancel my subscription” has high cosine similarity to “cancel my subscription” because the embedding captures topic proximity, not logical negation.
- LLM-based classification sends the query to a chat model with a prompt listing the available intents and asks it to classify. It handles nuance, negation, and ambiguity well because it reasons about the full meaning. But it is 5-10x more expensive, 3-5x slower, and non-deterministic — the same query can get different classifications on different runs if temperature is above zero.
- I would choose embeddings for high-throughput, low-stakes routing — a customer support bot handling thousands of queries per hour where 90% of queries cleanly fall into one of five categories. The 5-10% edge cases that get misclassified can be caught by a confidence threshold and escalated to a human or a more expensive model.
- I would choose LLM-based classification for low-throughput, high-stakes decisions — routing a medical triage question to the right specialist pipeline, or classifying a compliance query where misclassification has regulatory consequences. Here, the extra 300ms and $0.001 per classification is trivial compared to the cost of getting it wrong.
- The hybrid approach is often best in production: use embeddings as the fast path for clear-cut queries (confidence above 0.85), and fall back to LLM classification only for ambiguous queries (confidence between 0.5 and 0.85). This gives you the speed of embeddings for the 80% easy case and the accuracy of LLMs for the 20% hard case.