Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Semantic routing directs queries to the most appropriate handler, model, or pipeline based on content understanding. Think of it as a smart receptionist at a hospital: instead of sending every patient to the same doctor, they assess symptoms and route to the right specialist. Sending “what’s 2+2?” to GPT-4o is like sending a paper cut patient to the ER — expensive and wasteful. Semantic routing fixes this. The payoff is significant: teams that implement intelligent routing typically see 40-70% cost reductions with no quality loss on simple queries, because the cheap model handles them just fine.

Intent Classification

Embedding-Based Classification

import numpy as np
from openai import OpenAI
from dataclasses import dataclass


@dataclass
class Intent:
    """Represents an intent with example queries."""
    name: str
    description: str
    examples: list[str]
    embedding: np.ndarray = None


class IntentClassifier:
    """Classify queries into predefined intents using embeddings.
    
    How it works: Each intent has example queries. We embed those examples
    and average them to create a "centroid" -- a point in embedding space
    that represents the intent. New queries are classified by finding the 
    nearest centroid. Fast (one embedding call), cheap (no LLM needed), 
    and surprisingly accurate for well-defined intents.
    """
    
    def __init__(self, intents: list[Intent], model: str = "text-embedding-3-small"):
        self.client = OpenAI()
        self.model = model
        self.intents = intents
        self._compute_intent_embeddings()
    
    def _compute_intent_embeddings(self):
        """Compute embeddings for all intent examples.
        
        We average the description + all examples into one centroid vector.
        More examples = better centroid. Aim for 5-10 diverse examples per intent.
        Tip: Include edge-case phrasings, not just "happy path" examples.
        """
        for intent in self.intents:
            # Combine description and examples for a richer representation
            texts = [intent.description] + intent.examples
            
            response = self.client.embeddings.create(
                model=self.model,
                input=texts
            )
            
            # Average all embeddings -- the centroid of this intent's meaning
            embeddings = [e.embedding for e in response.data]
            intent.embedding = np.mean(embeddings, axis=0)
    
    def classify(self, query: str, threshold: float = 0.5) -> tuple[str, float]:
        """Classify a query into an intent."""
        # Get query embedding
        response = self.client.embeddings.create(
            model=self.model,
            input=[query]
        )
        query_embedding = np.array(response.data[0].embedding)
        
        # Find most similar intent
        best_intent = None
        best_score = -1
        
        for intent in self.intents:
            score = self._cosine_similarity(query_embedding, intent.embedding)
            if score > best_score:
                best_score = score
                best_intent = intent.name
        
        if best_score < threshold:
            return "unknown", best_score
        
        return best_intent, best_score
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def classify_batch(
        self,
        queries: list[str],
        threshold: float = 0.5
    ) -> list[tuple[str, float]]:
        """Classify multiple queries."""
        response = self.client.embeddings.create(
            model=self.model,
            input=queries
        )
        
        results = []
        for embedding_data in response.data:
            query_embedding = np.array(embedding_data.embedding)
            
            best_intent = None
            best_score = -1
            
            for intent in self.intents:
                score = self._cosine_similarity(query_embedding, intent.embedding)
                if score > best_score:
                    best_score = score
                    best_intent = intent.name
            
            if best_score < threshold:
                results.append(("unknown", best_score))
            else:
                results.append((best_intent, best_score))
        
        return results


# Usage
intents = [
    Intent(
        name="technical_support",
        description="Questions about technical issues, bugs, and troubleshooting",
        examples=[
            "My application keeps crashing",
            "How do I fix this error?",
            "The feature isn't working properly"
        ]
    ),
    Intent(
        name="billing",
        description="Questions about payments, invoices, and subscriptions",
        examples=[
            "How do I update my payment method?",
            "Where can I find my invoice?",
            "I want to cancel my subscription"
        ]
    ),
    Intent(
        name="product_info",
        description="Questions about features, capabilities, and product details",
        examples=[
            "What features are included?",
            "Can your product do X?",
            "Tell me about your enterprise plan"
        ]
    ),
]

classifier = IntentClassifier(intents)

queries = [
    "My app won't start after the update",
    "How much does the pro plan cost?",
    "Does it support Python 3.12?",
]

for query in queries:
    intent, confidence = classifier.classify(query)
    print(f"Query: {query}")
    print(f"Intent: {intent} (confidence: {confidence:.2f})\n")

LLM-Based Classification

from openai import OpenAI
import json


class LLMIntentClassifier:
    """Classify intents using LLM reasoning."""
    
    def __init__(
        self,
        intents: dict[str, str],
        model: str = "gpt-4o-mini"
    ):
        self.client = OpenAI()
        self.model = model
        self.intents = intents
    
    def classify(self, query: str) -> dict:
        """Classify a query with reasoning."""
        intent_list = "\n".join(
            f"- {name}: {desc}"
            for name, desc in self.intents.items()
        )
        
        prompt = f"""Classify this query into one of the following intents:

{intent_list}

Query: {query}

Respond with JSON:
{{
    "intent": "intent_name",
    "confidence": 0.0-1.0,
    "reasoning": "brief explanation"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def classify_with_fallback(
        self,
        query: str,
        confidence_threshold: float = 0.7
    ) -> dict:
        """Classify with fallback for low-confidence results."""
        result = self.classify(query)
        
        if result["confidence"] < confidence_threshold:
            result["intent"] = "requires_human_review"
            result["original_intent"] = result.get("intent")
        
        return result


# Usage
intents = {
    "order_status": "Inquiries about order tracking and delivery",
    "refund_request": "Requests for refunds or returns",
    "product_question": "Questions about product features or availability",
    "complaint": "Complaints about service or product quality",
    "general_inquiry": "General questions not fitting other categories"
}

classifier = LLMIntentClassifier(intents)

result = classifier.classify("When will my order arrive? I've been waiting for a week.")
print(f"Intent: {result['intent']}")
print(f"Confidence: {result['confidence']}")
print(f"Reasoning: {result['reasoning']}")

Query Routing

Multi-Model Router

The core insight: not every query needs your most expensive model. “What’s 2+2?” doesn’t need GPT-4o, but “Design a distributed system for…” does. Routing by complexity can cut your API bill by 50-70% with negligible quality loss on the queries that get routed down. Route queries to the most appropriate model based on complexity:
from openai import OpenAI
from anthropic import Anthropic
from dataclasses import dataclass
from enum import Enum
import json


class ModelTier(Enum):
    FAST = "fast"      # Simple queries
    BALANCED = "balanced"  # Moderate complexity
    POWERFUL = "powerful"  # Complex reasoning


@dataclass
class RouteConfig:
    """Configuration for a route."""
    model: str
    provider: str
    max_tokens: int
    temperature: float


class QueryRouter:
    """Route queries to appropriate models based on complexity."""
    
    ROUTES = {
        ModelTier.FAST: RouteConfig(
            model="gpt-4o-mini",
            provider="openai",
            max_tokens=512,
            temperature=0.3
        ),
        ModelTier.BALANCED: RouteConfig(
            model="gpt-4o",
            provider="openai",
            max_tokens=1024,
            temperature=0.5
        ),
        ModelTier.POWERFUL: RouteConfig(
            model="claude-sonnet-4-20250514",
            provider="anthropic",
            max_tokens=2048,
            temperature=0.7
        ),
    }
    
    def __init__(self):
        self.openai = OpenAI()
        self.anthropic = Anthropic()
    
    def analyze_complexity(self, query: str) -> ModelTier:
        """Determine query complexity."""
        prompt = f"""Analyze the complexity of this query:

Query: {query}

Consider:
1. Does it require multi-step reasoning?
2. Does it need domain expertise?
3. Is it a simple factual question?
4. Does it require creativity or nuance?

Respond with JSON:
{{
    "complexity": "simple" | "moderate" | "complex",
    "reasoning": "brief explanation"
}}"""
        
        response = self.openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        
        complexity_map = {
            "simple": ModelTier.FAST,
            "moderate": ModelTier.BALANCED,
            "complex": ModelTier.POWERFUL
        }
        
        return complexity_map.get(result["complexity"], ModelTier.BALANCED)
    
    def route(self, query: str) -> tuple[str, RouteConfig]:
        """Route query and get response."""
        tier = self.analyze_complexity(query)
        config = self.ROUTES[tier]
        
        if config.provider == "openai":
            response = self.openai.chat.completions.create(
                model=config.model,
                max_tokens=config.max_tokens,
                temperature=config.temperature,
                messages=[{"role": "user", "content": query}]
            )
            return response.choices[0].message.content, config
        
        elif config.provider == "anthropic":
            response = self.anthropic.messages.create(
                model=config.model,
                max_tokens=config.max_tokens,
                messages=[{"role": "user", "content": query}]
            )
            return response.content[0].text, config
        
        raise ValueError(f"Unknown provider: {config.provider}")


# Usage
router = QueryRouter()

queries = [
    "What is 2 + 2?",
    "Explain the concept of dependency injection.",
    "Design a distributed system for real-time collaboration with CRDT support."
]

for query in queries:
    response, config = router.route(query)
    print(f"Query: {query[:50]}...")
    print(f"Routed to: {config.model}")
    print(f"Response: {response[:100]}...\n")

Topic-Based Routing

from openai import OpenAI
from dataclasses import dataclass
from typing import Callable
import json


@dataclass
class TopicHandler:
    """Handler for a specific topic."""
    topic: str
    description: str
    handler: Callable[[str], str]
    keywords: list[str]


class TopicRouter:
    """Route queries to topic-specific handlers."""
    
    def __init__(self, handlers: list[TopicHandler]):
        self.client = OpenAI()
        self.handlers = {h.topic: h for h in handlers}
        self._build_topic_index()
    
    def _build_topic_index(self):
        """Build embeddings for topic matching."""
        # Create text representations for each topic
        self.topic_texts = {}
        for topic, handler in self.handlers.items():
            text = f"{handler.description}. Keywords: {', '.join(handler.keywords)}"
            self.topic_texts[topic] = text
    
    def route(self, query: str) -> tuple[str, str]:
        """Route query to appropriate handler."""
        # Use LLM to classify topic
        topics = "\n".join(
            f"- {topic}: {h.description}"
            for topic, h in self.handlers.items()
        )
        
        prompt = f"""Match this query to the most appropriate topic:

Topics:
{topics}

Query: {query}

Respond with JSON: {{"topic": "topic_name", "confidence": 0.0-1.0}}"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        topic = result["topic"]
        
        if topic in self.handlers:
            handler = self.handlers[topic]
            return topic, handler.handler(query)
        
        # Fallback to default handler
        return "unknown", f"I don't have a specialized handler for this query: {query}"


# Define handlers
def handle_coding(query: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert programmer."},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content


def handle_writing(query: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a professional writer and editor."},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content


def handle_math(query: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a mathematics expert. Show your work."},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content


# Create router
handlers = [
    TopicHandler(
        topic="coding",
        description="Programming, software development, and debugging",
        handler=handle_coding,
        keywords=["code", "program", "function", "bug", "python", "javascript"]
    ),
    TopicHandler(
        topic="writing",
        description="Writing, editing, and content creation",
        handler=handle_writing,
        keywords=["write", "edit", "essay", "article", "grammar"]
    ),
    TopicHandler(
        topic="math",
        description="Mathematics, calculations, and problem solving",
        handler=handle_math,
        keywords=["calculate", "equation", "solve", "math", "number"]
    ),
]

router = TopicRouter(handlers)

query = "How do I implement binary search in Python?"
topic, response = router.route(query)
print(f"Routed to: {topic}")
print(f"Response: {response}")

Routing Approach Comparison

ApproachLatency OverheadAccuracyCostBest For
Embedding-based~50ms (one embed call)Good for well-separated intentsLowestHigh-throughput classification with 5-20 distinct intents
LLM-based200-500ms (one LLM call)Best for nuanced/overlapping intentsMediumComplex routing where intent boundaries are fuzzy
Rule-based (keyword/regex)<1msLimited to exact patternsFreeSimple filters (profanity, PII detection, language detection)
Hybrid (rules + embedding fallback)1-50msGood overallLowProduction systems: rules catch the obvious, embeddings handle the rest
Decision framework for choosing your routing approach:
  • Under 10 intents with clear boundaries (billing, support, sales): Embedding-based. Fast, cheap, and the centroid approach handles it well.
  • Overlapping intents or nuanced classification (“is this a complaint or a feature request?”): LLM-based. The model’s reasoning catches subtlety that cosine similarity misses.
  • Cost-sensitive at high volume (10K+ queries/day): Hybrid. Use regex/keyword rules to catch 60-70% of queries instantly, then route the ambiguous remainder through embeddings.
  • Multi-model routing (choosing between GPT-4o-mini, GPT-4o, Claude): Two-stage. First classify complexity with a fast model, then route based on the classification. The routing call should never cost more than the cheapest model in your fleet.

Cost-Optimized Routing

In production, you’re optimizing three variables simultaneously: cost, latency, and quality. This router makes those trade-offs explicit and configurable rather than using one model for everything and hoping for the best.
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
import time


@dataclass
class ModelConfig:
    """Configuration for a model.
    
    Tip: Update these numbers quarterly -- model pricing changes frequently
    and new models often shift the cost/quality frontier.
    """
    name: str
    cost_per_1k_input: float
    cost_per_1k_output: float
    avg_latency_ms: float
    quality_score: float  # 0-1, calibrated against your specific eval set


class CostOptimizedRouter:
    """Route queries to minimize cost while meeting quality requirements."""
    
    MODELS = [
        ModelConfig("gpt-4o-mini", 0.00015, 0.0006, 500, 0.85),
        ModelConfig("gpt-4o", 0.0025, 0.01, 800, 0.95),
        ModelConfig("gpt-4-turbo", 0.01, 0.03, 1000, 0.93),
    ]
    
    def __init__(
        self,
        quality_threshold: float = 0.8,
        max_latency_ms: float = 2000,
        budget_per_query: float = 0.01
    ):
        self.client = OpenAI()
        self.quality_threshold = quality_threshold
        self.max_latency_ms = max_latency_ms
        self.budget_per_query = budget_per_query
    
    def estimate_tokens(self, text: str) -> int:
        """Rough token estimation."""
        return len(text) // 4
    
    def select_model(
        self,
        query: str,
        required_quality: Optional[float] = None
    ) -> ModelConfig:
        """Select the most cost-effective model."""
        quality_req = required_quality or self.quality_threshold
        
        # Filter models that meet requirements
        viable_models = [
            m for m in self.MODELS
            if m.quality_score >= quality_req
            and m.avg_latency_ms <= self.max_latency_ms
        ]
        
        if not viable_models:
            # Fallback to highest quality model
            return max(self.MODELS, key=lambda m: m.quality_score)
        
        # Estimate query cost
        input_tokens = self.estimate_tokens(query) / 1000
        output_tokens = 0.5  # Estimate 500 output tokens
        
        def estimate_cost(model: ModelConfig) -> float:
            return (
                input_tokens * model.cost_per_1k_input +
                output_tokens * model.cost_per_1k_output
            )
        
        # Select cheapest viable model
        return min(viable_models, key=estimate_cost)
    
    def route(
        self,
        query: str,
        required_quality: Optional[float] = None
    ) -> tuple[str, ModelConfig, dict]:
        """Route query and return response with metadata."""
        model = self.select_model(query, required_quality)
        
        start_time = time.perf_counter()
        
        response = self.client.chat.completions.create(
            model=model.name,
            messages=[{"role": "user", "content": query}]
        )
        
        latency_ms = (time.perf_counter() - start_time) * 1000
        
        usage = response.usage
        actual_cost = (
            (usage.prompt_tokens / 1000) * model.cost_per_1k_input +
            (usage.completion_tokens / 1000) * model.cost_per_1k_output
        )
        
        metadata = {
            "model": model.name,
            "latency_ms": latency_ms,
            "cost": actual_cost,
            "input_tokens": usage.prompt_tokens,
            "output_tokens": usage.completion_tokens
        }
        
        return response.choices[0].message.content, model, metadata


# Usage
router = CostOptimizedRouter(
    quality_threshold=0.85,
    max_latency_ms=1500,
    budget_per_query=0.005
)

# Simple query - should use cheaper model
simple_query = "What is the capital of France?"
response, model, meta = router.route(simple_query)
print(f"Query: {simple_query}")
print(f"Model: {model.name}, Cost: ${meta['cost']:.6f}")

# Complex query with high quality requirement
complex_query = "Explain the mathematical foundations of transformer attention mechanisms."
response, model, meta = router.route(complex_query, required_quality=0.95)
print(f"\nQuery: {complex_query}")
print(f"Model: {model.name}, Cost: ${meta['cost']:.6f}")

Hybrid Routing

Combine multiple routing strategies:
from openai import OpenAI
from dataclasses import dataclass
from typing import Callable, Any
import json


@dataclass
class RoutingDecision:
    """Detailed routing decision."""
    model: str
    handler: str
    reasoning: str
    confidence: float
    metadata: dict


class HybridRouter:
    """Combine intent, complexity, and cost-based routing."""
    
    def __init__(self):
        self.client = OpenAI()
    
    def analyze_query(self, query: str) -> dict:
        """Comprehensive query analysis."""
        prompt = f"""Analyze this query comprehensively:

Query: {query}

Provide analysis as JSON:
{{
    "intent": "question" | "task" | "creative" | "analysis" | "code",
    "complexity": "simple" | "moderate" | "complex",
    "domain": "general" | "technical" | "creative" | "analytical",
    "expected_length": "short" | "medium" | "long",
    "requires_reasoning": true/false,
    "requires_creativity": true/false,
    "requires_accuracy": true/false
}}"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def decide_route(self, query: str) -> RoutingDecision:
        """Make routing decision based on analysis."""
        analysis = self.analyze_query(query)
        
        # Determine best model based on analysis
        if analysis["complexity"] == "simple" and not analysis["requires_reasoning"]:
            model = "gpt-4o-mini"
            reasoning = "Simple query, fast model sufficient"
        elif analysis["requires_creativity"]:
            model = "gpt-4o"
            reasoning = "Creative task benefits from capable model"
        elif analysis["complexity"] == "complex" or analysis["requires_reasoning"]:
            model = "gpt-4o"
            reasoning = "Complex reasoning requires powerful model"
        else:
            model = "gpt-4o-mini"
            reasoning = "Balanced query, using efficient model"
        
        # Determine handler
        if analysis["intent"] == "code":
            handler = "code_specialist"
        elif analysis["intent"] == "creative":
            handler = "creative_writer"
        elif analysis["domain"] == "technical":
            handler = "technical_expert"
        else:
            handler = "general"
        
        return RoutingDecision(
            model=model,
            handler=handler,
            reasoning=reasoning,
            confidence=0.85,
            metadata=analysis
        )
    
    def route_and_respond(self, query: str) -> tuple[str, RoutingDecision]:
        """Route query and generate response."""
        decision = self.decide_route(query)
        
        # Build system prompt based on handler
        system_prompts = {
            "code_specialist": "You are an expert programmer. Provide clean, documented code.",
            "creative_writer": "You are a creative writer. Be imaginative and engaging.",
            "technical_expert": "You are a technical expert. Be precise and thorough.",
            "general": "You are a helpful assistant."
        }
        
        system = system_prompts.get(decision.handler, system_prompts["general"])
        
        response = self.client.chat.completions.create(
            model=decision.model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": query}
            ]
        )
        
        return response.choices[0].message.content, decision


# Usage
router = HybridRouter()

queries = [
    "What is 5 + 5?",
    "Write a Python function to sort a list of dictionaries by a key",
    "Write a short story about a robot learning to paint",
    "Explain the CAP theorem and its implications for distributed databases"
]

for query in queries:
    response, decision = router.route_and_respond(query)
    print(f"Query: {query[:50]}...")
    print(f"Model: {decision.model}")
    print(f"Handler: {decision.handler}")
    print(f"Reasoning: {decision.reasoning}")
    print(f"Response: {response[:100]}...\n")
Routing Best Practices
  • Use fast models for routing decisions themselves — if your router uses GPT-4o to decide which model to call, the routing overhead defeats the purpose. Use GPT-4o-mini or embeddings.
  • Cache routing decisions — similar queries should route the same way. Hash the query and cache for 5-10 minutes.
  • Monitor routing accuracy — track cases where the cheap model produced bad answers. This is your “mis-routing” rate.
  • Implement fallbacks — if the fast model returns low confidence, automatically escalate to the powerful model. Don’t make users retry.
  • Track cost savings — measure actual cost with routing vs. what it would have been with the powerful model for everything. This justifies the engineering investment.
  • Pitfall to avoid: Don’t over-engineer routing for small scale. If you’re under $100/month in API costs, just use one good model. Routing ROI kicks in at scale.

Practice Exercise

Build a production routing system that:
  1. Classifies queries by intent and complexity
  2. Routes to appropriate models based on requirements
  3. Optimizes for cost while meeting quality thresholds
  4. Tracks routing decisions and outcomes
  5. Adapts routing rules based on feedback
Focus on:
  • Low-latency routing decisions
  • Graceful degradation on failures
  • A/B testing different routing strategies
  • Cost and quality monitoring

Interview Deep-Dive

Strong Answer:
  • First, I would instrument every request to capture the query text, the model used, the latency, the token count, and a quality signal — either explicit user feedback (thumbs up/down) or an automated LLM-as-judge evaluation on a sampled subset. You cannot optimize what you do not measure, and without a quality baseline, any cost reduction is a gamble.
  • The routing layer itself has two stages. Stage one is a fast classifier — either an embedding-based centroid approach or a fine-tuned small model — that buckets queries into complexity tiers: simple, moderate, and complex. The classifier must run on something cheap like text-embedding-3-small or gpt-4o-mini, never on the expensive model you are trying to avoid. If the classifier itself costs significant tokens, you have defeated the purpose.
  • Stage two maps tiers to models: simple queries go to gpt-4o-mini (or an even cheaper model), moderate to gpt-4o, and complex to the most capable model available. The key insight is that 50-70% of production queries in most customer-facing products are simple — greetings, FAQ-type questions, single-fact lookups — and the cheap model handles them indistinguishably from the expensive one.
  • I would deploy this with a shadow mode first: route all queries to both the current model and the proposed cheaper model, then compare outputs using an automated eval. This gives you a real mis-routing rate before you flip any traffic. A mis-routing rate above 5% means your classifier needs more training examples or a different threshold.
  • The fallback mechanism is critical. If the cheap model returns low confidence or the user re-asks the same question, automatically escalate to the powerful model. This catch-net prevents the worst user experiences while still capturing the cost savings on the majority of traffic.
Follow-up: How do you handle the cold-start problem — when you have no historical data to train the classifier on?Start with a rule-based heuristic as a bootstrap: queries under 20 tokens with no technical jargon go to the cheap model, everything else goes to the expensive one. Log everything. After a week of production traffic, you have enough labeled data to train an embedding-based classifier. The heuristic is intentionally conservative — it routes more to the expensive model than necessary, which means quality stays high while you gather data. Once the classifier is trained, A/B test it against the heuristic and compare both cost and quality metrics. In my experience, even the crude heuristic captures 20-30% savings because so many production queries are genuinely simple.
Strong Answer:
  • The most insidious failure mode is latency amplification. If you use gpt-4o-mini to classify before routing, you have added 200-400ms of latency to every single request. For a chat application where perceived responsiveness matters, this overhead can negate the UX benefit of streaming. The fix is to use embeddings for classification instead — a single embedding call is 50ms and does not require a full LLM inference pass.
  • Second failure mode: the classifier and the router create a circular dependency. The LLM-based classifier is itself an API call that can fail, rate-limit, or time out. If your routing layer goes down, all queries stall. You need a fast fallback — if classification fails, default to the middle-tier model. Never default to the cheapest model on failure, because that degrades quality silently without any signal.
  • Third: confidence miscalibration. LLMs are notoriously overconfident when asked to self-rate. If you ask gpt-4o-mini “how complex is this query?” and it says 0.95 confidence that it is simple, that 0.95 is not a real probability. It is a language pattern. You cannot trust model-generated confidence scores for routing thresholds without calibrating them against actual outcomes.
  • The architecture fundamentally breaks down when query complexity is not predictable from the query text alone. For instance, “Tell me about the Johnson account” looks simple, but the answer might require reasoning across five documents in a RAG system. Complexity is often a function of the retrieval results, not the query. In these cases, you need a two-phase approach: do a cheap retrieval first, assess the complexity of the retrieved context, then route.
  • Finally, adversarial or ambiguous inputs — sarcasm, multi-intent queries (“book me a flight and also explain quantum physics”), or queries in mixed languages — tend to confuse simple classifiers. These edge cases get mis-routed to the cheap model and produce visibly bad outputs.
Follow-up: You mentioned embedding-based classification as faster than LLM-based. What is the practical accuracy trade-off, and when would you accept LLM-based classification despite the latency?In my experience, embedding centroids with 5-10 examples per intent achieve 85-90% accuracy on well-separated intents like “billing vs. technical support vs. product info.” LLM-based classification gets you to 95%+ because it can reason about nuance — but at 5-10x the latency cost. I would use LLM-based classification only for high-stakes routing decisions where a mis-route has significant consequences, like routing a compliance question to a model that hallucinates, versus routing a casual greeting to a slightly less capable model. For most consumer applications, the embedding approach is the right trade-off. You can also run LLM classification asynchronously as a quality check — route immediately using embeddings, but log the LLM classification result for monitoring and retraining the embedding classifier over time.
Strong Answer:
  • The core metric is the “routing accuracy rate” — the percentage of queries where the routed model produced an answer of equivalent or better quality compared to always using the most expensive model. You measure this by running an offline evaluation: take a random sample of routed queries (say 500 per week), re-run them through the expensive model, and compare outputs using an LLM-as-judge or human eval. If the cheap-model answers are rated equally good 95%+ of the time for queries routed to the cheap tier, your router is working.
  • Second, track the “escalation rate” — queries where the user re-asked the same question, gave a thumbs-down, or where a downstream quality check flagged the response. A rising escalation rate for the cheap tier is the earliest signal of router degradation. I would set up an alert if the weekly escalation rate for any tier increases by more than 2 percentage points.
  • Third, build a routing distribution dashboard that shows: what percentage of queries go to each tier, the average cost per query per tier, the p50/p95 latency per tier, and the total monthly cost. If the distribution shifts suddenly (e.g., 80% going to the cheap model when it was 60% last week), something changed — either user behavior shifted or the classifier drifted.
  • Fourth, log every routing decision with the classifier’s confidence score and the actual model used. This lets you build confusion matrices: for queries the classifier labeled “simple” that users rated poorly, what were the common patterns? These misclassified queries become new training examples for the next version of the classifier.
  • Finally, run a continuous A/B test where 5% of traffic bypasses the router and goes to the expensive model regardless. This control group gives you a live quality benchmark to compare against routed traffic. If the control group’s quality metrics are significantly better, the router is losing value somewhere.
Follow-up: The A/B test shows routed traffic has 3% worse quality ratings than the control group. Is that acceptable, and how do you decide?It depends entirely on the cost savings and the domain. If routing saves $20K/month and the 3% quality gap is on non-critical queries (casual chat, simple lookups), most businesses would accept that trade-off happily. But if the 3% gap concentrates in high-value interactions — enterprise customer queries, medical advice, legal analysis — even 1% degradation can be unacceptable because the cost of a bad answer far exceeds the API savings. I would segment the quality gap by query type and user tier. If premium users see any degradation, tighten their routing to always use the powerful model. If free-tier users see a 3% gap on casual queries, that is likely acceptable. The decision is a product call, not an engineering call — but engineering’s job is to provide the segmented data so the product team can make an informed decision.
Strong Answer:
  • Embedding-based classification computes a vector for the query and compares it against pre-computed intent centroids using cosine similarity. It is fast (one embedding API call, ~50ms), cheap (embedding models cost 10-100x less than chat models), and deterministic given the same model version. The trade-off is that it cannot reason about context, negation, or multi-intent queries. “I do NOT want to cancel my subscription” has high cosine similarity to “cancel my subscription” because the embedding captures topic proximity, not logical negation.
  • LLM-based classification sends the query to a chat model with a prompt listing the available intents and asks it to classify. It handles nuance, negation, and ambiguity well because it reasons about the full meaning. But it is 5-10x more expensive, 3-5x slower, and non-deterministic — the same query can get different classifications on different runs if temperature is above zero.
  • I would choose embeddings for high-throughput, low-stakes routing — a customer support bot handling thousands of queries per hour where 90% of queries cleanly fall into one of five categories. The 5-10% edge cases that get misclassified can be caught by a confidence threshold and escalated to a human or a more expensive model.
  • I would choose LLM-based classification for low-throughput, high-stakes decisions — routing a medical triage question to the right specialist pipeline, or classifying a compliance query where misclassification has regulatory consequences. Here, the extra 300ms and $0.001 per classification is trivial compared to the cost of getting it wrong.
  • The hybrid approach is often best in production: use embeddings as the fast path for clear-cut queries (confidence above 0.85), and fall back to LLM classification only for ambiguous queries (confidence between 0.5 and 0.85). This gives you the speed of embeddings for the 80% easy case and the accuracy of LLMs for the 20% hard case.
Follow-up: You mentioned embedding models cannot handle negation well. What other semantic nuances do embeddings consistently miss, and how would you build test cases to catch these gaps?Beyond negation, embeddings struggle with: sarcasm (“Oh great, another meeting” classifies as positive), conditional intent (“I would cancel IF the price goes up” classifies as cancellation), comparative queries (“Is Plan A better than Plan B” is ambiguous about which plan the user cares about), and code-switched language where the user mixes English and another language mid-sentence. I would build an adversarial test set specifically targeting these categories — 10-20 examples per failure mode. Run the embedding classifier against this set monthly. If accuracy on the adversarial set drops below 70%, it is time to either add more training examples for those edge cases or implement the hybrid approach with LLM fallback. The adversarial set is your canary — it catches classifier drift before your users notice it.