Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Practical strategies for reducing LLM costs by 50-90% while maintaining quality.

The Cost Challenge

LLM costs can explode in production. A chatbot that costs 5/dayduringdevelopmentcanhit5/day during development can hit 5,000/month once real users show up — and that is before you factor in retries, context stuffing, and the developer who accidentally left gpt-4o hardcoded in the logging pipeline. The strategies in this chapter are not premature optimization; they are the difference between a sustainable product and one that bleeds money.
ModelInput CostOutput Cost1M requests/month
GPT-4o$2.50/1M$10.00/1M~$5,000+
GPT-4o-mini$0.15/1M$0.60/1M~$300
Claude 3.5 Sonnet$3.00/1M$15.00/1M~$7,000+
Claude 3.5 Haiku$0.25/1M$1.25/1M~$600

Token Counting and Tracking

Understanding Token Costs

The single most important cost insight: output tokens cost 3-5x more than input tokens. This means a chatty system prompt that causes longer responses costs you far more than the prompt itself. If you can get a concise 50-token answer instead of a 200-token one, you save 4x on the expensive side of the bill.
import tiktoken
from dataclasses import dataclass
from typing import Optional

# Pricing per 1M tokens (as of Dec 2024)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-5-haiku": {"input": 0.25, "output": 1.25},
}

@dataclass
class TokenUsage:
    input_tokens: int
    output_tokens: int
    model: str
    
    @property
    def total_tokens(self) -> int:
        return self.input_tokens + self.output_tokens
    
    @property
    def cost_usd(self) -> float:
        pricing = PRICING.get(self.model, {"input": 0, "output": 0})
        input_cost = (self.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (self.output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

class TokenCounter:
    """Count and track token usage"""
    
    def __init__(self):
        self.encoders = {}
        self.total_usage = {"input": 0, "output": 0, "cost": 0.0}
    
    def get_encoder(self, model: str):
        if model not in self.encoders:
            try:
                self.encoders[model] = tiktoken.encoding_for_model(model)
            except KeyError:
                self.encoders[model] = tiktoken.get_encoding("cl100k_base")
        return self.encoders[model]
    
    def count(self, text: str, model: str = "gpt-4o") -> int:
        """Count tokens in text"""
        encoder = self.get_encoder(model)
        return len(encoder.encode(text))
    
    def count_messages(
        self,
        messages: list[dict],
        model: str = "gpt-4o"
    ) -> int:
        """Count tokens in message list"""
        total = 0
        encoder = self.get_encoder(model)
        
        for message in messages:
            # Message overhead
            total += 4  # role, content, etc.
            total += len(encoder.encode(message.get("content", "")))
            
            if "name" in message:
                total += len(encoder.encode(message["name"]))
        
        total += 2  # Assistant prefix
        return total
    
    def record(self, usage: TokenUsage):
        """Record usage for tracking"""
        self.total_usage["input"] += usage.input_tokens
        self.total_usage["output"] += usage.output_tokens
        self.total_usage["cost"] += usage.cost_usd
    
    def get_summary(self) -> dict:
        return self.total_usage.copy()

# Usage
counter = TokenCounter()

# Before API call - estimate cost
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing"}
]
estimated_input = counter.count_messages(messages)
print(f"Estimated input tokens: {estimated_input}")

# After API call - record actual usage
usage = TokenUsage(
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens,
    model="gpt-4o"
)
counter.record(usage)
print(f"Cost: ${usage.cost_usd:.6f}")

Model Routing

Model routing is the highest-leverage cost optimization you can make. The idea is simple: not every question needs GPT-4o. “What’s your return policy?” can be answered by GPT-4o-mini for 1/15th the cost, while “Analyze this contract for liability risks” genuinely needs the bigger model. Routing is the 80/20 rule in action — 80% of requests are simple enough for the cheap model, saving you 80% of that traffic’s cost. Route requests to the cheapest capable model:
from openai import OpenAI
from enum import Enum

client = OpenAI()

class TaskComplexity(Enum):
    SIMPLE = "simple"      # FAQ, basic Q&A
    MEDIUM = "medium"      # Summarization, analysis
    COMPLEX = "complex"    # Reasoning, coding, creative

class ModelRouter:
    """Route requests to appropriate models based on complexity"""
    
    MODEL_MAP = {
        TaskComplexity.SIMPLE: "gpt-4o-mini",
        TaskComplexity.MEDIUM: "gpt-4o-mini",
        TaskComplexity.COMPLEX: "gpt-4o"
    }
    
    def classify_complexity(self, query: str) -> TaskComplexity:
        """Classify query complexity"""
        # Use cheap model to classify
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Classify the complexity of this task:
                    - simple: Basic Q&A, greetings, simple lookups
                    - medium: Summarization, explanation, simple analysis
                    - complex: Multi-step reasoning, coding, creative writing
                    
                    Respond with just: simple, medium, or complex"""
                },
                {"role": "user", "content": query}
            ],
            max_tokens=10
        )
        
        result = response.choices[0].message.content.lower().strip()
        
        if "complex" in result:
            return TaskComplexity.COMPLEX
        elif "medium" in result:
            return TaskComplexity.MEDIUM
        return TaskComplexity.SIMPLE
    
    def route(self, query: str) -> str:
        """Get appropriate model for query"""
        complexity = self.classify_complexity(query)
        return self.MODEL_MAP[complexity]

# Usage
router = ModelRouter()

def smart_chat(user_input: str) -> str:
    model = router.route(user_input)
    print(f"Using model: {model}")
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_input}]
    )
    
    return response.choices[0].message.content

Rule-Based Routing

For lower overhead, use rules instead of LLM classification. The irony of LLM-based routing is that you are spending tokens to decide how to save tokens. For high-throughput systems (over 1000 requests/minute), the classification call itself becomes a meaningful cost. Rule-based routing is free, instant, and surprisingly effective — simple regex patterns catch 70-80% of cases correctly.
import re

class RuleBasedRouter:
    """Route based on patterns and keywords"""
    
    COMPLEX_PATTERNS = [
        r"write.*code",
        r"debug",
        r"explain.*step",
        r"analyze.*complex",
        r"compare.*and.*contrast",
        r"create.*story",
        r"design.*system",
    ]
    
    SIMPLE_PATTERNS = [
        r"^(hi|hello|hey)\b",
        r"what time",
        r"weather",
        r"define\s+\w+$",
        r"^(yes|no|ok|thanks)\b",
    ]
    
    def __init__(self):
        self.complex_regex = [
            re.compile(p, re.IGNORECASE) for p in self.COMPLEX_PATTERNS
        ]
        self.simple_regex = [
            re.compile(p, re.IGNORECASE) for p in self.SIMPLE_PATTERNS
        ]
    
    def route(self, query: str) -> str:
        # Check simple patterns first
        for pattern in self.simple_regex:
            if pattern.search(query):
                return "gpt-4o-mini"
        
        # Check complex patterns
        for pattern in self.complex_regex:
            if pattern.search(query):
                return "gpt-4o"
        
        # Default to cheaper model
        return "gpt-4o-mini"

Caching Strategies

Caching is the closest thing to free money in AI engineering. If 100 users ask “What’s your refund policy?” today, you should call the LLM exactly once and serve the cached response 99 times. The two approaches below handle different scenarios: exact caching for deterministic queries (same input always means same output), and semantic caching for the real world where “refund policy,” “how do I get my money back,” and “return policy details” should all hit the same cache entry.

Semantic Caching

Cache responses for semantically similar queries:
import hashlib
import json
from openai import OpenAI
import numpy as np
from datetime import datetime, timedelta

client = OpenAI()

class SemanticCache:
    """Cache LLM responses with semantic similarity matching"""
    
    def __init__(
        self,
        similarity_threshold: float = 0.95,
        ttl_hours: int = 24
    ):
        # 0.95 is conservative -- very similar queries only. Lower to 0.90
        # for more cache hits, but test that answer quality doesn't degrade.
        # The sweet spot depends on how much variation your domain has.
        self.similarity_threshold = similarity_threshold
        self.ttl = timedelta(hours=ttl_hours)
        self.cache: list[dict] = []  # In production, use Redis/DB
    
    def _get_embedding(self, text: str) -> np.ndarray:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(response.data[0].embedding)
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def get(self, query: str) -> tuple[str | None, bool]:
        """Get cached response if similar query exists"""
        query_embedding = self._get_embedding(query)
        
        now = datetime.now()
        
        for entry in self.cache:
            # Check TTL
            if now - entry["timestamp"] > self.ttl:
                continue
            
            # Check similarity
            similarity = self._cosine_similarity(
                query_embedding,
                entry["embedding"]
            )
            
            if similarity >= self.similarity_threshold:
                return entry["response"], True  # Cache hit
        
        return None, False
    
    def set(self, query: str, response: str):
        """Cache a query-response pair"""
        embedding = self._get_embedding(query)
        
        self.cache.append({
            "query": query,
            "response": response,
            "embedding": embedding,
            "timestamp": datetime.now()
        })
    
    def clear_expired(self):
        """Remove expired entries"""
        now = datetime.now()
        self.cache = [
            e for e in self.cache
            if now - e["timestamp"] <= self.ttl
        ]

# Usage
cache = SemanticCache(similarity_threshold=0.92)

def cached_chat(user_input: str) -> dict:
    # Check cache
    cached, hit = cache.get(user_input)
    
    if hit:
        return {
            "response": cached,
            "cached": True,
            "tokens_saved": True
        }
    
    # Generate new response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_input}]
    )
    
    result = response.choices[0].message.content
    
    # Cache it
    cache.set(user_input, result)
    
    return {
        "response": result,
        "cached": False,
        "usage": response.usage
    }

Exact Match Caching

For deterministic queries (temperature=0, same system prompt, same user input), the output is always identical. There is zero reason to call the API twice. This pattern is especially powerful for classification, extraction, and structured output tasks where you control the full prompt and the user input has low variance. Pitfall: Do not use exact caching with temperature > 0. The whole point of temperature is to introduce randomness — caching defeats that purpose and gives every user the same “creative” response.
import hashlib
from functools import lru_cache

class ExactCache:
    """Simple exact-match cache for deterministic queries"""
    
    def __init__(self, max_size: int = 10000):
        self.cache = {}
        self.max_size = max_size
    
    def _hash_key(self, model: str, messages: list, **kwargs) -> str:
        """Create deterministic hash for request"""
        key_data = {
            "model": model,
            "messages": messages,
            **kwargs
        }
        key_str = json.dumps(key_data, sort_keys=True)
        return hashlib.sha256(key_str.encode()).hexdigest()
    
    def get(self, model: str, messages: list, **kwargs) -> str | None:
        key = self._hash_key(model, messages, **kwargs)
        return self.cache.get(key)
    
    def set(self, model: str, messages: list, response: str, **kwargs):
        if len(self.cache) >= self.max_size:
            # Remove oldest entry (FIFO)
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
        
        key = self._hash_key(model, messages, **kwargs)
        self.cache[key] = response

# Usage with temperature=0 for deterministic responses
exact_cache = ExactCache()

def deterministic_chat(system: str, user: str) -> str:
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user}
    ]
    
    # Check cache
    cached = exact_cache.get("gpt-4o", messages, temperature=0)
    if cached:
        return cached
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0  # Deterministic
    )
    
    result = response.choices[0].message.content
    exact_cache.set("gpt-4o", messages, result, temperature=0)
    
    return result

Prompt Optimization

Prompt optimization is the low-hanging fruit that most teams skip. A system prompt full of filler words like “Please provide me with a detailed analysis” can be trimmed to “Analyze this” with no loss in quality. Over millions of requests, those extra 15 tokens per call add up to real money. The function below is a starting point — in practice, A/B test your shortened prompts to verify quality holds.

Reduce Prompt Length

def optimize_prompt(prompt: str) -> str:
    """Reduce prompt tokens while preserving meaning.
    
    Tip: Run this on your system prompts, not user input. You control
    system prompts; modifying user input risks changing the meaning.
    """
    optimizations = [
        # Remove redundant phrases
        ("Please provide", "Give"),
        ("I would like you to", ""),
        ("Can you please", ""),
        ("It would be great if you could", ""),
        
        # Shorten instructions
        ("In the context of", "For"),
        ("With respect to", "For"),
        ("Make sure to", ""),
        
        # Remove filler
        (r"\s+", " "),  # Multiple spaces
        (r"^\s+|\s+$", ""),  # Trim
    ]
    
    result = prompt
    for old, new in optimizations:
        if old.startswith("^") or old.startswith(r"\s"):
            import re
            result = re.sub(old, new, result)
        else:
            result = result.replace(old, new)
    
    return result.strip()

# Example
long_prompt = """
Please provide me with a detailed analysis of the following text. 
I would like you to identify the main themes and summarize them.
It would be great if you could also highlight any key insights.
"""

short_prompt = optimize_prompt(long_prompt)
# "Give a detailed analysis of the following text. Identify the main themes and summarize them. Also highlight key insights."

Context Compression

class ContextCompressor:
    """Compress context to reduce tokens"""
    
    def compress_for_rag(
        self,
        documents: list[str],
        query: str,
        max_tokens: int = 2000
    ) -> str:
        """Compress retrieved documents to fit token budget"""
        
        # Get most relevant sentences
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": f"""Extract the most relevant sentences from these documents for answering the query.
                    Keep only essential information. Target: {max_tokens} tokens max.
                    
                    Query: {query}"""
                },
                {
                    "role": "user",
                    "content": "\n\n".join(documents)
                }
            ]
        )
        
        return response.choices[0].message.content
    
    def summarize_history(
        self,
        messages: list[dict],
        max_tokens: int = 500
    ) -> str:
        """Summarize conversation history"""
        
        history = "\n".join([
            f"{m['role']}: {m['content']}"
            for m in messages
        ])
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": f"Summarize this conversation in under {max_tokens} tokens, preserving key facts and decisions."
                },
                {"role": "user", "content": history}
            ]
        )
        
        return response.choices[0].message.content

Batching and Async

Batching is about amortizing overhead. If 10 classification requests arrive within 100ms of each other, sending them as 10 separate API calls means 10x the HTTP overhead, 10x the rate-limit consumption, and often 10x the latency (serial requests). Batching them into a single prompt or parallel async calls is dramatically more efficient. The pattern below collects requests into a batch, waits briefly for stragglers, then fires them all at once.

Batch Similar Requests

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

class RequestBatcher:
    """Batch similar requests for efficiency"""
    
    def __init__(self, batch_size: int = 10, wait_time: float = 0.1):
        self.batch_size = batch_size
        self.wait_time = wait_time
        self.pending: list[tuple] = []
        self.lock = asyncio.Lock()
    
    async def add_request(
        self,
        messages: list[dict],
        model: str = "gpt-4o-mini"
    ) -> str:
        """Add request to batch and wait for result"""
        future = asyncio.Future()
        
        async with self.lock:
            self.pending.append((messages, model, future))
            
            if len(self.pending) >= self.batch_size:
                await self._process_batch()
        
        # Wait a bit for more requests to batch
        await asyncio.sleep(self.wait_time)
        
        async with self.lock:
            if self.pending:
                await self._process_batch()
        
        return await future
    
    async def _process_batch(self):
        """Process all pending requests"""
        if not self.pending:
            return
        
        batch = self.pending
        self.pending = []
        
        # Process in parallel
        tasks = [
            async_client.chat.completions.create(
                model=model,
                messages=messages
            )
            for messages, model, _ in batch
        ]
        
        responses = await asyncio.gather(*tasks)
        
        # Resolve futures
        for (_, _, future), response in zip(batch, responses):
            future.set_result(response.choices[0].message.content)

# Usage
batcher = RequestBatcher(batch_size=10)

async def batch_chat(queries: list[str]) -> list[str]:
    tasks = [
        batcher.add_request([{"role": "user", "content": q}])
        for q in queries
    ]
    return await asyncio.gather(*tasks)

Cost Monitoring Dashboard

from dataclasses import dataclass, field
from datetime import datetime, date
from collections import defaultdict
import json

@dataclass
class CostTracker:
    """Track and analyze LLM costs"""
    
    daily_costs: dict = field(default_factory=lambda: defaultdict(float))
    model_costs: dict = field(default_factory=lambda: defaultdict(float))
    request_count: dict = field(default_factory=lambda: defaultdict(int))
    
    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int
    ):
        today = date.today().isoformat()
        
        usage = TokenUsage(input_tokens, output_tokens, model)
        cost = usage.cost_usd
        
        self.daily_costs[today] += cost
        self.model_costs[model] += cost
        self.request_count[model] += 1
    
    def get_daily_report(self) -> dict:
        return {
            "daily_costs": dict(self.daily_costs),
            "model_breakdown": dict(self.model_costs),
            "request_counts": dict(self.request_count),
            "total_cost": sum(self.daily_costs.values()),
            "avg_cost_per_request": (
                sum(self.daily_costs.values()) / 
                max(sum(self.request_count.values()), 1)
            )
        }
    
    def check_budget(
        self,
        daily_limit: float,
        alert_threshold: float = 0.8
    ) -> dict:
        today = date.today().isoformat()
        current = self.daily_costs.get(today, 0)
        
        return {
            "current_spend": current,
            "daily_limit": daily_limit,
            "remaining": daily_limit - current,
            "utilization": current / daily_limit,
            "alert": current >= daily_limit * alert_threshold
        }

# Usage
tracker = CostTracker()

def tracked_chat(user_input: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_input}]
    )
    
    # Track costs
    tracker.record(
        model=model,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens
    )
    
    # Check budget
    budget = tracker.check_budget(daily_limit=100.0)
    if budget["alert"]:
        print(f"⚠️ Budget alert: ${budget['current_spend']:.2f} / ${budget['daily_limit']}")
    
    return response.choices[0].message.content

Cost Optimization Checklist

Use Cheaper Models

GPT-4o-mini is 15-30x cheaper than GPT-4o for many tasks

Implement Caching

Cache responses to avoid repeated API calls for similar queries

Compress Context

Reduce prompt and context size before sending to API

Set Budgets

Implement daily/monthly budget limits with alerts

Quick Wins

StrategyEffortSavings
Switch to mini modelsLow50-80%
Add response cachingMedium20-50%
Prompt optimizationLow10-30%
Model routingMedium30-50%
Context compressionHigh20-40%

Cost Optimization Decision Framework

The order matters. Most teams jump straight to complex solutions (fine-tuning, custom models) when the cheapest wins are sitting right in front of them. Follow this sequence.
PriorityActionExpected SavingsPrerequisites
1 (do first)Audit which model is used where — replace GPT-4o with GPT-4o-mini for non-critical paths50-80% on those pathsNone — this is a config change
2Add exact-match caching for deterministic requests (classification, extraction)20-40% of total volumetemperature=0 on those endpoints
3Shorten system prompts — remove filler, redundant instructions10-20% on input tokensA/B test to verify quality holds
4Implement model routing (cheap model for simple, expensive for complex)30-50% overallA complexity classifier or rule set
5Add semantic caching for user-facing chat15-30% on repeated queriesEmbedding infrastructure for similarity matching
6Compress RAG context before generation20-40% on input tokensA compression model or extractive pipeline
7Fine-tune a smaller model to replace GPT-4o on your most common task60-90% on that task500+ labeled examples, evaluation pipeline
When NOT to optimize:
  • If your total LLM spend is under $100/month, your engineering time costs more than the savings. Focus on product, not cost.
  • If you have no monitoring, you are optimizing blind. Instrument first (track cost per request, per user, per endpoint), then optimize.
  • If quality is not measured, you cannot tell whether your “optimization” degraded the product. Set up an evaluation suite before cutting costs.

Edge Cases in Cost Management

Retry storms after API outages. Your retry logic fires 3 attempts per failed request. During an OpenAI outage affecting 1000 requests/minute, that becomes 3000 retries/minute — tripling your cost on the recovery spike and potentially hitting rate limits. Add circuit breakers (see the Deployment and Scaling chapter) and cap total retries per time window, not just per request. Streaming responses that get cancelled. A user starts a chat, gets impatient after 2 seconds, and navigates away. The backend keeps generating tokens until completion. Those output tokens are billed even though nobody reads them. Implement cancellation propagation: when the SSE connection drops, abort the API call. Embedding costs hiding in plain sight. Each RAG query embeds the user’s question. Each document upload embeds every chunk. At 1000 queries/day with 5 re-uploads, that is 1000+ embedding calls that don’t show up in your chat cost tracking. Track embedding costs separately — they can be 10-30% of total spend for RAG-heavy applications. Development and testing costs. Your test suite makes real API calls. Your developers run ad-hoc experiments. Without a separate budget tracker for non-production usage, these costs blend into production metrics and distort your per-user economics. Use separate API keys for dev/staging/prod.

What’s Next

Multi-Agent Design Patterns

Learn advanced patterns for building multi-agent AI systems

Interview Deep-Dive

Strong Answer:
  • The highest-leverage move is model routing. Analyze traffic logs and you will find 60-80% of queries are simple: greetings, FAQ lookups, status checks. Route those to GPT-4o-mini at 1/15th the cost. Reserve GPT-4o for complex reasoning and analysis. A rule-based router (regex patterns) costs zero and catches 70% of cases. This alone cuts costs by 50-70%, bringing 5,000downto5,000 down to 1,500-2,500.
  • Second: response caching. In most chatbot deployments, 20-30% of queries are near-duplicates. A semantic cache with a 0.92 similarity threshold eliminates redundant calls. The embedding cost for cache lookup is negligible compared to the GPT-4o call it replaces. This saves another 15-25%.
  • Third: optimize prompts. Output tokens cost 4x more than input tokens. If your system prompt encourages verbose responses, you are paying a premium for wordiness. Shorten system prompts, add “be concise” instructions, set max_tokens appropriately. Trimming average response length from 200 to 80 tokens saves 60% on output costs.
  • Fourth: compress conversation context. Sending full history on every turn means paying for the same messages repeatedly. Summarize older turns and keep only the last 4-5 verbatim. A 20-turn conversation that sends all history uses 10x more input tokens than one that summarizes after turn 5.
  • Combined, these achieve 70-90% reduction. 5,000becomes5,000 becomes 500-1,000 without perceptible quality loss.
Follow-up: The product team worries about routing mistakes — ‘What if the cheap model gives a bad answer to a complex question the router misclassified?’ How do you handle this?Implement a quality feedback loop. Log which model handled each request. Add a confidence check on the cheap model’s output: if GPT-4o-mini’s response is unusually short, contains hedging, or the user immediately rephrases the same question, auto-escalate to GPT-4o and re-answer. Sample 100 routed requests weekly for human quality review. The cost of occasional escalation (maybe 5% of cheap-routed queries re-sent to GPT-4o) is far less than routing everything to the expensive model.
Strong Answer:
  • The similarity threshold is a precision-recall trade-off for your cache. Too high (0.98+) and the cache barely hits — only near-identical queries match, defeating the purpose. Too low (0.85) and you serve wrong answers for queries that are similar but not similar enough. The sweet spot is 0.92-0.95 for most applications.
  • The right way to choose: collect 500+ real queries, compute pairwise similarity, and have humans label whether the same response is appropriate for both. Find the threshold where precision exceeds 95% (almost never serve a wrong cached response) while recall is reasonable (catch 20-30% of cache-eligible queries).
  • Too-high failure mode: you spend on embedding lookups but almost never hit. A cache with a 2% hit rate might increase costs.
  • Too-low failure mode: “What is your refund policy?” and “What is your privacy policy?” might score 0.88 similarity because they share structure, but they need completely different answers. Serving the wrong cached answer is a trust-destroying silent bug.
  • Domain matters enormously. In customer service, similar-sounding questions often need different answers (“cancel my order” vs “cancel my account”). In technical documentation, similar questions often have similar answers. Tune per-domain.
Follow-up: Your cache works well but responses become stale — the return policy changed last week but the cache still serves the old version. How do you handle invalidation?TTL is the first defense — 24-48 hours for changing information, longer for static content. The smarter approach is event-driven invalidation: when someone updates the return policy in your knowledge base, invalidate all cache entries whose source documents include that page. This requires tagging cached responses with source document IDs at write time. For critical domains (pricing, legal terms), add a freshness check — before serving a cached response, verify source documents have not been modified since the entry was created.
Strong Answer:
  • Rule-based routing is free, instant (microseconds), and deterministic. Write 20 regex patterns for simple queries and 15 for complex ones. It catches 70-80% correctly with zero API cost and zero latency. The downside: it misclassifies the 20-30% that match no pattern, and maintaining rules as the product evolves is tedious.
  • LLM-based routing is more accurate (90%+) and handles novel query types without rule updates. The downside: every routing decision costs an API call. At GPT-4o-mini rates, the cost is small, but at 1,000 requests per minute the routing calls alone cost $200/month. More importantly, they add 100-200ms latency to every request.
  • Ship rule-based first, for three reasons. First, immediate savings with zero spend. Second, it generates data — log every query with its classification and actual model, then review misclassifications. Third, after two weeks you have 50,000+ labeled examples to train a tiny local classifier that is more accurate than the LLM router and runs in 1ms.
  • The production answer is a hybrid: rules for the easy 70%, a local classifier for the next 20%, LLM classification for the ambiguous 10%. This gives 95%+ accuracy at near-zero cost.
Follow-up: A month later, 12% of queries routed to GPT-4o-mini get poor responses because they were more complex than the rules detected. What is your fix?Add a quality gate on the cheap model’s output. After GPT-4o-mini responds, run a fast heuristic: is the response under 10 tokens? Does it contain hedging? Did the user immediately rephrase? If any trigger fires, silently re-send to GPT-4o and replace the response. Track the escalation rate — under 5% means the system works. Above 10% means rules need tightening. Even with 10% escalation, you save 85% compared to routing everything to GPT-4o.
Strong Answer:
  • Explicit length constraints in the system prompt are most effective. “Respond in 1-2 sentences” or “Maximum 50 words” dramatically reduces output. But crude limits hurt quality — “explain quantum computing in 10 words” produces garbage. Match the constraint to the task: classification needs 1 token, yes/no needs 1 sentence, explanations need a paragraph, code needs as long as necessary.
  • The max_tokens parameter is a hard ceiling, not quality control. Setting max_tokens=100 stops mid-sentence, which looks broken. Use it as a safety net (2x expected length), not the primary control. System prompt instructions let the model self-regulate.
  • Structured output eliminates prose overhead. Instead of “The sentiment is positive with 92% confidence because the user expressed satisfaction,” return JSON: {'"sentiment": "positive", "confidence": 0.92'}. JSON responses are 30-60% shorter for extraction tasks.
  • Response streaming with early termination is advanced but powerful. Stream the response, and if the first 50 tokens already contain the answer, cancel the stream. This requires careful UX but cuts output costs significantly for factual queries.
  • The meta-insight: output optimization and model routing are multiplicative. A simple query on GPT-4o with 200-token response costs 0.01.ThesamequeryonGPT4ominiwith30tokenresponsecosts0.01. The same query on GPT-4o-mini with 30-token response costs 0.0001 — a 100x reduction.
Follow-up: Your PM says users prefer longer, detailed responses and NPS is higher with verbose answers. How do you balance cost and satisfaction?This is a business decision. The answer is segmentation: free-tier users get concise GPT-4o-mini responses, paid users get detailed GPT-4o responses. If you cannot tier by plan, A/B test response length and measure both NPS and cost. The NPS difference is usually smaller than PMs expect — users care more about accuracy and speed than verbosity. But if data shows verbose wins, calculate the ROI: NPS improvement from verbosity is worth Xinretention,extracostisX in retention, extra cost is Y. Ship whichever gives X > Y.