Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Production-ready caching patterns including semantic caching, Redis integration, and OpenAI’s prompt caching.

Why Cache LLM Responses?

Think of LLM caching like a restaurant kitchen. If ten customers order the same dish, a smart kitchen does not cook it from scratch each time — it prepares a batch and plates from that. LLM caching works the same way: identical or semantically similar requests get served from a stored result instead of burning GPU cycles and dollars on a fresh inference. LLM calls are expensive and slow:
MetricWithout CachingWith Caching
Latency500-3000ms5-50ms
Cost$0.01-0.10/call$0 for cache hits
Rate LimitsEasily hitReduced pressure
Cache Hit Rate   Cost Savings   Latency Improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   50%              50%              10x
   80%              80%              50x
   95%              95%              100x

OpenAI Prompt Caching (Built-in)

OpenAI automatically caches prompts with shared prefixes:
from openai import OpenAI

client = OpenAI()

# Long system prompt - gets cached after first call
SYSTEM_PROMPT = """You are an expert customer service agent for TechCorp.

[Insert 2000+ tokens of product documentation, FAQs, policies...]

Always be helpful, accurate, and follow company guidelines.
"""

# First call: Full price
response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "What's your return policy?"}
    ]
)

# Second call: Cached prefix = 50% discount on cached tokens!
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},  # Cached!
        {"role": "user", "content": "How do I track my order?"}
    ]
)

# Check cache usage
print(f"Cached tokens: {response2.usage.prompt_tokens_details.cached_tokens}")

Maximizing Prompt Cache Hits

class PromptCacheOptimizer:
    """Optimize prompts for OpenAI's prompt caching"""
    
    def __init__(self, base_system_prompt: str):
        # Static content at the beginning (gets cached)
        self.static_prefix = base_system_prompt
        
    def build_prompt(
        self,
        dynamic_context: str,
        user_message: str
    ) -> list[dict]:
        """Build prompt with cacheable prefix"""
        
        # Structure: [Static (cached)] + [Dynamic] + [User]
        return [
            {
                "role": "system",
                "content": self.static_prefix + "\n\n" + dynamic_context
            },
            {"role": "user", "content": user_message}
        ]

# Usage
optimizer = PromptCacheOptimizer("""
You are an expert assistant with deep knowledge of our products.

# Product Catalog
[2000+ tokens of static product info...]

# Company Policies  
[1000+ tokens of static policies...]

# Response Guidelines
- Be concise and helpful
- Always cite sources
- Admit when unsure
""")

# The static prefix will be cached across all calls
messages = optimizer.build_prompt(
    dynamic_context="Customer is a VIP member since 2020",
    user_message="Can I get a discount?"
)

Exact Match Caching

Cache identical requests with deterministic settings.
Caching pitfall: Only cache requests where temperature=0. With any temperature above zero, the model is intentionally non-deterministic — you would be returning stale creative output and masking the intended variety. A common production bug is caching all responses regardless of temperature, then wondering why the chatbot sounds repetitive.
import hashlib
import json
from datetime import datetime, timedelta
from typing import Optional
import redis

class ExactMatchCache:
    """Cache exact query matches"""
    
    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        ttl_hours: int = 24
    ):
        self.redis = redis.from_url(redis_url)
        self.ttl = timedelta(hours=ttl_hours)
    
    def _hash_request(
        self,
        model: str,
        messages: list[dict],
        temperature: float,
        **kwargs
    ) -> str:
        """Create deterministic hash of request"""
        key_data = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            **{k: v for k, v in sorted(kwargs.items())}
        }
        key_str = json.dumps(key_data, sort_keys=True)
        return hashlib.sha256(key_str.encode()).hexdigest()
    
    def get(
        self,
        model: str,
        messages: list[dict],
        **kwargs
    ) -> Optional[str]:
        """Get cached response"""
        key = self._hash_request(model, messages, **kwargs)
        cached = self.redis.get(f"llm:{key}")
        return cached.decode() if cached else None
    
    def set(
        self,
        model: str,
        messages: list[dict],
        response: str,
        **kwargs
    ):
        """Cache a response"""
        key = self._hash_request(model, messages, **kwargs)
        self.redis.setex(
            f"llm:{key}",
            self.ttl,
            response
        )
    
    def get_stats(self) -> dict:
        """Get cache statistics"""
        info = self.redis.info()
        return {
            "hits": info.get("keyspace_hits", 0),
            "misses": info.get("keyspace_misses", 0),
            "hit_rate": info.get("keyspace_hits", 0) / 
                       max(info.get("keyspace_hits", 0) + info.get("keyspace_misses", 0), 1)
        }

# Usage with OpenAI
from openai import OpenAI

client = OpenAI()
cache = ExactMatchCache()

def cached_completion(messages: list[dict], **kwargs) -> str:
    # Only cache deterministic requests
    if kwargs.get("temperature", 1.0) == 0:
        cached = cache.get("gpt-4o", messages, **kwargs)
        if cached:
            return cached
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        **kwargs
    )
    
    result = response.choices[0].message.content
    
    # Cache deterministic responses
    if kwargs.get("temperature", 1.0) == 0:
        cache.set("gpt-4o", messages, result, **kwargs)
    
    return result

Semantic Caching

Cache based on meaning, not exact match:
from openai import OpenAI
import numpy as np
from typing import Optional
import json

class SemanticCache:
    """Cache based on semantic similarity"""
    
    def __init__(
        self,
        similarity_threshold: float = 0.95,
        embedding_model: str = "text-embedding-3-small"
    ):
        self.client = OpenAI()
        self.threshold = similarity_threshold
        self.embedding_model = embedding_model
        
        # In-memory store (use Redis/Pinecone in production)
        self.cache: list[dict] = []
    
    def _get_embedding(self, text: str) -> np.ndarray:
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return np.array(response.data[0].embedding)
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def get(self, query: str) -> Optional[str]:
        """Find semantically similar cached response"""
        if not self.cache:
            return None
        
        query_embedding = self._get_embedding(query)
        
        best_match = None
        best_score = 0
        
        for entry in self.cache:
            similarity = self._cosine_similarity(
                query_embedding, 
                entry["embedding"]
            )
            if similarity > best_score and similarity >= self.threshold:
                best_score = similarity
                best_match = entry
        
        if best_match:
            return best_match["response"]
        return None
    
    def set(self, query: str, response: str):
        """Cache a query-response pair"""
        embedding = self._get_embedding(query)
        
        self.cache.append({
            "query": query,
            "response": response,
            "embedding": embedding
        })

# Usage
semantic_cache = SemanticCache(similarity_threshold=0.92)

def smart_completion(user_query: str) -> str:
    # Check semantic cache
    cached = semantic_cache.get(user_query)
    if cached:
        print("🎯 Semantic cache hit!")
        return cached
    
    # Call LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_query}]
    )
    
    result = response.choices[0].message.content
    
    # Cache the response
    semantic_cache.set(user_query, result)
    
    return result

# These would likely hit the same cache entry:
# "What is machine learning?"
# "Can you explain machine learning?"
# "What's ML?"

Production Semantic Cache with Redis

import redis
import numpy as np
import json
from typing import Optional

class ProductionSemanticCache:
    """Production-ready semantic cache with Redis"""
    
    def __init__(
        self,
        redis_url: str,
        similarity_threshold: float = 0.93,
        max_entries: int = 10000
    ):
        self.redis = redis.from_url(redis_url)
        self.threshold = similarity_threshold
        self.max_entries = max_entries
        self.client = OpenAI()
    
    def _get_embedding(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def get(self, query: str, context_key: str = "default") -> Optional[str]:
        """Get cached response with context isolation"""
        query_embedding = np.array(self._get_embedding(query))
        
        # Get all cache entries for this context
        cache_keys = self.redis.keys(f"semantic:{context_key}:*")
        
        best_match = None
        best_score = 0
        
        for key in cache_keys:
            entry = json.loads(self.redis.get(key))
            cached_embedding = np.array(entry["embedding"])
            
            similarity = np.dot(query_embedding, cached_embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
            )
            
            if similarity > best_score and similarity >= self.threshold:
                best_score = similarity
                best_match = entry
        
        return best_match["response"] if best_match else None
    
    def set(
        self,
        query: str,
        response: str,
        context_key: str = "default",
        ttl_seconds: int = 86400
    ):
        """Cache with TTL and context isolation"""
        embedding = self._get_embedding(query)
        
        entry = {
            "query": query,
            "response": response,
            "embedding": embedding
        }
        
        # Use hash of query as key
        key = f"semantic:{context_key}:{hash(query)}"
        self.redis.setex(key, ttl_seconds, json.dumps(entry))
        
        # Enforce max entries
        self._enforce_limit(context_key)
    
    def _enforce_limit(self, context_key: str):
        """Remove oldest entries if over limit"""
        keys = self.redis.keys(f"semantic:{context_key}:*")
        if len(keys) > self.max_entries:
            # Remove oldest 10%
            to_remove = len(keys) - int(self.max_entries * 0.9)
            for key in keys[:to_remove]:
                self.redis.delete(key)

Multi-Layer Caching

Combine caching strategies for maximum efficiency:
from abc import ABC, abstractmethod
from typing import Optional
import time

class CacheLayer(ABC):
    @abstractmethod
    def get(self, key: str) -> Optional[str]:
        pass
    
    @abstractmethod
    def set(self, key: str, value: str):
        pass

class L1MemoryCache(CacheLayer):
    """In-memory cache for hot data"""
    
    def __init__(self, max_size: int = 1000, ttl_seconds: int = 300):
        self.cache = {}
        self.max_size = max_size
        self.ttl = ttl_seconds
    
    def get(self, key: str) -> Optional[str]:
        entry = self.cache.get(key)
        if entry and time.time() - entry["time"] < self.ttl:
            return entry["value"]
        return None
    
    def set(self, key: str, value: str):
        if len(self.cache) >= self.max_size:
            # Remove oldest
            oldest = min(self.cache, key=lambda k: self.cache[k]["time"])
            del self.cache[oldest]
        
        self.cache[key] = {"value": value, "time": time.time()}

class L2RedisCache(CacheLayer):
    """Redis cache for shared state"""
    
    def __init__(self, redis_client, ttl_seconds: int = 3600):
        self.redis = redis_client
        self.ttl = ttl_seconds
    
    def get(self, key: str) -> Optional[str]:
        value = self.redis.get(f"l2:{key}")
        return value.decode() if value else None
    
    def set(self, key: str, value: str):
        self.redis.setex(f"l2:{key}", self.ttl, value)

class L3SemanticCache(CacheLayer):
    """Semantic similarity cache"""
    
    def __init__(self, semantic_cache: SemanticCache):
        self.cache = semantic_cache
    
    def get(self, key: str) -> Optional[str]:
        return self.cache.get(key)
    
    def set(self, key: str, value: str):
        self.cache.set(key, value)

class MultiLayerCache:
    """Tiered caching system"""
    
    def __init__(self, layers: list[CacheLayer]):
        self.layers = layers
    
    def get(self, key: str) -> tuple[Optional[str], int]:
        """Get value, returns (value, layer_index) or (None, -1)"""
        for i, layer in enumerate(self.layers):
            value = layer.get(key)
            if value:
                # Backfill higher layers
                for j in range(i):
                    self.layers[j].set(key, value)
                return value, i
        return None, -1
    
    def set(self, key: str, value: str):
        """Set in all layers"""
        for layer in self.layers:
            layer.set(key, value)

# Usage
cache = MultiLayerCache([
    L1MemoryCache(max_size=100, ttl_seconds=60),     # Hot cache
    L2RedisCache(redis_client, ttl_seconds=3600),    # Shared cache
    L3SemanticCache(semantic_cache)                   # Semantic matching
])

def cached_llm_call(query: str) -> str:
    # Check cache layers
    cached, layer = cache.get(query)
    if cached:
        print(f"Cache hit at L{layer + 1}")
        return cached
    
    # Call LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )
    result = response.choices[0].message.content
    
    # Populate all cache layers
    cache.set(query, result)
    
    return result

Cache Invalidation Strategies

from datetime import datetime, timedelta
from typing import Optional
import re

class SmartCacheInvalidator:
    """Intelligent cache invalidation"""
    
    def __init__(self, cache: MultiLayerCache):
        self.cache = cache
        self.invalidation_rules = []
    
    def add_time_rule(
        self,
        pattern: str,
        ttl: timedelta
    ):
        """Invalidate entries matching pattern after TTL"""
        self.invalidation_rules.append({
            "type": "time",
            "pattern": re.compile(pattern),
            "ttl": ttl
        })
    
    def add_event_rule(
        self,
        event_type: str,
        pattern: str
    ):
        """Invalidate on specific events"""
        self.invalidation_rules.append({
            "type": "event",
            "event": event_type,
            "pattern": re.compile(pattern)
        })
    
    def on_event(self, event_type: str, data: dict):
        """Handle invalidation events"""
        for rule in self.invalidation_rules:
            if rule["type"] == "event" and rule["event"] == event_type:
                # Invalidate matching cache entries
                self._invalidate_pattern(rule["pattern"])
    
    def _invalidate_pattern(self, pattern: re.Pattern):
        """Invalidate all entries matching pattern"""
        # Implementation depends on cache backend
        pass

# Usage
invalidator = SmartCacheInvalidator(cache)

# Invalidate product queries after product update
invalidator.add_event_rule("product_updated", r".*product.*")

# On product update event
invalidator.on_event("product_updated", {"product_id": "123"})

Key Takeaways

Use OpenAI's Cache

50% discount on cached prompt prefixes - structure prompts accordingly

Layer Your Caches

Memory → Redis → Semantic for optimal hit rates

Semantic for Flexibility

Similar questions get cached answers, improving hit rate

Invalidate Smartly

Time-based + event-based invalidation keeps cache fresh

What’s Next

Embeddings Deep Dive

Master embedding models and similarity search

Interview Deep-Dive

Strong Answer:
  • These are three fundamentally different caching strategies operating at different layers, and a strong production system usually combines all three.
  • Exact-match caching hashes the entire request (model, messages, temperature, all parameters) and returns a stored response for identical requests. The hit rate depends entirely on how often users send exactly the same query. For internal tools, customer support bots with common FAQs, and batch processing with repeated inputs, exact-match cache hit rates can reach 30-50%. The critical constraint is to only cache deterministic requests where temperature equals 0. If you cache responses from temperature 0.7 calls, you are serving stale creative output and killing the intended variety. I have seen this exact bug in production — users complained the chatbot gave the same answer to every question because someone cached non-deterministic responses.
  • Semantic caching uses embedding similarity to match queries by meaning rather than exact text. “What is machine learning?” and “Can you explain ML to me?” would hit the same cache entry if their embedding similarity exceeds a threshold (typically 0.92-0.95). This dramatically increases hit rates for user-facing applications where people phrase the same question differently. The tradeoff is the cost of an embedding call per cache lookup (though embeddings are 100x cheaper than completions), the risk of false positives (returning a cached answer for a question that is similar but meaningfully different), and the O(N) comparison against the cache for each lookup unless you use a vector index.
  • Prompt prefix caching is a provider-side optimization (OpenAI offers this natively). When multiple requests share the same long prefix (same system prompt, same few-shot examples), the provider caches the KV-cache for that prefix and gives you a 50% discount on cached input tokens. You do not manage this cache yourself — you just structure your prompts so the static content comes first and the dynamic content comes last. The optimization is: put your 2000-token system prompt and 1000-token few-shot examples at the top, and the user’s 50-token question at the bottom. Every request after the first gets 3000 tokens at half price.
  • My production stack layers all three: prompt prefix caching reduces per-request cost at the provider level, exact-match caching eliminates redundant API calls entirely for repeated queries, and semantic caching catches the paraphrased queries that exact-match misses.
Red Flags: Candidate conflates these three types, suggests caching all LLM responses regardless of temperature, or does not know about OpenAI’s native prompt caching.Follow-up: How do you set the similarity threshold for semantic caching, and what happens if you set it wrong?The threshold is a precision-recall tradeoff. Too high (0.98+) and the cache rarely hits because queries need to be near-identical — you get the cost of embedding lookups with almost no benefit. Too low (0.85) and you get false positives: the cache returns an answer about Python the programming language for a question about Python the snake. I calibrate the threshold empirically using a labeled dataset of query pairs annotated as “same intent” or “different intent.” I compute the embedding similarity for all pairs and find the threshold that maximizes F1 score on the “same intent” classification. In my experience, 0.92-0.95 is the sweet spot for most customer-facing applications. For safety-critical applications (medical, legal), I push it to 0.97+ because a false positive could give dangerously wrong information. I also segment the cache by context: a query about “returns” in the context of a retail support bot should not match a cached answer about “returns” from a programming support bot. Context keys in the cache prevent cross-domain contamination.
Strong Answer:
  • The read path checks layers in order of speed. L1 (in-memory dictionary with TTL) is checked first — sub-millisecond latency, lives in the application process, perfect for hot queries. If L1 misses, check L2 (Redis) — 1-5ms latency, shared across all application instances, persists across restarts. If L2 misses, check L3 (semantic cache backed by a vector store) — 10-50ms latency because it requires an embedding call plus similarity search. If all three miss, call the LLM, get the response, and populate all three layers on the write path.
  • The write path writes to all layers simultaneously after an LLM call. L1 gets the exact query-response pair with a short TTL (60-300 seconds for hot data). L2 gets the same pair with a longer TTL (1-24 hours). L3 gets the query embedding and response, stored until eviction.
  • The critical detail most people miss is the backfill on read. If L2 hits but L1 missed, I backfill L1 from L2 so subsequent requests for the same query are served from the fastest layer. Same for L3 to L2 backfill. This is the same principle CPU caches use — a slower cache hit should populate all faster caches above it.
  • Cache consistency is managed through TTLs and event-based invalidation. TTLs handle gradual staleness: L1 has the shortest TTL so stale data ages out fastest from the hottest cache. Event-based invalidation handles immediate changes: when a product price changes, I publish an invalidation event that clears all cache entries matching a pattern (any query containing that product name) across all three layers. The invalidation fan-out goes from bottom to top: clear L3 first (most entries), then L2, then broadcast to all instances to clear L1.
  • One gotcha: the semantic cache (L3) cannot be invalidated by exact key because it is similarity-based. For event-based invalidation of L3, I either clear the entire context partition or re-embed the invalidation pattern and clear all entries within a similarity radius.
Red Flags: Candidate describes a flat cache without layering, does not mention backfill from slower to faster layers, or has no invalidation strategy beyond TTL expiry.Follow-up: How do you measure and optimize cache hit rates in production?I instrument every cache lookup to log hit/miss, which layer hit, lookup latency, and the query fingerprint. The primary metric is aggregate hit rate across all layers, but the more actionable metric is hit rate per layer. If L1 hit rate is low but L2 is high, my L1 TTL might be too short or my L1 max-size is too small — hot entries are getting evicted before they are reused. If L2 hit rate is low but L3 is high, my exact-match cache is not capturing the variety of user phrasings and semantic caching is doing the heavy lifting. For optimization, I analyze the miss log: I cluster cache misses by semantic similarity and look for clusters where many similar queries all missed the cache. If a cluster has 50 similar queries and none hit the semantic cache, the similarity threshold might be too high for that domain. I also track cache freshness: the average age of served cached responses. If 90% of served responses are from 23 hours ago (near TTL expiry), the cache is serving stale data and I need either shorter TTLs or event-based invalidation for that content type. I set up dashboards showing hit rate, cost savings (estimated by multiplying hits by average LLM call cost), and latency improvement (p50 and p95 of cached vs uncached response times).
Strong Answer:
  • The classic quote is that there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. For LLM caches, this is especially hard because the relationship between source data changes and cached responses is fuzzy — a product price change does not just invalidate the one cached response about that product’s price; it potentially invalidates any response that mentioned pricing.
  • I use a three-strategy approach. First, time-based TTLs as the baseline guarantee. Every cache entry has a maximum TTL that ensures it is eventually refreshed even if I miss an invalidation event. For rapidly changing data (stock prices, inventory levels), TTLs are minutes. For slow-changing data (company policies, product descriptions), TTLs are hours to days. This is the safety net.
  • Second, event-driven invalidation for known change events. When the product catalog updates, the pricing API changes, or a new policy document is published, the system publishes an invalidation event. The cache listener receives the event and invalidates all entries matching the affected domain. The tricky part is mapping a data change to the right cache entries. I use cache tags: every cached response is tagged with the data sources it depends on (e.g., tags ["product:123", "pricing:q4-2025"]). When product 123 changes, I invalidate all entries tagged with product:123.
  • Third, versioned caching for prompt and model changes. When I update a system prompt or switch models, the entire cache is logically stale because a new model or prompt would generate different responses. I include a prompt version hash and model identifier in the cache key, so a prompt change automatically starts a new cache namespace without requiring explicit invalidation.
  • The thing most people get wrong is trying to be too precise with invalidation. If you spend more engineering time on surgical cache invalidation than you save from caching, you have over-optimized. For most LLM applications, aggressive TTLs (1-4 hours) plus event-driven invalidation for the highest-impact changes (pricing, availability, critical policies) covers 95% of cases.
Red Flags: Candidate says “just use short TTLs” without considering event-driven invalidation, does not think about prompt or model version changes invalidating the cache, or tries to build a perfect invalidation system instead of accepting pragmatic staleness bounds.Follow-up: What if your cache serves a stale response that gives a user incorrect pricing information? How do you prevent that?For high-stakes data like pricing, I do not cache the LLM response at all for the data-dependent portion. Instead, I separate the response into a cached template (the conversational structure and tone) and a live data lookup (the actual price). The LLM generates a response with a placeholder like “The price of [PRODUCT] is [PRICE],” and the application fills in the live price from the source-of-truth database at serving time. This hybrid approach gives me caching benefits for the expensive LLM generation while ensuring live data is always current. For cases where the LLM needs the price to reason (not just insert it), I include the price in the prompt context at request time rather than relying on cached responses. The cache still helps because the prompt prefix (system prompt, examples) is cached via the provider’s prefix caching, and only the dynamic portion (current price, user question) is new.
Strong Answer:
  • First, I would profile the spend to understand where the money goes. I would break down costs by: endpoint (which features use the most tokens), model (GPT-4o vs GPT-4o-mini), request type (how many are unique vs repeated), and token composition (how much is system prompt vs user input vs output). At most companies I have seen, 60-70% of the token spend is the system prompt being resent identically on every request.
  • Quick win number one: prompt prefix caching. If 70% of our token spend is the system prompt, and OpenAI gives 50% off cached prefix tokens, that is an immediate 35% cost reduction with zero code changes beyond restructuring our prompts (static prefix first, dynamic content last). On 50K/month,thatsaves50K/month, that saves 17,500.
  • Quick win number two: exact-match caching with Redis. I would analyze request logs and identify the repeat rate. For internal tools and support bots, 20-40% of queries are repeats. Implementing exact-match caching with a 24-hour TTL on deterministic (temperature=0) requests would eliminate those API calls entirely. Conservatively, if 25% of requests are cacheable repeats, that saves another 8,000ontheremaining8,000 on the remaining 32,500.
  • Medium-term win: semantic caching for the remaining non-exact-repeat traffic. With a 0.93 similarity threshold, I would expect an additional 10-15% cache hit rate on queries that are paraphrases of cached queries. That saves another $2,500-3,500.
  • Model optimization: audit which requests actually need GPT-4o versus GPT-4o-mini. For straightforward classification, formatting, and simple Q-and-A, GPT-4o-mini at 0.15/0.15/0.60 per million tokens is 17x cheaper than GPT-4o at 2.50/2.50/10.00. If 40% of current GPT-4o requests can be downgraded to mini without quality loss, that saves significantly.
  • Combined realistic projection: prefix caching (17,500)plusexactmatch(17,500) plus exact-match (8,000) plus semantic (3,000)getstoroughly3,000) gets to roughly 28,500 in savings, which is a 57% reduction. Adding model downgrading pushes it past 60% comfortably. Total cost of implementation: one Redis instance ($50/month), engineering time for caching layer (1-2 weeks), and ongoing monitoring.
Red Flags: Candidate suggests only one strategy instead of layering multiple approaches, does not profile the spend first, or proposes solutions that require months of engineering for marginal gains.Follow-up: How do you prevent cache warming issues when you deploy a new version of your application?Cache cold starts after deployment are a real problem — for the first few hours, every request misses the cache and hits the LLM API, creating a latency spike and cost burst. I use three strategies. First, pre-warming: before deployment, I run the top 500 most common queries (extracted from production logs) through the new application version and populate the cache. This covers 30-40% of expected traffic immediately. Second, gradual rollout: I deploy the new version to 10% of traffic first, which warms the shared Redis cache (L2) from the small traffic slice. By the time I roll out to 100%, the L2 cache is already warm. Third, stale-while-revalidate: during the warming period, I allow serving slightly stale responses from the old cache (if still available) while the new cache populates in the background. The stale response has a “refresh-in-background” flag that triggers an async LLM call to populate the new cache without making the user wait.