Model optimization is crucial for production deployments. This chapter covers techniques for reducing latency, memory usage, and cost while maintaining quality.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Quantization
GGUF and llama.cpp
Quantization Comparison
vLLM Serving
Basic vLLM Setup
vLLM API Server
Speculative Decoding
KV Cache Optimization
Batch Processing
Memory Optimization
Optimization Trade-offs
- Quantization reduces memory and improves speed at cost of quality
- Batching increases throughput but adds latency
- KV caching speeds up repeated prefixes
- Speculative decoding works best when draft model matches target well
- Always measure actual performance on your workload
Practice Exercise
Build an optimized inference service that:- Supports multiple quantization levels
- Implements dynamic batching
- Uses KV cache for prefix reuse
- Monitors and optimizes memory usage
- Benchmarks throughput and latency
- Balancing speed vs quality tradeoffs
- Efficient memory utilization
- Production-ready batching
- Meaningful performance metrics
Interview Deep-Dive
Explain quantization for LLMs. What are the trade-offs between different quantization levels, and how do you decide which one to use in production?
Explain quantization for LLMs. What are the trade-offs between different quantization levels, and how do you decide which one to use in production?
Strong Answer:
- Quantization reduces the numerical precision of model weights from their training precision (typically FP16 or BF16, 16 bits per parameter) to lower bit widths (8-bit, 4-bit, even 2-bit). The immediate benefit is memory reduction: a 7B parameter model in FP16 requires roughly 14 GB of GPU memory. At 4-bit quantization (Q4_K_M in GGUF format), that drops to about 4 GB. This means you can run models on consumer GPUs that would otherwise require expensive data center hardware.
- The trade-off is quality degradation, and it is not linear. Going from FP16 to 8-bit (Q8_0) produces almost imperceptible quality loss — typically less than 1% on standard benchmarks. Going to 4-bit (Q4_K_M) produces a 3-6% quality drop that is noticeable on reasoning-heavy tasks but often acceptable for general conversation and simple extraction. Going to 2-bit (Q2_K) produces a 10-15% quality drop that is clearly noticeable — the model makes more factual errors, struggles with complex instructions, and produces less coherent long-form text.
- The decision framework I use has three inputs: available GPU memory, minimum acceptable quality, and throughput requirements. First, I run my application’s evaluation suite against FP16, Q8, Q5, and Q4 variants and measure quality. I find the lowest quantization level where quality remains above my threshold (usually 95% of FP16 performance). Then I check if that variant fits in my available GPU memory with room for the KV cache. If not, I either drop to a lower quantization or switch to a smaller model — a Q5 quantized 13B model often outperforms a Q2 quantized 70B model because the 70B at Q2 has degraded too much.
- One production nuance: quantization affects different tasks differently. Math and code generation degrade faster than creative writing or summarization because numerical precision matters more for exact reasoning. If your application involves multiple task types, you should evaluate quantization impact per task, not just overall.
What is speculative decoding, and why does it speed up LLM inference without changing the output distribution?
What is speculative decoding, and why does it speed up LLM inference without changing the output distribution?
Strong Answer:
- Speculative decoding exploits the fact that LLM inference is memory-bandwidth-bound, not compute-bound. For each token generated, the model reads its entire weight matrix from GPU memory, does a relatively small amount of computation, and outputs one token. The GPU compute units are largely idle during this memory read. Speculative decoding uses that idle compute to verify multiple candidate tokens in parallel.
- The mechanism works in three steps. First, a small “draft” model (say 1B parameters) generates k candidate tokens quickly — maybe 4-8 tokens. Because the draft model is small, this takes roughly the same wall-clock time as generating one token from the large model. Second, the large “target” model processes the entire candidate sequence in a single forward pass. Because of the parallelism in Transformer attention, verifying k tokens takes approximately the same time as generating one token. Third, you compare the target model’s probability distribution at each position against the draft model’s choices. Tokens where both models agree are accepted. At the first disagreement, you reject from that point onward and sample a correction token from the target model’s distribution.
- The key mathematical property is that the output distribution is identical to what the target model would have produced on its own. This is not an approximation — there is a careful rejection sampling procedure that guarantees distributional equivalence. You get a speedup of roughly k times the acceptance rate. If the draft model agrees with the target model 70% of the time and k is 5, you effectively generate 3-4 tokens per target model forward pass instead of 1. In practice, this yields 2-3x speedup for well-matched draft-target pairs.
- The critical design choice is the draft model. It must be fast (otherwise the drafting step becomes the bottleneck) and well-aligned with the target model’s distribution (otherwise the acceptance rate is low). Using a 1B model as draft for a 70B target works well because they often share vocabulary and general language patterns. Using a completely different architecture as draft performs poorly because the token probability distributions diverge too much.
You need to serve a 70B parameter model with sub-second latency for a real-time application. What is your infrastructure and optimization strategy?
You need to serve a 70B parameter model with sub-second latency for a real-time application. What is your infrastructure and optimization strategy?
Strong Answer:
- For a 70B model at sub-second latency, you need to make several architectural decisions. First, quantization: a 70B model in FP16 requires roughly 140 GB of GPU memory, which means at least two A100 80GB GPUs just for the weights. At 4-bit quantization (AWQ), you need about 35 GB, which fits on a single A100 with room for KV cache. I would start with AWQ 4-bit and validate quality on my evaluation suite. If quality is acceptable, single-GPU serving dramatically simplifies the infrastructure.
- Second, the serving framework. vLLM is my default choice for high-throughput serving because of its PagedAttention implementation, which manages KV cache memory dynamically rather than pre-allocating for the maximum sequence length. This can increase throughput by 2-4x compared to naive serving because it eliminates memory waste from partially-filled KV cache slots. vLLM also supports continuous batching, which means new requests can be added to an in-flight batch without waiting for the current batch to complete.
- Third, speculative decoding if latency is the primary constraint. A well-matched draft model (7B from the same family as the 70B target) can reduce time-to-first-token by 40-60%. Combined with streaming, this means the user sees the first few tokens within 200-300ms even though the full response takes 2-3 seconds.
- Fourth, KV cache optimization via prefix caching. If many requests share a common system prompt (which is typical), caching the KV state for that prefix means you only compute attention over the user-specific portion. For a 2,000-token system prompt, this saves roughly 30-40% of the first-token latency.
- Fifth, scaling strategy. For a real-time application, I would deploy behind a load balancer with autoscaling based on the pending request queue depth, not CPU utilization. GPU utilization is a misleading metric for LLM serving because a GPU can be 100% utilized while serving requests slowly if the batch size is too large. Queue depth directly reflects user-facing latency.