Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Covers the latest open-source models including Llama 3.3, Mistral Large 2, Qwen 2.5, and DeepSeek-V3, plus deployment options for every scale.

Why Open Source LLMs?

The decision between open-source and closed-source LLMs is like the decision between renting an apartment and owning a house. Renting (OpenAI, Anthropic) is easier to start, someone else handles maintenance, and you can move quickly — but you are at the mercy of the landlord’s pricing, rules, and availability. Owning (open-source) requires more upfront investment and maintenance, but gives you complete control over your data, costs, and customization. Most production teams end up using both: closed-source for complex reasoning tasks and open-source for high-volume, cost-sensitive workloads. Open-source and local LLMs offer compelling advantages:
BenefitDescription
PrivacyData never leaves your infrastructure
CostNo per-token API fees after hardware costs
CustomizationFine-tune on your data
LatencyNo network round-trip for local models
ComplianceMeet data residency requirements
AvailabilityNo dependence on external services

Open Source vs. Closed Source: Decision Framework

FactorChoose Open Source WhenChoose Closed Source When
Data sensitivityPII, medical records, legal docs, classified dataPublic-facing content, non-sensitive queries
Volume100K+ requests/day (cost crossover point)Under 10K requests/day
Quality requirements”Good enough” is acceptable (classification, extraction)Best-possible quality required (complex reasoning, creative)
LatencySub-100ms needed (local inference)500ms-3s acceptable
Team sizeHave infra/ML engineers to manage deploymentSmall team, no ML ops capacity
CustomizationNeed domain-specific fine-tuningGeneral-purpose tasks
BudgetUpfront GPU investment acceptablePrefer pay-per-use
RegulatoryGDPR, HIPAA, SOC2 with strict data residencyStandard compliance sufficient
The hybrid pattern most production teams converge on: Use open-source models for high-volume, cost-sensitive, or privacy-critical workloads (embedding generation, classification, simple extraction). Use closed-source models for complex reasoning, creative generation, and anything requiring frontier capabilities. Route between them with LiteLLM.

2025 Open Source Model Landscape

Model Capabilities (December 2025)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Frontier Class (>100B params, GPT-4 competitive):
├── Llama 3.3 70B            - Best open weights
├── Mistral Large 2 123B     - Strong reasoning
├── Qwen 2.5 72B             - Excellent multilingual
└── DeepSeek-V3 671B (MoE)   - State-of-the-art open

Mid-tier (7B-70B, GPT-3.5 competitive):
├── Llama 3.2 8B             - Fast, efficient
├── Mistral 7B               - Great for fine-tuning
├── Qwen 2.5 7B/14B/32B      - Balanced performance
└── Gemma 2 9B/27B           - Google's open models

Edge/Mobile (<7B, device-friendly):
├── Llama 3.2 1B/3B          - On-device inference
├── Phi-3.5 3.8B             - Microsoft's small model
├── Gemma 2 2B               - Minimal footprint
└── Qwen 2.5 0.5B/1.5B       - Ultra-lightweight

Inference Server Comparison

Choosing the right inference server matters more than choosing the right model. The wrong server can leave 80% of your GPU idle.
FeatureOllamavLLMTGI (HuggingFace)llama.cpp
Primary useDevelopment, prototypingProduction servingProduction servingEdge/CPU inference
Setup complexity1 commandModerateDocker requiredCompile from source
GPU utilizationGoodExcellent (PagedAttention)GoodOptional (CPU-first)
Concurrent usersLow (1-5)High (100+)High (50+)Low (1-3)
Throughput (tok/s)GoodBest (2-4x others)GoodModerate
Continuous batchingNoYesYesNo
Tensor parallelismNoYes (multi-GPU)Yes (multi-GPU)No
OpenAI-compatible APIYesYesNo (custom API)Yes (via server mode)
Quantization supportGGUF autoAWQ, GPTQ, FP8AWQ, GPTQ, bitsandbytesGGUF (best in class)
Apple SiliconExcellentNoNoExcellent
Best forLocal dev, Mac usersProduction GPU clustersHuggingFace ecosystemRaspberry Pi, laptops

Ollama: Easiest Local LLM Setup

Ollama is the “Docker for LLMs” — it packages models with their runtime so you can pull and run them with a single command. It handles quantization, GPU detection, and memory management automatically. If you have never run a local LLM before, start here. The experience is genuinely as simple as ollama run llama3.2, and the OpenAI-compatible API means your existing code works with minimal changes.

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - Download from ollama.com
# Or with winget:
winget install Ollama.Ollama

Basic Usage

# Pull and run a model
ollama run llama3.2

# Pull without running
ollama pull mistral

# List installed models
ollama list

# Run with specific parameters
ollama run llama3.2 --num-gpu 1 --num-ctx 4096

API Usage

Ollama exposes an OpenAI-compatible API:
from openai import OpenAI

# Point to local Ollama server -- same OpenAI SDK, different base_url
# This is the key insight: Ollama exposes an OpenAI-compatible API,
# so you can swap between local and cloud models by changing one line
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by the SDK but Ollama ignores it
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing briefly."}
    ]
)

print(response.choices[0].message.content)

Custom Modelfiles

Create custom model configurations:
# Modelfile
FROM llama3.2

# Set system prompt
SYSTEM """
You are a senior software engineer specializing in Python.
Provide concise, production-ready code with clear explanations.
"""

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

# Custom stop tokens
PARAMETER stop "<|endoftext|>"
PARAMETER stop "Human:"
# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant

Running Multiple Models

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

async def query_multiple_models(prompt: str):
    """Query multiple models and compare responses"""
    models = ["llama3.2", "mistral", "qwen2.5:7b"]
    
    async def query_model(model: str):
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return model, response.choices[0].message.content
    
    results = await asyncio.gather(*[query_model(m) for m in models])
    return dict(results)

# Compare responses
responses = asyncio.run(query_multiple_models("What is machine learning?"))
for model, response in responses.items():
    print(f"\n{model}:\n{response[:200]}...")

vLLM: Production-Scale Inference

If Ollama is for development, vLLM is for production. The difference is like running a single-threaded web server vs. nginx: vLLM is designed from the ground up for high-throughput, concurrent inference. Its key innovation, PagedAttention, manages GPU memory the way an operating system manages RAM with virtual memory pages — which means it can serve 2-4x more concurrent requests than naive implementations on the same hardware.

Installation

pip install vllm

Starting the Server

# Basic server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --port 8000

# With optimizations -- each flag matters:
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --tensor-parallel-size 2 \        # Split model across 2 GPUs
    --gpu-memory-utilization 0.9 \    # Use 90% of VRAM (leave headroom for OS)
    --max-model-len 8192 \            # Cap context length to save memory
    --enable-prefix-caching            # Reuse KV cache for repeated system prompts

Key Features

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

# Batched inference (vLLM handles efficiently)
prompts = [
    "Explain Python decorators",
    "What is async/await?",
    "How does garbage collection work?"
]

responses = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-8B-Instruct",
        messages=[{"role": "user", "content": prompt}]
    )
    responses.append(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about coding"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

vLLM Optimizations Explained

┌─────────────────────────────────────────────────────────────┐
│                    vLLM Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PagedAttention                                              │
│  ├── Manages KV cache like virtual memory pages             │
│  └── Enables serving many concurrent requests               │
│                                                             │
│  Continuous Batching                                         │
│  ├── Dynamically adds requests to running batch             │
│  └── Maximizes GPU utilization                              │
│                                                             │
│  Tensor Parallelism                                          │
│  ├── Splits model across multiple GPUs                      │
│  └── Linear speedup for large models                        │
│                                                             │
│  Prefix Caching                                              │
│  ├── Reuses KV cache for shared prefixes                    │
│  └── Great for system prompts and few-shot examples         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Text Generation Inference (TGI)

Hugging Face’s production inference server.

Running with Docker

# Pull and run TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.2-8B-Instruct \
    --max-input-length 4096 \
    --max-total-tokens 8192

API Usage

import requests

def query_tgi(prompt: str) -> str:
    response = requests.post(
        "http://localhost:8080/generate",
        json={
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 500,
                "temperature": 0.7,
                "do_sample": True
            }
        }
    )
    return response.json()["generated_text"]

# With streaming
def stream_tgi(prompt: str):
    response = requests.post(
        "http://localhost:8080/generate_stream",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 500}},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            yield line.decode("utf-8")

LiteLLM: Unified Interface

The problem LiteLLM solves is real: every LLM provider has a slightly different API, different parameter names, different response formats. Switching providers means rewriting integration code. LiteLLM acts as a universal adapter — one interface that translates to any provider. More importantly, its proxy server adds routing, fallbacks, and load balancing, which is exactly what you need when running multiple models in production.
from litellm import completion

# Works with any provider
def chat(prompt: str, provider: str = "ollama"):
    configs = {
        "ollama": {"model": "ollama/llama3.2", "api_base": "http://localhost:11434"},
        "openai": {"model": "gpt-4o"},
        "anthropic": {"model": "claude-3-5-sonnet-20241022"},
        "local_vllm": {"model": "openai/meta-llama/Llama-3.2-8B-Instruct", 
                       "api_base": "http://localhost:8000/v1"}
    }
    
    config = configs[provider]
    
    response = completion(
        model=config["model"],
        messages=[{"role": "user", "content": prompt}],
        api_base=config.get("api_base")
    )
    
    return response.choices[0].message.content

# Seamlessly switch providers
print(chat("Hello!", provider="ollama"))
print(chat("Hello!", provider="openai"))

LiteLLM Proxy Server

Run a unified proxy for all your models:
# config.yaml
model_list:
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434
      
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      
  - model_name: claude-sonnet
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: latency-based-routing
  num_retries: 2
  fallbacks: [{"gpt-4o": ["claude-sonnet", "llama-local"]}]
# Start proxy
litellm --config config.yaml --port 4000

Hardware Requirements

The most common question in open-source LLMs: “can I run this model on my hardware?” The answer depends on three factors: model size (parameter count), quantization level, and how much context you need. The rule of thumb for VRAM: multiply the parameter count by 2 for FP16 (7B model = 14 GB), or divide by the quantization factor (7B INT4 = ~4 GB). Add 1-2 GB for the KV cache (more for longer contexts).

GPU Memory Requirements

Model SizeMin VRAM (FP16)Min VRAM (INT8)Min VRAM (INT4)
7B14 GB7 GB4 GB
13B26 GB13 GB7 GB
34B68 GB34 GB17 GB
70B140 GB70 GB35 GB
Local Development:
├── MacBook Pro M3 Max (64GB) → Up to 34B models
├── RTX 4090 (24GB)           → 7B-13B at FP16, 34B at INT4
└── RTX 3090 (24GB)           → Same as 4090, slightly slower

Production (per-node):
├── A100 80GB                 → 70B at FP16
├── A100 40GB × 2             → 70B with tensor parallelism
├── H100 80GB                 → 70B+ with room to spare
└── L40S 48GB                 → Up to 34B at FP16

Quantization for Efficiency

Quantization is like compressing a high-resolution photo to JPEG: you lose some detail, but the file becomes dramatically smaller. In LLM terms, you are reducing the precision of model weights from 16-bit floats to 8-bit or 4-bit integers. A 70B parameter model that normally needs 140 GB of VRAM can fit in 35 GB at INT4 — making it runnable on two consumer GPUs instead of two A100s. The quality loss is surprisingly small for most tasks, though it is measurable on benchmarks. Run larger models on smaller GPUs with quantization:
# Using llama-cpp-python for quantized models
from llama_cpp import Llama

# Load GGUF quantized model (4-bit)
llm = Llama(
    model_path="./models/llama-3.2-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
    n_threads=8
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in one paragraph."}
    ]
)

print(response["choices"][0]["message"]["content"])

Quantization Comparison

QuantizationSize ReductionQuality LossPerplexity IncreaseUse Case
FP16 (BF16)BaselineNone0%Maximum quality, benchmarking
INT8 (Q8_0)50%Minimal~0.5%Production inference, best quality-to-size ratio
INT5 (Q5_K_M)65%Very small~1%Good balance for most tasks
INT4 (Q4_K_M)75%Small~2-3%Sweet spot for resource-constrained deployments
INT4 (Q4_0)75%Moderate~5%Maximum compression, simpler tasks only
INT3 (Q3_K_S)82%Noticeable~8-10%Extreme constraint — expect degraded output quality
INT2 (Q2_K)87%Significant~15-20%Experimental only, not recommended for production
Edge case — quantization and task type: Quality loss is not uniform across tasks. Quantized models handle simple classification and extraction almost as well as FP16. But they degrade noticeably on complex reasoning, multi-step math, and creative writing. Test your specific use case — do not rely on general benchmark numbers. Edge case — mixed quantization (GGUF K-quants): Formats like Q4_K_M use higher precision for attention layers and lower precision for feed-forward layers. This is why Q4_K_M significantly outperforms Q4_0 despite similar file sizes — the “K” stands for “k-quant,” a smarter allocation of precision budget.

Cloud Alternatives for Open Models

Run open-source models without managing infrastructure.

Cloud Inference Provider Comparison

ProviderSpeedPrice (Llama 3.3 70B)OpenAI-Compatible APIUnique StrengthLimitation
Together AIFast~$0.90/M tokensYesWide model selection, fine-tuning supportCan be slow during peak hours
GroqFastest (500+ tok/s)~$0.59/M tokensYesCustom LPU hardware, sub-second responsesLimited model selection, lower rate limits
Fireworks AIFast~$0.90/M tokensYesFunction calling support, long contextSmaller community
ReplicateModeratePay-per-second GPUYes (via proxy)Run any HuggingFace model, custom modelsCold starts can be slow (10-30s)
ModalFastPay-per-second GPUBuild your ownFull Python control, serverless GPUsRequires more setup, not a hosted API
AnyscaleFast~$1.00/M tokensYesRay-based scaling, production-gradeMore enterprise-focused
Decision shortcut: Need the fastest inference possible? Use Groq. Need the widest model selection? Use Together AI. Need to run a custom or fine-tuned model? Use Replicate or Modal. Need enterprise SLAs? Use Anyscale.

Together AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="your-together-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

Groq (Ultra-fast Inference)

Groq runs on custom LPU (Language Processing Unit) hardware that is purpose-built for inference. The result is staggeringly fast: 500+ tokens/second, which means a typical response arrives in under a second. The trade-off is a limited model selection and lower rate limits compared to other providers.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key"
)

# Extremely fast inference (500+ tokens/sec) -- great for latency-sensitive apps
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain recursion."}]
)

Fireworks AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="your-fireworks-api-key"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Production Deployment Patterns

The pattern below is a production-tested stack: Ollama serves the model, LiteLLM provides a unified API with routing and fallbacks, and your application talks to LiteLLM. This gives you model-agnostic code (swap providers by changing config, not code), automatic fallbacks (if Ollama goes down, route to OpenAI), and a single point for logging and rate limiting.

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000
    depends_on:
      - ollama

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LLM_BASE_URL=http://litellm-proxy:4000/v1
    depends_on:
      - litellm-proxy

volumes:
  ollama_models:

Kubernetes with GPU Support

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.2-8B-Instruct"
          - "--tensor-parallel-size"
          - "1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

Key Takeaways

Start with Ollama

For local development and prototyping, Ollama is unbeatable for ease of use.

vLLM for Production

When you need high throughput and concurrent users, vLLM’s optimizations shine.

Quantize for Efficiency

INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss.

Use LiteLLM for Flexibility

A unified interface lets you switch providers and add fallbacks easily.

What’s Next

Tool Calling

Learn how to give LLMs the ability to call functions and interact with external systems

Interview Deep-Dive

Strong Answer:
  • The first thing I would NOT do is try to replace everything at once. The key insight is that not all LLM workloads are equal. I would start by pulling our usage logs and segmenting by task type: classification, extraction, summarization, code generation, complex reasoning, and so on. In my experience, 60-80% of API spend comes from high-volume, simple tasks (classification, extraction, summarization) where a fine-tuned 7B-8B model can match GPT-4o quality. The remaining 20-40% is complex reasoning where you genuinely need a frontier model.
  • For the simple workloads, I would benchmark Llama 3.2 8B and Mistral 7B against our current GPT-4o-mini outputs using our actual production prompts — not generic benchmarks. I would build a 200-query evaluation set with human-labeled gold answers and measure accuracy, latency, and cost. The target: within 5% accuracy of the API model at 10x lower cost.
  • Deployment-wise, I would use vLLM on 2xA100 instances for the production inference server, with LiteLLM as a unified proxy that routes simple tasks to the local model and complex tasks to OpenAI. This gives us a single API interface for the application layer — the app code does not know or care which model is answering.
  • The realistic outcome: we move 60% of traffic to open-source, cut costs to around 3540K/month(the35-40K/month (the 80K minus 60%, plus maybe $8-10K in GPU compute), and maintain quality. The remaining 40% stays on OpenAI because the accuracy gap on complex reasoning tasks is still real. This is a 50% cost reduction with no user-visible quality degradation.
Follow-up: Your benchmarks show the open-source model matches GPT-4o-mini on accuracy, but your P95 latency is 3x worse. How would you diagnose and fix this?
  • The first question is where the latency is: is it time-to-first-token (TTFT) or token generation throughput? If TTFT is high, the model is spending too long on the prefill phase — likely because the KV cache is not warmed up or the batch size is too large. If throughput is low, the GPU is memory-bandwidth-bound, which is common with large models on insufficient hardware.
  • For TTFT, I would enable prefix caching in vLLM. If most of our requests share a common system prompt (which they do), prefix caching avoids recomputing the KV cache for that shared prefix on every request. This alone can cut TTFT by 40-60%.
  • For throughput, I would check GPU utilization. If the GPU is at 95%+ utilization but throughput is still low, we need more GPUs or a smaller model. If utilization is low, we have a batching problem — vLLM’s continuous batching should handle this, but I would check that max concurrent requests is set appropriately.
  • I would also look at quantization: if we are running FP16 but INT8 gives us within 1% accuracy, switching to INT8 halves memory usage and roughly doubles throughput. The Q4_K_M quantization on GGUF models is surprisingly good — I have seen it maintain 98% of FP16 quality on extraction tasks.
Strong Answer:
  • The way I think about it: Ollama is for developers, vLLM is for throughput, and TGI is for the Hugging Face ecosystem. They solve different problems at different scales.
  • Ollama is the fastest path from zero to running a local model. It handles model downloading, quantization, and GPU detection automatically. The OpenAI-compatible API means your existing code works with one line change. But it is single-user optimized — it does not handle concurrent requests efficiently, has no continuous batching, and its memory management is basic. I use Ollama for local development and prototyping, never for production serving.
  • vLLM is purpose-built for production inference. Its killer feature is PagedAttention, which manages the KV cache the way an OS manages virtual memory — in pages that can be allocated, shared, and freed dynamically. This means 2-4x more concurrent requests on the same GPU compared to naive serving. Add continuous batching (new requests join the batch dynamically instead of waiting) and tensor parallelism (split one model across multiple GPUs), and you have a system that can serve hundreds of concurrent users. The trade-off: more complex setup, requires specific GPU hardware, and the model ecosystem is narrower than Ollama.
  • TGI is Hugging Face’s offering, and its strength is tight integration with the HF model hub and ecosystem. It supports speculative decoding, quantization via bitsandbytes, and Flash Attention out of the box. Docker-first deployment is nice for teams that already use containers. But it is slower than vLLM in benchmarks for most models, and the API is not OpenAI-compatible without a wrapper.
  • My production recommendation: vLLM for the serving engine, with LiteLLM as a proxy for API compatibility and routing. Use Ollama on developer laptops for testing prompts against local models before deploying to the vLLM cluster.
Follow-up: You are running a 70B model on 2xA100-80GB with tensor parallelism. Users report occasional OOM crashes under load. What is happening and how do you fix it?
  • The 70B model in FP16 takes about 140GB of VRAM, which fits in 2xA100-80GB (160GB total) with about 20GB headroom. But that headroom is shared between the model weights (static) and the KV cache (dynamic, grows with concurrent requests and sequence length). Under heavy load, the KV cache exhausts the remaining 20GB.
  • The fix has several layers. First, set --gpu-memory-utilization 0.85 instead of the default 0.9 — this leaves more headroom for the KV cache at the cost of slightly fewer concurrent requests. Second, set --max-model-len to the actual maximum you need (say 4096 instead of the model’s full 8192) — this caps the per-request KV cache size. Third, consider INT8 quantization, which cuts the model to 70GB and frees 70GB for KV cache, dramatically increasing concurrent capacity. The quality loss from INT8 on a 70B model is negligible for most tasks — it is the sweet spot of quantization.
Strong Answer:
  • It is technically true that quantization is lossy, but the statement is misleading because it implies the quality loss is always significant. In practice, the relationship between quantization level and quality loss is highly nonlinear and depends on model size, task type, and quantization method.
  • For large models (70B+), INT8 quantization is essentially free — benchmarks consistently show less than 1% degradation on standard evals. INT4 on 70B models shows 2-4% degradation, which is acceptable for most production workloads. The reason: larger models have more redundancy in their weights, so rounding errors average out.
  • For smaller models (7B), the math changes. INT4 on a 7B model can show 5-10% degradation on reasoning tasks, because there is less redundancy and each weight matters more. The quantization method also matters enormously: Q4_K_M (a mixed-precision k-quant) preserves critical attention layers at higher precision and only aggressively quantizes less important layers. It outperforms naive INT4 (Q4_0) by 3-5% on benchmarks.
  • Where I would refuse to quantize: medical diagnosis, legal document analysis, or any task where a 2-3% accuracy drop has real-world consequences AND the task requires nuanced reasoning (not just extraction). For those, I would rather pay for more GPU capacity and run FP16.
  • Where I would always quantize: high-volume classification, extraction, summarization, and any task where the output is validated by downstream code anyway. If your parser checks the JSON schema, a slightly less precise model that produces valid JSON 97% of the time instead of 99% is fine — the retry handles the 2%.
Follow-up: Your team is debating between running Llama 3.3 70B at INT4 on 2xRTX 4090 (consumer GPUs) versus renting A100s in the cloud. What are the hidden costs and gotchas of each approach?
  • The consumer GPU path is cheaper on paper (2,500per4090vs.2,500 per 4090 vs. 2-3/hour for A100s) but has hidden costs. First, 4090s have only 24GB VRAM each, so 70B at INT4 (35GB) requires tensor parallelism across two cards — but consumer cards use PCIe, not NVLink, so inter-GPU communication is 5-10x slower than on data center GPUs. This kills throughput for long sequences. Second, 4090s lack ECC memory, meaning silent bit errors under sustained load. Third, no NVIDIA enterprise support, no guaranteed uptime, and your DevOps team now manages hardware.
  • The cloud A100 path is more expensive per month but eliminates hardware management, gives you NVLink for fast tensor parallelism, ECC memory, and the ability to scale up and down with demand. The hidden cost here is egress charges and spot instance interruptions. If you use spot A100s (50-70% cheaper), you need checkpointing and graceful failover for when instances get preempted.
  • My recommendation for most teams: start with cloud A100 spot instances behind a load balancer. Only move to owned hardware once you have predictable, sustained load that justifies the capital expense and the ops burden. The break-even is typically around 70-80% sustained utilization for 12+ months.