Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Covers the latest open-source models including Llama 3.3, Mistral Large 2, Qwen 2.5, and DeepSeek-V3, plus deployment options for every scale.
Why Open Source LLMs?
The decision between open-source and closed-source LLMs is like the decision between renting an apartment and owning a house. Renting (OpenAI, Anthropic) is easier to start, someone else handles maintenance, and you can move quickly — but you are at the mercy of the landlord’s pricing, rules, and availability. Owning (open-source) requires more upfront investment and maintenance, but gives you complete control over your data, costs, and customization. Most production teams end up using both: closed-source for complex reasoning tasks and open-source for high-volume, cost-sensitive workloads. Open-source and local LLMs offer compelling advantages:| Benefit | Description |
|---|---|
| Privacy | Data never leaves your infrastructure |
| Cost | No per-token API fees after hardware costs |
| Customization | Fine-tune on your data |
| Latency | No network round-trip for local models |
| Compliance | Meet data residency requirements |
| Availability | No dependence on external services |
Open Source vs. Closed Source: Decision Framework
| Factor | Choose Open Source When | Choose Closed Source When |
|---|---|---|
| Data sensitivity | PII, medical records, legal docs, classified data | Public-facing content, non-sensitive queries |
| Volume | 100K+ requests/day (cost crossover point) | Under 10K requests/day |
| Quality requirements | ”Good enough” is acceptable (classification, extraction) | Best-possible quality required (complex reasoning, creative) |
| Latency | Sub-100ms needed (local inference) | 500ms-3s acceptable |
| Team size | Have infra/ML engineers to manage deployment | Small team, no ML ops capacity |
| Customization | Need domain-specific fine-tuning | General-purpose tasks |
| Budget | Upfront GPU investment acceptable | Prefer pay-per-use |
| Regulatory | GDPR, HIPAA, SOC2 with strict data residency | Standard compliance sufficient |
2025 Open Source Model Landscape
Inference Server Comparison
Choosing the right inference server matters more than choosing the right model. The wrong server can leave 80% of your GPU idle.| Feature | Ollama | vLLM | TGI (HuggingFace) | llama.cpp |
|---|---|---|---|---|
| Primary use | Development, prototyping | Production serving | Production serving | Edge/CPU inference |
| Setup complexity | 1 command | Moderate | Docker required | Compile from source |
| GPU utilization | Good | Excellent (PagedAttention) | Good | Optional (CPU-first) |
| Concurrent users | Low (1-5) | High (100+) | High (50+) | Low (1-3) |
| Throughput (tok/s) | Good | Best (2-4x others) | Good | Moderate |
| Continuous batching | No | Yes | Yes | No |
| Tensor parallelism | No | Yes (multi-GPU) | Yes (multi-GPU) | No |
| OpenAI-compatible API | Yes | Yes | No (custom API) | Yes (via server mode) |
| Quantization support | GGUF auto | AWQ, GPTQ, FP8 | AWQ, GPTQ, bitsandbytes | GGUF (best in class) |
| Apple Silicon | Excellent | No | No | Excellent |
| Best for | Local dev, Mac users | Production GPU clusters | HuggingFace ecosystem | Raspberry Pi, laptops |
Ollama: Easiest Local LLM Setup
Ollama is the “Docker for LLMs” — it packages models with their runtime so you can pull and run them with a single command. It handles quantization, GPU detection, and memory management automatically. If you have never run a local LLM before, start here. The experience is genuinely as simple asollama run llama3.2, and the OpenAI-compatible API means your existing code works with minimal changes.
Installation
Basic Usage
API Usage
Ollama exposes an OpenAI-compatible API:Custom Modelfiles
Create custom model configurations:Running Multiple Models
vLLM: Production-Scale Inference
If Ollama is for development, vLLM is for production. The difference is like running a single-threaded web server vs. nginx: vLLM is designed from the ground up for high-throughput, concurrent inference. Its key innovation, PagedAttention, manages GPU memory the way an operating system manages RAM with virtual memory pages — which means it can serve 2-4x more concurrent requests than naive implementations on the same hardware.Installation
Starting the Server
Key Features
vLLM Optimizations Explained
Text Generation Inference (TGI)
Hugging Face’s production inference server.Running with Docker
API Usage
LiteLLM: Unified Interface
The problem LiteLLM solves is real: every LLM provider has a slightly different API, different parameter names, different response formats. Switching providers means rewriting integration code. LiteLLM acts as a universal adapter — one interface that translates to any provider. More importantly, its proxy server adds routing, fallbacks, and load balancing, which is exactly what you need when running multiple models in production.LiteLLM Proxy Server
Run a unified proxy for all your models:Hardware Requirements
The most common question in open-source LLMs: “can I run this model on my hardware?” The answer depends on three factors: model size (parameter count), quantization level, and how much context you need. The rule of thumb for VRAM: multiply the parameter count by 2 for FP16 (7B model = 14 GB), or divide by the quantization factor (7B INT4 = ~4 GB). Add 1-2 GB for the KV cache (more for longer contexts).GPU Memory Requirements
| Model Size | Min VRAM (FP16) | Min VRAM (INT8) | Min VRAM (INT4) |
|---|---|---|---|
| 7B | 14 GB | 7 GB | 4 GB |
| 13B | 26 GB | 13 GB | 7 GB |
| 34B | 68 GB | 34 GB | 17 GB |
| 70B | 140 GB | 70 GB | 35 GB |
Recommended Setups
Quantization for Efficiency
Quantization is like compressing a high-resolution photo to JPEG: you lose some detail, but the file becomes dramatically smaller. In LLM terms, you are reducing the precision of model weights from 16-bit floats to 8-bit or 4-bit integers. A 70B parameter model that normally needs 140 GB of VRAM can fit in 35 GB at INT4 — making it runnable on two consumer GPUs instead of two A100s. The quality loss is surprisingly small for most tasks, though it is measurable on benchmarks. Run larger models on smaller GPUs with quantization:Quantization Comparison
| Quantization | Size Reduction | Quality Loss | Perplexity Increase | Use Case |
|---|---|---|---|---|
| FP16 (BF16) | Baseline | None | 0% | Maximum quality, benchmarking |
| INT8 (Q8_0) | 50% | Minimal | ~0.5% | Production inference, best quality-to-size ratio |
| INT5 (Q5_K_M) | 65% | Very small | ~1% | Good balance for most tasks |
| INT4 (Q4_K_M) | 75% | Small | ~2-3% | Sweet spot for resource-constrained deployments |
| INT4 (Q4_0) | 75% | Moderate | ~5% | Maximum compression, simpler tasks only |
| INT3 (Q3_K_S) | 82% | Noticeable | ~8-10% | Extreme constraint — expect degraded output quality |
| INT2 (Q2_K) | 87% | Significant | ~15-20% | Experimental only, not recommended for production |
Cloud Alternatives for Open Models
Run open-source models without managing infrastructure.Cloud Inference Provider Comparison
| Provider | Speed | Price (Llama 3.3 70B) | OpenAI-Compatible API | Unique Strength | Limitation |
|---|---|---|---|---|---|
| Together AI | Fast | ~$0.90/M tokens | Yes | Wide model selection, fine-tuning support | Can be slow during peak hours |
| Groq | Fastest (500+ tok/s) | ~$0.59/M tokens | Yes | Custom LPU hardware, sub-second responses | Limited model selection, lower rate limits |
| Fireworks AI | Fast | ~$0.90/M tokens | Yes | Function calling support, long context | Smaller community |
| Replicate | Moderate | Pay-per-second GPU | Yes (via proxy) | Run any HuggingFace model, custom models | Cold starts can be slow (10-30s) |
| Modal | Fast | Pay-per-second GPU | Build your own | Full Python control, serverless GPUs | Requires more setup, not a hosted API |
| Anyscale | Fast | ~$1.00/M tokens | Yes | Ray-based scaling, production-grade | More enterprise-focused |
Together AI
Groq (Ultra-fast Inference)
Groq runs on custom LPU (Language Processing Unit) hardware that is purpose-built for inference. The result is staggeringly fast: 500+ tokens/second, which means a typical response arrives in under a second. The trade-off is a limited model selection and lower rate limits compared to other providers.Fireworks AI
Production Deployment Patterns
The pattern below is a production-tested stack: Ollama serves the model, LiteLLM provides a unified API with routing and fallbacks, and your application talks to LiteLLM. This gives you model-agnostic code (swap providers by changing config, not code), automatic fallbacks (if Ollama goes down, route to OpenAI), and a single point for logging and rate limiting.Docker Compose Setup
Kubernetes with GPU Support
Key Takeaways
Start with Ollama
For local development and prototyping, Ollama is unbeatable for ease of use.
vLLM for Production
When you need high throughput and concurrent users, vLLM’s optimizations shine.
Quantize for Efficiency
INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss.
Use LiteLLM for Flexibility
A unified interface lets you switch providers and add fallbacks easily.
What’s Next
Tool Calling
Learn how to give LLMs the ability to call functions and interact with external systems
Interview Deep-Dive
Your company is spending $80K/month on OpenAI API calls. Your VP of Engineering asks you to evaluate moving some workloads to open-source models. Walk me through how you would approach this decision and what you would actually recommend.
Your company is spending $80K/month on OpenAI API calls. Your VP of Engineering asks you to evaluate moving some workloads to open-source models. Walk me through how you would approach this decision and what you would actually recommend.
Strong Answer:
- The first thing I would NOT do is try to replace everything at once. The key insight is that not all LLM workloads are equal. I would start by pulling our usage logs and segmenting by task type: classification, extraction, summarization, code generation, complex reasoning, and so on. In my experience, 60-80% of API spend comes from high-volume, simple tasks (classification, extraction, summarization) where a fine-tuned 7B-8B model can match GPT-4o quality. The remaining 20-40% is complex reasoning where you genuinely need a frontier model.
- For the simple workloads, I would benchmark Llama 3.2 8B and Mistral 7B against our current GPT-4o-mini outputs using our actual production prompts — not generic benchmarks. I would build a 200-query evaluation set with human-labeled gold answers and measure accuracy, latency, and cost. The target: within 5% accuracy of the API model at 10x lower cost.
- Deployment-wise, I would use vLLM on 2xA100 instances for the production inference server, with LiteLLM as a unified proxy that routes simple tasks to the local model and complex tasks to OpenAI. This gives us a single API interface for the application layer — the app code does not know or care which model is answering.
- The realistic outcome: we move 60% of traffic to open-source, cut costs to around 80K minus 60%, plus maybe $8-10K in GPU compute), and maintain quality. The remaining 40% stays on OpenAI because the accuracy gap on complex reasoning tasks is still real. This is a 50% cost reduction with no user-visible quality degradation.
- The first question is where the latency is: is it time-to-first-token (TTFT) or token generation throughput? If TTFT is high, the model is spending too long on the prefill phase — likely because the KV cache is not warmed up or the batch size is too large. If throughput is low, the GPU is memory-bandwidth-bound, which is common with large models on insufficient hardware.
- For TTFT, I would enable prefix caching in vLLM. If most of our requests share a common system prompt (which they do), prefix caching avoids recomputing the KV cache for that shared prefix on every request. This alone can cut TTFT by 40-60%.
- For throughput, I would check GPU utilization. If the GPU is at 95%+ utilization but throughput is still low, we need more GPUs or a smaller model. If utilization is low, we have a batching problem — vLLM’s continuous batching should handle this, but I would check that max concurrent requests is set appropriately.
- I would also look at quantization: if we are running FP16 but INT8 gives us within 1% accuracy, switching to INT8 halves memory usage and roughly doubles throughput. The Q4_K_M quantization on GGUF models is surprisingly good — I have seen it maintain 98% of FP16 quality on extraction tasks.
Explain the trade-offs between running Ollama, vLLM, and TGI in production. When would you choose each?
Explain the trade-offs between running Ollama, vLLM, and TGI in production. When would you choose each?
Strong Answer:
- The way I think about it: Ollama is for developers, vLLM is for throughput, and TGI is for the Hugging Face ecosystem. They solve different problems at different scales.
- Ollama is the fastest path from zero to running a local model. It handles model downloading, quantization, and GPU detection automatically. The OpenAI-compatible API means your existing code works with one line change. But it is single-user optimized — it does not handle concurrent requests efficiently, has no continuous batching, and its memory management is basic. I use Ollama for local development and prototyping, never for production serving.
- vLLM is purpose-built for production inference. Its killer feature is PagedAttention, which manages the KV cache the way an OS manages virtual memory — in pages that can be allocated, shared, and freed dynamically. This means 2-4x more concurrent requests on the same GPU compared to naive serving. Add continuous batching (new requests join the batch dynamically instead of waiting) and tensor parallelism (split one model across multiple GPUs), and you have a system that can serve hundreds of concurrent users. The trade-off: more complex setup, requires specific GPU hardware, and the model ecosystem is narrower than Ollama.
- TGI is Hugging Face’s offering, and its strength is tight integration with the HF model hub and ecosystem. It supports speculative decoding, quantization via bitsandbytes, and Flash Attention out of the box. Docker-first deployment is nice for teams that already use containers. But it is slower than vLLM in benchmarks for most models, and the API is not OpenAI-compatible without a wrapper.
- My production recommendation: vLLM for the serving engine, with LiteLLM as a proxy for API compatibility and routing. Use Ollama on developer laptops for testing prompts against local models before deploying to the vLLM cluster.
- The 70B model in FP16 takes about 140GB of VRAM, which fits in 2xA100-80GB (160GB total) with about 20GB headroom. But that headroom is shared between the model weights (static) and the KV cache (dynamic, grows with concurrent requests and sequence length). Under heavy load, the KV cache exhausts the remaining 20GB.
- The fix has several layers. First, set
--gpu-memory-utilization 0.85instead of the default 0.9 — this leaves more headroom for the KV cache at the cost of slightly fewer concurrent requests. Second, set--max-model-lento the actual maximum you need (say 4096 instead of the model’s full 8192) — this caps the per-request KV cache size. Third, consider INT8 quantization, which cuts the model to 70GB and frees 70GB for KV cache, dramatically increasing concurrent capacity. The quality loss from INT8 on a 70B model is negligible for most tasks — it is the sweet spot of quantization.
A colleague says 'Quantization is just lossy compression, you always lose quality.' Is that true? When would you quantize and when would you refuse to?
A colleague says 'Quantization is just lossy compression, you always lose quality.' Is that true? When would you quantize and when would you refuse to?
Strong Answer:
- It is technically true that quantization is lossy, but the statement is misleading because it implies the quality loss is always significant. In practice, the relationship between quantization level and quality loss is highly nonlinear and depends on model size, task type, and quantization method.
- For large models (70B+), INT8 quantization is essentially free — benchmarks consistently show less than 1% degradation on standard evals. INT4 on 70B models shows 2-4% degradation, which is acceptable for most production workloads. The reason: larger models have more redundancy in their weights, so rounding errors average out.
- For smaller models (7B), the math changes. INT4 on a 7B model can show 5-10% degradation on reasoning tasks, because there is less redundancy and each weight matters more. The quantization method also matters enormously: Q4_K_M (a mixed-precision k-quant) preserves critical attention layers at higher precision and only aggressively quantizes less important layers. It outperforms naive INT4 (Q4_0) by 3-5% on benchmarks.
- Where I would refuse to quantize: medical diagnosis, legal document analysis, or any task where a 2-3% accuracy drop has real-world consequences AND the task requires nuanced reasoning (not just extraction). For those, I would rather pay for more GPU capacity and run FP16.
- Where I would always quantize: high-volume classification, extraction, summarization, and any task where the output is validated by downstream code anyway. If your parser checks the JSON schema, a slightly less precise model that produces valid JSON 97% of the time instead of 99% is fine — the retry handles the 2%.
- The consumer GPU path is cheaper on paper (2-3/hour for A100s) but has hidden costs. First, 4090s have only 24GB VRAM each, so 70B at INT4 (35GB) requires tensor parallelism across two cards — but consumer cards use PCIe, not NVLink, so inter-GPU communication is 5-10x slower than on data center GPUs. This kills throughput for long sequences. Second, 4090s lack ECC memory, meaning silent bit errors under sustained load. Third, no NVIDIA enterprise support, no guaranteed uptime, and your DevOps team now manages hardware.
- The cloud A100 path is more expensive per month but eliminates hardware management, gives you NVLink for fast tensor parallelism, ECC memory, and the ability to scale up and down with demand. The hidden cost here is egress charges and spot instance interruptions. If you use spot A100s (50-70% cheaper), you need checkpointing and graceful failover for when instances get preempted.
- My recommendation for most teams: start with cloud A100 spot instances behind a load balancer. Only move to owned hardware once you have predictable, sustained load that justifies the capital expense and the ops burden. The break-even is typically around 70-80% sustained utilization for 12+ months.