December 2025 Update : Covers the latest open-source models including Llama 3.3, Mistral Large 2, Qwen 2.5, and DeepSeek-V3, plus deployment options for every scale.
Why Open Source LLMs?
Open-source and local LLMs offer compelling advantages:
Benefit Description Privacy Data never leaves your infrastructure Cost No per-token API fees after hardware costs Customization Fine-tune on your data Latency No network round-trip for local models Compliance Meet data residency requirements Availability No dependence on external services
2025 Open Source Model Landscape
Model Capabilities (December 2025)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Frontier Class (>100B params, GPT-4 competitive):
├── Llama 3.3 70B - Best open weights
├── Mistral Large 2 123B - Strong reasoning
├── Qwen 2.5 72B - Excellent multilingual
└── DeepSeek-V3 671B (MoE) - State-of-the-art open
Mid-tier (7B-70B, GPT-3.5 competitive):
├── Llama 3.2 8B - Fast, efficient
├── Mistral 7B - Great for fine-tuning
├── Qwen 2.5 7B/14B/32B - Balanced performance
└── Gemma 2 9B/27B - Google's open models
Edge/Mobile (<7B, device-friendly):
├── Llama 3.2 1B/3B - On-device inference
├── Phi-3.5 3.8B - Microsoft's small model
├── Gemma 2 2B - Minimal footprint
└── Qwen 2.5 0.5B/1.5B - Ultra-lightweight
Ollama: Easiest Local LLM Setup
Ollama is the fastest way to run LLMs locally. One command to install, one command to run.
Installation
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows - Download from ollama.com
# Or with winget:
winget install Ollama.Ollama
Basic Usage
# Pull and run a model
ollama run llama3.2
# Pull without running
ollama pull mistral
# List installed models
ollama list
# Run with specific parameters
ollama run llama3.2 --num-gpu 1 --num-ctx 4096
API Usage
Ollama exposes an OpenAI-compatible API:
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url = "http://localhost:11434/v1" ,
api_key = "ollama" # Required but not used
)
response = client.chat.completions.create(
model = "llama3.2" ,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "Explain quantum computing briefly." }
]
)
print (response.choices[ 0 ].message.content)
Custom Modelfiles
Create custom model configurations:
# Modelfile
FROM llama3.2
# Set system prompt
SYSTEM """
You are a senior software engineer specializing in Python.
Provide concise, production-ready code with clear explanations.
"""
# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
# Custom stop tokens
PARAMETER stop "<|endoftext|>"
PARAMETER stop "Human:"
# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant
Running Multiple Models
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url = "http://localhost:11434/v1" ,
api_key = "ollama"
)
async def query_multiple_models ( prompt : str ):
"""Query multiple models and compare responses"""
models = [ "llama3.2" , "mistral" , "qwen2.5:7b" ]
async def query_model ( model : str ):
response = await client.chat.completions.create(
model = model,
messages = [{ "role" : "user" , "content" : prompt}]
)
return model, response.choices[ 0 ].message.content
results = await asyncio.gather( * [query_model(m) for m in models])
return dict (results)
# Compare responses
responses = asyncio.run(query_multiple_models( "What is machine learning?" ))
for model, response in responses.items():
print ( f " \n { model } : \n { response[: 200 ] } ..." )
vLLM: Production-Scale Inference
vLLM provides high-throughput inference with advanced optimizations.
Installation
Starting the Server
# Basic server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--port 8000
# With optimizations
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--enable-prefix-caching
Key Features
from openai import OpenAI
client = OpenAI( base_url = "http://localhost:8000/v1" , api_key = "vllm" )
# Batched inference (vLLM handles efficiently)
prompts = [
"Explain Python decorators" ,
"What is async/await?" ,
"How does garbage collection work?"
]
responses = []
for prompt in prompts:
response = client.chat.completions.create(
model = "meta-llama/Llama-3.2-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : prompt}]
)
responses.append(response.choices[ 0 ].message.content)
# Streaming
stream = client.chat.completions.create(
model = "meta-llama/Llama-3.2-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Write a poem about coding" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
vLLM Optimizations Explained
┌─────────────────────────────────────────────────────────────┐
│ vLLM Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ PagedAttention │
│ ├── Manages KV cache like virtual memory pages │
│ └── Enables serving many concurrent requests │
│ │
│ Continuous Batching │
│ ├── Dynamically adds requests to running batch │
│ └── Maximizes GPU utilization │
│ │
│ Tensor Parallelism │
│ ├── Splits model across multiple GPUs │
│ └── Linear speedup for large models │
│ │
│ Prefix Caching │
│ ├── Reuses KV cache for shared prefixes │
│ └── Great for system prompts and few-shot examples │
│ │
└─────────────────────────────────────────────────────────────┘
Text Generation Inference (TGI)
Hugging Face’s production inference server.
Running with Docker
# Pull and run TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
-v ~/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.2-8B-Instruct \
--max-input-length 4096 \
--max-total-tokens 8192
API Usage
import requests
def query_tgi ( prompt : str ) -> str :
response = requests.post(
"http://localhost:8080/generate" ,
json = {
"inputs" : prompt,
"parameters" : {
"max_new_tokens" : 500 ,
"temperature" : 0.7 ,
"do_sample" : True
}
}
)
return response.json()[ "generated_text" ]
# With streaming
def stream_tgi ( prompt : str ):
response = requests.post(
"http://localhost:8080/generate_stream" ,
json = { "inputs" : prompt, "parameters" : { "max_new_tokens" : 500 }},
stream = True
)
for line in response.iter_lines():
if line:
yield line.decode( "utf-8" )
LiteLLM: Unified Interface
LiteLLM provides a single interface for 100+ LLM providers.
from litellm import completion
# Works with any provider
def chat ( prompt : str , provider : str = "ollama" ):
configs = {
"ollama" : { "model" : "ollama/llama3.2" , "api_base" : "http://localhost:11434" },
"openai" : { "model" : "gpt-4o" },
"anthropic" : { "model" : "claude-3-5-sonnet-20241022" },
"local_vllm" : { "model" : "openai/meta-llama/Llama-3.2-8B-Instruct" ,
"api_base" : "http://localhost:8000/v1" }
}
config = configs[provider]
response = completion(
model = config[ "model" ],
messages = [{ "role" : "user" , "content" : prompt}],
api_base = config.get( "api_base" )
)
return response.choices[ 0 ].message.content
# Seamlessly switch providers
print (chat( "Hello!" , provider = "ollama" ))
print (chat( "Hello!" , provider = "openai" ))
LiteLLM Proxy Server
Run a unified proxy for all your models:
# config.yaml
model_list :
- model_name : llama-local
litellm_params :
model : ollama/llama3.2
api_base : http://localhost:11434
- model_name : gpt-4o
litellm_params :
model : gpt-4o
api_key : os.environ/OPENAI_API_KEY
- model_name : claude-sonnet
litellm_params :
model : claude-3-5-sonnet-20241022
api_key : os.environ/ANTHROPIC_API_KEY
router_settings :
routing_strategy : latency-based-routing
num_retries : 2
fallbacks : [{ "gpt-4o" : [ "claude-sonnet" , "llama-local" ]}]
# Start proxy
litellm --config config.yaml --port 4000
Hardware Requirements
GPU Memory Requirements
Model Size Min VRAM (FP16) Min VRAM (INT8) Min VRAM (INT4) 7B 14 GB 7 GB 4 GB 13B 26 GB 13 GB 7 GB 34B 68 GB 34 GB 17 GB 70B 140 GB 70 GB 35 GB
Recommended Setups
Local Development:
├── MacBook Pro M3 Max (64GB) → Up to 34B models
├── RTX 4090 (24GB) → 7B-13B at FP16, 34B at INT4
└── RTX 3090 (24GB) → Same as 4090, slightly slower
Production (per-node):
├── A100 80GB → 70B at FP16
├── A100 40GB × 2 → 70B with tensor parallelism
├── H100 80GB → 70B+ with room to spare
└── L40S 48GB → Up to 34B at FP16
Quantization for Efficiency
Run larger models on smaller GPUs with quantization:
# Using llama-cpp-python for quantized models
from llama_cpp import Llama
# Load GGUF quantized model (4-bit)
llm = Llama(
model_path = "./models/llama-3.2-8b-instruct-q4_k_m.gguf" ,
n_ctx = 4096 ,
n_gpu_layers = 35 , # Offload layers to GPU
n_threads = 8
)
response = llm.create_chat_completion(
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant." },
{ "role" : "user" , "content" : "Explain Docker in one paragraph." }
]
)
print (response[ "choices" ][ 0 ][ "message" ][ "content" ])
Quantization Comparison
Quantization Size Reduction Quality Loss Use Case FP16 Baseline None Maximum quality INT8 50% Minimal Production inference INT4 (Q4_K_M) 75% Small Resource-constrained INT4 (Q4_0) 75% Moderate Maximum efficiency
Cloud Alternatives for Open Models
Run open-source models without managing infrastructure:
Together AI
from openai import OpenAI
client = OpenAI(
base_url = "https://api.together.xyz/v1" ,
api_key = "your-together-api-key"
)
response = client.chat.completions.create(
model = "meta-llama/Llama-3.3-70B-Instruct-Turbo" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Groq (Ultra-fast Inference)
from openai import OpenAI
client = OpenAI(
base_url = "https://api.groq.com/openai/v1" ,
api_key = "your-groq-api-key"
)
# Extremely fast inference (500+ tokens/sec)
response = client.chat.completions.create(
model = "llama-3.3-70b-versatile" ,
messages = [{ "role" : "user" , "content" : "Explain recursion." }]
)
Fireworks AI
from openai import OpenAI
client = OpenAI(
base_url = "https://api.fireworks.ai/inference/v1" ,
api_key = "your-fireworks-api-key"
)
response = client.chat.completions.create(
model = "accounts/fireworks/models/llama-v3p3-70b-instruct" ,
messages = [{ "role" : "user" , "content" : "Hello!" }]
)
Production Deployment Patterns
Docker Compose Setup
# docker-compose.yml
version : '3.8'
services :
ollama :
image : ollama/ollama:latest
ports :
- "11434:11434"
volumes :
- ollama_models:/root/.ollama
deploy :
resources :
reservations :
devices :
- driver : nvidia
count : 1
capabilities : [ gpu ]
litellm-proxy :
image : ghcr.io/berriai/litellm:main-latest
ports :
- "4000:4000"
environment :
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes :
- ./litellm-config.yaml:/app/config.yaml
command : --config /app/config.yaml --port 4000
depends_on :
- ollama
api :
build : .
ports :
- "8000:8000"
environment :
- LLM_BASE_URL=http://litellm-proxy:4000/v1
depends_on :
- litellm-proxy
volumes :
ollama_models :
Kubernetes with GPU Support
# deployment.yaml
apiVersion : apps/v1
kind : Deployment
metadata :
name : llm-server
spec :
replicas : 2
selector :
matchLabels :
app : llm-server
template :
metadata :
labels :
app : llm-server
spec :
containers :
- name : vllm
image : vllm/vllm-openai:latest
args :
- "--model"
- "meta-llama/Llama-3.2-8B-Instruct"
- "--tensor-parallel-size"
- "1"
ports :
- containerPort : 8000
resources :
limits :
nvidia.com/gpu : 1
volumeMounts :
- name : model-cache
mountPath : /root/.cache/huggingface
volumes :
- name : model-cache
persistentVolumeClaim :
claimName : model-cache-pvc
Key Takeaways
Start with Ollama For local development and prototyping, Ollama is unbeatable for ease of use.
vLLM for Production When you need high throughput and concurrent users, vLLM’s optimizations shine.
Quantize for Efficiency INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss.
Use LiteLLM for Flexibility A unified interface lets you switch providers and add fallbacks easily.
What’s Next
Tool Calling Learn how to give LLMs the ability to call functions and interact with external systems