Skip to main content
December 2025 Update: Covers the latest open-source models including Llama 3.3, Mistral Large 2, Qwen 2.5, and DeepSeek-V3, plus deployment options for every scale.

Why Open Source LLMs?

Open-source and local LLMs offer compelling advantages:
BenefitDescription
PrivacyData never leaves your infrastructure
CostNo per-token API fees after hardware costs
CustomizationFine-tune on your data
LatencyNo network round-trip for local models
ComplianceMeet data residency requirements
AvailabilityNo dependence on external services

2025 Open Source Model Landscape

Model Capabilities (December 2025)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Frontier Class (>100B params, GPT-4 competitive):
├── Llama 3.3 70B            - Best open weights
├── Mistral Large 2 123B     - Strong reasoning
├── Qwen 2.5 72B             - Excellent multilingual
└── DeepSeek-V3 671B (MoE)   - State-of-the-art open

Mid-tier (7B-70B, GPT-3.5 competitive):
├── Llama 3.2 8B             - Fast, efficient
├── Mistral 7B               - Great for fine-tuning
├── Qwen 2.5 7B/14B/32B      - Balanced performance
└── Gemma 2 9B/27B           - Google's open models

Edge/Mobile (<7B, device-friendly):
├── Llama 3.2 1B/3B          - On-device inference
├── Phi-3.5 3.8B             - Microsoft's small model
├── Gemma 2 2B               - Minimal footprint
└── Qwen 2.5 0.5B/1.5B       - Ultra-lightweight

Ollama: Easiest Local LLM Setup

Ollama is the fastest way to run LLMs locally. One command to install, one command to run.

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - Download from ollama.com
# Or with winget:
winget install Ollama.Ollama

Basic Usage

# Pull and run a model
ollama run llama3.2

# Pull without running
ollama pull mistral

# List installed models
ollama list

# Run with specific parameters
ollama run llama3.2 --num-gpu 1 --num-ctx 4096

API Usage

Ollama exposes an OpenAI-compatible API:
from openai import OpenAI

# Point to local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing briefly."}
    ]
)

print(response.choices[0].message.content)

Custom Modelfiles

Create custom model configurations:
# Modelfile
FROM llama3.2

# Set system prompt
SYSTEM """
You are a senior software engineer specializing in Python.
Provide concise, production-ready code with clear explanations.
"""

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

# Custom stop tokens
PARAMETER stop "<|endoftext|>"
PARAMETER stop "Human:"
# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant

Running Multiple Models

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

async def query_multiple_models(prompt: str):
    """Query multiple models and compare responses"""
    models = ["llama3.2", "mistral", "qwen2.5:7b"]
    
    async def query_model(model: str):
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return model, response.choices[0].message.content
    
    results = await asyncio.gather(*[query_model(m) for m in models])
    return dict(results)

# Compare responses
responses = asyncio.run(query_multiple_models("What is machine learning?"))
for model, response in responses.items():
    print(f"\n{model}:\n{response[:200]}...")

vLLM: Production-Scale Inference

vLLM provides high-throughput inference with advanced optimizations.

Installation

pip install vllm

Starting the Server

# Basic server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --port 8000

# With optimizations
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --enable-prefix-caching

Key Features

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

# Batched inference (vLLM handles efficiently)
prompts = [
    "Explain Python decorators",
    "What is async/await?",
    "How does garbage collection work?"
]

responses = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-8B-Instruct",
        messages=[{"role": "user", "content": prompt}]
    )
    responses.append(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about coding"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

vLLM Optimizations Explained

┌─────────────────────────────────────────────────────────────┐
│                    vLLM Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PagedAttention                                              │
│  ├── Manages KV cache like virtual memory pages             │
│  └── Enables serving many concurrent requests               │
│                                                             │
│  Continuous Batching                                         │
│  ├── Dynamically adds requests to running batch             │
│  └── Maximizes GPU utilization                              │
│                                                             │
│  Tensor Parallelism                                          │
│  ├── Splits model across multiple GPUs                      │
│  └── Linear speedup for large models                        │
│                                                             │
│  Prefix Caching                                              │
│  ├── Reuses KV cache for shared prefixes                    │
│  └── Great for system prompts and few-shot examples         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Text Generation Inference (TGI)

Hugging Face’s production inference server.

Running with Docker

# Pull and run TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.2-8B-Instruct \
    --max-input-length 4096 \
    --max-total-tokens 8192

API Usage

import requests

def query_tgi(prompt: str) -> str:
    response = requests.post(
        "http://localhost:8080/generate",
        json={
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 500,
                "temperature": 0.7,
                "do_sample": True
            }
        }
    )
    return response.json()["generated_text"]

# With streaming
def stream_tgi(prompt: str):
    response = requests.post(
        "http://localhost:8080/generate_stream",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 500}},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            yield line.decode("utf-8")

LiteLLM: Unified Interface

LiteLLM provides a single interface for 100+ LLM providers.
from litellm import completion

# Works with any provider
def chat(prompt: str, provider: str = "ollama"):
    configs = {
        "ollama": {"model": "ollama/llama3.2", "api_base": "http://localhost:11434"},
        "openai": {"model": "gpt-4o"},
        "anthropic": {"model": "claude-3-5-sonnet-20241022"},
        "local_vllm": {"model": "openai/meta-llama/Llama-3.2-8B-Instruct", 
                       "api_base": "http://localhost:8000/v1"}
    }
    
    config = configs[provider]
    
    response = completion(
        model=config["model"],
        messages=[{"role": "user", "content": prompt}],
        api_base=config.get("api_base")
    )
    
    return response.choices[0].message.content

# Seamlessly switch providers
print(chat("Hello!", provider="ollama"))
print(chat("Hello!", provider="openai"))

LiteLLM Proxy Server

Run a unified proxy for all your models:
# config.yaml
model_list:
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434
      
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      
  - model_name: claude-sonnet
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: latency-based-routing
  num_retries: 2
  fallbacks: [{"gpt-4o": ["claude-sonnet", "llama-local"]}]
# Start proxy
litellm --config config.yaml --port 4000

Hardware Requirements

GPU Memory Requirements

Model SizeMin VRAM (FP16)Min VRAM (INT8)Min VRAM (INT4)
7B14 GB7 GB4 GB
13B26 GB13 GB7 GB
34B68 GB34 GB17 GB
70B140 GB70 GB35 GB
Local Development:
├── MacBook Pro M3 Max (64GB) → Up to 34B models
├── RTX 4090 (24GB)           → 7B-13B at FP16, 34B at INT4
└── RTX 3090 (24GB)           → Same as 4090, slightly slower

Production (per-node):
├── A100 80GB                 → 70B at FP16
├── A100 40GB × 2             → 70B with tensor parallelism
├── H100 80GB                 → 70B+ with room to spare
└── L40S 48GB                 → Up to 34B at FP16

Quantization for Efficiency

Run larger models on smaller GPUs with quantization:
# Using llama-cpp-python for quantized models
from llama_cpp import Llama

# Load GGUF quantized model (4-bit)
llm = Llama(
    model_path="./models/llama-3.2-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
    n_threads=8
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in one paragraph."}
    ]
)

print(response["choices"][0]["message"]["content"])

Quantization Comparison

QuantizationSize ReductionQuality LossUse Case
FP16BaselineNoneMaximum quality
INT850%MinimalProduction inference
INT4 (Q4_K_M)75%SmallResource-constrained
INT4 (Q4_0)75%ModerateMaximum efficiency

Cloud Alternatives for Open Models

Run open-source models without managing infrastructure:

Together AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="your-together-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

Groq (Ultra-fast Inference)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key"
)

# Extremely fast inference (500+ tokens/sec)
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain recursion."}]
)

Fireworks AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="your-fireworks-api-key"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Production Deployment Patterns

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000
    depends_on:
      - ollama

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LLM_BASE_URL=http://litellm-proxy:4000/v1
    depends_on:
      - litellm-proxy

volumes:
  ollama_models:

Kubernetes with GPU Support

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.2-8B-Instruct"
          - "--tensor-parallel-size"
          - "1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

Key Takeaways

Start with Ollama

For local development and prototyping, Ollama is unbeatable for ease of use.

vLLM for Production

When you need high throughput and concurrent users, vLLM’s optimizations shine.

Quantize for Efficiency

INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss.

Use LiteLLM for Flexibility

A unified interface lets you switch providers and add fallbacks easily.

What’s Next

Tool Calling

Learn how to give LLMs the ability to call functions and interact with external systems