Open Source & Local LLMs

Why Open Source LLMs?
2025 Open Source Model Landscape
Ollama: Easiest Local LLM Setup
Installation
Basic Usage
API Usage
Custom Modelfiles
Running Multiple Models
vLLM: Production-Scale Inference
Installation
Starting the Server
Key Features
vLLM Optimizations Explained
Text Generation Inference (TGI)
Running with Docker
API Usage
LiteLLM: Unified Interface
LiteLLM Proxy Server
Hardware Requirements
GPU Memory Requirements
Recommended Setups
Quantization for Efficiency
Quantization Comparison
Cloud Alternatives for Open Models
Together AI
Groq (Ultra-fast Inference)
Fireworks AI
Production Deployment Patterns
Docker Compose Setup
Kubernetes with GPU Support
Key Takeaways
What’s Next

December 2025 Update: Covers the latest open-source models including Llama 3.3, Mistral Large 2, Qwen 2.5, and DeepSeek-V3, plus deployment options for every scale.

Why Open Source LLMs?

Open-source and local LLMs offer compelling advantages:

Benefit	Description
Privacy	Data never leaves your infrastructure
Cost	No per-token API fees after hardware costs
Customization	Fine-tune on your data
Latency	No network round-trip for local models
Compliance	Meet data residency requirements
Availability	No dependence on external services

2025 Open Source Model Landscape

Model Capabilities (December 2025)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Frontier Class (>100B params, GPT-4 competitive):
├── Llama 3.3 70B            - Best open weights
├── Mistral Large 2 123B     - Strong reasoning
├── Qwen 2.5 72B             - Excellent multilingual
└── DeepSeek-V3 671B (MoE)   - State-of-the-art open

Mid-tier (7B-70B, GPT-3.5 competitive):
├── Llama 3.2 8B             - Fast, efficient
├── Mistral 7B               - Great for fine-tuning
├── Qwen 2.5 7B/14B/32B      - Balanced performance
└── Gemma 2 9B/27B           - Google's open models

Edge/Mobile (<7B, device-friendly):
├── Llama 3.2 1B/3B          - On-device inference
├── Phi-3.5 3.8B             - Microsoft's small model
├── Gemma 2 2B               - Minimal footprint
└── Qwen 2.5 0.5B/1.5B       - Ultra-lightweight

Ollama: Easiest Local LLM Setup

Ollama is the fastest way to run LLMs locally. One command to install, one command to run.

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - Download from ollama.com
# Or with winget:
winget install Ollama.Ollama

Basic Usage

# Pull and run a model
ollama run llama3.2

# Pull without running
ollama pull mistral

# List installed models
ollama list

# Run with specific parameters
ollama run llama3.2 --num-gpu 1 --num-ctx 4096

API Usage

Ollama exposes an OpenAI-compatible API:

from openai import OpenAI

# Point to local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing briefly."}
    ]
)

print(response.choices[0].message.content)

Custom Modelfiles

Create custom model configurations:

# Modelfile
FROM llama3.2

# Set system prompt
SYSTEM """
You are a senior software engineer specializing in Python.
Provide concise, production-ready code with clear explanations.
"""

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

# Custom stop tokens
PARAMETER stop "<|endoftext|>"
PARAMETER stop "Human:"

# Create and run custom model
ollama create code-assistant -f Modelfile
ollama run code-assistant

Running Multiple Models

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

async def query_multiple_models(prompt: str):
    """Query multiple models and compare responses"""
    models = ["llama3.2", "mistral", "qwen2.5:7b"]
    
    async def query_model(model: str):
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return model, response.choices[0].message.content
    
    results = await asyncio.gather(*[query_model(m) for m in models])
    return dict(results)

# Compare responses
responses = asyncio.run(query_multiple_models("What is machine learning?"))
for model, response in responses.items():
    print(f"\n{model}:\n{response[:200]}...")

vLLM: Production-Scale Inference

vLLM provides high-throughput inference with advanced optimizations.

Installation

pip install vllm

Starting the Server

# Basic server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --port 8000

# With optimizations
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B-Instruct \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --enable-prefix-caching

Key Features

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

# Batched inference (vLLM handles efficiently)
prompts = [
    "Explain Python decorators",
    "What is async/await?",
    "How does garbage collection work?"
]

responses = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-8B-Instruct",
        messages=[{"role": "user", "content": prompt}]
    )
    responses.append(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about coding"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

vLLM Optimizations Explained

┌─────────────────────────────────────────────────────────────┐
│                    vLLM Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PagedAttention                                              │
│  ├── Manages KV cache like virtual memory pages             │
│  └── Enables serving many concurrent requests               │
│                                                             │
│  Continuous Batching                                         │
│  ├── Dynamically adds requests to running batch             │
│  └── Maximizes GPU utilization                              │
│                                                             │
│  Tensor Parallelism                                          │
│  ├── Splits model across multiple GPUs                      │
│  └── Linear speedup for large models                        │
│                                                             │
│  Prefix Caching                                              │
│  ├── Reuses KV cache for shared prefixes                    │
│  └── Great for system prompts and few-shot examples         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Text Generation Inference (TGI)

Hugging Face’s production inference server.

Running with Docker

# Pull and run TGI
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.2-8B-Instruct \
    --max-input-length 4096 \
    --max-total-tokens 8192

API Usage

import requests

def query_tgi(prompt: str) -> str:
    response = requests.post(
        "http://localhost:8080/generate",
        json={
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 500,
                "temperature": 0.7,
                "do_sample": True
            }
        }
    )
    return response.json()["generated_text"]

# With streaming
def stream_tgi(prompt: str):
    response = requests.post(
        "http://localhost:8080/generate_stream",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 500}},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            yield line.decode("utf-8")

LiteLLM: Unified Interface

LiteLLM provides a single interface for 100+ LLM providers.

from litellm import completion

# Works with any provider
def chat(prompt: str, provider: str = "ollama"):
    configs = {
        "ollama": {"model": "ollama/llama3.2", "api_base": "http://localhost:11434"},
        "openai": {"model": "gpt-4o"},
        "anthropic": {"model": "claude-3-5-sonnet-20241022"},
        "local_vllm": {"model": "openai/meta-llama/Llama-3.2-8B-Instruct", 
                       "api_base": "http://localhost:8000/v1"}
    }
    
    config = configs[provider]
    
    response = completion(
        model=config["model"],
        messages=[{"role": "user", "content": prompt}],
        api_base=config.get("api_base")
    )
    
    return response.choices[0].message.content

# Seamlessly switch providers
print(chat("Hello!", provider="ollama"))
print(chat("Hello!", provider="openai"))

LiteLLM Proxy Server

Run a unified proxy for all your models:

# config.yaml
model_list:
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434
      
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      
  - model_name: claude-sonnet
    litellm_params:
      model: claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: latency-based-routing
  num_retries: 2
  fallbacks: [{"gpt-4o": ["claude-sonnet", "llama-local"]}]

# Start proxy
litellm --config config.yaml --port 4000

Hardware Requirements

GPU Memory Requirements

Model Size	Min VRAM (FP16)	Min VRAM (INT8)	Min VRAM (INT4)
7B	14 GB	7 GB	4 GB
13B	26 GB	13 GB	7 GB
34B	68 GB	34 GB	17 GB
70B	140 GB	70 GB	35 GB

Recommended Setups

Local Development:
├── MacBook Pro M3 Max (64GB) → Up to 34B models
├── RTX 4090 (24GB)           → 7B-13B at FP16, 34B at INT4
└── RTX 3090 (24GB)           → Same as 4090, slightly slower

Production (per-node):
├── A100 80GB                 → 70B at FP16
├── A100 40GB × 2             → 70B with tensor parallelism
├── H100 80GB                 → 70B+ with room to spare
└── L40S 48GB                 → Up to 34B at FP16

Quantization for Efficiency

Run larger models on smaller GPUs with quantization:

# Using llama-cpp-python for quantized models
from llama_cpp import Llama

# Load GGUF quantized model (4-bit)
llm = Llama(
    model_path="./models/llama-3.2-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
    n_threads=8
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Docker in one paragraph."}
    ]
)

print(response["choices"][0]["message"]["content"])

Quantization Comparison

Quantization	Size Reduction	Quality Loss	Use Case
FP16	Baseline	None	Maximum quality
INT8	50%	Minimal	Production inference
INT4 (Q4_K_M)	75%	Small	Resource-constrained
INT4 (Q4_0)	75%	Moderate	Maximum efficiency

Cloud Alternatives for Open Models

Run open-source models without managing infrastructure:

Together AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key="your-together-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

Groq (Ultra-fast Inference)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key"
)

# Extremely fast inference (500+ tokens/sec)
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain recursion."}]
)

Fireworks AI

from openai import OpenAI

client = OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="your-fireworks-api-key"
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Production Deployment Patterns

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000
    depends_on:
      - ollama

  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LLM_BASE_URL=http://litellm-proxy:4000/v1
    depends_on:
      - litellm-proxy

volumes:
  ollama_models:

Kubernetes with GPU Support

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.2-8B-Instruct"
          - "--tensor-parallel-size"
          - "1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

Key Takeaways

Start with Ollama

For local development and prototyping, Ollama is unbeatable for ease of use.

vLLM for Production

When you need high throughput and concurrent users, vLLM’s optimizations shine.

Quantize for Efficiency

INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss.

Use LiteLLM for Flexibility

A unified interface lets you switch providers and add fallbacks easily.

What’s Next

Tool Calling

Learn how to give LLMs the ability to call functions and interact with external systems

Agentic Architecture Multimodal AI

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Why Open Source LLMs?

​2025 Open Source Model Landscape

​Ollama: Easiest Local LLM Setup

​Installation

​Basic Usage

​API Usage

​Custom Modelfiles

​Running Multiple Models

​vLLM: Production-Scale Inference

​Installation

​Starting the Server

​Key Features

​vLLM Optimizations Explained

​Text Generation Inference (TGI)

​Running with Docker

​API Usage

​LiteLLM: Unified Interface

​LiteLLM Proxy Server

​Hardware Requirements

​GPU Memory Requirements

​Recommended Setups

​Quantization for Efficiency

​Quantization Comparison

​Cloud Alternatives for Open Models

​Together AI

​Groq (Ultra-fast Inference)

​Fireworks AI

​Production Deployment Patterns

​Docker Compose Setup

​Kubernetes with GPU Support

​Key Takeaways

Start with Ollama

vLLM for Production

Quantize for Efficiency

Use LiteLLM for Flexibility

​What’s Next

Tool Calling

Why Open Source LLMs?

2025 Open Source Model Landscape

Ollama: Easiest Local LLM Setup

Installation

Basic Usage

API Usage

Custom Modelfiles

Running Multiple Models

vLLM: Production-Scale Inference

Installation

Starting the Server

Key Features

vLLM Optimizations Explained

Text Generation Inference (TGI)

Running with Docker

API Usage

LiteLLM: Unified Interface

LiteLLM Proxy Server

Hardware Requirements

GPU Memory Requirements

Recommended Setups

Quantization for Efficiency

Quantization Comparison

Cloud Alternatives for Open Models

Together AI

Groq (Ultra-fast Inference)

Fireworks AI

Production Deployment Patterns

Docker Compose Setup

Kubernetes with GPU Support

Key Takeaways

What’s Next