Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Model Deployment

Model Deployment

From Research to Production

There is a saying in ML engineering: “Getting to 95% accuracy takes a month; getting that model into production takes a year.” Training is the glamorous part. Deployment is where dreams meet reality — where you discover that your model needs 8 GB of RAM but the server has 4, that inference takes 500ms but the SLA requires 50ms, and that the model works perfectly on your test set but produces nonsense on slightly different real-world inputs. Production requires:
  • Fast inference (users will not wait more than a few hundred milliseconds)
  • Minimal dependencies (your prod server should not need a full PyTorch research installation)
  • Reproducibility (the same input must produce the same output, every time)
  • Monitoring (you need to know when the model starts performing poorly before users complain)
The honest truth: Most ML models never make it to production. The gap between a Jupyter notebook that achieves good metrics and a reliable production service is vast. The techniques in this chapter — export formats, quantization, serving frameworks, monitoring — are what separate ML engineers from ML researchers.

Export Formats

TorchScript (JIT Compilation)

TorchScript converts your Python model into a serialized format that can run without a Python interpreter. This matters for two reasons: (1) Python is slow due to the GIL, and (2) deploying a Python environment to edge devices or C++ servers is painful. TorchScript gives you a portable, optimized model file.
import torch

model = MyModel()
model.eval()  # CRITICAL: set to eval mode BEFORE tracing/scripting

# Option 1: Tracing -- runs the model with example input and records
# every operation. Fast and reliable, but cannot capture data-dependent
# control flow (if/else that depends on input values).
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

# Option 2: Scripting -- analyzes the Python source code and compiles it.
# Handles control flow (if/else, loops, variable-length sequences)
# but has stricter requirements on what Python constructs are supported.
scripted_model = torch.jit.script(model)

# Save -- the resulting file is self-contained, no Python needed to load it
traced_model.save("model_traced.pt")

# Load in Python (or C++) -- no model class definition needed!
loaded = torch.jit.load("model_traced.pt")
output = loaded(example_input)
Use tracing for straightforward models (CNNs, fixed-architecture Transformers). Use scripting for models with data-dependent control flow (if/else, loops over variable-length sequences). When in doubt, try tracing first — it is simpler and produces more optimized code.
Pitfall — forgetting model.eval(): If you trace or script a model in training mode, BatchNorm and Dropout will be baked in with their training behavior (using batch statistics, dropping activations). This is a silent bug that produces inconsistent and degraded inference results. Always call model.eval() before export.

ONNX Export

import torch.onnx

model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
)
Verify the export:
import onnx
import onnxruntime as ort

# Check model
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# Run inference
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": dummy_input.numpy()})

Model Optimization

Quantization (INT8)

Quantization converts model weights (and optionally activations) from 32-bit floats to 8-bit integers. This is not just about storage — INT8 arithmetic is 2-4x faster than FP32 on most hardware, and the model fits in 4x less memory. The key question is always: how much accuracy do you lose? For most well-trained models, the answer is surprisingly little (0-1%). Reduce precision for faster inference:
import torch.quantization as quant

# Post-training quantization
model.eval()
model_fp32 = model

# Dynamic quantization (weights only)
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,
    {torch.nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)

# Static quantization (weights + activations)
model.qconfig = quant.get_default_qconfig('fbgemm')
model_prepared = quant.prepare(model)

# Calibrate with representative data
for batch in calibration_loader:
    model_prepared(batch)

model_quantized = quant.convert(model_prepared)

Model Size Comparison

PrecisionModel SizeSpeedAccuracy Drop
FP32100 MB1xBaseline
FP1650 MB1.5-2x~0%
INT825 MB2-4x0-1%

Pruning

Most neural networks are overparameterized — a large fraction of weights are near zero and contribute very little to the output. Pruning removes these unimportant weights, creating a sparse model that is smaller and (on hardware that supports sparse computation) faster. The “Lottery Ticket Hypothesis” (Frankle and Carlin, 2018) showed that dense networks contain sparse sub-networks that, when trained from the same initialization, match the full network’s performance. Remove unimportant weights:
import torch.nn.utils.prune as prune

# Prune 30% of weights in linear layers
for module in model.modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.3)

# Make pruning permanent
for module in model.modules():
    if isinstance(module, torch.nn.Linear):
        prune.remove(module, 'weight')

Serving with FastAPI

FastAPI is the most popular choice for serving ML models as REST APIs. It is async-native, auto-generates OpenAPI docs, and handles concurrent requests well. The key principle: load the model once at startup (not per request) and keep it in memory.
from fastapi import FastAPI, File, UploadFile
from PIL import Image
import torch
import torchvision.transforms as T
import io

app = FastAPI()

# Load model ONCE at startup -- not inside the request handler.
# Model loading is expensive (seconds); inference is cheap (milliseconds).
model = torch.jit.load("model_traced.pt")
model.eval()

# Use the same preprocessing as during training -- mismatched
# normalization is one of the most common deployment bugs.
transform = T.Compose([
    T.Resize(256),         # Resize shortest side to 256
    T.CenterCrop(224),     # Crop center 224x224 (deterministic, unlike RandomCrop)
    T.ToTensor(),          # Convert to tensor, scales to [0, 1]
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),  # ImageNet stats
])

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    # Read image bytes from the uploaded file
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")  # Force RGB
    
    # Preprocess -- same pipeline used during training
    input_tensor = transform(image).unsqueeze(0)  # Add batch dimension
    
    # Inference -- torch.no_grad() disables gradient tracking,
    # reducing memory usage and speeding up computation
    with torch.no_grad():
        output = model(input_tensor)
        probs = torch.softmax(output, dim=1)
        pred_class = probs.argmax(dim=1).item()
        confidence = probs[0, pred_class].item()
    
    return {
        "class": pred_class,
        "confidence": confidence,
    }
Pitfall — thread safety with GPU models: If you serve a GPU model with multiple async workers, concurrent requests can cause CUDA errors. Either (1) use a single worker with async I/O, (2) use a request queue with a dedicated inference thread, or (3) use a proper model serving framework like Triton that handles batching and concurrency correctly. For CPU-only models, FastAPI’s default async handling works fine.

GPU Serving with Triton

# config.pbtxt
name: "image_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 100
}

Docker Deployment

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model_traced.pt .
COPY app.py .

# Expose port
EXPOSE 8000

# Run server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build and run
docker build -t model-server .
docker run -p 8000:8000 model-server

Edge Deployment

ONNX Runtime Mobile

# Export for mobile
torch.onnx.export(
    model,
    dummy_input,
    "model_mobile.onnx",
    opset_version=13,  # Compatible with mobile
    do_constant_folding=True,
)

# Optimize for mobile
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "model_mobile.onnx",
    "model_mobile_quantized.onnx",
    weight_type=QuantType.QUInt8,
)

TensorFlow Lite Conversion

import tensorflow as tf

# Convert ONNX to TF SavedModel first (using onnx-tf)
# Then convert to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Monitoring in Production

import time
from prometheus_client import Counter, Histogram, start_http_server

# Metrics
PREDICTIONS = Counter('predictions_total', 'Total predictions')
PREDICTION_LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    start_time = time.time()
    
    # ... inference code ...
    
    # Record metrics
    PREDICTIONS.inc()
    PREDICTION_LATENCY.observe(time.time() - start_time)
    
    return result

# Start metrics server
start_http_server(8001)

Deployment Checklist

StepCheck
Model export✅ TorchScript/ONNX exports correctly
Validation✅ Output matches original model
Optimization✅ Quantization/pruning applied
Batching✅ Dynamic batching configured
Monitoring✅ Latency/throughput metrics
Error handling✅ Graceful failure modes
Scaling✅ Horizontal scaling ready

Exercises

Export a ResNet model to both TorchScript and ONNX. Compare inference speeds.
Apply INT8 quantization to a model. Measure size reduction and accuracy change.
Build a complete image classification API with proper error handling and documentation.

What’s Next

Module 22: Debugging Deep Learning

Tools and techniques for diagnosing training issues and model failures.