There is a saying in ML engineering: “Getting to 95% accuracy takes a month; getting that model into production takes a year.” Training is the glamorous part. Deployment is where dreams meet reality — where you discover that your model needs 8 GB of RAM but the server has 4, that inference takes 500ms but the SLA requires 50ms, and that the model works perfectly on your test set but produces nonsense on slightly different real-world inputs.Production requires:
Fast inference (users will not wait more than a few hundred milliseconds)
Minimal dependencies (your prod server should not need a full PyTorch research installation)
Reproducibility (the same input must produce the same output, every time)
Monitoring (you need to know when the model starts performing poorly before users complain)
The honest truth: Most ML models never make it to production. The gap between a Jupyter notebook that achieves good metrics and a reliable production service is vast. The techniques in this chapter — export formats, quantization, serving frameworks, monitoring — are what separate ML engineers from ML researchers.
TorchScript converts your Python model into a serialized format that can run without a Python interpreter. This matters for two reasons: (1) Python is slow due to the GIL, and (2) deploying a Python environment to edge devices or C++ servers is painful. TorchScript gives you a portable, optimized model file.
import torchmodel = MyModel()model.eval() # CRITICAL: set to eval mode BEFORE tracing/scripting# Option 1: Tracing -- runs the model with example input and records# every operation. Fast and reliable, but cannot capture data-dependent# control flow (if/else that depends on input values).example_input = torch.randn(1, 3, 224, 224)traced_model = torch.jit.trace(model, example_input)# Option 2: Scripting -- analyzes the Python source code and compiles it.# Handles control flow (if/else, loops, variable-length sequences)# but has stricter requirements on what Python constructs are supported.scripted_model = torch.jit.script(model)# Save -- the resulting file is self-contained, no Python needed to load ittraced_model.save("model_traced.pt")# Load in Python (or C++) -- no model class definition needed!loaded = torch.jit.load("model_traced.pt")output = loaded(example_input)
Use tracing for straightforward models (CNNs, fixed-architecture Transformers). Use scripting for models with data-dependent control flow (if/else, loops over variable-length sequences). When in doubt, try tracing first — it is simpler and produces more optimized code.
Pitfall — forgetting model.eval(): If you trace or script a model in training mode, BatchNorm and Dropout will be baked in with their training behavior (using batch statistics, dropping activations). This is a silent bug that produces inconsistent and degraded inference results. Always call model.eval() before export.
Quantization converts model weights (and optionally activations) from 32-bit floats to 8-bit integers. This is not just about storage — INT8 arithmetic is 2-4x faster than FP32 on most hardware, and the model fits in 4x less memory. The key question is always: how much accuracy do you lose? For most well-trained models, the answer is surprisingly little (0-1%).Reduce precision for faster inference:
import torch.quantization as quant# Post-training quantizationmodel.eval()model_fp32 = model# Dynamic quantization (weights only)model_int8 = torch.quantization.quantize_dynamic( model_fp32, {torch.nn.Linear}, # Layers to quantize dtype=torch.qint8)# Static quantization (weights + activations)model.qconfig = quant.get_default_qconfig('fbgemm')model_prepared = quant.prepare(model)# Calibrate with representative datafor batch in calibration_loader: model_prepared(batch)model_quantized = quant.convert(model_prepared)
Most neural networks are overparameterized — a large fraction of weights are near zero and contribute very little to the output. Pruning removes these unimportant weights, creating a sparse model that is smaller and (on hardware that supports sparse computation) faster. The “Lottery Ticket Hypothesis” (Frankle and Carlin, 2018) showed that dense networks contain sparse sub-networks that, when trained from the same initialization, match the full network’s performance.Remove unimportant weights:
import torch.nn.utils.prune as prune# Prune 30% of weights in linear layersfor module in model.modules(): if isinstance(module, torch.nn.Linear): prune.l1_unstructured(module, name='weight', amount=0.3)# Make pruning permanentfor module in model.modules(): if isinstance(module, torch.nn.Linear): prune.remove(module, 'weight')
FastAPI is the most popular choice for serving ML models as REST APIs. It is async-native, auto-generates OpenAPI docs, and handles concurrent requests well. The key principle: load the model once at startup (not per request) and keep it in memory.
from fastapi import FastAPI, File, UploadFilefrom PIL import Imageimport torchimport torchvision.transforms as Timport ioapp = FastAPI()# Load model ONCE at startup -- not inside the request handler.# Model loading is expensive (seconds); inference is cheap (milliseconds).model = torch.jit.load("model_traced.pt")model.eval()# Use the same preprocessing as during training -- mismatched# normalization is one of the most common deployment bugs.transform = T.Compose([ T.Resize(256), # Resize shortest side to 256 T.CenterCrop(224), # Crop center 224x224 (deterministic, unlike RandomCrop) T.ToTensor(), # Convert to tensor, scales to [0, 1] T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # ImageNet stats])@app.post("/predict")async def predict(file: UploadFile = File(...)): # Read image bytes from the uploaded file image_bytes = await file.read() image = Image.open(io.BytesIO(image_bytes)).convert("RGB") # Force RGB # Preprocess -- same pipeline used during training input_tensor = transform(image).unsqueeze(0) # Add batch dimension # Inference -- torch.no_grad() disables gradient tracking, # reducing memory usage and speeding up computation with torch.no_grad(): output = model(input_tensor) probs = torch.softmax(output, dim=1) pred_class = probs.argmax(dim=1).item() confidence = probs[0, pred_class].item() return { "class": pred_class, "confidence": confidence, }
Pitfall — thread safety with GPU models: If you serve a GPU model with multiple async workers, concurrent requests can cause CUDA errors. Either (1) use a single worker with async I/O, (2) use a request queue with a dedicated inference thread, or (3) use a proper model serving framework like Triton that handles batching and concurrency correctly. For CPU-only models, FastAPI’s default async handling works fine.
import tensorflow as tf# Convert ONNX to TF SavedModel first (using onnx-tf)# Then convert to TFLiteconverter = tf.lite.TFLiteConverter.from_saved_model("saved_model")converter.optimizations = [tf.lite.Optimize.DEFAULT]tflite_model = converter.convert()with open("model.tflite", "wb") as f: f.write(tflite_model)