Build a complete deep learning system that solves a real-world problem. This capstone is where theory meets reality. The gap between “I can train a model on CIFAR-10” and “I can build and deploy a production ML system” is enormous, and this project is designed to bridge it. You will encounter messy data, unclear requirements, training failures, and deployment headaches — and that is exactly the point.The phases mirror what you would do on an actual engineering team:
Problem definition and data collection — the most underrated phase; get this wrong and nothing else matters
Model architecture design — choose and customize based on your constraints
Training pipeline — reproducible, logged, and robust to failures
Evaluation and iteration — go beyond accuracy; understand where and why your model fails
Deployment — a model in a notebook helps no one; ship it
## Project: [Your Project Name]### Problem StatementWhat problem are you solving? Who benefits?### Success Criteria- Metric 1: Accuracy > 90%- Metric 2: Inference time < 100ms- Metric 3: Model size < 100MB### Constraints- Hardware: Single GPU- Data: Public dataset + custom samples- Timeline: 2 weeks
A solid data pipeline is the foundation of any successful ML project. The most common source of bugs and wasted training time is not the model — it is incorrect data loading, wrong augmentations, or label mismatches. Invest time here to save yourself days of debugging later.
import torchfrom torch.utils.data import Dataset, DataLoaderfrom torchvision import transformsfrom PIL import Imageimport pandas as pdclass CustomDataset(Dataset): """Custom dataset for your project. Best practices embedded in this implementation: - CSV-based annotation for easy inspection and versioning - Lazy loading (images loaded on-demand, not all into memory) - Configurable transforms for train vs. validation """ def __init__(self, data_dir, csv_file, transform=None): self.data_dir = Path(data_dir) self.annotations = pd.read_csv(csv_file) self.transform = transform or transforms.ToTensor() def __len__(self): return len(self.annotations) def __getitem__(self, idx): row = self.annotations.iloc[idx] # Load image image_path = self.data_dir / row["filename"] image = Image.open(image_path).convert("RGB") if self.transform: image = self.transform(image) label = row["label"] return image, labeldef create_dataloaders(train_dir, val_dir, batch_size=32): """Create train and validation dataloaders. IMPORTANT: Training and validation use DIFFERENT transforms. - Training: random augmentation to improve generalization - Validation: deterministic center crop for consistent evaluation Never apply random augmentation to validation data -- your metrics would fluctuate between runs, making comparison impossible. """ train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), # Random crop forces spatial invariance transforms.RandomHorizontalFlip(), # Most natural images are flip-invariant transforms.ColorJitter(0.2, 0.2, 0.2), # Robustness to lighting changes transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), # ImageNet stats ]) val_transform = transforms.Compose([ transforms.Resize(256), # Resize shorter edge to 256 transforms.CenterCrop(224), # Deterministic center crop transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) train_dataset = CustomDataset(train_dir, "train.csv", train_transform) val_dataset = CustomDataset(val_dir, "val.csv", val_transform) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4) val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4) return train_loader, val_loader
Start with transfer learning — it is almost always the right first move. Training from scratch requires massive datasets and compute budgets that most projects do not have. A pretrained EfficientNet backbone already understands edges, textures, shapes, and objects from ImageNet. Your job is to add a task-specific head and fine-tune.
import torchimport torch.nn as nnimport timmclass ProjectModel(nn.Module): """Your project model. Strategy: Pretrained backbone + custom classifier head. Phase 1: Freeze backbone, train only the head (fast convergence) Phase 2: Unfreeze backbone, fine-tune end-to-end (squeeze out last few %) """ def __init__(self, num_classes, pretrained=True): super().__init__() # Use pretrained backbone -- timm gives you 700+ architectures # EfficientNet-B0 is a great starting point: strong accuracy, reasonable size self.backbone = timm.create_model( "efficientnet_b0", pretrained=pretrained, num_classes=0 # Remove classifier to use as feature extractor ) # Custom classifier -- keep it simple to start # Dropout prevents overfitting when fine-tuning with limited data self.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(self.backbone.num_features, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, num_classes), ) def forward(self, x): features = self.backbone(x) return self.classifier(features) def freeze_backbone(self): for param in self.backbone.parameters(): param.requires_grad = False def unfreeze_backbone(self): for param in self.backbone.parameters(): param.requires_grad = True
Deployment is where many projects stall. The key insight: keep it simple. A FastAPI endpoint with a Docker container gets you 90% of the way to production. You can optimize later with batching, model servers (Triton, TorchServe), or serverless deployment — but ship the simple version first.
Common deployment pitfall: Loading the model inside the request handler. This means every request re-loads model weights from disk (seconds of latency). Always load the model at startup, as shown below. Also remember to call model.eval() — forgetting this leaves dropout and batch norm in training mode, causing inconsistent predictions.
You have trained a model that achieves 95% validation accuracy but performs poorly in production. Walk me through how you would diagnose and fix this.
Strong Answer:This is almost always a distribution mismatch between your validation set and real production data, and the debugging process follows a systematic funnel.Step one: confirm the gap is real, not a serving bug. Run the exact same inference pipeline on a sample of production inputs using the same model checkpoint and preprocessing code that achieved 95% validation. If accuracy is still high, the problem is in the serving infrastructure — different preprocessing, wrong model version, batching bugs, or numerical precision differences between training (FP32) and serving (FP16/INT8).Step two: if the gap is real, characterize it. Sample 500-1000 production examples the model gets wrong and manually categorize the failure modes. Common patterns: (a) domain shift — production images are taken with different cameras, lighting, or angles than training data; (b) class distribution shift — rare classes in training are common in production or vice versa; (c) novel inputs — production data contains categories or edge cases not present in training at all; (d) adversarial inputs — user-generated content that is deliberately unusual.Step three: fix the root cause. For domain shift, add representative production data to the training set (even a few hundred examples can help enormously with fine-tuning). For class imbalance, adjust the decision threshold per-class rather than retraining. For novel inputs, add an out-of-distribution detection layer that flags inputs the model is uncertain about, routing them to human review instead of making a bad prediction.The deeper lesson: a validation set that is an IID split of your training data tells you almost nothing about production performance. Always maintain a separate “production-representative” evaluation set drawn from actual serving traffic.Follow-up: How do you set up monitoring so you catch this kind of degradation before users notice?I would implement three monitoring layers. First, input distribution monitoring: compute statistics on incoming data (pixel mean/std for images, token distribution for text) and alert when they drift beyond training-time thresholds using KL divergence or Kolmogorov-Smirnov tests. Second, prediction distribution monitoring: if your model suddenly starts predicting one class 80% of the time when historically it was 30%, that is a strong signal. Third, feedback loop monitoring: track downstream metrics (user clicks, corrections, complaints) and correlate them with model confidence. A spike in low-confidence predictions combined with a drop in user engagement is a reliable early warning signal. I would set up a dashboard with daily automated checks and weekly human review of the worst-performing segments.
Compare transfer learning with fine-tuning the entire backbone versus freezing the backbone and only training the classification head. When would you choose each approach?
Strong Answer:The choice depends on three factors: dataset size, domain similarity to the pretrained data, and compute budget.Frozen backbone (only train the head): Best when you have a small dataset (under 5K samples) and your domain is similar to ImageNet. The pretrained features are already good representations, and fine-tuning the full backbone on so few examples would overfit. This is also the fastest approach — you can precompute all backbone features once and then train a simple linear classifier, which takes minutes instead of hours. Think of it as using a pre-built engine and just attaching a custom bumper.Full fine-tuning: Best when you have a large dataset (50K+ samples) or your domain is significantly different from ImageNet (medical images, satellite imagery, microscopy). The pretrained features provide a good initialization, but the domain gap means the lower layers need to adapt their feature detectors. Use a learning rate schedule that applies a lower rate to earlier layers (layer-wise learning rate decay) — earlier layers need less adjustment because edges and textures are universal, while later layers need more adjustment because semantic features are domain-specific.The hybrid approach (freeze then unfreeze) is what I use most often in practice: freeze the backbone, train the head for 5-10 epochs until it converges, then unfreeze the backbone and fine-tune end-to-end with a 10x lower learning rate. This gives the head a reasonable initialization before the backbone weights start moving, preventing the randomly-initialized head from sending noisy gradients through the backbone and destroying pretrained features. This is sometimes called “gradual unfreezing.”One often-overlooked consideration: batch normalization layers. If your frozen backbone has BN layers, you must set them to eval mode during training (model.eval() for BN but model.train() for your head). Otherwise, BN will update its running statistics with your domain data, corrupting the pretrained normalization and causing a significant accuracy drop that is very hard to debug.Follow-up: You are fine-tuning a model on 10K medical images and you notice validation loss decreasing but validation accuracy plateauing. What is happening?This is a classic calibration problem. The model is becoming more confident on examples it already gets right (pushing logits further from the decision boundary, which decreases cross-entropy loss) but not actually learning to classify more examples correctly. This often happens during fine-tuning when the model overfits to the easy examples in your dataset.Solutions: increase regularization (higher dropout, stronger weight decay, more aggressive augmentation), use label smoothing (0.1 is a good starting point) which prevents the model from becoming overconfident, or switch to a focal loss that down-weights easy examples and forces the model to focus on the hard cases. I would also check for class imbalance — if 90% of your medical images are “normal,” the model can achieve 90% accuracy by predicting normal for everything, and further training just makes it more confidently wrong on the rare disease cases.
You are deploying a deep learning model as a FastAPI service. What are the critical production considerations beyond just wrapping the model in an endpoint?
Strong Answer:The model endpoint is maybe 20% of the production engineering. Here are the critical pieces most people miss.Model loading and lifecycle: Load the model once at startup, not per-request. Call model.eval() to disable dropout and set batch norm to inference mode. Pin the model to a specific GPU with explicit device placement. Implement graceful shutdown that finishes in-flight requests before terminating. If using multiple workers (Gunicorn with Uvicorn), each worker loads its own copy of the model — plan your GPU memory accordingly.Input validation and preprocessing: Never trust client input. Validate image dimensions, file formats, and size limits before touching the model. Apply the exact same preprocessing pipeline used during training — same normalization constants, same resize interpolation method. A mismatch here (bilinear vs bicubic resize, or RGB vs BGR channel order) will silently degrade accuracy without raising any errors.Batching for throughput: If you are handling many concurrent requests, dynamic batching (collecting multiple requests into a single batch for GPU inference) can increase throughput 5-10x. Tools like NVIDIA Triton or TorchServe handle this, but you can also implement a simple request queue with a timeout.Latency monitoring: Track P50, P95, and P99 latency separately. A model that averages 50ms but occasionally spikes to 2 seconds (due to GPU memory defragmentation, thermal throttling, or garbage collection) will create a terrible user experience. Set up alerts on P99 latency, not just average.Model versioning: Always include a model version identifier in the response. When you deploy a new model, run both old and new versions simultaneously (A/B test or shadow mode) and compare predictions before fully cutting over. Store the model artifact with a hash of its weights so you can verify exactly which checkpoint is serving.Error handling: Return structured error responses with confidence scores. If the model’s confidence is below a threshold, return a “low confidence” flag rather than a potentially wrong prediction. Implement circuit breakers that route traffic to a fallback (cached response, simpler model, or human review queue) if the main model is overloaded or failing.Follow-up: Your model serving latency suddenly doubled after a routine deployment. What do you check?In order of likelihood: (1) Did the Docker base image change, introducing a different CUDA/cuDNN version that is slower on your GPU architecture? Check nvidia-smi output. (2) Did the preprocessing code change — perhaps a new image resize method that is CPU-bound? Profile CPU vs GPU time separately. (3) Is the model running in training mode instead of eval mode? Check for dropout or batch norm issues. (4) Did the model size change (different checkpoint with more parameters)? Check torch.cuda.memory_allocated. (5) Is there GPU contention — did another service start sharing the same GPU? Check GPU utilization in nvidia-smi. (6) Thermal throttling — is the GPU overheating under sustained load? Check GPU temperature. Most of these take under 5 minutes to check if you have proper observability set up.
What is mixed precision training, and why is it practically essential for modern deep learning projects?
Strong Answer:Mixed precision training uses lower-precision floating point (FP16 or BF16) for most computations while keeping a master copy of weights in FP32. The “mixed” part is crucial — you are not just casting everything to FP16, which would fail due to loss of precision in gradient accumulation.The mechanism: forward pass and most of the backward pass run in FP16/BF16, which is roughly 2x faster on modern GPUs (Tensor Cores are optimized for half-precision) and uses roughly half the memory. But the weight update step (optimizer.step()) uses the FP32 master weights because the small gradient magnitudes would underflow in FP16. A GradScaler multiplies the loss by a large factor before the backward pass to shift gradients into the representable FP16 range, then unscales them before the optimizer step.Why it is practically essential: a 7B parameter model requires 28 GB in FP32 just for weights. With mixed precision, activations and gradients are in FP16 (half size), and the FP32 master weights add overhead but the net memory savings are typically 30-40%. The speed gain is even more compelling — on A100 GPUs, BF16 Tensor Core operations run at 312 TFLOPS versus 19 TFLOPS for FP32, a 16x theoretical speedup. In practice, the end-to-end training speedup is 1.5-2x because memory bandwidth and non-matmul operations are also bottlenecks.BF16 versus FP16: BF16 has the same exponent range as FP32 (8 exponent bits) but reduced mantissa precision. This eliminates the need for loss scaling entirely because gradients do not underflow. FP16 has a smaller exponent range and requires careful loss scaling. If your hardware supports BF16 (A100, H100, TPUs), always prefer it. It is simpler and more numerically stable.Follow-up: You enable mixed precision training and your loss becomes NaN after a few hundred steps. What is going wrong?The most common cause is FP16 overflow or underflow in the gradient computation. The GradScaler is supposed to handle this, but it can fail if: (1) the initial loss scale is too high, causing gradients to overflow FP16’s range even before backpropagation begins — the fix is lowering the initial scale factor; (2) the model has layers with very different gradient magnitudes (common in transformers where attention logits can be much larger than feedforward gradients) — gradient clipping before the scaler’s unscale step helps; (3) the learning rate is too high for the reduced precision, causing weight updates that are too large. I would also check for inf/nan in the loss itself before the scaler, which would indicate a bug in the model (division by zero, log of zero) unrelated to mixed precision. If switching to BF16 fixes the NaN, the problem is FP16 dynamic range, not a model bug.