Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Model Deployment: From Notebook to Production
Your Model Is Useless in a Notebook
You’ve trained a great model. It achieves 95% accuracy! But it’s sitting in a Jupyter notebook on your laptop. To be useful, models need to be:- Saved and loaded
- Served via an API
- Monitored in production
- Updated when needed
Step 1: Saving Models
Using Joblib (Recommended for scikit-learn)
Loading the Model
Using Pickle (Built-in Python)
Step 2: Save the Full Pipeline
Don’t just save the model - save the entire preprocessing pipeline!Step 3: Create an API with FastAPI
Run the API
Test the API
Step 4: Containerize with Docker
Build and Run
Step 5: Model Versioning
Track your models like you track code:Step 6: Model Monitoring
Track model performance in production:Step 7: A/B Testing Models
Compare new models against production:Deployment Checklist
Before Deployment
- Model tested on holdout data
- Pipeline includes preprocessing
- Model serialized (joblib/pickle)
- API endpoints documented
- Error handling added
- Input validation in place
After Deployment
- Health checks working
- Logging configured
- Latency monitored
- Prediction distribution tracked
- Rollback plan ready
- Model version tracked
Cloud Deployment Options
| Platform | Complexity | Best For |
|---|---|---|
| Heroku | Low | Quick prototypes |
| Railway | Low | Simple apps |
| AWS Lambda | Medium | Serverless, pay-per-use |
| Google Cloud Run | Medium | Container-based |
| AWS SageMaker | High | Enterprise ML |
| Azure ML | High | Enterprise ML |
🚀 Mini Projects
Project 1: Model Serialization Pipeline
Save and load models with preprocessing
Project 2: Simple REST API
Build a prediction API with Flask
Project 3: Model Versioning System
Track different model versions
Project 4: Monitoring Dashboard
Monitor model performance in production
Project 1: Model Serialization Pipeline
Create a complete pipeline that saves models with their preprocessing steps.Project 2: Simple REST API
Build a prediction API using Flask.Project 3: Model Versioning System
Create a simple model versioning and registry system.Project 4: Monitoring Dashboard
Create a simple model monitoring system.Key Takeaways
Save the Pipeline
Include all preprocessing with the model
API = Interface
FastAPI makes serving models easy
Docker = Portability
Same environment everywhere
Monitor = Trust
Know when your model degrades
What’s Next?
We have more advanced topics to explore! Let’s learn about time series forecasting.Continue to Time Series
Predict the future from sequential data
Interview Deep-Dive
Walk me through how you would deploy an ML model that needs to serve 10,000 predictions per second with P99 latency under 50ms.
Walk me through how you would deploy an ML model that needs to serve 10,000 predictions per second with P99 latency under 50ms.
This is fundamentally an engineering problem, not an ML problem. The model accuracy is already decided at training time — deployment is about serving reliably at scale.
- Model optimization first. Before touching infrastructure, I would profile the model. A Random Forest with 500 trees at depth 20 is going to be 10x slower than one with 50 trees at depth 10. I would benchmark the accuracy drop from model simplification and see if the trade-off is acceptable. Often, you can cut model size by 80% with less than 1% accuracy loss.
- Batch predictions where possible. If predictions are not real-time (e.g., daily churn scoring), precompute them in a batch job and serve from a cache. This is the cheapest and most reliable path. Only build real-time serving if the use case demands it.
- For real-time: containerize with a lightweight framework. I would use FastAPI or a gRPC service inside a Docker container, deployed on Kubernetes with horizontal pod autoscaling. The model loads into memory once at startup, and each request is just a numpy operation.
- Connection pooling and async I/O. If features come from a database or feature store, the network call is usually the bottleneck, not the model. Use connection pooling and async requests to parallelize feature fetching.
- Load testing before launch. Use Locust or k6 to simulate 10K RPS and measure P50, P95, and P99 latencies. If P99 exceeds 50ms, the usual culprits are garbage collection pauses, cold starts, or feature computation latency — not the model inference itself.
Your deployed model's accuracy has dropped 8% over the past month but no code changes were made. What is your debugging process?
Your deployed model's accuracy has dropped 8% over the past month but no code changes were made. What is your debugging process?
No code changes means the model itself has not changed, so the issue is almost certainly in the data. My debugging process follows a systematic funnel:
- Step 1: Confirm the drop is real. Check if the evaluation data itself is reliable. Has the labeling process changed? Is there a new data source contributing noisier labels? A perceived accuracy drop could be a labeling quality issue, not a model issue.
- Step 2: Check input feature distributions. Compare each feature’s distribution in the recent prediction window against the training data. Use Population Stability Index (PSI) or Kolmogorov-Smirnov tests. If a feature like “average_transaction_amount” shifted because of inflation or a pricing change, the model is seeing inputs outside its training domain.
- Step 3: Check for upstream data pipeline issues. Missing values that were previously imputed, schema changes in upstream tables, timestamp timezone bugs, or a feature that silently started returning nulls — these are the most common silent killers. I have seen a model degrade because an upstream team changed a column from integer to float, which caused a downstream join to silently drop rows.
- Step 4: Segment the performance drop. Is accuracy down across all segments, or just one? If it dropped only for a specific customer cohort or geographic region, you have a targeted drift problem. If it dropped everywhere uniformly, it is more likely a global data issue.
- Step 5: Retrain on recent data and compare. If a freshly trained model on the last 3 months of data recovers the lost accuracy, the diagnosis is confirmed: data drift. Set up a regular retraining cadence and monitoring alerts.
How do you handle the training-serving skew problem -- where features computed during training differ subtly from features computed at serving time?
How do you handle the training-serving skew problem -- where features computed during training differ subtly from features computed at serving time?
Training-serving skew is one of the top reasons ML models fail silently in production, and it is notoriously hard to debug because the model does not throw errors — it just makes slightly worse predictions.
- Root cause: dual code paths. During training, features are computed in batch with pandas on a data warehouse. During serving, the same features need to be computed in real-time, often in a different language or framework. Even minor differences — different null handling, different timestamp rounding, different string encoding — produce skew.
- Solution 1: Feature stores. Use a feature store (Feast, Tecton, Hopsworks) that provides a single feature definition used identically for both training and serving. The feature store computes features once, stores them, and serves the same values to both the training pipeline and the prediction endpoint. This eliminates the dual code path problem entirely.
- Solution 2: Pipeline as the single source of truth. If a feature store is too heavy, package your feature engineering as a shared library that both the training job and the serving endpoint import. One codebase, one behavior.
- Solution 3: Log-and-compare. In the serving path, log the actual feature values fed to the model for a sample of requests. Periodically compare these logged features against what the training pipeline would produce for the same raw inputs. Any discrepancy is skew.
You need to deploy a model in an environment where the prediction latency budget is only 5ms. What trade-offs would you consider?
You need to deploy a model in an environment where the prediction latency budget is only 5ms. What trade-offs would you consider?
Five milliseconds is tight. That is roughly the time for a single database round-trip, so every architectural decision matters.
- Model choice is constrained. Deep ensembles with hundreds of trees are out. I would look at logistic regression, small gradient boosting models (under 50 trees, shallow depth), or even a precomputed lookup table if the feature space is discrete and bounded. A single decision tree with depth 6 can serve in microseconds.
- Feature computation budget. If features require any I/O (database lookups, external API calls), that will blow the 5ms budget immediately. All features must either be precomputed and cached in memory, or computable from the request payload alone.
- Model distillation. Train a complex model offline (XGBoost with 1000 trees), then use it to label a large dataset. Train a simple model (logistic regression) on those labels. The simple model learns to mimic the complex model and can serve in under 1ms. You lose some accuracy but gain massive latency improvement.
- ONNX or compiled models. Convert the sklearn model to ONNX format and serve with ONNX Runtime. This eliminates Python’s interpreter overhead and can give 2-5x speedup. For extreme cases, compile the model to C code using tools like sklearn-onnx or treelite.
- Avoid Python’s GIL bottleneck. If you are serving many concurrent requests in Python, the Global Interpreter Lock serializes CPU-bound work. Consider a Rust or C++ serving layer, or use multiprocess workers instead of multithreaded ones.