Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
ML at Scale
From Laptop to Production
Your model works on 10,000 samples. What happens with 10 million? What about 10 billion? The skills that make you a good data scientist on a laptop — careful feature engineering, thoughtful model selection, proper evaluation — are necessary but not sufficient at scale. At scale, you need to think about data that does not fit in memory, training that takes days not seconds, and prediction latency that is measured in milliseconds.Estimated Time: 3-4 hours
Difficulty: Advanced
Prerequisites: All previous ML chapters, basic understanding of distributed systems
Focus: Real production patterns used at tech companies
Difficulty: Advanced
Prerequisites: All previous ML chapters, basic understanding of distributed systems
Focus: Real production patterns used at tech companies
Challenge 1: Data Does Not Fit in Memory
When your dataset exceeds available RAM, you cannot simply callpd.read_csv() and move on. You need to process data in chunks, either through incremental learning (the model sees data one batch at a time) or through distributed computing (spread the data across multiple machines).
Solution: Batch Processing (Incremental Learning)
Solution: Dask for Out-of-Core Computing
Challenge 2: Training Takes Too Long
When a single training run takes hours or days, iteration speed drops to zero. You cannot experiment with features, try different models, or tune hyperparameters if each attempt takes a week. There are three main strategies: parallelize across CPU cores (n_jobs=-1), distribute across multiple machines (Ray, Spark), or accelerate with GPUs (RAPIDS cuML).Solution: Distributed Training
Solution: Model Selection at Scale
Grid search is exhaustive — it tries every combination. With 4 hyperparameters, each with 5 values, that is 625 combinations, each requiring full cross-validation. HalvingRandomSearchCV is much smarter: it starts by trying many configurations on a small subset of data, then progressively gives more data to the most promising configurations, eliminating poor performers early. Think of it like a talent competition with elimination rounds instead of auditioning everyone equally.Challenge 3: Serving Predictions at Scale
Training happens once (or periodically). Serving happens millions of times per day. A model that takes 100ms to predict instead of 10ms costs 10x more in compute and gives 10x worse user experience. At scale, prediction latency and throughput matter as much as accuracy.Solution: Model Optimization
Solution: Batching for Throughput
Solution: Caching Predictions
Challenge 4: Model Updates Without Downtime
In production, you cannot simply stop the old model and start the new one. During that gap, predictions fail and users are affected. Blue-green deployment maintains two model versions simultaneously — the current “blue” production model and the new “green” candidate. You gradually shift traffic from blue to green, monitoring for problems. If something goes wrong, you instantly route all traffic back to blue. Zero downtime, zero risk.Solution: Blue-Green Deployment
Solution: Shadow Mode Testing
Challenge 5: Feature Engineering at Scale
The “training-serving skew” problem: during training, you compute features using pandas on a full historical dataset. During serving, you need the exact same features computed in real-time from a single incoming request. If these computations differ even slightly (different rounding, different null handling, different aggregation windows), your model’s accuracy silently degrades. Feature stores solve this by providing a single source of truth for feature definitions, used identically in both training and serving.Solution: Feature Store Pattern
Summary: Production Checklist
Before Deployment
- Model fits in memory on target hardware
- Latency meets SLA requirements
- Throughput handles peak traffic
- Fallback strategy defined
- Monitoring and alerting set up
During Deployment
- Canary release (gradual rollout)
- Shadow mode testing
- A/B testing infrastructure
- Feature flag controls
After Deployment
- Monitor prediction latency (p50, p99)
- Track prediction distribution drift
- Alert on model staleness
- Regular retraining pipeline
Key Insight: At scale, ML is 10% modeling and 90% engineering. The best model is worthless if it can’t serve predictions reliably.
Tools for Scale
| Challenge | Tools |
|---|---|
| Data Processing | Spark, Dask, Ray |
| Distributed Training | Ray Train, Horovod, TF Distributed |
| GPU Acceleration | RAPIDS cuML, NVIDIA Triton |
| Feature Store | Feast, Tecton, Hopsworks |
| Model Serving | TF Serving, Triton, BentoML |
| Orchestration | Kubeflow, MLflow, Airflow |
| Monitoring | Evidently, Grafana, Prometheus |
Interview Deep-Dive
You have a model that trains fine on 1 million rows but the data team just delivered 500 million rows. Walk me through your scaling strategy.
You have a model that trains fine on 1 million rows but the data team just delivered 500 million rows. Walk me through your scaling strategy.
This is a common real-world inflection point, and the right answer depends on whether the problem is memory, compute time, or both.
- First question: does the model actually need 500M rows? Sample 10M rows randomly, train the model, and compare to the model trained on 1M rows. If accuracy improvement is minimal (diminishing returns on the learning curve), you might not need all 500M rows. Many tabular ML problems saturate well before 10M rows. Do not scale unless the data actually helps.
- If you need all the data — memory is the first bottleneck. 500M rows with 100 float64 features is 400GB. That will not fit in a single machine’s RAM. Options: (a) Use out-of-core learning with models that support partial_fit() — SGDClassifier, MiniBatchKMeans. Feed data in chunks. (b) Use Dask or Vaex to process data lazily without loading it all into memory. (c) Downsample intelligently — stratified sampling preserving class distribution and edge cases.
- If the model supports it, use incremental learning. SGDClassifier with partial_fit can process arbitrarily large datasets. The downside: not all models support this. Random Forest and gradient boosting do not support partial_fit in sklearn. For gradient boosting at scale, use LightGBM or XGBoost with external memory mode, which streams data from disk.
- If you need a non-incremental model on all data, distribute. Use Spark MLlib for distributed Random Forest or Gradient Boosting. Use Ray for distributed hyperparameter search. The overhead of distributed systems is significant — only worth it if you have confirmed the data helps.
- Optimize data formats. Read from Parquet instead of CSV (10x smaller, 5x faster to read). Use appropriate dtypes (float32 instead of float64 halves memory). Feature hashing for high-cardinality categoricals reduces dimensionality before training.
- Training time optimization. Use subsampling within the algorithm (subsample parameter in XGBoost), feature subsampling (colsample_bytree), and early stopping. A model with 200 trees trained on a 10% subsample of 500M rows may outperform 1000 trees on 1M rows.
Describe the training-serving skew problem at scale and how feature stores solve it.
Describe the training-serving skew problem at scale and how feature stores solve it.
Training-serving skew is the silent killer of ML at scale, and feature stores are the industry-standard solution. The way I explain this:
- The problem: two code paths for the same features. During training, you compute features in batch using pandas or Spark on a data warehouse. Features are computed over historical data with full access to aggregations and joins. During serving, the same features must be computed in real-time from a single incoming request, often in a different language (Python training vs Java/Go serving) or framework.
- How skew manifests. Subtle differences between training and serving computations produce different feature values for the same input. A rolling average computed with pandas uses one interpolation method; the real-time version uses another. Null handling differs. Timestamp rounding differs. Category encoding differs. The model was trained on one version of the features and is served another version. No errors occur — the predictions are just slightly wrong, consistently.
- At scale, this compounds. With 500 features, even 1% skew per feature accumulates into a significant distributional shift. The model’s accuracy degrades by 3-5% and nobody can pinpoint the cause because no single feature is obviously wrong.
- How feature stores solve this. A feature store provides a single definition for each feature, used identically in training and serving. The definition is code (e.g., a SQL query or a transformation function) that is registered once. During training, the feature store materializes historical features with point-in-time correctness (no future data leakage). During serving, the feature store serves the latest precomputed values from a low-latency cache (Redis, DynamoDB).
- Point-in-time correctness. This is the subtle but critical capability. When computing features for a historical training example from January 15, the feature store ensures that only data available on January 15 is used — not data from January 16 that was backfilled later. Without this, you get temporal leakage at scale.
- Feature sharing across teams. At a company with 20 ML models, many models use overlapping features (customer tenure, average transaction amount). Without a feature store, each team recomputes these independently — introducing inconsistency. With a feature store, features are computed once and shared.
You need to update a model in production serving 50,000 requests per second without any downtime or degraded predictions. How do you do it?
You need to update a model in production serving 50,000 requests per second without any downtime or degraded predictions. How do you do it?
Zero-downtime model updates are a core requirement for any serious ML system. The approach combines infrastructure patterns from software engineering with ML-specific considerations.
- Blue-green deployment for the model. Maintain two identical serving environments. The “blue” environment runs the current model. Deploy the new model to the “green” environment. Run smoke tests and shadow-mode validation on green. Once validated, switch the load balancer from blue to green atomically. Blue stays alive as an instant rollback target.
- Shadow mode testing before switching traffic. Before any live traffic goes to the new model, run it in shadow mode: every request gets sent to both the old and new model, but only the old model’s response is returned to the user. Compare predictions at scale. If the new model’s predictions diverge more than expected from the old model (or from ground truth), investigate before switching.
- Canary deployment for gradual rollout. Instead of switching 100% of traffic at once, route 1% to the new model. Monitor latency, error rates, and prediction distribution. If everything looks good, increase to 5%, then 10%, then 25%, then 100%. At each step, compare key metrics against the control (old model). Automated rollback if any metric exceeds a threshold.
- Handle preprocessing version changes. The trickiest part is not the model swap — it is ensuring that the preprocessing pipeline matches the model version. If Model V2 expects standardized features and Model V1 expected raw features, you need to deploy the preprocessing change atomically with the model change. Bundle the model and its preprocessing into a single artifact.
- Feature store versioning. If the new model uses different features than the old model, both feature sets must be available simultaneously during the canary period. The feature store should serve features keyed by model version.
- Rollback criteria and automation. Define explicit rollback triggers: P99 latency exceeds 100ms (was 50ms), prediction distribution mean shifts by more than 10%, error rate exceeds 0.1%. Automate the rollback — do not rely on a human watching a dashboard at 3 AM.
What are the key differences between scaling ML for batch predictions versus real-time predictions? How does the architecture differ?
What are the key differences between scaling ML for batch predictions versus real-time predictions? How does the architecture differ?
Batch and real-time serving have fundamentally different constraints, and conflating them is one of the most common architectural mistakes.
- Batch: optimize for throughput, not latency. Batch predictions run on a schedule (hourly, daily). You process millions of records and write results to a database. The key metric is total processing time, not per-prediction latency. A batch job that takes 2 hours to score 10 million customers is fine — 720ms per prediction. Architecture: scheduled jobs (Airflow, cron) running on large compute instances (Spark cluster, large VM), reading from and writing to data warehouses. The model can be large and complex because inference time is not user-facing.
- Real-time: optimize for latency, concurrency, and reliability. Real-time predictions serve individual requests with strict latency SLAs (typically 10-100ms P99). The key metrics are P99 latency, throughput (requests per second), and availability (five 9s = 5.26 minutes downtime per year). Architecture: model served behind a load balancer on multiple replicated instances, features fetched from a low-latency cache (Redis, in-memory), model loaded in memory at startup, autoscaling based on request volume.
- Feature engineering diverges. Batch features can be arbitrarily complex — join 10 tables, compute 90-day aggregates, run subqueries. Real-time features must be available in milliseconds, so they are either precomputed and cached, or computed from the request payload alone. The feature store bridges this gap by precomputing batch features and serving them at real-time speed.
- Model complexity trade-off. Batch predictions can use XGBoost with 5000 trees. Real-time predictions may need a distilled model with 50 trees or a logistic regression. Alternatively, use a hybrid architecture: run the complex model in batch to precompute predictions for all known entities, then serve from cache. For new or unseen entities, fall back to a simpler real-time model.
- Failure handling differs. If a batch job fails, you re-run it. If a real-time endpoint fails, users see errors. Real-time systems need graceful degradation: if the model fails, return a sensible default (e.g., the most popular recommendation, the average risk score). Batch systems need idempotency: re-running should produce the same results without duplicating side effects.