Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
ML Pipelines
The Problem with Notebook Code
This is what most ML code looks like in tutorials and Jupyter notebooks. Can you spot the problems? (Hint: there are at least four.)- Data leakage: Preprocessing uses test data statistics, inflating your evaluation scores
- Not reproducible: If you change one step, you need to manually re-run everything downstream
- Not deployable: In production, you need to apply the exact same transformations in the exact same order. Good luck remembering which scaler was fit on which data six months later.
- Fragile: Easy to make a mistake when manually chaining steps — and these bugs are silent (no error, just wrong results)
The Pipeline Solution
- No data leakage — each step is fit only on training data during
fit(), then applied to test data duringpredict() - Reproducible — the exact same sequence of transformations, every time, no manual steps
- Deployable — save one object, load it in production, call
predict()on raw data - Clean — one object replaces dozens of lines of manual transformation code
Building Pipelines Step by Step
Basic Pipeline
Using make_pipeline (Auto-naming)
Column Transformer: Different Preprocessing for Different Features
Real data is messy. You have numeric columns (age, income) that need scaling, categorical columns (city, education) that need encoding, and maybe binary columns that need nothing at all. ColumnTransformer lets you define different preprocessing recipes for different columns and bundles them into a single step.Complete Real-World Pipeline
Hyperparameter Tuning with Pipelines
One of the most powerful features of pipelines: you can tune hyperparameters across ALL steps (preprocessing and model) in a single grid search. Access nested parameters withstep__param syntax (double underscore). This means you can search over “should I use mean or median imputation?” alongside “what learning rate works best?” — the grid search will find the best combination automatically.
Cross-Validation with Pipelines
Pipelines ensure proper CV without data leakage:- Each fold: Fit preprocessor on train fold → transform test fold
- No information from test fold leaks into preprocessing
Custom Transformers
When sklearn’s built-in transformers are not enough, you can create your own. The key is inheriting fromBaseEstimator and TransformerMixin, then implementing fit() (learn from training data) and transform() (apply to any data). This pattern ensures your custom logic works seamlessly inside pipelines, cross-validation, and grid search — no special cases needed.
Saving and Loading Pipelines
Pipeline Visualization
Debugging Pipelines
Inspect Intermediate Results
Memory and Caching
Production Pipeline Template
Key Takeaways
Pipelines Prevent Leakage
Preprocessing is fit only on training data
One Object Does All
Fit, transform, predict in one call
Easy Deployment
Save/load entire workflow
Clean Hyperparameter Tuning
Search across all pipeline parameters
What’s Next?
Let’s wrap up with a checklist of common ML mistakes to avoid!Continue to ML Mistakes Checklist
Avoid the pitfalls that trip up even experienced practitioners
Interview Deep-Dive
You need to build an ML pipeline that handles mixed data types, custom feature engineering, and hyperparameter tuning -- all while preventing data leakage. Walk me through your design.
You need to build an ML pipeline that handles mixed data types, custom feature engineering, and hyperparameter tuning -- all while preventing data leakage. Walk me through your design.
The key design principle is that every data transformation must live inside the pipeline so that cross-validation and deployment are both correct by construction.
- ColumnTransformer for mixed data types. Define separate preprocessing paths for numeric features (imputation then scaling), categorical features (imputation then one-hot encoding), and any passthrough features (binary flags that need no transformation). ColumnTransformer routes each feature to its appropriate path and concatenates the results.
- Custom transformers for domain-specific feature engineering. Inherit from BaseEstimator and TransformerMixin to create custom steps. The critical design rule: the fit() method learns parameters from training data, and transform() applies those parameters to any data. For example, a “TargetEncoder” custom transformer would compute mean target values per category in fit() and apply those mappings in transform(). This prevents leakage because the target means are computed only on training folds.
- Pipeline composes everything sequentially. The full pipeline chains: FunctionTransformer (for stateless feature engineering like ratios), then ColumnTransformer (for type-specific preprocessing), then a feature selection step (SelectFromModel), then the estimator. Each step’s fit() is called only on training data during cross-validation.
- Hyperparameter tuning across the entire pipeline. GridSearchCV or RandomizedSearchCV with the double-underscore notation (preprocessor__num__imputer__strategy, classifier__max_depth) searches across preprocessing and model parameters simultaneously. This is powerful because the optimal imputation strategy might depend on the model — a tree model might prefer median imputation while a linear model might prefer mean.
- Caching for expensive transformations. If the preprocessor is expensive (e.g., large text vectorization), use Pipeline’s memory parameter with joblib caching. During grid search, the preprocessor output is computed once and reused across all classifier parameter combinations.
A junior data scientist shows you their pipeline that achieves 96% accuracy. You suspect data leakage through the pipeline. How do you audit it?
A junior data scientist shows you their pipeline that achieves 96% accuracy. You suspect data leakage through the pipeline. How do you audit it?
Pipeline leakage audits are methodical. I follow a specific checklist that catches the most common leakage patterns.
- Audit 1: Check the order of operations. Is there any preprocessing done BEFORE the data enters the pipeline? If the scaler was fit on the full dataset and then the scaled data was passed to the pipeline, the pipeline is clean but the data is already leaky. Grep the notebook for any fit_transform() calls that happen before the train-test split.
- Audit 2: Check for fit_transform() on full data inside custom transformers. A custom transformer might internally call fit on the input data in its transform() method rather than using parameters learned during fit(). For example, a ZScoreNormalizer that computes mean and std in transform() instead of using self.mean_ and self.std_ from fit(). This is leakage because test data statistics influence the transformation.
- Audit 3: Run the permutation test. Replace the target variable with random noise (y_random = np.random.randint(0, 2, len(y))). Run the full pipeline including cross-validation. If accuracy is significantly above 50% with random labels, the pipeline is leaking information. A leak-free pipeline should score at chance on random labels.
- Audit 4: Compare CV score to fresh holdout score. If the pipeline reports 96% CV accuracy but a completely fresh holdout (data set aside before any analysis) scores 82%, there is leakage somewhere in the pipeline or the CV procedure. A gap of more than 3-5% warrants investigation.
- Audit 5: Check for target encoding leakage. If any feature is derived from the target variable (e.g., mean target per category), verify that this encoding is done inside the pipeline (computed fresh in each CV fold) and not precomputed on the full dataset.
- Audit 6: Feature importance sanity check. If a feature that should not logically be predictive (like row_id, timestamp, or file_name) shows up as the most important feature, that is almost certainly leakage. The feature is acting as a key to look up the answer rather than learning a pattern.
How would you migrate a messy notebook-based ML workflow to a production pipeline without breaking the existing model's behavior?
How would you migrate a messy notebook-based ML workflow to a production pipeline without breaking the existing model's behavior?
This is one of the most common real-world tasks for ML engineers, and it requires disciplined engineering rather than ML cleverness.
- Step 1: Reproduce the notebook results exactly. Before changing anything, run the notebook end-to-end and record every metric: accuracy, precision, recall, feature importance rankings, prediction distribution. These become your regression tests. If you cannot reproduce the notebook results, stop — there is a hidden dependency (random seed, data version, library version) that must be identified first.
- Step 2: Extract preprocessing into pipeline steps. Go through the notebook cell by cell. Every data transformation (scaling, imputation, encoding, feature engineering) becomes either a built-in sklearn transformer or a custom transformer class. The critical rule: the order and logic of transformations must be identical to the notebook. Do not “improve” anything yet — the goal is exact behavioral equivalence.
- Step 3: Verify equivalence numerically. After building the pipeline, pass the same training data through both the notebook code and the pipeline. Compare intermediate outputs (feature matrices after preprocessing) using np.allclose(). Any numerical difference, no matter how small, must be investigated. Floating-point order-of-operations differences can compound.
- Step 4: Run the same cross-validation. The pipeline’s CV score should match the notebook’s CV score within floating-point tolerance. If they differ by more than 0.1%, something in the pipeline differs from the notebook logic.
- Step 5: Only then start improving. Now that you have a clean, tested pipeline that exactly reproduces the notebook, you can safely refactor. Add proper train-test splits, fix any leakage the notebook had, add monitoring hooks. Each change is validated against the regression tests.
- Step 6: Version everything. Pin the library versions (requirements.txt or conda environment), version the pipeline code (git), and version the trained model artifact. If anything breaks in production, you can trace exactly which change caused it.