Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

End-to-End ML Project

End-to-End ML Project Pipeline
Real World ML Pipeline - Uber Surge Pricing

The Complete ML Workflow

This module brings everything together in a real project. In practice, this workflow is never linear — you’ll jump back to EDA when your model fails, revisit feature engineering when evaluation reveals blind spots, and retune when new data arrives. Think of it as a spiral, not a waterfall.
  1. Problem Definition: What are we solving? What metric defines “success”?
  2. Data Collection: Get the data (often the hardest step)
  3. EDA: Understand the data before touching any model
  4. Feature Engineering: Transform raw data into model-ready features
  5. Model Selection: Choose 2-3 candidate algorithms
  6. Training: Fit models with cross-validation
  7. Evaluation: Measure performance on held-out data
  8. Tuning: Optimize hyperparameters for the best candidate
  9. Deployment: Make it usable in production

Project: Predicting Customer Churn

Business Problem: A telecom company wants to predict which customers will leave (churn) so they can offer them incentives to stay.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Step 1: Load and Explore Data

# Using Telco Customer Churn dataset
# In practice, you'd load from: pd.read_csv('customer_churn.csv')

# Simulate the dataset structure
n_samples = 5000

data = {
    'customer_id': range(1, n_samples + 1),
    'tenure': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 7000, n_samples),
    'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
    'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples),
    'senior_citizen': np.random.choice([0, 1], n_samples),
    'partner': np.random.choice(['Yes', 'No'], n_samples),
    'dependents': np.random.choice(['Yes', 'No'], n_samples),
}

# Create churn with realistic patterns
churn_prob = (
    0.1 +  # Base rate
    0.3 * (data['contract'] == 'Month-to-month').astype(int) +
    0.2 * (data['monthly_charges'] > 70) +
    -0.15 * (data['tenure'] > 36) +
    0.1 * (data['payment_method'] == 'Electronic check').astype(int)
)
churn_prob = np.clip(churn_prob, 0.05, 0.8)
data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)

df = pd.DataFrame(data)

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData Types:")
print(df.dtypes)
print("\nChurn Distribution:")
print(df['churn'].value_counts(normalize=True))

Step 2: Exploratory Data Analysis (EDA)

# Churn rate by contract type
print("Churn Rate by Contract Type:")
print(df.groupby('contract')['churn'].mean().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Churn distribution
df['churn'].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[0, 0], 
                                     labels=['Stayed', 'Churned'])
axes[0, 0].set_title('Churn Distribution')

# 2. Tenure distribution by churn
axes[0, 1].hist([df[df['churn']==0]['tenure'], df[df['churn']==1]['tenure']], 
                label=['Stayed', 'Churned'], bins=20, alpha=0.7)
axes[0, 1].set_xlabel('Tenure (months)')
axes[0, 1].set_title('Tenure by Churn Status')
axes[0, 1].legend()

# 3. Monthly charges by churn
df.boxplot(column='monthly_charges', by='churn', ax=axes[1, 0])
axes[1, 0].set_title('Monthly Charges by Churn')

# 4. Contract type vs churn
churn_by_contract = df.groupby('contract')['churn'].mean()
churn_by_contract.plot.bar(ax=axes[1, 1], color=['green', 'orange', 'red'])
axes[1, 1].set_title('Churn Rate by Contract')
axes[1, 1].set_ylabel('Churn Rate')

plt.tight_layout()
plt.show()

Step 3: Feature Engineering

# Separate features and target
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

# Identify column types
numeric_cols = ['tenure', 'monthly_charges', 'total_charges']
categorical_cols = ['contract', 'internet_service', 'online_security', 
                    'tech_support', 'payment_method', 'paperless_billing',
                    'partner', 'dependents']

# Create new features -- each one encodes a business hypothesis.
# "charges_per_month" normalizes total spend by tenure length.
# Without this, a 5-year customer who paid $5000 total looks the same
# as a 1-year customer who paid $5000 total, even though the second
# one is paying 5x more per month.
X['charges_per_month'] = X['total_charges'] / (X['tenure'] + 1)  # +1 avoids division by zero

# Binary flags are powerful for tree-based models -- they create
# easy split points that capture domain knowledge.
X['is_long_term'] = (X['tenure'] > 24).astype(int)  # 2+ years = loyal
X['high_charges'] = (X['monthly_charges'] > X['monthly_charges'].median()).astype(int)

numeric_cols.extend(['charges_per_month'])

print("Feature Engineering Complete!")
print(f"Numeric features: {len(numeric_cols)}")
print(f"Categorical features: {len(categorical_cols)}")
# Tip: Feature engineering is iterative. After your first model,
# look at misclassified examples to spot patterns that suggest
# new features. Did the model miss all customers with high charges
# AND month-to-month contracts? Create that interaction feature.

Step 4: Preprocessing Pipeline

Using sklearn Pipelines is not optional in production ML — it’s the difference between “works on my laptop” and “works reliably in production.” Pipelines prevent data leakage (fitting the scaler on test data), ensure reproducibility, and make deployment a single pipeline.predict() call instead of a fragile sequence of manual transforms.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Build preprocessor -- one pipeline per data type.
# The ColumnTransformer routes each column to the right pipeline.
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Median is robust to outliers
    ('scaler', StandardScaler())                     # Required for LR and SVM, harmless for trees
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),  # Explicit "missing" category
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
    # drop='first' avoids the "dummy variable trap" (multicollinearity).
    # handle_unknown='ignore' prevents crashes when test data has unseen categories.
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

Step 5: Model Selection and Comparison

A senior engineer’s approach to model selection: never bet on one model. Train 3-4 candidates with default hyperparameters, compare on the same cross-validation folds, then invest tuning effort only on the top 1-2. It’s like auditioning actors — you don’t give everyone a costume fitting before the first read-through.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Models to compare -- intentionally using default hyperparameters.
# The goal here is to find which FAMILY of models works best,
# not to squeeze out every last % of performance.
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}
# Model selection tip: For tabular data, Gradient Boosting or Random Forest
# almost always wins. Start there. Logistic Regression is your interpretability
# baseline. SVM is worth trying for small datasets (<10K rows).

# Create pipelines and evaluate
results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    results[name] = {
        'mean_auc': cv_scores.mean(),
        'std_auc': cv_scores.std(),
        'pipeline': pipeline
    }
    
    print(f"{name:22s}: AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Select best model based on cross-validation AUC.
# Important: don't just pick the highest number -- also consider:
# 1. Standard deviation (lower = more reliable estimate)
# 2. Training time (matters for retraining in production)
# 3. Interpretability (can you explain predictions to stakeholders?)
# A model that's 0.5% worse in AUC but 10x faster to train and
# easy to explain might be the better business choice.
best_model_name = max(results, key=lambda x: results[x]['mean_auc'])
print(f"\nBest Model: {best_model_name}")

Step 6: Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Tune the best model (assuming Gradient Boosting)
param_distributions = {
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(3, 10),
    'classifier__learning_rate': uniform(0.01, 0.3),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10)
}

best_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

random_search = RandomizedSearchCV(
    best_pipeline,
    param_distributions,
    n_iter=30,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.4f}")

Step 7: Final Evaluation

from sklearn.metrics import roc_curve, precision_recall_curve

# Get best model
final_model = random_search.best_estimator_

# Predictions
y_pred = final_model.predict(X_test)
y_prob = final_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned']))

# ROC AUC
print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_prob):.4f}")

# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_prob):.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

Step 8: Feature Importance Analysis

# Get feature names after preprocessing
feature_names = (
    numeric_cols + 
    list(final_model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['encoder']
         .get_feature_names_out(categorical_cols))
)

# Get feature importances
importances = final_model.named_steps['classifier'].feature_importances_

# Sort and plot
indices = np.argsort(importances)[::-1][:15]

plt.figure(figsize=(12, 6))
plt.bar(range(15), importances[indices])
plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Top 15 Most Important Features for Churn Prediction')
plt.tight_layout()
plt.show()

# Print top features
print("\nTop 10 Important Features:")
for i in indices[:10]:
    print(f"  {feature_names[i]:30s}: {importances[i]:.4f}")

Step 9: Business Insights

# Identify high-risk customers
X_test_copy = X_test.copy()
X_test_copy['churn_probability'] = y_prob
X_test_copy['predicted_churn'] = y_pred
X_test_copy['actual_churn'] = y_test.values

# High-risk customers (probability > 0.7)
high_risk = X_test_copy[X_test_copy['churn_probability'] > 0.7]
print(f"High-Risk Customers (>70% churn probability): {len(high_risk)}")
print("\nProfile of High-Risk Customers:")
print(high_risk[['contract', 'tenure', 'monthly_charges', 'churn_probability']].describe())

# Recommendations
print("\n" + "="*50)
print("BUSINESS RECOMMENDATIONS")
print("="*50)
print("""
1. TARGET MONTH-TO-MONTH CUSTOMERS
   - Highest churn rate
   - Offer discounts for annual contracts
   
2. FOCUS ON NEW CUSTOMERS (tenure < 12 months)
   - Most vulnerable period
   - Implement onboarding program
   
3. REVIEW HIGH-CHARGE CUSTOMERS
   - Consider loyalty discounts
   - Ensure they're getting value
   
4. ELECTRONIC CHECK USERS
   - Higher churn rate
   - Encourage automatic payment methods
""")

Step 10: Save the Model

import joblib

# Save the ENTIRE pipeline (preprocessor + model) -- not just the model!
# This is critical: if you save only the model and forget the scaler,
# production predictions will be wrong because raw features won't match
# what the model was trained on.
joblib.dump(final_model, 'churn_model.pkl')
print("Model saved to 'churn_model.pkl'")

# How to load and use in production:
# loaded_model = joblib.load('churn_model.pkl')
# predictions = loaded_model.predict(new_data)  # raw data goes in, predictions come out
# The pipeline handles all preprocessing internally.

# Also save metadata for future debugging:
import json
metadata = {
    'training_date': '2025-01-15',
    'n_training_samples': len(X_train),
    'features': list(X_train.columns),
    'best_params': random_search.best_params_,
    'cv_auc': random_search.best_score_,
}
# json.dump(metadata, open('churn_model_metadata.json', 'w'))

Production Considerations

Model Monitoring

  • Track prediction drift over time
  • Monitor for data quality issues
  • Set up alerts for performance degradation

A/B Testing

  • Test model in production with a subset
  • Compare with baseline
  • Gradually roll out

Retraining Schedule

  • Retrain periodically (weekly/monthly)
  • Automate the pipeline
  • Version your models

Documentation

  • Document feature definitions
  • Record model decisions
  • Maintain changelog

🚀 Mini Projects

Project 1: Loan Default Predictor

Build a complete loan approval system

Project 2: Employee Attrition Analyzer

Predict which employees might leave

Project 3: Product Recommendation Engine

Build a simple recommendation system

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline

Project 1: Loan Default Predictor

Build a complete loan default prediction system with EDA, feature engineering, and model selection.

Project 2: Employee Attrition Analyzer

Predict which employees might leave and understand why.

Project 3: Product Recommendation Engine

Build a simple collaborative filtering recommendation system.

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline with proper logging and experiment tracking.

Key Takeaways

Start with Business

Understand the problem before touching data

EDA is Critical

Visualize and understand your data first

Iterate Quickly

Start simple, then improve

Evaluate Properly

Use appropriate metrics for your problem

What’s Next?

Great job completing the end-to-end project! Now let’s explore unsupervised learning with clustering.

Continue to Module 11: Clustering

Learn to find patterns when you don’t have labels - K-Means, DBSCAN, and more