Capstone Project: Complete ML System

Real World Capstone - E-Commerce Churn Prediction

Project Overview

You’ll build a Customer Churn Prediction System - predicting which customers will leave a subscription service. This project synthesizes everything you’ve learned:

Data exploration and cleaning
Feature engineering
Model selection and evaluation
Deployment considerations

Part 1: Problem Definition

Business Context

A telecom company loses 15-20% of customers monthly. Each lost customer costs:

Revenue loss: $500-2000/year
Acquisition cost for replacement: $300-500

Your Goal: Identify at-risk customers BEFORE they leave, so the retention team can intervene.

Success Metrics

Metric	Target	Why
Recall	> 70%	Catch most churners
Precision	> 50%	Don’t waste retention budget
AUC-ROC	> 0.80	Overall discriminative power

Part 2: Data Exploration

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Generate realistic churn dataset
np.random.seed(42)
n_customers = 5000

data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'tenure_months': np.random.exponential(24, n_customers).astype(int).clip(1, 72),
    'monthly_charges': np.random.uniform(20, 100, n_customers).round(2),
    'total_charges': np.zeros(n_customers),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Bank transfer', 'Credit card', 'Electronic check', 'Mailed check'], n_customers),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_customers, p=[0.35, 0.45, 0.2]),
    'online_security': np.random.choice(['Yes', 'No', 'No internet service'], n_customers, p=[0.3, 0.5, 0.2]),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet service'], n_customers, p=[0.3, 0.5, 0.2]),
    'num_support_tickets': np.random.poisson(2, n_customers),
    'num_referrals': np.random.poisson(1, n_customers),
    'senior_citizen': np.random.choice([0, 1], n_customers, p=[0.84, 0.16]),
})

# Calculate total charges
data['total_charges'] = data['tenure_months'] * data['monthly_charges'] * np.random.uniform(0.9, 1.1, n_customers)

# Generate churn based on realistic patterns
churn_prob = np.zeros(n_customers)
# Short tenure increases churn
churn_prob += (72 - data['tenure_months']) / 72 * 0.2
# Month-to-month contracts churn more
churn_prob += (data['contract_type'] == 'Month-to-month') * 0.15
# High charges without value-add services
churn_prob += (data['monthly_charges'] > 70) * (data['online_security'] == 'No') * 0.1
# Many support tickets = frustrated
churn_prob += (data['num_support_tickets'] > 3) * 0.15
# Electronic check = often forgetful payments
churn_prob += (data['payment_method'] == 'Electronic check') * 0.1

# Add noise and clip
churn_prob = churn_prob + np.random.uniform(-0.1, 0.1, n_customers)
churn_prob = np.clip(churn_prob, 0.05, 0.95)
data['churn'] = (np.random.random(n_customers) < churn_prob).astype(int)

print(f"Dataset shape: {data.shape}")
print(f"Churn rate: {data['churn'].mean():.1%}")
print("\nFirst few rows:")
data.head()

Exploratory Analysis

# Churn distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Churn by contract type
pd.crosstab(data['contract_type'], data['churn'], normalize='index').plot(
    kind='bar', ax=axes[0, 0], color=['green', 'red']
)
axes[0, 0].set_title('Churn Rate by Contract Type')
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].legend(['Stay', 'Churn'])

# 2. Tenure distribution by churn
for label, color in [(0, 'green'), (1, 'red')]:
    data[data['churn'] == label]['tenure_months'].hist(
        ax=axes[0, 1], alpha=0.5, label=f'Churn={label}', color=color, bins=20
    )
axes[0, 1].set_title('Tenure Distribution')
axes[0, 1].legend()

# 3. Monthly charges by churn
data.boxplot(column='monthly_charges', by='churn', ax=axes[0, 2])
axes[0, 2].set_title('Monthly Charges by Churn')

# 4. Support tickets by churn
pd.crosstab(data['num_support_tickets'].clip(upper=5), data['churn'], normalize='index').plot(
    kind='bar', ax=axes[1, 0], color=['green', 'red']
)
axes[1, 0].set_title('Churn Rate by Support Tickets')

# 5. Payment method by churn
pd.crosstab(data['payment_method'], data['churn'], normalize='index').plot(
    kind='bar', ax=axes[1, 1], color=['green', 'red']
)
axes[1, 1].set_title('Churn Rate by Payment Method')
axes[1, 1].tick_params(axis='x', rotation=45)

# 6. Correlation heatmap for numeric features
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges', 
                'num_support_tickets', 'num_referrals', 'churn']
sns.heatmap(data[numeric_cols].corr(), annot=True, cmap='RdYlGn', ax=axes[1, 2])
axes[1, 2].set_title('Feature Correlations')

plt.tight_layout()
plt.show()

Part 3: Feature Engineering

# Feature engineering
def engineer_features(df):
    """Create meaningful features from raw data."""
    df = df.copy()
    
    # Derived features
    df['avg_monthly_spend'] = df['total_charges'] / df['tenure_months']
    df['charge_variability'] = df['avg_monthly_spend'] / df['monthly_charges']
    
    # Tenure buckets
    df['tenure_group'] = pd.cut(df['tenure_months'], 
                                 bins=[0, 6, 12, 24, 48, 100],
                                 labels=['0-6m', '6-12m', '1-2y', '2-4y', '4y+'])
    
    # Value indicators
    df['has_security'] = (df['online_security'] == 'Yes').astype(int)
    df['has_support'] = (df['tech_support'] == 'Yes').astype(int)
    df['total_services'] = df['has_security'] + df['has_support']
    
    # Risk indicators
    df['high_charges'] = (df['monthly_charges'] > df['monthly_charges'].median()).astype(int)
    df['many_tickets'] = (df['num_support_tickets'] > 3).astype(int)
    df['new_customer'] = (df['tenure_months'] <= 6).astype(int)
    
    # Interaction features
    df['new_high_charges'] = df['new_customer'] * df['high_charges']
    df['tickets_per_month'] = df['num_support_tickets'] / df['tenure_months']
    
    # Encode contract as ordinal (commitment level)
    contract_map = {'Month-to-month': 0, 'One year': 1, 'Two year': 2}
    df['contract_commitment'] = df['contract_type'].map(contract_map)
    
    return df

data_featured = engineer_features(data)

print("New features:")
print(data_featured[['avg_monthly_spend', 'tickets_per_month', 
                      'total_services', 'new_high_charges']].describe())

Prepare for Modeling

from sklearn.preprocessing import LabelEncoder, StandardScaler

def prepare_data(df, target='churn'):
    """Prepare data for modeling."""
    df = df.copy()
    
    # Drop non-feature columns
    drop_cols = ['customer_id', target]
    
    # Identify column types
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    numeric_cols = [c for c in numeric_cols if c not in drop_cols]
    
    # Encode categoricals
    le = LabelEncoder()
    for col in categorical_cols:
        df[col] = le.fit_transform(df[col].astype(str))
    
    # Get features and target
    feature_cols = [c for c in df.columns if c not in drop_cols]
    X = df[feature_cols]
    y = df[target]
    
    return X, y, feature_cols

X, y, feature_names = prepare_data(data_featured)

# Train-test split (stratified to maintain churn ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

Part 4: Model Development

Baseline Models

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, classification_report)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Define models to compare
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ]),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'KNN': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', KNeighborsClassifier(n_neighbors=5))
    ]),
    'Naive Bayes': GaussianNB(),
}

# Evaluate all models
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    result = {
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC-ROC': roc_auc_score(y_test, y_prob)
    }
    results.append(result)

results_df = pd.DataFrame(results).sort_values('AUC-ROC', ascending=False)
print("Model Comparison:")
print(results_df.to_string(index=False))

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, cross_val_score

# Tune the best model (let's say Gradient Boosting)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'min_samples_split': [2, 5, 10]
}

gb = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(
    gb, param_grid, 
    cv=5, 
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV AUC-ROC: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_

Part 5: Model Evaluation

Detailed Metrics

from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve

# Predictions
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

# Confusion Matrix
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stay', 'Churn']))

# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')
axes[0].set_xticklabels(['Stay', 'Churn'])
axes[0].set_yticklabels(['Stay', 'Churn'])

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {auc:.3f}')
axes[1].plot([0, 1], [0, 1], 'k--')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

# 3. Precision-Recall Curve
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
axes[2].plot(recall, precision)
axes[2].set_xlabel('Recall')
axes[2].set_ylabel('Precision')
axes[2].set_title('Precision-Recall Curve')

plt.tight_layout()
plt.show()

Feature Importance

# Feature importance
importances = pd.DataFrame({
    'feature': feature_names,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 8))
plt.barh(importances['feature'], importances['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(importances.tail(10).to_string(index=False))

Part 6: Business Impact Analysis

def calculate_business_impact(y_true, y_prob, threshold=0.5,
                              cost_false_negative=500,  # Lost customer revenue
                              cost_false_positive=50,   # Wasted retention effort
                              cost_true_positive=100,   # Retention cost + kept revenue
                              revenue_retained=400):     # Saved revenue from intervention
    """Calculate business impact of model at given threshold."""
    
    y_pred = (y_prob >= threshold).astype(int)
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    # Costs
    cost_fn = fn * cost_false_negative  # Missed churners
    cost_fp = fp * cost_false_positive  # Unnecessary retention efforts
    cost_tp = tp * cost_true_positive   # Retention cost
    
    # Revenue
    revenue = tp * revenue_retained  # Revenue saved from successful retention
    
    # Net impact
    net_impact = revenue - cost_fn - cost_fp - cost_tp
    
    return {
        'threshold': threshold,
        'true_positives': tp,
        'false_positives': fp,
        'false_negatives': fn,
        'cost_missed_churners': cost_fn,
        'cost_wasted_effort': cost_fp,
        'cost_retention': cost_tp,
        'revenue_saved': revenue,
        'net_impact': net_impact
    }

# Analyze different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
impacts = [calculate_business_impact(y_test, y_prob, t) for t in thresholds]
impact_df = pd.DataFrame(impacts)

# Find optimal threshold
optimal_idx = impact_df['net_impact'].argmax()
optimal_threshold = impact_df.loc[optimal_idx, 'threshold']

print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"Net business impact: ${impact_df.loc[optimal_idx, 'net_impact']:,.0f}")
print(f"Churners caught: {impact_df.loc[optimal_idx, 'true_positives']}")
print(f"Churners missed: {impact_df.loc[optimal_idx, 'false_negatives']}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(impact_df['threshold'], impact_df['net_impact'], 'b-o')
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Net Business Impact ($)')
plt.title('Business Impact by Threshold')
plt.legend()
plt.grid(True)
plt.show()

Part 7: Production Considerations

Model Serialization

import joblib

# Save model and preprocessing artifacts
artifacts = {
    'model': best_model,
    'feature_names': feature_names,
    'optimal_threshold': optimal_threshold
}

joblib.dump(artifacts, 'churn_model.pkl')
print("Model saved!")

# Load model
loaded = joblib.load('churn_model.pkl')
loaded_model = loaded['model']

Inference Pipeline

def predict_churn(customer_data, model, feature_names, threshold=0.5):
    """
    Predict churn probability for a customer.
    
    Args:
        customer_data: dict with customer features
        model: trained model
        feature_names: list of feature names in order
        threshold: classification threshold
    
    Returns:
        dict with prediction and probability
    """
    # Create DataFrame from input
    df = pd.DataFrame([customer_data])
    
    # Apply same feature engineering
    df = engineer_features(df)
    
    # Ensure same features and order
    X = prepare_data(df)[0]
    
    # Predict
    prob = model.predict_proba(X)[0, 1]
    prediction = int(prob >= threshold)
    
    return {
        'churn_probability': float(prob),
        'will_churn': bool(prediction),
        'risk_level': 'High' if prob > 0.7 else 'Medium' if prob > 0.4 else 'Low'
    }

# Example prediction
new_customer = {
    'tenure_months': 3,
    'monthly_charges': 85.0,
    'total_charges': 255.0,
    'contract_type': 'Month-to-month',
    'payment_method': 'Electronic check',
    'internet_service': 'Fiber optic',
    'online_security': 'No',
    'tech_support': 'No',
    'num_support_tickets': 5,
    'num_referrals': 0,
    'senior_citizen': 0
}

result = predict_churn(new_customer, best_model, feature_names, optimal_threshold)
print(f"Customer Churn Prediction:")
print(f"  Probability: {result['churn_probability']:.1%}")
print(f"  Will Churn: {result['will_churn']}")
print(f"  Risk Level: {result['risk_level']}")

Monitoring Dashboard

def create_monitoring_report(predictions, actuals, timestamp):
    """Generate model monitoring report."""
    
    # Calculate metrics
    accuracy = accuracy_score(actuals, predictions > 0.5)
    auc = roc_auc_score(actuals, predictions)
    
    # Distribution of predictions
    pred_dist = pd.cut(predictions, bins=[0, 0.3, 0.7, 1.0], 
                       labels=['Low', 'Medium', 'High']).value_counts(normalize=True)
    
    report = {
        'timestamp': timestamp,
        'n_predictions': len(predictions),
        'accuracy': accuracy,
        'auc_roc': auc,
        'mean_probability': predictions.mean(),
        'high_risk_pct': (predictions > 0.7).mean(),
        'alert': auc < 0.75  # Trigger alert if AUC drops
    }
    
    return report

# Simulate monitoring
monitoring_report = create_monitoring_report(y_prob, y_test, '2024-01-15')
print("\nModel Monitoring Report:")
for key, value in monitoring_report.items():
    print(f"  {key}: {value}")

Project Checklist

Problem Definition

✅ Clear business objective and success metrics

Data Exploration

✅ Understand data quality, distributions, and patterns

Feature Engineering

✅ Create meaningful features from domain knowledge

Model Development

✅ Compare multiple algorithms, tune hyperparameters

Evaluation

✅ Use appropriate metrics, analyze errors

Business Impact

✅ Translate ML metrics to business value

Production

✅ Plan for deployment, monitoring, and maintenance

🏆 Congratulations!

You've Completed the Capstone!

You’ve built a complete, production-ready ML system from scratch. You now have:

Technical Skills: Data exploration, feature engineering, model training, evaluation, and deployment
Business Acumen: Translating ML metrics to business impact
Production Mindset: Monitoring, maintenance, and continuous improvement

This project alone is portfolio-worthy for ML engineering roles!

📝 Portfolio Documentation Template

How to Present This Project to Employers

Project Summary (for your portfolio/resume)

Title: Customer Churn Prediction SystemBusiness Impact:

Identifies 70%+ of at-risk customers 2 weeks before churn
Enables targeted retention campaigns
Estimated $X00K annual savings in customer lifetime value

Technical Highlights:

End-to-end ML pipeline from raw data to production API
Comparison of 5+ algorithms (Logistic Regression, Random Forest, XGBoost, etc.)
Feature engineering creating 20+ derived features
Threshold optimization for business-aligned precision-recall tradeoff
Monitoring and alerting system for model drift

Technologies Used:

Python, scikit-learn, XGBoost, pandas, numpy
FastAPI for model serving
MLflow for experiment tracking
Docker for containerization

GitHub README Structure

# Customer Churn Prediction System

## Overview
End-to-end ML system predicting customer churn with 85% AUC-ROC.

## Results
| Metric | Value |
|--------|-------|
| AUC-ROC | 0.85 |
| Precision | 0.72 |
| Recall | 0.68 |
| Business Value | $200K saved/year |

## Quick Start
```bash
pip install -r requirements.txt
python train.py
python serve.py
```

## Project Structure
├── data/           # Raw and processed data
├── notebooks/      # EDA and experimentation
├── src/            # Production code
├── models/         # Saved models
├── tests/          # Unit tests
└── docker/         # Containerization

Interview Talking Points

“Walk me through this project”
- Start with business problem (churn costs $X)
- Explain data exploration findings
- Discuss feature engineering decisions
- Compare model approaches
- Show business impact calculation
“What was the biggest challenge?”
- Class imbalance (70/30 split)
- Feature engineering from raw transaction data
- Choosing the right threshold for business needs
“How would you improve it?”
- Real-time predictions with streaming
- A/B testing different interventions
- Incorporating more data sources
- Automated retraining pipeline

🔗 Complete ML Mastery Checklist

Skills You’ve Mastered Across This Course:

Category	Skills	Modules
Fundamentals	Linear models, loss functions, gradient descent	1-3
Classification	Logistic regression, metrics, thresholds	4-4b
Algorithms	Trees, ensembles, SVM, Naive Bayes	5-6
Evaluation	Cross-validation, precision/recall, ROC	7
Data Skills	Feature engineering, handling messy data	8
Optimization	Hyperparameter tuning	9
End-to-End	Complete pipelines	10, 19
Unsupervised	Clustering, dimensionality reduction	11, 18
Deep Learning	Neural networks basics	12
Production	Regularization, deployment, monitoring	13, 14
Time Series	Forecasting techniques	15
Theory	Bias-variance, data leakage	16-17
Real-World	Imbalanced data, explainability	20-23

You’re now ready for:

ML Engineer roles (junior to mid-level)
Data Scientist positions
AI/ML-focused software engineering
Further study in deep learning, NLP, or computer vision

What’s Next?

You’ve completed the capstone, but there’s more to learn! Let’s tackle real-world challenges.

Continue Learning

Handle datasets where 99% of data is one class

Deep Learning

Move on to neural networks, transformers, and LLMs

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Capstone Project: Complete ML System

​Project Overview

​Part 1: Problem Definition

​Business Context

​Success Metrics

​Part 2: Data Exploration

​Exploratory Analysis

​Part 3: Feature Engineering

​Prepare for Modeling

​Part 4: Model Development

​Baseline Models

​Hyperparameter Tuning

​Part 5: Model Evaluation

​Detailed Metrics

​Feature Importance

​Part 6: Business Impact Analysis

​Part 7: Production Considerations

​Model Serialization

​Inference Pipeline

​Monitoring Dashboard

​Project Checklist

​🏆 Congratulations!