Skip to main content

Capstone Project: Complete ML System

Capstone Project Lifecycle
Real World Capstone - E-Commerce Churn Prediction

Project Overview

You’ll build a Customer Churn Prediction System - predicting which customers will leave a subscription service. This project synthesizes everything you’ve learned:
  • Data exploration and cleaning
  • Feature engineering
  • Model selection and evaluation
  • Deployment considerations

Part 1: Problem Definition

Business Context

A telecom company loses 15-20% of customers monthly. Each lost customer costs:
  • Revenue loss: $500-2000/year
  • Acquisition cost for replacement: $300-500
Your Goal: Identify at-risk customers BEFORE they leave, so the retention team can intervene.

Success Metrics

MetricTargetWhy
Recall> 70%Catch most churners
Precision> 50%Don’t waste retention budget
AUC-ROC> 0.80Overall discriminative power

Part 2: Data Exploration

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Generate realistic churn dataset
np.random.seed(42)
n_customers = 5000

data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'tenure_months': np.random.exponential(24, n_customers).astype(int).clip(1, 72),
    'monthly_charges': np.random.uniform(20, 100, n_customers).round(2),
    'total_charges': np.zeros(n_customers),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Bank transfer', 'Credit card', 'Electronic check', 'Mailed check'], n_customers),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_customers, p=[0.35, 0.45, 0.2]),
    'online_security': np.random.choice(['Yes', 'No', 'No internet service'], n_customers, p=[0.3, 0.5, 0.2]),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet service'], n_customers, p=[0.3, 0.5, 0.2]),
    'num_support_tickets': np.random.poisson(2, n_customers),
    'num_referrals': np.random.poisson(1, n_customers),
    'senior_citizen': np.random.choice([0, 1], n_customers, p=[0.84, 0.16]),
})

# Calculate total charges
data['total_charges'] = data['tenure_months'] * data['monthly_charges'] * np.random.uniform(0.9, 1.1, n_customers)

# Generate churn based on realistic patterns
churn_prob = np.zeros(n_customers)
# Short tenure increases churn
churn_prob += (72 - data['tenure_months']) / 72 * 0.2
# Month-to-month contracts churn more
churn_prob += (data['contract_type'] == 'Month-to-month') * 0.15
# High charges without value-add services
churn_prob += (data['monthly_charges'] > 70) * (data['online_security'] == 'No') * 0.1
# Many support tickets = frustrated
churn_prob += (data['num_support_tickets'] > 3) * 0.15
# Electronic check = often forgetful payments
churn_prob += (data['payment_method'] == 'Electronic check') * 0.1

# Add noise and clip
churn_prob = churn_prob + np.random.uniform(-0.1, 0.1, n_customers)
churn_prob = np.clip(churn_prob, 0.05, 0.95)
data['churn'] = (np.random.random(n_customers) < churn_prob).astype(int)

print(f"Dataset shape: {data.shape}")
print(f"Churn rate: {data['churn'].mean():.1%}")
print("\nFirst few rows:")
data.head()

Exploratory Analysis

# Churn distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Churn by contract type
pd.crosstab(data['contract_type'], data['churn'], normalize='index').plot(
    kind='bar', ax=axes[0, 0], color=['green', 'red']
)
axes[0, 0].set_title('Churn Rate by Contract Type')
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].legend(['Stay', 'Churn'])

# 2. Tenure distribution by churn
for label, color in [(0, 'green'), (1, 'red')]:
    data[data['churn'] == label]['tenure_months'].hist(
        ax=axes[0, 1], alpha=0.5, label=f'Churn={label}', color=color, bins=20
    )
axes[0, 1].set_title('Tenure Distribution')
axes[0, 1].legend()

# 3. Monthly charges by churn
data.boxplot(column='monthly_charges', by='churn', ax=axes[0, 2])
axes[0, 2].set_title('Monthly Charges by Churn')

# 4. Support tickets by churn
pd.crosstab(data['num_support_tickets'].clip(upper=5), data['churn'], normalize='index').plot(
    kind='bar', ax=axes[1, 0], color=['green', 'red']
)
axes[1, 0].set_title('Churn Rate by Support Tickets')

# 5. Payment method by churn
pd.crosstab(data['payment_method'], data['churn'], normalize='index').plot(
    kind='bar', ax=axes[1, 1], color=['green', 'red']
)
axes[1, 1].set_title('Churn Rate by Payment Method')
axes[1, 1].tick_params(axis='x', rotation=45)

# 6. Correlation heatmap for numeric features
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges', 
                'num_support_tickets', 'num_referrals', 'churn']
sns.heatmap(data[numeric_cols].corr(), annot=True, cmap='RdYlGn', ax=axes[1, 2])
axes[1, 2].set_title('Feature Correlations')

plt.tight_layout()
plt.show()

Part 3: Feature Engineering

# Feature engineering
def engineer_features(df):
    """Create meaningful features from raw data."""
    df = df.copy()
    
    # Derived features
    df['avg_monthly_spend'] = df['total_charges'] / df['tenure_months']
    df['charge_variability'] = df['avg_monthly_spend'] / df['monthly_charges']
    
    # Tenure buckets
    df['tenure_group'] = pd.cut(df['tenure_months'], 
                                 bins=[0, 6, 12, 24, 48, 100],
                                 labels=['0-6m', '6-12m', '1-2y', '2-4y', '4y+'])
    
    # Value indicators
    df['has_security'] = (df['online_security'] == 'Yes').astype(int)
    df['has_support'] = (df['tech_support'] == 'Yes').astype(int)
    df['total_services'] = df['has_security'] + df['has_support']
    
    # Risk indicators
    df['high_charges'] = (df['monthly_charges'] > df['monthly_charges'].median()).astype(int)
    df['many_tickets'] = (df['num_support_tickets'] > 3).astype(int)
    df['new_customer'] = (df['tenure_months'] <= 6).astype(int)
    
    # Interaction features
    df['new_high_charges'] = df['new_customer'] * df['high_charges']
    df['tickets_per_month'] = df['num_support_tickets'] / df['tenure_months']
    
    # Encode contract as ordinal (commitment level)
    contract_map = {'Month-to-month': 0, 'One year': 1, 'Two year': 2}
    df['contract_commitment'] = df['contract_type'].map(contract_map)
    
    return df

data_featured = engineer_features(data)

print("New features:")
print(data_featured[['avg_monthly_spend', 'tickets_per_month', 
                      'total_services', 'new_high_charges']].describe())

Prepare for Modeling

from sklearn.preprocessing import LabelEncoder, StandardScaler

def prepare_data(df, target='churn'):
    """Prepare data for modeling."""
    df = df.copy()
    
    # Drop non-feature columns
    drop_cols = ['customer_id', target]
    
    # Identify column types
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    numeric_cols = [c for c in numeric_cols if c not in drop_cols]
    
    # Encode categoricals
    le = LabelEncoder()
    for col in categorical_cols:
        df[col] = le.fit_transform(df[col].astype(str))
    
    # Get features and target
    feature_cols = [c for c in df.columns if c not in drop_cols]
    X = df[feature_cols]
    y = df[target]
    
    return X, y, feature_cols

X, y, feature_names = prepare_data(data_featured)

# Train-test split (stratified to maintain churn ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

Part 4: Model Development

Baseline Models

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, classification_report)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Define models to compare
models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ]),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'KNN': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', KNeighborsClassifier(n_neighbors=5))
    ]),
    'Naive Bayes': GaussianNB(),
}

# Evaluate all models
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    result = {
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC-ROC': roc_auc_score(y_test, y_prob)
    }
    results.append(result)

results_df = pd.DataFrame(results).sort_values('AUC-ROC', ascending=False)
print("Model Comparison:")
print(results_df.to_string(index=False))

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, cross_val_score

# Tune the best model (let's say Gradient Boosting)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'min_samples_split': [2, 5, 10]
}

gb = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(
    gb, param_grid, 
    cv=5, 
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV AUC-ROC: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_

Part 5: Model Evaluation

Detailed Metrics

from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve

# Predictions
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

# Confusion Matrix
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stay', 'Churn']))

# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')
axes[0].set_xticklabels(['Stay', 'Churn'])
axes[0].set_yticklabels(['Stay', 'Churn'])

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {auc:.3f}')
axes[1].plot([0, 1], [0, 1], 'k--')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

# 3. Precision-Recall Curve
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
axes[2].plot(recall, precision)
axes[2].set_xlabel('Recall')
axes[2].set_ylabel('Precision')
axes[2].set_title('Precision-Recall Curve')

plt.tight_layout()
plt.show()

Feature Importance

# Feature importance
importances = pd.DataFrame({
    'feature': feature_names,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 8))
plt.barh(importances['feature'], importances['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(importances.tail(10).to_string(index=False))

Part 6: Business Impact Analysis

def calculate_business_impact(y_true, y_prob, threshold=0.5,
                              cost_false_negative=500,  # Lost customer revenue
                              cost_false_positive=50,   # Wasted retention effort
                              cost_true_positive=100,   # Retention cost + kept revenue
                              revenue_retained=400):     # Saved revenue from intervention
    """Calculate business impact of model at given threshold."""
    
    y_pred = (y_prob >= threshold).astype(int)
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    # Costs
    cost_fn = fn * cost_false_negative  # Missed churners
    cost_fp = fp * cost_false_positive  # Unnecessary retention efforts
    cost_tp = tp * cost_true_positive   # Retention cost
    
    # Revenue
    revenue = tp * revenue_retained  # Revenue saved from successful retention
    
    # Net impact
    net_impact = revenue - cost_fn - cost_fp - cost_tp
    
    return {
        'threshold': threshold,
        'true_positives': tp,
        'false_positives': fp,
        'false_negatives': fn,
        'cost_missed_churners': cost_fn,
        'cost_wasted_effort': cost_fp,
        'cost_retention': cost_tp,
        'revenue_saved': revenue,
        'net_impact': net_impact
    }

# Analyze different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
impacts = [calculate_business_impact(y_test, y_prob, t) for t in thresholds]
impact_df = pd.DataFrame(impacts)

# Find optimal threshold
optimal_idx = impact_df['net_impact'].argmax()
optimal_threshold = impact_df.loc[optimal_idx, 'threshold']

print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"Net business impact: ${impact_df.loc[optimal_idx, 'net_impact']:,.0f}")
print(f"Churners caught: {impact_df.loc[optimal_idx, 'true_positives']}")
print(f"Churners missed: {impact_df.loc[optimal_idx, 'false_negatives']}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(impact_df['threshold'], impact_df['net_impact'], 'b-o')
plt.axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.2f}')
plt.xlabel('Threshold')
plt.ylabel('Net Business Impact ($)')
plt.title('Business Impact by Threshold')
plt.legend()
plt.grid(True)
plt.show()

Part 7: Production Considerations

Model Serialization

import joblib

# Save model and preprocessing artifacts
artifacts = {
    'model': best_model,
    'feature_names': feature_names,
    'optimal_threshold': optimal_threshold
}

joblib.dump(artifacts, 'churn_model.pkl')
print("Model saved!")

# Load model
loaded = joblib.load('churn_model.pkl')
loaded_model = loaded['model']

Inference Pipeline

def predict_churn(customer_data, model, feature_names, threshold=0.5):
    """
    Predict churn probability for a customer.
    
    Args:
        customer_data: dict with customer features
        model: trained model
        feature_names: list of feature names in order
        threshold: classification threshold
    
    Returns:
        dict with prediction and probability
    """
    # Create DataFrame from input
    df = pd.DataFrame([customer_data])
    
    # Apply same feature engineering
    df = engineer_features(df)
    
    # Ensure same features and order
    X = prepare_data(df)[0]
    
    # Predict
    prob = model.predict_proba(X)[0, 1]
    prediction = int(prob >= threshold)
    
    return {
        'churn_probability': float(prob),
        'will_churn': bool(prediction),
        'risk_level': 'High' if prob > 0.7 else 'Medium' if prob > 0.4 else 'Low'
    }

# Example prediction
new_customer = {
    'tenure_months': 3,
    'monthly_charges': 85.0,
    'total_charges': 255.0,
    'contract_type': 'Month-to-month',
    'payment_method': 'Electronic check',
    'internet_service': 'Fiber optic',
    'online_security': 'No',
    'tech_support': 'No',
    'num_support_tickets': 5,
    'num_referrals': 0,
    'senior_citizen': 0
}

result = predict_churn(new_customer, best_model, feature_names, optimal_threshold)
print(f"Customer Churn Prediction:")
print(f"  Probability: {result['churn_probability']:.1%}")
print(f"  Will Churn: {result['will_churn']}")
print(f"  Risk Level: {result['risk_level']}")

Monitoring Dashboard

def create_monitoring_report(predictions, actuals, timestamp):
    """Generate model monitoring report."""
    
    # Calculate metrics
    accuracy = accuracy_score(actuals, predictions > 0.5)
    auc = roc_auc_score(actuals, predictions)
    
    # Distribution of predictions
    pred_dist = pd.cut(predictions, bins=[0, 0.3, 0.7, 1.0], 
                       labels=['Low', 'Medium', 'High']).value_counts(normalize=True)
    
    report = {
        'timestamp': timestamp,
        'n_predictions': len(predictions),
        'accuracy': accuracy,
        'auc_roc': auc,
        'mean_probability': predictions.mean(),
        'high_risk_pct': (predictions > 0.7).mean(),
        'alert': auc < 0.75  # Trigger alert if AUC drops
    }
    
    return report

# Simulate monitoring
monitoring_report = create_monitoring_report(y_prob, y_test, '2024-01-15')
print("\nModel Monitoring Report:")
for key, value in monitoring_report.items():
    print(f"  {key}: {value}")

Project Checklist

1

Problem Definition

✅ Clear business objective and success metrics
2

Data Exploration

✅ Understand data quality, distributions, and patterns
3

Feature Engineering

✅ Create meaningful features from domain knowledge
4

Model Development

✅ Compare multiple algorithms, tune hyperparameters
5

Evaluation

✅ Use appropriate metrics, analyze errors
6

Business Impact

✅ Translate ML metrics to business value
7

Production

✅ Plan for deployment, monitoring, and maintenance

🏆 Congratulations!

You've Completed the Capstone!

You’ve built a complete, production-ready ML system from scratch. You now have:
  • Technical Skills: Data exploration, feature engineering, model training, evaluation, and deployment
  • Business Acumen: Translating ML metrics to business impact
  • Production Mindset: Monitoring, maintenance, and continuous improvement
This project alone is portfolio-worthy for ML engineering roles!

📝 Portfolio Documentation Template

Project Summary (for your portfolio/resume)

Title: Customer Churn Prediction SystemBusiness Impact:
  • Identifies 70%+ of at-risk customers 2 weeks before churn
  • Enables targeted retention campaigns
  • Estimated $X00K annual savings in customer lifetime value
Technical Highlights:
  • End-to-end ML pipeline from raw data to production API
  • Comparison of 5+ algorithms (Logistic Regression, Random Forest, XGBoost, etc.)
  • Feature engineering creating 20+ derived features
  • Threshold optimization for business-aligned precision-recall tradeoff
  • Monitoring and alerting system for model drift
Technologies Used:
  • Python, scikit-learn, XGBoost, pandas, numpy
  • FastAPI for model serving
  • MLflow for experiment tracking
  • Docker for containerization

GitHub README Structure

# Customer Churn Prediction System

## Overview
End-to-end ML system predicting customer churn with 85% AUC-ROC.

## Results
| Metric | Value |
|--------|-------|
| AUC-ROC | 0.85 |
| Precision | 0.72 |
| Recall | 0.68 |
| Business Value | $200K saved/year |

## Quick Start
```bash
pip install -r requirements.txt
python train.py
python serve.py
```

## Project Structure
├── data/           # Raw and processed data
├── notebooks/      # EDA and experimentation
├── src/            # Production code
├── models/         # Saved models
├── tests/          # Unit tests
└── docker/         # Containerization

Interview Talking Points

  1. “Walk me through this project”
    • Start with business problem (churn costs $X)
    • Explain data exploration findings
    • Discuss feature engineering decisions
    • Compare model approaches
    • Show business impact calculation
  2. “What was the biggest challenge?”
    • Class imbalance (70/30 split)
    • Feature engineering from raw transaction data
    • Choosing the right threshold for business needs
  3. “How would you improve it?”
    • Real-time predictions with streaming
    • A/B testing different interventions
    • Incorporating more data sources
    • Automated retraining pipeline

🔗 Complete ML Mastery Checklist

Skills You’ve Mastered Across This Course:
CategorySkillsModules
FundamentalsLinear models, loss functions, gradient descent1-3
ClassificationLogistic regression, metrics, thresholds4-4b
AlgorithmsTrees, ensembles, SVM, Naive Bayes5-6
EvaluationCross-validation, precision/recall, ROC7
Data SkillsFeature engineering, handling messy data8
OptimizationHyperparameter tuning9
End-to-EndComplete pipelines10, 19
UnsupervisedClustering, dimensionality reduction11, 18
Deep LearningNeural networks basics12
ProductionRegularization, deployment, monitoring13, 14
Time SeriesForecasting techniques15
TheoryBias-variance, data leakage16-17
Real-WorldImbalanced data, explainability20-23
You’re now ready for:
  • ML Engineer roles (junior to mid-level)
  • Data Scientist positions
  • AI/ML-focused software engineering
  • Further study in deep learning, NLP, or computer vision

What’s Next?

You’ve completed the capstone, but there’s more to learn! Let’s tackle real-world challenges.