Skip to main content

End-to-End ML Project

End-to-End ML Project Pipeline
Real World ML Pipeline - Uber Surge Pricing

The Complete ML Workflow

This module brings everything together in a real project:
  1. Problem Definition: What are we solving?
  2. Data Collection: Get the data
  3. EDA: Understand the data
  4. Feature Engineering: Prepare features
  5. Model Selection: Choose algorithms
  6. Training: Fit models
  7. Evaluation: Measure performance
  8. Tuning: Optimize hyperparameters
  9. Deployment: Make it usable

Project: Predicting Customer Churn

Business Problem: A telecom company wants to predict which customers will leave (churn) so they can offer them incentives to stay.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Step 1: Load and Explore Data

# Using Telco Customer Churn dataset
# In practice, you'd load from: pd.read_csv('customer_churn.csv')

# Simulate the dataset structure
n_samples = 5000

data = {
    'customer_id': range(1, n_samples + 1),
    'tenure': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 7000, n_samples),
    'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
    'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples),
    'senior_citizen': np.random.choice([0, 1], n_samples),
    'partner': np.random.choice(['Yes', 'No'], n_samples),
    'dependents': np.random.choice(['Yes', 'No'], n_samples),
}

# Create churn with realistic patterns
churn_prob = (
    0.1 +  # Base rate
    0.3 * (data['contract'] == 'Month-to-month').astype(int) +
    0.2 * (data['monthly_charges'] > 70) +
    -0.15 * (data['tenure'] > 36) +
    0.1 * (data['payment_method'] == 'Electronic check').astype(int)
)
churn_prob = np.clip(churn_prob, 0.05, 0.8)
data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)

df = pd.DataFrame(data)

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData Types:")
print(df.dtypes)
print("\nChurn Distribution:")
print(df['churn'].value_counts(normalize=True))

Step 2: Exploratory Data Analysis (EDA)

# Churn rate by contract type
print("Churn Rate by Contract Type:")
print(df.groupby('contract')['churn'].mean().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Churn distribution
df['churn'].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[0, 0], 
                                     labels=['Stayed', 'Churned'])
axes[0, 0].set_title('Churn Distribution')

# 2. Tenure distribution by churn
axes[0, 1].hist([df[df['churn']==0]['tenure'], df[df['churn']==1]['tenure']], 
                label=['Stayed', 'Churned'], bins=20, alpha=0.7)
axes[0, 1].set_xlabel('Tenure (months)')
axes[0, 1].set_title('Tenure by Churn Status')
axes[0, 1].legend()

# 3. Monthly charges by churn
df.boxplot(column='monthly_charges', by='churn', ax=axes[1, 0])
axes[1, 0].set_title('Monthly Charges by Churn')

# 4. Contract type vs churn
churn_by_contract = df.groupby('contract')['churn'].mean()
churn_by_contract.plot.bar(ax=axes[1, 1], color=['green', 'orange', 'red'])
axes[1, 1].set_title('Churn Rate by Contract')
axes[1, 1].set_ylabel('Churn Rate')

plt.tight_layout()
plt.show()

Step 3: Feature Engineering

# Separate features and target
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

# Identify column types
numeric_cols = ['tenure', 'monthly_charges', 'total_charges']
categorical_cols = ['contract', 'internet_service', 'online_security', 
                    'tech_support', 'payment_method', 'paperless_billing',
                    'partner', 'dependents']

# Create new features
X['charges_per_month'] = X['total_charges'] / (X['tenure'] + 1)
X['is_long_term'] = (X['tenure'] > 24).astype(int)
X['high_charges'] = (X['monthly_charges'] > X['monthly_charges'].median()).astype(int)

numeric_cols.extend(['charges_per_month'])

print("Feature Engineering Complete!")
print(f"Numeric features: {len(numeric_cols)}")
print(f"Categorical features: {len(categorical_cols)}")

Step 4: Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Build preprocessor
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

Step 5: Model Selection and Comparison

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Create pipelines and evaluate
results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    results[name] = {
        'mean_auc': cv_scores.mean(),
        'std_auc': cv_scores.std(),
        'pipeline': pipeline
    }
    
    print(f"{name:22s}: AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Select best model
best_model_name = max(results, key=lambda x: results[x]['mean_auc'])
print(f"\nBest Model: {best_model_name}")

Step 6: Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Tune the best model (assuming Gradient Boosting)
param_distributions = {
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(3, 10),
    'classifier__learning_rate': uniform(0.01, 0.3),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10)
}

best_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

random_search = RandomizedSearchCV(
    best_pipeline,
    param_distributions,
    n_iter=30,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.4f}")

Step 7: Final Evaluation

from sklearn.metrics import roc_curve, precision_recall_curve

# Get best model
final_model = random_search.best_estimator_

# Predictions
y_pred = final_model.predict(X_test)
y_prob = final_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned']))

# ROC AUC
print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_prob):.4f}")

# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_prob):.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

Step 8: Feature Importance Analysis

# Get feature names after preprocessing
feature_names = (
    numeric_cols + 
    list(final_model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['encoder']
         .get_feature_names_out(categorical_cols))
)

# Get feature importances
importances = final_model.named_steps['classifier'].feature_importances_

# Sort and plot
indices = np.argsort(importances)[::-1][:15]

plt.figure(figsize=(12, 6))
plt.bar(range(15), importances[indices])
plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Top 15 Most Important Features for Churn Prediction')
plt.tight_layout()
plt.show()

# Print top features
print("\nTop 10 Important Features:")
for i in indices[:10]:
    print(f"  {feature_names[i]:30s}: {importances[i]:.4f}")

Step 9: Business Insights

# Identify high-risk customers
X_test_copy = X_test.copy()
X_test_copy['churn_probability'] = y_prob
X_test_copy['predicted_churn'] = y_pred
X_test_copy['actual_churn'] = y_test.values

# High-risk customers (probability > 0.7)
high_risk = X_test_copy[X_test_copy['churn_probability'] > 0.7]
print(f"High-Risk Customers (>70% churn probability): {len(high_risk)}")
print("\nProfile of High-Risk Customers:")
print(high_risk[['contract', 'tenure', 'monthly_charges', 'churn_probability']].describe())

# Recommendations
print("\n" + "="*50)
print("BUSINESS RECOMMENDATIONS")
print("="*50)
print("""
1. TARGET MONTH-TO-MONTH CUSTOMERS
   - Highest churn rate
   - Offer discounts for annual contracts
   
2. FOCUS ON NEW CUSTOMERS (tenure < 12 months)
   - Most vulnerable period
   - Implement onboarding program
   
3. REVIEW HIGH-CHARGE CUSTOMERS
   - Consider loyalty discounts
   - Ensure they're getting value
   
4. ELECTRONIC CHECK USERS
   - Higher churn rate
   - Encourage automatic payment methods
""")

Step 10: Save the Model

import joblib

# Save the model
joblib.dump(final_model, 'churn_model.pkl')
print("Model saved to 'churn_model.pkl'")

# How to load and use
# loaded_model = joblib.load('churn_model.pkl')
# predictions = loaded_model.predict(new_data)

Production Considerations

Model Monitoring

  • Track prediction drift over time
  • Monitor for data quality issues
  • Set up alerts for performance degradation

A/B Testing

  • Test model in production with a subset
  • Compare with baseline
  • Gradually roll out

Retraining Schedule

  • Retrain periodically (weekly/monthly)
  • Automate the pipeline
  • Version your models

Documentation

  • Document feature definitions
  • Record model decisions
  • Maintain changelog

🚀 Mini Projects

Project 1: Loan Default Predictor

Build a complete loan approval system

Project 2: Employee Attrition Analyzer

Predict which employees might leave

Project 3: Product Recommendation Engine

Build a simple recommendation system

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline

Project 1: Loan Default Predictor

Build a complete loan default prediction system with EDA, feature engineering, and model selection.

Project 2: Employee Attrition Analyzer

Predict which employees might leave and understand why.

Project 3: Product Recommendation Engine

Build a simple collaborative filtering recommendation system.

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline with proper logging and experiment tracking.

Key Takeaways

Start with Business

Understand the problem before touching data

EDA is Critical

Visualize and understand your data first

Iterate Quickly

Start simple, then improve

Evaluate Properly

Use appropriate metrics for your problem

What’s Next?

Great job completing the end-to-end project! Now let’s explore unsupervised learning with clustering.

Continue to Module 11: Clustering

Learn to find patterns when you don’t have labels - K-Means, DBSCAN, and more