End-to-End ML Project
The Complete ML Workflow
Project: Predicting Customer Churn
Step 1: Load and Explore Data
Step 2: Exploratory Data Analysis (EDA)
Step 3: Feature Engineering
Step 4: Preprocessing Pipeline
Step 5: Model Selection and Comparison
Step 6: Hyperparameter Tuning
Step 7: Final Evaluation
Step 8: Feature Importance Analysis
Step 9: Business Insights
Step 10: Save the Model
Production Considerations
🚀 Mini Projects
Project 1: Loan Default Predictor
Project 2: Employee Attrition Analyzer
Project 3: Product Recommendation Engine
Project 4: ML Pipeline with Logging
Key Takeaways
What’s Next?

End-to-End ML Project

Real World ML Pipeline - Uber Surge Pricing

The Complete ML Workflow

This module brings everything together in a real project:

Problem Definition: What are we solving?
Data Collection: Get the data
EDA: Understand the data
Feature Engineering: Prepare features
Model Selection: Choose algorithms
Training: Fit models
Evaluation: Measure performance
Tuning: Optimize hyperparameters
Deployment: Make it usable

Project: Predicting Customer Churn

Business Problem: A telecom company wants to predict which customers will leave (churn) so they can offer them incentives to stay.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

Step 1: Load and Explore Data

# Using Telco Customer Churn dataset
# In practice, you'd load from: pd.read_csv('customer_churn.csv')

# Simulate the dataset structure
n_samples = 5000

data = {
    'customer_id': range(1, n_samples + 1),
    'tenure': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 7000, n_samples),
    'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
    'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples),
    'senior_citizen': np.random.choice([0, 1], n_samples),
    'partner': np.random.choice(['Yes', 'No'], n_samples),
    'dependents': np.random.choice(['Yes', 'No'], n_samples),
}

# Create churn with realistic patterns
churn_prob = (
    0.1 +  # Base rate
    0.3 * (data['contract'] == 'Month-to-month').astype(int) +
    0.2 * (data['monthly_charges'] > 70) +
    -0.15 * (data['tenure'] > 36) +
    0.1 * (data['payment_method'] == 'Electronic check').astype(int)
)
churn_prob = np.clip(churn_prob, 0.05, 0.8)
data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)

df = pd.DataFrame(data)

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData Types:")
print(df.dtypes)
print("\nChurn Distribution:")
print(df['churn'].value_counts(normalize=True))

Step 2: Exploratory Data Analysis (EDA)

# Churn rate by contract type
print("Churn Rate by Contract Type:")
print(df.groupby('contract')['churn'].mean().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Churn distribution
df['churn'].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[0, 0], 
                                     labels=['Stayed', 'Churned'])
axes[0, 0].set_title('Churn Distribution')

# 2. Tenure distribution by churn
axes[0, 1].hist([df[df['churn']==0]['tenure'], df[df['churn']==1]['tenure']], 
                label=['Stayed', 'Churned'], bins=20, alpha=0.7)
axes[0, 1].set_xlabel('Tenure (months)')
axes[0, 1].set_title('Tenure by Churn Status')
axes[0, 1].legend()

# 3. Monthly charges by churn
df.boxplot(column='monthly_charges', by='churn', ax=axes[1, 0])
axes[1, 0].set_title('Monthly Charges by Churn')

# 4. Contract type vs churn
churn_by_contract = df.groupby('contract')['churn'].mean()
churn_by_contract.plot.bar(ax=axes[1, 1], color=['green', 'orange', 'red'])
axes[1, 1].set_title('Churn Rate by Contract')
axes[1, 1].set_ylabel('Churn Rate')

plt.tight_layout()
plt.show()

Step 3: Feature Engineering

# Separate features and target
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

# Identify column types
numeric_cols = ['tenure', 'monthly_charges', 'total_charges']
categorical_cols = ['contract', 'internet_service', 'online_security', 
                    'tech_support', 'payment_method', 'paperless_billing',
                    'partner', 'dependents']

# Create new features
X['charges_per_month'] = X['total_charges'] / (X['tenure'] + 1)
X['is_long_term'] = (X['tenure'] > 24).astype(int)
X['high_charges'] = (X['monthly_charges'] > X['monthly_charges'].median()).astype(int)

numeric_cols.extend(['charges_per_month'])

print("Feature Engineering Complete!")
print(f"Numeric features: {len(numeric_cols)}")
print(f"Categorical features: {len(categorical_cols)}")

Step 4: Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Build preprocessor
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

Step 5: Model Selection and Comparison

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Create pipelines and evaluate
results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    results[name] = {
        'mean_auc': cv_scores.mean(),
        'std_auc': cv_scores.std(),
        'pipeline': pipeline
    }
    
    print(f"{name:22s}: AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Select best model
best_model_name = max(results, key=lambda x: results[x]['mean_auc'])
print(f"\nBest Model: {best_model_name}")

Step 6: Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Tune the best model (assuming Gradient Boosting)
param_distributions = {
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(3, 10),
    'classifier__learning_rate': uniform(0.01, 0.3),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10)
}

best_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

random_search = RandomizedSearchCV(
    best_pipeline,
    param_distributions,
    n_iter=30,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.4f}")

Step 7: Final Evaluation

from sklearn.metrics import roc_curve, precision_recall_curve

# Get best model
final_model = random_search.best_estimator_

# Predictions
y_pred = final_model.predict(X_test)
y_prob = final_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned']))

# ROC AUC
print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_prob):.4f}")

# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_prob):.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()

Step 8: Feature Importance Analysis

# Get feature names after preprocessing
feature_names = (
    numeric_cols + 
    list(final_model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['encoder']
         .get_feature_names_out(categorical_cols))
)

# Get feature importances
importances = final_model.named_steps['classifier'].feature_importances_

# Sort and plot
indices = np.argsort(importances)[::-1][:15]

plt.figure(figsize=(12, 6))
plt.bar(range(15), importances[indices])
plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Top 15 Most Important Features for Churn Prediction')
plt.tight_layout()
plt.show()

# Print top features
print("\nTop 10 Important Features:")
for i in indices[:10]:
    print(f"  {feature_names[i]:30s}: {importances[i]:.4f}")

Step 9: Business Insights

# Identify high-risk customers
X_test_copy = X_test.copy()
X_test_copy['churn_probability'] = y_prob
X_test_copy['predicted_churn'] = y_pred
X_test_copy['actual_churn'] = y_test.values

# High-risk customers (probability > 0.7)
high_risk = X_test_copy[X_test_copy['churn_probability'] > 0.7]
print(f"High-Risk Customers (>70% churn probability): {len(high_risk)}")
print("\nProfile of High-Risk Customers:")
print(high_risk[['contract', 'tenure', 'monthly_charges', 'churn_probability']].describe())

# Recommendations
print("\n" + "="*50)
print("BUSINESS RECOMMENDATIONS")
print("="*50)
print("""
1. TARGET MONTH-TO-MONTH CUSTOMERS
   - Highest churn rate
   - Offer discounts for annual contracts
   
2. FOCUS ON NEW CUSTOMERS (tenure < 12 months)
   - Most vulnerable period
   - Implement onboarding program
   
3. REVIEW HIGH-CHARGE CUSTOMERS
   - Consider loyalty discounts
   - Ensure they're getting value
   
4. ELECTRONIC CHECK USERS
   - Higher churn rate
   - Encourage automatic payment methods
""")

Step 10: Save the Model

import joblib

# Save the model
joblib.dump(final_model, 'churn_model.pkl')
print("Model saved to 'churn_model.pkl'")

# How to load and use
# loaded_model = joblib.load('churn_model.pkl')
# predictions = loaded_model.predict(new_data)

Production Considerations

Model Monitoring

Track prediction drift over time
Monitor for data quality issues
Set up alerts for performance degradation

A/B Testing

Test model in production with a subset
Compare with baseline
Gradually roll out

Retraining Schedule

Retrain periodically (weekly/monthly)
Automate the pipeline
Version your models

Documentation

Document feature definitions
Record model decisions
Maintain changelog

🚀 Mini Projects

Project 1: Loan Default Predictor

Build a complete loan approval system

Project 2: Employee Attrition Analyzer

Predict which employees might leave

Project 3: Product Recommendation Engine

Build a simple recommendation system

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline

Project 1: Loan Default Predictor

Build a complete loan default prediction system with EDA, feature engineering, and model selection.

Project 2: Employee Attrition Analyzer

Predict which employees might leave and understand why.

Project 3: Product Recommendation Engine

Build a simple collaborative filtering recommendation system.

Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline with proper logging and experiment tracking.

Key Takeaways

Start with Business

Understand the problem before touching data

EDA is Critical

Visualize and understand your data first

Iterate Quickly

Start simple, then improve

Evaluate Properly

Use appropriate metrics for your problem

What’s Next?

Great job completing the end-to-end project! Now let’s explore unsupervised learning with clustering.

Continue to Module 11: Clustering

Learn to find patterns when you don’t have labels - K-Means, DBSCAN, and more

Hyperparameter Tuning Clustering

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​End-to-End ML Project

​The Complete ML Workflow

​Project: Predicting Customer Churn

​Step 1: Load and Explore Data

​Step 2: Exploratory Data Analysis (EDA)

​Step 3: Feature Engineering

​Step 4: Preprocessing Pipeline

​Step 5: Model Selection and Comparison

​Step 6: Hyperparameter Tuning

​Step 7: Final Evaluation

​Step 8: Feature Importance Analysis

​Step 9: Business Insights

​Step 10: Save the Model

​Production Considerations

Model Monitoring

A/B Testing

Retraining Schedule

Documentation

​🚀 Mini Projects

Project 1: Loan Default Predictor

Project 2: Employee Attrition Analyzer

Project 3: Product Recommendation Engine

Project 4: ML Pipeline with Logging

​Project 1: Loan Default Predictor

​Project 2: Employee Attrition Analyzer

​Project 3: Product Recommendation Engine

​Project 4: ML Pipeline with Logging

​Key Takeaways

Start with Business

EDA is Critical

Iterate Quickly

Evaluate Properly

​What’s Next?

Continue to Module 11: Clustering

End-to-End ML Project

The Complete ML Workflow

Project: Predicting Customer Churn

Step 1: Load and Explore Data

Step 2: Exploratory Data Analysis (EDA)

Step 3: Feature Engineering

Step 4: Preprocessing Pipeline

Step 5: Model Selection and Comparison

Step 6: Hyperparameter Tuning

Step 7: Final Evaluation

Step 8: Feature Importance Analysis

Step 9: Business Insights

Step 10: Save the Model

Production Considerations

🚀 Mini Projects

Project 1: Loan Default Predictor

Project 2: Employee Attrition Analyzer

Project 3: Product Recommendation Engine

Project 4: ML Pipeline with Logging

Key Takeaways

What’s Next?