> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# End-to-End ML Project

> Build a complete machine learning project from scratch

# End-to-End ML Project

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/end-to-end-pipeline-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=50b443b359c6d05f565d0d1a1db9512e" alt="End-to-End ML Project Pipeline" width="1080" height="1080" data-path="images/courses/ml-mastery/end-to-end-pipeline-concept.svg" />
</Frame>

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/end-to-end-pipeline-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=3b6aff9d87ee16c1b9d7b218ac509191" alt="Real World ML Pipeline - Uber Surge Pricing" width="1080" height="1080" data-path="images/courses/ml-mastery/end-to-end-pipeline-real-world.svg" />
</Frame>

## The Complete ML Workflow

This module brings everything together in a real project. In practice, this workflow is never linear -- you'll jump back to EDA when your model fails, revisit feature engineering when evaluation reveals blind spots, and retune when new data arrives. Think of it as a spiral, not a waterfall.

1. **Problem Definition**: What are we solving? What metric defines "success"?
2. **Data Collection**: Get the data (often the hardest step)
3. **EDA**: Understand the data before touching any model
4. **Feature Engineering**: Transform raw data into model-ready features
5. **Model Selection**: Choose 2-3 candidate algorithms
6. **Training**: Fit models with cross-validation
7. **Evaluation**: Measure performance on held-out data
8. **Tuning**: Optimize hyperparameters for the best candidate
9. **Deployment**: Make it usable in production

***

## Project: Predicting Customer Churn

**Business Problem**: A telecom company wants to predict which customers will leave (churn) so they can offer them incentives to stay.

```python theme={null}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
```

***

## Step 1: Load and Explore Data

```python theme={null}
# Using Telco Customer Churn dataset
# In practice, you'd load from: pd.read_csv('customer_churn.csv')

# Simulate the dataset structure
n_samples = 5000

data = {
    'customer_id': range(1, n_samples + 1),
    'tenure': np.random.randint(1, 72, n_samples),
    'monthly_charges': np.random.uniform(20, 100, n_samples),
    'total_charges': np.random.uniform(100, 7000, n_samples),
    'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
    'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples),
    'senior_citizen': np.random.choice([0, 1], n_samples),
    'partner': np.random.choice(['Yes', 'No'], n_samples),
    'dependents': np.random.choice(['Yes', 'No'], n_samples),
}

# Create churn with realistic patterns
churn_prob = (
    0.1 +  # Base rate
    0.3 * (data['contract'] == 'Month-to-month').astype(int) +
    0.2 * (data['monthly_charges'] > 70) +
    -0.15 * (data['tenure'] > 36) +
    0.1 * (data['payment_method'] == 'Electronic check').astype(int)
)
churn_prob = np.clip(churn_prob, 0.05, 0.8)
data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)

df = pd.DataFrame(data)

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData Types:")
print(df.dtypes)
print("\nChurn Distribution:")
print(df['churn'].value_counts(normalize=True))
```

***

## Step 2: Exploratory Data Analysis (EDA)

```python theme={null}
# Churn rate by contract type
print("Churn Rate by Contract Type:")
print(df.groupby('contract')['churn'].mean().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Churn distribution
df['churn'].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[0, 0], 
                                     labels=['Stayed', 'Churned'])
axes[0, 0].set_title('Churn Distribution')

# 2. Tenure distribution by churn
axes[0, 1].hist([df[df['churn']==0]['tenure'], df[df['churn']==1]['tenure']], 
                label=['Stayed', 'Churned'], bins=20, alpha=0.7)
axes[0, 1].set_xlabel('Tenure (months)')
axes[0, 1].set_title('Tenure by Churn Status')
axes[0, 1].legend()

# 3. Monthly charges by churn
df.boxplot(column='monthly_charges', by='churn', ax=axes[1, 0])
axes[1, 0].set_title('Monthly Charges by Churn')

# 4. Contract type vs churn
churn_by_contract = df.groupby('contract')['churn'].mean()
churn_by_contract.plot.bar(ax=axes[1, 1], color=['green', 'orange', 'red'])
axes[1, 1].set_title('Churn Rate by Contract')
axes[1, 1].set_ylabel('Churn Rate')

plt.tight_layout()
plt.show()
```

***

## Step 3: Feature Engineering

```python theme={null}
# Separate features and target
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

# Identify column types
numeric_cols = ['tenure', 'monthly_charges', 'total_charges']
categorical_cols = ['contract', 'internet_service', 'online_security', 
                    'tech_support', 'payment_method', 'paperless_billing',
                    'partner', 'dependents']

# Create new features -- each one encodes a business hypothesis.
# "charges_per_month" normalizes total spend by tenure length.
# Without this, a 5-year customer who paid $5000 total looks the same
# as a 1-year customer who paid $5000 total, even though the second
# one is paying 5x more per month.
X['charges_per_month'] = X['total_charges'] / (X['tenure'] + 1)  # +1 avoids division by zero

# Binary flags are powerful for tree-based models -- they create
# easy split points that capture domain knowledge.
X['is_long_term'] = (X['tenure'] > 24).astype(int)  # 2+ years = loyal
X['high_charges'] = (X['monthly_charges'] > X['monthly_charges'].median()).astype(int)

numeric_cols.extend(['charges_per_month'])

print("Feature Engineering Complete!")
print(f"Numeric features: {len(numeric_cols)}")
print(f"Categorical features: {len(categorical_cols)}")
# Tip: Feature engineering is iterative. After your first model,
# look at misclassified examples to spot patterns that suggest
# new features. Did the model miss all customers with high charges
# AND month-to-month contracts? Create that interaction feature.
```

***

## Step 4: Preprocessing Pipeline

Using sklearn Pipelines is not optional in production ML -- it's the difference between "works on my laptop" and "works reliably in production." Pipelines prevent data leakage (fitting the scaler on test data), ensure reproducibility, and make deployment a single `pipeline.predict()` call instead of a fragile sequence of manual transforms.

```python theme={null}
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Build preprocessor -- one pipeline per data type.
# The ColumnTransformer routes each column to the right pipeline.
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Median is robust to outliers
    ('scaler', StandardScaler())                     # Required for LR and SVM, harmless for trees
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),  # Explicit "missing" category
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
    # drop='first' avoids the "dummy variable trap" (multicollinearity).
    # handle_unknown='ignore' prevents crashes when test data has unseen categories.
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
```

***

## Step 5: Model Selection and Comparison

A senior engineer's approach to model selection: never bet on one model. Train 3-4 candidates with default hyperparameters, compare on the same cross-validation folds, then invest tuning effort only on the top 1-2. It's like auditioning actors -- you don't give everyone a costume fitting before the first read-through.

```python theme={null}
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Models to compare -- intentionally using default hyperparameters.
# The goal here is to find which FAMILY of models works best,
# not to squeeze out every last % of performance.
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}
# Model selection tip: For tabular data, Gradient Boosting or Random Forest
# almost always wins. Start there. Logistic Regression is your interpretability
# baseline. SVM is worth trying for small datasets (<10K rows).

# Create pipelines and evaluate
results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    results[name] = {
        'mean_auc': cv_scores.mean(),
        'std_auc': cv_scores.std(),
        'pipeline': pipeline
    }
    
    print(f"{name:22s}: AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Select best model based on cross-validation AUC.
# Important: don't just pick the highest number -- also consider:
# 1. Standard deviation (lower = more reliable estimate)
# 2. Training time (matters for retraining in production)
# 3. Interpretability (can you explain predictions to stakeholders?)
# A model that's 0.5% worse in AUC but 10x faster to train and
# easy to explain might be the better business choice.
best_model_name = max(results, key=lambda x: results[x]['mean_auc'])
print(f"\nBest Model: {best_model_name}")
```

***

## Step 6: Hyperparameter Tuning

```python theme={null}
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Tune the best model (assuming Gradient Boosting)
param_distributions = {
    'classifier__n_estimators': randint(50, 300),
    'classifier__max_depth': randint(3, 10),
    'classifier__learning_rate': uniform(0.01, 0.3),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10)
}

best_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

random_search = RandomizedSearchCV(
    best_pipeline,
    param_distributions,
    n_iter=30,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best Parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.4f}")
```

***

## Step 7: Final Evaluation

```python theme={null}
from sklearn.metrics import roc_curve, precision_recall_curve

# Get best model
final_model = random_search.best_estimator_

# Predictions
y_pred = final_model.predict(X_test)
y_prob = final_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned']))

# ROC AUC
print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_prob):.4f}")

# Confusion Matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_prob):.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.show()
```

***

## Step 8: Feature Importance Analysis

```python theme={null}
# Get feature names after preprocessing
feature_names = (
    numeric_cols + 
    list(final_model.named_steps['preprocessor']
         .named_transformers_['cat']
         .named_steps['encoder']
         .get_feature_names_out(categorical_cols))
)

# Get feature importances
importances = final_model.named_steps['classifier'].feature_importances_

# Sort and plot
indices = np.argsort(importances)[::-1][:15]

plt.figure(figsize=(12, 6))
plt.bar(range(15), importances[indices])
plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Top 15 Most Important Features for Churn Prediction')
plt.tight_layout()
plt.show()

# Print top features
print("\nTop 10 Important Features:")
for i in indices[:10]:
    print(f"  {feature_names[i]:30s}: {importances[i]:.4f}")
```

***

## Step 9: Business Insights

```python theme={null}
# Identify high-risk customers
X_test_copy = X_test.copy()
X_test_copy['churn_probability'] = y_prob
X_test_copy['predicted_churn'] = y_pred
X_test_copy['actual_churn'] = y_test.values

# High-risk customers (probability > 0.7)
high_risk = X_test_copy[X_test_copy['churn_probability'] > 0.7]
print(f"High-Risk Customers (>70% churn probability): {len(high_risk)}")
print("\nProfile of High-Risk Customers:")
print(high_risk[['contract', 'tenure', 'monthly_charges', 'churn_probability']].describe())

# Recommendations
print("\n" + "="*50)
print("BUSINESS RECOMMENDATIONS")
print("="*50)
print("""
1. TARGET MONTH-TO-MONTH CUSTOMERS
   - Highest churn rate
   - Offer discounts for annual contracts
   
2. FOCUS ON NEW CUSTOMERS (tenure < 12 months)
   - Most vulnerable period
   - Implement onboarding program
   
3. REVIEW HIGH-CHARGE CUSTOMERS
   - Consider loyalty discounts
   - Ensure they're getting value
   
4. ELECTRONIC CHECK USERS
   - Higher churn rate
   - Encourage automatic payment methods
""")
```

***

## Step 10: Save the Model

```python theme={null}
import joblib

# Save the ENTIRE pipeline (preprocessor + model) -- not just the model!
# This is critical: if you save only the model and forget the scaler,
# production predictions will be wrong because raw features won't match
# what the model was trained on.
joblib.dump(final_model, 'churn_model.pkl')
print("Model saved to 'churn_model.pkl'")

# How to load and use in production:
# loaded_model = joblib.load('churn_model.pkl')
# predictions = loaded_model.predict(new_data)  # raw data goes in, predictions come out
# The pipeline handles all preprocessing internally.

# Also save metadata for future debugging:
import json
metadata = {
    'training_date': '2025-01-15',
    'n_training_samples': len(X_train),
    'features': list(X_train.columns),
    'best_params': random_search.best_params_,
    'cv_auc': random_search.best_score_,
}
# json.dump(metadata, open('churn_model_metadata.json', 'w'))
```

***

## Production Considerations

<CardGroup cols={2}>
  <Card title="Model Monitoring" icon="chart-line">
    * Track prediction drift over time
    * Monitor for data quality issues
    * Set up alerts for performance degradation
  </Card>

  <Card title="A/B Testing" icon="flask">
    * Test model in production with a subset
    * Compare with baseline
    * Gradually roll out
  </Card>

  <Card title="Retraining Schedule" icon="rotate">
    * Retrain periodically (weekly/monthly)
    * Automate the pipeline
    * Version your models
  </Card>

  <Card title="Documentation" icon="book">
    * Document feature definitions
    * Record model decisions
    * Maintain changelog
  </Card>
</CardGroup>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: Loan Default Predictor" icon="building-columns">
    Build a complete loan approval system
  </Card>

  <Card title="Project 2: Employee Attrition Analyzer" icon="user-minus">
    Predict which employees might leave
  </Card>

  <Card title="Project 3: Product Recommendation Engine" icon="cart-plus">
    Build a simple recommendation system
  </Card>

  <Card title="Project 4: ML Pipeline with Logging" icon="clipboard-list">
    Create a production-ready ML pipeline
  </Card>
</CardGroup>

### Project 1: Loan Default Predictor

Build a complete loan default prediction system with EDA, feature engineering, and model selection.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.preprocessing import StandardScaler, LabelEncoder
  from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
  from sklearn.linear_model import LogisticRegression
  from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
  import warnings
  warnings.filterwarnings('ignore')

  # Step 1: Generate synthetic loan data
  np.random.seed(42)
  n_samples = 2000

  data = {
      'age': np.random.randint(21, 70, n_samples),
      'income': np.random.lognormal(10.5, 0.5, n_samples),
      'loan_amount': np.random.lognormal(10, 0.8, n_samples),
      'credit_score': np.random.normal(680, 80, n_samples).clip(300, 850),
      'employment_years': np.random.exponential(5, n_samples).clip(0, 40),
      'num_credit_cards': np.random.poisson(3, n_samples),
      'num_loans': np.random.poisson(2, n_samples),
      'home_ownership': np.random.choice(['rent', 'own', 'mortgage'], n_samples, p=[0.4, 0.3, 0.3]),
      'loan_purpose': np.random.choice(['debt', 'home', 'car', 'education', 'business'], n_samples),
  }
  df = pd.DataFrame(data)

  # Create target (default probability based on features)
  default_prob = (
      0.3 * (df['credit_score'] < 600).astype(int) +
      0.2 * (df['loan_amount'] / df['income'] > 5).astype(int) +
      0.15 * (df['num_loans'] > 3).astype(int) +
      0.1 * (df['employment_years'] < 1).astype(int)
  )
  df['default'] = (np.random.random(n_samples) < default_prob * 0.8).astype(int)

  print("="*60)
  print("📊 LOAN DEFAULT PREDICTION PROJECT")
  print("="*60)

  # Step 2: Exploratory Data Analysis
  print("\n1️⃣ EXPLORATORY DATA ANALYSIS")
  print("-"*40)
  print(f"Dataset shape: {df.shape}")
  print(f"Default rate: {df['default'].mean()*100:.1f}%")
  print("\nMissing values:", df.isnull().sum().sum())

  print("\nNumerical features summary:")
  print(df.describe().round(2))

  # Step 3: Feature Engineering
  print("\n2️⃣ FEATURE ENGINEERING")
  print("-"*40)

  # Create new features
  df['loan_to_income'] = df['loan_amount'] / df['income']
  df['credit_utilization'] = df['num_credit_cards'] * 2000 / df['income']
  df['total_debt_items'] = df['num_credit_cards'] + df['num_loans']
  df['income_per_year_employed'] = df['income'] / (df['employment_years'] + 1)
  df['is_young'] = (df['age'] < 30).astype(int)
  df['has_many_loans'] = (df['num_loans'] > 3).astype(int)
  df['low_credit'] = (df['credit_score'] < 600).astype(int)

  # Encode categorical variables
  le_home = LabelEncoder()
  le_purpose = LabelEncoder()
  df['home_ownership_encoded'] = le_home.fit_transform(df['home_ownership'])
  df['loan_purpose_encoded'] = le_purpose.fit_transform(df['loan_purpose'])

  print(f"New features created: loan_to_income, credit_utilization, etc.")
  print(f"Total features: {len(df.columns) - 3}")  # Exclude original categorical and target

  # Step 4: Prepare data for modeling
  feature_cols = ['age', 'income', 'loan_amount', 'credit_score', 'employment_years',
                  'num_credit_cards', 'num_loans', 'loan_to_income', 'credit_utilization',
                  'total_debt_items', 'income_per_year_employed', 'is_young', 
                  'has_many_loans', 'low_credit', 'home_ownership_encoded', 'loan_purpose_encoded']

  X = df[feature_cols]
  y = df['default']

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

  # Scale features
  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

  # Step 5: Model Selection
  print("\n3️⃣ MODEL SELECTION & TRAINING")
  print("-"*40)

  models = {
      'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
      'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
      'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
  }

  results = []
  for name, model in models.items():
      # Use scaled data for logistic regression
      if name == 'Logistic Regression':
          model.fit(X_train_scaled, y_train)
          cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
          y_pred = model.predict(X_test_scaled)
          y_proba = model.predict_proba(X_test_scaled)[:, 1]
      else:
          model.fit(X_train, y_train)
          cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
          y_pred = model.predict(X_test)
          y_proba = model.predict_proba(X_test)[:, 1]
      
      roc_auc = roc_auc_score(y_test, y_proba)
      results.append({
          'model': name,
          'cv_auc': cv_scores.mean(),
          'cv_std': cv_scores.std(),
          'test_auc': roc_auc
      })
      print(f"{name}: CV AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f}), Test AUC = {roc_auc:.4f}")

  # Step 6: Best Model Evaluation
  print("\n4️⃣ BEST MODEL EVALUATION")
  print("-"*40)

  best_result = max(results, key=lambda x: x['test_auc'])
  print(f"Best model: {best_result['model']}")

  best_model = models[best_result['model']]
  if best_result['model'] == 'Logistic Regression':
      y_pred = best_model.predict(X_test_scaled)
  else:
      y_pred = best_model.predict(X_test)

  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  # Step 7: Feature Importance (for RF/GB)
  if hasattr(best_model, 'feature_importances_'):
      importance = pd.DataFrame({
          'feature': feature_cols,
          'importance': best_model.feature_importances_
      }).sort_values('importance', ascending=False)
      
      print("\n5️⃣ TOP FEATURES")
      print("-"*40)
      for i, row in importance.head(10).iterrows():
          print(f"  {row['feature']:30s}: {row['importance']:.4f}")

  # Step 8: Business Recommendations
  print("\n6️⃣ BUSINESS RECOMMENDATIONS")
  print("-"*40)
  print("Based on feature importance:")
  print("  1. Credit score is the strongest predictor of default")
  print("  2. Loan-to-income ratio indicates overextension")
  print("  3. Employment stability matters significantly")
  print("\nRecommended actions:")
  print("  - Flag applicants with credit score < 600 for manual review")
  print("  - Cap loan-to-income ratio at 5x for automatic approval")
  print("  - Require employment verification for < 1 year employed")
  ```

  **What you learned:**

  * Complete ML workflow from EDA to deployment recommendations
  * Feature engineering specific to financial data
  * Translating model insights into business actions
</details>

### Project 2: Employee Attrition Analyzer

Predict which employees might leave and understand why.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.preprocessing import StandardScaler
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.linear_model import LogisticRegression
  from sklearn.metrics import classification_report, roc_auc_score
  import warnings
  warnings.filterwarnings('ignore')

  # Step 1: Generate synthetic employee data
  np.random.seed(42)
  n_employees = 1500

  data = {
      'age': np.random.randint(22, 60, n_employees),
      'years_at_company': np.random.exponential(4, n_employees).clip(0, 30),
      'salary': np.random.lognormal(11, 0.4, n_employees),
      'distance_from_home': np.random.exponential(10, n_employees).clip(1, 50),
      'num_projects': np.random.poisson(5, n_employees).clip(1, 15),
      'avg_hours_per_week': np.random.normal(45, 8, n_employees).clip(35, 70),
      'last_promotion_years': np.random.exponential(2, n_employees).clip(0, 10),
      'satisfaction_score': np.random.beta(7, 3, n_employees),  # Mostly satisfied
      'performance_rating': np.random.choice([1, 2, 3, 4, 5], n_employees, p=[0.05, 0.1, 0.5, 0.25, 0.1]),
      'department': np.random.choice(['Sales', 'Engineering', 'HR', 'Marketing'], n_employees),
      'job_level': np.random.choice([1, 2, 3, 4, 5], n_employees, p=[0.3, 0.3, 0.2, 0.15, 0.05]),
  }
  df = pd.DataFrame(data)

  # Create attrition (based on realistic factors)
  attrition_prob = (
      0.2 * (df['satisfaction_score'] < 0.5).astype(int) +
      0.15 * (df['last_promotion_years'] > 4).astype(int) +
      0.1 * (df['avg_hours_per_week'] > 55).astype(int) +
      0.1 * (df['years_at_company'] < 2).astype(int) +
      0.05 * (df['distance_from_home'] > 25).astype(int)
  )
  df['attrition'] = (np.random.random(n_employees) < attrition_prob).astype(int)

  print("="*60)
  print("👥 EMPLOYEE ATTRITION ANALYSIS")
  print("="*60)

  # Step 2: EDA
  print("\n1️⃣ EXPLORATORY DATA ANALYSIS")
  print("-"*40)
  print(f"Total employees: {len(df)}")
  print(f"Attrition rate: {df['attrition'].mean()*100:.1f}%")

  # Attrition by department
  print("\nAttrition by Department:")
  for dept in df['department'].unique():
      rate = df[df['department'] == dept]['attrition'].mean() * 100
      print(f"  {dept}: {rate:.1f}%")

  # Step 3: Feature Engineering
  print("\n2️⃣ FEATURE ENGINEERING")
  print("-"*40)

  df['salary_per_year'] = df['salary'] / (df['years_at_company'] + 1)
  df['workload'] = df['num_projects'] * df['avg_hours_per_week']
  df['is_overworked'] = (df['avg_hours_per_week'] > 50).astype(int)
  df['is_underpaid'] = (df['salary'] < df['salary'].quantile(0.25)).astype(int)
  df['stuck_no_promotion'] = ((df['last_promotion_years'] > 3) & (df['performance_rating'] >= 3)).astype(int)
  df['at_risk'] = ((df['satisfaction_score'] < 0.5) | (df['last_promotion_years'] > 4)).astype(int)

  # Encode categorical
  df['dept_encoded'] = pd.factorize(df['department'])[0]

  # Create department dummies
  dept_dummies = pd.get_dummies(df['department'], prefix='dept')
  df = pd.concat([df, dept_dummies], axis=1)

  print("Created risk indicators: is_overworked, is_underpaid, stuck_no_promotion")

  # Step 4: Prepare data
  feature_cols = ['age', 'years_at_company', 'salary', 'distance_from_home', 'num_projects',
                  'avg_hours_per_week', 'last_promotion_years', 'satisfaction_score', 
                  'performance_rating', 'job_level', 'salary_per_year', 'workload',
                  'is_overworked', 'is_underpaid', 'stuck_no_promotion',
                  'dept_Engineering', 'dept_HR', 'dept_Marketing', 'dept_Sales']

  X = df[feature_cols]
  y = df['attrition']

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

  # Step 5: Train models
  print("\n3️⃣ MODEL TRAINING")
  print("-"*40)

  rf = RandomForestClassifier(n_estimators=100, random_state=42)
  rf.fit(X_train, y_train)

  cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='roc_auc')
  print(f"Random Forest CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

  y_pred = rf.predict(X_test)
  y_proba = rf.predict_proba(X_test)[:, 1]
  print(f"Test AUC: {roc_auc_score(y_test, y_proba):.4f}")

  print("\nClassification Report:")
  print(classification_report(y_test, y_pred))

  # Step 6: Feature Importance
  print("\n4️⃣ KEY ATTRITION FACTORS")
  print("-"*40)

  importance = pd.DataFrame({
      'feature': feature_cols,
      'importance': rf.feature_importances_
  }).sort_values('importance', ascending=False)

  for i, row in importance.head(10).iterrows():
      print(f"  {row['feature']:25s}: {row['importance']:.4f}")

  # Step 7: Risk Scoring
  print("\n5️⃣ EMPLOYEE RISK SCORING")
  print("-"*40)

  # Add risk scores to test data
  test_df = df.iloc[X_test.index].copy()
  test_df['risk_score'] = y_proba

  # Find high-risk employees
  high_risk = test_df[test_df['risk_score'] > 0.5].sort_values('risk_score', ascending=False)
  print(f"High-risk employees: {len(high_risk)} ({len(high_risk)/len(test_df)*100:.1f}%)")

  print("\nTop 5 At-Risk Employees:")
  for i, row in high_risk.head().iterrows():
      print(f"  Risk: {row['risk_score']:.2%} | Satisfaction: {row['satisfaction_score']:.2f} | "
            f"Years since promotion: {row['last_promotion_years']:.1f}")

  # Step 8: Retention Recommendations
  print("\n6️⃣ RETENTION STRATEGY")
  print("-"*40)

  print("Based on analysis, recommend:")
  print("\n🎯 Immediate Actions for High-Risk Employees:")
  print("  - Schedule 1-on-1 meetings")
  print("  - Review compensation vs market rate")
  print("  - Discuss career development opportunities")

  print("\n📊 Systemic Changes:")
  if importance[importance['feature'] == 'satisfaction_score']['importance'].values[0] > 0.1:
      print("  - Satisfaction is key: Launch engagement surveys")
  if importance[importance['feature'] == 'last_promotion_years']['importance'].values[0] > 0.1:
      print("  - Promotion stagnation matters: Review promotion criteria")
  if importance[importance['feature'] == 'avg_hours_per_week']['importance'].values[0] > 0.05:
      print("  - Overwork is an issue: Enforce work-life balance policies")
  ```

  **What you learned:**

  * HR analytics application of ML
  * Creating actionable risk scores
  * Translating predictions into retention strategies
</details>

### Project 3: Product Recommendation Engine

Build a simple collaborative filtering recommendation system.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  from sklearn.metrics.pairwise import cosine_similarity
  from sklearn.model_selection import train_test_split
  from collections import defaultdict

  # Step 1: Generate synthetic e-commerce data
  np.random.seed(42)
  n_users = 500
  n_products = 100
  n_interactions = 5000

  # Create user-product interactions
  users = np.random.randint(0, n_users, n_interactions)
  products = np.random.randint(0, n_products, n_interactions)
  ratings = np.random.randint(1, 6, n_interactions)

  # Create some patterns (users who like product X also like product Y)
  for i in range(0, n_products - 1, 2):
      mask = products == i
      products[mask & (np.random.random(n_interactions) > 0.5)] = i + 1

  df = pd.DataFrame({
      'user_id': users,
      'product_id': products,
      'rating': ratings
  }).drop_duplicates(['user_id', 'product_id'])

  # Add product metadata
  product_names = [f"Product_{i}" for i in range(n_products)]
  product_categories = np.random.choice(['Electronics', 'Books', 'Clothing', 'Home'], n_products)

  print("="*60)
  print("🛒 PRODUCT RECOMMENDATION ENGINE")
  print("="*60)
  print(f"Users: {n_users}")
  print(f"Products: {n_products}")
  print(f"Interactions: {len(df)}")

  # Step 2: Create user-item matrix
  print("\n1️⃣ BUILDING USER-ITEM MATRIX")
  print("-"*40)

  user_item_matrix = df.pivot_table(
      index='user_id', 
      columns='product_id', 
      values='rating', 
      fill_value=0
  )
  print(f"Matrix shape: {user_item_matrix.shape}")
  print(f"Sparsity: {(user_item_matrix == 0).sum().sum() / user_item_matrix.size * 100:.1f}%")

  # Step 3: User-based Collaborative Filtering
  print("\n2️⃣ USER-BASED COLLABORATIVE FILTERING")
  print("-"*40)

  # Calculate user similarity
  user_similarity = cosine_similarity(user_item_matrix)
  user_sim_df = pd.DataFrame(
      user_similarity, 
      index=user_item_matrix.index, 
      columns=user_item_matrix.index
  )

  def recommend_user_based(user_id, n_recommendations=5, n_neighbors=10):
      """Recommend products based on similar users"""
      # Get similar users
      similar_users = user_sim_df[user_id].sort_values(ascending=False)[1:n_neighbors+1]
      
      # Get products the user hasn't rated
      user_products = set(df[df['user_id'] == user_id]['product_id'])
      
      # Weighted average of similar users' ratings
      scores = defaultdict(float)
      weights = defaultdict(float)
      
      for similar_user, similarity in similar_users.items():
          similar_user_ratings = df[df['user_id'] == similar_user]
          for _, row in similar_user_ratings.iterrows():
              if row['product_id'] not in user_products:
                  scores[row['product_id']] += similarity * row['rating']
                  weights[row['product_id']] += similarity
      
      # Calculate weighted average
      recommendations = []
      for product_id in scores:
          if weights[product_id] > 0:
              avg_score = scores[product_id] / weights[product_id]
              recommendations.append((product_id, avg_score))
      
      recommendations.sort(key=lambda x: x[1], reverse=True)
      return recommendations[:n_recommendations]

  # Test for a specific user
  test_user = 0
  print(f"\nRecommendations for User {test_user}:")
  recs = recommend_user_based(test_user)
  for product_id, score in recs:
      print(f"  Product_{product_id} (predicted rating: {score:.2f})")

  # Step 4: Item-based Collaborative Filtering
  print("\n3️⃣ ITEM-BASED COLLABORATIVE FILTERING")
  print("-"*40)

  # Calculate item similarity
  item_similarity = cosine_similarity(user_item_matrix.T)
  item_sim_df = pd.DataFrame(
      item_similarity, 
      index=user_item_matrix.columns, 
      columns=user_item_matrix.columns
  )

  def recommend_item_based(user_id, n_recommendations=5, n_similar_items=5):
      """Recommend products based on items the user liked"""
      # Get user's highly rated products (rating >= 4)
      user_ratings = df[(df['user_id'] == user_id) & (df['rating'] >= 4)]
      liked_products = user_ratings['product_id'].tolist()
      
      user_all_products = set(df[df['user_id'] == user_id]['product_id'])
      
      scores = defaultdict(float)
      for liked_product in liked_products:
          similar_products = item_sim_df[liked_product].sort_values(ascending=False)[1:n_similar_items+1]
          for product_id, similarity in similar_products.items():
              if product_id not in user_all_products:
                  scores[product_id] += similarity
      
      recommendations = sorted(scores.items(), key=lambda x: x[1], reverse=True)
      return recommendations[:n_recommendations]

  print(f"Item-based recommendations for User {test_user}:")
  recs = recommend_item_based(test_user)
  for product_id, score in recs:
      print(f"  Product_{product_id} (similarity score: {score:.2f})")

  # Step 5: Hybrid Recommendation
  print("\n4️⃣ HYBRID RECOMMENDATION")
  print("-"*40)

  def hybrid_recommend(user_id, n_recommendations=5, alpha=0.5):
      """Combine user-based and item-based recommendations"""
      user_recs = {p: s for p, s in recommend_user_based(user_id, n_recommendations=20)}
      item_recs = {p: s for p, s in recommend_item_based(user_id, n_recommendations=20)}
      
      # Normalize scores
      if user_recs:
          max_user = max(user_recs.values())
          user_recs = {p: s/max_user for p, s in user_recs.items()}
      if item_recs:
          max_item = max(item_recs.values())
          item_recs = {p: s/max_item for p, s in item_recs.items()}
      
      # Combine
      all_products = set(user_recs.keys()) | set(item_recs.keys())
      combined = []
      for product_id in all_products:
          score = alpha * user_recs.get(product_id, 0) + (1-alpha) * item_recs.get(product_id, 0)
          combined.append((product_id, score))
      
      combined.sort(key=lambda x: x[1], reverse=True)
      return combined[:n_recommendations]

  print(f"Hybrid recommendations for User {test_user}:")
  recs = hybrid_recommend(test_user)
  for product_id, score in recs:
      print(f"  Product_{product_id} (hybrid score: {score:.2f})")

  # Step 6: Evaluation
  print("\n5️⃣ EVALUATION")
  print("-"*40)

  def evaluate_recommendations(n_users=50):
      """Evaluate recommendation accuracy"""
      hits = 0
      total = 0
      
      for user_id in range(min(n_users, n_users)):
          # Get user's actual highly rated products
          user_ratings = df[(df['user_id'] == user_id) & (df['rating'] >= 4)]
          if len(user_ratings) < 2:
              continue
              
          # Use first half as training, second half as test
          actual = set(user_ratings['product_id'].tolist()[len(user_ratings)//2:])
          
          # Get recommendations
          recs = [p for p, s in hybrid_recommend(user_id, n_recommendations=10)]
          
          # Count hits
          hits += len(set(recs) & actual)
          total += len(actual)
      
      precision = hits / total if total > 0 else 0
      return precision

  precision = evaluate_recommendations()
  print(f"Recommendation Precision: {precision:.4f}")

  print("\n6️⃣ SYSTEM READY FOR DEPLOYMENT")
  print("-"*40)
  print("✅ User-based filtering: Find users like you")
  print("✅ Item-based filtering: Find products like ones you liked")
  print("✅ Hybrid approach: Best of both worlds")
  ```

  **What you learned:**

  * Collaborative filtering techniques
  * User-based vs item-based recommendations
  * Combining multiple recommendation strategies
</details>

### Project 4: ML Pipeline with Logging

Create a production-ready ML pipeline with proper logging and experiment tracking.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  from sklearn.datasets import load_breast_cancer
  from sklearn.model_selection import train_test_split, cross_val_score
  from sklearn.preprocessing import StandardScaler
  from sklearn.pipeline import Pipeline
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.metrics import classification_report, roc_auc_score
  import json
  import hashlib
  from datetime import datetime
  import pickle
  import os

  # Step 1: Create experiment tracker
  class ExperimentTracker:
      """Simple experiment tracking system"""
      
      def __init__(self, experiment_name):
          self.experiment_name = experiment_name
          self.runs = []
          self.current_run = None
          
      def start_run(self, run_name=None):
          """Start a new experiment run"""
          self.current_run = {
              'run_id': hashlib.md5(str(datetime.now()).encode()).hexdigest()[:8],
              'run_name': run_name or f"run_{len(self.runs)+1}",
              'start_time': datetime.now().isoformat(),
              'params': {},
              'metrics': {},
              'artifacts': [],
              'tags': {}
          }
          return self
      
      def log_param(self, key, value):
          """Log a parameter"""
          if self.current_run:
              self.current_run['params'][key] = value
          return self
      
      def log_params(self, params):
          """Log multiple parameters"""
          for key, value in params.items():
              self.log_param(key, value)
          return self
      
      def log_metric(self, key, value):
          """Log a metric"""
          if self.current_run:
              self.current_run['metrics'][key] = value
          return self
      
      def log_metrics(self, metrics):
          """Log multiple metrics"""
          for key, value in metrics.items():
              self.log_metric(key, value)
          return self
      
      def set_tag(self, key, value):
          """Set a tag"""
          if self.current_run:
              self.current_run['tags'][key] = value
          return self
      
      def log_artifact(self, artifact_path):
          """Log an artifact path"""
          if self.current_run:
              self.current_run['artifacts'].append(artifact_path)
          return self
      
      def end_run(self):
          """End current run"""
          if self.current_run:
              self.current_run['end_time'] = datetime.now().isoformat()
              self.runs.append(self.current_run)
              print(f"\n✅ Run {self.current_run['run_id']} completed")
              self.current_run = None
      
      def get_best_run(self, metric, higher_is_better=True):
          """Get the run with best metric"""
          if not self.runs:
              return None
          return max(self.runs, key=lambda r: r['metrics'].get(metric, float('-inf') if higher_is_better else float('inf')))
      
      def summary(self):
          """Print summary of all runs"""
          print(f"\n📊 Experiment: {self.experiment_name}")
          print("="*60)
          for run in self.runs:
              print(f"\nRun: {run['run_name']} ({run['run_id']})")
              print(f"  Parameters:")
              for k, v in run['params'].items():
                  print(f"    {k}: {v}")
              print(f"  Metrics:")
              for k, v in run['metrics'].items():
                  print(f"    {k}: {v:.4f}" if isinstance(v, float) else f"    {k}: {v}")

  # Step 2: Create ML Pipeline class
  class MLPipeline:
      """Production-ready ML Pipeline"""
      
      def __init__(self, tracker):
          self.tracker = tracker
          self.model = None
          self.scaler = None
          self.feature_names = None
          
      def preprocess(self, X, fit=True):
          """Preprocess features"""
          if fit:
              self.scaler = StandardScaler()
              X_scaled = self.scaler.fit_transform(X)
          else:
              X_scaled = self.scaler.transform(X)
          return X_scaled
      
      def train(self, X, y, params):
          """Train model with given parameters"""
          self.tracker.start_run()
          self.tracker.log_params(params)
          self.tracker.set_tag('pipeline_version', '1.0')
          
          # Preprocess
          X_scaled = self.preprocess(X, fit=True)
          self.feature_names = list(range(X.shape[1]))
          
          # Create and train model
          self.model = RandomForestClassifier(**params)
          self.model.fit(X_scaled, y)
          
          # Log training metrics
          cv_scores = cross_val_score(self.model, X_scaled, y, cv=5, scoring='accuracy')
          self.tracker.log_metric('cv_accuracy_mean', cv_scores.mean())
          self.tracker.log_metric('cv_accuracy_std', cv_scores.std())
          
          cv_auc = cross_val_score(self.model, X_scaled, y, cv=5, scoring='roc_auc')
          self.tracker.log_metric('cv_auc_mean', cv_auc.mean())
          
          return self
      
      def evaluate(self, X_test, y_test):
          """Evaluate model on test set"""
          X_scaled = self.preprocess(X_test, fit=False)
          
          y_pred = self.model.predict(X_scaled)
          y_proba = self.model.predict_proba(X_scaled)[:, 1]
          
          # Calculate metrics
          accuracy = (y_pred == y_test).mean()
          roc_auc = roc_auc_score(y_test, y_proba)
          
          self.tracker.log_metric('test_accuracy', accuracy)
          self.tracker.log_metric('test_roc_auc', roc_auc)
          
          print(f"  Test Accuracy: {accuracy:.4f}")
          print(f"  Test ROC-AUC: {roc_auc:.4f}")
          
          return {'accuracy': accuracy, 'roc_auc': roc_auc}
      
      def save(self, path):
          """Save pipeline"""
          artifacts = {
              'model': self.model,
              'scaler': self.scaler,
              'feature_names': self.feature_names
          }
          with open(path, 'wb') as f:
              pickle.dump(artifacts, f)
          self.tracker.log_artifact(path)
          print(f"  Model saved to {path}")
      
      def finalize(self):
          """Finalize the run"""
          self.tracker.end_run()

  # Step 3: Run experiments
  print("="*60)
  print("🔬 ML EXPERIMENT PIPELINE")
  print("="*60)

  # Load data
  cancer = load_breast_cancer()
  X, y = cancer.data, cancer.target
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Create tracker
  tracker = ExperimentTracker("Breast Cancer Classification")

  # Experiment 1: Baseline
  print("\n📦 Experiment 1: Baseline Model")
  pipeline1 = MLPipeline(tracker)
  pipeline1.train(X_train, y_train, {
      'n_estimators': 50,
      'max_depth': 5,
      'random_state': 42
  })
  pipeline1.evaluate(X_test, y_test)
  pipeline1.finalize()

  # Experiment 2: More trees
  print("\n📦 Experiment 2: More Trees")
  pipeline2 = MLPipeline(tracker)
  pipeline2.train(X_train, y_train, {
      'n_estimators': 200,
      'max_depth': 5,
      'random_state': 42
  })
  pipeline2.evaluate(X_test, y_test)
  pipeline2.finalize()

  # Experiment 3: Deeper trees
  print("\n📦 Experiment 3: Deeper Trees")
  pipeline3 = MLPipeline(tracker)
  pipeline3.train(X_train, y_train, {
      'n_estimators': 100,
      'max_depth': 15,
      'random_state': 42
  })
  pipeline3.evaluate(X_test, y_test)
  pipeline3.finalize()

  # Experiment 4: Best combination
  print("\n📦 Experiment 4: Optimized")
  pipeline4 = MLPipeline(tracker)
  pipeline4.train(X_train, y_train, {
      'n_estimators': 200,
      'max_depth': 10,
      'min_samples_split': 5,
      'random_state': 42
  })
  pipeline4.evaluate(X_test, y_test)
  pipeline4.save('best_model.pkl')
  pipeline4.finalize()

  # Step 4: Show experiment summary
  tracker.summary()

  # Step 5: Get best run
  best_run = tracker.get_best_run('test_roc_auc')
  print("\n🏆 BEST RUN")
  print("-"*40)
  print(f"Run: {best_run['run_name']}")
  print(f"Test ROC-AUC: {best_run['metrics']['test_roc_auc']:.4f}")
  print(f"Parameters: {best_run['params']}")
  ```

  **What you learned:**

  * Organizing ML code for production
  * Experiment tracking and comparison
  * Pipeline patterns for reproducibility
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Start with Business" icon="briefcase">
    Understand the problem before touching data
  </Card>

  <Card title="EDA is Critical" icon="magnifying-glass">
    Visualize and understand your data first
  </Card>

  <Card title="Iterate Quickly" icon="arrows-spin">
    Start simple, then improve
  </Card>

  <Card title="Evaluate Properly" icon="check-double">
    Use appropriate metrics for your problem
  </Card>
</CardGroup>

***

## What's Next?

Great job completing the end-to-end project! Now let's explore unsupervised learning with clustering.

<Card title="Continue to Module 11: Clustering" icon="arrow-right" href="/courses/ml-mastery/11-clustering">
  Learn to find patterns when you don't have labels - K-Means, DBSCAN, and more
</Card>
