> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # End-to-End ML Project > Build a complete machine learning project from scratch # End-to-End ML Project End-to-End ML Project Pipeline

Real World ML Pipeline - Uber Surge Pricing

## The Complete ML Workflow This module brings everything together in a real project. In practice, this workflow is never linear -- you'll jump back to EDA when your model fails, revisit feature engineering when evaluation reveals blind spots, and retune when new data arrives. Think of it as a spiral, not a waterfall. 1. **Problem Definition**: What are we solving? What metric defines "success"? 2. **Data Collection**: Get the data (often the hardest step) 3. **EDA**: Understand the data before touching any model 4. **Feature Engineering**: Transform raw data into model-ready features 5. **Model Selection**: Choose 2-3 candidate algorithms 6. **Training**: Fit models with cross-validation 7. **Evaluation**: Measure performance on held-out data 8. **Tuning**: Optimize hyperparameters for the best candidate 9. **Deployment**: Make it usable in production *** ## Project: Predicting Customer Churn **Business Problem**: A telecom company wants to predict which customers will leave (churn) so they can offer them incentives to stay. ```python theme={null} import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score import warnings warnings.filterwarnings('ignore') # Set random seed for reproducibility np.random.seed(42) ``` *** ## Step 1: Load and Explore Data ```python theme={null} # Using Telco Customer Churn dataset # In practice, you'd load from: pd.read_csv('customer_churn.csv') # Simulate the dataset structure n_samples = 5000 data = { 'customer_id': range(1, n_samples + 1), 'tenure': np.random.randint(1, 72, n_samples), 'monthly_charges': np.random.uniform(20, 100, n_samples), 'total_charges': np.random.uniform(100, 7000, n_samples), 'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples), 'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples), 'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples), 'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples), 'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples), 'paperless_billing': np.random.choice(['Yes', 'No'], n_samples), 'senior_citizen': np.random.choice([0, 1], n_samples), 'partner': np.random.choice(['Yes', 'No'], n_samples), 'dependents': np.random.choice(['Yes', 'No'], n_samples), } # Create churn with realistic patterns churn_prob = ( 0.1 + # Base rate 0.3 * (data['contract'] == 'Month-to-month').astype(int) + 0.2 * (data['monthly_charges'] > 70) + -0.15 * (data['tenure'] > 36) + 0.1 * (data['payment_method'] == 'Electronic check').astype(int) ) churn_prob = np.clip(churn_prob, 0.05, 0.8) data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int) df = pd.DataFrame(data) print("Dataset Shape:", df.shape) print("\nFirst 5 rows:") print(df.head()) print("\nData Types:") print(df.dtypes) print("\nChurn Distribution:") print(df['churn'].value_counts(normalize=True)) ``` *** ## Step 2: Exploratory Data Analysis (EDA) ```python theme={null} # Churn rate by contract type print("Churn Rate by Contract Type:") print(df.groupby('contract')['churn'].mean().sort_values(ascending=False)) # Visualizations fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # 1. Churn distribution df['churn'].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[0, 0], labels=['Stayed', 'Churned']) axes[0, 0].set_title('Churn Distribution') # 2. Tenure distribution by churn axes[0, 1].hist([df[df['churn']==0]['tenure'], df[df['churn']==1]['tenure']], label=['Stayed', 'Churned'], bins=20, alpha=0.7) axes[0, 1].set_xlabel('Tenure (months)') axes[0, 1].set_title('Tenure by Churn Status') axes[0, 1].legend() # 3. Monthly charges by churn df.boxplot(column='monthly_charges', by='churn', ax=axes[1, 0]) axes[1, 0].set_title('Monthly Charges by Churn') # 4. Contract type vs churn churn_by_contract = df.groupby('contract')['churn'].mean() churn_by_contract.plot.bar(ax=axes[1, 1], color=['green', 'orange', 'red']) axes[1, 1].set_title('Churn Rate by Contract') axes[1, 1].set_ylabel('Churn Rate') plt.tight_layout() plt.show() ``` *** ## Step 3: Feature Engineering ```python theme={null} # Separate features and target X = df.drop(['customer_id', 'churn'], axis=1) y = df['churn'] # Identify column types numeric_cols = ['tenure', 'monthly_charges', 'total_charges'] categorical_cols = ['contract', 'internet_service', 'online_security', 'tech_support', 'payment_method', 'paperless_billing', 'partner', 'dependents'] # Create new features -- each one encodes a business hypothesis. # "charges_per_month" normalizes total spend by tenure length. # Without this, a 5-year customer who paid $5000 total looks the same # as a 1-year customer who paid $5000 total, even though the second # one is paying 5x more per month. X['charges_per_month'] = X['total_charges'] / (X['tenure'] + 1) # +1 avoids division by zero # Binary flags are powerful for tree-based models -- they create # easy split points that capture domain knowledge. X['is_long_term'] = (X['tenure'] > 24).astype(int) # 2+ years = loyal X['high_charges'] = (X['monthly_charges'] > X['monthly_charges'].median()).astype(int) numeric_cols.extend(['charges_per_month']) print("Feature Engineering Complete!") print(f"Numeric features: {len(numeric_cols)}") print(f"Categorical features: {len(categorical_cols)}") # Tip: Feature engineering is iterative. After your first model, # look at misclassified examples to spot patterns that suggest # new features. Did the model miss all customers with high charges # AND month-to-month contracts? Create that interaction feature. ``` *** ## Step 4: Preprocessing Pipeline Using sklearn Pipelines is not optional in production ML -- it's the difference between "works on my laptop" and "works reliably in production." Pipelines prevent data leakage (fitting the scaler on test data), ensure reproducibility, and make deployment a single `pipeline.predict()` call instead of a fragile sequence of manual transforms. ```python theme={null} from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer # Build preprocessor -- one pipeline per data type. # The ColumnTransformer routes each column to the right pipeline. numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), # Median is robust to outliers ('scaler', StandardScaler()) # Required for LR and SVM, harmless for trees ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), # Explicit "missing" category ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')) # drop='first' avoids the "dummy variable trap" (multicollinearity). # handle_unknown='ignore' prevents crashes when test data has unseen categories. ]) preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_cols), ('cat', categorical_transformer, categorical_cols) ]) # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"Training set: {len(X_train)} samples") print(f"Test set: {len(X_test)} samples") ``` *** ## Step 5: Model Selection and Comparison A senior engineer's approach to model selection: never bet on one model. Train 3-4 candidates with default hyperparameters, compare on the same cross-validation folds, then invest tuning effort only on the top 1-2. It's like auditioning actors -- you don't give everyone a costume fitting before the first read-through. ```python theme={null} from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.svm import SVC # Models to compare -- intentionally using default hyperparameters. # The goal here is to find which FAMILY of models works best, # not to squeeze out every last % of performance. models = { 'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42), 'SVM': SVC(probability=True, random_state=42) } # Model selection tip: For tabular data, Gradient Boosting or Random Forest # almost always wins. Start there. Logistic Regression is your interpretability # baseline. SVM is worth trying for small datasets (<10K rows). # Create pipelines and evaluate results = {} for name, model in models.items(): pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', model) ]) # Cross-validation cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc') results[name] = { 'mean_auc': cv_scores.mean(), 'std_auc': cv_scores.std(), 'pipeline': pipeline } print(f"{name:22s}: AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})") # Select best model based on cross-validation AUC. # Important: don't just pick the highest number -- also consider: # 1. Standard deviation (lower = more reliable estimate) # 2. Training time (matters for retraining in production) # 3. Interpretability (can you explain predictions to stakeholders?) # A model that's 0.5% worse in AUC but 10x faster to train and # easy to explain might be the better business choice. best_model_name = max(results, key=lambda x: results[x]['mean_auc']) print(f"\nBest Model: {best_model_name}") ``` *** ## Step 6: Hyperparameter Tuning ```python theme={null} from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform # Tune the best model (assuming Gradient Boosting) param_distributions = { 'classifier__n_estimators': randint(50, 300), 'classifier__max_depth': randint(3, 10), 'classifier__learning_rate': uniform(0.01, 0.3), 'classifier__min_samples_split': randint(2, 20), 'classifier__min_samples_leaf': randint(1, 10) } best_pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', GradientBoostingClassifier(random_state=42)) ]) random_search = RandomizedSearchCV( best_pipeline, param_distributions, n_iter=30, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42 ) random_search.fit(X_train, y_train) print(f"Best Parameters: {random_search.best_params_}") print(f"Best CV AUC: {random_search.best_score_:.4f}") ``` *** ## Step 7: Final Evaluation ```python theme={null} from sklearn.metrics import roc_curve, precision_recall_curve # Get best model final_model = random_search.best_estimator_ # Predictions y_pred = final_model.predict(X_test) y_prob = final_model.predict_proba(X_test)[:, 1] # Classification report print("Classification Report:") print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned'])) # ROC AUC print(f"\nTest ROC AUC: {roc_auc_score(y_test, y_prob):.4f}") # Confusion Matrix fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Confusion Matrix cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0]) axes[0].set_xlabel('Predicted') axes[0].set_ylabel('Actual') axes[0].set_title('Confusion Matrix') # ROC Curve fpr, tpr, _ = roc_curve(y_test, y_prob) axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_prob):.3f}') axes[1].plot([0, 1], [0, 1], 'k--', label='Random') axes[1].set_xlabel('False Positive Rate') axes[1].set_ylabel('True Positive Rate') axes[1].set_title('ROC Curve') axes[1].legend() plt.tight_layout() plt.show() ``` *** ## Step 8: Feature Importance Analysis ```python theme={null} # Get feature names after preprocessing feature_names = ( numeric_cols + list(final_model.named_steps['preprocessor'] .named_transformers_['cat'] .named_steps['encoder'] .get_feature_names_out(categorical_cols)) ) # Get feature importances importances = final_model.named_steps['classifier'].feature_importances_ # Sort and plot indices = np.argsort(importances)[::-1][:15] plt.figure(figsize=(12, 6)) plt.bar(range(15), importances[indices]) plt.xticks(range(15), [feature_names[i] for i in indices], rotation=45, ha='right') plt.xlabel('Features') plt.ylabel('Importance') plt.title('Top 15 Most Important Features for Churn Prediction') plt.tight_layout() plt.show() # Print top features print("\nTop 10 Important Features:") for i in indices[:10]: print(f" {feature_names[i]:30s}: {importances[i]:.4f}") ``` *** ## Step 9: Business Insights ```python theme={null} # Identify high-risk customers X_test_copy = X_test.copy() X_test_copy['churn_probability'] = y_prob X_test_copy['predicted_churn'] = y_pred X_test_copy['actual_churn'] = y_test.values # High-risk customers (probability > 0.7) high_risk = X_test_copy[X_test_copy['churn_probability'] > 0.7] print(f"High-Risk Customers (>70% churn probability): {len(high_risk)}") print("\nProfile of High-Risk Customers:") print(high_risk[['contract', 'tenure', 'monthly_charges', 'churn_probability']].describe()) # Recommendations print("\n" + "="*50) print("BUSINESS RECOMMENDATIONS") print("="*50) print(""" 1. TARGET MONTH-TO-MONTH CUSTOMERS - Highest churn rate - Offer discounts for annual contracts 2. FOCUS ON NEW CUSTOMERS (tenure < 12 months) - Most vulnerable period - Implement onboarding program 3. REVIEW HIGH-CHARGE CUSTOMERS - Consider loyalty discounts - Ensure they're getting value 4. ELECTRONIC CHECK USERS - Higher churn rate - Encourage automatic payment methods """) ``` *** ## Step 10: Save the Model ```python theme={null} import joblib # Save the ENTIRE pipeline (preprocessor + model) -- not just the model! # This is critical: if you save only the model and forget the scaler, # production predictions will be wrong because raw features won't match # what the model was trained on. joblib.dump(final_model, 'churn_model.pkl') print("Model saved to 'churn_model.pkl'") # How to load and use in production: # loaded_model = joblib.load('churn_model.pkl') # predictions = loaded_model.predict(new_data) # raw data goes in, predictions come out # The pipeline handles all preprocessing internally. # Also save metadata for future debugging: import json metadata = { 'training_date': '2025-01-15', 'n_training_samples': len(X_train), 'features': list(X_train.columns), 'best_params': random_search.best_params_, 'cv_auc': random_search.best_score_, } # json.dump(metadata, open('churn_model_metadata.json', 'w')) ``` *** ## Production Considerations * Track prediction drift over time * Monitor for data quality issues * Set up alerts for performance degradation * Test model in production with a subset * Compare with baseline * Gradually roll out * Retrain periodically (weekly/monthly) * Automate the pipeline * Version your models * Document feature definitions * Record model decisions * Maintain changelog *** ## 🚀 Mini Projects Build a complete loan approval system Predict which employees might leave Build a simple recommendation system Create a production-ready ML pipeline ### Project 1: Loan Default Predictor Build a complete loan default prediction system with EDA, feature engineering, and model selection.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix import warnings warnings.filterwarnings('ignore') # Step 1: Generate synthetic loan data np.random.seed(42) n_samples = 2000 data = { 'age': np.random.randint(21, 70, n_samples), 'income': np.random.lognormal(10.5, 0.5, n_samples), 'loan_amount': np.random.lognormal(10, 0.8, n_samples), 'credit_score': np.random.normal(680, 80, n_samples).clip(300, 850), 'employment_years': np.random.exponential(5, n_samples).clip(0, 40), 'num_credit_cards': np.random.poisson(3, n_samples), 'num_loans': np.random.poisson(2, n_samples), 'home_ownership': np.random.choice(['rent', 'own', 'mortgage'], n_samples, p=[0.4, 0.3, 0.3]), 'loan_purpose': np.random.choice(['debt', 'home', 'car', 'education', 'business'], n_samples), } df = pd.DataFrame(data) # Create target (default probability based on features) default_prob = ( 0.3 * (df['credit_score'] < 600).astype(int) + 0.2 * (df['loan_amount'] / df['income'] > 5).astype(int) + 0.15 * (df['num_loans'] > 3).astype(int) + 0.1 * (df['employment_years'] < 1).astype(int) ) df['default'] = (np.random.random(n_samples) < default_prob * 0.8).astype(int) print("="*60) print("📊 LOAN DEFAULT PREDICTION PROJECT") print("="*60) # Step 2: Exploratory Data Analysis print("\n1️⃣ EXPLORATORY DATA ANALYSIS") print("-"*40) print(f"Dataset shape: {df.shape}") print(f"Default rate: {df['default'].mean()*100:.1f}%") print("\nMissing values:", df.isnull().sum().sum()) print("\nNumerical features summary:") print(df.describe().round(2)) # Step 3: Feature Engineering print("\n2️⃣ FEATURE ENGINEERING") print("-"*40) # Create new features df['loan_to_income'] = df['loan_amount'] / df['income'] df['credit_utilization'] = df['num_credit_cards'] * 2000 / df['income'] df['total_debt_items'] = df['num_credit_cards'] + df['num_loans'] df['income_per_year_employed'] = df['income'] / (df['employment_years'] + 1) df['is_young'] = (df['age'] < 30).astype(int) df['has_many_loans'] = (df['num_loans'] > 3).astype(int) df['low_credit'] = (df['credit_score'] < 600).astype(int) # Encode categorical variables le_home = LabelEncoder() le_purpose = LabelEncoder() df['home_ownership_encoded'] = le_home.fit_transform(df['home_ownership']) df['loan_purpose_encoded'] = le_purpose.fit_transform(df['loan_purpose']) print(f"New features created: loan_to_income, credit_utilization, etc.") print(f"Total features: {len(df.columns) - 3}") # Exclude original categorical and target # Step 4: Prepare data for modeling feature_cols = ['age', 'income', 'loan_amount', 'credit_score', 'employment_years', 'num_credit_cards', 'num_loans', 'loan_to_income', 'credit_utilization', 'total_debt_items', 'income_per_year_employed', 'is_young', 'has_many_loans', 'low_credit', 'home_ownership_encoded', 'loan_purpose_encoded'] X = df[feature_cols] y = df['default'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Scale features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Step 5: Model Selection print("\n3️⃣ MODEL SELECTION & TRAINING") print("-"*40) models = { 'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42) } results = [] for name, model in models.items(): # Use scaled data for logistic regression if name == 'Logistic Regression': model.fit(X_train_scaled, y_train) cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc') y_pred = model.predict(X_test_scaled) y_proba = model.predict_proba(X_test_scaled)[:, 1] else: model.fit(X_train, y_train) cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc') y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] roc_auc = roc_auc_score(y_test, y_proba) results.append({ 'model': name, 'cv_auc': cv_scores.mean(), 'cv_std': cv_scores.std(), 'test_auc': roc_auc }) print(f"{name}: CV AUC = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f}), Test AUC = {roc_auc:.4f}") # Step 6: Best Model Evaluation print("\n4️⃣ BEST MODEL EVALUATION") print("-"*40) best_result = max(results, key=lambda x: x['test_auc']) print(f"Best model: {best_result['model']}") best_model = models[best_result['model']] if best_result['model'] == 'Logistic Regression': y_pred = best_model.predict(X_test_scaled) else: y_pred = best_model.predict(X_test) print("\nClassification Report:") print(classification_report(y_test, y_pred)) # Step 7: Feature Importance (for RF/GB) if hasattr(best_model, 'feature_importances_'): importance = pd.DataFrame({ 'feature': feature_cols, 'importance': best_model.feature_importances_ }).sort_values('importance', ascending=False) print("\n5️⃣ TOP FEATURES") print("-"*40) for i, row in importance.head(10).iterrows(): print(f" {row['feature']:30s}: {row['importance']:.4f}") # Step 8: Business Recommendations print("\n6️⃣ BUSINESS RECOMMENDATIONS") print("-"*40) print("Based on feature importance:") print(" 1. Credit score is the strongest predictor of default") print(" 2. Loan-to-income ratio indicates overextension") print(" 3. Employment stability matters significantly") print("\nRecommended actions:") print(" - Flag applicants with credit score < 600 for manual review") print(" - Cap loan-to-income ratio at 5x for automatic approval") print(" - Require employment verification for < 1 year employed") ``` **What you learned:** * Complete ML workflow from EDA to deployment recommendations * Feature engineering specific to financial data * Translating model insights into business actions

### Project 2: Employee Attrition Analyzer Predict which employees might leave and understand why.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, roc_auc_score import warnings warnings.filterwarnings('ignore') # Step 1: Generate synthetic employee data np.random.seed(42) n_employees = 1500 data = { 'age': np.random.randint(22, 60, n_employees), 'years_at_company': np.random.exponential(4, n_employees).clip(0, 30), 'salary': np.random.lognormal(11, 0.4, n_employees), 'distance_from_home': np.random.exponential(10, n_employees).clip(1, 50), 'num_projects': np.random.poisson(5, n_employees).clip(1, 15), 'avg_hours_per_week': np.random.normal(45, 8, n_employees).clip(35, 70), 'last_promotion_years': np.random.exponential(2, n_employees).clip(0, 10), 'satisfaction_score': np.random.beta(7, 3, n_employees), # Mostly satisfied 'performance_rating': np.random.choice([1, 2, 3, 4, 5], n_employees, p=[0.05, 0.1, 0.5, 0.25, 0.1]), 'department': np.random.choice(['Sales', 'Engineering', 'HR', 'Marketing'], n_employees), 'job_level': np.random.choice([1, 2, 3, 4, 5], n_employees, p=[0.3, 0.3, 0.2, 0.15, 0.05]), } df = pd.DataFrame(data) # Create attrition (based on realistic factors) attrition_prob = ( 0.2 * (df['satisfaction_score'] < 0.5).astype(int) + 0.15 * (df['last_promotion_years'] > 4).astype(int) + 0.1 * (df['avg_hours_per_week'] > 55).astype(int) + 0.1 * (df['years_at_company'] < 2).astype(int) + 0.05 * (df['distance_from_home'] > 25).astype(int) ) df['attrition'] = (np.random.random(n_employees) < attrition_prob).astype(int) print("="*60) print("👥 EMPLOYEE ATTRITION ANALYSIS") print("="*60) # Step 2: EDA print("\n1️⃣ EXPLORATORY DATA ANALYSIS") print("-"*40) print(f"Total employees: {len(df)}") print(f"Attrition rate: {df['attrition'].mean()*100:.1f}%") # Attrition by department print("\nAttrition by Department:") for dept in df['department'].unique(): rate = df[df['department'] == dept]['attrition'].mean() * 100 print(f" {dept}: {rate:.1f}%") # Step 3: Feature Engineering print("\n2️⃣ FEATURE ENGINEERING") print("-"*40) df['salary_per_year'] = df['salary'] / (df['years_at_company'] + 1) df['workload'] = df['num_projects'] * df['avg_hours_per_week'] df['is_overworked'] = (df['avg_hours_per_week'] > 50).astype(int) df['is_underpaid'] = (df['salary'] < df['salary'].quantile(0.25)).astype(int) df['stuck_no_promotion'] = ((df['last_promotion_years'] > 3) & (df['performance_rating'] >= 3)).astype(int) df['at_risk'] = ((df['satisfaction_score'] < 0.5) | (df['last_promotion_years'] > 4)).astype(int) # Encode categorical df['dept_encoded'] = pd.factorize(df['department'])[0] # Create department dummies dept_dummies = pd.get_dummies(df['department'], prefix='dept') df = pd.concat([df, dept_dummies], axis=1) print("Created risk indicators: is_overworked, is_underpaid, stuck_no_promotion") # Step 4: Prepare data feature_cols = ['age', 'years_at_company', 'salary', 'distance_from_home', 'num_projects', 'avg_hours_per_week', 'last_promotion_years', 'satisfaction_score', 'performance_rating', 'job_level', 'salary_per_year', 'workload', 'is_overworked', 'is_underpaid', 'stuck_no_promotion', 'dept_Engineering', 'dept_HR', 'dept_Marketing', 'dept_Sales'] X = df[feature_cols] y = df['attrition'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Step 5: Train models print("\n3️⃣ MODEL TRAINING") print("-"*40) rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='roc_auc') print(f"Random Forest CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})") y_pred = rf.predict(X_test) y_proba = rf.predict_proba(X_test)[:, 1] print(f"Test AUC: {roc_auc_score(y_test, y_proba):.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred)) # Step 6: Feature Importance print("\n4️⃣ KEY ATTRITION FACTORS") print("-"*40) importance = pd.DataFrame({ 'feature': feature_cols, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) for i, row in importance.head(10).iterrows(): print(f" {row['feature']:25s}: {row['importance']:.4f}") # Step 7: Risk Scoring print("\n5️⃣ EMPLOYEE RISK SCORING") print("-"*40) # Add risk scores to test data test_df = df.iloc[X_test.index].copy() test_df['risk_score'] = y_proba # Find high-risk employees high_risk = test_df[test_df['risk_score'] > 0.5].sort_values('risk_score', ascending=False) print(f"High-risk employees: {len(high_risk)} ({len(high_risk)/len(test_df)*100:.1f}%)") print("\nTop 5 At-Risk Employees:") for i, row in high_risk.head().iterrows(): print(f" Risk: {row['risk_score']:.2%} | Satisfaction: {row['satisfaction_score']:.2f} | " f"Years since promotion: {row['last_promotion_years']:.1f}") # Step 8: Retention Recommendations print("\n6️⃣ RETENTION STRATEGY") print("-"*40) print("Based on analysis, recommend:") print("\n🎯 Immediate Actions for High-Risk Employees:") print(" - Schedule 1-on-1 meetings") print(" - Review compensation vs market rate") print(" - Discuss career development opportunities") print("\n📊 Systemic Changes:") if importance[importance['feature'] == 'satisfaction_score']['importance'].values[0] > 0.1: print(" - Satisfaction is key: Launch engagement surveys") if importance[importance['feature'] == 'last_promotion_years']['importance'].values[0] > 0.1: print(" - Promotion stagnation matters: Review promotion criteria") if importance[importance['feature'] == 'avg_hours_per_week']['importance'].values[0] > 0.05: print(" - Overwork is an issue: Enforce work-life balance policies") ``` **What you learned:** * HR analytics application of ML * Creating actionable risk scores * Translating predictions into retention strategies

### Project 3: Product Recommendation Engine Build a simple collaborative filtering recommendation system.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd from sklearn.metrics.pairwise import cosine_similarity from sklearn.model_selection import train_test_split from collections import defaultdict # Step 1: Generate synthetic e-commerce data np.random.seed(42) n_users = 500 n_products = 100 n_interactions = 5000 # Create user-product interactions users = np.random.randint(0, n_users, n_interactions) products = np.random.randint(0, n_products, n_interactions) ratings = np.random.randint(1, 6, n_interactions) # Create some patterns (users who like product X also like product Y) for i in range(0, n_products - 1, 2): mask = products == i products[mask & (np.random.random(n_interactions) > 0.5)] = i + 1 df = pd.DataFrame({ 'user_id': users, 'product_id': products, 'rating': ratings }).drop_duplicates(['user_id', 'product_id']) # Add product metadata product_names = [f"Product_{i}" for i in range(n_products)] product_categories = np.random.choice(['Electronics', 'Books', 'Clothing', 'Home'], n_products) print("="*60) print("🛒 PRODUCT RECOMMENDATION ENGINE") print("="*60) print(f"Users: {n_users}") print(f"Products: {n_products}") print(f"Interactions: {len(df)}") # Step 2: Create user-item matrix print("\n1️⃣ BUILDING USER-ITEM MATRIX") print("-"*40) user_item_matrix = df.pivot_table( index='user_id', columns='product_id', values='rating', fill_value=0 ) print(f"Matrix shape: {user_item_matrix.shape}") print(f"Sparsity: {(user_item_matrix == 0).sum().sum() / user_item_matrix.size * 100:.1f}%") # Step 3: User-based Collaborative Filtering print("\n2️⃣ USER-BASED COLLABORATIVE FILTERING") print("-"*40) # Calculate user similarity user_similarity = cosine_similarity(user_item_matrix) user_sim_df = pd.DataFrame( user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index ) def recommend_user_based(user_id, n_recommendations=5, n_neighbors=10): """Recommend products based on similar users""" # Get similar users similar_users = user_sim_df[user_id].sort_values(ascending=False)[1:n_neighbors+1] # Get products the user hasn't rated user_products = set(df[df['user_id'] == user_id]['product_id']) # Weighted average of similar users' ratings scores = defaultdict(float) weights = defaultdict(float) for similar_user, similarity in similar_users.items(): similar_user_ratings = df[df['user_id'] == similar_user] for _, row in similar_user_ratings.iterrows(): if row['product_id'] not in user_products: scores[row['product_id']] += similarity * row['rating'] weights[row['product_id']] += similarity # Calculate weighted average recommendations = [] for product_id in scores: if weights[product_id] > 0: avg_score = scores[product_id] / weights[product_id] recommendations.append((product_id, avg_score)) recommendations.sort(key=lambda x: x[1], reverse=True) return recommendations[:n_recommendations] # Test for a specific user test_user = 0 print(f"\nRecommendations for User {test_user}:") recs = recommend_user_based(test_user) for product_id, score in recs: print(f" Product_{product_id} (predicted rating: {score:.2f})") # Step 4: Item-based Collaborative Filtering print("\n3️⃣ ITEM-BASED COLLABORATIVE FILTERING") print("-"*40) # Calculate item similarity item_similarity = cosine_similarity(user_item_matrix.T) item_sim_df = pd.DataFrame( item_similarity, index=user_item_matrix.columns, columns=user_item_matrix.columns ) def recommend_item_based(user_id, n_recommendations=5, n_similar_items=5): """Recommend products based on items the user liked""" # Get user's highly rated products (rating >= 4) user_ratings = df[(df['user_id'] == user_id) & (df['rating'] >= 4)] liked_products = user_ratings['product_id'].tolist() user_all_products = set(df[df['user_id'] == user_id]['product_id']) scores = defaultdict(float) for liked_product in liked_products: similar_products = item_sim_df[liked_product].sort_values(ascending=False)[1:n_similar_items+1] for product_id, similarity in similar_products.items(): if product_id not in user_all_products: scores[product_id] += similarity recommendations = sorted(scores.items(), key=lambda x: x[1], reverse=True) return recommendations[:n_recommendations] print(f"Item-based recommendations for User {test_user}:") recs = recommend_item_based(test_user) for product_id, score in recs: print(f" Product_{product_id} (similarity score: {score:.2f})") # Step 5: Hybrid Recommendation print("\n4️⃣ HYBRID RECOMMENDATION") print("-"*40) def hybrid_recommend(user_id, n_recommendations=5, alpha=0.5): """Combine user-based and item-based recommendations""" user_recs = {p: s for p, s in recommend_user_based(user_id, n_recommendations=20)} item_recs = {p: s for p, s in recommend_item_based(user_id, n_recommendations=20)} # Normalize scores if user_recs: max_user = max(user_recs.values()) user_recs = {p: s/max_user for p, s in user_recs.items()} if item_recs: max_item = max(item_recs.values()) item_recs = {p: s/max_item for p, s in item_recs.items()} # Combine all_products = set(user_recs.keys()) | set(item_recs.keys()) combined = [] for product_id in all_products: score = alpha * user_recs.get(product_id, 0) + (1-alpha) * item_recs.get(product_id, 0) combined.append((product_id, score)) combined.sort(key=lambda x: x[1], reverse=True) return combined[:n_recommendations] print(f"Hybrid recommendations for User {test_user}:") recs = hybrid_recommend(test_user) for product_id, score in recs: print(f" Product_{product_id} (hybrid score: {score:.2f})") # Step 6: Evaluation print("\n5️⃣ EVALUATION") print("-"*40) def evaluate_recommendations(n_users=50): """Evaluate recommendation accuracy""" hits = 0 total = 0 for user_id in range(min(n_users, n_users)): # Get user's actual highly rated products user_ratings = df[(df['user_id'] == user_id) & (df['rating'] >= 4)] if len(user_ratings) < 2: continue # Use first half as training, second half as test actual = set(user_ratings['product_id'].tolist()[len(user_ratings)//2:]) # Get recommendations recs = [p for p, s in hybrid_recommend(user_id, n_recommendations=10)] # Count hits hits += len(set(recs) & actual) total += len(actual) precision = hits / total if total > 0 else 0 return precision precision = evaluate_recommendations() print(f"Recommendation Precision: {precision:.4f}") print("\n6️⃣ SYSTEM READY FOR DEPLOYMENT") print("-"*40) print("✅ User-based filtering: Find users like you") print("✅ Item-based filtering: Find products like ones you liked") print("✅ Hybrid approach: Best of both worlds") ``` **What you learned:** * Collaborative filtering techniques * User-based vs item-based recommendations * Combining multiple recommendation strategies

### Project 4: ML Pipeline with Logging Create a production-ready ML pipeline with proper logging and experiment tracking.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score import json import hashlib from datetime import datetime import pickle import os # Step 1: Create experiment tracker class ExperimentTracker: """Simple experiment tracking system""" def __init__(self, experiment_name): self.experiment_name = experiment_name self.runs = [] self.current_run = None def start_run(self, run_name=None): """Start a new experiment run""" self.current_run = { 'run_id': hashlib.md5(str(datetime.now()).encode()).hexdigest()[:8], 'run_name': run_name or f"run_{len(self.runs)+1}", 'start_time': datetime.now().isoformat(), 'params': {}, 'metrics': {}, 'artifacts': [], 'tags': {} } return self def log_param(self, key, value): """Log a parameter""" if self.current_run: self.current_run['params'][key] = value return self def log_params(self, params): """Log multiple parameters""" for key, value in params.items(): self.log_param(key, value) return self def log_metric(self, key, value): """Log a metric""" if self.current_run: self.current_run['metrics'][key] = value return self def log_metrics(self, metrics): """Log multiple metrics""" for key, value in metrics.items(): self.log_metric(key, value) return self def set_tag(self, key, value): """Set a tag""" if self.current_run: self.current_run['tags'][key] = value return self def log_artifact(self, artifact_path): """Log an artifact path""" if self.current_run: self.current_run['artifacts'].append(artifact_path) return self def end_run(self): """End current run""" if self.current_run: self.current_run['end_time'] = datetime.now().isoformat() self.runs.append(self.current_run) print(f"\n✅ Run {self.current_run['run_id']} completed") self.current_run = None def get_best_run(self, metric, higher_is_better=True): """Get the run with best metric""" if not self.runs: return None return max(self.runs, key=lambda r: r['metrics'].get(metric, float('-inf') if higher_is_better else float('inf'))) def summary(self): """Print summary of all runs""" print(f"\n📊 Experiment: {self.experiment_name}") print("="*60) for run in self.runs: print(f"\nRun: {run['run_name']} ({run['run_id']})") print(f" Parameters:") for k, v in run['params'].items(): print(f" {k}: {v}") print(f" Metrics:") for k, v in run['metrics'].items(): print(f" {k}: {v:.4f}" if isinstance(v, float) else f" {k}: {v}") # Step 2: Create ML Pipeline class class MLPipeline: """Production-ready ML Pipeline""" def __init__(self, tracker): self.tracker = tracker self.model = None self.scaler = None self.feature_names = None def preprocess(self, X, fit=True): """Preprocess features""" if fit: self.scaler = StandardScaler() X_scaled = self.scaler.fit_transform(X) else: X_scaled = self.scaler.transform(X) return X_scaled def train(self, X, y, params): """Train model with given parameters""" self.tracker.start_run() self.tracker.log_params(params) self.tracker.set_tag('pipeline_version', '1.0') # Preprocess X_scaled = self.preprocess(X, fit=True) self.feature_names = list(range(X.shape[1])) # Create and train model self.model = RandomForestClassifier(**params) self.model.fit(X_scaled, y) # Log training metrics cv_scores = cross_val_score(self.model, X_scaled, y, cv=5, scoring='accuracy') self.tracker.log_metric('cv_accuracy_mean', cv_scores.mean()) self.tracker.log_metric('cv_accuracy_std', cv_scores.std()) cv_auc = cross_val_score(self.model, X_scaled, y, cv=5, scoring='roc_auc') self.tracker.log_metric('cv_auc_mean', cv_auc.mean()) return self def evaluate(self, X_test, y_test): """Evaluate model on test set""" X_scaled = self.preprocess(X_test, fit=False) y_pred = self.model.predict(X_scaled) y_proba = self.model.predict_proba(X_scaled)[:, 1] # Calculate metrics accuracy = (y_pred == y_test).mean() roc_auc = roc_auc_score(y_test, y_proba) self.tracker.log_metric('test_accuracy', accuracy) self.tracker.log_metric('test_roc_auc', roc_auc) print(f" Test Accuracy: {accuracy:.4f}") print(f" Test ROC-AUC: {roc_auc:.4f}") return {'accuracy': accuracy, 'roc_auc': roc_auc} def save(self, path): """Save pipeline""" artifacts = { 'model': self.model, 'scaler': self.scaler, 'feature_names': self.feature_names } with open(path, 'wb') as f: pickle.dump(artifacts, f) self.tracker.log_artifact(path) print(f" Model saved to {path}") def finalize(self): """Finalize the run""" self.tracker.end_run() # Step 3: Run experiments print("="*60) print("🔬 ML EXPERIMENT PIPELINE") print("="*60) # Load data cancer = load_breast_cancer() X, y = cancer.data, cancer.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create tracker tracker = ExperimentTracker("Breast Cancer Classification") # Experiment 1: Baseline print("\n📦 Experiment 1: Baseline Model") pipeline1 = MLPipeline(tracker) pipeline1.train(X_train, y_train, { 'n_estimators': 50, 'max_depth': 5, 'random_state': 42 }) pipeline1.evaluate(X_test, y_test) pipeline1.finalize() # Experiment 2: More trees print("\n📦 Experiment 2: More Trees") pipeline2 = MLPipeline(tracker) pipeline2.train(X_train, y_train, { 'n_estimators': 200, 'max_depth': 5, 'random_state': 42 }) pipeline2.evaluate(X_test, y_test) pipeline2.finalize() # Experiment 3: Deeper trees print("\n📦 Experiment 3: Deeper Trees") pipeline3 = MLPipeline(tracker) pipeline3.train(X_train, y_train, { 'n_estimators': 100, 'max_depth': 15, 'random_state': 42 }) pipeline3.evaluate(X_test, y_test) pipeline3.finalize() # Experiment 4: Best combination print("\n📦 Experiment 4: Optimized") pipeline4 = MLPipeline(tracker) pipeline4.train(X_train, y_train, { 'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 5, 'random_state': 42 }) pipeline4.evaluate(X_test, y_test) pipeline4.save('best_model.pkl') pipeline4.finalize() # Step 4: Show experiment summary tracker.summary() # Step 5: Get best run best_run = tracker.get_best_run('test_roc_auc') print("\n🏆 BEST RUN") print("-"*40) print(f"Run: {best_run['run_name']}") print(f"Test ROC-AUC: {best_run['metrics']['test_roc_auc']:.4f}") print(f"Parameters: {best_run['params']}") ``` **What you learned:** * Organizing ML code for production * Experiment tracking and comparison * Pipeline patterns for reproducibility

*** ## Key Takeaways Understand the problem before touching data Visualize and understand your data first Start simple, then improve Use appropriate metrics for your problem *** ## What's Next? Great job completing the end-to-end project! Now let's explore unsupervised learning with clustering. Learn to find patterns when you don't have labels - K-Means, DBSCAN, and more