Skip to main content
From Statistics to Machine Learning

From Statistics to Machine Learning

The Bridge: Statistics Becomes Prediction

You’ve learned statistics. You can describe data, calculate probabilities, test hypotheses, and build regression models. Now here’s the revelation: Machine learning is statistics at scale. Everything you’ve learned maps directly to ML:
Statistics ConceptMachine Learning Version
Linear regressionNeural network (1 layer, no activation)
Regression coefficientsModel weights/parameters
Minimizing squared errorLoss function optimization
Fitting a line to dataTraining a model
Making predictionsModel inference
Confidence intervalsPrediction uncertainty
Hypothesis testingModel comparison/validation
Estimated Time: 4-5 hours
Difficulty: Intermediate
Prerequisites: All previous modules
What You’ll Build: Classification model, complete ML pipeline
πŸ”— This Is The Bridge: Every ML algorithm you’ll ever use is built on these statistical foundations:
Statistical ConceptML Algorithm
Linear RegressionNeural network linear layer
Logistic RegressionBinary classifier (spam, fraud)
MLE (Maximum Likelihood)Training objective for most models
Bayesian InferenceUncertainty estimation, priors
Hypothesis TestingA/B testing, model comparison
RegularizationDropout, weight decay, L1/L2
By the end of this module, you’ll see exactly how your statistics knowledge powers real ML!

Regression Becomes Classification

From Continuous to Discrete

Regression predicts continuous values (house prices). But what if you want to predict categories?
  • Will this customer buy? (Yes/No)
  • Is this email spam? (Spam/Not Spam)
  • What disease does the patient have? (Diagnosis A/B/C)
This is classification, and it builds directly on regression.

Logistic Regression: Classification’s Foundation

Instead of predicting a value, we predict a probability: P(y=1∣x)=Οƒ(Ξ²0+Ξ²1x)=11+eβˆ’(Ξ²0+Ξ²1x)P(y=1|x) = \sigma(\beta_0 + \beta_1 x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} The sigmoid function Οƒ\sigma squashes any value to be between 0 and 1.
Logistic Regression Sigmoid Function
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """The sigmoid function - converts any number to probability."""
    return 1 / (1 + np.exp(-z))

# Visualize the sigmoid
z = np.linspace(-6, 6, 100)
plt.figure(figsize=(10, 5))
plt.plot(z, sigmoid(z), linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', label='Decision boundary (0.5)')
plt.xlabel('z = Ξ²β‚€ + β₁x')
plt.ylabel('P(y=1)')
plt.title('Sigmoid Function: Converting Linear to Probability')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Example: Predicting Customer Churn

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import numpy as np

# Customer data
np.random.seed(42)
n_customers = 500

# Features
months_active = np.random.uniform(1, 48, n_customers)
monthly_spend = np.random.uniform(10, 200, n_customers)
support_tickets = np.random.poisson(2, n_customers)

# Churn probability increases with tickets, decreases with spend and tenure
churn_prob = sigmoid(
    -2 +                          # base
    -0.05 * months_active +       # longer tenure = less churn
    -0.02 * monthly_spend +       # higher spend = less churn
    0.5 * support_tickets         # more tickets = more churn
)
churned = (np.random.random(n_customers) < churn_prob).astype(int)

print(f"Churn rate: {churned.mean():.1%}")

# Prepare data
X = np.column_stack([months_active, monthly_spend, support_tickets])
y = churned

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.1%}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Stay', 'Churn']))
Confusion Matrix Explained

The Loss Function: What Models Minimize

Mean Squared Error (Regression)

For regression, we minimize the average squared difference between predictions and actuals: MSE=1nβˆ‘i=1n(yiβˆ’y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
def mse_loss(y_true, y_pred):
    """Mean Squared Error loss function."""
    return np.mean((y_true - y_pred) ** 2)

# Example
actual = np.array([100, 150, 200, 250])
predicted = np.array([110, 145, 190, 260])

loss = mse_loss(actual, predicted)
print(f"MSE Loss: {loss:.2f}")
print(f"RMSE: {np.sqrt(loss):.2f} (in original units)")

Cross-Entropy Loss (Classification)

For classification, we use cross-entropy (log loss): CrossEntropy=βˆ’1nβˆ‘i=1n[yilog⁑(p^i)+(1βˆ’yi)log⁑(1βˆ’p^i)]\text{CrossEntropy} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]
def cross_entropy_loss(y_true, y_prob):
    """Binary cross-entropy loss function."""
    epsilon = 1e-15  # Prevent log(0)
    y_prob = np.clip(y_prob, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_prob) + (1 - y_true) * np.log(1 - y_prob))

# Example
actual = np.array([1, 0, 1, 1, 0])
predicted_prob = np.array([0.9, 0.2, 0.8, 0.7, 0.3])

loss = cross_entropy_loss(actual, predicted_prob)
print(f"Cross-Entropy Loss: {loss:.4f}")

Gradient Descent: How Models Learn

Here’s the key insight that makes machine learning work:
  1. Start with random weights
  2. Make predictions
  3. Calculate the loss (how wrong are we?)
  4. Calculate the gradient (which direction reduces loss?)
  5. Update weights in that direction
  6. Repeat until loss is minimized
This is gradient descent - the algorithm that powers all of deep learning.
def gradient_descent_demo():
    """
    Demonstrate gradient descent for simple linear regression.
    Finding the best line: y = wx + b
    """
    # True relationship: y = 3x + 2
    np.random.seed(42)
    X = np.random.uniform(0, 10, 100)
    y = 3 * X + 2 + np.random.normal(0, 1, 100)
    
    # Initialize random weights
    w = np.random.randn()  # slope
    b = np.random.randn()  # intercept
    
    learning_rate = 0.01
    n_iterations = 100
    n = len(X)
    
    history = {'iteration': [], 'loss': [], 'w': [], 'b': []}
    
    for i in range(n_iterations):
        # Forward pass: make predictions
        y_pred = w * X + b
        
        # Calculate loss (MSE)
        loss = np.mean((y - y_pred) ** 2)
        
        # Calculate gradients (partial derivatives)
        dw = -2/n * np.sum(X * (y - y_pred))  # d(loss)/dw
        db = -2/n * np.sum(y - y_pred)         # d(loss)/db
        
        # Update weights (gradient descent step)
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record history
        history['iteration'].append(i)
        history['loss'].append(loss)
        history['w'].append(w)
        history['b'].append(b)
        
        if i % 20 == 0:
            print(f"Iteration {i:3d}: Loss = {loss:.4f}, w = {w:.4f}, b = {b:.4f}")
    
    print(f"\nFinal: w = {w:.4f} (true: 3), b = {b:.4f} (true: 2)")
    
    return history

history = gradient_descent_demo()
Output:
Iteration   0: Loss = 45.2341, w = 1.2345, b = 0.8765
Iteration  20: Loss = 1.2341, w = 2.8765, b = 1.9876
Iteration  40: Loss = 0.9876, w = 2.9876, b = 2.0123
Iteration  60: Loss = 0.9654, w = 2.9987, b = 2.0098
Iteration  80: Loss = 0.9612, w = 3.0012, b = 2.0054

Final: w = 3.0023 (true: 3), b = 2.0034 (true: 2)
The model learned the true relationship through gradient descent.

Bias-Variance Tradeoff

One of the most important concepts in ML: Bias: Error from overly simple models (underfitting) Variance: Error from overly complex models (overfitting) TotalΒ Error=Bias2+Variance+IrreducibleΒ Noise\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# True relationship: y = sin(x) + noise
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 30))
y_true = np.sin(X)
y = y_true + np.random.normal(0, 0.3, len(X))

# Test data (for evaluating generalization)
X_test = np.linspace(0, 10, 100)
y_test_true = np.sin(X_test)

# Fit models of different complexity
degrees = [1, 3, 5, 15]
results = {}

for degree in degrees:
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X.reshape(-1, 1))
    X_test_poly = poly.transform(X_test.reshape(-1, 1))
    
    # Fit model
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Predictions
    y_train_pred = model.predict(X_poly)
    y_test_pred = model.predict(X_test_poly)
    
    # Errors
    train_error = mean_squared_error(y, y_train_pred)
    test_error = mean_squared_error(y_test_true, y_test_pred)
    
    results[degree] = {
        'train_error': train_error,
        'test_error': test_error,
        'predictions': y_test_pred
    }
    
    print(f"Degree {degree:2d}: Train MSE = {train_error:.4f}, Test MSE = {test_error:.4f}")
Output:
Degree  1: Train MSE = 0.4521, Test MSE = 0.3421  # Underfitting (high bias)
Degree  3: Train MSE = 0.0876, Test MSE = 0.0654  # Good fit
Degree  5: Train MSE = 0.0765, Test MSE = 0.0712  # Good fit
Degree 15: Train MSE = 0.0234, Test MSE = 0.8765  # Overfitting (high variance)

Cross-Validation: Reliable Model Evaluation

Never evaluate your model on the same data you trained it on. Use cross-validation:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
import numpy as np

# Using our churn data from earlier
X = np.column_stack([months_active, monthly_spend, support_tickets])
y = churned

# 5-Fold Cross-Validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()

scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

print("Cross-Validation Results:")
print(f"  Scores: {scores}")
print(f"  Mean Accuracy: {scores.mean():.1%}")
print(f"  Std Dev: {scores.std():.1%}")
print(f"  95% CI: ({scores.mean() - 2*scores.std():.1%}, {scores.mean() + 2*scores.std():.1%})")

🎯 Model Selection Guide: Which Algorithm Should You Use?

Common Mistake: Jumping straight to neural networks! Simpler models are often better for tabular data and much easier to interpret.

Decision Flowchart for Classification

╔══════════════════════════════════════════╗
β•‘ What's your priority?                        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
                    β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚              β”‚              β”‚
Interpretability  Speed       Max Accuracy
    β”‚              β”‚              β”‚
    ↓              ↓              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Logistic  β”‚  β”‚Logistic  β”‚  β”‚ Gradient   β”‚
β”‚Regressionβ”‚  β”‚Regressionβ”‚  β”‚ Boosting   β”‚
β”‚ or       β”‚  β”‚ or       β”‚  β”‚ (XGBoost/  β”‚
β”‚Decision  β”‚  β”‚ Naive    β”‚  β”‚ LightGBM)  β”‚
β”‚Tree      β”‚  β”‚ Bayes    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Comparison Table

ModelBest ForInterpretable?Training SpeedPrediction Speed
Logistic RegressionBaseline, linearly separableβœ… Veryβœ… Fastβœ… Fast
Decision TreeUnderstanding feature importanceβœ… Veryβœ… Fastβœ… Fast
Random ForestGeneral purpose, robust⚠️ Moderate⚠️ Mediumβœ… Fast
XGBoost/LightGBMTabular data competitions⚠️ Moderate⚠️ Mediumβœ… Fast
SVMSmall datasets, high dimensions❌ Low❌ Slow⚠️ Medium
Neural NetworkUnstructured data (images, text)❌ Low❌ Slow⚠️ Medium

When to Use What

# Practical decision making
def recommend_model(n_samples, n_features, data_type, need_interpretability):
    """
    Recommend starting model based on problem characteristics.
    """
    if data_type == 'tabular':
        if need_interpretability:
            if n_samples < 1000:
                return "Logistic Regression (with feature engineering)"
            else:
                return "Decision Tree or Logistic Regression"
        else:
            if n_samples < 1000:
                return "Random Forest (less prone to overfit)"
            else:
                return "XGBoost or LightGBM (best performance)"
    
    elif data_type == 'text':
        if n_samples < 10000:
            return "TF-IDF + Logistic Regression"
        else:
            return "Fine-tuned BERT or similar"
    
    elif data_type == 'image':
        return "Transfer learning (ResNet, EfficientNet)"
    
    else:
        return "Start with XGBoost, then try neural networks"

# Examples
print(recommend_model(500, 20, 'tabular', True))
# Output: "Logistic Regression (with feature engineering)"

print(recommend_model(100000, 50, 'tabular', False))  
# Output: "XGBoost or LightGBM (best performance)"
Pro Tip: Always start simple! A well-tuned logistic regression often beats a poorly-tuned neural network on tabular data. Plus, you can explain it to stakeholders!

---

## Feature Engineering: The Art of ML

Often, creating better features matters more than choosing better algorithms.

```python
import numpy as np
import pandas as pd

# Raw data
data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'temperature': np.random.uniform(30, 90, 100),
    'humidity': np.random.uniform(20, 80, 100),
    'sales': np.random.uniform(1000, 5000, 100)
})

# Feature Engineering
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = (data['day_of_week'] >= 5).astype(int)
data['month'] = data['date'].dt.month
data['temp_humidity_ratio'] = data['temperature'] / data['humidity']
data['is_hot'] = (data['temperature'] > 75).astype(int)

# Binning
data['temp_category'] = pd.cut(
    data['temperature'], 
    bins=[0, 50, 70, 100], 
    labels=['cold', 'mild', 'hot']
)

# Log transform for skewed variables
data['log_sales'] = np.log1p(data['sales'])

print(data[['temperature', 'humidity', 'temp_humidity_ratio', 'is_hot', 'temp_category']].head(10))

Regularization: Preventing Overfitting

Add a penalty for complex models: L1 (Lasso): Encourages sparsity (some weights become exactly 0) Loss=MSE+Ξ»βˆ‘βˆ£wi∣\text{Loss} = \text{MSE} + \lambda \sum |w_i| L2 (Ridge): Encourages small weights (but none become 0) Loss=MSE+Ξ»βˆ‘wi2\text{Loss} = \text{MSE} + \lambda \sum w_i^2
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Create data with many features (some irrelevant)
np.random.seed(42)
n = 100
X = np.random.randn(n, 20)  # 20 features
# Only first 3 features actually matter
true_weights = np.array([3, -2, 1.5] + [0] * 17)
y = X @ true_weights + np.random.randn(n) * 0.5

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare models
from sklearn.linear_model import LinearRegression

models = {
    'Linear Regression': LinearRegression(),
    'Ridge (L2)': Ridge(alpha=1.0),
    'Lasso (L1)': Lasso(alpha=0.1)
}

for name, model in models.items():
    model.fit(X_scaled, y)
    coefs = model.coef_
    
    print(f"\n{name}:")
    print(f"  Non-zero coefficients: {np.sum(np.abs(coefs) > 0.01)}")
    print(f"  Coefficients for first 5 features: {coefs[:5].round(2)}")
    print(f"  True weights for first 5: {true_weights[:5]}")

Complete ML Pipeline

Putting it all together:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, confusion_matrix)
from dataclasses import dataclass
from typing import Dict, Tuple

@dataclass
class ModelResults:
    accuracy: float
    precision: float
    recall: float
    f1: float
    auc: float
    cv_scores: np.ndarray
    confusion_matrix: np.ndarray

class MLPipeline:
    """
    Complete machine learning pipeline with proper methodology.
    """
    
    def __init__(self, model=None, scale_features=True):
        self.model = model or LogisticRegression()
        self.scale_features = scale_features
        self.scaler = StandardScaler() if scale_features else None
        self.is_fitted = False
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train the pipeline."""
        if self.scale_features:
            X = self.scaler.fit_transform(X)
        self.model.fit(X, y)
        self.is_fitted = True
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions."""
        if self.scale_features:
            X = self.scaler.transform(X)
        return self.model.predict(X)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict probabilities."""
        if self.scale_features:
            X = self.scaler.transform(X)
        return self.model.predict_proba(X)[:, 1]
    
    def evaluate(self, X: np.ndarray, y: np.ndarray, cv_folds: int = 5) -> ModelResults:
        """Comprehensive model evaluation."""
        # Predictions
        y_pred = self.predict(X)
        y_prob = self.predict_proba(X)
        
        # Metrics
        accuracy = accuracy_score(y, y_pred)
        precision = precision_score(y, y_pred, zero_division=0)
        recall = recall_score(y, y_pred, zero_division=0)
        f1 = f1_score(y, y_pred, zero_division=0)
        auc = roc_auc_score(y, y_prob)
        cm = confusion_matrix(y, y_pred)
        
        # Cross-validation
        if self.scale_features:
            X_scaled = self.scaler.transform(X)
        else:
            X_scaled = X
        cv_scores = cross_val_score(self.model, X_scaled, y, cv=cv_folds)
        
        return ModelResults(
            accuracy=accuracy,
            precision=precision,
            recall=recall,
            f1=f1,
            auc=auc,
            cv_scores=cv_scores,
            confusion_matrix=cm
        )
    
    def print_report(self, results: ModelResults, model_name: str = "Model"):
        """Print formatted evaluation report."""
        print("\n" + "=" * 60)
        print(f"MODEL EVALUATION: {model_name}")
        print("=" * 60)
        
        print("\nClassification Metrics:")
        print(f"  Accuracy:  {results.accuracy:.1%}")
        print(f"  Precision: {results.precision:.1%}")
        print(f"  Recall:    {results.recall:.1%}")
        print(f"  F1 Score:  {results.f1:.1%}")
        print(f"  AUC-ROC:   {results.auc:.3f}")
        
        print("\nCross-Validation:")
        print(f"  Scores: {results.cv_scores.round(3)}")
        print(f"  Mean:   {results.cv_scores.mean():.1%} (+/- {results.cv_scores.std()*2:.1%})")
        
        print("\nConfusion Matrix:")
        print(f"  TN: {results.confusion_matrix[0,0]:5d}  FP: {results.confusion_matrix[0,1]:5d}")
        print(f"  FN: {results.confusion_matrix[1,0]:5d}  TP: {results.confusion_matrix[1,1]:5d}")
        
        print("=" * 60)


# Example usage with our churn data
np.random.seed(42)
n = 1000

# Generate realistic customer data
months_active = np.random.exponential(12, n)
monthly_spend = np.random.lognormal(4, 0.5, n)
support_tickets = np.random.poisson(2, n)
login_frequency = np.random.poisson(10, n)
feature_usage = np.random.uniform(0, 1, n)

# Churn probability
churn_prob = sigmoid(
    -3 +
    -0.03 * months_active +
    -0.01 * monthly_spend +
    0.3 * support_tickets +
    -0.1 * login_frequency +
    -1.5 * feature_usage
)
churned = (np.random.random(n) < churn_prob).astype(int)

# Prepare data
X = np.column_stack([months_active, monthly_spend, support_tickets, 
                     login_frequency, feature_usage])
y = churned

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate
pipeline = MLPipeline(LogisticRegression())
pipeline.fit(X_train, y_train)
results = pipeline.evaluate(X_test, y_test)
pipeline.print_report(results, "Customer Churn Prediction")

# Feature importance
feature_names = ['Months Active', 'Monthly Spend', 'Support Tickets', 
                 'Login Frequency', 'Feature Usage']
                 
print("\nFeature Importance (Coefficients):")
for name, coef in zip(feature_names, pipeline.model.coef_[0]):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {name}: {coef:+.4f} ({direction} churn probability)")

Key Statistical Concepts in ML

Maximum Likelihood

Most ML algorithms find parameters that maximize the probability of observing the data.

Bayesian Thinking

Prior beliefs + data = updated beliefs. Used in Bayesian ML, uncertainty quantification.

Information Theory

Cross-entropy, KL divergence, mutual information - all from statistics.

Central Limit Theorem

Why batch normalization works, why ensembles are powerful.

Practice: Capstone Project

Build a complete loan default prediction system:
# Dataset: Loan applications
# Features: income, debt_ratio, credit_score, loan_amount, employment_years
# Target: default (1) or paid (0)

# Your tasks:
# 1. Explore the data (summary statistics, correlations)
# 2. Engineer at least 2 new features
# 3. Train a logistic regression model
# 4. Evaluate using cross-validation
# 5. Interpret the coefficients
# 6. Calculate prediction for a new applicant
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Generate realistic loan data
np.random.seed(42)
n = 2000

income = np.random.lognormal(11, 0.5, n)  # Annual income
debt_ratio = np.random.beta(2, 5, n)  # Debt to income ratio
credit_score = np.random.normal(700, 80, n).clip(300, 850)
loan_amount = np.random.lognormal(10, 0.8, n)
employment_years = np.random.exponential(5, n)

# Default probability
default_prob = sigmoid(
    -5 +
    -0.00005 * income +
    3 * debt_ratio +
    -0.01 * credit_score +
    0.00002 * loan_amount +
    -0.1 * employment_years
)
default = (np.random.random(n) < default_prob).astype(int)

print(f"Default rate: {default.mean():.1%}")

# 1. Explore the data
print("\n--- EXPLORATORY ANALYSIS ---")
print(f"Income: mean=${np.mean(income):,.0f}, std=${np.std(income):,.0f}")
print(f"Credit Score: mean={np.mean(credit_score):.0f}, std={np.std(credit_score):.0f}")
print(f"Loan Amount: mean=${np.mean(loan_amount):,.0f}")

from scipy import stats
for var, name in [(income, 'Income'), (credit_score, 'Credit Score')]:
    r, p = stats.pointbiserialr(var, default)
    print(f"Correlation {name} vs Default: r={r:.3f}, p={p:.4f}")

# 2. Feature Engineering
loan_to_income = loan_amount / income
monthly_payment_estimate = loan_amount / 60  # Assume 5 year term
payment_to_income = monthly_payment_estimate / (income / 12)

# 3. Prepare and train
X = np.column_stack([income, debt_ratio, credit_score, loan_amount, 
                     employment_years, loan_to_income, payment_to_income])
y = default

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(C=1.0)
model.fit(X_train_scaled, y_train)

# 4. Evaluate
print("\n--- MODEL EVALUATION ---")
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"Cross-validation AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, y_prob):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Paid', 'Default']))

# 5. Interpret coefficients
print("\n--- FEATURE IMPORTANCE ---")
feature_names = ['Income', 'Debt Ratio', 'Credit Score', 'Loan Amount',
                 'Employment Years', 'Loan/Income Ratio', 'Payment/Income Ratio']
for name, coef in sorted(zip(feature_names, model.coef_[0]), key=lambda x: abs(x[1]), reverse=True):
    risk = "Higher risk" if coef > 0 else "Lower risk"
    print(f"  {name:20s}: {coef:+.3f} ({risk})")

# 6. Predict for new applicant
new_applicant = {
    'income': 75000,
    'debt_ratio': 0.25,
    'credit_score': 720,
    'loan_amount': 30000,
    'employment_years': 5
}
new_applicant['loan_to_income'] = new_applicant['loan_amount'] / new_applicant['income']
new_applicant['payment_to_income'] = (new_applicant['loan_amount']/60) / (new_applicant['income']/12)

X_new = np.array([[new_applicant[k] for k in ['income', 'debt_ratio', 'credit_score',
                                               'loan_amount', 'employment_years',
                                               'loan_to_income', 'payment_to_income']]])
X_new_scaled = scaler.transform(X_new)
prob = model.predict_proba(X_new_scaled)[0, 1]

print(f"\n--- NEW APPLICANT PREDICTION ---")
for k, v in new_applicant.items():
    print(f"  {k}: {v:.2f}")
print(f"\nDefault Probability: {prob:.1%}")
print(f"Recommendation: {'APPROVE' if prob < 0.15 else 'REVIEW' if prob < 0.30 else 'DENY'}")

Key Takeaways

Statistics is ML Foundation

  • Regression becomes neural networks
  • Probability becomes model outputs
  • Hypothesis testing becomes model validation

Loss Functions

  • MSE for regression
  • Cross-entropy for classification
  • Gradient descent minimizes loss

Bias-Variance Tradeoff

  • Simple models underfit (high bias)
  • Complex models overfit (high variance)
  • Regularization helps find balance

Proper Evaluation

  • Never test on training data
  • Use cross-validation
  • Consider multiple metrics

Interview Questions

Question: Your model has low training error but high test error. What’s happening and how would you fix it?
Answer: This is overfitting - the model has low bias but high variance.Diagnosis:
  • Model memorized training data instead of learning patterns
  • Too many features or too complex model
  • Not enough training data
Solutions:
  1. Regularization: Add L1 (Lasso) or L2 (Ridge) penalty
  2. Cross-validation: Use k-fold CV to detect overfitting early
  3. More data: Collect more training examples
  4. Feature selection: Remove irrelevant features
  5. Simpler model: Reduce polynomial degree, number of layers, etc.
  6. Early stopping: Stop training before overfitting occurs
  7. Dropout (for neural networks): Randomly disable neurons during training
Question: You’re building a fraud detection system. Should you optimize for precision or recall?
Answer: It depends on business costs, but usually recall is more important.Analysis:
  • High recall, lower precision: Catch most fraud but have more false alarms
  • High precision, lower recall: Fewer false alarms but miss more fraud
For fraud detection:
  • Cost of false negative (missed fraud) = money lost + reputation damage
  • Cost of false positive (flagged legitimate) = customer friction + review cost
Usually missed fraud is more costly, so prioritize recall.But the right answer is: Calculate the expected cost of each error type and optimize accordingly.
# Example: $500 average fraud, $10 review cost
# If precision=0.5, recall=0.95: Catch 95% of fraud, review 2x as many transactions
# If precision=0.9, recall=0.60: Catch 60% of fraud, but fewer reviews

# Total cost = (missed_fraud * fraud_amount) + (false_positives * review_cost)
Question: Why is feature scaling important for machine learning, and when is it not needed?
Answer:When scaling matters:
  1. Gradient-based optimization: Features on different scales can cause zig-zagging during optimization
  2. Distance-based algorithms: k-NN, SVM, k-means - larger features dominate
  3. Regularization: L1/L2 penalties affect differently-scaled features unequally
  4. Neural networks: Improves convergence speed
When scaling doesn’t matter:
  1. Tree-based models: Random forests, XGBoost split on one feature at a time
  2. Naive Bayes: Features are treated independently
  3. All features already on same scale: e.g., all percentages
Types of scaling:
  • Standardization (z-score): Mean=0, Std=1. Best for normally distributed data
  • Min-Max scaling: Range [0,1]. Best when bounds are known
  • Robust scaling: Uses median/IQR. Best when outliers present
Question: Explain k-fold cross-validation and when you might use stratified k-fold instead.
Answer:K-Fold Cross-Validation:
  1. Split data into k equal parts (folds)
  2. Train on k-1 folds, validate on 1 fold
  3. Repeat k times, using each fold as validation once
  4. Average the k scores for final estimate
Fold 1: [VAL] [Train] [Train] [Train] [Train]
Fold 2: [Train] [VAL] [Train] [Train] [Train]
Fold 3: [Train] [Train] [VAL] [Train] [Train]
...
Stratified K-Fold: Use when dealing with imbalanced classes. Ensures each fold has same proportion of classes as the full dataset.When to use stratified:
  • Imbalanced classification (e.g., fraud detection at 1%)
  • Multi-class with unequal class sizes
  • Small datasets where random splits could unbalance folds
Typical k values:
  • k=5 or k=10 are common
  • Higher k = less bias, more variance, more computation
  • Leave-one-out (k=n) rarely used except for tiny datasets

πŸ“ Practice Exercises

Exercise 1

Implement logistic regression from scratch

Exercise 2

Build and evaluate a classification model

Exercise 3

Implement gradient descent for optimization

Exercise 4

Real-world: Customer churn prediction pipeline

🚨 Real-World Challenge: Messy Data in Production

Production Reality: The examples above used clean, synthetic data. Real-world data is messy, biased, and constantly changing. Here’s what you’ll actually encounter:

Common Data Quality Issues

import numpy as np
import pandas as pd

# Simulating realistic messy data
np.random.seed(42)
n = 1000

# Create messy customer data
messy_data = pd.DataFrame({
    'customer_id': range(n),
    'tenure': np.where(np.random.rand(n) < 0.05, np.nan, 
                       np.random.exponential(24, n)),  # 5% missing
    'monthly_spend': np.where(np.random.rand(n) < 0.03, -999,  # Invalid placeholder
                              np.random.normal(80, 25, n)),
    'support_tickets': np.random.poisson(2, n),
    'age': np.where(np.random.rand(n) < 0.1, 0,  # Impossible age
                    np.random.normal(45, 15, n).astype(int)),
})

# Add some outliers
messy_data.loc[42, 'monthly_spend'] = 50000  # Enterprise customer?
messy_data.loc[100, 'tenure'] = 500  # Data entry error?

print("=== Messy Data Diagnostics ===")
print(f"\nMissing values:\n{messy_data.isnull().sum()}")
print(f"\nNegative spend (placeholder): {(messy_data['monthly_spend'] < 0).sum()}")
print(f"Zero age (invalid): {(messy_data['age'] == 0).sum()}")
print(f"\nSpend outliers (>3 std): {(messy_data['monthly_spend'] > 180).sum()}")

Data Cleaning Pipeline

def clean_customer_data(df):
    """Production-ready data cleaning pipeline."""
    df = df.copy()
    
    # 1. Handle placeholders and invalid values
    df['monthly_spend'] = df['monthly_spend'].replace(-999, np.nan)
    df.loc[df['age'] <= 0, 'age'] = np.nan
    df.loc[df['age'] > 120, 'age'] = np.nan  # Impossible ages
    
    # 2. Cap outliers (winsorization)
    for col in ['monthly_spend', 'tenure']:
        if col in df.columns:
            p99 = df[col].quantile(0.99)
            df.loc[df[col] > p99, col] = p99
    
    # 3. Impute missing values
    # Numeric: median (robust to outliers)
    for col in ['tenure', 'monthly_spend', 'age']:
        if col in df.columns:
            df[col].fillna(df[col].median(), inplace=True)
    
    # 4. Create data quality flags
    df['has_missing_data'] = df.isnull().any(axis=1).astype(int)
    
    return df

cleaned_data = clean_customer_data(messy_data)
print("\n=== After Cleaning ===")
print(f"Missing values: {cleaned_data.isnull().sum().sum()}")
print(f"Spend range: ${cleaned_data['monthly_spend'].min():.2f} - ${cleaned_data['monthly_spend'].max():.2f}")

Detecting and Handling Data Drift

def check_data_drift(train_data, new_data, threshold=0.1):
    """
    Check if new data has drifted from training distribution.
    Uses Kolmogorov-Smirnov test for numerical features.
    """
    from scipy.stats import ks_2samp
    
    drift_report = {}
    
    for col in train_data.select_dtypes(include=[np.number]).columns:
        stat, p_value = ks_2samp(train_data[col].dropna(), 
                                  new_data[col].dropna())
        
        drift_detected = p_value < threshold
        drift_report[col] = {
            'ks_statistic': stat,
            'p_value': p_value,
            'drift_detected': drift_detected
        }
        
        if drift_detected:
            print(f"⚠️ DRIFT DETECTED in '{col}': p={p_value:.4f}")
    
    return drift_report

# Simulate drift: new data with different distribution
new_data = cleaned_data.copy()
new_data['monthly_spend'] = new_data['monthly_spend'] * 1.5  # 50% inflation!

print("\n=== Data Drift Check ===")
drift_report = check_data_drift(cleaned_data, new_data)

Handling Class Imbalance

from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE  # pip install imbalanced-learn

# Severe imbalance: 2% fraud rate
y_imbalanced = np.zeros(10000)
y_imbalanced[:200] = 1  # Only 2% positive

print(f"Original class distribution: {np.bincount(y_imbalanced.astype(int))}")

# Strategy 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_imbalanced), y=y_imbalanced)
weight_dict = {0: class_weights[0], 1: class_weights[1]}
print(f"\nClass weights: {weight_dict}")

# Strategy 2: SMOTE oversampling (create synthetic minority examples)
X_dummy = np.random.randn(10000, 5)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_dummy, y_imbalanced)
print(f"After SMOTE: {np.bincount(y_resampled.astype(int))}")

# Strategy 3: Threshold tuning
# Instead of predicting class at 0.5, use lower threshold for rare class
print("\nπŸ’‘ For imbalanced data, tune threshold based on precision-recall curve!")
Production ML Checklist:
  • Check for missing values and understand WHY they’re missing
  • Detect outliers and decide: cap, remove, or flag?
  • Look for placeholder values (-999, 0, β€œN/A”, etc.)
  • Check class balance for classification problems
  • Set up data drift monitoring for production models
  • Document all cleaning decisions for reproducibility

πŸ”¬ Advanced Deep Dive (Optional)

The Foundation of Most ML Training

Maximum Likelihood Estimation (MLE) is how most ML models learn. The idea: find parameters that make the observed data most likely.The Math: Given data X={x1,x2,...,xn}X = \{x_1, x_2, ..., x_n\} and model parameters ΞΈ\theta:ΞΈMLE=arg⁑max⁑θ∏i=1nP(xi∣θ)\theta_{MLE} = \arg\max_\theta \prod_{i=1}^n P(x_i | \theta)Or in log form (more stable):ΞΈMLE=arg⁑maxβ‘ΞΈβˆ‘i=1nlog⁑P(xi∣θ)\theta_{MLE} = \arg\max_\theta \sum_{i=1}^n \log P(x_i | \theta)Connection to ML Loss Functions:
  • Cross-entropy loss = negative log-likelihood for classification
  • MSE loss = MLE assuming Gaussian noise in regression
import numpy as np
from scipy.optimize import minimize

# Example: Estimate mean and std of normal distribution using MLE
np.random.seed(42)
true_mean, true_std = 5.0, 2.0
data = np.random.normal(true_mean, true_std, 1000)

def negative_log_likelihood(params, data):
    """Negative log-likelihood for normal distribution."""
    mu, sigma = params
    if sigma <= 0:
        return np.inf
    n = len(data)
    # Log-likelihood of normal distribution
    ll = -n/2 * np.log(2 * np.pi) - n * np.log(sigma) - np.sum((data - mu)**2) / (2 * sigma**2)
    return -ll  # Negative because we minimize

# Find MLE estimates
result = minimize(negative_log_likelihood, x0=[0, 1], args=(data,), method='Nelder-Mead')
mle_mean, mle_std = result.x

print(f"True parameters: ΞΌ={true_mean}, Οƒ={true_std}")
print(f"MLE estimates:   ΞΌ={mle_mean:.4f}, Οƒ={mle_std:.4f}")
print(f"\nNote: MLE for normal distribution = sample mean and std!")
print(f"Sample mean: {data.mean():.4f}, Sample std: {data.std():.4f}")

Beyond p-values: Bayes Factors

Hypothesis testing gives you p-values, but Bayes factors tell you the relative evidence for one model vs another:BF=P(Data∣Model1)P(Data∣Model2)BF = \frac{P(Data | Model_1)}{P(Data | Model_2)}
Bayes FactorInterpretation
BF < 1/10Strong evidence for Model 2
1/10 < BF < 1/3Moderate evidence for Model 2
1/3 < BF < 3Inconclusive
3 < BF < 10Moderate evidence for Model 1
BF > 10Strong evidence for Model 1
import numpy as np
from scipy.stats import norm

def bayes_factor_means(data, prior_mean=0, prior_std=10):
    """
    Compute Bayes factor for H1 (mean β‰  0) vs H0 (mean = 0).
    Simplified version using normal prior.
    """
    n = len(data)
    sample_mean = data.mean()
    sample_var = data.var()
    
    # Likelihood under H0 (mean = 0)
    ll_h0 = np.sum(norm.logpdf(data, loc=0, scale=np.sqrt(sample_var)))
    
    # Marginal likelihood under H1 (integrated over prior)
    # Simplified: use posterior mean
    posterior_precision = n / sample_var + 1 / prior_std**2
    posterior_mean = (n * sample_mean / sample_var) / posterior_precision
    ll_h1 = np.sum(norm.logpdf(data, loc=posterior_mean, scale=np.sqrt(sample_var)))
    
    # Bayes factor (approximate)
    log_bf = ll_h1 - ll_h0
    bf = np.exp(log_bf)
    
    return bf

# Test with data that has true mean = 2
data_with_effect = np.random.normal(2, 1, 100)
data_no_effect = np.random.normal(0, 1, 100)

bf_effect = bayes_factor_means(data_with_effect)
bf_no_effect = bayes_factor_means(data_no_effect)

print(f"Data with true mean=2: BF = {bf_effect:.2f}")
print(f"Data with true mean=0: BF = {bf_no_effect:.2f}")

Course Summary: The Complete Statistical Toolkit

You’ve now mastered the statistical foundations of machine learning:
1

Describing Data

Mean, median, variance, and standard deviation to summarize any dataset
2

Probability

Basic rules, conditional probability, and Bayes’ theorem for reasoning under uncertainty
3

Distributions

Normal, binomial, and other patterns that randomness follows
4

Statistical Inference

Drawing conclusions from samples using confidence intervals
5

Hypothesis Testing

Determining if effects are real with A/B testing methodology
6

Regression

Modeling relationships and making predictions
7

Connection to ML

How all these concepts power modern machine learning algorithms

πŸ—ΊοΈ Your Complete Learning Path

You are here in the math-to-ML journey:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MATH FOUNDATIONS                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Linear Algebra β”‚    Calculus     β”‚      Statistics βœ“ (You!)       β”‚
β”‚  (Vectors,     β”‚  (Derivatives,  β”‚   (Probability, Inference,     β”‚
β”‚   Matrices)    β”‚   Gradients)    β”‚    Hypothesis Testing)         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                            ↓                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    ML MASTERY COURSE                                β”‚
β”‚   Algorithms β†’ Evaluation β†’ Feature Engineering β†’ Production       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                            ↓                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    AI ENGINEERING                                   β”‚
β”‚        LLMs β†’ RAG β†’ Agents β†’ Production Systems                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Next Steps Based on Your Goals:
Your GoalRecommended Path
Become an ML Engineer→ ML Mastery Course
Understand Deep Learning→ Linear Algebra (if not done) → Calculus → ML Mastery
Work with LLMs/AI→ AI Engineering Track
Data Science Role→ ML Mastery → Focus on Modules 7-11 (Evaluation, Features)
Research/Academia→ Complete all math courses → Deep Learning theory

What’s Next?

You now have a solid statistical foundation for machine learning. From here, you can explore:
TopicWhat You’ll LearnYour Foundation
Deep LearningNeural networks with multiple layersGradient descent, loss functions
Ensemble MethodsRandom forests, gradient boostingVariance reduction, decision trees
Unsupervised LearningClustering, dimensionality reductionVariance, distance metrics
Time SeriesForecasting, sequential dataRegression, autocorrelation
Bayesian MLUncertainty quantification, probabilistic modelsBayes’ theorem, priors

🧹 Real-World Complications: Data Quality Issues

ProblemHow to DetectSolution
Missing valuesdf.isnull().sum()Imputation, dropping, or modeling
OutliersIQR method, z-scores, visualizationWinsorization, robust statistics, or removal
Skewed distributionsHistograms, skewness metricLog transform, Box-Cox
Class imbalancey.value_counts()SMOTE, class weights, threshold tuning
Feature scalingRange comparisonStandardScaler, MinMaxScaler
Categorical encodingCheck dtypesOne-hot, label, or target encoding
MulticollinearityCorrelation matrix, VIFDrop features, PCA, regularization
Remember: Real data is messy. The best ML engineers spend 80% of their time on data quality, not model tuning!

Common Pitfalls in ML Practice

ML Mistakes to Avoid:
  1. Data Leakage - Training on information not available at prediction time; always split data BEFORE any preprocessing
  2. Not Using Cross-Validation - A single train/test split is unreliable; use k-fold CV for robust estimates
  3. Ignoring Class Imbalance - 99% accuracy is meaningless if 99% of data is one class; use precision, recall, F1
  4. Overfitting to Validation Set - Repeatedly tuning on validation set leads to implicit overfitting; use holdout test set
  5. Wrong Metric for Problem - Optimizing MSE when business cares about outliers; match metric to objective
  6. Assuming Stationarity - Models trained on old data may not work on new data; monitor for drift

Congratulations!

Course Complete!

You’ve completed Probability and Statistics for Machine Learning!You now understand the mathematical foundations that power modern AI systems - from how models learn (gradient descent) to how we validate them (hypothesis testing) to why they work (probability theory).This foundation will serve you in every ML role, from data scientist to ML engineer to research scientist.
Your Statistics β†’ ML Toolkit:
  • βœ… Descriptive Statistics β†’ Data exploration & feature engineering
  • βœ… Probability Theory β†’ Understanding model uncertainty & predictions
  • βœ… Distributions β†’ Choosing loss functions & detecting anomalies
  • βœ… Statistical Inference β†’ Confidence intervals for model performance
  • βœ… Hypothesis Testing β†’ A/B testing & model comparison
  • βœ… Regression β†’ Foundation for all supervised learning
  • βœ… Bias-Variance β†’ Model selection & hyperparameter tuning
  • βœ… Cross-Validation β†’ Robust performance estimation