> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Classification > Predict categories - spam or not spam, cat or dog, buy or don't buy # Classification Classification - Decision Boundary

## A Different Kind of Prediction In regression, we predict numbers: *"This house costs \$450,000"* In classification, we predict categories: *"This email is SPAM"* **Real-world classification problems**: * Is this transaction fraudulent? (Yes/No) * What digit is in this image? (0-9) * Will this customer buy? (Yes/No) * What disease does this patient have? (A, B, C, D) * Is this review positive or negative? (Positive/Negative) Medical Diagnosis Classification

*** ## The Email Spam Problem Let's build a spam detector from scratch. ### The Data Imagine each email is represented by features: * Number of exclamation marks * Contains word "FREE" * Contains word "WINNER" * Sender in contacts * Length of email ```python theme={null} import numpy as np # Email features: [exclamation_count, has_free, has_winner, in_contacts, length_bucket] # Labels: 0 = not spam, 1 = spam emails = np.array([ [5, 1, 1, 0, 1], # Short, has FREE and WINNER, lots of !!! -> likely spam [0, 0, 0, 1, 3], # Long, from contact, no sketchy words -> not spam [3, 1, 0, 0, 1], # Has FREE, some !!! -> maybe spam [0, 0, 0, 1, 2], # From contact -> not spam [10, 1, 1, 0, 1], # Very spammy [1, 0, 0, 1, 3], # Normal email from contact [8, 1, 1, 0, 1], # Spammy [0, 0, 0, 0, 2], # Normal email ]) labels = np.array([1, 0, 1, 0, 1, 0, 1, 0]) # 1=spam, 0=not spam ``` ### Why Not Just Use Linear Regression? Let's try: ```python theme={null} from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(emails, labels) # Predict predictions = model.predict(emails) print("Predictions:", predictions) # Output: [0.89, 0.12, 0.67, 0.15, 1.12, 0.18, 0.95, 0.22] ``` **Problems**: 1. Predictions can be > 1 or \< 0 (what does 1.12 "spam" mean?) 2. We want probabilities (0 to 1), not arbitrary numbers 3. We want a clear decision: spam or not spam *** ## The Sigmoid Function: Squashing to Probabilities We need a function that: * Takes any number (from -∞ to +∞) * Outputs a value between 0 and 1 * Acts like a probability Enter the **sigmoid function** -- nature's favorite dimmer switch: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$ Think of it like a confidence meter. The linear model produces a raw score (could be -47 or +312), and sigmoid translates it into "how confident are we?" on a 0-to-1 scale. Very negative scores become "almost certainly not spam" (near 0), and very positive scores become "almost certainly spam" (near 1). Zero is the tipping point -- 50/50. ```python theme={null} def sigmoid(z): """Squash any number to range (0, 1)""" return 1 / (1 + np.exp(-z)) # Test it for z in [-10, -2, 0, 2, 10]: print(f"sigmoid({z:3d}) = {sigmoid(z):.4f}") ``` **Output**: ``` sigmoid(-10) = 0.0000 # Very negative -> close to 0 sigmoid( -2) = 0.1192 # Negative -> small sigmoid( 0) = 0.5000 # Zero -> 0.5 (uncertain) sigmoid( 2) = 0.8808 # Positive -> close to 1 sigmoid( 10) = 1.0000 # Very positive -> close to 1 ``` *** ## Logistic Regression Combine linear regression with sigmoid: $$ P(spam) = \sigma(w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n) $$ 1. Compute a weighted sum (like linear regression) 2. Pass through sigmoid to get a probability 3. If probability > 0.5, predict "spam" ```python theme={null} def logistic_regression_predict_proba(X, w): """ Predict probability of class 1. """ z = X @ w # Linear combination return sigmoid(z) # Squash to probability def logistic_regression_predict(X, w, threshold=0.5): """ Predict class labels (0 or 1). """ probabilities = logistic_regression_predict_proba(X, w) return (probabilities >= threshold).astype(int) ``` *** ## Training Logistic Regression ### The Loss Function For classification, we use **Binary Cross-Entropy** (log loss): $$ L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] $$ **Why not use MSE like in regression?** Because MSE creates a loss surface with many flat plateaus for classification, making gradient descent painfully slow. Cross-entropy has steep slopes that push the model to fix its confident-but-wrong predictions aggressively. **Intuition** -- think of it as a "surprise" score: * If actual is 1 and we predict 0.9 -- small loss (not surprised, good prediction!) * If actual is 1 and we predict 0.1 -- large loss (very surprised, terrible prediction!) * If actual is 1 and we predict 0.001 -- *enormous* loss (the log function explodes as predictions approach 0, heavily penalizing confident wrong answers) ```python theme={null} def binary_cross_entropy(y_true, y_pred): """ Compute binary cross-entropy loss. """ # Clip predictions to avoid log(0) epsilon = 1e-15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) loss = -np.mean( y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred) ) return loss ``` ### Gradient Descent for Logistic Regression ```python theme={null} def train_logistic_regression(X, y, learning_rate=0.1, num_epochs=1000): """ Train logistic regression using gradient descent. """ # Add bias column X_bias = np.column_stack([np.ones(len(X)), X]) # Initialize weights w = np.zeros(X_bias.shape[1]) for epoch in range(num_epochs): # Forward pass z = X_bias @ w predictions = sigmoid(z) # Compute loss loss = binary_cross_entropy(y, predictions) # Compute gradient errors = predictions - y gradient = X_bias.T @ errors / len(y) # Update weights w = w - learning_rate * gradient if epoch % 100 == 0: print(f"Epoch {epoch}: Loss = {loss:.4f}") return w # Train on our email data weights = train_logistic_regression(emails, labels) # Make predictions X_bias = np.column_stack([np.ones(len(emails)), emails]) probs = sigmoid(X_bias @ weights) preds = (probs >= 0.5).astype(int) print("\nPredictions vs Actual:") for i in range(len(emails)): print(f"Email {i}: P(spam)={probs[i]:.2f}, Predicted={preds[i]}, Actual={labels[i]}") ``` *** ## Using scikit-learn ```python theme={null} from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Create and train model. # Despite its name, LogisticRegression is a CLASSIFIER, not a regressor. # The "regression" in the name refers to the mathematical technique # (fitting a logistic function), not the type of problem. model = LogisticRegression() model.fit(emails, labels) # Predict hard labels (0 or 1) predictions = model.predict(emails) # Predict probabilities -- often more useful than hard labels. # [:, 1] selects the probability of class 1 (spam). # Use these for ranking, threshold tuning, or when downstream # decisions depend on confidence level. probabilities = model.predict_proba(emails)[:, 1] # P(spam) print("Predictions:", predictions) print("Probabilities:", probabilities) print(f"Accuracy: {accuracy_score(labels, predictions):.2%}") ``` *** ## Real Example: Breast Cancer Detection ```python theme={null} from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix # Load data cancer = load_breast_cancer() X, y = cancer.data, cancer.target print("Features:", cancer.feature_names[:5], "...") print("Classes:", cancer.target_names) # ['malignant' 'benign'] # Split and scale X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train -- max_iter=5000 gives the optimizer enough iterations to converge. # Logistic regression uses an iterative solver internally, and the default # 100 iterations isn't always enough for high-dimensional data. model = LogisticRegression(max_iter=5000) model.fit(X_train_scaled, y_train) # Evaluate on data the model has never seen y_pred = model.predict(X_test_scaled) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=cancer.target_names)) print("\nConfusion Matrix:") print(confusion_matrix(y_test, y_pred)) # In medical contexts, pay special attention to False Negatives (FN): # a patient with cancer classified as benign. This is more dangerous # than a False Positive (healthy person flagged for further testing). ``` *** ## Understanding the Confusion Matrix ``` Predicted Neg Pos Actual Neg [ TN FP ] Pos [ FN TP ] ``` * **True Positive (TP)**: Predicted spam, was spam * **True Negative (TN)**: Predicted not spam, was not spam * **False Positive (FP)**: Predicted spam, was not spam (annoying!) * **False Negative (FN)**: Predicted not spam, was spam (dangerous!) *** ## Key Metrics ```python theme={null} from sklearn.metrics import precision_score, recall_score, f1_score # Precision: Of all spam predictions, how many were correct? # "When we say spam, how often are we right?" precision = precision_score(y_test, y_pred) # Recall: Of all actual spam, how many did we catch? # "What % of spam did we catch?" recall = recall_score(y_test, y_pred) # F1: Harmonic mean of precision and recall f1 = f1_score(y_test, y_pred) print(f"Precision: {precision:.2%}") print(f"Recall: {recall:.2%}") print(f"F1 Score: {f1:.2%}") ``` **When to prioritize which metric?** Think of it as a cost-of-mistakes analysis: * **High Precision needed**: Spam filter -- if you mark a real email as spam, your user misses an important message. The cost of a false positive is high. * **High Recall needed**: Disease detection -- if you miss a sick patient and send them home, the consequences could be fatal. The cost of a false negative is high. * **F1 Score**: When you need balance between both, or when you're not sure which type of mistake is worse. F1 is the harmonic mean, which means it punishes you if *either* precision or recall is low. **A senior engineer's shortcut**: Ask the business stakeholder "What's worse -- a false alarm or a missed catch?" Their answer tells you which metric to optimize. *** ## Multi-Class Classification What if there are more than 2 classes? ```python theme={null} from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Load iris data (3 classes of flowers) iris = load_iris() X, y = iris.data, iris.target print("Classes:", iris.target_names) # ['setosa' 'versicolor' 'virginica'] # Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train (scikit-learn handles multi-class automatically!) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) print(classification_report(y_test, y_pred, target_names=iris.target_names)) # Get probabilities for each class probs = model.predict_proba(X_test[:3]) print("\nProbabilities for first 3 samples:") for i, p in enumerate(probs): print(f"Sample {i}: {dict(zip(iris.target_names, p.round(3)))}") ``` *** ## The Decision Boundary Logistic regression creates a linear decision boundary: ```python theme={null} import matplotlib.pyplot as plt # Use just 2 features for visualization X_2d = iris.data[:, :2] # sepal length and width y = iris.target # Train model = LogisticRegression(max_iter=1000) model.fit(X_2d, y) # Create a mesh grid for decision boundary x_min, x_max = X_2d[:, 0].min() - 0.5, X_2d[:, 0].max() + 0.5 y_min, y_max = X_2d[:, 1].min() - 0.5, X_2d[:, 1].max() + 0.5 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # Plot plt.figure(figsize=(10, 6)) plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis') plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis', edgecolors='black') plt.xlabel('Sepal Length') plt.ylabel('Sepal Width') plt.title('Logistic Regression Decision Boundary') plt.show() ``` *** ## Key Takeaways Predict discrete labels, not numbers Squash outputs to 0-1 range P > 0.5 means positive class Accuracy isn't always enough *** ## 🚀 Mini Projects Build a spam detector from scratch Medical diagnosis classifier with metrics analysis Customer churn prediction system

**Project 1: Spam Email Detector** - Text classification basics

**Objective**: Build a simple spam classifier using word features. ```python theme={null} import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix # Simulated email data (word counts) # Features: [contains_free, contains_winner, contains_meeting, contains_urgent, word_count] emails = [ # [free, winner, meeting, urgent, word_count], is_spam ([1, 1, 0, 1, 50], 1), # "You're a FREE WINNER! URGENT!" ([1, 0, 0, 1, 30], 1), # "FREE offer URGENT!" ([0, 1, 0, 0, 45], 1), # "You are the WINNER!" ([0, 0, 1, 0, 120], 0), # "Meeting scheduled for Monday" ([0, 0, 1, 0, 85], 0), # "Team meeting notes" ([0, 0, 0, 1, 200], 0), # "Urgent: Project deadline" ([1, 1, 0, 1, 25], 1), # "FREE WINNER URGENT!" ([0, 0, 1, 0, 150], 0), # "Meeting agenda attached" ([1, 0, 0, 0, 100], 0), # "Free trial of software" ([0, 0, 0, 0, 300], 0), # Normal work email ] # More data for training np.random.seed(42) n_spam = 100 n_ham = 150 # Generate spam emails (high free, winner, urgent, low word count) spam_data = np.column_stack([ np.random.binomial(1, 0.7, n_spam), # free np.random.binomial(1, 0.5, n_spam), # winner np.random.binomial(1, 0.1, n_spam), # meeting np.random.binomial(1, 0.6, n_spam), # urgent np.random.normal(40, 15, n_spam), # word count (short) ]) spam_labels = np.ones(n_spam) # Generate ham emails (low free, winner, high meeting, longer) ham_data = np.column_stack([ np.random.binomial(1, 0.1, n_ham), # free np.random.binomial(1, 0.05, n_ham), # winner np.random.binomial(1, 0.4, n_ham), # meeting np.random.binomial(1, 0.2, n_ham), # urgent np.random.normal(150, 50, n_ham), # word count (longer) ]) ham_labels = np.zeros(n_ham) # Combine X = np.vstack([spam_data, ham_data]) y = np.concatenate([spam_labels, ham_labels]) # Split and train X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) print("=== Spam Classifier Results ===") print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam'])) # Feature importance feature_names = ['contains_free', 'contains_winner', 'contains_meeting', 'contains_urgent', 'word_count'] print("=== Feature Importance ===") for name, coef in zip(feature_names, model.coef_[0]): indicator = "→ SPAM" if coef > 0.5 else "→ HAM" if coef < -0.5 else "" print(f"{name}: {coef:+.3f} {indicator}") # Test new emails new_emails = [ [1, 1, 0, 1, 30], # Looks spammy [0, 0, 1, 0, 200], # Looks legitimate [1, 0, 1, 0, 100], # Mixed signals ] print("\n=== New Email Predictions ===") for email in new_emails: prob = model.predict_proba([email])[0] pred = "SPAM" if prob[1] > 0.5 else "HAM" print(f"Features {email}: {pred} (P(spam)={prob[1]:.2f})") ```

**Project 2: Medical Diagnosis Classifier** - Handle class imbalance

**Objective**: Classify patients as healthy or having a disease with proper metrics. ```python theme={null} import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, precision_recall_curve) from sklearn.preprocessing import StandardScaler # Simulate medical data (imbalanced: 5% positive) np.random.seed(42) n_healthy = 950 n_disease = 50 # Features: [age, blood_pressure, cholesterol, glucose, bmi] healthy = np.column_stack([ np.random.normal(40, 15, n_healthy), # age np.random.normal(120, 10, n_healthy), # bp np.random.normal(180, 30, n_healthy), # cholesterol np.random.normal(90, 10, n_healthy), # glucose np.random.normal(24, 3, n_healthy), # bmi ]) disease = np.column_stack([ np.random.normal(55, 12, n_disease), # older np.random.normal(145, 15, n_disease), # higher bp np.random.normal(240, 40, n_disease), # higher cholesterol np.random.normal(130, 25, n_disease), # higher glucose np.random.normal(30, 4, n_disease), # higher bmi ]) X = np.vstack([healthy, disease]) y = np.array([0]*n_healthy + [1]*n_disease) # Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Scale scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train with class weights (handle imbalance) model = LogisticRegression(class_weight='balanced') model.fit(X_train_scaled, y_train) # Predictions y_pred = model.predict(X_test_scaled) y_prob = model.predict_proba(X_test_scaled)[:, 1] print("=== Medical Diagnosis Results ===") print(f"Disease prevalence: {y.mean():.1%}") print(classification_report(y_test, y_pred, target_names=['Healthy', 'Disease'])) # Confusion matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(f" Pred Healthy Pred Disease") print(f"True Healthy {cm[0,0]:4d} {cm[0,1]:4d}") print(f"True Disease {cm[1,0]:4d} {cm[1,1]:4d}") # Calculate key metrics for medical context tn, fp, fn, tp = cm.ravel() sensitivity = tp / (tp + fn) # Recall for disease specificity = tn / (tn + fp) # Recall for healthy ppv = tp / (tp + fp) # Precision for disease npv = tn / (tn + fn) # Precision for healthy print(f"\n=== Medical Metrics ===") print(f"Sensitivity (True Positive Rate): {sensitivity:.1%}") print(f"Specificity (True Negative Rate): {specificity:.1%}") print(f"PPV (Positive Predictive Value): {ppv:.1%}") print(f"NPV (Negative Predictive Value): {npv:.1%}") print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}") # Medical interpretation print("\n=== Interpretation ===") print(f"Of patients WE predict have disease, {ppv:.0%} actually do (PPV)") print(f"Of patients WHO HAVE disease, we catch {sensitivity:.0%} (Sensitivity)") print(f"Missing {fn} out of {tp+fn} disease cases is concerning!") ```

**Project 3: Customer Churn Prediction** - Business ML application

**Objective**: Predict which customers will cancel their subscription. ```python theme={null} import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, roc_auc_score # Simulate customer data np.random.seed(42) n_customers = 1000 # Features tenure_months = np.random.exponential(24, n_customers) monthly_charges = np.random.normal(70, 20, n_customers) support_tickets = np.random.poisson(2, n_customers) contract_type = np.random.choice([0, 1, 2], n_customers, p=[0.4, 0.35, 0.25]) # monthly, 1yr, 2yr num_services = np.random.randint(1, 6, n_customers) # Churn probability model def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) churn_prob = sigmoid( -1 - 0.03 * tenure_months + 0.01 * monthly_charges + 0.4 * support_tickets - 0.8 * contract_type - 0.1 * num_services ) churned = (np.random.random(n_customers) < churn_prob).astype(int) print(f"Overall churn rate: {churned.mean():.1%}") # Create features X = np.column_stack([tenure_months, monthly_charges, support_tickets, contract_type, num_services]) y = churned # Split and train X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) model = LogisticRegression() model.fit(X_train_scaled, y_train) # Evaluate y_pred = model.predict(X_test_scaled) y_prob = model.predict_proba(X_test_scaled)[:, 1] print("\n=== Churn Prediction Model ===") print(classification_report(y_test, y_pred, target_names=['Stay', 'Churn'])) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}") # Feature importance features = ['tenure', 'monthly_charges', 'support_tickets', 'contract_type', 'num_services'] print("\n=== Churn Risk Factors ===") for name, coef in zip(features, model.coef_[0]): effect = "↑ churn risk" if coef > 0 else "↓ churn risk" print(f"{name}: {coef:+.3f} ({effect})") # Business simulation print("\n=== Business Impact Simulation ===") avg_customer_value = 1000 # Annual value intervention_cost = 50 intervention_success_rate = 0.3 # Identify high-risk customers high_risk = y_prob > 0.5 n_high_risk = high_risk.sum() print(f"High-risk customers identified: {n_high_risk}") print(f"Intervention cost: ${n_high_risk * intervention_cost:,}") # Expected saves expected_churners_in_high_risk = (y_test[high_risk] == 1).sum() expected_saves = expected_churners_in_high_risk * intervention_success_rate expected_value_saved = expected_saves * avg_customer_value print(f"Expected saves: {expected_saves:.0f} customers") print(f"Expected value saved: ${expected_value_saved:,.0f}") print(f"Net benefit: ${expected_value_saved - n_high_risk * intervention_cost:,.0f}") ```

*** ## What's Next? Before moving to more complex algorithms, let's learn K-Nearest Neighbors - an even more intuitive approach to classification! Classify by finding similar examples - the simplest ML algorithm