Classification
A Different Kind of Prediction
In regression, we predict numbers: “This house costs $450,000” In classification, we predict categories: “This email is SPAM” Real-world classification problems:- Is this transaction fraudulent? (Yes/No)
- What digit is in this image? (0-9)
- Will this customer buy? (Yes/No)
- What disease does this patient have? (A, B, C, D)
- Is this review positive or negative? (Positive/Negative)
The Email Spam Problem
Let’s build a spam detector from scratch.The Data
Imagine each email is represented by features:- Number of exclamation marks
- Contains word “FREE”
- Contains word “WINNER”
- Sender in contacts
- Length of email
Why Not Just Use Linear Regression?
Let’s try:- Predictions can be > 1 or < 0 (what does 1.12 “spam” mean?)
- We want probabilities (0 to 1), not arbitrary numbers
- We want a clear decision: spam or not spam
The Sigmoid Function: Squashing to Probabilities
We need a function that:- Takes any number (from -∞ to +∞)
- Outputs a value between 0 and 1
- Acts like a probability
Logistic Regression
Combine linear regression with sigmoid:- Compute a weighted sum (like linear regression)
- Pass through sigmoid to get a probability
- If probability > 0.5, predict “spam”
Training Logistic Regression
The Loss Function
For classification, we use Binary Cross-Entropy (log loss): Intuition:- If actual is 1 and we predict 0.9 → small loss (good!)
- If actual is 1 and we predict 0.1 → large loss (bad!)
Gradient Descent for Logistic Regression
Using scikit-learn
Real Example: Breast Cancer Detection
Understanding the Confusion Matrix
- True Positive (TP): Predicted spam, was spam
- True Negative (TN): Predicted not spam, was not spam
- False Positive (FP): Predicted spam, was not spam (annoying!)
- False Negative (FN): Predicted not spam, was spam (dangerous!)
Key Metrics
When to prioritize which metric?
- High Precision needed: Spam filter (don’t want to miss important emails)
- High Recall needed: Disease detection (don’t want to miss sick patients)
- F1 Score: When you need balance between both
Multi-Class Classification
What if there are more than 2 classes?The Decision Boundary
Logistic regression creates a linear decision boundary:Key Takeaways
Classification = Categories
Predict discrete labels, not numbers
Sigmoid = Probability
Squash outputs to 0-1 range
Threshold = Decision
P > 0.5 means positive class
Metrics Matter
Accuracy isn’t always enough
🚀 Mini Projects
Project 1
Build a spam detector from scratch
Project 2
Medical diagnosis classifier with metrics analysis
Project 3
Customer churn prediction system
What’s Next?
Before moving to more complex algorithms, let’s learn K-Nearest Neighbors - an even more intuitive approach to classification!Continue to Module 4a: K-Nearest Neighbors
Classify by finding similar examples - the simplest ML algorithm