Support Vector Machines (SVM)
The Widest Street Problem
Imagine you’re a city planner. You need to draw a road between two neighborhoods - one residential (blue), one commercial (red). You have many possible routes:The Maximum Margin Classifier
Support Vectors: The Important Points
Not all training points matter equally! Support vectors are the points closest to the decision boundary. They’re the only ones that determine where the boundary goes.- SVM is memory efficient (stores only support vectors)
- Robust to outliers far from the boundary
- Decision is based on “hardest” examples to classify
The Kernel Trick: Handling Non-Linear Data
What if data isn’t linearly separable?Visualizing Kernels
How The Kernel Trick Works (Intuition)
The kernel trick maps data to a higher dimension where it becomes linearly separable. Think of it like this: you have a table covered with red and blue coins. The red coins form a ring around the blue ones. No straight ruler can separate them on the flat table. But what if you could lift the table’s center upward, forming a bowl? Now the blue coins sit in the dip and the red ones sit on the rim — and a flat sheet of paper can separate them in 3D. 2D circles to 3D cone:- Inner circle: low on the cone
- Outer circle: high on the cone
- Now a flat plane can separate them!
Math Connection: The kernel function computes dot products in high-dimensional space without actually transforming the data. This is computationally efficient! See Vectors for dot product intuition.
Common Kernels
| Kernel | Best For | Key Parameter |
|---|---|---|
| Linear | High-dimensional data, text | - |
| RBF (Gaussian) | Most non-linear problems | gamma (width) |
| Polynomial | Feature interactions | degree, coef0 |
| Sigmoid | Neural network-like | gamma, coef0 |
Key Hyperparameters
C: Regularization Parameter
- Small C (e.g., 0.1): Wide margin, allows some misclassification (soft margin) — more tolerant of noise, less likely to overfit. Think of it as a relaxed bouncer who lets a few wrong people in.
- Large C (e.g., 100): Narrow margin, tries to classify all correctly (hard margin) — obsesses over every training point, more likely to overfit. Think of it as a strict bouncer who won’t let anyone slip through, even at the cost of making the entrance uncomfortably narrow.
Gamma: RBF Kernel Width
- Small gamma (e.g., 0.1): Each training point has a wide sphere of influence — the boundary is smooth and gentle. Like looking at a city from an airplane: you see the overall shape but not individual buildings.
- Large gamma (e.g., 10): Each training point only influences its immediate neighbors — the boundary gets complex and wiggly. Like looking at a city from street level: you see every detail but lose the big picture. High gamma is the most common cause of SVM overfitting.
Real Example: Digit Recognition
SVM for Regression: SVR
When to Use SVM
Good For
- High-dimensional data (text, genomics)
- Clear margin of separation
- When you need probability estimates (use
probability=True) - Small to medium datasets
Not Great For
- Very large datasets (slow training)
- Lots of noise and overlapping classes
- When interpretability is crucial
- When you need feature importance
Key Takeaways
Maximum Margin
Find the widest possible boundary between classes
Support Vectors
Only boundary points matter for the decision
Kernel Trick
Handle non-linear data by mapping to higher dimensions
Scale Your Data
SVM is sensitive to feature scales!
What’s Next?
Let’s learn about Naive Bayes - a completely different approach based on probability!Continue to Module 5b: Naive Bayes
Simple probabilistic classification that works surprisingly well