Support Vector Machines (SVM)
The Widest Street Problem
Imagine you’re a city planner. You need to draw a road between two neighborhoods - one residential (blue), one commercial (red). You have many possible routes:The Maximum Margin Classifier
Support Vectors: The Important Points
Not all training points matter equally! Support vectors are the points closest to the decision boundary. They’re the only ones that determine where the boundary goes.- SVM is memory efficient (stores only support vectors)
- Robust to outliers far from the boundary
- Decision is based on “hardest” examples to classify
The Kernel Trick: Handling Non-Linear Data
What if data isn’t linearly separable?Visualizing Kernels
How The Kernel Trick Works (Intuition)
The kernel trick maps data to a higher dimension where it becomes linearly separable. 2D circles → 3D cone:- Inner circle: low on the cone
- Outer circle: high on the cone
- Now a flat plane can separate them!
Math Connection: The kernel function computes dot products in high-dimensional space without actually transforming the data. This is computationally efficient! See Vectors for dot product intuition.
Common Kernels
| Kernel | Best For | Key Parameter |
|---|---|---|
| Linear | High-dimensional data, text | - |
| RBF (Gaussian) | Most non-linear problems | gamma (width) |
| Polynomial | Feature interactions | degree, coef0 |
| Sigmoid | Neural network-like | gamma, coef0 |
Key Hyperparameters
C: Regularization Parameter
- Small C: Wide margin, allows some misclassification (soft margin)
- Large C: Narrow margin, tries to classify all correctly (hard margin)
Gamma: RBF Kernel Width
- Small gamma: Smooth decision boundary (underfitting risk)
- Large gamma: Complex, wiggly boundary (overfitting risk)
Real Example: Digit Recognition
SVM for Regression: SVR
When to Use SVM
Good For
- High-dimensional data (text, genomics)
- Clear margin of separation
- When you need probability estimates (use
probability=True) - Small to medium datasets
Not Great For
- Very large datasets (slow training)
- Lots of noise and overlapping classes
- When interpretability is crucial
- When you need feature importance
Key Takeaways
Maximum Margin
Find the widest possible boundary between classes
Support Vectors
Only boundary points matter for the decision
Kernel Trick
Handle non-linear data by mapping to higher dimensions
Scale Your Data
SVM is sensitive to feature scales!
What’s Next?
Let’s learn about Naive Bayes - a completely different approach based on probability!Continue to Module 5b: Naive Bayes
Simple probabilistic classification that works surprisingly well