> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Support Vector Machines

> Find the best boundary between classes with maximum margin

# Support Vector Machines (SVM)

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/svm-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=c6834ec2cd91326b9c1d31af1e614ef1" alt="SVM Maximum Margin Hyperplane" width="1080" height="1080" data-path="images/courses/ml-mastery/svm-concept.svg" />
</Frame>

## The Widest Street Problem

Imagine you're a city planner. You need to draw a road between two neighborhoods - one residential (blue), one commercial (red).

You have many possible routes:

```
Option A: Barely squeezes through, buildings on both sides
Option B: Wide boulevard with buffer zones on both sides
```

**Which is better?** The wide boulevard! If a building is slightly misplaced, Option A fails. Option B has room for error.

**SVM finds the widest possible street (boundary) between classes.**

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/svm-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=820ce7d103708bf793ce2ea46b73416e" alt="Image Classification with SVM" width="1080" height="1080" data-path="images/courses/ml-mastery/svm-real-world.svg" />
</Frame>

***

## The Maximum Margin Classifier

```python theme={null}
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

# Two clearly separated groups
np.random.seed(42)
X_class0 = np.random.randn(20, 2) + np.array([-2, -2])
X_class1 = np.random.randn(20, 2) + np.array([2, 2])
X = np.vstack([X_class0, X_class1])
y = np.array([0] * 20 + [1] * 20)

# Train SVM
svm = SVC(kernel='linear')
svm.fit(X, y)

# Visualize
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', s=50)

# Plot decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
                     np.linspace(ylim[0], ylim[1], 50))
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Decision boundary and margins
plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], 
            linestyles=['--', '-', '--'])

# Highlight support vectors
plt.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], 
            s=200, linewidth=1, facecolors='none', edgecolors='k',
            label='Support Vectors')

plt.title('SVM: Maximum Margin Classifier')
plt.legend()
plt.show()

print(f"Number of support vectors: {len(svm.support_vectors_)}")
```

***

## Support Vectors: The Important Points

Not all training points matter equally!

**Support vectors** are the points closest to the decision boundary. They're the only ones that determine where the boundary goes.

```python theme={null}
# The model only remembers these points
print("Support vectors per class:", svm.n_support_)
print("Support vector indices:", svm.support_)
```

**Why this matters:**

* SVM is memory efficient (stores only support vectors)
* Robust to outliers far from the boundary
* Decision is based on "hardest" examples to classify

***

## The Kernel Trick: Handling Non-Linear Data

What if data isn't linearly separable?

```python theme={null}
from sklearn.datasets import make_circles

# Create circular data (can't separate with a line!)
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.3)

# Linear SVM fails
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_circles, y_circles)
print(f"Linear SVM accuracy: {svm_linear.score(X_circles, y_circles):.2%}")

# RBF kernel SVM succeeds!
svm_rbf = SVC(kernel='rbf', gamma='auto')
svm_rbf.fit(X_circles, y_circles)
print(f"RBF SVM accuracy: {svm_rbf.score(X_circles, y_circles):.2%}")
```

### Visualizing Kernels

```python theme={null}
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

kernels = ['linear', 'poly', 'rbf']

for ax, kernel in zip(axes, kernels):
    svm = SVC(kernel=kernel, gamma='auto', degree=3)
    svm.fit(X_circles, y_circles)
    
    # Create mesh for decision boundary
    xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100),
                         np.linspace(-1.5, 1.5, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm')
    ax.set_title(f'{kernel.upper()} Kernel (acc: {svm.score(X_circles, y_circles):.0%})')

plt.tight_layout()
plt.show()
```

***

## How The Kernel Trick Works (Intuition)

The kernel trick maps data to a higher dimension where it becomes linearly separable.

Think of it like this: you have a table covered with red and blue coins. The red coins form a ring around the blue ones. No straight ruler can separate them on the flat table. But what if you could lift the table's center upward, forming a bowl? Now the blue coins sit in the dip and the red ones sit on the rim -- and a flat sheet of paper *can* separate them in 3D.

**2D circles to 3D cone:**

* Inner circle: low on the cone
* Outer circle: high on the cone
* Now a flat plane can separate them!

The mathematical beauty of the "trick" part: the kernel function computes distances *as if* the data were lifted into high-dimensional space, without actually performing the expensive transformation. For RBF kernels, this implicit space is technically infinite-dimensional.

```python theme={null}
# What the RBF kernel "sees"
from mpl_toolkits.mplot3d import Axes3D

# Map to 3D using a simple transformation
r = np.sqrt(X_circles[:, 0]**2 + X_circles[:, 1]**2)
X_3d = np.column_stack([X_circles, r])

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=y_circles, cmap='coolwarm')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('R (distance from origin)')
ax.set_title('Data Lifted to 3D - Now Linearly Separable!')
plt.show()
```

<Note>
  **Math Connection**: The kernel function computes dot products in high-dimensional space without actually transforming the data. This is computationally efficient! See [Vectors](/courses/math-for-ml-linear-algebra/02-vectors) for dot product intuition.
</Note>

***

## Common Kernels

| Kernel             | Best For                    | Key Parameter     |
| ------------------ | --------------------------- | ----------------- |
| **Linear**         | High-dimensional data, text | -                 |
| **RBF (Gaussian)** | Most non-linear problems    | `gamma` (width)   |
| **Polynomial**     | Feature interactions        | `degree`, `coef0` |
| **Sigmoid**        | Neural network-like         | `gamma`, `coef0`  |

***

## Key Hyperparameters

### C: Regularization Parameter

```python theme={null}
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, C in zip(axes, [0.1, 1, 100]):
    svm = SVC(kernel='rbf', C=C, gamma='auto')
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100),
                         np.linspace(-1.5, 1.5, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm')
    ax.set_title(f'C = {C}')

plt.tight_layout()
plt.show()
```

* **Small C** (e.g., 0.1): Wide margin, allows some misclassification (soft margin) -- more tolerant of noise, less likely to overfit. Think of it as a relaxed bouncer who lets a few wrong people in.
* **Large C** (e.g., 100): Narrow margin, tries to classify all correctly (hard margin) -- obsesses over every training point, more likely to overfit. Think of it as a strict bouncer who won't let anyone slip through, even at the cost of making the entrance uncomfortably narrow.

### Gamma: RBF Kernel Width

```python theme={null}
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, gamma in zip(axes, [0.1, 1, 10]):
    svm = SVC(kernel='rbf', gamma=gamma)
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100),
                         np.linspace(-1.5, 1.5, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm')
    ax.set_title(f'gamma = {gamma}')

plt.tight_layout()
plt.show()
```

* **Small gamma** (e.g., 0.1): Each training point has a wide sphere of influence -- the boundary is smooth and gentle. Like looking at a city from an airplane: you see the overall shape but not individual buildings.
* **Large gamma** (e.g., 10): Each training point only influences its immediate neighbors -- the boundary gets complex and wiggly. Like looking at a city from street level: you see every detail but lose the big picture. High gamma is the most common cause of SVM overfitting.

***

## Real Example: Digit Recognition

```python theme={null}
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load digits
digits = load_digits()
X, y = digits.data, digits.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Grid search for best parameters.
# C and gamma are the two knobs that matter most for RBF-kernel SVM.
# C: tolerance for misclassification (higher = stricter boundary)
# gamma: how far each training point's influence reaches (higher = more local)
# These two interact -- high C + high gamma = very complex model (overfitting risk)
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.01, 0.1]
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# SVM achieves excellent accuracy on digit recognition because
# handwritten digits have clear separation in the feature space
# once you project them with the right kernel.

# Evaluate
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
```

***

## SVM for Regression: SVR

```python theme={null}
from sklearn.svm import SVR
from sklearn.datasets import make_regression

# Create regression data
X_reg, y_reg = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Fit SVR
svr = SVR(kernel='rbf', C=100, gamma='auto', epsilon=0.1)
svr.fit(X_reg, y_reg)

# Plot
X_test_reg = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_pred_reg = svr.predict(X_test_reg)

plt.figure(figsize=(10, 6))
plt.scatter(X_reg, y_reg, alpha=0.5, label='Data')
plt.plot(X_test_reg, y_pred_reg, 'r-', linewidth=2, label='SVR')
plt.fill_between(X_test_reg.ravel(), 
                 y_pred_reg - svr.epsilon, 
                 y_pred_reg + svr.epsilon, 
                 alpha=0.2, color='red', label='Epsilon tube')
plt.legend()
plt.title('Support Vector Regression')
plt.show()
```

***

## When to Use SVM

<CardGroup cols={2}>
  <Card title="Good For" icon="circle-check">
    * High-dimensional data (text, genomics)
    * Clear margin of separation
    * When you need probability estimates (use `probability=True`)
    * Small to medium datasets
  </Card>

  <Card title="Not Great For" icon="circle-xmark">
    * Very large datasets (slow training)
    * Lots of noise and overlapping classes
    * When interpretability is crucial
    * When you need feature importance
  </Card>
</CardGroup>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Maximum Margin" icon="ruler-combined">
    Find the widest possible boundary between classes
  </Card>

  <Card title="Support Vectors" icon="thumbtack">
    Only boundary points matter for the decision
  </Card>

  <Card title="Kernel Trick" icon="magic">
    Handle non-linear data by mapping to higher dimensions
  </Card>

  <Card title="Scale Your Data" icon="arrows-left-right-to-line">
    SVM is sensitive to feature scales!
  </Card>
</CardGroup>

***

## What's Next?

Let's learn about Naive Bayes - a completely different approach based on probability!

<Card title="Continue to Module 5b: Naive Bayes" icon="arrow-right" href="/courses/ml-mastery/05b-naive-bayes">
  Simple probabilistic classification that works surprisingly well
</Card>