> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Clustering

> Group similar things together - unsupervised learning fundamentals

# Clustering: Unsupervised Learning

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/clustering-concept.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=54cbe15af0b9adf3fdca8937e7b75703" alt="K-Means Clustering Visualization" width="1080" height="1080" data-path="images/courses/ml-mastery/clustering-concept.svg" />
</Frame>

## A Different Kind of Problem

So far, all our problems had **labels**:

* House prices (we knew the correct price)
* Spam/not spam (we knew which emails were spam)
* Customer churn (we knew who churned)

But what if you have data **without labels**?

**Real scenarios:**

* Group customers into segments (but you don't know the segments beforehand)
* Find patterns in gene expression data
* Detect anomalies in network traffic
* Organize documents by topic

This is **unsupervised learning**. The algorithm finds structure on its own.

If supervised learning is like a student taking an exam with an answer key, unsupervised learning is like that same student being handed a box of unlabeled rocks and told to "sort them into groups that make sense." There's no right answer -- just patterns waiting to be discovered.

<Frame>
  <img src="https://mintcdn.com/devweeekends/1cs3K7TO-w20cKuc/images/courses/ml-mastery/clustering-real-world.svg?fit=max&auto=format&n=1cs3K7TO-w20cKuc&q=85&s=18deef79354166537cb360fbfd86998a" alt="Customer Segmentation with Clustering" width="1080" height="1080" data-path="images/courses/ml-mastery/clustering-real-world.svg" />
</Frame>

***

## The Customer Segmentation Problem

Your marketing team wants to send different campaigns to different customer types. But what types exist?

```python theme={null}
import numpy as np
import matplotlib.pyplot as plt

# Customer data: [annual_spending ($k), store_visits_per_month]
np.random.seed(42)

# Generate 3 natural clusters (but we pretend we don't know this!)
# Budget shoppers: low spending, low visits
budget = np.random.randn(50, 2) * [5, 2] + [20, 3]

# Regular customers: moderate spending, moderate visits
regular = np.random.randn(60, 2) * [8, 3] + [50, 8]

# Premium customers: high spending, high visits
premium = np.random.randn(40, 2) * [10, 2] + [100, 15]

# Combine (in real life, you wouldn't know which cluster each point belongs to)
customers = np.vstack([budget, regular, premium])

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], alpha=0.6)
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('Customer Data - Can You See the Groups?')
plt.grid(True)
plt.show()
```

Looking at the scatter plot, you can probably see 3 groups. But how do we find them automatically?

***

## K-Means Clustering

The most popular clustering algorithm. It's like a game of "hot potato" between cluster centers and data points, where each round brings the centers closer to where they "belong."

### The Algorithm (Simple Version)

1. **Pick K random points** as initial cluster centers (like randomly placing K flags on a map)
2. **Assign each point** to the nearest center (each person walks to the closest flag)
3. **Update centers** to the mean of assigned points (move each flag to the center of its crowd)
4. **Repeat** steps 2-3 until nothing changes (the flags stop moving -- we've converged)

The key insight: this alternating process of "assign then update" is guaranteed to converge, because every step reduces the total within-cluster distance. However, it may converge to a *local* optimum, not the global one -- which is why scikit-learn runs K-Means multiple times (`n_init=10` by default) with different random starting positions and keeps the best result.

```python theme={null}
def simple_kmeans(X, k, max_iters=100):
    """
    Simple K-Means implementation.
    """
    n_samples = len(X)
    
    # Step 1: Random initialization
    random_indices = np.random.choice(n_samples, k, replace=False)
    centers = X[random_indices].copy()
    
    for iteration in range(max_iters):
        # Step 2: Assign points to nearest center
        labels = np.zeros(n_samples, dtype=int)
        for i, point in enumerate(X):
            distances = [np.linalg.norm(point - center) for center in centers]
            labels[i] = np.argmin(distances)
        
        # Step 3: Update centers to mean of assigned points
        new_centers = np.zeros_like(centers)
        for j in range(k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                new_centers[j] = cluster_points.mean(axis=0)
            else:
                new_centers[j] = centers[j]  # Keep old center if cluster is empty
        
        # Check convergence
        if np.allclose(centers, new_centers):
            print(f"Converged at iteration {iteration}")
            break
        
        centers = new_centers
    
    return labels, centers

# Run our simple K-Means
labels, centers = simple_kmeans(customers, k=3)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('K-Means Clustering Result')
plt.legend()
plt.grid(True)
plt.show()
```

***

## Using scikit-learn

```python theme={null}
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Always scale for clustering!
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# K-Means with 3 clusters
# n_init=10 means: run K-Means 10 times with different random starts,
# keep the best result. This guards against bad initialization.
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print("Cluster assignments:", labels[:10])
print("Cluster centers (scaled):\n", kmeans.cluster_centers_)
# Inertia = total within-cluster sum of squared distances to center.
# Lower is better, but it always decreases with more clusters
# (K=n_samples gives inertia=0 but is useless).
print("Inertia (within-cluster sum of squares):", kmeans.inertia_)

# IMPORTANT: To interpret cluster centers in original units,
# reverse the scaling:
# centers_original = scaler.inverse_transform(kmeans.cluster_centers_)
# This tells you "Cluster 0 = customers who spend ~$20K/year and visit 3x/month"
```

***

## Choosing K: The Elbow Method

How many clusters should we use?

```python theme={null}
from sklearn.cluster import KMeans

# Try different values of k
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_scaled)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)

# Look for the "elbow" - where adding more clusters doesn't help much
plt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3')
plt.legend()
plt.show()
```

***

## Silhouette Score: Better Cluster Evaluation

The silhouette score measures how similar a point is to its own cluster vs other clusters. Think of it as asking each data point: "Are you happy in your cluster, or would you rather switch?"

$$
s = \frac{b - a}{\max(a, b)}
$$

Where:

* $a$ = average distance to points in same cluster (cohesion -- "how close am I to my own group?")
* $b$ = average distance to points in nearest other cluster (separation -- "how far am I from the next group?")

Range: -1 (bad) to +1 (good)

* **+1**: The point is far from other clusters and close to its own -- perfect clustering
* **0**: The point is on the border between two clusters -- ambiguous assignment
* **-1**: The point is closer to another cluster than its own -- likely misassigned

```python theme={null}
from sklearn.metrics import silhouette_score, silhouette_samples

# Calculate silhouette score for different k values
k_values = range(2, 11)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(customers_scaled)
    score = silhouette_score(customers_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, 'go-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method for Optimal K')
plt.grid(True)
plt.show()
```

***

## DBSCAN: Density-Based Clustering

K-Means has problems:

* You must specify K upfront (what if you guess wrong?)
* Assumes spherical, roughly equal-sized clusters (fails on elongated or ring-shaped groups)
* Sensitive to outliers (one extreme point can drag a cluster center far from where it should be)

**DBSCAN** (Density-Based Spatial Clustering) solves all three:

* Automatically finds the number of clusters based on data density
* Finds clusters of any shape -- rings, crescents, irregular blobs
* Identifies outliers/noise as points that don't belong to any cluster

The trade-off: DBSCAN requires you to choose `eps` (neighborhood radius) and `min_samples` (minimum density), which can be tricky. And unlike K-Means, DBSCAN struggles when clusters have very different densities -- a tight cluster and a sparse cluster might need different `eps` values.

### How DBSCAN Works

1. For each point, count neighbors within radius `eps`
2. If count >= `min_samples`, it's a **core point**
3. Connect core points that are neighbors
4. Non-core points near core points are **border points**
5. Everything else is **noise**

```python theme={null}
from sklearn.cluster import DBSCAN

# Create more interesting data with different shaped clusters
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# K-Means fails on non-spherical clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons)

# DBSCAN handles them well
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title('K-Means (Fails on Moons)')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title('DBSCAN (Works!)')

plt.tight_layout()
plt.show()
```

***

## Choosing DBSCAN Parameters

```python theme={null}
# eps: radius of neighborhood
# min_samples: minimum points to form a cluster

# Heuristic: min_samples = 2 * n_features
# Heuristic: eps from k-distance graph

from sklearn.neighbors import NearestNeighbors

# Find optimal eps using k-distance
k = 5  # min_samples - 1
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_moons)
distances, _ = neighbors.kneighbors(X_moons)
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('K-Distance Graph - Look for the "Elbow"')
plt.grid(True)
plt.show()
```

***

## Hierarchical Clustering

Builds a tree (dendrogram) of clusters:

```python theme={null}
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Using a smaller dataset for visualization
np.random.seed(42)
X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0)

# Create dendrogram
linkage_matrix = linkage(X_small, method='ward')

plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut the tree at a certain height to get clusters
from scipy.cluster.hierarchy import fcluster

# Get 3 clusters
labels = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels)
```

***

## Practical Example: Customer Segmentation

```python theme={null}
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Create realistic customer data
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'annual_spending': np.random.exponential(50, n_customers) + 10,
    'frequency': np.random.poisson(5, n_customers) + 1,
    'avg_basket': np.random.gamma(2, 25, n_customers),
    'days_since_purchase': np.random.exponential(30, n_customers),
    'items_per_visit': np.random.poisson(3, n_customers) + 1,
})

print("Customer data sample:")
print(data.head())
print("\nStatistics:")
print(data.describe())

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Find optimal k using silhouette
best_k = 4
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(data_scaled)

# Analyze clusters
print("\nCluster Profiles:")
cluster_profiles = data.groupby('cluster').mean()
print(cluster_profiles.round(2))

# Name the segments based on characteristics
segment_names = {
    0: 'Occasional Shoppers',
    1: 'High-Value Regulars',
    2: 'Bargain Hunters',
    3: 'VIP Customers'
}
```

***

## Comparison: When to Use What

| Algorithm        | Best For                    | Pros                                    | Cons                              |
| ---------------- | --------------------------- | --------------------------------------- | --------------------------------- |
| K-Means          | Spherical clusters, known K | Fast, simple                            | Must specify K, assumes spherical |
| DBSCAN           | Arbitrary shapes, noise     | Finds K automatically, handles outliers | Sensitive to parameters           |
| Hierarchical     | Small data, need hierarchy  | Visual dendrogram, no K needed          | Slow for large data               |
| Gaussian Mixture | Soft clustering, elliptical | Probability outputs                     | Can overfit                       |

***

## Connection to Supervised Learning

Clustering can help supervised learning:

```python theme={null}
# Use clusters as features
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

# Create cluster features
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1)

# Add to original features
X_train_enhanced = np.hstack([X_train, cluster_features])

# Now use in a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_enhanced, y_train)
```

<Note>
  **Math Connection**: Clustering uses distance metrics extensively. Understanding [vector similarity](/courses/math-for-ml-linear-algebra/02-vectors) helps you choose the right metric.
</Note>

***

## 🚀 Mini Projects

<CardGroup cols={2}>
  <Card title="Project 1: Customer Segmentation" icon="users">
    Segment e-commerce customers for targeted marketing
  </Card>

  <Card title="Project 2: Image Color Quantization" icon="palette">
    Compress images using K-Means clustering
  </Card>

  <Card title="Project 3: Anomaly Detection System" icon="triangle-exclamation">
    Detect outliers using DBSCAN
  </Card>

  <Card title="Project 4: Document Clustering" icon="folder-tree">
    Organize documents by topic automatically
  </Card>
</CardGroup>

### Project 1: Customer Segmentation

Segment customers based on purchasing behavior for targeted marketing campaigns.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.cluster import KMeans
  from sklearn.preprocessing import StandardScaler
  from sklearn.metrics import silhouette_score

  # Step 1: Generate synthetic customer data (RFM Analysis)
  np.random.seed(42)
  n_customers = 1000

  data = {
      'customer_id': range(1, n_customers + 1),
      'recency': np.random.exponential(30, n_customers).clip(1, 365),
      'frequency': np.random.poisson(5, n_customers).clip(1, 50),
      'monetary': np.random.lognormal(5, 1, n_customers).clip(10, 10000),
      'avg_basket_size': np.random.lognormal(3.5, 0.5, n_customers),
      'tenure_days': np.random.randint(30, 1000, n_customers),
  }
  df = pd.DataFrame(data)

  print("="*60)
  print("👥 CUSTOMER SEGMENTATION")
  print("="*60)
  print(f"Total customers: {len(df)}")

  # Step 2: Feature Engineering for RFM
  df['frequency_rate'] = df['frequency'] / (df['tenure_days'] / 30)
  df['avg_order_value'] = df['monetary'] / df['frequency']
  df['customer_value'] = df['monetary'] / (df['recency'] + 1)

  # Select features for clustering
  features = ['recency', 'frequency', 'monetary', 'avg_basket_size']
  X = df[features]

  # Scale features
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  print("\nFeatures used for clustering:")
  for f in features:
      print(f"  - {f}")

  # Step 3: Find optimal number of clusters
  print("\n1️⃣ FINDING OPTIMAL CLUSTERS")
  print("-"*40)

  inertias = []
  silhouettes = []
  K_range = range(2, 11)

  for k in K_range:
      kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
      kmeans.fit(X_scaled)
      inertias.append(kmeans.inertia_)
      silhouettes.append(silhouette_score(X_scaled, kmeans.labels_))
      print(f"K={k}: Inertia={kmeans.inertia_:.0f}, Silhouette={silhouettes[-1]:.3f}")

  optimal_k = K_range[np.argmax(silhouettes)]
  print(f"\nOptimal K (by silhouette): {optimal_k}")

  # Step 4: Final clustering
  print("\n2️⃣ CLUSTERING CUSTOMERS")
  print("-"*40)

  kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
  df['segment'] = kmeans.fit_predict(X_scaled)

  # Step 5: Analyze segments
  print("\n3️⃣ SEGMENT ANALYSIS")
  print("-"*40)

  segment_summary = df.groupby('segment')[features + ['customer_id']].agg({
      'customer_id': 'count',
      'recency': 'mean',
      'frequency': 'mean',
      'monetary': 'mean',
      'avg_basket_size': 'mean'
  }).round(2)

  segment_summary.columns = ['count', 'avg_recency', 'avg_frequency', 'avg_monetary', 'avg_basket']
  print(segment_summary)

  # Step 6: Name segments based on characteristics
  segment_names = {}
  for seg in range(4):
      seg_data = segment_summary.loc[seg]
      
      # High value: high monetary, high frequency
      if seg_data['avg_monetary'] > segment_summary['avg_monetary'].median() and \
         seg_data['avg_frequency'] > segment_summary['avg_frequency'].median():
          segment_names[seg] = "💎 Champions"
      # At risk: high recency (inactive), high monetary
      elif seg_data['avg_recency'] > segment_summary['avg_recency'].median() and \
           seg_data['avg_monetary'] > segment_summary['avg_monetary'].median():
          segment_names[seg] = "⚠️ At Risk"
      # New customers: low frequency, low recency
      elif seg_data['avg_frequency'] < segment_summary['avg_frequency'].median() and \
           seg_data['avg_recency'] < segment_summary['avg_recency'].median():
          segment_names[seg] = "🌱 New Customers"
      else:
          segment_names[seg] = "📊 Regular"

  df['segment_name'] = df['segment'].map(segment_names)

  print("\n4️⃣ SEGMENT NAMES")
  print("-"*40)
  for seg, name in segment_names.items():
      count = len(df[df['segment'] == seg])
      pct = count / len(df) * 100
      print(f"Segment {seg}: {name} ({count} customers, {pct:.1f}%)")

  # Step 7: Marketing recommendations
  print("\n5️⃣ MARKETING RECOMMENDATIONS")
  print("-"*40)

  recommendations = {
      "💎 Champions": "VIP treatment, exclusive offers, loyalty rewards",
      "⚠️ At Risk": "Win-back campaigns, special discounts, personal outreach",
      "🌱 New Customers": "Onboarding emails, first-purchase discounts, welcome series",
      "📊 Regular": "Cross-sell/up-sell, engagement programs, referral incentives"
  }

  for name, rec in recommendations.items():
      if name in segment_names.values():
          print(f"\n{name}:")
          print(f"  → {rec}")

  # Step 8: Visualization
  fig, axes = plt.subplots(2, 2, figsize=(12, 10))

  # Segment sizes
  ax1 = axes[0, 0]
  segment_counts = df['segment_name'].value_counts()
  ax1.pie(segment_counts, labels=segment_counts.index, autopct='%1.1f%%')
  ax1.set_title('Customer Segment Distribution')

  # RFM plot
  ax2 = axes[0, 1]
  colors = ['blue', 'red', 'green', 'orange']
  for seg in range(4):
      mask = df['segment'] == seg
      ax2.scatter(df[mask]['recency'], df[mask]['monetary'], 
                 c=colors[seg], label=segment_names[seg], alpha=0.5)
  ax2.set_xlabel('Recency (days)')
  ax2.set_ylabel('Monetary ($)')
  ax2.set_title('Recency vs Monetary by Segment')
  ax2.legend()

  # Frequency vs Monetary
  ax3 = axes[1, 0]
  for seg in range(4):
      mask = df['segment'] == seg
      ax3.scatter(df[mask]['frequency'], df[mask]['monetary'], 
                 c=colors[seg], label=segment_names[seg], alpha=0.5)
  ax3.set_xlabel('Frequency')
  ax3.set_ylabel('Monetary ($)')
  ax3.set_title('Frequency vs Monetary by Segment')

  # Segment averages
  ax4 = axes[1, 1]
  segment_summary[['avg_recency', 'avg_frequency', 'avg_monetary']].plot(
      kind='bar', ax=ax4
  )
  ax4.set_xlabel('Segment')
  ax4.set_title('Segment Characteristics')
  ax4.legend(loc='upper right')

  plt.tight_layout()
  plt.savefig('customer_segments.png', dpi=150)
  print("\n✅ Visualization saved!")
  ```

  **What you learned:**

  * RFM analysis for customer segmentation
  * Interpreting clusters in business context
  * Translating segments into actionable marketing strategies
</details>

### Project 2: Image Color Quantization

Use K-Means to reduce the number of colors in an image (compression).

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import matplotlib.pyplot as plt
  from sklearn.cluster import KMeans

  # Step 1: Create a synthetic "image" (or use real image with PIL)
  np.random.seed(42)

  # Create a simple gradient image with some colors
  height, width = 200, 300

  # Create an image with distinct color regions
  image = np.zeros((height, width, 3), dtype=np.uint8)

  # Add color regions
  image[:100, :150] = [255, 100, 100]  # Red region
  image[:100, 150:] = [100, 255, 100]  # Green region
  image[100:, :150] = [100, 100, 255]  # Blue region
  image[100:, 150:] = [255, 255, 100]  # Yellow region

  # Add gradients and noise
  for i in range(height):
      for j in range(width):
          image[i, j] = image[i, j] + np.random.randint(-30, 30, 3)
          
  image = np.clip(image, 0, 255).astype(np.uint8)

  print("="*60)
  print("🎨 IMAGE COLOR QUANTIZATION")
  print("="*60)
  print(f"Image shape: {image.shape}")
  print(f"Original unique colors: {len(np.unique(image.reshape(-1, 3), axis=0))}")

  # Step 2: Reshape image for clustering
  # From (height, width, 3) to (height*width, 3)
  pixels = image.reshape(-1, 3).astype(float)
  print(f"Pixels to cluster: {len(pixels)}")

  # Step 3: Apply K-Means for different numbers of colors
  n_colors_list = [2, 4, 8, 16]
  results = {}

  print("\nQuantizing colors...")
  for n_colors in n_colors_list:
      print(f"\nK = {n_colors} colors:")
      
      kmeans = KMeans(n_clusters=n_colors, random_state=42, n_init=10)
      kmeans.fit(pixels)
      
      # Replace each pixel with its cluster center
      new_colors = kmeans.cluster_centers_[kmeans.labels_]
      quantized = new_colors.reshape(image.shape).astype(np.uint8)
      
      # Calculate compression ratio (simplified)
      original_bits = height * width * 3 * 8  # 8 bits per channel
      quantized_bits = height * width * np.log2(n_colors) + n_colors * 3 * 8
      compression = original_bits / quantized_bits
      
      results[n_colors] = {
          'image': quantized,
          'centers': kmeans.cluster_centers_.astype(np.uint8),
          'compression': compression,
          'inertia': kmeans.inertia_
      }
      
      print(f"  Compression ratio: {compression:.1f}x")
      print(f"  Distortion (inertia): {kmeans.inertia_:.0f}")

  # Step 4: Visualize results
  fig, axes = plt.subplots(2, 3, figsize=(15, 10))

  # Original
  axes[0, 0].imshow(image)
  axes[0, 0].set_title('Original Image')
  axes[0, 0].axis('off')

  # Quantized versions
  for i, n_colors in enumerate(n_colors_list[:4]):
      row = (i + 1) // 3
      col = (i + 1) % 3
      axes[row, col].imshow(results[n_colors]['image'])
      axes[row, col].set_title(f'{n_colors} Colors ({results[n_colors]["compression"]:.1f}x compression)')
      axes[row, col].axis('off')

  # Color palette for one example
  ax_palette = axes[1, 2]
  palette = results[8]['centers']
  for i, color in enumerate(palette):
      ax_palette.add_patch(plt.Rectangle((i, 0), 1, 1, color=color/255))
  ax_palette.set_xlim(0, len(palette))
  ax_palette.set_ylim(0, 1)
  ax_palette.set_title('8-Color Palette')
  ax_palette.axis('off')

  plt.tight_layout()
  plt.savefig('color_quantization.png', dpi=150)

  # Step 5: Quality vs Compression tradeoff
  print("\n📊 QUALITY VS COMPRESSION TRADEOFF")
  print("-"*40)
  for n_colors, res in results.items():
      print(f"K={n_colors:2d}: Compression={res['compression']:5.1f}x, Distortion={res['inertia']:,.0f}")

  # Step 6: Find optimal K using elbow method
  print("\n🔍 Finding Optimal K...")
  inertias = []
  for k in range(2, 33):
      kmeans = KMeans(n_clusters=k, random_state=42, n_init=5)
      kmeans.fit(pixels)
      inertias.append(kmeans.inertia_)

  # Plot elbow
  fig, ax = plt.subplots(figsize=(10, 5))
  ax.plot(range(2, 33), inertias, 'b-', marker='o')
  ax.set_xlabel('Number of Colors (K)')
  ax.set_ylabel('Distortion (Inertia)')
  ax.set_title('Elbow Method for Optimal K')
  ax.axvline(x=8, color='r', linestyle='--', label='Suggested K=8')
  ax.legend()
  plt.savefig('elbow_colors.png', dpi=150)

  print("\n✅ Color quantization complete!")
  print("💡 K=8 often provides good balance between quality and compression")
  ```

  **What you learned:**

  * K-Means works great for color quantization
  * Practical trade-off between quality and compression
  * Cluster centers become the color palette
</details>

### Project 3: Anomaly Detection System

Use DBSCAN to detect anomalies in network traffic or transaction data.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.cluster import DBSCAN
  from sklearn.preprocessing import StandardScaler
  from sklearn.neighbors import NearestNeighbors

  # Step 1: Generate synthetic network traffic data
  np.random.seed(42)
  n_normal = 1000
  n_anomalies = 50

  # Normal traffic patterns
  normal_data = {
      'bytes_sent': np.random.lognormal(8, 0.5, n_normal),
      'bytes_received': np.random.lognormal(9, 0.5, n_normal),
      'duration': np.random.exponential(30, n_normal),
      'packets': np.random.poisson(100, n_normal),
      'connections': np.random.poisson(5, n_normal),
  }

  # Anomaly patterns (suspicious behavior)
  anomaly_data = {
      'bytes_sent': np.random.lognormal(12, 0.5, n_anomalies),  # Data exfiltration
      'bytes_received': np.random.lognormal(6, 0.5, n_anomalies),  # Low receive
      'duration': np.random.exponential(5, n_anomalies),  # Quick bursts
      'packets': np.random.poisson(1000, n_anomalies),  # High packet count
      'connections': np.random.poisson(50, n_anomalies),  # Many connections
  }

  # Combine
  df_normal = pd.DataFrame(normal_data)
  df_normal['is_anomaly'] = 0

  df_anomaly = pd.DataFrame(anomaly_data)
  df_anomaly['is_anomaly'] = 1

  df = pd.concat([df_normal, df_anomaly], ignore_index=True)
  df = df.sample(frac=1, random_state=42).reset_index(drop=True)  # Shuffle

  print("="*60)
  print("🔍 ANOMALY DETECTION SYSTEM")
  print("="*60)
  print(f"Total records: {len(df)}")
  print(f"Actual anomalies: {df['is_anomaly'].sum()} ({df['is_anomaly'].mean()*100:.1f}%)")

  # Step 2: Feature preparation
  features = ['bytes_sent', 'bytes_received', 'duration', 'packets', 'connections']
  X = df[features].values

  # Add derived features
  df['bytes_ratio'] = df['bytes_sent'] / (df['bytes_received'] + 1)
  df['packets_per_second'] = df['packets'] / (df['duration'] + 1)
  df['bytes_per_connection'] = (df['bytes_sent'] + df['bytes_received']) / (df['connections'] + 1)

  features_extended = features + ['bytes_ratio', 'packets_per_second', 'bytes_per_connection']
  X = df[features_extended].values

  # Scale
  scaler = StandardScaler()
  X_scaled = scaler.fit_transform(X)

  # Step 3: Find optimal eps using k-distance graph
  print("\n1️⃣ FINDING OPTIMAL EPSILON")
  print("-"*40)

  k = 5  # min_samples
  nbrs = NearestNeighbors(n_neighbors=k+1).fit(X_scaled)
  distances, _ = nbrs.kneighbors(X_scaled)
  k_distances = np.sort(distances[:, k])

  # Plot k-distance graph
  plt.figure(figsize=(10, 5))
  plt.plot(k_distances)
  plt.xlabel('Points sorted by distance')
  plt.ylabel(f'{k}-distance')
  plt.title('K-Distance Graph for Epsilon Selection')
  plt.axhline(y=2.5, color='r', linestyle='--', label='Suggested eps=2.5')
  plt.legend()
  plt.savefig('k_distance.png', dpi=150)

  # Step 4: Apply DBSCAN
  print("\n2️⃣ APPLYING DBSCAN")
  print("-"*40)

  eps_values = [1.5, 2.0, 2.5, 3.0]
  results = []

  for eps in eps_values:
      dbscan = DBSCAN(eps=eps, min_samples=5)
      labels = dbscan.fit_predict(X_scaled)
      
      n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
      n_noise = (labels == -1).sum()
      
      # Evaluate against known anomalies
      detected_anomalies = labels == -1
      true_anomalies = df['is_anomaly'] == 1
      
      true_positives = (detected_anomalies & true_anomalies).sum()
      false_positives = (detected_anomalies & ~true_anomalies).sum()
      false_negatives = (~detected_anomalies & true_anomalies).sum()
      
      precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
      recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
      f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
      
      results.append({
          'eps': eps,
          'n_clusters': n_clusters,
          'n_noise': n_noise,
          'precision': precision,
          'recall': recall,
          'f1': f1
      })
      
      print(f"eps={eps}: Clusters={n_clusters}, Noise={n_noise}, Precision={precision:.2f}, Recall={recall:.2f}, F1={f1:.2f}")

  # Step 5: Use best eps
  best_result = max(results, key=lambda x: x['f1'])
  print(f"\nBest eps: {best_result['eps']} (F1={best_result['f1']:.2f})")

  dbscan = DBSCAN(eps=best_result['eps'], min_samples=5)
  df['dbscan_label'] = dbscan.fit_predict(X_scaled)
  df['detected_anomaly'] = (df['dbscan_label'] == -1).astype(int)

  # Step 6: Analyze results
  print("\n3️⃣ DETECTION RESULTS")
  print("-"*40)

  confusion = pd.crosstab(df['is_anomaly'], df['detected_anomaly'], 
                          rownames=['Actual'], colnames=['Detected'])
  print("Confusion Matrix:")
  print(confusion)

  # Step 7: Examine detected anomalies
  print("\n4️⃣ ANOMALY CHARACTERISTICS")
  print("-"*40)

  detected = df[df['detected_anomaly'] == 1]
  normal = df[df['detected_anomaly'] == 0]

  print("\nNormal Traffic (average):")
  for f in features[:5]:
      print(f"  {f}: {normal[f].mean():.2f}")

  print("\nDetected Anomalies (average):")
  for f in features[:5]:
      print(f"  {f}: {detected[f].mean():.2f}")

  # Step 8: Visualization
  fig, axes = plt.subplots(1, 3, figsize=(15, 4))

  # Bytes sent vs received
  ax1 = axes[0]
  normal_mask = df['detected_anomaly'] == 0
  ax1.scatter(df[normal_mask]['bytes_sent'], df[normal_mask]['bytes_received'], 
             c='blue', alpha=0.5, label='Normal')
  ax1.scatter(df[~normal_mask]['bytes_sent'], df[~normal_mask]['bytes_received'], 
             c='red', alpha=0.7, label='Anomaly')
  ax1.set_xlabel('Bytes Sent')
  ax1.set_ylabel('Bytes Received')
  ax1.set_title('Traffic Pattern')
  ax1.legend()
  ax1.set_xscale('log')
  ax1.set_yscale('log')

  # Packets vs connections
  ax2 = axes[1]
  ax2.scatter(df[normal_mask]['packets'], df[normal_mask]['connections'], 
             c='blue', alpha=0.5, label='Normal')
  ax2.scatter(df[~normal_mask]['packets'], df[~normal_mask]['connections'], 
             c='red', alpha=0.7, label='Anomaly')
  ax2.set_xlabel('Packets')
  ax2.set_ylabel('Connections')
  ax2.set_title('Network Behavior')
  ax2.legend()

  # Comparison of actual vs detected
  ax3 = axes[2]
  metrics = ['Precision', 'Recall', 'F1']
  values = [best_result['precision'], best_result['recall'], best_result['f1']]
  ax3.bar(metrics, values, color=['steelblue', 'coral', 'green'])
  ax3.set_ylabel('Score')
  ax3.set_title('Detection Performance')
  ax3.set_ylim(0, 1)
  for i, v in enumerate(values):
      ax3.text(i, v + 0.02, f'{v:.2f}', ha='center')

  plt.tight_layout()
  plt.savefig('anomaly_detection.png', dpi=150)

  print("\n✅ Anomaly detection complete!")
  ```

  **What you learned:**

  * DBSCAN labels outliers as noise (label -1)
  * The eps parameter is crucial - use k-distance graph to find it
  * Real-world anomaly detection requires domain knowledge for threshold tuning
</details>

### Project 4: Document Clustering

Automatically organize documents by topic using clustering.

<details>
  <summary>View Complete Solution</summary>

  ```python theme={null}
  import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.cluster import KMeans, AgglomerativeClustering
  from sklearn.metrics import silhouette_score
  from sklearn.decomposition import PCA
  from collections import Counter

  # Step 1: Create sample documents
  documents = [
      # Technology/AI
      "Machine learning algorithms are transforming data analysis",
      "Neural networks can recognize patterns in images",
      "Artificial intelligence is revolutionizing healthcare",
      "Deep learning models require large datasets",
      "Natural language processing enables chatbots",
      
      # Sports
      "The basketball team won the championship game",
      "Football players train hard for the season",
      "Tennis requires excellent hand-eye coordination",
      "Soccer is the most popular sport worldwide",
      "Athletes follow strict diet and exercise routines",
      
      # Finance
      "Stock market shows signs of recovery",
      "Investment strategies for long-term growth",
      "Cryptocurrency trading has increased volatility",
      "Interest rates affect mortgage payments",
      "Portfolio diversification reduces risk",
      
      # Health
      "Regular exercise improves cardiovascular health",
      "Balanced diet is essential for wellness",
      "Mental health awareness is growing",
      "Vaccines protect against infectious diseases",
      "Sleep quality affects overall wellbeing",
      
      # Travel
      "Beaches in Thailand attract many tourists",
      "European cities offer rich cultural experiences",
      "Adventure travel is gaining popularity",
      "Airlines offering more domestic routes",
      "Sustainable tourism practices are important",
  ]

  # True labels for evaluation
  true_labels = [0]*5 + [1]*5 + [2]*5 + [3]*5 + [4]*5
  topic_names = ['Technology', 'Sports', 'Finance', 'Health', 'Travel']

  print("="*60)
  print("📄 DOCUMENT CLUSTERING")
  print("="*60)
  print(f"Total documents: {len(documents)}")
  print(f"Topics: {topic_names}")

  # Step 2: Convert text to numerical features using TF-IDF
  print("\n1️⃣ TEXT VECTORIZATION")
  print("-"*40)

  vectorizer = TfidfVectorizer(
      max_features=100,
      stop_words='english',
      ngram_range=(1, 2)
  )
  X = vectorizer.fit_transform(documents)

  print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
  print(f"Document-term matrix shape: {X.shape}")

  # Show top terms
  feature_names = vectorizer.get_feature_names_out()
  print(f"\nSample features: {list(feature_names[:10])}")

  # Step 3: Find optimal number of clusters
  print("\n2️⃣ FINDING OPTIMAL K")
  print("-"*40)

  silhouette_scores = []
  for k in range(2, 10):
      kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
      labels = kmeans.fit_predict(X)
      score = silhouette_score(X, labels)
      silhouette_scores.append(score)
      print(f"K={k}: Silhouette Score = {score:.3f}")

  optimal_k = range(2, 10)[np.argmax(silhouette_scores)]
  print(f"\nOptimal K: {optimal_k}")

  # Step 4: Cluster documents
  print("\n3️⃣ CLUSTERING DOCUMENTS")
  print("-"*40)

  kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
  predicted_labels = kmeans.fit_predict(X)

  # Step 5: Analyze clusters
  print("\n4️⃣ CLUSTER ANALYSIS")
  print("-"*40)

  # Get top terms for each cluster
  def get_top_terms(cluster_idx, n_terms=5):
      """Get top terms for a cluster"""
      cluster_center = kmeans.cluster_centers_[cluster_idx]
      top_indices = cluster_center.argsort()[-n_terms:][::-1]
      return [feature_names[i] for i in top_indices]

  for i in range(5):
      cluster_docs = [documents[j] for j in range(len(documents)) if predicted_labels[j] == i]
      top_terms = get_top_terms(i)
      print(f"\nCluster {i} ({len(cluster_docs)} documents):")
      print(f"  Top terms: {', '.join(top_terms)}")
      print(f"  Sample: {cluster_docs[0][:50]}...")

  # Step 6: Evaluate clustering (compare with true labels)
  print("\n5️⃣ CLUSTERING EVALUATION")
  print("-"*40)

  from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

  ari = adjusted_rand_score(true_labels, predicted_labels)
  nmi = normalized_mutual_info_score(true_labels, predicted_labels)

  print(f"Adjusted Rand Index: {ari:.3f}")
  print(f"Normalized Mutual Information: {nmi:.3f}")
  print(f"Silhouette Score: {silhouette_score(X, predicted_labels):.3f}")

  # Step 7: Visualization using PCA
  print("\n6️⃣ VISUALIZING CLUSTERS")
  print("-"*40)

  # Reduce to 2D for visualization
  pca = PCA(n_components=2)
  X_2d = pca.fit_transform(X.toarray())

  fig, axes = plt.subplots(1, 2, figsize=(14, 5))

  # Plot predicted clusters
  ax1 = axes[0]
  colors = ['red', 'blue', 'green', 'orange', 'purple']
  for i in range(5):
      mask = predicted_labels == i
      ax1.scatter(X_2d[mask, 0], X_2d[mask, 1], c=colors[i], label=f'Cluster {i}', alpha=0.7)
  ax1.set_title('Predicted Clusters')
  ax1.legend()
  ax1.set_xlabel('PC1')
  ax1.set_ylabel('PC2')

  # Plot true labels
  ax2 = axes[1]
  for i, topic in enumerate(topic_names):
      mask = np.array(true_labels) == i
      ax2.scatter(X_2d[mask, 0], X_2d[mask, 1], c=colors[i], label=topic, alpha=0.7)
  ax2.set_title('True Topics')
  ax2.legend()
  ax2.set_xlabel('PC1')
  ax2.set_ylabel('PC2')

  plt.tight_layout()
  plt.savefig('document_clusters.png', dpi=150)

  # Step 8: Assign topic names to clusters
  print("\n7️⃣ AUTOMATIC TOPIC LABELING")
  print("-"*40)

  cluster_topics = {}
  for i in range(5):
      top_terms = get_top_terms(i, n_terms=3)
      cluster_topics[i] = ' / '.join(top_terms)
      print(f"Cluster {i}: {cluster_topics[i]}")

  print("\n✅ Document clustering complete!")
  ```

  **What you learned:**

  * TF-IDF converts text to numerical features for clustering
  * Cluster centers reveal the main terms for each topic
  * Dimensionality reduction helps visualize document clusters
</details>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="No Labels Needed" icon="question">
    Clustering finds groups without knowing the answer
  </Card>

  <Card title="K-Means = Centers" icon="bullseye">
    Assign to nearest center, update centers, repeat
  </Card>

  <Card title="DBSCAN = Density" icon="circle-nodes">
    Finds arbitrary shapes and identifies noise
  </Card>

  <Card title="Scale Your Data" icon="arrows-left-right-to-line">
    Distance-based algorithms need scaled features
  </Card>
</CardGroup>

***

## What's Next?

You've now covered both supervised and unsupervised learning! Let's dive into the basics of neural networks.

<Card title="Continue to Module 12: Neural Networks" icon="arrow-right" href="/courses/ml-mastery/12-neural-networks">
  Learn how artificial neurons work and build your first neural network
</Card>
