Skip to main content

Clustering: Unsupervised Learning

K-Means Clustering Visualization

A Different Kind of Problem

So far, all our problems had labels:
  • House prices (we knew the correct price)
  • Spam/not spam (we knew which emails were spam)
  • Customer churn (we knew who churned)
But what if you have data without labels? Real scenarios:
  • Group customers into segments (but you don’t know the segments beforehand)
  • Find patterns in gene expression data
  • Detect anomalies in network traffic
  • Organize documents by topic
This is unsupervised learning. The algorithm finds structure on its own.
Customer Segmentation with Clustering

The Customer Segmentation Problem

Your marketing team wants to send different campaigns to different customer types. But what types exist?
import numpy as np
import matplotlib.pyplot as plt

# Customer data: [annual_spending ($k), store_visits_per_month]
np.random.seed(42)

# Generate 3 natural clusters (but we pretend we don't know this!)
# Budget shoppers: low spending, low visits
budget = np.random.randn(50, 2) * [5, 2] + [20, 3]

# Regular customers: moderate spending, moderate visits
regular = np.random.randn(60, 2) * [8, 3] + [50, 8]

# Premium customers: high spending, high visits
premium = np.random.randn(40, 2) * [10, 2] + [100, 15]

# Combine (in real life, you wouldn't know which cluster each point belongs to)
customers = np.vstack([budget, regular, premium])

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], alpha=0.6)
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('Customer Data - Can You See the Groups?')
plt.grid(True)
plt.show()
Looking at the scatter plot, you can probably see 3 groups. But how do we find them automatically?

K-Means Clustering

The most popular clustering algorithm.

The Algorithm (Simple Version)

  1. Pick K random points as initial cluster centers
  2. Assign each point to the nearest center
  3. Update centers to the mean of assigned points
  4. Repeat steps 2-3 until nothing changes
def simple_kmeans(X, k, max_iters=100):
    """
    Simple K-Means implementation.
    """
    n_samples = len(X)
    
    # Step 1: Random initialization
    random_indices = np.random.choice(n_samples, k, replace=False)
    centers = X[random_indices].copy()
    
    for iteration in range(max_iters):
        # Step 2: Assign points to nearest center
        labels = np.zeros(n_samples, dtype=int)
        for i, point in enumerate(X):
            distances = [np.linalg.norm(point - center) for center in centers]
            labels[i] = np.argmin(distances)
        
        # Step 3: Update centers to mean of assigned points
        new_centers = np.zeros_like(centers)
        for j in range(k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                new_centers[j] = cluster_points.mean(axis=0)
            else:
                new_centers[j] = centers[j]  # Keep old center if cluster is empty
        
        # Check convergence
        if np.allclose(centers, new_centers):
            print(f"Converged at iteration {iteration}")
            break
        
        centers = new_centers
    
    return labels, centers

# Run our simple K-Means
labels, centers = simple_kmeans(customers, k=3)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('K-Means Clustering Result')
plt.legend()
plt.grid(True)
plt.show()

Using scikit-learn

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Always scale for clustering!
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print("Cluster assignments:", labels[:10])
print("Cluster centers (scaled):\n", kmeans.cluster_centers_)
print("Inertia (within-cluster sum of squares):", kmeans.inertia_)

Choosing K: The Elbow Method

How many clusters should we use?
from sklearn.cluster import KMeans

# Try different values of k
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_scaled)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)

# Look for the "elbow" - where adding more clusters doesn't help much
plt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3')
plt.legend()
plt.show()

Silhouette Score: Better Cluster Evaluation

The silhouette score measures how similar a point is to its own cluster vs other clusters. s=bamax(a,b)s = \frac{b - a}{\max(a, b)} Where:
  • aa = average distance to points in same cluster
  • bb = average distance to points in nearest other cluster
Range: -1 (bad) to +1 (good)
from sklearn.metrics import silhouette_score, silhouette_samples

# Calculate silhouette score for different k values
k_values = range(2, 11)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(customers_scaled)
    score = silhouette_score(customers_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, 'go-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method for Optimal K')
plt.grid(True)
plt.show()

DBSCAN: Density-Based Clustering

K-Means has problems:
  • You must specify K
  • Assumes spherical clusters
  • Sensitive to outliers
DBSCAN (Density-Based Spatial Clustering) solves these:
  • Automatically finds the number of clusters
  • Finds clusters of any shape
  • Identifies outliers/noise

How DBSCAN Works

  1. For each point, count neighbors within radius eps
  2. If count >= min_samples, it’s a core point
  3. Connect core points that are neighbors
  4. Non-core points near core points are border points
  5. Everything else is noise
from sklearn.cluster import DBSCAN

# Create more interesting data with different shaped clusters
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# K-Means fails on non-spherical clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons)

# DBSCAN handles them well
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title('K-Means (Fails on Moons)')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title('DBSCAN (Works!)')

plt.tight_layout()
plt.show()

Choosing DBSCAN Parameters

# eps: radius of neighborhood
# min_samples: minimum points to form a cluster

# Heuristic: min_samples = 2 * n_features
# Heuristic: eps from k-distance graph

from sklearn.neighbors import NearestNeighbors

# Find optimal eps using k-distance
k = 5  # min_samples - 1
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_moons)
distances, _ = neighbors.kneighbors(X_moons)
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('K-Distance Graph - Look for the "Elbow"')
plt.grid(True)
plt.show()

Hierarchical Clustering

Builds a tree (dendrogram) of clusters:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Using a smaller dataset for visualization
np.random.seed(42)
X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0)

# Create dendrogram
linkage_matrix = linkage(X_small, method='ward')

plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut the tree at a certain height to get clusters
from scipy.cluster.hierarchy import fcluster

# Get 3 clusters
labels = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels)

Practical Example: Customer Segmentation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Create realistic customer data
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'annual_spending': np.random.exponential(50, n_customers) + 10,
    'frequency': np.random.poisson(5, n_customers) + 1,
    'avg_basket': np.random.gamma(2, 25, n_customers),
    'days_since_purchase': np.random.exponential(30, n_customers),
    'items_per_visit': np.random.poisson(3, n_customers) + 1,
})

print("Customer data sample:")
print(data.head())
print("\nStatistics:")
print(data.describe())

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Find optimal k using silhouette
best_k = 4
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(data_scaled)

# Analyze clusters
print("\nCluster Profiles:")
cluster_profiles = data.groupby('cluster').mean()
print(cluster_profiles.round(2))

# Name the segments based on characteristics
segment_names = {
    0: 'Occasional Shoppers',
    1: 'High-Value Regulars',
    2: 'Bargain Hunters',
    3: 'VIP Customers'
}

Comparison: When to Use What

AlgorithmBest ForProsCons
K-MeansSpherical clusters, known KFast, simpleMust specify K, assumes spherical
DBSCANArbitrary shapes, noiseFinds K automatically, handles outliersSensitive to parameters
HierarchicalSmall data, need hierarchyVisual dendrogram, no K neededSlow for large data
Gaussian MixtureSoft clustering, ellipticalProbability outputsCan overfit

Connection to Supervised Learning

Clustering can help supervised learning:
# Use clusters as features
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

# Create cluster features
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1)

# Add to original features
X_train_enhanced = np.hstack([X_train, cluster_features])

# Now use in a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_enhanced, y_train)
Math Connection: Clustering uses distance metrics extensively. Understanding vector similarity helps you choose the right metric.

🚀 Mini Projects

Project 1: Customer Segmentation

Segment e-commerce customers for targeted marketing

Project 2: Image Color Quantization

Compress images using K-Means clustering

Project 3: Anomaly Detection System

Detect outliers using DBSCAN

Project 4: Document Clustering

Organize documents by topic automatically

Project 1: Customer Segmentation

Segment customers based on purchasing behavior for targeted marketing campaigns.

Project 2: Image Color Quantization

Use K-Means to reduce the number of colors in an image (compression).

Project 3: Anomaly Detection System

Use DBSCAN to detect anomalies in network traffic or transaction data.

Project 4: Document Clustering

Automatically organize documents by topic using clustering.

Key Takeaways

No Labels Needed

Clustering finds groups without knowing the answer

K-Means = Centers

Assign to nearest center, update centers, repeat

DBSCAN = Density

Finds arbitrary shapes and identifies noise

Scale Your Data

Distance-based algorithms need scaled features

What’s Next?

You’ve now covered both supervised and unsupervised learning! Let’s dive into the basics of neural networks.

Continue to Module 12: Neural Networks

Learn how artificial neurons work and build your first neural network