Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Clustering: Unsupervised Learning

K-Means Clustering Visualization

A Different Kind of Problem

So far, all our problems had labels:
  • House prices (we knew the correct price)
  • Spam/not spam (we knew which emails were spam)
  • Customer churn (we knew who churned)
But what if you have data without labels? Real scenarios:
  • Group customers into segments (but you don’t know the segments beforehand)
  • Find patterns in gene expression data
  • Detect anomalies in network traffic
  • Organize documents by topic
This is unsupervised learning. The algorithm finds structure on its own. If supervised learning is like a student taking an exam with an answer key, unsupervised learning is like that same student being handed a box of unlabeled rocks and told to “sort them into groups that make sense.” There’s no right answer — just patterns waiting to be discovered.
Customer Segmentation with Clustering

The Customer Segmentation Problem

Your marketing team wants to send different campaigns to different customer types. But what types exist?
import numpy as np
import matplotlib.pyplot as plt

# Customer data: [annual_spending ($k), store_visits_per_month]
np.random.seed(42)

# Generate 3 natural clusters (but we pretend we don't know this!)
# Budget shoppers: low spending, low visits
budget = np.random.randn(50, 2) * [5, 2] + [20, 3]

# Regular customers: moderate spending, moderate visits
regular = np.random.randn(60, 2) * [8, 3] + [50, 8]

# Premium customers: high spending, high visits
premium = np.random.randn(40, 2) * [10, 2] + [100, 15]

# Combine (in real life, you wouldn't know which cluster each point belongs to)
customers = np.vstack([budget, regular, premium])

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], alpha=0.6)
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('Customer Data - Can You See the Groups?')
plt.grid(True)
plt.show()
Looking at the scatter plot, you can probably see 3 groups. But how do we find them automatically?

K-Means Clustering

The most popular clustering algorithm. It’s like a game of “hot potato” between cluster centers and data points, where each round brings the centers closer to where they “belong.”

The Algorithm (Simple Version)

  1. Pick K random points as initial cluster centers (like randomly placing K flags on a map)
  2. Assign each point to the nearest center (each person walks to the closest flag)
  3. Update centers to the mean of assigned points (move each flag to the center of its crowd)
  4. Repeat steps 2-3 until nothing changes (the flags stop moving — we’ve converged)
The key insight: this alternating process of “assign then update” is guaranteed to converge, because every step reduces the total within-cluster distance. However, it may converge to a local optimum, not the global one — which is why scikit-learn runs K-Means multiple times (n_init=10 by default) with different random starting positions and keeps the best result.
def simple_kmeans(X, k, max_iters=100):
    """
    Simple K-Means implementation.
    """
    n_samples = len(X)
    
    # Step 1: Random initialization
    random_indices = np.random.choice(n_samples, k, replace=False)
    centers = X[random_indices].copy()
    
    for iteration in range(max_iters):
        # Step 2: Assign points to nearest center
        labels = np.zeros(n_samples, dtype=int)
        for i, point in enumerate(X):
            distances = [np.linalg.norm(point - center) for center in centers]
            labels[i] = np.argmin(distances)
        
        # Step 3: Update centers to mean of assigned points
        new_centers = np.zeros_like(centers)
        for j in range(k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                new_centers[j] = cluster_points.mean(axis=0)
            else:
                new_centers[j] = centers[j]  # Keep old center if cluster is empty
        
        # Check convergence
        if np.allclose(centers, new_centers):
            print(f"Converged at iteration {iteration}")
            break
        
        centers = new_centers
    
    return labels, centers

# Run our simple K-Means
labels, centers = simple_kmeans(customers, k=3)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('K-Means Clustering Result')
plt.legend()
plt.grid(True)
plt.show()

Using scikit-learn

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Always scale for clustering!
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# K-Means with 3 clusters
# n_init=10 means: run K-Means 10 times with different random starts,
# keep the best result. This guards against bad initialization.
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print("Cluster assignments:", labels[:10])
print("Cluster centers (scaled):\n", kmeans.cluster_centers_)
# Inertia = total within-cluster sum of squared distances to center.
# Lower is better, but it always decreases with more clusters
# (K=n_samples gives inertia=0 but is useless).
print("Inertia (within-cluster sum of squares):", kmeans.inertia_)

# IMPORTANT: To interpret cluster centers in original units,
# reverse the scaling:
# centers_original = scaler.inverse_transform(kmeans.cluster_centers_)
# This tells you "Cluster 0 = customers who spend ~$20K/year and visit 3x/month"

Choosing K: The Elbow Method

How many clusters should we use?
from sklearn.cluster import KMeans

# Try different values of k
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_scaled)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)

# Look for the "elbow" - where adding more clusters doesn't help much
plt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3')
plt.legend()
plt.show()

Silhouette Score: Better Cluster Evaluation

The silhouette score measures how similar a point is to its own cluster vs other clusters. Think of it as asking each data point: “Are you happy in your cluster, or would you rather switch?” s=bamax(a,b)s = \frac{b - a}{\max(a, b)} Where:
  • aa = average distance to points in same cluster (cohesion — “how close am I to my own group?”)
  • bb = average distance to points in nearest other cluster (separation — “how far am I from the next group?”)
Range: -1 (bad) to +1 (good)
  • +1: The point is far from other clusters and close to its own — perfect clustering
  • 0: The point is on the border between two clusters — ambiguous assignment
  • -1: The point is closer to another cluster than its own — likely misassigned
from sklearn.metrics import silhouette_score, silhouette_samples

# Calculate silhouette score for different k values
k_values = range(2, 11)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(customers_scaled)
    score = silhouette_score(customers_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, 'go-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method for Optimal K')
plt.grid(True)
plt.show()

DBSCAN: Density-Based Clustering

K-Means has problems:
  • You must specify K upfront (what if you guess wrong?)
  • Assumes spherical, roughly equal-sized clusters (fails on elongated or ring-shaped groups)
  • Sensitive to outliers (one extreme point can drag a cluster center far from where it should be)
DBSCAN (Density-Based Spatial Clustering) solves all three:
  • Automatically finds the number of clusters based on data density
  • Finds clusters of any shape — rings, crescents, irregular blobs
  • Identifies outliers/noise as points that don’t belong to any cluster
The trade-off: DBSCAN requires you to choose eps (neighborhood radius) and min_samples (minimum density), which can be tricky. And unlike K-Means, DBSCAN struggles when clusters have very different densities — a tight cluster and a sparse cluster might need different eps values.

How DBSCAN Works

  1. For each point, count neighbors within radius eps
  2. If count >= min_samples, it’s a core point
  3. Connect core points that are neighbors
  4. Non-core points near core points are border points
  5. Everything else is noise
from sklearn.cluster import DBSCAN

# Create more interesting data with different shaped clusters
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# K-Means fails on non-spherical clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons)

# DBSCAN handles them well
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title('K-Means (Fails on Moons)')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title('DBSCAN (Works!)')

plt.tight_layout()
plt.show()

Choosing DBSCAN Parameters

# eps: radius of neighborhood
# min_samples: minimum points to form a cluster

# Heuristic: min_samples = 2 * n_features
# Heuristic: eps from k-distance graph

from sklearn.neighbors import NearestNeighbors

# Find optimal eps using k-distance
k = 5  # min_samples - 1
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_moons)
distances, _ = neighbors.kneighbors(X_moons)
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('K-Distance Graph - Look for the "Elbow"')
plt.grid(True)
plt.show()

Hierarchical Clustering

Builds a tree (dendrogram) of clusters:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Using a smaller dataset for visualization
np.random.seed(42)
X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0)

# Create dendrogram
linkage_matrix = linkage(X_small, method='ward')

plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut the tree at a certain height to get clusters
from scipy.cluster.hierarchy import fcluster

# Get 3 clusters
labels = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels)

Practical Example: Customer Segmentation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Create realistic customer data
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'annual_spending': np.random.exponential(50, n_customers) + 10,
    'frequency': np.random.poisson(5, n_customers) + 1,
    'avg_basket': np.random.gamma(2, 25, n_customers),
    'days_since_purchase': np.random.exponential(30, n_customers),
    'items_per_visit': np.random.poisson(3, n_customers) + 1,
})

print("Customer data sample:")
print(data.head())
print("\nStatistics:")
print(data.describe())

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Find optimal k using silhouette
best_k = 4
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(data_scaled)

# Analyze clusters
print("\nCluster Profiles:")
cluster_profiles = data.groupby('cluster').mean()
print(cluster_profiles.round(2))

# Name the segments based on characteristics
segment_names = {
    0: 'Occasional Shoppers',
    1: 'High-Value Regulars',
    2: 'Bargain Hunters',
    3: 'VIP Customers'
}

Comparison: When to Use What

AlgorithmBest ForProsCons
K-MeansSpherical clusters, known KFast, simpleMust specify K, assumes spherical
DBSCANArbitrary shapes, noiseFinds K automatically, handles outliersSensitive to parameters
HierarchicalSmall data, need hierarchyVisual dendrogram, no K neededSlow for large data
Gaussian MixtureSoft clustering, ellipticalProbability outputsCan overfit

Connection to Supervised Learning

Clustering can help supervised learning:
# Use clusters as features
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

# Create cluster features
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1)

# Add to original features
X_train_enhanced = np.hstack([X_train, cluster_features])

# Now use in a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_enhanced, y_train)
Math Connection: Clustering uses distance metrics extensively. Understanding vector similarity helps you choose the right metric.

🚀 Mini Projects

Project 1: Customer Segmentation

Segment e-commerce customers for targeted marketing

Project 2: Image Color Quantization

Compress images using K-Means clustering

Project 3: Anomaly Detection System

Detect outliers using DBSCAN

Project 4: Document Clustering

Organize documents by topic automatically

Project 1: Customer Segmentation

Segment customers based on purchasing behavior for targeted marketing campaigns.

Project 2: Image Color Quantization

Use K-Means to reduce the number of colors in an image (compression).

Project 3: Anomaly Detection System

Use DBSCAN to detect anomalies in network traffic or transaction data.

Project 4: Document Clustering

Automatically organize documents by topic using clustering.

Key Takeaways

No Labels Needed

Clustering finds groups without knowing the answer

K-Means = Centers

Assign to nearest center, update centers, repeat

DBSCAN = Density

Finds arbitrary shapes and identifies noise

Scale Your Data

Distance-based algorithms need scaled features

What’s Next?

You’ve now covered both supervised and unsupervised learning! Let’s dive into the basics of neural networks.

Continue to Module 12: Neural Networks

Learn how artificial neurons work and build your first neural network