Clustering: Unsupervised Learning
A Different Kind of Problem
The Customer Segmentation Problem
K-Means Clustering
The Algorithm (Simple Version)
Using scikit-learn
Choosing K: The Elbow Method
Silhouette Score: Better Cluster Evaluation
DBSCAN: Density-Based Clustering
How DBSCAN Works
Choosing DBSCAN Parameters
Hierarchical Clustering
Practical Example: Customer Segmentation
Comparison: When to Use What
Connection to Supervised Learning
🚀 Mini Projects
Project 1: Customer Segmentation
Project 2: Image Color Quantization
Project 3: Anomaly Detection System
Project 4: Document Clustering
Key Takeaways
What’s Next?

Clustering: Unsupervised Learning

A Different Kind of Problem

So far, all our problems had labels:

House prices (we knew the correct price)
Spam/not spam (we knew which emails were spam)
Customer churn (we knew who churned)

But what if you have data without labels? Real scenarios:

Group customers into segments (but you don’t know the segments beforehand)
Find patterns in gene expression data
Detect anomalies in network traffic
Organize documents by topic

This is unsupervised learning. The algorithm finds structure on its own.

The Customer Segmentation Problem

Your marketing team wants to send different campaigns to different customer types. But what types exist?

import numpy as np
import matplotlib.pyplot as plt

# Customer data: [annual_spending ($k), store_visits_per_month]
np.random.seed(42)

# Generate 3 natural clusters (but we pretend we don't know this!)
# Budget shoppers: low spending, low visits
budget = np.random.randn(50, 2) * [5, 2] + [20, 3]

# Regular customers: moderate spending, moderate visits
regular = np.random.randn(60, 2) * [8, 3] + [50, 8]

# Premium customers: high spending, high visits
premium = np.random.randn(40, 2) * [10, 2] + [100, 15]

# Combine (in real life, you wouldn't know which cluster each point belongs to)
customers = np.vstack([budget, regular, premium])

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], alpha=0.6)
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('Customer Data - Can You See the Groups?')
plt.grid(True)
plt.show()

Looking at the scatter plot, you can probably see 3 groups. But how do we find them automatically?

K-Means Clustering

The most popular clustering algorithm.

The Algorithm (Simple Version)

Pick K random points as initial cluster centers
Assign each point to the nearest center
Update centers to the mean of assigned points
Repeat steps 2-3 until nothing changes

def simple_kmeans(X, k, max_iters=100):
    """
    Simple K-Means implementation.
    """
    n_samples = len(X)
    
    # Step 1: Random initialization
    random_indices = np.random.choice(n_samples, k, replace=False)
    centers = X[random_indices].copy()
    
    for iteration in range(max_iters):
        # Step 2: Assign points to nearest center
        labels = np.zeros(n_samples, dtype=int)
        for i, point in enumerate(X):
            distances = [np.linalg.norm(point - center) for center in centers]
            labels[i] = np.argmin(distances)
        
        # Step 3: Update centers to mean of assigned points
        new_centers = np.zeros_like(centers)
        for j in range(k):
            cluster_points = X[labels == j]
            if len(cluster_points) > 0:
                new_centers[j] = cluster_points.mean(axis=0)
            else:
                new_centers[j] = centers[j]  # Keep old center if cluster is empty
        
        # Check convergence
        if np.allclose(centers, new_centers):
            print(f"Converged at iteration {iteration}")
            break
        
        centers = new_centers
    
    return labels, centers

# Run our simple K-Means
labels, centers = simple_kmeans(customers, k=3)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')
plt.xlabel('Annual Spending ($k)')
plt.ylabel('Store Visits per Month')
plt.title('K-Means Clustering Result')
plt.legend()
plt.grid(True)
plt.show()

Using scikit-learn

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Always scale for clustering!
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(customers_scaled)

print("Cluster assignments:", labels[:10])
print("Cluster centers (scaled):\n", kmeans.cluster_centers_)
print("Inertia (within-cluster sum of squares):", kmeans.inertia_)

Choosing K: The Elbow Method

How many clusters should we use?

from sklearn.cluster import KMeans

# Try different values of k
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customers_scaled)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'bo-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)

# Look for the "elbow" - where adding more clusters doesn't help much
plt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3')
plt.legend()
plt.show()

Silhouette Score: Better Cluster Evaluation

The silhouette score measures how similar a point is to its own cluster vs other clusters.

s = \frac{b - a}{\max(a, b)}

Where:

$a$ = average distance to points in same cluster
$b$ = average distance to points in nearest other cluster

Range: -1 (bad) to +1 (good)

from sklearn.metrics import silhouette_score, silhouette_samples

# Calculate silhouette score for different k values
k_values = range(2, 11)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(customers_scaled)
    score = silhouette_score(customers_scaled, labels)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score = {score:.3f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, 'go-', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method for Optimal K')
plt.grid(True)
plt.show()

DBSCAN: Density-Based Clustering

K-Means has problems:

You must specify K
Assumes spherical clusters
Sensitive to outliers

DBSCAN (Density-Based Spatial Clustering) solves these:

Automatically finds the number of clusters
Finds clusters of any shape
Identifies outliers/noise

How DBSCAN Works

For each point, count neighbors within radius eps
If count >= min_samples, it’s a core point
Connect core points that are neighbors
Non-core points near core points are border points
Everything else is noise

from sklearn.cluster import DBSCAN

# Create more interesting data with different shaped clusters
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# K-Means fails on non-spherical clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_moons)

# DBSCAN handles them well
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_moons)

# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis')
axes[0].set_title('K-Means (Fails on Moons)')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis')
axes[1].set_title('DBSCAN (Works!)')

plt.tight_layout()
plt.show()

Choosing DBSCAN Parameters

# eps: radius of neighborhood
# min_samples: minimum points to form a cluster

# Heuristic: min_samples = 2 * n_features
# Heuristic: eps from k-distance graph

from sklearn.neighbors import NearestNeighbors

# Find optimal eps using k-distance
k = 5  # min_samples - 1
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X_moons)
distances, _ = neighbors.kneighbors(X_moons)
k_distances = np.sort(distances[:, k-1])

plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('K-Distance Graph - Look for the "Elbow"')
plt.grid(True)
plt.show()

Hierarchical Clustering

Builds a tree (dendrogram) of clusters:

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Using a smaller dataset for visualization
np.random.seed(42)
X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0)

# Create dendrogram
linkage_matrix = linkage(X_small, method='ward')

plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

# Cut the tree at a certain height to get clusters
from scipy.cluster.hierarchy import fcluster

# Get 3 clusters
labels = fcluster(linkage_matrix, t=3, criterion='maxclust')
print("Cluster labels:", labels)

Practical Example: Customer Segmentation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Create realistic customer data
np.random.seed(42)
n_customers = 500

data = pd.DataFrame({
    'annual_spending': np.random.exponential(50, n_customers) + 10,
    'frequency': np.random.poisson(5, n_customers) + 1,
    'avg_basket': np.random.gamma(2, 25, n_customers),
    'days_since_purchase': np.random.exponential(30, n_customers),
    'items_per_visit': np.random.poisson(3, n_customers) + 1,
})

print("Customer data sample:")
print(data.head())
print("\nStatistics:")
print(data.describe())

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Find optimal k using silhouette
best_k = 4
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(data_scaled)

# Analyze clusters
print("\nCluster Profiles:")
cluster_profiles = data.groupby('cluster').mean()
print(cluster_profiles.round(2))

# Name the segments based on characteristics
segment_names = {
    0: 'Occasional Shoppers',
    1: 'High-Value Regulars',
    2: 'Bargain Hunters',
    3: 'VIP Customers'
}

Comparison: When to Use What

Algorithm	Best For	Pros	Cons
K-Means	Spherical clusters, known K	Fast, simple	Must specify K, assumes spherical
DBSCAN	Arbitrary shapes, noise	Finds K automatically, handles outliers	Sensitive to parameters
Hierarchical	Small data, need hierarchy	Visual dendrogram, no K needed	Slow for large data
Gaussian Mixture	Soft clustering, elliptical	Probability outputs	Can overfit

Connection to Supervised Learning

Clustering can help supervised learning:

# Use clusters as features
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

# Create cluster features
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1)

# Add to original features
X_train_enhanced = np.hstack([X_train, cluster_features])

# Now use in a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_enhanced, y_train)

Math Connection: Clustering uses distance metrics extensively. Understanding vector similarity helps you choose the right metric.

🚀 Mini Projects

Project 1: Customer Segmentation

Segment e-commerce customers for targeted marketing

Project 2: Image Color Quantization

Compress images using K-Means clustering

Project 3: Anomaly Detection System

Detect outliers using DBSCAN

Project 4: Document Clustering

Organize documents by topic automatically

Project 1: Customer Segmentation

Segment customers based on purchasing behavior for targeted marketing campaigns.

Project 2: Image Color Quantization

Use K-Means to reduce the number of colors in an image (compression).

Project 3: Anomaly Detection System

Use DBSCAN to detect anomalies in network traffic or transaction data.

Project 4: Document Clustering

Automatically organize documents by topic using clustering.

Key Takeaways

No Labels Needed

Clustering finds groups without knowing the answer

K-Means = Centers

Assign to nearest center, update centers, repeat

DBSCAN = Density

Finds arbitrary shapes and identifies noise

Scale Your Data

Distance-based algorithms need scaled features

What’s Next?

You’ve now covered both supervised and unsupervised learning! Let’s dive into the basics of neural networks.

Continue to Module 12: Neural Networks

Learn how artificial neurons work and build your first neural network

End-to-End Project Neural Networks

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Clustering: Unsupervised Learning

​A Different Kind of Problem

​The Customer Segmentation Problem

​K-Means Clustering

​The Algorithm (Simple Version)

​Using scikit-learn

​Choosing K: The Elbow Method

​Silhouette Score: Better Cluster Evaluation

​DBSCAN: Density-Based Clustering

​How DBSCAN Works

​Choosing DBSCAN Parameters

​Hierarchical Clustering

​Practical Example: Customer Segmentation

​Comparison: When to Use What

​Connection to Supervised Learning

​🚀 Mini Projects

Project 1: Customer Segmentation

Project 2: Image Color Quantization

Project 3: Anomaly Detection System

Project 4: Document Clustering

​Project 1: Customer Segmentation

​Project 2: Image Color Quantization

​Project 3: Anomaly Detection System

​Project 4: Document Clustering

​Key Takeaways

No Labels Needed

K-Means = Centers

DBSCAN = Density

Scale Your Data

​What’s Next?

Continue to Module 12: Neural Networks

Clustering: Unsupervised Learning

A Different Kind of Problem

The Customer Segmentation Problem

K-Means Clustering

The Algorithm (Simple Version)

Using scikit-learn

Choosing K: The Elbow Method

Silhouette Score: Better Cluster Evaluation

DBSCAN: Density-Based Clustering

How DBSCAN Works

Choosing DBSCAN Parameters

Hierarchical Clustering

Practical Example: Customer Segmentation

Comparison: When to Use What

Connection to Supervised Learning

🚀 Mini Projects

Project 1: Customer Segmentation

Project 2: Image Color Quantization

Project 3: Anomaly Detection System

Project 4: Document Clustering

Key Takeaways

What’s Next?