Clustering: Unsupervised Learning
A Different Kind of Problem
So far, all our problems had labels:- House prices (we knew the correct price)
- Spam/not spam (we knew which emails were spam)
- Customer churn (we knew who churned)
- Group customers into segments (but you don’t know the segments beforehand)
- Find patterns in gene expression data
- Detect anomalies in network traffic
- Organize documents by topic
The Customer Segmentation Problem
Your marketing team wants to send different campaigns to different customer types. But what types exist?K-Means Clustering
The most popular clustering algorithm.The Algorithm (Simple Version)
- Pick K random points as initial cluster centers
- Assign each point to the nearest center
- Update centers to the mean of assigned points
- Repeat steps 2-3 until nothing changes
Using scikit-learn
Choosing K: The Elbow Method
How many clusters should we use?Silhouette Score: Better Cluster Evaluation
The silhouette score measures how similar a point is to its own cluster vs other clusters. Where:- = average distance to points in same cluster
- = average distance to points in nearest other cluster
DBSCAN: Density-Based Clustering
K-Means has problems:- You must specify K
- Assumes spherical clusters
- Sensitive to outliers
- Automatically finds the number of clusters
- Finds clusters of any shape
- Identifies outliers/noise
How DBSCAN Works
- For each point, count neighbors within radius
eps - If count >=
min_samples, it’s a core point - Connect core points that are neighbors
- Non-core points near core points are border points
- Everything else is noise
Choosing DBSCAN Parameters
Hierarchical Clustering
Builds a tree (dendrogram) of clusters:Practical Example: Customer Segmentation
Comparison: When to Use What
| Algorithm | Best For | Pros | Cons |
|---|---|---|---|
| K-Means | Spherical clusters, known K | Fast, simple | Must specify K, assumes spherical |
| DBSCAN | Arbitrary shapes, noise | Finds K automatically, handles outliers | Sensitive to parameters |
| Hierarchical | Small data, need hierarchy | Visual dendrogram, no K needed | Slow for large data |
| Gaussian Mixture | Soft clustering, elliptical | Probability outputs | Can overfit |
Connection to Supervised Learning
Clustering can help supervised learning:Math Connection: Clustering uses distance metrics extensively. Understanding vector similarity helps you choose the right metric.
🚀 Mini Projects
Project 1: Customer Segmentation
Segment e-commerce customers for targeted marketing
Project 2: Image Color Quantization
Compress images using K-Means clustering
Project 3: Anomaly Detection System
Detect outliers using DBSCAN
Project 4: Document Clustering
Organize documents by topic automatically
Project 1: Customer Segmentation
Segment customers based on purchasing behavior for targeted marketing campaigns.Project 2: Image Color Quantization
Use K-Means to reduce the number of colors in an image (compression).Project 3: Anomaly Detection System
Use DBSCAN to detect anomalies in network traffic or transaction data.Project 4: Document Clustering
Automatically organize documents by topic using clustering.Key Takeaways
No Labels Needed
Clustering finds groups without knowing the answer
K-Means = Centers
Assign to nearest center, update centers, repeat
DBSCAN = Density
Finds arbitrary shapes and identifies noise
Scale Your Data
Distance-based algorithms need scaled features
What’s Next?
You’ve now covered both supervised and unsupervised learning! Let’s dive into the basics of neural networks.Continue to Module 12: Neural Networks
Learn how artificial neurons work and build your first neural network