But what if you have data without labels?Real scenarios:
Group customers into segments (but you don’t know the segments beforehand)
Find patterns in gene expression data
Detect anomalies in network traffic
Organize documents by topic
This is unsupervised learning. The algorithm finds structure on its own.If supervised learning is like a student taking an exam with an answer key, unsupervised learning is like that same student being handed a box of unlabeled rocks and told to “sort them into groups that make sense.” There’s no right answer — just patterns waiting to be discovered.
The most popular clustering algorithm. It’s like a game of “hot potato” between cluster centers and data points, where each round brings the centers closer to where they “belong.”
Pick K random points as initial cluster centers (like randomly placing K flags on a map)
Assign each point to the nearest center (each person walks to the closest flag)
Update centers to the mean of assigned points (move each flag to the center of its crowd)
Repeat steps 2-3 until nothing changes (the flags stop moving — we’ve converged)
The key insight: this alternating process of “assign then update” is guaranteed to converge, because every step reduces the total within-cluster distance. However, it may converge to a local optimum, not the global one — which is why scikit-learn runs K-Means multiple times (n_init=10 by default) with different random starting positions and keeps the best result.
def simple_kmeans(X, k, max_iters=100): """ Simple K-Means implementation. """ n_samples = len(X) # Step 1: Random initialization random_indices = np.random.choice(n_samples, k, replace=False) centers = X[random_indices].copy() for iteration in range(max_iters): # Step 2: Assign points to nearest center labels = np.zeros(n_samples, dtype=int) for i, point in enumerate(X): distances = [np.linalg.norm(point - center) for center in centers] labels[i] = np.argmin(distances) # Step 3: Update centers to mean of assigned points new_centers = np.zeros_like(centers) for j in range(k): cluster_points = X[labels == j] if len(cluster_points) > 0: new_centers[j] = cluster_points.mean(axis=0) else: new_centers[j] = centers[j] # Keep old center if cluster is empty # Check convergence if np.allclose(centers, new_centers): print(f"Converged at iteration {iteration}") break centers = new_centers return labels, centers# Run our simple K-Meanslabels, centers = simple_kmeans(customers, k=3)# Visualizeplt.figure(figsize=(10, 6))plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6)plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers')plt.xlabel('Annual Spending ($k)')plt.ylabel('Store Visits per Month')plt.title('K-Means Clustering Result')plt.legend()plt.grid(True)plt.show()
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler# Always scale for clustering!scaler = StandardScaler()customers_scaled = scaler.fit_transform(customers)# K-Means with 3 clusters# n_init=10 means: run K-Means 10 times with different random starts,# keep the best result. This guards against bad initialization.kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)labels = kmeans.fit_predict(customers_scaled)print("Cluster assignments:", labels[:10])print("Cluster centers (scaled):\n", kmeans.cluster_centers_)# Inertia = total within-cluster sum of squared distances to center.# Lower is better, but it always decreases with more clusters# (K=n_samples gives inertia=0 but is useless).print("Inertia (within-cluster sum of squares):", kmeans.inertia_)# IMPORTANT: To interpret cluster centers in original units,# reverse the scaling:# centers_original = scaler.inverse_transform(kmeans.cluster_centers_)# This tells you "Cluster 0 = customers who spend ~$20K/year and visit 3x/month"
from sklearn.cluster import KMeans# Try different values of kk_values = range(1, 11)inertias = []for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(customers_scaled) inertias.append(kmeans.inertia_)# Plot the elbow curveplt.figure(figsize=(10, 6))plt.plot(k_values, inertias, 'bo-', linewidth=2)plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia (Within-Cluster Sum of Squares)')plt.title('Elbow Method for Optimal K')plt.grid(True)# Look for the "elbow" - where adding more clusters doesn't help muchplt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3')plt.legend()plt.show()
The silhouette score measures how similar a point is to its own cluster vs other clusters. Think of it as asking each data point: “Are you happy in your cluster, or would you rather switch?”s=max(a,b)b−aWhere:
a = average distance to points in same cluster (cohesion — “how close am I to my own group?”)
b = average distance to points in nearest other cluster (separation — “how far am I from the next group?”)
Range: -1 (bad) to +1 (good)
+1: The point is far from other clusters and close to its own — perfect clustering
0: The point is on the border between two clusters — ambiguous assignment
-1: The point is closer to another cluster than its own — likely misassigned
from sklearn.metrics import silhouette_score, silhouette_samples# Calculate silhouette score for different k valuesk_values = range(2, 11)silhouette_scores = []for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(customers_scaled) score = silhouette_score(customers_scaled, labels) silhouette_scores.append(score) print(f"K={k}: Silhouette Score = {score:.3f}")# Plotplt.figure(figsize=(10, 6))plt.plot(k_values, silhouette_scores, 'go-', linewidth=2)plt.xlabel('Number of Clusters (K)')plt.ylabel('Silhouette Score')plt.title('Silhouette Method for Optimal K')plt.grid(True)plt.show()
You must specify K upfront (what if you guess wrong?)
Assumes spherical, roughly equal-sized clusters (fails on elongated or ring-shaped groups)
Sensitive to outliers (one extreme point can drag a cluster center far from where it should be)
DBSCAN (Density-Based Spatial Clustering) solves all three:
Automatically finds the number of clusters based on data density
Finds clusters of any shape — rings, crescents, irregular blobs
Identifies outliers/noise as points that don’t belong to any cluster
The trade-off: DBSCAN requires you to choose eps (neighborhood radius) and min_samples (minimum density), which can be tricky. And unlike K-Means, DBSCAN struggles when clusters have very different densities — a tight cluster and a sparse cluster might need different eps values.
from sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkage# Using a smaller dataset for visualizationnp.random.seed(42)X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0)# Create dendrogramlinkage_matrix = linkage(X_small, method='ward')plt.figure(figsize=(12, 6))dendrogram(linkage_matrix)plt.title('Hierarchical Clustering Dendrogram')plt.xlabel('Sample Index')plt.ylabel('Distance')plt.show()# Cut the tree at a certain height to get clustersfrom scipy.cluster.hierarchy import fcluster# Get 3 clusterslabels = fcluster(linkage_matrix, t=3, criterion='maxclust')print("Cluster labels:", labels)
# Use clusters as featuresfrom sklearn.cluster import KMeansfrom sklearn.ensemble import RandomForestClassifier# Create cluster featureskmeans = KMeans(n_clusters=5, random_state=42)cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1)# Add to original featuresX_train_enhanced = np.hstack([X_train, cluster_features])# Now use in a classifierclf = RandomForestClassifier(random_state=42)clf.fit(X_train_enhanced, y_train)
Math Connection: Clustering uses distance metrics extensively. Understanding vector similarity helps you choose the right metric.