> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Clustering > Group similar things together - unsupervised learning fundamentals # Clustering: Unsupervised Learning K-Means Clustering Visualization

## A Different Kind of Problem So far, all our problems had **labels**: * House prices (we knew the correct price) * Spam/not spam (we knew which emails were spam) * Customer churn (we knew who churned) But what if you have data **without labels**? **Real scenarios:** * Group customers into segments (but you don't know the segments beforehand) * Find patterns in gene expression data * Detect anomalies in network traffic * Organize documents by topic This is **unsupervised learning**. The algorithm finds structure on its own. If supervised learning is like a student taking an exam with an answer key, unsupervised learning is like that same student being handed a box of unlabeled rocks and told to "sort them into groups that make sense." There's no right answer -- just patterns waiting to be discovered. Customer Segmentation with Clustering

*** ## The Customer Segmentation Problem Your marketing team wants to send different campaigns to different customer types. But what types exist? ```python theme={null} import numpy as np import matplotlib.pyplot as plt # Customer data: [annual_spending ($k), store_visits_per_month] np.random.seed(42) # Generate 3 natural clusters (but we pretend we don't know this!) # Budget shoppers: low spending, low visits budget = np.random.randn(50, 2) * [5, 2] + [20, 3] # Regular customers: moderate spending, moderate visits regular = np.random.randn(60, 2) * [8, 3] + [50, 8] # Premium customers: high spending, high visits premium = np.random.randn(40, 2) * [10, 2] + [100, 15] # Combine (in real life, you wouldn't know which cluster each point belongs to) customers = np.vstack([budget, regular, premium]) # Visualize plt.figure(figsize=(10, 6)) plt.scatter(customers[:, 0], customers[:, 1], alpha=0.6) plt.xlabel('Annual Spending ($k)') plt.ylabel('Store Visits per Month') plt.title('Customer Data - Can You See the Groups?') plt.grid(True) plt.show() ``` Looking at the scatter plot, you can probably see 3 groups. But how do we find them automatically? *** ## K-Means Clustering The most popular clustering algorithm. It's like a game of "hot potato" between cluster centers and data points, where each round brings the centers closer to where they "belong." ### The Algorithm (Simple Version) 1. **Pick K random points** as initial cluster centers (like randomly placing K flags on a map) 2. **Assign each point** to the nearest center (each person walks to the closest flag) 3. **Update centers** to the mean of assigned points (move each flag to the center of its crowd) 4. **Repeat** steps 2-3 until nothing changes (the flags stop moving -- we've converged) The key insight: this alternating process of "assign then update" is guaranteed to converge, because every step reduces the total within-cluster distance. However, it may converge to a *local* optimum, not the global one -- which is why scikit-learn runs K-Means multiple times (`n_init=10` by default) with different random starting positions and keeps the best result. ```python theme={null} def simple_kmeans(X, k, max_iters=100): """ Simple K-Means implementation. """ n_samples = len(X) # Step 1: Random initialization random_indices = np.random.choice(n_samples, k, replace=False) centers = X[random_indices].copy() for iteration in range(max_iters): # Step 2: Assign points to nearest center labels = np.zeros(n_samples, dtype=int) for i, point in enumerate(X): distances = [np.linalg.norm(point - center) for center in centers] labels[i] = np.argmin(distances) # Step 3: Update centers to mean of assigned points new_centers = np.zeros_like(centers) for j in range(k): cluster_points = X[labels == j] if len(cluster_points) > 0: new_centers[j] = cluster_points.mean(axis=0) else: new_centers[j] = centers[j] # Keep old center if cluster is empty # Check convergence if np.allclose(centers, new_centers): print(f"Converged at iteration {iteration}") break centers = new_centers return labels, centers # Run our simple K-Means labels, centers = simple_kmeans(customers, k=3) # Visualize plt.figure(figsize=(10, 6)) plt.scatter(customers[:, 0], customers[:, 1], c=labels, cmap='viridis', alpha=0.6) plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centers') plt.xlabel('Annual Spending ($k)') plt.ylabel('Store Visits per Month') plt.title('K-Means Clustering Result') plt.legend() plt.grid(True) plt.show() ``` *** ## Using scikit-learn ```python theme={null} from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Always scale for clustering! scaler = StandardScaler() customers_scaled = scaler.fit_transform(customers) # K-Means with 3 clusters # n_init=10 means: run K-Means 10 times with different random starts, # keep the best result. This guards against bad initialization. kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) labels = kmeans.fit_predict(customers_scaled) print("Cluster assignments:", labels[:10]) print("Cluster centers (scaled):\n", kmeans.cluster_centers_) # Inertia = total within-cluster sum of squared distances to center. # Lower is better, but it always decreases with more clusters # (K=n_samples gives inertia=0 but is useless). print("Inertia (within-cluster sum of squares):", kmeans.inertia_) # IMPORTANT: To interpret cluster centers in original units, # reverse the scaling: # centers_original = scaler.inverse_transform(kmeans.cluster_centers_) # This tells you "Cluster 0 = customers who spend ~$20K/year and visit 3x/month" ``` *** ## Choosing K: The Elbow Method How many clusters should we use? ```python theme={null} from sklearn.cluster import KMeans # Try different values of k k_values = range(1, 11) inertias = [] for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(customers_scaled) inertias.append(kmeans.inertia_) # Plot the elbow curve plt.figure(figsize=(10, 6)) plt.plot(k_values, inertias, 'bo-', linewidth=2) plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia (Within-Cluster Sum of Squares)') plt.title('Elbow Method for Optimal K') plt.grid(True) # Look for the "elbow" - where adding more clusters doesn't help much plt.axvline(x=3, color='r', linestyle='--', label='Elbow at K=3') plt.legend() plt.show() ``` *** ## Silhouette Score: Better Cluster Evaluation The silhouette score measures how similar a point is to its own cluster vs other clusters. Think of it as asking each data point: "Are you happy in your cluster, or would you rather switch?" $$ s = \frac{b - a}{\max(a, b)} $$ Where: * $a$ = average distance to points in same cluster (cohesion -- "how close am I to my own group?") * $b$ = average distance to points in nearest other cluster (separation -- "how far am I from the next group?") Range: -1 (bad) to +1 (good) * **+1**: The point is far from other clusters and close to its own -- perfect clustering * **0**: The point is on the border between two clusters -- ambiguous assignment * **-1**: The point is closer to another cluster than its own -- likely misassigned ```python theme={null} from sklearn.metrics import silhouette_score, silhouette_samples # Calculate silhouette score for different k values k_values = range(2, 11) silhouette_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(customers_scaled) score = silhouette_score(customers_scaled, labels) silhouette_scores.append(score) print(f"K={k}: Silhouette Score = {score:.3f}") # Plot plt.figure(figsize=(10, 6)) plt.plot(k_values, silhouette_scores, 'go-', linewidth=2) plt.xlabel('Number of Clusters (K)') plt.ylabel('Silhouette Score') plt.title('Silhouette Method for Optimal K') plt.grid(True) plt.show() ``` *** ## DBSCAN: Density-Based Clustering K-Means has problems: * You must specify K upfront (what if you guess wrong?) * Assumes spherical, roughly equal-sized clusters (fails on elongated or ring-shaped groups) * Sensitive to outliers (one extreme point can drag a cluster center far from where it should be) **DBSCAN** (Density-Based Spatial Clustering) solves all three: * Automatically finds the number of clusters based on data density * Finds clusters of any shape -- rings, crescents, irregular blobs * Identifies outliers/noise as points that don't belong to any cluster The trade-off: DBSCAN requires you to choose `eps` (neighborhood radius) and `min_samples` (minimum density), which can be tricky. And unlike K-Means, DBSCAN struggles when clusters have very different densities -- a tight cluster and a sparse cluster might need different `eps` values. ### How DBSCAN Works 1. For each point, count neighbors within radius `eps` 2. If count >= `min_samples`, it's a **core point** 3. Connect core points that are neighbors 4. Non-core points near core points are **border points** 5. Everything else is **noise** ```python theme={null} from sklearn.cluster import DBSCAN # Create more interesting data with different shaped clusters from sklearn.datasets import make_moons X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42) # K-Means fails on non-spherical clusters kmeans = KMeans(n_clusters=2, random_state=42) kmeans_labels = kmeans.fit_predict(X_moons) # DBSCAN handles them well dbscan = DBSCAN(eps=0.2, min_samples=5) dbscan_labels = dbscan.fit_predict(X_moons) # Compare fig, axes = plt.subplots(1, 2, figsize=(14, 5)) axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=kmeans_labels, cmap='viridis') axes[0].set_title('K-Means (Fails on Moons)') axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=dbscan_labels, cmap='viridis') axes[1].set_title('DBSCAN (Works!)') plt.tight_layout() plt.show() ``` *** ## Choosing DBSCAN Parameters ```python theme={null} # eps: radius of neighborhood # min_samples: minimum points to form a cluster # Heuristic: min_samples = 2 * n_features # Heuristic: eps from k-distance graph from sklearn.neighbors import NearestNeighbors # Find optimal eps using k-distance k = 5 # min_samples - 1 neighbors = NearestNeighbors(n_neighbors=k) neighbors.fit(X_moons) distances, _ = neighbors.kneighbors(X_moons) k_distances = np.sort(distances[:, k-1]) plt.figure(figsize=(10, 6)) plt.plot(k_distances) plt.xlabel('Points (sorted)') plt.ylabel(f'{k}-th Nearest Neighbor Distance') plt.title('K-Distance Graph - Look for the "Elbow"') plt.grid(True) plt.show() ``` *** ## Hierarchical Clustering Builds a tree (dendrogram) of clusters: ```python theme={null} from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage # Using a smaller dataset for visualization np.random.seed(42) X_small = np.random.randn(15, 2) * 2 + np.array([[0, 0], [5, 5], [0, 5]]).repeat(5, axis=0) # Create dendrogram linkage_matrix = linkage(X_small, method='ward') plt.figure(figsize=(12, 6)) dendrogram(linkage_matrix) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() # Cut the tree at a certain height to get clusters from scipy.cluster.hierarchy import fcluster # Get 3 clusters labels = fcluster(linkage_matrix, t=3, criterion='maxclust') print("Cluster labels:", labels) ``` *** ## Practical Example: Customer Segmentation ```python theme={null} from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import pandas as pd # Create realistic customer data np.random.seed(42) n_customers = 500 data = pd.DataFrame({ 'annual_spending': np.random.exponential(50, n_customers) + 10, 'frequency': np.random.poisson(5, n_customers) + 1, 'avg_basket': np.random.gamma(2, 25, n_customers), 'days_since_purchase': np.random.exponential(30, n_customers), 'items_per_visit': np.random.poisson(3, n_customers) + 1, }) print("Customer data sample:") print(data.head()) print("\nStatistics:") print(data.describe()) # Scale the data scaler = StandardScaler() data_scaled = scaler.fit_transform(data) # Find optimal k using silhouette best_k = 4 kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10) data['cluster'] = kmeans.fit_predict(data_scaled) # Analyze clusters print("\nCluster Profiles:") cluster_profiles = data.groupby('cluster').mean() print(cluster_profiles.round(2)) # Name the segments based on characteristics segment_names = { 0: 'Occasional Shoppers', 1: 'High-Value Regulars', 2: 'Bargain Hunters', 3: 'VIP Customers' } ``` *** ## Comparison: When to Use What | Algorithm | Best For | Pros | Cons | | ---------------- | --------------------------- | --------------------------------------- | --------------------------------- | | K-Means | Spherical clusters, known K | Fast, simple | Must specify K, assumes spherical | | DBSCAN | Arbitrary shapes, noise | Finds K automatically, handles outliers | Sensitive to parameters | | Hierarchical | Small data, need hierarchy | Visual dendrogram, no K needed | Slow for large data | | Gaussian Mixture | Soft clustering, elliptical | Probability outputs | Can overfit | *** ## Connection to Supervised Learning Clustering can help supervised learning: ```python theme={null} # Use clusters as features from sklearn.cluster import KMeans from sklearn.ensemble import RandomForestClassifier # Create cluster features kmeans = KMeans(n_clusters=5, random_state=42) cluster_features = kmeans.fit_predict(X_train).reshape(-1, 1) # Add to original features X_train_enhanced = np.hstack([X_train, cluster_features]) # Now use in a classifier clf = RandomForestClassifier(random_state=42) clf.fit(X_train_enhanced, y_train) ``` **Math Connection**: Clustering uses distance metrics extensively. Understanding [vector similarity](/courses/math-for-ml-linear-algebra/02-vectors) helps you choose the right metric. *** ## 🚀 Mini Projects Segment e-commerce customers for targeted marketing Compress images using K-Means clustering Detect outliers using DBSCAN Organize documents by topic automatically ### Project 1: Customer Segmentation Segment customers based on purchasing behavior for targeted marketing campaigns.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # Step 1: Generate synthetic customer data (RFM Analysis) np.random.seed(42) n_customers = 1000 data = { 'customer_id': range(1, n_customers + 1), 'recency': np.random.exponential(30, n_customers).clip(1, 365), 'frequency': np.random.poisson(5, n_customers).clip(1, 50), 'monetary': np.random.lognormal(5, 1, n_customers).clip(10, 10000), 'avg_basket_size': np.random.lognormal(3.5, 0.5, n_customers), 'tenure_days': np.random.randint(30, 1000, n_customers), } df = pd.DataFrame(data) print("="*60) print("👥 CUSTOMER SEGMENTATION") print("="*60) print(f"Total customers: {len(df)}") # Step 2: Feature Engineering for RFM df['frequency_rate'] = df['frequency'] / (df['tenure_days'] / 30) df['avg_order_value'] = df['monetary'] / df['frequency'] df['customer_value'] = df['monetary'] / (df['recency'] + 1) # Select features for clustering features = ['recency', 'frequency', 'monetary', 'avg_basket_size'] X = df[features] # Scale features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print("\nFeatures used for clustering:") for f in features: print(f" - {f}") # Step 3: Find optimal number of clusters print("\n1️⃣ FINDING OPTIMAL CLUSTERS") print("-"*40) inertias = [] silhouettes = [] K_range = range(2, 11) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(X_scaled) inertias.append(kmeans.inertia_) silhouettes.append(silhouette_score(X_scaled, kmeans.labels_)) print(f"K={k}: Inertia={kmeans.inertia_:.0f}, Silhouette={silhouettes[-1]:.3f}") optimal_k = K_range[np.argmax(silhouettes)] print(f"\nOptimal K (by silhouette): {optimal_k}") # Step 4: Final clustering print("\n2️⃣ CLUSTERING CUSTOMERS") print("-"*40) kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) df['segment'] = kmeans.fit_predict(X_scaled) # Step 5: Analyze segments print("\n3️⃣ SEGMENT ANALYSIS") print("-"*40) segment_summary = df.groupby('segment')[features + ['customer_id']].agg({ 'customer_id': 'count', 'recency': 'mean', 'frequency': 'mean', 'monetary': 'mean', 'avg_basket_size': 'mean' }).round(2) segment_summary.columns = ['count', 'avg_recency', 'avg_frequency', 'avg_monetary', 'avg_basket'] print(segment_summary) # Step 6: Name segments based on characteristics segment_names = {} for seg in range(4): seg_data = segment_summary.loc[seg] # High value: high monetary, high frequency if seg_data['avg_monetary'] > segment_summary['avg_monetary'].median() and \ seg_data['avg_frequency'] > segment_summary['avg_frequency'].median(): segment_names[seg] = "💎 Champions" # At risk: high recency (inactive), high monetary elif seg_data['avg_recency'] > segment_summary['avg_recency'].median() and \ seg_data['avg_monetary'] > segment_summary['avg_monetary'].median(): segment_names[seg] = "⚠️ At Risk" # New customers: low frequency, low recency elif seg_data['avg_frequency'] < segment_summary['avg_frequency'].median() and \ seg_data['avg_recency'] < segment_summary['avg_recency'].median(): segment_names[seg] = "🌱 New Customers" else: segment_names[seg] = "📊 Regular" df['segment_name'] = df['segment'].map(segment_names) print("\n4️⃣ SEGMENT NAMES") print("-"*40) for seg, name in segment_names.items(): count = len(df[df['segment'] == seg]) pct = count / len(df) * 100 print(f"Segment {seg}: {name} ({count} customers, {pct:.1f}%)") # Step 7: Marketing recommendations print("\n5️⃣ MARKETING RECOMMENDATIONS") print("-"*40) recommendations = { "💎 Champions": "VIP treatment, exclusive offers, loyalty rewards", "⚠️ At Risk": "Win-back campaigns, special discounts, personal outreach", "🌱 New Customers": "Onboarding emails, first-purchase discounts, welcome series", "📊 Regular": "Cross-sell/up-sell, engagement programs, referral incentives" } for name, rec in recommendations.items(): if name in segment_names.values(): print(f"\n{name}:") print(f" → {rec}") # Step 8: Visualization fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Segment sizes ax1 = axes[0, 0] segment_counts = df['segment_name'].value_counts() ax1.pie(segment_counts, labels=segment_counts.index, autopct='%1.1f%%') ax1.set_title('Customer Segment Distribution') # RFM plot ax2 = axes[0, 1] colors = ['blue', 'red', 'green', 'orange'] for seg in range(4): mask = df['segment'] == seg ax2.scatter(df[mask]['recency'], df[mask]['monetary'], c=colors[seg], label=segment_names[seg], alpha=0.5) ax2.set_xlabel('Recency (days)') ax2.set_ylabel('Monetary ($)') ax2.set_title('Recency vs Monetary by Segment') ax2.legend() # Frequency vs Monetary ax3 = axes[1, 0] for seg in range(4): mask = df['segment'] == seg ax3.scatter(df[mask]['frequency'], df[mask]['monetary'], c=colors[seg], label=segment_names[seg], alpha=0.5) ax3.set_xlabel('Frequency') ax3.set_ylabel('Monetary ($)') ax3.set_title('Frequency vs Monetary by Segment') # Segment averages ax4 = axes[1, 1] segment_summary[['avg_recency', 'avg_frequency', 'avg_monetary']].plot( kind='bar', ax=ax4 ) ax4.set_xlabel('Segment') ax4.set_title('Segment Characteristics') ax4.legend(loc='upper right') plt.tight_layout() plt.savefig('customer_segments.png', dpi=150) print("\n✅ Visualization saved!") ``` **What you learned:** * RFM analysis for customer segmentation * Interpreting clusters in business context * Translating segments into actionable marketing strategies

### Project 2: Image Color Quantization Use K-Means to reduce the number of colors in an image (compression).

View Complete Solution

```python theme={null} import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans # Step 1: Create a synthetic "image" (or use real image with PIL) np.random.seed(42) # Create a simple gradient image with some colors height, width = 200, 300 # Create an image with distinct color regions image = np.zeros((height, width, 3), dtype=np.uint8) # Add color regions image[:100, :150] = [255, 100, 100] # Red region image[:100, 150:] = [100, 255, 100] # Green region image[100:, :150] = [100, 100, 255] # Blue region image[100:, 150:] = [255, 255, 100] # Yellow region # Add gradients and noise for i in range(height): for j in range(width): image[i, j] = image[i, j] + np.random.randint(-30, 30, 3) image = np.clip(image, 0, 255).astype(np.uint8) print("="*60) print("🎨 IMAGE COLOR QUANTIZATION") print("="*60) print(f"Image shape: {image.shape}") print(f"Original unique colors: {len(np.unique(image.reshape(-1, 3), axis=0))}") # Step 2: Reshape image for clustering # From (height, width, 3) to (height*width, 3) pixels = image.reshape(-1, 3).astype(float) print(f"Pixels to cluster: {len(pixels)}") # Step 3: Apply K-Means for different numbers of colors n_colors_list = [2, 4, 8, 16] results = {} print("\nQuantizing colors...") for n_colors in n_colors_list: print(f"\nK = {n_colors} colors:") kmeans = KMeans(n_clusters=n_colors, random_state=42, n_init=10) kmeans.fit(pixels) # Replace each pixel with its cluster center new_colors = kmeans.cluster_centers_[kmeans.labels_] quantized = new_colors.reshape(image.shape).astype(np.uint8) # Calculate compression ratio (simplified) original_bits = height * width * 3 * 8 # 8 bits per channel quantized_bits = height * width * np.log2(n_colors) + n_colors * 3 * 8 compression = original_bits / quantized_bits results[n_colors] = { 'image': quantized, 'centers': kmeans.cluster_centers_.astype(np.uint8), 'compression': compression, 'inertia': kmeans.inertia_ } print(f" Compression ratio: {compression:.1f}x") print(f" Distortion (inertia): {kmeans.inertia_:.0f}") # Step 4: Visualize results fig, axes = plt.subplots(2, 3, figsize=(15, 10)) # Original axes[0, 0].imshow(image) axes[0, 0].set_title('Original Image') axes[0, 0].axis('off') # Quantized versions for i, n_colors in enumerate(n_colors_list[:4]): row = (i + 1) // 3 col = (i + 1) % 3 axes[row, col].imshow(results[n_colors]['image']) axes[row, col].set_title(f'{n_colors} Colors ({results[n_colors]["compression"]:.1f}x compression)') axes[row, col].axis('off') # Color palette for one example ax_palette = axes[1, 2] palette = results[8]['centers'] for i, color in enumerate(palette): ax_palette.add_patch(plt.Rectangle((i, 0), 1, 1, color=color/255)) ax_palette.set_xlim(0, len(palette)) ax_palette.set_ylim(0, 1) ax_palette.set_title('8-Color Palette') ax_palette.axis('off') plt.tight_layout() plt.savefig('color_quantization.png', dpi=150) # Step 5: Quality vs Compression tradeoff print("\n📊 QUALITY VS COMPRESSION TRADEOFF") print("-"*40) for n_colors, res in results.items(): print(f"K={n_colors:2d}: Compression={res['compression']:5.1f}x, Distortion={res['inertia']:,.0f}") # Step 6: Find optimal K using elbow method print("\n🔍 Finding Optimal K...") inertias = [] for k in range(2, 33): kmeans = KMeans(n_clusters=k, random_state=42, n_init=5) kmeans.fit(pixels) inertias.append(kmeans.inertia_) # Plot elbow fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(range(2, 33), inertias, 'b-', marker='o') ax.set_xlabel('Number of Colors (K)') ax.set_ylabel('Distortion (Inertia)') ax.set_title('Elbow Method for Optimal K') ax.axvline(x=8, color='r', linestyle='--', label='Suggested K=8') ax.legend() plt.savefig('elbow_colors.png', dpi=150) print("\n✅ Color quantization complete!") print("💡 K=8 often provides good balance between quality and compression") ``` **What you learned:** * K-Means works great for color quantization * Practical trade-off between quality and compression * Cluster centers become the color palette

### Project 3: Anomaly Detection System Use DBSCAN to detect anomalies in network traffic or transaction data.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.neighbors import NearestNeighbors # Step 1: Generate synthetic network traffic data np.random.seed(42) n_normal = 1000 n_anomalies = 50 # Normal traffic patterns normal_data = { 'bytes_sent': np.random.lognormal(8, 0.5, n_normal), 'bytes_received': np.random.lognormal(9, 0.5, n_normal), 'duration': np.random.exponential(30, n_normal), 'packets': np.random.poisson(100, n_normal), 'connections': np.random.poisson(5, n_normal), } # Anomaly patterns (suspicious behavior) anomaly_data = { 'bytes_sent': np.random.lognormal(12, 0.5, n_anomalies), # Data exfiltration 'bytes_received': np.random.lognormal(6, 0.5, n_anomalies), # Low receive 'duration': np.random.exponential(5, n_anomalies), # Quick bursts 'packets': np.random.poisson(1000, n_anomalies), # High packet count 'connections': np.random.poisson(50, n_anomalies), # Many connections } # Combine df_normal = pd.DataFrame(normal_data) df_normal['is_anomaly'] = 0 df_anomaly = pd.DataFrame(anomaly_data) df_anomaly['is_anomaly'] = 1 df = pd.concat([df_normal, df_anomaly], ignore_index=True) df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle print("="*60) print("🔍 ANOMALY DETECTION SYSTEM") print("="*60) print(f"Total records: {len(df)}") print(f"Actual anomalies: {df['is_anomaly'].sum()} ({df['is_anomaly'].mean()*100:.1f}%)") # Step 2: Feature preparation features = ['bytes_sent', 'bytes_received', 'duration', 'packets', 'connections'] X = df[features].values # Add derived features df['bytes_ratio'] = df['bytes_sent'] / (df['bytes_received'] + 1) df['packets_per_second'] = df['packets'] / (df['duration'] + 1) df['bytes_per_connection'] = (df['bytes_sent'] + df['bytes_received']) / (df['connections'] + 1) features_extended = features + ['bytes_ratio', 'packets_per_second', 'bytes_per_connection'] X = df[features_extended].values # Scale scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Step 3: Find optimal eps using k-distance graph print("\n1️⃣ FINDING OPTIMAL EPSILON") print("-"*40) k = 5 # min_samples nbrs = NearestNeighbors(n_neighbors=k+1).fit(X_scaled) distances, _ = nbrs.kneighbors(X_scaled) k_distances = np.sort(distances[:, k]) # Plot k-distance graph plt.figure(figsize=(10, 5)) plt.plot(k_distances) plt.xlabel('Points sorted by distance') plt.ylabel(f'{k}-distance') plt.title('K-Distance Graph for Epsilon Selection') plt.axhline(y=2.5, color='r', linestyle='--', label='Suggested eps=2.5') plt.legend() plt.savefig('k_distance.png', dpi=150) # Step 4: Apply DBSCAN print("\n2️⃣ APPLYING DBSCAN") print("-"*40) eps_values = [1.5, 2.0, 2.5, 3.0] results = [] for eps in eps_values: dbscan = DBSCAN(eps=eps, min_samples=5) labels = dbscan.fit_predict(X_scaled) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = (labels == -1).sum() # Evaluate against known anomalies detected_anomalies = labels == -1 true_anomalies = df['is_anomaly'] == 1 true_positives = (detected_anomalies & true_anomalies).sum() false_positives = (detected_anomalies & ~true_anomalies).sum() false_negatives = (~detected_anomalies & true_anomalies).sum() precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 results.append({ 'eps': eps, 'n_clusters': n_clusters, 'n_noise': n_noise, 'precision': precision, 'recall': recall, 'f1': f1 }) print(f"eps={eps}: Clusters={n_clusters}, Noise={n_noise}, Precision={precision:.2f}, Recall={recall:.2f}, F1={f1:.2f}") # Step 5: Use best eps best_result = max(results, key=lambda x: x['f1']) print(f"\nBest eps: {best_result['eps']} (F1={best_result['f1']:.2f})") dbscan = DBSCAN(eps=best_result['eps'], min_samples=5) df['dbscan_label'] = dbscan.fit_predict(X_scaled) df['detected_anomaly'] = (df['dbscan_label'] == -1).astype(int) # Step 6: Analyze results print("\n3️⃣ DETECTION RESULTS") print("-"*40) confusion = pd.crosstab(df['is_anomaly'], df['detected_anomaly'], rownames=['Actual'], colnames=['Detected']) print("Confusion Matrix:") print(confusion) # Step 7: Examine detected anomalies print("\n4️⃣ ANOMALY CHARACTERISTICS") print("-"*40) detected = df[df['detected_anomaly'] == 1] normal = df[df['detected_anomaly'] == 0] print("\nNormal Traffic (average):") for f in features[:5]: print(f" {f}: {normal[f].mean():.2f}") print("\nDetected Anomalies (average):") for f in features[:5]: print(f" {f}: {detected[f].mean():.2f}") # Step 8: Visualization fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Bytes sent vs received ax1 = axes[0] normal_mask = df['detected_anomaly'] == 0 ax1.scatter(df[normal_mask]['bytes_sent'], df[normal_mask]['bytes_received'], c='blue', alpha=0.5, label='Normal') ax1.scatter(df[~normal_mask]['bytes_sent'], df[~normal_mask]['bytes_received'], c='red', alpha=0.7, label='Anomaly') ax1.set_xlabel('Bytes Sent') ax1.set_ylabel('Bytes Received') ax1.set_title('Traffic Pattern') ax1.legend() ax1.set_xscale('log') ax1.set_yscale('log') # Packets vs connections ax2 = axes[1] ax2.scatter(df[normal_mask]['packets'], df[normal_mask]['connections'], c='blue', alpha=0.5, label='Normal') ax2.scatter(df[~normal_mask]['packets'], df[~normal_mask]['connections'], c='red', alpha=0.7, label='Anomaly') ax2.set_xlabel('Packets') ax2.set_ylabel('Connections') ax2.set_title('Network Behavior') ax2.legend() # Comparison of actual vs detected ax3 = axes[2] metrics = ['Precision', 'Recall', 'F1'] values = [best_result['precision'], best_result['recall'], best_result['f1']] ax3.bar(metrics, values, color=['steelblue', 'coral', 'green']) ax3.set_ylabel('Score') ax3.set_title('Detection Performance') ax3.set_ylim(0, 1) for i, v in enumerate(values): ax3.text(i, v + 0.02, f'{v:.2f}', ha='center') plt.tight_layout() plt.savefig('anomaly_detection.png', dpi=150) print("\n✅ Anomaly detection complete!") ``` **What you learned:** * DBSCAN labels outliers as noise (label -1) * The eps parameter is crucial - use k-distance graph to find it * Real-world anomaly detection requires domain knowledge for threshold tuning

### Project 4: Document Clustering Automatically organize documents by topic using clustering.

View Complete Solution

```python theme={null} import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans, AgglomerativeClustering from sklearn.metrics import silhouette_score from sklearn.decomposition import PCA from collections import Counter # Step 1: Create sample documents documents = [ # Technology/AI "Machine learning algorithms are transforming data analysis", "Neural networks can recognize patterns in images", "Artificial intelligence is revolutionizing healthcare", "Deep learning models require large datasets", "Natural language processing enables chatbots", # Sports "The basketball team won the championship game", "Football players train hard for the season", "Tennis requires excellent hand-eye coordination", "Soccer is the most popular sport worldwide", "Athletes follow strict diet and exercise routines", # Finance "Stock market shows signs of recovery", "Investment strategies for long-term growth", "Cryptocurrency trading has increased volatility", "Interest rates affect mortgage payments", "Portfolio diversification reduces risk", # Health "Regular exercise improves cardiovascular health", "Balanced diet is essential for wellness", "Mental health awareness is growing", "Vaccines protect against infectious diseases", "Sleep quality affects overall wellbeing", # Travel "Beaches in Thailand attract many tourists", "European cities offer rich cultural experiences", "Adventure travel is gaining popularity", "Airlines offering more domestic routes", "Sustainable tourism practices are important", ] # True labels for evaluation true_labels = [0]*5 + [1]*5 + [2]*5 + [3]*5 + [4]*5 topic_names = ['Technology', 'Sports', 'Finance', 'Health', 'Travel'] print("="*60) print("📄 DOCUMENT CLUSTERING") print("="*60) print(f"Total documents: {len(documents)}") print(f"Topics: {topic_names}") # Step 2: Convert text to numerical features using TF-IDF print("\n1️⃣ TEXT VECTORIZATION") print("-"*40) vectorizer = TfidfVectorizer( max_features=100, stop_words='english', ngram_range=(1, 2) ) X = vectorizer.fit_transform(documents) print(f"Vocabulary size: {len(vectorizer.vocabulary_)}") print(f"Document-term matrix shape: {X.shape}") # Show top terms feature_names = vectorizer.get_feature_names_out() print(f"\nSample features: {list(feature_names[:10])}") # Step 3: Find optimal number of clusters print("\n2️⃣ FINDING OPTIMAL K") print("-"*40) silhouette_scores = [] for k in range(2, 10): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(X) score = silhouette_score(X, labels) silhouette_scores.append(score) print(f"K={k}: Silhouette Score = {score:.3f}") optimal_k = range(2, 10)[np.argmax(silhouette_scores)] print(f"\nOptimal K: {optimal_k}") # Step 4: Cluster documents print("\n3️⃣ CLUSTERING DOCUMENTS") print("-"*40) kmeans = KMeans(n_clusters=5, random_state=42, n_init=10) predicted_labels = kmeans.fit_predict(X) # Step 5: Analyze clusters print("\n4️⃣ CLUSTER ANALYSIS") print("-"*40) # Get top terms for each cluster def get_top_terms(cluster_idx, n_terms=5): """Get top terms for a cluster""" cluster_center = kmeans.cluster_centers_[cluster_idx] top_indices = cluster_center.argsort()[-n_terms:][::-1] return [feature_names[i] for i in top_indices] for i in range(5): cluster_docs = [documents[j] for j in range(len(documents)) if predicted_labels[j] == i] top_terms = get_top_terms(i) print(f"\nCluster {i} ({len(cluster_docs)} documents):") print(f" Top terms: {', '.join(top_terms)}") print(f" Sample: {cluster_docs[0][:50]}...") # Step 6: Evaluate clustering (compare with true labels) print("\n5️⃣ CLUSTERING EVALUATION") print("-"*40) from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score ari = adjusted_rand_score(true_labels, predicted_labels) nmi = normalized_mutual_info_score(true_labels, predicted_labels) print(f"Adjusted Rand Index: {ari:.3f}") print(f"Normalized Mutual Information: {nmi:.3f}") print(f"Silhouette Score: {silhouette_score(X, predicted_labels):.3f}") # Step 7: Visualization using PCA print("\n6️⃣ VISUALIZING CLUSTERS") print("-"*40) # Reduce to 2D for visualization pca = PCA(n_components=2) X_2d = pca.fit_transform(X.toarray()) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot predicted clusters ax1 = axes[0] colors = ['red', 'blue', 'green', 'orange', 'purple'] for i in range(5): mask = predicted_labels == i ax1.scatter(X_2d[mask, 0], X_2d[mask, 1], c=colors[i], label=f'Cluster {i}', alpha=0.7) ax1.set_title('Predicted Clusters') ax1.legend() ax1.set_xlabel('PC1') ax1.set_ylabel('PC2') # Plot true labels ax2 = axes[1] for i, topic in enumerate(topic_names): mask = np.array(true_labels) == i ax2.scatter(X_2d[mask, 0], X_2d[mask, 1], c=colors[i], label=topic, alpha=0.7) ax2.set_title('True Topics') ax2.legend() ax2.set_xlabel('PC1') ax2.set_ylabel('PC2') plt.tight_layout() plt.savefig('document_clusters.png', dpi=150) # Step 8: Assign topic names to clusters print("\n7️⃣ AUTOMATIC TOPIC LABELING") print("-"*40) cluster_topics = {} for i in range(5): top_terms = get_top_terms(i, n_terms=3) cluster_topics[i] = ' / '.join(top_terms) print(f"Cluster {i}: {cluster_topics[i]}") print("\n✅ Document clustering complete!") ``` **What you learned:** * TF-IDF converts text to numerical features for clustering * Cluster centers reveal the main terms for each topic * Dimensionality reduction helps visualize document clusters

*** ## Key Takeaways Clustering finds groups without knowing the answer Assign to nearest center, update centers, repeat Finds arbitrary shapes and identifies noise Distance-based algorithms need scaled features *** ## What's Next? You've now covered both supervised and unsupervised learning! Let's dive into the basics of neural networks. Learn how artificial neurons work and build your first neural network