Imagine you move to a new city and want to know if a neighborhood is safe.What do you do? You look at nearby neighborhoods.If 4 out of 5 nearby neighborhoods are safe → probably safe.
If 4 out of 5 nearby neighborhoods are unsafe → probably unsafe.That’s KNN. To predict something, find the K most similar examples and use their labels.KNN is often called a “lazy learner” — not because it’s poorly designed, but because it does zero work during training. It just memorizes all the data and waits. All the computation happens at prediction time, when it searches for neighbors. This is the opposite of models like linear regression, which do all their work upfront during training and then predict instantly.
def euclidean_distance(a, b): """Calculate distance between two points.""" return np.sqrt(np.sum((np.array(a) - np.array(b)) ** 2))# Calculate distance from The Matrix to all other moviesdistances = {}for name, features in movies.items(): if name != 'The Matrix': distances[name] = euclidean_distance(my_movie, features)# Sort by distance (closest first)sorted_movies = sorted(distances.items(), key=lambda x: x[1])print("Movies similar to 'The Matrix':")for movie, dist in sorted_movies[:5]: print(f" {movie}: distance = {dist:.2f}")
Output:
Movies similar to 'The Matrix': Terminator: distance = 1.41 Inception: distance = 2.00 Blade Runner: distance = 3.16 Avatar: distance = 3.74 Mad Max: distance = 4.12
Recommendation: Watch Terminator or Inception next!
from collections import Counterdef knn_classify(training_data, training_labels, new_point, k=3): """ Classify a new point based on its k nearest neighbors. """ # Step 1: Calculate distances to all training points distances = [] for i, point in enumerate(training_data): dist = euclidean_distance(new_point, point) distances.append((dist, training_labels[i])) # Step 2: Sort by distance and take k nearest distances.sort(key=lambda x: x[0]) k_nearest = distances[:k] # Step 3: Vote - majority wins labels = [label for dist, label in k_nearest] most_common = Counter(labels).most_common(1)[0][0] return most_common
def knn_regression(training_data, training_values, new_point, k=3): """ Predict a continuous value based on k nearest neighbors. """ # Step 1: Calculate distances distances = [] for i, point in enumerate(training_data): dist = euclidean_distance(new_point, point) distances.append((dist, training_values[i])) # Step 2: Sort and take k nearest distances.sort(key=lambda x: x[0]) k_nearest = distances[:k] # Step 3: Average their values values = [val for dist, val in k_nearest] return np.mean(values)
Rule of thumb: Start with k = sqrt(n) where n is training size. Also, use an odd number for K in binary classification to avoid ties (a 2-2 split has no winner, but 3-2 always does).
from sklearn.model_selection import cross_val_scoreimport matplotlib.pyplot as plt# Test different values of kk_values = range(1, 31)cv_scores = []for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy') cv_scores.append(scores.mean())# Plotplt.figure(figsize=(10, 6))plt.plot(k_values, cv_scores, 'bo-')plt.xlabel('Number of Neighbors (K)')plt.ylabel('Cross-Validation Accuracy')plt.title('KNN: Choosing the Right K')plt.axhline(y=max(cv_scores), color='r', linestyle='--', label=f'Best: {max(cv_scores):.3f}')plt.legend()plt.grid(True)plt.show()print(f"Best K: {k_values[cv_scores.index(max(cv_scores))]}")
from sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipeline# Create a pipeline that scales, then applies KNN.# The pipeline ensures scaling is fit ONLY on training data,# preventing data leakage -- a subtle but critical detail.pipeline = Pipeline([ ('scaler', StandardScaler()), # Transforms each feature to mean=0, std=1 ('knn', KNeighborsClassifier(n_neighbors=5))])# Train and evaluatepipeline.fit(X_train, y_train)print(f"Accuracy: {pipeline.score(X_test, y_test):.2%}")
Always scale your data before using KNN! Distance-based algorithms are sensitive to feature scales. Without scaling, a feature measured in thousands (like salary) will completely drown out a feature measured in single digits (like years of experience). The distance calculation will act as if only salary exists.This applies to KNN, SVM, and any algorithm that uses distances. Tree-based models (Decision Trees, Random Forest, XGBoost) are immune to this problem because they split on one feature at a time.See Feature Engineering for more on scaling strategies.