Skip to main content

K-Nearest Neighbors (KNN)

K-Nearest Neighbors Voting

The Most Intuitive Algorithm

Imagine you move to a new city and want to know if a neighborhood is safe. What do you do? You look at nearby neighborhoods. If 4 out of 5 nearby neighborhoods are safe → probably safe. If 4 out of 5 nearby neighborhoods are unsafe → probably unsafe. That’s KNN. To predict something, find the K most similar examples and use their labels.
Movie Recommendation with KNN

The Movie Recommendation Problem

You just watched “The Matrix” and loved it. What should you watch next?
import numpy as np

# Movies described by: [action_level, scifi_level, romance_level, comedy_level]
movies = {
    'The Matrix':     [9, 10, 2, 1],
    'Inception':      [8, 9, 3, 2],
    'John Wick':      [10, 2, 1, 1],
    'Blade Runner':   [6, 10, 4, 1],
    'The Notebook':   [1, 0, 10, 3],
    'Terminator':     [9, 9, 3, 2],
    'Avatar':         [8, 10, 5, 2],
    'Die Hard':       [10, 1, 2, 4],
    'Interstellar':   [5, 10, 6, 1],
    'Mad Max':        [10, 7, 2, 2],
}

# You loved The Matrix
my_movie = movies['The Matrix']  # [9, 10, 2, 1]

Finding Similar Movies

def euclidean_distance(a, b):
    """Calculate distance between two points."""
    return np.sqrt(np.sum((np.array(a) - np.array(b)) ** 2))

# Calculate distance from The Matrix to all other movies
distances = {}
for name, features in movies.items():
    if name != 'The Matrix':
        distances[name] = euclidean_distance(my_movie, features)

# Sort by distance (closest first)
sorted_movies = sorted(distances.items(), key=lambda x: x[1])

print("Movies similar to 'The Matrix':")
for movie, dist in sorted_movies[:5]:
    print(f"  {movie}: distance = {dist:.2f}")
Output:
Movies similar to 'The Matrix':
  Terminator: distance = 1.41
  Inception: distance = 2.00
  Blade Runner: distance = 3.16
  Avatar: distance = 3.74
  Mad Max: distance = 4.12
Recommendation: Watch Terminator or Inception next!

The KNN Algorithm

For Classification

from collections import Counter

def knn_classify(training_data, training_labels, new_point, k=3):
    """
    Classify a new point based on its k nearest neighbors.
    """
    # Step 1: Calculate distances to all training points
    distances = []
    for i, point in enumerate(training_data):
        dist = euclidean_distance(new_point, point)
        distances.append((dist, training_labels[i]))
    
    # Step 2: Sort by distance and take k nearest
    distances.sort(key=lambda x: x[0])
    k_nearest = distances[:k]
    
    # Step 3: Vote - majority wins
    labels = [label for dist, label in k_nearest]
    most_common = Counter(labels).most_common(1)[0][0]
    
    return most_common

For Regression

Instead of voting, average the values:
def knn_regression(training_data, training_values, new_point, k=3):
    """
    Predict a continuous value based on k nearest neighbors.
    """
    # Step 1: Calculate distances
    distances = []
    for i, point in enumerate(training_data):
        dist = euclidean_distance(new_point, point)
        distances.append((dist, training_values[i]))
    
    # Step 2: Sort and take k nearest
    distances.sort(key=lambda x: x[0])
    k_nearest = distances[:k]
    
    # Step 3: Average their values
    values = [val for dist, val in k_nearest]
    return np.mean(values)

Real Example: Iris Classification

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and train KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Choosing K: The Magic Number

K=1: Use only the closest neighbor
  • Very sensitive to noise
  • Can overfit
K=large: Use many neighbors
  • Smoother predictions
  • Can underfit
Rule of thumb: Start with k = sqrt(n) where n is training size
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# Test different values of k
k_values = range(1, 31)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, 'bo-')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('KNN: Choosing the Right K')
plt.axhline(y=max(cv_scores), color='r', linestyle='--', label=f'Best: {max(cv_scores):.3f}')
plt.legend()
plt.grid(True)
plt.show()

print(f"Best K: {k_values[cv_scores.index(max(cv_scores))]}")

The Scaling Problem

KNN uses distance. If features have different scales, large-scale features dominate:
# House example (unscaled)
house1 = [3, 2000, 500000]     # [bedrooms, sqft, price]
house2 = [4, 2100, 520000]

# Distance calculation
diff = np.array(house1) - np.array(house2)
# diff = [-1, -100, -20000]

# Price difference (20000) completely dominates!
# Bedroom difference (1) is ignored
Solution: Scale your features!
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline that scales, then applies KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2%}")
Always scale your data before using KNN! Distance-based algorithms are sensitive to feature scales.See Feature Engineering for more on scaling.

Distance Metrics

Euclidean isn’t the only option:
from sklearn.neighbors import KNeighborsClassifier

# Different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']

for metric in metrics:
    if metric == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    print(f"{metric:12s}: {acc:.2%}")
MetricFormulaBest For
Euclidean(xiyi)2\sqrt{\sum(x_i - y_i)^2}Most cases, continuous features
Manhattanxiyi\sum\|x_i - y_i\|Grid-like movement, high dimensions
Chebyshevmaxxiyi\max\|x_i - y_i\|When max difference matters
Cosine1cos(θ)1 - \cos(\theta)Text, when magnitude doesn’t matter
Math Connection: Distance metrics come from linear algebra concepts. See Vectors for more on measuring similarity.

Weighted KNN

Not all neighbors are equal! Closer neighbors should have more influence:
# Weight by distance (closer = more weight)
knn_weighted = KNeighborsClassifier(
    n_neighbors=5,
    weights='distance'  # or 'uniform' (default)
)

knn_weighted.fit(X_train, y_train)
print(f"Weighted KNN Accuracy: {knn_weighted.score(X_test, y_test):.2%}")

Pros and Cons

Advantages

  • Simple and intuitive
  • No training phase (lazy learner)
  • Works with any number of classes
  • Naturally handles multi-label
  • Non-parametric (no assumptions about data)

Disadvantages

  • Slow prediction (checks all training data)
  • Sensitive to irrelevant features
  • Sensitive to feature scaling
  • Struggles in high dimensions (curse of dimensionality)
  • Memory intensive (stores all data)

When to Use KNN

Good for:
  • Small to medium datasets
  • When you need interpretability (“these are the similar cases”)
  • Recommendation systems
  • Baseline model to beat
Avoid for:
  • Large datasets (slow prediction)
  • High-dimensional data (100+ features)
  • When fast prediction is critical

Key Takeaways

Find Neighbors

Predict based on the K most similar examples

Vote or Average

Classification = majority vote, Regression = average

Scale Features

Distance-based algorithms need scaled data

Choose K Wisely

Use cross-validation to find the best K

What’s Next?

Now that you understand classification with both logistic regression and KNN, let’s learn about decision trees - a completely different approach!

Continue to Module 5: Decision Trees

Learn how trees make decisions - just like you do