K-Nearest Neighbors (KNN)
The Most Intuitive Algorithm
The Movie Recommendation Problem
Finding Similar Movies
The KNN Algorithm
For Classification
For Regression
Real Example: Iris Classification
Choosing K: The Magic Number
The Scaling Problem
Distance Metrics
Weighted KNN
Pros and Cons
When to Use KNN
Key Takeaways
What’s Next?

K-Nearest Neighbors (KNN)

The Most Intuitive Algorithm

Imagine you move to a new city and want to know if a neighborhood is safe. What do you do? You look at nearby neighborhoods. If 4 out of 5 nearby neighborhoods are safe → probably safe. If 4 out of 5 nearby neighborhoods are unsafe → probably unsafe. That’s KNN. To predict something, find the K most similar examples and use their labels.

The Movie Recommendation Problem

You just watched “The Matrix” and loved it. What should you watch next?

import numpy as np

# Movies described by: [action_level, scifi_level, romance_level, comedy_level]
movies = {
    'The Matrix':     [9, 10, 2, 1],
    'Inception':      [8, 9, 3, 2],
    'John Wick':      [10, 2, 1, 1],
    'Blade Runner':   [6, 10, 4, 1],
    'The Notebook':   [1, 0, 10, 3],
    'Terminator':     [9, 9, 3, 2],
    'Avatar':         [8, 10, 5, 2],
    'Die Hard':       [10, 1, 2, 4],
    'Interstellar':   [5, 10, 6, 1],
    'Mad Max':        [10, 7, 2, 2],
}

# You loved The Matrix
my_movie = movies['The Matrix']  # [9, 10, 2, 1]

Finding Similar Movies

def euclidean_distance(a, b):
    """Calculate distance between two points."""
    return np.sqrt(np.sum((np.array(a) - np.array(b)) ** 2))

# Calculate distance from The Matrix to all other movies
distances = {}
for name, features in movies.items():
    if name != 'The Matrix':
        distances[name] = euclidean_distance(my_movie, features)

# Sort by distance (closest first)
sorted_movies = sorted(distances.items(), key=lambda x: x[1])

print("Movies similar to 'The Matrix':")
for movie, dist in sorted_movies[:5]:
    print(f"  {movie}: distance = {dist:.2f}")

Output:

Movies similar to 'The Matrix':
  Terminator: distance = 1.41
  Inception: distance = 2.00
  Blade Runner: distance = 3.16
  Avatar: distance = 3.74
  Mad Max: distance = 4.12

Recommendation: Watch Terminator or Inception next!

The KNN Algorithm

For Classification

from collections import Counter

def knn_classify(training_data, training_labels, new_point, k=3):
    """
    Classify a new point based on its k nearest neighbors.
    """
    # Step 1: Calculate distances to all training points
    distances = []
    for i, point in enumerate(training_data):
        dist = euclidean_distance(new_point, point)
        distances.append((dist, training_labels[i]))
    
    # Step 2: Sort by distance and take k nearest
    distances.sort(key=lambda x: x[0])
    k_nearest = distances[:k]
    
    # Step 3: Vote - majority wins
    labels = [label for dist, label in k_nearest]
    most_common = Counter(labels).most_common(1)[0][0]
    
    return most_common

For Regression

Instead of voting, average the values:

def knn_regression(training_data, training_values, new_point, k=3):
    """
    Predict a continuous value based on k nearest neighbors.
    """
    # Step 1: Calculate distances
    distances = []
    for i, point in enumerate(training_data):
        dist = euclidean_distance(new_point, point)
        distances.append((dist, training_values[i]))
    
    # Step 2: Sort and take k nearest
    distances.sort(key=lambda x: x[0])
    k_nearest = distances[:k]
    
    # Step 3: Average their values
    values = [val for dist, val in k_nearest]
    return np.mean(values)

Real Example: Iris Classification

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and train KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Choosing K: The Magic Number

K=1: Use only the closest neighbor

Very sensitive to noise
Can overfit

K=large: Use many neighbors

Smoother predictions
Can underfit

Rule of thumb: Start with k = sqrt(n) where n is training size

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# Test different values of k
k_values = range(1, 31)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, 'bo-')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('KNN: Choosing the Right K')
plt.axhline(y=max(cv_scores), color='r', linestyle='--', label=f'Best: {max(cv_scores):.3f}')
plt.legend()
plt.grid(True)
plt.show()

print(f"Best K: {k_values[cv_scores.index(max(cv_scores))]}")

The Scaling Problem

KNN uses distance. If features have different scales, large-scale features dominate:

# House example (unscaled)
house1 = [3, 2000, 500000]     # [bedrooms, sqft, price]
house2 = [4, 2100, 520000]

# Distance calculation
diff = np.array(house1) - np.array(house2)
# diff = [-1, -100, -20000]

# Price difference (20000) completely dominates!
# Bedroom difference (1) is ignored

Solution: Scale your features!

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline that scales, then applies KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2%}")

Always scale your data before using KNN! Distance-based algorithms are sensitive to feature scales.See Feature Engineering for more on scaling.

Distance Metrics

Euclidean isn’t the only option:

from sklearn.neighbors import KNeighborsClassifier

# Different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']

for metric in metrics:
    if metric == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
    else:
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    print(f"{metric:12s}: {acc:.2%}")

Metric	Formula	Best For
Euclidean	$\sqrt{\sum(x_i - y_i)^2}$	Most cases, continuous features
Manhattan	$\sum\\|x_i - y_i\\|$	Grid-like movement, high dimensions
Chebyshev	$\max\\|x_i - y_i\\|$	When max difference matters
Cosine	$1 - \cos(\theta)$	Text, when magnitude doesn’t matter

Math Connection: Distance metrics come from linear algebra concepts. See Vectors for more on measuring similarity.

Weighted KNN

Not all neighbors are equal! Closer neighbors should have more influence:

# Weight by distance (closer = more weight)
knn_weighted = KNeighborsClassifier(
    n_neighbors=5,
    weights='distance'  # or 'uniform' (default)
)

knn_weighted.fit(X_train, y_train)
print(f"Weighted KNN Accuracy: {knn_weighted.score(X_test, y_test):.2%}")

Pros and Cons

Advantages

Simple and intuitive
No training phase (lazy learner)
Works with any number of classes
Naturally handles multi-label
Non-parametric (no assumptions about data)

Disadvantages

Slow prediction (checks all training data)
Sensitive to irrelevant features
Sensitive to feature scaling
Struggles in high dimensions (curse of dimensionality)
Memory intensive (stores all data)

When to Use KNN

Good for:

Small to medium datasets
When you need interpretability (“these are the similar cases”)
Recommendation systems
Baseline model to beat

Avoid for:

Large datasets (slow prediction)
High-dimensional data (100+ features)
When fast prediction is critical

Key Takeaways

Find Neighbors

Predict based on the K most similar examples

Vote or Average

Classification = majority vote, Regression = average

Scale Features

Distance-based algorithms need scaled data

Choose K Wisely

Use cross-validation to find the best K

What’s Next?

Now that you understand classification with both logistic regression and KNN, let’s learn about decision trees - a completely different approach!

Continue to Module 5: Decision Trees

Learn how trees make decisions - just like you do

Classification Decision Trees

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​K-Nearest Neighbors (KNN)

​The Most Intuitive Algorithm

​The Movie Recommendation Problem

​Finding Similar Movies

​The KNN Algorithm

​For Classification

​For Regression

​Real Example: Iris Classification

​Choosing K: The Magic Number

​The Scaling Problem

​Distance Metrics

​Weighted KNN

​Pros and Cons

Advantages

Disadvantages

​When to Use KNN

​Key Takeaways

Find Neighbors

Vote or Average

Scale Features

Choose K Wisely

​What’s Next?

Continue to Module 5: Decision Trees

K-Nearest Neighbors (KNN)

The Most Intuitive Algorithm

The Movie Recommendation Problem

Finding Similar Movies

The KNN Algorithm

For Classification

For Regression

Real Example: Iris Classification

Choosing K: The Magic Number

The Scaling Problem

Distance Metrics

Weighted KNN

Pros and Cons

When to Use KNN

Key Takeaways

What’s Next?