You’re looking for a new apartment. You visit Zillow and find one you love:
2 bedrooms
1,200 square feet
$2,400/month rent
15 minutes from work
Now you want to find similar apartments. Not identical — just similar enough that you’d consider them.Zillow shows you a “Similar Homes” section. But how did they decide which apartments are similar?Think about it: What makes two apartments “similar”?
Same number of bedrooms?
Similar size?
Similar rent?
Similar commute?
All of the above, in some combination. And that combination is exactly what vectors and similarity measures capture.
Estimated Time: 3-4 hours Difficulty: Beginner Prerequisites: Basic Python What You’ll Build: A “Find Similar Items” system that works for apartments, songs, or anything
🔗 ML Connection: Vectors are THE foundation of modern ML. Here’s where you’ll see them:
ML System
Vector Representation
Word2Vec/GPT
Every word → 300-768 dimensional vector
Face Recognition
Every face → 128-512 dimensional embedding
Recommendation Systems
Users & items in shared vector space
Image Classification
CNN features as vectors
After this module, you’ll understand exactly how these systems find “similar” items!
When you add vectors, you add corresponding components:a+b=a1a2a3+b1b2b3=a1+b1a2+b2a3+b3Real Example: Combining two shopping carts:
Attempt 2: Euclidean Distance (Works, But Has Issues)
We could calculate the “distance” between apartments in 4D space:
Copy
import numpy as npdef distance(a, b): """Euclidean distance between two vectors.""" return np.sqrt(sum((a[i] - b[i])**2 for i in range(len(a))))apartment_A = np.array([2, 1200, 2400, 15])apartment_B = np.array([2, 1100, 2300, 18])apartment_C = np.array([4, 2500, 4500, 45])print(f"A vs B: {distance(apartment_A, apartment_B):.0f}") # 141print(f"A vs C: {distance(apartment_A, apartment_C):.0f}") # 2462
B is much closer to A than C is. Good!But there’s a problem: the sqft and rent numbers are huge (1000s) while bedrooms and commute are small (single digits). The big numbers dominate everything.
The dot product (also called inner product or scalar product) of two vectors:a⋅b=i=1∑naibi=a1b1+a2b2+⋯+anbnWhat it does: Multiply corresponding numbers and add them up.
Copy
def dot_product(a, b): """Multiply corresponding elements and sum.""" return sum(a[i] * b[i] for i in range(len(a)))# Or simply: np.dot(a, b)
The dot product has one problem: bigger vectors give bigger numbers regardless of similarity.Cosine similarity fixes this by normalizing:similarity(A,B)=∣A∣×∣B∣A⋅BThis gives a number between -1 and 1:
1.0 = identical direction (very similar)
0.0 = perpendicular (unrelated)
-1.0 = opposite direction (opposites)
Copy
def cosine_similarity(a, b): """Similarity based on angle, not magnitude.""" dot = np.dot(a, b) magnitude_a = np.sqrt(np.dot(a, a)) # length of a magnitude_b = np.sqrt(np.dot(b, b)) # length of b return dot / (magnitude_a * magnitude_b)
Let’s test it on our apartments:
Copy
import numpy as npapartments = { 'A (my favorite)': np.array([2, 1200, 2400, 15]), 'B (similar)': np.array([2, 1100, 2300, 18]), 'C (luxury)': np.array([4, 2500, 4500, 45]), 'D (studio)': np.array([1, 800, 1900, 10]),}my_apt = apartments['A (my favorite)']print("Similarity to my apartment:")for name, apt in apartments.items(): sim = cosine_similarity(my_apt, apt) print(f" {name}: {sim:.3f}")
Output:
Copy
Similarity to my apartment: A (my favorite): 1.000 ← identical to itself B (similar): 0.999 ← very similar! C (luxury): 0.997 ← surprisingly similar (same "shape", just bigger) D (studio): 0.998 ← also similar (same "shape", just smaller)
Wait, why is C so similar? Because cosine similarity measures direction, not magnitude. C is a “scaled up” version of A — same proportions, just bigger numbers.This is actually useful! It finds apartments with the same profile (ratio of bedrooms to sqft to rent), regardless of absolute size.
Input: Convert your data to a vector (image → pixels, text → numbers)
Layers: Transform the vector through matrix multiplications (we’ll learn this next!)
Output: Compare the final vector to known categories using… similarity
Copy
# Simplified: How image classification worksimage_vector = [0.1, 0.8, 0.3, ...] # 1000s of numbers from pixels# The network transforms this to a "meaning" vectormeaning_vector = neural_network(image_vector) # Let's say [0.9, 0.1, 0.05]# Compare to category vectorscat_vector = [1.0, 0.0, 0.0] # What a "cat" looks like in meaning-spacedog_vector = [0.0, 1.0, 0.0] # What a "dog" looks like# Which is more similar?cat_similarity = cosine_similarity(meaning_vector, cat_vector) # 0.95dog_similarity = cosine_similarity(meaning_vector, dog_vector) # 0.10# Prediction: It's a cat! (higher similarity)
The core operation — vector similarity — is exactly what you learned with apartments.
Copy
def cosine_similarity(a, b): """ Returns a value between -1 and 1: - 1.0 = identical direction (very similar) - 0.0 = perpendicular (unrelated) - -1.0 = opposite direction (very different) """ dot = np.dot(a, b) magnitude_a = np.linalg.norm(a) # length of a magnitude_b = np.linalg.norm(b) # length of b return dot / (magnitude_a * magnitude_b)
Now let’s use it on our songs:
Copy
# Using our song vectors from earliersim_blinding_levitating = cosine_similarity(blinding_lights, levitating)sim_blinding_adele = cosine_similarity(blinding_lights, someone_like_you)print(f"Blinding Lights vs Levitating: {sim_blinding_levitating:.3f}")print(f"Blinding Lights vs Someone Like You: {sim_blinding_adele:.3f}")# Output:# Blinding Lights vs Levitating: 0.891 (very similar - both upbeat pop)# Blinding Lights vs Someone Like You: 0.412 (less similar - different vibes)
That’s the Spotify algorithm in a nutshell! Find songs with the highest cosine similarity to what you just played.
The Question: What if we want to combine two house profiles?Geometric Intuition: Place vectors tip-to-tail. The result is the diagonal.Algebraic Definition: Add corresponding components.
Copy
# Two house feature vectorshouse_1 = np.array([3, 2000, 15, 5])house_2 = np.array([2, 1500, 10, 3])# Average house in the neighborhoodaverage_house = (house_1 + house_2) / 2print(average_house) # [2.5, 1750, 12.5, 4]
Why This Matters:
Feature engineering: Combine features to create new ones
Gradient descent: Update model parameters by adding gradients
Ensemble methods: Average predictions from multiple models
The Question: What if all house prices in a neighborhood increase by 20%?Geometric Intuition: Stretch or shrink the vector. Direction stays the same.Algebraic Definition: Multiply each component by a number (scalar).
Copy
house = np.array([3, 2000, 15, 5])# Scale by 1.2 (20% increase)scaled_house = 1.2 * houseprint(scaled_house) # [3.6, 2400, 18, 6]
Why This Matters:
Normalization: Scale features to same range
Learning rate: Control how much to update parameters
Feature weighting: Emphasize important features
ML Application: Gradient descent
Copy
# Current model parametersweights = np.array([50000, 120, -5000, -8000])# Gradient (direction to improve)gradient = np.array([100, 0.5, -20, -30])# Learning rate (how far to move)learning_rate = 0.01# Update parametersweights = weights - learning_rate * gradient# ↑ scalar multiplication!
Key Insight: The learning rate controls the step size. Too large → overshoot. Too small → slow learning.
The Big Question: How do we measure if two things are similar?This is THE most important operation in machine learning! Let’s see why through three examples.Algebraic Definition: Multiply corresponding components and sum.Mathematical Formula:v⋅w=i=1∑nviwi=v1w1+v2w2+…+vnwnAlternative Formula (geometric):v⋅w=∥v∥∥w∥cos(θ)Where θ is the angle between vectors.
# Find products similar to what user just boughtuser_purchase = np.array([1, 0, 1, 0, 1]) # Product featuresall_products = np.array([ [1, 0, 1, 1, 0], # Product A [1, 0, 1, 0, 1], # Product B (identical!) [0, 1, 0, 1, 0], # Product C (different)])similarities = [np.dot(user_purchase, product) for product in all_products]print(similarities) # [2, 3, 0] → Recommend Product B!
3. Attention Mechanisms: How transformers (GPT, BERT) work
Copy
# Simplified: How much should we "attend" to each word?query = np.array([0.8, 0.2, 0.5]) # Current wordkey1 = np.array([0.9, 0.1, 0.4]) # Word 1key2 = np.array([0.2, 0.8, 0.1]) # Word 2attention_1 = np.dot(query, key1) # 0.94 (high attention!)attention_2 = np.dot(query, key2) # 0.37 (low attention)
The Question: How “big” is a house (in feature space)?Geometric Intuition: The length of the arrow.Algebraic Definition: Square root of dot product with itself.
The Problem with Dot Product: It’s affected by magnitude!
Copy
# Two houses with same type, different sizesmall_house = np.array([2, 1000, 10, 3])large_house = np.array([4, 2000, 20, 6]) # 2× small_house# Dot product is very differentprint(np.dot(small_house, small_house)) # 1,001,113print(np.dot(large_house, large_house)) # 4,004,452 (4× larger!)
The Solution: Cosine similarity ignores magnitude, only cares about direction (type).Formula:similarity(v,w)=∥v∥∥w∥v⋅w=cos(θ)Range: -1 (opposite) to +1 (identical direction)
Copy
def cosine_similarity(v, w): """Compute cosine similarity between two vectors.""" return np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))
from sklearn.feature_extraction.text import CountVectorizerdocuments = [ "machine learning is awesome", "deep learning is a subset of machine learning", "neural networks are powerful", "python is great for machine learning"]# Convert to vectorsvectorizer = CountVectorizer()doc_vectors = vectorizer.fit_transform(documents).toarray()print("Vocabulary:", vectorizer.get_feature_names_out())print("\nDocument vectors:")print(doc_vectors)# Find similar documents to "machine learning"query = "machine learning"query_vector = vectorizer.transform([query]).toarray()[0]for i, doc_vec in enumerate(doc_vectors): sim = cosine_similarity(query_vector, doc_vec) print(f"Doc {i}: {sim:.3f} - {documents[i]}")
# User-movie rating matrixratings = np.array([ [5, 4, 0, 0, 1], # User 0: likes action/comedy [4, 5, 0, 0, 2], # User 1: similar to User 0 [0, 0, 5, 4, 5], # User 2: likes drama/romance [5, 4, 0, 1, 1], # User 3: similar to User 0])# Find users similar to User 0user_0 = ratings[0]for i in range(1, len(ratings)): sim = cosine_similarity(user_0, ratings[i]) print(f"User {i}: similarity = {sim:.3f}")# Output:# User 1: similarity = 0.987 (recommend same movies!)# User 2: similarity = 0.140 (different taste)# User 3: similarity = 0.989 (very similar)
# Given these houses and priceshouses = np.array([ [3, 1800, 20, 5], # $280k [4, 2400, 10, 3], # $360k [2, 1200, 30, 8], # $220k])prices = np.array([280, 360, 220])# Predict price for this housenew_house = np.array([3, 2000, 15, 4])# TODO: Find 2 most similar houses and average their prices
Spotify represents songs as vectors based on audio features. Given these song vectors:
Song
Energy
Danceability
Acousticness
Tempo (normalized)
Your Favorite
0.8
0.7
0.2
0.6
Song A
0.9
0.8
0.1
0.7
Song B
0.3
0.4
0.9
0.3
Song C
0.7
0.6
0.3
0.5
Task: Find which song is most similar to “Your Favorite” using cosine similarity.
Copy
import numpy as np# Define the song vectorsyour_favorite = np.array([0.8, 0.7, 0.2, 0.6])song_A = np.array([0.9, 0.8, 0.1, 0.7])song_B = np.array([0.3, 0.4, 0.9, 0.3])song_C = np.array([0.7, 0.6, 0.3, 0.5])# TODO: Calculate cosine similarity with each song# TODO: Which song should Spotify recommend?
💡 Solution
Copy
import numpy as npdef cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))your_favorite = np.array([0.8, 0.7, 0.2, 0.6])song_A = np.array([0.9, 0.8, 0.1, 0.7])song_B = np.array([0.3, 0.4, 0.9, 0.3])song_C = np.array([0.7, 0.6, 0.3, 0.5])songs = {'Song A': song_A, 'Song B': song_B, 'Song C': song_C}print("Similarity scores:")for name, song in songs.items(): sim = cosine_similarity(your_favorite, song) print(f" {name}: {sim:.4f}")# Output:# Song A: 0.9945 ← Most similar (upbeat, danceable)# Song B: 0.6847 (very different - acoustic, slow)# Song C: 0.9903 (also quite similar)print("\n✅ Recommendation: Song A (0.9945 similarity)")
Real-World Insight: This is exactly how Spotify’s “Discover Weekly” works! Songs are represented as 12+ dimensional vectors including tempo, key, loudness, and more.
Amazon wants to show “Similar Products” when a customer views an item. Products are represented as vectors:Features: [price_tier, avg_rating, num_reviews (log), category_score, brand_popularity]
Calculate a “compatibility score” using dot product
Normalize and use cosine similarity - does the ranking change?
Which match is best and why?
💡 Solution
Copy
import numpy as npyou = np.array([0.8, 0.3, 0.7, 0.6, 0.9])matches = { "Alex": np.array([0.7, 0.4, 0.8, 0.5, 0.85]), "Jordan": np.array([0.2, 0.9, 0.3, 0.8, 0.4]), "Casey": np.array([0.9, 0.2, 0.6, 0.7, 0.95]), "Morgan": np.array([0.5, 0.5, 0.5, 0.5, 0.5]),}print("Compatibility Analysis:")print("-" * 50)print(f"{'Match':<10} {'Dot Product':<14} {'Cosine Sim':<12}")print("-" * 50)for name, profile in matches.items(): dot = np.dot(you, profile) cos = np.dot(you, profile) / (np.linalg.norm(you) * np.linalg.norm(profile)) print(f"{name:<10} {dot:<14.4f} {cos:<12.4f}")# Output:# Alex 2.1650 0.9844 # Jordan 1.4700 0.7429 # Casey 2.3350 0.9937 ← Best match!# Morgan 1.6500 0.9071 print("\n💕 Best Match: Casey!")print(" • High adventure (0.9 vs your 0.8)")print(" • Similar introversion level (0.2 vs 0.3)")print(" • Compatible humor style (0.95 vs 0.9)")print("\n⚠️ Jordan is least compatible:")print(" • Opposite on adventure (0.2 vs 0.8)")print(" • Opposite on introversion (0.9 vs 0.3)")
Real-World Insight: Dating apps like Hinge and OkCupid use similar vector-based matching, but with 50+ dimensions including behavioral data from swipes and messages!
Why might “Data Science” rank higher than “Python Basics” even though query has “python”?
💡 Solution
Copy
import numpy as npdef cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))documents = { "ML Tutorial": np.array([0.5, 0.8, 0.9, 0.7, 0.1, 0.2]), "Web Dev Guide": np.array([0.2, 0.1, 0.0, 0.3, 0.9, 0.8]), "Data Science": np.array([0.6, 0.5, 0.7, 0.9, 0.2, 0.3]), "Python Basics": np.array([0.9, 0.2, 0.3, 0.4, 0.3, 0.4]),}query = np.array([0.7, 0.9, 0.8, 0.3, 0.0, 0.0])print("🔍 Search Results for 'machine learning python':")print("-" * 45)results = []for name, doc in documents.items(): sim = cosine_similarity(query, doc) results.append((name, sim))# Sort by similarity (descending)results.sort(key=lambda x: x[1], reverse=True)for rank, (name, sim) in enumerate(results, 1): print(f"{rank}. {name:<18} (relevance: {sim:.4f})")# Output:# 1. ML Tutorial (relevance: 0.9357) ← Top result!# 2. Data Science (relevance: 0.8234)# 3. Python Basics (relevance: 0.7156)# 4. Web Dev Guide (relevance: 0.1342)print("\n📊 Why 'Data Science' > 'Python Basics'?")print(" Query emphasizes 'machine' (0.9) and 'learning' (0.8)")print(" Data Science has machine=0.5, learning=0.7")print(" Python Basics has machine=0.2, learning=0.3")print(" Even though Python Basics has higher 'python' score,")print(" the overall direction is less aligned with the query!")
Real-World Insight: This is how Google Search worked in its early days! Modern search engines add hundreds more signals (links, freshness, user behavior).
In textbooks, data is clean. In production, data is messy. Here’s how to handle real-world vector problems:
Production Reality: Real data has missing values, outliers, inconsistent scales, and noise. Your similarity system will fail if you don’t handle these!
# Different scaling methods for different situations# Min-Max: Scale to [0, 1] - use when you need bounded valuesdef minmax_scale(data): mins = data.min(axis=0) maxs = data.max(axis=0) return (data - mins) / (maxs - mins + 1e-8)# Z-Score: Center and scale - use when comparing distributionsdef zscore_scale(data): means = data.mean(axis=0) stds = data.std(axis=0) return (data - means) / (stds + 1e-8)# Robust: Use median/IQR - use when outliers are presentdef robust_scale(data): medians = np.median(data, axis=0) q75 = np.percentile(data, 75, axis=0) q25 = np.percentile(data, 25, axis=0) iqr = q75 - q25 return (data - medians) / (iqr + 1e-8)print("Choose your scaler based on your data characteristics!")
Rule of Thumb:
Min-Max: Neural networks, bounded features
Z-Score: Most ML algorithms, normally distributed data
# In high-D, almost all volume is at the surface of a sphere!def shell_volume_ratio(dim, thickness=0.01): """What fraction of unit ball is within `thickness` of surface?""" inner_radius = 1 - thickness # V(r) ∝ r^d inner_volume_ratio = inner_radius ** dim shell_ratio = 1 - inner_volume_ratio return shell_ratioprint("Fraction of volume near surface (within 1%):")for dim in [2, 10, 50, 100, 500]: ratio = shell_volume_ratio(dim) print(f" {dim:3d}D: {ratio:.4%}")# Output:# 2D: 1.99%# 10D: 9.56%# 50D: 39.50%# 100D: 63.40%# 500D: 99.33% ← Almost everything is on the edge!
Locality-Sensitive Hashing groups similar vectors into the same “bucket”:
Copy
class SimpleLSH: """Simplified LSH for cosine similarity.""" def __init__(self, dim, n_hyperplanes=16): # Random hyperplanes divide space into 2^n regions self.hyperplanes = np.random.randn(n_hyperplanes, dim) def hash(self, vector): """Convert vector to binary hash.""" # Which side of each hyperplane? projections = self.hyperplanes @ vector bits = (projections > 0).astype(int) return tuple(bits) def build_index(self, vectors): """Group vectors by hash.""" self.buckets = {} for i, vec in enumerate(vectors): h = self.hash(vec) if h not in self.buckets: self.buckets[h] = [] self.buckets[h].append(i) return self def search(self, query, vectors, k=10): """Search only in same bucket - much faster!""" h = self.hash(query) candidates = self.buckets.get(h, []) # Compare only with candidates similarities = [] for i in candidates: sim = np.dot(query, vectors[i]) / (np.linalg.norm(query) * np.linalg.norm(vectors[i])) similarities.append((i, sim)) return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]# Usagedim = 768n_vectors = 100000vectors = np.random.randn(n_vectors, dim)query = np.random.randn(dim)lsh = SimpleLSH(dim, n_hyperplanes=12)lsh.build_index(vectors)# Instead of searching 100,000 vectors, we search ~100!results = lsh.search(query, vectors, k=5)print(f"Found {len(results)} approximate neighbors")
Trade-off: Speed vs accuracy. LSH might miss some true neighbors, but it’s 100-1000x faster!Production systems (Pinecone, Milvus, Faiss) use sophisticated variants of LSH and graph-based methods.
✅ Vectors represent data - Houses, images, text all become vectors
✅ Dot product measures similarity - Foundation of neural networks
✅ Cosine similarity - Direction-based (ignores magnitude)
✅ Euclidean distance - Position-based (includes magnitude)
✅ Normalization matters - Prevent one feature from dominating
✅ Same math, different domains - Vectors work everywhere!
✅ Handle messy data - Missing values, outliers, and scaling are production realities
✅ High dimensions are weird - Curse of dimensionality affects all similarity search
A vector space is a set of objects (vectors) with two operations (addition and scalar multiplication) that satisfy certain axioms. This abstraction lets us apply vector math to surprising domains:
A set of vectors is linearly independent if no vector can be written as a combination of others:If c1v1+c2v2+⋯+cnvn=0, then all ci=0A basis is a minimal set of linearly independent vectors that span the space.ML Application: In neural networks, we’re essentially finding a good basis to represent data. Autoencoders find compressed bases; attention mechanisms dynamically select relevant basis directions.
Our dot product is a specific inner product. More generally, an inner product ⟨·,·⟩ satisfies:
⟨u, v⟩ = ⟨v, u⟩ (symmetry)
⟨au + bv, w⟩ = a⟨u, w⟩ + b⟨v, w⟩ (linearity)
⟨v, v⟩ ≥ 0, with equality iff v = 0 (positive definiteness)
Why this matters: Different inner products define different notions of similarity! Kernel methods in ML use custom inner products to find nonlinear patterns.
Mind-blowing application: Words are vectors, and vector math works on meaning!
Copy
# Word2Vec / GloVe represent words as ~300-dimensional vectors# Famous example: King - Man + Woman ≈ Queenking = np.array([0.5, 0.3, 0.8, ...]) # 300 dimensionsman = np.array([0.4, 0.2, 0.1, ...])woman = np.array([0.4, 0.3, 0.2, ...])# Vector arithmetic on meaning!result = king - man + woman# result is closest to the "queen" vector!# This works because:# king - man captures "royalty without gender"# Adding woman reintroduces gender → queen
Modern AI (GPT-4, Claude) uses this same principle with transformer embeddings of 12,000+ dimensions!
You now understand how to represent houses as vectors and measure similarity. But how do we actually predict the price?That’s where matrices come in. A matrix is a function that transforms input (house features) into output (price prediction). This is exactly how neural networks work!