> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Vectors: The Language of Similarity > From finding similar houses to understanding how recommendations work $Vectors - The Language of Similarity$ # Vectors: The Language of Similarity ## A Problem You Already Understand You're looking for a new apartment. You visit Zillow and find one you love: * **2 bedrooms** * **1,200 square feet** * **\$2,400/month rent** * **15 minutes from work** Now you want to find **similar apartments**. Not identical — just similar enough that you'd consider them. Zillow shows you a "Similar Homes" section. But how did they decide which apartments are similar? **Think about it**: What makes two apartments "similar"? * Same number of bedrooms? * Similar size? * Similar rent? * Similar commute? **All of the above**, in some combination. And that combination is exactly what vectors and similarity measures capture. **Estimated Time**: 3-4 hours\ **Difficulty**: Beginner\ **Prerequisites**: Basic Python\ **What You'll Build**: A "Find Similar Items" system that works for apartments, songs, or anything **🔗 ML Connection**: Vectors are THE foundation of modern ML. Here's where you'll see them: | ML System | Vector Representation | | -------------------------- | ------------------------------------------ | | **Word2Vec/GPT** | Every word → 300-768 dimensional vector | | **Face Recognition** | Every face → 128-512 dimensional embedding | | **Recommendation Systems** | Users & items in shared vector space | | **Image Classification** | CNN features as vectors | After this module, you'll understand exactly how these systems find "similar" items! *** ## Step 1: Describe Things with Numbers The first insight is simple: **we can describe any apartment as a list of numbers.** $Apartment as Vector$ ```python theme={null} # My favorite apartment my_apartment = [2, 1200, 2400, 15] # ↑ ↑ ↑ ↑ # beds sqft rent commute(min) ``` Now every apartment is just 4 numbers: ```python theme={null} apartment_A = [2, 1200, 2400, 15] # My favorite apartment_B = [2, 1100, 2300, 18] # Very similar! apartment_C = [4, 2500, 4500, 45] # Very different apartment_D = [1, 800, 1900, 10] # Smaller, cheaper, closer ``` **This list of numbers is called a vector.** That's it. A vector is just an ordered list of numbers that describes something. **Key Insight**: Once something is described as numbers, we can use math to compare things automatically. No human judgment needed. $Vector Math Concept$ *** ## Mathematical Foundations: Vector Operations Before we measure similarity, let's master the fundamental operations. These are the building blocks of ALL machine learning. ### Vector Addition: Combine Two Vectors When you add vectors, you add corresponding components: $$ \mathbf{a} + \mathbf{b} = \begin{bmatrix}a_1\\a_2\\a_3\end{bmatrix} + \begin{bmatrix}b_1\\b_2\\b_3\end{bmatrix} = \begin{bmatrix}a_1 + b_1\\a_2 + b_2\\a_3 + b_3\end{bmatrix} $$ **Real Example**: Combining two shopping carts: ```python theme={null} import numpy as np # Shopping cart contents: [apples, bananas, oranges] cart_monday = np.array([3, 2, 5]) cart_tuesday = np.array([1, 4, 2]) # Total purchases total = cart_monday + cart_tuesday print(f"Total: {total}") # [4, 6, 7] ``` **Geometric Interpretation**: Place vectors tip-to-tail; the sum goes from the first tail to the last tip. ### Scalar Multiplication: Scale a Vector Multiply every component by the same number (scalar): $$ c \cdot \mathbf{v} = c \cdot \begin{bmatrix}v_1\\v_2\\v_3\end{bmatrix} = \begin{bmatrix}c \cdot v_1\\c \cdot v_2\\c \cdot v_3\end{bmatrix} $$ **Real Example**: Double a recipe: ```python theme={null} # Recipe: [flour_cups, sugar_cups, eggs] recipe = np.array([2, 0.5, 3]) # Double the recipe doubled = 2 * recipe print(f"Doubled: {doubled}") # [4, 1, 6] # Half the recipe halved = 0.5 * recipe print(f"Halved: {halved}") # [1, 0.25, 1.5] ``` **Geometric Interpretation**: Scalar > 1 stretches the vector; 0 \< scalar \< 1 shrinks it; negative flips direction. ### Vector Magnitude (Length) The magnitude (or norm) measures how "big" a vector is: $$ \|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\sum_{i=1}^{n} v_i^2} $$ **Real Example**: Distance from origin: ```python theme={null} # Your position: [x, y] = [3, 4] position = np.array([3, 4]) # Distance from origin (Pythagorean theorem!) magnitude = np.sqrt(3**2 + 4**2) # = 5 # Or use NumPy: magnitude = np.linalg.norm(position) # = 5.0 print(f"Distance from origin: {magnitude}") ``` **Fun fact**: The 3-4-5 triangle is the most famous Pythagorean triple! Ancient Egyptians used it to create right angles in construction. ### Unit Vectors: Direction Without Magnitude A **unit vector** has length 1 and only represents direction: $$ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} $$ **Real Example**: Normalize for comparison: ```python theme={null} # Two vectors with different magnitudes review_1 = np.array([5, 4, 5, 3, 4]) # Enthusiastic reviewer review_2 = np.array([2, 1, 2, 1, 1]) # Reserved reviewer # Convert to unit vectors (direction only) unit_1 = review_1 / np.linalg.norm(review_1) unit_2 = review_2 / np.linalg.norm(review_2) print(f"Unit 1: {unit_1.round(3)}") print(f"Unit 2: {unit_2.round(3)}") print(f"Lengths: {np.linalg.norm(unit_1):.3f}, {np.linalg.norm(unit_2):.3f}") # Both have length 1.0! ``` **Key insight**: Normalization removes the "enthusiasm" factor and compares only the *pattern* of ratings. ### Vector Subtraction: Finding the Difference $$ \mathbf{a} - \mathbf{b} = \begin{bmatrix}a_1 - b_1\\a_2 - b_2\\a_3 - b_3\end{bmatrix} $$ **Real Example**: What changed between two time periods? ```python theme={null} # Monthly sales: [Product A, Product B, Product C] january = np.array([1000, 500, 750]) february = np.array([1200, 450, 800]) # Change from January to February change = february - january print(f"Change: {change}") # [200, -50, 50] # Product A: +200, Product B: -50, Product C: +50 ``` ### Practice: Vector Arithmetic Let's combine these operations: ```python theme={null} import numpy as np # Portfolio weights: [stocks, bonds, real_estate] portfolio = np.array([0.6, 0.3, 0.1]) # Expected returns for each asset class returns = np.array([0.10, 0.04, 0.06]) # 10%, 4%, 6% # Weighted average return (dot product preview!) portfolio_return = np.sum(portfolio * returns) print(f"Expected portfolio return: {portfolio_return:.2%}") # 7.4% # Rebalance: shift 10% from stocks to bonds shift = np.array([-0.1, 0.1, 0]) new_portfolio = portfolio + shift print(f"Rebalanced: {new_portfolio}") # [0.5, 0.4, 0.1] ``` *** ## Step 2: Measure How Similar Two Apartments Are Now the real question: **Given two apartments as vectors, how do we measure their similarity?** $Apartment Similarity Space$ ### Attempt 1: Just Subtract (Doesn't Work Well) Your first instinct might be to subtract the numbers: ```python theme={null} apartment_A = [2, 1200, 2400, 15] apartment_B = [2, 1100, 2300, 18] difference = [2-2, 1200-1100, 2400-2300, 15-18] = [0, 100, 100, -3] ``` But what does `[0, 100, 100, -3]` mean? The numbers have different units (bedrooms vs sqft vs dollars vs minutes). We can't just add them. ### Attempt 2: Euclidean Distance (Works, But Has Issues) We could calculate the "distance" between apartments in 4D space: ```python theme={null} import numpy as np def distance(a, b): """Euclidean distance between two vectors.""" return np.sqrt(sum((a[i] - b[i])**2 for i in range(len(a)))) apartment_A = np.array([2, 1200, 2400, 15]) apartment_B = np.array([2, 1100, 2300, 18]) apartment_C = np.array([4, 2500, 4500, 45]) print(f"A vs B: {distance(apartment_A, apartment_B):.0f}") # 141 print(f"A vs C: {distance(apartment_A, apartment_C):.0f}") # 2462 ``` B is much closer to A than C is. Good! **But there's a problem**: the sqft and rent numbers are huge (1000s) while bedrooms and commute are small (single digits). The big numbers dominate everything. ### Attempt 3: Normalize First, Then Compare The fix: **scale all features to the same range** (usually 0 to 1): ```python theme={null} def normalize(apartments): """Scale each feature to 0-1 range.""" apartments = np.array(apartments) mins = apartments.min(axis=0) maxs = apartments.max(axis=0) return (apartments - mins) / (maxs - mins) # Original apartments apartments = [ [2, 1200, 2400, 15], # A [2, 1100, 2300, 18], # B [4, 2500, 4500, 45], # C [1, 800, 1900, 10], # D ] # After normalization (all values between 0 and 1) normalized = normalize(apartments) print(normalized) # A: [0.33, 0.24, 0.19, 0.14] # B: [0.33, 0.18, 0.15, 0.23] # C: [1.00, 1.00, 1.00, 1.00] # D: [0.00, 0.00, 0.00, 0.00] ``` Now all features are on equal footing. A difference of 0.1 in bedrooms matters as much as 0.1 in rent. *** ## Step 3: The Dot Product — Measuring Alignment There's an even better way to measure similarity: the **dot product**. ### Mathematical Definition The dot product (also called inner product or scalar product) of two vectors: $$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n $$ **What it does**: Multiply corresponding numbers and add them up. ```python theme={null} def dot_product(a, b): """Multiply corresponding elements and sum.""" return sum(a[i] * b[i] for i in range(len(a))) # Or simply: np.dot(a, b) ``` **Example:** ```python theme={null} a = [1, 2, 3] b = [4, 5, 6] dot = 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32 ``` ### Geometric Interpretation The dot product has a beautiful geometric meaning: $$ \mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) $$ Where $\theta$ is the angle between the vectors! $Dot Product and Cosine Similarity Geometric Intuition$ **🎮 Interactive Visualization**: Try the code below to see how the dot product changes as you rotate vectors! ```python theme={null} import numpy as np import matplotlib.pyplot as plt from ipywidgets import interact, FloatSlider def visualize_dot_product(angle_degrees=45): # Fixed vector a a = np.array([1, 0]) # Vector b at specified angle angle_rad = np.radians(angle_degrees) b = np.array([np.cos(angle_rad), np.sin(angle_rad)]) # Calculate dot product dot = np.dot(a, b) # Plot plt.figure(figsize=(8, 6)) plt.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label='Vector a') plt.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label='Vector b') plt.xlim(-1.5, 1.5) plt.ylim(-1.5, 1.5) plt.grid(True, alpha=0.3) plt.axhline(y=0, color='k', linewidth=0.5) plt.axvline(x=0, color='k', linewidth=0.5) plt.title(f'Angle: {angle_degrees}° | Dot Product: {dot:.3f} | cos({angle_degrees}°) = {np.cos(angle_rad):.3f}') plt.legend() plt.axis('equal') plt.show() # Interactive slider - run in Jupyter! # interact(visualize_dot_product, angle_degrees=FloatSlider(min=0, max=360, step=5, value=45)) ``` **What this tells us:** * $\theta = 0°$ (same direction): $\cos(0°) = 1$ → Maximum positive dot product * $\theta = 90°$ (perpendicular): $\cos(90°) = 0$ → Dot product is zero * $\theta = 180°$ (opposite): $\cos(180°) = -1$ → Maximum negative dot product ```python theme={null} import numpy as np # Same direction a = np.array([3, 0]) b = np.array([5, 0]) print(f"Same direction: {np.dot(a, b)}") # 15 (positive) # Perpendicular (90°) a = np.array([3, 0]) b = np.array([0, 4]) print(f"Perpendicular: {np.dot(a, b)}") # 0 # Opposite direction a = np.array([3, 0]) b = np.array([-2, 0]) print(f"Opposite: {np.dot(a, b)}") # -6 (negative) # Verify the angle formula a = np.array([1, 0]) b = np.array([1, 1]) # 45 degrees angle = np.arccos(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))) print(f"Angle between: {np.degrees(angle):.1f}°") # 45.0° ``` ### The Dot Product in Action **Why does this measure similarity?** Think about it intuitively: * If both apartments are **high in the same features** (both large, both expensive), the products are large → high dot product * If one is high where the other is low, products are small → low dot product * Apartments that are "aligned" (similar profile) have high dot products *** ## Step 4: Cosine Similarity — The Industry Standard The dot product has one problem: **bigger vectors give bigger numbers** regardless of similarity. **Cosine similarity** fixes this by normalizing: $$ \text{similarity}(A, B) = \frac{A \cdot B}{|A| \times |B|} $$ This gives a number between -1 and 1: * **1.0** = identical direction (very similar) * **0.0** = perpendicular (unrelated) * **-1.0** = opposite direction (opposites) ```python theme={null} def cosine_similarity(a, b): """Similarity based on angle, not magnitude.""" dot = np.dot(a, b) magnitude_a = np.sqrt(np.dot(a, a)) # length of a magnitude_b = np.sqrt(np.dot(b, b)) # length of b return dot / (magnitude_a * magnitude_b) ``` **Let's test it on our apartments:** ```python theme={null} import numpy as np apartments = { 'A (my favorite)': np.array([2, 1200, 2400, 15]), 'B (similar)': np.array([2, 1100, 2300, 18]), 'C (luxury)': np.array([4, 2500, 4500, 45]), 'D (studio)': np.array([1, 800, 1900, 10]), } my_apt = apartments['A (my favorite)'] print("Similarity to my apartment:") for name, apt in apartments.items(): sim = cosine_similarity(my_apt, apt) print(f" {name}: {sim:.3f}") ``` **Output:** ``` Similarity to my apartment: A (my favorite): 1.000 ← identical to itself B (similar): 0.999 ← very similar! C (luxury): 0.997 ← surprisingly similar (same "shape", just bigger) D (studio): 0.998 ← also similar (same "shape", just smaller) ``` **Wait, why is C so similar?** Because cosine similarity measures **direction**, not **magnitude**. C is a "scaled up" version of A — same proportions, just bigger numbers. This is actually useful! It finds apartments with the **same profile** (ratio of bedrooms to sqft to rent), regardless of absolute size. *** ## Real-World Application: Build a "Similar Apartments" Finder Let's build a working system: ```python theme={null} import numpy as np class ApartmentFinder: def __init__(self, apartments): """ apartments: dict of {name: [beds, sqft, rent, commute]} """ self.names = list(apartments.keys()) self.vectors = np.array(list(apartments.values())) # Normalize for fair comparison self.normalized = self._normalize(self.vectors) def _normalize(self, data): mins = data.min(axis=0) maxs = data.max(axis=0) return (data - mins) / (maxs - mins + 1e-8) # avoid division by zero def find_similar(self, query, top_k=3): """Find top_k most similar apartments to query.""" # Normalize the query query_norm = (np.array(query) - self.vectors.min(axis=0)) / \ (self.vectors.max(axis=0) - self.vectors.min(axis=0) + 1e-8) # Calculate similarity to all apartments similarities = [] for i, apt in enumerate(self.normalized): sim = np.dot(query_norm, apt) / \ (np.linalg.norm(query_norm) * np.linalg.norm(apt) + 1e-8) similarities.append((self.names[i], sim, self.vectors[i])) # Sort by similarity (highest first) similarities.sort(key=lambda x: x[1], reverse=True) return similarities[:top_k] # Database of apartments listings = { 'Downtown Loft': [1, 900, 2800, 5], 'Suburban House': [4, 2200, 3200, 35], 'Cozy Studio': [0, 500, 1500, 20], 'Modern 2BR': [2, 1100, 2400, 15], 'Family Home': [3, 1800, 2900, 25], 'Luxury Penthouse': [2, 1500, 5500, 10], 'Budget 1BR': [1, 700, 1800, 30], 'Midtown 2BR': [2, 1050, 2350, 12], } finder = ApartmentFinder(listings) # What I'm looking for my_ideal = [2, 1200, 2400, 15] print("Your search: 2BR, 1200sqft, $2400, 15min commute") print("\nMost similar apartments:") for name, similarity, features in finder.find_similar(my_ideal): print(f" {similarity:.2f} - {name}: {int(features[0])}BR, " f"{int(features[1])}sqft, ${int(features[2])}, {int(features[3])}min") ``` **Output:** ``` Your search: 2BR, 1200sqft, $2400, 15min commute Most similar apartments: 0.98 - Modern 2BR: 2BR, 1100sqft, $2400, 15min 0.97 - Midtown 2BR: 2BR, 1050sqft, $2350, 12min 0.89 - Family Home: 3BR, 1800sqft, $2900, 25min ``` **You just built Zillow's "Similar Homes" feature!** *** ## Now Let's Connect This to Machine Learning Everything we just learned about apartments applies directly to ML. The concepts are identical — only the application changes. ### Pattern: Real World → Vector → Similarity | Real World | Vector Representation | What Similarity Finds | | ---------- | ------------------------------------ | --------------------- | | Apartments | \[beds, sqft, rent, commute] | Similar listings | | Songs | \[energy, tempo, danceability, mood] | Songs you'll like | | Movies | \[action, romance, comedy, rating] | Movies to recommend | | Customers | \[age, income, purchases, visits] | Customer segments | | Images | \[pixel1, pixel2, ..., pixel1000000] | Similar images | | Words | \[dimension1, ..., dimension300] | Related words | **The math is identical.** Once something is a vector, you can find similar items using dot products and cosine similarity. ### Example: How Spotify Actually Works Remember our apartment finder? Spotify does the exact same thing with songs: ```python theme={null} # Spotify's actual audio features (simplified) songs = { 'Blinding Lights': [0.73, 0.51, 135, 0.00, 0.32], # [energy, dance, tempo, acoustic, happy] 'Levitating': [0.69, 0.70, 103, 0.03, 0.91], 'Someone Like You': [0.34, 0.50, 67, 0.75, 0.14], 'Uptown Funk': [0.93, 0.89, 115, 0.00, 0.97], 'Hello': [0.40, 0.48, 79, 0.73, 0.25], } # Reusing our same logic! finder = SongFinder(songs) # Same algorithm as ApartmentFinder # You just listened to Blinding Lights print(finder.find_similar('Blinding Lights', top_k=2)) # → Levitating, Uptown Funk (similar energy/dance profiles) ``` **The entire Spotify recommendation engine is built on the same vector similarity concept you just learned with apartments.** *** ## How This Applies to Neural Networks Now let's take the final step. In neural networks, **everything is vectors**, and **everything is similarity and transformation**. ### What a Neural Network Does (Simplified) 1. **Input**: Convert your data to a vector (image → pixels, text → numbers) 2. **Layers**: Transform the vector through matrix multiplications (we'll learn this next!) 3. **Output**: Compare the final vector to known categories using... similarity ```python theme={null} # Simplified: How image classification works image_vector = [0.1, 0.8, 0.3, ...] # 1000s of numbers from pixels # The network transforms this to a "meaning" vector meaning_vector = neural_network(image_vector) # Let's say [0.9, 0.1, 0.05] # Compare to category vectors cat_vector = [1.0, 0.0, 0.0] # What a "cat" looks like in meaning-space dog_vector = [0.0, 1.0, 0.0] # What a "dog" looks like # Which is more similar? cat_similarity = cosine_similarity(meaning_vector, cat_vector) # 0.95 dog_similarity = cosine_similarity(meaning_vector, dog_vector) # 0.10 # Prediction: It's a cat! (higher similarity) ``` **The core operation — vector similarity — is exactly what you learned with apartments.** ```python theme={null} def cosine_similarity(a, b): """ Returns a value between -1 and 1: - 1.0 = identical direction (very similar) - 0.0 = perpendicular (unrelated) - -1.0 = opposite direction (very different) """ dot = np.dot(a, b) magnitude_a = np.linalg.norm(a) # length of a magnitude_b = np.linalg.norm(b) # length of b return dot / (magnitude_a * magnitude_b) ``` **Now let's use it on our songs:** ```python theme={null} # Using our song vectors from earlier sim_blinding_levitating = cosine_similarity(blinding_lights, levitating) sim_blinding_adele = cosine_similarity(blinding_lights, someone_like_you) print(f"Blinding Lights vs Levitating: {sim_blinding_levitating:.3f}") print(f"Blinding Lights vs Someone Like You: {sim_blinding_adele:.3f}") # Output: # Blinding Lights vs Levitating: 0.891 (very similar - both upbeat pop) # Blinding Lights vs Someone Like You: 0.412 (less similar - different vibes) ``` **That's the Spotify algorithm in a nutshell!** Find songs with the highest cosine similarity to what you just played. *** ## Vector Operations: The Building Blocks Now that we can represent houses as vectors, what can we do with them? ### 1. Vector Addition: Combining Features **The Question**: What if we want to combine two house profiles? $Vector Addition$ **Geometric Intuition**: Place vectors tip-to-tail. The result is the diagonal. **Algebraic Definition**: Add corresponding components. ```python theme={null} # Two house feature vectors house_1 = np.array([3, 2000, 15, 5]) house_2 = np.array([2, 1500, 10, 3]) # Average house in the neighborhood average_house = (house_1 + house_2) / 2 print(average_house) # [2.5, 1750, 12.5, 4] ``` **Why This Matters**: * **Feature engineering**: Combine features to create new ones * **Gradient descent**: Update model parameters by adding gradients * **Ensemble methods**: Average predictions from multiple models **Real-World Example**: User preferences ```python theme={null} # User's historical preferences past_prefs = np.array([0.8, 0.2, 0.5]) # [action, comedy, drama] # Recent viewing behavior recent = np.array([0.1, 0.3, 0.2]) # Updated preferences (weighted sum) new_prefs = 0.7 * past_prefs + 0.3 * recent print(new_prefs) # [0.59, 0.23, 0.41] ``` *** ### 2. Scalar Multiplication: Scaling Features **The Question**: What if all house prices in a neighborhood increase by 20%? $Scalar Multiplication$ **Geometric Intuition**: Stretch or shrink the vector. Direction stays the same. **Algebraic Definition**: Multiply each component by a number (scalar). ```python theme={null} house = np.array([3, 2000, 15, 5]) # Scale by 1.2 (20% increase) scaled_house = 1.2 * house print(scaled_house) # [3.6, 2400, 18, 6] ``` **Why This Matters**: * **Normalization**: Scale features to same range * **Learning rate**: Control how much to update parameters * **Feature weighting**: Emphasize important features **ML Application**: Gradient descent ```python theme={null} # Current model parameters weights = np.array([50000, 120, -5000, -8000]) # Gradient (direction to improve) gradient = np.array([100, 0.5, -20, -30]) # Learning rate (how far to move) learning_rate = 0.01 # Update parameters weights = weights - learning_rate * gradient # ↑ scalar multiplication! ``` **Key Insight**: The learning rate controls the step size. Too large → overshoot. Too small → slow learning. *** ### 3. Dot Product: Measuring Similarity **The Big Question**: How do we measure if two things are similar? This is THE most important operation in machine learning! Let's see why through three examples. $Dot Product with Houses$ **Algebraic Definition**: Multiply corresponding components and sum. **Mathematical Formula**: $$ \mathbf{v} \cdot \mathbf{w} = \sum_{i=1}^{n} v_i w_i = v_1w_1 + v_2w_2 + \ldots + v_nw_n $$ **Alternative Formula** (geometric): $$ \mathbf{v} \cdot \mathbf{w} = \|\mathbf{v}\| \|\mathbf{w}\| \cos(\theta) $$ Where $\theta$ is the angle between vectors. *** ### Example 1: Comparing Houses ```python theme={null} house_1 = np.array([3, 2000, 10, 3]) # Suburban family home house_2 = np.array([4, 2200, 8, 2]) # Similar house house_3 = np.array([1, 800, 50, 20]) # Old studio apartment # Compute dot products sim_1_2 = np.dot(house_1, house_2) sim_1_3 = np.dot(house_1, house_3) print(f"House 1 · House 2 = {sim_1_2}") # 4,400,098 (large, positive) print(f"House 1 · House 3 = {sim_1_3}") # 41,803 (much smaller) ``` **Interpretation**: * Large dot product = similar houses * Small dot product = different houses * **Why?** Similar houses have similar feature values, so products are large **Real application**: Zillow uses this to find "similar homes" when you're browsing! *** ### Example 2: Matching Students for Study Groups ```python theme={null} # Student profiles: [math_score, reading_score, science_score, study_hours] alice = np.array([85, 92, 78, 12]) # Strong in reading bob = np.array([95, 75, 88, 15]) # Strong in math charlie = np.array([87, 90, 80, 13]) # Similar to Alice # Who should Alice study with? alice_bob = np.dot(alice, bob) alice_charlie = np.dot(alice, charlie) print(f"Alice · Bob = {alice_bob}") # 23,265 print(f"Alice · Charlie = {alice_charlie}") # 24,021 (higher!) ``` **Interpretation**: Alice and Charlie have more similar learning patterns! **Why this matters**: * Form effective study groups (similar students help each other) * Pair struggling students with successful ones who had similar challenges * Predict who will benefit from group work **Real application**: Educational platforms use this for peer matching! *** ### Example 3: Movie Recommendations ```python theme={null} # Movie features: [rating, runtime, year, action, romance, comedy] inception = np.array([8.8, 148, 2010, 0.9, 0.1, 0.3]) interstellar = np.array([8.6, 169, 2014, 0.7, 0.2, 0.2]) titanic = np.array([7.9, 195, 1997, 0.3, 0.9, 0.2]) # You just watched Inception. What should Netflix recommend? inception_interstellar = np.dot(inception, interstellar) inception_titanic = np.dot(inception, titanic) print(f"Inception · Interstellar = {inception_interstellar}") # 26,847 print(f"Inception · Titanic = {inception_titanic}") # 24,143 ``` **Recommendation**: Watch Interstellar! (Higher similarity) **Why it works**: Both are: * High-rated sci-fi films * Similar runtime * Recent releases * Action-heavy with minimal romance **Real application**: This is literally how Netflix, Spotify, and YouTube work! *** ### Understanding the Dot Product Geometrically **Key Insights**: ```python theme={null} # Parallel vectors (same direction) → large positive dot product v1 = np.array([2, 0]) v2 = np.array([3, 0]) print(np.dot(v1, v2)) # 6 (positive, large) # Perpendicular vectors (90°) → dot product = 0 v3 = np.array([1, 0]) v4 = np.array([0, 1]) print(np.dot(v3, v4)) # 0 (orthogonal = independent!) # Opposite vectors (180°) → negative dot product v5 = np.array([1, 1]) v6 = np.array([-1, -1]) print(np.dot(v5, v6)) # -2 (opposite) ``` **What this means**: * **Positive dot product**: Vectors point in similar directions (similar items) * **Zero dot product**: Vectors are perpendicular (completely different items) * **Negative dot product**: Vectors point in opposite directions (opposite items) *** ### Why Dot Product is Everywhere in ML **1. Neural Networks**: Every layer computes dot products! ```python theme={null} # A neuron computes: output = weights · inputs + bias weights = np.array([0.5, -0.3, 0.8]) inputs = np.array([1.0, 2.0, 3.0]) output = np.dot(weights, inputs) + 0.1 # = (0.5×1.0) + (-0.3×2.0) + (0.8×3.0) + 0.1 # = 0.5 - 0.6 + 2.4 + 0.1 = 2.4 ``` **2. Similarity Search**: Find similar items ```python theme={null} # Find products similar to what user just bought user_purchase = np.array([1, 0, 1, 0, 1]) # Product features all_products = np.array([ [1, 0, 1, 1, 0], # Product A [1, 0, 1, 0, 1], # Product B (identical!) [0, 1, 0, 1, 0], # Product C (different) ]) similarities = [np.dot(user_purchase, product) for product in all_products] print(similarities) # [2, 3, 0] → Recommend Product B! ``` **3. Attention Mechanisms**: How transformers (GPT, BERT) work ```python theme={null} # Simplified: How much should we "attend" to each word? query = np.array([0.8, 0.2, 0.5]) # Current word key1 = np.array([0.9, 0.1, 0.4]) # Word 1 key2 = np.array([0.2, 0.8, 0.1]) # Word 2 attention_1 = np.dot(query, key1) # 0.94 (high attention!) attention_2 = np.dot(query, key2) # 0.37 (low attention) ``` *** ### 4. Vector Magnitude: Measuring "Size" **The Question**: How "big" is a house (in feature space)? **Geometric Intuition**: The length of the arrow. **Algebraic Definition**: Square root of dot product with itself. ```python theme={null} house = np.array([3, 2000, 15, 5]) magnitude = np.linalg.norm(house) # = sqrt(3² + 2000² + 15² + 5²) # = sqrt(9 + 4,000,000 + 225 + 25) # = sqrt(4,000,259) # ≈ 2000.06 ``` **Mathematical Formula**: $$ \|\mathbf{v}\| = \sqrt{\mathbf{v} \cdot \mathbf{v}} = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2} $$ **Why This Matters**: Normalization! ```python theme={null} # Problem: sqft dominates (2000 vs 3 bedrooms) house = np.array([3, 2000, 15, 5]) # Solution: Normalize to unit length normalized = house / np.linalg.norm(house) print(normalized) # [0.0015, 0.9999, 0.0075, 0.0025] print(np.linalg.norm(normalized)) # 1.0 (unit vector) ``` **Key Insight**: After normalization, all features contribute equally. Sqft no longer dominates! *** ## Similarity Measures: Finding Similar Items ### Cosine Similarity: Direction-Based **The Problem with Dot Product**: It's affected by magnitude! ```python theme={null} # Two houses with same type, different size small_house = np.array([2, 1000, 10, 3]) large_house = np.array([4, 2000, 20, 6]) # 2× small_house # Dot product is very different print(np.dot(small_house, small_house)) # 1,001,113 print(np.dot(large_house, large_house)) # 4,004,452 (4× larger!) ``` **The Solution**: Cosine similarity ignores magnitude, only cares about direction (type). $Cosine Similarity$ **Formula**: $$ \text{similarity}(\mathbf{v}, \mathbf{w}) = \frac{\mathbf{v} \cdot \mathbf{w}}{\|\mathbf{v}\| \|\mathbf{w}\|} = \cos(\theta) $$ **Range**: -1 (opposite) to +1 (identical direction) ```python theme={null} def cosine_similarity(v, w): """Compute cosine similarity between two vectors.""" return np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w)) ``` *** ### Example 1: House Type Matching (Ignoring Size) ```python theme={null} # Find houses of similar TYPE, regardless of size small_suburban = np.array([2, 1000, 10, 5]) # Small suburban large_suburban = np.array([4, 2000, 20, 10]) # Large suburban (2× size) urban_apartment = np.array([1, 800, 5, 1]) # Urban apartment # Cosine similarity print(f"Small vs Large suburban: {cosine_similarity(small_suburban, large_suburban):.3f}") # 1.000! print(f"Small suburban vs Urban: {cosine_similarity(small_suburban, urban_apartment):.3f}") # 0.997 ``` **Key Insight**: The two suburban houses are identical in TYPE (cosine = 1.0), even though one is twice the size! **Why this matters**: * A family looking for a suburban house doesn't care if it's 2000 or 4000 sqft * They care about the TYPE: suburban, family-friendly, good schools * Cosine similarity captures this! **Real application**: Zillow's "similar homes" feature uses cosine similarity to find homes of similar style, not just similar size. *** ### Example 2: Student Learning Style (Not Just Scores) ```python theme={null} # Student profiles: [math, reading, science, study_hours] alice = np.array([85, 92, 78, 12]) # Strong reader, moderate study alice_2x = np.array([170, 184, 156, 24]) # Alice with 2× scores (impossible, but illustrative) bob = np.array([95, 75, 88, 15]) # Strong in math # Cosine similarity print(f"Alice vs Alice_2x: {cosine_similarity(alice, alice_2x):.3f}") # 1.000 (same learning style!) print(f"Alice vs Bob: {cosine_similarity(alice, bob):.3f}") # 0.991 (different style) ``` **Interpretation**: * Alice and Alice\_2x have IDENTICAL learning patterns (cosine = 1.0) * The magnitude doesn't matter - it's the PATTERN that counts * Alice is strong in reading, Bob is strong in math (different patterns) **Why this matters**: * Match students with similar learning STYLES, not just similar scores * A student who scores 60/70/65 has the same pattern as one who scores 80/93/87 * Recommend study materials based on learning style, not absolute performance **Real application**: Khan Academy matches students with similar learning patterns to suggest effective study paths. *** ### Example 3: Movie Taste (Not Just Ratings) ```python theme={null} # Movie preferences: [action, romance, comedy, horror, sci-fi] user_A = np.array([5, 1, 3, 0, 4]) # Loves action & sci-fi user_A_harsh = np.array([3, 0, 2, 0, 2]) # Same taste, harsher ratings user_B = np.array([1, 5, 2, 4, 0]) # Loves romance & horror # Cosine similarity print(f"User A vs A_harsh: {cosine_similarity(user_A, user_A_harsh):.3f}") # 0.998 (same taste!) print(f"User A vs B: {cosine_similarity(user_A, user_B):.3f}") # 0.385 (different taste) ``` **Key Insight**: User A and User A\_harsh have the SAME TASTE, just different rating scales! * User A rates generously (5, 4, 3) * User A\_harsh rates strictly (3, 2, 1) * But they like the SAME TYPES of movies! **Why this matters**: * Some users rate everything 5 stars, others are harsh critics * Cosine similarity finds users with similar TASTE, not similar rating scales * Recommend movies based on taste, not rating magnitude **Real application**: Netflix uses cosine similarity because users have different rating behaviors, but similar tastes should get similar recommendations. *** ### When to Use Cosine vs. Euclidean Distance **Use Cosine Similarity when**: * ✅ Direction matters more than magnitude * ✅ Different scales (harsh vs. generous raters) * ✅ Text similarity (document length doesn't matter) * ✅ Recommendation systems (taste, not intensity) **Use Euclidean Distance when**: * ✅ Absolute position matters * ✅ Same scale for all features * ✅ Clustering (K-means) * ✅ Anomaly detection (how far from normal?) ```python theme={null} # Example: Anomaly detection normal_house = np.array([3, 2000, 15, 5]) similar_house = np.array([3, 2100, 14, 4]) anomaly = np.array([10, 8000, 2, 50]) # Weird house! # Euclidean distance (absolute difference) print(f"Normal vs Similar: {euclidean_distance(normal_house, similar_house):.1f}") # 100.5 print(f"Normal vs Anomaly: {euclidean_distance(normal_house, anomaly):.1f}") # 6007.0 (huge!) # Cosine similarity (direction) print(f"Normal vs Similar: {cosine_similarity(normal_house, similar_house):.3f}") # 0.999 print(f"Normal vs Anomaly: {cosine_similarity(normal_house, anomaly):.3f}") # 0.996 (still high!) ``` **Interpretation**: Euclidean distance catches the anomaly better because it cares about MAGNITUDE! *** ## Real-World Application: Finding Similar Houses Let's build a simple house recommendation system! ```python theme={null} import numpy as np # Database of houses (bedrooms, sqft, age, distance) houses = np.array([ [3, 2000, 15, 5], # House 0 [4, 2200, 8, 2], # House 1 [2, 1200, 25, 3], # House 2 [3, 1900, 12, 6], # House 3 [5, 3500, 5, 15], # House 4 ]) # Prices (in thousands) prices = np.array([320, 380, 250, 310, 550]) # Query: User likes this house query_house = np.array([3, 2000, 10, 4]) # Find 3 most similar houses similarities = [] for i, house in enumerate(houses): sim = cosine_similarity(query_house, house) similarities.append((i, sim, prices[i])) # Sort by similarity similarities.sort(key=lambda x: x[1], reverse=True) print("Top 3 similar houses:") for i, (idx, sim, price) in enumerate(similarities[:3], 1): print(f"{i}. House {idx}: similarity={sim:.3f}, price=${price}k") ``` **Output**: ``` Top 3 similar houses: 1. House 0: similarity=0.999, price=$320k 2. House 3: similarity=0.998, price=$310k 3. House 1: similarity=0.997, price=$380k ``` **Prediction**: Based on similar houses, estimated price ≈ \$337k (average of top 3) *** ## Supporting Example 1: Document Similarity The same vector concepts apply to text! ```python theme={null} from sklearn.feature_extraction.text import CountVectorizer documents = [ "machine learning is awesome", "deep learning is a subset of machine learning", "neural networks are powerful", "python is great for machine learning" ] # Convert to vectors vectorizer = CountVectorizer() doc_vectors = vectorizer.fit_transform(documents).toarray() print("Vocabulary:", vectorizer.get_feature_names_out()) print("\nDocument vectors:") print(doc_vectors) # Find similar documents to "machine learning" query = "machine learning" query_vector = vectorizer.transform([query]).toarray()[0] for i, doc_vec in enumerate(doc_vectors): sim = cosine_similarity(query_vector, doc_vec) print(f"Doc {i}: {sim:.3f} - {documents[i]}") ``` **Key Insight**: Same math, different domain! *** ## Supporting Example 2: User Recommendations ```python theme={null} # User-movie rating matrix ratings = np.array([ [5, 4, 0, 0, 1], # User 0: likes action/comedy [4, 5, 0, 0, 2], # User 1: similar to User 0 [0, 0, 5, 4, 5], # User 2: likes drama/romance [5, 4, 0, 1, 1], # User 3: similar to User 0 ]) # Find users similar to User 0 user_0 = ratings[0] for i in range(1, len(ratings)): sim = cosine_similarity(user_0, ratings[i]) print(f"User {i}: similarity = {sim:.3f}") # Output: # User 1: similarity = 0.987 (recommend same movies!) # User 2: similarity = 0.140 (different taste) # User 3: similarity = 0.989 (very similar) ``` *** ## Practice Exercises ### Exercise 1: House Price Estimation ```python theme={null} # Given these houses and prices houses = np.array([ [3, 1800, 20, 5], # $280k [4, 2400, 10, 3], # $360k [2, 1200, 30, 8], # $220k ]) prices = np.array([280, 360, 220]) # Predict price for this house new_house = np.array([3, 2000, 15, 4]) # TODO: Find 2 most similar houses and average their prices ```

Solution

```python theme={null} similarities = [] for i, house in enumerate(houses): sim = cosine_similarity(new_house, house) similarities.append((i, sim, prices[i])) similarities.sort(key=lambda x: x[1], reverse=True) # Top 2 similar houses top_2_prices = [similarities[0][2], similarities[1][2]] predicted_price = np.mean(top_2_prices) print(f"Predicted price: ${predicted_price}k") # Output: Predicted price: $320k ```

*** ## 🎯 Practice Exercises & Real-World Applications **Challenge yourself!** These exercises blend mathematical concepts with real-world scenarios. Try to solve them before peeking at the solutions. ### Exercise 1: Music Streaming Recommendations 🎵 Spotify represents songs as vectors based on audio features. Given these song vectors: | Song | Energy | Danceability | Acousticness | Tempo (normalized) | | ------------- | ------ | ------------ | ------------ | ------------------ | | Your Favorite | 0.8 | 0.7 | 0.2 | 0.6 | | Song A | 0.9 | 0.8 | 0.1 | 0.7 | | Song B | 0.3 | 0.4 | 0.9 | 0.3 | | Song C | 0.7 | 0.6 | 0.3 | 0.5 | **Task**: Find which song is most similar to "Your Favorite" using cosine similarity. ```python theme={null} import numpy as np # Define the song vectors your_favorite = np.array([0.8, 0.7, 0.2, 0.6]) song_A = np.array([0.9, 0.8, 0.1, 0.7]) song_B = np.array([0.3, 0.4, 0.9, 0.3]) song_C = np.array([0.7, 0.6, 0.3, 0.5]) # TODO: Calculate cosine similarity with each song # TODO: Which song should Spotify recommend? ``` ```python theme={null} import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) your_favorite = np.array([0.8, 0.7, 0.2, 0.6]) song_A = np.array([0.9, 0.8, 0.1, 0.7]) song_B = np.array([0.3, 0.4, 0.9, 0.3]) song_C = np.array([0.7, 0.6, 0.3, 0.5]) songs = {'Song A': song_A, 'Song B': song_B, 'Song C': song_C} print("Similarity scores:") for name, song in songs.items(): sim = cosine_similarity(your_favorite, song) print(f" {name}: {sim:.4f}") # Output: # Song A: 0.9945 ← Most similar (upbeat, danceable) # Song B: 0.6847 (very different - acoustic, slow) # Song C: 0.9903 (also quite similar) print("\n✅ Recommendation: Song A (0.9945 similarity)") ``` **Real-World Insight**: This is exactly how Spotify's "Discover Weekly" works! Songs are represented as 12+ dimensional vectors including tempo, key, loudness, and more. *** ### Exercise 2: E-commerce Product Matching 🛒 Amazon wants to show "Similar Products" when a customer views an item. Products are represented as vectors: **Features**: \[price\_tier, avg\_rating, num\_reviews (log), category\_score, brand\_popularity] ```python theme={null} # Customer is viewing this laptop current_product = np.array([4, 4.5, 3.2, 0.9, 0.7]) # Candidate products to recommend products = { "Budget Laptop": np.array([2, 4.2, 2.8, 0.9, 0.4]), "Gaming Laptop": np.array([5, 4.6, 3.5, 0.8, 0.9]), "Similar Laptop": np.array([4, 4.4, 3.0, 0.9, 0.65]), "Tablet": np.array([3, 4.3, 3.1, 0.3, 0.6]), } ``` **Tasks**: 1. Calculate both Euclidean distance AND cosine similarity for each product 2. Which metric gives better recommendations and why? 3. Should we normalize the data first? ```python theme={null} import numpy as np def euclidean_distance(a, b): return np.sqrt(np.sum((a - b) ** 2)) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) current_product = np.array([4, 4.5, 3.2, 0.9, 0.7]) products = { "Budget Laptop": np.array([2, 4.2, 2.8, 0.9, 0.4]), "Gaming Laptop": np.array([5, 4.6, 3.5, 0.8, 0.9]), "Similar Laptop": np.array([4, 4.4, 3.0, 0.9, 0.65]), "Tablet": np.array([3, 4.3, 3.1, 0.3, 0.6]), } print("Product Comparison:") print("-" * 55) print(f"{'Product':<18} {'Euclidean':<12} {'Cosine Sim':<12}") print("-" * 55) for name, vec in products.items(): dist = euclidean_distance(current_product, vec) sim = cosine_similarity(current_product, vec) print(f"{name:<18} {dist:<12.4f} {sim:<12.4f}") # Output: # Budget Laptop 2.1541 0.9812 # Gaming Laptop 1.1533 0.9975 ← Cosine picks this # Similar Laptop 0.2693 0.9994 ← Euclidean picks this # Tablet 0.7280 0.9432 print("\n📊 Analysis:") print("• Euclidean: Similar Laptop wins (closest in absolute values)") print("• Cosine: Similar Laptop also wins (most similar direction)") print("\n✅ Both agree! But Euclidean is better here because") print(" price_tier matters in absolute terms, not just ratio.") ``` **Key Insight**: * Use **Euclidean** when magnitude matters (price, ratings) * Use **Cosine** when only direction matters (document topics, user preferences) * **Always normalize** features to different scales! *** ### Exercise 3: Dating App Compatibility 💕 A dating app represents users as compatibility vectors: **Features**: \[adventure\_score, introversion, career\_focus, family\_values, humor\_style] ```python theme={null} # Your profile you = np.array([0.8, 0.3, 0.7, 0.6, 0.9]) # Potential matches matches = { "Alex": np.array([0.7, 0.4, 0.8, 0.5, 0.85]), "Jordan": np.array([0.2, 0.9, 0.3, 0.8, 0.4]), "Casey": np.array([0.9, 0.2, 0.6, 0.7, 0.95]), "Morgan": np.array([0.5, 0.5, 0.5, 0.5, 0.5]), } ``` **Tasks**: 1. Calculate a "compatibility score" using dot product 2. Normalize and use cosine similarity - does the ranking change? 3. Which match is best and why? ```python theme={null} import numpy as np you = np.array([0.8, 0.3, 0.7, 0.6, 0.9]) matches = { "Alex": np.array([0.7, 0.4, 0.8, 0.5, 0.85]), "Jordan": np.array([0.2, 0.9, 0.3, 0.8, 0.4]), "Casey": np.array([0.9, 0.2, 0.6, 0.7, 0.95]), "Morgan": np.array([0.5, 0.5, 0.5, 0.5, 0.5]), } print("Compatibility Analysis:") print("-" * 50) print(f"{'Match':<10} {'Dot Product':<14} {'Cosine Sim':<12}") print("-" * 50) for name, profile in matches.items(): dot = np.dot(you, profile) cos = np.dot(you, profile) / (np.linalg.norm(you) * np.linalg.norm(profile)) print(f"{name:<10} {dot:<14.4f} {cos:<12.4f}") # Output: # Alex 2.1650 0.9844 # Jordan 1.4700 0.7429 # Casey 2.3350 0.9937 ← Best match! # Morgan 1.6500 0.9071 print("\n💕 Best Match: Casey!") print(" • High adventure (0.9 vs your 0.8)") print(" • Similar introversion level (0.2 vs 0.3)") print(" • Compatible humor style (0.95 vs 0.9)") print("\n⚠️ Jordan is least compatible:") print(" • Opposite on adventure (0.2 vs 0.8)") print(" • Opposite on introversion (0.9 vs 0.3)") ``` **Real-World Insight**: Dating apps like Hinge and OkCupid use similar vector-based matching, but with 50+ dimensions including behavioral data from swipes and messages! *** ### Exercise 4: Document Search Engine 📄 Build a simple search engine using TF-IDF vectors: ```python theme={null} # Documents (already converted to TF-IDF vectors) # Dimensions: [python, machine, learning, data, web, api] documents = { "ML Tutorial": np.array([0.5, 0.8, 0.9, 0.7, 0.1, 0.2]), "Web Dev Guide": np.array([0.2, 0.1, 0.0, 0.3, 0.9, 0.8]), "Data Science": np.array([0.6, 0.5, 0.7, 0.9, 0.2, 0.3]), "Python Basics": np.array([0.9, 0.2, 0.3, 0.4, 0.3, 0.4]), } # User searches for "machine learning python" query = np.array([0.7, 0.9, 0.8, 0.3, 0.0, 0.0]) ``` **Tasks**: 1. Rank documents by relevance to the query 2. What's the top result? 3. Why might "Data Science" rank higher than "Python Basics" even though query has "python"? ```python theme={null} import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) documents = { "ML Tutorial": np.array([0.5, 0.8, 0.9, 0.7, 0.1, 0.2]), "Web Dev Guide": np.array([0.2, 0.1, 0.0, 0.3, 0.9, 0.8]), "Data Science": np.array([0.6, 0.5, 0.7, 0.9, 0.2, 0.3]), "Python Basics": np.array([0.9, 0.2, 0.3, 0.4, 0.3, 0.4]), } query = np.array([0.7, 0.9, 0.8, 0.3, 0.0, 0.0]) print("🔍 Search Results for 'machine learning python':") print("-" * 45) results = [] for name, doc in documents.items(): sim = cosine_similarity(query, doc) results.append((name, sim)) # Sort by similarity (descending) results.sort(key=lambda x: x[1], reverse=True) for rank, (name, sim) in enumerate(results, 1): print(f"{rank}. {name:<18} (relevance: {sim:.4f})") # Output: # 1. ML Tutorial (relevance: 0.9357) ← Top result! # 2. Data Science (relevance: 0.8234) # 3. Python Basics (relevance: 0.7156) # 4. Web Dev Guide (relevance: 0.1342) print("\n📊 Why 'Data Science' > 'Python Basics'?") print(" Query emphasizes 'machine' (0.9) and 'learning' (0.8)") print(" Data Science has machine=0.5, learning=0.7") print(" Python Basics has machine=0.2, learning=0.3") print(" Even though Python Basics has higher 'python' score,") print(" the overall direction is less aligned with the query!") ``` **Real-World Insight**: This is how Google Search worked in its early days! Modern search engines add hundreds more signals (links, freshness, user behavior). *** ## 🚨 Real-World Challenge: Handling Messy Data In textbooks, data is clean. In production, data is messy. Here's how to handle real-world vector problems: **Production Reality**: Real data has missing values, outliers, inconsistent scales, and noise. Your similarity system will fail if you don't handle these! ### Missing Values ```python theme={null} import numpy as np # Real apartment data with missing values (NaN) apartments = np.array([ [2, 1200, 2400, 15], # Complete [2, np.nan, 2300, 18], # Missing sqft [np.nan, 2500, 4500, 45], # Missing bedrooms [1, 800, np.nan, 10], # Missing rent ]) # Strategy 1: Impute with column mean def impute_mean(data): result = data.copy() for col in range(data.shape[1]): col_mean = np.nanmean(data[:, col]) mask = np.isnan(result[:, col]) result[mask, col] = col_mean return result # Strategy 2: Impute with median (robust to outliers) def impute_median(data): result = data.copy() for col in range(data.shape[1]): col_median = np.nanmedian(data[:, col]) mask = np.isnan(result[:, col]) result[mask, col] = col_median return result cleaned = impute_mean(apartments) print("Cleaned data:\n", cleaned) ``` ### Outlier Detection ```python theme={null} # Detect outliers using z-score def detect_outliers(data, threshold=3): """Flag values more than `threshold` std devs from mean.""" means = np.nanmean(data, axis=0) stds = np.nanstd(data, axis=0) z_scores = np.abs((data - means) / (stds + 1e-8)) return z_scores > threshold # Example: Luxury penthouse is an outlier apartments = np.array([ [2, 1200, 2400, 15], [2, 1100, 2300, 18], [2, 1150, 2500, 16], [2, 50000, 100000, 15], # Outlier! Mansion accidentally in apartment data ]) outliers = detect_outliers(apartments) print("Outlier locations:\n", outliers) # Handle: Remove, cap, or flag for review ``` ### Feature Scaling Choices ```python theme={null} # Different scaling methods for different situations # Min-Max: Scale to [0, 1] - use when you need bounded values def minmax_scale(data): mins = data.min(axis=0) maxs = data.max(axis=0) return (data - mins) / (maxs - mins + 1e-8) # Z-Score: Center and scale - use when comparing distributions def zscore_scale(data): means = data.mean(axis=0) stds = data.std(axis=0) return (data - means) / (stds + 1e-8) # Robust: Use median/IQR - use when outliers are present def robust_scale(data): medians = np.median(data, axis=0) q75 = np.percentile(data, 75, axis=0) q25 = np.percentile(data, 25, axis=0) iqr = q75 - q25 return (data - medians) / (iqr + 1e-8) print("Choose your scaler based on your data characteristics!") ``` **Rule of Thumb**: * **Min-Max**: Neural networks, bounded features * **Z-Score**: Most ML algorithms, normally distributed data * **Robust**: Data with outliers, skewed distributions *** ## 🔬 Advanced Deep Dive (Optional) ### Why High Dimensions Are Weird In high dimensions, our intuition breaks down completely: ```python theme={null} import numpy as np def random_vector_similarity(dim, n_pairs=1000): """Average cosine similarity between random unit vectors.""" similarities = [] for _ in range(n_pairs): a = np.random.randn(dim) b = np.random.randn(dim) a = a / np.linalg.norm(a) b = b / np.linalg.norm(b) similarities.append(np.dot(a, b)) return np.mean(similarities), np.std(similarities) print("Random vector similarity by dimension:") for dim in [2, 10, 100, 1000, 10000]: mean, std = random_vector_similarity(dim) print(f" {dim:5d}D: mean={mean:+.4f}, std={std:.4f}") # Output: # 2D: mean=+0.0012, std=0.7071 ← High variance # 10D: mean=-0.0008, std=0.3162 # 100D: mean=+0.0002, std=0.1000 # 1000D: mean=-0.0001, std=0.0316 ← Nearly orthogonal! # 10000D: mean=+0.0000, std=0.0100 ← All vectors ~90° apart ``` **Key Insight**: In 10,000 dimensions, random vectors are almost perfectly orthogonal! This is why: * Random embeddings don't work (everything is equally dissimilar) * Trained embeddings are necessary (learn meaningful directions) * Dimension reduction (PCA, t-SNE) helps visualization ### Volume Concentration ```python theme={null} # In high-D, almost all volume is at the surface of a sphere! def shell_volume_ratio(dim, thickness=0.01): """What fraction of unit ball is within `thickness` of surface?""" inner_radius = 1 - thickness # V(r) ∝ r^d inner_volume_ratio = inner_radius ** dim shell_ratio = 1 - inner_volume_ratio return shell_ratio print("Fraction of volume near surface (within 1%):") for dim in [2, 10, 50, 100, 500]: ratio = shell_volume_ratio(dim) print(f" {dim:3d}D: {ratio:.4%}") # Output: # 2D: 1.99% # 10D: 9.56% # 50D: 39.50% # 100D: 63.40% # 500D: 99.33% ← Almost everything is on the edge! ``` ### Implications for ML 1. **Nearest Neighbors degrades**: All points become equidistant 2. **More data needed**: Exponentially more samples to cover space 3. **Regularization essential**: Prevents overfitting in sparse spaces 4. **Feature selection matters**: Irrelevant features hurt more in high-D ### The Problem: Brute Force Doesn't Scale Finding similar vectors in a billion-vector database takes forever with brute force: ```python theme={null} # Brute force: O(n × d) per query def brute_force_search(query, database, k=10): """Find k nearest neighbors - SLOW for large databases!""" similarities = [] for i, vec in enumerate(database): sim = np.dot(query, vec) / (np.linalg.norm(query) * np.linalg.norm(vec)) similarities.append((i, sim)) return sorted(similarities, key=lambda x: x[1], reverse=True)[:k] # 1 billion vectors × 768 dimensions = 3+ trillion operations per query! ``` ### LSH: Approximate but Fast Locality-Sensitive Hashing groups similar vectors into the same "bucket": ```python theme={null} class SimpleLSH: """Simplified LSH for cosine similarity.""" def __init__(self, dim, n_hyperplanes=16): # Random hyperplanes divide space into 2^n regions self.hyperplanes = np.random.randn(n_hyperplanes, dim) def hash(self, vector): """Convert vector to binary hash.""" # Which side of each hyperplane? projections = self.hyperplanes @ vector bits = (projections > 0).astype(int) return tuple(bits) def build_index(self, vectors): """Group vectors by hash.""" self.buckets = {} for i, vec in enumerate(vectors): h = self.hash(vec) if h not in self.buckets: self.buckets[h] = [] self.buckets[h].append(i) return self def search(self, query, vectors, k=10): """Search only in same bucket - much faster!""" h = self.hash(query) candidates = self.buckets.get(h, []) # Compare only with candidates similarities = [] for i in candidates: sim = np.dot(query, vectors[i]) / (np.linalg.norm(query) * np.linalg.norm(vectors[i])) similarities.append((i, sim)) return sorted(similarities, key=lambda x: x[1], reverse=True)[:k] # Usage dim = 768 n_vectors = 100000 vectors = np.random.randn(n_vectors, dim) query = np.random.randn(dim) lsh = SimpleLSH(dim, n_hyperplanes=12) lsh.build_index(vectors) # Instead of searching 100,000 vectors, we search ~100! results = lsh.search(query, vectors, k=5) print(f"Found {len(results)} approximate neighbors") ``` **Trade-off**: Speed vs accuracy. LSH might miss some true neighbors, but it's 100-1000x faster! **Production systems** (Pinecone, Milvus, Faiss) use sophisticated variants of LSH and graph-based methods. *** ## Key Takeaways ✅ **Vectors represent data** - Houses, images, text all become vectors\ ✅ **Dot product measures similarity** - Foundation of neural networks\ ✅ **Cosine similarity** - Direction-based (ignores magnitude)\ ✅ **Euclidean distance** - Position-based (includes magnitude)\ ✅ **Normalization matters** - Prevent one feature from dominating\ ✅ **Same math, different domains** - Vectors work everywhere!\ ✅ **Handle messy data** - Missing values, outliers, and scaling are production realities\ ✅ **High dimensions are weird** - Curse of dimensionality affects all similarity search *** ## 🔗 Math → ML Connection Summary **What you learned in this module powers these ML systems:** | Vector Concept | ML Application | Real-World Example | | -------------------------------- | ------------------------------------ | ---------------------------------------------- | | **Representing data as vectors** | Feature vectors in any ML model | Every scikit-learn model takes feature vectors | | **Dot product** | Neural network layers, attention | `y = W·x + b` is the core of deep learning | | **Cosine similarity** | Semantic search, recommendations | ChatGPT's embeddings, Spotify recommendations | | **Euclidean distance** | KNN classification, clustering | Customer segmentation, image retrieval | | **Normalization** | Batch normalization, feature scaling | Required preprocessing for most models | | **High-dimensional vectors** | Word embeddings, image features | GPT uses 12,000+ dimensional embeddings | **Next time you use any ML model, remember: it's operating on vectors using these exact operations!** *** **For learners who want the mathematical foundations:** ### Vector Spaces: The Abstract View A **vector space** is a set of objects (vectors) with two operations (addition and scalar multiplication) that satisfy certain axioms. This abstraction lets us apply vector math to surprising domains: | Domain | "Vectors" | Addition | Scalar Multiplication | | --------------- | -------------- | ---------------------- | --------------------- | | **Functions** | f(x), g(x) | (f+g)(x) = f(x) + g(x) | (cf)(x) = c·f(x) | | **Polynomials** | 1, x, x², ... | Combine coefficients | Scale coefficients | | **Matrices** | Any m×n matrix | Element-wise addition | Element-wise scaling | | **Signals** | Time series | Add signals | Amplify/attenuate | ### Linear Independence & Basis A set of vectors is **linearly independent** if no vector can be written as a combination of others: $\text{If } c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_n\mathbf{v}_n = \mathbf{0} \text{, then all } c_i = 0$ A **basis** is a minimal set of linearly independent vectors that span the space. **ML Application**: In neural networks, we're essentially finding a good basis to represent data. Autoencoders find compressed bases; attention mechanisms dynamically select relevant basis directions. ### Inner Product Spaces Our dot product is a specific **inner product**. More generally, an inner product ⟨·,·⟩ satisfies: 1. ⟨u, v⟩ = ⟨v, u⟩ (symmetry) 2. ⟨au + bv, w⟩ = a⟨u, w⟩ + b⟨v, w⟩ (linearity) 3. ⟨v, v⟩ ≥ 0, with equality iff v = 0 (positive definiteness) **Why this matters**: Different inner products define different notions of similarity! Kernel methods in ML use custom inner products to find nonlinear patterns. ### Recommended Deep-Dive Resources * **Gilbert Strang's Linear Algebra** (MIT OpenCourseWare) - Rigorous but intuitive * **3Blue1Brown: Essence of Linear Algebra** - Visual understanding * **Mathematics for Machine Learning** book, Ch. 2-3 - ML-focused treatment *** ## Word Embeddings: Vectors in NLP **Mind-blowing application**: Words are vectors, and vector math works on meaning! ```python theme={null} # Word2Vec / GloVe represent words as ~300-dimensional vectors # Famous example: King - Man + Woman ≈ Queen king = np.array([0.5, 0.3, 0.8, ...]) # 300 dimensions man = np.array([0.4, 0.2, 0.1, ...]) woman = np.array([0.4, 0.3, 0.2, ...]) # Vector arithmetic on meaning! result = king - man + woman # result is closest to the "queen" vector! # This works because: # king - man captures "royalty without gender" # Adding woman reintroduces gender → queen ``` **Modern AI (GPT-4, Claude)** uses this same principle with transformer embeddings of 12,000+ dimensions! *** ## Interview Questions: Vectors **Answer**: The dot product $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i$ measures alignment between vectors. In ML: * **Neural networks**: Every neuron computes a dot product (weights · inputs) * **Attention mechanisms**: Query-key dot products determine what to focus on * **Similarity search**: Cosine similarity uses normalized dot products * **Loss functions**: Many involve dot products (cross-entropy, hinge loss) **Answer**: * **Cosine**: When magnitude doesn't matter (text similarity, user preferences, normalized data) * **Euclidean**: When absolute values matter (physical distance, raw measurements) * **Example**: Two documents about ML with different lengths should be similar (cosine), but two GPS coordinates need actual distance (Euclidean) **Answer**: In high dimensions: * All points become roughly equidistant ("curse of dimensionality") * Random vectors are almost orthogonal (cosine ≈ 0) * This is why PCA/dimension reduction is important * Modern embeddings (512-4096 dim) are trained to preserve meaningful similarity *** ## What's Next? You now understand how to represent houses as vectors and measure similarity. But how do we actually **predict** the price? That's where **matrices** come in. A matrix is a function that transforms input (house features) into output (price prediction). This is exactly how neural networks work! Learn how matrices transform house features into price predictions