Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Image Similarity Search Engine

Project 1: Image Similarity Search Engine

Build a working image search engine that finds similar images using vector representations and cosine similarity. This project integrates everything you’ve learned about vectors, dot products, and similarity measures.
Estimated Time: 3-4 hours (take your time to understand each concept!)
Difficulty: Beginner-Intermediate
Concepts Used: Vectors, dot product, cosine similarity, normalization
Dataset: Product images (we’ll simulate these)
Don’t Rush! This project connects multiple concepts. If something doesn’t click, go back to the Vectors module. Understanding why this works is more important than just making it work.

🎯 The Big Picture: What Are We Really Doing?

Before diving into code, let’s understand the core insight that makes image search possible:
Image Search Pipeline

The Key Insight

Every image can be represented as a vector (list of numbers). Once images are vectors, finding “similar” images becomes finding “nearby” vectors. It is geometry, pure and simple. The same math that measures the angle between two arrows on a whiteboard can measure the visual similarity between two photographs. This is the central miracle of vectorization: it converts a fuzzy human judgment (“these photos look alike”) into a precise mathematical operation.
Image ComparisonVector Comparison
”These shoes look alike”Vectors point in similar directions
”Cat ≠ Car”Vectors point in very different directions
”Same product, different angle”Vectors are close together

Why Does This Work?

Think about it: If two images are similar, they probably have:
  • Similar colors in similar places
  • Similar shapes and patterns
  • Similar brightness distributions
When we convert these properties to numbers, similar images produce similar numbers. Similar numbers = nearby vectors!

Project Overview

What You’ll Build

An image search engine that:
  1. Converts images to vector representations (the magic transformation)
  2. Computes similarity between images (using cosine similarity)
  3. Returns the top-K most similar images to a query
  4. Visualizes results (so you can verify it works!)

Why This Matters

Real-World Applications (This is exactly what these companies do):
CompanyApplicationHow It Works
PinterestVisual search”Find pins like this image”
Google ImagesReverse search”Where else does this image appear?”
AmazonSimilar products”Customers who viewed this also viewed…”
SpotifyCover art discoveryAudio → spectrogram → vector similarity
Face IDAuthenticationFace → embedding → compare to stored
ML Concepts You’ll Master:
  • Feature extraction (images → vectors)
  • Similarity metrics (cosine similarity)
  • Nearest neighbor search
  • High-dimensional vector spaces
  • Vectorized operations (matrix multiplication for speed!)

Part 1: Understanding the Problem (Deeply)

How Do We Represent Images as Vectors?

This is the crucial question. Let’s explore multiple methods and understand the tradeoffs.
Understanding Vector Similarity
Method 1: Raw Pixels (Simple but effective for similar images)
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

def image_to_vector_simple(image_path, size=(64, 64)):
    """
    Convert image to a vector using raw pixel values.
    
    THE IDEA:
    - A 64×64 grayscale image has 64×64 = 4,096 pixels
    - Each pixel is a number from 0 (black) to 255 (white)
    - We "flatten" the 2D grid into a 1D list of 4,096 numbers
    - This list IS our vector!
    
    Args:
        image_path: Path to image file
        size: Resize dimensions (smaller = faster, but less detail)
    
    Returns:
        1D numpy array of pixel values, normalized to [0, 1]
    """
    # Load image and convert to grayscale (removes color complexity)
    img = Image.open(image_path).convert('L')
    
    # Resize to standard size (all vectors must be same length!)
    # This is crucial: you can't compare vectors of different sizes
    img = img.resize(size)
    
    # Convert to numpy array
    img_array = np.array(img)
    print(f"Image shape: {img_array.shape}")  # (64, 64)
    
    # Flatten: 2D grid → 1D vector
    # Think of it like reading a book: left-to-right, top-to-bottom
    vector = img_array.flatten()
    print(f"Vector shape: {vector.shape}")  # (4096,)
    
    # Normalize to [0, 1]
    # Why? Cosine similarity cares about direction, not magnitude
    # But normalization helps numerical stability
    vector = vector / 255.0
    
    return vector

# Let's visualize what's happening
def show_flatten_process():
    """Visualize how an image becomes a vector."""
    # Create a tiny 4×4 example
    tiny_image = np.array([
        [10, 20, 30, 40],
        [50, 60, 70, 80],
        [90, 100, 110, 120],
        [130, 140, 150, 160]
    ], dtype=np.uint8)
    
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    
    # Original image
    axes[0].imshow(tiny_image, cmap='gray')
    axes[0].set_title('Original 4×4 Image')
    for i in range(4):
        for j in range(4):
            axes[0].text(j, i, str(tiny_image[i,j]), ha='center', va='center', color='red')
    
    # Flattened
    vector = tiny_image.flatten()
    axes[1].bar(range(16), vector, color='steelblue')
    axes[1].set_title('Flattened Vector (16 numbers)')
    axes[1].set_xlabel('Index')
    axes[1].set_ylabel('Pixel Value')
    
    # The vector
    axes[2].text(0.1, 0.5, f"Vector = [{', '.join(map(str, vector[:8]))}...\n"
                          f"          {', '.join(map(str, vector[8:]))}]",
                fontfamily='monospace', fontsize=10, va='center')
    axes[2].axis('off')
    axes[2].set_title('Our Image Vector!')
    
    plt.tight_layout()
    plt.show()

show_flatten_process()
Why normalize? So all images are on the same scale (0-1 instead of 0-255). This prevents one bright image from dominating comparisons. Without normalization, a photo taken in sunlight and the same photo taken in shade would appear very different numerically even though they show the same scene. Normalization removes the “brightness bias” and lets the comparison focus on the actual pattern of light and dark.
Numerical Stability Note: Always add a small epsilon (like 1e-8) when dividing to avoid division-by-zero errors on completely black images. The + 1e-8 you will see in production code is not paranoia — it is a standard defensive practice.

Method 2: Color Histogram (Better for Different Viewpoints)

The pixel method has a problem: if you rotate or shift an image slightly, the vector changes dramatically! Histograms solve this. The Idea: Instead of “where are the pixels?”, ask “what colors are present?”
def image_to_histogram(image_path, bins=32):
    """
    Convert image to color histogram vector.
    
    THE IDEA:
    - Count how many pixels of each color exist
    - A red car has lots of red pixels → high red bin values
    - A blue sky has lots of blue pixels → high blue bin values
    - Two red cars → similar histograms (even if rotated!)
    
    Args:
        image_path: Path to image file
        bins: Number of "buckets" per color channel (more = more detail)
              32 bins means we group 256 colors into 32 buckets
    
    Returns:
        1D numpy array of histogram values (normalized)
    """
    # Load image in RGB (Red, Green, Blue channels)
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img)
    
    # Compute histogram for each color channel
    # "How many pixels have red values 0-7? 8-15? etc."
    hist_r = np.histogram(img_array[:,:,0], bins=bins, range=(0, 256))[0]
    hist_g = np.histogram(img_array[:,:,1], bins=bins, range=(0, 256))[0]
    hist_b = np.histogram(img_array[:,:,2], bins=bins, range=(0, 256))[0]
    
    # Concatenate into one vector: [red_hist, green_hist, blue_hist]
    vector = np.concatenate([hist_r, hist_g, hist_b])
    
    # Normalize so values sum to 1 (makes comparison fair)
    vector = vector / vector.sum()
    
    return vector

# Visualize what a histogram looks like
def show_histogram_process():
    """Show how histogram captures color distribution."""
    # Create sample images
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    # Red-ish image
    red_img = np.zeros((64, 64, 3), dtype=np.uint8)
    red_img[:, :, 0] = np.random.randint(180, 255, (64, 64))  # High red
    red_img[:, :, 1] = np.random.randint(0, 50, (64, 64))     # Low green
    red_img[:, :, 2] = np.random.randint(0, 50, (64, 64))     # Low blue
    
    # Blue-ish image
    blue_img = np.zeros((64, 64, 3), dtype=np.uint8)
    blue_img[:, :, 0] = np.random.randint(0, 50, (64, 64))
    blue_img[:, :, 1] = np.random.randint(0, 50, (64, 64))
    blue_img[:, :, 2] = np.random.randint(180, 255, (64, 64))
    
    # Show images
    axes[0, 0].imshow(red_img)
    axes[0, 0].set_title('Red-dominant Image')
    axes[0, 0].axis('off')
    
    axes[0, 2].imshow(blue_img)
    axes[0, 2].set_title('Blue-dominant Image')
    axes[0, 2].axis('off')
    
    # Show histograms
    for i, (img, color, title) in enumerate([
        (red_img, ['red', 'green', 'blue'], 'Red Image Histogram'),
        (blue_img, ['red', 'green', 'blue'], 'Blue Image Histogram')
    ]):
        ax = axes[1, i*2]
        for c, channel in enumerate(['Red', 'Green', 'Blue']):
            hist = np.histogram(img[:,:,c], bins=32, range=(0, 256))[0]
            ax.bar(np.arange(32) + c*0.25, hist, width=0.25, 
                   label=channel, alpha=0.7)
        ax.set_title(title)
        ax.legend()
        ax.set_xlabel('Color Bin')
        ax.set_ylabel('Count')
    
    axes[0, 1].axis('off')
    axes[1, 1].axis('off')
    plt.tight_layout()
    plt.show()

show_histogram_process()
Histogram vs Pixels: When to Use Which?
MethodProsConsBest For
Raw PixelsSimple, captures spatial infoSensitive to rotation/shiftExact duplicates, aligned images
HistogramRotation invariant, compactLoses spatial infoDifferent viewpoints, color-based
Deep Learning (CNN)Best of both worldsRequires trainingProduction systems
For learning, we’ll use both! In production, companies use CNNs (neural networks) to extract “deep features.”
Why histograms work for rotated images: If you rotate a red car, it’s still red! The histogram captures “what colors exist” not “where colors are.”

Part 2: Computing Similarity

This is where the linear algebra magic happens. We’ve turned images into vectors. Now we need to measure “how similar” two vectors are.

The Core Insight: Similarity = Angle Between Vectors

Imagine two arrows (vectors) starting from the same point:
  • If they point in the same direction → very similar
  • If they point in different directions → not similar
  • If they’re perpendicular (90°) → completely unrelated
Cosine similarity measures this by computing the cosine of the angle between vectors.

Cosine Similarity: The Math

cosine similarity=cos(θ)=ABA×B\text{cosine similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} Where:
  • AB\mathbf{A} \cdot \mathbf{B} is the dot product (sum of element-wise products)
  • A||\mathbf{A}|| is the magnitude (length) of vector A
  • The result is between -1 and 1
Cosine ValueAngleMeaning
1.0Identical (same direction)
0.7~45°Similar
0.090°Unrelated (perpendicular)
-1.0180°Opposite

Implementation with Deep Understanding

def cosine_similarity(v1, v2):
    """
    Compute cosine similarity between two vectors.
    
    This is THE core operation of similarity search!
    
    Formula: cos(θ) = (v1 · v2) / (||v1|| × ||v2||)
    
    STEP BY STEP:
    1. Dot product: Multiply corresponding elements and sum
       - Measures "how much do these vectors agree?"
       - If both have high values in same positions → high dot product
    
    2. Normalize by magnitudes
       - Removes the effect of "brightness" (larger numbers)
       - Focuses purely on "direction" (pattern of values)
    
    Args:
        v1, v2: Vectors (numpy arrays) of same length
    
    Returns:
        Similarity score in [-1, 1]
        1 = identical direction (most similar)
        0 = orthogonal (unrelated)
        -1 = opposite direction (least similar)
    """
    # Step 1: Compute dot product
    # v1 = [a, b, c], v2 = [x, y, z]
    # dot = a*x + b*y + c*z
    dot_product = np.dot(v1, v2)
    
    # Step 2: Compute magnitudes (lengths)
    # ||v|| = sqrt(a² + b² + c²)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    
    # Step 3: Avoid division by zero
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    
    # Step 4: Compute and return
    return dot_product / (norm_v1 * norm_v2)

# Let's build intuition with examples
print("=== Understanding Cosine Similarity ===\n")

# Example 1: Same direction, different magnitude
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])  # v2 = 2 * v1
print("v1 = [1, 2, 3]")
print("v2 = [2, 4, 6] (same direction, 2x magnitude)")
print(f"Cosine similarity: {cosine_similarity(v1, v2):.3f}")
print("→ 1.0 because they point the same way!\n")

# Example 2: Perpendicular vectors
v3 = np.array([1, 0, 0])
v4 = np.array([0, 1, 0])
print("v3 = [1, 0, 0] (pointing along x-axis)")
print("v4 = [0, 1, 0] (pointing along y-axis)")
print(f"Cosine similarity: {cosine_similarity(v3, v4):.3f}")
print("→ 0.0 because they're perpendicular!\n")

# Example 3: Image-like vectors (patterns matter)
bright_sunset = np.array([0.9, 0.5, 0.1])  # High red, medium green, low blue
dark_sunset = np.array([0.45, 0.25, 0.05])  # Same pattern, darker
blue_sky = np.array([0.1, 0.3, 0.9])  # Opposite pattern

print("bright_sunset = [0.9, 0.5, 0.1] (red-orange colors)")
print("dark_sunset = [0.45, 0.25, 0.05] (same colors, darker)")
print(f"Similarity: {cosine_similarity(bright_sunset, dark_sunset):.3f}")
print("→ ~1.0 because same COLOR PATTERN (direction)!\n")

print("blue_sky = [0.1, 0.3, 0.9] (blue colors)")
print(f"Similarity to sunset: {cosine_similarity(bright_sunset, blue_sky):.3f}")
print("→ Low because completely different color pattern!")
Why Cosine Beats Euclidean for ImagesTwo images of the same sunset—one taken with a bright setting, one dark—have the same pattern of colors but different absolute values.
  • Cosine similarity: “Same pattern!” (1.0)
  • Euclidean distance: “Very different!” (large distance)
Cosine captures what matters for visual similarity.

Euclidean Distance: The Alternative

For completeness, here’s Euclidean distance. It measures “how far apart” vectors are in space:
def euclidean_distance(v1, v2):
    """
    Compute Euclidean distance between two vectors.
    
    This is the "straight line" distance formula you know from geometry!
    
    Formula: ||v1 - v2|| = sqrt(Σ(v1_i - v2_i)²)
    
    Returns:
        Distance (0 = identical, larger = more different)
    """
    return np.linalg.norm(v1 - v2)

# Compare the two metrics
print("=== Cosine vs Euclidean ===\n")

v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])  # Same direction, different magnitude

print(f"Cosine similarity: {cosine_similarity(v1, v2):.3f}")  # 1.0
print(f"Euclidean distance: {euclidean_distance(v1, v2):.3f}")  # 3.74

print("\n→ Cosine says 'identical direction'")
print("→ Euclidean says 'they're far apart'")
print("→ For images, direction (pattern) matters more!")

Which to Use?

  • Cosine: When magnitude doesn’t matter (text, images with different brightness)
  • Euclidean: When absolute values matter (coordinates, measurements)
For images, cosine similarity is usually better!

Part 3: Building the Search Engine

Step 1: Create Image Database

import os
from glob import glob
from tqdm import tqdm

class ImageSearchEngine:
    def __init__(self, image_dir, method='histogram'):
        """
        Initialize search engine.
        
        Args:
            image_dir: Directory containing images
            method: 'pixels' or 'histogram'
        """
        self.image_dir = image_dir
        self.method = method
        self.image_paths = []
        self.image_vectors = []
        
    def index_images(self):
        """
        Convert all images to vectors and store them.
        """
        # Find all images
        extensions = ['*.jpg', '*.jpeg', '*.png']
        for ext in extensions:
            self.image_paths.extend(glob(os.path.join(self.image_dir, ext)))
        
        print(f"Found {len(self.image_paths)} images")
        
        # Convert to vectors
        print("Converting images to vectors...")
        for img_path in tqdm(self.image_paths):
            try:
                if self.method == 'pixels':
                    vector = image_to_vector_simple(img_path)
                else:
                    vector = image_to_histogram(img_path)
                
                self.image_vectors.append(vector)
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
        
        # Convert to numpy array for faster computation
        self.image_vectors = np.array(self.image_vectors)
        
        print(f"Indexed {len(self.image_vectors)} images")
        print(f"Vector shape: {self.image_vectors.shape}")
    
    def search(self, query_path, top_k=5):
        """
        Find top-K most similar images to query.
        
        Args:
            query_path: Path to query image
            top_k: Number of results to return
        
        Returns:
            List of (image_path, similarity_score) tuples
        """
        # Convert query to vector
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        # Compute similarities with all images
        similarities = []
        for i, img_vector in enumerate(self.image_vectors):
            sim = cosine_similarity(query_vector, img_vector)
            similarities.append((self.image_paths[i], sim))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top-K
        return similarities[:top_k]
    
    def search_fast(self, query_path, top_k=5):
        """
        Faster version using vectorized operations.
        """
        # Convert query to vector
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        # Vectorized cosine similarity
        # cos(θ) = (A · B) / (||A|| ||B||)
        
        # Normalize query vector
        query_norm = query_vector / np.linalg.norm(query_vector)
        
        # Normalize all database vectors
        db_norms = np.linalg.norm(self.image_vectors, axis=1, keepdims=True)
        db_normalized = self.image_vectors / db_norms
        
        # Compute all similarities at once (matrix-vector multiplication!)
        similarities = db_normalized @ query_norm
        
        # Get top-K indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return results
        results = [(self.image_paths[i], similarities[i]) for i in top_indices]
        return results

Step 2: Use the Search Engine

# Initialize
engine = ImageSearchEngine('product_images/', method='histogram')

# Index all images
engine.index_images()

# Search
query_image = 'query.jpg'
results = engine.search_fast(query_image, top_k=5)

# Display results
print(f"\nTop 5 similar images to {query_image}:")
for i, (img_path, score) in enumerate(results, 1):
    print(f"{i}. {os.path.basename(img_path)}: {score:.3f}")
Output:
Found 1000 images
Converting images to vectors...
100%|████████████████████| 1000/1000 [00:15<00:00, 65.2it/s]
Indexed 1000 images
Vector shape: (1000, 96)

Top 5 similar images to query.jpg:
1. product_042.jpg: 0.987
2. product_156.jpg: 0.923
3. product_089.jpg: 0.891
4. product_234.jpg: 0.876
5. product_567.jpg: 0.854

Part 4: Visualizing Results

import matplotlib.pyplot as plt

def visualize_results(query_path, results, num_results=5):
    """
    Display query image and top results.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Query image
    query_img = Image.open(query_path)
    axes[0, 0].imshow(query_img)
    axes[0, 0].set_title('Query Image', fontsize=14, fontweight='bold')
    axes[0, 0].axis('off')
    
    # Hide extra subplot
    axes[0, 1].axis('off')
    axes[0, 2].axis('off')
    
    # Top results
    for i, (img_path, score) in enumerate(results[:num_results]):
        row = 1 + i // 3
        col = i % 3
        
        if row < 2:  # Only show first 3 results
            img = Image.open(img_path)
            axes[row, col].imshow(img)
            axes[row, col].set_title(f'#{i+1}: {score:.3f}', fontsize=12)
            axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.savefig('search_results.png', dpi=150, bbox_inches='tight')
    plt.show()

# Use it
visualize_results(query_image, results)

Part 5: Performance Analysis

Timing Comparison

import time

# Slow version (loop)
start = time.time()
results_slow = engine.search(query_image, top_k=5)
time_slow = time.time() - start

# Fast version (vectorized)
start = time.time()
results_fast = engine.search_fast(query_image, top_k=5)
time_fast = time.time() - start

print(f"Slow version: {time_slow:.3f}s")
print(f"Fast version: {time_fast:.3f}s")
print(f"Speedup: {time_slow/time_fast:.1f}x")
Output:
Slow version: 0.245s
Fast version: 0.003s
Speedup: 81.7x
Why so much faster? Vectorized operations use optimized C/Fortran code and can leverage CPU SIMD instructions!

Memory Usage

import sys

# Size of database
num_images = len(engine.image_vectors)
vector_dim = engine.image_vectors.shape[1]
total_bytes = engine.image_vectors.nbytes

print(f"Number of images: {num_images}")
print(f"Vector dimension: {vector_dim}")
print(f"Total memory: {total_bytes / 1024 / 1024:.2f} MB")
print(f"Per image: {total_bytes / num_images / 1024:.2f} KB")

Part 6: Improvements & Extensions

1. Better Feature Extraction

# Use pre-trained neural network (ResNet, VGG, etc.)
from torchvision import models, transforms
import torch

def image_to_vector_deep(image_path):
    """
    Extract features using pre-trained ResNet.
    """
    # Load pre-trained model
    model = models.resnet18(pretrained=True)
    model.eval()
    
    # Remove final classification layer
    feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
    
    # Preprocess image
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
    ])
    
    img = Image.open(image_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0)
    
    # Extract features
    with torch.no_grad():
        features = feature_extractor(img_tensor)
    
    # Flatten to vector
    vector = features.squeeze().numpy()
    
    return vector

# This gives 512-dimensional vectors with much better semantic understanding!

2. Faster Search with Approximate Nearest Neighbors

# For large databases (millions of images), use FAISS or Annoy
import faiss

def build_faiss_index(vectors):
    """
    Build FAISS index for fast approximate search.
    """
    dimension = vectors.shape[1]
    
    # Create index
    index = faiss.IndexFlatL2(dimension)  # L2 distance
    
    # Add vectors
    index.add(vectors.astype('float32'))
    
    return index

def search_faiss(index, query_vector, top_k=5):
    """
    Search using FAISS index.
    """
    query = query_vector.astype('float32').reshape(1, -1)
    distances, indices = index.search(query, top_k)
    return indices[0], distances[0]

# Usage
index = build_faiss_index(engine.image_vectors)
indices, distances = search_faiss(index, query_vector, top_k=5)
def search_by_text_and_image(text_query, image_query, alpha=0.5):
    """
    Combine text and image search.
    
    Args:
        text_query: Text description
        image_query: Image path
        alpha: Weight for image similarity (1-alpha for text)
    """
    # Get image similarity
    img_similarities = engine.search_fast(image_query, top_k=100)
    
    # Get text similarity (using image captions)
    # text_similarities = search_by_text(text_query, top_k=100)
    
    # Combine scores
    # combined = alpha * img_sim + (1-alpha) * text_sim
    
    pass  # Implementation left as exercise

Challenges & Exercises

Modify the search engine to handle multiple queries at once:
def search_batch(self, query_paths, top_k=5):
    """
    Search for multiple queries simultaneously.
    
    Args:
        query_paths: List of query image paths
        top_k: Number of results per query
    
    Returns:
        List of result lists
    """
    # TODO: Implement this!
    # Hint: Convert all queries to vectors, then use matrix multiplication
    pass

Challenge 2: Evaluation Metrics

Implement precision@K and recall@K:
def evaluate_search(ground_truth, predictions, k=5):
    """
    Evaluate search quality.
    
    Args:
        ground_truth: Set of relevant image IDs
        predictions: List of predicted image IDs (ordered by relevance)
        k: Number of top results to consider
    
    Returns:
        precision, recall
    """
    # TODO: Implement this!
    pass

Challenge 3: Diversity

Modify search to return diverse results (not all similar to each other):
def search_diverse(self, query_path, top_k=5, diversity_weight=0.3):
    """
    Return diverse results.
    
    Strategy: Penalize results similar to already selected results.
    """
    # TODO: Implement this!
    pass

Complete Code

Here’s the full working implementation:
# image_search.py

import numpy as np
from PIL import Image
import os
from glob import glob
from tqdm import tqdm
import matplotlib.pyplot as plt

def image_to_vector_simple(image_path, size=(64, 64)):
    img = Image.open(image_path).convert('L')
    img = img.resize(size)
    img_array = np.array(img)
    vector = img_array.flatten()
    vector = vector / 255.0
    return vector

def image_to_histogram(image_path, bins=32):
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img)
    
    hist_r = np.histogram(img_array[:,:,0], bins=bins, range=(0, 256))[0]
    hist_g = np.histogram(img_array[:,:,1], bins=bins, range=(0, 256))[0]
    hist_b = np.histogram(img_array[:,:,2], bins=bins, range=(0, 256))[0]
    
    vector = np.concatenate([hist_r, hist_g, hist_b])
    vector = vector / vector.sum()
    
    return vector

def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    
    return dot_product / (norm_v1 * norm_v2)

class ImageSearchEngine:
    def __init__(self, image_dir, method='histogram'):
        self.image_dir = image_dir
        self.method = method
        self.image_paths = []
        self.image_vectors = []
        
    def index_images(self):
        extensions = ['*.jpg', '*.jpeg', '*.png']
        for ext in extensions:
            self.image_paths.extend(glob(os.path.join(self.image_dir, ext)))
        
        print(f"Found {len(self.image_paths)} images")
        
        print("Converting images to vectors...")
        for img_path in tqdm(self.image_paths):
            try:
                if self.method == 'pixels':
                    vector = image_to_vector_simple(img_path)
                else:
                    vector = image_to_histogram(img_path)
                
                self.image_vectors.append(vector)
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
        
        self.image_vectors = np.array(self.image_vectors)
        
        print(f"Indexed {len(self.image_vectors)} images")
        print(f"Vector shape: {self.image_vectors.shape}")
    
    def search_fast(self, query_path, top_k=5):
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        query_norm = query_vector / np.linalg.norm(query_vector)
        
        db_norms = np.linalg.norm(self.image_vectors, axis=1, keepdims=True)
        db_normalized = self.image_vectors / db_norms
        
        similarities = db_normalized @ query_norm
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = [(self.image_paths[i], similarities[i]) for i in top_indices]
        return results

def visualize_results(query_path, results, num_results=5):
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    query_img = Image.open(query_path)
    axes[0, 0].imshow(query_img)
    axes[0, 0].set_title('Query Image', fontsize=14, fontweight='bold')
    axes[0, 0].axis('off')
    
    axes[0, 1].axis('off')
    axes[0, 2].axis('off')
    
    for i, (img_path, score) in enumerate(results[:3]):
        img = Image.open(img_path)
        axes[1, i].imshow(img)
        axes[1, i].set_title(f'#{i+1}: {score:.3f}', fontsize=12)
        axes[1, i].axis('off')
    
    plt.tight_layout()
    plt.savefig('search_results.png', dpi=150, bbox_inches='tight')
    plt.show()

if __name__ == '__main__':
    # Initialize and run
    engine = ImageSearchEngine('images/', method='histogram')
    engine.index_images()
    
    query = 'query.jpg'
    results = engine.search_fast(query, top_k=5)
    
    print(f"\nTop 5 similar images:")
    for i, (path, score) in enumerate(results, 1):
        print(f"{i}. {os.path.basename(path)}: {score:.3f}")
    
    visualize_results(query, results)

🚨 Real-World Challenge: Handling Messy Images

Production Reality: Real image databases have corrupted files, varying formats, extreme sizes, and other issues. Here’s how to handle them:
import numpy as np
from PIL import Image
import os
from glob import glob

class RobustImageProcessor:
    """Production-ready image processing with error handling."""
    
    def __init__(self, target_size=(64, 64)):
        self.target_size = target_size
        self.failed_images = []
        
    def process_image(self, image_path):
        """
        Robust image processing with comprehensive error handling.
        """
        try:
            # Check file exists
            if not os.path.exists(image_path):
                raise FileNotFoundError(f"Image not found: {image_path}")
            
            # Check file size (skip tiny/corrupted files)
            file_size = os.path.getsize(image_path)
            if file_size < 100:  # Less than 100 bytes
                raise ValueError(f"File too small, likely corrupted: {file_size} bytes")
            
            # Attempt to open image
            img = Image.open(image_path)
            
            # Handle various formats and modes
            if img.mode == 'RGBA':
                # Handle transparency by compositing on white background
                background = Image.new('RGB', img.size, (255, 255, 255))
                background.paste(img, mask=img.split()[3])  # Use alpha channel as mask
                img = background
            elif img.mode == 'P':
                # Palette mode
                img = img.convert('RGB')
            elif img.mode == 'L':
                # Already grayscale
                pass
            elif img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Convert to grayscale for consistency
            img = img.convert('L')
            
            # Handle extreme aspect ratios (crop to square)
            width, height = img.size
            if width / height > 3 or height / width > 3:
                # Extreme aspect ratio - center crop
                min_dim = min(width, height)
                left = (width - min_dim) // 2
                top = (height - min_dim) // 2
                img = img.crop((left, top, left + min_dim, top + min_dim))
            
            # Resize
            img = img.resize(self.target_size, Image.Resampling.LANCZOS)
            
            # Convert to vector
            vector = np.array(img).flatten() / 255.0
            
            return vector
            
        except Exception as e:
            self.failed_images.append((image_path, str(e)))
            return None
    
    def process_directory(self, image_dir):
        """Process all images in directory with progress tracking."""
        extensions = ['*.jpg', '*.jpeg', '*.png', '*.gif', '*.bmp', '*.webp']
        all_paths = []
        for ext in extensions:
            all_paths.extend(glob(os.path.join(image_dir, '**', ext), recursive=True))
        
        print(f"Found {len(all_paths)} images")
        
        vectors = []
        valid_paths = []
        
        for i, path in enumerate(all_paths):
            if i % 100 == 0:
                print(f"Processing {i}/{len(all_paths)}...")
            
            vector = self.process_image(path)
            if vector is not None:
                vectors.append(vector)
                valid_paths.append(path)
        
        # Report failures
        if self.failed_images:
            print(f"\n⚠️ Failed to process {len(self.failed_images)} images:")
            for path, error in self.failed_images[:5]:  # Show first 5
                print(f"   - {os.path.basename(path)}: {error}")
            if len(self.failed_images) > 5:
                print(f"   ... and {len(self.failed_images) - 5} more")
        
        return np.array(vectors), valid_paths

# Usage
processor = RobustImageProcessor()
vectors, paths = processor.process_directory('images/')
print(f"\nSuccessfully processed {len(vectors)} images")
Production Checklist:
  • Handle missing files gracefully
  • Skip corrupted/truncated images
  • Normalize different color modes (RGB, RGBA, palette, grayscale)
  • Handle extreme aspect ratios
  • Log failures for debugging
  • Set memory limits for very large images
  • Add timeout for slow processing

Feature Extraction Methods: From Simple to Production

MethodVector DimensionsRotation Invariant?Semantic Understanding?SpeedQuality
Raw Pixelswidth x height (e.g., 4096)NoNoneFastestLow — fails with any spatial shift
Color Histogrambins x channels (e.g., 96)YesColor onlyFastModerate — good for color-based matching
HOG (Histogram of Oriented Gradients)~3780PartialShape/edgesModerateGood for object shape matching
SIFT/ORB FeaturesVariable (keypoints)YesLocal featuresModerateGood for specific object matching
CNN Features (ResNet, VGG)512-2048YesHigh-level semanticSlow (requires GPU)Best — understands “cat” vs “dog”
CLIP Embeddings512YesText + image semanticSlowBest for cross-modal search
Edge Case — The Semantic Gap: Raw pixel similarity fails in cases that seem obvious to humans. A photo of a black cat and a photo of a white cat on a black background can have similar pixel histograms (both are mostly black), but semantically they are very different. Conversely, two photos of the same cat in different lighting have very different pixel values but are semantically identical. This gap between pixel-level similarity and semantic similarity is why production systems use deep learning features. The linear algebra is the same (cosine similarity on vectors); the quality of the vectors is what changes everything.

Key Takeaways

What You Built and Learned:
  • Images as Vectors - Pixels, histograms, and deep features all work
  • Cosine Similarity - Measure angular distance between high-dimensional vectors
  • Vectorized Operations - Replace loops with matrix operations for 100x speedup
  • Batch Processing - Matrix multiplication enables searching thousands of images instantly
  • Real ML Pipeline - From raw data to vectors to similarity to results
  • Production Robustness - Handle corrupted, malformed, and edge-case images
Industry Insight: Companies like Pinterest, Airbnb, and Spotify use exactly these techniques—just with more sophisticated embeddings (from neural networks) and approximate nearest neighbor search for speed at scale.

What’s Next?

You’ve built a working image search engine using linear algebra! Next, we’ll explore eigenvalues and eigenvectors to build even more powerful applications like face recognition and dimensionality reduction.

Next: Eigenvalues & Eigenvectors

Learn the math behind PCA and face recognition

Interview Deep-Dive

Strong Answer:
  • Raw pixel vectors are not invariant to geometric transformations. Rotating an image by even 5 degrees shifts every pixel to a different location in the flattened vector, creating a large cosine distance between what are visually identical images. Similarly, changes in lighting alter every pixel value, and cropping changes the vector length entirely (or which pixels are included after resizing).
  • The diagnosis is straightforward: compute the cosine similarity between the same product photographed from two angles. If similarity is below 0.5 for clearly-identical products, the representation is the problem, not the similarity metric. You can also visualize the vectors using t-SNE or UMAP — if images of the same product scatter randomly rather than clustering, the features are not capturing semantic similarity.
  • The fix is to move from handcrafted features (raw pixels, histograms) to learned embeddings. A CNN trained for image classification (ResNet, EfficientNet) or specifically for metric learning (using triplet loss or contrastive loss) produces feature vectors that are invariant to rotation, scale, lighting, and background changes. You extract features from the penultimate layer (before the classification head) — typically a 2048-dim vector for ResNet-50 — and use those as your search vectors.
  • In production at Pinterest, their visual search system uses embeddings from a model trained with multi-task learning: classification (is this a shoe?) plus metric learning (are these the same shoe?). The embeddings are 256-dimensional, normalized to unit length, and indexed with HNSW. This handles angle, lighting, and even partial occlusion robustly.
  • The trade-off: learned embeddings require a trained model (GPU inference per image at indexing time), while pixel features are free to compute. For a prototype with controlled imagery (like product catalog shots on white backgrounds), pixel histograms can work surprisingly well. For user-uploaded photos with arbitrary conditions, learned embeddings are mandatory.
Follow-up: How would you evaluate whether your new embedding-based search is actually better than the pixel-based approach? What metrics would you use?Use a held-out test set of known-similar image pairs (same product, different angles) and known-dissimilar pairs. Compute precision@k (what fraction of the top-k results are actually relevant) and recall@k (what fraction of all relevant images appear in the top-k). Also measure Mean Reciprocal Rank (MRR) — the average of 1/rank1/\text{rank} of the first relevant result. Plot precision-recall curves for both methods at various similarity thresholds. In practice, also measure “semantic precision” via human evaluation: show annotators the top-5 results for 100 random queries and have them label relevance. The pixel-based approach might achieve 30% precision@5 while the embedding approach achieves 85%. Additionally, measure latency: if the embedding model takes 50ms per query for inference, that must be factored into the end-to-end search time.
Strong Answer:
  • Cosine similarity measures the angle between vectors, ignoring their magnitudes. For image features, this is desirable because two images of the same scene at different exposures (bright vs. dark) will have pixel vectors pointing in the same direction but with different magnitudes. Cosine similarity treats them as identical; Euclidean distance penalizes the brightness difference.
  • More fundamentally, in high-dimensional spaces (4096+ dimensions for image features), Euclidean distance suffers from the concentration of measure phenomenon — all pairwise distances converge toward the same value, making discrimination difficult. Cosine similarity, by normalizing vectors to the unit sphere, focuses on the angular structure which tends to be more discriminative.
  • Euclidean distance is preferred when absolute feature values carry information. In image segmentation, if you are comparing pixel regions where intensity level matters (differentiating a dark object from a bright one in the same scene), Euclidean distance on raw pixel patches is appropriate. In medical imaging, absolute Hounsfield unit values in CT scans encode tissue type — cosine similarity would incorrectly treat bone and soft tissue as similar if their spatial patterns happen to align.
  • There is a mathematical equivalence worth knowing: for L2-normalized vectors (unit length), Euclidean distance and cosine similarity are monotonically related: ab2=2(1cosθ)\|a - b\|^2 = 2(1 - \cos\theta). So if you pre-normalize all vectors (which most production systems do), the two metrics produce identical rankings. The choice then becomes a matter of API conventions and numerical convenience rather than a true algorithmic difference.
Follow-up: You notice that your image search returns very different results depending on whether you normalize vectors before computing Euclidean distance. Why, and which approach is correct?Without normalization, Euclidean distance conflates two signals: directional similarity (the pattern of features) and magnitude similarity (the overall intensity/activation level). A very “bright” image with features [100, 200, 300] and a dim image with features [1, 2, 3] have high cosine similarity (identical direction) but enormous Euclidean distance (299 units apart). Normalizing first removes the magnitude signal, making Euclidean distance equivalent to angular distance. Whether you should normalize depends on whether magnitude is signal or noise. For CNN embeddings, magnitude typically correlates with image “confidence” or “prototypicality” and is usually not useful for search — normalize. For raw pixel features in medical imaging, magnitude encodes diagnostically relevant intensity — do not normalize.
Strong Answer:
  • The brute-force approach computes similarity between the query and all NN database vectors: NN dot products, each of dimension dd. For N=10MN = 10M and d=2048d = 2048, that is 20 billion multiply-adds per query. At best, a single CPU core does this in about 10 seconds. Unacceptable for interactive search.
  • Optimization 1 (most impactful): Approximate Nearest Neighbors. Use FAISS with IVF+PQ indexing. Build a quantized index offline in about 30 minutes for 10M vectors. Query time drops to 1-5ms at 95%+ recall. This is 2000x faster and is what every production system uses. The linear algebra trick: Product Quantization replaces each full dot product with a table lookup, and IVF limits the search to a small subset of vectors.
  • Optimization 2: Dimensionality reduction. Apply PCA from 2048 to 256 dimensions before indexing. Each dot product is 8x cheaper, memory usage drops 8x, and (as discussed) recall often improves because you cut noise dimensions. This compounds with ANN: the PQ codebooks are cheaper to compute in lower dimensions.
  • Optimization 3: GPU-accelerated brute force. FAISS supports GPU search — an A100 can brute-force 10M vectors of dimension 256 in about 10ms. For moderate-scale systems (under 50M vectors), GPU brute force with PCA-reduced vectors can be simpler than maintaining an ANN index, with perfect recall.
  • Optimization 4: Pre-filtering. If your UI allows category filters (e.g., “shoes only”), partition the index by category so you only search the relevant subset. This reduces NN by 10-100x with zero approximation cost.
  • Optimization 5: Caching. If certain queries are repeated (e.g., “similar to this bestseller”), cache the results. Simple but often overlooked.
Follow-up: How would you handle the case where new images are added to the database continuously (say, 100K new images per day)? Rebuilding the entire ANN index is expensive.Most production ANN indexes support incremental insertion. FAISS IVF allows adding vectors to existing Voronoi cells without retraining the cell centroids. The trade-off: over time, the cell boundaries become suboptimal as the data distribution shifts, degrading recall. The standard practice is to add incrementally during the day and rebuild the full index nightly during a maintenance window. HNSW (used by Pinecone, Weaviate, Qdrant) is inherently incremental — new vectors are inserted into the graph on-the-fly with no rebuild needed, though query performance degrades slightly as the graph becomes less optimal. Monitor recall@10 weekly against a ground-truth set; when it drops below your threshold, trigger a full rebuild.