Skip to main content
Image Similarity Search Engine

Project 1: Image Similarity Search Engine

Build a working image search engine that finds similar images using vector representations and cosine similarity. This project integrates everything you’ve learned about vectors, dot products, and similarity measures.
Estimated Time: 3-4 hours (take your time to understand each concept!)
Difficulty: Beginner-Intermediate
Concepts Used: Vectors, dot product, cosine similarity, normalization
Dataset: Product images (we’ll simulate these)
Don’t Rush! This project connects multiple concepts. If something doesn’t click, go back to the Vectors module. Understanding why this works is more important than just making it work.

🎯 The Big Picture: What Are We Really Doing?

Before diving into code, let’s understand the core insight that makes image search possible:
Image Search Pipeline

The Key Insight

Every image can be represented as a vector (list of numbers). Once images are vectors, finding “similar” images becomes finding “nearby” vectors. It’s just geometry!
Image ComparisonVector Comparison
”These shoes look alike”Vectors point in similar directions
”Cat ≠ Car”Vectors point in very different directions
”Same product, different angle”Vectors are close together

Why Does This Work?

Think about it: If two images are similar, they probably have:
  • Similar colors in similar places
  • Similar shapes and patterns
  • Similar brightness distributions
When we convert these properties to numbers, similar images produce similar numbers. Similar numbers = nearby vectors!

Project Overview

What You’ll Build

An image search engine that:
  1. Converts images to vector representations (the magic transformation)
  2. Computes similarity between images (using cosine similarity)
  3. Returns the top-K most similar images to a query
  4. Visualizes results (so you can verify it works!)

Why This Matters

Real-World Applications (This is exactly what these companies do):
CompanyApplicationHow It Works
PinterestVisual search”Find pins like this image”
Google ImagesReverse search”Where else does this image appear?”
AmazonSimilar products”Customers who viewed this also viewed…”
SpotifyCover art discoveryAudio → spectrogram → vector similarity
Face IDAuthenticationFace → embedding → compare to stored
ML Concepts You’ll Master:
  • Feature extraction (images → vectors)
  • Similarity metrics (cosine similarity)
  • Nearest neighbor search
  • High-dimensional vector spaces
  • Vectorized operations (matrix multiplication for speed!)

Part 1: Understanding the Problem (Deeply)

How Do We Represent Images as Vectors?

This is the crucial question. Let’s explore multiple methods and understand the tradeoffs.
Understanding Vector Similarity
Method 1: Raw Pixels (Simple but effective for similar images)
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

def image_to_vector_simple(image_path, size=(64, 64)):
    """
    Convert image to a vector using raw pixel values.
    
    THE IDEA:
    - A 64×64 grayscale image has 64×64 = 4,096 pixels
    - Each pixel is a number from 0 (black) to 255 (white)
    - We "flatten" the 2D grid into a 1D list of 4,096 numbers
    - This list IS our vector!
    
    Args:
        image_path: Path to image file
        size: Resize dimensions (smaller = faster, but less detail)
    
    Returns:
        1D numpy array of pixel values, normalized to [0, 1]
    """
    # Load image and convert to grayscale (removes color complexity)
    img = Image.open(image_path).convert('L')
    
    # Resize to standard size (all vectors must be same length!)
    # This is crucial: you can't compare vectors of different sizes
    img = img.resize(size)
    
    # Convert to numpy array
    img_array = np.array(img)
    print(f"Image shape: {img_array.shape}")  # (64, 64)
    
    # Flatten: 2D grid → 1D vector
    # Think of it like reading a book: left-to-right, top-to-bottom
    vector = img_array.flatten()
    print(f"Vector shape: {vector.shape}")  # (4096,)
    
    # Normalize to [0, 1]
    # Why? Cosine similarity cares about direction, not magnitude
    # But normalization helps numerical stability
    vector = vector / 255.0
    
    return vector

# Let's visualize what's happening
def show_flatten_process():
    """Visualize how an image becomes a vector."""
    # Create a tiny 4×4 example
    tiny_image = np.array([
        [10, 20, 30, 40],
        [50, 60, 70, 80],
        [90, 100, 110, 120],
        [130, 140, 150, 160]
    ], dtype=np.uint8)
    
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    
    # Original image
    axes[0].imshow(tiny_image, cmap='gray')
    axes[0].set_title('Original 4×4 Image')
    for i in range(4):
        for j in range(4):
            axes[0].text(j, i, str(tiny_image[i,j]), ha='center', va='center', color='red')
    
    # Flattened
    vector = tiny_image.flatten()
    axes[1].bar(range(16), vector, color='steelblue')
    axes[1].set_title('Flattened Vector (16 numbers)')
    axes[1].set_xlabel('Index')
    axes[1].set_ylabel('Pixel Value')
    
    # The vector
    axes[2].text(0.1, 0.5, f"Vector = [{', '.join(map(str, vector[:8]))}...\n"
                          f"          {', '.join(map(str, vector[8:]))}]",
                fontfamily='monospace', fontsize=10, va='center')
    axes[2].axis('off')
    axes[2].set_title('Our Image Vector!')
    
    plt.tight_layout()
    plt.show()

show_flatten_process()
Why normalize? So all images are on the same scale (0-1 instead of 0-255). This prevents one bright image from dominating comparisons.

Method 2: Color Histogram (Better for Different Viewpoints)

The pixel method has a problem: if you rotate or shift an image slightly, the vector changes dramatically! Histograms solve this. The Idea: Instead of “where are the pixels?”, ask “what colors are present?”
def image_to_histogram(image_path, bins=32):
    """
    Convert image to color histogram vector.
    
    THE IDEA:
    - Count how many pixels of each color exist
    - A red car has lots of red pixels → high red bin values
    - A blue sky has lots of blue pixels → high blue bin values
    - Two red cars → similar histograms (even if rotated!)
    
    Args:
        image_path: Path to image file
        bins: Number of "buckets" per color channel (more = more detail)
              32 bins means we group 256 colors into 32 buckets
    
    Returns:
        1D numpy array of histogram values (normalized)
    """
    # Load image in RGB (Red, Green, Blue channels)
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img)
    
    # Compute histogram for each color channel
    # "How many pixels have red values 0-7? 8-15? etc."
    hist_r = np.histogram(img_array[:,:,0], bins=bins, range=(0, 256))[0]
    hist_g = np.histogram(img_array[:,:,1], bins=bins, range=(0, 256))[0]
    hist_b = np.histogram(img_array[:,:,2], bins=bins, range=(0, 256))[0]
    
    # Concatenate into one vector: [red_hist, green_hist, blue_hist]
    vector = np.concatenate([hist_r, hist_g, hist_b])
    
    # Normalize so values sum to 1 (makes comparison fair)
    vector = vector / vector.sum()
    
    return vector

# Visualize what a histogram looks like
def show_histogram_process():
    """Show how histogram captures color distribution."""
    # Create sample images
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    # Red-ish image
    red_img = np.zeros((64, 64, 3), dtype=np.uint8)
    red_img[:, :, 0] = np.random.randint(180, 255, (64, 64))  # High red
    red_img[:, :, 1] = np.random.randint(0, 50, (64, 64))     # Low green
    red_img[:, :, 2] = np.random.randint(0, 50, (64, 64))     # Low blue
    
    # Blue-ish image
    blue_img = np.zeros((64, 64, 3), dtype=np.uint8)
    blue_img[:, :, 0] = np.random.randint(0, 50, (64, 64))
    blue_img[:, :, 1] = np.random.randint(0, 50, (64, 64))
    blue_img[:, :, 2] = np.random.randint(180, 255, (64, 64))
    
    # Show images
    axes[0, 0].imshow(red_img)
    axes[0, 0].set_title('Red-dominant Image')
    axes[0, 0].axis('off')
    
    axes[0, 2].imshow(blue_img)
    axes[0, 2].set_title('Blue-dominant Image')
    axes[0, 2].axis('off')
    
    # Show histograms
    for i, (img, color, title) in enumerate([
        (red_img, ['red', 'green', 'blue'], 'Red Image Histogram'),
        (blue_img, ['red', 'green', 'blue'], 'Blue Image Histogram')
    ]):
        ax = axes[1, i*2]
        for c, channel in enumerate(['Red', 'Green', 'Blue']):
            hist = np.histogram(img[:,:,c], bins=32, range=(0, 256))[0]
            ax.bar(np.arange(32) + c*0.25, hist, width=0.25, 
                   label=channel, alpha=0.7)
        ax.set_title(title)
        ax.legend()
        ax.set_xlabel('Color Bin')
        ax.set_ylabel('Count')
    
    axes[0, 1].axis('off')
    axes[1, 1].axis('off')
    plt.tight_layout()
    plt.show()

show_histogram_process()
Histogram vs Pixels: When to Use Which?
MethodProsConsBest For
Raw PixelsSimple, captures spatial infoSensitive to rotation/shiftExact duplicates, aligned images
HistogramRotation invariant, compactLoses spatial infoDifferent viewpoints, color-based
Deep Learning (CNN)Best of both worldsRequires trainingProduction systems
For learning, we’ll use both! In production, companies use CNNs (neural networks) to extract “deep features.”
Why histograms work for rotated images: If you rotate a red car, it’s still red! The histogram captures “what colors exist” not “where colors are.”

Part 2: Computing Similarity

This is where the linear algebra magic happens. We’ve turned images into vectors. Now we need to measure “how similar” two vectors are.

The Core Insight: Similarity = Angle Between Vectors

Imagine two arrows (vectors) starting from the same point:
  • If they point in the same direction → very similar
  • If they point in different directions → not similar
  • If they’re perpendicular (90°) → completely unrelated
Cosine similarity measures this by computing the cosine of the angle between vectors.

Cosine Similarity: The Math

cosine similarity=cos(θ)=ABA×B\text{cosine similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} Where:
  • AB\mathbf{A} \cdot \mathbf{B} is the dot product (sum of element-wise products)
  • A||\mathbf{A}|| is the magnitude (length) of vector A
  • The result is between -1 and 1
Cosine ValueAngleMeaning
1.0Identical (same direction)
0.7~45°Similar
0.090°Unrelated (perpendicular)
-1.0180°Opposite

Implementation with Deep Understanding

def cosine_similarity(v1, v2):
    """
    Compute cosine similarity between two vectors.
    
    This is THE core operation of similarity search!
    
    Formula: cos(θ) = (v1 · v2) / (||v1|| × ||v2||)
    
    STEP BY STEP:
    1. Dot product: Multiply corresponding elements and sum
       - Measures "how much do these vectors agree?"
       - If both have high values in same positions → high dot product
    
    2. Normalize by magnitudes
       - Removes the effect of "brightness" (larger numbers)
       - Focuses purely on "direction" (pattern of values)
    
    Args:
        v1, v2: Vectors (numpy arrays) of same length
    
    Returns:
        Similarity score in [-1, 1]
        1 = identical direction (most similar)
        0 = orthogonal (unrelated)
        -1 = opposite direction (least similar)
    """
    # Step 1: Compute dot product
    # v1 = [a, b, c], v2 = [x, y, z]
    # dot = a*x + b*y + c*z
    dot_product = np.dot(v1, v2)
    
    # Step 2: Compute magnitudes (lengths)
    # ||v|| = sqrt(a² + b² + c²)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    
    # Step 3: Avoid division by zero
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    
    # Step 4: Compute and return
    return dot_product / (norm_v1 * norm_v2)

# Let's build intuition with examples
print("=== Understanding Cosine Similarity ===\n")

# Example 1: Same direction, different magnitude
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])  # v2 = 2 * v1
print("v1 = [1, 2, 3]")
print("v2 = [2, 4, 6] (same direction, 2x magnitude)")
print(f"Cosine similarity: {cosine_similarity(v1, v2):.3f}")
print("→ 1.0 because they point the same way!\n")

# Example 2: Perpendicular vectors
v3 = np.array([1, 0, 0])
v4 = np.array([0, 1, 0])
print("v3 = [1, 0, 0] (pointing along x-axis)")
print("v4 = [0, 1, 0] (pointing along y-axis)")
print(f"Cosine similarity: {cosine_similarity(v3, v4):.3f}")
print("→ 0.0 because they're perpendicular!\n")

# Example 3: Image-like vectors (patterns matter)
bright_sunset = np.array([0.9, 0.5, 0.1])  # High red, medium green, low blue
dark_sunset = np.array([0.45, 0.25, 0.05])  # Same pattern, darker
blue_sky = np.array([0.1, 0.3, 0.9])  # Opposite pattern

print("bright_sunset = [0.9, 0.5, 0.1] (red-orange colors)")
print("dark_sunset = [0.45, 0.25, 0.05] (same colors, darker)")
print(f"Similarity: {cosine_similarity(bright_sunset, dark_sunset):.3f}")
print("→ ~1.0 because same COLOR PATTERN (direction)!\n")

print("blue_sky = [0.1, 0.3, 0.9] (blue colors)")
print(f"Similarity to sunset: {cosine_similarity(bright_sunset, blue_sky):.3f}")
print("→ Low because completely different color pattern!")
Why Cosine Beats Euclidean for ImagesTwo images of the same sunset—one taken with a bright setting, one dark—have the same pattern of colors but different absolute values.
  • Cosine similarity: “Same pattern!” (1.0)
  • Euclidean distance: “Very different!” (large distance)
Cosine captures what matters for visual similarity.

Euclidean Distance: The Alternative

For completeness, here’s Euclidean distance. It measures “how far apart” vectors are in space:
def euclidean_distance(v1, v2):
    """
    Compute Euclidean distance between two vectors.
    
    This is the "straight line" distance formula you know from geometry!
    
    Formula: ||v1 - v2|| = sqrt(Σ(v1_i - v2_i)²)
    
    Returns:
        Distance (0 = identical, larger = more different)
    """
    return np.linalg.norm(v1 - v2)

# Compare the two metrics
print("=== Cosine vs Euclidean ===\n")

v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])  # Same direction, different magnitude

print(f"Cosine similarity: {cosine_similarity(v1, v2):.3f}")  # 1.0
print(f"Euclidean distance: {euclidean_distance(v1, v2):.3f}")  # 3.74

print("\n→ Cosine says 'identical direction'")
print("→ Euclidean says 'they're far apart'")
print("→ For images, direction (pattern) matters more!")

Which to Use?

  • Cosine: When magnitude doesn’t matter (text, images with different brightness)
  • Euclidean: When absolute values matter (coordinates, measurements)
For images, cosine similarity is usually better!

Part 3: Building the Search Engine

Step 1: Create Image Database

import os
from glob import glob
from tqdm import tqdm

class ImageSearchEngine:
    def __init__(self, image_dir, method='histogram'):
        """
        Initialize search engine.
        
        Args:
            image_dir: Directory containing images
            method: 'pixels' or 'histogram'
        """
        self.image_dir = image_dir
        self.method = method
        self.image_paths = []
        self.image_vectors = []
        
    def index_images(self):
        """
        Convert all images to vectors and store them.
        """
        # Find all images
        extensions = ['*.jpg', '*.jpeg', '*.png']
        for ext in extensions:
            self.image_paths.extend(glob(os.path.join(self.image_dir, ext)))
        
        print(f"Found {len(self.image_paths)} images")
        
        # Convert to vectors
        print("Converting images to vectors...")
        for img_path in tqdm(self.image_paths):
            try:
                if self.method == 'pixels':
                    vector = image_to_vector_simple(img_path)
                else:
                    vector = image_to_histogram(img_path)
                
                self.image_vectors.append(vector)
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
        
        # Convert to numpy array for faster computation
        self.image_vectors = np.array(self.image_vectors)
        
        print(f"Indexed {len(self.image_vectors)} images")
        print(f"Vector shape: {self.image_vectors.shape}")
    
    def search(self, query_path, top_k=5):
        """
        Find top-K most similar images to query.
        
        Args:
            query_path: Path to query image
            top_k: Number of results to return
        
        Returns:
            List of (image_path, similarity_score) tuples
        """
        # Convert query to vector
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        # Compute similarities with all images
        similarities = []
        for i, img_vector in enumerate(self.image_vectors):
            sim = cosine_similarity(query_vector, img_vector)
            similarities.append((self.image_paths[i], sim))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top-K
        return similarities[:top_k]
    
    def search_fast(self, query_path, top_k=5):
        """
        Faster version using vectorized operations.
        """
        # Convert query to vector
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        # Vectorized cosine similarity
        # cos(θ) = (A · B) / (||A|| ||B||)
        
        # Normalize query vector
        query_norm = query_vector / np.linalg.norm(query_vector)
        
        # Normalize all database vectors
        db_norms = np.linalg.norm(self.image_vectors, axis=1, keepdims=True)
        db_normalized = self.image_vectors / db_norms
        
        # Compute all similarities at once (matrix-vector multiplication!)
        similarities = db_normalized @ query_norm
        
        # Get top-K indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return results
        results = [(self.image_paths[i], similarities[i]) for i in top_indices]
        return results

Step 2: Use the Search Engine

# Initialize
engine = ImageSearchEngine('product_images/', method='histogram')

# Index all images
engine.index_images()

# Search
query_image = 'query.jpg'
results = engine.search_fast(query_image, top_k=5)

# Display results
print(f"\nTop 5 similar images to {query_image}:")
for i, (img_path, score) in enumerate(results, 1):
    print(f"{i}. {os.path.basename(img_path)}: {score:.3f}")
Output:
Found 1000 images
Converting images to vectors...
100%|████████████████████| 1000/1000 [00:15<00:00, 65.2it/s]
Indexed 1000 images
Vector shape: (1000, 96)

Top 5 similar images to query.jpg:
1. product_042.jpg: 0.987
2. product_156.jpg: 0.923
3. product_089.jpg: 0.891
4. product_234.jpg: 0.876
5. product_567.jpg: 0.854

Part 4: Visualizing Results

import matplotlib.pyplot as plt

def visualize_results(query_path, results, num_results=5):
    """
    Display query image and top results.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Query image
    query_img = Image.open(query_path)
    axes[0, 0].imshow(query_img)
    axes[0, 0].set_title('Query Image', fontsize=14, fontweight='bold')
    axes[0, 0].axis('off')
    
    # Hide extra subplot
    axes[0, 1].axis('off')
    axes[0, 2].axis('off')
    
    # Top results
    for i, (img_path, score) in enumerate(results[:num_results]):
        row = 1 + i // 3
        col = i % 3
        
        if row < 2:  # Only show first 3 results
            img = Image.open(img_path)
            axes[row, col].imshow(img)
            axes[row, col].set_title(f'#{i+1}: {score:.3f}', fontsize=12)
            axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.savefig('search_results.png', dpi=150, bbox_inches='tight')
    plt.show()

# Use it
visualize_results(query_image, results)

Part 5: Performance Analysis

Timing Comparison

import time

# Slow version (loop)
start = time.time()
results_slow = engine.search(query_image, top_k=5)
time_slow = time.time() - start

# Fast version (vectorized)
start = time.time()
results_fast = engine.search_fast(query_image, top_k=5)
time_fast = time.time() - start

print(f"Slow version: {time_slow:.3f}s")
print(f"Fast version: {time_fast:.3f}s")
print(f"Speedup: {time_slow/time_fast:.1f}x")
Output:
Slow version: 0.245s
Fast version: 0.003s
Speedup: 81.7x
Why so much faster? Vectorized operations use optimized C/Fortran code and can leverage CPU SIMD instructions!

Memory Usage

import sys

# Size of database
num_images = len(engine.image_vectors)
vector_dim = engine.image_vectors.shape[1]
total_bytes = engine.image_vectors.nbytes

print(f"Number of images: {num_images}")
print(f"Vector dimension: {vector_dim}")
print(f"Total memory: {total_bytes / 1024 / 1024:.2f} MB")
print(f"Per image: {total_bytes / num_images / 1024:.2f} KB")

Part 6: Improvements & Extensions

1. Better Feature Extraction

# Use pre-trained neural network (ResNet, VGG, etc.)
from torchvision import models, transforms
import torch

def image_to_vector_deep(image_path):
    """
    Extract features using pre-trained ResNet.
    """
    # Load pre-trained model
    model = models.resnet18(pretrained=True)
    model.eval()
    
    # Remove final classification layer
    feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
    
    # Preprocess image
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
    ])
    
    img = Image.open(image_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0)
    
    # Extract features
    with torch.no_grad():
        features = feature_extractor(img_tensor)
    
    # Flatten to vector
    vector = features.squeeze().numpy()
    
    return vector

# This gives 512-dimensional vectors with much better semantic understanding!

2. Faster Search with Approximate Nearest Neighbors

# For large databases (millions of images), use FAISS or Annoy
import faiss

def build_faiss_index(vectors):
    """
    Build FAISS index for fast approximate search.
    """
    dimension = vectors.shape[1]
    
    # Create index
    index = faiss.IndexFlatL2(dimension)  # L2 distance
    
    # Add vectors
    index.add(vectors.astype('float32'))
    
    return index

def search_faiss(index, query_vector, top_k=5):
    """
    Search using FAISS index.
    """
    query = query_vector.astype('float32').reshape(1, -1)
    distances, indices = index.search(query, top_k)
    return indices[0], distances[0]

# Usage
index = build_faiss_index(engine.image_vectors)
indices, distances = search_faiss(index, query_vector, top_k=5)
def search_by_text_and_image(text_query, image_query, alpha=0.5):
    """
    Combine text and image search.
    
    Args:
        text_query: Text description
        image_query: Image path
        alpha: Weight for image similarity (1-alpha for text)
    """
    # Get image similarity
    img_similarities = engine.search_fast(image_query, top_k=100)
    
    # Get text similarity (using image captions)
    # text_similarities = search_by_text(text_query, top_k=100)
    
    # Combine scores
    # combined = alpha * img_sim + (1-alpha) * text_sim
    
    pass  # Implementation left as exercise

Challenges & Exercises

Modify the search engine to handle multiple queries at once:
def search_batch(self, query_paths, top_k=5):
    """
    Search for multiple queries simultaneously.
    
    Args:
        query_paths: List of query image paths
        top_k: Number of results per query
    
    Returns:
        List of result lists
    """
    # TODO: Implement this!
    # Hint: Convert all queries to vectors, then use matrix multiplication
    pass

Challenge 2: Evaluation Metrics

Implement precision@K and recall@K:
def evaluate_search(ground_truth, predictions, k=5):
    """
    Evaluate search quality.
    
    Args:
        ground_truth: Set of relevant image IDs
        predictions: List of predicted image IDs (ordered by relevance)
        k: Number of top results to consider
    
    Returns:
        precision, recall
    """
    # TODO: Implement this!
    pass

Challenge 3: Diversity

Modify search to return diverse results (not all similar to each other):
def search_diverse(self, query_path, top_k=5, diversity_weight=0.3):
    """
    Return diverse results.
    
    Strategy: Penalize results similar to already selected results.
    """
    # TODO: Implement this!
    pass

Complete Code

Here’s the full working implementation:
# image_search.py

import numpy as np
from PIL import Image
import os
from glob import glob
from tqdm import tqdm
import matplotlib.pyplot as plt

def image_to_vector_simple(image_path, size=(64, 64)):
    img = Image.open(image_path).convert('L')
    img = img.resize(size)
    img_array = np.array(img)
    vector = img_array.flatten()
    vector = vector / 255.0
    return vector

def image_to_histogram(image_path, bins=32):
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img)
    
    hist_r = np.histogram(img_array[:,:,0], bins=bins, range=(0, 256))[0]
    hist_g = np.histogram(img_array[:,:,1], bins=bins, range=(0, 256))[0]
    hist_b = np.histogram(img_array[:,:,2], bins=bins, range=(0, 256))[0]
    
    vector = np.concatenate([hist_r, hist_g, hist_b])
    vector = vector / vector.sum()
    
    return vector

def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    
    if norm_v1 == 0 or norm_v2 == 0:
        return 0.0
    
    return dot_product / (norm_v1 * norm_v2)

class ImageSearchEngine:
    def __init__(self, image_dir, method='histogram'):
        self.image_dir = image_dir
        self.method = method
        self.image_paths = []
        self.image_vectors = []
        
    def index_images(self):
        extensions = ['*.jpg', '*.jpeg', '*.png']
        for ext in extensions:
            self.image_paths.extend(glob(os.path.join(self.image_dir, ext)))
        
        print(f"Found {len(self.image_paths)} images")
        
        print("Converting images to vectors...")
        for img_path in tqdm(self.image_paths):
            try:
                if self.method == 'pixels':
                    vector = image_to_vector_simple(img_path)
                else:
                    vector = image_to_histogram(img_path)
                
                self.image_vectors.append(vector)
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
        
        self.image_vectors = np.array(self.image_vectors)
        
        print(f"Indexed {len(self.image_vectors)} images")
        print(f"Vector shape: {self.image_vectors.shape}")
    
    def search_fast(self, query_path, top_k=5):
        if self.method == 'pixels':
            query_vector = image_to_vector_simple(query_path)
        else:
            query_vector = image_to_histogram(query_path)
        
        query_norm = query_vector / np.linalg.norm(query_vector)
        
        db_norms = np.linalg.norm(self.image_vectors, axis=1, keepdims=True)
        db_normalized = self.image_vectors / db_norms
        
        similarities = db_normalized @ query_norm
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = [(self.image_paths[i], similarities[i]) for i in top_indices]
        return results

def visualize_results(query_path, results, num_results=5):
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    query_img = Image.open(query_path)
    axes[0, 0].imshow(query_img)
    axes[0, 0].set_title('Query Image', fontsize=14, fontweight='bold')
    axes[0, 0].axis('off')
    
    axes[0, 1].axis('off')
    axes[0, 2].axis('off')
    
    for i, (img_path, score) in enumerate(results[:3]):
        img = Image.open(img_path)
        axes[1, i].imshow(img)
        axes[1, i].set_title(f'#{i+1}: {score:.3f}', fontsize=12)
        axes[1, i].axis('off')
    
    plt.tight_layout()
    plt.savefig('search_results.png', dpi=150, bbox_inches='tight')
    plt.show()

if __name__ == '__main__':
    # Initialize and run
    engine = ImageSearchEngine('images/', method='histogram')
    engine.index_images()
    
    query = 'query.jpg'
    results = engine.search_fast(query, top_k=5)
    
    print(f"\nTop 5 similar images:")
    for i, (path, score) in enumerate(results, 1):
        print(f"{i}. {os.path.basename(path)}: {score:.3f}")
    
    visualize_results(query, results)

🚨 Real-World Challenge: Handling Messy Images

Production Reality: Real image databases have corrupted files, varying formats, extreme sizes, and other issues. Here’s how to handle them:
import numpy as np
from PIL import Image
import os
from glob import glob

class RobustImageProcessor:
    """Production-ready image processing with error handling."""
    
    def __init__(self, target_size=(64, 64)):
        self.target_size = target_size
        self.failed_images = []
        
    def process_image(self, image_path):
        """
        Robust image processing with comprehensive error handling.
        """
        try:
            # Check file exists
            if not os.path.exists(image_path):
                raise FileNotFoundError(f"Image not found: {image_path}")
            
            # Check file size (skip tiny/corrupted files)
            file_size = os.path.getsize(image_path)
            if file_size < 100:  # Less than 100 bytes
                raise ValueError(f"File too small, likely corrupted: {file_size} bytes")
            
            # Attempt to open image
            img = Image.open(image_path)
            
            # Handle various formats and modes
            if img.mode == 'RGBA':
                # Handle transparency by compositing on white background
                background = Image.new('RGB', img.size, (255, 255, 255))
                background.paste(img, mask=img.split()[3])  # Use alpha channel as mask
                img = background
            elif img.mode == 'P':
                # Palette mode
                img = img.convert('RGB')
            elif img.mode == 'L':
                # Already grayscale
                pass
            elif img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Convert to grayscale for consistency
            img = img.convert('L')
            
            # Handle extreme aspect ratios (crop to square)
            width, height = img.size
            if width / height > 3 or height / width > 3:
                # Extreme aspect ratio - center crop
                min_dim = min(width, height)
                left = (width - min_dim) // 2
                top = (height - min_dim) // 2
                img = img.crop((left, top, left + min_dim, top + min_dim))
            
            # Resize
            img = img.resize(self.target_size, Image.Resampling.LANCZOS)
            
            # Convert to vector
            vector = np.array(img).flatten() / 255.0
            
            return vector
            
        except Exception as e:
            self.failed_images.append((image_path, str(e)))
            return None
    
    def process_directory(self, image_dir):
        """Process all images in directory with progress tracking."""
        extensions = ['*.jpg', '*.jpeg', '*.png', '*.gif', '*.bmp', '*.webp']
        all_paths = []
        for ext in extensions:
            all_paths.extend(glob(os.path.join(image_dir, '**', ext), recursive=True))
        
        print(f"Found {len(all_paths)} images")
        
        vectors = []
        valid_paths = []
        
        for i, path in enumerate(all_paths):
            if i % 100 == 0:
                print(f"Processing {i}/{len(all_paths)}...")
            
            vector = self.process_image(path)
            if vector is not None:
                vectors.append(vector)
                valid_paths.append(path)
        
        # Report failures
        if self.failed_images:
            print(f"\n⚠️ Failed to process {len(self.failed_images)} images:")
            for path, error in self.failed_images[:5]:  # Show first 5
                print(f"   - {os.path.basename(path)}: {error}")
            if len(self.failed_images) > 5:
                print(f"   ... and {len(self.failed_images) - 5} more")
        
        return np.array(vectors), valid_paths

# Usage
processor = RobustImageProcessor()
vectors, paths = processor.process_directory('images/')
print(f"\nSuccessfully processed {len(vectors)} images")
Production Checklist:
  • Handle missing files gracefully
  • Skip corrupted/truncated images
  • Normalize different color modes (RGB, RGBA, palette, grayscale)
  • Handle extreme aspect ratios
  • Log failures for debugging
  • Set memory limits for very large images
  • Add timeout for slow processing

Key Takeaways

What You Built & Learned:
  • Images as Vectors - Pixels, histograms, and deep features all work
  • Cosine Similarity - Measure angular distance between high-dimensional vectors
  • Vectorized Operations - Replace loops with matrix operations for 100x speedup
  • Batch Processing - Matrix multiplication enables searching thousands of images instantly
  • Real ML Pipeline - From raw data → vectors → similarity → results
  • Production Robustness - Handle corrupted, malformed, and edge-case images
Industry Insight: Companies like Pinterest, Airbnb, and Spotify use exactly these techniques—just with more sophisticated embeddings (from neural networks) and approximate nearest neighbor search for speed at scale.

What’s Next?

You’ve built a working image search engine using linear algebra! Next, we’ll explore eigenvalues and eigenvectors to build even more powerful applications like face recognition and dimensionality reduction.

Next: Eigenvalues & Eigenvectors

Learn the math behind PCA and face recognition