Skip to main content
Music Recommendation Engine

Project 2: Music Recommendation Engine

What You’re Building

A collaborative filtering recommendation system that:
  1. Takes a user’s listening history
  2. Finds hidden patterns using SVD
  3. Predicts what songs they’ll like
  4. Recommends songs they haven’t heard yet
This is exactly how Spotify’s “Discover Weekly” works at its core!
Estimated Time: 4-5 hours
Difficulty: Intermediate
Concepts Used: SVD, matrix factorization, similarity measures
What You’ll Learn: How Netflix/Spotify recommendations actually work
Take Your Time: This project synthesizes everything from the course. If you’re confused, go back to the SVD and Eigenvalues modules.

The Big Picture

Why Matrix Factorization?

Imagine you have millions of users and thousands of songs. Most users have only listened to a tiny fraction of songs.
         Song1  Song2  Song3  Song4  Song5  Song6  ...  Song10000
User1    [  5     ?     3      ?      ?     4    ...     ?      ]
User2    [  ?     4     ?      5      ?     ?    ...     2      ]
User3    [  4     ?     ?      ?      3     5    ...     ?      ]
...
User1M   [  ?     ?     5      ?      4     ?    ...     ?      ]
The Problem: We have a sparse matrix with mostly missing values. How do we predict the ? values? The Solution: Find hidden “taste factors” that explain the patterns!

The Key Insight

Users and songs can be described by hidden factors:
FactorUser InterpretationSong Interpretation
Factor 1”Likes energetic music""High energy song”
Factor 2”Prefers acoustic""Acoustic instrumentation”
Factor 3”Enjoys complex lyrics""Lyrical complexity”
If a user scores high on “likes energy” and a song scores high on “high energy,” they’ll probably like it!

Part 1: Create the Dataset

import numpy as np
import pandas as pd
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Simulate a music streaming dataset
n_users = 500
n_songs = 200

# Create user and song names
users = [f"user_{i}" for i in range(n_users)]
songs = [f"song_{i}" for i in range(n_songs)]

# Create some "ground truth" hidden factors
# These are the TRUE taste dimensions (we'll try to recover them)
n_factors = 5

# User taste profiles (how much each user likes each hidden factor)
user_factors = np.random.randn(n_users, n_factors) * 0.5

# Song characteristics (how much each song has each hidden factor)
song_factors = np.random.randn(n_factors, n_songs) * 0.5

# True ratings = user_factors @ song_factors + noise
true_ratings = user_factors @ song_factors

# Convert to 1-5 scale
true_ratings = (true_ratings - true_ratings.min()) / (true_ratings.max() - true_ratings.min()) * 4 + 1

# Create sparse observations (users only rate some songs)
# Each user rates about 5-15% of songs
observed_mask = np.random.random((n_users, n_songs)) < 0.1

# Add some noise to observed ratings
noise = np.random.randn(n_users, n_songs) * 0.3
observed_ratings = true_ratings + noise
observed_ratings = np.clip(observed_ratings, 1, 5)

# Replace unobserved with NaN
ratings_matrix = np.where(observed_mask, observed_ratings, np.nan)

print(f"Dataset size: {n_users} users × {n_songs} songs")
print(f"Observed ratings: {observed_mask.sum():,} ({observed_mask.mean()*100:.1f}%)")
print(f"Missing ratings: {(~observed_mask).sum():,} ({(~observed_mask).mean()*100:.1f}%)")

# Create a DataFrame for easier manipulation
ratings_df = pd.DataFrame(ratings_matrix, index=users, columns=songs)
print("\nSample of ratings matrix:")
print(ratings_df.iloc[:5, :5].round(1))

Part 2: Exploratory Data Analysis

# Analyze the ratings distribution
observed_values = ratings_matrix[~np.isnan(ratings_matrix)]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Distribution of ratings
axes[0].hist(observed_values, bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')
axes[0].set_title('Distribution of Ratings')
axes[0].axvline(observed_values.mean(), color='red', linestyle='--', label=f'Mean: {observed_values.mean():.2f}')
axes[0].legend()

# Ratings per user
ratings_per_user = (~np.isnan(ratings_matrix)).sum(axis=1)
axes[1].hist(ratings_per_user, bins=30, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of Rated Songs')
axes[1].set_ylabel('Number of Users')
axes[1].set_title('Ratings per User')

# Ratings per song
ratings_per_song = (~np.isnan(ratings_matrix)).sum(axis=0)
axes[2].hist(ratings_per_song, bins=30, edgecolor='black', alpha=0.7)
axes[2].set_xlabel('Number of Ratings')
axes[2].set_ylabel('Number of Songs')
axes[2].set_title('Ratings per Song')

plt.tight_layout()
plt.show()

print(f"\nAverage rating: {observed_values.mean():.2f}")
print(f"Std of ratings: {observed_values.std():.2f}")
print(f"Avg ratings per user: {ratings_per_user.mean():.1f}")
print(f"Avg ratings per song: {ratings_per_song.mean():.1f}")

Part 3: Build the Recommendation Engine

Step 1: Handle Missing Values

For SVD, we need a complete matrix. We’ll fill missing values with the user’s average rating.
def fill_missing_values(ratings_matrix):
    """
    Fill missing ratings with user averages.
    This is a simple baseline - more sophisticated methods exist!
    """
    filled_matrix = ratings_matrix.copy()
    
    for i in range(ratings_matrix.shape[0]):
        user_ratings = ratings_matrix[i, :]
        user_mean = np.nanmean(user_ratings)
        
        # If user has no ratings, use global mean
        if np.isnan(user_mean):
            user_mean = np.nanmean(ratings_matrix)
        
        # Fill missing values with user mean
        filled_matrix[i, np.isnan(user_ratings)] = user_mean
    
    return filled_matrix

# Fill missing values
filled_ratings = fill_missing_values(ratings_matrix)

print("Before filling:")
print(ratings_matrix[:3, :5].round(1))
print("\nAfter filling:")
print(filled_ratings[:3, :5].round(1))

Step 2: Center the Ratings

SVD works better when data is centered (mean = 0).
def center_ratings(ratings_matrix):
    """
    Center ratings by subtracting user means.
    Returns: centered matrix and user means (for later reconstruction)
    """
    user_means = np.mean(ratings_matrix, axis=1, keepdims=True)
    centered = ratings_matrix - user_means
    return centered, user_means.flatten()

centered_ratings, user_means = center_ratings(filled_ratings)

print("Centered ratings (first 3 users, first 5 songs):")
print(centered_ratings[:3, :5].round(2))
print(f"\nUser means (first 5): {user_means[:5].round(2)}")

Step 3: Apply SVD

Now the magic happens!
def learn_latent_factors(centered_matrix, n_factors=10):
    """
    Learn hidden taste factors using SVD.
    
    Returns:
        U: User latent factors (n_users × n_factors)
        sigma: Singular values (importance of each factor)
        Vt: Song latent factors (n_factors × n_songs)
    """
    # Use scipy's sparse SVD for efficiency
    U, sigma, Vt = svds(centered_matrix, k=n_factors)
    
    # Sort by singular value (importance) in descending order
    idx = np.argsort(sigma)[::-1]
    U = U[:, idx]
    sigma = sigma[idx]
    Vt = Vt[idx, :]
    
    return U, sigma, Vt

# Try different numbers of factors
for k in [3, 5, 10, 20]:
    U, sigma, Vt = learn_latent_factors(centered_ratings, n_factors=k)
    
    # Reconstruct and measure error on observed ratings
    reconstructed = U @ np.diag(sigma) @ Vt + user_means.reshape(-1, 1)
    
    # Calculate RMSE on observed ratings only
    observed_pred = reconstructed[observed_mask]
    observed_true = ratings_matrix[observed_mask]
    rmse = np.sqrt(mean_squared_error(observed_true, observed_pred))
    
    # Calculate variance explained
    total_variance = np.sum(sigma**2)
    
    print(f"Factors: {k:2d} | RMSE: {rmse:.4f} | Top 3 singular values: {sigma[:3].round(2)}")

# Use 10 factors as our final model
n_factors = 10
U, sigma, Vt = learn_latent_factors(centered_ratings, n_factors=n_factors)

print(f"\nFinal model: {n_factors} latent factors")
print(f"   U shape: {U.shape} (user factors)")
print(f"   Vt shape: {Vt.shape} (song factors)")

Step 4: Interpret the Latent Factors

# Visualize the importance of each factor
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Singular values (importance)
axes[0].bar(range(1, n_factors + 1), sigma)
axes[0].set_xlabel('Factor')
axes[0].set_ylabel('Singular Value')
axes[0].set_title('Importance of Each Latent Factor')

# Explained variance ratio
variance_explained = sigma**2 / np.sum(sigma**2)
cumulative_var = np.cumsum(variance_explained)

axes[1].bar(range(1, n_factors + 1), variance_explained, alpha=0.7, label='Individual')
axes[1].plot(range(1, n_factors + 1), cumulative_var, 'ro-', label='Cumulative')
axes[1].axhline(y=0.9, color='green', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Factor')
axes[1].set_ylabel('Variance Explained')
axes[1].set_title('Variance Explained by Each Factor')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nVariance explained by top 5 factors: {cumulative_var[4]*100:.1f}%")

Part 4: Make Recommendations

def predict_rating(user_idx, song_idx, U, sigma, Vt, user_means):
    """Predict a single user-song rating."""
    # Get user and song latent factors
    user_factor = U[user_idx, :] @ np.diag(sigma)
    song_factor = Vt[:, song_idx]
    
    # Predicted rating = dot product + user mean
    pred = np.dot(user_factor, song_factor) + user_means[user_idx]
    
    # Clip to valid range
    return np.clip(pred, 1, 5)

def get_recommendations(user_idx, ratings_matrix, U, sigma, Vt, user_means, n_recs=10):
    """
    Get top N song recommendations for a user.
    Only recommends songs the user hasn't rated.
    """
    # Get all unrated songs for this user
    rated_songs = ~np.isnan(ratings_matrix[user_idx, :])
    
    # Predict ratings for all unrated songs
    predictions = []
    for song_idx in range(ratings_matrix.shape[1]):
        if not rated_songs[song_idx]:  # Only unrated songs
            pred = predict_rating(user_idx, song_idx, U, sigma, Vt, user_means)
            predictions.append((song_idx, pred))
    
    # Sort by predicted rating (descending) and return top N
    predictions.sort(key=lambda x: x[1], reverse=True)
    
    return predictions[:n_recs]

# Get recommendations for a sample user
sample_user = 42
print(f"Getting recommendations for User {sample_user}...")
print(f"User has rated {(~np.isnan(ratings_matrix[sample_user, :])).sum()} songs\n")

# Show their existing ratings
existing_ratings = []
for song_idx in range(n_songs):
    if not np.isnan(ratings_matrix[sample_user, song_idx]):
        existing_ratings.append((song_idx, ratings_matrix[sample_user, song_idx]))

existing_ratings.sort(key=lambda x: x[1], reverse=True)
print("User's top-rated songs:")
for song_idx, rating in existing_ratings[:5]:
    print(f"  {songs[song_idx]}: {rating:.1f}/5.0")

# Get recommendations
recommendations = get_recommendations(sample_user, ratings_matrix, U, sigma, Vt, user_means, n_recs=10)

print(f"\nTop 10 Recommended Songs:")
for rank, (song_idx, pred_rating) in enumerate(recommendations, 1):
    true_rating = true_ratings[sample_user, song_idx]  # Our ground truth
    print(f"  {rank}. {songs[song_idx]}: Predicted {pred_rating:.2f}/5.0 (True: {true_rating:.2f})")

Part 5: Find Similar Users and Songs

from sklearn.metrics.pairwise import cosine_similarity

def find_similar_users(user_idx, U, n_similar=5):
    """Find users with similar taste profiles."""
    # User factors encode taste preferences
    user_vector = U[user_idx, :].reshape(1, -1)
    
    # Calculate similarity with all other users
    similarities = cosine_similarity(user_vector, U)[0]
    
    # Get top similar users (excluding self)
    similar_indices = np.argsort(similarities)[::-1][1:n_similar+1]
    
    return [(idx, similarities[idx]) for idx in similar_indices]

def find_similar_songs(song_idx, Vt, n_similar=5):
    """Find songs with similar characteristics."""
    # Song factors encode song characteristics
    song_vector = Vt[:, song_idx].reshape(1, -1)
    
    # Calculate similarity with all other songs
    similarities = cosine_similarity(song_vector, Vt.T)[0]
    
    # Get top similar songs (excluding self)
    similar_indices = np.argsort(similarities)[::-1][1:n_similar+1]
    
    return [(idx, similarities[idx]) for idx in similar_indices]

# Find similar users
sample_user = 42
similar_users = find_similar_users(sample_user, U, n_similar=5)

print(f"Users similar to User {sample_user}:")
for user_idx, similarity in similar_users:
    print(f"  User {user_idx}: {similarity:.3f} similarity")

# Find similar songs
sample_song = 15
similar_songs = find_similar_songs(sample_song, Vt, n_similar=5)

print(f"\nSongs similar to Song {sample_song}:")
for song_idx, similarity in similar_songs:
    print(f"  Song {song_idx}: {similarity:.3f} similarity")

Part 6: Evaluate the Model

from sklearn.model_selection import train_test_split

def evaluate_recommendations(ratings_matrix, n_factors=10, test_fraction=0.2):
    """
    Evaluate recommendation quality using held-out ratings.
    """
    # Get indices of all observed ratings
    observed_indices = np.argwhere(~np.isnan(ratings_matrix))
    
    # Split into train/test
    train_idx, test_idx = train_test_split(
        range(len(observed_indices)), 
        test_size=test_fraction, 
        random_state=42
    )
    
    # Create train/test masks
    train_mask = np.zeros_like(ratings_matrix, dtype=bool)
    test_mask = np.zeros_like(ratings_matrix, dtype=bool)
    
    for i in train_idx:
        r, c = observed_indices[i]
        train_mask[r, c] = True
    
    for i in test_idx:
        r, c = observed_indices[i]
        test_mask[r, c] = True
    
    # Create training matrix (hide test ratings)
    train_matrix = ratings_matrix.copy()
    train_matrix[test_mask] = np.nan
    
    # Train model on training data
    filled = fill_missing_values(train_matrix)
    centered, means = center_ratings(filled)
    U, sigma, Vt = learn_latent_factors(centered, n_factors)
    
    # Predict on test set
    predictions = []
    actuals = []
    
    for i in range(len(observed_indices)):
        r, c = observed_indices[i]
        if test_mask[r, c]:
            pred = predict_rating(r, c, U, sigma, Vt, means)
            predictions.append(pred)
            actuals.append(ratings_matrix[r, c])
    
    predictions = np.array(predictions)
    actuals = np.array(actuals)
    
    rmse = np.sqrt(mean_squared_error(actuals, predictions))
    mae = np.mean(np.abs(predictions - actuals))
    
    return rmse, mae, predictions, actuals

# Evaluate
rmse, mae, preds, acts = evaluate_recommendations(ratings_matrix, n_factors=10)

print(f"Evaluation Results:")
print(f"   RMSE: {rmse:.4f}")
print(f"   MAE:  {mae:.4f}")

# Compare with baseline (predict average)
baseline_pred = np.mean(acts)
baseline_rmse = np.sqrt(mean_squared_error(acts, np.full_like(acts, baseline_pred)))
print(f"   Baseline RMSE (predict mean): {baseline_rmse:.4f}")
print(f"   Improvement: {(baseline_rmse - rmse) / baseline_rmse * 100:.1f}%")

# Visualize predictions vs actuals
plt.figure(figsize=(10, 5))
plt.scatter(acts, preds, alpha=0.3, s=10)
plt.plot([1, 5], [1, 5], 'r--', label='Perfect prediction')
plt.xlabel('Actual Rating')
plt.ylabel('Predicted Rating')
plt.title('Predicted vs Actual Ratings')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Part 7: Putting It All Together

class MusicRecommender:
    """
    A complete music recommendation system using SVD.
    """
    
    def __init__(self, n_factors=10):
        self.n_factors = n_factors
        self.U = None
        self.sigma = None
        self.Vt = None
        self.user_means = None
        self.ratings_matrix = None
        self.users = None
        self.songs = None
    
    def fit(self, ratings_df):
        """Train the recommender on a ratings DataFrame."""
        self.users = ratings_df.index.tolist()
        self.songs = ratings_df.columns.tolist()
        self.ratings_matrix = ratings_df.values
        
        # Fill, center, and decompose
        filled = fill_missing_values(self.ratings_matrix)
        centered, self.user_means = center_ratings(filled)
        self.U, self.sigma, self.Vt = learn_latent_factors(centered, self.n_factors)
        
        print(f"Trained on {len(self.users)} users x {len(self.songs)} songs")
        print(f"   Using {self.n_factors} latent factors")
        
        return self
    
    def predict(self, user, song):
        """Predict a user's rating for a song."""
        user_idx = self.users.index(user)
        song_idx = self.songs.index(song)
        return predict_rating(user_idx, song_idx, self.U, self.sigma, self.Vt, self.user_means)
    
    def recommend(self, user, n=10):
        """Get top N recommendations for a user."""
        user_idx = self.users.index(user)
        recs = get_recommendations(user_idx, self.ratings_matrix, 
                                   self.U, self.sigma, self.Vt, self.user_means, n)
        return [(self.songs[idx], rating) for idx, rating in recs]
    
    def similar_users(self, user, n=5):
        """Find similar users."""
        user_idx = self.users.index(user)
        similar = find_similar_users(user_idx, self.U, n)
        return [(self.users[idx], sim) for idx, sim in similar]
    
    def similar_songs(self, song, n=5):
        """Find similar songs."""
        song_idx = self.songs.index(song)
        similar = find_similar_songs(song_idx, self.Vt, n)
        return [(self.songs[idx], sim) for idx, sim in similar]

# Usage example
recommender = MusicRecommender(n_factors=10)
recommender.fit(ratings_df)

print("\n" + "="*50)
print("MUSIC RECOMMENDATION ENGINE")
print("="*50)

# Demo
user = "user_42"
print(f"\nRecommendations for {user}:")
for song, rating in recommender.recommend(user, n=5):
    print(f"  {song}: {rating:.2f}/5.0")

print(f"\nSimilar users to {user}:")
for other_user, similarity in recommender.similar_users(user, n=3):
    print(f"  {other_user}: {similarity:.3f} similarity")

song = "song_15"
print(f"\nSimilar songs to {song}:")
for other_song, similarity in recommender.similar_songs(song, n=3):
    print(f"  {other_song}: {similarity:.3f} similarity")

Challenges

Problem: A new user with no ratings joins. How do you recommend songs?Hints:
  • Use content-based features (genre, tempo, etc.)
  • Ask for initial preferences
  • Recommend popular songs initially
Problem: When a user rates a new song, how do you update recommendations without retraining the entire model?Hints:
  • Incremental SVD updates
  • Online matrix factorization
  • Approximate nearest neighbor methods
Problem: Most users don’t rate songs explicitly. They just play or skip. How do you handle this?Hints:
  • Play count as implicit rating
  • Session-based recommendations
  • Weighted matrix factorization for implicit feedback

Summary

ConceptWhat You Learned
Matrix FactorizationDecompose ratings into user×song latent factors
SVDThe mathematical engine behind recommendations
Latent FactorsHidden dimensions that explain preferences
Cold StartHandling new users/items without history
EvaluationRMSE, MAE, and train/test splits
Congratulations! You’ve built a working recommendation engine using the same fundamental math that powers Netflix, Spotify, and Amazon. The concepts you’ve learned - matrix factorization, similarity measures, and latent factors - are the foundation of modern personalization systems.