Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Matrices & Linear Transformations

Matrices & Linear Transformations

A Problem You Already Understand: Calculating Final Grades

You’re a teacher with a spreadsheet of student data:
StudentHomework (40%)Midterm (25%)Final (35%)
Alice928885
Bob788290
Carol959092
Question: What’s each student’s final grade? You already know how to do this in Excel:
=0.40*B2 + 0.25*C2 + 0.35*D2
For Alice: 0.40×92 + 0.25×88 + 0.35×85 = 36.8 + 22 + 29.75 = 88.55 Congratulations — you just did matrix multiplication! Grade Calculation as Matrix Math
Estimated Time: 4-5 hours
Difficulty: Beginner to Intermediate
Prerequisites: Vectors module
What You’ll Build: Grade calculator, photo filter app, and a simple prediction model

What You Just Did (Mathematically)

Let’s write that Excel formula as math: Matrix Transformation Math Concept
import numpy as np

# Student scores as a vector
alice = np.array([92, 88, 85])  # [homework, midterm, final]

# Weights as a vector  
weights = np.array([0.40, 0.25, 0.35])

# Final grade = weighted sum
final_grade = np.dot(weights, alice)
# = 0.40×92 + 0.25×88 + 0.35×85
# = 88.55

print(f"Alice's grade: {final_grade:.1f}")
The dot product IS the weighted sum. You’ve been doing “matrix math” in Excel for years!

Scaling Up: All Students at Once

What if you have 100 students? You don’t want to calculate one by one.
# All students in a matrix (each row = one student)
scores = np.array([
    [92, 88, 85],   # Alice
    [78, 82, 90],   # Bob  
    [95, 90, 92],   # Carol
])

# Weights
weights = np.array([0.40, 0.25, 0.35])

# Calculate ALL final grades at once!
final_grades = scores @ weights  # Matrix multiplication

print("Final grades:")
print(f"  Alice: {final_grades[0]:.1f}")
print(f"  Bob: {final_grades[1]:.1f}")
print(f"  Carol: {final_grades[2]:.1f}")
Output:
Final grades:
  Alice: 88.6
  Bob: 83.2
  Carol: 92.2
That’s matrix multiplication: applying the same weighted sum to every row, all at once.

What Is a Matrix?

Now let’s get formal.

A Matrix is a Table of Numbers

# A 3×3 matrix (3 rows, 3 columns)
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(A.shape)  # (3, 3)
Mathematical notation: An m×nm \times n matrix has mm rows and nn columns: A=[a11a12a1na21a22a2nam1am2amn]A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} Where aija_{ij} is the element at row ii, column jj.

Matrix Operations: The Complete Toolkit

Matrix Addition

Add matrices of the same size element-by-element: A+B=[a11a12a21a22]+[b11b12b21b22]=[a11+b11a12+b12a21+b21a22+b22]A + B = \begin{bmatrix}a_{11} & a_{12}\\a_{21} & a_{22}\end{bmatrix} + \begin{bmatrix}b_{11} & b_{12}\\b_{21} & b_{22}\end{bmatrix} = \begin{bmatrix}a_{11}+b_{11} & a_{12}+b_{12}\\a_{21}+b_{21} & a_{22}+b_{22}\end{bmatrix}
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

C = A + B
print(C)
# [[6, 8], [10, 12]]

Scalar Multiplication

Multiply every element by a number: cA=c[a11a12a21a22]=[ca11ca12ca21ca22]cA = c \begin{bmatrix}a_{11} & a_{12}\\a_{21} & a_{22}\end{bmatrix} = \begin{bmatrix}c \cdot a_{11} & c \cdot a_{12}\\c \cdot a_{21} & c \cdot a_{22}\end{bmatrix}
A = np.array([[1, 2], [3, 4]])
print(3 * A)
# [[3, 6], [9, 12]]

Matrix Transpose

Flip rows and columns (swap aija_{ij} with ajia_{ji}): AT=[123456]T=[142536]A^T = \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6\end{bmatrix}^T = \begin{bmatrix}1 & 4\\2 & 5\\3 & 6\end{bmatrix}
A = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Original shape: {A.shape}")    # (2, 3)
print(f"Transposed shape: {A.T.shape}") # (3, 2)
print(A.T)
# [[1, 4], [2, 5], [3, 6]]
Key properties:
  • (AT)T=A(A^T)^T = A (transpose twice = original)
  • (A+B)T=AT+BT(A + B)^T = A^T + B^T
  • (AB)T=BTAT(AB)^T = B^T A^T (note the reversed order!)

Matrix Multiplication

This is the most important operation! For C=ABC = AB: Cij=k=1nAikBkj=(row i of A)(column j of B)C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj} = \text{(row } i \text{ of } A) \cdot \text{(column } j \text{ of } B) The rule: (row) × (column) = one number, repeated for every position.
A = np.array([[1, 2], [3, 4]])  # 2×2
B = np.array([[5, 6], [7, 8]])  # 2×2

C = A @ B  # or np.matmul(A, B) or np.dot(A, B)
print(C)
# [[19, 22], [43, 50]]

# Let's verify C[0,0]:
# Row 0 of A: [1, 2]
# Col 0 of B: [5, 7]
# Dot product: 1×5 + 2×7 = 5 + 14 = 19 ✓
Worked example — step by step: [1234]×[5678]=[(1)(5)+(2)(7)(1)(6)+(2)(8)(3)(5)+(4)(7)(3)(6)+(4)(8)]=[19224350]\begin{bmatrix}1 & 2\\3 & 4\end{bmatrix} \times \begin{bmatrix}5 & 6\\7 & 8\end{bmatrix} = \begin{bmatrix}(1)(5)+(2)(7) & (1)(6)+(2)(8)\\(3)(5)+(4)(7) & (3)(6)+(4)(8)\end{bmatrix} = \begin{bmatrix}19 & 22\\43 & 50\end{bmatrix}
Matrix multiplication is NOT commutative! ABBAAB \neq BA in general.
print(A @ B)  # [[19, 22], [43, 50]]
print(B @ A)  # [[23, 34], [31, 46]] - Different!
This makes intuitive sense if you think of matrices as transformations: rotating then scaling is different from scaling then rotating. The order of operations matters. In neural networks, this is why the order of layers matters — swapping two layers changes the network’s behavior completely.
Dimension rule: For ABAB to work, columns of AA must equal rows of BB:
  • (m×n)×(n×p)=(m×p)(m \times n) \times (n \times p) = (m \times p)
  • The inner dimensions must match!
A = np.array([[1, 2, 3], [4, 5, 6]])  # 2×3
B = np.array([[1], [2], [3]])          # 3×1

C = A @ B  # (2×3) × (3×1) = (2×1)
print(C.shape)  # (2, 1)

The Identity Matrix

The identity matrix II is like “1” for matrices — multiplying by it changes nothing: I=[100010001],AI=IA=AI = \begin{bmatrix}1 & 0 & 0\\0 & 1 & 0\\0 & 0 & 1\end{bmatrix}, \quad AI = IA = A
I = np.eye(3)  # 3×3 identity matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(np.allclose(A @ I, A))  # True
print(np.allclose(I @ A, A))  # True

Matrix Inverse

The inverse A1A^{-1} “undoes” multiplication by AA: AA1=A1A=IAA^{-1} = A^{-1}A = I If AA represents a transformation (like rotating by 30 degrees), then A1A^{-1} is the reverse transformation (rotating by -30 degrees). If AA blurs an image, A1A^{-1} would “unblur” it (in theory — in practice, noise makes this extremely difficult, which is why image deblurring is a hard problem). For a 2x2 matrix: A=[abcd],A1=1adbc[dbca]A = \begin{bmatrix}a & b\\c & d\end{bmatrix}, \quad A^{-1} = \frac{1}{ad-bc}\begin{bmatrix}d & -b\\-c & a\end{bmatrix} The term adbcad - bc is called the determinant. If it is zero, no inverse exists! Geometrically, this means the transformation squashed 2D space into a line (or a point) — you cannot unsquash it because information was destroyed.
Practical Tip: In ML code, almost never compute a matrix inverse explicitly with np.linalg.inv(A). It is numerically unstable and slow. Instead, solve the system directly: replace np.linalg.inv(A) @ b with np.linalg.solve(A, b). This is faster, more accurate, and handles near-singular cases more gracefully.
A = np.array([[4, 7], [2, 6]])

# Determinant
det = 4*6 - 7*2  # = 10

# Inverse
A_inv = np.linalg.inv(A)
print(A_inv)
# [[ 0.6, -0.7], [-0.2,  0.4]]

# Verify: A @ A_inv = I
print(np.allclose(A @ A_inv, np.eye(2)))  # True

Determinant

The determinant measures how much a matrix “scales” area (or volume). This is one of those concepts that becomes deeply intuitive once you see it geometrically. Imagine you have a unit square (1x1) on graph paper. When you apply a matrix transformation to every corner of that square, it becomes a parallelogram. The absolute value of the determinant tells you the area of that parallelogram. If the determinant is 6, the area grew 6x. If it is 0.5, the area shrank by half. If it is 0, the square collapsed into a line or a point — all the information in one direction was destroyed, which is why you cannot invert the transformation. 2x2 determinant: det[abcd]=adbc\det\begin{bmatrix}a & b\\c & d\end{bmatrix} = ad - bc Properties:
  • det(I)=1\det(I) = 1 — the identity does not change area
  • det(AB)=det(A)det(B)\det(AB) = \det(A) \cdot \det(B) — composing transformations multiplies their scaling factors
  • If det(A)=0\det(A) = 0, the matrix is “singular” (no inverse) — it squashes space into a lower dimension
  • det(A)|\det(A)| = factor by which area is scaled
  • If det(A)<0\det(A) < 0, the transformation flips orientation (like looking in a mirror)
A = np.array([[2, 0], [0, 3]])
print(np.linalg.det(A))  # 6.0

# This matrix scales area by 6x (2× in x-direction, 3× in y-direction)

Common 2D Transformations: A Visual Catalog

Every 2x2 matrix represents a geometric transformation of the plane. Here is a reference of the most important ones and what they look like when applied to a unit square with corners at (0,0), (1,0), (0,1), and (1,1).
TransformationMatrixWhat It Does GeometricallyML Connection
Identity[1001]\begin{bmatrix}1 & 0\\0 & 1\end{bmatrix}Nothing — the “do nothing” transformSkip connections in ResNets
Scaling[2003]\begin{bmatrix}2 & 0\\0 & 3\end{bmatrix}Stretches x by 2, y by 3 — unit square becomes 2x3 rectangleFeature scaling, batch normalization
Rotation (45 degrees)[0.710.710.710.71]\begin{bmatrix}0.71 & -0.71\\0.71 & 0.71\end{bmatrix}Rotates everything counterclockwise — square tilts 45 degreesData augmentation in image models
Reflection (y-axis)[1001]\begin{bmatrix}-1 & 0\\0 & 1\end{bmatrix}Mirrors across y-axis — square flips left-to-rightImage augmentation (horizontal flip)
Shear[10.501]\begin{bmatrix}1 & 0.5\\0 & 1\end{bmatrix}Tilts verticals — square becomes parallelogramItalic text rendering, data augmentation
Projection[1000]\begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}Collapses onto x-axis — square becomes line segmentDimensionality reduction, PCA
Singular (rank 0)[0000]\begin{bmatrix}0 & 0\\0 & 0\end{bmatrix}Crushes everything to the originDead neurons (all-zero weights)
Geometric picture of key transformations on a unit square:

Original:      Rotation:       Shear:          Projection:
  +--+           /\            +---+            +--+
  |  |          /  \          /   /             (line on x-axis)
  +--+         /    \        +---+
              (tilted)      (slanted)
Edge Case — Singular Transformations: When a matrix has determinant zero (like the projection matrix above), it irreversibly destroys information. The unit square collapses to a line or a point. No inverse exists because you cannot “un-collapse” a line back into a square — you have lost the information about where each point was in the collapsed dimension. In neural networks, a weight matrix that becomes near-singular during training effectively kills an entire dimension of representation, which is one cause of “dying neurons.”

A Matrix is a Transformation Machine

Here’s the key insight that separates people who use matrices from people who understand them: A matrix is a function. It takes a vector in and spits a different vector out. Think of a matrix like a machine in a factory. You feed in raw material (the input vector), the machine processes it according to its fixed internal rules (the matrix entries), and out comes a transformed product (the output vector). Different machines (matrices) produce different outputs from the same input.
Input Vector  →  [Matrix]  →  Output Vector
   (3 numbers)               (could be different size!)
Every matrix multiplication is a transformation — it rotates, stretches, squishes, reflects, or projects your data. Understanding which transformation a matrix performs is the key to debugging ML models and understanding what neural network layers actually do. Example: Our grade calculation
[92, 88, 85]  →  [0.40, 0.25, 0.35]  →  88.55
   3 numbers         weights           1 number
The matrix (our weights) transformed 3 scores into 1 grade.

Real-World Example: Photo Filters

Ever wonder how Instagram filters work? They’re matrix operations! Photo Filter as Matrix Math

Brightness: Multiply Every Pixel

from PIL import Image
import numpy as np

# Load an image (each pixel is [R, G, B] from 0-255)
image = np.array(Image.open('photo.jpg'))
print(image.shape)  # e.g., (1080, 1920, 3) - height × width × RGB

# Increase brightness by 20%
brighter = image * 1.2
brighter = np.clip(brighter, 0, 255)  # Keep values in valid range

# That's scalar multiplication - a type of matrix operation!

Color Transformation: Matrix Multiplication

What if you want to make a photo look “warmer” (more red/yellow) or “cooler” (more blue)?
# A color transformation matrix
# Each row says how to calculate new [R, G, B] from old [R, G, B]
warm_filter = np.array([
    [1.2, 0.1, 0.0],   # New R = 1.2×R + 0.1×G + 0×B (boost red)
    [0.0, 1.1, 0.0],   # New G = 0×R + 1.1×G + 0×B (slight boost)
    [0.0, 0.0, 0.9],   # New B = 0×R + 0×G + 0.9×B (reduce blue)
])

# Apply to one pixel
original_pixel = np.array([100, 150, 200])  # Some blue-ish color
new_pixel = warm_filter @ original_pixel
print(f"Original: {original_pixel}")
print(f"Warmer: {new_pixel.astype(int)}")

# Output:
# Original: [100 150 200]
# Warmer: [135 165 180]  ← More red, less blue!

Grayscale: Average the Colors

# Grayscale transformation
# Human eyes are more sensitive to green, so we weight it higher
grayscale_weights = np.array([0.299, 0.587, 0.114])

pixel = np.array([100, 150, 200])
gray_value = np.dot(grayscale_weights, pixel)
print(f"Gray value: {gray_value:.0f}")  # 143
Every Instagram filter is a matrix transformation on your pixels! This is not a simplification — it is literally what happens. The “warmth” slider adjusts entries in a 3x3 color transformation matrix. The “contrast” slider scales the matrix. The “vintage” filter chains several matrix multiplications together. When you swipe through 20 filters in a second, your phone is performing 20 different matrix multiplications on millions of pixels. The reason it is fast? Matrix multiplication is embarrassingly parallelizable, and your phone’s GPU is designed specifically for it.

Real-World Example: Predicting House Prices

Now let’s apply this to prediction — the core of machine learning.

The Setup

You have data about houses:
HouseBedroomsSqftAge (years)Price
A3200010$350k
B425005$450k
C2120020$250k
Question: Given a NEW house with 3 bedrooms, 1800 sqft, 15 years old — what’s the predicted price?

The Approach: Weighted Sum of Features

Just like grades! Each feature contributes to the price:
# New house features
new_house = np.array([3, 1800, 15])

# Weights (how much each feature matters)
# These would be "learned" from data in real ML
weights = np.array([
    50000,   # Each bedroom adds $50k
    150,     # Each sqft adds $150
    -3000,   # Each year of age subtracts $3k
])

# Prediction
predicted_price = np.dot(weights, new_house)
# = 50000×3 + 150×1800 + (-3000)×15
# = 150000 + 270000 - 45000
# = $375,000

print(f"Predicted price: ${predicted_price:,}")

Predicting Many Houses at Once

# Multiple houses
houses = np.array([
    [3, 2000, 10],   # House 1
    [4, 2500, 5],    # House 2
    [2, 1200, 20],   # House 3
    [3, 1800, 15],   # House 4 (our query)
])

# Predict ALL prices at once
prices = houses @ weights

print("Predicted prices:")
for i, price in enumerate(prices, 1):
    print(f"  House {i}: ${price:,}")
Output:
Predicted prices:
  House 1: $420,000
  House 2: $560,000
  House 3: $280,000
  House 4: $375,000
This is linear regression — and it’s just matrix multiplication!

The Connection to Machine Learning

Now you understand the core operation of ML. Let’s make the connection explicit.

What a Neural Network Layer Does

Every layer in a neural network does exactly what we just did:
# Simplified neural network layer
def neural_network_layer(input_vector, weights, bias):
    """
    This is literally what PyTorch/TensorFlow do!
    """
    output = np.dot(weights, input_vector) + bias
    return output

# Example: 4 input features → 2 output neurons
input_features = np.array([3, 2000, 15, 5])  # House features

# Weight matrix: 2 outputs × 4 inputs
weights = np.array([
    [50000, 150, -3000, -5000],    # Weights for output 1
    [30000, 100, -2000, -3000],    # Weights for output 2
])

biases = np.array([10000, 5000])

# Forward pass
output = neural_network_layer(input_features, weights, biases)
print(f"Layer output: {output}")
# Two output values, computed by two different weighted sums

The Pattern

ApplicationInputWeightsOutput
Grade calculation[homework, midterm, final][0.4, 0.25, 0.35]final_grade
House pricing[beds, sqft, age][50k, 150, -3k]price
Photo filter[R, G, B]3×3 color matrix[R’, G’, B’]
Neural networkfeatureslearned weightspredictions
The math is identical. Only the application changes.
final_grade = np.dot(weights, student)
# = (2.5×12) + (0.3×85) + (1.5×8) + (0.2×90)
# = 30 + 25.5 + 12 + 18
# = 85.5

print(f"Predicted final grade: {final_grade:.1f}")

Batch Prediction for Entire Class

# All students in class
students = np.array([
    [12, 85, 8, 90],   # Student A
    [8, 70, 6, 75],    # Student B
    [15, 95, 10, 95],  # Student C
])

# Predict all grades at once
grades = students @ weights
print(grades)  # [85.5, 63.5, 102.5]
Real application: Learning management systems use this to predict which students need help!

Example 3: Movie Rating Prediction

The Model

# Movie features
movie = np.array([
    0.9,    # action score (0-1)
    0.1,    # romance score
    0.3,    # comedy score
    8.5,    # IMDb rating
    2020    # release year
])

# User preferences (weights)
user_weights = np.array([
    5,      # Loves action (+5 points)
    -2,     # Dislikes romance (-2 points)
    3,      # Likes comedy (+3 points)
    0.5,    # Considers IMDb rating
    -0.001  # Slight preference for newer movies
])

# Predict user's rating
predicted_rating = np.dot(user_weights, movie)
# = (5×0.9) + (-2×0.1) + (3×0.3) + (0.5×8.5) + (-0.001×2020)
# = 4.5 - 0.2 + 0.9 + 4.25 - 2.02
# = 7.43

print(f"Predicted rating: {predicted_rating:.1f}/10")

Recommend Movies to User

# Movie database
movies = np.array([
    [0.9, 0.1, 0.3, 8.5, 2020],  # Action movie
    [0.2, 0.9, 0.1, 7.5, 2019],  # Romance movie
    [0.7, 0.2, 0.8, 8.0, 2021],  # Action-comedy
])

# Predict ratings for all movies
ratings = movies @ user_weights
print(ratings)  # [7.43, 2.73, 8.88]

# Recommend highest-rated movie
best_movie = np.argmax(ratings)
print(f"Recommended: Movie {best_movie}")  # Movie 2 (action-comedy)
Real application: Netflix uses matrices to predict your ratings for millions of movies!

Matrix Operations

1. Matrix Addition

When to use: Combining multiple models, updating parameters
# Two prediction models
model_1 = np.array([[1, 2], [3, 4]])
model_2 = np.array([[5, 6], [7, 8]])

# Ensemble: average the models
ensemble = (model_1 + model_2) / 2
print(ensemble)
# [[3, 4],
#  [5, 6]]

2. Scalar Multiplication

When to use: Scaling predictions, learning rates
# Original weights
weights = np.array([[50000, 120, -5000, -8000]])

# Scale down for regularization
scaled_weights = 0.9 * weights

3. Matrix Multiplication

The Most Important Operation! Rule: (m×n) matrix × (n×p) matrix = (m×p) matrix The inner dimensions must match!
# Example: Neural network layer
# Input: 3 features → Output: 2 neurons

# Weight matrix (2 neurons × 3 features)
W = np.array([
    [0.5, -0.3, 0.8],  # Weights for neuron 1
    [0.2, 0.6, -0.4]   # Weights for neuron 2
])

# Input vector (3 features)
x = np.array([1.0, 2.0, 3.0])

# Output (2 neurons)
y = W @ x
# Neuron 1: (0.5×1.0) + (-0.3×2.0) + (0.8×3.0) = 2.3
# Neuron 2: (0.2×1.0) + (0.6×2.0) + (-0.4×3.0) = 0.2

print(y)  # [2.3, 0.2]
Neural Network Layer Key Insight: Every neural network layer is just matrix multiplication!

Matrix Multiplication: Three Examples

Example 1: Multi-Output House Prediction

Predict multiple outputs from house features:
# House features
house = np.array([3, 2000, 15, 5])

# Weight matrix (3 outputs × 4 features)
W = np.array([
    [50000, 120, -5000, -8000],   # Price prediction
    [0.8, 0.0002, -0.01, -0.02],  # Desirability score (0-1)
    [2, 0.001, -0.5, -0.3]        # Market time (months)
])

# Predict all outputs
outputs = W @ house
print(f"Price: ${outputs[0]:,.0f}")
print(f"Desirability: {outputs[1]:.2f}")
print(f"Time to sell: {outputs[2]:.1f} months")
Output:
Price: $275,000
Desirability: 0.73
Time to sell: 2.8 months

Example 2: Student Performance Across Subjects

Predict grades in multiple subjects:
# Student features
student = np.array([12, 85, 8, 90])  # [study_hours, prev_score, assignments, attendance]

# Weight matrix (3 subjects × 4 features)
W = np.array([
    [2.5, 0.3, 1.5, 0.2],   # Math
    [2.0, 0.4, 1.0, 0.3],   # English
    [3.0, 0.2, 2.0, 0.1]    # Science
])

# Predict all subject grades
grades = W @ student
print(f"Math: {grades[0]:.1f}")
print(f"English: {grades[1]:.1f}")
print(f"Science: {grades[2]:.1f}")

Example 3: Multi-User Movie Recommendations

Predict ratings for multiple users:
# Movie features
movie = np.array([0.9, 0.1, 0.3, 8.5, 2020])

# User preference matrix (3 users × 5 features)
U = np.array([
    [5, -2, 3, 0.5, -0.001],    # User 1: loves action
    [-1, 5, 2, 0.3, -0.002],    # User 2: loves romance
    [3, 1, 4, 0.4, 0]           # User 3: balanced
])

# Predict ratings for all users
ratings = U @ movie
print(f"User 1 rating: {ratings[0]:.1f}")
print(f"User 2 rating: {ratings[1]:.1f}")
print(f"User 3 rating: {ratings[2]:.1f}")

Matrix Transpose

Definition: Flip rows and columns
A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

A_T = A.T
print(A_T)
# [[1, 4],
#  [2, 5],
#  [3, 6]]
Why it matters: Used in backpropagation!
# Forward pass
X = np.array([[1, 2, 3]])  # (1×3) input
W = np.array([[0.5], [0.3], [0.8]])  # (3×1) weights
y = X @ W  # (1×1) output

# Backward pass (gradient)
dy = np.array([[1.0]])  # Gradient from loss
dW = X.T @ dy  # (3×1) gradient for weights

Identity Matrix

Definition: Matrix that doesn’t change vectors Iv=vI\mathbf{v} = \mathbf{v}
I = np.eye(3)  # 3×3 identity
print(I)
# [[1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

v = np.array([5, 10, 15])
result = I @ v
print(result)  # [5, 10, 15] (unchanged!)
Why it matters: Used in regularization, initialization, and matrix inversion.

Matrix Inverse

Definition: Matrix that “undoes” another matrix AA1=IA A^{-1} = I
A = np.array([
    [4, 7],
    [2, 6]
])

A_inv = np.linalg.inv(A)
print(A_inv)

# Verify
print(A @ A_inv)  # ≈ Identity matrix
Application: Solving linear equations
# Solve: Ax = b for x
A = np.array([[2, 1], [1, 3]])
b = np.array([5, 7])

# Solution: x = A^(-1) b
x = np.linalg.inv(A) @ b
print(x)  # [1, 3]

# Verify
print(A @ x)  # [5, 7] ✓

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises connect matrix operations to real applications you use every day.

Exercise 1: Instagram Color Filters 📸

Create custom photo filters using matrix transformations:
import numpy as np

# A pixel is represented as [R, G, B] from 0-255
sample_pixel = np.array([180, 120, 90])  # A warm skin tone

# TODO: Create these filter matrices and apply them
# 1. Sepia filter (vintage brown look)
# 2. Negative filter (invert colors)
# 3. Custom "sunset" filter (boost reds/oranges)
Tasks:
  1. Design a 3×3 sepia transformation matrix
  2. Apply it to the sample pixel
  3. Create your own artistic filter
import numpy as np

sample_pixel = np.array([180, 120, 90])

# 1. Sepia Filter Matrix (classic vintage look)
sepia = np.array([
    [0.393, 0.769, 0.189],  # New R
    [0.349, 0.686, 0.168],  # New G
    [0.272, 0.534, 0.131],  # New B
])

# 2. Negative Filter (invert)
negative = np.array([
    [-1, 0, 0],
    [0, -1, 0],
    [0, 0, -1],
])

# 3. Sunset Filter (warm, golden hour look)
sunset = np.array([
    [1.2, 0.1, 0.0],   # Boost red
    [0.1, 1.0, 0.0],   # Slight red bleed to green
    [0.0, 0.0, 0.7],   # Reduce blue
])

print("Original pixel:", sample_pixel)
print("-" * 40)

# Apply Sepia
sepia_result = np.clip(sepia @ sample_pixel, 0, 255).astype(int)
print(f"Sepia filter:   {sepia_result}")

# Apply Negative (add 255 to invert)
negative_result = np.clip(negative @ sample_pixel + 255, 0, 255).astype(int)
print(f"Negative:       {negative_result}")

# Apply Sunset
sunset_result = np.clip(sunset @ sample_pixel, 0, 255).astype(int)
print(f"Sunset filter:  {sunset_result}")

# Output:
# Original pixel: [180 120  90]
# Sepia filter:   [196 174 136]  ← Warm vintage brown
# Negative:       [ 75 135 165]  ← Inverted colors
# Sunset filter:  [228 138  63]  ← Golden warm tones
Real-World Insight: Instagram’s filters are exactly this - matrix multiplications applied to every pixel! The “Clarendon” filter boosts contrast, “Gingham” adds vintage fade.

Exercise 2: Multi-Store Inventory Management 🏪

A retail company has 3 stores and 4 products. Calculate total revenue using matrix multiplication:
# Inventory: rows = stores, cols = products
inventory = np.array([
    [50, 30, 100, 25],   # Store A
    [80, 45, 60, 40],    # Store B
    [35, 60, 80, 55],    # Store C
])

# Prices per product
prices = np.array([29.99, 49.99, 9.99, 79.99])

# Units sold (percentage of inventory)
sales_rate = np.array([
    [0.8, 0.6, 0.9, 0.5],   # Store A
    [0.7, 0.8, 0.7, 0.6],   # Store B
    [0.9, 0.5, 0.8, 0.7],   # Store C
])
Tasks:
  1. Calculate units sold at each store (element-wise multiply inventory × sales_rate)
  2. Calculate total revenue per store (matrix × price vector)
  3. Which store had the highest revenue?
import numpy as np

inventory = np.array([
    [50, 30, 100, 25],   # Store A
    [80, 45, 60, 40],    # Store B
    [35, 60, 80, 55],    # Store C
])

prices = np.array([29.99, 49.99, 9.99, 79.99])

sales_rate = np.array([
    [0.8, 0.6, 0.9, 0.5],
    [0.7, 0.8, 0.7, 0.6],
    [0.9, 0.5, 0.8, 0.7],
])

# 1. Units sold at each store (element-wise)
units_sold = inventory * sales_rate
print("Units Sold per Store:")
print(units_sold)

# 2. Revenue per store (matrix-vector multiplication)
revenue_per_store = units_sold @ prices
print("\n💰 Revenue per Store:")
stores = ['Store A', 'Store B', 'Store C']
for store, rev in zip(stores, revenue_per_store):
    print(f"  {store}: ${rev:,.2f}")

# 3. Best performing store
best_store = stores[np.argmax(revenue_per_store)]
print(f"\n🏆 Highest Revenue: {best_store} (${max(revenue_per_store):,.2f})")

# Output:
# Revenue per Store:
#   Store A: $3,288.35
#   Store B: $4,275.18  ← Winner!
#   Store C: $4,270.63

# Bonus: Revenue breakdown by product
print("\n📊 Revenue by Product (all stores):")
product_revenue = units_sold.sum(axis=0) * prices
products = ['Product 1', 'Product 2', 'Product 3', 'Product 4']
for prod, rev in zip(products, product_revenue):
    print(f"  {prod}: ${rev:,.2f}")
Real-World Insight: This is how Walmart, Target, and Amazon calculate daily revenue across thousands of stores and millions of products - all matrix operations!

Exercise 3: Simple Neural Network Forward Pass 🧠

Implement a tiny neural network using only matrix operations:
import numpy as np

# Input: 4 features (e.g., house: beds, baths, sqft/1000, age/10)
X = np.array([[3, 2, 1.5, 1.5]])  # 1 house

# Layer 1: 4 inputs → 3 neurons
W1 = np.array([
    [0.5, 0.3, 0.8, -0.2],
    [0.2, 0.6, 0.4, -0.1],
    [0.7, 0.1, 0.5, -0.3],
])
b1 = np.array([0.1, 0.2, 0.1])

# Layer 2: 3 neurons → 1 output (price)
W2 = np.array([[0.8, 0.5, 0.6]])
b2 = np.array([0.1])
Tasks:
  1. Compute the output of Layer 1: h = X @ W1.T + b1
  2. Apply ReLU activation: h = max(0, h)
  3. Compute final output: y = h @ W2.T + b2
  4. What’s the predicted price?
import numpy as np

# Input
X = np.array([[3, 2, 1.5, 1.5]])

# Layer 1 weights and bias
W1 = np.array([
    [0.5, 0.3, 0.8, -0.2],
    [0.2, 0.6, 0.4, -0.1],
    [0.7, 0.1, 0.5, -0.3],
])
b1 = np.array([0.1, 0.2, 0.1])

# Layer 2 weights and bias
W2 = np.array([[0.8, 0.5, 0.6]])
b2 = np.array([0.1])

def relu(x):
    return np.maximum(0, x)

# Forward pass
print("🧠 Neural Network Forward Pass")
print("=" * 40)

# Layer 1
z1 = X @ W1.T + b1
print(f"Layer 1 pre-activation (z1): {z1}")

h1 = relu(z1)
print(f"Layer 1 post-ReLU (h1):      {h1}")

# Layer 2 (output)
z2 = h1 @ W2.T + b2
print(f"\nFinal output (z2): {z2}")

# Interpret as price (scale back)
predicted_price = z2[0, 0] * 100000  # Scale factor
print(f"\n🏠 Predicted House Price: ${predicted_price:,.0f}")

# Output:
# Layer 1 pre-activation: [[3.4  2.55 3.3 ]]
# Layer 1 post-ReLU:      [[3.4  2.55 3.3 ]]
# Final output: [[5.095]]
# Predicted House Price: $509,500
Real-World Insight: This is EXACTLY how PyTorch and TensorFlow work under the hood! Every deep learning model is just chains of matrix multiplications with non-linear activations.

Exercise 4: Cryptography - Hill Cipher 🔐

The Hill Cipher uses matrix multiplication to encrypt messages:
import numpy as np

# Encryption key (2x2 matrix)
key = np.array([
    [3, 3],
    [2, 5]
])

# Message: "HI" → Convert to numbers (A=0, B=1, ... Z=25)
# H=7, I=8
message = np.array([[7], [8]])

# Encrypt: ciphertext = key @ message (mod 26)
Tasks:
  1. Encrypt the message “HI”
  2. Find the decryption key (matrix inverse mod 26)
  3. Decrypt back to the original message
import numpy as np

def mod_inverse(a, m):
    """Find modular multiplicative inverse of a mod m"""
    for x in range(1, m):
        if (a * x) % m == 1:
            return x
    return None

# Encryption key
key = np.array([
    [3, 3],
    [2, 5]
])

# Message: "HI"
message = np.array([[7], [8]])  # H=7, I=8
print(f"Original message: HI → {message.flatten()}")

# 1. ENCRYPT
ciphertext = (key @ message) % 26
print(f"Encrypted: {ciphertext.flatten()}")

# Convert back to letters
cipher_letters = ''.join([chr(c + ord('A')) for c in ciphertext.flatten()])
print(f"Ciphertext letters: {cipher_letters}")

# 2. FIND DECRYPTION KEY
# Key inverse (mod 26) = (1/det) * adjugate (mod 26)
det = int(np.linalg.det(key)) % 26  # det = 3*5 - 3*2 = 9
det_inv = mod_inverse(det, 26)  # 9^(-1) mod 26 = 3

# Adjugate matrix
adjugate = np.array([
    [5, -3],
    [-2, 3]
])

# Decryption key
key_inv = (det_inv * adjugate) % 26
print(f"\nDecryption key:\n{key_inv}")

# 3. DECRYPT
decrypted = (key_inv @ ciphertext) % 26
print(f"\nDecrypted: {decrypted.flatten()}")

# Convert back to letters
original = ''.join([chr(int(c) + ord('A')) for c in decrypted.flatten()])
print(f"Original message: {original}")

# Output:
# Original message: HI → [7 8]
# Encrypted: [19 2]
# Ciphertext letters: TC
# Decryption key: [[15 17] [20  9]]
# Decrypted: [[7] [8]]
# Original message: HI ✓
Real-World Insight: While Hill Cipher is breakable, modern encryption (RSA, AES) uses similar matrix operations in much larger spaces. Your HTTPS connection uses these principles!

Key Takeaways

Matrices are functions that transform vectors
Matrix multiplication = applying transformations
Neural networks = stacked matrix multiplications
Batch processing = process many inputs at once
Transpose = used in backpropagation
Inverse = solving equations, undoing transformations

The Transformer Architecture: Matrix Multiplication All the Way Down

The Transformer — the architecture behind GPT, BERT, and every modern language model — is built entirely from the matrix operations you have learned. Here is the anatomy:
Transformer ComponentMatrix OperationWhat It Computes
Token EmbeddingLookup (matrix row selection)Word to vector (RvocabRd\mathbb{R}^{vocab} \to \mathbb{R}^{d})
Query/Key/Value ProjectionsMatrix multiply: Q=XWQQ = XW_QProject tokens into Q, K, V spaces
Attention ScoresMatrix multiply: QKT/dkQK^T / \sqrt{d_k}Pairwise similarity between all tokens
SoftmaxElement-wise (not a matrix op)Normalize scores to probabilities
Attention OutputMatrix multiply: softmax(...)V\text{softmax}(...)VWeighted combination of value vectors
Feed-Forward NetworkTwo matrix multiplies with ReLUReLU(xW1+b1)W2+b2\text{ReLU}(xW_1 + b_1)W_2 + b_2
Layer NormElement-wise scalingNormalize activations
Output ProjectionMatrix multiplyMap back to vocabulary size
A single forward pass through GPT-4 involves roughly 100+ trillion matrix multiply-accumulate operations. Every one of those operations is the same matrix multiplication you learned in this module. The matrices are just bigger (thousands of rows and columns) and there are more of them (hundreds of layers). But the math is identical to W @ x + b.
# Simplified self-attention -- just three matrix multiplies and a softmax
def self_attention(X, W_Q, W_K, W_V):
    """X is (seq_len, d_model). This is the core of every Transformer."""
    Q = X @ W_Q   # (seq_len, d_k)  -- matrix multiply
    K = X @ W_K   # (seq_len, d_k)  -- matrix multiply
    V = X @ W_V   # (seq_len, d_v)  -- matrix multiply
    
    scores = Q @ K.T / np.sqrt(Q.shape[1])  # (seq_len, seq_len) -- dot products
    weights = softmax(scores, axis=-1)        # normalize
    output = weights @ V                      # (seq_len, d_v) -- weighted sum
    
    return output

GPU Computing: Why Matrices Matter for Speed

Why GPUs are 100x faster: GPUs have thousands of cores designed for parallel matrix operations.
import numpy as np
import time

# CPU: Sequential operations
n = 5000
A = np.random.randn(n, n)
B = np.random.randn(n, n)

start = time.time()
C = A @ B  # Matrix multiplication
cpu_time = time.time() - start

print(f"CPU: {cpu_time:.2f}s for {n}x{n} matrix multiply")
# CPU: ~2-5 seconds

# On GPU (with PyTorch/CUDA):
# import torch
# A_gpu = torch.randn(n, n, device='cuda')
# B_gpu = torch.randn(n, n, device='cuda')
# C_gpu = A_gpu @ B_gpu  # ~0.02 seconds!
Why this matters:
  • GPT-4 training: 25,000 GPUs × months
  • Each forward pass: trillions of matrix operations
  • Without GPU optimization: would take centuries!

Interview Questions: Matrices

Answer: A neural network layer computes: h=σ(Wx+b)\mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b})Where:
  • WW is the weight matrix (learned parameters)
  • x\mathbf{x} is the input vector
  • b\mathbf{b} is the bias vector
  • σ\sigma is the activation function (ReLU, sigmoid, etc.)
Matrix multiplication WxW\mathbf{x} computes weighted sums of inputs. The bias shifts the result. The activation adds non-linearity, enabling the network to learn complex patterns.
Answer: Batch processing (processing multiple samples simultaneously) is crucial because:
  1. GPU efficiency: GPUs parallelize matrix operations; single samples waste capacity
  2. Gradient stability: Averaging gradients over a batch reduces noise
  3. Memory efficiency: One matrix multiply instead of N vector operations
  4. Modern batch sizes: 32-8192 depending on task and memory
Answer: For two n×nn \times n matrices:
  • Naive algorithm: O(n3)O(n^3) - each of n2n^2 outputs requires nn multiplications
  • Strassen’s algorithm: O(n2.807)O(n^{2.807}) - faster but less numerically stable
  • Best known: O(n2.373)O(n^{2.373}) - theoretical, not practical
In practice, optimized libraries (BLAS, cuBLAS) use cache-aware algorithms that approach theoretical limits.

What’s Next?

You now understand how matrices transform data. But which transformations are most important? Which directions in your data carry the most information? That’s where eigenvalues and eigenvectors come in - they reveal the “natural axes” of your data!

Next: Eigenvalues & Eigenvectors

Discover which house features matter most for price prediction

Interview Deep-Dive

Strong Answer:
  • The rank of a matrix is the number of linearly independent rows (or equivalently, columns). It tells you the “true dimensionality” of the information the matrix encodes. A 1000x500 matrix with rank 3 looks like it contains 500 features, but really all the information lives in a 3-dimensional subspace.
  • In ML, rank directly affects model capacity and numerical stability. If your feature matrix XX has rank less than the number of features, the normal equations (XTX)1XTy(X^TX)^{-1}X^Ty blow up because XTXX^TX is singular. This happens when you have perfectly correlated features, one-hot encoded categories with the dummy variable trap, or more features than samples (p>np > n).
  • A concrete production example: at a fintech company, a fraud detection model started producing unstable predictions after a feature pipeline change. Two new features were exact linear combinations of existing features (a junior engineer added “total_amount” which was the sum of “subtotal” + “tax” + “shipping,” all already in the feature set). The feature matrix went from full rank to rank-deficient, the condition number exploded from 100 to 101510^{15}, and the regression coefficients became meaningless. The fix was removing the redundant feature or adding L2 regularization.
  • Low rank is also exploited intentionally. SVD-based recommendations work because user-item rating matrices are approximately low-rank. LoRA (Low-Rank Adaptation) fine-tunes large language models by adding low-rank update matrices to frozen weights, reducing trainable parameters by 1000x while preserving performance.
Follow-up: How does regularization (L1 or L2) interact with the rank of the feature matrix?L2 regularization adds λI\lambda I to XTXX^TX, making it (XTX+λI)(X^TX + \lambda I). Since λI\lambda I is positive definite, the sum is guaranteed invertible regardless of the rank of XX. Small eigenvalues (near-singular directions) get proportionally the most “help,” shrinking their corresponding coefficients toward zero. L1 regularization does not directly fix rank deficiency, but it drives some coefficients to exactly zero, performing feature selection and reducing effective dimensionality. If your matrix is rank-deficient, L2 is the safer default; L1 is better when you believe many features are irrelevant.
Strong Answer:
  • The determinant measures the signed volume scaling factor of the linear transformation. If AA is a 2x2 matrix, det(A)\det(A) tells you how much the area of any shape changes when you apply AA. For 3x3, it is volume. The sign indicates whether the transformation flips orientation (like a mirror reflection).
  • det(A)=0\det(A) = 0 means the transformation collapses at least one dimension — it squishes a 2D shape into a line. This means AA is singular: it destroys information and cannot be inverted.
  • In ML, the determinant appears directly in Gaussian mixture models: det(Σ)\det(\Sigma) for each covariance matrix Σ\Sigma is in the normalization constant of the multivariate Gaussian density. If the determinant is near zero, the cluster is degenerate and the density produces numerical infinities. This is the “singularity problem” that EM-based GMMs are prone to.
  • The determinant of the Jacobian matrix appears in normalizing flows (generative models). These models transform a simple distribution through invertible transformations, and the log-determinant tracks how probability density changes. The entire architecture is designed around making this determinant cheap to compute — using triangular Jacobians (determinant = product of diagonal elements).
  • In practice, you rarely compute determinants directly for large matrices. Instead, you use the log-determinant via Cholesky decomposition or check invertibility through condition numbers and singular values.
Follow-up: A Gaussian Mixture Model’s EM algorithm diverges because one component’s covariance matrix becomes nearly singular. What is happening geometrically and how do you fix it?One cluster has collapsed onto a lower-dimensional subspace — all assigned points lie nearly on a plane in 3D space. The covariance matrix has at least one eigenvalue approaching zero. The Gaussian density divides by det(Σ)\sqrt{\det(\Sigma)}, which approaches zero, making density spike to infinity. EM then assigns all nearby points to this cluster, reinforcing the collapse. Fixes: (1) Add regularization ϵI\epsilon I to all covariance matrices after each M-step. (2) Use tied or diagonal covariances. (3) Merge or reinitialize collapsed components. Scikit-learn’s GaussianMixture has a reg_covar parameter (default 10610^{-6}) for exactly this.
Strong Answer:
  • GPUs are designed around massive parallelism for data-parallel operations. An NVIDIA A100 has 6,912 CUDA cores versus a CPU’s 8-64 cores. Matrix multiplication is the perfect GPU workload: computing output element (i,j)(i,j) requires a dot product independent of every other output element, so all n2n^2 outputs can be computed in parallel.
  • An A100 achieves 312 TFLOPS for FP16 matrix multiplication; a high-end CPU achieves roughly 5 TFLOPS. That is a 60x difference for the operation that dominates deep learning. Tensor cores (specialized hardware units) perform 4x4 matrix multiplies in a single clock cycle, exploiting the fact that neural networks tolerate reduced precision (FP16, BF16, INT8).
  • The practical implication for model design: operations expressible as matrix multiplications are fast; others are bottlenecks. This is why attention (matrix multiply) beat recurrence (sequential dependency). It is why depthwise separable convolutions are slower on GPU than regular convolutions despite having fewer FLOPs — they have lower arithmetic intensity (FLOPs per byte of memory access). It is why FFN layers in transformers use wide matrices (high arithmetic intensity) and why batch size tuning matters.
  • An ML engineer who understands matrix multiplication hardware can make architectural decisions that yield 2-10x training speedups without changing accuracy.
Follow-up: You have a model that uses a 50,000 x 50,000 matrix multiply in its forward pass and it exceeds GPU memory. What are your options?(1) Model parallelism — split the matrix across multiple GPUs using tensor parallelism (Megatron-LM style column-parallel and row-parallel splits). (2) Gradient checkpointing — recompute intermediate activations during backward pass instead of storing them. (3) Mixed precision — FP16 uses half the memory. (4) Sparse matrices — if the matrix is sparse, use CSR or block-sparse formats. (5) Low-rank factorization — replace 50K x 50K with two matrices of 50K x r and r x 50K where r50Kr \ll 50K, reducing memory from O(n2)O(n^2) to O(nr)O(nr). This is the core idea behind LoRA and factorized embeddings.
Strong Answer:
  • The matrix inverse A1A^{-1} exists only for square, full-rank matrices. It satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I and gives the exact solution to Ax=bAx = b as x=A1bx = A^{-1}b.
  • The Moore-Penrose pseudo-inverse A+A^+ exists for any matrix. For an overdetermined system (m>nm > n), A+=(ATA)1ATA^+ = (A^TA)^{-1}A^T gives the least-squares solution. For an underdetermined system (m<nm < n), A+=AT(AAT)1A^+ = A^T(AA^T)^{-1} gives the minimum-norm solution. It handles all cases where the regular inverse fails.
  • In ML, you almost always want the pseudo-inverse. Linear regression uses X+y=(XTX)1XTyX^+y = (X^TX)^{-1}X^Ty. If XX is rank-deficient, the true (XTX)1(X^TX)^{-1} does not exist, but np.linalg.lstsq computes the pseudo-inverse via SVD and returns a valid solution.
  • The SVD-based computation: given A=UΣVTA = U\Sigma V^T, then A+=VΣ+UTA^+ = V\Sigma^+ U^T where Σ+\Sigma^+ inverts non-zero singular values and leaves zeros. Directions with zero singular values are ignored rather than causing division by zero. This is why np.linalg.lstsq is numerically superior to computing (XTX)1XT(X^TX)^{-1}X^T explicitly.
Follow-up: In what practical scenario would computing the explicit matrix inverse be preferable to a factorization-based solver?Almost never for solving linear systems — factorization (LU, QR, Cholesky) is always preferred. However, sometimes you need the inverse matrix itself as an object. In Bayesian linear regression, the posterior covariance of the weights is (XTX+λI)1σ2(X^TX + \lambda I)^{-1}\sigma^2, and you need the full matrix for uncertainty intervals. In the Kalman filter, you propagate the covariance matrix inverse. Even then, you compute it via Cholesky factorization rather than calling np.linalg.inv directly.