Skip to main content

Neural Networks: The Foundation of Deep Learning

Neural Network Architecture

From Brains to Math

Your brain has about 86 billion neurons, each connected to thousands of others. A single neuron:
  1. Receives inputs from other neurons
  2. Weighs how important each input is
  3. Sums them up
  4. Activates if the sum exceeds a threshold
  5. Sends output to other neurons
That’s literally what an artificial neuron does!
Tesla Autopilot Neural Network

The Perceptron: One Artificial Neuron

How It Works

Input 1 ─── weight 1 ───┐

Input 2 ─── weight 2 ───┼──► [Sum] ──► [Activation] ──► Output

Input 3 ─── weight 3 ───┘
Math version: output=activation(i=1nwixi+b)=activation(wx+b)output = activation\left(\sum_{i=1}^{n} w_i x_i + b\right) = activation(w \cdot x + b) Where:
  • xix_i = inputs
  • wiw_i = weights (learnable)
  • bb = bias (also learnable)
  • activationactivation = a function that decides to “fire” or not

Building a Perceptron from Scratch

import numpy as np

class Perceptron:
    """A single artificial neuron."""
    
    def __init__(self, n_inputs):
        # Random small weights
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
    
    def forward(self, x):
        """Compute output for given inputs."""
        # Weighted sum + bias
        z = np.dot(x, self.weights) + self.bias
        # Step activation: output 1 if z > 0, else 0
        return 1 if z > 0 else 0
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """Train using the perceptron learning rule."""
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction
                
                # Update rule: move weights in direction of error
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error
                
                errors += abs(error)
            
            if epoch % 20 == 0:
                print(f"Epoch {epoch}: {errors} errors")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break

# Test on AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X, y_and)

print("\nAND Gate Results:")
for xi in X:
    print(f"  {xi} -> {perceptron.forward(xi)}")

The XOR Problem: Why We Need More Layers

# XOR: outputs 1 if inputs are different
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X, y_xor, epochs=1000)

print("\nXOR Gate Results (FAILS!):")
for xi in X:
    print(f"  {xi} -> {perceptron_xor.forward(xi)}")
A single perceptron can only learn linearly separable patterns! XOR is not linearly separable - you can’t draw a single line to separate 0s from 1s. Solution: Stack multiple layers of neurons = Multi-Layer Perceptron (MLP)

Activation Functions

The step function (0 or 1) has a problem: the gradient is 0 everywhere, so gradient descent doesn’t work! We need smooth, differentiable activation functions:
import matplotlib.pyplot as plt

def sigmoid(x):
    """S-curve from 0 to 1"""
    return 1 / (1 + np.exp(-x))

def tanh(x):
    """S-curve from -1 to 1"""
    return np.tanh(x)

def relu(x):
    """0 if negative, x if positive"""
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Small slope for negative values"""
    return np.where(x > 0, x, alpha * x)

# Visualize
x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
activations = [
    (sigmoid, 'Sigmoid', 'Output between 0 and 1'),
    (tanh, 'Tanh', 'Output between -1 and 1'),
    (relu, 'ReLU', 'Most popular, fast to compute'),
    (leaky_relu, 'Leaky ReLU', 'Fixes "dying ReLU" problem')
]

for ax, (func, name, desc) in zip(axes.flat, activations):
    ax.plot(x, func(x), linewidth=2)
    ax.set_title(f'{name}: {desc}')
    ax.grid(True)
    ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()
ActivationRangeUse Case
Sigmoid(0, 1)Output layer for binary classification
Tanh(-1, 1)Hidden layers (centered at 0)
ReLU[0, ∞)Hidden layers (most common)
Softmax(0, 1), sums to 1Output for multi-class classification

Multi-Layer Perceptron: The Universal Approximator

By stacking layers, we can learn ANY function!
class NeuralNetwork:
    """Simple 2-layer neural network."""
    
    def __init__(self, input_size, hidden_size, output_size):
        # Layer 1: input -> hidden
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros(hidden_size)
        
        # Layer 2: hidden -> output
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros(output_size)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, X):
        """Forward pass through the network."""
        # Layer 1
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        
        return self.a2
    
    def backward(self, X, y, learning_rate=0.5):
        """Backward pass - compute gradients and update weights."""
        m = len(X)
        
        # Output layer error
        dz2 = self.a2 - y.reshape(-1, 1)
        dW2 = (self.a1.T @ dz2) / m
        db2 = np.mean(dz2, axis=0)
        
        # Hidden layer error (using chain rule!)
        dz1 = (dz2 @ self.W2.T) * self.a1 * (1 - self.a1)
        dW1 = (X.T @ dz1) / m
        db1 = np.mean(dz1, axis=0)
        
        # Update weights
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def train(self, X, y, epochs=1000, learning_rate=0.5):
        losses = []
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (binary cross-entropy)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            losses.append(loss)
            
            # Backward
            self.backward(X, y, learning_rate)
            
            if epoch % 200 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}")
        
        return losses

# NOW we can learn XOR!
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
losses = nn.train(X, y_xor, epochs=2000)

print("\nXOR Results (SUCCESS!):")
predictions = nn.forward(X)
for xi, pred in zip(X, predictions):
    print(f"  {xi} -> {pred[0]:.3f} (rounded: {int(pred[0] > 0.5)})")

Backpropagation: How Networks Learn

Backpropagation uses the chain rule from calculus to compute gradients efficiently.
Math Connection: Backpropagation is just repeated application of the chain rule. See Chain Rule for the mathematical foundation.
The key insight:
  1. Compute error at output
  2. Propagate error backward through layers
  3. Update each weight proportionally to how much it contributed to the error
Lossw=Lossoutputoutputhiddenhiddenw\frac{\partial Loss}{\partial w} = \frac{\partial Loss}{\partial output} \cdot \frac{\partial output}{\partial hidden} \cdot \frac{\partial hidden}{\partial w}

Using PyTorch (The Professional Way)

import torch
import torch.nn as nn
import torch.optim as optim

# Define the network
class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()
    
    def forward(self, x):
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        return x

# Create network
model = XORNet()

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

# Data
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y_xor).reshape(-1, 1)

# Training loop
for epoch in range(1000):
    # Forward pass
    output = model(X_tensor)
    loss = criterion(output, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Test
with torch.no_grad():
    predictions = model(X_tensor)
    print("\nPyTorch XOR Results:")
    for xi, pred in zip(X, predictions):
        print(f"  {xi} -> {pred.item():.3f}")

Using scikit-learn

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load digit recognition dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create neural network
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',
    max_iter=500,
    random_state=42
)

# Train
mlp.fit(X_train, y_train)

# Evaluate
y_pred = mlp.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Visualize some predictions
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
    ax.set_title(f"Pred: {y_pred[i]}, True: {y_test[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()

Network Architectures

ArchitectureLayersUse Case
Shallow1-2 hiddenSimple patterns, tabular data
Deep3+ hiddenComplex patterns
WideMany neuronsMore capacity per layer
Deep & NarrowMany layers, fewer neuronsHierarchical features
Rule of thumb for tabular data:
  • Start with 2 hidden layers
  • Hidden size: between input and output size
  • Use ReLU activation
  • Use dropout for regularization

Regularization for Neural Networks

Dropout

Randomly “turn off” neurons during training:
model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Dropout(0.3),  # 30% of neurons randomly zeroed
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(64, 10)
)

Early Stopping

Stop training when validation loss stops improving:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # Use 10% for validation
    n_iter_no_change=10,       # Stop after 10 epochs without improvement
    max_iter=1000
)

Key Hyperparameters

HyperparameterEffect
Learning rateToo high = unstable, too low = slow
Hidden layersMore = more complex patterns, more overfitting risk
Neurons per layerMore = more capacity
Batch sizeSmaller = more noise, larger = more stable
ActivationReLU most common, sigmoid for output
Dropout rate0.1-0.5 typical

When to Use Neural Networks

Good for:
  • Image data (use CNNs)
  • Text data (use Transformers)
  • Sequential data (use RNNs/LSTMs)
  • Very large datasets
  • Complex non-linear patterns
Not great for:
  • Small datasets (overfits easily)
  • When interpretability matters
  • Tabular data with < 10,000 rows (tree models often better)

🚀 Mini Projects

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits

Project 2: Neural Network from Scratch

Implement a neural network without libraries

Project 3: Activation Function Explorer

Compare different activation functions

Project 4: Hyperparameter Tuner

Find optimal architecture through experimentation

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits from the MNIST dataset.

Project 2: Neural Network from Scratch

Implement a simple neural network using only NumPy.

Project 3: Activation Function Explorer

Compare different activation functions and their effects on learning.

Project 4: Hyperparameter Tuner

Systematically find the best neural network architecture.

Key Takeaways

Neurons = Weighted Sums

Input × weights + bias → activation → output

Layers = Power

More layers = learn more complex patterns

Backprop = Chain Rule

Gradients flow backward to update weights

Regularize!

Dropout and early stopping prevent overfitting

What’s Next?

Now that you understand neural networks, let’s learn about regularization in more depth - the key to preventing overfitting in any model!

Continue to Module 13: Regularization

Learn L1, L2 regularization and other techniques to prevent overfitting