Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Neural Networks: The Foundation of Deep Learning

Neural Network Architecture

From Brains to Math

Your brain has about 86 billion neurons, each connected to thousands of others. A single neuron:
  1. Receives inputs from other neurons
  2. Weighs how important each input is
  3. Sums them up
  4. Activates if the sum exceeds a threshold
  5. Sends output to other neurons
That’s literally what an artificial neuron does!
Tesla Autopilot Neural Network

The Perceptron: One Artificial Neuron

How It Works

Input 1 ─── weight 1 ───┐

Input 2 ─── weight 2 ───┼──► [Sum] ──► [Activation] ──► Output

Input 3 ─── weight 3 ───┘
Math version: output=activation(i=1nwixi+b)=activation(wx+b)output = activation\left(\sum_{i=1}^{n} w_i x_i + b\right) = activation(w \cdot x + b) Where:
  • xix_i = inputs
  • wiw_i = weights (learnable)
  • bb = bias (also learnable)
  • activationactivation = a function that decides to “fire” or not

Building a Perceptron from Scratch

import numpy as np

class Perceptron:
    """
    A single artificial neuron -- the simplest possible neural network.
    
    Think of it as a tiny decision-maker: it looks at evidence (inputs),
    weighs how important each piece is (weights), adds it up, and makes
    a yes/no call (activation). Like a hiring manager who scores candidates
    on different criteria and hires if the total score exceeds a threshold.
    """
    
    def __init__(self, n_inputs):
        # Initialize weights to small random values (not zero -- symmetry breaking)
        # If all weights start at zero, every neuron learns the same thing
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0  # The "default tendency" -- like a judge's prior disposition
    
    def forward(self, x):
        """Compute output for given inputs."""
        # Weighted sum + bias: each input contributes proportionally to its weight
        z = np.dot(x, self.weights) + self.bias
        # Step activation: fire (1) if evidence exceeds threshold, stay silent (0) otherwise
        return 1 if z > 0 else 0
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """
        Train using the perceptron learning rule.
        
        The learning rule is beautifully simple: if the prediction is correct,
        do nothing. If wrong, nudge each weight toward the correct answer.
        The learning_rate controls how big each nudge is -- too large and the
        model oscillates, too small and it takes forever.
        """
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction  # +1 if we should have fired, -1 if we shouldn't have
                
                # Update rule: move weights in the direction of the error
                # If error > 0 (should have fired), increase weights for active inputs
                # If error < 0 (should not have fired), decrease weights for active inputs
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error
                
                errors += abs(error)
            
            if epoch % 20 == 0:
                print(f"Epoch {epoch}: {errors} errors")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break

# Test on AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X, y_and)

print("\nAND Gate Results:")
for xi in X:
    print(f"  {xi} -> {perceptron.forward(xi)}")

The XOR Problem: Why We Need More Layers

# XOR: outputs 1 if inputs are different
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X, y_xor, epochs=1000)

print("\nXOR Gate Results (FAILS!):")
for xi in X:
    print(f"  {xi} -> {perceptron_xor.forward(xi)}")
A single perceptron can only learn linearly separable patterns! XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s. Think of it like a bouncer at a club who can only apply one rule: “everyone taller than 6 feet gets in” works fine, but “people get in if they have an ID or they are on the list, but not both” requires understanding two conditions simultaneously. A single perceptron is that one-rule bouncer. Solution: Stack multiple layers of neurons = Multi-Layer Perceptron (MLP). The first layer learns simple patterns, the second layer combines those patterns into more complex ones — just like how the visual cortex processes edges first, then shapes, then objects.

Activation Functions

The step function (0 or 1) has a problem: its gradient is 0 everywhere except at the threshold, where it is undefined. This means gradient descent has no signal to work with — it is like trying to roll a ball downhill on a perfectly flat surface with a single cliff edge. We need smooth, differentiable activation functions that provide a gradient at every point — a gentle slope the optimization can follow:
import matplotlib.pyplot as plt

def sigmoid(x):
    """S-curve from 0 to 1"""
    return 1 / (1 + np.exp(-x))

def tanh(x):
    """S-curve from -1 to 1"""
    return np.tanh(x)

def relu(x):
    """0 if negative, x if positive"""
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Small slope for negative values"""
    return np.where(x > 0, x, alpha * x)

# Visualize
x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
activations = [
    (sigmoid, 'Sigmoid', 'Output between 0 and 1'),
    (tanh, 'Tanh', 'Output between -1 and 1'),
    (relu, 'ReLU', 'Most popular, fast to compute'),
    (leaky_relu, 'Leaky ReLU', 'Fixes "dying ReLU" problem')
]

for ax, (func, name, desc) in zip(axes.flat, activations):
    ax.plot(x, func(x), linewidth=2)
    ax.set_title(f'{name}: {desc}')
    ax.grid(True)
    ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()
ActivationRangeUse CaseGotcha
Sigmoid(0, 1)Output layer for binary classificationVanishing gradient in deep networks — gradients shrink toward zero in early layers
Tanh(-1, 1)Hidden layers (centered at 0)Same vanishing gradient problem as sigmoid, but centered output helps convergence
ReLU[0, ∞)Hidden layers (most common, fast)“Dying ReLU”: if a neuron’s output goes negative, gradient is 0 and it never recovers
Softmax(0, 1), sums to 1Output for multi-class classificationOnly for the output layer — it normalizes across all outputs to produce probabilities
Practical default: Use ReLU for hidden layers and sigmoid/softmax for the output layer. This covers 90% of use cases. Only switch to Leaky ReLU or GELU if you observe dying neurons (training loss plateaus while many neurons output zero).

Multi-Layer Perceptron: The Universal Approximator

By stacking layers, we can learn ANY function! This is not hand-waving — the Universal Approximation Theorem (Cybenko, 1989) proves that a neural network with just one hidden layer and enough neurons can approximate any continuous function to arbitrary accuracy. The catch: “enough neurons” might mean millions, and finding the right weights is the hard part. In practice, deeper networks with fewer neurons per layer learn hierarchical features more efficiently than one massive wide layer.
class NeuralNetwork:
    """
    Simple 2-layer neural network built from scratch.
    
    Architecture: Input -> Hidden Layer (sigmoid) -> Output Layer (sigmoid)
    This is the minimum viable network that can solve non-linear problems like XOR.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        # Layer 1: input -> hidden (learns basic patterns/features)
        # Weight initialization with * 0.5 keeps initial values moderate;
        # too large and gradients explode, too small and learning is glacially slow
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros(hidden_size)
        
        # Layer 2: hidden -> output (combines hidden features into final prediction)
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros(output_size)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, X):
        """
        Forward pass: data flows input -> hidden -> output.
        We store intermediate values (z1, a1, z2, a2) because
        backpropagation needs them to compute gradients.
        """
        # Layer 1: compute weighted sum, then apply activation
        self.z1 = X @ self.W1 + self.b1       # Linear transformation
        self.a1 = self.sigmoid(self.z1)         # Non-linear activation (this is what lets us learn curves, not just lines)
        
        # Layer 2: hidden layer output becomes input to the output layer
        self.z2 = self.a1 @ self.W2 + self.b2  # Linear transformation
        self.a2 = self.sigmoid(self.z2)          # Final prediction (0-1 for binary classification)
        
        return self.a2
    
    def backward(self, X, y, learning_rate=0.5):
        """
        Backward pass: compute how much each weight contributed to the error,
        then nudge weights in the direction that reduces error.
        
        This is backpropagation -- the chain rule applied layer by layer,
        working backwards from output to input. Think of it like tracing
        blame: "The output was wrong because the hidden layer sent the wrong
        signal, which happened because the input weights were off."
        """
        m = len(X)  # Number of samples (for averaging gradients)
        
        # Output layer error: how far off were our predictions?
        dz2 = self.a2 - y.reshape(-1, 1)       # Derivative of loss w.r.t. z2
        dW2 = (self.a1.T @ dz2) / m            # How much each W2 weight contributed to error
        db2 = np.mean(dz2, axis=0)              # How much the bias contributed
        
        # Hidden layer error: chain rule propagates error backward through W2
        # The sigmoid derivative a*(1-a) is what makes this differentiable
        dz1 = (dz2 @ self.W2.T) * self.a1 * (1 - self.a1)
        dW1 = (X.T @ dz1) / m
        db1 = np.mean(dz1, axis=0)
        
        # Update weights: step in the direction that reduces error
        # learning_rate controls step size -- the fundamental tradeoff of optimization
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def train(self, X, y, epochs=1000, learning_rate=0.5):
        losses = []
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (binary cross-entropy)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            losses.append(loss)
            
            # Backward
            self.backward(X, y, learning_rate)
            
            if epoch % 200 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}")
        
        return losses

# NOW we can learn XOR!
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
losses = nn.train(X, y_xor, epochs=2000)

print("\nXOR Results (SUCCESS!):")
predictions = nn.forward(X)
for xi, pred in zip(X, predictions):
    print(f"  {xi} -> {pred[0]:.3f} (rounded: {int(pred[0] > 0.5)})")

Backpropagation: How Networks Learn

Backpropagation uses the chain rule from calculus to compute gradients efficiently.
Math Connection: Backpropagation is just repeated application of the chain rule. See Chain Rule for the mathematical foundation.
The key insight:
  1. Compute error at output
  2. Propagate error backward through layers
  3. Update each weight proportionally to how much it contributed to the error
Lossw=Lossoutputoutputhiddenhiddenw\frac{\partial Loss}{\partial w} = \frac{\partial Loss}{\partial output} \cdot \frac{\partial output}{\partial hidden} \cdot \frac{\partial hidden}{\partial w}

Using PyTorch (The Professional Way)

import torch
import torch.nn as nn
import torch.optim as optim

# Define the network
class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()
    
    def forward(self, x):
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        return x

# Create network
model = XORNet()

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

# Data
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y_xor).reshape(-1, 1)

# Training loop
for epoch in range(1000):
    # Forward pass
    output = model(X_tensor)
    loss = criterion(output, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Test
with torch.no_grad():
    predictions = model(X_tensor)
    print("\nPyTorch XOR Results:")
    for xi, pred in zip(X, predictions):
        print(f"  {xi} -> {pred.item():.3f}")

Using scikit-learn

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load digit recognition dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create neural network
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',
    max_iter=500,
    random_state=42
)

# Train
mlp.fit(X_train, y_train)

# Evaluate
y_pred = mlp.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Visualize some predictions
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
    ax.set_title(f"Pred: {y_pred[i]}, True: {y_test[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()

Network Architectures

ArchitectureLayersUse Case
Shallow1-2 hiddenSimple patterns, tabular data
Deep3+ hiddenComplex patterns
WideMany neuronsMore capacity per layer
Deep & NarrowMany layers, fewer neuronsHierarchical features
Rule of thumb for tabular data:
  • Start with 2 hidden layers
  • Hidden size: between input and output size
  • Use ReLU activation
  • Use dropout for regularization

Regularization for Neural Networks

Dropout

Randomly “turn off” neurons during training. Think of it like a team where you randomly bench different players in each practice session. No single player can carry the team alone, so every player has to be competent. This forces the network to build redundant representations rather than relying on a few “star” neurons — which means it generalizes better to new data.
model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Dropout(0.3),  # 30% of neurons randomly zeroed each forward pass
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),  # Dropout is only active during training, NOT during inference
    nn.Linear(64, 10)  # No dropout on the output layer
)
Practical tip: Start with dropout rate of 0.2-0.3 for hidden layers. If the model still overfits, increase toward 0.5. Never apply dropout to the output layer. Remember to call model.eval() during inference — dropout must be disabled for predictions.

Early Stopping

Stop training when validation loss stops improving — the simplest and most effective regularization technique. Training too long is like studying for an exam past the point of understanding into the territory of memorizing typos in the textbook.
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # Use 10% for validation
    n_iter_no_change=10,       # Stop after 10 epochs without improvement
    max_iter=1000
)

Key Hyperparameters

HyperparameterEffectPractical Starting Point
Learning rateToo high = unstable/diverges, too low = painfully slow convergence0.001 for Adam, 0.01 for SGD
Hidden layersMore = more complex patterns, but harder to train and more overfitting risk2 layers for tabular, 3+ for images/text
Neurons per layerMore = more capacity per layerStart between input size and output size
Batch sizeSmaller = noisier gradients (can help escape local minima), larger = more stable/faster32-128 for most tasks
ActivationDetermines what non-linearities the network can learnReLU for hidden layers, sigmoid/softmax for output
Dropout rateHigher = more regularization, lower = more capacity0.2-0.3 as starting point

When to Use Neural Networks

Good for:
  • Image data (use CNNs)
  • Text data (use Transformers)
  • Sequential data (use RNNs/LSTMs)
  • Very large datasets
  • Complex non-linear patterns
Not great for:
  • Small datasets (overfits easily — neural nets are data-hungry by nature)
  • When interpretability matters (explaining why a 10-layer network made a decision is much harder than explaining a decision tree)
  • Tabular data with fewer than 10,000 rows (tree-based models like XGBoost or Random Forest are almost always better here, and this is backed by extensive benchmarks)
Industry reality: For tabular data in production, gradient boosted trees (XGBoost, LightGBM) beat neural networks in the majority of Kaggle competitions and real-world deployments. Neural networks shine on unstructured data: images, text, audio, and video. If someone suggests a neural network for a 5,000-row CSV, push back.

🚀 Mini Projects

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits

Project 2: Neural Network from Scratch

Implement a neural network without libraries

Project 3: Activation Function Explorer

Compare different activation functions

Project 4: Hyperparameter Tuner

Find optimal architecture through experimentation

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits from the MNIST dataset.

Project 2: Neural Network from Scratch

Implement a simple neural network using only NumPy.

Project 3: Activation Function Explorer

Compare different activation functions and their effects on learning.

Project 4: Hyperparameter Tuner

Systematically find the best neural network architecture.

Key Takeaways

Neurons = Weighted Sums

Input × weights + bias → activation → output

Layers = Power

More layers = learn more complex patterns

Backprop = Chain Rule

Gradients flow backward to update weights

Regularize!

Dropout and early stopping prevent overfitting

What’s Next?

Now that you understand neural networks, let’s learn about regularization in more depth - the key to preventing overfitting in any model!

Continue to Module 13: Regularization

Learn L1, L2 regularization and other techniques to prevent overfitting