Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Perceptron Concept

Perceptrons & Multi-Layer Networks

The Biological Inspiration

Your brain contains approximately 86 billion neurons, each connected to thousands of others. A single neuron:
  1. Receives signals from other neurons through dendrites
  2. Processes those signals in the cell body
  3. Fires (or not) based on whether the combined signal exceeds a threshold
  4. Transmits that signal to other neurons through its axon
In 1958, Frank Rosenblatt created the Perceptron — a mathematical model of a single neuron. It’s remarkably simple, yet it laid the foundation for all modern deep learning. A useful analogy: a single neuron is like a voter. It listens to multiple arguments (inputs), weighs each one by how convincing it finds that source (weights), adds up its overall impression (weighted sum), and then makes a binary decision — yes or no (activation). A neural network is a parliament of these voters, organized into committees (layers), where each committee’s collective decision feeds into the next.
Biological vs Artificial Neuron

The Perceptron: One Artificial Neuron

Mathematical Formulation

A perceptron computes: y=f(i=1nwixi+b)=f(wx+b)y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b) Where:
  • x=[x1,x2,...,xn]\mathbf{x} = [x_1, x_2, ..., x_n] = input features
  • w=[w1,w2,...,wn]\mathbf{w} = [w_1, w_2, ..., w_n] = weights (learnable)
  • bb = bias (learnable)
  • ff = activation function

Visual Representation

     x₁ ──── w₁ ────┐

     x₂ ──── w₂ ────┼──► [Σ + b] ──► [f] ──► y

     x₃ ──── w₃ ────┘

Building from Scratch

import numpy as np

class Perceptron:
    """A single artificial neuron."""
    
    def __init__(self, n_inputs, activation='step'):
        """Initialize with random weights."""
        # Small random weights for symmetry breaking.
        # Why random? If all weights start identical, all neurons learn the same thing.
        # Why small? Large initial weights can saturate activations, killing gradients.
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0  # Bias shifts the decision boundary -- like a threshold
        self.activation = activation
    
    def _activate(self, z):
        """Apply activation function."""
        if self.activation == 'step':
            return 1 if z > 0 else 0
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward(self, x):
        """Compute output for given input."""
        # Weighted sum
        z = np.dot(x, self.weights) + self.bias
        # Apply activation
        return self._activate(z)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self.forward(x) for x in X])
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """Train using the perceptron learning rule."""
        history = []
        
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                # Forward pass
                prediction = self.forward(xi)
                
                # Compute error
                error = yi - prediction
                
                # Update weights if prediction was wrong
                if error != 0:
                    self.weights += learning_rate * error * xi
                    self.bias += learning_rate * error
                    errors += 1
            
            accuracy = 1 - errors / len(y)
            history.append(accuracy)
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Accuracy = {accuracy:.2%}")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}!")
                break
        
        return history


# Test on AND gate
print("="*50)
print("Training Perceptron on AND Gate")
print("="*50)

X_and = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X_and, y_and)

print("\nResults:")
for x, y_true in zip(X_and, y_and):
    y_pred = perceptron.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")

The Perceptron Learning Rule

The training algorithm is beautifully simple:
FOR each training example (x, y):
    1. Compute prediction: ŷ = sign(w·x + b)
    2. If ŷ ≠ y (wrong prediction):
        w = w + η(y - ŷ)x
        b = b + η(y - ŷ)
    3. If ŷ = y (correct): do nothing

Why This Works

The update rule has an elegant geometric interpretation:
  • If we predict 0 but should predict 1: increase weights in direction of x (pull the decision boundary toward this point)
  • If we predict 1 but should predict 0: decrease weights in direction of x (push the decision boundary away from this point)
  • The learning rate η\eta controls how big each update is — too large and the boundary oscillates wildly, too small and learning takes forever
Think of the weights as defining a dividing line in space. Each mistake nudges that line in the right direction. Given enough nudges, the line settles into the right place.

Convergence Theorem

Perceptron Convergence Theorem: If the data is linearly separable, the perceptron algorithm will converge to a solution in finite time. The number of updates is bounded by (R/γ)2(R / \gamma)^2, where RR is the maximum norm of any data point and γ\gamma is the margin — the distance between the closest points and the separating hyperplane. Wider margins mean faster convergence.
Historical Note: Minsky & Papert’s 1969 book Perceptrons showed that single perceptrons can’t solve non-linearly-separable problems (like XOR). This led to the “AI Winter” — but they missed that multiple layers could solve any problem!

The XOR Problem: Why We Need Depth

# XOR: output is 1 if inputs are DIFFERENT
print("="*50)
print("Training Perceptron on XOR Gate")
print("="*50)

y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X_and, y_xor, epochs=100)

print("\nResults (FAILS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = perceptron_xor.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")
The perceptron fails on XOR! Why?
XOR Problem
XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s. Solution: Stack multiple layers of neurons!

Multi-Layer Perceptron (MLP)

The Universal Approximation Theorem

A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rn\mathbb{R}^n.
In other words: deep networks can learn anything (given enough neurons and data). But here is the catch most people miss: the theorem says such a network exists — it does not say you can find it efficiently. In practice, deeper networks with fewer neurons per layer are far easier to train than enormously wide shallow networks. The theorem is an existence proof, not a training recipe. It is the difference between “a key to this lock exists somewhere in the universe” and “here is the key.”

Architecture

Input Layer      Hidden Layer      Output Layer
    x₁ ────────┐
               ├────► h₁ ────┐
    x₂ ────────┤             ├────► y
               ├────► h₂ ────┤
    x₃ ────────┘             │
               ├────► h₃ ────┘

Each connection has its own weight. Each hidden neuron has its own bias.

Building an MLP from Scratch

class MLP:
    """Multi-Layer Perceptron from scratch."""
    
    def __init__(self, layer_sizes, activation='sigmoid'):
        """
        Initialize network with given layer sizes.
        
        Args:
            layer_sizes: List like [input_size, hidden1, hidden2, ..., output_size]
        """
        self.n_layers = len(layer_sizes) - 1
        self.activation = activation
        
        # Initialize weights and biases for each layer
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            # He initialization for ReLU, Xavier for sigmoid/tanh
            scale = np.sqrt(2.0 / layer_sizes[i]) if activation == 'relu' else \
                    np.sqrt(1.0 / layer_sizes[i])
            
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros(layer_sizes[i+1])
            
            self.weights.append(W)
            self.biases.append(b)
    
    def _activate(self, z, derivative=False):
        """Apply activation function (or its derivative)."""
        if self.activation == 'sigmoid':
            sig = 1 / (1 + np.exp(-np.clip(z, -500, 500)))
            if derivative:
                return sig * (1 - sig)
            return sig
        elif self.activation == 'relu':
            if derivative:
                return (z > 0).astype(float)
            return np.maximum(0, z)
        elif self.activation == 'tanh':
            if derivative:
                return 1 - np.tanh(z)**2
            return np.tanh(z)
    
    def forward(self, X):
        """Forward pass through the network."""
        self.activations = [X]  # Store for backprop
        self.z_values = []       # Pre-activation values
        
        current = X
        for i in range(self.n_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            
            # Apply activation (except for last layer in classification)
            if i == self.n_layers - 1:  # Output layer
                current = self._sigmoid(z)  # For binary classification
            else:
                current = self._activate(z)
            
            self.activations.append(current)
        
        return current
    
    def _sigmoid(self, z):
        """Sigmoid for output layer."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def backward(self, X, y, learning_rate=0.01):
        """Backward pass (backpropagation)."""
        m = len(X)
        
        # Output layer error
        output = self.activations[-1]
        delta = output - y.reshape(-1, 1)  # Derivative of BCE loss with sigmoid
        
        # Backpropagate through layers
        for i in range(self.n_layers - 1, -1, -1):
            # Gradient for weights and biases
            dW = self.activations[i].T @ delta / m
            db = np.mean(delta, axis=0)
            
            # Propagate error to previous layer
            if i > 0:
                delta = (delta @ self.weights[i].T) * self._activate(
                    self.z_values[i-1], derivative=True
                )
            
            # Update weights and biases
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db
    
    def train(self, X, y, epochs=1000, learning_rate=0.1, verbose=True):
        """Train the network."""
        history = {'loss': [], 'accuracy': []}
        
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Compute loss (binary cross-entropy)
            eps = 1e-8
            loss = -np.mean(y * np.log(output + eps) + (1 - y) * np.log(1 - output + eps))
            
            # Compute accuracy
            predictions = (output > 0.5).astype(int).flatten()
            accuracy = np.mean(predictions == y)
            
            history['loss'].append(loss)
            history['accuracy'].append(accuracy)
            
            # Backward pass
            self.backward(X, y, learning_rate)
            
            if verbose and epoch % 100 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.2%}")
        
        return history
    
    def predict(self, X):
        """Make predictions."""
        return (self.forward(X) > 0.5).astype(int).flatten()


# NOW we can solve XOR!
print("="*50)
print("Training MLP on XOR Gate")
print("="*50)

mlp = MLP([2, 4, 1], activation='sigmoid')  # 2 inputs, 4 hidden, 1 output
history = mlp.train(X_and, y_xor, epochs=2000, learning_rate=1.0)

print("\nResults (SUCCESS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = mlp.predict(x.reshape(1, -1))[0]
    print(f"  {x} -> {y_pred} (true: {y_true})")

How MLPs Solve XOR

The hidden layer creates a new representation where the problem becomes linearly separable:
Original Space         Hidden Space
    (0,1) ●─────● (1,1)         h₁
         │     │               ↗
         │ XOR │          • (0,1)   • (1,0)   → output 1
         │     │               ↓
    (0,0) ●─────● (1,0)   • (0,0)   • (1,1)   → output 0
                                    h₂
                            Now linearly separable!

Visualizing the Decision Boundary

import matplotlib.pyplot as plt

def plot_decision_boundary(model, X, y, title):
    """Plot the decision boundary of a 2D classifier."""
    # Create mesh grid
    h = 0.01
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, levels=np.linspace(0, 1, 50), cmap='RdBu_r', alpha=0.8)
    plt.colorbar(label='Prediction')
    
    # Plot training points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu_r', 
                         edgecolors='black', s=200, linewidths=2)
    
    plt.xlabel('x₁')
    plt.ylabel('x₂')
    plt.title(title)
    plt.show()

# Visualize XOR solution
plot_decision_boundary(mlp, X_and, y_xor, "MLP Solving XOR")

Deeper Networks

Why Go Deep?

DepthAdvantagesChallenges
Shallow (1-2 layers)Easy to train, interpretableLimited expressivity
Medium (3-5 layers)Good balanceStandard training works
Deep (10+ layers)Hierarchical featuresVanishing gradients
Very Deep (100+)State-of-the-artRequires special techniques

The Depth vs Width Tradeoff

Theorem: A 2-layer network of width nn can approximate functions that require width 2n2^n with a deeper network of width nn. In practice:
  • Deep narrow networks learn hierarchical features (more efficient) — they compose simple patterns into complex ones
  • Wide shallow networks have more brute-force capacity — they memorize rather than generalize
  • Modern architectures are both deep AND wide (but depth usually helps more)
The intuition: depth enables composition. A 3-layer network can represent “if (has_eyes AND has_fur) AND is_small, then cat” — where each layer handles one level of the logical hierarchy. A 1-layer network would need to memorize every possible pixel pattern that constitutes a cat, which requires exponentially more neurons.

A Deeper Network

# Deeper network for a more complex problem
from sklearn.datasets import make_moons

# Create non-linear dataset
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

# Normalize
X_moons = (X_moons - X_moons.mean(axis=0)) / X_moons.std(axis=0)

# Train deeper network
deep_mlp = MLP([2, 32, 32, 16, 1], activation='relu')
history = deep_mlp.train(X_moons, y_moons, epochs=2000, learning_rate=0.01)

# Visualize
plot_decision_boundary(deep_mlp, X_moons, y_moons, "Deep MLP on Moons Dataset")

PyTorch Implementation

Now let’s see how to build the same networks using PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim

class PyTorchMLP(nn.Module):
    """MLP using PyTorch -- compare how much cleaner this is than our from-scratch version."""
    
    def __init__(self, input_size, hidden_sizes, output_size):
        super().__init__()
        
        layers = []
        prev_size = input_size
        
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))  # Linear transform: y = Wx + b
            layers.append(nn.ReLU())                          # Non-linearity after each layer
            prev_size = hidden_size
        
        # No activation on the output -- BCEWithLogitsLoss applies sigmoid internally
        # for numerical stability (avoids log(0) disasters)
        layers.append(nn.Linear(prev_size, output_size))
        
        # nn.Sequential chains layers so forward() just pipes data through
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)


# Create model
model = PyTorchMLP(input_size=2, hidden_sizes=[32, 32, 16], output_size=1)
print(model)

# Setup training
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Convert data
X_tensor = torch.FloatTensor(X_moons)
y_tensor = torch.FloatTensor(y_moons).reshape(-1, 1)

# Train
for epoch in range(1000):
    # Forward
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        with torch.no_grad():
            preds = (torch.sigmoid(outputs) > 0.5).float()
            acc = (preds == y_tensor).float().mean()
            print(f"Epoch {epoch}: Loss = {loss.item():.4f}, Acc = {acc.item():.2%}")

Key Concepts Summary

ConceptWhat It MeansWhy It Matters
PerceptronSingle neuron with weighted inputsBasic building block
WeightsHow much each input mattersWhat the network learns
BiasThreshold for activationShifts decision boundary
ActivationNon-linear functionEnables complex patterns
Hidden LayerIntermediate processingCreates useful representations
BackpropagationComputing gradients layer by layerHow we train the network
MLP Forward and Backward Pass

Exercises

Implement perceptrons for:
  1. OR gate
  2. NAND gate
  3. Can you create XOR using only NAND gates? (Hint: NAND is universal)
Create an animation showing how the decision boundary evolves during training:
# Store model states every N epochs
# Replay decision boundaries as animation
from matplotlib.animation import FuncAnimation
Compare networks of different depths on the moons dataset:
  • [2, 8, 1]
  • [2, 8, 8, 1]
  • [2, 8, 8, 8, 1]
  • [2, 8, 8, 8, 8, 1]
Plot training curves. At what depth does training become difficult?
Extend our MLP to classify MNIST digits:
  1. Load MNIST data
  2. Flatten images to 784-dimensional vectors
  3. Train a [784, 256, 128, 10] network
  4. Compare to our PyTorch version

What’s Next

Now that you understand how neurons compute and connect, let’s dive deep into how they learn:

Module 3: Backpropagation Deep Dive

The algorithm that makes learning possible — chain rule, computational graphs, and gradient flow.

Interview Deep-Dive

Strong Answer:
  • The Universal Approximation Theorem is an existence proof, not a training recipe. It says a sufficiently wide single-hidden-layer network can represent any continuous function, but it says nothing about whether gradient descent can find those weights efficiently, or how many neurons would be required.
  • In practice, the width required grows exponentially with the complexity of the target function. A function expressible by a 10-layer network with 100 neurons per layer might require a single-layer network with 210=10242^{10} = 1024 or more neurons. Deep networks achieve exponential efficiency through composition — they build complex functions by composing simple ones, just like how a program with nested subroutines is more efficient than one giant function.
  • Depth enables hierarchical feature learning, which matches the structure of real-world data. Images have edges composed into textures composed into parts composed into objects. A deep network mirrors this hierarchy naturally; a shallow network must discover flat representations that implicitly encode all levels simultaneously.
  • From an optimization perspective, deep narrow networks tend to have smoother loss landscapes with better-connected low-loss regions than shallow wide networks. This makes them easier to train with gradient descent, despite the added challenge of vanishing gradients (which skip connections and normalization solve).
Follow-up: If depth is about compositionality, can you give a concrete example of a function that is exponentially cheaper to represent with depth?Consider the parity function: given nn binary inputs, output 1 if an even number are 1. A single-layer network needs O(2n)O(2^n) neurons because each possible input pattern requires its own “detector.” A deep network can compute parity with O(n)O(n) neurons: first XOR pairs, then XOR the results, cascading upward like a tournament bracket. Each layer composes the previous layer’s partial results, achieving exponential compression. This is the power of depth — it enables re-use of intermediate computations.
Strong Answer:
  • Each hidden layer performs two operations: an affine transformation (rotation, scaling, shearing, translation via Wx+bWx + b) followed by a non-linear warping (the activation function). Together, these fold, stretch, and warp the input space to make the data more linearly separable.
  • Consider the XOR problem in 2D: the four points (0,0), (0,1), (1,0), (1,1) labeled 0, 1, 1, 0 are not linearly separable. The hidden layer maps these points into a new space where they become linearly separable. Specifically, two hidden neurons can project the 2D input into a new 2D coordinate system where the classes fall on opposite sides of a line.
  • Mathematically, an MLP with ReLU activations partitions the input space into convex polytopes (flat-sided regions), with each region having its own linear function. As you add more neurons and layers, the number of regions grows combinatorially, allowing the network to approximate arbitrarily complex decision boundaries.
  • This is why we call it “representation learning” — the hidden layers learn to re-represent the data in a form where the final linear layer can trivially solve the task. The quality of a neural network is fundamentally the quality of its learned representations.
Follow-up: Why does the bias term matter? What happens if you remove all biases from an MLP?Without biases, every hyperplane defined by a neuron must pass through the origin. This means the decision boundaries are constrained to radiate from a single point. For many problems, the optimal decision boundary does not pass through the origin, so the network would need to “waste” neurons creating an indirect path to the right boundary. Removing biases reduces representational capacity and can make certain simple functions (like the constant function f(x)=1f(x) = 1) impossible to represent. In practice, removing biases from hidden layers in deep networks has minimal impact (the next layer compensates), but removing them from the output layer or from batch normalization layers can cause real problems.
Strong Answer:
  • Zero initialization breaks symmetry: if all weights start at zero, every neuron in a layer computes the same output, receives the same gradient, and makes the same update. They remain identical throughout training — effectively, you have one neuron replicated nn times, wasting all capacity. Random initialization ensures each neuron starts on a different trajectory and learns a different feature.
  • Small initialization prevents saturation: for sigmoid and tanh activations, large inputs push the activation into the flat (saturated) regions where the derivative is near zero. If weights are large, the pre-activation values Wx+bWx + b will be large, gradients will vanish, and learning will stall from the very first step. For ReLU, very large weights can cause some neurons to produce extremely large activations in early layers, leading to numerical instability.
  • The specific scale matters and depends on the activation function. Xavier/Glorot initialization (Var(w)=2/(nin+nout)\text{Var}(w) = 2 / (n_{in} + n_{out})) is designed for sigmoid/tanh: it preserves the variance of activations and gradients across layers. He initialization (Var(w)=2/nin\text{Var}(w) = 2 / n_{in}) is designed for ReLU: it accounts for the fact that ReLU zeroes out half the activations, so the surviving activations need twice the variance to maintain signal strength.
  • The intuition: initialization sets the starting point of optimization. A bad starting point (too large, too uniform) can place you in a region of the loss landscape where gradients are uninformative, making training either impossible or painfully slow.
Follow-up: What is the Lottery Ticket Hypothesis, and how does it relate to initialization?The Lottery Ticket Hypothesis (Frankle and Carlin, 2019) states that a randomly initialized dense network contains a sparse subnetwork (a “winning ticket”) that, when trained in isolation from the same initialization, reaches comparable accuracy. This suggests that the role of overparameterization at initialization is to ensure that at least one good subnetwork exists by chance. It implies that initialization is even more important than previously thought — the specific random seed determines which subnetworks are present, and the training process is essentially a search for the winning ticket within the initialized network.
Strong Answer:
  • Depth provides compositional expressiveness: each layer can build on the representations of the previous layer, enabling hierarchical feature learning. Width provides per-layer capacity: more neurons in a single layer can represent more diverse features at the same level of abstraction.
  • Prefer depth when the data has hierarchical structure (images, language, audio) because the compositional structure of deep networks naturally matches the compositional structure of the data. A 10-layer network with 256 neurons per layer will learn edge-to-texture-to-part-to-object hierarchies that a 2-layer network with 1280 neurons per layer cannot.
  • Prefer width when the data lacks hierarchical structure (some tabular problems), when training stability is a concern (shallow wide networks are easier to optimize), or when latency matters (wide shallow networks can be parallelized more effectively on hardware, while depth creates sequential dependencies).
  • In modern practice, the best architectures are both deep AND wide, with techniques like skip connections and normalization making deep training feasible. The trend in large language models is to scale both depth and width together, following scaling laws that predict optimal ratios given a compute budget.
Follow-up: EfficientNet introduced compound scaling — scaling depth, width, and resolution together. Why does this work better than scaling any single dimension?Scaling only depth eventually hits vanishing gradients and diminishing returns. Scaling only width has diminishing returns as additional neurons become redundant. Scaling only resolution increases input detail but the network lacks capacity to process it. EfficientNet’s insight is that these dimensions are interdependent: higher resolution inputs need deeper networks (more layers to process fine details) and wider networks (more channels to capture the additional information). The compound scaling coefficient ensures all three dimensions grow in balance, avoiding bottlenecks in any single dimension. This is analogous to scaling a factory — you need more workers (width), more assembly stages (depth), AND higher-quality raw materials (resolution) to increase output.