Perceptrons & Multi-Layer Networks
The Biological Inspiration
The Perceptron: One Artificial Neuron
Mathematical Formulation
Visual Representation
Building from Scratch
The Perceptron Learning Rule
Why This Works
Convergence Theorem
The XOR Problem: Why We Need Depth
Multi-Layer Perceptron (MLP)
The Universal Approximation Theorem
Architecture
Building an MLP from Scratch
How MLPs Solve XOR
Visualizing the Decision Boundary
Deeper Networks
Why Go Deep?
The Depth vs Width Tradeoff
A Deeper Network
PyTorch Implementation
Key Concepts Summary
Exercises
What’s Next

Perceptrons & Multi-Layer Networks

The Biological Inspiration

Your brain contains approximately 86 billion neurons, each connected to thousands of others. A single neuron:

Receives signals from other neurons through dendrites
Processes those signals in the cell body
Fires (or not) based on whether the combined signal exceeds a threshold
Transmits that signal to other neurons through its axon

In 1958, Frank Rosenblatt created the Perceptron — a mathematical model of a single neuron. It’s remarkably simple, yet it laid the foundation for all modern deep learning.

The Perceptron: One Artificial Neuron

Mathematical Formulation

A perceptron computes:

y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b)

Where:

$\mathbf{x} = [x_1, x_2, ..., x_n]$ = input features
$\mathbf{w} = [w_1, w_2, ..., w_n]$ = weights (learnable)
$b$ = bias (learnable)
$f$ = activation function

Visual Representation

     x₁ ──── w₁ ────┐
                    │
     x₂ ──── w₂ ────┼──► [Σ + b] ──► [f] ──► y
                    │
     x₃ ──── w₃ ────┘

Building from Scratch

import numpy as np

class Perceptron:
    """A single artificial neuron."""
    
    def __init__(self, n_inputs, activation='step'):
        """Initialize with random weights."""
        # Small random weights for symmetry breaking
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.activation = activation
    
    def _activate(self, z):
        """Apply activation function."""
        if self.activation == 'step':
            return 1 if z > 0 else 0
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward(self, x):
        """Compute output for given input."""
        # Weighted sum
        z = np.dot(x, self.weights) + self.bias
        # Apply activation
        return self._activate(z)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self.forward(x) for x in X])
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """Train using the perceptron learning rule."""
        history = []
        
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                # Forward pass
                prediction = self.forward(xi)
                
                # Compute error
                error = yi - prediction
                
                # Update weights if prediction was wrong
                if error != 0:
                    self.weights += learning_rate * error * xi
                    self.bias += learning_rate * error
                    errors += 1
            
            accuracy = 1 - errors / len(y)
            history.append(accuracy)
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Accuracy = {accuracy:.2%}")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}!")
                break
        
        return history


# Test on AND gate
print("="*50)
print("Training Perceptron on AND Gate")
print("="*50)

X_and = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X_and, y_and)

print("\nResults:")
for x, y_true in zip(X_and, y_and):
    y_pred = perceptron.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")

The Perceptron Learning Rule

The training algorithm is beautifully simple:

FOR each training example (x, y):
    1. Compute prediction: ŷ = sign(w·x + b)
    2. If ŷ ≠ y (wrong prediction):
        w = w + η(y - ŷ)x
        b = b + η(y - ŷ)
    3. If ŷ = y (correct): do nothing

Why This Works

If we predict 0 but should predict 1: increase weights in direction of x
If we predict 1 but should predict 0: decrease weights in direction of x
The learning rate $\eta$ controls how big each update is

Convergence Theorem

Perceptron Convergence Theorem: If the data is linearly separable, the perceptron algorithm will converge to a solution in finite time.

Historical Note: Minsky & Papert’s 1969 book Perceptrons showed that single perceptrons can’t solve non-linearly-separable problems (like XOR). This led to the “AI Winter” — but they missed that multiple layers could solve any problem!

The XOR Problem: Why We Need Depth

# XOR: output is 1 if inputs are DIFFERENT
print("="*50)
print("Training Perceptron on XOR Gate")
print("="*50)

y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X_and, y_xor, epochs=100)

print("\nResults (FAILS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = perceptron_xor.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")

The perceptron fails on XOR! Why?

XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s. Solution: Stack multiple layers of neurons!

Multi-Layer Perceptron (MLP)

The Universal Approximation Theorem

A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$ .

In other words: Deep networks can learn anything (given enough neurons and data).

Architecture

Input Layer      Hidden Layer      Output Layer
    x₁ ────────┐
               ├────► h₁ ────┐
    x₂ ────────┤             ├────► y
               ├────► h₂ ────┤
    x₃ ────────┘             │
               ├────► h₃ ────┘
               │

Each connection has its own weight. Each hidden neuron has its own bias.

Building an MLP from Scratch

class MLP:
    """Multi-Layer Perceptron from scratch."""
    
    def __init__(self, layer_sizes, activation='sigmoid'):
        """
        Initialize network with given layer sizes.
        
        Args:
            layer_sizes: List like [input_size, hidden1, hidden2, ..., output_size]
        """
        self.n_layers = len(layer_sizes) - 1
        self.activation = activation
        
        # Initialize weights and biases for each layer
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            # He initialization for ReLU, Xavier for sigmoid/tanh
            scale = np.sqrt(2.0 / layer_sizes[i]) if activation == 'relu' else \
                    np.sqrt(1.0 / layer_sizes[i])
            
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros(layer_sizes[i+1])
            
            self.weights.append(W)
            self.biases.append(b)
    
    def _activate(self, z, derivative=False):
        """Apply activation function (or its derivative)."""
        if self.activation == 'sigmoid':
            sig = 1 / (1 + np.exp(-np.clip(z, -500, 500)))
            if derivative:
                return sig * (1 - sig)
            return sig
        elif self.activation == 'relu':
            if derivative:
                return (z > 0).astype(float)
            return np.maximum(0, z)
        elif self.activation == 'tanh':
            if derivative:
                return 1 - np.tanh(z)**2
            return np.tanh(z)
    
    def forward(self, X):
        """Forward pass through the network."""
        self.activations = [X]  # Store for backprop
        self.z_values = []       # Pre-activation values
        
        current = X
        for i in range(self.n_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            
            # Apply activation (except for last layer in classification)
            if i == self.n_layers - 1:  # Output layer
                current = self._sigmoid(z)  # For binary classification
            else:
                current = self._activate(z)
            
            self.activations.append(current)
        
        return current
    
    def _sigmoid(self, z):
        """Sigmoid for output layer."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def backward(self, X, y, learning_rate=0.01):
        """Backward pass (backpropagation)."""
        m = len(X)
        
        # Output layer error
        output = self.activations[-1]
        delta = output - y.reshape(-1, 1)  # Derivative of BCE loss with sigmoid
        
        # Backpropagate through layers
        for i in range(self.n_layers - 1, -1, -1):
            # Gradient for weights and biases
            dW = self.activations[i].T @ delta / m
            db = np.mean(delta, axis=0)
            
            # Propagate error to previous layer
            if i > 0:
                delta = (delta @ self.weights[i].T) * self._activate(
                    self.z_values[i-1], derivative=True
                )
            
            # Update weights and biases
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db
    
    def train(self, X, y, epochs=1000, learning_rate=0.1, verbose=True):
        """Train the network."""
        history = {'loss': [], 'accuracy': []}
        
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Compute loss (binary cross-entropy)
            eps = 1e-8
            loss = -np.mean(y * np.log(output + eps) + (1 - y) * np.log(1 - output + eps))
            
            # Compute accuracy
            predictions = (output > 0.5).astype(int).flatten()
            accuracy = np.mean(predictions == y)
            
            history['loss'].append(loss)
            history['accuracy'].append(accuracy)
            
            # Backward pass
            self.backward(X, y, learning_rate)
            
            if verbose and epoch % 100 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.2%}")
        
        return history
    
    def predict(self, X):
        """Make predictions."""
        return (self.forward(X) > 0.5).astype(int).flatten()


# NOW we can solve XOR!
print("="*50)
print("Training MLP on XOR Gate")
print("="*50)

mlp = MLP([2, 4, 1], activation='sigmoid')  # 2 inputs, 4 hidden, 1 output
history = mlp.train(X_and, y_xor, epochs=2000, learning_rate=1.0)

print("\nResults (SUCCESS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = mlp.predict(x.reshape(1, -1))[0]
    print(f"  {x} -> {y_pred} (true: {y_true})")

How MLPs Solve XOR

The hidden layer creates a new representation where the problem becomes linearly separable:

Original Space         Hidden Space
    (0,1) ●─────● (1,1)         h₁
         │     │               ↗
         │ XOR │          • (0,1)   • (1,0)   → output 1
         │     │               ↓
    (0,0) ●─────● (1,0)   • (0,0)   • (1,1)   → output 0
                                    h₂
                            Now linearly separable!

Visualizing the Decision Boundary

import matplotlib.pyplot as plt

def plot_decision_boundary(model, X, y, title):
    """Plot the decision boundary of a 2D classifier."""
    # Create mesh grid
    h = 0.01
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, levels=np.linspace(0, 1, 50), cmap='RdBu_r', alpha=0.8)
    plt.colorbar(label='Prediction')
    
    # Plot training points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu_r', 
                         edgecolors='black', s=200, linewidths=2)
    
    plt.xlabel('x₁')
    plt.ylabel('x₂')
    plt.title(title)
    plt.show()

# Visualize XOR solution
plot_decision_boundary(mlp, X_and, y_xor, "MLP Solving XOR")

Deeper Networks

Why Go Deep?

Depth	Advantages	Challenges
Shallow (1-2 layers)	Easy to train, interpretable	Limited expressivity
Medium (3-5 layers)	Good balance	Standard training works
Deep (10+ layers)	Hierarchical features	Vanishing gradients
Very Deep (100+)	State-of-the-art	Requires special techniques

The Depth vs Width Tradeoff

Theorem: A 2-layer network of width

n

can approximate functions that require width

2^n

with a deeper network of width

n

. In practice:

Deep narrow networks learn hierarchical features (more efficient)
Wide shallow networks have more brute-force capacity
Modern architectures are both deep AND wide (but depth usually helps more)

A Deeper Network

# Deeper network for a more complex problem
from sklearn.datasets import make_moons

# Create non-linear dataset
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

# Normalize
X_moons = (X_moons - X_moons.mean(axis=0)) / X_moons.std(axis=0)

# Train deeper network
deep_mlp = MLP([2, 32, 32, 16, 1], activation='relu')
history = deep_mlp.train(X_moons, y_moons, epochs=2000, learning_rate=0.01)

# Visualize
plot_decision_boundary(deep_mlp, X_moons, y_moons, "Deep MLP on Moons Dataset")

PyTorch Implementation

Now let’s see how to build the same networks using PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

class PyTorchMLP(nn.Module):
    """MLP using PyTorch."""
    
    def __init__(self, input_size, hidden_sizes, output_size):
        super().__init__()
        
        layers = []
        prev_size = input_size
        
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size
        
        layers.append(nn.Linear(prev_size, output_size))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)


# Create model
model = PyTorchMLP(input_size=2, hidden_sizes=[32, 32, 16], output_size=1)
print(model)

# Setup training
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Convert data
X_tensor = torch.FloatTensor(X_moons)
y_tensor = torch.FloatTensor(y_moons).reshape(-1, 1)

# Train
for epoch in range(1000):
    # Forward
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        with torch.no_grad():
            preds = (torch.sigmoid(outputs) > 0.5).float()
            acc = (preds == y_tensor).float().mean()
            print(f"Epoch {epoch}: Loss = {loss.item():.4f}, Acc = {acc.item():.2%}")

Key Concepts Summary

Concept	What It Means	Why It Matters
Perceptron	Single neuron with weighted inputs	Basic building block
Weights	How much each input matters	What the network learns
Bias	Threshold for activation	Shifts decision boundary
Activation	Non-linear function	Enables complex patterns
Hidden Layer	Intermediate processing	Creates useful representations
Backpropagation	Computing gradients layer by layer	How we train the network

Exercises

Exercise 1: Logic Gates

Implement perceptrons for:

OR gate
NAND gate
Can you create XOR using only NAND gates? (Hint: NAND is universal)

Exercise 2: Visualization

Create an animation showing how the decision boundary evolves during training:

# Store model states every N epochs
# Replay decision boundaries as animation
from matplotlib.animation import FuncAnimation

Exercise 3: Depth Experiments

Compare networks of different depths on the moons dataset:

[2, 8, 1]
[2, 8, 8, 1]
[2, 8, 8, 8, 1]
[2, 8, 8, 8, 8, 1]

Plot training curves. At what depth does training become difficult?

Exercise 4: MNIST from Scratch

Extend our MLP to classify MNIST digits:

Load MNIST data
Flatten images to 784-dimensional vectors
Train a [784, 256, 128, 10] network
Compare to our PyTorch version

What’s Next

Now that you understand how neurons compute and connect, let’s dive deep into how they learn:

Module 3: Backpropagation Deep Dive

The algorithm that makes learning possible — chain rule, computational graphs, and gradient flow.

Deep Learning Landscape Backpropagation

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Perceptrons & Multi-Layer Networks

​The Biological Inspiration

​The Perceptron: One Artificial Neuron

​Mathematical Formulation

​Visual Representation

​Building from Scratch

​The Perceptron Learning Rule

​Why This Works

​Convergence Theorem

​The XOR Problem: Why We Need Depth

​Multi-Layer Perceptron (MLP)

​The Universal Approximation Theorem

​Architecture

​Building an MLP from Scratch

​How MLPs Solve XOR

​Visualizing the Decision Boundary

​Deeper Networks

​Why Go Deep?

​The Depth vs Width Tradeoff

​A Deeper Network

​PyTorch Implementation

​Key Concepts Summary

​Exercises

​What’s Next

Module 3: Backpropagation Deep Dive

Perceptrons & Multi-Layer Networks

The Biological Inspiration

The Perceptron: One Artificial Neuron

Mathematical Formulation

Visual Representation

Building from Scratch

The Perceptron Learning Rule

Why This Works

Convergence Theorem

The XOR Problem: Why We Need Depth

Multi-Layer Perceptron (MLP)

The Universal Approximation Theorem

Architecture

Building an MLP from Scratch

How MLPs Solve XOR

Visualizing the Decision Boundary

Deeper Networks

Why Go Deep?

The Depth vs Width Tradeoff

A Deeper Network

PyTorch Implementation

Key Concepts Summary

Exercises

What’s Next