Perceptrons & Multi-Layer Networks
The Biological Inspiration
Your brain contains approximately 86 billion neurons , each connected to thousands of others. A single neuron:
Receives signals from other neurons through dendrites
Processes those signals in the cell body
Fires (or not) based on whether the combined signal exceeds a threshold
Transmits that signal to other neurons through its axon
In 1958, Frank Rosenblatt created the Perceptron — a mathematical model of a single neuron. It’s remarkably simple, yet it laid the foundation for all modern deep learning.
The Perceptron: One Artificial Neuron
A perceptron computes:
y = f ( ∑ i = 1 n w i x i + b ) = f ( w ⋅ x + b ) y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b) y = f ( i = 1 ∑ n w i x i + b ) = f ( w ⋅ x + b )
Where:
x = [ x 1 , x 2 , . . . , x n ] \mathbf{x} = [x_1, x_2, ..., x_n] x = [ x 1 , x 2 , ... , x n ] = input features
w = [ w 1 , w 2 , . . . , w n ] \mathbf{w} = [w_1, w_2, ..., w_n] w = [ w 1 , w 2 , ... , w n ] = weights (learnable)
b b b = bias (learnable)
f f f = activation function
Visual Representation
x₁ ──── w₁ ────┐
│
x₂ ──── w₂ ────┼──► [Σ + b] ──► [f] ──► y
│
x₃ ──── w₃ ────┘
Building from Scratch
import numpy as np
class Perceptron :
"""A single artificial neuron."""
def __init__ ( self , n_inputs , activation = 'step' ):
"""Initialize with random weights."""
# Small random weights for symmetry breaking
self .weights = np.random.randn(n_inputs) * 0.01
self .bias = 0.0
self .activation = activation
def _activate ( self , z ):
"""Apply activation function."""
if self .activation == 'step' :
return 1 if z > 0 else 0
elif self .activation == 'sigmoid' :
return 1 / ( 1 + np.exp( - np.clip(z, - 500 , 500 )))
else :
raise ValueError ( f "Unknown activation: { self .activation } " )
def forward ( self , x ):
"""Compute output for given input."""
# Weighted sum
z = np.dot(x, self .weights) + self .bias
# Apply activation
return self ._activate(z)
def predict ( self , X ):
"""Predict for multiple samples."""
return np.array([ self .forward(x) for x in X])
def train ( self , X , y , learning_rate = 0.1 , epochs = 100 ):
"""Train using the perceptron learning rule."""
history = []
for epoch in range (epochs):
errors = 0
for xi, yi in zip (X, y):
# Forward pass
prediction = self .forward(xi)
# Compute error
error = yi - prediction
# Update weights if prediction was wrong
if error != 0 :
self .weights += learning_rate * error * xi
self .bias += learning_rate * error
errors += 1
accuracy = 1 - errors / len (y)
history.append(accuracy)
if epoch % 10 == 0 :
print ( f "Epoch { epoch } : Accuracy = { accuracy :.2%} " )
if errors == 0 :
print ( f "Converged at epoch { epoch } !" )
break
return history
# Test on AND gate
print ( "=" * 50 )
print ( "Training Perceptron on AND Gate" )
print ( "=" * 50 )
X_and = np.array([
[ 0 , 0 ],
[ 0 , 1 ],
[ 1 , 0 ],
[ 1 , 1 ]
])
y_and = np.array([ 0 , 0 , 0 , 1 ])
perceptron = Perceptron( n_inputs = 2 )
perceptron.train(X_and, y_and)
print ( " \n Results:" )
for x, y_true in zip (X_and, y_and):
y_pred = perceptron.forward(x)
print ( f " { x } -> { y_pred } (true: { y_true } )" )
The Perceptron Learning Rule
The training algorithm is beautifully simple:
FOR each training example (x, y):
1. Compute prediction: ŷ = sign(w·x + b)
2. If ŷ ≠ y (wrong prediction):
w = w + η(y - ŷ)x
b = b + η(y - ŷ)
3. If ŷ = y (correct): do nothing
Why This Works
If we predict 0 but should predict 1: increase weights in direction of x
If we predict 1 but should predict 0: decrease weights in direction of x
The learning rate η \eta η controls how big each update is
Convergence Theorem
Perceptron Convergence Theorem : If the data is linearly separable, the perceptron algorithm will converge to a solution in finite time.
Historical Note : Minsky & Papert’s 1969 book Perceptrons showed that single perceptrons can’t solve non-linearly-separable problems (like XOR). This led to the “AI Winter” — but they missed that multiple layers could solve any problem!
The XOR Problem: Why We Need Depth
# XOR: output is 1 if inputs are DIFFERENT
print ( "=" * 50 )
print ( "Training Perceptron on XOR Gate" )
print ( "=" * 50 )
y_xor = np.array([ 0 , 1 , 1 , 0 ])
perceptron_xor = Perceptron( n_inputs = 2 )
perceptron_xor.train(X_and, y_xor, epochs = 100 )
print ( " \n Results (FAILS!):" )
for x, y_true in zip (X_and, y_xor):
y_pred = perceptron_xor.forward(x)
print ( f " { x } -> { y_pred } (true: { y_true } )" )
The perceptron fails on XOR! Why?
XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s.
Solution : Stack multiple layers of neurons!
Multi-Layer Perceptron (MLP)
The Universal Approximation Theorem
A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R n \mathbb{R}^n R n .
In other words: Deep networks can learn anything (given enough neurons and data).
Architecture
Input Layer Hidden Layer Output Layer
x₁ ────────┐
├────► h₁ ────┐
x₂ ────────┤ ├────► y
├────► h₂ ────┤
x₃ ────────┘ │
├────► h₃ ────┘
│
Each connection has its own weight. Each hidden neuron has its own bias.
Building an MLP from Scratch
class MLP :
"""Multi-Layer Perceptron from scratch."""
def __init__ ( self , layer_sizes , activation = 'sigmoid' ):
"""
Initialize network with given layer sizes.
Args:
layer_sizes: List like [input_size, hidden1, hidden2, ..., output_size]
"""
self .n_layers = len (layer_sizes) - 1
self .activation = activation
# Initialize weights and biases for each layer
self .weights = []
self .biases = []
for i in range ( self .n_layers):
# He initialization for ReLU, Xavier for sigmoid/tanh
scale = np.sqrt( 2.0 / layer_sizes[i]) if activation == 'relu' else \
np.sqrt( 1.0 / layer_sizes[i])
W = np.random.randn(layer_sizes[i], layer_sizes[i + 1 ]) * scale
b = np.zeros(layer_sizes[i + 1 ])
self .weights.append(W)
self .biases.append(b)
def _activate ( self , z , derivative = False ):
"""Apply activation function (or its derivative)."""
if self .activation == 'sigmoid' :
sig = 1 / ( 1 + np.exp( - np.clip(z, - 500 , 500 )))
if derivative:
return sig * ( 1 - sig)
return sig
elif self .activation == 'relu' :
if derivative:
return (z > 0 ).astype( float )
return np.maximum( 0 , z)
elif self .activation == 'tanh' :
if derivative:
return 1 - np.tanh(z) ** 2
return np.tanh(z)
def forward ( self , X ):
"""Forward pass through the network."""
self .activations = [X] # Store for backprop
self .z_values = [] # Pre-activation values
current = X
for i in range ( self .n_layers):
z = current @ self .weights[i] + self .biases[i]
self .z_values.append(z)
# Apply activation (except for last layer in classification)
if i == self .n_layers - 1 : # Output layer
current = self ._sigmoid(z) # For binary classification
else :
current = self ._activate(z)
self .activations.append(current)
return current
def _sigmoid ( self , z ):
"""Sigmoid for output layer."""
return 1 / ( 1 + np.exp( - np.clip(z, - 500 , 500 )))
def backward ( self , X , y , learning_rate = 0.01 ):
"""Backward pass (backpropagation)."""
m = len (X)
# Output layer error
output = self .activations[ - 1 ]
delta = output - y.reshape( - 1 , 1 ) # Derivative of BCE loss with sigmoid
# Backpropagate through layers
for i in range ( self .n_layers - 1 , - 1 , - 1 ):
# Gradient for weights and biases
dW = self .activations[i].T @ delta / m
db = np.mean(delta, axis = 0 )
# Propagate error to previous layer
if i > 0 :
delta = (delta @ self .weights[i].T) * self ._activate(
self .z_values[i - 1 ], derivative = True
)
# Update weights and biases
self .weights[i] -= learning_rate * dW
self .biases[i] -= learning_rate * db
def train ( self , X , y , epochs = 1000 , learning_rate = 0.1 , verbose = True ):
"""Train the network."""
history = { 'loss' : [], 'accuracy' : []}
for epoch in range (epochs):
# Forward pass
output = self .forward(X)
# Compute loss (binary cross-entropy)
eps = 1e-8
loss = - np.mean(y * np.log(output + eps) + ( 1 - y) * np.log( 1 - output + eps))
# Compute accuracy
predictions = (output > 0.5 ).astype( int ).flatten()
accuracy = np.mean(predictions == y)
history[ 'loss' ].append(loss)
history[ 'accuracy' ].append(accuracy)
# Backward pass
self .backward(X, y, learning_rate)
if verbose and epoch % 100 == 0 :
print ( f "Epoch { epoch } : Loss = { loss :.4f} , Accuracy = { accuracy :.2%} " )
return history
def predict ( self , X ):
"""Make predictions."""
return ( self .forward(X) > 0.5 ).astype( int ).flatten()
# NOW we can solve XOR!
print ( "=" * 50 )
print ( "Training MLP on XOR Gate" )
print ( "=" * 50 )
mlp = MLP([ 2 , 4 , 1 ], activation = 'sigmoid' ) # 2 inputs, 4 hidden, 1 output
history = mlp.train(X_and, y_xor, epochs = 2000 , learning_rate = 1.0 )
print ( " \n Results (SUCCESS!):" )
for x, y_true in zip (X_and, y_xor):
y_pred = mlp.predict(x.reshape( 1 , - 1 ))[ 0 ]
print ( f " { x } -> { y_pred } (true: { y_true } )" )
How MLPs Solve XOR
The hidden layer creates a new representation where the problem becomes linearly separable:
Original Space Hidden Space
(0,1) ●─────● (1,1) h₁
│ │ ↗
│ XOR │ • (0,1) • (1,0) → output 1
│ │ ↓
(0,0) ●─────● (1,0) • (0,0) • (1,1) → output 0
h₂
Now linearly separable!
Visualizing the Decision Boundary
import matplotlib.pyplot as plt
def plot_decision_boundary ( model , X , y , title ):
"""Plot the decision boundary of a 2D classifier."""
# Create mesh grid
h = 0.01
x_min, x_max = X[:, 0 ].min() - 0.5 , X[:, 0 ].max() + 0.5
y_min, y_max = X[:, 1 ].min() - 0.5 , X[:, 1 ].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict on mesh
Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.figure( figsize = ( 8 , 6 ))
plt.contourf(xx, yy, Z, levels = np.linspace( 0 , 1 , 50 ), cmap = 'RdBu_r' , alpha = 0.8 )
plt.colorbar( label = 'Prediction' )
# Plot training points
scatter = plt.scatter(X[:, 0 ], X[:, 1 ], c = y, cmap = 'RdBu_r' ,
edgecolors = 'black' , s = 200 , linewidths = 2 )
plt.xlabel( 'x₁' )
plt.ylabel( 'x₂' )
plt.title(title)
plt.show()
# Visualize XOR solution
plot_decision_boundary(mlp, X_and, y_xor, "MLP Solving XOR" )
Deeper Networks
Why Go Deep?
Depth Advantages Challenges Shallow (1-2 layers) Easy to train, interpretable Limited expressivity Medium (3-5 layers) Good balance Standard training works Deep (10+ layers) Hierarchical features Vanishing gradients Very Deep (100+) State-of-the-art Requires special techniques
The Depth vs Width Tradeoff
Theorem : A 2-layer network of width n n n can approximate functions that require width 2 n 2^n 2 n with a deeper network of width n n n .
In practice:
Deep narrow networks learn hierarchical features (more efficient)
Wide shallow networks have more brute-force capacity
Modern architectures are both deep AND wide (but depth usually helps more)
A Deeper Network
# Deeper network for a more complex problem
from sklearn.datasets import make_moons
# Create non-linear dataset
X_moons, y_moons = make_moons( n_samples = 500 , noise = 0.2 , random_state = 42 )
# Normalize
X_moons = (X_moons - X_moons.mean( axis = 0 )) / X_moons.std( axis = 0 )
# Train deeper network
deep_mlp = MLP([ 2 , 32 , 32 , 16 , 1 ], activation = 'relu' )
history = deep_mlp.train(X_moons, y_moons, epochs = 2000 , learning_rate = 0.01 )
# Visualize
plot_decision_boundary(deep_mlp, X_moons, y_moons, "Deep MLP on Moons Dataset" )
PyTorch Implementation
Now let’s see how to build the same networks using PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
class PyTorchMLP ( nn . Module ):
"""MLP using PyTorch."""
def __init__ ( self , input_size , hidden_sizes , output_size ):
super (). __init__ ()
layers = []
prev_size = input_size
for hidden_size in hidden_sizes:
layers.append(nn.Linear(prev_size, hidden_size))
layers.append(nn.ReLU())
prev_size = hidden_size
layers.append(nn.Linear(prev_size, output_size))
self .network = nn.Sequential( * layers)
def forward ( self , x ):
return self .network(x)
# Create model
model = PyTorchMLP( input_size = 2 , hidden_sizes = [ 32 , 32 , 16 ], output_size = 1 )
print (model)
# Setup training
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr = 0.01 )
# Convert data
X_tensor = torch.FloatTensor(X_moons)
y_tensor = torch.FloatTensor(y_moons).reshape( - 1 , 1 )
# Train
for epoch in range ( 1000 ):
# Forward
outputs = model(X_tensor)
loss = criterion(outputs, y_tensor)
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 200 == 0 :
with torch.no_grad():
preds = (torch.sigmoid(outputs) > 0.5 ).float()
acc = (preds == y_tensor).float().mean()
print ( f "Epoch { epoch } : Loss = { loss.item() :.4f} , Acc = { acc.item() :.2%} " )
Key Concepts Summary
Concept What It Means Why It Matters Perceptron Single neuron with weighted inputs Basic building block Weights How much each input matters What the network learns Bias Threshold for activation Shifts decision boundary Activation Non-linear function Enables complex patterns Hidden Layer Intermediate processing Creates useful representations Backpropagation Computing gradients layer by layer How we train the network
Exercises
Implement perceptrons for:
OR gate
NAND gate
Can you create XOR using only NAND gates? (Hint: NAND is universal)
Exercise 2: Visualization
Create an animation showing how the decision boundary evolves during training: # Store model states every N epochs
# Replay decision boundaries as animation
from matplotlib.animation import FuncAnimation
Exercise 3: Depth Experiments
Compare networks of different depths on the moons dataset:
[2, 8, 1]
[2, 8, 8, 1]
[2, 8, 8, 8, 1]
[2, 8, 8, 8, 8, 1]
Plot training curves. At what depth does training become difficult?
Exercise 4: MNIST from Scratch
Extend our MLP to classify MNIST digits:
Load MNIST data
Flatten images to 784-dimensional vectors
Train a [784, 256, 128, 10] network
Compare to our PyTorch version
What’s Next
Now that you understand how neurons compute and connect, let’s dive deep into how they learn :