Skip to main content
Perceptron Concept

Perceptrons & Multi-Layer Networks

The Biological Inspiration

Your brain contains approximately 86 billion neurons, each connected to thousands of others. A single neuron:
  1. Receives signals from other neurons through dendrites
  2. Processes those signals in the cell body
  3. Fires (or not) based on whether the combined signal exceeds a threshold
  4. Transmits that signal to other neurons through its axon
In 1958, Frank Rosenblatt created the Perceptron — a mathematical model of a single neuron. It’s remarkably simple, yet it laid the foundation for all modern deep learning.
Biological vs Artificial Neuron

The Perceptron: One Artificial Neuron

Mathematical Formulation

A perceptron computes: y=f(i=1nwixi+b)=f(wx+b)y = f\left(\sum_{i=1}^{n} w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b) Where:
  • x=[x1,x2,...,xn]\mathbf{x} = [x_1, x_2, ..., x_n] = input features
  • w=[w1,w2,...,wn]\mathbf{w} = [w_1, w_2, ..., w_n] = weights (learnable)
  • bb = bias (learnable)
  • ff = activation function

Visual Representation

     x₁ ──── w₁ ────┐

     x₂ ──── w₂ ────┼──► [Σ + b] ──► [f] ──► y

     x₃ ──── w₃ ────┘

Building from Scratch

import numpy as np

class Perceptron:
    """A single artificial neuron."""
    
    def __init__(self, n_inputs, activation='step'):
        """Initialize with random weights."""
        # Small random weights for symmetry breaking
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.activation = activation
    
    def _activate(self, z):
        """Apply activation function."""
        if self.activation == 'step':
            return 1 if z > 0 else 0
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward(self, x):
        """Compute output for given input."""
        # Weighted sum
        z = np.dot(x, self.weights) + self.bias
        # Apply activation
        return self._activate(z)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self.forward(x) for x in X])
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """Train using the perceptron learning rule."""
        history = []
        
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                # Forward pass
                prediction = self.forward(xi)
                
                # Compute error
                error = yi - prediction
                
                # Update weights if prediction was wrong
                if error != 0:
                    self.weights += learning_rate * error * xi
                    self.bias += learning_rate * error
                    errors += 1
            
            accuracy = 1 - errors / len(y)
            history.append(accuracy)
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: Accuracy = {accuracy:.2%}")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}!")
                break
        
        return history


# Test on AND gate
print("="*50)
print("Training Perceptron on AND Gate")
print("="*50)

X_and = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X_and, y_and)

print("\nResults:")
for x, y_true in zip(X_and, y_and):
    y_pred = perceptron.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")

The Perceptron Learning Rule

The training algorithm is beautifully simple:
FOR each training example (x, y):
    1. Compute prediction: ŷ = sign(w·x + b)
    2. If ŷ ≠ y (wrong prediction):
        w = w + η(y - ŷ)x
        b = b + η(y - ŷ)
    3. If ŷ = y (correct): do nothing

Why This Works

  • If we predict 0 but should predict 1: increase weights in direction of x
  • If we predict 1 but should predict 0: decrease weights in direction of x
  • The learning rate η\eta controls how big each update is

Convergence Theorem

Perceptron Convergence Theorem: If the data is linearly separable, the perceptron algorithm will converge to a solution in finite time.
Historical Note: Minsky & Papert’s 1969 book Perceptrons showed that single perceptrons can’t solve non-linearly-separable problems (like XOR). This led to the “AI Winter” — but they missed that multiple layers could solve any problem!

The XOR Problem: Why We Need Depth

# XOR: output is 1 if inputs are DIFFERENT
print("="*50)
print("Training Perceptron on XOR Gate")
print("="*50)

y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X_and, y_xor, epochs=100)

print("\nResults (FAILS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = perceptron_xor.forward(x)
    print(f"  {x} -> {y_pred} (true: {y_true})")
The perceptron fails on XOR! Why?
XOR Problem
XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s. Solution: Stack multiple layers of neurons!

Multi-Layer Perceptron (MLP)

The Universal Approximation Theorem

A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rn\mathbb{R}^n.
In other words: Deep networks can learn anything (given enough neurons and data).

Architecture

Input Layer      Hidden Layer      Output Layer
    x₁ ────────┐
               ├────► h₁ ────┐
    x₂ ────────┤             ├────► y
               ├────► h₂ ────┤
    x₃ ────────┘             │
               ├────► h₃ ────┘

Each connection has its own weight. Each hidden neuron has its own bias.

Building an MLP from Scratch

class MLP:
    """Multi-Layer Perceptron from scratch."""
    
    def __init__(self, layer_sizes, activation='sigmoid'):
        """
        Initialize network with given layer sizes.
        
        Args:
            layer_sizes: List like [input_size, hidden1, hidden2, ..., output_size]
        """
        self.n_layers = len(layer_sizes) - 1
        self.activation = activation
        
        # Initialize weights and biases for each layer
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            # He initialization for ReLU, Xavier for sigmoid/tanh
            scale = np.sqrt(2.0 / layer_sizes[i]) if activation == 'relu' else \
                    np.sqrt(1.0 / layer_sizes[i])
            
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros(layer_sizes[i+1])
            
            self.weights.append(W)
            self.biases.append(b)
    
    def _activate(self, z, derivative=False):
        """Apply activation function (or its derivative)."""
        if self.activation == 'sigmoid':
            sig = 1 / (1 + np.exp(-np.clip(z, -500, 500)))
            if derivative:
                return sig * (1 - sig)
            return sig
        elif self.activation == 'relu':
            if derivative:
                return (z > 0).astype(float)
            return np.maximum(0, z)
        elif self.activation == 'tanh':
            if derivative:
                return 1 - np.tanh(z)**2
            return np.tanh(z)
    
    def forward(self, X):
        """Forward pass through the network."""
        self.activations = [X]  # Store for backprop
        self.z_values = []       # Pre-activation values
        
        current = X
        for i in range(self.n_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            
            # Apply activation (except for last layer in classification)
            if i == self.n_layers - 1:  # Output layer
                current = self._sigmoid(z)  # For binary classification
            else:
                current = self._activate(z)
            
            self.activations.append(current)
        
        return current
    
    def _sigmoid(self, z):
        """Sigmoid for output layer."""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def backward(self, X, y, learning_rate=0.01):
        """Backward pass (backpropagation)."""
        m = len(X)
        
        # Output layer error
        output = self.activations[-1]
        delta = output - y.reshape(-1, 1)  # Derivative of BCE loss with sigmoid
        
        # Backpropagate through layers
        for i in range(self.n_layers - 1, -1, -1):
            # Gradient for weights and biases
            dW = self.activations[i].T @ delta / m
            db = np.mean(delta, axis=0)
            
            # Propagate error to previous layer
            if i > 0:
                delta = (delta @ self.weights[i].T) * self._activate(
                    self.z_values[i-1], derivative=True
                )
            
            # Update weights and biases
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db
    
    def train(self, X, y, epochs=1000, learning_rate=0.1, verbose=True):
        """Train the network."""
        history = {'loss': [], 'accuracy': []}
        
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Compute loss (binary cross-entropy)
            eps = 1e-8
            loss = -np.mean(y * np.log(output + eps) + (1 - y) * np.log(1 - output + eps))
            
            # Compute accuracy
            predictions = (output > 0.5).astype(int).flatten()
            accuracy = np.mean(predictions == y)
            
            history['loss'].append(loss)
            history['accuracy'].append(accuracy)
            
            # Backward pass
            self.backward(X, y, learning_rate)
            
            if verbose and epoch % 100 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.2%}")
        
        return history
    
    def predict(self, X):
        """Make predictions."""
        return (self.forward(X) > 0.5).astype(int).flatten()


# NOW we can solve XOR!
print("="*50)
print("Training MLP on XOR Gate")
print("="*50)

mlp = MLP([2, 4, 1], activation='sigmoid')  # 2 inputs, 4 hidden, 1 output
history = mlp.train(X_and, y_xor, epochs=2000, learning_rate=1.0)

print("\nResults (SUCCESS!):")
for x, y_true in zip(X_and, y_xor):
    y_pred = mlp.predict(x.reshape(1, -1))[0]
    print(f"  {x} -> {y_pred} (true: {y_true})")

How MLPs Solve XOR

The hidden layer creates a new representation where the problem becomes linearly separable:
Original Space         Hidden Space
    (0,1) ●─────● (1,1)         h₁
         │     │               ↗
         │ XOR │          • (0,1)   • (1,0)   → output 1
         │     │               ↓
    (0,0) ●─────● (1,0)   • (0,0)   • (1,1)   → output 0
                                    h₂
                            Now linearly separable!

Visualizing the Decision Boundary

import matplotlib.pyplot as plt

def plot_decision_boundary(model, X, y, title):
    """Plot the decision boundary of a 2D classifier."""
    # Create mesh grid
    h = 0.01
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = model.forward(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, levels=np.linspace(0, 1, 50), cmap='RdBu_r', alpha=0.8)
    plt.colorbar(label='Prediction')
    
    # Plot training points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu_r', 
                         edgecolors='black', s=200, linewidths=2)
    
    plt.xlabel('x₁')
    plt.ylabel('x₂')
    plt.title(title)
    plt.show()

# Visualize XOR solution
plot_decision_boundary(mlp, X_and, y_xor, "MLP Solving XOR")

Deeper Networks

Why Go Deep?

DepthAdvantagesChallenges
Shallow (1-2 layers)Easy to train, interpretableLimited expressivity
Medium (3-5 layers)Good balanceStandard training works
Deep (10+ layers)Hierarchical featuresVanishing gradients
Very Deep (100+)State-of-the-artRequires special techniques

The Depth vs Width Tradeoff

Theorem: A 2-layer network of width nn can approximate functions that require width 2n2^n with a deeper network of width nn. In practice:
  • Deep narrow networks learn hierarchical features (more efficient)
  • Wide shallow networks have more brute-force capacity
  • Modern architectures are both deep AND wide (but depth usually helps more)

A Deeper Network

# Deeper network for a more complex problem
from sklearn.datasets import make_moons

# Create non-linear dataset
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

# Normalize
X_moons = (X_moons - X_moons.mean(axis=0)) / X_moons.std(axis=0)

# Train deeper network
deep_mlp = MLP([2, 32, 32, 16, 1], activation='relu')
history = deep_mlp.train(X_moons, y_moons, epochs=2000, learning_rate=0.01)

# Visualize
plot_decision_boundary(deep_mlp, X_moons, y_moons, "Deep MLP on Moons Dataset")

PyTorch Implementation

Now let’s see how to build the same networks using PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim

class PyTorchMLP(nn.Module):
    """MLP using PyTorch."""
    
    def __init__(self, input_size, hidden_sizes, output_size):
        super().__init__()
        
        layers = []
        prev_size = input_size
        
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size
        
        layers.append(nn.Linear(prev_size, output_size))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)


# Create model
model = PyTorchMLP(input_size=2, hidden_sizes=[32, 32, 16], output_size=1)
print(model)

# Setup training
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Convert data
X_tensor = torch.FloatTensor(X_moons)
y_tensor = torch.FloatTensor(y_moons).reshape(-1, 1)

# Train
for epoch in range(1000):
    # Forward
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        with torch.no_grad():
            preds = (torch.sigmoid(outputs) > 0.5).float()
            acc = (preds == y_tensor).float().mean()
            print(f"Epoch {epoch}: Loss = {loss.item():.4f}, Acc = {acc.item():.2%}")

Key Concepts Summary

ConceptWhat It MeansWhy It Matters
PerceptronSingle neuron with weighted inputsBasic building block
WeightsHow much each input mattersWhat the network learns
BiasThreshold for activationShifts decision boundary
ActivationNon-linear functionEnables complex patterns
Hidden LayerIntermediate processingCreates useful representations
BackpropagationComputing gradients layer by layerHow we train the network
MLP Forward and Backward Pass

Exercises

Implement perceptrons for:
  1. OR gate
  2. NAND gate
  3. Can you create XOR using only NAND gates? (Hint: NAND is universal)
Create an animation showing how the decision boundary evolves during training:
# Store model states every N epochs
# Replay decision boundaries as animation
from matplotlib.animation import FuncAnimation
Compare networks of different depths on the moons dataset:
  • [2, 8, 1]
  • [2, 8, 8, 1]
  • [2, 8, 8, 8, 1]
  • [2, 8, 8, 8, 8, 1]
Plot training curves. At what depth does training become difficult?
Extend our MLP to classify MNIST digits:
  1. Load MNIST data
  2. Flatten images to 784-dimensional vectors
  3. Train a [784, 256, 128, 10] network
  4. Compare to our PyTorch version

What’s Next

Now that you understand how neurons compute and connect, let’s dive deep into how they learn: