Neural Networks: The Foundation of Deep Learning
From Brains to Math
The Perceptron: One Artificial Neuron
How It Works
Building a Perceptron from Scratch
The XOR Problem: Why We Need More Layers
Activation Functions
Multi-Layer Perceptron: The Universal Approximator
Backpropagation: How Networks Learn
Using PyTorch (The Professional Way)
Using scikit-learn
Network Architectures
Regularization for Neural Networks
Dropout
Early Stopping
Key Hyperparameters
When to Use Neural Networks
🚀 Mini Projects
Project 1: Digit Recognizer
Project 2: Neural Network from Scratch
Project 3: Activation Function Explorer
Project 4: Hyperparameter Tuner
Key Takeaways
What’s Next?

Neural Networks: The Foundation of Deep Learning

From Brains to Math

Your brain has about 86 billion neurons, each connected to thousands of others. A single neuron:

Receives inputs from other neurons
Weighs how important each input is
Sums them up
Activates if the sum exceeds a threshold
Sends output to other neurons

That’s literally what an artificial neuron does!

The Perceptron: One Artificial Neuron

How It Works

Input 1 ─── weight 1 ───┐
                        │
Input 2 ─── weight 2 ───┼──► [Sum] ──► [Activation] ──► Output
                        │
Input 3 ─── weight 3 ───┘

Math version:

output = activation\left(\sum_{i=1}^{n} w_i x_i + b\right) = activation(w \cdot x + b)

Where:

$x_i$ = inputs
$w_i$ = weights (learnable)
$b$ = bias (also learnable)
$activation$ = a function that decides to “fire” or not

Building a Perceptron from Scratch

import numpy as np

class Perceptron:
    """A single artificial neuron."""
    
    def __init__(self, n_inputs):
        # Random small weights
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
    
    def forward(self, x):
        """Compute output for given inputs."""
        # Weighted sum + bias
        z = np.dot(x, self.weights) + self.bias
        # Step activation: output 1 if z > 0, else 0
        return 1 if z > 0 else 0
    
    def train(self, X, y, learning_rate=0.1, epochs=100):
        """Train using the perceptron learning rule."""
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction
                
                # Update rule: move weights in direction of error
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error
                
                errors += abs(error)
            
            if epoch % 20 == 0:
                print(f"Epoch {epoch}: {errors} errors")
            
            if errors == 0:
                print(f"Converged at epoch {epoch}")
                break

# Test on AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

perceptron = Perceptron(n_inputs=2)
perceptron.train(X, y_and)

print("\nAND Gate Results:")
for xi in X:
    print(f"  {xi} -> {perceptron.forward(xi)}")

The XOR Problem: Why We Need More Layers

# XOR: outputs 1 if inputs are different
y_xor = np.array([0, 1, 1, 0])

perceptron_xor = Perceptron(n_inputs=2)
perceptron_xor.train(X, y_xor, epochs=1000)

print("\nXOR Gate Results (FAILS!):")
for xi in X:
    print(f"  {xi} -> {perceptron_xor.forward(xi)}")

A single perceptron can only learn linearly separable patterns! XOR is not linearly separable - you can’t draw a single line to separate 0s from 1s. Solution: Stack multiple layers of neurons = Multi-Layer Perceptron (MLP)

Activation Functions

The step function (0 or 1) has a problem: the gradient is 0 everywhere, so gradient descent doesn’t work! We need smooth, differentiable activation functions:

import matplotlib.pyplot as plt

def sigmoid(x):
    """S-curve from 0 to 1"""
    return 1 / (1 + np.exp(-x))

def tanh(x):
    """S-curve from -1 to 1"""
    return np.tanh(x)

def relu(x):
    """0 if negative, x if positive"""
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Small slope for negative values"""
    return np.where(x > 0, x, alpha * x)

# Visualize
x = np.linspace(-5, 5, 100)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
activations = [
    (sigmoid, 'Sigmoid', 'Output between 0 and 1'),
    (tanh, 'Tanh', 'Output between -1 and 1'),
    (relu, 'ReLU', 'Most popular, fast to compute'),
    (leaky_relu, 'Leaky ReLU', 'Fixes "dying ReLU" problem')
]

for ax, (func, name, desc) in zip(axes.flat, activations):
    ax.plot(x, func(x), linewidth=2)
    ax.set_title(f'{name}: {desc}')
    ax.grid(True)
    ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()

Activation	Range	Use Case
Sigmoid	(0, 1)	Output layer for binary classification
Tanh	(-1, 1)	Hidden layers (centered at 0)
ReLU	[0, ∞)	Hidden layers (most common)
Softmax	(0, 1), sums to 1	Output for multi-class classification

Multi-Layer Perceptron: The Universal Approximator

By stacking layers, we can learn ANY function!

class NeuralNetwork:
    """Simple 2-layer neural network."""
    
    def __init__(self, input_size, hidden_size, output_size):
        # Layer 1: input -> hidden
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros(hidden_size)
        
        # Layer 2: hidden -> output
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros(output_size)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, X):
        """Forward pass through the network."""
        # Layer 1
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        
        # Layer 2
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        
        return self.a2
    
    def backward(self, X, y, learning_rate=0.5):
        """Backward pass - compute gradients and update weights."""
        m = len(X)
        
        # Output layer error
        dz2 = self.a2 - y.reshape(-1, 1)
        dW2 = (self.a1.T @ dz2) / m
        db2 = np.mean(dz2, axis=0)
        
        # Hidden layer error (using chain rule!)
        dz1 = (dz2 @ self.W2.T) * self.a1 * (1 - self.a1)
        dW1 = (X.T @ dz1) / m
        db1 = np.mean(dz1, axis=0)
        
        # Update weights
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
    
    def train(self, X, y, epochs=1000, learning_rate=0.5):
        losses = []
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (binary cross-entropy)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            losses.append(loss)
            
            # Backward
            self.backward(X, y, learning_rate)
            
            if epoch % 200 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}")
        
        return losses

# NOW we can learn XOR!
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
losses = nn.train(X, y_xor, epochs=2000)

print("\nXOR Results (SUCCESS!):")
predictions = nn.forward(X)
for xi, pred in zip(X, predictions):
    print(f"  {xi} -> {pred[0]:.3f} (rounded: {int(pred[0] > 0.5)})")

Backpropagation: How Networks Learn

Backpropagation uses the chain rule from calculus to compute gradients efficiently.

Math Connection: Backpropagation is just repeated application of the chain rule. See Chain Rule for the mathematical foundation.

The key insight:

Compute error at output
Propagate error backward through layers
Update each weight proportionally to how much it contributed to the error

\frac{\partial Loss}{\partial w} = \frac{\partial Loss}{\partial output} \cdot \frac{\partial output}{\partial hidden} \cdot \frac{\partial hidden}{\partial w}

Using PyTorch (The Professional Way)

import torch
import torch.nn as nn
import torch.optim as optim

# Define the network
class XORNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)
        self.layer2 = nn.Linear(4, 1)
        self.activation = nn.Sigmoid()
    
    def forward(self, x):
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        return x

# Create network
model = XORNet()

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

# Data
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y_xor).reshape(-1, 1)

# Training loop
for epoch in range(1000):
    # Forward pass
    output = model(X_tensor)
    loss = criterion(output, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# Test
with torch.no_grad():
    predictions = model(X_tensor)
    print("\nPyTorch XOR Results:")
    for xi, pred in zip(X, predictions):
        print(f"  {xi} -> {pred.item():.3f}")

Using scikit-learn

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load digit recognition dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create neural network
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',
    max_iter=500,
    random_state=42
)

# Train
mlp.fit(X_train, y_train)

# Evaluate
y_pred = mlp.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

# Visualize some predictions
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
    ax.set_title(f"Pred: {y_pred[i]}, True: {y_test[i]}")
    ax.axis('off')
plt.tight_layout()
plt.show()

Network Architectures

Architecture	Layers	Use Case
Shallow	1-2 hidden	Simple patterns, tabular data
Deep	3+ hidden	Complex patterns
Wide	Many neurons	More capacity per layer
Deep & Narrow	Many layers, fewer neurons	Hierarchical features

Rule of thumb for tabular data:

Start with 2 hidden layers
Hidden size: between input and output size
Use ReLU activation
Use dropout for regularization

Regularization for Neural Networks

Dropout

Randomly “turn off” neurons during training:

model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Dropout(0.3),  # 30% of neurons randomly zeroed
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(64, 10)
)

Early Stopping

Stop training when validation loss stops improving:

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # Use 10% for validation
    n_iter_no_change=10,       # Stop after 10 epochs without improvement
    max_iter=1000
)

Key Hyperparameters

Hyperparameter	Effect
Learning rate	Too high = unstable, too low = slow
Hidden layers	More = more complex patterns, more overfitting risk
Neurons per layer	More = more capacity
Batch size	Smaller = more noise, larger = more stable
Activation	ReLU most common, sigmoid for output
Dropout rate	0.1-0.5 typical

When to Use Neural Networks

Good for:

Image data (use CNNs)
Text data (use Transformers)
Sequential data (use RNNs/LSTMs)
Very large datasets
Complex non-linear patterns

Not great for:

Small datasets (overfits easily)
When interpretability matters
Tabular data with < 10,000 rows (tree models often better)

🚀 Mini Projects

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits

Project 2: Neural Network from Scratch

Implement a neural network without libraries

Project 3: Activation Function Explorer

Compare different activation functions

Project 4: Hyperparameter Tuner

Find optimal architecture through experimentation

Project 1: Digit Recognizer

Build a neural network to recognize handwritten digits from the MNIST dataset.

Project 2: Neural Network from Scratch

Implement a simple neural network using only NumPy.

Project 3: Activation Function Explorer

Compare different activation functions and their effects on learning.

Project 4: Hyperparameter Tuner

Systematically find the best neural network architecture.

Key Takeaways

Neurons = Weighted Sums

Input × weights + bias → activation → output

Layers = Power

More layers = learn more complex patterns

Backprop = Chain Rule

Gradients flow backward to update weights

Regularize!

Dropout and early stopping prevent overfitting

What’s Next?

Now that you understand neural networks, let’s learn about regularization in more depth - the key to preventing overfitting in any model!

Continue to Module 13: Regularization

Learn L1, L2 regularization and other techniques to prevent overfitting

Clustering Regularization

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Neural Networks: The Foundation of Deep Learning

​From Brains to Math

​The Perceptron: One Artificial Neuron

​How It Works

​Building a Perceptron from Scratch

​The XOR Problem: Why We Need More Layers

​Activation Functions

​Multi-Layer Perceptron: The Universal Approximator

​Backpropagation: How Networks Learn

​Using PyTorch (The Professional Way)

​Using scikit-learn

​Network Architectures

​Regularization for Neural Networks

​Dropout

​Early Stopping

​Key Hyperparameters

​When to Use Neural Networks

​🚀 Mini Projects

Project 1: Digit Recognizer

Project 2: Neural Network from Scratch

Project 3: Activation Function Explorer

Project 4: Hyperparameter Tuner

​Project 1: Digit Recognizer

​Project 2: Neural Network from Scratch

​Project 3: Activation Function Explorer

​Project 4: Hyperparameter Tuner

​Key Takeaways

Neurons = Weighted Sums

Layers = Power

Backprop = Chain Rule

Regularize!

​What’s Next?

Continue to Module 13: Regularization

Neural Networks: The Foundation of Deep Learning

From Brains to Math

The Perceptron: One Artificial Neuron

How It Works

Building a Perceptron from Scratch

The XOR Problem: Why We Need More Layers

Activation Functions

Multi-Layer Perceptron: The Universal Approximator

Backpropagation: How Networks Learn

Using PyTorch (The Professional Way)

Using scikit-learn

Network Architectures

Regularization for Neural Networks

Dropout

Early Stopping

Key Hyperparameters

When to Use Neural Networks

🚀 Mini Projects

Project 1: Digit Recognizer

Project 2: Neural Network from Scratch

Project 3: Activation Function Explorer

Project 4: Hyperparameter Tuner

Key Takeaways

What’s Next?