import numpy as npclass Perceptron: """ A single artificial neuron -- the simplest possible neural network. Think of it as a tiny decision-maker: it looks at evidence (inputs), weighs how important each piece is (weights), adds it up, and makes a yes/no call (activation). Like a hiring manager who scores candidates on different criteria and hires if the total score exceeds a threshold. """ def __init__(self, n_inputs): # Initialize weights to small random values (not zero -- symmetry breaking) # If all weights start at zero, every neuron learns the same thing self.weights = np.random.randn(n_inputs) * 0.01 self.bias = 0 # The "default tendency" -- like a judge's prior disposition def forward(self, x): """Compute output for given inputs.""" # Weighted sum + bias: each input contributes proportionally to its weight z = np.dot(x, self.weights) + self.bias # Step activation: fire (1) if evidence exceeds threshold, stay silent (0) otherwise return 1 if z > 0 else 0 def train(self, X, y, learning_rate=0.1, epochs=100): """ Train using the perceptron learning rule. The learning rule is beautifully simple: if the prediction is correct, do nothing. If wrong, nudge each weight toward the correct answer. The learning_rate controls how big each nudge is -- too large and the model oscillates, too small and it takes forever. """ for epoch in range(epochs): errors = 0 for xi, yi in zip(X, y): prediction = self.forward(xi) error = yi - prediction # +1 if we should have fired, -1 if we shouldn't have # Update rule: move weights in the direction of the error # If error > 0 (should have fired), increase weights for active inputs # If error < 0 (should not have fired), decrease weights for active inputs self.weights += learning_rate * error * xi self.bias += learning_rate * error errors += abs(error) if epoch % 20 == 0: print(f"Epoch {epoch}: {errors} errors") if errors == 0: print(f"Converged at epoch {epoch}") break# Test on AND gateX = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])y_and = np.array([0, 0, 0, 1])perceptron = Perceptron(n_inputs=2)perceptron.train(X, y_and)print("\nAND Gate Results:")for xi in X: print(f" {xi} -> {perceptron.forward(xi)}")
# XOR: outputs 1 if inputs are differenty_xor = np.array([0, 1, 1, 0])perceptron_xor = Perceptron(n_inputs=2)perceptron_xor.train(X, y_xor, epochs=1000)print("\nXOR Gate Results (FAILS!):")for xi in X: print(f" {xi} -> {perceptron_xor.forward(xi)}")
A single perceptron can only learn linearly separable patterns!XOR is not linearly separable — you cannot draw a single straight line to separate the 0s from the 1s. Think of it like a bouncer at a club who can only apply one rule: “everyone taller than 6 feet gets in” works fine, but “people get in if they have an ID or they are on the list, but not both” requires understanding two conditions simultaneously. A single perceptron is that one-rule bouncer.Solution: Stack multiple layers of neurons = Multi-Layer Perceptron (MLP). The first layer learns simple patterns, the second layer combines those patterns into more complex ones — just like how the visual cortex processes edges first, then shapes, then objects.
The step function (0 or 1) has a problem: its gradient is 0 everywhere except at the threshold, where it is undefined. This means gradient descent has no signal to work with — it is like trying to roll a ball downhill on a perfectly flat surface with a single cliff edge.We need smooth, differentiable activation functions that provide a gradient at every point — a gentle slope the optimization can follow:
import matplotlib.pyplot as pltdef sigmoid(x): """S-curve from 0 to 1""" return 1 / (1 + np.exp(-x))def tanh(x): """S-curve from -1 to 1""" return np.tanh(x)def relu(x): """0 if negative, x if positive""" return np.maximum(0, x)def leaky_relu(x, alpha=0.01): """Small slope for negative values""" return np.where(x > 0, x, alpha * x)# Visualizex = np.linspace(-5, 5, 100)fig, axes = plt.subplots(2, 2, figsize=(12, 10))activations = [ (sigmoid, 'Sigmoid', 'Output between 0 and 1'), (tanh, 'Tanh', 'Output between -1 and 1'), (relu, 'ReLU', 'Most popular, fast to compute'), (leaky_relu, 'Leaky ReLU', 'Fixes "dying ReLU" problem')]for ax, (func, name, desc) in zip(axes.flat, activations): ax.plot(x, func(x), linewidth=2) ax.set_title(f'{name}: {desc}') ax.grid(True) ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5) ax.axvline(x=0, color='k', linestyle='-', linewidth=0.5)plt.tight_layout()plt.show()
Activation
Range
Use Case
Gotcha
Sigmoid
(0, 1)
Output layer for binary classification
Vanishing gradient in deep networks — gradients shrink toward zero in early layers
Tanh
(-1, 1)
Hidden layers (centered at 0)
Same vanishing gradient problem as sigmoid, but centered output helps convergence
ReLU
[0, ∞)
Hidden layers (most common, fast)
“Dying ReLU”: if a neuron’s output goes negative, gradient is 0 and it never recovers
Softmax
(0, 1), sums to 1
Output for multi-class classification
Only for the output layer — it normalizes across all outputs to produce probabilities
Practical default: Use ReLU for hidden layers and sigmoid/softmax for the output layer. This covers 90% of use cases. Only switch to Leaky ReLU or GELU if you observe dying neurons (training loss plateaus while many neurons output zero).
Multi-Layer Perceptron: The Universal Approximator
By stacking layers, we can learn ANY function! This is not hand-waving — the Universal Approximation Theorem (Cybenko, 1989) proves that a neural network with just one hidden layer and enough neurons can approximate any continuous function to arbitrary accuracy. The catch: “enough neurons” might mean millions, and finding the right weights is the hard part. In practice, deeper networks with fewer neurons per layer learn hierarchical features more efficiently than one massive wide layer.
class NeuralNetwork: """ Simple 2-layer neural network built from scratch. Architecture: Input -> Hidden Layer (sigmoid) -> Output Layer (sigmoid) This is the minimum viable network that can solve non-linear problems like XOR. """ def __init__(self, input_size, hidden_size, output_size): # Layer 1: input -> hidden (learns basic patterns/features) # Weight initialization with * 0.5 keeps initial values moderate; # too large and gradients explode, too small and learning is glacially slow self.W1 = np.random.randn(input_size, hidden_size) * 0.5 self.b1 = np.zeros(hidden_size) # Layer 2: hidden -> output (combines hidden features into final prediction) self.W2 = np.random.randn(hidden_size, output_size) * 0.5 self.b2 = np.zeros(output_size) def sigmoid(self, x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) def forward(self, X): """ Forward pass: data flows input -> hidden -> output. We store intermediate values (z1, a1, z2, a2) because backpropagation needs them to compute gradients. """ # Layer 1: compute weighted sum, then apply activation self.z1 = X @ self.W1 + self.b1 # Linear transformation self.a1 = self.sigmoid(self.z1) # Non-linear activation (this is what lets us learn curves, not just lines) # Layer 2: hidden layer output becomes input to the output layer self.z2 = self.a1 @ self.W2 + self.b2 # Linear transformation self.a2 = self.sigmoid(self.z2) # Final prediction (0-1 for binary classification) return self.a2 def backward(self, X, y, learning_rate=0.5): """ Backward pass: compute how much each weight contributed to the error, then nudge weights in the direction that reduces error. This is backpropagation -- the chain rule applied layer by layer, working backwards from output to input. Think of it like tracing blame: "The output was wrong because the hidden layer sent the wrong signal, which happened because the input weights were off." """ m = len(X) # Number of samples (for averaging gradients) # Output layer error: how far off were our predictions? dz2 = self.a2 - y.reshape(-1, 1) # Derivative of loss w.r.t. z2 dW2 = (self.a1.T @ dz2) / m # How much each W2 weight contributed to error db2 = np.mean(dz2, axis=0) # How much the bias contributed # Hidden layer error: chain rule propagates error backward through W2 # The sigmoid derivative a*(1-a) is what makes this differentiable dz1 = (dz2 @ self.W2.T) * self.a1 * (1 - self.a1) dW1 = (X.T @ dz1) / m db1 = np.mean(dz1, axis=0) # Update weights: step in the direction that reduces error # learning_rate controls step size -- the fundamental tradeoff of optimization self.W2 -= learning_rate * dW2 self.b2 -= learning_rate * db2 self.W1 -= learning_rate * dW1 self.b1 -= learning_rate * db1 def train(self, X, y, epochs=1000, learning_rate=0.5): losses = [] for epoch in range(epochs): # Forward output = self.forward(X) # Loss (binary cross-entropy) loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8)) losses.append(loss) # Backward self.backward(X, y, learning_rate) if epoch % 200 == 0: print(f"Epoch {epoch}: Loss = {loss:.4f}") return losses# NOW we can learn XOR!nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)losses = nn.train(X, y_xor, epochs=2000)print("\nXOR Results (SUCCESS!):")predictions = nn.forward(X)for xi, pred in zip(X, predictions): print(f" {xi} -> {pred[0]:.3f} (rounded: {int(pred[0] > 0.5)})")
Randomly “turn off” neurons during training. Think of it like a team where you randomly bench different players in each practice session. No single player can carry the team alone, so every player has to be competent. This forces the network to build redundant representations rather than relying on a few “star” neurons — which means it generalizes better to new data.
model = nn.Sequential( nn.Linear(64, 128), nn.ReLU(), nn.Dropout(0.3), # 30% of neurons randomly zeroed each forward pass nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.3), # Dropout is only active during training, NOT during inference nn.Linear(64, 10) # No dropout on the output layer)
Practical tip: Start with dropout rate of 0.2-0.3 for hidden layers. If the model still overfits, increase toward 0.5. Never apply dropout to the output layer. Remember to call model.eval() during inference — dropout must be disabled for predictions.
Stop training when validation loss stops improving — the simplest and most effective regularization technique. Training too long is like studying for an exam past the point of understanding into the territory of memorizing typos in the textbook.
from sklearn.neural_network import MLPClassifiermlp = MLPClassifier( hidden_layer_sizes=(100,), early_stopping=True, # Enable early stopping validation_fraction=0.1, # Use 10% for validation n_iter_no_change=10, # Stop after 10 epochs without improvement max_iter=1000)
Small datasets (overfits easily — neural nets are data-hungry by nature)
When interpretability matters (explaining why a 10-layer network made a decision is much harder than explaining a decision tree)
Tabular data with fewer than 10,000 rows (tree-based models like XGBoost or Random Forest are almost always better here, and this is backed by extensive benchmarks)
Industry reality: For tabular data in production, gradient boosted trees (XGBoost, LightGBM) beat neural networks in the majority of Kaggle competitions and real-world deployments. Neural networks shine on unstructured data: images, text, audio, and video. If someone suggests a neural network for a 5,000-row CSV, push back.