Skip to main content

Decision Trees

Decision Tree Structure

You Already Think in Trees

Every day, you make decisions using “if-then” logic: Should I bring an umbrella?
  1. Is it raining? → Yes → Bring umbrella
  2. Is it cloudy? → Yes → Bring umbrella
  3. Is forecast rain > 50%? → Yes → Bring umbrella
  4. Otherwise → Don’t bring umbrella
That’s a decision tree!
Loan Approval Decision Tree

The Loan Approval Problem

Imagine you’re a bank deciding whether to approve loans:
import numpy as np
import pandas as pd

# Loan applicant data
data = {
    'age': [25, 35, 45, 20, 35, 52, 23, 40, 60, 48],
    'income': [50000, 120000, 80000, 20000, 60000, 150000, 30000, 95000, 70000, 85000],
    'credit_score': [650, 750, 680, 500, 720, 780, 580, 710, 690, 740],
    'has_job': [1, 1, 1, 0, 1, 1, 1, 1, 0, 1],
    'approved': [0, 1, 1, 0, 1, 1, 0, 1, 0, 1]  # Target
}

df = pd.DataFrame(data)
print(df)

How Would a Human Decide?

A loan officer might think:
Is credit score > 700?
├── Yes: Probably approve
│   └── Is income > 50000?
│       ├── Yes: APPROVE
│       └── No: Check job status...
└── No: Risky
    └── Is income > 100000?
        ├── Yes: Maybe approve
        └── No: REJECT
A decision tree learns these rules from data!

Building a Decision Tree from Scratch

The Key Question: How Do We Choose Splits?

At each step, we need to decide:
  1. Which feature to split on?
  2. What value to split at?
We want splits that separate classes well.

Measuring “Purity” with Gini Impurity

Gini Impurity measures how mixed a group is: Gini=1i=1Cpi2Gini = 1 - \sum_{i=1}^{C} p_i^2 Where pip_i is the proportion of class ii.
def gini_impurity(y):
    """
    Calculate Gini impurity of a set.
    """
    if len(y) == 0:
        return 0
    
    # Count each class
    _, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    
    # Gini = 1 - sum(p^2)
    return 1 - np.sum(probabilities ** 2)

# Examples
pure = [1, 1, 1, 1]           # All same class
print(f"Pure:  {gini_impurity(pure):.3f}")    # 0.0

mixed = [1, 1, 0, 0]          # 50-50 split
print(f"Mixed: {gini_impurity(mixed):.3f}")   # 0.5

mostly_one = [1, 1, 1, 0]     # 75-25 split
print(f"Mostly one: {gini_impurity(mostly_one):.3f}")  # 0.375

Information Gain

We want the split that reduces impurity the most: Gain=GiniparentnleftntotalGinileftnrightntotalGinirightGain = Gini_{parent} - \frac{n_{left}}{n_{total}} Gini_{left} - \frac{n_{right}}{n_{total}} Gini_{right}
def information_gain(y, left_indices, right_indices):
    """
    Calculate information gain from a split.
    """
    parent_gini = gini_impurity(y)
    
    n = len(y)
    n_left = len(left_indices)
    n_right = len(right_indices)
    
    if n_left == 0 or n_right == 0:
        return 0
    
    # Weighted average of child impurities
    child_gini = (
        (n_left / n) * gini_impurity(y[left_indices]) +
        (n_right / n) * gini_impurity(y[right_indices])
    )
    
    return parent_gini - child_gini

Building the Tree

class DecisionNode:
    """A node in the decision tree."""
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature        # Feature index to split on
        self.threshold = threshold    # Value to split at
        self.left = left              # Left subtree
        self.right = right            # Right subtree
        self.value = value            # Predicted class (for leaf nodes)

class SimpleDecisionTree:
    def __init__(self, max_depth=5, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
    
    def fit(self, X, y):
        self.root = self._build_tree(X, y, depth=0)
    
    def _build_tree(self, X, y, depth):
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Stopping conditions
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or 
            n_classes == 1):
            return DecisionNode(value=self._most_common_class(y))
        
        # Find best split
        best_gain = 0
        best_feature = None
        best_threshold = None
        
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            
            for threshold in thresholds:
                left_mask = X[:, feature] <= threshold
                right_mask = ~left_mask
                
                if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                    continue
                
                gain = information_gain(y, np.where(left_mask)[0], np.where(right_mask)[0])
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        # No good split found
        if best_gain == 0:
            return DecisionNode(value=self._most_common_class(y))
        
        # Create child nodes
        left_mask = X[:, best_feature] <= best_threshold
        left = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right = self._build_tree(X[~left_mask], y[~left_mask], depth + 1)
        
        return DecisionNode(
            feature=best_feature,
            threshold=best_threshold,
            left=left,
            right=right
        )
    
    def _most_common_class(self, y):
        values, counts = np.unique(y, return_counts=True)
        return values[np.argmax(counts)]
    
    def predict(self, X):
        return np.array([self._traverse(x, self.root) for x in X])
    
    def _traverse(self, x, node):
        if node.value is not None:
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._traverse(x, node.left)
        return self._traverse(x, node.right)

# Test our tree
X = df[['age', 'income', 'credit_score', 'has_job']].values
y = df['approved'].values

tree = SimpleDecisionTree(max_depth=3)
tree.fit(X, y)
predictions = tree.predict(X)
print("Predictions:", predictions)
print("Actual:     ", y)
print(f"Accuracy:    {np.mean(predictions == y):.2%}")

Using scikit-learn

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Prepare data
X = df[['age', 'income', 'credit_score', 'has_job']].values
y = df['approved'].values
feature_names = ['age', 'income', 'credit_score', 'has_job']

# Train
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X, y)

# Visualize the tree!
plt.figure(figsize=(20, 10))
plot_tree(model, 
          feature_names=feature_names,
          class_names=['Rejected', 'Approved'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title("Loan Approval Decision Tree")
plt.tight_layout()
plt.show()

Real Example: Iris Classification

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Feature importance
print("\nFeature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"  {name}: {importance:.3f}")

The Problem: Overfitting

Decision trees can get too specific:
# Overfit tree - memorizes training data
overfit_model = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1)
overfit_model.fit(X_train, y_train)

print(f"Training accuracy: {overfit_model.score(X_train, y_train):.2%}")  # 100%!
print(f"Test accuracy:     {overfit_model.score(X_test, y_test):.2%}")    # Lower

# Well-tuned tree - generalizes better
good_model = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5)
good_model.fit(X_train, y_train)

print(f"\nTuned Training accuracy: {good_model.score(X_train, y_train):.2%}")
print(f"Tuned Test accuracy:     {good_model.score(X_test, y_test):.2%}")
Overfitting Signs:
  • Training accuracy much higher than test accuracy
  • Very deep trees (many levels)
  • Leaves with very few samples

Controlling Tree Complexity

from sklearn.tree import DecisionTreeClassifier

# Hyperparameters to control overfitting
model = DecisionTreeClassifier(
    max_depth=5,           # Maximum depth of tree
    min_samples_split=10,  # Min samples to split a node
    min_samples_leaf=5,    # Min samples in a leaf
    max_features='sqrt',   # Max features to consider per split
    random_state=42
)

Regression Trees

Trees can also predict numbers!
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np

# Create data
np.random.seed(42)
X = np.random.uniform(0, 10, (100, 1))
y = np.sin(X).ravel() + np.random.normal(0, 0.2, 100)

# Train regression tree
model = DecisionTreeRegressor(max_depth=4)
model.fit(X, y)

# Predict on smooth range
X_test = np.linspace(0, 10, 200).reshape(-1, 1)
y_pred = model.predict(X_test)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X_test, y_pred, 'r-', linewidth=2, label='Tree Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Decision Tree Regression')
plt.legend()
plt.show()
Notice the “staircase” pattern - trees predict constant values in each region!

Advantages and Disadvantages

Advantages

  • Easy to understand and visualize
  • No feature scaling needed
  • Handles both numeric and categorical
  • Feature importance built-in
  • Fast predictions

Disadvantages

  • Prone to overfitting
  • Unstable (small data changes = different tree)
  • Axis-aligned splits only
  • Not as accurate as ensemble methods
  • Can be biased toward features with many levels

🚀 Mini Projects

Project 1

Build a loan approval classifier

Project 2

Titanic survival prediction

Project 3

Visualize and interpret decision rules

Key Takeaways

If-Then Rules

Trees learn rules from data automatically

Gini Impurity

Measures how mixed a group is (lower = purer)

Information Gain

Choose splits that reduce impurity most

Depth Control

Limit depth to prevent overfitting

What’s Next?

Before moving to ensemble methods, let’s explore two more powerful classifiers: