> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Decision Trees > Make decisions like a flowchart - if this, then that # Decision Trees Decision Tree Structure

## You Already Think in Trees Every day, you make decisions using "if-then" logic: **Should I bring an umbrella?** 1. Is it raining? → Yes → Bring umbrella 2. Is it cloudy? → Yes → Bring umbrella 3. Is forecast rain > 50%? → Yes → Bring umbrella 4. Otherwise → Don't bring umbrella **That's a decision tree!**

*** ## The Loan Approval Problem Imagine you're a bank deciding whether to approve loans: ```python theme={null} import numpy as np import pandas as pd # Loan applicant data data = { 'age': [25, 35, 45, 20, 35, 52, 23, 40, 60, 48], 'income': [50000, 120000, 80000, 20000, 60000, 150000, 30000, 95000, 70000, 85000], 'credit_score': [650, 750, 680, 500, 720, 780, 580, 710, 690, 740], 'has_job': [1, 1, 1, 0, 1, 1, 1, 1, 0, 1], 'approved': [0, 1, 1, 0, 1, 1, 0, 1, 0, 1] # Target } df = pd.DataFrame(data) print(df) ``` ### How Would a Human Decide? A loan officer might think: ``` Is credit score > 700? ├── Yes: Probably approve │ └── Is income > 50000? │ ├── Yes: APPROVE │ └── No: Check job status... └── No: Risky └── Is income > 100000? ├── Yes: Maybe approve └── No: REJECT ``` **A decision tree learns these rules from data!** *** ## Building a Decision Tree from Scratch ### The Key Question: How Do We Choose Splits? At each step, we need to decide: 1. Which feature to split on? 2. What value to split at? We want splits that **separate classes well**. ### Measuring "Purity" with Gini Impurity Imagine reaching into a bag of colored balls. **Gini Impurity** measures: "If I pick two balls at random, how likely are they to be different colors?" $$ Gini = 1 - \sum_{i=1}^{C} p_i^2 $$ Where $p_i$ is the proportion of class $i$. * **Gini = 0**: Perfect purity -- every ball is the same color. The split created a "pure" group. * **Gini = 0.5**: Maximum impurity for 2 classes -- a 50/50 mix. Picking two balls at random gives you different colors half the time. * **The goal**: Each split should create child nodes with *lower* Gini than the parent -- we're sorting the balls into separate bags. ```python theme={null} def gini_impurity(y): """ Calculate Gini impurity of a set. """ if len(y) == 0: return 0 # Count each class _, counts = np.unique(y, return_counts=True) probabilities = counts / len(y) # Gini = 1 - sum(p^2) return 1 - np.sum(probabilities ** 2) # Examples pure = [1, 1, 1, 1] # All same class print(f"Pure: {gini_impurity(pure):.3f}") # 0.0 mixed = [1, 1, 0, 0] # 50-50 split print(f"Mixed: {gini_impurity(mixed):.3f}") # 0.5 mostly_one = [1, 1, 1, 0] # 75-25 split print(f"Mostly one: {gini_impurity(mostly_one):.3f}") # 0.375 ``` ### Information Gain We want the split that reduces impurity the most: $$ Gain = Gini_{parent} - \frac{n_{left}}{n_{total}} Gini_{left} - \frac{n_{right}}{n_{total}} Gini_{right} $$ ```python theme={null} def information_gain(y, left_indices, right_indices): """ Calculate information gain from a split. """ parent_gini = gini_impurity(y) n = len(y) n_left = len(left_indices) n_right = len(right_indices) if n_left == 0 or n_right == 0: return 0 # Weighted average of child impurities child_gini = ( (n_left / n) * gini_impurity(y[left_indices]) + (n_right / n) * gini_impurity(y[right_indices]) ) return parent_gini - child_gini ``` *** ## Building the Tree ```python theme={null} class DecisionNode: """A node in the decision tree.""" def __init__(self, feature=None, threshold=None, left=None, right=None, value=None): self.feature = feature # Feature index to split on self.threshold = threshold # Value to split at self.left = left # Left subtree self.right = right # Right subtree self.value = value # Predicted class (for leaf nodes) class SimpleDecisionTree: def __init__(self, max_depth=5, min_samples_split=2): self.max_depth = max_depth self.min_samples_split = min_samples_split self.root = None def fit(self, X, y): self.root = self._build_tree(X, y, depth=0) def _build_tree(self, X, y, depth): n_samples, n_features = X.shape n_classes = len(np.unique(y)) # Stopping conditions if (depth >= self.max_depth or n_samples < self.min_samples_split or n_classes == 1): return DecisionNode(value=self._most_common_class(y)) # Find best split best_gain = 0 best_feature = None best_threshold = None for feature in range(n_features): thresholds = np.unique(X[:, feature]) for threshold in thresholds: left_mask = X[:, feature] <= threshold right_mask = ~left_mask if np.sum(left_mask) == 0 or np.sum(right_mask) == 0: continue gain = information_gain(y, np.where(left_mask)[0], np.where(right_mask)[0]) if gain > best_gain: best_gain = gain best_feature = feature best_threshold = threshold # No good split found if best_gain == 0: return DecisionNode(value=self._most_common_class(y)) # Create child nodes left_mask = X[:, best_feature] <= best_threshold left = self._build_tree(X[left_mask], y[left_mask], depth + 1) right = self._build_tree(X[~left_mask], y[~left_mask], depth + 1) return DecisionNode( feature=best_feature, threshold=best_threshold, left=left, right=right ) def _most_common_class(self, y): values, counts = np.unique(y, return_counts=True) return values[np.argmax(counts)] def predict(self, X): return np.array([self._traverse(x, self.root) for x in X]) def _traverse(self, x, node): if node.value is not None: return node.value if x[node.feature] <= node.threshold: return self._traverse(x, node.left) return self._traverse(x, node.right) # Test our tree X = df[['age', 'income', 'credit_score', 'has_job']].values y = df['approved'].values tree = SimpleDecisionTree(max_depth=3) tree.fit(X, y) predictions = tree.predict(X) print("Predictions:", predictions) print("Actual: ", y) print(f"Accuracy: {np.mean(predictions == y):.2%}") ``` *** ## Using scikit-learn ```python theme={null} from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Prepare data X = df[['age', 'income', 'credit_score', 'has_job']].values y = df['approved'].values feature_names = ['age', 'income', 'credit_score', 'has_job'] # Train model = DecisionTreeClassifier(max_depth=3, random_state=42) model.fit(X, y) # Visualize the tree! plt.figure(figsize=(20, 10)) plot_tree(model, feature_names=feature_names, class_names=['Rejected', 'Approved'], filled=True, rounded=True, fontsize=10) plt.title("Loan Approval Decision Tree") plt.tight_layout() plt.show() ``` *** ## Real Example: Iris Classification ```python theme={null} from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Load data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 ) # Train model = DecisionTreeClassifier(max_depth=4, random_state=42) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) print(classification_report(y_test, y_pred, target_names=iris.target_names)) # Feature importance print("\nFeature Importances:") for name, importance in zip(iris.feature_names, model.feature_importances_): print(f" {name}: {importance:.3f}") ``` *** ## The Problem: Overfitting Decision trees can get too specific: ```python theme={null} # Overfit tree - memorizes training data overfit_model = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1) overfit_model.fit(X_train, y_train) print(f"Training accuracy: {overfit_model.score(X_train, y_train):.2%}") # 100%! print(f"Test accuracy: {overfit_model.score(X_test, y_test):.2%}") # Lower # Well-tuned tree - generalizes better good_model = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5) good_model.fit(X_train, y_train) print(f"\nTuned Training accuracy: {good_model.score(X_train, y_train):.2%}") print(f"Tuned Test accuracy: {good_model.score(X_test, y_test):.2%}") ``` **Overfitting Signs:** * Training accuracy much higher than test accuracy (e.g., 100% train vs 70% test) * Very deep trees (many levels) -- a tree with depth=50 has likely memorized individual data points * Leaves with very few samples -- if a leaf has 1-2 samples, the tree is "remembering" specific examples instead of learning general patterns **Why trees overfit so easily**: An unrestricted tree can always achieve 100% training accuracy by creating one leaf per training sample. It's like answering a quiz by memorizing every question-answer pair -- you ace the practice test but fail any new question. This is why we always limit `max_depth` or `min_samples_leaf`. *** ## Controlling Tree Complexity ```python theme={null} from sklearn.tree import DecisionTreeClassifier # Hyperparameters to control overfitting -- think of these as # "guardrails" that prevent the tree from getting too specific. model = DecisionTreeClassifier( max_depth=5, # Maximum depth of tree -- the single most important control min_samples_split=10, # Min samples to split a node -- prevents splitting tiny groups min_samples_leaf=5, # Min samples in a leaf -- ensures each prediction has enough evidence max_features='sqrt', # Max features per split -- adds randomness, reduces correlation between trees (useful for ensembles) random_state=42 ) # Practical tip: Start with max_depth=5 and increase gradually # while monitoring test accuracy. When test accuracy stops improving # (or starts dropping), you've found your sweet spot. ``` *** ## Regression Trees Trees can also predict numbers! ```python theme={null} from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import make_regression import matplotlib.pyplot as plt import numpy as np # Create data np.random.seed(42) X = np.random.uniform(0, 10, (100, 1)) y = np.sin(X).ravel() + np.random.normal(0, 0.2, 100) # Train regression tree model = DecisionTreeRegressor(max_depth=4) model.fit(X, y) # Predict on smooth range X_test = np.linspace(0, 10, 200).reshape(-1, 1) y_pred = model.predict(X_test) # Plot plt.figure(figsize=(10, 6)) plt.scatter(X, y, alpha=0.5, label='Data') plt.plot(X_test, y_pred, 'r-', linewidth=2, label='Tree Prediction') plt.xlabel('X') plt.ylabel('y') plt.title('Decision Tree Regression') plt.legend() plt.show() ``` Notice the "staircase" pattern -- trees predict constant values in each region! This is a fundamental limitation of decision trees for regression: they can only output discrete values (the mean of training points in each leaf), creating a piecewise-constant approximation. For smooth, continuous relationships, you'll need many leaves (risking overfitting) or an ensemble of trees (like Gradient Boosting, Module 6) that sums many small staircases into a smooth curve. *** ## Advantages and Disadvantages * Easy to understand and visualize * No feature scaling needed * Handles both numeric and categorical * Feature importance built-in * Fast predictions * Prone to overfitting * Unstable (small data changes = different tree) * Axis-aligned splits only * Not as accurate as ensemble methods * Can be biased toward features with many levels *** ## 🚀 Mini Projects Build a loan approval classifier Titanic survival prediction Visualize and interpret decision rules

**Project 1: Loan Approval Classifier** - Build a business decision system

**Objective**: Create a decision tree to automatically approve/reject loan applications. ```python theme={null} import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix import matplotlib.pyplot as plt # Generate loan application data np.random.seed(42) n = 500 age = np.random.normal(35, 10, n).clip(18, 70) income = np.random.exponential(50000, n) + 20000 credit_score = np.random.normal(680, 80, n).clip(300, 850) debt_ratio = np.random.uniform(0.1, 0.6, n) employment_years = np.random.exponential(5, n).clip(0, 30) # Approval logic (what tree should learn) approved = ( (credit_score > 650) & (income > 40000) & (debt_ratio < 0.4) | (credit_score > 750) & (income > 30000) | (employment_years > 5) & (credit_score > 600) ).astype(int) # Add some noise noise_idx = np.random.choice(n, size=int(n*0.05), replace=False) approved[noise_idx] = 1 - approved[noise_idx] df = pd.DataFrame({ 'age': age, 'income': income, 'credit_score': credit_score, 'debt_ratio': debt_ratio, 'employment_years': employment_years, 'approved': approved }) print(f"Approval rate: {approved.mean():.1%}") # Split data X = df.drop('approved', axis=1) y = df['approved'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train decision tree tree = DecisionTreeClassifier(max_depth=4, random_state=42) tree.fit(X_train, y_train) # Evaluate y_pred = tree.predict(X_test) print("\n=== Classification Report ===") print(classification_report(y_test, y_pred, target_names=['Rejected', 'Approved'])) # Feature importance print("\n=== Feature Importance ===") for name, imp in sorted(zip(X.columns, tree.feature_importances_), key=lambda x: -x[1]): print(f"{name}: {imp:.3f}") # Visualize tree plt.figure(figsize=(20, 10)) plot_tree(tree, feature_names=X.columns.tolist(), class_names=['Reject', 'Approve'], filled=True, rounded=True, fontsize=10) plt.title('Loan Approval Decision Tree') plt.tight_layout() plt.savefig('loan_decision_tree.png', dpi=150) print("\nSaved loan_decision_tree.png") # Interpret rules print("\n=== Key Decision Rules ===") print("The tree learned these patterns:") print("1. Credit score is the most important factor") print("2. Income thresholds vary by credit score") print("3. Low debt ratio can compensate for moderate credit") ```

**Project 2: Titanic Survival Prediction** - Classic ML challenge

**Objective**: Predict passenger survival on the Titanic. ```python theme={null} import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score # Simulated Titanic-like data np.random.seed(42) n = 800 pclass = np.random.choice([1, 2, 3], n, p=[0.25, 0.25, 0.5]) sex = np.random.choice([0, 1], n, p=[0.35, 0.65]) # 0=female, 1=male age = np.random.normal(30, 15, n).clip(1, 80) sibsp = np.random.poisson(0.5, n) fare = np.where(pclass == 1, np.random.exponential(80, n), np.where(pclass == 2, np.random.exponential(20, n), np.random.exponential(10, n))) # Survival probability (women & children first, first class priority) survival_prob = ( 0.1 + 0.5 * (sex == 0) + # Female 0.2 * (age < 15) + # Child 0.15 * (pclass == 1) + 0.05 * (pclass == 2) - 0.1 * (age > 50) ).clip(0, 1) survived = (np.random.random(n) < survival_prob).astype(int) df = pd.DataFrame({ 'pclass': pclass, 'sex': sex, 'age': age, 'sibsp': sibsp, 'fare': fare, 'survived': survived }) print(f"Survival rate: {survived.mean():.1%}") # Prepare data X = df.drop('survived', axis=1) y = df['survived'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Compare different tree depths print("\n=== Depth Comparison ===") for depth in [2, 4, 6, 8, None]: tree = DecisionTreeClassifier(max_depth=depth, random_state=42) cv_scores = cross_val_score(tree, X_train, y_train, cv=5) tree.fit(X_train, y_train) test_acc = accuracy_score(y_test, tree.predict(X_test)) print(f"Depth {str(depth):>4}: CV={cv_scores.mean():.3f}±{cv_scores.std():.3f}, Test={test_acc:.3f}") # Best model best_tree = DecisionTreeClassifier(max_depth=4, random_state=42) best_tree.fit(X_train, y_train) print("\n=== Feature Importance ===") for name, imp in sorted(zip(X.columns, best_tree.feature_importances_), key=lambda x: -x[1]): if imp > 0.01: print(f"{name}: {imp:.3f}") # Predict for specific passengers print("\n=== Survival Predictions ===") passengers = [ {"pclass": 1, "sex": 0, "age": 25, "sibsp": 1, "fare": 100}, # 1st class woman {"pclass": 3, "sex": 1, "age": 30, "sibsp": 0, "fare": 8}, # 3rd class man {"pclass": 2, "sex": 0, "age": 5, "sibsp": 2, "fare": 20}, # 2nd class child ] for p in passengers: X_new = pd.DataFrame([p]) prob = best_tree.predict_proba(X_new)[0][1] pred = "Survived" if prob > 0.5 else "Died" print(f"Class {p['pclass']}, {'Female' if p['sex']==0 else 'Male'}, Age {p['age']:.0f}: {pred} ({prob:.0%})") ```

**Project 3: Rule Extraction & Interpretation** - Understand model decisions

**Objective**: Extract and visualize decision rules from a trained tree. ```python theme={null} import numpy as np from sklearn.tree import DecisionTreeClassifier, export_text from sklearn.datasets import load_iris # Load iris for demonstration iris = load_iris() X, y = iris.data, iris.target feature_names = iris.feature_names class_names = iris.target_names # Train a shallow, interpretable tree tree = DecisionTreeClassifier(max_depth=3, random_state=42) tree.fit(X, y) # Export text rules print("=== Decision Rules (Text) ===") rules = export_text(tree, feature_names=feature_names) print(rules) # Extract rules programmatically def extract_rules(tree, feature_names, class_names): """Extract decision rules from tree.""" tree_ = tree.tree_ feature_name = [ feature_names[i] if i != -2 else "undefined" for i in tree_.feature ] rules = [] def recurse(node, path): if tree_.feature[node] != -2: # Not a leaf name = feature_name[node] threshold = tree_.threshold[node] # Left branch (<=) recurse(tree_.children_left[node], path + [f"{name} <= {threshold:.2f}"]) # Right branch (>) recurse(tree_.children_right[node], path + [f"{name} > {threshold:.2f}"]) else: # Leaf class_idx = np.argmax(tree_.value[node]) class_name = class_names[class_idx] samples = int(tree_.n_node_samples[node]) rules.append({ 'conditions': path, 'class': class_name, 'samples': samples, 'confidence': tree_.value[node].max() / tree_.value[node].sum() }) recurse(0, []) return rules rules = extract_rules(tree, feature_names, list(class_names)) print("\n=== Human-Readable Rules ===") for i, rule in enumerate(rules, 1): conditions = " AND ".join(rule['conditions']) print(f"Rule {i}: IF {conditions}") print(f" THEN {rule['class']} (samples={rule['samples']}, confidence={rule['confidence']:.0%})\n") # Simplify to actionable insights print("=== Actionable Insights ===") print("1. Petal length ≤ 2.45 → Always Setosa (100% accuracy)") print("2. Petal length > 2.45 AND petal width ≤ 1.75 → Likely Versicolor") print("3. Petal length > 2.45 AND petal width > 1.75 → Likely Virginica") print("\nKey discriminating features: petal_length, petal_width") ```

*** ## Key Takeaways Trees learn rules from data automatically Measures how mixed a group is (lower = purer) Choose splits that reduce impurity most Limit depth to prevent overfitting *** ## What's Next? Before moving to ensemble methods, let's explore two more powerful classifiers: Support Vector Machines - find the optimal boundary between classes Or jump ahead to Random Forests and Gradient Boosting