You choose hyperparameters. The model learns parameters.Think of it like baking a cake. The parameters are the exact mixing ratios the recipe produces (flour, sugar, eggs). The hyperparameters are the oven temperature and baking time — you set them before baking starts, and they control how the recipe turns out. You can’t learn the oven temperature from the batter; you have to experiment.
from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier( n_estimators=100, # How many trees? max_depth=10, # How deep? min_samples_split=2, # Min samples to split? min_samples_leaf=1, # Min samples in leaf? max_features='sqrt', # Features per split? bootstrap=True, # Sample with replacement? random_state=42)
Grid search is the brute-force approach: define a grid of values and try every single combination. It’s like testing every possible combination of oven temperature and baking time — thorough but expensive. For 3 hyperparameters with 4 values each, that’s 4 x 4 x 4 = 64 combinations, each requiring a full cross-validation run.
from sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_split# Load datacancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, test_size=0.2, random_state=42)# Define parameter grid -- start coarse, refine later.# Common mistake: putting too many values in the grid.# 3 values per parameter is usually enough for a first pass.param_grid = { 'n_estimators': [50, 100, 200], # More trees = better but slower 'max_depth': [5, 10, 15, None], # None = unlimited (risk of overfitting) 'min_samples_split': [2, 5, 10] # Higher = more regularization}# Total combinations: 3 x 4 x 3 = 36, each evaluated with 5-fold CV# That's 36 x 5 = 180 model fits!# Grid searchgrid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, # 5-fold cross-validation for each combination scoring='accuracy', # Metric to optimize -- use 'f1' for imbalanced data n_jobs=-1, # Use all CPU cores -- essential for grid search speed verbose=1 # Show progress (set to 2 for more detail))grid_search.fit(X_train, y_train)# Resultsprint(f"Best parameters: {grid_search.best_params_}")print(f"Best CV score: {grid_search.best_score_:.4f}")print(f"Test score: {grid_search.score(X_test, y_test):.4f}")
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Get results as DataFrameresults = pd.DataFrame(grid_search.cv_results_)results = results[['param_n_estimators', 'param_max_depth', 'mean_test_score', 'std_test_score']]print(results.sort_values('mean_test_score', ascending=False).head(10))# Heatmap for 2 parameterspivot = results.pivot_table( values='mean_test_score', index='param_max_depth', columns='param_n_estimators')plt.figure(figsize=(10, 6))plt.imshow(pivot, cmap='viridis', aspect='auto')plt.colorbar(label='Mean CV Score')plt.xlabel('n_estimators')plt.ylabel('max_depth')plt.xticks(range(len(pivot.columns)), pivot.columns)plt.yticks(range(len(pivot.index)), pivot.index)plt.title('Grid Search Results')# Annotatefor i in range(len(pivot.index)): for j in range(len(pivot.columns)): plt.text(j, i, f'{pivot.iloc[i, j]:.3f}', ha='center', va='center', color='white')plt.tight_layout()plt.show()
Here’s the key insight from Bergstra and Bengio’s 2012 paper: in most ML problems, only 1-2 hyperparameters actually matter. Grid search wastes most of its budget exhaustively varying the ones that don’t matter. Random search, by contrast, samples each important dimension more thoroughly. Think of it like searching for a lost key in a field: grid search mows the lawn in neat rows, while random search drops random probes — if the key is somewhere along a specific line, random search is more likely to hit that line.Random Search samples randomly from parameter distributions:
from sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randint, uniform# Define parameter distributionsparam_distributions = { 'n_estimators': randint(50, 300), # Integer between 50-300 'max_depth': randint(3, 20), # Integer between 3-20 'min_samples_split': randint(2, 15), # Integer between 2-15 'min_samples_leaf': randint(1, 10), # Integer between 1-10 'max_features': ['sqrt', 'log2', None] # Categorical}# Random searchrandom_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_distributions, n_iter=50, # Try 50 random combinations cv=5, scoring='accuracy', n_jobs=-1, random_state=42, verbose=1)random_search.fit(X_train, y_train)print(f"Best parameters: {random_search.best_params_}")print(f"Best CV score: {random_search.best_score_:.4f}")
Research shows (Bergstra & Bengio, 2012): Random search often finds good hyperparameters faster than grid search. The intuition is simple — if only 2 of your 5 hyperparameters actually matter (which is common), grid search wastes most of its budget exploring irrelevant dimensions. Random search spreads trials across all dimensions, so you explore more unique values of the important parameters with the same compute budget.Practical rule: Use grid search when you have 2-3 hyperparameters with known good ranges. Use random search when you have 4+ hyperparameters or wide, uncertain ranges.
Grid search ignores past results. Random search ignores past results. Bayesian optimization is smarter — it builds a model of “which hyperparameters lead to good scores” and uses that model to decide where to look next. Think of it like a gold prospector who, after finding gold in one spot, digs nearby rather than randomly across the entire mountain.
Not all hyperparameters are created equal. Tune the ones that move the needle most, and leave the rest at sensible defaults.
from sklearn.ensemble import GradientBoostingClassifier# For Gradient Boosting, these three interact heavily and matter most:# - learning_rate: controls step size (low = more trees needed but better generalization)# - n_estimators: number of boosting rounds (tied to learning_rate)# - max_depth: complexity of each tree (usually 3-7 for boosting)# Rule of thumb: lower learning_rate + more n_estimators = better results, more compute.param_grid = { 'n_estimators': [100, 200, 500], 'learning_rate': [0.01, 0.1, 0.3], 'max_depth': [3, 5, 7]}
from sklearn.model_selection import GridSearchCV# Classificationscoring_classification = ['accuracy', 'f1', 'roc_auc', 'precision', 'recall']# Use refit to choose final modelgrid = GridSearchCV( model, param_grid, cv=5, scoring=scoring_classification, refit='f1' # Final model optimizes for F1)
For unbiased evaluation of the tuning process. This is subtle but important: regular cross-validation with hyperparameter tuning gives you an optimistically biased estimate of performance. You picked the best hyperparameters on the same folds you’re reporting results for. Nested CV fixes this by using separate inner folds for tuning and outer folds for evaluation.
from sklearn.model_selection import cross_val_score, GridSearchCV# Inner loop: tune hyperparameters (picks the best settings)inner_cv = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5)# Outer loop: evaluate the ENTIRE tuning process on held-out data# This gives you an honest estimate of "if I gave this tuning pipeline# new data, how well would it perform?"outer_scores = cross_val_score(inner_cv, X, y, cv=5)print(f"Nested CV Score: {outer_scores.mean():.4f} (+/- {outer_scores.std():.4f})")# This score will typically be 1-3% lower than non-nested CV --# the difference is the "optimism" that regular CV introduces.
Common ML mistake — tuning before feature engineering: Hyperparameter tuning typically yields 1-3% improvement. Good feature engineering yields 5-20%. Always get your features right first, then tune. A perfectly tuned model on bad features will lose to a default model on great features every time.
{ 'C': [0.1, 1, 10, 100], # Regularization strength (inverse) 'gamma': ['scale', 'auto', 0.1, 1], # RBF kernel width -- most common overfitting cause 'kernel': ['rbf', 'poly'] # RBF is the default starting point}# WARNING: SVM tuning is O(n^2) in dataset size. For >50K samples,# consider LinearSVC with just C, or switch to tree-based models.