Hyperparameter Tuning in Decision Tree

Hyperparameter Tuning in Decision Tree #

Hyperparameters are settings that control how the tree grows, rather than learned from the data.
Tuning means selecting the combination that gives the best performance on unseen data, usually validated with cross-validation.

Key Hyperparameters of Decision Tree#

Hyperparameter	Description	Effect
`max_depth`	Maximum depth of the tree	Lower → underfitting, Higher → overfitting
`min_samples_split`	Min samples required to split a node	Higher → simpler tree, Lower → complex tree
`min_samples_leaf`	Min samples required at a leaf	Larger → smoother predictions
`max_features`	Max features to consider for a split	Reduces variance, increases bias
`max_leaf_nodes`	Maximum number of leaves	Restricts tree growth
`criterion`	Split quality measure (`squared_error`)	Determines how splits are chosen

Goal: Balance between bias and variance to improve generalization.

2. Overfitting vs Underfitting#

Issue	Description	Symptoms	How to Handle
Overfitting	Tree is too complex, memorizes training data	Very high train score, low test score	1. Reduce `max_depth` 2. Increase `min_samples_leaf` 3. Increase `min_samples_split` 4. Limit `max_leaf_nodes` 5. Prune tree / use ensemble (Random Forest)
Underfitting	Tree is too simple, cannot capture patterns	Both train & test scores are low	1. Increase `max_depth` 2. Reduce `min_samples_leaf` 3. Reduce `min_samples_split` 4. Add more features

3. Practical Steps for Hyperparameter Tuning#

Step 1: Identify the problem#

Check train vs test performance using metrics like R², RMSE, MAE.

Step 2: Tune hyperparameters#

Use GridSearchCV or RandomizedSearchCV to find optimal hyperparameters.

Example grid:

param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3]
}

Step 3: Validate using cross-validation#

Ensures hyperparameters generalize to unseen data.

Step 4: Evaluate#

Check train/test R², RMSE and ensure the gap is not too large (prevents overfitting).

4. Tips to Handle Overfitting/Underfitting#

Overfitting#

Limit tree depth: max_depth
Set minimum samples per leaf: min_samples_leaf
Set minimum samples to split: min_samples_split
Limit number of leaf nodes: max_leaf_nodes
Use ensemble methods like Random Forest or Gradient Boosting

Underfitting#

Increase tree depth
Reduce min_samples_leaf / min_samples_split
Add more features or interaction terms
Ensure data preprocessing is correct

5. Visual intuition#

Shallow tree → Underfitting (misses patterns)
Deep tree → Overfitting (memorizes noise)
Properly tuned tree → Captures patterns, generalizes well

# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Create Sample Dataset
np.random.seed(42)
X = np.sort(np.random.rand(50, 1) * 10, axis=0)  # Features
y = np.sin(X).ravel() + np.random.normal(0, 0.2, X.shape[0])  # Target with noise

# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to train and evaluate a model
def evaluate_dtr(model, X_train, X_test, y_train, y_test, label="Model"):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    print(f"\n{label} Performance:")
    print(f"Train R²: {r2_score(y_train, y_train_pred):.2f}")
    print(f"Test R²: {r2_score(y_test, y_test_pred):.2f}")
    
    # Plot
    plt.scatter(X_train, y_train, c='blue', label='Train Data')
    plt.scatter(X_test, y_test, c='green', label='Test Data')
    X_plot = np.linspace(0, 10, 100).reshape(-1,1)
    plt.plot(X_plot, model.predict(X_plot), c='red', label='Prediction')
    plt.title(label)
    plt.legend()
    plt.show()

# ----------------------
# Underfitting Example
# ----------------------
dtr_underfit = DecisionTreeRegressor(max_depth=1, random_state=42)
evaluate_dtr(dtr_underfit, X_train, X_test, y_train, y_test, label="Underfitting Tree")

# ----------------------
# Overfitting Example
# ----------------------
dtr_overfit = DecisionTreeRegressor(max_depth=None, min_samples_leaf=1, random_state=42)
evaluate_dtr(dtr_overfit, X_train, X_test, y_train, y_test, label="Overfitting Tree")

# ----------------------
# Properly Tuned Tree using GridSearchCV
# ----------------------
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3]
}

grid_search = GridSearchCV(DecisionTreeRegressor(random_state=42),
                           param_grid, cv=3, scoring='r2')
grid_search.fit(X_train, y_train)
best_dtr = grid_search.best_estimator_
evaluate_dtr(best_dtr, X_train, X_test, y_train, y_test, label="Properly Tuned Tree")

Underfitting Tree Performance:
Train R²: 0.32
Test R²: 0.47

../../../_images/1f48bce7c8b89894418b3286ddd92ba1be0606d40114a51bb1752db50172559c.png

Overfitting Tree Performance:
Train R²: 1.00
Test R²: 0.87

../../../_images/fedc1d98af1785c6fefdd21b8e91d49b6ff20a84cfcd4f00c0f715664c2ed83a.png

Properly Tuned Tree Performance:
Train R²: 0.94
Test R²: 0.82

../../../_images/a4c6620143e8e1dd7cd486d1566213f9d9e5e9c1e0bf6c21f0a2214dc715773f.png

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Interpretation#

Dataset:
- \(y = \sin(X) + \text{noise}\) → nonlinear pattern with randomness.
Underfitting Tree:
- Very shallow (max_depth=1) → misses the sine pattern.
- Low train and test R².
Overfitting Tree:
- Grows fully (max_depth=None, min_samples_leaf=1) → memorizes training data.
- High train R², low test R².
Properly Tuned Tree:
- GridSearchCV finds optimal max_depth, min_samples_split, min_samples_leaf.
- Balances bias and variance → good train/test R².

Hyperparameter Tuning in Decision Tree

Contents