Hyperparameter tuning

Hyperparameter tuning#

Hyperparameters are settings that control how the model learns.
For DTR, hyperparameters influence tree depth, splits, and overfitting/underfitting.
Tuning means finding the best combination of hyperparameters to improve model performance on unseen data.

Key Hyperparameters in Decision Tree Regressor#

Hyperparameter	Description	Effect
`max_depth`	Maximum depth of the tree	Too deep → overfitting; Too shallow → underfitting
`min_samples_split`	Minimum samples required to split a node	Larger value → less splits, simpler tree
`min_samples_leaf`	Minimum samples required at a leaf node	Prevents leaves with very few samples → reduces overfitting
`max_features`	Max features to consider for split	Helps reduce variance
`max_leaf_nodes`	Maximum number of leaf nodes	Restricts tree growth
`criterion`	Function to measure split quality (`squared_error`, `friedman_mse`)	Determines how splits are chosen

Why tuning is important?#

Overfitting: Tree fits training data perfectly but fails on test data.
Underfitting: Tree is too simple and misses patterns.
Goal: Balance bias and variance for best predictive performance.

How to tune hyperparameters?#

A. Grid Search#

Tries all possible combinations from a given set of hyperparameters.

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

# Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_grid=param_grid,
    scoring='r2',
    cv=5
)

grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)
print("Best R² Score:", grid_search.best_score_)

B. Randomized Search#

Tries random combinations of hyperparameters for faster search.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5)
}

random_search = RandomizedSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_distributions=param_dist,
    n_iter=10,
    scoring='r2',
    cv=5,
    random_state=42
)

random_search.fit(X_train, y_train)

print("Best Hyperparameters:", random_search.best_params_)
print("Best R² Score:", random_search.best_score_)

5. Tips for Tuning DTR#

Start with max_depth → prevent overfitting.
Adjust min_samples_split & min_samples_leaf → smooth predictions.
Use cross-validation → ensures results are generalizable.
Evaluate using R², RMSE, or MAE → choose the hyperparameters giving best performance.

# Step 1: Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Step 2: Create sample dataset
data = {
    'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Feature2': [5, 3, 6, 2, 7, 8, 5, 9, 4, 10],
    'Target': [10, 12, 15, 14, 18, 20, 19, 22, 21, 25]
}
df = pd.DataFrame(data)

X = df[['Feature1', 'Feature2']]
y = df['Target']

# Step 3: Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, None],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2]
}

# Step 5: Grid Search
grid_search = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_grid=param_grid,
    scoring='r2',
    cv=3
)

grid_search.fit(X_train, y_train)

# Step 6: Best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Step 7: Train DTR with best hyperparameters
best_dtr = DecisionTreeRegressor(**best_params, random_state=42)
best_dtr.fit(X_train, y_train)

# Step 8: Predictions and performance metrics
y_pred = best_dtr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\nPerformance on Test Set:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

Best Hyperparameters: {'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2}

Performance on Test Set:
MAE: 1.67
MSE: 3.17
RMSE: 1.78
R² Score: 0.80

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Hyperparameter tuning

Contents

Hyperparameter tuning#

Key Hyperparameters in Decision Tree Regressor#

Why tuning is important?#

How to tune hyperparameters?#

A. Grid Search#

B. Randomized Search#

5. Tips for Tuning DTR#