Hyperparameter tuning#
Hyperparameters are settings that control how the model learns.
For DTR, hyperparameters influence tree depth, splits, and overfitting/underfitting.
Tuning means finding the best combination of hyperparameters to improve model performance on unseen data.
Key Hyperparameters in Decision Tree Regressor#
Hyperparameter |
Description |
Effect |
|---|---|---|
|
Maximum depth of the tree |
Too deep → overfitting; Too shallow → underfitting |
|
Minimum samples required to split a node |
Larger value → less splits, simpler tree |
|
Minimum samples required at a leaf node |
Prevents leaves with very few samples → reduces overfitting |
|
Max features to consider for split |
Helps reduce variance |
|
Maximum number of leaf nodes |
Restricts tree growth |
|
Function to measure split quality ( |
Determines how splits are chosen |
Why tuning is important?#
Overfitting: Tree fits training data perfectly but fails on test data.
Underfitting: Tree is too simple and misses patterns.
Goal: Balance bias and variance for best predictive performance.
How to tune hyperparameters?#
A. Grid Search#
Tries all possible combinations from a given set of hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
# Define hyperparameter grid
param_grid = {
'max_depth': [2, 3, 4, 5, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# GridSearchCV
grid_search = GridSearchCV(
estimator=DecisionTreeRegressor(random_state=42),
param_grid=param_grid,
scoring='r2',
cv=5
)
grid_search.fit(X_train, y_train)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best R² Score:", grid_search.best_score_)
B. Randomized Search#
Tries random combinations of hyperparameters for faster search.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'max_depth': [2, 3, 4, 5, None],
'min_samples_split': randint(2, 10),
'min_samples_leaf': randint(1, 5)
}
random_search = RandomizedSearchCV(
estimator=DecisionTreeRegressor(random_state=42),
param_distributions=param_dist,
n_iter=10,
scoring='r2',
cv=5,
random_state=42
)
random_search.fit(X_train, y_train)
print("Best Hyperparameters:", random_search.best_params_)
print("Best R² Score:", random_search.best_score_)
5. Tips for Tuning DTR#
Start with
max_depth→ prevent overfitting.Adjust
min_samples_split&min_samples_leaf→ smooth predictions.Use cross-validation → ensures results are generalizable.
Evaluate using R², RMSE, or MAE → choose the hyperparameters giving best performance.
# Step 1: Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Step 2: Create sample dataset
data = {
'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature2': [5, 3, 6, 2, 7, 8, 5, 9, 4, 10],
'Target': [10, 12, 15, 14, 18, 20, 19, 22, 21, 25]
}
df = pd.DataFrame(data)
X = df[['Feature1', 'Feature2']]
y = df['Target']
# Step 3: Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Step 4: Define hyperparameter grid
param_grid = {
'max_depth': [2, 3, 4, None],
'min_samples_split': [2, 3, 4],
'min_samples_leaf': [1, 2]
}
# Step 5: Grid Search
grid_search = GridSearchCV(
estimator=DecisionTreeRegressor(random_state=42),
param_grid=param_grid,
scoring='r2',
cv=3
)
grid_search.fit(X_train, y_train)
# Step 6: Best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
# Step 7: Train DTR with best hyperparameters
best_dtr = DecisionTreeRegressor(**best_params, random_state=42)
best_dtr.fit(X_train, y_train)
# Step 8: Predictions and performance metrics
y_pred = best_dtr.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("\nPerformance on Test Set:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")
Best Hyperparameters: {'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2}
Performance on Test Set:
MAE: 1.67
MSE: 3.17
RMSE: 1.78
R² Score: 0.80
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.