Hyperparameter tuning#

  • Hyperparameters are settings that control how the model learns.

  • For DTR, hyperparameters influence tree depth, splits, and overfitting/underfitting.

  • Tuning means finding the best combination of hyperparameters to improve model performance on unseen data.


Key Hyperparameters in Decision Tree Regressor#

Hyperparameter

Description

Effect

max_depth

Maximum depth of the tree

Too deep → overfitting; Too shallow → underfitting

min_samples_split

Minimum samples required to split a node

Larger value → less splits, simpler tree

min_samples_leaf

Minimum samples required at a leaf node

Prevents leaves with very few samples → reduces overfitting

max_features

Max features to consider for split

Helps reduce variance

max_leaf_nodes

Maximum number of leaf nodes

Restricts tree growth

criterion

Function to measure split quality (squared_error, friedman_mse)

Determines how splits are chosen


Why tuning is important?#

  • Overfitting: Tree fits training data perfectly but fails on test data.

  • Underfitting: Tree is too simple and misses patterns.

  • Goal: Balance bias and variance for best predictive performance.


How to tune hyperparameters?#



5. Tips for Tuning DTR#

  1. Start with max_depth → prevent overfitting.

  2. Adjust min_samples_split & min_samples_leaf → smooth predictions.

  3. Use cross-validation → ensures results are generalizable.

  4. Evaluate using R², RMSE, or MAE → choose the hyperparameters giving best performance.

# Step 1: Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Step 2: Create sample dataset
data = {
    'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Feature2': [5, 3, 6, 2, 7, 8, 5, 9, 4, 10],
    'Target': [10, 12, 15, 14, 18, 20, 19, 22, 21, 25]
}
df = pd.DataFrame(data)

X = df[['Feature1', 'Feature2']]
y = df['Target']

# Step 3: Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, None],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2]
}

# Step 5: Grid Search
grid_search = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_grid=param_grid,
    scoring='r2',
    cv=3
)

grid_search.fit(X_train, y_train)

# Step 6: Best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Step 7: Train DTR with best hyperparameters
best_dtr = DecisionTreeRegressor(**best_params, random_state=42)
best_dtr.fit(X_train, y_train)

# Step 8: Predictions and performance metrics
y_pred = best_dtr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\nPerformance on Test Set:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")
Best Hyperparameters: {'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2}

Performance on Test Set:
MAE: 1.67
MSE: 3.17
RMSE: 1.78
R² Score: 0.80
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.