Handle Underfitting/Overfitting

Handle Underfitting/Overfitting #

Handling Overfitting in DTC #

Pre-pruning (Early Stopping)
- Limit tree complexity:
  - max_depth → limit depth of the tree
  - min_samples_split → require more samples to split
  - min_samples_leaf → require more samples per leaf
  - max_leaf_nodes → restrict number of leaves
```
DecisionTreeClassifier(max_depth=5, min_samples_leaf=4)
```
Post-pruning (Cost Complexity Pruning)
- Use ccp_alpha to prune weak branches.
```
DecisionTreeClassifier(ccp_alpha=0.01)
```
Reduce Noise in Data
- Clean data, remove irrelevant features.
Use Ensemble Methods
- Random Forests, Gradient Boosting (combine multiple trees).

Handling Underfitting in DTC #

Allow More Complexity

Increase depth & allow smaller splits:

DecisionTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1)

Reduce Pruning Strength
- Use smaller ccp_alpha (closer to 0).
Feature Engineering
- Add better features, encode categorical variables, transformations.
Check Data Quality
- Underfitting often means the features don’t explain the target well → improve dataset.

Key Takeaway

Overfitting → model too complex → simplify the tree.
Underfitting → model too simple → allow more complexity.
Best practice: Use cross-validation + hyperparameter tuning (GridSearchCV) to find the sweet spot.

Overfitting vs Underfitting in DTC

Case	What Happens	Symptoms	Model Behavior
Overfitting 🌳	Tree is too complex, memorizes training data	Very high training accuracy, much lower test accuracy	Poor generalization
Underfitting 🌱	Tree is too simple, can’t capture patterns	Both training & test accuracy are low	Model is too weak

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Case 1: Underfitting (shallow tree)
dt_underfit = DecisionTreeClassifier(max_depth=1, random_state=42)
dt_underfit.fit(X_train, y_train)
train_acc_underfit = accuracy_score(y_train, dt_underfit.predict(X_train))
test_acc_underfit = accuracy_score(y_test, dt_underfit.predict(X_test))

# Case 2: Overfitting (deep tree, no restriction)
dt_overfit = DecisionTreeClassifier(max_depth=None, random_state=42)
dt_overfit.fit(X_train, y_train)
train_acc_overfit = accuracy_score(y_train, dt_overfit.predict(X_train))
test_acc_overfit = accuracy_score(y_test, dt_overfit.predict(X_test))

# Case 3: Balanced tree (tuned)
dt_balanced = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_balanced.fit(X_train, y_train)
train_acc_balanced = accuracy_score(y_train, dt_balanced.predict(X_train))
test_acc_balanced = accuracy_score(y_test, dt_balanced.predict(X_test))

# Print results
print("=== Decision Tree Performance ===")
print(f"Underfitting (max_depth=1) → Train: {train_acc_underfit:.3f}, Test: {test_acc_underfit:.3f}")
print(f"Overfitting (max_depth=None) → Train: {train_acc_overfit:.3f}, Test: {test_acc_overfit:.3f}")
print(f"Balanced (max_depth=3) → Train: {train_acc_balanced:.3f}, Test: {test_acc_balanced:.3f}")

# Plot results
labels = ["Underfitting", "Overfitting", "Balanced"]
train_scores = [train_acc_underfit, train_acc_overfit, train_acc_balanced]
test_scores = [test_acc_underfit, test_acc_overfit, test_acc_balanced]

x = range(len(labels))
plt.figure(figsize=(7,5))
plt.bar(x, train_scores, width=0.4, label="Train Accuracy", align="center")
plt.bar([i + 0.4 for i in x], test_scores, width=0.4, label="Test Accuracy", align="center")
plt.xticks([i + 0.2 for i in x], labels)
plt.ylabel("Accuracy")
plt.title("Underfitting vs Overfitting vs Balanced Decision Tree")
plt.legend()
plt.ylim(0, 1.05)
plt.show()

=== Decision Tree Performance ===
Underfitting (max_depth=1) → Train: 0.648, Test: 0.711
Overfitting (max_depth=None) → Train: 1.000, Test: 1.000
Balanced (max_depth=3) → Train: 0.952, Test: 1.000

../../../../_images/5c437529cb96099c5c646a1f9d0de57ad3442597cc40b37045fb083ed8c63a1b.png

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Handle Underfitting/Overfitting

Contents

Handle Underfitting/Overfitting#

Handling Overfitting in DTC#

Handling Underfitting in DTC#

Handle Underfitting/Overfitting #

Handling Overfitting in DTC #

Handling Underfitting in DTC #