Decision Tree#
A decision tree can solve both classification and regression problems.
Entropy
Gini Index
Information Gain
We’ll also briefly differentiate between the two major types of decision tree algorithms:
ID3 (Iterative Dichotomiser 3): Allows multiple splits at a node.
CART (Classification and Regression Tree): Restricts splits to binary decisions (used in libraries like
sklearn).
Intuition: Decision Trees as Conditional Statements#
Think of decision trees as an extension of simple conditional statements. For example, in Python:
age = 14
if age <= 14:
print("The person is in school.")
elif 15 <= age <= 21:
print("The person may be in college.")
else:
print("The person has passed college.")
This logic resembles how a decision tree splits data:
Start with a root node (e.g.,
age <= 14).Branch into outcomes (e.g., Yes or No).
Continue splitting based on new conditions until reaching a pure classification (e.g., “School,” “College”).
Constructing a Decision Tree#
Let’s explore how a decision tree is built using a real-world dataset. Imagine a dataset with these features:
Outlook (Sunny, Overcast, Rainy)
Temperature
Humidity
Wind Our goal: Predict whether someone will Play Tennis.
Step 1: Selecting the Root Node#
Start by evaluating each feature. For instance, take Outlook.
Count the occurrences:
Sunny: 2 Yes, 3 No
Overcast: 4 Yes, 0 No (Pure split)
Rainy: 3 Yes, 2 No
Notice that Sunny and Rainy are impure splits (contain both Yes and No), while Overcast is a pure split.
Key Metrics for Splitting#
When splitting nodes, we use metrics like:
Entropy: Measures the impurity in the dataset.
Information Gain: Determines how much uncertainty is reduced after a split.
Gini Index: Another measure of impurity, often used in CART.
Example: Calculating Entropy#
For the root node: \( \text{Entropy} = - \sum_{i=1}^{n} p_i \cdot \log_2(p_i) \) Where\( p_i\) is the proportion of each class (Yes/No).
Example: Information Gain#
\( \text{Information Gain} = \text{Entropy (Parent)} - \text{Weighted Average Entropy (Child Nodes)} \)
Splitting Impure Nodes#
For impure splits like Sunny (2 Yes, 3 No):
Consider another feature (e.g., Temperature).
Split further until achieving pure nodes or acceptable impurity.
Conclusion and Next Steps#
This process continues until:
The tree reaches pure splits.
Or further splits don’t improve the model significantly.
A Decision Tree is a supervised learning algorithm used for both classification (predicting categories) and regression (predicting continuous values). It works like a flowchart:
Each internal node represents a condition (feature test).
Each branch represents the outcome of the condition.
Each leaf node gives the final prediction (class label or numeric value).
Intuition#
Imagine you’re deciding whether to play outside:
If it’s sunny → Yes
If it’s rainy → Check if you have an umbrella → Yes or No
This is exactly how decision trees work: split data into smaller subsets based on conditions until the subsets are “pure” (contain mostly one label).
Structure of a Decision Tree#
Root Node → the first question/condition (best split).
Decision Nodes → intermediate questions.
Leaf Nodes → final output (prediction).
Example (Classification):
Weather?
/ \
Sunny Rainy
/ \ \
Hot Cold Umbrella?
| | / \
Play Don’t Play Play Don’t Play
How Does the Tree Decide Where to Split#
At each step, the algorithm chooses the best feature and threshold to split the data. It uses impurity measures to decide:
For Classification:#
Gini Impurity (most common in CART)
Entropy / Information Gain (used in ID3, C4.5)
👉 Lower impurity = better split.
For Regression:#
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Advantages#
✅ Easy to interpret and visualize ✅ Handles both classification & regression ✅ Works with both numerical and categorical data ✅ No need for feature scaling
Disadvantages#
❌ Can overfit (very deep trees memorize training data) ❌ Sensitive to small changes in data ❌ Greedy splitting (locally optimal, not always globally optimal)
👉 To fix these, we use Ensembles like Random Forests and Gradient Boosted Trees.
Workflow of a Decision Tree
Start with the root node (entire dataset).
Choose the best feature to split on (using impurity measure).
Split the dataset into subsets.
Repeat recursively for each subset.
Stop when:
All points belong to the same class (pure leaf).
Max depth reached.
Minimum number of samples per leaf reached.
Example Use Cases
Classification: Spam detection, disease diagnosis, loan approval.
Regression: Predicting house prices, sales forecasting.
In short:
A Decision Tree is like repeatedly asking “the best question” until you reach a clear decision.