Purity Measures

Purity Measures #

A node is said to be pure if all samples inside it belong to the same class.
A node is impure if it contains mixed classes.
Decision trees aim to split data so that each branch leads to nodes with higher purity.

To measure purity, we use Entropy and Gini Impurity.

Entropy#

Formula (for binary classification):

\[ H(S) = - p_+ \log_2(p_+) - p_- \log_2(p_-) \]

Where:
- \(p_+\) = probability of positive class (say “Yes”)
- \(p_-\) = probability of negative class (say “No”)
Multi-class extension:

\[ H(S) = - \sum_{i=1}^k p_i \log_2(p_i) \]

Interpretation:
- \(H(S) = 0\): node is pure (all one class).
- \(H(S) = 1\): maximum impurity (equal distribution of classes, e.g. 50/50 in binary case).
Entropy graph:
- Ranges between 0 and 1.
- Maximum impurity occurs when classes are equally likely.

Gini Impurity#

Formula:

\[ Gini(S) = 1 - \sum_{i=1}^k p_i^2 \]

For binary case:

\[ Gini(S) = 1 - (p_+^2 + p_-^2) \]

Interpretation:
- \(Gini = 0\): pure node.
- Maximum impurity in binary case: \(Gini = 0.5\) (when \(p_+ = p_- = 0.5\)).
Range: 0 → 0.5 (binary), slightly different for multi-class.

Example:#

Suppose 6 samples → 3 “Yes”, 3 “No”.

Entropy:

\[ H = - \frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{3}{6} = 1 \]

(complete impurity)

Gini:

\[ 1 - \Big(\frac{3}{6}\Big)^2 - \Big(\frac{3}{6}\Big)^2 = 1 - 0.25 - 0.25 = 0.5 \]

Comparison: Entropy vs Gini

Aspect	Entropy	Gini Impurity
Formula	\(-\sum p_i \log_2 p_i\)	\(1 - \sum p_i^2\)
Range (binary)	0 → 1	0 → 0.5
Complexity	Slightly more expensive (log terms)	Faster to compute (squares only)
Behavior	Both give similar results in practice	Sometimes Gini prefers larger splits

Information Gain (IG) — Feature Selection Criterion #

Entropy & Gini tell us purity of a node. But to decide which feature to split on, we use Information Gain:

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \]

\(H(S)\): entropy before split
Weighted average entropy after split is subtracted.
The feature with highest IG is chosen to split.

Key Takeaways

Entropy and Gini Impurity are purity measures.
Entropy → range 0–1, max impurity = 1.
Gini → range 0–0.5 in binary classification.
Both are used in practice; Gini is slightly faster.
Information Gain helps choose which feature gives the best split.

What is purity in decision trees?#

A node is said to be pure if all samples inside it belong to the same class.
A node is impure if it contains mixed classes.
Decision trees aim to split data so that each branch leads to nodes with higher purity.

To measure purity, we use Entropy and Gini Impurity.

Entropy#

Formula (for binary classification):

\[ H(S) = - p_+ \log_2(p_+) - p_- \log_2(p_-) \]

Where:
- \(p_+\) = probability of positive class (say “Yes”)
- \(p_-\) = probability of negative class (say “No”)
Multi-class extension:

\[ H(S) = - \sum_{i=1}^k p_i \log_2(p_i) \]

Interpretation:
- \(H(S) = 0\): node is pure (all one class).
- \(H(S) = 1\): maximum impurity (equal distribution of classes, e.g. 50/50 in binary case).
Entropy graph:
- Ranges between 0 and 1.
- Maximum impurity occurs when classes are equally likely.

Gini Impurity#

Formula:

\[ Gini(S) = 1 - \sum_{i=1}^k p_i^2 \]

For binary case:

\[ Gini(S) = 1 - (p_+^2 + p_-^2) \]

Interpretation:
- \(Gini = 0\): pure node.
- Maximum impurity in binary case: \(Gini = 0.5\) (when \(p_+ = p_- = 0.5\)).
Range: 0 → 0.5 (binary), slightly different for multi-class.

Example:#

Suppose 6 samples → 3 “Yes”, 3 “No”.

Entropy:

\[ H = - \frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{3}{6} = 1 \]

(complete impurity)

Gini:

\[ 1 - \Big(\frac{3}{6}\Big)^2 - \Big(\frac{3}{6}\Big)^2 = 1 - 0.25 - 0.25 = 0.5 \]

Comparison: Entropy vs Gini#

Aspect	Entropy	Gini Impurity
Formula	\(-\sum p_i \log_2 p_i\)	\(1 - \sum p_i^2\)
Range (binary)	0 → 1	0 → 0.5
Complexity	Slightly more expensive (log terms)	Faster to compute (squares only)
Behavior	Both give similar results in practice	Sometimes Gini prefers larger splits

Information Gain (IG) — Feature Selection Criterion #

Entropy & Gini tell us purity of a node. But to decide which feature to split on, we use Information Gain:

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \]

\(H(S)\): entropy before split
Weighted average entropy after split is subtracted.
The feature with highest IG is chosen to split.

Key Takeaways

Entropy and Gini Impurity are purity measures.
Entropy → range 0–1, max impurity = 1.
Gini → range 0–0.5 in binary classification.
Both are used in practice; Gini is slightly faster.
Information Gain helps choose which feature gives the best split.

Information Gain #

We already know:

Entropy and Gini impurity → measure how pure or impure a node is.
But a decision tree needs to choose the best feature to split at each step. 👉 This is where Information Gain (IG) comes in.

Definition #

Information Gain tells us how much “uncertainty” is reduced in the target after splitting on a feature.

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \; H(S_v) \]

Where:

\(H(S)\) = entropy of the root node (before split)
\(S_v\) = subset of data after splitting on value \(v\) of feature \(A\)
\(H(S_v)\) = entropy of that subset
\(\frac{|S_v|}{|S|}\) = weighted contribution of that subset

Step-by-Step Process #

Compute entropy of the root node (before any split).
Choose a candidate feature (say F1).
Split dataset into subsets based on values of F1.
Compute entropy of each subset.
Take a weighted average of these entropies.
Subtract this weighted entropy from the root entropy → Information Gain for F1.
Repeat for all features (F1, F2, F3, …).
Pick the feature with highest Information Gain → best split.

Worked Example #

Suppose dataset has 14 samples: 9 Yes, 5 No.

Root Entropy:

\[ H(S) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14} \approx 0.94 \]

Split on F1:
- Subset C1: 6 Yes, 2 No → \(H(C1) \approx 0.81\)
- Subset C2: 3 Yes, 3 No → \(H(C2) = 1\)
Weighted Entropy = \(\frac{8}{14}\times0.81 + \frac{6}{14}\times1 \approx 0.89\)

So:

\[ IG(S, F1) = 0.94 - 0.89 = 0.049 \]

Split on F2: Suppose IG(S, F2) > 0.049 👉 Then F2 is chosen for the root split.

Key Insights

Higher Information Gain = Better Feature for Split.
Decision tree keeps splitting until:
- IG ≈ 0 (no improvement possible), OR
- stopping criteria (max depth, min samples, etc.) is reached.

When to Use Entropy vs Gini#

Entropy:
- More theoretical, based on information theory.
- Can be slightly slower (logarithm calculation).
- Preferred if you want a “purer” measure of uncertainty.
Gini Impurity:
- Computationally faster (uses squares, no logs).
- Works very similarly in practice.
- Often the default choice (e.g., in Scikit-learn).

👉 In practice, both give similar results. Gini is often preferred for speed, Entropy for interpretability.

Takeaways

Entropy / Gini = measure of node impurity.
Information Gain = reduction in impurity → helps decide which feature to split on.
Decision Trees always pick the feature with highest IG at each step.
Gini and Entropy are very similar in behavior → choice depends on speed vs interpretability.

Entropy v/s Gini Impurity #

Entropy
- Formula:
  
  \[ H(S) = -\sum_{i=1}^{k} p_i \log_2 p_i \]
  
  where \(k\) = number of classes, \(p_i\) = probability of class \(i\).
- Measures disorder/uncertainty in the dataset.
- More theoretical foundation (from information theory).
Gini Impurity
- Formula:
  
  \[ G(S) = 1 - \sum_{i=1}^{k} p_i^2 \]
- Measures how often a randomly chosen element would be misclassified if labeled according to class distribution.
- More computationally efficient (no logarithms).

Behavior When Classes Increase #

If the number of output categories increases:
- Entropy expands into more terms (one for each class).
- Gini Impurity also expands, but only with squared probabilities (no logs).
Both handle multi-class problems, but entropy involves heavier computation due to logarithms.

When to Use Which?#

Use Entropy when:#

You want a pure information-theoretic interpretation.
Dataset is small or moderately sized (e.g., ≤ 10,000 records).
You want to emphasize purity more strongly — entropy tends to be more sensitive to class probabilities close to 0 or 1.

Use Gini Impurity when:#

You’re working with large datasets (millions of records).
You care about speed and efficiency (fewer computations since no log).
You’re using scikit-learn’s DecisionTreeClassifier — by default it uses Gini.

Practical Notes #

Performance Difference: In most real-world problems, Gini and Entropy give very similar results in terms of accuracy and splits.
Default Choice:
- Gini is used more often (fast, default in scikit-learn).
- Entropy can be used for interpretability or teaching purposes.

Summary (One-Liner)

Entropy = more theoretical, uses log, slower, suitable for small datasets.
Gini = faster, simpler, default in most implementations, better for large datasets.
In practice, both lead to similar decision trees, so the choice is often pragmatic (speed vs. interpretability).

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train two decision trees: one with Gini, one with Entropy
tree_gini = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)

tree_gini.fit(X, y)
tree_entropy.fit(X, y)

# Extract decision rules for comparison
rules_gini = export_text(tree_gini, feature_names=iris.feature_names)
rules_entropy = export_text(tree_entropy, feature_names=iris.feature_names)

rules_gini, rules_entropy

('|--- petal length (cm) <= 2.45\n|   |--- class: 0\n|--- petal length (cm) >  2.45\n|   |--- petal width (cm) <= 1.75\n|   |   |--- petal length (cm) <= 4.95\n|   |   |   |--- class: 1\n|   |   |--- petal length (cm) >  4.95\n|   |   |   |--- class: 2\n|   |--- petal width (cm) >  1.75\n|   |   |--- petal length (cm) <= 4.85\n|   |   |   |--- class: 2\n|   |   |--- petal length (cm) >  4.85\n|   |   |   |--- class: 2\n',
 '|--- petal length (cm) <= 2.45\n|   |--- class: 0\n|--- petal length (cm) >  2.45\n|   |--- petal width (cm) <= 1.75\n|   |   |--- petal length (cm) <= 4.95\n|   |   |   |--- class: 1\n|   |   |--- petal length (cm) >  4.95\n|   |   |   |--- class: 2\n|   |--- petal width (cm) >  1.75\n|   |   |--- petal length (cm) <= 4.85\n|   |   |   |--- class: 2\n|   |   |--- petal length (cm) >  4.85\n|   |   |   |--- class: 2\n')

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Purity Measures

Contents