Purity Measures#

  • A node is said to be pure if all samples inside it belong to the same class.

  • A node is impure if it contains mixed classes.

  • Decision trees aim to split data so that each branch leads to nodes with higher purity.

To measure purity, we use Entropy and Gini Impurity.


Entropy#

  • Formula (for binary classification):

\[ H(S) = - p_+ \log_2(p_+) - p_- \log_2(p_-) \]
  • Where:

    • \(p_+\) = probability of positive class (say “Yes”)

    • \(p_-\) = probability of negative class (say “No”)

  • Multi-class extension:

\[ H(S) = - \sum_{i=1}^k p_i \log_2(p_i) \]
  • Interpretation:

    • \(H(S) = 0\): node is pure (all one class).

    • \(H(S) = 1\): maximum impurity (equal distribution of classes, e.g. 50/50 in binary case).

  • Entropy graph:

    • Ranges between 0 and 1.

    • Maximum impurity occurs when classes are equally likely.


Gini Impurity#

  • Formula:

\[ Gini(S) = 1 - \sum_{i=1}^k p_i^2 \]
  • For binary case:

\[ Gini(S) = 1 - (p_+^2 + p_-^2) \]
  • Interpretation:

    • \(Gini = 0\): pure node.

    • Maximum impurity in binary case: \(Gini = 0.5\) (when \(p_+ = p_- = 0.5\)).

  • Range: 0 → 0.5 (binary), slightly different for multi-class.


Example:#

Suppose 6 samples → 3 “Yes”, 3 “No”.

  • Entropy:

\[ H = - \frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{3}{6} = 1 \]

(complete impurity)

  • Gini:

\[ 1 - \Big(\frac{3}{6}\Big)^2 - \Big(\frac{3}{6}\Big)^2 = 1 - 0.25 - 0.25 = 0.5 \]

Comparison: Entropy vs Gini

Aspect

Entropy

Gini Impurity

Formula

\(-\sum p_i \log_2 p_i\)

\(1 - \sum p_i^2\)

Range (binary)

0 → 1

0 → 0.5

Complexity

Slightly more expensive (log terms)

Faster to compute (squares only)

Behavior

Both give similar results in practice

Sometimes Gini prefers larger splits


Information Gain (IG) — Feature Selection Criterion#

Entropy & Gini tell us purity of a node. But to decide which feature to split on, we use Information Gain:

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \]
  • \(H(S)\): entropy before split

  • Weighted average entropy after split is subtracted.

  • The feature with highest IG is chosen to split.


Key Takeaways

  1. Entropy and Gini Impurity are purity measures.

  2. Entropy → range 0–1, max impurity = 1.

  3. Gini → range 0–0.5 in binary classification.

  4. Both are used in practice; Gini is slightly faster.

  5. Information Gain helps choose which feature gives the best split.

What is purity in decision trees?#

  • A node is said to be pure if all samples inside it belong to the same class.

  • A node is impure if it contains mixed classes.

  • Decision trees aim to split data so that each branch leads to nodes with higher purity.

To measure purity, we use Entropy and Gini Impurity.


Entropy#

  • Formula (for binary classification):

\[ H(S) = - p_+ \log_2(p_+) - p_- \log_2(p_-) \]
  • Where:

    • \(p_+\) = probability of positive class (say “Yes”)

    • \(p_-\) = probability of negative class (say “No”)

  • Multi-class extension:

\[ H(S) = - \sum_{i=1}^k p_i \log_2(p_i) \]
  • Interpretation:

    • \(H(S) = 0\): node is pure (all one class).

    • \(H(S) = 1\): maximum impurity (equal distribution of classes, e.g. 50/50 in binary case).

  • Entropy graph:

    • Ranges between 0 and 1.

    • Maximum impurity occurs when classes are equally likely.


Gini Impurity#

  • Formula:

\[ Gini(S) = 1 - \sum_{i=1}^k p_i^2 \]
  • For binary case:

\[ Gini(S) = 1 - (p_+^2 + p_-^2) \]
  • Interpretation:

    • \(Gini = 0\): pure node.

    • Maximum impurity in binary case: \(Gini = 0.5\) (when \(p_+ = p_- = 0.5\)).

  • Range: 0 → 0.5 (binary), slightly different for multi-class.


Example:#

Suppose 6 samples → 3 “Yes”, 3 “No”.

  • Entropy:

\[ H = - \frac{3}{6} \log_2 \frac{3}{6} - \frac{3}{6} \log_2 \frac{3}{6} = 1 \]

(complete impurity)

  • Gini:

\[ 1 - \Big(\frac{3}{6}\Big)^2 - \Big(\frac{3}{6}\Big)^2 = 1 - 0.25 - 0.25 = 0.5 \]

Comparison: Entropy vs Gini#

Aspect

Entropy

Gini Impurity

Formula

\(-\sum p_i \log_2 p_i\)

\(1 - \sum p_i^2\)

Range (binary)

0 → 1

0 → 0.5

Complexity

Slightly more expensive (log terms)

Faster to compute (squares only)

Behavior

Both give similar results in practice

Sometimes Gini prefers larger splits


Information Gain (IG) — Feature Selection Criterion#

Entropy & Gini tell us purity of a node. But to decide which feature to split on, we use Information Gain:

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) \]
  • \(H(S)\): entropy before split

  • Weighted average entropy after split is subtracted.

  • The feature with highest IG is chosen to split.


Key Takeaways

  1. Entropy and Gini Impurity are purity measures.

  2. Entropy → range 0–1, max impurity = 1.

  3. Gini → range 0–0.5 in binary classification.

  4. Both are used in practice; Gini is slightly faster.

  5. Information Gain helps choose which feature gives the best split.

Information Gain#

We already know:

  • Entropy and Gini impurity → measure how pure or impure a node is.

  • But a decision tree needs to choose the best feature to split at each step. 👉 This is where Information Gain (IG) comes in.


Definition#

Information Gain tells us how much “uncertainty” is reduced in the target after splitting on a feature.

\[ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \; H(S_v) \]

Where:

  • \(H(S)\) = entropy of the root node (before split)

  • \(S_v\) = subset of data after splitting on value \(v\) of feature \(A\)

  • \(H(S_v)\) = entropy of that subset

  • \(\frac{|S_v|}{|S|}\) = weighted contribution of that subset


Step-by-Step Process#

  1. Compute entropy of the root node (before any split).

  2. Choose a candidate feature (say F1).

  3. Split dataset into subsets based on values of F1.

  4. Compute entropy of each subset.

  5. Take a weighted average of these entropies.

  6. Subtract this weighted entropy from the root entropy → Information Gain for F1.

  7. Repeat for all features (F1, F2, F3, …).

  8. Pick the feature with highest Information Gain → best split.


Worked Example#

Suppose dataset has 14 samples: 9 Yes, 5 No.

  • Root Entropy:

\[ H(S) = -\frac{9}{14}\log_2\frac{9}{14} - \frac{5}{14}\log_2\frac{5}{14} \approx 0.94 \]
  • Split on F1:

    • Subset C1: 6 Yes, 2 No → \(H(C1) \approx 0.81\)

    • Subset C2: 3 Yes, 3 No → \(H(C2) = 1\)

    Weighted Entropy = \(\frac{8}{14}\times0.81 + \frac{6}{14}\times1 \approx 0.89\)

    So:

\[ IG(S, F1) = 0.94 - 0.89 = 0.049 \]
  • Split on F2: Suppose IG(S, F2) > 0.049 👉 Then F2 is chosen for the root split.


Key Insights

  • Higher Information Gain = Better Feature for Split.

  • Decision tree keeps splitting until:

    • IG ≈ 0 (no improvement possible), OR

    • stopping criteria (max depth, min samples, etc.) is reached.


When to Use Entropy vs Gini#

  • Entropy:

    • More theoretical, based on information theory.

    • Can be slightly slower (logarithm calculation).

    • Preferred if you want a “purer” measure of uncertainty.

  • Gini Impurity:

    • Computationally faster (uses squares, no logs).

    • Works very similarly in practice.

    • Often the default choice (e.g., in Scikit-learn).

👉 In practice, both give similar results. Gini is often preferred for speed, Entropy for interpretability.


Takeaways

  1. Entropy / Gini = measure of node impurity.

  2. Information Gain = reduction in impurity → helps decide which feature to split on.

  3. Decision Trees always pick the feature with highest IG at each step.

  4. Gini and Entropy are very similar in behavior → choice depends on speed vs interpretability.


Entropy v/s Gini Impurity#

  • Entropy

    • Formula:

      \[ H(S) = -\sum_{i=1}^{k} p_i \log_2 p_i \]

      where \(k\) = number of classes, \(p_i\) = probability of class \(i\).

    • Measures disorder/uncertainty in the dataset.

    • More theoretical foundation (from information theory).

  • Gini Impurity

    • Formula:

      \[ G(S) = 1 - \sum_{i=1}^{k} p_i^2 \]
    • Measures how often a randomly chosen element would be misclassified if labeled according to class distribution.

    • More computationally efficient (no logarithms).


Behavior When Classes Increase#

  • If the number of output categories increases:

    • Entropy expands into more terms (one for each class).

    • Gini Impurity also expands, but only with squared probabilities (no logs).

  • Both handle multi-class problems, but entropy involves heavier computation due to logarithms.


When to Use Which?#

Use Entropy when:#

  • You want a pure information-theoretic interpretation.

  • Dataset is small or moderately sized (e.g., ≤ 10,000 records).

  • You want to emphasize purity more strongly — entropy tends to be more sensitive to class probabilities close to 0 or 1.

Use Gini Impurity when:#

  • You’re working with large datasets (millions of records).

  • You care about speed and efficiency (fewer computations since no log).

  • You’re using scikit-learn’s DecisionTreeClassifier — by default it uses Gini.


Practical Notes#

  • Performance Difference: In most real-world problems, Gini and Entropy give very similar results in terms of accuracy and splits.

  • Default Choice:

    • Gini is used more often (fast, default in scikit-learn).

    • Entropy can be used for interpretability or teaching purposes.


Summary (One-Liner)

  • Entropy = more theoretical, uses log, slower, suitable for small datasets.

  • Gini = faster, simpler, default in most implementations, better for large datasets.

  • In practice, both lead to similar decision trees, so the choice is often pragmatic (speed vs. interpretability).


import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train two decision trees: one with Gini, one with Entropy
tree_gini = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)

tree_gini.fit(X, y)
tree_entropy.fit(X, y)

# Extract decision rules for comparison
rules_gini = export_text(tree_gini, feature_names=iris.feature_names)
rules_entropy = export_text(tree_entropy, feature_names=iris.feature_names)

rules_gini, rules_entropy
('|--- petal length (cm) <= 2.45\n|   |--- class: 0\n|--- petal length (cm) >  2.45\n|   |--- petal width (cm) <= 1.75\n|   |   |--- petal length (cm) <= 4.95\n|   |   |   |--- class: 1\n|   |   |--- petal length (cm) >  4.95\n|   |   |   |--- class: 2\n|   |--- petal width (cm) >  1.75\n|   |   |--- petal length (cm) <= 4.85\n|   |   |   |--- class: 2\n|   |   |--- petal length (cm) >  4.85\n|   |   |   |--- class: 2\n',
 '|--- petal length (cm) <= 2.45\n|   |--- class: 0\n|--- petal length (cm) >  2.45\n|   |--- petal width (cm) <= 1.75\n|   |   |--- petal length (cm) <= 4.95\n|   |   |   |--- class: 1\n|   |   |--- petal length (cm) >  4.95\n|   |   |   |--- class: 2\n|   |--- petal width (cm) >  1.75\n|   |   |--- petal length (cm) <= 4.85\n|   |   |   |--- class: 2\n|   |   |--- petal length (cm) >  4.85\n|   |   |   |--- class: 2\n')
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.