Cost Function#

🌳 What is the “Cost” in Decision Trees?#

When training a decision tree, the model’s goal is to find splits that minimize impurity (error) at each node. So the cost function is what the tree tries to minimize while splitting the dataset.


Common Cost Functions in DTC#

  1. Gini Impurity (default in scikit-learn)

    • Measures how often a randomly chosen sample from the node would be misclassified.

    • Formula:

      \[ Gini = 1 - \sum_{i=1}^{C} p_i^2 \]

      where \(p_i\) = probability of class \(i\), \(C\) = number of classes.

    • Gini = 0 → perfectly pure node (all samples same class).

    • Gini is faster to compute (no log).


  1. Entropy (Information Gain)

    • Comes from information theory.

    • Formula:

      \[ Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i) \]
    • Pure nodes → Entropy = 0.

    • Decision tree chooses the feature that maximizes information gain (reduces entropy most).


  1. Misclassification Error (less common, mostly theoretical)

    • Simply measures fraction of incorrectly classified samples at a node.

    • Formula:

      \[ Error = 1 - \max(p_i) \]
    • It’s not sensitive enough for splitting (often used for pruning instead).


The Cost Function for a Split#

For a split at node \(t\):

\[ Cost(t) = \sum_{k=1}^{K} \frac{N_k}{N} \cdot Impurity(t_k) \]
  • \(N\) = total samples in parent node.

  • \(N_k\) = samples in child node \(k\).

  • \(Impurity\) = Gini or Entropy.

👉 The best split is the one that minimizes this weighted cost.


Example#

If a node has:

  • 8 samples → 6 “Yes”, 2 “No”

  • Gini = \(1 - [(6/8)^2 + (2/8)^2] = 0.375\)

If a split reduces this Gini from 0.375 to (say) 0.1, then it’s a good split.


In short

  • The cost function = impurity measure (Gini, Entropy, or Error).

  • The decision tree chooses splits that minimize cost (i.e., maximize purity/information gain).

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# Small toy dataset
data = {
    "Outlook": ["Sunny", "Sunny", "Overcast", "Rain", "Rain", "Rain", "Overcast", "Sunny"],
    "Play":    ["No", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"]
}
df = pd.DataFrame(data)

# Convert categorical to numerical for simplicity
df_encoded = pd.get_dummies(df[["Outlook"]])
y = df["Play"].map({"No":0, "Yes":1})

# Train two trees: one with Gini, one with Entropy
clf_gini = DecisionTreeClassifier(criterion="gini", max_depth=1, random_state=42)
clf_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=42)

clf_gini.fit(df_encoded, y)
clf_entropy.fit(df_encoded, y)

# Extract impurity values from root node (before split) and child nodes (after split)
root_gini = clf_gini.tree_.impurity[0]
child_ginis = clf_gini.tree_.impurity[1:3]

root_entropy = clf_entropy.tree_.impurity[0]
child_entropies = clf_entropy.tree_.impurity[1:3]

results = pd.DataFrame({
    "Criterion": ["Gini", "Entropy"],
    "Root Impurity": [root_gini, root_entropy],
    "Child Impurities": [child_ginis, child_entropies]
})

results
Criterion Root Impurity Child Impurities
0 Gini 0.468750 [0.31999999999999995, 0.4444444444444444]
1 Entropy 0.954434 [1.0, 0.0]
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Results on the toy dataset (Outlook → Play)#

Criterion

Root Impurity

Child Impurities

Gini

0.469

[0.320, 0.444]

Entropy

0.954

[1.0, 0.0]


Interpretation#

  • Root Node (before splitting):

    • Gini impurity = 0.469

    • Entropy = 0.954 (higher because log-based measure is more sensitive).

  • After Split (Child Nodes):

    • Using Gini, both children still have some impurity (0.32 and 0.44).

    • Using Entropy, one child became perfectly pure (0.0), the other is fully impure (1.0).

👉 The tree compares these values, computing the weighted average impurity.

  • The split that reduces impurity the most (maximizes Information Gain or minimizes Gini cost) is chosen.