Cost Functions#

1. Maximum Likelihood Estimation (MLE) – Training Objective#

Naïve Bayes learns probabilities \(P(y)\) and \(P(x_i|y)\) from data by maximizing the likelihood of the training set:

\[ L(\theta) = \prod_{j=1}^m P(y^{(j)} \mid x^{(j)}; \theta) \]

where:

  • \(m\) = number of training samples

  • \(y^{(j)}\) = class label of sample \(j\)

  • \(x^{(j)}\) = features of sample \(j\)

  • \(\theta\) = parameters (priors + likelihoods).

⚡ In practice, we maximize the log-likelihood (to avoid underflow and simplify multiplication into summation):

\[ \ell(\theta) = \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta) \]

👉 So the implicit cost function is:

\[ J(\theta) = -\ell(\theta) = - \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta) \]

This is essentially negative log-likelihood (NLL), also called cross-entropy loss.


2. Cross-Entropy / Log Loss – Evaluation#

When evaluating probabilistic classifiers like Naïve Bayes, we often use log loss:

\[ \text{LogLoss} = -\frac{1}{m} \sum_{j=1}^m \sum_{c=1}^k \mathbf{1}\{y^{(j)} = c\} \log P(y=c \mid x^{(j)}) \]

where:

  • \(k\) = number of classes

  • \(\mathbf{1}\) = indicator function (1 if true class = \(c\), else 0).

👉 This penalizes wrong predictions more when the model is confident but incorrect.


3. Zero-One Loss – Simpler alternative#

Sometimes for classification, we also look at 0-1 loss (not probabilistic, just accuracy-based):

\[ \text{0-1 Loss} = \frac{1}{m} \sum_{j=1}^m \mathbf{1}\{\hat{y}^{(j)} \neq y^{(j)}\} \]

This is basically the misclassification rate.


Summary

  • Training → Naïve Bayes parameters are estimated via maximum likelihood, which implicitly minimizes negative log-likelihood (NLL).

  • Evaluation → Common cost functions:

    • Log Loss (cross-entropy) → best for probabilistic performance.

    • 0-1 Loss (error rate) → best for accuracy comparison.