Cost Functions

Cost Functions #

Naïve Bayes learns probabilities \(P(y)\) and \(P(x_i|y)\) from data by maximizing the likelihood of the training set:

\[ L(\theta) = \prod_{j=1}^m P(y^{(j)} \mid x^{(j)}; \theta) \]

where:

⚡ In practice, we maximize the log-likelihood (to avoid underflow and simplify multiplication into summation):

\[ \ell(\theta) = \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta) \]

👉 So the implicit cost function is:

\[ J(\theta) = -\ell(\theta) = - \sum_{j=1}^m \log P(y^{(j)} \mid x^{(j)}; \theta) \]

This is essentially negative log-likelihood (NLL), also called cross-entropy loss.

When evaluating probabilistic classifiers like Naïve Bayes, we often use log loss:

\[ \text{LogLoss} = -\frac{1}{m} \sum_{j=1}^m \sum_{c=1}^k \mathbf{1}\{y^{(j)} = c\} \log P(y=c \mid x^{(j)}) \]

where:

👉 This penalizes wrong predictions more when the model is confident but incorrect.

Sometimes for classification, we also look at 0-1 loss (not probabilistic, just accuracy-based):

\[ \text{0-1 Loss} = \frac{1}{m} \sum_{j=1}^m \mathbf{1}\{\hat{y}^{(j)} \neq y^{(j)}\} \]

This is basically the misclassification rate.

Summary

Training → Naïve Bayes parameters are estimated via maximum likelihood, which implicitly minimizes negative log-likelihood (NLL).
Evaluation → Common cost functions:
- Log Loss (cross-entropy) → best for probabilistic performance.
- 0-1 Loss (error rate) → best for accuracy comparison.