Performance Metrics#

1. Accuracy#

\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total predictions}} \]
  • Measures overall correctness.

  • Works well when classes are balanced.

  • Misleading for imbalanced datasets.

    • Example: If 95% of emails are “ham”, predicting “ham” always gives 95% accuracy but is useless.


2. Confusion Matrix#

A table comparing predicted vs actual classes.

For binary classification:

Predicted Positive

Predicted Negative

Actual Positive

True Positive (TP)

False Negative (FN)

Actual Negative

False Positive (FP)

True Negative (TN)

From this, we compute other metrics.


3. Precision#

\[ \text{Precision} = \frac{TP}{TP + FP} \]
  • Of all items predicted positive, how many are truly positive?

  • Good when false positives are costly (e.g., classifying ham as spam).


4. Recall (Sensitivity, True Positive Rate)#

\[ \text{Recall} = \frac{TP}{TP + FN} \]
  • Of all true positives, how many did we correctly find?

  • Good when false negatives are costly (e.g., missing a cancer diagnosis).


5. F1 Score#

\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
  • Harmonic mean of precision and recall.

  • Useful for imbalanced data.


6. ROC Curve & AUC#

  • ROC Curve → plots True Positive Rate (Recall) vs False Positive Rate (FP / (FP+TN)) for different probability thresholds.

  • AUC (Area Under Curve) → measures how well the model separates classes.

    • AUC = 1 → perfect.

    • AUC = 0.5 → random guessing.

Naïve Bayes outputs probabilities (\(P(y|x)\)), so you can directly use ROC-AUC.


7. Log Loss (Cross-Entropy Loss)#

\[ \text{LogLoss} = -\frac{1}{m} \sum_{j=1}^m \log P(y^{(j)} | x^{(j)}) \]
  • Evaluates the probabilistic predictions, not just labels.

  • Penalizes confident but wrong predictions.

  • Useful when probability calibration matters (e.g., medical risk prediction).


8. Calibration Metrics#

Naïve Bayes often produces poorly calibrated probabilities (too extreme, close to 0 or 1).

  • Tools like calibration curves or Brier score check if predicted probabilities match actual outcomes.


Summary

For Naïve Bayes classification, use:

  • Accuracy → if classes balanced.

  • Precision, Recall, F1 → if data imbalanced.

  • ROC-AUC → for probability-based evaluation.

  • Log Loss → if probability quality matters.

  • Calibration → if decision thresholds rely on well-calibrated probabilities.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, auc, log_loss
)
from sklearn.datasets import make_classification

# Generate synthetic binary classification dataset
X, y = make_classification(
    n_samples=500, n_features=10, n_informative=5, n_redundant=2,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)

# Predictions
y_pred = nb.predict(X_test)
y_proba = nb.predict_proba(X_test)[:, 1]

# Metrics
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=["Class 0", "Class 1"])
logloss_val = log_loss(y_test, y_proba)

# ROC-AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

(acc, cm, report, logloss_val, roc_auc)
(0.8866666666666667,
 array([[100,   4],
        [ 13,  33]]),
 '              precision    recall  f1-score   support\n\n     Class 0       0.88      0.96      0.92       104\n     Class 1       0.89      0.72      0.80        46\n\n    accuracy                           0.89       150\n   macro avg       0.89      0.84      0.86       150\nweighted avg       0.89      0.89      0.88       150\n',
 0.4033059439714829,
 0.8760451505016723)

Results#

  • Accuracy: 0.887 (~89%)

  • Confusion Matrix:

    [[100   4]
     [ 13  33]]
    
    • True Negatives = 100

    • False Positives = 4

    • False Negatives = 13

    • True Positives = 33

  • Classification Report:

                precision    recall  f1-score   support
    
       Class 0       0.88      0.96      0.92       104
       Class 1       0.89      0.72      0.80        46
    
      accuracy                           0.89       150
     macro avg       0.89      0.84      0.86       150
    

weighted avg 0.89 0.89 0.88 150


- **Log Loss**: `0.403` (lower is better; penalizes wrong confident predictions)  
- **ROC-AUC**: `0.876` (good separation; 1.0 = perfect, 0.5 = random)  

---

These metrics show:
- Model is strong overall (~89% accuracy).  
- Slight imbalance in recall → Class 1 (minority) has lower recall (0.72), meaning some positives are missed.  
- ROC-AUC confirms good probability separation.  

---

⚡ Do you want me to also **plot ROC curve + confusion matrix heatmap** for clearer visualization?

Macro Average (macro_avg)#

  • Definition: Takes the arithmetic mean of the metric across all classes without considering class imbalance.

  • Formula for precision (example):

    \[ \text{Precision}_{macro} = \frac{1}{C} \sum_{i=1}^{C} \text{Precision}_i \]

    where \(C\) = number of classes.

  • Effect:

    • Treats all classes equally.

    • Useful when you want to evaluate performance per class fairly, even if one class has fewer samples.

👉 In your Naïve Bayes example:

  • macro avg precision = 0.89

  • macro avg recall = 0.84

  • Shows average performance across Class 0 and Class 1, equally weighted.


Weighted Average (weighted_avg)#

  • Definition: Takes the support (number of true samples per class) into account while averaging.

  • Formula for precision (example):

    \[ \text{Precision}_{weighted} = \frac{\sum_{i=1}^{C} ( \text{Support}_i \times \text{Precision}_i )}{\sum_{i=1}^{C} \text{Support}_i} \]
  • Effect:

    • Gives more importance to larger classes.

    • If dataset is imbalanced, the metric will be skewed toward majority class.

👉 In your Naïve Bayes example:

  • weighted avg precision = 0.89

  • weighted avg recall = 0.89

  • Since Class 0 has 104 samples vs Class 1 has 46, Class 0 has more influence on the weighted averages.


Summary:

  • Macro Avg → Equal weight to each class (good for imbalanced dataset evaluation).

  • Weighted Avg → Weighted by class size (good for overall performance reflection).