Performance Metrics#
1. Accuracy#
Measures overall correctness.
Works well when classes are balanced.
Misleading for imbalanced datasets.
Example: If 95% of emails are “ham”, predicting “ham” always gives 95% accuracy but is useless.
2. Confusion Matrix#
A table comparing predicted vs actual classes.
For binary classification:
Predicted Positive |
Predicted Negative |
|
|---|---|---|
Actual Positive |
True Positive (TP) |
False Negative (FN) |
Actual Negative |
False Positive (FP) |
True Negative (TN) |
From this, we compute other metrics.
3. Precision#
Of all items predicted positive, how many are truly positive?
Good when false positives are costly (e.g., classifying ham as spam).
4. Recall (Sensitivity, True Positive Rate)#
Of all true positives, how many did we correctly find?
Good when false negatives are costly (e.g., missing a cancer diagnosis).
5. F1 Score#
Harmonic mean of precision and recall.
Useful for imbalanced data.
6. ROC Curve & AUC#
ROC Curve → plots True Positive Rate (Recall) vs False Positive Rate (FP / (FP+TN)) for different probability thresholds.
AUC (Area Under Curve) → measures how well the model separates classes.
AUC = 1 → perfect.
AUC = 0.5 → random guessing.
Naïve Bayes outputs probabilities (\(P(y|x)\)), so you can directly use ROC-AUC.
7. Log Loss (Cross-Entropy Loss)#
Evaluates the probabilistic predictions, not just labels.
Penalizes confident but wrong predictions.
Useful when probability calibration matters (e.g., medical risk prediction).
8. Calibration Metrics#
Naïve Bayes often produces poorly calibrated probabilities (too extreme, close to 0 or 1).
Tools like calibration curves or Brier score check if predicted probabilities match actual outcomes.
Summary
For Naïve Bayes classification, use:
Accuracy → if classes balanced.
Precision, Recall, F1 → if data imbalanced.
ROC-AUC → for probability-based evaluation.
Log Loss → if probability quality matters.
Calibration → if decision thresholds rely on well-calibrated probabilities.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (
accuracy_score, confusion_matrix, classification_report,
roc_curve, auc, log_loss
)
from sklearn.datasets import make_classification
# Generate synthetic binary classification dataset
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5, n_redundant=2,
n_classes=2, weights=[0.7, 0.3], random_state=42
)
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Train Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
# Predictions
y_pred = nb.predict(X_test)
y_proba = nb.predict_proba(X_test)[:, 1]
# Metrics
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=["Class 0", "Class 1"])
logloss_val = log_loss(y_test, y_proba)
# ROC-AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
(acc, cm, report, logloss_val, roc_auc)
(0.8866666666666667,
array([[100, 4],
[ 13, 33]]),
' precision recall f1-score support\n\n Class 0 0.88 0.96 0.92 104\n Class 1 0.89 0.72 0.80 46\n\n accuracy 0.89 150\n macro avg 0.89 0.84 0.86 150\nweighted avg 0.89 0.89 0.88 150\n',
0.4033059439714829,
0.8760451505016723)
Results#
Accuracy:
0.887(~89%)Confusion Matrix:
[[100 4] [ 13 33]]
True Negatives = 100
False Positives = 4
False Negatives = 13
True Positives = 33
Classification Report:
precision recall f1-score support Class 0 0.88 0.96 0.92 104 Class 1 0.89 0.72 0.80 46 accuracy 0.89 150 macro avg 0.89 0.84 0.86 150
weighted avg 0.89 0.89 0.88 150
- **Log Loss**: `0.403` (lower is better; penalizes wrong confident predictions)
- **ROC-AUC**: `0.876` (good separation; 1.0 = perfect, 0.5 = random)
---
These metrics show:
- Model is strong overall (~89% accuracy).
- Slight imbalance in recall → Class 1 (minority) has lower recall (0.72), meaning some positives are missed.
- ROC-AUC confirms good probability separation.
---
⚡ Do you want me to also **plot ROC curve + confusion matrix heatmap** for clearer visualization?
Macro Average (macro_avg)#
Definition: Takes the arithmetic mean of the metric across all classes without considering class imbalance.
Formula for precision (example):
\[ \text{Precision}_{macro} = \frac{1}{C} \sum_{i=1}^{C} \text{Precision}_i \]where \(C\) = number of classes.
Effect:
Treats all classes equally.
Useful when you want to evaluate performance per class fairly, even if one class has fewer samples.
👉 In your Naïve Bayes example:
macro avg precision = 0.89macro avg recall = 0.84Shows average performance across Class 0 and Class 1, equally weighted.
Weighted Average (weighted_avg)#
Definition: Takes the support (number of true samples per class) into account while averaging.
Formula for precision (example):
\[ \text{Precision}_{weighted} = \frac{\sum_{i=1}^{C} ( \text{Support}_i \times \text{Precision}_i )}{\sum_{i=1}^{C} \text{Support}_i} \]Effect:
Gives more importance to larger classes.
If dataset is imbalanced, the metric will be skewed toward majority class.
👉 In your Naïve Bayes example:
weighted avg precision = 0.89weighted avg recall = 0.89Since Class 0 has 104 samples vs Class 1 has 46, Class 0 has more influence on the weighted averages.
Summary:
Macro Avg → Equal weight to each class (good for imbalanced dataset evaluation).
Weighted Avg → Weighted by class size (good for overall performance reflection).