Assumptions#
1. Feature Independence#
Assumption: Features are independent given the class.
Reality: Features are often correlated.
Example: In spam emails, the words “lottery” and “win” appear together frequently. Independence is false.
Why it still works: The product of probabilities often still yields a reasonable ranking of classes, even if absolute probabilities are wrong. Classification only needs the highest posterior, not exact values.
2. Equal Contribution of Features#
Assumption: Each feature contributes equally to the prediction.
Reality: Some features dominate.
Example: In medical diagnosis, “tumor detected in MRI” is far stronger than “slight fever”.
Why it still works: In high-dimensional settings (like text classification), many weak but independent-ish signals combine to give strong results.
3. Distribution of Features#
Assumption:
Gaussian NB → features are normally distributed.
Multinomial NB → word counts follow multinomial distribution.
Bernoulli NB → features are binary indicators.
Reality: Data distributions often deviate.
Example: Continuous features may be skewed, not Gaussian.
Why it still works: As long as the assumed distribution is a rough approximation, the decision boundary can still separate classes effectively.
4. No Zero Probability#
Assumption: Every feature-class combination has a nonzero probability.
Reality: Some words/values may not appear in training.
Example: If “Bitcoin” never appeared in spam training data, then \(P(\text{Bitcoin}|\text{spam}) = 0\).
Why it still works: With Laplace (add-one) smoothing, we avoid zeros and keep predictions stable.
Key Insight: Even though independence and distribution assumptions are false in practice, Naive Bayes still works well when:
Features provide enough weak evidence.
The goal is classification, not perfect probability estimation.
Data is high-dimensional and sparse (like text).
❌ It fails when:
Strong feature correlations matter.
Precise probability estimates are required (not just classification).