Assumptions#


1. Feature Independence#

  • Assumption: Features are independent given the class.

  • Reality: Features are often correlated.

    • Example: In spam emails, the words “lottery” and “win” appear together frequently. Independence is false.

  • Why it still works: The product of probabilities often still yields a reasonable ranking of classes, even if absolute probabilities are wrong. Classification only needs the highest posterior, not exact values.


2. Equal Contribution of Features#

  • Assumption: Each feature contributes equally to the prediction.

  • Reality: Some features dominate.

    • Example: In medical diagnosis, “tumor detected in MRI” is far stronger than “slight fever”.

  • Why it still works: In high-dimensional settings (like text classification), many weak but independent-ish signals combine to give strong results.


3. Distribution of Features#

  • Assumption:

    • Gaussian NB → features are normally distributed.

    • Multinomial NB → word counts follow multinomial distribution.

    • Bernoulli NB → features are binary indicators.

  • Reality: Data distributions often deviate.

    • Example: Continuous features may be skewed, not Gaussian.

  • Why it still works: As long as the assumed distribution is a rough approximation, the decision boundary can still separate classes effectively.


4. No Zero Probability#

  • Assumption: Every feature-class combination has a nonzero probability.

  • Reality: Some words/values may not appear in training.

    • Example: If “Bitcoin” never appeared in spam training data, then \(P(\text{Bitcoin}|\text{spam}) = 0\).

  • Why it still works: With Laplace (add-one) smoothing, we avoid zeros and keep predictions stable.


Key Insight: Even though independence and distribution assumptions are false in practice, Naive Bayes still works well when:

  • Features provide enough weak evidence.

  • The goal is classification, not perfect probability estimation.

  • Data is high-dimensional and sparse (like text).

❌ It fails when:

  • Strong feature correlations matter.

  • Precise probability estimates are required (not just classification).