Intuition#

Naive Bayes is a “vote counting” machine powered by probability.

  • You look at the features of a new sample (like words in an email).

  • For each possible class (e.g., spam vs ham), you ask: “If this sample belonged to this class, how likely would I see these features?”

  • Multiply those likelihoods by how common that class is overall (prior).

  • Whichever class gives the highest probability wins.

👉 Even if features are correlated, it still works because it compares relative evidence, not exact truth.


Mathematical Intuition#

Bayes’ theorem:

\[ P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)} \]

We only care about the class with the maximum posterior probability, so denominator \(P(X)\) can be ignored:

\[ \hat{y} = \arg\max_y P(X|y) \cdot P(y) \]

Now apply the naive assumption (feature independence):

\[ P(X|y) = P(x_1, x_2, …, x_n | y) \approx \prod_{i=1}^n P(x_i | y) \]

So the classifier becomes:

\[ \hat{y} = \arg\max_y P(y) \prod_{i=1}^n P(x_i | y) \]
  • \(P(y)\) → how frequent the class is (prior).

  • \(P(x_i|y)\) → how often feature \(x_i\) appears in class \(y\).


3. Example Intuition with Math#

Suppose we want to classify an email with the words: “win lottery”.

From training data:

  • \(P(\text{spam}) = 0.4\), \(P(\text{ham}) = 0.6\).

  • \(P(\text{win}|\text{spam}) = 0.8\), \(P(\text{win}|\text{ham}) = 0.1\).

  • \(P(\text{lottery}|\text{spam}) = 0.7\), \(P(\text{lottery}|\text{ham}) = 0.05\).

Compute:

\[ P(\text{spam}|\text{“win lottery”}) \propto 0.4 \cdot 0.8 \cdot 0.7 = 0.224 \]
\[ P(\text{ham}|\text{“win lottery”}) \propto 0.6 \cdot 0.1 \cdot 0.05 = 0.003 \]

📌 Prediction: Spam, because 0.224 > 0.003.


4. Why it works despite being naive#

  • Even if “win” and “lottery” are correlated, multiplying probabilities still boosts the correct class compared to the wrong one.

  • The absolute values may be wrong, but relative comparison is good enough for classification.


Summary:

  • Intuition: Pick the class that best explains the observed features.

  • Math intuition: Bayes’ theorem + independence assumption → product of simple probabilities.

  • Outcome: Fast, effective classifier, especially for text/NLP.

Would you like me to also make a visual probability tree diagram to show the intuition graphically?