Intuition

Intuition #

Naive Bayes is a “vote counting” machine powered by probability.

You look at the features of a new sample (like words in an email).
For each possible class (e.g., spam vs ham), you ask: “If this sample belonged to this class, how likely would I see these features?”
Multiply those likelihoods by how common that class is overall (prior).
Whichever class gives the highest probability wins.

👉 Even if features are correlated, it still works because it compares relative evidence, not exact truth.

Bayes’ theorem:

\[ P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)} \]

We only care about the class with the maximum posterior probability, so denominator \(P(X)\) can be ignored:

\[ \hat{y} = \arg\max_y P(X|y) \cdot P(y) \]

Now apply the naive assumption (feature independence):

\[ P(X|y) = P(x_1, x_2, …, x_n | y) \approx \prod_{i=1}^n P(x_i | y) \]

So the classifier becomes:

\[ \hat{y} = \arg\max_y P(y) \prod_{i=1}^n P(x_i | y) \]

Suppose we want to classify an email with the words: “win lottery”.

From training data:

\(P(\text{spam}) = 0.4\), \(P(\text{ham}) = 0.6\).
\(P(\text{win}|\text{spam}) = 0.8\), \(P(\text{win}|\text{ham}) = 0.1\).
\(P(\text{lottery}|\text{spam}) = 0.7\), \(P(\text{lottery}|\text{ham}) = 0.05\).

Compute:

\[ P(\text{spam}|\text{“win lottery”}) \propto 0.4 \cdot 0.8 \cdot 0.7 = 0.224 \]

\[ P(\text{ham}|\text{“win lottery”}) \propto 0.6 \cdot 0.1 \cdot 0.05 = 0.003 \]

📌 Prediction: Spam, because 0.224 > 0.003.

Even if “win” and “lottery” are correlated, multiplying probabilities still boosts the correct class compared to the wrong one.
The absolute values may be wrong, but relative comparison is good enough for classification.

⚡ Summary:

Intuition: Pick the class that best explains the observed features.
Math intuition: Bayes’ theorem + independence assumption → product of simple probabilities.
Outcome: Fast, effective classifier, especially for text/NLP.

Would you like me to also make a visual probability tree diagram to show the intuition graphically?