Background#

Why do we need probability in ML?#

Machine learning is about making predictions under uncertainty. For example:

  • Email classification: “Is this spam or not?”

  • Medical diagnosis: “Does the patient have disease X?”

  • Sentiment analysis: “Is the review positive, neutral, or negative?”

We want to estimate the probability of each class given the observed features.

Formally:

\[ P(y \mid X) \quad \text{where } X = (x_1, x_2, \dots, x_n) \]

Independent vs Dependent events#

  • Independent events: rolling a dice — the outcome of roll 1 does not affect roll 2.

    \[ P(A \cap B) = P(A) \cdot P(B) \]
  • Dependent events: drawing marbles without replacement. The probability of the second draw changes depending on the first.

    \[ P(A \cap B) = P(A) \cdot P(B \mid A) \]

👉 This dependency idea is the foundation for conditional probability.


Conditional Probability#

Definition:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

Intuition: “What is the probability of \(A\) happening if I already know that \(B\) happened?”


Bayes Theorem#

Using conditional probability both ways:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Where:

  • \(P(A)\) = prior (belief about \(A\) before seeing data)

  • \(P(B \mid A)\) = likelihood (how compatible evidence \(B\) is with \(A\))

  • \(P(B)\) = marginal probability (normalizing constant)

  • \(P(A \mid B)\) = posterior (updated belief after seeing evidence)

This is the Bayesian update rule.


Applying Bayes to ML#

We want:

\[ P(y \mid x_1, x_2, \dots, x_n) \]

By Bayes theorem:

\[ P(y \mid X) = \frac{P(y) \cdot P(x_1, x_2, \dots, x_n \mid y)}{P(x_1, x_2, \dots, x_n)} \]

The denominator is the same for all classes, so we only care about the numerator.


The Naïve Assumption#

Problem: computing \(P(x_1, x_2, \dots, x_n \mid y)\) is complex because features may be dependent.

Naïve Bayes assumes conditional independence:

\[ P(x_1, x_2, \dots, x_n \mid y) \approx \prod_{i=1}^n P(x_i \mid y) \]

This gives:

\[ \hat{y} = \arg\max_y \; P(y) \prod_{i=1}^n P(x_i \mid y) \]

That’s the Naïve Bayes classifier.


Interpreting the Formula#

  • \(P(y)\): Prior probability of the class

  • \(P(x_i \mid y)\): Likelihood of feature \(x_i\) under class \(y\)

  • \(\prod_i\): Combine all feature evidence

  • \(\arg\max_y\): Choose the class with the highest posterior probability


Worked Example (Play Tennis 🌤️🎾)#

Dataset (simplified):

  • Features: Outlook, Temperature

  • Target: Play = Yes/No

Say the test instance is: Outlook = Sunny, Temperature = Hot

We compute:

\[ P(\text{Yes} \mid Sunny, Hot) \propto P(\text{Yes}) \cdot P(Sunny \mid Yes) \cdot P(Hot \mid Yes) \]
\[ P(\text{No} \mid Sunny, Hot) \propto P(\text{No}) \cdot P(Sunny \mid No) \cdot P(Hot \mid No) \]
  • From counts in dataset:

    • \(P(\text{Yes}) = \tfrac{9}{14}\), \(P(\text{No}) = \tfrac{5}{14}\)

    • \(P(Sunny \mid Yes) = \tfrac{2}{9}\), \(P(Hot \mid Yes) = \tfrac{2}{9}\)

    • \(P(Sunny \mid No) = \tfrac{3}{5}\), \(P(Hot \mid No) = \tfrac{2}{5}\)

Plug in:

\[ P(\text{Yes} \mid Sunny, Hot) \propto \tfrac{9}{14} \cdot \tfrac{2}{9} \cdot \tfrac{2}{9} \approx 0.031 \]
\[ P(\text{No} \mid Sunny, Hot) \propto \tfrac{5}{14} \cdot \tfrac{3}{5} \cdot \tfrac{2}{5} \approx 0.085 \]

Normalize:

  • \(P(\text{Yes} \mid Sunny, Hot) = 0.27\)

  • \(P(\text{No} \mid Sunny, Hot) = 0.73\)

✅ Prediction: No (won’t play tennis)


Key Takeaways#

  • Naïve Bayes = Bayes theorem + independence assumption

  • Works well when features are weakly correlated

  • Very fast, good for text classification (spam filtering, sentiment analysis)

  • Outputs probabilities, not just class labels