Background

Background #

Why do we need probability in ML?#

Machine learning is about making predictions under uncertainty. For example:

Email classification: “Is this spam or not?”
Medical diagnosis: “Does the patient have disease X?”
Sentiment analysis: “Is the review positive, neutral, or negative?”

We want to estimate the probability of each class given the observed features.

Formally:

\[ P(y \mid X) \quad \text{where } X = (x_1, x_2, \dots, x_n) \]

Independent vs Dependent events #

Independent events: rolling a dice — the outcome of roll 1 does not affect roll 2.

\[ P(A \cap B) = P(A) \cdot P(B) \]
Dependent events: drawing marbles without replacement. The probability of the second draw changes depending on the first.

\[ P(A \cap B) = P(A) \cdot P(B \mid A) \]

👉 This dependency idea is the foundation for conditional probability.

Conditional Probability #

Definition:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

Intuition: “What is the probability of \(A\) happening if I already know that \(B\) happened?”

Bayes Theorem #

Using conditional probability both ways:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Where:

\(P(A)\) = prior (belief about \(A\) before seeing data)
\(P(B \mid A)\) = likelihood (how compatible evidence \(B\) is with \(A\))
\(P(B)\) = marginal probability (normalizing constant)
\(P(A \mid B)\) = posterior (updated belief after seeing evidence)

This is the Bayesian update rule.

Applying Bayes to ML #

We want:

\[ P(y \mid x_1, x_2, \dots, x_n) \]

By Bayes theorem:

\[ P(y \mid X) = \frac{P(y) \cdot P(x_1, x_2, \dots, x_n \mid y)}{P(x_1, x_2, \dots, x_n)} \]

The denominator is the same for all classes, so we only care about the numerator.

The Naïve Assumption #

Problem: computing \(P(x_1, x_2, \dots, x_n \mid y)\) is complex because features may be dependent.

Naïve Bayes assumes conditional independence:

\[ P(x_1, x_2, \dots, x_n \mid y) \approx \prod_{i=1}^n P(x_i \mid y) \]

This gives:

\[ \hat{y} = \arg\max_y \; P(y) \prod_{i=1}^n P(x_i \mid y) \]

That’s the Naïve Bayes classifier.

Interpreting the Formula #

\(P(y)\): Prior probability of the class
\(P(x_i \mid y)\): Likelihood of feature \(x_i\) under class \(y\)
\(\prod_i\): Combine all feature evidence
\(\arg\max_y\): Choose the class with the highest posterior probability

Worked Example (Play Tennis 🌤️🎾)#

Dataset (simplified):

Features: Outlook, Temperature
Target: Play = Yes/No

Say the test instance is: Outlook = Sunny, Temperature = Hot

We compute:

\[ P(\text{Yes} \mid Sunny, Hot) \propto P(\text{Yes}) \cdot P(Sunny \mid Yes) \cdot P(Hot \mid Yes) \]

\[ P(\text{No} \mid Sunny, Hot) \propto P(\text{No}) \cdot P(Sunny \mid No) \cdot P(Hot \mid No) \]

From counts in dataset:
- \(P(\text{Yes}) = \tfrac{9}{14}\), \(P(\text{No}) = \tfrac{5}{14}\)
- \(P(Sunny \mid Yes) = \tfrac{2}{9}\), \(P(Hot \mid Yes) = \tfrac{2}{9}\)
- \(P(Sunny \mid No) = \tfrac{3}{5}\), \(P(Hot \mid No) = \tfrac{2}{5}\)

Plug in:

\[ P(\text{Yes} \mid Sunny, Hot) \propto \tfrac{9}{14} \cdot \tfrac{2}{9} \cdot \tfrac{2}{9} \approx 0.031 \]

\[ P(\text{No} \mid Sunny, Hot) \propto \tfrac{5}{14} \cdot \tfrac{3}{5} \cdot \tfrac{2}{5} \approx 0.085 \]

Normalize:

\(P(\text{Yes} \mid Sunny, Hot) = 0.27\)
\(P(\text{No} \mid Sunny, Hot) = 0.73\)

✅ Prediction: No (won’t play tennis)

Key Takeaways #

Naïve Bayes = Bayes theorem + independence assumption
Works well when features are weakly correlated
Very fast, good for text classification (spam filtering, sentiment analysis)
Outputs probabilities, not just class labels

Background

Contents