Background#
Why do we need probability in ML?#
Machine learning is about making predictions under uncertainty. For example:
Email classification: “Is this spam or not?”
Medical diagnosis: “Does the patient have disease X?”
Sentiment analysis: “Is the review positive, neutral, or negative?”
We want to estimate the probability of each class given the observed features.
Formally:
Independent vs Dependent events#
Independent events: rolling a dice — the outcome of roll 1 does not affect roll 2.
\[ P(A \cap B) = P(A) \cdot P(B) \]Dependent events: drawing marbles without replacement. The probability of the second draw changes depending on the first.
\[ P(A \cap B) = P(A) \cdot P(B \mid A) \]
👉 This dependency idea is the foundation for conditional probability.
Conditional Probability#
Definition:
Intuition: “What is the probability of \(A\) happening if I already know that \(B\) happened?”
Bayes Theorem#
Using conditional probability both ways:
Where:
\(P(A)\) = prior (belief about \(A\) before seeing data)
\(P(B \mid A)\) = likelihood (how compatible evidence \(B\) is with \(A\))
\(P(B)\) = marginal probability (normalizing constant)
\(P(A \mid B)\) = posterior (updated belief after seeing evidence)
This is the Bayesian update rule.
Applying Bayes to ML#
We want:
By Bayes theorem:
The denominator is the same for all classes, so we only care about the numerator.
The Naïve Assumption#
Problem: computing \(P(x_1, x_2, \dots, x_n \mid y)\) is complex because features may be dependent.
Naïve Bayes assumes conditional independence:
This gives:
That’s the Naïve Bayes classifier.
Interpreting the Formula#
\(P(y)\): Prior probability of the class
\(P(x_i \mid y)\): Likelihood of feature \(x_i\) under class \(y\)
\(\prod_i\): Combine all feature evidence
\(\arg\max_y\): Choose the class with the highest posterior probability
Worked Example (Play Tennis 🌤️🎾)#
Dataset (simplified):
Features: Outlook, Temperature
Target: Play = Yes/No
Say the test instance is:
Outlook = Sunny, Temperature = Hot
We compute:
From counts in dataset:
\(P(\text{Yes}) = \tfrac{9}{14}\), \(P(\text{No}) = \tfrac{5}{14}\)
\(P(Sunny \mid Yes) = \tfrac{2}{9}\), \(P(Hot \mid Yes) = \tfrac{2}{9}\)
\(P(Sunny \mid No) = \tfrac{3}{5}\), \(P(Hot \mid No) = \tfrac{2}{5}\)
Plug in:
Normalize:
\(P(\text{Yes} \mid Sunny, Hot) = 0.27\)
\(P(\text{No} \mid Sunny, Hot) = 0.73\)
✅ Prediction: No (won’t play tennis)
Key Takeaways#
Naïve Bayes = Bayes theorem + independence assumption
Works well when features are weakly correlated
Very fast, good for text classification (spam filtering, sentiment analysis)
Outputs probabilities, not just class labels