Workflows#
Problem Definition#
Define the task (e.g., spam detection, sentiment analysis, medical diagnosis).
Decide what the input features are (words, pixel values, categorical attributes).
Decide what the target labels are (spam/ham, positive/negative, disease/healthy).
Data Preparation#
Collect labeled data.
Preprocess features:
For text data → tokenization, stopword removal, vectorization (Bag of Words, TF-IDF).
For categorical data → encode categories into counts or frequencies.
For continuous data → assume Gaussian distribution (Gaussian Naïve Bayes).
Split dataset into train/test (or validation) sets.
Training (Fit)#
Naïve Bayes learns probabilities from data:
Compute prior probabilities for each class:
\[ P(y=c) = \frac{\text{count of class } c}{\text{total samples}} \]Compute likelihoods for each feature given class:
For categorical:
\[ P(x_i \mid y=c) = \frac{\text{count}(x_i, y=c)}{\text{count}(y=c)} \]For text: word frequencies (with Laplace smoothing).
For continuous: use Gaussian distribution:
\[ P(x_i \mid y=c) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \]
Prediction (Inference)#
Given a new instance \(X = (x_1, x_2, \dots, x_n)\):
Apply Bayes theorem:
\[ P(y \mid X) \propto P(y) \cdot \prod_{i=1}^n P(x_i \mid y) \]Select the class with maximum posterior probability:
\[ \hat{y} = \arg\max_y P(y) \prod_i P(x_i \mid y) \]
Evaluation#
Use metrics depending on task:
Classification: Accuracy, Precision, Recall, F1-score.
Probabilistic predictions: Log-loss, ROC-AUC.
Cross-validation for robustness.
6. Deployment#
Save the trained model (priors + likelihoods).
For new unseen data, run through preprocessing → prediction pipeline.
Example: Spam Classification Workflow#
Data: Emails labeled as spam/ham.
Preprocessing: Tokenize words → convert to TF-IDF features.
Training:
\(P(\text{spam})\), \(P(\text{ham})\) (priors).
\(P(\text{word} \mid \text{spam})\), \(P(\text{word} \mid \text{ham})\).
Prediction: For a new email, multiply word likelihoods + prior, choose class with higher posterior.
Evaluation: Check accuracy, precision, recall on test emails.
Summary Workflow 👉 Define Problem → Preprocess Data → Train (estimate probabilities) → Predict (posterior) → Evaluate → Deploy