Workflows#

Problem Definition#

  • Define the task (e.g., spam detection, sentiment analysis, medical diagnosis).

  • Decide what the input features are (words, pixel values, categorical attributes).

  • Decide what the target labels are (spam/ham, positive/negative, disease/healthy).


Data Preparation#

  • Collect labeled data.

  • Preprocess features:

    • For text data → tokenization, stopword removal, vectorization (Bag of Words, TF-IDF).

    • For categorical data → encode categories into counts or frequencies.

    • For continuous data → assume Gaussian distribution (Gaussian Naïve Bayes).

  • Split dataset into train/test (or validation) sets.


Training (Fit)#

Naïve Bayes learns probabilities from data:

  1. Compute prior probabilities for each class:

    \[ P(y=c) = \frac{\text{count of class } c}{\text{total samples}} \]
  2. Compute likelihoods for each feature given class:

    • For categorical:

      \[ P(x_i \mid y=c) = \frac{\text{count}(x_i, y=c)}{\text{count}(y=c)} \]
    • For text: word frequencies (with Laplace smoothing).

    • For continuous: use Gaussian distribution:

      \[ P(x_i \mid y=c) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \]

Prediction (Inference)#

Given a new instance \(X = (x_1, x_2, \dots, x_n)\):

  1. Apply Bayes theorem:

    \[ P(y \mid X) \propto P(y) \cdot \prod_{i=1}^n P(x_i \mid y) \]
  2. Select the class with maximum posterior probability:

    \[ \hat{y} = \arg\max_y P(y) \prod_i P(x_i \mid y) \]

Evaluation#

  • Use metrics depending on task:

    • Classification: Accuracy, Precision, Recall, F1-score.

    • Probabilistic predictions: Log-loss, ROC-AUC.

  • Cross-validation for robustness.


6. Deployment#

  • Save the trained model (priors + likelihoods).

  • For new unseen data, run through preprocessing → prediction pipeline.


Example: Spam Classification Workflow#

  1. Data: Emails labeled as spam/ham.

  2. Preprocessing: Tokenize words → convert to TF-IDF features.

  3. Training:

    • \(P(\text{spam})\), \(P(\text{ham})\) (priors).

    • \(P(\text{word} \mid \text{spam})\), \(P(\text{word} \mid \text{ham})\).

  4. Prediction: For a new email, multiply word likelihoods + prior, choose class with higher posterior.

  5. Evaluation: Check accuracy, precision, recall on test emails.


Summary Workflow 👉 Define Problem → Preprocess Data → Train (estimate probabilities) → Predict (posterior) → Evaluate → Deploy