Workflows

Contents

Workflows#

  1. Data Preparation

    • Collect labeled data \((x_i, y_i)\) with \(y_i \in \{-1, +1\}\) for binary classification.

    • Scale/normalize features since SVM relies on distance measures.

    • Handle imbalance (e.g., stratified sampling, class weights).

  2. Model Selection

    • Choose linear SVM if data is (almost) linearly separable.

    • Use kernel SVM (RBF, polynomial) for non-linear patterns.

    • Decide between hard margin (large C) or soft margin (small C) depending on noise.

  3. Training (Optimization)

    • Solve convex optimization:

      \[ \min_{w,b,\xi} \; \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \]

      subject to margin constraints.

    • Identify support vectors, points that lie on the margin or violate it.

  4. Prediction

    • Compute decision function:

      \[ f(x) = \text{sign}\Big(\sum_{i} \alpha_i y_i K(x_i, x) + b\Big) \]
    • Predict class based on sign of the decision value.

\[\begin{split} \text{sign}(z) = \begin{cases} +1 & \text{if } z > 0 \\ -1 & \text{if } z < 0 \\ 0 & \text{if } z = 0 \end{cases} \end{split}\]
  1. Model Evaluation

    • Use cross-validation to estimate performance.

    • Metrics: accuracy, precision, recall, F1, ROC-AUC depending on task.

  2. Hyperparameter Tuning

    • Parameters:

      • C: margin vs. misclassification trade-off.

      • γ (gamma): influence of single points in RBF.

      • degree: for polynomial kernels.

    • Tune via GridSearchCV or RandomizedSearchCV.

  3. Deployment

    • Use trained model to classify unseen data.

    • Re-train periodically if data distribution shifts.

Demonstration#

# Re-run after reset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd

# 1. Data Preparation
X, y = make_classification(n_samples=500, n_features=5, n_informative=3, n_redundant=0,
                           n_classes=2, weights=[0.6, 0.4], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    stratify=y, random_state=42)

# Feature scaling (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Baseline Model (Linear kernel)
svm_linear = SVC(kernel="linear", C=1, random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_baseline = svm_linear.predict(X_test_scaled)
baseline_acc = accuracy_score(y_test, y_pred_baseline)

# 3. Hyperparameter Tuning with RBF kernel
param_grid = {
    "C": [0.1, 1, 10],
    "gamma": [0.01, 0.1, 1]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(SVC(kernel="rbf"), param_grid, cv=cv, scoring="accuracy", n_jobs=-1)
grid.fit(X_train_scaled, y_train)

best_svm = grid.best_estimator_
y_pred_best = best_svm.predict(X_test_scaled)
best_acc = accuracy_score(y_test, y_pred_best)

# 4. Evaluation results
baseline_report = classification_report(y_test, y_pred_baseline)
best_report = classification_report(y_test, y_pred_best)

cm_baseline = confusion_matrix(y_test, y_pred_baseline)
cm_best = confusion_matrix(y_test, y_pred_best)

results_df = pd.DataFrame({
    "Model": ["Linear SVM", "Best RBF SVM"],
    "Accuracy": [baseline_acc, best_acc],
    "Best Params": [None, grid.best_params_]
})



print("Baseline Linear SVM Classification Report:\n", baseline_report)
print("Best RBF SVM Classification Report:\n", best_report)
print("Confusion Matrix (Baseline):\n", cm_baseline)
print("Confusion Matrix (Best RBF):\n", cm_best)

results_df.head()
Baseline Linear SVM Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.94      0.90        89
           1       0.91      0.79      0.84        61

    accuracy                           0.88       150
   macro avg       0.89      0.87      0.87       150
weighted avg       0.88      0.88      0.88       150

Best RBF SVM Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.99      0.95        89
           1       0.98      0.87      0.92        61

    accuracy                           0.94       150
   macro avg       0.95      0.93      0.94       150
weighted avg       0.94      0.94      0.94       150

Confusion Matrix (Baseline):
 [[84  5]
 [13 48]]
Confusion Matrix (Best RBF):
 [[88  1]
 [ 8 53]]
Model Accuracy Best Params
0 Linear SVM 0.88 None
1 Best RBF SVM 0.94 {'C': 10, 'gamma': 0.1}