Hyper-paramter Tuning Intiution#

Perfect 👍 let’s go deeper into the mathematical intuition behind the key hyperparameters of SVC.

We’ll focus on the three most important ones: C, γ (gamma), and kernel.


Objective Function of SVC#

The primal optimization problem of SVM is:

\[ \min_{w, b, \xi} \quad \frac{1}{2} \|w\|^2 + C \sum_{i=1}^N \xi_i \]

subject to:

\[ y_i (w^T \phi(x_i) + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]

Where:

  • \(w\): weight vector

  • \(b\): bias term

  • \(\xi_i\): slack variables (allow misclassifications)

  • \(C\): regularization parameter (controls penalty for misclassifications)

  • \(\phi(x)\): feature mapping (depends on kernel)


Role of C (Regularization)#

From the objective function:

  • The term \(\frac{1}{2} \|w\|^2\) → tries to maximize margin.

  • The term \(C \sum \xi_i\) → penalizes misclassifications.

👉 Intuition:

  • Small C → margin maximization dominates (tolerates some errors).

    • Simpler decision boundary.

    • Prevents overfitting.

  • Large C → error penalty dominates (forces correct classification of training data).

    • Narrow margin.

    • Risk of overfitting.


Role of γ (Gamma) in RBF Kernel#

The RBF kernel is:

\[ K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2) \]

👉 Interpretation:

  • If \(\|x_i - x_j\|\) is small → similarity close to 1.

  • If \(\|x_i - x_j\|\) is large → similarity close to 0.

  • \(\gamma\) controls the decay rate of similarity.

  • Small γ:

    • Kernel is smoother, points far apart are still considered similar.

    • Leads to a smooth, less complex decision boundary.

  • Large γ:

    • Kernel is sharper, only very close neighbors are considered similar.

    • Leads to highly complex decision boundary (can overfit).


Role of Kernel Choice#

The kernel defines \(\phi(x)\), the transformation of data:

  • Linear kernel:

    \[ K(x_i, x_j) = x_i^T x_j \]

    → Works well if data is linearly separable.

  • Polynomial kernel:

    \[ K(x_i, x_j) = (x_i^T x_j + c)^d \]

    → Captures polynomial relationships; degree \(d\) is a hyperparameter.

  • RBF kernel:

    \[ K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2) \]

    → Very flexible, maps data to infinite-dimensional feature space.


How C and γ Interact#

  • High C + High γ:

    • Very complex model, tries to classify everything correctly.

    • Risk of overfitting.

  • Low C + Low γ:

    • Very smooth decision boundary, high bias.

    • Risk of underfitting.

  • Balanced values:

    • Trade-off between margin size, misclassification, and flexibility.


Decision Function#

The final decision function of SVC is:

\[ f(x) = \text{sign}\Big(\sum_{i=1}^N \alpha_i y_i K(x_i, x) + b\Big) \]
  • \(\alpha_i\): learned weights (nonzero only for support vectors).

  • \(K(x_i, x)\): similarity function (depends on γ and kernel).

  • \(C\): controls how many support vectors exist (larger C → more support vectors).


Summary (Mathematical Intuition):

  • C → controls penalty on misclassified points (\(\xi_i\)).

  • γ → controls how similarity decays in RBF kernel.

  • Kernel → defines feature space transformation.

  • Together, they shape the decision boundary: wide vs narrow, smooth vs complex.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load dataset
X, y = datasets.load_iris(return_X_y=True)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Base SVC model
svc = SVC()

# Hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf', 'poly']
}

# GridSearchCV
grid = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    refit=True,        # keep best model
    cv=5,              # 5-fold cross-validation
    verbose=2,
    n_jobs=-1          # use all CPUs
)

# Fit
grid.fit(X_train, y_train)



# Best hyperparameters
print("Best Parameters:", grid.best_params_)

# Use best model for predictions
y_pred = grid.predict(X_test)

# Evaluation
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45