Hyper-paramter Tuning Intiution#
Perfect 👍 let’s go deeper into the mathematical intuition behind the key hyperparameters of SVC.
We’ll focus on the three most important ones: C, γ (gamma), and kernel.
Objective Function of SVC#
The primal optimization problem of SVM is:
subject to:
Where:
\(w\): weight vector
\(b\): bias term
\(\xi_i\): slack variables (allow misclassifications)
\(C\): regularization parameter (controls penalty for misclassifications)
\(\phi(x)\): feature mapping (depends on kernel)
Role of C (Regularization)#
From the objective function:
The term \(\frac{1}{2} \|w\|^2\) → tries to maximize margin.
The term \(C \sum \xi_i\) → penalizes misclassifications.
👉 Intuition:
Small C → margin maximization dominates (tolerates some errors).
Simpler decision boundary.
Prevents overfitting.
Large C → error penalty dominates (forces correct classification of training data).
Narrow margin.
Risk of overfitting.
Role of γ (Gamma) in RBF Kernel#
The RBF kernel is:
👉 Interpretation:
If \(\|x_i - x_j\|\) is small → similarity close to 1.
If \(\|x_i - x_j\|\) is large → similarity close to 0.
\(\gamma\) controls the decay rate of similarity.
Small γ:
Kernel is smoother, points far apart are still considered similar.
Leads to a smooth, less complex decision boundary.
Large γ:
Kernel is sharper, only very close neighbors are considered similar.
Leads to highly complex decision boundary (can overfit).
Role of Kernel Choice#
The kernel defines \(\phi(x)\), the transformation of data:
Linear kernel:
\[ K(x_i, x_j) = x_i^T x_j \]→ Works well if data is linearly separable.
Polynomial kernel:
\[ K(x_i, x_j) = (x_i^T x_j + c)^d \]→ Captures polynomial relationships; degree \(d\) is a hyperparameter.
RBF kernel:
\[ K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2) \]→ Very flexible, maps data to infinite-dimensional feature space.
How C and γ Interact#
High C + High γ:
Very complex model, tries to classify everything correctly.
Risk of overfitting.
Low C + Low γ:
Very smooth decision boundary, high bias.
Risk of underfitting.
Balanced values:
Trade-off between margin size, misclassification, and flexibility.
Decision Function#
The final decision function of SVC is:
\(\alpha_i\): learned weights (nonzero only for support vectors).
\(K(x_i, x)\): similarity function (depends on γ and kernel).
\(C\): controls how many support vectors exist (larger C → more support vectors).
Summary (Mathematical Intuition):
C → controls penalty on misclassified points (\(\xi_i\)).
γ → controls how similarity decays in RBF kernel.
Kernel → defines feature space transformation.
Together, they shape the decision boundary: wide vs narrow, smooth vs complex.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Load dataset
X, y = datasets.load_iris(return_X_y=True)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Base SVC model
svc = SVC()
# Hyperparameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['linear', 'rbf', 'poly']
}
# GridSearchCV
grid = GridSearchCV(
estimator=svc,
param_grid=param_grid,
refit=True, # keep best model
cv=5, # 5-fold cross-validation
verbose=2,
n_jobs=-1 # use all CPUs
)
# Fit
grid.fit(X_train, y_train)
# Best hyperparameters
print("Best Parameters:", grid.best_params_)
# Use best model for predictions
y_pred = grid.predict(X_test)
# Evaluation
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 19
1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45