Cross Validation#
Cross-validation is a resampling method to estimate how well a model generalizes to unseen data.
Mechanism
Split dataset into multiple subsets (folds).
Train model on some folds, test on the remaining fold.
Repeat for each fold as test set.
Average results across folds → performance estimate.
GridSearchCV#
What it does: Tries all possible combinations of hyperparameters in the grid you provide.
Example: If you test
Penalty = [‘l1’, ‘l2’]
C = [0.1, 1, 10]
Solver = [‘liblinear’, ‘saga’] → That’s
2 × 3 × 2 = 12models trained and evaluated.
Pros: Exhaustive, finds the global best.
Cons: Computationally expensive if the parameter space is large.
RandomizedSearchCV#
What it does: Instead of testing all combinations, it randomly samples a fixed number of parameter combinations.
Example: Same parameter grid as above → instead of 12, you can ask for only 5 random samples (
n_iter=5).Pros: Much faster, works well with large search spaces.
Cons: May miss the absolute best combination, but usually finds a good enough one.
Demonstration#
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings("ignore")
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Base model
model = LogisticRegression(max_iter=5000)
# Define parameter grid
param_grid = {
'penalty': ['l1', 'l2'],
'C': [0.01, 0.1, 1, 10, 100],
'solver': ['liblinear', 'saga'],
'l1_ratio': [0, 0.5, 1] # only used when penalty='elasticnet'
}
# Stratified K-Fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# -------------------------
# 🔹 GridSearchCV
# -------------------------
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='accuracy',
cv=cv,
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print("Best Parameters (GridSearchCV):", grid_search.best_params_)
print("Best Accuracy (GridSearchCV):", grid_search.best_score_)
# Evaluate on test set
y_pred_grid = grid_search.predict(X_test)
print("\nClassification Report (GridSearchCV):\n", classification_report(y_test, y_pred_grid))
# -------------------------
# 🔹 RandomizedSearchCV
# -------------------------
random_search = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_iter=10, # number of random combinations to try
scoring='accuracy',
cv=cv,
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print("\nBest Parameters (RandomizedSearchCV):", random_search.best_params_)
print("Best Accuracy (RandomizedSearchCV):", random_search.best_score_)
# Evaluate on test set
y_pred_random = random_search.predict(X_test)
print("\nClassification Report (RandomizedSearchCV):\n", classification_report(y_test, y_pred_random))
Best Parameters (GridSearchCV): {'C': 100, 'l1_ratio': 0, 'penalty': 'l1', 'solver': 'liblinear'}
Best Accuracy (GridSearchCV): 0.9648351648351647
Classification Report (GridSearchCV):
precision recall f1-score support
0 1.00 0.95 0.98 42
1 0.97 1.00 0.99 72
accuracy 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
Best Parameters (RandomizedSearchCV): {'solver': 'liblinear', 'penalty': 'l1', 'l1_ratio': 0, 'C': 100}
Best Accuracy (RandomizedSearchCV): 0.9648351648351647
Classification Report (RandomizedSearchCV):
precision recall f1-score support
0 1.00 0.95 0.98 42
1 0.97 1.00 0.99 72
accuracy 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
Key Takeaways
GridSearchCV: Best when the search space is small and you want the absolute best parameters.
RandomizedSearchCV: Best when the search space is large and you want a fast, good-enough solution.
Both use cross-validation to ensure stable results.
Always check metrics beyond accuracy (precision, recall, F1).