Hyperparameter Tuning#

What are Hyperparameters#

  • In machine learning, hyperparameters are settings you define before training a model.

  • They are not learned from data (unlike model weights/coefficients).

  • Proper tuning of hyperparameters can improve model performance and prevent overfitting/underfitting.


Key Hyperparameters in Logistic Regression#

  1. Regularization Parameter (C)

  • Definition: Controls the strength of regularization (penalty for large coefficients).

  • In scikit-learn, C is the inverse of regularization strength:

\[ C = \frac{1}{\lambda} \]
  • Smaller Cstronger regularization → reduces overfitting.

  • Larger C → weaker regularization → may overfit on training data.

  • Regularization helps prevent the model from giving too much weight to a single feature.


  1. Penalty Type (penalty)

  • Determines the type of regularization used. Common options:

    • 'l2' → Ridge regularization (squared magnitude of coefficients)

    • 'l1' → Lasso regularization (absolute value of coefficients, encourages sparsity)

    • 'elasticnet' → Combination of L1 and L2 (requires solver='saga')


  1. Solver (solver)

  • Optimization algorithm used to fit the model. Common options:

    • 'lbfgs' → Good default for small datasets, supports L2

    • 'liblinear' → Good for small datasets, supports L1

    • 'saga' → Supports L1, L2, and elasticnet, scalable for large datasets

  • Choice of solver may depend on dataset size and penalty type.


  1. Maximum Iterations (max_iter)

  • Maximum number of iterations for the solver to converge.

  • Default is usually 100; if the model does not converge, you can increase this number.


  1. Class Weight (class_weight)

  • Used for imbalanced datasets.

  • Options:

    • None → all classes are treated equally

    • 'balanced' → weights inversely proportional to class frequency


3. How to Tune Hyperparameters#

  1. Grid Search

    • Try all possible combinations of hyperparameters in a predefined grid.

    • Example: tune C = [0.01, 0.1, 1, 10] and penalty = ['l1','l2'].

  2. Randomized Search

    • Randomly select hyperparameter combinations, useful when the grid is large.

  3. Cross-Validation

    • Split training data into multiple folds.

    • Evaluate hyperparameter combinations using average validation performance.

    • Prevents overfitting to a single train-test split.


4. Example in Python#


Summary

  • Hyperparameter tuning improves model performance.

  • Important hyperparameters in Logistic Regression:

    • C → regularization strength

    • penalty → L1/L2

    • solver → optimization algorithm

    • class_weight → handle imbalanced datasets

  • Use GridSearchCV or RandomizedSearchCV with cross-validation to find the best combination.

# Import libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

# Load Iris dataset (multi-class example)
data = load_iris()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create logistic regression model
logreg = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=500)

# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10],          # Regularization strength
    'penalty': ['l1', 'l2'],          # L1 or L2 regularization
    'class_weight': [None, 'balanced'] # Handle imbalanced data
}

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print best hyperparameters and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Accuracy:", grid_search.best_score_)

# Test set evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))
Best Hyperparameters: {'C': 1, 'class_weight': 'balanced', 'penalty': 'l2'}
Best Cross-Validated Accuracy: 0.9523809523809523
Test Set Accuracy: 0.9777777777777777
# Demonstration: Logistic Regression on an imbalanced dataset using StratifiedKFold

# Import libraries
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score

# Step 1: Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=5, 
                           n_informative=3, n_redundant=0, n_classes=2, 
                           weights=[0.9, 0.1], random_state=42)

# Step 2: Create Logistic Regression model with class_weight='balanced'
model = LogisticRegression(solver='liblinear', class_weight='balanced')

# Step 3: Define StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Step 4: Evaluate using cross-validation with F1-score, Precision, Recall
f1_scores = cross_val_score(model, X, y, cv=skf, scoring=make_scorer(f1_score))
precision_scores = cross_val_score(model, X, y, cv=skf, scoring=make_scorer(precision_score))
recall_scores = cross_val_score(model, X, y, cv=skf, scoring=make_scorer(recall_score))

# Step 5: Print results
print("F1-scores for each fold:", f1_scores)
print("Mean F1-score:", f1_scores.mean())
print("\nPrecision for each fold:", precision_scores)
print("Mean Precision:", precision_scores.mean())
print("\nRecall for each fold:", recall_scores)
print("Mean Recall:", recall_scores.mean())
F1-scores for each fold: [0.61016949 0.64864865 0.7037037  0.74509804 0.6557377 ]
Mean F1-score: 0.672671517602299

Precision for each fold: [0.47368421 0.75       0.57575758 0.63333333 0.51282051]
Mean Precision: 0.5891191264875475

Recall for each fold: [0.85714286 0.57142857 0.9047619  0.9047619  0.90909091]
Mean Recall: 0.8294372294372293
Click here for Sections