Workflow of Regularized Regression#

Define the Linear Regression Model#

We start with the standard linear regression equation:

\[ \hat{y} = X\beta + \epsilon \]

where:

  • \(X\) = input features

  • \(\beta\) = coefficients

  • \(\epsilon\) = error term


Define the Cost Function#

  • Standard regression uses Mean Squared Error (MSE):

\[ J(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
  • Regularization adds a penalty term to control complexity:

  1. Ridge (L2 Regularization):

    \[ J(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]
    • Shrinks coefficients but never makes them exactly zero.

    • Helps with multicollinearity.

  2. Lasso (L1 Regularization):

    \[ J(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j| \]
    • Can shrink some coefficients exactly to zero → feature selection.

  3. Elastic Net (Combination of L1 & L2):

    \[ J(\beta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \Big(\alpha \sum_{j=1}^p |\beta_j| + (1-\alpha) \sum_{j=1}^p \beta_j^2 \Big) \]
    • Balances Ridge and Lasso.

    • Good for high-dimensional datasets (p >> n).


Choose Hyperparameters#

  • \(\lambda\) (regularization strength): Controls penalty size.

  • \(\alpha\) (for Elastic Net only): Balances L1 vs L2 penalty.


Optimization#

  • Use Gradient Descent (or specialized solvers like Coordinate Descent for Lasso).

  • Iteratively update coefficients:

\[ \beta_j \leftarrow \beta_j - \eta \cdot \frac{\partial J}{\partial \beta_j} \]

where \(\eta\) is the learning rate.


Model Training#

  • Fit model on training data.

  • Coefficients shrink depending on the regularization.

    • Ridge → small but nonzero.

    • Lasso → some zero.

    • Elastic Net → mix.


Model Validation (Cross-Validation)#

  • Use k-Fold CV to tune \(\lambda\) (and \(\alpha\) for Elastic Net).

  • Select the value that minimizes validation error.


Prediction#

  • Use the final model to make predictions:

\[ \hat{y}_{test} = X_{test}\beta \]

Summary Workflow

  1. Define regression model.

  2. Add regularization term (L1, L2, or both).

  3. Choose hyperparameters (\(\lambda\), \(\alpha\)).

  4. Optimize using gradient descent/coordinate descent.

  5. Train model → shrink/zero coefficients.

  6. Validate via CV and tune parameters.

  7. Predict on new data.