Optimization Techniques#

Optimization techniques are mathematical methods used to minimize the loss/cost function and improve model performance. They are fundamental to training models such as linear regression, logistic regression, SVMs, and neural networks.

Below is a clear, structured, complete explanation.


Gradient-Based Optimization#

Gradient Descent (GD)#

Updates parameters in the direction of the negative gradient of the cost function.

Parameter update:

\[ \theta = \theta - \alpha \nabla_\theta J(\theta) \]

Where:

  • \(\theta\) = model parameters

  • \(\alpha\) = learning rate

  • \(J(\theta)\) = cost function

Types of Gradient Descent:#

Batch Gradient Descent#

  • Uses the entire dataset for every update

  • Stable but slow

  • Used in regression problems

Stochastic Gradient Descent (SGD)#

  • Updates using a single data sample

  • Fast, noisy updates

  • Helps escape local minima

Mini-Batch Gradient Descent#

  • Uses a small batch (e.g., 32, 64)

  • Balanced: faster than batch, smoother than SGD

  • Most common technique in deep learning


Advanced Gradient Optimization (First-Order Methods)#

These improve the speed and stability of gradient descent, especially in deep neural networks.

Momentum#

Adds memory of previous updates to smooth gradient:

\[ v_t = \beta v_{t-1} + (1-\beta)\nabla J(\theta) \]
\[ \theta = \theta - \alpha v_t \]

Reduces oscillations and speeds convergence.


Nesterov Accelerated Gradient (NAG)#

Looks ahead before taking a step. Faster and more accurate than standard momentum.


AdaGrad#

Adapts learning rate for each parameter individually. Works well with sparse data (NLP).

\[ \theta_i = \theta_i - \frac{\alpha}{\sqrt{G_{ii}}} \nabla J(\theta_i) \]

But learning rate keeps shrinking → stagnation.


RMSProp#

Improves AdaGrad by using an exponential moving average of squared gradients.

Great for:

  • non-stationary data

  • deep networks

  • recurrent networks


Adam (Adaptive Moment Estimation)#

Most widely used optimizer.

Computes:

  • moving average of gradients

  • moving average of squared gradients

\[ \theta = \theta - \alpha \frac{\hat{m}}{\sqrt{\hat{v}}+\epsilon} \]

Combines momentum + RMSProp. Works well for almost all deep learning tasks.


Second-Order Optimization#

Uses curvature (Hessian matrix).

Newton’s Method#

Fast convergence but expensive (requires Hessian). Used in logistic regression solvers.

Quasi-Newton Methods (e.g., BFGS, L-BFGS)#

Approximate Hessian instead of computing it explicitly.

Used in:

  • classical ML models

  • scikit-learn optimization

  • small-to-medium datasets


Convex Optimization Techniques#

Used in models where the cost surface is convex (one global minimum).

Examples:

  • Linear Regression (closed-form solution)

  • Logistic Regression (convex optimization)

  • SVM (Quadratic Programming)

  • Lasso/Ridge Regression (regularized convex problems)

Convex optimization guarantees global optimum.


Evolutionary Optimization (Gradient-Free)#

Used when gradients are unavailable or the surface is not differentiable.

Genetic Algorithms#

Inspired by natural evolution. Used for feature selection or hyperparameter tuning.

Particle Swarm Optimization#

Simulates groups of particles moving toward the best-known solution.

Useful for:

  • global optimization

  • hyperparameter tuning

Simulated Annealing#

Random search guided by probability. Helps escape local minima.


Bayesian Optimization**#

Models the objective function using a Gaussian Process and chooses next evaluations via acquisition functions.

Used when:

  • evaluations are expensive

  • black-box functions

  • deep learning hyperparameters tuning


Coordinate Descent#

Optimizes one parameter at a time.

Used in:

  • Lasso regression

  • Matrix factorization

  • SVM dual formulation



Optimization in Deep Learning#

Learning Rate Scheduling#

Adjust learning rate dynamically.

Examples:

  • Step decay

  • Exponential decay

  • ReduceLROnPlateau

  • Cyclical learning rates

Improves convergence stability.


Regularization-Based Optimization#

Used to reduce overfitting.

Methods:

  • L1/L2 regularization

  • Dropout

  • Early stopping

  • Batch normalization


Summary Table

Category

Techniques

Use Cases

First-order

GD, SGD, Momentum, RMSProp, Adam

Deep learning

Second-order

Newton, BFGS

Classical ML, small data

Gradient-free

GA, PSO, Annealing

Black-box optimization

Bayesian

GP + EI/UCB

Hyperparameter tuning

Convex

QP, L-BFGS

Linear/Logistic regression, SVM

Regularization

L1/L2, dropout

Reduce overfitting