Optimization Techniques

Optimization Techniques #

Optimization techniques are mathematical methods used to minimize the loss/cost function and improve model performance. They are fundamental to training models such as linear regression, logistic regression, SVMs, and neural networks.

Below is a clear, structured, complete explanation.

Gradient-Based Optimization #

Gradient Descent (GD)#

Updates parameters in the direction of the negative gradient of the cost function.

Parameter update:

\[ \theta = \theta - \alpha \nabla_\theta J(\theta) \]

Where:

\(\theta\) = model parameters
\(\alpha\) = learning rate
\(J(\theta)\) = cost function

Types of Gradient Descent:#

Batch Gradient Descent #

Uses the entire dataset for every update
Stable but slow
Used in regression problems

Stochastic Gradient Descent (SGD)#

Updates using a single data sample
Fast, noisy updates
Helps escape local minima

Mini-Batch Gradient Descent #

Uses a small batch (e.g., 32, 64)
Balanced: faster than batch, smoother than SGD
Most common technique in deep learning

Advanced Gradient Optimization (First-Order Methods)#

These improve the speed and stability of gradient descent, especially in deep neural networks.

Momentum #

Adds memory of previous updates to smooth gradient:

\[ v_t = \beta v_{t-1} + (1-\beta)\nabla J(\theta) \]

\[ \theta = \theta - \alpha v_t \]

Reduces oscillations and speeds convergence.

Nesterov Accelerated Gradient (NAG)#

Looks ahead before taking a step. Faster and more accurate than standard momentum.

AdaGrad #

Adapts learning rate for each parameter individually. Works well with sparse data (NLP).

\[ \theta_i = \theta_i - \frac{\alpha}{\sqrt{G_{ii}}} \nabla J(\theta_i) \]

But learning rate keeps shrinking → stagnation.

RMSProp #

Improves AdaGrad by using an exponential moving average of squared gradients.

Great for:

non-stationary data
deep networks
recurrent networks

Adam (Adaptive Moment Estimation)#

Most widely used optimizer.

Computes:

moving average of gradients
moving average of squared gradients

\[ \theta = \theta - \alpha \frac{\hat{m}}{\sqrt{\hat{v}}+\epsilon} \]

Combines momentum + RMSProp. Works well for almost all deep learning tasks.

Second-Order Optimization #

Uses curvature (Hessian matrix).

Newton’s Method #

Fast convergence but expensive (requires Hessian). Used in logistic regression solvers.

Quasi-Newton Methods (e.g., BFGS, L-BFGS)#

Approximate Hessian instead of computing it explicitly.

Used in:

classical ML models
scikit-learn optimization
small-to-medium datasets

Convex Optimization Techniques #

Used in models where the cost surface is convex (one global minimum).

Examples:

Linear Regression (closed-form solution)
Logistic Regression (convex optimization)
SVM (Quadratic Programming)
Lasso/Ridge Regression (regularized convex problems)

Convex optimization guarantees global optimum.

Evolutionary Optimization (Gradient-Free)#

Used when gradients are unavailable or the surface is not differentiable.

Genetic Algorithms #

Inspired by natural evolution. Used for feature selection or hyperparameter tuning.

Particle Swarm Optimization #

Simulates groups of particles moving toward the best-known solution.

Useful for:

global optimization
hyperparameter tuning

Simulated Annealing #

Random search guided by probability. Helps escape local minima.

Bayesian Optimization**#

Models the objective function using a Gaussian Process and chooses next evaluations via acquisition functions.

Used when:

evaluations are expensive
black-box functions
deep learning hyperparameters tuning

Coordinate Descent #

Optimizes one parameter at a time.

Used in:

Lasso regression
Matrix factorization
SVM dual formulation

Gradient-Free Local Search #

Nelder–Mead #

Works without gradients, using simplex updates.

Used in:

small parameter spaces
noisy functions

Optimization in Deep Learning #

Learning Rate Scheduling #

Adjust learning rate dynamically.

Examples:

Step decay
Exponential decay
ReduceLROnPlateau
Cyclical learning rates

Improves convergence stability.

Regularization-Based Optimization #

Used to reduce overfitting.

Methods:

L1/L2 regularization
Dropout
Early stopping
Batch normalization

Summary Table

Category	Techniques	Use Cases
First-order	GD, SGD, Momentum, RMSProp, Adam	Deep learning
Second-order	Newton, BFGS	Classical ML, small data
Gradient-free	GA, PSO, Annealing	Black-box optimization
Bayesian	GP + EI/UCB	Hyperparameter tuning
Convex	QP, L-BFGS	Linear/Logistic regression, SVM
Regularization	L1/L2, dropout	Reduce overfitting

Optimization Techniques

Contents