Optimization Techniques#
Optimization techniques are mathematical methods used to minimize the loss/cost function and improve model performance. They are fundamental to training models such as linear regression, logistic regression, SVMs, and neural networks.
Below is a clear, structured, complete explanation.
Gradient-Based Optimization#
Gradient Descent (GD)#
Updates parameters in the direction of the negative gradient of the cost function.
Parameter update:
Where:
\(\theta\) = model parameters
\(\alpha\) = learning rate
\(J(\theta)\) = cost function
Types of Gradient Descent:#
Batch Gradient Descent#
Uses the entire dataset for every update
Stable but slow
Used in regression problems
Stochastic Gradient Descent (SGD)#
Updates using a single data sample
Fast, noisy updates
Helps escape local minima
Mini-Batch Gradient Descent#
Uses a small batch (e.g., 32, 64)
Balanced: faster than batch, smoother than SGD
Most common technique in deep learning
Advanced Gradient Optimization (First-Order Methods)#
These improve the speed and stability of gradient descent, especially in deep neural networks.
Momentum#
Adds memory of previous updates to smooth gradient:
Reduces oscillations and speeds convergence.
Nesterov Accelerated Gradient (NAG)#
Looks ahead before taking a step. Faster and more accurate than standard momentum.
AdaGrad#
Adapts learning rate for each parameter individually. Works well with sparse data (NLP).
But learning rate keeps shrinking → stagnation.
RMSProp#
Improves AdaGrad by using an exponential moving average of squared gradients.
Great for:
non-stationary data
deep networks
recurrent networks
Adam (Adaptive Moment Estimation)#
Most widely used optimizer.
Computes:
moving average of gradients
moving average of squared gradients
Combines momentum + RMSProp. Works well for almost all deep learning tasks.
Second-Order Optimization#
Uses curvature (Hessian matrix).
Newton’s Method#
Fast convergence but expensive (requires Hessian). Used in logistic regression solvers.
Quasi-Newton Methods (e.g., BFGS, L-BFGS)#
Approximate Hessian instead of computing it explicitly.
Used in:
classical ML models
scikit-learn optimization
small-to-medium datasets
Convex Optimization Techniques#
Used in models where the cost surface is convex (one global minimum).
Examples:
Linear Regression (closed-form solution)
Logistic Regression (convex optimization)
SVM (Quadratic Programming)
Lasso/Ridge Regression (regularized convex problems)
Convex optimization guarantees global optimum.
Evolutionary Optimization (Gradient-Free)#
Used when gradients are unavailable or the surface is not differentiable.
Genetic Algorithms#
Inspired by natural evolution. Used for feature selection or hyperparameter tuning.
Particle Swarm Optimization#
Simulates groups of particles moving toward the best-known solution.
Useful for:
global optimization
hyperparameter tuning
Simulated Annealing#
Random search guided by probability. Helps escape local minima.
Bayesian Optimization**#
Models the objective function using a Gaussian Process and chooses next evaluations via acquisition functions.
Used when:
evaluations are expensive
black-box functions
deep learning hyperparameters tuning
Coordinate Descent#
Optimizes one parameter at a time.
Used in:
Lasso regression
Matrix factorization
SVM dual formulation
Gradient-Free Local Search#
Nelder–Mead#
Works without gradients, using simplex updates.
Used in:
small parameter spaces
noisy functions
Optimization in Deep Learning#
Learning Rate Scheduling#
Adjust learning rate dynamically.
Examples:
Step decay
Exponential decay
ReduceLROnPlateau
Cyclical learning rates
Improves convergence stability.
Regularization-Based Optimization#
Used to reduce overfitting.
Methods:
L1/L2 regularization
Dropout
Early stopping
Batch normalization
Summary Table
Category |
Techniques |
Use Cases |
|---|---|---|
First-order |
GD, SGD, Momentum, RMSProp, Adam |
Deep learning |
Second-order |
Newton, BFGS |
Classical ML, small data |
Gradient-free |
GA, PSO, Annealing |
Black-box optimization |
Bayesian |
GP + EI/UCB |
Hyperparameter tuning |
Convex |
QP, L-BFGS |
Linear/Logistic regression, SVM |
Regularization |
L1/L2, dropout |
Reduce overfitting |