Gradient Descent#

Gradient Descent is an optimization algorithm used to minimize a cost function by updating model parameters (θ). The types mainly differ in how much data they use at each update step.


Types of Gradient Descent#

Batch Gradient Descent#

  • Definition: Uses the entire training dataset to compute the gradient of the cost function in each iteration.

  • Formula:

    \[ \theta := \theta - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta J(\theta; x^{(i)}, y^{(i)}) \]

    where \(m\) = number of training examples.

  • Pros:

    • Converges to the global minimum for convex functions (like linear regression).

    • Stable updates.

  • Cons:

    • Very slow for large datasets.

    • Requires huge memory since it must load all data at once.

  • ✅ Best suited for small to medium datasets.


Stochastic Gradient Descent (SGD)#

  • Definition: Updates parameters for each training example one at a time.

  • Formula:

    \[ \theta := \theta - \alpha \cdot \nabla_\theta J(\theta; x^{(i)}, y^{(i)}) \]
  • Pros:

    • Much faster (frequent updates).

    • Can escape local minima due to noisy updates.

  • Cons:

    • Updates are noisy → cost function fluctuates rather than smoothly converging.

    • Harder to reach exact global minimum (oscillates around it).

  • ✅ Best for very large datasets or online learning.


Mini-Batch Gradient Descent#

  • Definition: A compromise between Batch and SGD. Uses small random subsets (mini-batches) of the data to update parameters.

  • Formula:

    \[ \theta := \theta - \alpha \cdot \frac{1}{b} \sum_{i=1}^{b} \nabla_\theta J(\theta; x^{(i)}, y^{(i)}) \]

    where \(b\) = mini-batch size (e.g., 32, 64, 128).

  • Pros:

    • Faster and more efficient than pure Batch.

    • Less noisy than SGD.

    • Can leverage vectorization (parallel processing on GPUs).

  • Cons:

    • Choosing batch size is tricky (too small → noisy, too large → slow).

  • ✅ Best for deep learning and neural networks.

Comparison Summary#

Type

Update Frequency

Speed

Stability

Use Case

Batch GD

After full dataset

Slow

Very stable

Small datasets

Stochastic GD

After each data point

Fast per step

Noisy

Large datasets, online learning

Mini-Batch GD

After subset (batch)

Fast + efficient

Balanced

Deep Learning