Convergence Algorithm#
The convergence algorithm is an iterative process used in gradient descent to update model parameters (\(\theta_0, \theta_1, \dots\)) until the cost function (error) reaches its minimum value.
In linear regression, convergence means reaching the global minimum of the cost function (Mean Squared Error), which gives us the best-fit line.
The Update Rule (Core Formula)#
Where:
\(\theta_j\) → parameter (intercept, slope, etc.)
\(\alpha\) → learning rate (step size)
\(\frac{\partial J(\theta)}{\partial \theta_j}\) → derivative of cost function w.r.t parameter (i.e., slope of the curve)
Intuition Behind It#
Cost Function Curve → looks like a bowl (U-shape).
Derivative (Slope) tells direction:
If slope is positive → decrease parameter.
If slope is negative → increase parameter.
By repeatedly adjusting parameters in the opposite direction of the slope, the algorithm slides down the bowl toward the global minimum.
Role of Learning Rate (\(\alpha\))#
Small \(\alpha\) → tiny steps → slow convergence.
Large \(\alpha\) → big jumps → may overshoot, never converge.
Balanced \(\alpha\) → smooth, efficient convergence.
When to Stop (Convergence Criteria)#
We say the algorithm has converged when:
The change in cost function between iterations is negligible (e.g., < 0.0001).
Or, after a fixed number of iterations.
Visual Intuition#
Imagine standing on a hill (cost function surface).
Each step, you check the slope of the ground.
Walk in the downhill direction with small, controlled steps.
Eventually, you reach the lowest valley (global minimum).
✅ In short: The convergence algorithm in gradient descent optimizes parameters by repeatedly moving them in the direction that reduces error, until the cost function cannot be minimized further.
import numpy as np
import matplotlib.pyplot as plt
# Create synthetic data
X = np.array([1, 2, 3, 4, 5])
y = np.array([1.2, 1.9, 3.0, 3.9, 5.1])
m = len(X)
# Initialize theta values
theta0, theta1 = 0.0, 0.0
alpha = 0.01
iterations = 100
# Store history
theta0_hist = []
theta1_hist = []
cost_hist = []
# Hypothesis function
def hypothesis(X, theta0, theta1):
return theta0 + theta1 * X
# Cost function (MSE)
def compute_cost(X, y, theta0, theta1):
predictions = hypothesis(X, theta0, theta1)
errors = predictions - y
return (1 / (2 * m)) * np.sum(errors ** 2)
# Gradient Descent
for _ in range(iterations):
predictions = hypothesis(X, theta0, theta1)
errors = predictions - y
# Compute gradients
d_theta0 = (1 / m) * np.sum(errors)
d_theta1 = (1 / m) * np.sum(errors * X)
# Update parameters
theta0 -= alpha * d_theta0
theta1 -= alpha * d_theta1
# Record history
theta0_hist.append(theta0)
theta1_hist.append(theta1)
cost_hist.append(compute_cost(X, y, theta0, theta1))
# Create figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
# Plot cost over iterations (convergence)
axs[0].plot(range(iterations), cost_hist, color='blue')
axs[0].set_title("Convergence of Cost Function")
axs[0].set_xlabel("Iterations")
axs[0].set_ylabel("Cost (MSE)")
# Plot gradient descent path in parameter space
axs[1].plot(theta0_hist, theta1_hist, marker='o', color='red')
axs[1].set_title("Gradient Descent Path in Parameter Space")
axs[1].set_xlabel("theta0")
axs[1].set_ylabel("theta1")
plt.tight_layout()
plt.show()
Global Minima vs Local Minima#
Global Minima#
The lowest possible point of a cost function across the entire function domain.
In optimization (like regression), reaching the global minima means we’ve found the best solution:
The cost (error) is the smallest possible.
Our model is optimally fitted to the data.
👉 Example: In linear regression, the cost function (MSE) is convex (U-shaped parabola or bowl).
A convex function has only one global minima.
That’s why gradient descent in linear regression always converges to the global minimum (if the learning rate is good).
Local Minima#
Points where the function value is lower than nearby points, but not the lowest overall.
Think of a “valley” in a mountain range — it’s lower than surrounding hills but not the deepest valley overall.
👉 Example: In more complex models (like neural networks), the cost surface may have multiple local minima.
Gradient descent might get “stuck” in one of these valleys instead of reaching the global minimum.
That’s why techniques like momentum, Adam optimizer, random restarts are used in deep learning.
Visual Intuition#
Imagine a ball rolling downhill on a curve (cost function).
If the curve is convex (parabola), it always reaches the global minima.
If the curve has multiple dips, the ball may stop at a local minima instead of reaching the lowest global point.
import numpy as np
import matplotlib.pyplot as plt
# Define the loss function: Loss(θ) = θ^4 - 4θ^2 + 0.5θ + 5
def loss_function(theta):
return theta**4 - 4*theta**2 + 0.5*theta + 5
# Generate theta values
theta = np.arange(-2.5, 2.51, 0.01)
loss = loss_function(theta)
# Find approximate minima for annotation
global_min_theta = -1.414
global_min_loss = loss_function(global_min_theta) # ≈ 2
local_min_theta = 1.414
local_min_loss = loss_function(local_min_theta) # ≈ 4
# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(theta, loss, label='Loss Function', color='indigo', linewidth=2)
# Mark global minimum
plt.scatter([global_min_theta], [global_min_loss], color='green', s=100, label='Global Minimum')
plt.annotate('Global Minimum', xy=(global_min_theta, global_min_loss),
xytext=(global_min_theta - 1, global_min_loss + 5),
arrowprops=dict(facecolor='green', shrink=0.05))
# Mark local minimum
plt.scatter([local_min_theta], [local_min_loss], color='orange', s=100, label='Local Minimum')
plt.annotate('Local Minimum', xy=(local_min_theta, local_min_loss),
xytext=(local_min_theta + 0.5, local_min_loss + 5),
arrowprops=dict(facecolor='orange', shrink=0.05))
# Customize the plot
plt.title('Loss Function with Global and Local Minima', fontsize=14)
plt.xlabel('Parameter (θ)', fontsize=12)
plt.ylabel('Loss (Error)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.tight_layout()
# Show the plot
plt.show()