Intuition#
1. Start with a naive model#
Begin with a constant prediction (e.g., mean of \(y\) in regression, log-odds in classification).
This is a crude first guess.
2. Look at the errors#
Compute the residuals:
In regression: difference between predicted and actual.
In classification: negative gradient of loss (pseudo-residuals).
These residuals show what the model has not yet explained.
3. Fit a weak learner to the residuals#
Train a small tree on the residuals.
The tree asks: “What patterns in the features can explain the mistakes so far?”
Each learner focuses only on what is left unexplained.
4. Take a small corrective step#
Instead of fully trusting the new learner, we add it in with a small weight (learning rate).
This is analogous to taking a small step in the negative gradient direction when minimizing a loss function.
5. Repeat#
Again compute residuals, fit another tree, update predictions.
Each iteration nudges the model closer to the target.
Over many iterations, small corrections accumulate into a strong predictor.
Why “gradient”?#
The loss function defines an error surface.
Instead of computing residuals directly, gradient boosting fits learners to the negative gradient of the loss at each step.
This is identical to doing gradient descent, except instead of updating a parameter vector, we update a function (the predictor).
Geometric intuition#
Imagine trying to descend a mountain blindfolded.
You feel the slope (gradient) under your feet, then take a step in the downhill direction.
In boosting, each weak learner is like a tiny step downhill.
The sequence of learners traces a path toward the minimum of the loss function.
Example intuition (squared error regression)#
Step 1: Predict mean of \(y\). Residuals = \(y - \hat{y}\).
Step 2: Fit tree to residuals. It learns structure in what was missed.
Step 3: Add this tree’s prediction (scaled by learning rate).
Step 4: Now residuals are smaller, new tree corrects again.
Step 5: Repeat until residuals are close to random noise.
Key mental model#
Gradient boosting is “sequential residual correction guided by gradients.”
Each learner = a correction.
Learning rate = trust in each correction.
Many small corrections = strong final model.
Mathematical intuition of Gradient Boosting: it is gradient descent in function space. Instead of minimizing loss by adjusting numeric parameters, we minimize it by iteratively adding functions.
Mathematical Intiution#
1. Problem setup#
We want to approximate a function \(F(x)\) that predicts \(y\) from input \(x\). We define a differentiable loss function:
and aim to minimize the total loss over training data:
2. Function space gradient descent#
In standard gradient descent, you update parameters:
Gradient Boosting does the same, but in function space:
Here:
\(\nabla_F J(F)(x)\) = derivative of the loss wrt prediction \(F(x)\).
This derivative is the negative gradient at each data point → called pseudo-residuals.
3. Pseudo-residuals#
For each sample \(i\), at stage \(m\):
These \(r_{im}\) are the directions in which the loss decreases most steeply.
4. Weak learner approximation#
We cannot directly set \(F(x) = F_{m-1}(x) + r_{im}\). Instead, we fit a weak learner \(h_m(x)\) (small decision tree) to approximate these residuals.
5. Line search#
To find how much of this weak learner to add, solve for:
6. Update rule#
Finally, update the model:
\(\nu\) = learning rate (step size).
\(\gamma_m\) = optimal multiplier for the learner.
7. Iterative effect#
Each \(h_m(x)\) is a step in function space in the negative gradient direction.
The sum of many such steps produces a strong predictor:
Concrete examples of gradients#
Regression (MSE loss):
\[ L(y, F(x)) = (y - F(x))^2 \]Gradient:
\[ r_{im} = y_i - F_{m-1}(x_i) \quad \text{(residuals)} \]Classification (logistic loss):
\[ L(y, F(x)) = \log\big(1 + e^{-yF(x)}\big), \quad y \in \{-1, +1\} \]Gradient:
\[ r_{im} = \frac{y_i}{1 + e^{y_i F_{m-1}(x_i)}} \]
Intuition summary#
At each stage, compute the gradient of the loss wrt predictions.
Approximate this gradient with a weak learner.
Add it in with small weight.
Repeat → function gradually descends toward minimum of the loss.