Gradient Boosting#
Gradient Boosting is an ensemble method (like bagging and boosting) where models (usually decision trees) are trained sequentially. Each new model tries to correct the errors of the previous ones.
But instead of just re-weighting misclassified points (like AdaBoost), Gradient Boosting uses the idea of Gradient Descent on a loss function.
1. Intuition: “Boosting as Gradient Descent”#
Think of your loss function \(J(y, \hat{y})\), e.g.:
MSE for regression:
\[ J = \frac{1}{2m} \sum (y_i - \hat{y}_i)^2 \]Log Loss for classification.
Gradient Descent updates parameters by moving opposite to the gradient of the loss.
In Gradient Boosting, instead of directly updating parameters, we fit a new weak learner (tree) to predict the negative gradient of the loss (the “residuals”).
👉 So Gradient Boosting = Fitting trees to the gradient of errors step by step.
2. The Workflow#
Initialize Start with a simple model, e.g. predict the mean of \(y\):
\[ F_0(x) = \arg\min_c \sum L(y_i, c) \]Compute Pseudo-Residuals For iteration \(m\), compute the negative gradient of the loss wrt predictions:
\[ r_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)} \]These \(r_{im}\) are the “errors” we want to fix.
Fit Weak Learner Train a weak model \(h_m(x)\) (usually a shallow decision tree) to predict these residuals.
Update Model Update the ensemble prediction:
\[ F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x) \]where \(\nu\) is the learning rate (step size, like in gradient descent).
Repeat Continue until convergence (or max trees).
3. Math Example (Regression with MSE)#
Loss:
\[ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 \]Gradient wrt prediction:
\[ \frac{\partial L}{\partial \hat{y}} = \hat{y} - y \]Negative gradient (pseudo-residual):
\[ r = y - \hat{y} \]
👉 So each new tree is simply trained to predict the residuals.
4. Connection to Gradient Descent#
In classical gradient descent, we directly adjust weights:
\[ \theta := \theta - \alpha \nabla_\theta J(\theta) \]In gradient boosting, we adjust the function (the model) itself by adding trees:
\[ F_m(x) = F_{m-1}(x) - \alpha \nabla J(F_{m-1}(x)) \](but instead of moving directly, we fit a new weak learner to approximate that gradient).
So: 👉 Gradient Boosting = Gradient Descent in Function Space.
5. Key Hyperparameters in Gradient Boosting#
Number of Trees: more trees → better fit (but risk of overfitting).
Learning Rate (ν): smaller = slower but more accurate.
Tree Depth: controls complexity of weak learners.
Subsampling: random sampling of data at each iteration (used in Stochastic Gradient Boosting).
6. Famous Implementations#
GBM (Gradient Boosting Machine)
XGBoost (extreme GBM, optimized for speed/regularization)
LightGBM (uses leaf-wise growth, very fast on large data)
CatBoost (handles categorical features efficiently).
Summary
Gradient Boosting combines boosting + gradient descent.
Instead of updating weights directly, it adds weak learners trained on gradients (residuals) of the loss.
Think of it as “building a model by taking gradient descent steps in model space.”