Workflows#
1. Input Preparation#
Data Formatting
Input data is converted into DMatrix (optimized internal structure).
Handles sparse data efficiently (compressed column storage).
Labels & Weights
Target variable \(y\), instance weights, and optional missing values are stored.
2. Initialization#
Start with a base prediction:
For regression → mean of targets.
For classification → log-odds of positive class.
This becomes the initial model \(f_0(x)\).
3. Iterative Boosting Rounds#
For each boosting round \(t = 1, 2, \dots, T\):
(a) Compute Gradients & Hessians#
For each data point \(i\), compute:
\[ g_i = \frac{\partial L(y_i, \hat{y}_i)}{\partial \hat{y}_i}, \quad h_i = \frac{\partial^2 L(y_i, \hat{y}_i)}{\partial \hat{y}_i^2} \]\(g_i\) = gradient (direction of steepest loss decrease).
\(h_i\) = Hessian (curvature, helps with step size).
(b) Build a Decision Tree#
At each node, evaluate possible splits using Gain:
\[ \text{Gain} = \frac{1}{2}\left(\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right) - \gamma \]where:
\(G, H\) = sums of gradients & Hessians.
\(\lambda\) = L2 regularization.
\(\gamma\) = minimum loss reduction required.
If Gain > 0 (or > gamma), the split is made.
(c) Pruning / Stopping Splits#
Stop splitting if:
Depth >
max_depth.Gain <
gamma.Node samples too few (
min_child_weight).
(d) Assign Leaf Weights#
For each leaf, compute optimal weight:
\[ w^* = -\frac{\sum_i g_i}{\sum_i h_i + \lambda} \]This weight minimizes the loss locally.
(e) Update Model#
Update predictions:
\[ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta f_t(x_i) \]where:
\(f_t(x)\) = new tree’s output.
\(\eta\) = learning rate (shrinkage factor).
4. Regularization#
Built-in penalties:
L1 (α) → sparsity, feature selection.
L2 (λ) → weight shrinkage.
Prevents overfitting by discouraging overly complex trees.
5. Prediction#
After T boosting rounds:
\[ \hat{y}_i = \sum_{t=1}^T \eta f_t(x_i) \]Apply sigmoid (classification) or leave as-is (regression).
6. Model Evaluation#
Use evaluation metrics (
logloss,rmse,auc, etc.) on train and validation sets.Can apply early stopping if validation loss stops improving.
Workflow Summary (Simplified Pipeline)
Input data → DMatrix
Initialize model (baseline prediction)
For each boosting round:
Compute gradients & Hessians
Build tree using Gain formula
Stop splitting if conditions met
Assign leaf weights
Update model with learning rate
Apply regularization
Repeat until max_rounds or early stopping
Final model = sum of all trees
Key Intuition XGBoost is gradient descent on trees — it learns residual patterns iteratively, but with second-order accuracy (gradients + Hessians), making it faster and more precise than AdaBoost or Gradient Boosting.