Workflows#


1. Input Preparation#

  • Data Formatting

    • Input data is converted into DMatrix (optimized internal structure).

    • Handles sparse data efficiently (compressed column storage).

  • Labels & Weights

    • Target variable \(y\), instance weights, and optional missing values are stored.


2. Initialization#

  • Start with a base prediction:

    • For regression → mean of targets.

    • For classification → log-odds of positive class.

  • This becomes the initial model \(f_0(x)\).


3. Iterative Boosting Rounds#

For each boosting round \(t = 1, 2, \dots, T\):

(a) Compute Gradients & Hessians#

  • For each data point \(i\), compute:

    \[ g_i = \frac{\partial L(y_i, \hat{y}_i)}{\partial \hat{y}_i}, \quad h_i = \frac{\partial^2 L(y_i, \hat{y}_i)}{\partial \hat{y}_i^2} \]
  • \(g_i\) = gradient (direction of steepest loss decrease).

  • \(h_i\) = Hessian (curvature, helps with step size).


(b) Build a Decision Tree#

  • At each node, evaluate possible splits using Gain:

    \[ \text{Gain} = \frac{1}{2}\left(\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right) - \gamma \]

    where:

    • \(G, H\) = sums of gradients & Hessians.

    • \(\lambda\) = L2 regularization.

    • \(\gamma\) = minimum loss reduction required.

  • If Gain > 0 (or > gamma), the split is made.


(c) Pruning / Stopping Splits#

  • Stop splitting if:

    • Depth > max_depth.

    • Gain < gamma.

    • Node samples too few (min_child_weight).


(d) Assign Leaf Weights#

  • For each leaf, compute optimal weight:

    \[ w^* = -\frac{\sum_i g_i}{\sum_i h_i + \lambda} \]
  • This weight minimizes the loss locally.


(e) Update Model#

  • Update predictions:

    \[ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta f_t(x_i) \]

    where:

    • \(f_t(x)\) = new tree’s output.

    • \(\eta\) = learning rate (shrinkage factor).


4. Regularization#

  • Built-in penalties:

    • L1 (α) → sparsity, feature selection.

    • L2 (λ) → weight shrinkage.

  • Prevents overfitting by discouraging overly complex trees.


5. Prediction#

  • After T boosting rounds:

    \[ \hat{y}_i = \sum_{t=1}^T \eta f_t(x_i) \]
  • Apply sigmoid (classification) or leave as-is (regression).


6. Model Evaluation#

  • Use evaluation metrics (logloss, rmse, auc, etc.) on train and validation sets.

  • Can apply early stopping if validation loss stops improving.


Workflow Summary (Simplified Pipeline)

  1. Input data → DMatrix

  2. Initialize model (baseline prediction)

  3. For each boosting round:

    • Compute gradients & Hessians

    • Build tree using Gain formula

    • Stop splitting if conditions met

    • Assign leaf weights

    • Update model with learning rate

  4. Apply regularization

  5. Repeat until max_rounds or early stopping

  6. Final model = sum of all trees


Key Intuition XGBoost is gradient descent on trees — it learns residual patterns iteratively, but with second-order accuracy (gradients + Hessians), making it faster and more precise than AdaBoost or Gradient Boosting.