Cost Functions#

XGBoost is not a single model but a framework that supports different cost functions (a.k.a. loss functions) depending on whether you’re solving regression or classification.


Cost Functions in XGBRegressor#

In regression, the task is to predict a continuous value. XGBRegressor uses differentiable loss functions that measure prediction error.

Common loss functions:#

  1. Squared Error (default)

    \[ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 \]
    • Penalizes larger errors more heavily.

    • Smooth and differentiable.

    • Works well when errors are normally distributed.

  2. Absolute Error (MAE)

    \[ L(y, \hat{y}) = |y - \hat{y}| \]
    • Robust to outliers (less penalty for extreme values).

    • Slower optimization since gradient is less smooth.

  3. Huber Loss (mix between MSE & MAE)

    \[\begin{split} L(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2 & |y - \hat{y}| > \delta \end{cases} \end{split}\]
    • Balances robustness and sensitivity.

  4. Quantile Loss

    \[ L(y, \hat{y}) = \max(\alpha(y - \hat{y}), (1-\alpha)(\hat{y} - y)) \]
    • Useful for prediction intervals (not just mean).

🔑 In practice, XGBRegressor defaults to squared error loss unless specified (objective="reg:squarederror").


Cost Functions in XGBClassifier#

For classification, the task is to predict probabilities (then assign classes). The cost functions measure probability calibration (how close predicted probabilities are to true labels).

Common loss functions:#

  1. Logistic Loss (Binary Classification)

    \[ L(y, \hat{p}) = - \big( y \log(\hat{p}) + (1-y) \log(1 - \hat{p}) \big) \]

    where \(\hat{p} = \sigma(\hat{y}) = \frac{1}{1+e^{-\hat{y}}}\).

    • Penalizes wrong confident predictions heavily.

    • Optimized with Newton’s method (second-order gradients).

    • Default for objective="binary:logistic".

  2. Softmax Loss (Multiclass Classification) For \(K\) classes:

    \[ L(y, \hat{p}) = - \sum_{k=1}^{K} \mathbf{1}_{y=k} \log(\hat{p}_k) \]

    where

    \[ \hat{p}_k = \frac{e^{\hat{y}_k}}{\sum_{j=1}^{K} e^{\hat{y}_j}} \]
    • Standard cross-entropy loss.

    • Used when objective="multi:softprob" or "multi:softmax".

  3. Hinge Loss (SVM-style, optional)

    \[ L(y, \hat{y}) = \max(0, 1 - y\hat{y}) \]
    • Focuses on the margin between classes.

    • Less probabilistic, more decision-boundary focused.


Intuition: Why these losses?#

  • Regression losses → measure distance between prediction & actual values.

  • Classification losses → measure probability calibration (confidence in correct class).

  • All are differentiable, allowing gradient boosting with 1st & 2nd order derivatives.


Summary:

  • XGBRegressor → squared error, MAE, Huber, quantile.

  • XGBClassifier → logistic loss (binary), softmax loss (multiclass), hinge loss (SVM-like).

  • Loss choice depends on whether you want robustness to outliers, probability calibration, or hard-margin separation.