Assumptions#
XGBoost does not have strict statistical assumptions like linear regression, but it inherits assumptions from decision trees and boosting frameworks. These assumptions explain when and why XGBoost works well.
1. Additive Model Assumption#
XGBoost assumes the true function can be approximated as a sum of weak learners (trees):
\[ \hat{y}_i = \sum_{m=1}^M f_m(x_i) \]Each new tree corrects residual errors of the previous ensemble.
Works best if the relationship between features and target is non-linear and can be captured by recursive splits.
2. Differentiable Loss Function#
XGBoost assumes the loss function is differentiable, so it can compute gradients and Hessians for optimization.
Example:
Regression → squared error, MAE, Huber.
Classification → logistic loss, cross-entropy.
If the loss cannot be differentiated, XGBoost cannot optimize it.
3. Independent and Identically Distributed (i.i.d.) Data#
Assumes training examples are independent and drawn from the same distribution as test data.
Violations (like time-series without ordering, or domain shift) can reduce performance.
4. Weak Learner Assumption#
Each individual tree is a weak learner (shallow, high-bias).
XGBoost assumes boosting many weak learners will create a strong learner (low-bias, low-variance).
5. No Multicollinearity Requirement#
Unlike linear regression, XGBoost does not assume features are independent.
But highly correlated features may reduce interpretability (e.g., feature importance becomes diluted).
6. Handling Missing Data#
Assumes missing values can be assigned to a default split direction in trees.
XGBoost automatically learns the best default branch, so missingness is not problematic.
7. Complexity vs. Generalization#
Assumes a trade-off between tree complexity and generalization.
That’s why regularization terms (\(\lambda, \alpha, \gamma\)) are included in the objective function.
Summary
Target can be modeled as additive trees.
Loss function must be differentiable.
Data is i.i.d.
Weak learners combined form a strong model.
No strict assumptions on feature independence.
Missing values can be handled by default splits.
Model complexity must be controlled to prevent overfitting.