Assumptions

Contents

Assumptions #

1. Additive Model Assumption #

The true function can be approximated by a sum of weak learners:

\[ f(x) \approx \sum_{m=1}^M \nu \cdot \gamma_m h_m(x) \]
Each learner improves the model slightly, not fully.

2. Weak Learner Assumption #

Base learners (usually shallow decision trees) must be slightly better than random guessing.
They should capture local structure but not overfit.

3. Gradient Descent Validity #

Loss function must be differentiable (or sub-differentiable).
Pseudo-residuals are computed using gradients:

\[ r_{im} = - \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \]
Assumes gradient points in a useful descent direction toward minimizing loss.

4. Independence of Errors #

Training samples are assumed to be i.i.d. (independent and identically distributed).
Residuals (errors) from previous learners represent real signal, not structured dependence between samples.

5. Consistency of Residuals #

Errors left after each iteration are assumed to contain useful structure that the next learner can capture.
If residuals are pure noise, boosting stops being effective.

6. Bias–Variance Tradeoff #

Assumes weak learners have high bias, low variance.
Boosting reduces bias by combining many such learners.

7. Shrinkage and Regularization Assumption #

Assumes small step sizes (learning rate) and many iterations will converge to a better solution than large steps.
Prevents overshooting in function space.

8. Sufficient Data Size #

Boosting assumes enough data to prevent overfitting.
Small datasets can cause boosted learners to memorize noise.

9. No Strong Multicollinearity in Features #

Strongly correlated features can lead to redundant splits and instability.
Not a strict assumption, but reduces interpretability and efficiency.

10. Stationary Data Distribution #

Assumes training and test data come from the same distribution.
Boosting optimizes for the training loss; distributional shift reduces performance.

Summary of Key Assumptions #

Function can be approximated additively.
Weak learners are better than random.
Loss is differentiable → gradients exist.
Training samples are i.i.d.
Residuals contain signal, not pure noise.
Weak learners have high bias, low variance.
Learning rate must be small for stability.
Dataset large enough to generalize.
Features not overly collinear.
Train/test distribution is stable.