AdaBoost (Adaptive Boosting)#
AdaBoost is a popular ensemble learning algorithm used primarily for classification (but can also be adapted for regression). Its core idea is to combine multiple “weak learners” into a strong learner.
Weak Learner: A model slightly better than random guessing (e.g., decision stump – a single-level decision tree).
Strong Learner: A combination of weak learners that achieves high accuracy.
AdaBoost adapts by focusing more on the misclassified instances from previous learners in each iteration.
Core Steps of AdaBoost#
Initialize Sample Weights
Start with equal weights for all training samples:
\[ w_i = \frac{1}{N}, \quad i = 1,2,...,N \]where \(N\) is the total number of training samples.
Train Weak Learner
Fit a weak learner (e.g., a decision stump) using the weighted dataset.
Compute Learner Error
Calculate the weighted error of the weak learner:
\[ \text{error}_t = \frac{\sum_{i=1}^{N} w_i \cdot \mathbf{1}(y_i \neq h_t(x_i))}{\sum_{i=1}^{N} w_i} \]where \(h_t(x_i)\) is the prediction of the t-th weak learner, and \(\mathbf{1}\) is the indicator function (1 if wrong, 0 if correct).
Compute Learner Weight (Alpha)
Assign a weight to the weak learner based on its accuracy:
\[ \alpha_t = \frac{1}{2} \ln\left(\frac{1 - \text{error}_t}{\text{error}_t}\right) \]Better learners get higher weights.
Update Sample Weights
Increase weights of misclassified samples so the next learner focuses on them:
\[ w_i \leftarrow w_i \cdot e^{\alpha_t \cdot \mathbf{1}(y_i \neq h_t(x_i))} \]Normalize weights so they sum to 1.
Repeat
Steps 2–5 are repeated for \(T\) weak learners.
Final Prediction
The strong classifier is a weighted majority vote of all weak learners:
\[ H(x) = \text{sign} \Bigg( \sum_{t=1}^{T} \alpha_t \cdot h_t(x) \Bigg) \]
Key Intuition
AdaBoost focuses on hard-to-classify samples by increasing their weights.
Weak learners are combined using weighted voting; more accurate learners have more influence.
Despite using weak learners individually, the ensemble often achieves high accuracy.
Advantages
Works well with small weak learners.
Less prone to overfitting compared to single deep trees.
Adaptive: focuses on hard samples.
Disadvantages
Sensitive to noisy data and outliers.
Can be computationally expensive if many iterations are used.