Gradient Boosting Classifier#

Gradient Boosting Classifier (GBC) is the classification version of Gradient Boosting. It builds an ensemble of weak learners (usually decision trees) in a stage-wise fashion, where each learner corrects the mistakes of the previous one.


1. Objective#

  • Given training data \((x_i, y_i)\) with \(y_i \in \{0,1\}\) or \(\{-1,+1\}\), the goal is to minimize a classification loss.

  • Common choice: Logistic loss

\[ L(y, F(x)) = \log\big(1 + e^{-y F(x)}\big), \quad y \in \{-1,+1\} \]

where \(F(x)\) is the additive model.


2. Initialization#

  • Start with a constant model:

\[ F_0(x) = \frac{1}{2} \ln \frac{p}{1-p} \]

where \(p\) is the proportion of positive samples.

  • This is the log-odds of the positive class.


3. Iterative boosting process#

At each iteration \(m\):

a) Compute pseudo-residuals#

  • Derivative of logistic loss wrt predictions:

\[ r_{im} = y_i - p_{i}^{(m-1)} \]

where \(p_i^{(m-1)} = \frac{1}{1+e^{-F_{m-1}(x_i)}}\) is the predicted probability.

  • Intuition: residuals = true label – predicted probability.


b) Fit weak learner#

  • Train a small decision tree \(h_m(x)\) on pseudo-residuals.

  • The tree tries to capture patterns in the misclassified points.


c) Compute multiplier#

  • Find best \(\gamma_m\) (step size) via line search:

\[ \gamma_m = \arg\min_\gamma \sum_{i=1}^n L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big) \]

d) Update model#

\[ F_m(x) = F_{m-1}(x) + \nu \cdot \gamma_m h_m(x) \]
  • \(\nu\) = learning rate (shrinkage).


4. Final prediction#

  • After \(M\) rounds, we have:

\[ F_M(x) = F_0(x) + \nu \sum_{m=1}^M \gamma_m h_m(x) \]
  • Convert to probability with sigmoid:

\[ p(x) = \frac{1}{1 + e^{-F_M(x)}} \]
  • Predict class:

\[\begin{split} \hat{y} = \begin{cases}1 & p(x) \geq 0.5 \\ 0 & p(x) < 0.5\end{cases} \end{split}\]

5. Intuition#

  • Each tree is trained on the errors of the previous ensemble.

  • Predictions are updated in small steps (learning rate).

  • Over many iterations, the model improves classification boundaries.


Key Features

  • Handles binary and multiclass classification (one-vs-rest or multinomial loss).

  • Can use different loss functions (log-loss, exponential loss, deviance).

  • Sensitive to learning rate and number of trees (requires tuning).

  • More robust than AdaBoost because it uses gradient descent rather than weighting errors exponentially.


from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Data
X, y = make_classification(n_samples=200, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbc.fit(X_train, y_train)

# Predictions
y_pred = gbc.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.8666666666666667