Random Forest#

  • Random Forest (RF) is an ensemble learning method for regression and classification.

  • It builds multiple Decision Trees and combines their predictions.

  • Essentially, it’s “a forest of decision trees” where each tree votes (for classification) or averages (for regression).

Key Idea: Combining many weak learners (trees) reduces overfitting and improves generalization.


2. How Random Forest Works (Intuition)#

  1. Bootstrap Sampling (Bagging):

    • Randomly sample data with replacement for each tree.

    • Each tree gets a slightly different dataset → introduces diversity.

  2. Random Feature Selection:

    • At each split in a tree, only a random subset of features is considered.

    • Prevents trees from being too correlated (e.g., one strong feature dominating all trees).

  3. Tree Training:

    • Each tree is trained independently using its bootstrapped dataset and random features.

  4. Prediction Aggregation:

    • Regression: Average the predictions of all trees.

    • Classification: Majority vote among trees.


3. Why Random Forest Works#

  • Reduces overfitting: Individual trees may overfit, but averaging predictions smooths errors.

  • Handles high-dimensional data: Random feature selection prevents a single feature from dominating splits.

  • Robust to noise: Noise in training data affects individual trees, but not the ensemble.


4. Key Hyperparameters in Random Forest#

Hyperparameter

Description

Effect

n_estimators

Number of trees in the forest

More trees → better performance, slower training

max_depth

Maximum depth of each tree

Controls overfitting

min_samples_split

Min samples required to split a node

Higher → simpler trees

min_samples_leaf

Min samples required at a leaf

Prevents leaves with very few samples

max_features

Number of features to consider at each split

Lower → more diversity, higher bias

bootstrap

Whether to use bootstrap sampling

Usually True (bagging)

criterion

Split quality (squared_error for regression)

How splits are chosen


5. Advantages of Random Forest#

  1. Handles both regression and classification.

  2. Works well on nonlinear data without much feature engineering.

  3. Less prone to overfitting than a single Decision Tree.

  4. Can compute feature importance.

  5. Robust to outliers and noise.


6. Disadvantages / Considerations#

  1. Slower than a single Decision Tree (more trees → more computation).

  2. Less interpretable than a single Decision Tree (ensemble is a “black box”).

  3. Needs careful tuning of n_estimators, max_depth, max_features for optimal performance.


7. Random Forest Intuition (Visual)#

Imagine predicting house prices:

  • Each tree learns different patterns from random subsets of houses and features.

  • One tree might overpredict some areas, another underpredicts others.

  • Averaging all trees gives a more accurate and stable prediction.

Click here for Sections