Random Forest

Random Forest #

Random Forest (RF) is an ensemble learning method for regression and classification.
It builds multiple Decision Trees and combines their predictions.
Essentially, it’s “a forest of decision trees” where each tree votes (for classification) or averages (for regression).

Key Idea: Combining many weak learners (trees) reduces overfitting and improves generalization.

Bootstrap Sampling (Bagging):
- Randomly sample data with replacement for each tree.
- Each tree gets a slightly different dataset → introduces diversity.
Random Feature Selection:
- At each split in a tree, only a random subset of features is considered.
- Prevents trees from being too correlated (e.g., one strong feature dominating all trees).
Tree Training:
- Each tree is trained independently using its bootstrapped dataset and random features.
Prediction Aggregation:
- Regression: Average the predictions of all trees.
- Classification: Majority vote among trees.

Reduces overfitting: Individual trees may overfit, but averaging predictions smooths errors.
Handles high-dimensional data: Random feature selection prevents a single feature from dominating splits.
Robust to noise: Noise in training data affects individual trees, but not the ensemble.

Hyperparameter	Description	Effect
`n_estimators`	Number of trees in the forest	More trees → better performance, slower training
`max_depth`	Maximum depth of each tree	Controls overfitting
`min_samples_split`	Min samples required to split a node	Higher → simpler trees
`min_samples_leaf`	Min samples required at a leaf	Prevents leaves with very few samples
`max_features`	Number of features to consider at each split	Lower → more diversity, higher bias
`bootstrap`	Whether to use bootstrap sampling	Usually True (bagging)
`criterion`	Split quality (`squared_error` for regression)	How splits are chosen

Slower than a single Decision Tree (more trees → more computation).
Less interpretable than a single Decision Tree (ensemble is a “black box”).
Needs careful tuning of n_estimators, max_depth, max_features for optimal performance.

Imagine predicting house prices: