Workflows#
1. Problem Understanding & Data Preparation#
Define the task:
Classification → predict class labels
Regression → predict continuous values
Collect & clean data:
Handle missing values (RF can sometimes handle missing values internally)
Encode categorical features if needed
Split dataset:
Usually into train and test sets
Optional: cross-validation for hyperparameter tuning
2. Feature Selection (Optional)#
Random Forest is robust to irrelevant features because it selects random subsets of features at each split.
Still, removing completely useless features can improve efficiency.
3. Bootstrap Sampling (Bagging)#
For each tree in the forest:
Randomly sample N observations with replacement from the training set.
This sample is called a bootstrapped dataset.
Effect: Each tree sees slightly different data → introduces diversity among trees.
4. Tree Building (Individual Decision Trees)#
For each tree:
Start at the root node.
At each split, select a random subset of features.
Choose the best split based on:
Regression: Variance reduction / MSE
Classification: Gini impurity or entropy
Repeat recursively until stopping criteria:
Maximum depth reached (
max_depth)Minimum samples per leaf (
min_samples_leaf)All leaves are pure
Note: Trees are usually grown deep → individual trees may overfit.
5. Repeat for All Trees#
Build n_estimators trees independently using their own bootstrapped samples.
Each tree is slightly different due to data sampling + random feature selection.
6. Aggregation of Predictions (Ensembling)#
After all trees are trained, make predictions on new/unseen data:
Regression: Average predictions from all trees.
Classification: Majority vote across all trees.
Effect:
Reduces variance → predictions are more stable than a single tree.
Robust to noise → errors from individual trees are averaged out.
7. Model Evaluation#
Evaluate performance on test/validation set:
Regression: R², RMSE, MAE
Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
Optional: Out-of-Bag (OOB) Error
Since each tree sees only ~63% of data (bootstrap sampling), remaining 37% can act as a validation set → gives an unbiased error estimate without separate test set.
8. Hyperparameter Tuning#
Common parameters to tune:
n_estimators→ number of treesmax_depth→ maximum depth of treesmin_samples_split/min_samples_leaf→ control complexitymax_features→ number of features considered at each split
Use GridSearchCV or RandomizedSearchCV for optimal performance.
9. Feature Importance (Optional)#
Random Forest can compute feature importance scores:
Measures how much each feature reduces impurity across all trees
Helps in interpreting the model
10. Deployment#
Once trained and validated, the Random Forest model can be deployed to predict unseen data.
Predictions are aggregated outputs of all the trees.
Workflow Summary Diagram (Mental Picture)#
Input Data → clean & split →
Bootstrap Samples → multiple trees trained independently →
Random Feature Selection at Each Split → tree grows deep →
Aggregate Predictions → final ensemble output →
Evaluate & Tune Hyperparameters → deploy model