Workflows

Contents

Workflows #

1. Problem Understanding & Data Preparation#

Define the task:
- Classification → predict class labels
- Regression → predict continuous values
Collect & clean data:
- Handle missing values (RF can sometimes handle missing values internally)
- Encode categorical features if needed
Split dataset:
- Usually into train and test sets
- Optional: cross-validation for hyperparameter tuning

2. Feature Selection (Optional)#

Random Forest is robust to irrelevant features because it selects random subsets of features at each split.
Still, removing completely useless features can improve efficiency.

3. Bootstrap Sampling (Bagging)#

For each tree in the forest:
1. Randomly sample N observations with replacement from the training set.
2. This sample is called a bootstrapped dataset.

Effect: Each tree sees slightly different data → introduces diversity among trees.

4. Tree Building (Individual Decision Trees)#

For each tree:
1. Start at the root node.
2. At each split, select a random subset of features.
3. Choose the best split based on:
  - Regression: Variance reduction / MSE
  - Classification: Gini impurity or entropy
4. Repeat recursively until stopping criteria:
  - Maximum depth reached (max_depth)
  - Minimum samples per leaf (min_samples_leaf)
  - All leaves are pure

Note: Trees are usually grown deep → individual trees may overfit.

5. Repeat for All Trees#

Build n_estimators trees independently using their own bootstrapped samples.
Each tree is slightly different due to data sampling + random feature selection.

6. Aggregation of Predictions (Ensembling)#

After all trees are trained, make predictions on new/unseen data:
- Regression: Average predictions from all trees.
- Classification: Majority vote across all trees.

Effect:

Reduces variance → predictions are more stable than a single tree.
Robust to noise → errors from individual trees are averaged out.

7. Model Evaluation#

Evaluate performance on test/validation set:
- Regression: R², RMSE, MAE
- Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
Optional: Out-of-Bag (OOB) Error
- Since each tree sees only ~63% of data (bootstrap sampling), remaining 37% can act as a validation set → gives an unbiased error estimate without separate test set.

8. Hyperparameter Tuning#

Common parameters to tune:
- n_estimators → number of trees
- max_depth → maximum depth of trees
- min_samples_split / min_samples_leaf → control complexity
- max_features → number of features considered at each split
Use GridSearchCV or RandomizedSearchCV for optimal performance.

9. Feature Importance (Optional)#

Random Forest can compute feature importance scores:
- Measures how much each feature reduces impurity across all trees
- Helps in interpreting the model

10. Deployment#

Once trained and validated, the Random Forest model can be deployed to predict unseen data.
Predictions are aggregated outputs of all the trees.

Workflow Summary Diagram (Mental Picture)#

Input Data → clean & split →
Bootstrap Samples → multiple trees trained independently →
Random Feature Selection at Each Split → tree grows deep →
Aggregate Predictions → final ensemble output →
Evaluate & Tune Hyperparameters → deploy model