Workflows#

1. Problem Understanding & Data Preparation#

  1. Define the task:

    • Classification → predict class labels

    • Regression → predict continuous values

  2. Collect & clean data:

    • Handle missing values (RF can sometimes handle missing values internally)

    • Encode categorical features if needed

  3. Split dataset:

    • Usually into train and test sets

    • Optional: cross-validation for hyperparameter tuning


2. Feature Selection (Optional)#

  • Random Forest is robust to irrelevant features because it selects random subsets of features at each split.

  • Still, removing completely useless features can improve efficiency.


3. Bootstrap Sampling (Bagging)#

  • For each tree in the forest:

    1. Randomly sample N observations with replacement from the training set.

    2. This sample is called a bootstrapped dataset.

Effect: Each tree sees slightly different data → introduces diversity among trees.


4. Tree Building (Individual Decision Trees)#

  • For each tree:

    1. Start at the root node.

    2. At each split, select a random subset of features.

    3. Choose the best split based on:

      • Regression: Variance reduction / MSE

      • Classification: Gini impurity or entropy

    4. Repeat recursively until stopping criteria:

      • Maximum depth reached (max_depth)

      • Minimum samples per leaf (min_samples_leaf)

      • All leaves are pure

Note: Trees are usually grown deep → individual trees may overfit.


5. Repeat for All Trees#

  • Build n_estimators trees independently using their own bootstrapped samples.

  • Each tree is slightly different due to data sampling + random feature selection.


6. Aggregation of Predictions (Ensembling)#

  • After all trees are trained, make predictions on new/unseen data:

    • Regression: Average predictions from all trees.

    • Classification: Majority vote across all trees.

Effect:

  • Reduces variance → predictions are more stable than a single tree.

  • Robust to noise → errors from individual trees are averaged out.


7. Model Evaluation#

  • Evaluate performance on test/validation set:

    • Regression: R², RMSE, MAE

    • Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC

  • Optional: Out-of-Bag (OOB) Error

    • Since each tree sees only ~63% of data (bootstrap sampling), remaining 37% can act as a validation set → gives an unbiased error estimate without separate test set.


8. Hyperparameter Tuning#

  • Common parameters to tune:

    • n_estimators → number of trees

    • max_depth → maximum depth of trees

    • min_samples_split / min_samples_leaf → control complexity

    • max_features → number of features considered at each split

  • Use GridSearchCV or RandomizedSearchCV for optimal performance.


9. Feature Importance (Optional)#

  • Random Forest can compute feature importance scores:

    • Measures how much each feature reduces impurity across all trees

    • Helps in interpreting the model


10. Deployment#

  • Once trained and validated, the Random Forest model can be deployed to predict unseen data.

  • Predictions are aggregated outputs of all the trees.


Workflow Summary Diagram (Mental Picture)#

  1. Input Data → clean & split →

  2. Bootstrap Samples → multiple trees trained independently →

  3. Random Feature Selection at Each Split → tree grows deep →

  4. Aggregate Predictions → final ensemble output →

  5. Evaluate & Tune Hyperparameters → deploy model