Random Forest
Random forest is an ensemble learning that combines several decision trees to mitigate overfitting and reduce the variance of the model.
Table of contents
Ensemble Methods
As described above, decision trees tend to overfit and are unstable upon slightly varying data.
To mitigate these issues, random forest utilizes two ensemble methods:
We know that an ensemble reduces the variance of the estimator by averaging over multiple samples. However, unlike the idealistic case where the samples are independent, bagging is an ensemble over highly-correlated bootstrap samples, which increases variance when averaging.
So we combine random subspace method to reduce the correlation between ensembled trees.
Hyperparameters would be the number of trees $B$ and the number of features to use for each tree $m < p$.
Procedure
Bootstrap Sampling
- Generate $B$ bootstrap samples from the original data set: $Z^{*b}$.
Tree Building
- Train a decision tree on each of the $B$ bootstrap samples
When training each tree, for every split, randomly select $m < p$ candidate features from the total $p$ features.
Usually $m \approx \sqrt{p}$.
These candidates are newly selected for each split. This is why there is not much of a accuracy drop compared to when we select from the full $p$.
- Select the (greedy) best split from the $m$ features.
Pruning is unnecessary in random forests, because bagging and random subspacing already mitigate overfitting.
Aggregating
Prediction is averaged over the $B$ trees.
Measuring the Importance of Predictors
In regular decision trees, we can measure the importance of predictors by the order in which they are selected for splitting.
But in random forests or bagged trees (one with no random subspacing), we don’t have an explicit tree, we just know the average prediction.
When training each tree, we greedily minimize either RSS for regression or Gini impurity (entropy, etc.) for classification on each split.
For each predictor, when we branch on it, we keep a sum of the decrease in RSS or Gini impurity resulting from that split.
Then we average over all $B$ trees to get the importance of each predictor.
FYI, you cannot do this for boosted trees.
Drawbacks
- We lose interpretability, which is a strength of decision trees.
- Obviously, it’s computationally expensive.
- But not really a problem these days.