Boosted Trees

Another ensemble method that does not rely on bootstrap sampling is boosting.

One drawback is that we lose interpretability, even more so than random forests.

Table of contents

Difference from Bagging
Process

Difference from Bagging

We do not use bootstrap sampling.
We fit each tree with the enitre, but weighted, dataset.

Fitting the Data Hard

When we try to fit a single large data to a model, we say we are fitting the data hard.

This often leads to issues like overfitting and high variance.

In boosting, we ultimately learn the entire dataset, but we do so in slowly.

Process

Following is explained in the context of regression. With classification it is a bit more tricky. But it still is a bottom-up approach fitting to errors from the previous smaller tree.

Initial Model and Training Set

With boosting, we start from an empty tree:

\[\hat{f}(x) = 0\]

and sequentially try to fit the next tree using the residuals of the previous tree.

The initial residuals are just the response values themselves, since the model just outputs zero:

\[\forall i,\; r_i = y_i\]

So our initial training set is:

$$ \left\{ x_i, r_i \right\}_{i=1}^{n} $$

Which is the same as the original dataset.

Number of Trees $B$

We are given a fixed number of trees $B$ to fit.

This $B$ has nothing to do with bootstrap sampling.

Unlike bagging, high $B$ can lead to overfitting (although not as drastically as we learn slowly), so we use cross-validation to find the optimal $B$.

Number of Splits $d$

We select $d$, a small number which represents how many splits each intermediate tree can have ($d$ splits = $d+1$ terminal nodes).

This is to ensure that we’re never fitting the data too hard in one go.

Choosing $d$

We often choose $d=1$.

In this case, each tree is called a stump. Using a stump leads to an additive model:

\[\hat{Y} = \beta_0 + \sum_{j=1}^p f_j(X_j)\]

Training Each Tree

Each tree is fit to the residuals of the previous tree:

\[\hat{f}^b(x)\]

Shrinkage Parameter $\lambda$

We select $\lambda > 0$, a small shrinkage parameter (such as $0.01$ or $0.001$) that controls the rate of learning. Again, this is to ensure slow learning.

Smaller $\lambda$ requires more trees $B$ to fit the model.

Update the Model

Once we have the $b$-th tree, we update our model:

$$ \hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x) $$

We also update each residual:

$$ r_i \leftarrow r_i - \lambda \hat{f}^b(x_i) $$

We repeat training and updating $B$ times.

Final Model

Our final model is the shrunken sum of all the trees:

$$ \hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}^b(x) $$