Test Error Estimation / Model Selection

Table of contents
  1. What is Test Error Estimation?
    1. Test Set
    2. Test Error
    3. Model Selection
  2. Validation Set Approach
  3. Cross-Validation
  4. Bootstrap Underestimates Test Error
  5. Adjusted Training Error
    1. Mallows’s $C_p$
    2. AIC and BIC
    3. Adjusted $R^2$
    4. Comparison to Cross-Validation
  6. One-Standard-Error Rule

What is Test Error Estimation?

Test Set

A test set is the set of data specifically designated for testing only.

Except for some special occasions, the data we have is often limited and barely enough to train the model, let alone to test it.

Test Error

Because we don’t have enough data to test the model, test error is technically an unknown.

That is why we need a method to estimate the test error.

There are two common approaches:

  • Directly estimate the test error by holding out a subset of the data
    • Validation Set Approach, Cross-Validation
  • Indirectly estimate the test error with adjusted training error

    Adjustment is necessary since training errors can grossly underestimate the test error.

    • $C_p$, AIC, BIC, Adjusted $R^2$

Model Selection

Estimates of test error allow us to compare different models, and thus it is crucial for model selection.

Validation set approach, cross-validation, and adjusted training error methods all ultimately aim to help us choose the best model.


Validation Set Approach

See here


Cross-Validation

See here


Bootstrap Underestimates Test Error

You will see that bootstrap is also a resampling method like cross-validation.

Bootstrap is more commonly used to estimate the variance of an estimator, while cross-validation is used to estimate the test error of a model.

  • Why not use bootstrap to estimate the test error?

If we were to use bootstrap, what can we use as a validation set? One idea is to use the original dataset as the validation set.

Each bootstrap sample contains about $2/3$ of the original data.

Why?

Let $n$ be the number of samples in the original dataset.

The probability of a sample $i$ not being selected in a bootstrap sample (sampled with replacement and also of size $n$) is:

\[\lim_{n \to \infty} \left(\frac{n-1}{n}\right)^n \approx \frac{1}{e} \approx 0.368\]

So the probability of sample $i$ being in a bootstrap sample is about $2/3$.

Cross-validation works because it makes sure there is no overlap between training and validation sets.

However, with this validation set, there is a significant correlation between the bootstrap training set and the validation set, which leads to huge underestimation of the test error.

So, for bootstrapping to work, we’d actually have to cherry-pick the samples that were by luck not included in the bootstrap training sets, and use them as a validation set (known as out-of-bag error).

But this gets messy, so we stick to cross-validation for test error estimation.


Adjusted Training Error

Validation set approach and cross-validation tries to estimate the performance of the model by directly estimating the test error.

There are ways to estimate the performance indirectly, mainly by doing some calculations on the training error.

Since training error can be minimized by overfitting complex models, these methods usually include penalty terms for complexity.

Mallows’s $C_p$

See here

AIC and BIC

See here

Adjusted $R^2$

See here

Comparison to Cross-Validation

  • CV makes fewer assumptions about the true model.
  • Adjusted training error estimates require an estimate of $\Var(\epsilon)$, the variance of the error term.
    • CV is better when it is hard to estimate $\Var(\epsilon)$.
  • CV is computationally more expensive.

One-Standard-Error Rule

See here