Bootstrap

When we don’t know the distribution of the data, we often use non-parametric inference and try to estimate important statistics with the empirical distribution function.

Calculating the measures of uncertainty (e.g. standard error, confidence interval) of these estimated statistics is often difficult.

The bootstrap allows us to estimate these measures of uncertainty of estimated statistics using simple resampling techniques and Monte Carlo simulation.

Table of contents

Layers of Approximation
Monte Carlo Simulation
Bootstrap Variance
Bootstrap Percentile Confidence Interval
In Practice

Layers of Approximation

Let’s say we have some samples $X_1, \dots, X_n$ from an unknown distribution $F$.

$T_n = g(X_1, \dots, X_n)$ is a sample statistic. For example: the sample mean, median, or variance.

We want to estimate the uncertainty in the sample distribution of $T_n$, for example $\Var(T_n)$.

But the calculation of $\Var(T_n)$ is often difficult because we don’t know $F$.

Instead, we use the empirical distribution function $\hat{F}_n$.

Draw samples from $\hat{F}_n$ and calculate the statistic of interest, aggregate the results and calculate the sample variance of these resamples (bootstrap samples).

Then we argue that the sample variance of the bootstrap samples is a good estimate variance of the sample distribution $T_n$ (see below).

We need to know that there are two layers of approximation happpening in the bootstrap.

The first comes from the approximation of the true distribution with the empirical distribution function:

$$ \Var(T(F)) \approx \Var(T(\hat{F}_n)) $$

Using a slightly different notation

$T(F)$ is a statistical functional of the CDF $F$, and $T(\hat{F}_n)$ is the plug-in estimator using the eCDF.

The second comes from the approximation of the uncertainty in sample distribution using Monte Carlo simulation:

$$ \Var(T(\hat{F}_n)) \approx v_{boot}(T(\hat{F}_n)) $$

Increasing the number of simulations will make the second approximation more accurate.

Monte Carlo Simulation

By law of large numbers, as $B \to \infty$,

\[\frac{1}{B} \sum_{b=1}^B g(X_b) \xrightarrow{P} E[g(X)] = \int g(x) dF(x)\]

and the sample variance follows:

\[\frac{1}{B-1} \sum_{b=1}^B (g(X_b) - \overline{g})^2 \xrightarrow{P} \Var(g(X)) = \int g^2(x) dF(x) - \left(\int g(x) dF(x)\right)^2\]

The same applies to multivariate function $g(X_1, \dots, X_n) = T_n$ as long as the IID assumption holds:

\[\frac{1}{B} \sum_{b=1}^B T_{n, b} \xrightarrow{P} E[T_n]\] \[\frac{1}{B-1} \sum_{b=1}^B (T_{n, b} - \overline{T}_n)^2 \xrightarrow{P} \Var(T_n)\]

This is the idea behind the bootstrap.

The only thing is that we draw the samples from $\hat{F}_n$ instead of $F$.

Bootstrap Variance

Let $T_n = g(X_1, \dots, X_n)$ be a statistic of interest where $X_1, \dots, X_n$ are IID samples from an unknown distribution $F$.

We want to estimate $\Var(T_n) \approx v_{boot}(T_n)$.

Resampling: Draw $n$ samples with replacement from eCDF $\hat{F}_n$.
\[X_1^*, \dots, X_n^* \sim \hat{F}_n\]
Calculation: Calculate the statistic of interest using the resampled data.
\[T_{n, b}^* = g(X_1^*, \dots, X_n^*)\]
Repeat: Repeat above two steps $B$ times to obtain $B$ bootstrap samples.
\[T_{n, 1}^*, \dots, T_{n, B}^*\]
Estimation: Calculate the sample variance of the bootstrap samples.
\[v_{boot}(T_n) = \frac{1}{B-1} \sum_{b=1}^B \left(T_{n, b}^* - \frac{1}{B} \sum_{r=1}^B T_{n, r}^*\right)^2\]

Bootstrap Percentile Confidence Interval

Normal Interval

When $T_n$ is approximately normally distributed, the normal interval can be used:

\[T_n \pm z_{\alpha/2} \sqrt{v_{boot}(T_n)}\]

Percentile Interval

Let $T_{n, (1/B)}^* \leq T_{n, (2/B)}^* \leq \dots \leq T_{n, (1)}^*$ be the ordered bootstrap samples.

The percentile interval is:

\[C_n = (T_{n, (\alpha/2)}^*, T_{n, (1-\alpha/2)}^*)\]

Pivot Interval

Let $\theta = T(F)$ be the statistic and $\hat{\theta}_n = T(\hat{F}_n)$ be the estimated statistic using the empirical distribution function.

Define a pivot:

$$ R_n = \hat{\theta}_n - \theta $$

If we know the distribution of $R_n$, we could calculate the inverse CDF or the quantiles of $R_n$ and calculate the CI just like we did with standard normal distribution.

However, we don’t know the distribution of $R_n$:

\[H(r) = \Pr(R_n \leq r)\]

Instead we bootstrap the pivot:

\[R_{n, b}^* = \hat{\theta}_{n, b}^* - \hat{\theta}_n\]

Then the bootstrap estimate of the distribution is:

\[\hat{H}_{boot}(r) = \frac{1}{B} \sum_{b=1}^B I(R_{n, b}^* \leq r)\]

Then we’d know the inverse of $\hat{H}_{boot}(r)$.

To get the $1 - \alpha$ CI, we have to know both the $\alpha/2$ and $1 - \alpha/2$ quantiles of $\hat{H}_{boot}(r)$, because we don’t know if this distribution is symmetric unlike the standard normal distribution.

Denote:

\[\begin{align*} r_{\alpha/2}^* &= \hat{H}_{boot}^{-1}(\alpha/2) \\[0.5em] r_{1 - \alpha/2}^* &= \hat{H}_{boot}^{-1}(1 - \alpha/2) \end{align*}\]

Since $r_{1 - \alpha/2}^* > r_{\alpha/2}^*$ the $1 - \alpha$ CI is:

\[\hat{\theta}_n - r_{1 - \alpha/2}^* \leq \theta \leq \hat{\theta}_n - r_{\alpha/2}^*\]

In Practice

Same idea, just explained in a different way.

Instead of drawing from the empirical distribution function $\hat{F}_n$, we are given a dataset $Z$.

Bootstrap samples are drawn from $Z$ with replacement.

The cardinality of the bootstrap sample is the same as the original dataset.

Which means that the bootstrap sample can contain the same data point multiple times.