Inferential Statistics

Table of contents

Recap: Why Inferential Statistics?
1. Overview of the idea
2. Modeling
Random Sampling
Sampling Distribution
1. Sample Statistic as a Random Variable
Sampling Distribution of the Sample Mean
1. Law of Large Numbers
2. Central Limit Theorem
Estimators
1. Consistent Estimator
2. Unbiased Estimator
Sampling Error
1. Standard Error of the Mean
Confidence Interval
1. Confidence Interval for the Mean
t-Distribution

Recap: Why Inferential Statistics?

Our goal is to understand the population.

We have two main options:

Complete enumeration
Sample survey

Because complete enumeration is unrealistic, we need to figure out how to use samples to understand the population.

Inferential statistics is used to analyze samples and infer the population.

Overview of the idea

Assume population is represented by a probability distribution, namely the population distribution.
The parameters define the shape of population distribution.
The samples are values drawn from the population distribution.
From the samples, we infer the parameters of the population distribution.

Modeling

In a realistic sense, the population distribution cannot exactly match a certain mathematical probability distribution.

Instead, we are merely modeling the population distribution with a certain probability distribution.

Sounds obvious, but it is an important concept to keep in mind.

Random Sampling

Random sampling is essential if we want an unbiased inference.

Simple Random Sampling

This is equivalent to drawing papers from a jar.

Results are ideal, but not realistic as it can be time-consuming and costly.

Stratified Sampling

This is equivalent to drawing papers from jars of different colors.

We divide the population into strata, and draw random samples from each stratum.

Other Sampling Methods

To be added

Systematic sampling
Cluster sampling

Sampling Distribution

Sampling distribution is the probability distribution of a statistic obtained from a random sample of size $n$.

It is the probability distribution you observe when you repeatedly draw samples of size $n$ from the population and calculate each sample’s statistic.

Sample Statistic as a Random Variable

Take the example of the sample mean $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$.

Each $x_i$ is also a random variable since it is randomly drawn from the population.

Then $\bar{x}$ is also a random variable since it is a function of $x_i$.

We generally use lower case letters to denote the realization of a random variable, and use upper case letters to denote the random variable itself.

To make it more clear, let’s use capital letters to denote the random variables.

$$ \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i $$

Then we know that $\bar{X}$ must have an associated probability distribution.

Sampling Distribution of the Sample Mean

Below are some of the useful concepts when inferencing the population mean from the sample mean.

Law of Large Numbers

The law of large numbers states that the sample mean $\bar{x}$ converges to the population mean $\mu$ as the sample size $n$ increases.

$$ \bar{x} \to \mu \text{ as } n \to \infty $$

This only holds for the mean, not for other statistics.

Central Limit Theorem

The central limit theorem (CLT) states that, regardless of whether the population distribution is normal or not, when the sample size $n$ is large enough, the sampling distribution of $\bar{X}$ is approximately normal with parameters $\mu$ and $\frac{\sigma^2}{n}$.

Important assumption of CLT is that $X_i$ are i.i.d.

$$ \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$

So as $n$ increases, the sampling distribution of $\bar{X}$ becomes a normal distribution where most of the values are close to $\mu$ with a standard deviation of $\frac{\sigma}{\sqrt{n}}$.

Estimators

An estimator is a statistic used to estimate a population parameter.

Because it is a function of random variables, it is also a random variable.

Realization of estimators are estimates of the parameter.

Let’s say we have a population parameter $\theta$ that we want to estimate.

Let $X_i$ be the random variable and we have a statistic calculated from $X_i$.

We will denote the statistic as $T_n$, where $n$ is the sample size.

Then $T_n$ is an estimator of $\theta$.

$$ T_n = u(X_1, X_2, \dots, X_n) $$

Consistent Estimator

An estimator $T_n$ is consistent if $T_n$ converges in probability to $\theta$ as $n \to \infty$.

$$ \plim_{n \to \infty} T_n = \theta $$

Convergence in probability

$$ \forall \epsilon > 0,\; \lim_{n \to \infty} Pr[|T_n - \theta| > \epsilon] = 0 $$

Unbiased Estimator

An estimator $T_n$ is unbiased if the expected value of $T_n$ is equal to $\theta$.

$$ E[T_n] = \theta $$

Therefore the sampling distribution of $T_n$ is centered at $\theta$.

This means that the estimator is not systematically over- or under-estimating.

If the expected value of $T_n$ is not equal to $\theta$, then the estimator is considered biased.

One example of an unbiased estimator is the sample mean $\bar{X}$.

Proof that sample mean is an unbiased estimator of population mean

$$ \begin{align*} E[\bar{X}] &= E[\frac{1}{n} \sum_{i=1}^n X_i] \\ &= \frac{1}{n} \sum_{i=1}^n E[X_i] \tag{linearity of expectation} \\ &= \frac{1}{n} \sum_{i=1}^n \mu \tag{by definition} \\ &= \frac{1}{n} \cdot n \cdot \mu \\ &= \mu \end{align*} $$

Sampling Error

Sampling error is the difference between the sample statistic and the population statistic.

An example of sampling error would be $\bar{X} - \mu$.

Sampling Error as a Random Variable

Let’s take the example of $\bar{X} - \mu$.

$\bar{X} - \mu$ is also a random variable since $\mu$ is a constant.

This means that $\bar{X} - \mu$ has an associated probability distribution.

Sampling Distribution

By the Central Limit Theorem, we know that the sampling distribution of $\bar{X}$ is approximately normal with parameters $\mu$ and $\frac{\sigma^2}{n}$.

Since we’re merely shifting the distribution horizontally by $\mu$, we know that the distribution of $\bar{X} - \mu$ is also approximately normal with parameters $0$ and $\frac{\sigma^2}{n}$:

$$ \begin{equation*} \label{eq:sa} \end{equation*} \bar{X} - \mu \sim N(0, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$

Standard Error of the Mean

The standard error of the mean (SEM) is the standard deviation of the sampling distribution of the sampling mean $\bar{X}$.

$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} $$

However, we usually don’t know the population standard deviation $\sigma$.

So we use the sample standard deviation $s$ (or the unbiased estimator of the population standard deviation) instead:

$$ \text{SEM} \approx \frac{s}{\sqrt{n}} $$

Confidence Interval

A confidence interval is a range of values that we are fairly confident contains the population parameter.

So basically how confident can we be that the estimate obtained from the sample is close to the true population parameter.

The smaller the confidence interval, the more precise the estimate is.

Confidence Interval for the Mean

Let’s say we want to estimate the population mean $\mu$.

We know that the sampling distribution of $\bar{X} - \mu$ is approximately

$$ \bar{X} - \mu \sim N(0, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$

Because the sampling distribution is normal, we know that approx. 95% of the values fall within approx. $2 \cdot \text{SEM}$

Technically, 95% of the values fall within $1.96 \cdot \frac{s}{\sqrt{n}}$. This 1.96 is the z-score that marks $0.025$ area (2.5%) on the right tail of the standard normal distribution.

$$ \begin{gather*} 0 - 1.96 \cdot \frac{s}{\sqrt{n}} \leq \bar{X} - \mu \leq 0 + 1.96 \cdot \frac{s}{\sqrt{n}} \\ \downarrow \\ \bar{X} - 1.96 \cdot \frac{s}{\sqrt{n}} \leq \mu \leq \bar{X} + 1.96 \cdot \frac{s}{\sqrt{n}} \end{gather*} $$

Then we can say that we are 95% confident that the population mean $\mu$ falls within the interval

$$ \bar{X} \pm 1.96 \cdot \frac{s}{\sqrt{n}} $$

Basically, when we perform the interval calculation for 100 samples, we expect the true population mean to fall within the interval 95 times out of the 100 times.

This is the confidence interval at the 95% confidence level.

95% is the most common confidence level.

t-Distribution

The Central Limit Theorem only applies when the sample size $n$ is large enough.

In reality, we don’t always have a large sample size during data analysis.

Also, since we don’t know the population standard deviation $\sigma$, we use the sample standard deviation $s$ as an estimator of $\sigma$ instead.

t-distribution is a probability distribution that is used when the sample size is small and the population standard deviation is unknown.

The t-distribution assumes that the population from which the sample is drawn follows a normal distribution.

t-Score

The t-score is calculated by the following formula:

$$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} $$

Just like the z-score, we are standardizing the sampling distribution of the sample mean $\bar{X}$:

Shifting the mean to 0 and standardizing variance to 1.

The only difference is that we don’t know the population standard deviation $\sigma$, so we estimate it with the sample standard deviation $s$.

Because of this estimation, the t-score is not exactly a standard normal distribution.

t-Distribution vs Standard Normal Distribution

The t-distribution is similar to the normal distribution, but with heavier tails.

The tails of a distribution are the regions that fall outside of 2 standard deviations from the mean.

As the sample size $n$ increases, the t-distribution approaches $N(0, 1)$.

Degrees of Freedom

The degrees of freedom (typically denoted by $\nu$ or d.f.) is the number of independent observations in a sample that are used to calculate an estimate of a population parameter.

Typically, the degrees of freedom equals the number of sample size $n$ minus the number of parameters to estimate.

When we’re estimating the mean, $\nu = n - 1$.

Degrees of freedom $\nu$ is a parameter that determines the shape of the t-distribution.

Confidence Interval Using t-Distribution

The 95% confidence interval above assumes that $n$ is large enough and hence the sampling distribution of $\bar{X}$ is approximately normal.

When $n$ is small, realistically, we use the t-distribution instead.

So the realistic confidence interval is calculated by the following formula:

$$ \begin{equation} \label{eq:ci-t} \bar{X} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}} \end{equation} $$

where $t_{\alpha/2, \nu}$ is the t-score such that

$$ Pr[-t_{\alpha/2, \nu} \leq t \leq t_{\alpha/2, \nu}] = 1 - \alpha $$

Setting $\alpha = 0.05$ gives us the 95% confidence interval.

We will discuss $\alpha$ in more detail in the following sections.

Narrowing the Confidence Interval

If we wanted to make our estimate more precise, we want to narrow down the confidence interval.

If we take a look at equation $\eqref{eq:ci-t}$, we notice that we can either:

Increase the sample size $n$
Decrease the sample standard deviation $s$

Increasing the sample size $n$ is not always realistic. Since it is $\sqrt{n}$ in the denominator, in order to narrow down the confidence unit by $1/k$, we need to increase the sample size by $k^2$.

Decreasing the sample standard deviation $s$ is also not always possible, because of the nature of the population. However, we can try to decrease the sample standard deviation $s$ by collecting our samples with a more precise measure.