Inferential Statistics
Table of contents
Recap: Why Inferential Statistics?
Our goal is to understand the population.
We have two main options:
- Complete enumeration
- Sample survey
Because complete enumeration is unrealistic, we need to figure out how to use samples to understand the population.
Inferential statistics is used to analyze samples and infer the population.
Overview of the idea
Assume population is represented by a probability distribution, namely the population distribution.
The parameters define the shape of population distribution.
The samples are values drawn from the population distribution.
From the samples, we infer the parameters of the population distribution.
Modeling
In a realistic sense, the population distribution cannot exactly match a certain mathematical probability distribution.
Instead, we are merely modeling the population distribution with a certain probability distribution.
Sounds obvious, but it is an important concept to keep in mind.
Random Sampling
Random sampling is essential if we want an unbiased inference.
Simple Random Sampling
This is equivalent to drawing papers from a jar.
Results are ideal, but not realistic as it can be time-consuming and costly.
Stratified Sampling
This is equivalent to drawing papers from jars of different colors.
We divide the population into strata, and draw random samples from each stratum.
Other Sampling Methods
To be added
- Systematic sampling
- Cluster sampling
Sampling Distribution
Sampling distribution is the probability distribution of a statistic obtained from a random sample of size $n$.
It is the probability distribution you observe when you repeatedly draw samples of size $n$ from the population and calculate each sample’s statistic.
Sample Statistic as a Random Variable
Take the example of the sample mean $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$.
Each $x_i$ is also a random variable since it is randomly drawn from the population.
Then $\bar{x}$ is also a random variable since it is a function of $x_i$.
We generally use lower case letters to denote the realization of a random variable, and use upper case letters to denote the random variable itself.
To make it more clear, let’s use capital letters to denote the random variables.
$$ \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i $$
Then we know that $\bar{X}$ must have an associated probability distribution.
Sampling Distribution of the Sample Mean
Below are some of the useful concepts when inferencing the population mean from the sample mean.
Law of Large Numbers
The law of large numbers states that the sample mean $\bar{x}$ converges to the population mean $\mu$ as the sample size $n$ increases.
$$ \bar{x} \to \mu \text{ as } n \to \infty $$
This only holds for the mean, not for other statistics.
Central Limit Theorem
The central limit theorem (CLT) states that, regardless of whether the population distribution is normal or not, when the sample size $n$ is large enough, the sampling distribution of $\bar{X}$ is approximately normal with parameters $\mu$ and $\frac{\sigma^2}{n}$.
Important assumption of CLT is that $X_i$ are i.i.d.
$$ \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$
So as $n$ increases, the sampling distribution of $\bar{X}$ becomes a normal distribution where most of the values are close to $\mu$ with a standard deviation of $\frac{\sigma}{\sqrt{n}}$.
Estimators
An estimator is a statistic used to estimate a population parameter.
Because it is a function of random variables, it is also a random variable.
Realization of estimators are estimates of the parameter.
Let’s say we have a population parameter $\theta$ that we want to estimate.
Let $X_i$ be the random variable and we have a statistic calculated from $X_i$.
We will denote the statistic as $T_n$, where $n$ is the sample size.
Then $T_n$ is an estimator of $\theta$.
$$ T_n = u(X_1, X_2, \dots, X_n) $$
Consistent Estimator
An estimator $T_n$ is consistent if $T_n$ converges in probability to $\theta$ as $n \to \infty$.
$$ \plim_{n \to \infty} T_n = \theta $$
Convergence in probability
$$ \forall \epsilon > 0,\; \lim_{n \to \infty} Pr[|T_n - \theta| > \epsilon] = 0 $$
Unbiased Estimator
An estimator $T_n$ is unbiased if the expected value of $T_n$ is equal to $\theta$.
$$ E[T_n] = \theta $$
Therefore the sampling distribution of $T_n$ is centered at $\theta$.
This means that the estimator is not systematically over- or under-estimating.
If the expected value of $T_n$ is not equal to $\theta$, then the estimator is considered biased.
One example of an unbiased estimator is the sample mean $\bar{X}$.
Proof that sample mean is an unbiased estimator of population mean
$$ \begin{align*} E[\bar{X}] &= E[\frac{1}{n} \sum_{i=1}^n X_i] \\ &= \frac{1}{n} \sum_{i=1}^n E[X_i] \tag{linearity of expectation} \\ &= \frac{1}{n} \sum_{i=1}^n \mu \tag{by definition} \\ &= \frac{1}{n} \cdot n \cdot \mu \\ &= \mu \end{align*} $$
Sampling Error
Sampling error is the difference between the sample statistic and the population statistic.
An example of sampling error would be $\bar{X} - \mu$.
Sampling Error as a Random Variable
Let’s take the example of $\bar{X} - \mu$.
$\bar{X} - \mu$ is also a random variable since $\mu$ is a constant.
This means that $\bar{X} - \mu$ has an associated probability distribution.
Sampling Distribution
By the Central Limit Theorem, we know that the sampling distribution of $\bar{X}$ is approximately normal with parameters $\mu$ and $\frac{\sigma^2}{n}$.
Since we’re merely shifting the distribution horizontally by $\mu$, we know that the distribution of $\bar{X} - \mu$ is also approximately normal with parameters $0$ and $\frac{\sigma^2}{n}$:
$$ \begin{equation*} \label{eq:sa} \end{equation*} \bar{X} - \mu \sim N(0, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$
Standard Error of the Mean
The standard error of the mean (SEM) is the standard deviation of the sampling distribution of the sampling mean $\bar{X}$.
$$ \text{SEM} = \frac{\sigma}{\sqrt{n}} $$
However, we usually don’t know the population standard deviation $\sigma$.
So we use the sample standard deviation $s$ (or the unbiased estimator of the population standard deviation) instead:
$$ \text{SEM} \approx \frac{s}{\sqrt{n}} $$
Confidence Interval
A confidence interval is a range of values that we are fairly confident contains the population parameter.
So basically how confident can we be that the estimate obtained from the sample is close to the true population parameter.
The smaller the confidence interval, the more precise the estimate is.
Confidence Interval for the Mean
Let’s say we want to estimate the population mean $\mu$.
We know that the sampling distribution of $\bar{X} - \mu$ is approximately
$$ \bar{X} - \mu \sim N(0, \frac{\sigma^2}{n}) \text{ as } n \to \infty $$
Because the sampling distribution is normal, we know that approx. 95% of the values fall within approx. $2 \cdot \text{SEM}$
Technically, 95% of the values fall within $1.96 \cdot \frac{s}{\sqrt{n}}$. This 1.96 is the z-score that marks $0.025$ area (2.5%) on the right tail of the standard normal distribution.
$$ \begin{gather*} 0 - 1.96 \cdot \frac{s}{\sqrt{n}} \leq \bar{X} - \mu \leq 0 + 1.96 \cdot \frac{s}{\sqrt{n}} \\ \downarrow \\ \bar{X} - 1.96 \cdot \frac{s}{\sqrt{n}} \leq \mu \leq \bar{X} + 1.96 \cdot \frac{s}{\sqrt{n}} \end{gather*} $$
Then we can say that we are 95% confident that the population mean $\mu$ falls within the interval
$$ \bar{X} \pm 1.96 \cdot \frac{s}{\sqrt{n}} $$
Basically, when we perform the interval calculation for 100 samples, we expect the true population mean to fall within the interval 95 times out of the 100 times.
This is the confidence interval at the 95% confidence level.
95% is the most common confidence level.
t-Distribution
The Central Limit Theorem only applies when the sample size $n$ is large enough.
In reality, we don’t always have a large sample size during data analysis.
Also, since we don’t know the population standard deviation $\sigma$, we use the sample standard deviation $s$ as an estimator of $\sigma$ instead.
t-distribution is a probability distribution that is used when the sample size is small and the population standard deviation is unknown.
The t-distribution assumes that the population from which the sample is drawn follows a normal distribution.
t-Score
The t-score is calculated by the following formula:
$$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} $$
Just like the z-score, we are standardizing the sampling distribution of the sample mean $\bar{X}$:
- Shifting the mean to 0 and standardizing variance to 1.
The only difference is that we don’t know the population standard deviation $\sigma$, so we estimate it with the sample standard deviation $s$.
Because of this estimation, the t-score is not exactly a standard normal distribution.
t-Distribution vs Standard Normal Distribution
The t-distribution is similar to the normal distribution, but with heavier tails.
The tails of a distribution are the regions that fall outside of 2 standard deviations from the mean.
As the sample size $n$ increases, the t-distribution approaches $N(0, 1)$.
Degrees of Freedom
The degrees of freedom (typically denoted by $\nu$ or d.f.) is the number of independent observations in a sample that are used to calculate an estimate of a population parameter.
Typically, the degrees of freedom equals the number of sample size $n$ minus the number of parameters to estimate.
When we’re estimating the mean, $\nu = n - 1$.
Degrees of freedom $\nu$ is a parameter that determines the shape of the t-distribution.
Confidence Interval Using t-Distribution
The 95% confidence interval above assumes that $n$ is large enough and hence the sampling distribution of $\bar{X}$ is approximately normal.
When $n$ is small, realistically, we use the t-distribution instead.
So the realistic confidence interval is calculated by the following formula:
$$ \begin{equation} \label{eq:ci-t} \bar{X} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}} \end{equation} $$
where $t_{\alpha/2, \nu}$ is the t-score such that
$$ Pr[-t_{\alpha/2, \nu} \leq t \leq t_{\alpha/2, \nu}] = 1 - \alpha $$
Setting $\alpha = 0.05$ gives us the 95% confidence interval.
We will discuss $\alpha$ in more detail in the following sections.
Narrowing the Confidence Interval
If we wanted to make our estimate more precise, we want to narrow down the confidence interval.
If we take a look at equation $\eqref{eq:ci-t}$, we notice that we can either:
- Increase the sample size $n$
- Decrease the sample standard deviation $s$
Increasing the sample size $n$ is not always realistic. Since it is $\sqrt{n}$ in the denominator, in order to narrow down the confidence unit by $1/k$, we need to increase the sample size by $k^2$.
Decreasing the sample standard deviation $s$ is also not always possible, because of the nature of the population. However, we can try to decrease the sample standard deviation $s$ by collecting our samples with a more precise measure.