Inferential Statistics
Table of contents
Recap: Why Inferential Statistics?
Our goal is to understand the population.
We have two main options:
- Complete enumeration
- Sample survey
Because complete enumeration is unrealistic, we need to figure out how to use samples to understand the population.
Inferential statistics is used to analyze samples and infer the population.
Overview of the idea
Assume population is represented by a probability distribution, namely the population distribution.
The parameters define the shape of population distribution.
The samples are values drawn from the population distribution.
From the samples, we infer the parameters of the population distribution.
Modeling
In a realistic sense, the population distribution cannot exactly match a certain mathematical probability distribution.
Instead, we are merely modeling the population distribution with a certain probability distribution.
Sounds obvious, but it is an important concept to keep in mind.
Random Sampling
Random sampling is essential if we want an unbiased inference.
Simple Random Sampling
This is equivalent to drawing papers from a jar.
Results are ideal, but not realistic as it can be time-consuming and costly.
Stratified Sampling
This is equivalent to drawing papers from jars of different colors.
We divide the population into strata, and draw random samples from each stratum.
Other Sampling Methods
To be added
- Systematic sampling
- Cluster sampling
Sampling Distribution
Sampling distribution is the probability distribution of a statistic obtained from a random sample of size
It is the probability distribution you observe when you repeatedly draw samples of size
Sample Statistic as a Random Variable
Take the example of the sample mean
Each
Then
We generally use lower case letters to denote the realization of a random variable, and use upper case letters to denote the random variable itself.
To make it more clear, let’s use capital letters to denote the random variables.
Then we know that
Sampling Distribution of the Sample Mean
Below are some of the useful concepts when inferencing the population mean from the sample mean.
Law of Large Numbers
The law of large numbers states that the sample mean
This only holds for the mean, not for other statistics.
Central Limit Theorem
The central limit theorem (CLT) states that, regardless of whether the population distribution is normal or not, when the sample size
Important assumption of CLT is that
So as
Estimators
An estimator is a statistic used to estimate a population parameter.
Because it is a function of random variables, it is also a random variable.
Realization of estimators are estimates of the parameter.
Let’s say we have a population parameter
Let
We will denote the statistic as
Then
Consistent Estimator
An estimator
Convergence in probability
Unbiased Estimator
An estimator
Therefore the sampling distribution of
This means that the estimator is not systematically over- or under-estimating.
If the expected value of
One example of an unbiased estimator is the sample mean
Proof that sample mean is an unbiased estimator of population mean
Sampling Error
Sampling error is the difference between the sample statistic and the population statistic.
An example of sampling error would be
Sampling Error as a Random Variable
Let’s take the example of
This means that
Sampling Distribution
By the Central Limit Theorem, we know that the sampling distribution of
Since we’re merely shifting the distribution horizontally by
Standard Error of the Mean
The standard error of the mean (SEM) is the standard deviation of the sampling distribution of the sampling mean
However, we usually don’t know the population standard deviation
So we use the sample standard deviation
Confidence Interval
A confidence interval is a range of values that we are fairly confident contains the population parameter.
So basically how confident can we be that the estimate obtained from the sample is close to the true population parameter.
The smaller the confidence interval, the more precise the estimate is.
Confidence Interval for the Mean
Let’s say we want to estimate the population mean
We know that the sampling distribution of
Because the sampling distribution is normal, we know that approx. 95% of the values fall within approx.
Technically, 95% of the values fall within
Then we can say that we are 95% confident that the population mean
Basically, when we perform the interval calculation for 100 samples, we expect the true population mean to fall within the interval 95 times out of the 100 times.
This is the confidence interval at the 95% confidence level.
95% is the most common confidence level.
t-Distribution
The Central Limit Theorem only applies when the sample size
In reality, we don’t always have a large sample size during data analysis.
Also, since we don’t know the population standard deviation
t-distribution is a probability distribution that is used when the sample size is small and the population standard deviation is unknown.
The t-distribution assumes that the population from which the sample is drawn follows a normal distribution.
t-Score
The t-score is calculated by the following formula:
Just like the z-score, we are standardizing the sampling distribution of the sample mean
- Shifting the mean to 0 and standardizing variance to 1.
The only difference is that we don’t know the population standard deviation
Because of this estimation, the t-score is not exactly a standard normal distribution.
t-Distribution vs Standard Normal Distribution
The t-distribution is similar to the normal distribution, but with heavier tails.
The tails of a distribution are the regions that fall outside of 2 standard deviations from the mean.
As the sample size
Degrees of Freedom
The degrees of freedom (typically denoted by
Typically, the degrees of freedom equals the number of sample size
When we’re estimating the mean,
Degrees of freedom
Confidence Interval Using t-Distribution
The 95% confidence interval above assumes that
When
So the realistic confidence interval is calculated by the following formula:
where
Setting
We will discuss
Narrowing the Confidence Interval
If we wanted to make our estimate more precise, we want to narrow down the confidence interval.
If we take a look at equation
- Increase the sample size
- Decrease the sample standard deviation
Increasing the sample size
Decreasing the sample standard deviation