Post-Hoc Test

As the name suggests, post-hoc test is performed after a test to correct error rates and assess the significance of the results.

Table of contents

Error Rates in Multiple Comparison
Bonferroni Correction
Holm’s Method
q-Value
Benjamini-Hochberg Procedure
1. Comparison to controlling FWER
Tukey’s HSD Test
1. Studentized Range Distribution
2. HSD
Dunnett’s Test
Williams Test

Error Rates in Multiple Comparison

Suppose we are testing a family of $m$ hypotheses.

Let’s denote:

	$H_0$ is true	$H_A$ is true
Test declared insignificant	$U$	$T$	$m - R$
Test declared significant	$V$	$S$	$R$
	$m_0$	$m - m_0$	$m$

where each cell represents the number of hypotheses:

$m_0$: the number of true null hypotheses
$R$: the number of rejected null hypotheses
$U$: the number of true negatives
$T$: the number of false negatives
$V$: the number of false positives
$S$: the number of true positives

Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one Type I error in a family of tests.

$$ \text{FWER} = P(\text{at least one Type I error}) = P(V \geq 1) $$

False Discovery Rate

False discovery rate (FDR) is the expected proportion of false positives among all rejected hypotheses.

Simply:

$$ \text{FDR} = E\left[\frac{V}{R}\right] $$

What if R = 0?

Formally,

$$ \text{FDR} = E\left[\frac{V}{R} \mid R > 0\right] \cdot P(R > 0) $$

Positive False Discovery Rate

To be added

Bonferroni Correction

As we’ve seen in the multiple comparisons problem, repeating the a pairwise test multiple times increases the FWER.

The Bonferroni correction is a method to correct the FWER to $\alpha$ by rejecting the null hypothesis less frequently in each test.

The idea is simple, which is to correct each significance level to:

$$ \alpha' = \frac{\alpha}{m} $$

where $m$ is the number of pairwise tests.

Bonferroni correction has pros:

It is simple
It can be used in any test that produces a p-value

Issues with Bonferroni Correction

The problem with the Bonferroni correction is that it is too conservative.

Although it corrects the FWER to approximately $\alpha$, the power of each test is reduced significantly as the number of tests increases.

Suppose we have $\alpha = 0.05$ and we perform 10 pairwise tests.

Then our Bonferroni-corrected $\alpha’$ for each test is 0.005, and this gets worse as the number of tests increases.

It gets harder and harder to call any result significant.

Holm’s Method

Just like Bonferroni, Holm’s method tries to correct the Type I error rate by controlling the FWER.

Procedure

Let’s say we have $m$ hypotheses to test.

Notice the difference in notation between $p_i$ and $p_{(j)}$.
$p_i$ is the p-value of $H_i$, while $p_{(j)}$ is the $j$-th smallest p-value among $p_i$.

Calculate the p-value $p_i$ for each test ($1 \leq i \leq m$)
Sort the p-values in ascending order: $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$
Calculate lowerbound rank $L$:
$$ L = \min\left\{j \in \{1, \dots, m\} \mid p_{(j)} > \frac{\alpha}{m + 1 - j}\right\} $$
Reject all null hypotheses with $p_i < p_{(L)}$

Alternatively

Calculate the p-value $p_i$ for each test ($1 \leq i \leq m$)
Sort the p-values in ascending order: $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$
Reject all $p_{(j)}$ such that
$$ p_{(j)} < \frac{\alpha}{m + 1 - j} $$

Comparison to Bonferroni

Both Bonferroni and Holm’s method control the FWER to $\alpha$.

However Holm’s method is at least as powerful as Bonferroni correction, and thus less conservative.

Anything Bonferroni rejects, Holm’s method will also reject.

Proof

Suppose Bonferroni rejects $H_{0i}$ for some $i \in [1, m]$. Then $p_i < \frac{\alpha}{m}$.

Because

\[\forall j \in [1, m],\; \frac{\alpha}{m} \leq \frac{\alpha}{m + 1 - j}\]

We have $p_i < \frac{\alpha}{m + 1 - L} < p_{(L)}$ by definition.

So Holm’s method will also reject $H_{0i}$.

q-Value

Remember that running multiple tests suffers increased overall Type I error. So we use correction methods on the p-value to stabilize the FWER.

However, controlling the FWER to correct type I error often becomes too conservative.

So we look for other methods to correct Type I errors.

Instead of trying to control the FWER, we can try to control the false discovery rate (FDR), and use a q-value which is the FDR analogue of the p-value.

For a recap, the FDR is the expected proportion of false positives among all rejected null hypotheses.
So compared to FWER which measures the chance of making any false positives, FDR is more lenient.

The q-value of a test is the minimum FDR at which the test may be called significant.

Benjamini-Hochberg Procedure

Benjamini-Hochberg procedure is a method to control the FDR.

If we set a threshold $q$ and follow the procedure, $\text{FDR} \leq q$.

Procedure

Calculate the p-value $p_i$ for each test ($1 \leq i \leq m$)
Sort the p-values in ascending order: $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$
Calculate the rank $L$
$$ \begin{equation*} \label{eq:bh_rank} \tag{Maximum Rank} L = \max\left\{j \in \{1, \dots, m\} \mid p_{(j)} < \frac{q}{m}\cdot j\right\} \end{equation*} $$
Reject all null hypotheses with $p_i < p_{(L)}$.

Comparison to controlling FWER

The slope of Benjamini-Hochberge threshold is defined by $q/m$ (see \ref{eq:bh_rank}).

The p-values below each line are the rejected ones (discoveries).

From the graph, we see that Benjamini-Hochberg procedure makes more discoveries than well-known methods that control FWER.

Tukey’s HSD Test

HSD stands for honestly significant difference, which is the minimum difference between two means that is considered (actually/honestly) statistically significant.

So Tukey’s HSD test is a method that calculates the HSD, and checks which pair has a significant difference (i.e. greater than HSD).

It works better than the Bonferroni correction in that it maintains the FWER at $\alpha$ but without losing as much power.

Typically, Tukey’s HSD test is used after ANOVA. However, it can be used standalone as well.

The null hypothesis of Tukey’s HSD test is:

$$ H_0: \forall i, j \in [1, k],\; \mu_i = \mu_j $$

Assumptions

Tukey’s test assumes the following:

The observations are independent (within and among)
Each group follows a normal distribution
Homogeneity of variance in each group
The sample sizes are equal ($n_k = n$)
If the sample sizes are unequal, Tukey's test becomes more conservative.

Studentized Range Distribution

Suppose we have $k$ populations with normal distribution, and we each take a sample of size $n$.

Let $\bar{x}_{min}$ and $\bar{x}_{max}$ be the smallest and largest sample means respectively, and $s_p$ the pooled standard deviation (with equal sample sizes).

The following random variable has a Studentized range distribution:

$$ Q = \frac{\bar{x}_{max} - \bar{x}_{min}}{s_p / \sqrt{n}} $$

Largest difference between two sample means measured by the standard error.

HSD

Test Statistic

We use the following statistic to measure the difference in means:

$$ \frac{\mid\bar{x}_i - \bar{x}_j\mid}{\sqrt{MSE / n}} $$

where $\bar{x}_i$ and $\bar{x}_j$ are the sample means of group $i$ and $j$ respectively.

When null hypothesis is true, the statistic follows the studentized range distribution.

Critical Value of Studentized Range Distribution

We denote the critical value as

$$ q_{\alpha, \nu, k} $$

where:

$\alpha$ is the significance level
$\nu$ is the degrees of freedom of the distribution
$k$ is the number of groups

We then compare our statistic to a critical value of the distribution. If the statistic is greater than the critical value, we reject the null hypothesis.

The Studentized range distribution is measured on the largest difference, so it makes sense that any difference measure greater than the critical value should be considered significant.

Just like z-scores and t-scores, the values can be obtained from a table.

Defining HSD

For our differences to be significant, we need the following to be true:

\[\frac{\mid\bar{x}_i - \bar{x}_j\mid}{\sqrt{MSE / n}} \geq q_{\alpha, \nu, k}\]

We can rearrange the inequality to get the following:

\[\mid\bar{x}_i - \bar{x}_j\mid \geq q_{\alpha, \nu, k} \cdot \sqrt{\frac{MSE}{n}}\]

The right-hand side of the inequality is then defined HSD:

$$ HSD = q_{\alpha, \nu, k} \cdot \sqrt{\frac{MSE}{n}} $$

where:

$q_{\alpha, \nu, N}$ is the critical value of the studentized range distribution with $\alpha$ significance level, $\nu$ degrees of freedom, and $k$ groups
Standard error of the group mean:
- $MSE$ is the mean square errror (or within), which you can get from the ANOVA summary table
- $n$ is the sample size of each group (which is assumed to be all equal)

Dunnett’s Test

To be added

Williams Test

To be added