Maximum Likelihood

Most common method for estimating parameters of a statistical model.

Table of contents
  1. Likelihood
  2. Maximum Likelihood Estimation (MLE)
    1. Log-Likelihood
    2. Properties of MLE
      1. Consistent
      2. Asymptotically Normal
      3. Asymptotically Optimal (Efficient)
      4. Equivariant
  3. Asymptotic Confidence Interval of MLE

Likelihood

Likelihood or likelihood function is a function of the parameters of a statistical model.

It is essentially a joint PDF of the data given the parameters, but viewed as a function of the parameters.

Why the name likelihood?

We have a bunch of probability that some event happens.

Assuming they are all independent, if all these events happened, and thus likely, the joint probability should be high.

And the path to find the parameters that make this joint probability highest is maximum likelihood estimation.

For RVs $X_1, \dots, X_n$ where each $X_i$ is IID and has the PDF $f(x_i; \theta)$, their joint PDF is:

\[f(x_1, \dots, x_n; \theta) \xlongequal{iid} \prod_{i=1}^n f(x_i; \theta)\]

Usually, this joint PDF would be seen as a function of the data $x_1, \dots, x_n$ with the parameters $\theta$ fixed, but likelihood sees this as a function of the parameters $\theta$ given the data $x_1, \dots, x_n$, and we denote:

$$ \mathcal{L}(\theta) = \prod_{i=1}^n f(x_i; \theta) $$

The likelihood is not a probability, and $\theta$ is not a random variable.


Maximum Likelihood Estimation (MLE)

Our goal is to find the parameter $\theta$ that maximizes the likelihood function $\mathcal{L}(\theta)$.

So given the data $x_1, \dots, x_n$, we’re trying to find the parameter that makes the generation of this data most likely.

Log-Likelihood

Maximizing the likelihood is equivalent to maximizing the log-likelihood:

$$ \ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^n \log f(x_i; \theta) $$

because the log function is monotonically increasing.

Example with Bernoulli Distribution

Let $X_1, \dots, X_n \sim \text{Bernoulli}(p)$.

The likelihood function is:

\[\begin{align*} \mathcal{L}(p) &= \prod_{i=1}^n \Pr(X_i = x_i; p) \\[1em] &= \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} \\[1em] &= p^{\sum_{i=1}^n x_i} (1-p)^{\sum_{i=1}^n (1-x_i)} \end{align*}\]

The log-likelihood function is:

\[\begin{align*} \ell(p) &= \log \mathcal{L}(p) \\[1em] &= \sum_{i=1}^n x_i \log p + \sum_{i=1}^n (1-x_i) \log (1-p) \end{align*}\]

Using calculus, we can find the value of $p$ that maximizes $\ell(p)$:

\[\frac{d\ell(p)}{dp} = 0\]

Take the derivative:

\[\begin{align*} \frac{d\ell(p)}{dp} &= \frac{\sum_{i=1}^n x_i}{p} - \frac{\sum_{i=1}^n (1-x_i)}{1-p} \\[1em] &= \frac{\sum_{i=1}^n x_i - np}{p(1-p)} = 0 \end{align*}\]

We see that the derivative is zero when:

\[\sum_{i=1}^n x_i = np\]

Rearranging, we get the MLE for $p$:

\[\hat{p} = \frac{1}{n} \sum_{i=1}^n x_i = \overline{x}_n\]

MLE is unaffected by constant factors in the likelihood function. Constants from joint PDFs can be dropped.

Properties of MLE

Let $\hat{\theta}$ be the MLE of $\theta$.

Consistent

\[\hat{\theta} \xrightarrow{P} \theta\]

Asymptotically Normal

\[\frac{\hat{\theta} - \theta}{\hat{\text{se}}} \leadsto N(0, 1)\]

where $\hat{\text{se}}$ is the standard error of $\hat{\theta}$.

We can use the Fisher Information to asymptotically estimate $\hat{\text{se}}$:

\[\hat{\text{se}} \approx \frac{1}{\sqrt{I_n(\hat{\theta})}}\]

Variance of MLE is the inverse of the Fisher Information.

Asymptotically Optimal (Efficient)

For large samples, the MLE has the smallest possible variance among all well-behaved estimators.

When $\hat{\theta}_{MLE}$ is the MLE of $\theta$, and $\tilde{\theta}$ is another estimator,

\[\text{ARE}(\tilde{\theta}, \hat{\theta}_{MLE}) \leq 1\]

See Asymptotically Optimal for more details.

Equivariant

If $\hat{\theta}$ is the MLE of $\theta$, then $g(\hat{\theta})$ is the MLE of $g(\theta)$.

Since the MLE is asymptotically normal, we can use the Delta method to find the standard error of $g(\hat{\theta})$:

\[\hat{\text{se}}(g(\hat{\theta})) \approx |g'(\hat{\theta})| \hat{\text{se}}(\hat{\theta})\]

Then we can construct a confidence interval for $g(\theta)$:

\[g(\hat{\theta}) \pm z_{\alpha/2} \hat{\text{se}}(g(\hat{\theta}))\]

Asymptotic Confidence Interval of MLE

Since the MLE is asymptotically normal, we can use the normal interval to construct a confidence interval for $\hat{\theta}$.

$$ \hat{\theta} \pm z_{\alpha/2} \hat{\text{se}} $$

Again, the standard error can be estimated using the Fisher Information:

\[\hat{\text{se}} \approx \frac{1}{\sqrt{I_n(\hat{\theta})}}\]