Score Function and Fisher Information

Table of contents
  1. Score Function
  2. Fisher Information
    1. Interpretation of Fisher Information
    2. Inverse of Fisher Information

Score Function

The score function is the first derivative (gradient) of the log-likelihood function with respect to the parameter $\theta$.

The way we find MLE is by taking the derivative of $\ell(\theta)$ with respect to $\theta$, and see which value of $\theta$ makes the derivative zero. So it makes sense to want to have a gradient of $\ell(\theta)$.

The log-likelihood function is:

\[\ell(\theta) = \sum_{i=1}^n \log f(x_i; \theta)\]

The score function is:

$$ s(\theta) = \nabla_\theta \ell(\theta) = \sum_{i=1}^n \nabla_\theta \log f(x_i; \theta) $$

Other notation

Some people like to define the score function for a single observation $X_i$:

\[s(\theta; x_i) = \frac{\partial \log f(x_i; \theta)}{\partial \theta}\]

And say that the score function for the entire sample is:

\[s(\theta) = \sum_{i=1}^n s(\theta; x_i)\]
Expected Value of the Score Function

When $\theta^*$ is the true parameter of $X_i$,

\[E[s(\theta^*; x_i)] = 0\]

Which makes sense, because we want all the partial derivatives to be zero (at a maxima) when we have the true parameter.


Fisher Information

The Fisher information is the variance of the score function.

$$ I_n(\theta) = \Var(s(\theta)) = \sum_{i=1}^n \Var(s(\theta; x_i)) $$

For IID samples, it suffices to calculate the variance of the score function for a single observation:

\[I(\theta) = \Var(s(\theta; x_i))\]

And then $I_n(\theta) = n I(\theta)$.

There are several other ways to define the Fisher information.

First one is to use the fact that the score function has expected value 0 around the true parameter:

\[I(\theta) = E[s^2(\theta; x_i)] - E[s(\theta; x_i)]^2 = E[s^2(\theta; x_i)] = \E \left[ \left( \frac{\partial \log f(x_i; \theta)}{\partial \theta} \right)^2 \right]\]

Furthermore, you could also derive the fact that:

\[\frac{\partial^2 \log f(x_i; \theta)}{\partial \theta^2} = - \left( \frac{\partial \log f(x_i; \theta)}{\partial \theta} \right)^2\]

And then you could define the Fisher information for a single observation as:

$$ I(\theta) = -E \left[ \frac{\partial^2 \log f(x_i; \theta)}{\partial \theta^2} \right] $$

Don’t forget $I_n(\theta) = n I(\theta)$ (for IID samples).

Interpretation of Fisher Information

We know that the second derivative of a function is a measure of curvature.

So the Fisher information is a measure of the curvature of the log-likelihood function around $\theta$.

If the curve is shallow, then the Fisher information is small, meaning, this may have been just a local maxima and we may not be very confident with our MLE.

It turns out that $\theta$ is easier to estimate with MLE when the Fisher information is large.

Inverse of Fisher Information

The inverse of the Fisher information is the variance of the MLE:

$$ \Var(\hat{\theta}) \approx \frac{1}{I_n(\hat{\theta})} $$

This matches up with the interpretation above: larger Fisher information means smaller variance of the MLE, meaning our estimate is more precise.