Score Function and Fisher Information
Table of contents
Score Function
The score function is the first derivative (gradient) of the log-likelihood function with respect to the parameter $\theta$.
The way we find MLE is by taking the derivative of $\ell(\theta)$ with respect to $\theta$, and see which value of $\theta$ makes the derivative zero. So it makes sense to want to have a gradient of $\ell(\theta)$.
The log-likelihood function is:
\[\ell(\theta) = \sum_{i=1}^n \log f(x_i; \theta)\]The score function is:
$$ s(\theta) = \nabla_\theta \ell(\theta) = \sum_{i=1}^n \nabla_\theta \log f(x_i; \theta) $$
Other notation
Some people like to define the score function for a single observation $X_i$:
\[s(\theta; x_i) = \frac{\partial \log f(x_i; \theta)}{\partial \theta}\]And say that the score function for the entire sample is:
\[s(\theta) = \sum_{i=1}^n s(\theta; x_i)\]Expected Value of the Score Function
When $\theta^*$ is the true parameter of $X_i$,
\[E[s(\theta^*; x_i)] = 0\]Which makes sense, because we want all the partial derivatives to be zero (at a maxima) when we have the true parameter.
Fisher Information
The Fisher information is the variance of the score function.
$$ I_n(\theta) = \Var(s(\theta)) = \sum_{i=1}^n \Var(s(\theta; x_i)) $$
For IID samples, it suffices to calculate the variance of the score function for a single observation:
\[I(\theta) = \Var(s(\theta; x_i))\]And then $I_n(\theta) = n I(\theta)$.
There are several other ways to define the Fisher information.
First one is to use the fact that the score function has expected value 0 around the true parameter:
\[I(\theta) = E[s^2(\theta; x_i)] - E[s(\theta; x_i)]^2 = E[s^2(\theta; x_i)] = \E \left[ \left( \frac{\partial \log f(x_i; \theta)}{\partial \theta} \right)^2 \right]\]Furthermore, you could also derive the fact that:
\[\frac{\partial^2 \log f(x_i; \theta)}{\partial \theta^2} = - \left( \frac{\partial \log f(x_i; \theta)}{\partial \theta} \right)^2\]And then you could define the Fisher information for a single observation as:
$$ I(\theta) = -E \left[ \frac{\partial^2 \log f(x_i; \theta)}{\partial \theta^2} \right] $$
Don’t forget $I_n(\theta) = n I(\theta)$ (for IID samples).
Interpretation of Fisher Information
We know that the second derivative of a function is a measure of curvature.
So the Fisher information is a measure of the curvature of the log-likelihood function around $\theta$.
If the curve is shallow, then the Fisher information is small, meaning, this may have been just a local maxima and we may not be very confident with our MLE.
It turns out that $\theta$ is easier to estimate with MLE when the Fisher information is large.
Inverse of Fisher Information
The inverse of the Fisher information is the variance of the MLE:
$$ \Var(\hat{\theta}) \approx \frac{1}{I_n(\hat{\theta})} $$
This matches up with the interpretation above: larger Fisher information means smaller variance of the MLE, meaning our estimate is more precise.