Entropy / KL Divergence / Cross Entropy
Table of contents
Informational value (surprise)
If a variable has a higher “surprise”, we say that it needs more information to describe it due to its uncertainty.
A surprising event is one that has a low probability of occurring.
Then it is clear that probability $p$ has an inverse relationship with surprise:
\[\text{surprise} \propto \log\left(\frac{1}{p}\right)\]Why the log?
Intuitively, it makes sense that surprise should be infinitely large when $p = 0$.
Infinitely surprised when something impossible happens.
On the other hand, surprise should be $0$ when $p = 1$.
However with just $\frac{1}{p}$, we get a value of $1$ when $p = 1$.
We put a $\log$ in front so that:
\[\text{surprise} = \log\left(\frac{1}{1}\right) = 0\]We can do this because $\log$ is a monotonic function.
Also $\log$ makes math easier.
Define $p_X$ as the probability function of random variable $X$.
Therefore we define surprise of $x \in X$ as:
$$ -\log(p_X(x)) $$
Entropy
Entropy is a measure of surprise/uncertainty in a random variable.
It is actually the expected value of surprise:
$$ H(X) = E[-\log(p_X)] = -\sum_{x \in X} p_X(x) \log(p_X(x)) $$
For continuous random variables, just use the integral.
KL Divergence
Kullback-Leibler (KL) divergence is the statistical difference between two distributions.
$$ D_{KL}(P\mathbin{||}Q) = \sum_{x \in X} p_P(x) \log\left(\frac{p(x)}{q(x)}\right) $$
Also known as relative entropy of $P$ with respect to $Q$, or how much extra surprise we get if we use $Q$ instead of $P$.
Most common scenario is to compare the difference between the empirical distribution of the observations and the theoretical model distribution.
Let $P$ be the empirical distribution and $Q$ be the theoretical model distribution.
Intuitive way to remember this is to calculate the ratio of the probabilities of observing random variable $X$ under $P$ and $Q$:
\[\frac{p(x)}{q(x)}\]Then we take the log of this ratio:
\[\log\left(\frac{p(x)}{q(x)}\right)\]This aligns with our intuition that we want to measure the statistical difference between the two distributions, because it is equal to the logarithmic difference $\log(p(x)) - \log(q(x))$.
Then KL divergence is the expected value of this logarithmic difference under $P$:
\[E_p\left[\log\left(\frac{p(X)}{q(X)}\right)\right]\]Some properties
Another way to write KL divergence
By logarithmic properties,
\[D_{KL}(P\mathbin{||}Q) = \sum_{x \in X} p(x) \log\left(\frac{p(x)}{q(x)}\right) = - \sum_{x \in X} p(x) \log\left(\frac{q(x)}{p(x)}\right)\]KL divergence is not symmetric
Algebraically, it is easy to see that $D_{KL}(P\mathbin{||}Q) \neq D_{KL}(Q\mathbin{||}P)$.
Intuitive way to remember
Remeber that we can think of $P$ as the observed distribution and $Q$ as the theoretical model distribution.
We usually want to estimate a model parameter that can minimize the difference with the observed distribution. But a model cannot always perfectly fit the observed distribution, but it will try its best. But two distributions may have different ideas on what counts as “best”.
Maybe $P$ has a lot of bumps or outbursts, which may be important because they are caused by relevant events.
However, a model distribution might have a tendency to flatten out the bumps and outbursts because it wants to generalize.
For $D_{KL}(P\mathbin{||}Q)$, we are looking at the difference from the empirical distribution perspective. When $P$ looks at $Q$, it may think the difference is big because they did not capture the important information, like the bump.
For $D_{KL}(Q\mathbin{||}P)$, we are looking at the difference from the model perspective. When $Q$ looks at $P$, it may think the difference is small because the mean is similar and the bumps are not important.
Cross Entropy
Cross entropy is a measure of the surprise we have if we use the estimated distribution $Q$ instead of true distribution $P$.
Definition of cross entropy using KL divergence is an intuitive way to remember:
$$ H(P, Q) = H(P) + D_{KL}(P\mathbin{||}Q) $$
Information we need to describe $P$ when we use $Q$ to explain is equal to information needed for $P$ plus the extra information we need for using $Q$ instead.
It is very easy to show that it leads to the following definition:
$$ H(P, Q) = -\sum_{x \in X} p(x) \log(q(x)) $$
Conditional entropy
Conditional entropy of random variable $X$ given $Y$ is the amount of extra information needed to describe $X$ given $Y$.
$$ H(X \mid Y) = -\sum_{x \in X} \sum_{y \in Y} p_{XY}(x, y) \log(p_{X \mid Y}(x \mid y)) $$
Mutual information
Mutual information is the amount of information that random variable $X$ and $Y$ share.
$$ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) $$
Also called “information gain of $\mathbf{X}$” given $Y$.
Intuition
We need $H(X)$ much information to describe $X$.
$H(X \mid Y)$ is the amount of extra work we have to do from $Y$ in order to describe $X$.
So $H(X) - H(X \mid Y)$ is the amount of free info that $Y$ already gives us about $X$.
So $I(X; Y)$ is the amount of information that $X$ and $Y$ already had in common, thus mutual information.
It is also the “information gained about $X$” by knowing $Y$.