Entropy / KL Divergence / Cross Entropy

Table of contents

Informational value (surprise)
Entropy
KL Divergence
1. Some properties
  1. Another way to write KL divergence
  2. KL divergence is not symmetric
Cross Entropy
Conditional entropy
1. Mutual information

Informational value (surprise)

If a variable has a higher “surprise”, we say that it needs more information to describe it due to its uncertainty.

A surprising event is one that has a low probability of occurring.

Then it is clear that probability $p$ has an inverse relationship with surprise:

\[\text{surprise} \propto \log\left(\frac{1}{p}\right)\]

Why the log?

Intuitively, it makes sense that surprise should be infinitely large when $p = 0$.

Infinitely surprised when something impossible happens.

On the other hand, surprise should be $0$ when $p = 1$.

However with just $\frac{1}{p}$, we get a value of $1$ when $p = 1$.

We put a $\log$ in front so that:

\[\text{surprise} = \log\left(\frac{1}{1}\right) = 0\]

We can do this because $\log$ is a monotonic function.

Also $\log$ makes math easier.

Define $p_X$ as the probability function of random variable $X$.

Therefore we define surprise of $x \in X$ as:

$$ -\log(p_X(x)) $$

Entropy

Entropy is a measure of surprise/uncertainty in a random variable.

It is actually the expected value of surprise:

$$ H(X) = E[-\log(p_X)] = -\sum_{x \in X} p_X(x) \log(p_X(x)) $$

For continuous random variables, just use the integral.

KL Divergence

Kullback-Leibler (KL) divergence is the statistical difference between two distributions.

$$ D_{KL}(P\mathbin{||}Q) = \sum_{x \in X} p_P(x) \log\left(\frac{p(x)}{q(x)}\right) $$

Also known as relative entropy of $P$ with respect to $Q$, or how much extra surprise we get if we use $Q$ instead of $P$.

Most common scenario is to compare the difference between the empirical distribution of the observations and the theoretical model distribution.

Let $P$ be the empirical distribution and $Q$ be the theoretical model distribution.

Intuitive way to remember this is to calculate the ratio of the probabilities of observing random variable $X$ under $P$ and $Q$:

\[\frac{p(x)}{q(x)}\]

Then we take the log of this ratio:

\[\log\left(\frac{p(x)}{q(x)}\right)\]

This aligns with our intuition that we want to measure the statistical difference between the two distributions, because it is equal to the logarithmic difference $\log(p(x)) - \log(q(x))$.

Then KL divergence is the expected value of this logarithmic difference under $P$:

\[E_p\left[\log\left(\frac{p(X)}{q(X)}\right)\right]\]

Some properties

Another way to write KL divergence

By logarithmic properties,

\[D_{KL}(P\mathbin{||}Q) = \sum_{x \in X} p(x) \log\left(\frac{p(x)}{q(x)}\right) = - \sum_{x \in X} p(x) \log\left(\frac{q(x)}{p(x)}\right)\]

KL divergence is not symmetric

Algebraically, it is easy to see that $D_{KL}(P\mathbin{||}Q) \neq D_{KL}(Q\mathbin{||}P)$.

Intuitive way to remember

Remeber that we can think of $P$ as the observed distribution and $Q$ as the theoretical model distribution.

We usually want to estimate a model parameter that can minimize the difference with the observed distribution. But a model cannot always perfectly fit the observed distribution, but it will try its best. But two distributions may have different ideas on what counts as “best”.

Maybe $P$ has a lot of bumps or outbursts, which may be important because they are caused by relevant events.

However, a model distribution might have a tendency to flatten out the bumps and outbursts because it wants to generalize.

For $D_{KL}(P\mathbin{||}Q)$, we are looking at the difference from the empirical distribution perspective. When $P$ looks at $Q$, it may think the difference is big because they did not capture the important information, like the bump.

For $D_{KL}(Q\mathbin{||}P)$, we are looking at the difference from the model perspective. When $Q$ looks at $P$, it may think the difference is small because the mean is similar and the bumps are not important.

Cross Entropy

Cross entropy is a measure of the surprise we have if we use the estimated distribution $Q$ instead of true distribution $P$.

Definition of cross entropy using KL divergence is an intuitive way to remember:

$$ H(P, Q) = H(P) + D_{KL}(P\mathbin{||}Q) $$

Information we need to describe $P$ when we use $Q$ to explain is equal to information needed for $P$ plus the extra information we need for using $Q$ instead.

It is very easy to show that it leads to the following definition:

$$ H(P, Q) = -\sum_{x \in X} p(x) \log(q(x)) $$

Conditional entropy

Conditional entropy of random variable $X$ given $Y$ is the amount of extra information needed to describe $X$ given $Y$.

$$ H(X \mid Y) = -\sum_{x \in X} \sum_{y \in Y} p_{XY}(x, y) \log(p_{X \mid Y}(x \mid y)) $$

Mutual information

Mutual information is the amount of information that random variable $X$ and $Y$ share.

$$ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) $$

Also called “information gain of $\mathbf{X}$” given $Y$.

Intuition

We need $H(X)$ much information to describe $X$.

$H(X \mid Y)$ is the amount of extra work we have to do from $Y$ in order to describe $X$.

So $H(X) - H(X \mid Y)$ is the amount of free info that $Y$ already gives us about $X$.

So $I(X; Y)$ is the amount of information that $X$ and $Y$ already had in common, thus mutual information.

It is also the “information gained about $X$” by knowing $Y$.