Loss Functions

Table of contents
  1. Loss / Cost Function
  2. Constructing Loss Function From Maximum Likelihood
    1. Summary
  3. Domain Specific Loss Construction
    1. Least Squares Loss
    2. Binary Cross-Entropy Loss
    3. Multi-Class Cross-Entropy Loss

Loss / Cost Function

Given dataset $\{ x_i, y_i \}_{i=1}^N$ and model $f[x, \phi]$, the loss function $L[\phi]$ returns the discrepancy between the model’s prediction and the true label $y_i$.

The loss function is dependent on both the model (and parameters) and the dataset.

The goal of training is to find estimate the parameter $\phi$ that will minimize the loss across the dataset:

$$ \hat{\phi} = \arg\min_{\phi} L[\phi] $$


Constructing Loss Function From Maximum Likelihood

Instead of thinking about the model’s prediction as a point estimate, we can think of it as a conditional probability distribution:

\[\Pr(y \mid x)\]

We want to maximize the likelihood of $y_i$ given each $x_i$:

\[\hat{\phi} = \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid x_i)\]

So now our model $f[x, \phi]$ needs to model a distribution, and we will do that indirectly by learning a parameter for some distribution.

We assume some parametric distribution for $y$ with parameter set $\theta$. This distribution should be chosen based on the nature and domain of the problem.

$\phi$ and $\theta$?

$\phi$ is the parameter (weights and biases) of the model $f$.

$\theta$ is the parameter of the chosen distribution for $y$.

The output of our model is the parameter $\theta$ of the distribution:

\[\theta = f[x_i, \phi]\]

Then our likelihood boils down to:

\[\begin{align*} \hat{\phi} &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid x_i) \\ &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid \theta_i) \\ &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid f[x_i, \phi]) \end{align*}\]

Notice that we assume our input data pairs are i.i.d.

We know that log is a monotonic function, so we can maximize the log likelihood:

\[\hat{\phi} = \arg\max_{\phi} \sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right]\]

Since loss is something we usually minimize, we turn this maximization problem into a minimization problem by negating to get the negative log likelihood:

\[\hat{\phi} = \arg\min_{\phi} \left[ -\sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right] \right]\]

We define the loss function as the negative log likelihood:

$$ L[\phi] = -\sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right] $$

Summary

  1. Assume a domain-specific distribution for $y$ parameterized by $\theta$.
  2. Have the model $f$ learn the parameter $\theta$ given $x$.
  3. Find the negative log likelihood of $y$ given model output $f[x, \phi]$.

So the key takeaway is, instead of the actual label $y$, our network learns the parameter $\theta$ of the distribution of $y$.
This learning is done by the negative loss likelihood or the loss function.

Finally, since our model $f$ no longer directly predicts $y$, we need to do inference by:

\[\hat{y} = \arg\max_y \Pr(y \mid f[x, \hat{\phi}])\]

Domain Specific Loss Construction

We will explore some domain-specific loss functions.

Everything else is pretty much by the recipe, but the real difference likes in selecting the right distribution for $y$ and which parameters to learn.

Least Squares Loss

For linear regression, we assume a normal distribution for the error term $\epsilon$ to derive the OLS loss function.

Loss function is the negative log likelihood of:

\[\begin{align*} \Pr(y \mid \mu, \sigma^2) &= \Pr(y \mid f[x, \phi], \sigma^2) \tag{Homoscedastic} \\[1em] \Pr(y \mid \mu, \sigma^2) &= \Pr(y \mid f_1[x, \phi], f_2[x, \phi]^2) \tag{Heteroscedastic} \end{align*}\]

For homoscedastic models, our loss functions is the usual OLS:

\[L[\phi] = \sum_{i=1}^N \left( y_i - f[x_i, \phi] \right)^2\]

Binary Cross-Entropy Loss

For binary classification, we assume a Bernoulli distribution for the label $y$:

\[\Pr(y \mid \lambda) = \lambda^y (1 - \lambda)^{1-y}\]

The only thing is that the parameter $\lambda$ of the Bernoulli distribution is a probability, but there is no guarantee that our network $f$ will output a value in $[0, 1]$.

To do this, we wrap our network output in a sigmoid function:

\[\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}\]

Then, the loss function is the negative log likelihood of:

\[\Pr(y \mid \lambda) = \Pr(y \mid \sigma[f[x, \phi]])\]

Let us denote the probability output of the network $\hat{p} = \sigma[f[x, \phi]]$.

The binary cross-entropy loss is:

$$ L[\phi] = -\sum_{i=1}^N \left[ y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) \right] $$

Derivation \[\begin{align*} L[\phi] &= -\sum_{i=1}^N \log \left[ \Pr(y_i \mid \sigma[f[x_i, \phi]]) \right] \\[0.5em] &= -\sum_{i=1}^N \log \left[ \Pr(y_i \mid \hat{p}_i) \right] \\[0.5em] &= -\sum_{i=1}^N \log \left[ \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1-y_i} \right] \\[0.5em] &= -\sum_{i=1}^N [ y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) ] \\[0.5em] \end{align*}\]

Multi-Class Cross-Entropy Loss

For multi-class classification, this is just a generalization of the binary classification:

\[\Pr(y = k) = \lambda_k \land \sum_{k=1}^K \lambda_k = 1\]

Binary classification is a special case of multi-class classification, where $K = 2$, but since $\lambda_1 + \lambda_2 = 1$, we just use one parameter $\lambda$ and $1 - \lambda$.

Instead of the sigmoid function, we wrap our model output in a softmax function:

$$ \text{softmax}_k(z) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} $$

Loss function is the negative log likelihood of training data:

\[\begin{align*} L[\phi] &= - \sum_{i=1}^N \log \biggl[ \text{softmax}_{y_i}\left[f[x_i, \phi]\right] \biggr] \\[0.5em] &= - \sum_{i=1}^N \biggl( f_{y_i}[x_i, \phi] - \log \left[\sum_{j=1}^K \text{exp}(f_j[x_i, \phi])\right] \biggr) \end{align*}\]

Though, it is common to use the following form:

$$ L[\phi] = \frac{1}{N}\sum_{i=1}^N l_i = - \frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^C y_{ik} \log \hat{p}_{ik} $$

where $y_{i}$ is the one-hot encoded vector of the true label, and $\hat{p}_i$ is the softmax output of the model.