Loss Functions
Table of contents
Loss / Cost Function
Given dataset $\{ x_i, y_i \}_{i=1}^N$ and model $f[x, \phi]$, the loss function $L[\phi]$ returns the discrepancy between the model’s prediction and the true label $y_i$.
The loss function is dependent on both the model (and parameters) and the dataset.
The goal of training is to find estimate the parameter $\phi$ that will minimize the loss across the dataset:
$$ \hat{\phi} = \arg\min_{\phi} L[\phi] $$
Constructing Loss Function From Maximum Likelihood
Instead of thinking about the model’s prediction as a point estimate, we can think of it as a conditional probability distribution:
\[\Pr(y \mid x)\]We want to maximize the likelihood of $y_i$ given each $x_i$:
\[\hat{\phi} = \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid x_i)\]So now our model $f[x, \phi]$ needs to model a distribution, and we will do that indirectly by learning a parameter for some distribution.
We assume some parametric distribution for $y$ with parameter set $\theta$. This distribution should be chosen based on the nature and domain of the problem.
$\phi$ and $\theta$?
$\phi$ is the parameter (weights and biases) of the model $f$.
$\theta$ is the parameter of the chosen distribution for $y$.
The output of our model is the parameter $\theta$ of the distribution:
\[\theta = f[x_i, \phi]\]Then our likelihood boils down to:
\[\begin{align*} \hat{\phi} &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid x_i) \\ &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid \theta_i) \\ &= \arg\max_{\phi} \prod_{i=1}^N \Pr(y_i \mid f[x_i, \phi]) \end{align*}\]Notice that we assume our input data pairs are i.i.d.
We know that log is a monotonic function, so we can maximize the log likelihood:
\[\hat{\phi} = \arg\max_{\phi} \sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right]\]Since loss is something we usually minimize, we turn this maximization problem into a minimization problem by negating to get the negative log likelihood:
\[\hat{\phi} = \arg\min_{\phi} \left[ -\sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right] \right]\]We define the loss function as the negative log likelihood:
$$ L[\phi] = -\sum_{i=1}^N \log \left[ \Pr(y_i \mid f[x_i, \phi]) \right] $$
Summary
- Assume a domain-specific distribution for $y$ parameterized by $\theta$.
- Have the model $f$ learn the parameter $\theta$ given $x$.
- Find the negative log likelihood of $y$ given model output $f[x, \phi]$.
So the key takeaway is, instead of the actual label $y$, our network learns the parameter $\theta$ of the distribution of $y$.
This learning is done by the negative loss likelihood or the loss function.
Finally, since our model $f$ no longer directly predicts $y$, we need to do inference by:
\[\hat{y} = \arg\max_y \Pr(y \mid f[x, \hat{\phi}])\]Domain Specific Loss Construction
We will explore some domain-specific loss functions.
Everything else is pretty much by the recipe, but the real difference likes in selecting the right distribution for $y$ and which parameters to learn.
Least Squares Loss
For linear regression, we assume a normal distribution for the error term $\epsilon$ to derive the OLS loss function.
Loss function is the negative log likelihood of:
\[\begin{align*} \Pr(y \mid \mu, \sigma^2) &= \Pr(y \mid f[x, \phi], \sigma^2) \tag{Homoscedastic} \\[1em] \Pr(y \mid \mu, \sigma^2) &= \Pr(y \mid f_1[x, \phi], f_2[x, \phi]^2) \tag{Heteroscedastic} \end{align*}\]For homoscedastic models, our loss functions is the usual OLS:
\[L[\phi] = \sum_{i=1}^N \left( y_i - f[x_i, \phi] \right)^2\]Binary Cross-Entropy Loss
For binary classification, we assume a Bernoulli distribution for the label $y$:
\[\Pr(y \mid \lambda) = \lambda^y (1 - \lambda)^{1-y}\]The only thing is that the parameter $\lambda$ of the Bernoulli distribution is a probability, but there is no guarantee that our network $f$ will output a value in $[0, 1]$.
To do this, we wrap our network output in a sigmoid function:
\[\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}\]Then, the loss function is the negative log likelihood of:
\[\Pr(y \mid \lambda) = \Pr(y \mid \sigma[f[x, \phi]])\]Let us denote the probability output of the network $\hat{p} = \sigma[f[x, \phi]]$.
The binary cross-entropy loss is:
$$ L[\phi] = -\sum_{i=1}^N \left[ y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) \right] $$
Derivation
\[\begin{align*} L[\phi] &= -\sum_{i=1}^N \log \left[ \Pr(y_i \mid \sigma[f[x_i, \phi]]) \right] \\[0.5em] &= -\sum_{i=1}^N \log \left[ \Pr(y_i \mid \hat{p}_i) \right] \\[0.5em] &= -\sum_{i=1}^N \log \left[ \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1-y_i} \right] \\[0.5em] &= -\sum_{i=1}^N [ y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) ] \\[0.5em] \end{align*}\]Multi-Class Cross-Entropy Loss
For multi-class classification, this is just a generalization of the binary classification:
\[\Pr(y = k) = \lambda_k \land \sum_{k=1}^K \lambda_k = 1\]Binary classification is a special case of multi-class classification, where $K = 2$, but since $\lambda_1 + \lambda_2 = 1$, we just use one parameter $\lambda$ and $1 - \lambda$.
Instead of the sigmoid function, we wrap our model output in a softmax function:
$$ \text{softmax}_k(z) = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} $$
Loss function is the negative log likelihood of training data:
\[\begin{align*} L[\phi] &= - \sum_{i=1}^N \log \biggl[ \text{softmax}_{y_i}\left[f[x_i, \phi]\right] \biggr] \\[0.5em] &= - \sum_{i=1}^N \biggl( f_{y_i}[x_i, \phi] - \log \left[\sum_{j=1}^K \text{exp}(f_j[x_i, \phi])\right] \biggr) \end{align*}\]Though, it is common to use the following form:
$$ L[\phi] = \frac{1}{N}\sum_{i=1}^N l_i = - \frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^C y_{ik} \log \hat{p}_{ik} $$
where $y_{i}$ is the one-hot encoded vector of the true label, and $\hat{p}_i$ is the softmax output of the model.