Regularization in Neural Networks

Regularization is a technique used to prevent overfitting in models.

It generally involves adding a penalty to an objective function to reduce variance and improve generalization.

Another way to think about it is imposing some constraints on the model, effectively smoothing the model’s loss landscape.

Table of contents
  1. Explicit Regularization
    1. L2 Regularization
    2. Probabilistic Interpretation
  2. Implicit Regularization
  3. Alternative Heuristics
    1. Early Stopping
    2. Ensemble Methods
    3. Dropout
    4. Adding Noise / Data Augmentation
    5. Bayesian Approaches
    6. Transfer Learning / Multi-Task Learning / Self-Supervised Learning

Explicit Regularization

Explicit regularization involves adding a penalty term to the loss function.

L2 Regularization

$$ \hat{\phi} = \underset{\phi}{\text{argmin}} \left( \sum_{i=1}^N \ell_i(\phi) + \lambda \|\phi\|_2^2 \right) $$

It is also called weight decay, which prevents explosion of weights.

Unlike in statistical models, we do not search for the optimal $\lambda$. The reason is explained in the next section.

Probabilistic Interpretation

Adding a regularization term is equivalent to assuming certain prior distributions of the parameters $\Pr(\phi)$ in maximum likelihood estimation:

\[\hat{\phi} = \underset{\phi}{\text{argmax}} \left( \prod_{i=1}^N \Pr(y_i | x_i, \phi) \Pr(\phi) \right)\]

This is because the loss function is basically the negative log likelihood.

Turning above equation into a negative log likelihood minimization problem gives us the equivalent objective with the regularization term.


Implicit Regularization

We mentioned above that we do not search for the optimal $\lambda$.

This is because in deep learning, we usually do not use explicit regularization.

In fact, most deep learning models generalize well without explicit regularization.

This is due to a combination of factors:

  1. Deep learning models are overly parameterized which leads to multiple global minima (a valley of lowest points)
  2. Optimizers (gradient descent) have an implicit preference among those minima (they have a preferred path and endpoint within the valley)

So the optimizers having their own preference acts as an implicit regularization. This reduces variation in the model because the algorithm already constrains itself to a certain path.

  • Full-batch gradient descent: disfavors sharp descents, similar preference to L2 regularization
  • Minibatch SGD: in addition to the above, prefers all batches to move in similar directions

Alternative Heuristics

Instead of explicit regularization, neural networks utilize many heuristics to prevent overfitting.

Early Stopping

The idea is simple: we stop before the model starts overfitting.

Although the name can be confusing, we don’t actually stop training early.

We actually save the model at each epoch, and backtrack to the previously saved model when we start to lose performance in validation step.

This way we do not have to retrain the model from scratch.

Ensemble Methods

We saw previously that taking the average or median of multiple models reduces variance and improves generalization.

We train models with different initializations and/or architectures, and ensemble their results.

Another ensemble method is to reuse some layers of a trained model and train only the last few layers.

We could use bagging with training data (have each model train on a different bootstrap sample), but this is often unnecessary in deep learning. Multiple initialization often suffices.

Dropout

Dropout is a technique where we randomly set some neurons to output zero.

We essentially turn off some percentage of neurons in each layer. This has the effect of regularization.

Adding Noise / Data Augmentation

We can add noise to the input’s features or labels, or even to the model’s weights.

Another way is to augment the data by adding random transformations to the input data and adding them to the training set.

They both act as a form of regularization.

Bayesian Approaches

Instead of trying to optimize over the output space of the model, we try to optimize over the distribution of the model’s parameters given the data.

So from the original maximization objective of $\Pr(y | x, \phi)$, we migrate to $\Pr(\phi | x, y)$.

However, since the amount of parameters in deep learning models is huge, this is not computationally feasible. Approximations are used instead.

Transfer Learning / Multi-Task Learning / Self-Supervised Learning

Reusing a model trained on a different task and fine tuning it for another task, or having the model train on multiple tasks naturally generalizes their learning.