Gradient Descent

To minimize a loss, we update the parameters in the opposite direction of the rate of change in loss with respect to the parameters (gradient).

By default, gradient gives us the steepest ascent, but since we go in the opposite direction, it is called gradient descent.

Table of contents

(Full-Batch) Gradient Descent
1. Useful Even for Linear Regression
Convex Problems
Stochastic Gradient Descent (SGD)
Momentum
1. Nesterov Accelerated Momentum
Normalized Gradient
Adaptive Moment Estimation (Adam)

(Full-Batch) Gradient Descent

We calculate the sum of gradient of the loss function with respect to the parameters:

\[\sum_{i=1}^N \frac{\partial \ell_i(\phi_t)}{\partial \phi}\]

And update the parameters with some learning rate $\alpha$:

\[\phi_t \leftarrow \phi_t - \alpha \sum_{i=1}^N \frac{\partial \ell_i(\phi_t)}{\partial \phi}\]

Useful Even for Linear Regression

Some linear regression problems have closed-form solutions.

However, there are some cases where no inverse of $X^TX$ exists. In such cases, we can use gradient descent.

Even when we have a closed-form solution, when the number of features $p$ of $X$ is very large, it gets computationally expensive to calculate the inverse of $X^TX$.

In such cases, gradient descent is more efficient.

Convex Problems

A convex problem is one where the objective function, aka the loss function is convex.

In such cases, gradient descent is guaranteed to find the global minimum.

If objective is non-convex, gradient descent may get stuck in a local minimum.

For non-convex problems, gradient descent is sensitive to the initialization of the parameters.

Stochastic Gradient Descent (SGD)

To get out of local minima, we add some randomness or noise to the gradient steps.

Easiest way to achieve this is to update parameters from a loss calculated on a single data point. This is called stochastic gradient descent.

Since this is not the most optimal step, we would be taking many more noisy walks to reach the minimum, which allows us to escape local minima.

Per iteration (update) calculation is faster than regular GD, but there is no guarantee overall

Minibatch-SGD

Minibatch-SGD is a compromise between full-batch GD and SGD.

We calculate the gradient on a random subset of size $B$ of the data:

\[\phi_{t+1} \leftarrow \phi_t - \alpha \sum_{i=1}^B \frac{\partial \ell_i(\phi_t)}{\partial \phi}\]

Epoch

Epoch refers to the one pass through the entire dataset.

GD: 1 epoch = 1 update
SGD: 1 epoch = $N$ updates
Minibatch SGD: 1 epoch = $N/B$ updates

Convergence in SGD

Stochastic gradient may not converge with fixed learning rate, as it may keep oscillating around the minimum.

Therefore, it is common to decay the learning rate over time, unlike in full-batch GD.

For convex problems, we can still converge at the minimum in SGD even with a non-decaying big step sizes. But it will converge rather slowly.

The learning rate $r_t$ should satisfy the Robbins-Monro conditions:

$\sum_{t=1}^\infty r_t = \infty$
$\sum_{t=1}^\infty r_t^2 < \infty$

A decaying learning rate that satisfies these conditions is something like:

\[r_t = \frac{C}{t}\]

for some constant $C$.

Momentum

Gradient descent considers the direction at the current time step.

Each update is based on the current optimal direction.

Momentum smoothes the trajectory of the walk towards the minimum by keeping track of the previous gradients as a weighted sum.

\[\begin{gather*} m_{t+1} \leftarrow \beta \cdot m_t + (1 - \beta) \sum_{i=1}^B \frac{\partial \ell_i(\phi_t)}{\partial \phi} \\[1em] \phi_{t+1} \leftarrow \phi_t - \alpha \cdot m_{t+1} \end{gather*}\]

$\beta \in [0, 1)$ controls the weight between the current and previous gradients. If $\beta$ is high, our path will be smoother we place more weight on how we have been walking so far. So it is less likely to make sharp turns.

Nesterov Accelerated Momentum

Original momentum still respects the current gradient direction when updating the parameters. It is just that the effect is a bit diluted by the previous history.

Nesterov accelerated momentum places more focus on the predicted direction based on the previous momentum.

It updates the parameters based on the previous momentum, then calculates the gradient at that point.

\[\begin{gather*} m_{t+1} \leftarrow \beta \cdot m_t + (1 - \beta) \sum_{i=1}^B \frac{\partial \ell_i(\phi_t - \alpha \cdot \beta \cdot m_t)}{\partial \phi} \\[1em] \phi_{t+1} \leftarrow \phi_t - \alpha \cdot m_{t+1} \end{gather*}\]

In the above figure, notice how Nesterov momentum really abides to the direction up until it reached point 1, then updates the direction based on the gradient.

Whereas the original momentum first updates the direction based on the gradient, then moves in the direction of the previous momentum.

Normalized Gradient

Gradients have different magnitudes for different parameters. Some descents are really steep, some are not.

This makes it difficult to choose a suitable learning rate that makes updates happy for all parameters.

What we can do is normalize the gradients so that each step is of the same magnitude. This gives us just the direction of the gradient, and each step size will be consistently controlled by the learning rate.

We do so by dividing the gradient by its norm:

\[\begin{gather*} m_{t+1} \leftarrow \sum_{i=1}^B \frac{\partial \ell_i(\phi_t)}{\partial \phi} \\[1em] v_{t+1} \leftarrow \sum_{i=1}^B \left(\frac{\partial \ell_i(\phi_t)}{\partial \phi}\right)^2 \\[1em] \phi_{t+1} \leftarrow \phi_t - \alpha \cdot \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon} \end{gather*}\]

The $\epsilon$ is added to prevent division by zero.

Normalization achieves good direction for all parameters, but it fails to converge unless we land exactly on the minimum.

Adaptive Moment Estimation (Adam)

Adam combines the ideas of momentum and normalized gradient.

We have two momentum coefficients $\beta$ and $\gamma$, one for the gradient and one for the squared gradient.

\[\begin{gather*} m_{t+1} \leftarrow \beta \cdot m_t + (1 - \beta) \sum_{i=1}^B \frac{\partial \ell_i(\phi_t)}{\partial \phi} \\[1em] v_{t+1} \leftarrow \gamma \cdot v_t + (1 - \gamma) \sum_{i=1}^B \left(\frac{\partial \ell_i(\phi_t)}{\partial \phi}\right)^2 \\[1em] \end{gather*}\]

Because momentum is zero at the beginning, we modify the update rule to correct this bias:

\[\begin{gather*} \tilde{m}_{t+1} \leftarrow \frac{m_{t+1}}{1 - \beta^{t+1}} \\[1em] \tilde{v}_{t+1} \leftarrow \frac{v_{t+1}}{1 - \gamma^{t+1}} \\[1em] \end{gather*}\]

The effects of this correction are prominent at the beginning of training and diminishes as $t$ increases.

Finally, we update the parameters:

\[\phi_{t+1} \leftarrow \phi_t - \alpha \cdot \frac{\tilde{m}_{t+1}}{\sqrt{\tilde{v}_{t+1}} + \epsilon}\]