Softmax

Table of contents

To Probabilities

Converts vector of logits to vector of probabilities.

Essentially takes a vector, takes the exponential of each entry, and then normalizes the entries so that they sum to 1.

Max: because the largest logit gets amplified
Soft: because the other logits are suppressed but not hard-zeroed out.
Think of it as a softened version of the max function.

Let $K$ be the dimension of the input vector.

If $K$ is large, computing this every iteration can be computationally expensive.

Good Read: