RNN

See Language Model & n-gram for basics of language modeling and shortcomings of non-neural language models.

Table of contents
  1. Fixed-Window Neural Language Model
  2. Recurrent Neural Networks
    1. Word and Word Embedding
    2. Hidden State
    3. Output
  3. Advantages and Disadvantages
    1. Advantages
    2. Disadvantages
  4. Cross-Entropy Loss

Fixed-Window Neural Language Model

Recall that a langauge modeling task takes in a sequence of words and outputs the probability distribution of the next word.

With the same Markov assumption as in n-gram models, we can take a fixed-size window of words (embeddings) as input, put it through a hidden layer, and output the probability with a softmax.

Same stuff, but the only difference is that while it was important for n-gram models to actually see the precise n-gram in the corpus, neural models can generalize better on unseen n-grams.

No sparsity problem as in n-gram models.

It is also more compact than n-gram models because we do not have to store the counts of n-grams.

However, the fixed-size window is still a limitation.


Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can handle sequences of arbitrary length.

RNN

At each position, it takes the word embedding at timestep $t$ and the hidden state at timestep $t-1$ as input.

The same weight $W$ is reused at each step.

Word and Word Embedding

$\boldsymbol{x}^{(t)}$ is the one-hot encoding of the word at timestep $t$.

We will obtain the word embedding $\boldsymbol{e}^{(t)}$ by multiplying $\boldsymbol{x}^{(t)}$ with a word embedding matrix $\boldsymbol{E}$:

$$ \boldsymbol{e}^{(t)} = \boldsymbol{E} \boldsymbol{x}^{(t)} $$

Hidden State

$\boldsymbol{h}^{(0)}$ is the initial hidden state. It is usually initialized to zero.

And at each timestep $t$:

$$ \boldsymbol{h}^{(t)} = \tanh\left( \boldsymbol{W}_h \boldsymbol{h}^{(t-1)} + \boldsymbol{W}_e \boldsymbol{e}^{(t)} + \boldsymbol{b}_h \right) $$

Output

A vector of probabilities over the vocabulary, $\mathbb{R}^{|V|}$:

$$ \hat{\boldsymbol{y}}^{(t)} = \text{softmax}\left( \boldsymbol{W}_y \boldsymbol{h}^{(t)} + \boldsymbol{b}_y \right) $$


Advantages and Disadvantages

Advantages

  • Handle sequences of arbitrary length.
  • Model size doesn’t grow with input length.

Disadvantages

  • Computation is slow.
  • Performance degrades in the long term.

Cross-Entropy Loss

On each step $t$, we output the distribution of the next word $\hat{\boldsymbol{y}}^{(t)}$.

The loss function at step $t$ is the cross-entropy loss. The true distribution is $\boldsymbol{y}^{(t)}$ is the one-hot encoding of the next word $\boldsymbol{x}^{(t+1)}$:

$$ \begin{align*} J^{(t)}(\theta) &= - \sum_{w \in V} y_w^{(t)} \log \hat{y}_w^{(t)} \\[1em] &= - \log \hat{y}_{x^{(t+1)}}^{(t)} \tag{$\boldsymbol{y}^{(t)}$ is one-hot} \end{align*} $$

Total loss is the average of all step losses:

$$ J(\theta) = \frac{1}{T} \sum_{t=1}^T J^{(t)}(\theta) $$