Language Model & n-gram

Table of contents

Language Modeling
n-gram Language Model

Language Modeling

It is the task of predicting what word should come next.

We are modeling the probability distribution of:

\[P(w^{(t+1)} | w^{(1)}, w^{(2)}, \ldots, w^{(t)})\]

Where each $w^{(i)}$ belongs to a vocabulary of size $V$.

A language model performs this task.

n-gram Language Model

Before neural networks, the most common language model was the n-gram model.

n-gram

n-gram is a consecutive sequence of $n$ words. Depending on the value $n$ takes, we have unigrams, bigrams, trigrams, 4-grams, etc.

We collect different n-grams from a corpus.

Markov Assumption

Remember that the Markov assmption states that only the current state matters to predict the next state, and previous states are irrelevant.

When word $w^{(t+1)}$ is part of some n-gram, with the Markov assumption we make simplication that the probability of it only depends on the previous $n-1$ words:

\[P(w^{(t+1)} | w^{(1)}, w^{(2)}, \ldots, w^{(t)}) \approx P(w^{(t+1)} | w^{(t-n+2)}, \ldots, w^{(t)})\]

Using conditional probability, the above becomes the probability of an $n$-gram divided by the probability of the $(n-1)$-gram:

\[P(w^{(t+1)} | w^{(t-n+2)}, \ldots, w^{(t)}) = \frac{P(w^{(t-n+2)}, \ldots, w^{(t+1)})}{P(w^{(t-n+2)}, \ldots, w^{(t)})}\]

And each of the probabilities on the right side is calculated by counting the $n$-grams in the corpus.

Sparsity Problem

Problem occurs when no such $n$-gram or $(n-1)$-gram exists in the training corpus.

As $n$ grows larger, the probability of finding such n-grams decreases.

When the numerator is zero, we could introduce smoothing, where you add small $\delta$ to every vocab count to avoid zero probabilities.

When the denominator is zero, we could use backoff, where you use even shorter $(n-k)$-grams to have a better chance of finding a case in the training corpus.

Larger $n$ requires exponentially more training data.

Storage Problem

Counting all $n$-grams and storing them requires a lot of storage.