Language Model & n-gram
Table of contents
Language Modeling
It is the task of predicting what word should come next.
We are modeling the probability distribution of:
\[P(w^{(t+1)} | w^{(1)}, w^{(2)}, \ldots, w^{(t)})\]Where each $w^{(i)}$ belongs to a vocabulary of size $V$.
A language model performs this task.
n-gram Language Model
Before neural networks, the most common language model was the n-gram model.
n-gram
n-gram is a consecutive sequence of $n$ words. Depending on the value $n$ takes, we have unigrams, bigrams, trigrams, 4-grams, etc.
We collect different n-grams from a corpus.
Markov Assumption
Remember that the Markov assmption states that only the current state matters to predict the next state, and previous states are irrelevant.
When word $w^{(t+1)}$ is part of some n-gram, with the Markov assumption we make simplication that the probability of it only depends on the previous $n-1$ words:
\[P(w^{(t+1)} | w^{(1)}, w^{(2)}, \ldots, w^{(t)}) \approx P(w^{(t+1)} | w^{(t-n+2)}, \ldots, w^{(t)})\]Using conditional probability, the above becomes the probability of an $n$-gram divided by the probability of the $(n-1)$-gram:
\[P(w^{(t+1)} | w^{(t-n+2)}, \ldots, w^{(t)}) = \frac{P(w^{(t-n+2)}, \ldots, w^{(t+1)})}{P(w^{(t-n+2)}, \ldots, w^{(t)})}\]And each of the probabilities on the right side is calculated by counting the $n$-grams in the corpus.
Sparsity Problem
Problem occurs when no such $n$-gram or $(n-1)$-gram exists in the training corpus.
As $n$ grows larger, the probability of finding such n-grams decreases.
When the numerator is zero, we could introduce smoothing, where you add small $\delta$ to every vocab count to avoid zero probabilities.
When the denominator is zero, we could use backoff, where you use even shorter $(n-k)$-grams to have a better chance of finding a case in the training corpus.
Larger $n$ requires exponentially more training data.
Storage Problem
Counting all $n$-grams and storing them requires a lot of storage.