Language Model & n-gram

Table of contents
  1. Language Modeling
  2. n-gram Language Model
    1. n-gram
    2. Markov Assumption
    3. Sparsity Problem
    4. Storage Problem

Language Modeling

It is the task of predicting what word should come next.

We are modeling the probability distribution of:

P(w(t+1)|w(1),w(2),,w(t))

Where each w(i) belongs to a vocabulary of size V.

A language model performs this task.


n-gram Language Model

Before neural networks, the most common language model was the n-gram model.

n-gram

n-gram is a consecutive sequence of n words. Depending on the value n takes, we have unigrams, bigrams, trigrams, 4-grams, etc.

We collect different n-grams from a corpus.

Markov Assumption

Remember that the Markov assmption states that only the current state matters to predict the next state, and previous states are irrelevant.

When word w(t+1) is part of some n-gram, with the Markov assumption we make simplication that the probability of it only depends on the previous n1 words:

P(w(t+1)|w(1),w(2),,w(t))P(w(t+1)|w(tn+2),,w(t))

Using conditional probability, the above becomes the probability of an n-gram divided by the probability of the (n1)-gram:

P(w(t+1)|w(tn+2),,w(t))=P(w(tn+2),,w(t+1))P(w(tn+2),,w(t))

And each of the probabilities on the right side is calculated by counting the n-grams in the corpus.

Sparsity Problem

Problem occurs when no such n-gram or (n1)-gram exists in the training corpus.

As n grows larger, the probability of finding such n-grams decreases.

When the numerator is zero, we could introduce smoothing, where you add small δ to every vocab count to avoid zero probabilities.

When the denominator is zero, we could use backoff, where you use even shorter (nk)-grams to have a better chance of finding a case in the training corpus.

Larger n requires exponentially more training data.

Storage Problem

Counting all n-grams and storing them requires a lot of storage.