FastText

Paper

Model based on Skip-Gram with negative sampling.

  • Fast training
  • Allows representation of out-of-vocabulary words

Co-occurrence methods, Word2Vec, etc. represents each word as distinct vectors without parameter sharing, and ignores the internal structure of words.

For languages with a lot of verb conjugations, etc. this is inefficient.

So we want to utilize character level information, and build them up to represent words.

Table of contents
  1. Bag of Character n-grams
  2. Score Function
  3. Loss Function

Bag of Character n-grams

Refer back to the n-gram model.

The key points are:

  • We want to model rare words better
  • We should incorporate morphological, character-level information

Each word $w$ is represented as a bag of character n-grams.

< and > are special boundary symbols for the beginning and end of words. Allows us to distinguish whether a sequence is a prefix or suffix.

The bag for a word contains the following:

  • (In practice) all $n$-grams for $n \in [3, 6]$
    • There may be other variations
  • The word itself (with special boundary symbols)

For example, for the word where and $n=3$ only:

  • <wh, whe, her, ere, re>, and <where>

Score Function

Skip-Gram recap

Remeber that in skip-gram, we associated each word with two vector representations:

  • Outside word embedding $\boldsymbol{u}_w$ (when $w$ is used as a context word)
  • Center word embedding $\boldsymbol{v}_w$ (when $w$ is used as a target word)

The score (input to softmax) for the skip-gram model is:

\[\boldsymbol{u}_c^\top \boldsymbol{v}_w\]

Where $\boldsymbol{u}_c$ is the outside word embedding for context word $c$, and $\boldsymbol{v}_w$ is the center word embedding for target word $w$.

Let $\mathcal{G}_w$ be the set of all character $n$-grams for word $w$.

In FastText, we associate each $n$-gram $g \in \mathcal{G}_w$ with a vector representation $\boldsymbol{z}_g$.

The score is:

\[s(w, c) = \sum_{g \in \mathcal{G}_w} \boldsymbol{z}_g^\top \boldsymbol{u}_c\]

Loss Function

For a center word $w$ and context word $c$ positive pair, the loss function (with negative sampling) is:

\[J(\theta) = - \log \sigma\left(s(w, c)\right) - \sum_{j=1}^k \log \sigma\left(-s(w, n_j)\right)\]

Remember that skip-gram had the exact same loss function, except that the score function was different.