FastText
Model based on Skip-Gram with negative sampling.
- Fast training
- Allows representation of out-of-vocabulary words
Co-occurrence methods, Word2Vec, etc. represents each word as distinct vectors without parameter sharing, and ignores the internal structure of words.
For languages with a lot of verb conjugations, etc. this is inefficient.
So we want to utilize character level information, and build them up to represent words.
Table of contents
Bag of Character n-grams
Refer back to the n-gram model.
The key points are:
- We want to model rare words better
- We should incorporate morphological, character-level information
Each word $w$ is represented as a bag of character n-grams.
<
and >
are special boundary symbols for the beginning and end of words. Allows us to distinguish whether a sequence is a prefix or suffix.
The bag for a word contains the following:
- (In practice) all $n$-grams for $n \in [3, 6]$
- There may be other variations
- The word itself (with special boundary symbols)
For example, for the word where
and $n=3$ only:
<wh
,whe
,her
,ere
,re>
, and<where>
Score Function
Skip-Gram recap
Remeber that in skip-gram, we associated each word with two vector representations:
- Outside word embedding $\boldsymbol{u}_w$ (when $w$ is used as a context word)
- Center word embedding $\boldsymbol{v}_w$ (when $w$ is used as a target word)
The score (input to softmax) for the skip-gram model is:
\[\boldsymbol{u}_c^\top \boldsymbol{v}_w\]Where $\boldsymbol{u}_c$ is the outside word embedding for context word $c$, and $\boldsymbol{v}_w$ is the center word embedding for target word $w$.
Let $\mathcal{G}_w$ be the set of all character $n$-grams for word $w$.
In FastText, we associate each $n$-gram $g \in \mathcal{G}_w$ with a vector representation $\boldsymbol{z}_g$.
The score is:
\[s(w, c) = \sum_{g \in \mathcal{G}_w} \boldsymbol{z}_g^\top \boldsymbol{u}_c\]Loss Function
For a center word $w$ and context word $c$ positive pair, the loss function (with negative sampling) is:
\[J(\theta) = - \log \sigma\left(s(w, c)\right) - \sum_{j=1}^k \log \sigma\left(-s(w, n_j)\right)\]Remember that skip-gram had the exact same loss function, except that the score function was different.