Distributional representation, or representation obtained from statistical, count-based methods like TF-IDF and PPMI are also based on the distributional hypothesis, just like Word2Vec.
However, statistical methods have some limitations:
Large matrix sizes due to large vocabularies
We could use techniques like SVD to reduce the dimensionality. But an SVD on a matrix has a time complexity of , which is unrealistic for large vocabularies.
Also, SVD is a full-batch algorithm, while Word2Vec can be trained on mini-batches, allowing it to handle large datasets.
Word2Vec produces a distributed representation of words.
In distributed representation, each word is represented by some -dimensional dense vector.
Bag of Words Model
Word2Vec is a bag of words model.
This type of model ignores the position of words in a sentence. Whether a word appears at the beginning or the end of a sentence, the resulting word embedding will be the same.
Two Different Approaches
Word2Vec is a group of models used to produce word embeddings. It comes in two flavors:
Continuous Bag of Words (CBOW): predict the target word given the context words
Skip-Gram: predict the context words given the target word
They are both shallow neural networks with a single hidden layer.
Skip-Gram
Skip-Gram model creates word embeddings by predicting the context words based on the target word.
It learns by guessing the surrounding words.
Slower to train than CBOW
Works well with infrequent words
Works well with smaller datasets
Let be the position of the target word. Let be the window size.
Then is the target word, and context words are .
We want to maximize the likelihood of the context words given the target words:
Suppose we’ve preprocessed the corpus to obtain , a set of word-context pairs.
Then the likelihoood can be rewritten as:
The loss function (objective) is the negative log likelihood.
Probability of Context Given Target Word
We still need a way to model .
We will introduce two embedding vectors for each word :
when is the target word
when is a context word
Implication
What this implies is that each word has different representations depending on whether it is used as a target (center) word or a context word.
Otherwise, the probability of a word being a context word of itself would end up being fairly high, due to the fact we’re using a dot product.
But in reality, , the probability of seeing dog dog in a sentence should be very low.
The probability of a context word given a target word is modeled as a softmax:
where is the entire vocabulary.
We are measuring the similarity between a context and a target word with a dot product, and normalizing it over the entire vocabulary.
Gradient Descent
Our loss function is the negative log likelihood:
For a single pair , the partial derivative of the loss function with respect to is:
What? As Cross-Entropy Loss
Let be the model prediction for target word. Then is the entry corresponding to word .
Single pair loss function is:
Cross-entropy loss function is:
where is the one-hot encoded vector for the context word .
Every is 0 except for which is 1:
Therefore, the softmax loss function is the same as the cross-entropy loss function between the true label and the model prediction.
Let be the model prediction entry for word given word :
Let be the one-hot encoded vector for the context word (the true label). Since only when , we can rewrite above as:
Since it is a linear combination of the columns of the matrix , we can write the gradient as:
The implication of this gradient is that upon gradient descent, we update the target word vector by the difference between the predicted context word vector and the true context word vector .
Minimizing this difference has the effect of bringing closer to the context word vector .
We will skip the derivation of the gradient with respect to . But the end result is similar: gradient descent will update the context word vector closer to the target word vector .
CBOW
block-beta
columns 10
block:group1
i1["i"] b["believe"] i2["?"] c["can"] f["fly"]
end
style group1 fill:#fee,stroke:#333,stroke-width:2px
style i2 fill:#fff,stroke:#333,stroke-width:1px,stroke-dasharray: 5, 5
style b fill:#f8b
style c fill:#f8b
Continuous Bag of Words (CBOW) model creates word embeddings by predicting the target word based on the context words.
Essentially, the model learns by punching a hole and guessing the missing word.
It is faster to train than Skip-Gram
It works well with frequent words
Let be the position of the target word. Let be the window size.
Then is the target word, and context words are .
We want to maximize the likelihood of the target words given the context words:
The loss function (objective) is the negative log likelihood.
Probability of Target Given Context Words
Similar to Skip-Gram, we introduce two vectors for each word :
when is the target word
when is a context word
to model the probability.
However, unlike Skip-Gram, we will now calculate the average of the context word vectors:
Remember is the window size.
The probability of a target word given the context words is
Issue with Word2Vec
Word2Vec requires a softmax layer with an input size equal to the vocabulary size.
Remember, softmax requires a normalization over the entire vocabulary.
Multi-class to Binary Classification
So we try to change the prediction from a multi-class classification problem to a binary classification problem.
In the context of Skip-Gram, our original task was to predict the context words given the target word:
We have the word can, and the model is expected to come up with a high probability for words like fly, etc. among a bunch of words.
Now we will change it to a binary classification problem that takes a target word and a context word pair and tries to predict the probability that the context word is indeed a context word of the target word.
So instead of selecting from a bunch of words, we are now just answering a yes/no question:
What is the probability that fly is a context word of can?
Basically, we’re cutting down a softmax to a sigmoid.
Negative Sampling
Shifting from a multi-class classification to a binary classification requires a new training set.
This is because our original training set only had positive examples (context words that are indeed context words).
For example, given part of a corpus i believe i can fly, the target word can, and window size of 1, the samples would be:
Target
Context
Neighbors?
can
i
yes (1)
can
fly
yes (1)
We’re obviously not giving enough example for the model to say no. We need to generate negative examples as well.
Generating Negative Examples
We keep the target word.
For each positive sample, we generate negative samples. To achieve that, we randomly sample non-context words from the vocabulary.
Target
Context
Neighbors?
can
i
yes (1)
can
sky
no (0)
can
fly
yes (1)
can
day
no (0)
New Loss Function
To incorporate the negative samples, we need to make some changes to our loss function.
Still sticking to the single pair loss function, first thing we’re going to do is change the softmax to a sigmoid:
Cross-Entropy Again
Let . Remember the cross-entropy loss function for binary classification:
Our pair is a positive example, therefore :
only remains.
This is for the positive example pair . We need to incorporate the negative examples as well:
Where are the negative examples.
is the number of negative samples we generated for each positive sample.
Why?
The negation on each term is just to make the loss a minimization objective.
The sigmoid is a probability. If you think about the curve of the sigmoid, you’ll remember that probability is near 1 when the input is large, and near 0 when the input is small.
So ideally, we’d want the dot product to be large since it’s a positive example.
On the other hand, when the dot product is large, it needs to act as a penalty, since it’s a negative example.
Using the fact that the sigmoid is symmetric around 0, we negate the dot product for the negative examples.