Word2Vec

Predictive models for learning word embeddings.

Our objective is to put relevant words close to each other in the vector space.

Table of contents

Problem with Distributional Representation
Bag of Words Model
Two Different Approaches
Skip-Gram
1. Probability of Context Given Target Word
2. Gradient Descent
CBOW
1. Probability of Target Given Context Words
Issue with Word2Vec
1. Multi-class to Binary Classification
Negative Sampling
1. Generating Negative Examples
2. New Loss Function

Problem with Distributional Representation

Distributional representation, or representation obtained from statistical, count-based methods like TF-IDF and PPMI are also based on the distributional hypothesis, just like Word2Vec.

However, statistical methods have some limitations:

Large matrix sizes due to large vocabularies

We could use techniques like SVD to reduce the dimensionality. But an SVD on a $n \times n$ matrix has a time complexity of $O(n^3)$, which is unrealistic for large vocabularies.

Also, SVD is a full-batch algorithm, while Word2Vec can be trained on mini-batches, allowing it to handle large datasets.

Word2Vec produces a distributed representation of words.

Distributional vs Distributed Representation

Refer to this link.

In distributed representation, each word is represented by some $d$-dimensional dense vector.

Bag of Words Model

Word2Vec is a bag of words model.

This type of model ignores the position of words in a sentence. Whether a word appears at the beginning or the end of a sentence, the resulting word embedding will be the same.

Two Different Approaches

Word2Vec is a group of models used to produce word embeddings. It comes in two flavors:

Continuous Bag of Words (CBOW): predict the target word given the context words
Skip-Gram: predict the context words given the target word

They are both shallow neural networks with a single hidden layer.

Skip-Gram

block-beta
  columns 10
  block:group1
  i1["i"] b["believe"] i2["?"] c["can"] f["?"]
  end
  style group1 fill:#fee,stroke:#333,stroke-width:2px
  style i2 fill:#fff,stroke:#333,stroke-width:1px,stroke-dasharray: 5, 5
  style f fill:#fff,stroke:#333,stroke-width:1px,stroke-dasharray: 5, 5
  style c fill:#f8b

Skip-Gram model creates word embeddings by predicting the context words based on the target word.

It learns by guessing the surrounding words.

Slower to train than CBOW
Works well with infrequent words
Works well with smaller datasets

Let $t$ be the position of the target word. Let $m$ be the window size.

Then $w_t$ is the target word, and context words are $w_{t-m}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+m}$.

We want to maximize the likelihood of the context words given the target words:

\[L(\theta) = \prod_{t=1}^{T} \prod_{\substack{-m \leq j \leq m \\ j \neq 0}} P(w_{t+j} \mid w_t; \theta)\]

Suppose we’ve preprocessed the corpus to obtain $D$, a set of word-context pairs.

Then the likelihoood can be rewritten as:

\[L(\theta) = \prod_{(w,c) \in D} P(c \mid w; \theta)\]

The loss function (objective) is the negative log likelihood.

Probability of Context Given Target Word

We still need a way to model $P(c \mid w; \theta)$.

We will introduce two embedding vectors for each word $w$:

$v_w$ when $w$ is the target word
$u_w$ when $w$ is a context word

Implication

What this implies is that each word has different representations depending on whether it is used as a target (center) word or a context word.

Otherwise, the probability of a word being a context word of itself would end up being fairly high, due to the fact we’re using a dot product.

But in reality, $P(\text{dog} \mid \text{dog})$, the probability of seeing dog dog in a sentence should be very low.

The probability of a context word $c$ given a target word $w$ is modeled as a softmax:

$$ P(c \mid w; \theta) = \frac{\exp(u_c^T v_w)}{\sum_{x \in V} \exp(u_x^T v_w)} $$

where $V$ is the entire vocabulary.

We are measuring the similarity between a context and a target word with a dot product, and normalizing it over the entire vocabulary.

Gradient Descent

Our loss function is the negative log likelihood:

\[J(\theta) = - \sum_{(w,c) \in D} \log P(c \mid w; \theta)\]

For a single pair $(w,c)$, the partial derivative of the loss function with respect to $v_w$ is:

\[\begin{align*} \frac{\partial J(\theta)}{\partial v_w} &= -\frac{\partial}{\partial v_w} \log P(c \mid w; \theta) \\ &= - \frac{\partial}{\partial v_w} \log \frac{\exp(u_c^T v_w)}{\sum_{x \in V} \exp(u_x^T v_w)} \\ &= - u_c + \sum_{x \in V} P(x \mid w; \theta) u_x \end{align*}\]

What?

\[\begin{align*} \frac{\partial}{\partial v_w} \log \frac{\exp(u_c^T v_w)}{\sum_{x \in V} \exp(u_x^T v_w)} &= \frac{\partial}{\partial v_w} \left( u_c^T v_w - \log \sum_{x \in V} \exp(u_x^T v_w) \right) \\ &= u_c - \frac{1}{\sum_{x \in V} \exp(u_x^T v_w)} \cdot \sum_{x' \in V} \exp(u_{x'}^T v_w) u_{x'} \\ &= u_c - \sum_{x \in V} P(x \mid w; \theta) u_x \end{align*}\]

As Cross-Entropy Loss

Let $\hat{y}$ be the model prediction for target word. Then $\hat{y}_x$ is the entry corresponding to word $x$.

Single pair loss function is:

\[J(\theta) = - \log P(c \mid w; \theta) = - \log \hat{y}_c\]

Cross-entropy loss function is:

\[- \sum_{x \in V} y_x \log \hat{y}_x\]

where $y$ is the one-hot encoded vector for the context word $c$.

Every $y_x$ is 0 except for $y_c$ which is 1:

\[- \sum_{x \in V} y_x \log \hat{y}_x = - \log \hat{y}_c\]

Therefore, the softmax loss function is the same as the cross-entropy loss function between the true label and the model prediction.

Let $\hat{y}_v$ be the model prediction entry for word $v$ given word $w$:

\[- u_c + \sum_{x \in V} \hat{y}_x u_x\]

Let $y$ be the one-hot encoded vector for the context word $c$ (the true label). Since $y_x = 1$ only when $x = c$, we can rewrite above as:

\[\begin{align*} - \sum_{x \in V} y_x u_x + \sum_{x \in V} \hat{y}_x u_x &= \sum_{x \in V} (\hat{y}_x - y_x) u_x \end{align*}\]

Since it is a linear combination of the columns of the matrix $U$, we can write the gradient as:

\[\frac{\partial J(\theta)}{\partial v_w} = U (\hat{y} - y)\]

The implication of this gradient is that upon gradient descent, we update the target word vector $v_w$ by the difference between the predicted context word vector $\hat{y}$ and the true context word vector $y$.

Minimizing this difference has the effect of bringing $v_w$ closer to the context word vector $u_c$.

We will skip the derivation of the gradient with respect to $u_c$. But the end result is similar: gradient descent will update the context word vector $u_c$ closer to the target word vector $v_w$.

CBOW

block-beta
  columns 10
  block:group1
  i1["i"] b["believe"] i2["?"] c["can"] f["fly"]
  end
  style group1 fill:#fee,stroke:#333,stroke-width:2px
  style i2 fill:#fff,stroke:#333,stroke-width:1px,stroke-dasharray: 5, 5
  style b fill:#f8b
  style c fill:#f8b

Continuous Bag of Words (CBOW) model creates word embeddings by predicting the target word based on the context words.

Essentially, the model learns by punching a hole and guessing the missing word.

It is faster to train than Skip-Gram
It works well with frequent words

Let $t$ be the position of the target word. Let $m$ be the window size.

Then $w_t$ is the target word, and context words are $w_{t-m}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+m}$.

We want to maximize the likelihood of the target words given the context words:

\[L(\theta) = \prod_{t=1}^{T} P(w_t \mid w_{t-m}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+m}; \theta)\]

The loss function (objective) is the negative log likelihood.

Probability of Target Given Context Words

Similar to Skip-Gram, we introduce two vectors for each word $w$:

$v_w$ when $w$ is the target word
$u_w$ when $w$ is a context word

to model the probability.

However, unlike Skip-Gram, we will now calculate the average of the context word vectors:

\[\bar{u}_{context} = \frac{1}{2m} \sum_{-m \leq j \leq m, j \neq 0} u_{w_{t+j}}\]

Remember $m$ is the window size.

The probability of a target word $w$ given the context words is

$$ P(w \mid \text{context}; \theta) = \frac{\exp(v_w^T \bar{u}_{context})}{\sum_{v \in V} \exp(v^T \bar{u}_{context})} $$

Issue with Word2Vec

Word2Vec requires a softmax layer with an input size equal to the vocabulary size.

If our vocabulary is huge (which is often the case), calculating the softmax becomes too expensive.

Remember, softmax requires a normalization over the entire vocabulary.

Multi-class to Binary Classification

So we try to change the prediction from a multi-class classification problem to a binary classification problem.

In the context of Skip-Gram, our original task was to predict the context words given the target word:

We have the word can, and the model is expected to come up with a high probability for words like fly, etc. among a bunch of words.

Now we will change it to a binary classification problem that takes a target word and a context word pair and tries to predict the probability that the context word is indeed a context word of the target word.

So instead of selecting from a bunch of words, we are now just answering a yes/no question:

What is the probability that fly is a context word of can?

Basically, we’re cutting down a softmax to a sigmoid.

Negative Sampling

Shifting from a multi-class classification to a binary classification requires a new training set.

This is because our original training set only had positive examples (context words that are indeed context words).

For example, given part of a corpus i believe i can fly, the target word can, and window size of 1, the samples would be:

Target	Context	Neighbors?
can	i	yes (1)
can	fly	yes (1)

We’re obviously not giving enough example for the model to say no. We need to generate negative examples as well.

Generating Negative Examples

We keep the target word.

For each positive sample, we generate $k$ negative samples. To achieve that, we randomly sample non-context words from the vocabulary.

Target	Context	Neighbors?
can	i	yes (1)
can	sky	no (0)
can	fly	yes (1)
can	day	no (0)

New Loss Function

To incorporate the negative samples, we need to make some changes to our loss function.

Still sticking to the single pair loss function, first thing we’re going to do is change the softmax to a sigmoid:

\[J(\theta) = - \log \sigma(u_c^T v_w)\]

Cross-Entropy Again

Let $\hat{y} = \sigma(u_c^T v_w)$. Remember the cross-entropy loss function for binary classification:

\[- y \log \hat{y} - (1 - y) \log (1 - \hat{y})\]

Our pair $(w, c)$ is a positive example, therefore $y = 1$:

\[- \log \hat{y}\]

only remains.

This is for the positive example pair $(w, c)$. We need to incorporate the negative examples as well:

$$ J(\theta) = - \log \sigma(u_c^T v_w) - \sum_{i=1}^{k} \log \sigma(-u_i^T v_w) $$

Where $(i, w)$ are the negative examples.

$k$ is the number of negative samples we generated for each positive sample.

Why?

The negation on each term is just to make the loss a minimization objective.

The sigmoid is a probability. If you think about the curve of the sigmoid, you’ll remember that probability is near 1 when the input is large, and near 0 when the input is small.

So ideally, we’d want the dot product $u_c^T v_w$ to be large since it’s a positive example.

On the other hand, when the dot product $u_i^T v_w$ is large, it needs to act as a penalty, since it’s a negative example.

Using the fact that the sigmoid is symmetric around 0, we negate the dot product for the negative examples.

References:

Illustrated Word2Vec