WORD2VEC : CBOW AND SKIPGRAM
Word2Vec, bag of words
WORD2VEC
Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc. Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer.
Bag of words model : The position and word order is not important. The model makes the same predictions at each position. The bag of words technique is a method used in Natural Language Processing for text modeling. A bag of words is a representation of text that describes the occurrence of words within a document. This method disregards the grammatical intricacies and word order, treating the text as a collection of words without any specific sequence. It’s referred to as a “bag” of words because it doesn’t consider the arrangement or structure of words in the document. Instead, it focuses solely on determining whether known words appear in the document, without caring about their positions within the text.
There are two problems with this approach:
- The vector is very sparse (i.e. most values are 0)
- We lose information relating the context (i.e. order of the words in the sentence)
We want a model that gives a reasonably high probability estimate to all words that occur in the context.
How do we learn good vectors?
we have a cost function J(θ) and we want to minimize it. Gradient Descent is an algorithm to minimize the cost function by changing θ.
The idea is basic: From current value of θ, calculate the gradient of J(θ) and take small step in the direction of negative gradient. The important part is selecting the step size. The small step size takes too much time. With big step size we may miss the minimum point.

The algorithm of the gradient descent:
while True:
theta_grad=evaluate_gradient(J,corpus,theta)
theta=theta-alpha*theta_grad
alpha is the step size.
The problem is J(θ) is a function of all windows in the corpus. So, single update on gradient takes too much time. Solution is Stochastic Gradient Descent (SGD) which is using the sample windows and updating the each window (small batch).
The algorithm of the stochastic gradient descent:
while True:
window=sample_window(corpus)
theta_grad=evaluate_gradient(J,window,theta)
theta=theta-alpha*theta_grad
alpha is the step size.
In Word2Vec basically we have 2 vectors: the center vector and outside vector and we average both of them at the end. What about having only one vector for a word? There are two variants of Word2Vec : skip-gram and CBOW.
- Skip-gram predicts context words from a target word.
- CBOW predicts a target word from a set of context words.
SKIPGRAM
By leveraging the context in which words appear within a corpus, skip-gram enables us to create dense and meaningful vector representations that capture the nuances of word semantics.we need to model this Skip-gram architecture now as a deep learning classification model such that we take in the target word as our input and try to predict the context words.This becomes slightly complex since we have multiple words in our context. We simplify this further by breaking down each (target, context_words) pair into (target, context) pairs such that each context consists of only one word. For this, we feed our skip-gram model pairs of (X, Y) where X is our input and Y is our label. We do this by using [(target, context), 1] pairs as positive input samples where target is our word of interest and context is a context word occurring near the target word and the positive label 1 indicates this is a contextually relevant pair. We also feed in [(target, random), 0] pairs as negative input samples where target is again our word of interest but random is just a randomly selected word from our vocabulary which has no context or association with our target word. Hence the negative label 0 indicates this is a contextually irrelevant pair. We do this so that the model can then learn which pairs of words are contextually relevant and which are not and generate similar embeddings for semantically similar words.
The following schema shows how the algorithm works.
- Both the target and context word pairs are passed to individual embedding layers from which we get dense word embeddings for each of these two words.
- We then use a ‘merge layer’ to compute the dot product of these two embeddings and get the dot product value.
- This dot product value is then sent to a dense sigmoid layer that outputs either 0 or 1.
- The output is compared with the actual label and the loss is computed followed by backpropagation with each epoch to update the embedding layer in the process.

CBOW
The CBOW architecture comprises a deep learning classification model in which we take in context words as input, X, and try to predict our target word, Y.
For example, if we consider the sentence – “Word2Vec has a deep learning model working in the backend.”, there can be pairs of context words and target (center) words. If we consider a context window size of 2, we will have pairs like ([deep, model], learning), ([model, in], working), ([a, learning), deep) etc. The deep learning model would try to predict these target words based on the context words.
The following steps describe how the model works:
- The context words are first passed as an input to an embedding layer (initialized with some random weights) as shown in the Figure below.
- The word embeddings are then passed to a lambda layer where we average out the word embeddings.
- We then pass these embeddings to a dense SoftMax layer that predicts our target word. We match this with our target word and compute the loss and then we perform backpropagation with each epoch to update the embedding layer in the process.
We can extract out the embeddings of the needed words from our embedding layer, once the training is completed.

References:
https://www.analyticsvidhya.com/blog/2021/07/word2vec-for-word-embeddings-a-beginners-guide/

