CONTEXTUALIZED EMBEDDING: BERT

INTRODUCTION TO BERT

CONTEXTUALIZED EMBEDDING

Let’s start with what we should understand by calling ‘contextualized meaning’. I will explain it an with elementary example. The following two sentences have the same word ‘apple’.

I like apples.

I like Apple Macbooks.

The word embedding method represents the words by vectors. The word is the same for both sentences but the meanings are different. Having the same vectors for both words means we are skipping the importance of the meaning of the word. The contextualized language models allow us to represent the same words which have different meanings in different vectors.

Bidirectional Encoder Representations from Transformer (BERT)… Okay, but what does it mean?

BIDIRECTIONAL

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, this characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

TRANSFORMER AND ENCODER

Transformers use an attention mechanism to observe relationships between words. Machine Learning models need to learn how to pay attention only to the things that matter and not waste computational resources processing irrelevant information. Transformers create differential weights signaling which words in a sentence are the most critical to further process.

Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.

MASKED LANGUAGE MODELING AND NEXT SENTENCE PREDICTION

Many models predict the next word in a sequence which is a a directional approach. BERT uses two training strategies:

1-Masked Language Modeling (MLM)

MLM enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word. Actually, we do the same thing as a human. Imagine, you are listening to a course and you miss a word said by your professor. How would you predict the missed word?

‘blablablabla …… blablablabla.’

The correct answer is by checking the both sides of the missed word. This is the same thing that BERT does.

A random 15% of tokenized words are hidden during training and BERT’s job is to correctly predict the hidden words.

2-Next Sentence Prediction

NSP (Next Sentence Prediction) is used to help BERT learn about relationships between sentences by predicting if a given sentence follows the previous sentence or not.

In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy.

References:

https://huggingface.co/blog/bert-101#2-how-does-bert-work

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270