A guide to building document embeddings - Part 2

By Stijn Goossens
March 26th, 2021
8 minutes
AIDocument embeddingsArtificial Intelligence

In Part 1 of this two-part blog post, we designed several document embedding approaches. In this part, we will define an evaluation metric and put all these approaches to the test. We will also introduce and evaluate some more advanced alternatives.

Designing the evaluation metric

Comparison with human-annotated data

A small recap: in Part 1 we built document embeddings with the aim of capturing the semantic relationships between professions and personal interests for VDAB’s career orientation test. In this part, we want to quantitatively evaluate the four approaches we have proposed in order to confirm that the improvements we made actually make sense.

In order to do a quantitative evaluation, we need labelled data. Luckily, we do have some labelled data at our disposal. For the very first version of VDAB’s career orientation test, the relationships between personal interests and professions were rated manually. This human-annotated dataset was built by VDAB’s domain experts and scores the relationships between professions and interests on a discrete scale of 0-3 (where 0 = no relationship and 3 = highly related).

The heatmap below shows a subset of these weights for the same personal interests and professions as shown in Part 1. As can be seen from this subset, the human-annotated dataset is not perfect. For example, “nurse” and “dietician” are completely unrelated to “healthcare”, which is obviously not true. Nevertheless, most of the relationships do make sense and we will use them to evaluate our embeddings.

document embeddings
A subset of the human-annotated dataset of profession-interest relationships that will be used to evaluate our embeddings.

Using Spearman correlation as the evaluation metric

As the evaluation metric, we will use the Spearman correlation between the human-annotated dataset and the cosine similarities between the document embeddings. In contrast to Pearson correlation, Spearman correlation does not look at the linear relationship between two variables but assesses the fact whether the relationship is monotonically increasing. The graph below illustrates this with a small example. Consider the profession “Craftsman metalworker” and the eight personal interests that are part of the heatmap above. The x-axis shows the ranking of the eight interests according to humans. The y-axis shows the ranking according to the cosine similarities of the “spaCy + mean” (left) or the “FT_NB_SW + SIF” (right) method. The graph shows that the “FT_NB_SW + SIF” method ranks the interests in a similar way as our ground-truth data. Both rank “Working with my hands” and “Working with metal” on top, followed by “Repairing things” and the other interests. This results in a high Spearman correlation. In contrast, the ranking of the “spaCy + mean” method does not correspond very well to the ground-truth ranking and thus results in a low Spearman correlation. The evaluation metric that will be reported in the next sections is the Spearman correlation computed over all profession and interest relationships.

Spearman correlation
Illustration of Spearman correlation.

Performance of the document embedding approaches

As a refresher, the table below shows the approaches proposed in Part 1:

Table overview

The graph below shows the performance of these four approaches. It shows that all the improvements that were made, actually made sense. With an improvement of 98%, the difference between the final approach (FT_NB_SW + SIF) and the baseline is significant (spaCy + mean). Note that the seemingly small Spearman correlation of 0.27 can be considered a high value given that the human-annotated data isn’t of perfect quality.

Results 1
Performance of the document embedding approaches presented in Part 1 of this blogpost.

Introducing some more advanced techniques

All of the former approaches start from word embeddings and pool these in a specific way to form document embeddings. This is a simple (yet powerful) way to embed documents. An alternative approach would be to use transformer-based models to embed complete sentences at once and average the resulting sentence embeddings. The transformer architecture was introduced in 2017 [1] and has been part of almost all leading NLP models since then, such as Google’s BERT and OpenAI’s GPT models. Other than LSTM-based models (the previous standard in NLP), transformers can read a complete sentence at once. This allows them to learn very powerfully contextualised word embeddings. For example, depending on the context, a transformer-based model will give a different embedding to the word "bark" when it occurs in the context of trees than when it occurs in the context of dogs. With the approaches described in Part 1, both occurrences would result in the exact same word embedding. Another difference is that these transformer-based models take word order into account. The pooling strategies from Part 1 simply dump all the word embeddings together, thereby losing any information that is contained in the ordering of the words.

In the section below we introduce and evaluate two transformer-based models: Multilingual USE and RobBERT. We also add one additional model that is not transformer-based, but is nevertheless quite popular and originated in 2019: LASER. Since the training of these big models is computationally expensive, we will only use pre-trained models.

Multilingual Universal Sentence Encoder

Google’s Universal Sentence Encoder (USE) is a transformer-based sentence encoding model that is designed to be as generally applicable as possible [2]. Via multi-task learning, the model is supposed to perform well in various domains and can be used to encode sentences for several downstream tasks. Besides the transformer-based model, USE also offers a model based on a Deep Averaging Network (DAN). This DAN model trades off a faster inference time for a slightly worse performance. Multilingual USE for Semantic Retrieval (USE-M) is an extension of USE where the model is trained on multiple tasks across 16 languages, including Dutch [3]. Like USE, USE-M provides a transformer and DAN implementation. For our tests, we used the transformer-based model [4].

# Requirements:
# pip install tensorflow_text>=2.0.0rc0
# pip install --upgrade tensorflow_hub
import numpy as np
import tensorflow_hub as hub
import tensorflow_text

my_document = [
    "This is my document.",
    "It consists of multiple sentences.",
    "This is the third and final sentence.",

use_m = hub.load(
sentence_embs = use_m(my_document).numpy()
doc_emb = np.mean(sentence_embs, axis=0)


Besides USE, Google released yet another transformer-based model in 2018, called BERT [5].  The BERT model is first pre-trained on two unsupervised tasks (Masked Language Modeling and Next Sentence Prediction). Afterwards, a task-specific head can be added on top of the architecture and with this head, the model can be finetuned on a relatively small dataset. In 2019, Facebook released RoBERTa which improves upon BERT’s training algorithm [6].

RobBERT is a RoBERTa model trained on a Dutch corpus and uses a Dutch-specific word tokenizer [7]. This language-specific model outperforms multilingual BERT models on various Dutch language tasks. For our purposes, we do not need to add a task-specific head on top of the RobBERT model. Instead, we feed individual sentences through the pre-trained model and average the resulting sentence embeddings per document. There are several possibilities for extracting the sentence embeddings. The RoBERTa paper mentions several options for doing so, like summing up all twelve layers or concatenating the last four layers of the model. For our sentence embeddings, we chose to take the last hidden state of the [CLS] token. This token is often used as a summary of the entire sentence for downstream classification tasks.

# Requirements:
# pip install transformers
from transformers import BertConfig, RobertaForMaskedLM, RobertaTokenizer

model_path = "pdelobelle/robbert-v2-dutch-base"
tokenizer = RobertaTokenizer.from_pretrained(model_path)
config = BertConfig.from_pretrained(model_path, output_hidden_states=True)

robbert = RobertaForMaskedLM.from_pretrained(model_path, config=config)

inputs = tokenizer(my_document, return_tensors="pt", padding=True)

# For every sentence, take the embedding of the first element [CLS] of the final layer. Explanation of the indices:
# 1 - The model outputs a tuple, the first element contains the output of the model, the second the hidden states.
# -1 - The hidden states of the last layer.
# : - All elements in the batch (the sentences).
# 0 - First element of every sentence (including start and end tokens) -> [CLS] token.
sentence_embs = robbert(**inputs)[1][-1][:, 0].detach().numpy()
doc_emb = np.mean(sentence_embs, axis=0)


The last model we are going to evaluate is called LASER and was presented by Facebook in 2019 [8]. Unlike most of the other models released in 2018/2019, this one does not use transformers but uses the older BiLSTM architecture instead. The model, trained on parallel corpora, can be used to embed sentences in 93 (!) languages.

# Requirements:
# pip install laserembeddings
# python -m laserembeddings download-models
from laserembeddings import Laser

laser = Laser()
sentence_embs = laser.embed_sentences(my_document, "nl")
doc_emb = np.mean(sentence_embs, axis=0)

Performance of the more advanced techniques

The graph below shows the performance of the USE-M, RobBERT and LASER approaches. Because the inference of these three models is quite slow, we did not compute the document embeddings on all the available data. First, the data being fed into the model was limited to 10 sentences per profession. As can be seen from the graph below, this already showed some clear differences between the three models. LASER, the non-transformer model, performed worse and there was a reasonable gap between USE-M and RobBERT. Therefore, we decided to continue with USE-M only. Subsequently, we repeated the analysis by giving 100 and 1000 sentences to the USE-M model. As expected, this improved the performance, but the graph below shows that adding even more data will likely not be sufficient to reach the same level of performance as the “FT_NB_SW + SIF” approach.

Result 2
Performance of more advanced document embedding techniques.

Now, why do we see such a difference in performance between these more advanced models and the approaches presented in Part 1? The fact that we see the same issue with LASER as with the transformer-based models shows that the performance issue is not purely related to the transformer architecture, but rather to the sentence-level focus that these three models imply. Because these models embed whole sentences at once, they tend to focus on other aspects of text like sentence length, writing style or sentiment. The semantics of the individual words blend into the whole. Clearly, this is not what we need. In our use case, we are very interested in the semantics on word-level and not so much in the information contained in the sentence structure.

Disclaimer: Note that we didn’t spend as much time on the transformer-based models as on the approaches described in Part 1. Therefore, the models described in this part are probably not performing to the fullest of their potential. For example, the performance of the RobBERT model could be improved by experimenting with different ways of extracting the sentence embeddings or by fine-tuning it on a specific task.


In Part 1 of this blog post we illustrated the following important elements in constructing document embeddings:

  • Use a domain-specific vocabulary (fastText)
  • Solve the out-of-vocabulary with subword embeddings (fastText)
  • Combine word embeddings in a smart way (SIF)
  • Add common-sense knowledge into the word embeddings (ConceptNet Numberbatch + a custom subword embedding model)

In the second part of this blog post, we saw that the strategies of Part 1 paid off and we drastically improved upon the performance of the baseline. Furthermore, we compared our approach with some more recent methods based on the transformer and LSTM architectures. The fact that these methods were outperformed by our earlier methods shows that sentence-level focus is not what we need. Thus, an important lesson is to not always go for the latest methods straight away. Instead, starting with a simple baseline and improving from there can give you better insights and control over what you need and how to get there.


[1] Attention Is All You Need

[2] Universal Sentence Encoder

[3] Multilingual Universal Sentence Encoder for Semantic Retrieval

[4] https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3

[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[6] RoBERTa: A Robustly Optimized BERT Pretraining Approach

[7] RobBERT: a Dutch RoBERTa-based Language Model

[8] Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Stay up to date

Stay ahead of the world. Our team shares their
knowledge learnt on the field. Sign up for our