A guide to building document embeddings - Part 1

By Stijn Goossens

March 12th, 2021

11 minutes

Document embeddings AI Artificial Intelligence

How we built document embeddings for VDAB’s new career orientation test

In this article, you will learn how to build meaningful document embeddings. We start with a simple baseline model and, in a few iterations, add various improvements. We do this by means of one of our recent projects.

The context

VDAB’s new career orientation test

Recently we joined forces with the VDAB’s AI team to rework their career orientation test. VDAB is the Flemish public employment service and the goal of their orientation test is to suggest professions to jobseekers solely based on questions about their personal interests (e.g. being outdoors, interacting with children, programming,...). More information about this use case can be found in our case study.

The algorithm behind the orientation test is responsible for the following two tasks:

Learn the relationships between professions and personal interests, in a data-driven way.
Use the relationships learned in task (1) to ask the most informative questions and suggest relevant professions based on the answers of the jobseekers.

In this blog post, we’ll focus on the first task. In the original orientation test, the relationships between the 600+ professions and 100+ personal interests were manually annotated by a team of domain experts. This process was very labour intensive and hard to maintain over time as the job market changes. A data-driven approach removes the manual labeling and, at the same time, allows the relationships to automatically evolve over time, thereby following the job market. For example, if 3D printing became common in the construction of houses, we would automatically want to increase the relationship between "bricklayer" and "working with computers", because operating a 3D printer requires some interest in computers.

What we need the document embeddings for

Our task is to learn meaningful relationships between the concepts of professions and personal interests. We will do this by first representing each concept by some text. The professions will be represented by the large sets of vacancies that VDAB collects. For the personal interests, there is much less data available and thus they will be represented by a small set of keywords only.

Next, these textual representations can be turned into a numerical representation called embeddings. The remainder of this post will focus on exactly this step. It is a crucial step because it turns the information that is understandable by humans (text) into a representation that is interpretable by computers (numbers). The resulting embeddings are vectors of typically 300 or more dimensions and they enable various calculations or can be used as such in downstream applications (e.g. as input in a classification model).

In our case, the embeddings will be used to compute the semantic similarity between any pair of concepts. By taking the dot product of the normalized embeddings we obtain a number that reflects this similarity and is called the cosine similarity. Ideally, we want the semantic similarity between any pair of concepts to be reflected in the cosine similarity of their respective embeddings. The similarity score needs to be high if the profession and interest are related (e.g. ‘gardener’ and ‘being outdoors’) and low otherwise (e.g. ‘accountant’ and ‘taking care of animals’). Afterwards, the similarities between the embeddings will be used further downstream in the career orientation test.

Orient document embeddings overview — The textual representations of professions and interests are embedded in document embeddings. The cosine similarities between the embeddings are used as a measure of semantic similarity.

How we will embed documents

Creating meaningful word embeddings is a well-established technique in NLP (word2vec, Glove, fastText and the more recent contextualised embeddings such as ELMo and BERT) and the resulting embeddings often carry a lot of meaning: similar words have similar embeddings. One way to illustrate this is that you can do some math with the embeddings, such as the classic example of “king” - “man” + “woman” = “queen”.

In contrast to individual words, embedding larger pieces of text at once is a more complicated task. As there is no single best approach to create document embeddings, one has to experiment to find the best approach for the task. In this blog post, we’ll start from a simple baseline and iteratively improve the model from there.

One approach to building document embeddings consists of the following two steps:

Embed the individual words of the document.
Combine the word embeddings into a single document embedding (i.e. pooling).

We will follow this approach in the remainder of this post and iteratively improve upon either one of these components. This is a two-part article and in the second part of this blog post, we will explore alternative document embedding approaches.

Building document embeddings

1. Start with a baseline

In machine learning, it is always a good idea to start from a simple baseline, learn from the outcome and make incremental improvements. That’s exactly what we’ll do here. For the word embeddings component, we take spaCy’s pre-trained Dutch word embeddings [1]. The 500 000 word embeddings in the ‘nl_core_news_lg’ were trained on a corpus of Dutch news and media documents. In order to pool the individual word embeddings into document embeddings, we will simply average all the word embeddings of the document. This is called mean-pooling.

Summary of this approach:

Word embeddings: pre-trained Dutch spaCy embeddings
Pooling operation: mean-pooling

Spacy+mean document embeddings illustration — First, all words in the document are embedded separately. Afterwards, the word embeddings are combined into one document embedding.

The code below illustrates how, in a few operations, you can embed a document using spaCy word embeddings and mean-pooling.

# Requirements:
# pip install numpy
# pip install spacy
# python -m spacy download nl_core_news_lg

import numpy as np
import spacy

model = spacy.load("nl_core_news_lg")

document = "the cat sat on a mat"

# Step 1 - Embed all the words.
word_embs = np.array([model(word).vector for word in document.split()])
# Step 2 - Average the word embeddings into a single document embedding.
doc_emb = word_embs.mean(axis=0)

The heatmap below shows the cosine similarities for a subset of the professions and personal interests. The cosine similarities in this graph were computed by using the document embeddings from the “spaCy + mean” approach. Ideally, we would like to see three clusters here because the graph includes three sets of related professions and interests (construction, IT and healthcare-related). There are high similarities between “working in healthcare”, “nurse” and “dietician”. We can also see a small block of high similarities for the IT-related concepts. Furthermore, the IT professions are not related to “working with metal” and “working with my hands”. All other relationships, however, are somewhat similar and so our embeddings do not yet reflect the semantics of the concepts very well. We will replot this graph for our final document embedding approach. Hopefully, we will see some more clear patterns there.

Spacy document embeddings — Cosine similarities between a subset of the professions and interests for the “spaCy + mean” approach.

2. Solve the out-of-vocabulary problem

There is one big flaw with the spaCy embeddings: there is no out-of-vocabulary (OOV) strategy. This means that it cannot provide an embedding for all words. In Dutch especially this is an issue because we can arbitrarily make new words by combining multiple words or adding compounds to it. For example, “contactvaardig”, “zaterdagwerk” and “softwareprogrammering” are all words that appear in our corpus, but for which spaCy cannot provide an embedding (actually, it does provide one, but it’s simply a zero vector). Furthermore, having an OOV strategy also helps to deal with spelling mistakes. It is a shame that we have to throw away all the information of words like "treinbestuurdern", "kooken", "scrhijven", because they clearly do have a meaning.

So, let’s update the word embedding component of our document embedding approach to solving the OOV problem. We will replace the pre-trained spaCy embeddings with fastText [2] embeddings that are trained on our own corpus. FastText is an open-source library that allows you to easily train word embeddings from a given corpus. The code snippet below shows how you can do that yourself. For simplicity, we keep the default fastText settings, but know that you can update various parameters like the model, dimension of the embeddings and learning rate to finetune your results.

# Requirements:
# git clone https://github.com/facebookresearch/fastText.git
# cd fastText
# sudo pip install .

import fasttext
import numpy as np

# Train fastText word embeddings.
model = fasttext.train_unsupervised("data/my_corpus.txt")
model.save_model("result/trained_fasttext_embeddings.bin")

# Use the learned fastText embeddings to embed a document.
my_document = "the cat sat on a mat"
model = fasttext.load_model("models/trained_fasttext_embeddings.bin")

# Step 1 - Embed all the words.
word_embs = np.array([
    model.get_word_vector(word)
    for word in my_document.split()
])
# Step 2 - Average the word embeddings into one document embedding.
doc_emb = word_embs.mean(axis=0)

Under the hood, fastText is an extension on top of word2vec. Word2vec [3] is a set of techniques (skipgram and cbow) that learn word embeddings based on the context in which the words occur. FastText extends word2vec by not only learning embeddings for words themselves, but also for the subwords that make up the words. For example, while word2vec would only learn one embedding for “snow”, fastText would learn embeddings for the 3-grams “<sn”, “sno”, “now", “ow>”, the 4-grams “<sno”, “snow”, “now>”, the 5-grams “<snow”, “snow>” and the 6-gram “<snow>” where “<” and “>” are the characters representing the beginning and end of a word. With the learned n-grams, fastText can embed OOV words like “snowing”, “snowball” or the misspelled word “snoow”.

Another advantage of having trained our own fastText embeddings is that we have now obtained a domain specific vocabulary. For example, we now have embeddings for words like “stikster”, “tuinarbeider” and “flexijob”, embeddings that do not exist in the pre-trained spaCy model.

Summary of this approach:

Word embeddings: fastText embeddings trained on our own corpus
Pooling operation: mean-pooling

3. Combine the word embeddings in a smarter way

We have replaced our standard pre-trained word embeddings with domain-specific fastText embeddings that solve the OOV problem. Let’s now have a look at how we can improve the second component of our document embedding approach, the pooling component. What we have been doing so far was pretty naive. By averaging all the word embeddings together we implicitly give all the words the same importance. General words like “working”, “experience” and “knowledge” are very common words in our corpus (after removal of the stop words), but they are actually not informative about the content of a profession. Less frequent words like “plants”, “computers”, “hospital” are more informative in that sense. Thus, when pooling the word embeddings into a document embedding, we want to give less weight to the more general, frequent words. Additionally, when we reflect on the meaning contained in the document embeddings we can see that they all have something in common: they are all vacancies and related to the job market. In that sense, all documents are quite similar, but that’s not the information we want the embeddings to capture. We would rather have embeddings that point out the differences between the vacancies.

Both of these elements are addressed by the smooth inverse frequency (SIF) pooling method [4]. The algorithm below shows how this method first weights all the word embeddings by their frequencies. More frequent words will receive a smaller weight. The word frequencies can be learned from the domain-specific corpus. In the second step, all the document embeddings have been computed and are composed into a matrix with the embeddings along the columns. The singular vector of this matrix, which can be found with singular value decomposition (SVD), corresponds to the strongest component that all of the documents have in common. By subtracting this singular vector from all the document embeddings the differences between the documents should now become more apparent. As you can also see in the algorithm, SIF takes one parameter a. In general, the default value of a=1e-3 works well, but if you have a proper evaluation metric you could finetune this parameter to improve results.

Summary of this approach:

Word embeddings: fastText embeddings trained on our own corpus
Pooling operation: SIF

4. Add common sense knowledge

Finally, we will improve the word embedding component one more time. The techniques we show in this section are based on our own research at Radix and consist of two steps. In the first step, we use transfer learning to enhance the fastText word embeddings with information from ConceptNet. Secondly, we learn subword embeddings from these new word embeddings.

STEP 1 - Aligning our word embeddings to ConceptNet

ConceptNet [5] is a knowledge graph that links together words and concepts via relationships. For example, “a house has a roof” where “has a” is the relationship between the two concepts “house” and “roof”. In addition to the extensive range of concepts and relationships, the knowledge graph is multilingual. It contains concepts from ten core and hundreds of other languages. ConceptNet Numberbatch is a set of word embeddings that originate from the ConceptNet graph [6, 7]. These embeddings are of high quality because they contain the common sense knowledge that is included in ConceptNet.

We can exploit the information contained in ConceptNet Numberbatch to improve our own domain-specific fastText embeddings. We will do this by learning a projection of the fastText embeddings (A) into the Numberbatch space (B). Thus, we want to learn a function f for which f(A) ~= B. We represent this function by a multilayer perceptron (MLP) with two hidden layers and train it with two loss functions at the same time: mean squared error and mean cosine distance loss. This way both the euclidean and cosine distance between similar words is being minimized.

The MLP is trained on the intersection of words that both the fastText embeddings and Numberbatch have in common (the blue words in the illustration below). After training the MLP, we can project the remaining fastText word embeddings into the Numberbatch space (the red words in the illustration below). By doing so we obtain high-quality embeddings for our domain-specific vocabulary.

STEP 2 - Learning new subword embeddings

Having obtained the new set of word embeddings, we lost our ability to embed out-of-vocabulary words. Thus, once more we need to learn subword embeddings. Previously, the fastText algorithm started from a given corpus and learned both word and subword embeddings at the same time. Now, however, we already have the word embeddings. At Radix, we built a tool called Subwarp that takes word-embeddings as input and subsequently learns the sub-word-embeddings.

The main idea is that a word should be equal to the sum of its parts. This means that the sum of the embeddings of all the subwords that make up the word should be equal to the embedding of the words itself. More information about this can be found in an earlier blog post of ours.

Summary of this approach:

Word embeddings: fastText embeddings mapped to ConceptNet and subwords learned with Subwarp
Pooling operation: SIF

Let’s now return to the heatmap of cosine similarities. The difference between the heatmap of the “spaCy + mean” approach is clear. We can now clearly distinguish between the three clusters of the construction, IT and healthcare-related concepts. The areas outside these clusters are coloured very dark, which means that there is almost no connection between the concepts. This is an indication that our document embeddings have been able to capture the semantics of the concepts pretty well.

Subwarper — Cosine similarities between a subset of the professions and interests for our final approach.

Stay tuned for more

In the second part of this blog post, we will put our proposed methods to the test. We will also share some negative results. These are other things we have tried, but which did not give the expected results. Spoiler: transformers are involved!