The vast majority of data available to use is unstructured. These documents contain information in many different forms and don’t follow any conventional model.
This makes most of the data hard to use, prompting us to find new ways to manage and analyze it. Keyword extraction is one of those tools that enable organizations to more readily take advantage of unstructured data for business intelligence (BI) and analytics applications.
Let’s learn more about it together!
What is keyword extraction?
Keyword extraction is a textual information-processing task that automates the extraction of representative and characteristic words from a document that expresses all the key aspects of its content. As a result, these keywords provide a summary of a document. This technique uses natural language processing (NLP), a subfield of artificial intelligence (AI), to break down human language so that it can be understood and analyzed by machines.
Where would you use it?
Summarizing a text is an important problem in Text Mining (TM), Information Retrieval (IR), and NLP tasks like automatic semantic indexing, text and document clustering or classification, automatic summarization, document management, cross-category retrieval, constructing domain-specific dictionaries, named-entity recognition, topic detection, tracking, etc.
Imagine you want to analyze thousands of online documents about a specific topic. Keyword Extraction allows you to sift through the entire set of data and get the words that best describe each document in seconds. That way, you can easily and automatically see what your documents are about, saving your teams hours and hours of manual processing.
In this article, we will go over the most commonly used methods that automate keyword extraction.
Here’s how you approach keyword extraction
There are both supervised and unsupervised keyword extraction methods. Unsupervised methods are the most popular because they do not need labeled training data and are domain-independent. Supervised methods on the other hand achieve a much higher accuracy score than most unsupervised ones. In most use cases there is no labeled data available so we will go over unsupervised and domain-independent methods.
The unsupervised keyword extraction methods can be split into three smaller groups: statistical, graph-based, and embedding-based methods.
One of the simplest methods for identifying the main keywords and key phrases within a text is by using statistical models.
There are different types of statistical approaches, two of which are TF-IDF and YAKE. These approaches don’t require any data to extract the most important keywords in a text. However, because these models only rely on statistics, they may not extract relevant keywords or keyphrases that are mentioned once but should still be considered relevant.
TF-IDF (term frequency-inverse document frequency) is an example of a statistical model. TF-IDF evaluates how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics: the number of times a word appears in a text (term frequency) and the inverse document frequency (how rare or common that word is in the entire data set). Because of this, stopwords that occur a lot in the entire data set will receive a low TF-IDF score, and the most used and most important words will receive a high TF-IDF score. This technique however fails to find words that appear in a single text but don't appear in the remaining documents, but are very important to understand the content of that text.
YAKE (Yet Another Keyword Extractor) is a keyword extraction method that uses multiple statistical features from a single document to extract keywords. In contrast to TF-IDF, this allows you to extract keywords without the need for a large dataset. It extracts keywords in five steps:
- Preprocessing and candidate term identification: the text is split into sentences, chunks (part of the sentence separated with punctuations), and tokens. Afterwards, the text is cleaned, tagged and stopwords are identified.
- Feature extraction: the algorithm computes the following five statistical features for terms (words) in the document:
- Casing: counts the number of times a term appears in all uppercase or as an acronym. A significant term usually appears uppercase more often.
- Term position: the median position of the term’s sentence in the text. Terms used at the beginning of a text are most of the time more significant than terms used in the middle of the text.
- Term frequency normalization: measures balanced term frequency in the document.
- Term relatedness to context: measure the number of different terms that appear next to the candidate term. More significant terms co-occur with less different terms.
- Term different sentence: measures how many times terms appear in different sentences. The more a term is used in different sentences, the more likely it is that the term is significant.
- Computing term score: features from the previous step are combined into a single value.
- Generating n-gram and computing keyword scores: the algorithm identifies all valid n-grams. An n-gram is a sequence of n words from a sentence. Then each n-gram is given a score by multiplying each term's score inside the n-gram and normalizing that value to reduce the n-gram length’s impact.
- Data deduplication and ranking: the algorithm removes similar keywords. It keeps the one that is more relevant (one with a lower score). Two terms are considered duplicate keywords if they have a low Levenshtein distance. This metric measures the similarity between words. Given two words, the distance measures the number of edits needed to transform one word into another. Afterwards, the keywords are sorted based on their scores.
A more popular method is using graphs to identify the main keywords and key phrases within a text. A graph can be defined as a set of vertices with connections between them. A text can be represented by a graph by for example turning the terms into vertices and the edges between terms if they follow one another. The edge gets a higher weight if these 2 terms occur more frequently next to each other in the text.
After creating the graph, the vertices can be ranked based on importance. Two of the most popular methods that use graphs to solve keyword extraction are TextRank and TopicRank. Both approaches don’t require any data to extract the most important keywords in a text.
TextRank is a graph-based ranking method that is used for extracting relevant sentences or finding keywords. It extracts keywords in five steps:
- Text tokenization and annotation with part of speech (PoS) tags. PoS tags can be nouns, adjectives...
- Graph construction: vertices in the graph are words with selected PoS tags. Two vertices are connected with an edge if they appear within the window of N words in the text. The graph is undirected and unweighted.
- Graph ranking: the score of each vertex is set to 1, and the ranking algorithm is run on the graph. Google’s PageRank algorithm is used to rank the vertices, which was primarily used to rank graphs of websites. The weight of a vertex is computed by considering the weights of vertices connected to the vertex and is based on the edges connected to the vertex. The algorithm is run on each node in several iterations until the weights on the nodes converge.
- Top-scoring words selection: words (vertices) are sorted from the highest to lowest scoring words. Afterwards, the algorithm selects the top 33% of the words.
- Keywords extraction: words selected in the previous phase are joined in multi-word keywords if they appear in the text together. A score of newly constructed keywords is calculated by taking the sum of the words' scores.
TopicRank employs a slightly different approach than TextRank. Instead of extracting keywords from the text, TopicRank extracts keywords from the most important topics present in the document. This algorithm considers a Topic as a cluster of similar single and multi-word keyword candidates. Afterwards, the same steps are used as TextRank to identify the most representative keywords.
In this blog post, we showed several keyword extraction methods from groups of statistical and graph-based methods. There is no single method that works best on every dataset. However every shown method is completely language independent and YAKE, TextRank, and TopicRank even function on a single document.
If you found this blog post useful, let us know or drop a LinkedIn message to the author chat about it!
Want to work alongside talented engineers on projects with keyword extraction? Check out our open vacancies at www.radix.ai/careers