Topic detection is a commonly sought-after Natural Language Processing (NLP) technique. It’s especially useful for getting high-level views of your conversations, emails, or documents. In this example, we’re going to take a look at BERT, a large language model, and how you can use a BERT-derived library to do topic detection.

What is BERT?

BERT (“Bidirectional Encoder Representations from Transformers”) is a popular large language model created and published in 2018. BERT is widely used in research and production settings—Google even implements BERT in its search engine.

By 2020, BERT had become a standard benchmark for NLP applications with over 150 citations. At its core, it is built like many transformer models. The main difference between transformer models and Recurrent Neural Networks (RNNs), another classic in the NLP toolkit, is that they process the input all at once.

While RNNs have been around for decades, starting with the Hopfield network in the 1980s, RNNs evolved to include Long Short Term Memory (LSTM) Models by the end of the 1990s. More recent NLP architecture models hit the scene in the 2010s. In 2014, the Gated Recurrent Unit model was introduced to speed up LSTMs.

In 2017, transformer models were introduced. They not only allow for running predictions on the entire input at once but also more parallelization at training time. In the years since, transformer models are increasingly the architecture of choice for both NLP and image processing tasks.

The original BERT language model was trained on over 800 million words from BooksCorpus and over 2.5 billion words from Wikipedia. It was originally trained on two tasks: language modeling and next sentence prediction. 

Since its inception, BERT has inspired many other models and use cases. One example is in topic detection with BERTopic, which we’ll cover below.

Introduction to BERTopic

BERTopic is an open-source library that uses a BERT model to do Topic Detection with class-based TF-IDF procedure. TF-IDF stands for “Term Frequency - Inverse Document Frequency”. TF-IDF is an algorithm that weights the importance of words in a corpus, exactly as the name implies. The more frequently a word appears in a document, the more important it is. However, the more you see that word across documents, the less important it becomes.

An example of this could be the word “the”. You probably see the word “the” a lot in a single document. However, it also appears in many documents. On the other hand, a word like “BERT” would not appear in as many documents but may appear many times in a document on NLP. In this case, a TF-IDF model would probably say that BERT is an important word that defines the topic of a small set of documents.

Using BERTopic for Topic Modeling in Python

Now that we’ve covered the basic history and ideas behind the BERT model and BERTopic library, let’s take a look at how we can use it. We’re not only going to use the library, but also explore the modeled data set, discuss the modeled topic, and visualize the resulting document clusters.

Before we get started, we’ll need to install a few libraries. We need BERTopic, Scikit Learn, NumPy, Pandas, and MatPlotLib. Install these libraries using the package manager of your choice. In this case, I’ll be using pip install bertopic sklearn numpy pandas matplotlib.

Exploring a Topic Modeling Dataset

The first thing we’re going to do is get our BERTopic model and example dataset from sklearn

In this example, we use the set of documents stored in fetch_20newsgroups from the datasets in sklearn. All we need to do to get these documents is to call the fetch_20newsgroups function and extract the data element.

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# fetch an example dataset from sklearn
docs = fetch_20newsgroups(subset='all')['data']

Let’s take a brief look at the data we’re working with. The first thing we do is take a look at how much data we’re working with. In this case, we’ve extracted the data from the automatic return so we just call len on the list of documents.

print(len(docs))

The length of the document set is 18846 documents. Now that we know how long the dataset is, let’s look into it more. Let’s pull out two random elements from the dataset to see what each document looks like. For the next two examples, I’ve picked documents 1 and 100th (indexed at 0 and 44).

print(docs[0])

Then, if we look at another email …