Insight Series: What is a Topic Model?

Misia Tramp

17 October 2023

In the first of this series, we described how and why Metia uses large data sets to understand the audiences that are important to our clients. Gathering online data at scale provides valuable insights, but it can be challenging to make sense of the vast volumes of data gathered – often we are working with millions of items of data. That’s when advanced techniques for topic modelling need to be applied.  

Topic modelling is a technique used to extract a list of meaningful topics that appear within a large dataset. In a typical client project, topic modelling helps identify audience priorities and language so that our clients can most effectively understand and address the needs and interests of their customers. 

A topic model is derived from documents that are mixtures of topics, and topics that are mixtures of terms. These topics are learned based on which words tend to appear together in documents in the dataset. The topic modelling process includes the following steps, described here in simplified form: 

  1. Pre-processing: The raw data is cleaned by removing punctuation, numbers, and low-meaning terms called stop words (e.g. “the”, “and”, “feel”, “go”). We also identify and encode two- and three-word phrases called bigrams and trigrams. 
  2. Vectorization: Each document is represented as a bag of words, or simply a list of the terms used in the document. 
  3. Building the Model: The topic modelling algorithm (called Latent Dirichlet Allocation or LDA) analyzes the frequency and distribution of terms used across documents to identify recurring topics of conversation. 
  4. Model Selection: The data science team selects the optimal number of topics based on a coherence metric and the interpretability of the results. 
  5. Topic Naming: Analysts interpret the identified topics by investigating the top terms associated with each topic. They label each topic with a name that captures the essence of that topic. 

For example, in Metia’s B2B Directions data set, the following three topics are among the topics generated: 

  1. Increasing Productivity and Efficiency: “process”, “efficiency”, “cost”, “time”, “application”, “automate” 
  2. Building a Secure Hybrid Cloud: “infrastructure”, “application”, “hybrid cloud”, “secure” 
  3. Creating a Sustainable Business: “sustainability”, “climate”, “industry”, “energy efficiency”, “commitment” 

These topics illustrate two important facts about topic models that allow them to capture the complexity of how we speak online: 

  • A single term can show up in multiple topics. In the example above, “application” appears in topics 1 and 2. This reflects that words can take on different meanings in different contexts and that words are relevant in multiple domains. 
  • A single document can contain multiple topics. For example, if a CEO creates a post titled “How Sustainability Drives Efficiency”, it might match topics 1 and 3. This allows the model to handle overlapping topics and longer documents. 

Topic modelling also calculates the prevalence of each topic, or what percentage of the dataset is described by that topic. This allows us to quantify the importance of the topics at the all-up dataset level as well as for different subsets of the data. For example, we may investigate differences in conversation by audience, vertical, year, or thematic area. 

Metia uses topic models so that our clients can align their content, messaging, and commercial strategies to the topics and language that matters most to their target audiences, improving marketing performance and providing a better, more satisfying experience for customers.  

Insight Series 

This post is one of a series in which the Metia Insight team explain the various tools, systems, research techniques and methods we use to help answer the challenges set by our clients.  

Read the first blog in the series here -  What is Digital Data