Topic Modeling is a method from the world of Natural Language Processing (NLP). With the help of unsupervised machine learning, text documents are statistically analyzed for word patterns in order to compile words into groups (the so-called "topics"):
Topic Modeling in MAXQDA
Topic Modeling in MAXQDA is primarily used for the exploration of data. Topic Modeling helps you to identify topics in your documents or survey responses and to include the identified topics in your analysis:
- The identified topics can be saved as dictionary categories containing the corresponding words. You can then use the dictionary for autocoding and dictionary-based content analysis.
- The dominant topic in each analyzed document can be recorded in a document variable.
- The documents can be assigned to document sets according to the dominant topics.
Recommended procedure for Topic Modeling in MAXQDA
In general, you should follow these steps when using Topic Modeling in MAXQDA:
1. Preparation: Create a stop word list for the data
Get an overview of the words that occur in the texts to be analyzed using the MAXDictio > Word Frequenciesfunction. Transfer all words that do not carry meaning to a stop word list to ignore them in the later analysis.
2. Create the first model
Consider in advance whether you expect many different topics or only a few. The more different and diverse the documents are in their words, the larger the number of topics should be chosen. MAXQDA defaults to 6 topics, which is rather a small number.
Create the first topic model as described below.
3. Check the model and create alternative models if necessary
Check the coherence of the words of the individual topics and, if necessary, try out alternative models with more or fewer topics. To evaluate the model, you can also use the Topic Document Matrix, which shows the probability that a topic occurs in a document. If there are many different dominant topics in the documents, even if they are not very diverse thematically, then a model with more topics is probably more appropriate.
If necessary, exclude more words from the analysis by adding them to the stop word list.
In general, it is helpful for the interpretation to use MAXDictio > Keyword in Context or MAXDictio > Word Frequencies in parallel to the Topic Modeling window to check in which contexts words are used and how often they occur.
4. Name the topics
Name the topics with an abstract term that summarizes the words they contain.
5. Use and save the topics
Save the dominant topic per document automatically as a document variable or create corresponding document sets with the respective documents. Create a dictionary with the words per topic.
Save the topics in the Questions - Themes - Theories (QTT) workspace or export them to Excel.
Prerequisite for performing Topic Modeling in MAXQDA
In order to obtain meaningful results with Topic Modeling, it is necessary that not only a few documents with very similar most frequent words are analyzed. With less than 30 documents, meaningful results can hardly be expected; using survey data with 100 cases or more, the chances to get meaningful results are greater.
How to start Topic Modeling
- Activate all documents that you want to include in the analysis.
- Start MAXDictio > Topic Modelingfunction.
- Choose desired settings in the dialog:
- Number of topics
- Restriction to certain documents or segments
- Ignoring certain content and stop words
- Lemmatization (reduction of words to their base form)
After clicking OK the calculation starts. Topic Modeling is a very computationally intensive process and can therefore take several minutes, even with smaller amounts of data.
The results window
In the result window, the identified words per topic are presented:
The more important a word is for a topic, the larger and more colored it is displayed in the word cloud and the higher up it will be presented in the list view.
The following options are available in the Results window:
- Using the icons above the topics, you can switch the view from list to word cloud.
- To name or rename the individual topics, click on the current name.
- To ignore a topic for further analysis, click the crossed-out eye icon. Disabled topics are ignored in the following functions: “Topic Document Matrix” and “Save Topics as Dictionaries/Variables/Document Set”.
- At the top left of the ribbon, you can change the number of topics and the number of words displayed per topic. If you change the number of topics, the calculation will be restarted, and the set topic labels will be reset.
- On the top right you can save the current view in QTT, copy it to the clipboard, or export it.
Topic-Document-Matrix
Via Start > Topic Document Matrix in the result window you can call up a visualization that shows which topics dominate in which document.
The calculated probabilities that a topic occurs in a document serve as the basis for the visualization. The probability value between 0 and 1 is multiplied by 100 for the display.
Using the icons above the display, you can adjust the display, for example, you can switch to a heatmap view as shown in the image above. The functions in the toolbar are described in detail in the Code Matrix Browser section of the manual.
Saving results
The assignments of words to topics and the probabilities of topics per document can be saved as follows:
Save Topics as Dictionaries – A new dictionary is created in MAXDictio. The topic names are used as category names and the top words per topic are entered as search items. You should check the dictionary for duplicate words, because it is possible that the same word is significant for different topics (usually with different weighting).
Save Topics as Document Variable – A new document variable is created and for each analyzed document the topic name with the highest probability is entered. If several topics are equally likely, “not defined” is entered.
Save Topics as Document Sets – One document set is created per topic in the “Document System” window. Each document is assigned to the topic set with the highest membership probability.
Topic Modeling for Survey Responses
If you use the Analysis > Categorize Survey Responses</strong > feature to code your survey responses, you can invoke Topic Modeling directly from the analysis window in the Start menu.
Only the currently displayed responses are considered in the analysis.
Concluding remarks
- Topic Modeling is a statistical modeling technique that does not consider the meaning of words. Accordingly, results may differ from subjective expectations.
- MAXQDA uses Gensimwith the Latent Dirichlet Allocation (LDA) algorithm to determine the topics. To reproduce the results in Gensim: MAXQDA uses 50 iterations and sets 1 as the random state to always obtain the same results for the same input.