It’s been a while since I’ve posted something related to topic modeling, and I decided to do so after stumbling upon a conspiracy theory document set that, albeit small, seemed an interesting starting point for building a topic model on the subject of conspiracies.
Unfortunately, these documents seem to be a bit dated as I couldn’t easily find references to more recent conspiracy theories like chemtrails or “vaccines are a government/big pharma plot to poison people”. At any rate, I’m sure they provide clues to the major themes behind them. As a preliminary exercise, I used JATE’s implementation of GlossEx to extract keywords from the documents. Some of the most “salient” ones (ie, among the top 50 or so) are represented in the bubble chart below. The size of the bubble represents the score of the keyword and among them we can see mentions to the CIA, jews, Tesla, the Illuminati and JFK. Yep, seems like a conspiracy data set, alright!
And now let’s explore the data set in more depth by building a topic model. ‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. To summarize: topic modeling is a process that decomposes the documents into sets of probabilities. The final outcome of this process is a mathematical model that represents a piece of text as a mixture of different topics, being that each topic has a weight (that is, a probability) associated to it. The higher the weight of a topic, the more important it is to characterize the text. Another aspect of this mathematical model is that the terms that compose a topic also have different weights: the higher their value, the more important they are for characterizing the topic.
I’ve used Mallet (Java) and STMT (Scala) before for topic modeling, so I chose something different this time, the R topicmodels package, to build a model for these 432 documents. Here’s a sample of the code, note that the LDA function accepts the corpus as a DocumentTermMatrix object of unigrams, bigrams and trigrams. Note also that the topic model has 35 topics. This number was chosen after inspecting the log-likelihood of multiple LDA models, each with a different number of topics. I think 35 topics is excessive for such a small data set, but will use this criterion just for the sake of having a method that determines this parameter.
corpus <- Corpus(DirSource("texts")) corpus <- tm_map(corpus, stripWhitespace) TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3)) dtm <- DocumentTermMatrix(corpus, control = list( tokenize = TrigramTokenizer, stemming = FALSE, stopwords = TRUE, removePunctuation = TRUE)) #Build topic model with 35 topics, previously determined with the logLik function lda <- LDA(dtm, 35, method="Gibbs", control = list(seed = 123, verbose = 1, iter=1500)) #inspect word distribution per topic beta <- lda@beta #inspect documents composition as mixtures of topics gamma <- lda@gamma
The following multiple d3 word cloud was built after inspecting the beta object of the model (which tells us the n-grams that compose each topic and also the weight of each n-gram within the topic) and choosing 9 of the 35 topics (some topics were redundant or composed of non-informative terms). The size and the opacity of a term in the visualization reflects its weight. There are topics for all tastes: UFOs, freemasonry, the new world order, Tesla and strangely one that mixes nazis, George Bush and oil (topic 3). By the way, the code used for the multiple word cloud comes from this blog post by Elijah Meeks and it's a very nice and easy way of representing topics.
- Conspiracy documents harvested from beyondweird.com
- R topicmodels package for topic modeling with LDA
- JATE toolkit for automatic keyword extraction with GlossEx
- D3 multi-word cloud for topic model representation