Conspiracy Theories – Topic Modeling & Keyword Extraction

It’s been a while since I’ve posted something related to topic modeling, and I decided to do so after stumbling upon a conspiracy theory document set that, albeit small, seemed an interesting starting point for building a topic model on the subject of conspiracies.

Unfortunately, these documents seem to be a bit dated as I couldn’t easily find references to more recent conspiracy theories like chemtrails or “vaccines are a government/big pharma plot to poison people”. At any rate, I’m sure they provide clues to the major themes behind them. As a preliminary exercise, I used JATE’s implementation of GlossEx to extract keywords from the documents. Some of the most “salient” ones (ie, among the top 50 or so) are represented in the bubble chart below. The size of the bubble represents the score of the keyword and among them we can see mentions to the CIA, jews, Tesla, the Illuminati and JFK. Yep, seems like a conspiracy data set, alright!

And now let’s explore the data set in more depth by building a topic model. ‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. To summarize: topic modeling is a process that decomposes the documents into sets of probabilities. The final outcome of this process is a mathematical model that represents a piece of text as a mixture of different topics, being that each topic has a weight (that is, a probability) associated to it. The higher the weight of a topic, the more important it is to characterize the text. Another aspect of this mathematical model is that the terms that compose a topic also have different weights: the higher their value, the more important they are for characterizing the topic.

I’ve used Mallet (Java) and STMT (Scala) before for topic modeling, so I chose something different this time, the R topicmodels package, to build a model for these 432 documents. Here’s a sample of the code, note that the LDA function accepts the corpus as a DocumentTermMatrix object of unigrams, bigrams and trigrams. Note also that the topic model has 35 topics. This number was chosen after inspecting the log-likelihood of multiple LDA models, each with a different number of topics. I think 35 topics is excessive for such a small data set, but will use this criterion just for the sake of having a method that determines this parameter.

corpus <- Corpus(DirSource("texts"))
corpus <- tm_map(corpus, stripWhitespace)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))

dtm <- DocumentTermMatrix(corpus, 
                          control = list(
                                         tokenize = TrigramTokenizer,
                                         stemming = FALSE, 
                                         stopwords = TRUE,                                          
                                         removePunctuation = TRUE))

#Build topic model with 35 topics, previously determined with the logLik function
lda <- LDA(dtm, 35, method="Gibbs", control = list(seed = 123, verbose = 1, iter=1500))

#inspect word distribution per topic
beta <- lda@beta

#inspect documents composition as mixtures of topics
gamma <- lda@gamma

The following multiple d3 word cloud was built after inspecting the beta object of the model (which tells us the n-grams that compose each topic and also the weight of each n-gram within the topic) and choosing 9 of the 35 topics (some topics were redundant or composed of non-informative terms). The size and the opacity of a term in the visualization reflects its weight. There are topics for all tastes: UFOs, freemasonry, the new world order, Tesla and strangely one that mixes nazis, George Bush and oil (topic 3). By the way, the code used for the multiple word cloud comes from this blog post by Elijah Meeks and it's a very nice and easy way of representing topics.


Part IV – Record Labels and Lyrical Content

In part IV (and final) of topic discovery in black metal lyrics, we’ll address the issue of assigning topics to record labels based on the lyrical content of their black metal releases. In other words, we want to find out if a given label has a tendency to release bands that write about a particular theme. We’ll also investigate the temporal evolution of these topics, that is, what changes happened through the years regarding the usage of topics in black metal lyrics. This aims to shed some light on the issue of whether lyrical content has remained the same throughout the years.

In order to address these questions, I turned once more to topic modeling. This machine learning technique was mentioned in parts I, II and III of this post, so knock yourself out reading those. If that does not appeal to you, let’s sum things up by saying that topic modeling aims to infer automatically (i.e., with minimum human intervention) topics underlying a collection of texts. “Topic” in this context is defined as a set of words that (co-)occur in the same context and are semantically related, somehow.

Instead of using the topic model built for parts II and III, I generated a new one after some (sensible, I hope) cleaning of the data set. This pre-processing involved, among other things, removal of lyrics that were not fully translated to english and lyrics with less than 5 words. In the end, I reduced the data set to 72666 lyrics (how ominous!) and generated a topic model of 30 topics with the Stanford Topic Modeling Toolbox (STMT).

Like in previous attempts, of these 30 topics, 2 or 3 seemed quite generic (they were composed of words that could occur in any context) or just plain noisy garbage, but for the most part the topics are quite concise.  I’m listing those I found the most interesting/intriguing. For each of them I added a title (between parentheses) that tentatively describes the overall tone of the topic:

  • Topic 28 (Cult, Rituals & Symbolism): “sacrifice”, “ritual”, “altar”, “unholy”, “goat”, “rites”, “blasphemy”, “chalice”, “temple”, “cult”
  • Topic 23 (Chaos, Universe & Cosmos): “chaos”, “stars”, “universe”, “cosmic”, “light”, “space”, “serpent”, “void”, “abyss”, “creation”
  • Topic 3 (The Divine): “lord”, “behold”, “praise”, “divine”, “god”, “blessed”, “man”, “glory”, “throne”, “perdition”
  • Topic 2 (Mind & Reality): “mind”, “existence”, “reality”, “thoughts”, “sense”, “moment”, “vision”, “mental”, “consciousness”
  • Topic 21 (Flesh & Decay): “flesh”, “dead”, “skin”,”body”, “bones”,”corpse”,”grave”
  • Topic 18 (The End): “end”, “day”, “path”, “leave”, “final”, “stand”, “fate”, “left”

And so on, and so forth. Click here for the full list, it will be handy for deciphering the plots below.

One nice functionality that the STMT offers is the ability of “slicing” the data with respect to the topics. This means that when slicing the data by date, one is able to infer what percentage of lyrics in a given year falls into each topic.

In order to observe the temporal evolution of some of these 30 topics between 1980 and 2014, I chose to use a NVD3 stacked area chart instead of just plotting twenty-something lines (which would be impossible to understand given the inevitable overlapping). The final result looks very neat and tidy, but can also be misleading and give the impression that all the topics are rising and diminishing at the same points in time. This is not true: when inspecting the stacked area char below remember that what represents the topic for a given year is the area of the topic at that point. There’s also the possibility of deselecting all topics (in the legend, top-right corner) except the one you want to examine, or simply clicking its area in the graph.

It seems that “Pain, Sorrow & Suffering” is consistently the most prevalent topic, peaking at 10.3% somewhere around 2006. “Fucking” has a peak in 1992, and “Warriors & Battles”  represents more than 20% of the topic assignment in 1986. For the most part, the topic assignment percentages seem to stabilize  after 92/93 (after the Norwegian boom or second wave or whatever it’s called).

And finally, when slicing the data set by the record labels, the output can be interpreted as the percentage of black metal releases by a given label that falls into each topic. After doing precisely that for records labels that have a minimum of 10 black metal releases, I selected a few labels and plotted for each the percentage of releases that were assigned to the topics with some degree of confidence. The resulting plot is huge, so I removed a few generic topics for the sake of clarity. By hovering the mouse on the topic titles, a set of some words that represent it will pop-up. Similarly, by hovering the mouse over a record label name, the circles will turn into percentages. The larger the circle’s radius, the higher the percentage of releases from that label were assigned to that circle’s corresponding topic.

Some observations of results that stand out: it seems that more than 20% of Depressive Illusions‘ releases were assigned to “Pain, Sorrow and Suffering. End All Life (which has released albums by Abigor, Blacklodge and Mütiilation, to name a few) top three topics are “Mind & Reality”, “Pain, Sorrow & Suffering” and “Chaos, Universe & Cosmos”. Also, almost 1/4 of all Norma Evangelium Diaboli‘s releases (which include Deathspell Omega, Funeral Mist and Katharsis) seem to pertain to “The Divine” topic.

Edit: WordPress does not allow for huge iframes, so click here to view the Labels vs. Topics plot  in all of its glory.

And that’s it for now, I’m done with topic modeling for the time being until I have the time and patience to fine-tune the overall representation of the data and the algorithm’s parameters. In the next few weeks I’ll turn to unsupervised machine learning techniques, such as clustering, to discover hidden relationships between bands.


Credits & Useful Resources:

– D3 ToolTip: D3-tip by Caged

– Stacked Area Chart: NVD3 re-usable charts for d3.js

– Labels per Topic: taken from Asif Rahman Journals


Part III – Topic and Lyrical Content Correlation

In part II of this post, we explored a topic model built for the whole black metal lyrics data set (if you don’t know what a topic model is, read this as well, but to sum things up let’s just say topic modeling is a process that enables discovery of the “meaning” underlying a document, with minimum human intervention). In said post we analyzed 1) the relationship between topics, and 2) the importance of individual words in their characterization by means of a force directed graph, which (let’s face it) is a bit of a bubbly mess.
In order to understand better the second point stated above, I decided to build a zoomable treemap. In it, each large box (distinguished from the surrounding boxes by a label and a distinct color) represents a topic, i.e. a set of words that are somehow related and occur in the same context(s). By clicking on a label, the map zooms into it and presents the ten most relevant words within that topic. For example, by clicking on “Coldness”, you’ll see the top 10 terms that compose it (“ice”, “frost”, “snow” and so on). The area of each word is proportion to its importance in characterizing the topic: in our “Coldness” example, “cold” occupies as larger area than the rest, being the most relevant word in this context.
Similarly, the total area of each topic is proportional to its incidence in the black metal lyrics data set. For example, “Fire & Flames” has a larger area than “Mind & Reality” or “Universe & Cosmos”, making it more likely to occur when infering the topics that characterize a song.

By the way, these topic labels were chosen manually. Unfortunately I couldn’t devise an automated process that would do that for me (if anyone has an inkling on how to do this, let me know) so I had to pick meaningful and reasonable (I hope) representative titles for each set of words. In most cases, like the aforementioned “Coldness”, the concept behind the topic is evident. There are, however, a few cases where I had to be a bit more creative because the meaning of the topic is not so obvious (“Urban Horror” comes to mind).

There are also two topics which are quite generic, with terms that could occur in almost any context, so they’re simply labeled “Non-descriptive”.

As mentioned in part II of this post, one goal of this whole mess is to find out which lyrics “embody” a specific topic. Given that the lyrical content of a song is seen by the topic model as a mixture of topics, then we’re interested in discovering lyrics that are composed solely (or almost in their entirety, let’s say more than 90%) of a single topic. Using the topic inferencing capabilities of the Stanford Topic Model Tool I did just that, selecting at least 3 representative lyrics for 14 of the topics above. They’re displayed in the collapsible tree below.

For the most part the lyrics seem to have a high degree of correlation with the topic assigned to them: for instance Immortal’s “Mountains of Might” fits the “Coldness” topic fairly well (surprise, surprise…) and Vondur’s cover of an Elvis Presly song obviously falls into the heart stuff category. But there is one intriguing result: after reading Woods of Infinity’s “A Love Story”, I was expecting it to have the “Dreams & Stuff from the Heart” topic assigned to it. It falls in the “Fucking” topic instead, so maybe the algorithm detected something (creepy) between the lines.



The zoomable treemap was built from Bill White’s Treemap with Title Headers.

The collapsible tree was inspired by this tree and this other tree.