Conspiracy Theories – Topic Modeling & Keyword Extraction

It’s been a while since I’ve posted something related to topic modeling, and I decided to do so after stumbling upon a conspiracy theory document set that, albeit small, seemed an interesting starting point for building a topic model on the subject of conspiracies.

Unfortunately, these documents seem to be a bit dated as I couldn’t easily find references to more recent conspiracy theories like chemtrails or “vaccines are a government/big pharma plot to poison people”. At any rate, I’m sure they provide clues to the major themes behind them. As a preliminary exercise, I used JATE’s implementation of GlossEx to extract keywords from the documents. Some of the most “salient” ones (ie, among the top 50 or so) are represented in the bubble chart below. The size of the bubble represents the score of the keyword and among them we can see mentions to the CIA, jews, Tesla, the Illuminati and JFK. Yep, seems like a conspiracy data set, alright!

And now let’s explore the data set in more depth by building a topic model. ‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. To summarize: topic modeling is a process that decomposes the documents into sets of probabilities. The final outcome of this process is a mathematical model that represents a piece of text as a mixture of different topics, being that each topic has a weight (that is, a probability) associated to it. The higher the weight of a topic, the more important it is to characterize the text. Another aspect of this mathematical model is that the terms that compose a topic also have different weights: the higher their value, the more important they are for characterizing the topic.

I’ve used Mallet (Java) and STMT (Scala) before for topic modeling, so I chose something different this time, the R topicmodels package, to build a model for these 432 documents. Here’s a sample of the code, note that the LDA function accepts the corpus as a DocumentTermMatrix object of unigrams, bigrams and trigrams. Note also that the topic model has 35 topics. This number was chosen after inspecting the log-likelihood of multiple LDA models, each with a different number of topics. I think 35 topics is excessive for such a small data set, but will use this criterion just for the sake of having a method that determines this parameter.

corpus <- Corpus(DirSource("texts"))
corpus <- tm_map(corpus, stripWhitespace)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))

dtm <- DocumentTermMatrix(corpus, 
                          control = list(
                                         tokenize = TrigramTokenizer,
                                         stemming = FALSE, 
                                         stopwords = TRUE,                                          
                                         removePunctuation = TRUE))

#Build topic model with 35 topics, previously determined with the logLik function
lda <- LDA(dtm, 35, method="Gibbs", control = list(seed = 123, verbose = 1, iter=1500))

#inspect word distribution per topic
beta <- lda@beta

#inspect documents composition as mixtures of topics
gamma <- lda@gamma

The following multiple d3 word cloud was built after inspecting the beta object of the model (which tells us the n-grams that compose each topic and also the weight of each n-gram within the topic) and choosing 9 of the 35 topics (some topics were redundant or composed of non-informative terms). The size and the opacity of a term in the visualization reflects its weight. There are topics for all tastes: UFOs, freemasonry, the new world order, Tesla and strangely one that mixes nazis, George Bush and oil (topic 3). By the way, the code used for the multiple word cloud comes from this blog post by Elijah Meeks and it's a very nice and easy way of representing topics.


Classification of Political Statements by Orator with DSX

After all that keyword extraction thing from political speeches, it occurred to me it would be interesting to find out if it’s possible to build a model that predicts the political orator to which a statement, or even a complete speech, belongs to. By statement, I mean a sentence with more than a couple of words drawn from a speech (not including here interviews or political debates, for example). I took the 327 speeches by 12 US presidents used in the previous post as the basis of a document data set and added to it a few dozen speeches by other, non-american, dictatorial, political leaders as to create a set appropriate for a classification task.

I intended to explore two different routes:

  1. Build a predictive model from a collection of sentences previously classified as either being uttered by an US President or by some other politician (all non-American political leaders of the XXth century, all dictators) during an official speech. This can be defined then as a binary classification problem: either a statement is assigned to a politician of the class “USPresident” or it isn’t. All the sentences (33500 in total) were drawn from the speeches mentioned above.
  2. Build another model from a collection of speeches by a diverse group of orators, such that the model can assign correctly to a a previously unseen speech the person associated with it. This can be defined as a multi-class classification problem.

This all sounds reasonable and potentially interesting (and, who knows, even useful), but building predictive models from text-based data is a very cumbersome task because there’s always a multitude of things to decide beforehand, which includes:

  • How to represent the text? Learning algorithms can’t deal with text in its original, “as-is” format, so there’s a number of preprocessing steps to take in order to transform it into a set of numerical/categorical/ordinal/etc. features that make sense. There are numerous feature types and transformations I could explore here, like representing the text as a weighted vector space model, using word based features, character-based features, using part-of-speech tags or entities as additional features, build topic models and use the topic probabilities for each document, and so on. The problem is that I do not have enough time (nor patience) to decide efficiently the most appropriate feature representation for my speech/sentences data set.
  • Dimensionality curse: Assuming I’ve managed to find some good text representation, it’s almost certain the final dimensions of the data set to be presented to the learning algorithm will be prohibitive. Again, there are numerous feature selection methods that can be employed to help me ascertain which features are more informative and discard the rest. I don’t really care about trying them all.
  • What learning algorithm is appropriate? Finally, which algorithms to use for these two classification tasks. Again, there are hundreds of them out there, not to mention countless parameters to tune, cross-validation techniques to test, different evaluation measures to optimize, and so on.

As to avoid losing too much time with all of this stuff just for the sake of a blog post, I decided to use DSXfor two simple reasons: 1) it accepts text in its original format and does all the feature transformation/selection/extraction steps all by itself, so I don’t need to worry about that stage, and 2) it tests hundreds of different algorithms and combinations of algorithms to find the best model for the data.

The only pre-processing done to the data sets prior to uploading them as csv files to DSX was:

  1. Subsetting each data set into one training portion, from which to build a predictive model, and a testing portion used to evaluate the model on data unknown to it (and make sure there was no overfitting).
  2. To make things more challenging, I replaced all entities mentioned by their entity type. This is because a sentence or speech mentioning specific dates, people and locations can be easily assigned to the correct orator using those entities alone. For example, “It is nearly five months since we were attacked at Pearl Harbor” is obviously something that only FDR could have said. “Pearl Harbor” is a clear hint of the true class of the sentence, and to make things more difficult to DSX, it gets replaced with the placeholder “LOCATION”. A similar replacement is used for entities like organizations, dates or persons with the help of the Stanford NLP Core toolkit.

The first model built was the one for the binary version of the data set (i.e., a sentence either belongs to an US president or to a non-american political leader), using a total of 26792 sentences. Of a total of 8500 examined models, DSX found one generated with the Iterative OLS algorithm to be the best, estimating accuracy (that is, the percentage of sentences correctly assigned to their respective class) to fall between 76% and 88%, and average recall (that is, the averaged percentages of correct assignments for each class) to fall in the range of 78% to 88%. Given that the “NON US PRESIDENT” class is about two thirds of the “US PRESIDENT” average recall is a better evaluation measure than regular accuracy, for this particular data set.


ForecastThis DSX estimated qualities of the best predictive model for the binary political sentences data set.

To make sure the model is not overfitting the training data, and that the estimates above are correct, I sent DSX a test set of sentences with no labels assigned, and compared the returned predictions with the ground truth. Turns out accuracy is around 82% and average recall approximately 80%. This is a great result overall and it means we’ve managed to build a model that could be useful, for example, for automatic annotation of political statements.

And just for the record, here’s a few example of sentences that the model did not get right:

  • Sentences by US Presidents marked as belonging to non-US political leaders (dictators):
    • We have no territory there, nor do we seek any.
    • That is why we have answered this aggression with action.
    • Freedom’s fight is not finished.
  • Sentences by non-US political leaders (dictators) marked as belonging to US presidents
    • The period of war in [LOCATION] is over.
    • The least that could be said is that there is no tranquillity, that there is no security, that we are on the threshold of an uncontrollable arms race and that the danger of a world war is growing; the danger is growing and it is real.

I doubt a person could do much better just by reading the text, with no additional information.

The second model was built from a train set of 232 speeches, each labeled with the respective orator (11 in total). The classes are very unbalanced (that is, the number of examples for each label varies greatly), and some of them are quite small, which makes average recall the best measure to pay attention to when asserting the quality of the predictions made by the model. The best model DSX found was built with Multiquadric Kernel Regression, and although it has a hard time learning three of the eleven classes (see figure below), it’s actually a lot better than what I expected given the skewness of the data, and the fact that all entities were removed from the text.


ForecastThis DSX predictive model for political speeches by 11 orators. The best model was built with Multiquadratic Kernel Regression.

And what about the model’s performance in the test set? It more or less follows the estimated performance of the trained model: it fails to classify correctly speeches by Hitler (classifying them as belonging to FDR instead), and by Nixon (which are assigned to Lyndon B. Johnson). On the other hand, it does classify correctly all the instances of Reagan, FDR, Stalin, and most of Bill Clinton’s speeches. I’m sure if I provided a few more examples for each class, the results would greatly improve.

To conclude: this model, alongside the very good model obtained for the first data set, illustrates how it is possible to quickly obtain predictive models useful for text annotation of political speeches. And all this with minimal effort, given that DSX can evaluate hundreds of different models very quickly, and also handle the feature engineering side of things, prior to the supervised learning step.

*Disclaimer: I work for ForecastThis, so shameless self-promotion trigger warning goes here.


  • ForecastThis DSX
  • US Presidential speeches harvested from the Miller Center Speech Archive



Lexical Diversity: Black Metal vs. Pop Queens

Lexical diversity is typically defined as a measure of the uniqueness of the words used in a text, that is, the proportion of distinct words across the text. This type of measure is indicative of the vocabulary richness present in the text and general writing quality.
There are several ways of measuring this, but I’ll focus on MTLD (Measure of Textual Lexical Diversity) as it’s very sensitive and less prone to be affected by the text length, unlike more traditional metrics. MTLD, by the way, is defined as the mean length of sequential word strings in a text that maintain a given type to token ratio – in other words, sequences that have a high proportion of unique words.
To get an idea of how diverse is the vocabulary used in black metal, I’ve measured MTLD for the song lyrics of 18 bands, which were selected (more or less) randomly.
The text of each band consists of the entirety of their lyrics after removal of text portions in languages other than english. To make things slightly more interesting, I also computed the MTLD values for the three queens of pop (or maybe they’ve been dethroned since I last checked the current pop pulse, it’s been a while).
There’s a handful of caveats to the experiment I’m describing here, the most evident being 1) I’m assuming the traditional parameter values of MTLD are suitable for song lyrics, 2) I’ve removed all lyrics totally or partially written in languages other than English, and 3) there’s a lot of intentional line repetition in songs (choruses and the like), something that is more prevalent in pop music than black metal. With that in mind I removed such duplicates from both data sets, which actually improved (albeit not significantly) the pop artists MTLD value.
That said, it’s not at all surprising (to me, admitting my bias here) to see Dodheimsgard at the top, or that MTLD values for the three pop singers are a lot lower than for all of the selected black-metal bands. However, keep in mind that Lexical Diversity measures don’t explicitly take into account sentence structure or grammar, so we can’t really infer the degree of quality (for lack of a better expression) of how the words are used.
Band/Artist MTLD-MA Lexical Density (%)
Lady Gaga 56 53.39
Beyonce 56 51.30
Rihanna 58 52.89
Abigor 112 57.35
Blacklodge 91 59.7
Clandestine Blaze 86 63.15
Corpus Christii 94 57.48
Craft 73 51.15
Cultes des Ghoules 114 59.1
Darkthrone 104 57.65
Deathspell Omega 101 51.39
Dodheimsgard 132 58.17
Emperor 81 54.29
Immortal 73 58.94
Inquisition 72 58.02
Mayhem 83 59.7
Mutiilation 103 56.08
Ride for Revenge 102 58.7
Satanic Warmaster 70 55.47
Satyricon 79 54.04
Solefald 84 56.98

The rightmost column of the table above displays values of lexical density, which should not be confused with lexical diversity. The former is defined as the proportion of content words – such as nouns, adjectives and verbs – present in the text. Other categories of words are said to be functional (such as determiners). I’ll follow here a rough interpretation of Halliday‘s definition of lexical density and consider adverbs as content words.
As far as content word classes go, adjectives, nouns and prepositions (eg, “than”, “beyond”, “under”, “into”) are less common in the pop lyrics than in the bm lyrics analysed here. Usage of pronouns however (eg, “me”, “you”, “we”) is a lot more evident in the pop lyrics.
Focusing solely on the black metal lyrics, the same type of distribution is observed for each band, with the exception of nouns which for some reason are a lot more prevalent in Clandestine Blaze and Blacklodge lyrics, amounting to more than one third of all the words used. Another aspect where Blacklodge, along with Craft and Deathspell Omega, deviate considerably from the rest of the bands is the “Other” category.  This word class is actually the aggregation of the smaller classes, such as digits, punctuation and symbols. Craft, in particular, use a good amount of punctuation. Another interesting thing is seeing Deathspell Omega at the bottom of the table with regards to lexical diversity (ie, actual content words) values, albeit scoring high in the lexical diversity department.


  • MTLD was first suggested, and subsequently developed, by Philip McCarthy while @ the University of Memphis. I strongly suggest reading his and Scott Jarvis article “MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment” as it’s a lot more comprehensive than the very brief, limited and ad-hoc assessment I make here (and I’ve probably minsinterpreted some aspects of its correct usage).
  • All the MTLD values and part-of-speech tagging were computed in R, using the koRpus package implementation of MTLD-MA and TreeTagger.
  • D3 stacked bar chart source.

Using NLP to build a Black Metal Vocabulary

Black metal is typically linked, since its inception, to Satanic or anti-Christian themes. With the proliferation of bands in the 90s (after the Norwegian boom) and subsequent emergence of sub-genres, other topics such as paganism, metaphysics, depression and even nationalism came to the fore.

In order to discover the terminology used to explore these lyrical themes, I’ve devised a couple of term extraction experiments using the black metal data set. The goal here is to build a black metal vocabulary by discovering salient words and expressions, that is terms that when used in BM lyrics carry more information than when used in a “normal” setting. For instance, the terms “Nazarene” or “Wotan” have a much higher weight in the black metal domain than  in the general purpose corpus used for comparison. Once again note that this does not necessarily mean that these two words occur very frequently in BM lyrics (I’d bet that “Satan” or “death” have a higher number of occurrences), but it indicates that, when they do, they carry more information within the BM context.

This task was carried through JATE‘s implementations of the GlossEx and C-value algorithms. The part-of-speech of each term (that is, the “type” of term) was discovered with the StanfordNLP toolkit. The top 50 of each type (with the exception of adverbs) are listed in the table below. For the sake of visualization, I make a distinction between named entities/locations and the other nouns, being that the former are depicted in the word maps at the end of this post.

I’ve also included, in the last column of the table, the top term combinations. It’s noteworthy how much of these combinations are either negations of something (“no hope”, “no god”, “no life” and so on), or concerned with time (“eternal darkness”, “ancient times”). Such preoccupation with large extensions of “time” is also evident in the top adverbs (“eternally”, “forever”, “evermore”),  adjectives (“endless”, “eternal”) and even nouns (“aeon” or “eon”).

Endless Nevermore Desecrate Forefather Life and death
Unhallowed Eternally Smolder Armor Human race
Luciferian Tomorrow Travel Aeon No light
Infernal Infernally Fuel Splendor No hope
Necromantic Forever Spiral Pentagram Eternal night
Paralyzed Anymore Dethrone Perdition No god
Pestilent Mighty Throne Specter Full moon
Unholy Skyward Envenom Misanthrope No life
Illusive Evermore Lay Cross Black metal
Untrodden Earthward Resound Magick Cold wind
Astral Someday Mesmerize Nihil No place
Misanthropic Astray Abominate Ragnarok No escape
Unmerciful Onward Paralyze Blasphemer No return
Cruelest Verily Blaspheme Profanation Eternal life
Blackest Deathly Impale Misanthropy No fear
Eternal Forth Cremate Malediction Flesh and blood
Wintry Unceasingly Bleed Revenant No matter
Bestial Weightlessly Procreate Damnation Fallen angel
Reborn Anew Enslave Conjuration Eternal darkness
Putrid Demonically Awake Undead No man
Darkest Behold Nothingness Dark night
Unblessed Intoxicate Armageddon Lost soul
Colorless Devour Lacerate No end
Diabolic Bury Wormhole Ancient time
Demonic Demonize Eon No remorse
Wrathful Forsake Devourer No reason
Nebular Enshroud Impaler No longer
Vampiric Writhe Sulfur Black cloud
Unchained Destroy Betrayer Dark forest
Armored Entomb Deceiver Human flesh
Immortal Raze Bloodlust Endless night
Hellish Flagellate Reaper Ancient god
Hellbound Unleash Horde Mother earth
Unnamable Convoke Blasphemy Black wing
Prideful Crucify Eternity Night sky
Colorful Fornicate Defiler Dark side
Unbaptized Torment Immolation Eternal sleep
Unforgotten Venerate Soul Black hole
Satanic Beckon Abomination Black heart
Morbid Defile Flame Flesh and bone
Sempiternal Distill Hail No chance
Mortal Immolate Malignancy Dark cloud
Honorable Welter Wrath Final battle
Glooming Run Pestilence Eternal fire
Willful Sanctify Gallow No peace
Lustful Eviscerate Disbeliever No future
Everlasting Unchain Witchery Black soul
Impure Ravage Satanist Final breath
Promethean Mutilate Lust Black night

Most salient entities: many are drawn from the Sumerian and Nordic mythologies. I’ve also included in this bunch groups of animals (“Beasts”, “Locusts”).

Most salient locations. I’ve also included in this bunch non-descript places (“Northland”). Notice how most are concerned with the afterlife (surprisingly, “hell” is not one of them).

It occurred to me that these results could be the starting point of an automatic lyric generator (like the now defunct Scandinavian Black Metal Lyric Generator). Could be a fun project, if time allows (probably not).


IBM GlossEx

Jason Davies’ D3 Word Cloud

JATE – Java Automatic Text Extraction

StanfordNLP Core

Middle-Earth Entity Recognition in Black Metal Lyrics

The influence of JRR Tolkien in black metal is pervasive, almost since its beginning. One of BM’s most (in)famous outfits, Burzum, took its name from a word invented by the Middle-Earth creator that signifies “darkness” in the Black Speech of Mordor. Other Norwegian acts such as Gorgoroth or Isengard adopted their names from notable Middle-Earth locations. Perhaps the best example is the Austrian duo Summoning, who have incorporated in their releases inumerous lyrical references (well, not inumerous, about 70 actually) to Tolkien’s works.

The references to Middle-Earth mythology abound in both lyrics and band monikers. Using a list of notable characters’ names and geographic locations as the basis for a named entity recognition task, I set out to find which are the most cited in the black metal data set.

With this list and a small Java NER script implemented for this task, I found 149 bands which have chosen a Middle-Earth location or entity for their name. Angmar is the most popular (6), closely followed by Mordor (5) and Sauron (5). With 4 occurrences each, there’s also Orthanc, Moria, Nargothrond, Carcharoth, Gorthaur and Morgoth.

As for actual lyrical references to these entities, I found a grand total of 736 of them. The ones that have at least two occurrences are depicted in the bubble chart below. It’s not surprising at all to find that the most common references (Mordor, Morgoth, Sauron, Moria, Saruman and Carcharoth) belong to malevolent characters, or dark and dangerous places, of the Tolkienesque lore. The “Black Gate” is also mentioned a lot, but it could have a meaning outside of the Middle-Earth mythology.


Bubble Chart built from Mike Bostock’s example

Usage of Specific Terms through Time

What about the usage of specific words in black metal lyrics across time? Have common terms – like life or death – been mentioned in a constant fashion through the years, or has their frequency changed dramatically?

The figure below plots the frequencies of a few selected words against time (in years). I’ve chosen death, life and time because they are among the most frequent terms in the whole lyrics data set. As for god and satan, well, if you don’t know why I picked them then that probably means you’re not acquainted with black metal at all, so I’ll refer you to Google or the nearest (decent) record shop to sort that out.

I’ve bundled a few synonyms and hyponyms with each term, taken from WordNet. This means that, for example, the occurrence count for satan also includes the counts of similar terms such as lucifer and devil.

Looking at the plot we can see that death was at its highest point around 1998 and has been decreasing since then (being surpassed by life in 2006/07), up until 2012. And notice how satan closely follows god across the years. This probably means that most lyrics than mention one of these entities, also mention the other.

Part II – Frequent Words in Black Metal Lyrics

In the last post we tried to discover the most common terms used in black metal lyrics. One of the first questions that popped up was if there are there differences between countries, regarding the most frequent words. To answer this (in a very small scale) I’ve subsetted the original data set into two smaller sets: one for lyrics penned by Norwegian bands and the other for Iraqi bands. The following bar plot shows the top 15 most frequent words found in the lyrics of Norwegian bands. It does not seem to differ much from the global top 15, presented in our previous post.

Below you’ll find the most frequent words in the lyrics of Iraqi bands. Not only does it look much different from the Norwegian bar plot, it also differs significantly from the global results. I find it very interesting that lies corresponds to 0.9% of the total occurrences. This and the presence of both truth and blashpemy seems to point to some sort of deeper meaning here.  Or maybe it’s just all a coincidence because, again, with no contextual analysis we can’t really infer much. At any rate, it’s very likely that the lyrical concerns of Norwegian and Iraqi bands are distinct.

Part I – Frequent Words in Black Metal Lyrics

Ever wondered what, if any, patterns are there to be discovered in black metal lyrics? Well, I did, and started by simply finding out which words occur the most in this data set. After some cleaning and pre-processing, I’ve ended up using lyrics of 76039 songs by 24086 bands, from 116 different countries. Stop words (which can be roughly defined as very common and very uninformative words like the or or) were removed in this pre-processing stage. In the end, a total of 258610 distinct words occur, with the number of occurrences summing up to 5304046.

The following bar plot shows the top 15 most used words across the whole lyrics data set.

The most common term is death (not at all unexpected) represents 0.7% of the total number of occurrences of all distinct words. Other more or less expected results such as blood or darkness also make an appearance, but it is somewhat intriguing to find time in the top 5. So, what does this all mean? Well, not much (yet): simply counting the number of occurrences of individual words is not a good indicator of “meaning” because it discards the context in which the words appear, as well as the relationships between them, but provides very helpful hints.