Conspiracy Theories – Topic Modeling & Keyword Extraction

It’s been a while since I’ve posted something related to topic modeling, and I decided to do so after stumbling upon a conspiracy theory document set that, albeit small, seemed an interesting starting point for building a topic model on the subject of conspiracies.

Unfortunately, these documents seem to be a bit dated as I couldn’t easily find references to more recent conspiracy theories like chemtrails or “vaccines are a government/big pharma plot to poison people”. At any rate, I’m sure they provide clues to the major themes behind them. As a preliminary exercise, I used JATE’s implementation of GlossEx to extract keywords from the documents. Some of the most “salient” ones (ie, among the top 50 or so) are represented in the bubble chart below. The size of the bubble represents the score of the keyword and among them we can see mentions to the CIA, jews, Tesla, the Illuminati and JFK. Yep, seems like a conspiracy data set, alright!

And now let’s explore the data set in more depth by building a topic model. ‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. To summarize: topic modeling is a process that decomposes the documents into sets of probabilities. The final outcome of this process is a mathematical model that represents a piece of text as a mixture of different topics, being that each topic has a weight (that is, a probability) associated to it. The higher the weight of a topic, the more important it is to characterize the text. Another aspect of this mathematical model is that the terms that compose a topic also have different weights: the higher their value, the more important they are for characterizing the topic.

I’ve used Mallet (Java) and STMT (Scala) before for topic modeling, so I chose something different this time, the R topicmodels package, to build a model for these 432 documents. Here’s a sample of the code, note that the LDA function accepts the corpus as a DocumentTermMatrix object of unigrams, bigrams and trigrams. Note also that the topic model has 35 topics. This number was chosen after inspecting the log-likelihood of multiple LDA models, each with a different number of topics. I think 35 topics is excessive for such a small data set, but will use this criterion just for the sake of having a method that determines this parameter.

corpus <- Corpus(DirSource("texts"))
corpus <- tm_map(corpus, stripWhitespace)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))

dtm <- DocumentTermMatrix(corpus, 
                          control = list(
                                         tokenize = TrigramTokenizer,
                                         stemming = FALSE, 
                                         stopwords = TRUE,                                          
                                         removePunctuation = TRUE))

#Build topic model with 35 topics, previously determined with the logLik function
lda <- LDA(dtm, 35, method="Gibbs", control = list(seed = 123, verbose = 1, iter=1500))

#inspect word distribution per topic
beta <- lda@beta

#inspect documents composition as mixtures of topics
gamma <- lda@gamma

The following multiple d3 word cloud was built after inspecting the beta object of the model (which tells us the n-grams that compose each topic and also the weight of each n-gram within the topic) and choosing 9 of the 35 topics (some topics were redundant or composed of non-informative terms). The size and the opacity of a term in the visualization reflects its weight. There are topics for all tastes: UFOs, freemasonry, the new world order, Tesla and strangely one that mixes nazis, George Bush and oil (topic 3). By the way, the code used for the multiple word cloud comes from this blog post by Elijah Meeks and it's a very nice and easy way of representing topics.


Classification of Political Statements by Orator with DSX

After all that keyword extraction thing from political speeches, it occurred to me it would be interesting to find out if it’s possible to build a model that predicts the political orator to which a statement, or even a complete speech, belongs to. By statement, I mean a sentence with more than a couple of words drawn from a speech (not including here interviews or political debates, for example). I took the 327 speeches by 12 US presidents used in the previous post as the basis of a document data set and added to it a few dozen speeches by other, non-american, dictatorial, political leaders as to create a set appropriate for a classification task.

I intended to explore two different routes:

  1. Build a predictive model from a collection of sentences previously classified as either being uttered by an US President or by some other politician (all non-American political leaders of the XXth century, all dictators) during an official speech. This can be defined then as a binary classification problem: either a statement is assigned to a politician of the class “USPresident” or it isn’t. All the sentences (33500 in total) were drawn from the speeches mentioned above.
  2. Build another model from a collection of speeches by a diverse group of orators, such that the model can assign correctly to a a previously unseen speech the person associated with it. This can be defined as a multi-class classification problem.

This all sounds reasonable and potentially interesting (and, who knows, even useful), but building predictive models from text-based data is a very cumbersome task because there’s always a multitude of things to decide beforehand, which includes:

  • How to represent the text? Learning algorithms can’t deal with text in its original, “as-is” format, so there’s a number of preprocessing steps to take in order to transform it into a set of numerical/categorical/ordinal/etc. features that make sense. There are numerous feature types and transformations I could explore here, like representing the text as a weighted vector space model, using word based features, character-based features, using part-of-speech tags or entities as additional features, build topic models and use the topic probabilities for each document, and so on. The problem is that I do not have enough time (nor patience) to decide efficiently the most appropriate feature representation for my speech/sentences data set.
  • Dimensionality curse: Assuming I’ve managed to find some good text representation, it’s almost certain the final dimensions of the data set to be presented to the learning algorithm will be prohibitive. Again, there are numerous feature selection methods that can be employed to help me ascertain which features are more informative and discard the rest. I don’t really care about trying them all.
  • What learning algorithm is appropriate? Finally, which algorithms to use for these two classification tasks. Again, there are hundreds of them out there, not to mention countless parameters to tune, cross-validation techniques to test, different evaluation measures to optimize, and so on.

As to avoid losing too much time with all of this stuff just for the sake of a blog post, I decided to use DSXfor two simple reasons: 1) it accepts text in its original format and does all the feature transformation/selection/extraction steps all by itself, so I don’t need to worry about that stage, and 2) it tests hundreds of different algorithms and combinations of algorithms to find the best model for the data.

The only pre-processing done to the data sets prior to uploading them as csv files to DSX was:

  1. Subsetting each data set into one training portion, from which to build a predictive model, and a testing portion used to evaluate the model on data unknown to it (and make sure there was no overfitting).
  2. To make things more challenging, I replaced all entities mentioned by their entity type. This is because a sentence or speech mentioning specific dates, people and locations can be easily assigned to the correct orator using those entities alone. For example, “It is nearly five months since we were attacked at Pearl Harbor” is obviously something that only FDR could have said. “Pearl Harbor” is a clear hint of the true class of the sentence, and to make things more difficult to DSX, it gets replaced with the placeholder “LOCATION”. A similar replacement is used for entities like organizations, dates or persons with the help of the Stanford NLP Core toolkit.

The first model built was the one for the binary version of the data set (i.e., a sentence either belongs to an US president or to a non-american political leader), using a total of 26792 sentences. Of a total of 8500 examined models, DSX found one generated with the Iterative OLS algorithm to be the best, estimating accuracy (that is, the percentage of sentences correctly assigned to their respective class) to fall between 76% and 88%, and average recall (that is, the averaged percentages of correct assignments for each class) to fall in the range of 78% to 88%. Given that the “NON US PRESIDENT” class is about two thirds of the “US PRESIDENT” average recall is a better evaluation measure than regular accuracy, for this particular data set.


ForecastThis DSX estimated qualities of the best predictive model for the binary political sentences data set.

To make sure the model is not overfitting the training data, and that the estimates above are correct, I sent DSX a test set of sentences with no labels assigned, and compared the returned predictions with the ground truth. Turns out accuracy is around 82% and average recall approximately 80%. This is a great result overall and it means we’ve managed to build a model that could be useful, for example, for automatic annotation of political statements.

And just for the record, here’s a few example of sentences that the model did not get right:

  • Sentences by US Presidents marked as belonging to non-US political leaders (dictators):
    • We have no territory there, nor do we seek any.
    • That is why we have answered this aggression with action.
    • Freedom’s fight is not finished.
  • Sentences by non-US political leaders (dictators) marked as belonging to US presidents
    • The period of war in [LOCATION] is over.
    • The least that could be said is that there is no tranquillity, that there is no security, that we are on the threshold of an uncontrollable arms race and that the danger of a world war is growing; the danger is growing and it is real.

I doubt a person could do much better just by reading the text, with no additional information.

The second model was built from a train set of 232 speeches, each labeled with the respective orator (11 in total). The classes are very unbalanced (that is, the number of examples for each label varies greatly), and some of them are quite small, which makes average recall the best measure to pay attention to when asserting the quality of the predictions made by the model. The best model DSX found was built with Multiquadric Kernel Regression, and although it has a hard time learning three of the eleven classes (see figure below), it’s actually a lot better than what I expected given the skewness of the data, and the fact that all entities were removed from the text.


ForecastThis DSX predictive model for political speeches by 11 orators. The best model was built with Multiquadratic Kernel Regression.

And what about the model’s performance in the test set? It more or less follows the estimated performance of the trained model: it fails to classify correctly speeches by Hitler (classifying them as belonging to FDR instead), and by Nixon (which are assigned to Lyndon B. Johnson). On the other hand, it does classify correctly all the instances of Reagan, FDR, Stalin, and most of Bill Clinton’s speeches. I’m sure if I provided a few more examples for each class, the results would greatly improve.

To conclude: this model, alongside the very good model obtained for the first data set, illustrates how it is possible to quickly obtain predictive models useful for text annotation of political speeches. And all this with minimal effort, given that DSX can evaluate hundreds of different models very quickly, and also handle the feature engineering side of things, prior to the supervised learning step.

*Disclaimer: I work for ForecastThis, so shameless self-promotion trigger warning goes here.


  • ForecastThis DSX
  • US Presidential speeches harvested from the Miller Center Speech Archive



Automatic Keyword Extraction from Presidential Speeches

Keyword extraction can be defined as the automatic extraction of terms from a set of documents that are in some way more relevant than others in characterizing the corpus’ domain. It’s a task widely used for bio-medical documents characterization, for example, and in general it is very useful as a basis for text classification or summarization.

There are several methods out there in the world that perform this task, some of which use a reference corpus as a means to determine which terms in the test corpus are more unusual relatively to what would be expected, and others that only look at the content of the test corpus itself and search for the most meaningful terms within it.

The toolkit JATE offers a number of these algorithms implemented in Java. I chose C-value to extract keywords from a set of speech transcripts by 12 presidents of the United States (from FDR to George W. Bush), which were harvested from the Miller Center Speech Archive. I harvested a total of 327 of these speeches, and my goal is to get a set of keywords that characterizes the set of speeches of each orator (that is, get a set of extracted keywords per president).

The reason why I chose C-value (which in recent years became C/NC-value) is because it doesn’t need a reference corpus and can extract multi-word keywords: it’s an hybrid method that combines a term-frequency based approach (“termhood”) with an inspection of the frequencies of a term used as part of a larger term (“unithood”)[1][2].

Here’s a collapsible tree of keywords among the top 20 for each president (click a node to expand). The size of the keyword node reflects its score as determined by the C-value algorithm. “Health care”, for example, has a very large weight in the speeches by Clinton. Overall, the Middle East, social security, energy and world conflicts seem to be the basis of the keywords found by C-value.

For visualization purposes, I’ve manually selected 10 of the top 20 keyterms because there were quite a few that showed up for all presidents (stuff like “american people”, “american citizens”, “americans”), so those were discarded.

Another, more recent, keyword extraction algorithm that doesn’t need a reference corpus is RAKE which, to quote its authors[3], is

 based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words ‘and’, ‘the’, and ‘of’


RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords […] Co-occurrences of words within these candidate keywords are meaningful and allow to identify word cooccurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords.

The top results are not as concise as the ones obtained with C-value, but still provides some clues to topics addressed by each president. I think FDR’s “invisible thing called ‘conscience'”  is my favourite. It also seems to me that splitting the text by stopwords might cause keywords to lose some of the original impact: take for example Truman’s statement ‘liberation in commie language means conquest’ that gets truncated to ‘commie language means conquest’.




Death Row Inmates: Last Words

Word clouds aren’t the best of data visualizations. They’re often too simplistic, representing a small sample of words out of context. I felt, however, that a word cloud would be appropriate to convey the most frequent terms present in the last statements of death row inmates.

That’s because the majority of these statements is typically comprised of few sentences, where the inmates say goodbye to their families. Many apologise to their victims, some protest their innocence until the end. A few simply state they’re now ready to die. There’s not much variety here, so representing the top terms proportionally to their frequency will not, I think, be an inaccurate representation.

The following word cloud was generated after stop word removal of 518 last statements of Texas death row inmates, executed between 1982 and 2014. These statements were harvested from the Texas Department of Criminal Justice website. 

Here’s the top 15 words and their counts:

  • love: 634
  • family: 290
  • god: 203
  • life: 149
  • hope: 131
  • lord: 130
  • forgive: 127
  • people: 125
  • peace: 96
  • jesus: 96
  • give: 95
  • death: 92
  • pain: 81
  • strong: 81
  • warden: 77

It’s not at all suprising that “love”, “god” and “family” are at the very top. Here’s a sample of the most common bigrams and trigrams (ie, sets of two and three words):

  • “i love you”
  • “i would like”
  • “i am sorry”
  • “i am ready”
  • “i am going”
  • “thank you”
  • “my family”
  • “forgive me”
  • “stay strong”


Lexical Diversity: Black Metal vs. Pop Queens

Lexical diversity is typically defined as a measure of the uniqueness of the words used in a text, that is, the proportion of distinct words across the text. This type of measure is indicative of the vocabulary richness present in the text and general writing quality.
There are several ways of measuring this, but I’ll focus on MTLD (Measure of Textual Lexical Diversity) as it’s very sensitive and less prone to be affected by the text length, unlike more traditional metrics. MTLD, by the way, is defined as the mean length of sequential word strings in a text that maintain a given type to token ratio – in other words, sequences that have a high proportion of unique words.
To get an idea of how diverse is the vocabulary used in black metal, I’ve measured MTLD for the song lyrics of 18 bands, which were selected (more or less) randomly.
The text of each band consists of the entirety of their lyrics after removal of text portions in languages other than english. To make things slightly more interesting, I also computed the MTLD values for the three queens of pop (or maybe they’ve been dethroned since I last checked the current pop pulse, it’s been a while).
There’s a handful of caveats to the experiment I’m describing here, the most evident being 1) I’m assuming the traditional parameter values of MTLD are suitable for song lyrics, 2) I’ve removed all lyrics totally or partially written in languages other than English, and 3) there’s a lot of intentional line repetition in songs (choruses and the like), something that is more prevalent in pop music than black metal. With that in mind I removed such duplicates from both data sets, which actually improved (albeit not significantly) the pop artists MTLD value.
That said, it’s not at all surprising (to me, admitting my bias here) to see Dodheimsgard at the top, or that MTLD values for the three pop singers are a lot lower than for all of the selected black-metal bands. However, keep in mind that Lexical Diversity measures don’t explicitly take into account sentence structure or grammar, so we can’t really infer the degree of quality (for lack of a better expression) of how the words are used.
Band/Artist MTLD-MA Lexical Density (%)
Lady Gaga 56 53.39
Beyonce 56 51.30
Rihanna 58 52.89
Abigor 112 57.35
Blacklodge 91 59.7
Clandestine Blaze 86 63.15
Corpus Christii 94 57.48
Craft 73 51.15
Cultes des Ghoules 114 59.1
Darkthrone 104 57.65
Deathspell Omega 101 51.39
Dodheimsgard 132 58.17
Emperor 81 54.29
Immortal 73 58.94
Inquisition 72 58.02
Mayhem 83 59.7
Mutiilation 103 56.08
Ride for Revenge 102 58.7
Satanic Warmaster 70 55.47
Satyricon 79 54.04
Solefald 84 56.98

The rightmost column of the table above displays values of lexical density, which should not be confused with lexical diversity. The former is defined as the proportion of content words – such as nouns, adjectives and verbs – present in the text. Other categories of words are said to be functional (such as determiners). I’ll follow here a rough interpretation of Halliday‘s definition of lexical density and consider adverbs as content words.
As far as content word classes go, adjectives, nouns and prepositions (eg, “than”, “beyond”, “under”, “into”) are less common in the pop lyrics than in the bm lyrics analysed here. Usage of pronouns however (eg, “me”, “you”, “we”) is a lot more evident in the pop lyrics.
Focusing solely on the black metal lyrics, the same type of distribution is observed for each band, with the exception of nouns which for some reason are a lot more prevalent in Clandestine Blaze and Blacklodge lyrics, amounting to more than one third of all the words used. Another aspect where Blacklodge, along with Craft and Deathspell Omega, deviate considerably from the rest of the bands is the “Other” category.  This word class is actually the aggregation of the smaller classes, such as digits, punctuation and symbols. Craft, in particular, use a good amount of punctuation. Another interesting thing is seeing Deathspell Omega at the bottom of the table with regards to lexical diversity (ie, actual content words) values, albeit scoring high in the lexical diversity department.


  • MTLD was first suggested, and subsequently developed, by Philip McCarthy while @ the University of Memphis. I strongly suggest reading his and Scott Jarvis article “MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment” as it’s a lot more comprehensive than the very brief, limited and ad-hoc assessment I make here (and I’ve probably minsinterpreted some aspects of its correct usage).
  • All the MTLD values and part-of-speech tagging were computed in R, using the koRpus package implementation of MTLD-MA and TreeTagger.
  • D3 stacked bar chart source.

Around the World with Satan

The following map displays, for each country, the rank of the word “Satan” in black metal lyrics written between 1980 and 2013. This ranking is calculated as the ratio of the total number of times “Satan” occurs to the maximum raw frequency of any term in the country’s lyrics, after stopwords removal. The darker the shade of blue, the higher up in the term-ranking is “Satan” for that country. Filipino bands throw the S-word around a lot more than the rest of the world, at least in comparison with other frequent terms in their lexicon, followed by a number of countries in Latin America.

Click here for a larger version.

As for the (as in, “first”) most frequent word for each country, here’s a selected sample with some amusing entries:

  • Brunei: human
  • Kazakhstan: rape
  • Mongolia: soul
  • Costa Rica: lord
  • Honduras: cold
  • Barbados: ocean
  • Jamaica: pussy
  • Japan: hell

Note that actual size of each country’s corpus, that is, the total number of terms has some influence over the computed ratios. Since some countries are a lot more prolific, blackmetal-wise, than others, take this analysis with a grain of salt.


Using NLP to build a Black Metal Vocabulary

Black metal is typically linked, since its inception, to Satanic or anti-Christian themes. With the proliferation of bands in the 90s (after the Norwegian boom) and subsequent emergence of sub-genres, other topics such as paganism, metaphysics, depression and even nationalism came to the fore.

In order to discover the terminology used to explore these lyrical themes, I’ve devised a couple of term extraction experiments using the black metal data set. The goal here is to build a black metal vocabulary by discovering salient words and expressions, that is terms that when used in BM lyrics carry more information than when used in a “normal” setting. For instance, the terms “Nazarene” or “Wotan” have a much higher weight in the black metal domain than  in the general purpose corpus used for comparison. Once again note that this does not necessarily mean that these two words occur very frequently in BM lyrics (I’d bet that “Satan” or “death” have a higher number of occurrences), but it indicates that, when they do, they carry more information within the BM context.

This task was carried through JATE‘s implementations of the GlossEx and C-value algorithms. The part-of-speech of each term (that is, the “type” of term) was discovered with the StanfordNLP toolkit. The top 50 of each type (with the exception of adverbs) are listed in the table below. For the sake of visualization, I make a distinction between named entities/locations and the other nouns, being that the former are depicted in the word maps at the end of this post.

I’ve also included, in the last column of the table, the top term combinations. It’s noteworthy how much of these combinations are either negations of something (“no hope”, “no god”, “no life” and so on), or concerned with time (“eternal darkness”, “ancient times”). Such preoccupation with large extensions of “time” is also evident in the top adverbs (“eternally”, “forever”, “evermore”),  adjectives (“endless”, “eternal”) and even nouns (“aeon” or “eon”).

Endless Nevermore Desecrate Forefather Life and death
Unhallowed Eternally Smolder Armor Human race
Luciferian Tomorrow Travel Aeon No light
Infernal Infernally Fuel Splendor No hope
Necromantic Forever Spiral Pentagram Eternal night
Paralyzed Anymore Dethrone Perdition No god
Pestilent Mighty Throne Specter Full moon
Unholy Skyward Envenom Misanthrope No life
Illusive Evermore Lay Cross Black metal
Untrodden Earthward Resound Magick Cold wind
Astral Someday Mesmerize Nihil No place
Misanthropic Astray Abominate Ragnarok No escape
Unmerciful Onward Paralyze Blasphemer No return
Cruelest Verily Blaspheme Profanation Eternal life
Blackest Deathly Impale Misanthropy No fear
Eternal Forth Cremate Malediction Flesh and blood
Wintry Unceasingly Bleed Revenant No matter
Bestial Weightlessly Procreate Damnation Fallen angel
Reborn Anew Enslave Conjuration Eternal darkness
Putrid Demonically Awake Undead No man
Darkest Behold Nothingness Dark night
Unblessed Intoxicate Armageddon Lost soul
Colorless Devour Lacerate No end
Diabolic Bury Wormhole Ancient time
Demonic Demonize Eon No remorse
Wrathful Forsake Devourer No reason
Nebular Enshroud Impaler No longer
Vampiric Writhe Sulfur Black cloud
Unchained Destroy Betrayer Dark forest
Armored Entomb Deceiver Human flesh
Immortal Raze Bloodlust Endless night
Hellish Flagellate Reaper Ancient god
Hellbound Unleash Horde Mother earth
Unnamable Convoke Blasphemy Black wing
Prideful Crucify Eternity Night sky
Colorful Fornicate Defiler Dark side
Unbaptized Torment Immolation Eternal sleep
Unforgotten Venerate Soul Black hole
Satanic Beckon Abomination Black heart
Morbid Defile Flame Flesh and bone
Sempiternal Distill Hail No chance
Mortal Immolate Malignancy Dark cloud
Honorable Welter Wrath Final battle
Glooming Run Pestilence Eternal fire
Willful Sanctify Gallow No peace
Lustful Eviscerate Disbeliever No future
Everlasting Unchain Witchery Black soul
Impure Ravage Satanist Final breath
Promethean Mutilate Lust Black night

Most salient entities: many are drawn from the Sumerian and Nordic mythologies. I’ve also included in this bunch groups of animals (“Beasts”, “Locusts”).

Most salient locations. I’ve also included in this bunch non-descript places (“Northland”). Notice how most are concerned with the afterlife (surprisingly, “hell” is not one of them).

It occurred to me that these results could be the starting point of an automatic lyric generator (like the now defunct Scandinavian Black Metal Lyric Generator). Could be a fun project, if time allows (probably not).


IBM GlossEx

Jason Davies’ D3 Word Cloud

JATE – Java Automatic Text Extraction

StanfordNLP Core

Middle-Earth Entity Recognition in Black Metal Lyrics

The influence of JRR Tolkien in black metal is pervasive, almost since its beginning. One of BM’s most (in)famous outfits, Burzum, took its name from a word invented by the Middle-Earth creator that signifies “darkness” in the Black Speech of Mordor. Other Norwegian acts such as Gorgoroth or Isengard adopted their names from notable Middle-Earth locations. Perhaps the best example is the Austrian duo Summoning, who have incorporated in their releases inumerous lyrical references (well, not inumerous, about 70 actually) to Tolkien’s works.

The references to Middle-Earth mythology abound in both lyrics and band monikers. Using a list of notable characters’ names and geographic locations as the basis for a named entity recognition task, I set out to find which are the most cited in the black metal data set.

With this list and a small Java NER script implemented for this task, I found 149 bands which have chosen a Middle-Earth location or entity for their name. Angmar is the most popular (6), closely followed by Mordor (5) and Sauron (5). With 4 occurrences each, there’s also Orthanc, Moria, Nargothrond, Carcharoth, Gorthaur and Morgoth.

As for actual lyrical references to these entities, I found a grand total of 736 of them. The ones that have at least two occurrences are depicted in the bubble chart below. It’s not surprising at all to find that the most common references (Mordor, Morgoth, Sauron, Moria, Saruman and Carcharoth) belong to malevolent characters, or dark and dangerous places, of the Tolkienesque lore. The “Black Gate” is also mentioned a lot, but it could have a meaning outside of the Middle-Earth mythology.


Bubble Chart built from Mike Bostock’s example

Part IV – Record Labels and Lyrical Content

In part IV (and final) of topic discovery in black metal lyrics, we’ll address the issue of assigning topics to record labels based on the lyrical content of their black metal releases. In other words, we want to find out if a given label has a tendency to release bands that write about a particular theme. We’ll also investigate the temporal evolution of these topics, that is, what changes happened through the years regarding the usage of topics in black metal lyrics. This aims to shed some light on the issue of whether lyrical content has remained the same throughout the years.

In order to address these questions, I turned once more to topic modeling. This machine learning technique was mentioned in parts I, II and III of this post, so knock yourself out reading those. If that does not appeal to you, let’s sum things up by saying that topic modeling aims to infer automatically (i.e., with minimum human intervention) topics underlying a collection of texts. “Topic” in this context is defined as a set of words that (co-)occur in the same context and are semantically related, somehow.

Instead of using the topic model built for parts II and III, I generated a new one after some (sensible, I hope) cleaning of the data set. This pre-processing involved, among other things, removal of lyrics that were not fully translated to english and lyrics with less than 5 words. In the end, I reduced the data set to 72666 lyrics (how ominous!) and generated a topic model of 30 topics with the Stanford Topic Modeling Toolbox (STMT).

Like in previous attempts, of these 30 topics, 2 or 3 seemed quite generic (they were composed of words that could occur in any context) or just plain noisy garbage, but for the most part the topics are quite concise.  I’m listing those I found the most interesting/intriguing. For each of them I added a title (between parentheses) that tentatively describes the overall tone of the topic:

  • Topic 28 (Cult, Rituals & Symbolism): “sacrifice”, “ritual”, “altar”, “unholy”, “goat”, “rites”, “blasphemy”, “chalice”, “temple”, “cult”
  • Topic 23 (Chaos, Universe & Cosmos): “chaos”, “stars”, “universe”, “cosmic”, “light”, “space”, “serpent”, “void”, “abyss”, “creation”
  • Topic 3 (The Divine): “lord”, “behold”, “praise”, “divine”, “god”, “blessed”, “man”, “glory”, “throne”, “perdition”
  • Topic 2 (Mind & Reality): “mind”, “existence”, “reality”, “thoughts”, “sense”, “moment”, “vision”, “mental”, “consciousness”
  • Topic 21 (Flesh & Decay): “flesh”, “dead”, “skin”,”body”, “bones”,”corpse”,”grave”
  • Topic 18 (The End): “end”, “day”, “path”, “leave”, “final”, “stand”, “fate”, “left”

And so on, and so forth. Click here for the full list, it will be handy for deciphering the plots below.

One nice functionality that the STMT offers is the ability of “slicing” the data with respect to the topics. This means that when slicing the data by date, one is able to infer what percentage of lyrics in a given year falls into each topic.

In order to observe the temporal evolution of some of these 30 topics between 1980 and 2014, I chose to use a NVD3 stacked area chart instead of just plotting twenty-something lines (which would be impossible to understand given the inevitable overlapping). The final result looks very neat and tidy, but can also be misleading and give the impression that all the topics are rising and diminishing at the same points in time. This is not true: when inspecting the stacked area char below remember that what represents the topic for a given year is the area of the topic at that point. There’s also the possibility of deselecting all topics (in the legend, top-right corner) except the one you want to examine, or simply clicking its area in the graph.

It seems that “Pain, Sorrow & Suffering” is consistently the most prevalent topic, peaking at 10.3% somewhere around 2006. “Fucking” has a peak in 1992, and “Warriors & Battles”  represents more than 20% of the topic assignment in 1986. For the most part, the topic assignment percentages seem to stabilize  after 92/93 (after the Norwegian boom or second wave or whatever it’s called).

And finally, when slicing the data set by the record labels, the output can be interpreted as the percentage of black metal releases by a given label that falls into each topic. After doing precisely that for records labels that have a minimum of 10 black metal releases, I selected a few labels and plotted for each the percentage of releases that were assigned to the topics with some degree of confidence. The resulting plot is huge, so I removed a few generic topics for the sake of clarity. By hovering the mouse on the topic titles, a set of some words that represent it will pop-up. Similarly, by hovering the mouse over a record label name, the circles will turn into percentages. The larger the circle’s radius, the higher the percentage of releases from that label were assigned to that circle’s corresponding topic.

Some observations of results that stand out: it seems that more than 20% of Depressive Illusions‘ releases were assigned to “Pain, Sorrow and Suffering. End All Life (which has released albums by Abigor, Blacklodge and Mütiilation, to name a few) top three topics are “Mind & Reality”, “Pain, Sorrow & Suffering” and “Chaos, Universe & Cosmos”. Also, almost 1/4 of all Norma Evangelium Diaboli‘s releases (which include Deathspell Omega, Funeral Mist and Katharsis) seem to pertain to “The Divine” topic.

Edit: WordPress does not allow for huge iframes, so click here to view the Labels vs. Topics plot  in all of its glory.

And that’s it for now, I’m done with topic modeling for the time being until I have the time and patience to fine-tune the overall representation of the data and the algorithm’s parameters. In the next few weeks I’ll turn to unsupervised machine learning techniques, such as clustering, to discover hidden relationships between bands.


Credits & Useful Resources:

– D3 ToolTip: D3-tip by Caged

– Stacked Area Chart: NVD3 re-usable charts for d3.js

– Labels per Topic: taken from Asif Rahman Journals


Part III – Topic and Lyrical Content Correlation

In part II of this post, we explored a topic model built for the whole black metal lyrics data set (if you don’t know what a topic model is, read this as well, but to sum things up let’s just say topic modeling is a process that enables discovery of the “meaning” underlying a document, with minimum human intervention). In said post we analyzed 1) the relationship between topics, and 2) the importance of individual words in their characterization by means of a force directed graph, which (let’s face it) is a bit of a bubbly mess.
In order to understand better the second point stated above, I decided to build a zoomable treemap. In it, each large box (distinguished from the surrounding boxes by a label and a distinct color) represents a topic, i.e. a set of words that are somehow related and occur in the same context(s). By clicking on a label, the map zooms into it and presents the ten most relevant words within that topic. For example, by clicking on “Coldness”, you’ll see the top 10 terms that compose it (“ice”, “frost”, “snow” and so on). The area of each word is proportion to its importance in characterizing the topic: in our “Coldness” example, “cold” occupies as larger area than the rest, being the most relevant word in this context.
Similarly, the total area of each topic is proportional to its incidence in the black metal lyrics data set. For example, “Fire & Flames” has a larger area than “Mind & Reality” or “Universe & Cosmos”, making it more likely to occur when infering the topics that characterize a song.

By the way, these topic labels were chosen manually. Unfortunately I couldn’t devise an automated process that would do that for me (if anyone has an inkling on how to do this, let me know) so I had to pick meaningful and reasonable (I hope) representative titles for each set of words. In most cases, like the aforementioned “Coldness”, the concept behind the topic is evident. There are, however, a few cases where I had to be a bit more creative because the meaning of the topic is not so obvious (“Urban Horror” comes to mind).

There are also two topics which are quite generic, with terms that could occur in almost any context, so they’re simply labeled “Non-descriptive”.

As mentioned in part II of this post, one goal of this whole mess is to find out which lyrics “embody” a specific topic. Given that the lyrical content of a song is seen by the topic model as a mixture of topics, then we’re interested in discovering lyrics that are composed solely (or almost in their entirety, let’s say more than 90%) of a single topic. Using the topic inferencing capabilities of the Stanford Topic Model Tool I did just that, selecting at least 3 representative lyrics for 14 of the topics above. They’re displayed in the collapsible tree below.

For the most part the lyrics seem to have a high degree of correlation with the topic assigned to them: for instance Immortal’s “Mountains of Might” fits the “Coldness” topic fairly well (surprise, surprise…) and Vondur’s cover of an Elvis Presly song obviously falls into the heart stuff category. But there is one intriguing result: after reading Woods of Infinity’s “A Love Story”, I was expecting it to have the “Dreams & Stuff from the Heart” topic assigned to it. It falls in the “Fucking” topic instead, so maybe the algorithm detected something (creepy) between the lines.



The zoomable treemap was built from Bill White’s Treemap with Title Headers.

The collapsible tree was inspired by this tree and this other tree.