Lexical Diversity: Black Metal vs. Pop Queens

Lexical diversity is typically defined as a measure of the uniqueness of the words used in a text, that is, the proportion of distinct words across the text. This type of measure is indicative of the vocabulary richness present in the text and general writing quality.
There are several ways of measuring this, but I’ll focus on MTLD (Measure of Textual Lexical Diversity) as it’s very sensitive and less prone to be affected by the text length, unlike more traditional metrics. MTLD, by the way, is defined as the mean length of sequential word strings in a text that maintain a given type to token ratio – in other words, sequences that have a high proportion of unique words.
To get an idea of how diverse is the vocabulary used in black metal, I’ve measured MTLD for the song lyrics of 18 bands, which were selected (more or less) randomly.
The text of each band consists of the entirety of their lyrics after removal of text portions in languages other than english. To make things slightly more interesting, I also computed the MTLD values for the three queens of pop (or maybe they’ve been dethroned since I last checked the current pop pulse, it’s been a while).
There’s a handful of caveats to the experiment I’m describing here, the most evident being 1) I’m assuming the traditional parameter values of MTLD are suitable for song lyrics, 2) I’ve removed all lyrics totally or partially written in languages other than English, and 3) there’s a lot of intentional line repetition in songs (choruses and the like), something that is more prevalent in pop music than black metal. With that in mind I removed such duplicates from both data sets, which actually improved (albeit not significantly) the pop artists MTLD value.
That said, it’s not at all surprising (to me, admitting my bias here) to see Dodheimsgard at the top, or that MTLD values for the three pop singers are a lot lower than for all of the selected black-metal bands. However, keep in mind that Lexical Diversity measures don’t explicitly take into account sentence structure or grammar, so we can’t really infer the degree of quality (for lack of a better expression) of how the words are used.
Band/Artist MTLD-MA Lexical Density (%)
Lady Gaga 56 53.39
Beyonce 56 51.30
Rihanna 58 52.89
Abigor 112 57.35
Blacklodge 91 59.7
Clandestine Blaze 86 63.15
Corpus Christii 94 57.48
Craft 73 51.15
Cultes des Ghoules 114 59.1
Darkthrone 104 57.65
Deathspell Omega 101 51.39
Dodheimsgard 132 58.17
Emperor 81 54.29
Immortal 73 58.94
Inquisition 72 58.02
Mayhem 83 59.7
Mutiilation 103 56.08
Ride for Revenge 102 58.7
Satanic Warmaster 70 55.47
Satyricon 79 54.04
Solefald 84 56.98

The rightmost column of the table above displays values of lexical density, which should not be confused with lexical diversity. The former is defined as the proportion of content words – such as nouns, adjectives and verbs – present in the text. Other categories of words are said to be functional (such as determiners). I’ll follow here a rough interpretation of Halliday‘s definition of lexical density and consider adverbs as content words.
As far as content word classes go, adjectives, nouns and prepositions (eg, “than”, “beyond”, “under”, “into”) are less common in the pop lyrics than in the bm lyrics analysed here. Usage of pronouns however (eg, “me”, “you”, “we”) is a lot more evident in the pop lyrics.
Focusing solely on the black metal lyrics, the same type of distribution is observed for each band, with the exception of nouns which for some reason are a lot more prevalent in Clandestine Blaze and Blacklodge lyrics, amounting to more than one third of all the words used. Another aspect where Blacklodge, along with Craft and Deathspell Omega, deviate considerably from the rest of the bands is the “Other” category.  This word class is actually the aggregation of the smaller classes, such as digits, punctuation and symbols. Craft, in particular, use a good amount of punctuation. Another interesting thing is seeing Deathspell Omega at the bottom of the table with regards to lexical diversity (ie, actual content words) values, albeit scoring high in the lexical diversity department.


  • MTLD was first suggested, and subsequently developed, by Philip McCarthy while @ the University of Memphis. I strongly suggest reading his and Scott Jarvis article “MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment” as it’s a lot more comprehensive than the very brief, limited and ad-hoc assessment I make here (and I’ve probably minsinterpreted some aspects of its correct usage).
  • All the MTLD values and part-of-speech tagging were computed in R, using the koRpus package implementation of MTLD-MA and TreeTagger.
  • D3 stacked bar chart source.

Around the World with Satan

The following map displays, for each country, the rank of the word “Satan” in black metal lyrics written between 1980 and 2013. This ranking is calculated as the ratio of the total number of times “Satan” occurs to the maximum raw frequency of any term in the country’s lyrics, after stopwords removal. The darker the shade of blue, the higher up in the term-ranking is “Satan” for that country. Filipino bands throw the S-word around a lot more than the rest of the world, at least in comparison with other frequent terms in their lexicon, followed by a number of countries in Latin America.

Click here for a larger version.

As for the (as in, “first”) most frequent word for each country, here’s a selected sample with some amusing entries:

  • Brunei: human
  • Kazakhstan: rape
  • Mongolia: soul
  • Costa Rica: lord
  • Honduras: cold
  • Barbados: ocean
  • Jamaica: pussy
  • Japan: hell

Note that actual size of each country’s corpus, that is, the total number of terms has some influence over the computed ratios. Since some countries are a lot more prolific, blackmetal-wise, than others, take this analysis with a grain of salt.


Using NLP to build a Black Metal Vocabulary

Black metal is typically linked, since its inception, to Satanic or anti-Christian themes. With the proliferation of bands in the 90s (after the Norwegian boom) and subsequent emergence of sub-genres, other topics such as paganism, metaphysics, depression and even nationalism came to the fore.

In order to discover the terminology used to explore these lyrical themes, I’ve devised a couple of term extraction experiments using the black metal data set. The goal here is to build a black metal vocabulary by discovering salient words and expressions, that is terms that when used in BM lyrics carry more information than when used in a “normal” setting. For instance, the terms “Nazarene” or “Wotan” have a much higher weight in the black metal domain than  in the general purpose corpus used for comparison. Once again note that this does not necessarily mean that these two words occur very frequently in BM lyrics (I’d bet that “Satan” or “death” have a higher number of occurrences), but it indicates that, when they do, they carry more information within the BM context.

This task was carried through JATE‘s implementations of the GlossEx and C-value algorithms. The part-of-speech of each term (that is, the “type” of term) was discovered with the StanfordNLP toolkit. The top 50 of each type (with the exception of adverbs) are listed in the table below. For the sake of visualization, I make a distinction between named entities/locations and the other nouns, being that the former are depicted in the word maps at the end of this post.

I’ve also included, in the last column of the table, the top term combinations. It’s noteworthy how much of these combinations are either negations of something (“no hope”, “no god”, “no life” and so on), or concerned with time (“eternal darkness”, “ancient times”). Such preoccupation with large extensions of “time” is also evident in the top adverbs (“eternally”, “forever”, “evermore”),  adjectives (“endless”, “eternal”) and even nouns (“aeon” or “eon”).

Endless Nevermore Desecrate Forefather Life and death
Unhallowed Eternally Smolder Armor Human race
Luciferian Tomorrow Travel Aeon No light
Infernal Infernally Fuel Splendor No hope
Necromantic Forever Spiral Pentagram Eternal night
Paralyzed Anymore Dethrone Perdition No god
Pestilent Mighty Throne Specter Full moon
Unholy Skyward Envenom Misanthrope No life
Illusive Evermore Lay Cross Black metal
Untrodden Earthward Resound Magick Cold wind
Astral Someday Mesmerize Nihil No place
Misanthropic Astray Abominate Ragnarok No escape
Unmerciful Onward Paralyze Blasphemer No return
Cruelest Verily Blaspheme Profanation Eternal life
Blackest Deathly Impale Misanthropy No fear
Eternal Forth Cremate Malediction Flesh and blood
Wintry Unceasingly Bleed Revenant No matter
Bestial Weightlessly Procreate Damnation Fallen angel
Reborn Anew Enslave Conjuration Eternal darkness
Putrid Demonically Awake Undead No man
Darkest Behold Nothingness Dark night
Unblessed Intoxicate Armageddon Lost soul
Colorless Devour Lacerate No end
Diabolic Bury Wormhole Ancient time
Demonic Demonize Eon No remorse
Wrathful Forsake Devourer No reason
Nebular Enshroud Impaler No longer
Vampiric Writhe Sulfur Black cloud
Unchained Destroy Betrayer Dark forest
Armored Entomb Deceiver Human flesh
Immortal Raze Bloodlust Endless night
Hellish Flagellate Reaper Ancient god
Hellbound Unleash Horde Mother earth
Unnamable Convoke Blasphemy Black wing
Prideful Crucify Eternity Night sky
Colorful Fornicate Defiler Dark side
Unbaptized Torment Immolation Eternal sleep
Unforgotten Venerate Soul Black hole
Satanic Beckon Abomination Black heart
Morbid Defile Flame Flesh and bone
Sempiternal Distill Hail No chance
Mortal Immolate Malignancy Dark cloud
Honorable Welter Wrath Final battle
Glooming Run Pestilence Eternal fire
Willful Sanctify Gallow No peace
Lustful Eviscerate Disbeliever No future
Everlasting Unchain Witchery Black soul
Impure Ravage Satanist Final breath
Promethean Mutilate Lust Black night

Most salient entities: many are drawn from the Sumerian and Nordic mythologies. I’ve also included in this bunch groups of animals (“Beasts”, “Locusts”).

Most salient locations. I’ve also included in this bunch non-descript places (“Northland”). Notice how most are concerned with the afterlife (surprisingly, “hell” is not one of them).

It occurred to me that these results could be the starting point of an automatic lyric generator (like the now defunct Scandinavian Black Metal Lyric Generator). Could be a fun project, if time allows (probably not).


IBM GlossEx

Jason Davies’ D3 Word Cloud

JATE – Java Automatic Text Extraction

StanfordNLP Core

Middle-Earth Entity Recognition in Black Metal Lyrics

The influence of JRR Tolkien in black metal is pervasive, almost since its beginning. One of BM’s most (in)famous outfits, Burzum, took its name from a word invented by the Middle-Earth creator that signifies “darkness” in the Black Speech of Mordor. Other Norwegian acts such as Gorgoroth or Isengard adopted their names from notable Middle-Earth locations. Perhaps the best example is the Austrian duo Summoning, who have incorporated in their releases inumerous lyrical references (well, not inumerous, about 70 actually) to Tolkien’s works.

The references to Middle-Earth mythology abound in both lyrics and band monikers. Using a list of notable characters’ names and geographic locations as the basis for a named entity recognition task, I set out to find which are the most cited in the black metal data set.

With this list and a small Java NER script implemented for this task, I found 149 bands which have chosen a Middle-Earth location or entity for their name. Angmar is the most popular (6), closely followed by Mordor (5) and Sauron (5). With 4 occurrences each, there’s also Orthanc, Moria, Nargothrond, Carcharoth, Gorthaur and Morgoth.

As for actual lyrical references to these entities, I found a grand total of 736 of them. The ones that have at least two occurrences are depicted in the bubble chart below. It’s not surprising at all to find that the most common references (Mordor, Morgoth, Sauron, Moria, Saruman and Carcharoth) belong to malevolent characters, or dark and dangerous places, of the Tolkienesque lore. The “Black Gate” is also mentioned a lot, but it could have a meaning outside of the Middle-Earth mythology.


Bubble Chart built from Mike Bostock’s example

Part IV – Record Labels and Lyrical Content

In part IV (and final) of topic discovery in black metal lyrics, we’ll address the issue of assigning topics to record labels based on the lyrical content of their black metal releases. In other words, we want to find out if a given label has a tendency to release bands that write about a particular theme. We’ll also investigate the temporal evolution of these topics, that is, what changes happened through the years regarding the usage of topics in black metal lyrics. This aims to shed some light on the issue of whether lyrical content has remained the same throughout the years.

In order to address these questions, I turned once more to topic modeling. This machine learning technique was mentioned in parts I, II and III of this post, so knock yourself out reading those. If that does not appeal to you, let’s sum things up by saying that topic modeling aims to infer automatically (i.e., with minimum human intervention) topics underlying a collection of texts. “Topic” in this context is defined as a set of words that (co-)occur in the same context and are semantically related, somehow.

Instead of using the topic model built for parts II and III, I generated a new one after some (sensible, I hope) cleaning of the data set. This pre-processing involved, among other things, removal of lyrics that were not fully translated to english and lyrics with less than 5 words. In the end, I reduced the data set to 72666 lyrics (how ominous!) and generated a topic model of 30 topics with the Stanford Topic Modeling Toolbox (STMT).

Like in previous attempts, of these 30 topics, 2 or 3 seemed quite generic (they were composed of words that could occur in any context) or just plain noisy garbage, but for the most part the topics are quite concise.  I’m listing those I found the most interesting/intriguing. For each of them I added a title (between parentheses) that tentatively describes the overall tone of the topic:

  • Topic 28 (Cult, Rituals & Symbolism): “sacrifice”, “ritual”, “altar”, “unholy”, “goat”, “rites”, “blasphemy”, “chalice”, “temple”, “cult”
  • Topic 23 (Chaos, Universe & Cosmos): “chaos”, “stars”, “universe”, “cosmic”, “light”, “space”, “serpent”, “void”, “abyss”, “creation”
  • Topic 3 (The Divine): “lord”, “behold”, “praise”, “divine”, “god”, “blessed”, “man”, “glory”, “throne”, “perdition”
  • Topic 2 (Mind & Reality): “mind”, “existence”, “reality”, “thoughts”, “sense”, “moment”, “vision”, “mental”, “consciousness”
  • Topic 21 (Flesh & Decay): “flesh”, “dead”, “skin”,”body”, “bones”,”corpse”,”grave”
  • Topic 18 (The End): “end”, “day”, “path”, “leave”, “final”, “stand”, “fate”, “left”

And so on, and so forth. Click here for the full list, it will be handy for deciphering the plots below.

One nice functionality that the STMT offers is the ability of “slicing” the data with respect to the topics. This means that when slicing the data by date, one is able to infer what percentage of lyrics in a given year falls into each topic.

In order to observe the temporal evolution of some of these 30 topics between 1980 and 2014, I chose to use a NVD3 stacked area chart instead of just plotting twenty-something lines (which would be impossible to understand given the inevitable overlapping). The final result looks very neat and tidy, but can also be misleading and give the impression that all the topics are rising and diminishing at the same points in time. This is not true: when inspecting the stacked area char below remember that what represents the topic for a given year is the area of the topic at that point. There’s also the possibility of deselecting all topics (in the legend, top-right corner) except the one you want to examine, or simply clicking its area in the graph.

It seems that “Pain, Sorrow & Suffering” is consistently the most prevalent topic, peaking at 10.3% somewhere around 2006. “Fucking” has a peak in 1992, and “Warriors & Battles”  represents more than 20% of the topic assignment in 1986. For the most part, the topic assignment percentages seem to stabilize  after 92/93 (after the Norwegian boom or second wave or whatever it’s called).

And finally, when slicing the data set by the record labels, the output can be interpreted as the percentage of black metal releases by a given label that falls into each topic. After doing precisely that for records labels that have a minimum of 10 black metal releases, I selected a few labels and plotted for each the percentage of releases that were assigned to the topics with some degree of confidence. The resulting plot is huge, so I removed a few generic topics for the sake of clarity. By hovering the mouse on the topic titles, a set of some words that represent it will pop-up. Similarly, by hovering the mouse over a record label name, the circles will turn into percentages. The larger the circle’s radius, the higher the percentage of releases from that label were assigned to that circle’s corresponding topic.

Some observations of results that stand out: it seems that more than 20% of Depressive Illusions‘ releases were assigned to “Pain, Sorrow and Suffering. End All Life (which has released albums by Abigor, Blacklodge and Mütiilation, to name a few) top three topics are “Mind & Reality”, “Pain, Sorrow & Suffering” and “Chaos, Universe & Cosmos”. Also, almost 1/4 of all Norma Evangelium Diaboli‘s releases (which include Deathspell Omega, Funeral Mist and Katharsis) seem to pertain to “The Divine” topic.

Edit: WordPress does not allow for huge iframes, so click here to view the Labels vs. Topics plot  in all of its glory.

And that’s it for now, I’m done with topic modeling for the time being until I have the time and patience to fine-tune the overall representation of the data and the algorithm’s parameters. In the next few weeks I’ll turn to unsupervised machine learning techniques, such as clustering, to discover hidden relationships between bands.


Credits & Useful Resources:

– D3 ToolTip: D3-tip by Caged

– Stacked Area Chart: NVD3 re-usable charts for d3.js

– Labels per Topic: taken from Asif Rahman Journals


Part III – Topic and Lyrical Content Correlation

In part II of this post, we explored a topic model built for the whole black metal lyrics data set (if you don’t know what a topic model is, read this as well, but to sum things up let’s just say topic modeling is a process that enables discovery of the “meaning” underlying a document, with minimum human intervention). In said post we analyzed 1) the relationship between topics, and 2) the importance of individual words in their characterization by means of a force directed graph, which (let’s face it) is a bit of a bubbly mess.
In order to understand better the second point stated above, I decided to build a zoomable treemap. In it, each large box (distinguished from the surrounding boxes by a label and a distinct color) represents a topic, i.e. a set of words that are somehow related and occur in the same context(s). By clicking on a label, the map zooms into it and presents the ten most relevant words within that topic. For example, by clicking on “Coldness”, you’ll see the top 10 terms that compose it (“ice”, “frost”, “snow” and so on). The area of each word is proportion to its importance in characterizing the topic: in our “Coldness” example, “cold” occupies as larger area than the rest, being the most relevant word in this context.
Similarly, the total area of each topic is proportional to its incidence in the black metal lyrics data set. For example, “Fire & Flames” has a larger area than “Mind & Reality” or “Universe & Cosmos”, making it more likely to occur when infering the topics that characterize a song.

By the way, these topic labels were chosen manually. Unfortunately I couldn’t devise an automated process that would do that for me (if anyone has an inkling on how to do this, let me know) so I had to pick meaningful and reasonable (I hope) representative titles for each set of words. In most cases, like the aforementioned “Coldness”, the concept behind the topic is evident. There are, however, a few cases where I had to be a bit more creative because the meaning of the topic is not so obvious (“Urban Horror” comes to mind).

There are also two topics which are quite generic, with terms that could occur in almost any context, so they’re simply labeled “Non-descriptive”.

As mentioned in part II of this post, one goal of this whole mess is to find out which lyrics “embody” a specific topic. Given that the lyrical content of a song is seen by the topic model as a mixture of topics, then we’re interested in discovering lyrics that are composed solely (or almost in their entirety, let’s say more than 90%) of a single topic. Using the topic inferencing capabilities of the Stanford Topic Model Tool I did just that, selecting at least 3 representative lyrics for 14 of the topics above. They’re displayed in the collapsible tree below.

For the most part the lyrics seem to have a high degree of correlation with the topic assigned to them: for instance Immortal’s “Mountains of Might” fits the “Coldness” topic fairly well (surprise, surprise…) and Vondur’s cover of an Elvis Presly song obviously falls into the heart stuff category. But there is one intriguing result: after reading Woods of Infinity’s “A Love Story”, I was expecting it to have the “Dreams & Stuff from the Heart” topic assigned to it. It falls in the “Fucking” topic instead, so maybe the algorithm detected something (creepy) between the lines.



The zoomable treemap was built from Bill White’s Treemap with Title Headers.

The collapsible tree was inspired by this tree and this other tree.

Part II – Topic Discovery in Black Metal Lyrics (All Bands)

In Part I of this post, we examined a topic model built from a subset of the black metal lyrics data set. It was a preliminary experiment with regards to topic discovery, and we only explored the 10 most frequent topics underlying lyrics authored by French bands.

In this second part we will examine the relationship, i.e. similarity/dissimilarity, between topics of a topic model. The model now presented was built using the original data set in its entirety, with Stanford’s Topic Modeling Toolbox (STMT). I chose this tool over Mallet for a number of reasons, the most important being that it allows to “slice” the topics with respect to a particular aspect(s) of the data set such as time or band (a functionality I’ll explore in the next post).

The pre-processing stage of the data set included removal of typical stop words (such as the, or and and), and not so typical ones: STMT allows removal of the most frequent words in the data set (if they’re too common, it’s very likely they’re not that informative). I’ve also removed lyrics with less than 5 words and used those lyrics that have 85% probability of being detected as written in english (this will allow for the best translations of lyrics not written originally in english to also be included in this analysis). The outcome of this pre-processing step is a data set comprised of approximately 52000 distinct lyrics.

Once again, I had to manually set a value for the number of topics (in subsequent experiments I’ll explore the possibility of determining the ideal number automatically) so I picked 30, a nice round number (my guess is as good as yours). Below you’ll find a list of some of them. Remember that a topic is a set of related words (the first and fifth listed here are my personal favourites):

  • space, universe, stars, chaos, void, cosmic, light, infinite
  • hell, evil, satan, demons, souls, god, unholy, infernal
  • fucking, metal, fuck, kill, lust, whore, shit, rape, cunt, bitch,
  • cold, winter, wind, winds, ice, snow, frozen, land, mountains, frost
  • human, earth, race, humanity, destruction, mankind, war, plague, destroy
  • fire, burning, burn, flames, flame, burns, ashes, soul, fires
  • dark, ancient, power, soul, shadows, evil, spirit, eternal
  • pain, soul, hatred, mind, suffering, hate, veins, anger, thoughts, madness

[Click here for the whole 30 topic word distribution list] ( note that the numbers inside parentheses represent the “weight” of the word in the topic: the higher the weight, the greater its importance in characterizing the topic)

Its a pretty diverse list of topics. Some are quite generic, and others occur in a very small percentage of lyrics, but for the most part they seem concise and informative. There are also a number of them that could be related to each other: a topic about sea and waves is probably closer to another comprised of words such as wind or sky, than to a topic about pain and hatred.

One question that arises is how to determine this hypothetical relationship between the topics. One of the outputs of the SMTM is a document by topic matrix, that is, a matrix where each line corresponds to a lyric and each column to a topic: their intersection gives us the “weight” of a topic on a particular lyric.

What’s important to retain here is that each topic can be represented by a column of values (its weights in the data set). If we want to determine the relationship between two topics in the corpus, we can use their representations as vectors of numbers and apply some sort of measure to it, such as the Jensen-Shannon divergence. This metric actually gives us the dissimilarity between two vectors: the higher is value, the higher is their degree of unrelatedness.

In the force directed graph below, each topic is represented as a cluster of its top 7 words, being that words with higher weight in the topic are associated to circles with larger radiuses than the rest. The higher the divergence between two topics, the further apart they are in the graph (or should be, it’s not perfectly rendered because d3 is really not my strong point, but it will give you an idea of these distances for a few pairs of topics – refresh and reshuffle as much as you like!). In addition, if you mouseover a particular topic, the graph will highlight its links to other topics: the thinner a link is, the more unrelated the topics will be.  

Another question that comes to mind when visualizing these topics is what lyrics “embody” them best. Remember that lyrics can been seen as a mixture of topics, so one that is composed in its majority of a single topic, will probably represent it better than another that as a small percentage (say, 10%) of that same topic. This will be addressed in the next post, so stay tuned.

PartI – Topic Discovery in Black Metal Lyrics (French Bands)

Counting occurrences of single words is not the most informative way of discovering the meaning (or a possible meaning) of a text. This is mainly because both the relationship between words and the context in which they occur are ignored. A more significant result would be discovering sets of correlated terms that express ideas or concepts underlying the text. Topic modeling addresses this issue of topic discovery, and more importantly, does so with (almost) no human supervision.

‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”This is especially important in a data set like the black metal lyrics given that there are a number of words (such as death, life and blood) that appear in different contexts.

So, how does a topic modeling tool work? According to this excellent introduction to this subject, its main advantage is that it doesn’t need to know anything about the text except the words that are in it.  It assumes that any piece of text is composed of words taken from “baskets” of words, where each basket corresponds to a topic. Then it becomes possible to mathematically decompose a text into the probable baskets from whence the words first came. The tool goes through this process repeatedly until it settles on the most likely distribution of words into topics.

What results from this approach is that a piece of text is seen as a mixture of different topics, being that each topic has a weight associated. The higher the weight of a topic, the more important it is to characterize the text. For sake of a practical example, let’s say that we have  the following topics

  1. cosmic, space, universe, millions, stars …
  2. dna, genetic, evolution, millions, years ….
  3. business, financial, millions, funding, budget ….

Notice how the word “millions” shows up in different contexts: you can have a business text talking about millions of dollars or a science text mentioning evolution over millions of years. Taking the following text as a test case for our simple example topic model…

“The Hubble Space Telescope (HST) is a space telescope that was carried into orbit by a Space Shuttle in 1990. Hubble’s Deep Field has recorded some of the most detailed visible-light images ever, allowing a deep view into space and time. Many Hubble observations have led to breakthroughs in astrophysics, such as accurately determining the rate of expansion of the universe […]ESA agreed to provide funding and supply one of the first generation instruments for the telescope […] From its original total cost estimate of about US$400 million, the telescope had by now cost over $2.5 billion to construct.

…it does seems reasonable that it can be seen as a mixture of topics 1 and 3 (with topic 1 having a higher weight than topic 3):

What would a black metal topic model look like? To find out, I’ve made a couple of preliminary experiments using lyrics from French black metal bands (future experiments will explore other subsets of the lyrics corpus, and hopefully build a topic model for the entire data set, if time allows). The model described in this post was generated with Mallet, setting the number of topics to look for to 20, and using its most basic processing techniques: stop word removal, non-alphanumeric characters removal, feature sequences with bigrams, and little else.

For reasons of economy (and also not to bore you to tears) I’ll just list the top 10, that is, the 10 topics that have a higher “weight” in characterizing the French lyrics subset (the remaining 10 have very small weights). Each is represented by 9 terms:

  1. life, time, death, eyes, pain, soul, feel, mind, world
  2. night, dark, light, cold, black, darkness, moon, sky, eternal
  3. world, life, human, death, earth, humanity, end, hatred, chaos
  4. blood, body, black, tears, flesh, heart, eyes, love, wind
  5. satan, blood, black, god, hell, evil, lord, christ, master
  6. war, blood, death, fight, fire, black, kill, rise, hell
  7. land, gods, blood, people, proud, men, great, king, ancestors
  8. god, time, void, light, death, reality, stars, matter, infinite
  9. god, lord, fire, divine, holy, light, flesh, man, great
  10. fucking, shit, fuck, make, time, trust, love, suck, dirty

The first one seems to be all over the place: life, time and death can be applied to a ton of subjects, and indeed they seem to characterize to some extent about half of the data set. Also, some terms appear quite often in different contexts (blood, black, death and even god). But there are a couple of interesting ones, such as topics 2, 7 and 10. And because looking at lists of words is tedious, here’s a word cloud that represents them using a horrid sexy color palette. Each topic has a different color, and the larger the font, the more preponderant it is in the subset.

One practical application of a topic model is using it to describe a text. Let’s take, for example, the lyrics for Blut Aus Nord’s “Fathers of the Icy Age” and “ask” our topic model what’s the composition of this particular piece of text. The outcome is:

  • Topic 7 (54.25%): land, gods, blood, people, proud, men, great, king, ancestors
  • Topic 2 (42.34%):  night, dark, light, cold, black, darkness, moon, sky, eternal
  • Other topics – less than 3.41%

We can interpret this song as a mixture of two topics, and in my opinion, the first one (let’s call it “ancient pagan stuff of yore”) seems to be pretty accurate. What about more personal lyrics such as T.A.o.S. “For Psychiatry”? Here’s what we get:

  • Topic 10 (40.95%): fucking, shit, fuck, make, time, trust, love, suck, dirty
  • Topic 1 (26.13%): life, time, death, eyes, pain, soul, feel, mind, world
  • Topic 3 (12.05%): world, life, human, death, earth, humanity, end, hatred, chaos

It’s a bit too generic for my liking, but we’re not that far off the mark. All in all, topic modeling appears to be quite useful for the discovery of concepts in our data set. There are, however, a few drawbacks to this approach. One of them is that the number of topics has to be set manually – in an ideal case the algorithm should figure out by itself the appropriate number. The other is the simplicity of the features, future experiments should focus on improving the lyrics representation with richer features. At any rate, these are promising results that can be further improved.

Usage of Specific Terms through Time

What about the usage of specific words in black metal lyrics across time? Have common terms – like life or death – been mentioned in a constant fashion through the years, or has their frequency changed dramatically?

The figure below plots the frequencies of a few selected words against time (in years). I’ve chosen death, life and time because they are among the most frequent terms in the whole lyrics data set. As for god and satan, well, if you don’t know why I picked them then that probably means you’re not acquainted with black metal at all, so I’ll refer you to Google or the nearest (decent) record shop to sort that out.

I’ve bundled a few synonyms and hyponyms with each term, taken from WordNet. This means that, for example, the occurrence count for satan also includes the counts of similar terms such as lucifer and devil.

Looking at the plot we can see that death was at its highest point around 1998 and has been decreasing since then (being surpassed by life in 2006/07), up until 2012. And notice how satan closely follows god across the years. This probably means that most lyrics than mention one of these entities, also mention the other.

Part II – Frequent Words in Black Metal Lyrics

In the last post we tried to discover the most common terms used in black metal lyrics. One of the first questions that popped up was if there are there differences between countries, regarding the most frequent words. To answer this (in a very small scale) I’ve subsetted the original data set into two smaller sets: one for lyrics penned by Norwegian bands and the other for Iraqi bands. The following bar plot shows the top 15 most frequent words found in the lyrics of Norwegian bands. It does not seem to differ much from the global top 15, presented in our previous post.

Below you’ll find the most frequent words in the lyrics of Iraqi bands. Not only does it look much different from the Norwegian bar plot, it also differs significantly from the global results. I find it very interesting that lies corresponds to 0.9% of the total occurrences. This and the presence of both truth and blashpemy seems to point to some sort of deeper meaning here.  Or maybe it’s just all a coincidence because, again, with no contextual analysis we can’t really infer much. At any rate, it’s very likely that the lyrical concerns of Norwegian and Iraqi bands are distinct.