Counting occurrences of single words is not the most informative way of discovering the meaning (or a possible meaning) of a text. This is mainly because both the relationship between words and the context in which they occur are ignored. A more significant result would be discovering sets of correlated terms that express ideas or concepts underlying the text. Topic modeling addresses this issue of topic discovery, and more importantly, does so with (almost) no human supervision.
‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. This is especially important in a data set like the black metal lyrics given that there are a number of words (such as death, life and blood) that appear in different contexts.
So, how does a topic modeling tool work? According to this excellent introduction to this subject, its main advantage is that it doesn’t need to know anything about the text except the words that are in it. It assumes that any piece of text is composed of words taken from “baskets” of words, where each basket corresponds to a topic. Then it becomes possible to mathematically decompose a text into the probable baskets from whence the words first came. The tool goes through this process repeatedly until it settles on the most likely distribution of words into topics.
What results from this approach is that a piece of text is seen as a mixture of different topics, being that each topic has a weight associated. The higher the weight of a topic, the more important it is to characterize the text. For sake of a practical example, let’s say that we have the following topics
- cosmic, space, universe, millions, stars …
- dna, genetic, evolution, millions, years ….
- business, financial, millions, funding, budget ….
Notice how the word “millions” shows up in different contexts: you can have a business text talking about millions of dollars or a science text mentioning evolution over millions of years. Taking the following text as a test case for our simple example topic model…
“The Hubble Space Telescope (HST) is a space telescope that was carried into orbit by a Space Shuttle in 1990. Hubble’s Deep Field has recorded some of the most detailed visible-light images ever, allowing a deep view into space and time. Many Hubble observations have led to breakthroughs in astrophysics, such as accurately determining the rate of expansion of the universe […]ESA agreed to provide funding and supply one of the first generation instruments for the telescope […] From its original total cost estimate of about US$400 million, the telescope had by now cost over $2.5 billion to construct.“
…it does seems reasonable that it can be seen as a mixture of topics 1 and 3 (with topic 1 having a higher weight than topic 3):
What would a black metal topic model look like? To find out, I’ve made a couple of preliminary experiments using lyrics from French black metal bands (future experiments will explore other subsets of the lyrics corpus, and hopefully build a topic model for the entire data set, if time allows). The model described in this post was generated with Mallet, setting the number of topics to look for to 20, and using its most basic processing techniques: stop word removal, non-alphanumeric characters removal, feature sequences with bigrams, and little else.
For reasons of economy (and also not to bore you to tears) I’ll just list the top 10, that is, the 10 topics that have a higher “weight” in characterizing the French lyrics subset (the remaining 10 have very small weights). Each is represented by 9 terms:
- life, time, death, eyes, pain, soul, feel, mind, world
- night, dark, light, cold, black, darkness, moon, sky, eternal
- world, life, human, death, earth, humanity, end, hatred, chaos
- blood, body, black, tears, flesh, heart, eyes, love, wind
- satan, blood, black, god, hell, evil, lord, christ, master
- war, blood, death, fight, fire, black, kill, rise, hell
- land, gods, blood, people, proud, men, great, king, ancestors
- god, time, void, light, death, reality, stars, matter, infinite
- god, lord, fire, divine, holy, light, flesh, man, great
- fucking, shit, fuck, make, time, trust, love, suck, dirty
The first one seems to be all over the place: life, time and death can be applied to a ton of subjects, and indeed they seem to characterize to some extent about half of the data set. Also, some terms appear quite often in different contexts (blood, black, death and even god). But there are a couple of interesting ones, such as topics 2, 7 and 10. And because looking at lists of words is tedious, here’s a word cloud that represents them using a
horrid sexy color palette. Each topic has a different color, and the larger the font, the more preponderant it is in the subset.
One practical application of a topic model is using it to describe a text. Let’s take, for example, the lyrics for Blut Aus Nord’s “Fathers of the Icy Age” and “ask” our topic model what’s the composition of this particular piece of text. The outcome is:
- Topic 7 (54.25%): land, gods, blood, people, proud, men, great, king, ancestors
- Topic 2 (42.34%): night, dark, light, cold, black, darkness, moon, sky, eternal
- Other topics – less than 3.41%
We can interpret this song as a mixture of two topics, and in my opinion, the first one (let’s call it “ancient pagan stuff of yore”) seems to be pretty accurate. What about more personal lyrics such as T.A.o.S. “For Psychiatry”? Here’s what we get:
- Topic 10 (40.95%): fucking, shit, fuck, make, time, trust, love, suck, dirty
- Topic 1 (26.13%): life, time, death, eyes, pain, soul, feel, mind, world
- Topic 3 (12.05%): world, life, human, death, earth, humanity, end, hatred, chaos
It’s a bit too generic for my liking, but we’re not that far off the mark. All in all, topic modeling appears to be quite useful for the discovery of concepts in our data set. There are, however, a few drawbacks to this approach. One of them is that the number of topics has to be set manually – in an ideal case the algorithm should figure out by itself the appropriate number. The other is the simplicity of the features, future experiments should focus on improving the lyrics representation with richer features. At any rate, these are promising results that can be further improved.