In Part I of this post, we examined a topic model built from a subset of the black metal lyrics data set. It was a preliminary experiment with regards to topic discovery, and we only explored the 10 most frequent topics underlying lyrics authored by French bands.
In this second part we will examine the relationship, i.e. similarity/dissimilarity, between topics of a topic model. The model now presented was built using the original data set in its entirety, with Stanford’s Topic Modeling Toolbox (STMT). I chose this tool over Mallet for a number of reasons, the most important being that it allows to “slice” the topics with respect to a particular aspect(s) of the data set such as time or band (a functionality I’ll explore in the next post).
The pre-processing stage of the data set included removal of typical stop words (such as the, or and and), and not so typical ones: STMT allows removal of the most frequent words in the data set (if they’re too common, it’s very likely they’re not that informative). I’ve also removed lyrics with less than 5 words and used those lyrics that have 85% probability of being detected as written in english (this will allow for the best translations of lyrics not written originally in english to also be included in this analysis). The outcome of this pre-processing step is a data set comprised of approximately 52000 distinct lyrics.
Once again, I had to manually set a value for the number of topics (in subsequent experiments I’ll explore the possibility of determining the ideal number automatically) so I picked 30, a nice round number (my guess is as good as yours). Below you’ll find a list of some of them. Remember that a topic is a set of related words (the first and fifth listed here are my personal favourites):
- space, universe, stars, chaos, void, cosmic, light, infinite
- hell, evil, satan, demons, souls, god, unholy, infernal
- fucking, metal, fuck, kill, lust, whore, shit, rape, cunt, bitch,
- cold, winter, wind, winds, ice, snow, frozen, land, mountains, frost
- human, earth, race, humanity, destruction, mankind, war, plague, destroy
- fire, burning, burn, flames, flame, burns, ashes, soul, fires
- dark, ancient, power, soul, shadows, evil, spirit, eternal
- pain, soul, hatred, mind, suffering, hate, veins, anger, thoughts, madness
[Click here for the whole 30 topic word distribution list] ( note that the numbers inside parentheses represent the “weight” of the word in the topic: the higher the weight, the greater its importance in characterizing the topic)
Its a pretty diverse list of topics. Some are quite generic, and others occur in a very small percentage of lyrics, but for the most part they seem concise and informative. There are also a number of them that could be related to each other: a topic about sea and waves is probably closer to another comprised of words such as wind or sky, than to a topic about pain and hatred.
One question that arises is how to determine this hypothetical relationship between the topics. One of the outputs of the SMTM is a document by topic matrix, that is, a matrix where each line corresponds to a lyric and each column to a topic: their intersection gives us the “weight” of a topic on a particular lyric.
What’s important to retain here is that each topic can be represented by a column of values (its weights in the data set). If we want to determine the relationship between two topics in the corpus, we can use their representations as vectors of numbers and apply some sort of measure to it, such as the Jensen-Shannon divergence. This metric actually gives us the dissimilarity between two vectors: the higher is value, the higher is their degree of unrelatedness.
In the force directed graph below, each topic is represented as a cluster of its top 7 words, being that words with higher weight in the topic are associated to circles with larger radiuses than the rest. The higher the divergence between two topics, the further apart they are in the graph (or should be, it’s not perfectly rendered because d3 is really not my strong point, but it will give you an idea of these distances for a few pairs of topics – refresh and reshuffle as much as you like!). In addition, if you mouseover a particular topic, the graph will highlight its links to other topics: the thinner a link is, the more unrelated the topics will be.
Another question that comes to mind when visualizing these topics is what lyrics “embody” them best. Remember that lyrics can been seen as a mixture of topics, so one that is composed in its majority of a single topic, will probably represent it better than another that as a small percentage (say, 10%) of that same topic. This will be addressed in the next post, so stay tuned.