Classification of Political Statements by Orator with DSX

After all that keyword extraction thing from political speeches, it occurred to me it would be interesting to find out if it’s possible to build a model that predicts the political orator to which a statement, or even a complete speech, belongs to. By statement, I mean a sentence with more than a couple of words drawn from a speech (not including here interviews or political debates, for example). I took the 327 speeches by 12 US presidents used in the previous post as the basis of a document data set and added to it a few dozen speeches by other, non-american, dictatorial, political leaders as to create a set appropriate for a classification task.

I intended to explore two different routes:

  1. Build a predictive model from a collection of sentences previously classified as either being uttered by an US President or by some other politician (all non-American political leaders of the XXth century, all dictators) during an official speech. This can be defined then as a binary classification problem: either a statement is assigned to a politician of the class “USPresident” or it isn’t. All the sentences (33500 in total) were drawn from the speeches mentioned above.
  2. Build another model from a collection of speeches by a diverse group of orators, such that the model can assign correctly to a a previously unseen speech the person associated with it. This can be defined as a multi-class classification problem.

This all sounds reasonable and potentially interesting (and, who knows, even useful), but building predictive models from text-based data is a very cumbersome task because there’s always a multitude of things to decide beforehand, which includes:

  • How to represent the text? Learning algorithms can’t deal with text in its original, “as-is” format, so there’s a number of preprocessing steps to take in order to transform it into a set of numerical/categorical/ordinal/etc. features that make sense. There are numerous feature types and transformations I could explore here, like representing the text as a weighted vector space model, using word based features, character-based features, using part-of-speech tags or entities as additional features, build topic models and use the topic probabilities for each document, and so on. The problem is that I do not have enough time (nor patience) to decide efficiently the most appropriate feature representation for my speech/sentences data set.
  • Dimensionality curse: Assuming I’ve managed to find some good text representation, it’s almost certain the final dimensions of the data set to be presented to the learning algorithm will be prohibitive. Again, there are numerous feature selection methods that can be employed to help me ascertain which features are more informative and discard the rest. I don’t really care about trying them all.
  • What learning algorithm is appropriate? Finally, which algorithms to use for these two classification tasks. Again, there are hundreds of them out there, not to mention countless parameters to tune, cross-validation techniques to test, different evaluation measures to optimize, and so on.

As to avoid losing too much time with all of this stuff just for the sake of a blog post, I decided to use DSXfor two simple reasons: 1) it accepts text in its original format and does all the feature transformation/selection/extraction steps all by itself, so I don’t need to worry about that stage, and 2) it tests hundreds of different algorithms and combinations of algorithms to find the best model for the data.

The only pre-processing done to the data sets prior to uploading them as csv files to DSX was:

  1. Subsetting each data set into one training portion, from which to build a predictive model, and a testing portion used to evaluate the model on data unknown to it (and make sure there was no overfitting).
  2. To make things more challenging, I replaced all entities mentioned by their entity type. This is because a sentence or speech mentioning specific dates, people and locations can be easily assigned to the correct orator using those entities alone. For example, “It is nearly five months since we were attacked at Pearl Harbor” is obviously something that only FDR could have said. “Pearl Harbor” is a clear hint of the true class of the sentence, and to make things more difficult to DSX, it gets replaced with the placeholder “LOCATION”. A similar replacement is used for entities like organizations, dates or persons with the help of the Stanford NLP Core toolkit.

The first model built was the one for the binary version of the data set (i.e., a sentence either belongs to an US president or to a non-american political leader), using a total of 26792 sentences. Of a total of 8500 examined models, DSX found one generated with the Iterative OLS algorithm to be the best, estimating accuracy (that is, the percentage of sentences correctly assigned to their respective class) to fall between 76% and 88%, and average recall (that is, the averaged percentages of correct assignments for each class) to fall in the range of 78% to 88%. Given that the “NON US PRESIDENT” class is about two thirds of the “US PRESIDENT” average recall is a better evaluation measure than regular accuracy, for this particular data set.

1

ForecastThis DSX estimated qualities of the best predictive model for the binary political sentences data set.

To make sure the model is not overfitting the training data, and that the estimates above are correct, I sent DSX a test set of sentences with no labels assigned, and compared the returned predictions with the ground truth. Turns out accuracy is around 82% and average recall approximately 80%. This is a great result overall and it means we’ve managed to build a model that could be useful, for example, for automatic annotation of political statements.

And just for the record, here’s a few example of sentences that the model did not get right:

  • Sentences by US Presidents marked as belonging to non-US political leaders (dictators):
    • We have no territory there, nor do we seek any.
    • That is why we have answered this aggression with action.
    • Freedom’s fight is not finished.
  • Sentences by non-US political leaders (dictators) marked as belonging to US presidents
    • The period of war in [LOCATION] is over.
    • The least that could be said is that there is no tranquillity, that there is no security, that we are on the threshold of an uncontrollable arms race and that the danger of a world war is growing; the danger is growing and it is real.

I doubt a person could do much better just by reading the text, with no additional information.


The second model was built from a train set of 232 speeches, each labeled with the respective orator (11 in total). The classes are very unbalanced (that is, the number of examples for each label varies greatly), and some of them are quite small, which makes average recall the best measure to pay attention to when asserting the quality of the predictions made by the model. The best model DSX found was built with Multiquadric Kernel Regression, and although it has a hard time learning three of the eleven classes (see figure below), it’s actually a lot better than what I expected given the skewness of the data, and the fact that all entities were removed from the text.

2

ForecastThis DSX predictive model for political speeches by 11 orators. The best model was built with Multiquadratic Kernel Regression.

And what about the model’s performance in the test set? It more or less follows the estimated performance of the trained model: it fails to classify correctly speeches by Hitler (classifying them as belonging to FDR instead), and by Nixon (which are assigned to Lyndon B. Johnson). On the other hand, it does classify correctly all the instances of Reagan, FDR, Stalin, and most of Bill Clinton’s speeches. I’m sure if I provided a few more examples for each class, the results would greatly improve.

To conclude: this model, alongside the very good model obtained for the first data set, illustrates how it is possible to quickly obtain predictive models useful for text annotation of political speeches. And all this with minimal effort, given that DSX can evaluate hundreds of different models very quickly, and also handle the feature engineering side of things, prior to the supervised learning step.

*Disclaimer: I work for ForecastThis, so shameless self-promotion trigger warning goes here.

Sources/Tools

  • ForecastThis DSX
  • US Presidential speeches harvested from the Miller Center Speech Archive

 

 

Part IV – Record Labels and Lyrical Content

In part IV (and final) of topic discovery in black metal lyrics, we’ll address the issue of assigning topics to record labels based on the lyrical content of their black metal releases. In other words, we want to find out if a given label has a tendency to release bands that write about a particular theme. We’ll also investigate the temporal evolution of these topics, that is, what changes happened through the years regarding the usage of topics in black metal lyrics. This aims to shed some light on the issue of whether lyrical content has remained the same throughout the years.

In order to address these questions, I turned once more to topic modeling. This machine learning technique was mentioned in parts I, II and III of this post, so knock yourself out reading those. If that does not appeal to you, let’s sum things up by saying that topic modeling aims to infer automatically (i.e., with minimum human intervention) topics underlying a collection of texts. “Topic” in this context is defined as a set of words that (co-)occur in the same context and are semantically related, somehow.

Instead of using the topic model built for parts II and III, I generated a new one after some (sensible, I hope) cleaning of the data set. This pre-processing involved, among other things, removal of lyrics that were not fully translated to english and lyrics with less than 5 words. In the end, I reduced the data set to 72666 lyrics (how ominous!) and generated a topic model of 30 topics with the Stanford Topic Modeling Toolbox (STMT).

Like in previous attempts, of these 30 topics, 2 or 3 seemed quite generic (they were composed of words that could occur in any context) or just plain noisy garbage, but for the most part the topics are quite concise.  I’m listing those I found the most interesting/intriguing. For each of them I added a title (between parentheses) that tentatively describes the overall tone of the topic:

  • Topic 28 (Cult, Rituals & Symbolism): “sacrifice”, “ritual”, “altar”, “unholy”, “goat”, “rites”, “blasphemy”, “chalice”, “temple”, “cult”
  • Topic 23 (Chaos, Universe & Cosmos): “chaos”, “stars”, “universe”, “cosmic”, “light”, “space”, “serpent”, “void”, “abyss”, “creation”
  • Topic 3 (The Divine): “lord”, “behold”, “praise”, “divine”, “god”, “blessed”, “man”, “glory”, “throne”, “perdition”
  • Topic 2 (Mind & Reality): “mind”, “existence”, “reality”, “thoughts”, “sense”, “moment”, “vision”, “mental”, “consciousness”
  • Topic 21 (Flesh & Decay): “flesh”, “dead”, “skin”,”body”, “bones”,”corpse”,”grave”
  • Topic 18 (The End): “end”, “day”, “path”, “leave”, “final”, “stand”, “fate”, “left”

And so on, and so forth. Click here for the full list, it will be handy for deciphering the plots below.

One nice functionality that the STMT offers is the ability of “slicing” the data with respect to the topics. This means that when slicing the data by date, one is able to infer what percentage of lyrics in a given year falls into each topic.

In order to observe the temporal evolution of some of these 30 topics between 1980 and 2014, I chose to use a NVD3 stacked area chart instead of just plotting twenty-something lines (which would be impossible to understand given the inevitable overlapping). The final result looks very neat and tidy, but can also be misleading and give the impression that all the topics are rising and diminishing at the same points in time. This is not true: when inspecting the stacked area char below remember that what represents the topic for a given year is the area of the topic at that point. There’s also the possibility of deselecting all topics (in the legend, top-right corner) except the one you want to examine, or simply clicking its area in the graph.

It seems that “Pain, Sorrow & Suffering” is consistently the most prevalent topic, peaking at 10.3% somewhere around 2006. “Fucking” has a peak in 1992, and “Warriors & Battles”  represents more than 20% of the topic assignment in 1986. For the most part, the topic assignment percentages seem to stabilize  after 92/93 (after the Norwegian boom or second wave or whatever it’s called).

And finally, when slicing the data set by the record labels, the output can be interpreted as the percentage of black metal releases by a given label that falls into each topic. After doing precisely that for records labels that have a minimum of 10 black metal releases, I selected a few labels and plotted for each the percentage of releases that were assigned to the topics with some degree of confidence. The resulting plot is huge, so I removed a few generic topics for the sake of clarity. By hovering the mouse on the topic titles, a set of some words that represent it will pop-up. Similarly, by hovering the mouse over a record label name, the circles will turn into percentages. The larger the circle’s radius, the higher the percentage of releases from that label were assigned to that circle’s corresponding topic.

Some observations of results that stand out: it seems that more than 20% of Depressive Illusions‘ releases were assigned to “Pain, Sorrow and Suffering. End All Life (which has released albums by Abigor, Blacklodge and Mütiilation, to name a few) top three topics are “Mind & Reality”, “Pain, Sorrow & Suffering” and “Chaos, Universe & Cosmos”. Also, almost 1/4 of all Norma Evangelium Diaboli‘s releases (which include Deathspell Omega, Funeral Mist and Katharsis) seem to pertain to “The Divine” topic.

Edit: WordPress does not allow for huge iframes, so click here to view the Labels vs. Topics plot  in all of its glory.

And that’s it for now, I’m done with topic modeling for the time being until I have the time and patience to fine-tune the overall representation of the data and the algorithm’s parameters. In the next few weeks I’ll turn to unsupervised machine learning techniques, such as clustering, to discover hidden relationships between bands.

 

Credits & Useful Resources:

– D3 ToolTip: D3-tip by Caged

– Stacked Area Chart: NVD3 re-usable charts for d3.js

– Labels per Topic: taken from Asif Rahman Journals