After all that keyword extraction thing from political speeches, it occurred to me it would be interesting to find out if it’s possible to build a model that predicts the political orator to which a statement, or even a complete speech, belongs to. By statement, I mean a sentence with more than a couple of words drawn from a speech (not including here interviews or political debates, for example). I took the 327 speeches by 12 US presidents used in the previous post as the basis of a document data set and added to it a few dozen speeches by other, non-american, dictatorial, political leaders as to create a set appropriate for a classification task.
I intended to explore two different routes:
- Build a predictive model from a collection of sentences previously classified as either being uttered by an US President or by some other politician (all non-American political leaders of the XXth century, all dictators) during an official speech. This can be defined then as a binary classification problem: either a statement is assigned to a politician of the class “USPresident” or it isn’t. All the sentences (33500 in total) were drawn from the speeches mentioned above.
- Build another model from a collection of speeches by a diverse group of orators, such that the model can assign correctly to a a previously unseen speech the person associated with it. This can be defined as a multi-class classification problem.
This all sounds reasonable and potentially interesting (and, who knows, even useful), but building predictive models from text-based data is a very cumbersome task because there’s always a multitude of things to decide beforehand, which includes:
- How to represent the text? Learning algorithms can’t deal with text in its original, “as-is” format, so there’s a number of preprocessing steps to take in order to transform it into a set of numerical/categorical/ordinal/etc. features that make sense. There are numerous feature types and transformations I could explore here, like representing the text as a weighted vector space model, using word based features, character-based features, using part-of-speech tags or entities as additional features, build topic models and use the topic probabilities for each document, and so on. The problem is that I do not have enough time (nor patience) to decide efficiently the most appropriate feature representation for my speech/sentences data set.
- Dimensionality curse: Assuming I’ve managed to find some good text representation, it’s almost certain the final dimensions of the data set to be presented to the learning algorithm will be prohibitive. Again, there are numerous feature selection methods that can be employed to help me ascertain which features are more informative and discard the rest. I don’t really care about trying them all.
- What learning algorithm is appropriate? Finally, which algorithms to use for these two classification tasks. Again, there are hundreds of them out there, not to mention countless parameters to tune, cross-validation techniques to test, different evaluation measures to optimize, and so on.
As to avoid losing too much time with all of this stuff just for the sake of a blog post, I decided to use DSX* for two simple reasons: 1) it accepts text in its original format and does all the feature transformation/selection/extraction steps all by itself, so I don’t need to worry about that stage, and 2) it tests hundreds of different algorithms and combinations of algorithms to find the best model for the data.
The only pre-processing done to the data sets prior to uploading them as csv files to DSX was:
- Subsetting each data set into one training portion, from which to build a predictive model, and a testing portion used to evaluate the model on data unknown to it (and make sure there was no overfitting).
- To make things more challenging, I replaced all entities mentioned by their entity type. This is because a sentence or speech mentioning specific dates, people and locations can be easily assigned to the correct orator using those entities alone. For example, “It is nearly five months since we were attacked at Pearl Harbor” is obviously something that only FDR could have said. “Pearl Harbor” is a clear hint of the true class of the sentence, and to make things more difficult to DSX, it gets replaced with the placeholder “LOCATION”. A similar replacement is used for entities like organizations, dates or persons with the help of the Stanford NLP Core toolkit.
The first model built was the one for the binary version of the data set (i.e., a sentence either belongs to an US president or to a non-american political leader), using a total of 26792 sentences. Of a total of 8500 examined models, DSX found one generated with the Iterative OLS algorithm to be the best, estimating accuracy (that is, the percentage of sentences correctly assigned to their respective class) to fall between 76% and 88%, and average recall (that is, the averaged percentages of correct assignments for each class) to fall in the range of 78% to 88%. Given that the “NON US PRESIDENT” class is about two thirds of the “US PRESIDENT” average recall is a better evaluation measure than regular accuracy, for this particular data set.
ForecastThis DSX estimated qualities of the best predictive model for the binary political sentences data set.
To make sure the model is not overfitting the training data, and that the estimates above are correct, I sent DSX a test set of sentences with no labels assigned, and compared the returned predictions with the ground truth. Turns out accuracy is around 82% and average recall approximately 80%. This is a great result overall and it means we’ve managed to build a model that could be useful, for example, for automatic annotation of political statements.
And just for the record, here’s a few example of sentences that the model did not get right:
- Sentences by US Presidents marked as belonging to non-US political leaders (dictators):
- We have no territory there, nor do we seek any.
- That is why we have answered this aggression with action.
- Freedom’s fight is not finished.
- Sentences by non-US political leaders (dictators) marked as belonging to US presidents
- The period of war in [LOCATION] is over.
- The least that could be said is that there is no tranquillity, that there is no security, that we are on the threshold of an uncontrollable arms race and that the danger of a world war is growing; the danger is growing and it is real.
I doubt a person could do much better just by reading the text, with no additional information.
The second model was built from a train set of 232 speeches, each labeled with the respective orator (11 in total). The classes are very unbalanced (that is, the number of examples for each label varies greatly), and some of them are quite small, which makes average recall the best measure to pay attention to when asserting the quality of the predictions made by the model. The best model DSX found was built with Multiquadric Kernel Regression, and although it has a hard time learning three of the eleven classes (see figure below), it’s actually a lot better than what I expected given the skewness of the data, and the fact that all entities were removed from the text.
ForecastThis DSX predictive model for political speeches by 11 orators. The best model was built with Multiquadratic Kernel Regression.
And what about the model’s performance in the test set? It more or less follows the estimated performance of the trained model: it fails to classify correctly speeches by Hitler (classifying them as belonging to FDR instead), and by Nixon (which are assigned to Lyndon B. Johnson). On the other hand, it does classify correctly all the instances of Reagan, FDR, Stalin, and most of Bill Clinton’s speeches. I’m sure if I provided a few more examples for each class, the results would greatly improve.
To conclude: this model, alongside the very good model obtained for the first data set, illustrates how it is possible to quickly obtain predictive models useful for text annotation of political speeches. And all this with minimal effort, given that DSX can evaluate hundreds of different models very quickly, and also handle the feature engineering side of things, prior to the supervised learning step.
*Disclaimer: I work for ForecastThis, so shameless self-promotion trigger warning goes here.
- ForecastThis DSX
- US Presidential speeches harvested from the Miller Center Speech Archive