Conspiracy Theories – Topic Modeling & Keyword Extraction

It’s been a while since I’ve posted something related to topic modeling, and I decided to do so after stumbling upon a conspiracy theory document set that, albeit small, seemed an interesting starting point for building a topic model on the subject of conspiracies.

Unfortunately, these documents seem to be a bit dated as I couldn’t easily find references to more recent conspiracy theories like chemtrails or “vaccines are a government/big pharma plot to poison people”. At any rate, I’m sure they provide clues to the major themes behind them. As a preliminary exercise, I used JATE’s implementation of GlossEx to extract keywords from the documents. Some of the most “salient” ones (ie, among the top 50 or so) are represented in the bubble chart below. The size of the bubble represents the score of the keyword and among them we can see mentions to the CIA, jews, Tesla, the Illuminati and JFK. Yep, seems like a conspiracy data set, alright!

And now let’s explore the data set in more depth by building a topic model. ‘Topic’ is defined here as a set of words that frequently occur together. Quoting from Mallet: “using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings”. To summarize: topic modeling is a process that decomposes the documents into sets of probabilities. The final outcome of this process is a mathematical model that represents a piece of text as a mixture of different topics, being that each topic has a weight (that is, a probability) associated to it. The higher the weight of a topic, the more important it is to characterize the text. Another aspect of this mathematical model is that the terms that compose a topic also have different weights: the higher their value, the more important they are for characterizing the topic.

I’ve used Mallet (Java) and STMT (Scala) before for topic modeling, so I chose something different this time, the R topicmodels package, to build a model for these 432 documents. Here’s a sample of the code, note that the LDA function accepts the corpus as a DocumentTermMatrix object of unigrams, bigrams and trigrams. Note also that the topic model has 35 topics. This number was chosen after inspecting the log-likelihood of multiple LDA models, each with a different number of topics. I think 35 topics is excessive for such a small data set, but will use this criterion just for the sake of having a method that determines this parameter.

corpus <- Corpus(DirSource("texts"))
corpus <- tm_map(corpus, stripWhitespace)

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))

dtm <- DocumentTermMatrix(corpus, 
                          control = list(
                                         tokenize = TrigramTokenizer,
                                         stemming = FALSE, 
                                         stopwords = TRUE,                                          
                                         removePunctuation = TRUE))

#Build topic model with 35 topics, previously determined with the logLik function
lda <- LDA(dtm, 35, method="Gibbs", control = list(seed = 123, verbose = 1, iter=1500))

#inspect word distribution per topic
beta <- lda@beta

#inspect documents composition as mixtures of topics
gamma <- lda@gamma

The following multiple d3 word cloud was built after inspecting the beta object of the model (which tells us the n-grams that compose each topic and also the weight of each n-gram within the topic) and choosing 9 of the 35 topics (some topics were redundant or composed of non-informative terms). The size and the opacity of a term in the visualization reflects its weight. There are topics for all tastes: UFOs, freemasonry, the new world order, Tesla and strangely one that mixes nazis, George Bush and oil (topic 3). By the way, the code used for the multiple word cloud comes from this blog post by Elijah Meeks and it's a very nice and easy way of representing topics.


Classification of Political Statements by Orator with DSX

After all that keyword extraction thing from political speeches, it occurred to me it would be interesting to find out if it’s possible to build a model that predicts the political orator to which a statement, or even a complete speech, belongs to. By statement, I mean a sentence with more than a couple of words drawn from a speech (not including here interviews or political debates, for example). I took the 327 speeches by 12 US presidents used in the previous post as the basis of a document data set and added to it a few dozen speeches by other, non-american, dictatorial, political leaders as to create a set appropriate for a classification task.

I intended to explore two different routes:

  1. Build a predictive model from a collection of sentences previously classified as either being uttered by an US President or by some other politician (all non-American political leaders of the XXth century, all dictators) during an official speech. This can be defined then as a binary classification problem: either a statement is assigned to a politician of the class “USPresident” or it isn’t. All the sentences (33500 in total) were drawn from the speeches mentioned above.
  2. Build another model from a collection of speeches by a diverse group of orators, such that the model can assign correctly to a a previously unseen speech the person associated with it. This can be defined as a multi-class classification problem.

This all sounds reasonable and potentially interesting (and, who knows, even useful), but building predictive models from text-based data is a very cumbersome task because there’s always a multitude of things to decide beforehand, which includes:

  • How to represent the text? Learning algorithms can’t deal with text in its original, “as-is” format, so there’s a number of preprocessing steps to take in order to transform it into a set of numerical/categorical/ordinal/etc. features that make sense. There are numerous feature types and transformations I could explore here, like representing the text as a weighted vector space model, using word based features, character-based features, using part-of-speech tags or entities as additional features, build topic models and use the topic probabilities for each document, and so on. The problem is that I do not have enough time (nor patience) to decide efficiently the most appropriate feature representation for my speech/sentences data set.
  • Dimensionality curse: Assuming I’ve managed to find some good text representation, it’s almost certain the final dimensions of the data set to be presented to the learning algorithm will be prohibitive. Again, there are numerous feature selection methods that can be employed to help me ascertain which features are more informative and discard the rest. I don’t really care about trying them all.
  • What learning algorithm is appropriate? Finally, which algorithms to use for these two classification tasks. Again, there are hundreds of them out there, not to mention countless parameters to tune, cross-validation techniques to test, different evaluation measures to optimize, and so on.

As to avoid losing too much time with all of this stuff just for the sake of a blog post, I decided to use DSXfor two simple reasons: 1) it accepts text in its original format and does all the feature transformation/selection/extraction steps all by itself, so I don’t need to worry about that stage, and 2) it tests hundreds of different algorithms and combinations of algorithms to find the best model for the data.

The only pre-processing done to the data sets prior to uploading them as csv files to DSX was:

  1. Subsetting each data set into one training portion, from which to build a predictive model, and a testing portion used to evaluate the model on data unknown to it (and make sure there was no overfitting).
  2. To make things more challenging, I replaced all entities mentioned by their entity type. This is because a sentence or speech mentioning specific dates, people and locations can be easily assigned to the correct orator using those entities alone. For example, “It is nearly five months since we were attacked at Pearl Harbor” is obviously something that only FDR could have said. “Pearl Harbor” is a clear hint of the true class of the sentence, and to make things more difficult to DSX, it gets replaced with the placeholder “LOCATION”. A similar replacement is used for entities like organizations, dates or persons with the help of the Stanford NLP Core toolkit.

The first model built was the one for the binary version of the data set (i.e., a sentence either belongs to an US president or to a non-american political leader), using a total of 26792 sentences. Of a total of 8500 examined models, DSX found one generated with the Iterative OLS algorithm to be the best, estimating accuracy (that is, the percentage of sentences correctly assigned to their respective class) to fall between 76% and 88%, and average recall (that is, the averaged percentages of correct assignments for each class) to fall in the range of 78% to 88%. Given that the “NON US PRESIDENT” class is about two thirds of the “US PRESIDENT” average recall is a better evaluation measure than regular accuracy, for this particular data set.


ForecastThis DSX estimated qualities of the best predictive model for the binary political sentences data set.

To make sure the model is not overfitting the training data, and that the estimates above are correct, I sent DSX a test set of sentences with no labels assigned, and compared the returned predictions with the ground truth. Turns out accuracy is around 82% and average recall approximately 80%. This is a great result overall and it means we’ve managed to build a model that could be useful, for example, for automatic annotation of political statements.

And just for the record, here’s a few example of sentences that the model did not get right:

  • Sentences by US Presidents marked as belonging to non-US political leaders (dictators):
    • We have no territory there, nor do we seek any.
    • That is why we have answered this aggression with action.
    • Freedom’s fight is not finished.
  • Sentences by non-US political leaders (dictators) marked as belonging to US presidents
    • The period of war in [LOCATION] is over.
    • The least that could be said is that there is no tranquillity, that there is no security, that we are on the threshold of an uncontrollable arms race and that the danger of a world war is growing; the danger is growing and it is real.

I doubt a person could do much better just by reading the text, with no additional information.

The second model was built from a train set of 232 speeches, each labeled with the respective orator (11 in total). The classes are very unbalanced (that is, the number of examples for each label varies greatly), and some of them are quite small, which makes average recall the best measure to pay attention to when asserting the quality of the predictions made by the model. The best model DSX found was built with Multiquadric Kernel Regression, and although it has a hard time learning three of the eleven classes (see figure below), it’s actually a lot better than what I expected given the skewness of the data, and the fact that all entities were removed from the text.


ForecastThis DSX predictive model for political speeches by 11 orators. The best model was built with Multiquadratic Kernel Regression.

And what about the model’s performance in the test set? It more or less follows the estimated performance of the trained model: it fails to classify correctly speeches by Hitler (classifying them as belonging to FDR instead), and by Nixon (which are assigned to Lyndon B. Johnson). On the other hand, it does classify correctly all the instances of Reagan, FDR, Stalin, and most of Bill Clinton’s speeches. I’m sure if I provided a few more examples for each class, the results would greatly improve.

To conclude: this model, alongside the very good model obtained for the first data set, illustrates how it is possible to quickly obtain predictive models useful for text annotation of political speeches. And all this with minimal effort, given that DSX can evaluate hundreds of different models very quickly, and also handle the feature engineering side of things, prior to the supervised learning step.

*Disclaimer: I work for ForecastThis, so shameless self-promotion trigger warning goes here.


  • ForecastThis DSX
  • US Presidential speeches harvested from the Miller Center Speech Archive



Automatic Keyword Extraction from Presidential Speeches

Keyword extraction can be defined as the automatic extraction of terms from a set of documents that are in some way more relevant than others in characterizing the corpus’ domain. It’s a task widely used for bio-medical documents characterization, for example, and in general it is very useful as a basis for text classification or summarization.

There are several methods out there in the world that perform this task, some of which use a reference corpus as a means to determine which terms in the test corpus are more unusual relatively to what would be expected, and others that only look at the content of the test corpus itself and search for the most meaningful terms within it.

The toolkit JATE offers a number of these algorithms implemented in Java. I chose C-value to extract keywords from a set of speech transcripts by 12 presidents of the United States (from FDR to George W. Bush), which were harvested from the Miller Center Speech Archive. I harvested a total of 327 of these speeches, and my goal is to get a set of keywords that characterizes the set of speeches of each orator (that is, get a set of extracted keywords per president).

The reason why I chose C-value (which in recent years became C/NC-value) is because it doesn’t need a reference corpus and can extract multi-word keywords: it’s an hybrid method that combines a term-frequency based approach (“termhood”) with an inspection of the frequencies of a term used as part of a larger term (“unithood”)[1][2].

Here’s a collapsible tree of keywords among the top 20 for each president (click a node to expand). The size of the keyword node reflects its score as determined by the C-value algorithm. “Health care”, for example, has a very large weight in the speeches by Clinton. Overall, the Middle East, social security, energy and world conflicts seem to be the basis of the keywords found by C-value.

For visualization purposes, I’ve manually selected 10 of the top 20 keyterms because there were quite a few that showed up for all presidents (stuff like “american people”, “american citizens”, “americans”), so those were discarded.

Another, more recent, keyword extraction algorithm that doesn’t need a reference corpus is RAKE which, to quote its authors[3], is

 based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words ‘and’, ‘the’, and ‘of’


RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords […] Co-occurrences of words within these candidate keywords are meaningful and allow to identify word cooccurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords.

The top results are not as concise as the ones obtained with C-value, but still provides some clues to topics addressed by each president. I think FDR’s “invisible thing called ‘conscience'”  is my favourite. It also seems to me that splitting the text by stopwords might cause keywords to lose some of the original impact: take for example Truman’s statement ‘liberation in commie language means conquest’ that gets truncated to ‘commie language means conquest’.