Automatic Keyword Extraction from Presidential Speeches

Keyword extraction can be defined as the automatic extraction of terms from a set of documents that are in some way more relevant than others in characterizing the corpus’ domain. It’s a task widely used for bio-medical documents characterization, for example, and in general it is very useful as a basis for text classification or summarization.

There are several methods out there in the world that perform this task, some of which use a reference corpus as a means to determine which terms in the test corpus are more unusual relatively to what would be expected, and others that only look at the content of the test corpus itself and search for the most meaningful terms within it.

The toolkit JATE offers a number of these algorithms implemented in Java. I chose C-value to extract keywords from a set of speech transcripts by 12 presidents of the United States (from FDR to George W. Bush), which were harvested from the Miller Center Speech Archive. I harvested a total of 327 of these speeches, and my goal is to get a set of keywords that characterizes the set of speeches of each orator (that is, get a set of extracted keywords per president).

The reason why I chose C-value (which in recent years became C/NC-value) is because it doesn’t need a reference corpus and can extract multi-word keywords: it’s an hybrid method that combines a term-frequency based approach (“termhood”) with an inspection of the frequencies of a term used as part of a larger term (“unithood”)[1][2].

Here’s a collapsible tree of keywords among the top 20 for each president (click a node to expand). The size of the keyword node reflects its score as determined by the C-value algorithm. “Health care”, for example, has a very large weight in the speeches by Clinton. Overall, the Middle East, social security, energy and world conflicts seem to be the basis of the keywords found by C-value.

For visualization purposes, I’ve manually selected 10 of the top 20 keyterms because there were quite a few that showed up for all presidents (stuff like “american people”, “american citizens”, “americans”), so those were discarded.

Another, more recent, keyword extraction algorithm that doesn’t need a reference corpus is RAKE which, to quote its authors[3], is

 based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words ‘and’, ‘the’, and ‘of’


RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords […] Co-occurrences of words within these candidate keywords are meaningful and allow to identify word cooccurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords.

The top results are not as concise as the ones obtained with C-value, but still provides some clues to topics addressed by each president. I think FDR’s “invisible thing called ‘conscience'”  is my favourite. It also seems to me that splitting the text by stopwords might cause keywords to lose some of the original impact: take for example Truman’s statement ‘liberation in commie language means conquest’ that gets truncated to ‘commie language means conquest’.




Part III – Topic and Lyrical Content Correlation

In part II of this post, we explored a topic model built for the whole black metal lyrics data set (if you don’t know what a topic model is, read this as well, but to sum things up let’s just say topic modeling is a process that enables discovery of the “meaning” underlying a document, with minimum human intervention). In said post we analyzed 1) the relationship between topics, and 2) the importance of individual words in their characterization by means of a force directed graph, which (let’s face it) is a bit of a bubbly mess.
In order to understand better the second point stated above, I decided to build a zoomable treemap. In it, each large box (distinguished from the surrounding boxes by a label and a distinct color) represents a topic, i.e. a set of words that are somehow related and occur in the same context(s). By clicking on a label, the map zooms into it and presents the ten most relevant words within that topic. For example, by clicking on “Coldness”, you’ll see the top 10 terms that compose it (“ice”, “frost”, “snow” and so on). The area of each word is proportion to its importance in characterizing the topic: in our “Coldness” example, “cold” occupies as larger area than the rest, being the most relevant word in this context.
Similarly, the total area of each topic is proportional to its incidence in the black metal lyrics data set. For example, “Fire & Flames” has a larger area than “Mind & Reality” or “Universe & Cosmos”, making it more likely to occur when infering the topics that characterize a song.

By the way, these topic labels were chosen manually. Unfortunately I couldn’t devise an automated process that would do that for me (if anyone has an inkling on how to do this, let me know) so I had to pick meaningful and reasonable (I hope) representative titles for each set of words. In most cases, like the aforementioned “Coldness”, the concept behind the topic is evident. There are, however, a few cases where I had to be a bit more creative because the meaning of the topic is not so obvious (“Urban Horror” comes to mind).

There are also two topics which are quite generic, with terms that could occur in almost any context, so they’re simply labeled “Non-descriptive”.

As mentioned in part II of this post, one goal of this whole mess is to find out which lyrics “embody” a specific topic. Given that the lyrical content of a song is seen by the topic model as a mixture of topics, then we’re interested in discovering lyrics that are composed solely (or almost in their entirety, let’s say more than 90%) of a single topic. Using the topic inferencing capabilities of the Stanford Topic Model Tool I did just that, selecting at least 3 representative lyrics for 14 of the topics above. They’re displayed in the collapsible tree below.

For the most part the lyrics seem to have a high degree of correlation with the topic assigned to them: for instance Immortal’s “Mountains of Might” fits the “Coldness” topic fairly well (surprise, surprise…) and Vondur’s cover of an Elvis Presly song obviously falls into the heart stuff category. But there is one intriguing result: after reading Woods of Infinity’s “A Love Story”, I was expecting it to have the “Dreams & Stuff from the Heart” topic assigned to it. It falls in the “Fucking” topic instead, so maybe the algorithm detected something (creepy) between the lines.



The zoomable treemap was built from Bill White’s Treemap with Title Headers.

The collapsible tree was inspired by this tree and this other tree.