Automatic Keyword Extraction from Presidential Speeches

Keyword extraction can be defined as the automatic extraction of terms from a set of documents that are in some way more relevant than others in characterizing the corpus’ domain. It’s a task widely used for bio-medical documents characterization, for example, and in general it is very useful as a basis for text classification or summarization.

There are several methods out there in the world that perform this task, some of which use a reference corpus as a means to determine which terms in the test corpus are more unusual relatively to what would be expected, and others that only look at the content of the test corpus itself and search for the most meaningful terms within it.

The toolkit JATE offers a number of these algorithms implemented in Java. I chose C-value to extract keywords from a set of speech transcripts by 12 presidents of the United States (from FDR to George W. Bush), which were harvested from the Miller Center Speech Archive. I harvested a total of 327 of these speeches, and my goal is to get a set of keywords that characterizes the set of speeches of each orator (that is, get a set of extracted keywords per president).

The reason why I chose C-value (which in recent years became C/NC-value) is because it doesn’t need a reference corpus and can extract multi-word keywords: it’s an hybrid method that combines a term-frequency based approach (“termhood”) with an inspection of the frequencies of a term used as part of a larger term (“unithood”)[1][2].

Here’s a collapsible tree of keywords among the top 20 for each president (click a node to expand). The size of the keyword node reflects its score as determined by the C-value algorithm. “Health care”, for example, has a very large weight in the speeches by Clinton. Overall, the Middle East, social security, energy and world conflicts seem to be the basis of the keywords found by C-value.

For visualization purposes, I’ve manually selected 10 of the top 20 keyterms because there were quite a few that showed up for all presidents (stuff like “american people”, “american citizens”, “americans”), so those were discarded.

Another, more recent, keyword extraction algorithm that doesn’t need a reference corpus is RAKE which, to quote its authors[3], is

 based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words ‘and’, ‘the’, and ‘of’


RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords […] Co-occurrences of words within these candidate keywords are meaningful and allow to identify word cooccurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords.

The top results are not as concise as the ones obtained with C-value, but still provides some clues to topics addressed by each president. I think FDR’s “invisible thing called ‘conscience'”  is my favourite. It also seems to me that splitting the text by stopwords might cause keywords to lose some of the original impact: take for example Truman’s statement ‘liberation in commie language means conquest’ that gets truncated to ‘commie language means conquest’.




Using NLP to build a Black Metal Vocabulary

Black metal is typically linked, since its inception, to Satanic or anti-Christian themes. With the proliferation of bands in the 90s (after the Norwegian boom) and subsequent emergence of sub-genres, other topics such as paganism, metaphysics, depression and even nationalism came to the fore.

In order to discover the terminology used to explore these lyrical themes, I’ve devised a couple of term extraction experiments using the black metal data set. The goal here is to build a black metal vocabulary by discovering salient words and expressions, that is terms that when used in BM lyrics carry more information than when used in a “normal” setting. For instance, the terms “Nazarene” or “Wotan” have a much higher weight in the black metal domain than  in the general purpose corpus used for comparison. Once again note that this does not necessarily mean that these two words occur very frequently in BM lyrics (I’d bet that “Satan” or “death” have a higher number of occurrences), but it indicates that, when they do, they carry more information within the BM context.

This task was carried through JATE‘s implementations of the GlossEx and C-value algorithms. The part-of-speech of each term (that is, the “type” of term) was discovered with the StanfordNLP toolkit. The top 50 of each type (with the exception of adverbs) are listed in the table below. For the sake of visualization, I make a distinction between named entities/locations and the other nouns, being that the former are depicted in the word maps at the end of this post.

I’ve also included, in the last column of the table, the top term combinations. It’s noteworthy how much of these combinations are either negations of something (“no hope”, “no god”, “no life” and so on), or concerned with time (“eternal darkness”, “ancient times”). Such preoccupation with large extensions of “time” is also evident in the top adverbs (“eternally”, “forever”, “evermore”),  adjectives (“endless”, “eternal”) and even nouns (“aeon” or “eon”).

Endless Nevermore Desecrate Forefather Life and death
Unhallowed Eternally Smolder Armor Human race
Luciferian Tomorrow Travel Aeon No light
Infernal Infernally Fuel Splendor No hope
Necromantic Forever Spiral Pentagram Eternal night
Paralyzed Anymore Dethrone Perdition No god
Pestilent Mighty Throne Specter Full moon
Unholy Skyward Envenom Misanthrope No life
Illusive Evermore Lay Cross Black metal
Untrodden Earthward Resound Magick Cold wind
Astral Someday Mesmerize Nihil No place
Misanthropic Astray Abominate Ragnarok No escape
Unmerciful Onward Paralyze Blasphemer No return
Cruelest Verily Blaspheme Profanation Eternal life
Blackest Deathly Impale Misanthropy No fear
Eternal Forth Cremate Malediction Flesh and blood
Wintry Unceasingly Bleed Revenant No matter
Bestial Weightlessly Procreate Damnation Fallen angel
Reborn Anew Enslave Conjuration Eternal darkness
Putrid Demonically Awake Undead No man
Darkest Behold Nothingness Dark night
Unblessed Intoxicate Armageddon Lost soul
Colorless Devour Lacerate No end
Diabolic Bury Wormhole Ancient time
Demonic Demonize Eon No remorse
Wrathful Forsake Devourer No reason
Nebular Enshroud Impaler No longer
Vampiric Writhe Sulfur Black cloud
Unchained Destroy Betrayer Dark forest
Armored Entomb Deceiver Human flesh
Immortal Raze Bloodlust Endless night
Hellish Flagellate Reaper Ancient god
Hellbound Unleash Horde Mother earth
Unnamable Convoke Blasphemy Black wing
Prideful Crucify Eternity Night sky
Colorful Fornicate Defiler Dark side
Unbaptized Torment Immolation Eternal sleep
Unforgotten Venerate Soul Black hole
Satanic Beckon Abomination Black heart
Morbid Defile Flame Flesh and bone
Sempiternal Distill Hail No chance
Mortal Immolate Malignancy Dark cloud
Honorable Welter Wrath Final battle
Glooming Run Pestilence Eternal fire
Willful Sanctify Gallow No peace
Lustful Eviscerate Disbeliever No future
Everlasting Unchain Witchery Black soul
Impure Ravage Satanist Final breath
Promethean Mutilate Lust Black night

Most salient entities: many are drawn from the Sumerian and Nordic mythologies. I’ve also included in this bunch groups of animals (“Beasts”, “Locusts”).

Most salient locations. I’ve also included in this bunch non-descript places (“Northland”). Notice how most are concerned with the afterlife (surprisingly, “hell” is not one of them).

It occurred to me that these results could be the starting point of an automatic lyric generator (like the now defunct Scandinavian Black Metal Lyric Generator). Could be a fun project, if time allows (probably not).


IBM GlossEx

Jason Davies’ D3 Word Cloud

JATE – Java Automatic Text Extraction

StanfordNLP Core