Keyword extraction can be defined as the automatic extraction of terms from a set of documents that are in some way more relevant than others in characterizing the corpus’ domain. It’s a task widely used for bio-medical documents characterization, for example, and in general it is very useful as a basis for text classification or summarization.
There are several methods out there in the world that perform this task, some of which use a reference corpus as a means to determine which terms in the test corpus are more unusual relatively to what would be expected, and others that only look at the content of the test corpus itself and search for the most meaningful terms within it.
The toolkit JATE offers a number of these algorithms implemented in Java. I chose C-value to extract keywords from a set of speech transcripts by 12 presidents of the United States (from FDR to George W. Bush), which were harvested from the Miller Center Speech Archive. I harvested a total of 327 of these speeches, and my goal is to get a set of keywords that characterizes the set of speeches of each orator (that is, get a set of extracted keywords per president).
The reason why I chose C-value (which in recent years became C/NC-value) is because it doesn’t need a reference corpus and can extract multi-word keywords: it’s an hybrid method that combines a term-frequency based approach (“termhood”) with an inspection of the frequencies of a term used as part of a larger term (“unithood”).
Here’s a collapsible tree of keywords among the top 20 for each president (click a node to expand). The size of the keyword node reflects its score as determined by the C-value algorithm. “Health care”, for example, has a very large weight in the speeches by Clinton. Overall, the Middle East, social security, energy and world conflicts seem to be the basis of the keywords found by C-value.
For visualization purposes, I’ve manually selected 10 of the top 20 keyterms because there were quite a few that showed up for all presidents (stuff like “american people”, “american citizens”, “americans”), so those were discarded.
Another, more recent, keyword extraction algorithm that doesn’t need a reference corpus is RAKE which, to quote its authors, is
based on the observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words ‘and’, ‘the’, and ‘of’
RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords […] Co-occurrences of words within these candidate keywords are meaningful and allow to identify word cooccurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords.
The top results are not as concise as the ones obtained with C-value, but still provides some clues to topics addressed by each president. I think FDR’s “invisible thing called ‘conscience'” is my favourite. It also seems to me that splitting the text by stopwords might cause keywords to lose some of the original impact: take for example Truman’s statement ‘liberation in commie language means conquest’ that gets truncated to ‘commie language means conquest’.
- Combining C-value and Keyword Extraction Methods for Biomedical Terms Extraction, Ventura et al, 2013
- A Comparative Evaluation of Term Recognition Algorithms, Zhang et al, LREC, 2008
- Automatic keyword extraction from individual documents, Rose et al, 2010
- RAKE Java implementation
- Miller Center Speech Archive