Lexical Analysis of 2008 US Presidential and Vice-Presidential Debates
home | Martin Krzywinski : projects contact
HOME // results and analysis
Biden vs Palin
Lexical Analysis of
Barack Obama vs John McCain (1st debate)
Debate Word Count
Summary Word Count
The summary word count reports the total number of words and the
number of unique, non-stop words
used by each candidate. Word number is expressed as both absolute and relative values.
Table 1 Analysis
The candidates' time allowance was equal and given the fact
that both candidates used approximately the same number of words, it
can be concluded that the global cadence of speech is similar.
Although I am not surprised that ratio of the total number of used
words is similar (Obama delivered 7,529 words, 7% more than
McCain's 7,043), the fact that the total number of unique words was nearly identical for both candidates (1,376 vs 1,380) was a
shock. Though both Obama and McCain can be considered
articulate, Obama presents as verbally sharper than McCain and his
delivery has a greater nimbleness to it, which is reflected in his
slightly higher volume of word delivery. During his unscripted
deliveries, Obama's manner hints at a significant command of the
English language and suggests that his verbal abilities are not
stretched. For this reason, I was expecting his unique word count to
The fact that the unique word count is identical suggests a high
degree of rehearsal and preparation. It may well be that both
candidates spent significant amount of time being coached to effect
the best delivery that would reach the most number of people. It may
also be that through the process of political selection, both
candidates epitomize an archetype of spoken word delivery.
It also came as a surprise that the total number of unique words
used by both candidates was only 2,115. Initially, I felt this to be
low - surely the matters of state require more than two thousand
words. For both candidates, I suspect a significant amount of coaching
towards conformity to the average American's comprehension.
Table 1 Legend
a :: total number of words
b :: proportion of words in the debate
c :: unique words in (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c
Stop Word Contribution
In the table below, the candidates' delivery is partitioned into stop and non-stop words. Stop words are frequently-used bridging words (e.g. pronouns and conjunctions) and do not carry inherent meaning. The fraction of words that are stop words is one measure of the complexity of speech.
Table 2 Analysis
Obama's absolute stop word count is higher than McCains
(4,263 vs 3,922) but Obama's total word count is also
higher. When the total number of words is considered, Obama and McCain
stop word delivery is similar, at 56.6% and 55.7%, respectively.
Stop word counts do not reveal significant difference between the two
Table 2 Legend
a :: total number of words, for a given category (all, stop, non-stop)
b :: (a) relative to words in the debate if category=all, otherwise relative to words by the candidate
c :: number of unique words with set (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c
All further analysis uses debate content that has been filtered for stop words.
The word frequency table summarizes the frequency with which words were used. Specifically, the average word frequency and the weighted cumulative frequencies at 50 and 90 percentile. The average word frequency indicates how many times, on average, a word is used. For a given fraction of the entire delivery, the weighted cumulative frequency indicates the largest word frequency within this fraction (details about weighted cumulative distribution).
Table 3. Average, 50%, and 90% weighted cumulative word frequencies (content filtered for stop words).
Part of Speech Analysis
In this section, word frequency is broken down by their part of speech (POS). The four POS groups examined are nouns, verbs, adjectives and adverbs. Conjunctions and prepositions are not considered. The first category (n+v+adj+adv) is composed of all four POS groups.
Part of Speech Count
Table 5 Analysis
The composition of the candidates' speech by part of speech is remarkably similar. The relative breakdown of nouns, verbs, adjectives and adverbs for Obama is 53:25:15:7 and 54:26:14:5 for McCain. I am more than mildly surprised at such an incredible uniformity in the speech of the candidates. The ratio of noun:verb:adjective:adverb reduces to about 8:4:2:1.
Within each POS category, the number of unique words is nearly identical for both candidates, with Obama (McCain) having 39% (41%), 46% (45%), 45% (45%) and 34% (39%) of their nouns, verbs, adjectives and adverbs unique. The largest difference is in the use of adverbs, with McCain having 39.2% of all his adverbs unique, whereas Obama's adverbs have a unique component of 34.3%.
Note that Obama uses adverbs more than McCain (6.8% vs 5.5%) - his speech included 213 adverbs (73 unique) whereas McCain used 166 adverbs (65 unique).
Table 5 Legend
a :: total number of words for a given POS (all, noun, verb, adjective, adverb)
b :: (a) relative to all words by candidate
c :: unique words in (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c
Part of Speech Frequency
Table 5 Analysis
This table hints at a significant difference in verb and adverb use.
As indicated in the previous table, McCain used fewer total adverbs
than Obama (166 vs 213), but his unique adverb fraction was higher
(39.2% vs 34.3%). It looks like Obama really likes adverbs, and really
likes repeating them too. Obama's average adverb frequency was
2.92, compared to McCain's 2.55. Moreover, 90% of Obama's
adverbs were used with a frequency of 36 times or less, whereas 90% of
McCain's adverbs were used with a frequency of 13 or less.
Obama, however, is significantly less repetitive with verbs, with 90% of his verbs
used 16 times or less, compared to 90% of McCain's verbs which
were used 25 times or less. Thus, although the candidates' total
and unique verb count was similar (see previous table), Obama's
distribution in verb frequency was skewed towards less repetition.
Table 5 Legend
a :: average word frequency
b :: largest word frequency in 50% of content
c :: largest word frequency in 90% of content
bar :: proportion of a:b:c
Part of Speech Pairing
Through word pairing, I attempt to capture the contextual use of parts of speech within a sentence and extract concepts from the text. Specifically, unique pairs of words indicate complexity and inter-relatedness between concepts in a sentence.
This section enumerates words that were unique to a canddiate
(e.g. used by one candidate but not the other). For a given part of
speech, the table breaks down the number of words that were spoken by
only one of the candidates or both candidates (intersection). The last
row includes all words (union).
Noun Phrase Usage
Noun phrases were extracted from the text and analyzed for frequency, word count, unique word count and richness.
Top-level noun phrases are those without a parent noun phrase (a parent phrase is one that a similar, longer phrase). Derived noun phrases are those with a parent (more details about noun phrase analysis).
The top-level noun phrases can be interpreted as independent concepts. Derived noun phrases can be interpreted as variants on concepts embodied by the top-level phrases.
Noun Phrase Count
This table reports the absolute number of noun phrases, which is related to the number of total words (specifically, nouns) delivered. The next table presents the number of phrases relative to the number of nouns.
Table 8 Analysis
Obama has +4.0% more noun phrases than McCain (855 vs 851). The difference between the fraction of unique noun phrases, however, is smaller between Obama and McCain, whose noun phrase uniqueness is 84.3% and 83.3%. Relatively to the number of noun phrases, the number of top-level phrases is similar between them, as is the top-level uniqueness ratio.
Table 8c Legend
a :: number of noun phrases
b :: (a) relative to number of all noun phrases
c :: number of unique phrases
d :: (c) relative to (a)
bar :: normalized ratio of (a-c):c
Noun Phrase Richness
The previous table presented the total number of noun phrases, which can be equated to individual concepts. In this table, this value is shown relative to the number of nouns used. The interpretation of this ratio is that of richness. In other words, how many noun phrases were constructed, per noun.
Table 9 Analysis
The ratios here are very similar. Extremely similar, in fact, with the exception of the ratio of unique noun phrases to unique nouns, which is 1.16 for Obama and 1.07 for McCain. The interpretation is that Obama constructed a greater diversity of distinct concepts with his nouns.
Table 9c Legend
a :: ratio of the number of noun phrases to number of nouns
b :: ratio of the number of unique noun phrases to number of unique nouns
bar :: ratio of a:b
Noun Phrase Frequency and Size
The Windbag Index is a compound measure that characterizes the complexity of speech. A low index is indicative of succinct speech with low degree of repetition and large number of independent concepts.
Table 11 Analysis
Obama's Windbag Index is +14.7% when compared to McCain's, at 422 vs 368.
The index is a compound score, with contributions from nine terms. Individually, Obama does better at the verb, adjective and noun phrase components. McCain, on the other hand, has superior contributions from word counts, nouns and adverbs.
Table 11c Legend
The Windbag Index is 1/(t1*t2*...*t9) where t1,t2,...,t9 are the individual terms. These terms are
t1 :: fraction of words which are non-stop
t2 :: fraction of non-stop words which are unique
t3 :: fraction of nouns which are unique
t4 :: fraction of verbs which are unique
t5 :: fraction of adjectives which are unique
t6 :: fraction of adverbs which are unique
t7 :: fraction of noun phrases which are unique
t8 :: fraction of noun phrases which have no parent
t9 :: ratio of unique noun phrases to unique nouns
Note that large individual terms t1...t9 contribute to a smaller index.
The percentage values below the index and each term are relative differences to the other speaker' corresponding term (i.e. 100*(x-x0)/x0 where x is the value for the present speaker and x0 for the other speaker).
In the tag clouds below, the size of the word is proportional to
the number of times it was used by a candidate (tag cloud details).
Not all words from a group used to draw the cloud fit in the
image. Specifically, less frequently used words for large word groups
fall outside the image.
Debate Tag Clouds for Each Candidate - All Words
Each candidate's debate portion was extracted and frequencies were
compiled for each part of speech (noun, verb, adjective, adverb), with
words colored by their part of speech category. The words in these
tag clouds include words unique to one candidate as well as words used by
both candidates. For other tag clouds below, only words unique to a
candidate are used.
Keep in mind that the word sizes between tag clouds cannot be
directly compared, since the minimum and maximum size of the words in
each tag cloud is the same. However, the distribution of sizes within
a tag cloud reflects the frequency distribution of words (tag cloud details).
Debate Tag Cloud for Barack Obama - all words
Debate Tag Cloud for John McCain - all words
Debate Tag Cloud Analysis
The tag clouds for all words used by each candidate powerfully show
the difference in word frequency distribution between Obama and
McCain. In a few tables, I indicated the average and 50%/90% weighted
cumulative values for frequencies, but did not explicity show a
distribution. Well, these tag clouds show that.
McCain's cloud has a significantly more large
words, when compared to Obama's, indicating that McCain repeated
a larger subset of words throughout the debate. For example,
McCain's use of the word "nuclear" was nearly as frequent as his
use of the word "Obama". On the other hand, Obama's use of
"nuclear" was smaller than his use of the word "McCain".
It is also interesting to see that Obama very frequently used
"John", calling his opponent by his first name, whereas McCain
never used Obama's first name, Barack.
Debate Tag Clouds for Each Candidate - Unique Words
The tag clouds below show only used exlusively by a candidate. For
example, if candidate A used the word "invest" (any number of times),
but the other candidate B did not, then the word will appear in the
unique word tag cloud for candidate A.
Debate Tag Cloud for Barack Obama - words unique to Barack Obama
Debate Tag Cloud for John McCain - words unique to John McCain
Unique Word Tag Cloud Analysis
The tag cloud composed of words used exclusively by McCain'
indicates a high degree of relative repetition of a small subset of
the words. The center of McCain's tag cloud is bloated with large
text, indicating high relative usage of words like "afraid",
"serious", "fragile", and "badly". Remember, these are words unique to McCain - Obama did not use these words.
Obama's tag cloud shows relatively less repetition among the
words used only by Obama. In general, the words used by Obama that
were not used by McCain are more uniformly distributed in
frequency. It is surprising words like "recognize", "strategic",
"solve", "invest", and "agree" are unique to Obama (they
were not used by McCain).
Part of Speech Tag Clouds
In these tag clouds, words by both candidates were categorized on the
basis of exclusivity to a candidate. Words unique to each candidate
are drawn with a different color. Words used by both candidates are
shown in grey.
The size of the word is relative to the frequency for the candidate
- word sizes between candidates should not be used to indicate
difference in absolute frequency.
Words were further cateogorized by part of speech (noun, verb,
adjective, adverb) and individual tag clouds were prepared for each
The last tag cloud in this section, which uses all (noun + verb +
adjective + adverb) parts of speech.
Tag Cloud of noun words, by speaker
Noun Tag Cloud Analysis
Not surprisingly, the candidates' most frequent noun was
Obama (for McCain) and John (for Obama). As I mentioned previously, it
is curious to find that McCain never refered to Obama by his first
The cloud of green words around the central core of the tag cloud
indicates that nouns unique to Obama appeared at a higher relative
frequency than McCain.
Some interesting nouns for Obama are "alternative", "fundamentals",
"medicare", and "diplomacy". On the other hand, words like
"restraint", "failure", "corruption" and "maverick" are unique to
Tag Cloud of verb words, by speaker
Verb Tag Cloud Analysis
The top verb unique to McCain was "control", closely followed by
"fought" and "succeed", followed by verbs like "defeat", "win", and
"legitimize". For Obama, the top unique verb was "getting", followed
by "invest", "funded", "recognize", "agree" and "rebuild", but those
were of relatively lower frequencies than for words at the same rank
in McCain's list. McCain repeats strong verbs.
If the verbs are an indication of action planned for and supported
by the candidates, then McCain is someone who wishes to "legitimize"
and "succeed [at] control", whereas Obama is more conciliatory and
positive with "invest", "focused", "solve" and "recognize".
Tag Cloud of adjective words, by speaker
Adjective Tag Cloud Analysis
McCain's exclusive use of "afraid", "serious" and "fragile" are interesting and hint at fear mongering.
Tag Cloud of adverb words, by speaker
Adverb Tag Cloud Analysis
Adverbs are the least frequent of the four parts of speech, so the
tag cloud here is less complex. Both candidates use strong and certain
action modifiers like "completely" (McCain) and "absolutely"
(Obama). As for other parts of speech, McCain had high relatively
frequency of terms unique to him, and this is evident by a more large
blue words in this tag cloud.
It is interesting to see Obama exclusively use words like "responsibly" and "structurally".
Tag Cloud of all words, by speaker
All Tag Cloud Analysis
When all parts of speech are combined into one tag cloud,
Obama's unique words swamp out those of McCain',
suggesting that when parts of speech are combined, Obama repeated
terms exclusive to him more frequently.
Word Pair Vignette Tag Clouds for Each Candidate
Tag Cloud of word pairs by Barack Obama
adjective/adjective by Barack Obama
adjective/adverb by Barack Obama
adjective/noun by Barack Obama
adjective/verb by Barack Obama
adverb/adverb by Barack Obama
adverb/noun by Barack Obama
adverb/verb by Barack Obama
noun/noun by Barack Obama
noun/verb by Barack Obama
verb/verb by Barack Obama
Word Pair Tag Cloud Analysis for Barack Obama.
The major contributors to Obama's word pair tag clouds are
open-ended word pairs such as "last several" (adjective/adjective),
"correct just" (adjective/adverb), "mccain senator" (noun/noun). A
couple of concepts such as "al qaeda" (qaeda was tagged as a verb by
the Brill tagger), "north korea" were prominent, but these are proper
nouns and reflect the topic under discussion.
Obama touched on many concepts, as indicated by the relatively flat
distribution of sizes in the noun/noun tag cloud. Some of these were
"care health", "biodiesel energy", "oil world", "energy wind". Some
curious ones were "john spending", "crisis day", "afghanistan iraq",
"deal russia". These should be contrasted to noun/noun pairs for McCain (below), which focused on threats and the military.
Tag Cloud of word pairs by John McCain
adjective/adjective by John McCain
adjective/adverb by John McCain
adjective/noun by John McCain
adjective/verb by John McCain
adverb/adverb by John McCain
adverb/noun by John McCain
adverb/verb by John McCain
noun/noun by John McCain
noun/verb by John McCain
verb/verb by John McCain
Word Pair Tag Cloud Analysis for John McCain.
McCain' pair tag clouds have significantly different
morphology than those of Obama. Primarily, due to McCain'
repetitive use of certain words, the tag clouds are overwhelmed with
these frequent (therefore large) pairs.
His adjective/noun tag clouds has an apocalyptic theme: "nuclear
threat", "important thing", "long way", and (this is fascinating) "old
russian" and "next states".
The noun/noun tag cloud size distribution is relatively flat, like
that of Obama, and indicates topics such as "threat weapons",
"business tax", "ahmadinejad extermination" and "aggression
georgia". The majority of McCain's noun/noun concepts were
threat- and military-related (contrast this to Obama, who was focused
more on energy and economy). Environment? What environment?
debate transcript (courtesy of CNN).
parsed word lists (analyzed transcript, including words by speaker, by POS, and all POS pairings).
tag cloud images
Please see the methods section for details about these files.