Lexical Analysis of 2008 US Presidential and Vice-Presidential Debates home | Martin Krzywinski : projects contact

HOME // results and analysis Obama/McCain (1st) :: Obama/McCain (2nd) :: Obama/McCain (3nd) :: Obama/McCain (combined) :: Biden vs Palin

# Lexical Analysis ofBarack Obama vs John McCain (combined debates)

There be dragons here. The results here are based on a combined transcript from all debates between these candidates. When interpreting metric values from this analysis (e.g. fraction of unique words), keep in mind that you are looking at results based on speech of multiple debates, not one. In the limit of an infinite number of debates, most metrics will converge to a value that is characteristic of the speaker (e.g. total vocabulary size).

# Word Statistics

## Debate Word Count

### Summary Word Count

The summary word count reports the total number of words and the number of unique, non-stop words used by each candidate. Word number is expressed as both absolute and relative values.

Table 1. Number of all words and unique words used by each speaker.
speaker word count
Barack Obama
 21,818 2,523 52.1% 10.9%
John McCain
 20,020 2,446 47.9% 11.4%
all
 41,838 3,656 100.0% 8.3%
Table 1 Analysis

Across all the debates, the candidates delivered just short of 42,000 words. With each debate approximately 1.5 hours in length, the amount of unique words delivered by both candidates corresponds to a delivery rate of one unique word every 4.4 seconds (3 debates x 1.5 hours x 3600 s = 16,200 seconds). The average speech rate was 2.6 words per second.

Obama delivered +9.0% more words than McCain and had a larger overall vocabulary, by +3.1%.

Table 1 Legend
 a c b d
a :: total number of words
b :: proportion of words in the debate
c :: unique words in (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c

### Stop Word Contribution

In the table below, the candidates' delivery is partitioned into stop and non-stop words. Stop words are frequently-used bridging words (e.g. pronouns and conjunctions) and do not carry inherent meaning. The fraction of words that are stop words is one measure of the complexity of speech.

Table 2. Expanded analysis of total, stop and non-stop word count.
speaker word category
all stop non-stop
Barack Obama
 21,818 2,523 52.1% 11.6%
 12,235 154 56.1% 1.3%
 9,583 2,369 43.9% 24.7%
John McCain
 20,020 2,446 47.9% 12.2%
 11,063 156 55.3% 1.4%
 8,957 2,290 44.7% 25.6%
all
 41,838 3,656 100.0% 8.7%
 23,298 166 55.7% 0.7%
 18,540 3,490 44.3% 18.8%
Table 2 Analysis

Overall, Obama's stop word fraction was slightly higher than McCain's. However, Obama delivered more words throughout the debates and displayed greater range of vocabulary, with 2,369 unique non-stop words, +3.4% more than McCain.

Table 2 Legend
 a c b d
a :: total number of words, for a given category (all, stop, non-stop)
b :: (a) relative to words in the debate if category=all, otherwise relative to words by the candidate
c :: number of unique words with set (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c

All further analysis uses debate content that has been filtered for stop words.

## Word frequency

The word frequency table summarizes the frequency with which words were used. Specifically, the average word frequency and the weighted cumulative frequencies at 50 and 90 percentile. The average word frequency indicates how many times, on average, a word is used. For a given fraction of the entire delivery, the weighted cumulative frequency indicates the largest word frequency within this fraction (details about weighted cumulative distribution).

Table 3. Average, 50%, and 90% weighted cumulative word frequencies (content filtered for stop words).
speaker word frequency
Barack Obama
 4.04 10 63
John McCain
 3.91 8 53
all
 5.31 16 113
Table 3 Analysis

Absolute values of word frequency statistics for a combined debate transcript are not useful because they are directly proportional to the length of the concatenated transcript. In the limit of a large number of debates, total vocabulary size approaches a limit, and as word count goes up so does word frequency.

However, a comparison between the candidates can still be made. Obama's word frequency is slightly higher than McCain's, but not by much (+3.3%).

Table 3 Legend
 a b c
a :: average word frequency
b :: largest word frequency in 50% of content
c :: largest word frequency in 90% of content
bar :: proportion of a:b:c

## Sentence Size

Table 4. Number of words in a sentence, as measured by average number of words, 50% and 90% weighted cumulative values for three word groups (all words, stop words and non-stop words).
speaker sentence size (by word type)
all stop non-stop
Barack Obama
 18.3 26 56

 10.4 15 32

 8.2 11 26

John McCain
 15.5 20 46

 8.7 12 26

 7 10 22

all
 16.8 23 51

 9.5 13 29

 7.6 11 24

Table 4 Analysis

Obama consistently delivers larger sentences, at 8.2 words, compared to McCain, at 7.0 words. Obama's sentence size distribution has a greater component of large sentences. 90% of his speech is in sentences ≤26 words in length, whereas McCain fits 90% of his speech in sentences ≤22 words in length.

Table 4 Legend
 a b c

a :: average sentence size
b :: largest sentence size for 50% of content
c :: largest sentence size for 90% of content
bar :: proportion of a:b:c

## Part of Speech Analysis

In this section, word frequency is broken down by their part of speech (POS). The four POS groups examined are nouns, verbs, adjectives and adverbs. Conjunctions and prepositions are not considered. The first category (n+v+adj+adv) is composed of all four POS groups.

### Part of Speech Count

Table 5. Count of words (total and unique) categorized by part of speech (POS).
parts of speech
Barack Obama
 9,189 2,324 100.0% 25.3%
 4,809 1,235 52.3% 25.7%
 2,348 719 25.6% 30.6%
 1,432 406 15.6% 28.4%
 600 137 6.5% 22.8%
John McCain
 8,633 2,238 100.0% 25.9%
 4,687 1,253 54.3% 26.7%
 2,270 686 26.3% 30.2%
 1,195 371 13.8% 31.0%
 481 115 5.6% 23.9%
all
 17,822 3,431 100.0% 19.3%
 9,496 1,859 53.3% 19.6%
 4,618 1,096 25.9% 23.7%
 2,627 586 14.7% 22.3%
 1,081 187 6.1% 17.3%
Table 5 Analysis

This is a great table for the combined debate analysis because it shows the part of speech breakdown across three independent samples of speech and is therefore a more robust measure of the candidates' natural style than a sampling from a single event.

McCain uses more nouns than Obama, with 54.3% of his parts of speech being nouns (remember, in this analysis I only consider nouns, verbs, adjectives and adverbs and all to the exclusion of other parts of speech), whereas Obama's fraction is 52.3%. McCain's +3.8% increase suggests speech with a greater emphasis on concrete concepts.

Verb usage is also greater by McCain, at 26.3% vs Obama's 25.6%. The difference is +2.7%, smaller than for nouns.

Once we get into adjectives and adverbs, however, it's a different story. Obama's use of adjectives and adverbs is significantly higher than McCain's. Obama's adjective fraction is +13.0% larger than McCain's and his adverb fraction is +16.1% larger than McCain's. This suggests that Obama's speech is more nuanced and that he captures and delivers more texture in his nouns and verbs than McCain.

Table 5 Legend
 a c b d
a :: total number of words for a given POS (all, noun, verb, adjective, adverb)
b :: (a) relative to all words by candidate
c :: unique words in (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c

### Part of Speech Frequency

Table 5. Frequency of words by part of speech (POS).
part of speech frequency
Barack Obama
 3.95 10 63
 3.89 9 52
 3.08 6 57
 3.53 7 50
 4.38 11 92
John McCain
 3.86 8 53
 3.74 7 42
 3.02 5 48
 3.22 6 21
 4.18 11 42
all
 5.19 15 113
 5.11 15 80
 3.9 10 107
 4.48 11 58
 5.78 19 130
Table 5 Analysis

Obama's overall part of speech frequency is slightly higher than McCain, but not by much (+2.3%). He consistently has slightly greater repetition of nouns and verbs, at +4.0% and +2.0% more than McCain, respectively.

Obama's adjective and adverb use frequency is much higher than McCain's, however, at +9.6% and +4.8%, respectively. This increase reflects the greater proportion of adjectives and adverbs in Obama's speech.

Table 5 Legend
 a b c
a :: average word frequency
b :: largest word frequency in 50% of content
c :: largest word frequency in 90% of content
bar :: proportion of a:b:c

### Part of Speech Pairing

Through word pairing, I attempt to capture the contextual use of parts of speech within a sentence and extract concepts from the text. Specifically, unique pairs of words indicate complexity and inter-relatedness between concepts in a sentence.

Table 6a (Barack Obama). Word pairs (total and unique) categorized by part of speech (POS) for Barack Obama.
parts of speech pairings - Barack Obama
noun
 14,989 10,955 25.0% 73.1%
verb
 14,266 11,157 23.8% 78.2%
 2,880 2,383 4.8% 82.7%
 9,281 7,026 15.5% 75.7%
 3,997 3,243 6.7% 81.1%
 1,193 969 2.0% 81.2%
 3,430 2,645 5.7% 77.1%
 1,736 1,390 2.9% 80.1%
 1,033 820 1.7% 79.4%
 245 169 0.4% 69.0%
Table 6b (John McCain). Word pairs (total and unique) categorized by part of speech (POS) for John McCain.
parts of speech pairings - John McCain
noun
 12,749 9,276 27.4% 72.8%
verb
 11,354 8,877 24.4% 78.2%
 2,215 1,904 4.8% 86.0%
 6,652 5,088 14.3% 76.5%
 2,602 2,193 5.6% 84.3%
 729 614 1.6% 84.2%
 2,445 1,955 5.3% 80.0%
 1,159 953 2.5% 82.2%
 683 559 1.5% 81.8%
 121 96 0.3% 79.3%
Table 6c (Barack Obama vs John McCain). Word Pairs (total and unique) categorized by part of speech (POS) for both candidates.
parts of speech pairings
noun
 14,989 12,749 85.1% 73.1% 72.8%

verb
 14,266 11,354 79.6% 78.2% 78.2%

 2,880 2,215 76.9% 82.7% 86.0%

 9,281 6,652 71.7% 75.7% 76.5%

 3,997 2,602 65.1% 81.1% 84.3%

 1,193 729 61.1% 81.2% 84.2%

 3,430 2,445 71.3% 77.1% 80.0%

 1,736 1,159 66.8% 80.1% 82.2%

 1,033 683 66.1% 79.4% 81.8%

 245 121 49.4% 69.0% 79.3%

Table 6 Analysis

Obama has larger delivery of all pairings. The largest difference is in adverb/adverb pairings, with Obama having twice as many as McCain.

When compared to Obama, McCain has significantly lower parings that include adjectives and adverbs. While for combinations of nouns and verbs McCain is at 76-85% of Obama, when adjectives and adverbs are brought into the mix McCain is at 50-72%.

These numbers starkly illustrate Obama's greater penchant for precision and modification.

Table 6a,b Legend
 a c b d
a :: total number of pairs, for a given category (e.g. verb/noun)
b :: (a) relative to all pairs
c :: number of unique pairs within set (a)
d :: (c) relative to (a)
bar :: proportion of (a-c):c
Table 6c Legend
 a c d b e

a :: total number of pairs for Barack Obama
b :: relative unique pairs for Barack Obama
c :: total pairs for John McCain
d :: (c) relative to (a) (i.e. John McCain relative to Barack Obama)
e :: relative unique pairs for John McCain
bars :: values of (a), (b), (c) and (e)

# Word usage

This section enumerates words that were unique to a canddiate (e.g. used by one candidate but not the other). For a given part of speech, the table breaks down the number of words that were spoken by only one of the candidates or both candidates (intersection). The last row includes all words (union).

Table 7. Total and unique words used exclusively by a candidate or by both candidates.
parts of speech
Barack Obama
 1,890 1,193 100.0% 63.1% 10.6% 34.8%

 923 576 48.8% 62.4% 9.7% 31.0%

 550 375 29.1% 68.2% 11.9% 34.2%

 326 208 17.2% 63.8% 12.4% 35.5%

 91 64 4.8% 70.3% 8.4% 34.2%

John McCain
 1,889 1,107 100.0% 58.6% 10.6% 32.3%

 1,079 600 57.1% 55.6% 11.4% 32.3%

 463 328 24.5% 70.8% 10.0% 29.9%

 271 170 14.3% 62.7% 10.3% 29.0%

 76 45 4.0% 59.2% 7.0% 24.1%

both
 14,043 1,131 100.0% 8.1% 78.8% 33.0%

 7,390 629 52.6% 8.5% 77.8% 33.8%

 3,473 309 24.7% 8.9% 75.2% 28.2%

 1,992 191 14.2% 9.6% 75.8% 32.6%

 890 65 6.3% 7.3% 82.3% 34.8%

all
 17,822 3,431 100.0% 19.3% 100.0% 100.0%

 9,496 1,859 53.3% 19.6% 100.0% 100.0%

 4,618 1,096 25.9% 23.7% 100.0% 100.0%

 2,627 586 14.7% 22.3% 100.0% 100.0%

 1,081 187 6.1% 17.3% 100.0% 100.0%

Table 7 Analysis

This is another table that benefits from a combined debate treatment. Here we can see the number of words, by part of speech, spoken exclusively by one candidate, or by both. Presumably, as the number of debates increases, the number of words spoken by one candidate but not the other steadily decreases, until it reaches some core value that represents words truly unique to that candidate (e.g. the other candidate does not know the word, or consciously avoids using it).

The key values to draw your attention to are the number of exclusive unique words (first two rows, second column for each part of speech). This number corresponds to the exclusive contribution by each candidate to the vocabulary of the speech.

For example, of the 1,859 unique nouns used in the debate, 629 (33.8%) were spoken by both candidates, 600 (32.3%) by McCain only and 576 (31.0%) by Obama only. McCain thus contributed more nouns to the debate, and his repetition of these words was lower than Obama (55.6% vs 62.4%).

When it comes to verbs, however, Obama's contribution is higher, at 34.2% of all debate verbs vs 29.9% for McCain. Note that verbs were the parts of speech that had the lowest shared fraction - only 28.2% of verbs in the debate were spoken by both candidates.

Obama also contributed a greater variety of adjectives and adverbs to the debate. In particular, Obama's contribution to adverbs was 34.2% compared to 24.1% for McCain. In other words, for every 3 adverbs used by Obama not spoken by McCain, McCain had only 2 not spoken by Obama.

The profile presented in this table closely matches previous the result of previous work by Pennebaker) in which McCain is concluded to be a categorical thinker (heavy noun use), while Obama is fluid and contextual (verb and modifier use).

Table 7c Legend
 a d b e c f

a :: total number of words unique to a candidate, for a given POS group
b :: (a) relative to all unique words to the candidate
c :: (a) relative to all words
d :: unique words in (a)
e :: (d) relative to (a)
f :: (d) relative to all unique words
bar1 :: normalized ratio of (a-d):d
bar2 :: absolute ratio of (a-d):d for all POS groups (first column) or POS group (other columns)

# Noun Phrase Usage

Noun phrases were extracted from the text and analyzed for frequency, word count, unique word count and richness.

Top-level noun phrases are those without a parent noun phrase (a parent phrase is one that a similar, longer phrase). Derived noun phrases are those with a parent (more details about noun phrase analysis).

The top-level noun phrases can be interpreted as independent concepts. Derived noun phrases can be interpreted as variants on concepts embodied by the top-level phrases.

## Noun Phrase Count

This table reports the absolute number of noun phrases, which is related to the number of total words (specifically, nouns) delivered. The next table presents the number of phrases relative to the number of nouns.

Table 8. Number of noun phrases.
speaker noun phrase
all top-level derived
Barack Obama
 2,494 1,934 100.0% 77.5%
 807 770 32.4% 95.4%
 1,687 1,164 67.6% 69.0%
John McCain
 2,386 1,863 100.0% 78.1%
 761 723 31.9% 95.0%
 1,625 1,140 68.1% 70.2%
Table 8 Analysis

Obama delivered +3.8% more noun phrases than McCain. He had +6.5% more top-level noun phrases and +2.1% more derived noun phrases. The increase of top-level noun phrases is greater than the increase of derived noun phrases, suggesting greater variation in concept usage.

Table 8c Legend
 a c b d
a :: number of noun phrases
b :: (a) relative to number of all noun phrases
c :: number of unique phrases
d :: (c) relative to (a)
bar :: normalized ratio of (a-c):c

## Noun Phrase Richness

The previous table presented the total number of noun phrases, which can be equated to individual concepts. In this table, this value is shown relative to the number of nouns used. The interpretation of this ratio is that of richness. In other words, how many noun phrases were constructed, per noun.

Table 9. Number of noun phrases relative to the number of nouns.
speaker noun phrase
all top-level derived
Barack Obama
 0.52 1.57
 0.17 0.62
 0.35 0.94
John McCain
 0.51 1.49
 0.16 0.58
 0.35 0.91
Table 9 Analysis

Number of noun phrases relative to the number of nouns remains relatively constant.

Table 9c Legend
 a b
a :: ratio of the number of noun phrases to number of nouns
b :: ratio of the number of unique noun phrases to number of unique nouns
bar :: ratio of a:b

## Noun Phrase Frequency and Size

Table 10. Noun phrase frequency, word count and unique word count.
speaker noun phrase
avg frequency word count unique word count
Barack Obama
 1.29 1 6
 2.95 4 8
 2.9 3 7
John McCain
 1.28 1 6
 2.95 4 7
 2.89 4 7
Table 10 Analysis

Noun phrase frequency and size remains relatively constant.

Table 10c Legend
 a b c
a :: average
b :: 50% weighted cumulative value
c :: 90% weighted cumulative value
bar1 :: normalized ratio of a:b:c

## Windbag Index

The Windbag Index is a compound measure that characterizes the complexity of speech. A low index is indicative of succinct speech with low degree of repetition and large number of independent concepts.

Table 11. Windbag Index for each speaker. The higher the value, the greater the degree of repetition in the speech.
speaker Windbag Index
index value index terms
Barack Obama
 3,741 +15.6%
 0.439 0.247 0.257 0.306 0.284 0.228 0.775 0.398 1.566 -1.8% -3.3% -3.9% +1.3% -8.7% -4.5% -0.7% +2.6% +5.3%
John McCain
 3,235 -13.5%
 0.447 0.256 0.267 0.302 0.310 0.239 0.781 0.388 1.487 +1.9% +3.4% +4.1% -1.3% +9.5% +4.7% +0.7% -2.5% -5.1%
Table 11 Analysis

This index is not particularly well suited for a combined analysis, because it is expected that the candidates repeat themselves across three debates. The same points will be brought up, the same questions asked, and so on. Naturally, the more words are said the more words are repeated, since the pool of unique words is fixed.

The Windbag Index is +15.6% greater for Obama. Although he does better for verbs, and 2/3 of the noun phrase metrics, his uniqueness scores in other categories are lower.

Table 11c Legend
 The Windbag Index is 1/(t1*t2*...*t9) where t1,t2,...,t9 are the individual terms. These terms are t1 :: fraction of words which are non-stop t2 :: fraction of non-stop words which are unique t3 :: fraction of nouns which are unique t4 :: fraction of verbs which are unique t5 :: fraction of adjectives which are unique t6 :: fraction of adverbs which are unique t7 :: fraction of noun phrases which are unique t8 :: fraction of noun phrases which have no parent t9 :: ratio of unique noun phrases to unique nouns Note that large individual terms t1...t9 contribute to a smaller index. The percentage values below the index and each term are relative differences to the other speaker' corresponding term (i.e. 100*(x-x0)/x0 where x is the value for the present speaker and x0 for the other speaker).

# Tag Clouds

In the tag clouds below, the size of the word is proportional to the number of times it was used by a candidate (tag cloud details).

Not all words from a group used to draw the cloud fit in the image. Specifically, less frequently used words for large word groups fall outside the image.

## Debate Tag Clouds for Each Candidate - All Words

Each candidate's debate portion was extracted and frequencies were compiled for each part of speech (noun, verb, adjective, adverb), with words colored by their part of speech category. The words in these tag clouds include words unique to one candidate as well as words used by both candidates. For other tag clouds below, only words unique to a candidate are used.

Keep in mind that the word sizes between tag clouds cannot be directly compared, since the minimum and maximum size of the words in each tag cloud is the same. However, the distribution of sizes within a tag cloud reflects the frequency distribution of words (tag cloud details).

### Debate Tag Cloud for John McCain - all words

Debate Tag Cloud Analysis

Across all the debates, Obama maintains "important" as his most important (ha ha) word. Note "energy", "health", "economic", "care", "tax" and "people" are central concepts.

In stark contrast, McCain truly feels that "nuclear" is an important topic and as relatively important as "Obama".

## Debate Tag Clouds for Each Candidate - Unique Words

The tag clouds below show only used exlusively by a candidate. For example, if candidate A used the word "invest" (any number of times), but the other candidate B did not, then the word will appear in the unique word tag cloud for candidate A.

### Debate Tag Cloud for John McCain - words unique to John McCain

Unique Word Tag Cloud Analysis

The unique word clouds are particularly informative in a combined debate analysis. The more words said, the fewer words are attributed to only one candidate and these gain importance with increased number of debate samples. Remember, these are words spoken by one candidate, but not the other, across all debates.

Obama's unique words have a large noun component, with words such as "notion", "fundamentals", "consequence", and "wages". His most prominent unique word was the verb "agree", which McCain did not use (note: there is no stemming done in the analysis - McCain did use "agreed"). Obama's use of "potentially" suggests openness to complications and the unforeseen.

McCain's unique words on the other hand focus nearly exclusively on verbs. He uses strong action words such as "opposes" and "legitimize" which suggest a confrontational and unilateral view. His top unique adverb was "badly", which suggests an attack stance (presumably the word is used in context of his opponent).

## Part of Speech Tag Clouds

In these tag clouds, words by both candidates were categorized on the basis of exclusivity to a candidate. Words unique to each candidate are drawn with a different color. Words used by both candidates are shown in grey.

The size of the word is relative to the frequency for the candidate - word sizes between candidates should not be used to indicate difference in absolute frequency.

Words were further cateogorized by part of speech (noun, verb, adjective, adverb) and individual tag clouds were prepared for each category.

The last tag cloud in this section, which uses all (noun + verb + adjective + adverb) parts of speech.

### Tag Cloud of noun words, by speaker

Noun Tag Cloud Analysis

Do you see many blue words? Those are nouns exclusive to McCain and there is is hardly a blue word in sight. It is shocking how overwhelming Obama's delivery drowns out McCain's contribution in the realm of nouns across all the debates.

The third debate saw a cloud like this, but McCain at least managed to get a few words into the cloud.

### Tag Cloud of verb words, by speaker

Verb Tag Cloud Analysis

For verbs, McCain's contribution was overwhelming - a situation opposite to that of nouns. Take a look, however, at what Obama brings to the cloud: words like "agree", "invest", "recognize", "focused" and "thinking". Obama's contribution is that of conciliation and careful consideration.

### Tag Cloud of adjective words, by speaker

Split in adjective contribution is more even between the debaters. McCain's curious repetition of "angry", "excess" and "afraid" contrasts Obama's central use of "enormous" as well as "strategic", "easy" and "local".

### Tag Cloud of adverb words, by speaker

McCain, though delivering fewer adverbs than Obama, repeats them quite a bit. Here, his relative usage contribution outweights Obama's. Contrast McCain's "badly" to Obama's "potentially". McCain comes across as a hard-liner whereas Obama comes across as moderate.

### Tag Cloud of all words, by speaker

All Tag Cloud Analysis

When all parts of speech are compared, Obama is easily the greater verbal force. McCain's contribution is absolutely swamped out by Obama's unique words.

## Word Pair Vignette Tag Clouds for Each Candidate

### Tag Cloud of word pairs by Barack Obama

noun/noun by Barack Obama

noun/verb by Barack Obama

verb/verb by Barack Obama

Word Pair Tag Cloud Analysis for Barack Obama.

An interesting adjective/adverb pairing frequent for Obama is "military never", as well as "correct quickly". Cross all debates, the top pairings suggest focus on "care health" (large noun/noun component), and "think understand" (large verb/verb component).

### Tag Cloud of word pairs by John McCain

noun/noun by John McCain

noun/verb by John McCain

verb/verb by John McCain

Word Pair Tag Cloud Analysis for John McCain.

McCain's repetition of "nuclear power" and "national security" drowns out any mention of economy or domestic policy. His largest verb/verb pairing is "america united" (compare this to "think understand" for Obama), and a large component to adverb/verb is "completely control". McCain's stance is one of nationalism and certainty.