Lexical Analysis of 2008 US Presidential and Vice-Presidential Debates home | Martin Krzywinski : projects contact

Lexical Analysis of
2008 US Presidential and Vice-Presidential
Debates — who's the Windbag?

1 minute summary Metrics of speech structure of candidates fall within narrow tolerances, suggesting high degree of wordsmithing and rehearsal. For example, noun/verb/adjective/adverb ratio spread is very small with candidates' values within 2%. Relatively small differences seen in unique word count and noun phrase profile. The Obama/McCain debates began with balanced performance from both candidates but end with Obama verbally overpowering McCain and delivering speech with more concepts and higher complexity. When words exclusive to a candidate are considered, Obama's more frequent use of verbs and much more frequent use of adjectives and adverbs, compared to McCain, suggests that he is more of a fluid and contextual thinker who, unlike McCain whose language metrics suggest a categorical approach, does not seek to fit issues into pre-existing categories. Obama's greater use of modifiers suggest an outlook that is more open to nuance and inter-relatedness of events and issues.

Analysis of the Biden/Palin debate suggests that speech of Vice-Presidential candidates is less complex and more repetitive than that of their Presidential counterparts, with Biden being the most repetitive speaker and Palin having the longest sentences, of all four debates.

Thematic Profile of 2008 US Presidential Debates between Barack Obama and John McCain - identifying topics: nuclear, fear, military, economy, geography, health and environment
^Thematic profiles of all three Presidential Debates (large) (PDF) show parts of the debate in which specific topics were discussed: nuclear issues, fear mongering, military matters, economic crisis, health concerns, energy and the environment.

Introduction

Others' work

He counts your words (even those pronouns), an article in the NYT about Pennebaker's approach to analysis of debates and Al Qaeda communication

Presidential word use in State of the Union addresses by Jonathan Corum.

Naming Names, a NYT article about candidates' reference to each other during debates (uses Circos)

Lexical Analysis of Obama's and McCain's Speeches by Jacques Savoy

+ add your work

The analysis presented here explores word usage in the 2008 US Presidential and Vice-Presidential debates. The purpose is to explore the structure of speech, as characterized by the use of nouns, verbs, adjectives and adverbs, and noun phrases. The speech patterns of opposing candidates are compared in an effort to identify characteristic value and personality traits.

Specifically, I examine the debate for the following

The analysis reveals extremely surprising results, or at least what I believe to be surprising results. With three debates behind them, the verbal contest between Obama and McCain, while starting out relatively even, can be seen to tip very strongly in Obama's favour with respect to speech complexity and articulation.

A formal debate — and in this case three for a pair of candidates — serves as a great text for this kind of analysis. The debate format is highly controlled: each speaker is subjected to the same stimulus (question) and is given the same amount of time to respond. Debates therefore eliminate some of the variation that would appear in analysis of interviews and other unscripted speech, in which questions and topics may vary across samples.

For some cases, where the analysis focuses on proportions of parts of speech, collecting a variety of inputs from a speaker is helpful. However, when debates are analysed, we can extend the investigation to include examining words used exclusively by one speaker and not the other — these are extremely informative because they reflect how a speaker chose to address the issue.


Third Debate is a structural disaster for McCain, verbally speaking.

First Debate. Obama makes greater contribution tounique words in the debate. He is informal with McCain, using John extremely frequently. Second Debate. The debate is relatively more populated with words spoken by both candidates. Obamachanges stance from "John" to "McCain". Third Debate. Obama overwhelms McCain with his contribution of words to the debate. Obama never says "nuclear", a word that McCain continues to repeat.

Methods

The transcript for each debate is parsed to (a) identify the speaker, (b) remove stop words (words such as "do", "and" and "it") are removed, (c) tag non-stop words with their part of speech (this is called tagging), and (d) identify noun phrases (this is called chunking).

The tagged and chunked transcripts are analyzed to determine

I attempt to quantify the overall complexity of speech by a novel metric called the Windbag Index. This value is product of 9 factors, each measuring uniqueness in different aspects of speech (more about Windbag Index).

A full description of each of the steps in the analysis is available in the detailed methods section. I enourage you to read this section - it's not very technical - to become familiar with the approach and to gain greater versatility in interpreting the results. This works is also not without its share of limitations.

Vignette of unique words used by US Presidential and Vice-Presidential Candidates - John McCain, Barack Obama, Joe Biden and Sarah Palin - drawn by Wordle
^images created with Wordle - make your own with word feeds from this analysis.

Results and Analysis

Detailed results and comments are available for each debate. The first Obama vs McCain debate has more in-depth analysis, since it is the first debate that I analyzed.

Results for Barack Obama vs John McCain (1st debate)

Results for Barack Obama vs John McCain (2nd debate)

Results for Barack Obama vs John McCain (3nd debate)

Results for Barack Obama vs John McCain (combined debates)

Results for Joe Biden vs Sarah Palin

Each debate analysis report contains a great deal of data. To start, you may find these elements the most interesting

And, yes, Biden is a windbag. His speech has a Windbag Index of 606, highest of all candidates!

Visualizing the Debates

Word usage tables describe the structural characteristics of speech by frequency of words, sentence size, proportion of unique and exclusive words and breakdown of words by part-of-speech see example
Tag clouds for words used by a candidate, categorized by parts of speech. Obama is frequently informal with "John", but McCain never calls Obama by his first name see example
Tag clouds for words used in the debate, categorized by ownership. McCain's favourite word is "afraid", whereas Obama prefers "true" see example
Concept frequency based on part-of-speech pairs. Obama is about energy and economy, but McCain's focus is on threats and the military see example

Word Clouds with Flair - Wordles

Using the Atom feeds below, you can create your own visualizations of the word lists using Wordle.

Wordles of noun/adjective pairs for Barack Obama and John McCain

Nouns unique to a speaker were those that were mentioned by one speaker, but not the other. Nouns, and other parts of speech, were identified in the transcripts using the Brill tagger (example tagged text - Obama portion, 1st debate). For example, Obama said biodiesel, children, education, education, medicare, perspective and science. McCain did not. McCain, on the other hand, said aggression, greed, failure, invasion, maverick and stubborness. Obama did not.

Wordles of unique nouns for Barack Obama and John McCain

Download high-resolution wordles: Obama noun/adjective pairs unique nouns McCain noun/adjective pairs unique nouns

Candidates' Lexical Profiles

Table 1. Variety of word usage statistics for each speaker. Vocabulary size is based on non-stop content.
speaker statistics
vocabulary
size
non-stop word
fraction
unique word
fraction
avg word
frequency
avg sentence
size
Barack Obama
1,243
1243
43.4%
43.3789347854961
38.1%
38.0587875076546
2.63
2.628
7.74
7.739
John McCain
1,243
1243
44.3%
44.3135027687065
39.8%
39.8269785325216
2.51
2.511
7.08
7.077
Joe Biden
1,174
1174
47.1%
47.1457627118644
33.8%
33.7647397181478
2.96
2.962
7.59
7.592
Sarah Palin
1,201
1201
43.9%
43.8739330269206
35.9%
35.9473211613289
2.78
2.782
8.46
8.458
Table 2. Part of speech statistics for each speaker. Within each cell, values are (nouns|verbs|adjectives|adverbs). For POS ratio, first row of values is the fraction of speech tagged with a given POS, the second is the fraction relative to adverbs (least frequent group), and third row is the number of unique words for this POS (i.e. POS vocabulary size).
speaker statistic
POS ratio POS unique
fraction
average POS
frequency
Barack Obama
52.9% 25.2% 15.1% 6.8%
7.7 3.7 2.2 1.0
645 361 213 73
52.854393842206525.176395125080215.13790891597186.8313021167415
39% 46% 45% 34%
39.138349514563145.987261146496845.127118644067834.2723004694836
2.56 2.17 2.22 2.92
2.5552.1752.2162.918
John McCain
53.8% 26.5% 14.2% 5.5%
9.7 4.8 2.6 1.0
663 363 192 65
53.761651131824226.531291611185114.18109187749675.52596537949401
41% 46% 45% 39%
41.052631578947445.545796737766645.070422535211339.1566265060241
2.44 2.20 2.22 2.55
2.4362.1962.2192.554
Joe Biden
57.5% 25.2% 12.1% 5.3%
10.9 4.8 2.3 1.0
640 359 163 65
57.468123861566525.166970248937512.11293260473595.25197328476017
34% 43% 41% 38%
33.808769149498243.305186972255740.852130325814537.5722543352601
2.96 2.31 2.45 2.66
2.9582.3092.4482.662
Sarah Palin
55.0% 26.5% 12.9% 5.6%
9.8 4.7 2.3 1.0
651 377 182 61
54.995331465919726.455026455026512.916277622165.63336445689387
37% 44% 44% 34%
36.842105263157944.352941176470643.85542168674733.7016574585635
2.71 2.25 2.28 2.97
2.7142.2552.2802.967
Table 3. Windbag Index measures the extent of repetition in speech. The index for every debate for a candidate is shown.
speaker Windbag Index
Barack Obama 422, 405, 457 (avg 428)
422
405
457
John McCain 368, 352, 505 (avg 408)
368
352
505
Joe Biden 606
606
Sarah Palin 535
535

Words in the tag clouds below are colored by part of speech:   noun   verb   adjective   adverb  

^ Words unique to Barack Obama (not spoken by McCain) in all debates, colored by part of speech. Note the preponderance of nouns and central role of "agree", "invest" and "potentially".
^ Words unque to John McCain (not spoken by Obama) in all debates, colored by part of speech. McCain's contribution was largely verbs and adverbs. The large role of "badly", "excess", and "legitimize".
^ All nouns in debates, colored by contributing speaker (green = Obama, blue = McCain, grey = spoken by both). Relative use of Obama's contribution to nouns is more balanced than McCain's, largely because McCain actually used "John" and "McCain" in his speech (if he didn't those words would heavily unbalance Obama's unique nouns). Note "science", "notion", "consequence", "approach", "regulations" - all words used with relatively high frequency by Obama but never by McCain.
^ All verbs in debates, colored by contributing speaker (green = Obama, blue = McCain, grey = spoken by both). McCain had a greater contribution of frequently used verbs (words that only he used). Note "oppose", "legitimize" and "winning". The verbs that were exclusive to Obama were less frequently used, except for "agree", "invest" and "recognize".

Discussion

The analysis presents a great deal of data, but from it two central themes arise.

Lexical Structure: Speech Confirmity through Rehearsal and Audience Profiling

The first theme quickly became evident after analyzing the first debate. The speech pattern of Obama and McCain conformed to nearly identical word usage patterns. For example, vocabulary size for Obama and McCain (number of unique non-stop words used) is identical at 1,243. Their non-stop word fraction is also nearly identical at 43.4% and 44.3% for Obama and McCain, respectively. Likewise, the difference in their unique word fraction and average word frequency is only +4.3% and +4.8%, respectively.

The reason for such conformity is anyone' guess, but several factors come to mind. First, the word usage profile could be a direct product of political selection. The fact that these debaters were drawn from a political and therefore have had to function within a verbally demanding environment, where nimbleness is perhaps rewarded over precision, may speak to the similarity in their delivery. Their political contemporaries and the public may both have a finely tuned ear, though perhaps not the same one, to what is considered effective speech by a successful candidate.

Another factor, and one that I suspect is in play at all times, is the degree of premedidated wordsmithing in the preparation for these debates. I do not doubt that each of the candidates' preparation went well beyond casual enumeration of talking points. Certainly to consistently win the hearts and minds (and ears) of their audience, each debater must have give significant consideration to not just what was said, but how. I would not be surprised if candidates memorized what they considered to be particularly effective phrases for delivering content and trenchant retorts for contrasting their opponents. It is also likely that somewhere in the bowels of the political arena are linguistic specialists who have profiled precise (or as precise as can be measured) comprehension and literacy levels of the population, broken down by region and demographic.

The Vice-Presidentials: Lowered Expectations and in the Shadow of their Running Mates

Frequency analysis of the speech of Biden and Palin indicates a lower overall complexity - smaller vocabulary size and higher degree of repetition - than in the speech of Obama and McCain. Presumably these Vice-Presidential hopefuls want to come across as sufficiently articulate and effective to be compelling, but not so much as to steal the limelight from their running mates.

The largest difference in complexity is between Biden and McCain, whose averge word frequency was 2.96 and 2.51, respectively. This, and other metrics that measure Biden' speech, earn him many of his nicknames that suggest him to be verbose but not articulate. And, although McCain's complexity drops significantly in the third debate, in which his verb/noun pairings suggest that he spends more time attacking than expounding his own plans, McCain is the least repetitive of all candidates.

Palin had the longest sentences, with an average length of 8.46 non-stop words. With nearly 1.5 words more than McCain, her sentences were the only ones that broke the 8 non-stop word barrier. This is a significant finding, especially in light of the fact that she had significantly smaller vocabulary than McCain and Obama.

Part of Speech Usage: Adverb Signature Distinguishes Candidates

The relative ratio of each part of speech is extremely similar to all candidates: nouns compose 53-57% of speech, verbs 25-26%, adjectives 12-15% and adverbs 5-7%. The greatest fluctuation in usage, and in the unique component, was adverbs.

Given that adverbs are the least used part of speech of the four examined, they serve as a natural unit. When compared to adverb use, adjectives are consistently only 2.2-2.6 times as frequent as adverbs. This strongly suggests the speakers' desire to qualify things much more than actions. Verb use is about 4.8 times as frequent as adverbs, which suggests that only 1 verb in 5 gets a modifier. This brings to mind the notion that politicians make promises by saying what they will do, but fail to deliver clarity that would explain how it will be achieved. Obama, however, has a lower verb-to-adverb frequency, 3.7, suggesting that he might be one to more frequently characterize actions, by either defining limits or strengthening the verb. Obama had the lowest noun-to-adverb ratio, 7.7, compared to 9.7 for McCain, 9.8 for Palin and 10.9 for Biden. This suggests that Obama's delivery was focused more on action and movement rather than static concepts.

McCain is Categorical and Obama is Contextual — certainty vs nuance

Adverb's are not Obama's only strength. Consistently a greater part of his delivery is composed of more verbs, adjectives and adverbs than McCain (see table). If we look at words that are specific to a candidate, Obama's ratio of nouns:verbs:adjectives:adverbs in this word group is 48:30:21:7, whereas the values for McCain are 59:27:16:3.

Previous work by Pennebaker drew firm conclusions that Obama is a contextual thinker — one who uses more verbs and modifiers — - who sees the world as having loosely defined boundaries between concepts. Contextual thinkers like to use adjectives and adverbs to loosen otherwise narrowly defined words (or those perceived as narrow) in an effort to express exceptions and nuance. McCain, on the other hand, has been characterized as a categorical thinker — one who heavily uses nouns.

Even when all words by a candidate are considered, not just the ones only attributable to them, the proportion in part of speech discrepancy is strong. Obama's ratio is 51:27:16:7 and McCain's is 56:26:13:5. Adjectives and adverbs make up 22.4% of Obama's parts of speech whereas for McCain this fraction is only 18.3%. The difference in verb use is relatively small, but adjective and adverb usage is significantly different. Presumably when McCain uses a verb he sees a narrow definition. Obama, on the other hand, is more likely to add dimension to its meaning with an adverb.

What Was and Wasn't Said

The analysis of all three debates by Obama and McCain reveals distinguishing elements of their speech. It is not a large stretch to imply that these relate directly to their personality, outlook and their policies.

Extremely informative are the word clouds of nouns and verbs, by speaker.

Unique contribution to presidential debate noun use by John McCain and Barack Obama Unique contribution to presidential debate verb use by John McCain and Barack Obama

When frequencies of nouns unique to McCain are tallied, the top word is "Obama" which he used 111 times, with other top nouns being Iranians (8), greed (7), marines (6), institutions (6) and aggression (6). Since McCain actually used both "John" and "McCain", these words do not overwhelm nouns unique to Obama, who used words like science (5), consequence (6), notion (9), approach (6), and focus (7). I am partial to Obama's nouns than McCains - they indicate openness to nuance (e.g. "notion") and recognition of the complexity in issues (e.g. "approach" and "consequence").

Relative use of McCain's unique verbs is greater than for Obama's set, and thus McCain's verbs outweigh Obama's in the unique verb word cloud. McCain uses oppose/opposes (8), secure (4), legitimize (4), realize (5) and watch (5). Obama has agree (15), invest (14), recognize (8), focused (6) and thinking (6). McCain stands out with strong and aggressive verbs, and repeats them to the same extent. Frequencies of Obama's unique contribution is more greatly skewed, with verbs like "agree" and "invest" being used nearly twice as frequently as other unique verbs.

Unique word frequency statistics are extremely revealing about the manner in which the candidates chose to distinguish themselves. McCain's distinguishing nouns and verbs are more balanced with a focus on threats, military and unilateral action. He choses to spread his unique contribution evenly across these topics. Obama, on the other hand, has more focused use on his top verb contributions.

Vocabulary Size, Repetition and the Windbag Index

The vocabulary size for each part of speech is remarkably similar for every candidate. The number of unique nouns, verbs, adjectives and adverbs ranged within 640-663, 359-377, 163-213 and 61-73, respectively. The largest difference was for adjectives, with Obama having the largest adjective vocabulary (213) and Biden the lowest (163).

In an effort to provide a single number that quantifies repetition in speech, I created the Windbag Index (details), which is a composite of measures of repetition in various aspects of speech.

Windbag Index for US Presidential and Vice-Presidential Debates. Yes, Biden is a windbag.

^These presidential candidates are literally a breed apart from their vice-presidential counterparts.

The Windbag Index successfully captures the essence of the large number of individual metrics presented by this analysis. It can be seen that Obama and McCain cluster together, with a Windbag Index difference of 4.9%. Likewise Biden and Palin cluster together, with the difference between them being 13.3%. More importantly, the vice-presidential candidates are separated from the presidential candidates by a very large relative margin.

McCain's loss of verbal agility in the last debate stands out clearly in this graphic - his Windbag Index rose from a previous high of 368 to 505. Obama's index was highest in the third debate as well but his values had much lower variation suggesting better pacing and more consistent performance.

Downloads

Content of word list archive and data structure syntax is described in the methods section.

Barack Obama vs John McCain (1st debate) transcript word lists tag clouds data structure

Barack Obama vs John McCain (2nd debate) transcript word lists tag clouds data structure

Barack Obama vs John McCain (3nd debate) transcript word lists tag clouds data structure

Barack Obama vs John McCain (combined debates) transcript word lists tag clouds data structure

Joe Biden vs Sarah Palin transcript word lists tag clouds data structure

Atom Feeds

Each word list is available as an Atom feed. It can be used as input to create Wordles. See methods for explanation about the part of speech codes.

Barack Obama vs John McCain (1st debate)

Barack Obama vs John McCain (2nd debate)

Barack Obama vs John McCain (3nd debate)

Barack Obama vs John McCain (combined debates)

Joe Biden vs Sarah Palin