by Martin Krzywinski | projects contact

## Methods

### Transcript Preparation

Any markup by the transcriptionist (e.g. (sic), [mispronunciation], (laughter)) were removed. All m-dashes and double dashes were replaced by single dashes.

### Speaker Identification

The start of a speaker's section is indicated in the transcript by a leading $SPEAKER_NAME:$ tag (e.g. $TRUMP:$, $BIDEN:$, etc).

For each debate, the section of the transcript for each of the participants was extracted using the speaker tag. The moderator's contribution was discarded.

Raw transcript portions for a given speaker are found in SPEAKER.txt (e.g. trump.txt, biden.txt, etc.). Sections of uninterrupted speech are reported on a single line, overriding the paragraph structure of the transcript.

### Sentence Identification

For each speaker, their transcript portion was divided into individual sentences using the Natural Language Toolkit. Sentence files (one sentence per line) are found in sentence.SPEAKER.txt.

The Flesch-Kincaid reading ease and grade level metrics are designed to indicate how difficult a passage in English is to understand.

Both reading ease and grade level require determining the number of sentences, words and syllables in text.

Sentences were determined as described above.

Words were determined by first tokenizing text with Natural Language Toolkit. In this process, contractions and possessives are tokenized into a pair of strings (e.g. "they're" becomes "they" and "'re").

Any m-dashes and pause hyphens are added to the word count but their syllable contribution is zero — the idea here is that these represent pauses and should contribute to making the speech more understandable. Any other tokens with a zero syllable count were not considered. For example, "'60s" is reported to have zero syllables, so it is not counted. For this reason, the number of words reported in this section can be slightly lower than in other sections where the syllable filter is not applied.

Syllables were counted using syllapy. The number of syllables in a possessive or contraction token (e.g. "'re") is 1. The number of syllables in an m-dash or pause hyphen is zero.

Metrics for all of a candidate's combined speech as well as for each section of uninterrupted text is available in transcript.SPEAKER.readability.txt.

$# all text # sections sentences words syllables reading_ease grade_level 81 579 7788 11050 73.15 6.40 # individual sections, first number is section index 1 30 410 584 72.46 6.55 Thank you, Jim. It's an honor to be here with you, ...$

Reading ease ranges from 100 (easiest) down to 0 (hardest) and is calculating using $$206.835 - 1.015 \left( \frac{\textrm{words}}{\textrm{sentences}} \right) - 84.6 \left( \frac{\textrm{syllables}}{\textrm{words}} \right)$$

This metric can be interpreted as follows

The grade level corresponds roughly to a U.S. grade level. It has a minimum value of –3.4 and no upper bound. It is calculated using $$0.39 \left( \frac{\textrm{words}}{\textrm{sentences}} \right) + 11.8 \left( \frac{\textrm{syllables}}{\textrm{words}} - 15.59 \right)$$

### Stop Word Removal

Where indicated, text was stripped of stop words. This was done to evaluate non-stop word sentence length and to create lists of non-stop words unique to each candidate.

### Part of Speech Tagging

All words were tagged with their part of speech using the Natural Language Toolkit using the Penn Treebank tags. The tagger is not 100% accurate. Some words are miscategorized and no attempt was made to fix these errors.

Tagged sentences are sentence.SPEAKER.tag.*.txt and word lists for each tag are in tag.SPEAKER.TAG.txt.

Words were grouped by their tag into nouns ($<N*>$), verbs ($<V*>$), adjectives ($<JJ*>$) and adverbs ($<RB*>$).

These lists are available in pos.SPEAKER.WORDTYPE.txt.

### Pronouns

This year I've added a detailed analysis of pronoun use based on category, person, gender and count. I've also added vignettes of interesting contrasts (e.g. "me" vs "we").

Pronouns were identified using a list of 119 pronouns classified into the following categories: demonstrative, indefinite, interrogative, object, personal, possessive, reflexive/intensive and relative. An "other" cateogry that contains prenominal adjectives.

Many pronouns belong to more than one category. For these pronouns, they were assigned to the first category in the following order: personal, object, possessive, reflexive/intensive, indefinite, demonstrative, interrogative, relative, others. For example, "it" can be personal and object and it is assigned to the personal category.

Each pronoun was also classified based on person (1st, 2nd, 3rd), gender (masculine, feminine, neuter) and number (singular, plural) using the following associations. Note that only 3rd person pronouns have gender, some the gender of some 2nd person pronouns (e.g. you) and those pronouns in the "other" category cannot be unambiguously determined without context.

$# person gender number # 1 m s # 2 f p # 3 n --s = this,that,anybody,anything,nothing,one,other,somebody,something, another,any,no,anyone,each,none,either,neither --p = those,everybody,everything,many,several,these,all,both,others,most 1-s = i,me,my,mine,myself 2-- = you,your,yours 2-s = yourself 3ms = he,him,his,himself 3fs = she,her,hers,herself 3ns = it,its,itself 1-p = we,us,our,ours,ourselves 2-p = yourselves 3-p = they,them,their,theirs,themselves$

Pronoun lists are available in pos.SPEAKER.pronoun.TYPE.txt.

### Word Pairs List Creation

For each pair combination of noun, verb, adjective and adverb (e.g. noun noun), adjacent pairs of words of this type were extracted. These are in pospair.SPEAKER.POS1.POS2.txt.

For example, for the tagged sentence

$running/VBG president/NN hope/NN talking/VBG tonight/RB$

the word categories were

$running/VERB president/NOUN hope/NOUN talking/VERB tonight/ADVERB$

and the contribution from the pairs were to these sets

$running president : VERB/NOUN president hope : NOUN/NOUN hope talking : NOUN/VERB talking tonight : VERB ADVERB$

### Unique and Shared Words

Words unique to the candidate were those used by the candidate but not the other.

In the results section, unique words are sometimes refered to as "exclusive". Care must be taken not to confuse this meaning of unique (i.e. unique to Trump) with the meaning of "distinct". It should be clear by the context which meaning is used. When potential ambiguity arises, I use "exclusive".

Another definition for "exclusive" in the context of words is used to describe words like "but", "except" and "without". When I use exclusive here, I do not mean this.

Words (or other entities, such as parts of speech or noun phrases) categorized as exclusive are in *.set.SPEAKER.* and those that are shared are in *.set.intersection.txt.

### Extracting Noun Phrases

Noun phrases were extracted using the Natural Language Toolkit with the grammar $NP: {<JJ>*<NN.*|JJ.*>+}$. Chunked files are sentence.SPEAKER.chunk.*.txt.

Noun phrase frequency, length (both total words and unique words) are used in the analysis. Similarity between noun phrases (read below) is used to derive a list of unique concepts.

### Noun Phrase Similarity

Two noun phrases are compared using the union and intersection of their words. The union is the number of words in the phrases (words may repeat). The intersection is the number of words that appear in both phrases. Similarity is defined as the ratio of intersection to union.

Given a noun phrase, its child is a shorter noun phrase which has the highest similarity ratio. It is possible for a noun phrase to have more than one child, in which case all children have the same similarity ratio with the phrase.

Given a noun phrase, its parent is a longer noun phrase which has the highest similarity ratio. It is possible for a noun phrase to have more than one parent, in which case all parents have the same similarity ratio with the phrase.

In order for a noun phrase pair to be considered for the child/parent relationship, the similarity ratio must be ≥0.2. This cutoff is arbitrary, but a non-zero value is useful to impose a minimum amount of similarity before an association between two noun phrases is made.

For example, the two noun phrases "big green dog" and "small green cat" have a union of 5 (unique words are big, green, dog, small, cat) and an intersection of 1 (one word, green, is found in both), yielding a similarity ratio of 1/5=0.2.

### Noun Phrase Trees

For a given speaker, two hierarchical trees of noun phrases are constructed: a top-down tree and a bottom up tree. These noun phrase trees associate a noun phrase with its most similar child or parent noun phrase, where a child noun phrase is one that is shorter, and a parent noun phrase is one that is longer. This association is created using the method described above.

The nphrasetree.SPEAKER.*.txt file, which is the top-down list. Each noun phrase spoken by the candidate appears in this file, along with all its child phrases. Here are two example branches of the noun phrase tree, and its format in the file

$a b c d e ---------f------------ ---g---------------------- ... 0 2 1 3 3 tBCNRM6b+qr0eSYyvs5UQg low-level diplomatic talks 1 1 1 2 2 8b1/ZNuvgruxhbZ6L9X3uA low-level talks 2 0 1 1 1 /laZKvGiVT1I3jpS5xlQJA talks ... 0 2 1 3 3 dpQjBR6buAp2v6jmMSMpDA one last point 1 1 1 2 2 7OFOeAZ9BNM4NdysHsBshw one point 2 0 5 1 1 eO5Uqo+BOIX+L+INIyUYuQ point ...$

with the fields being

• a - depth of the entry in the tree
• b - number of children associated with the noun phrase in this line
• c - number of times the noun phrase appears
• d - number of words in the noun phrase
• e - number of unique words in the noun phrase
• f - a digest of the noun phrase that uniquely identifies it
• g - noun phrase (padded with spaces for higher depths)

For example, consider the noun phrase "low-level diplomatic talks". This noun phrase has two children. Its first child "low-level talks" is at depth=1 (depth of a child is always one more than its parent) and itself has a child "talks". This child phrase is at depth = 2, because it is a child of a phrase at depth=1.

Note that a noun phrase may have one or more children. Child phrases are always shorter than their parent.

The bottom-up noun phrase tree can be considered to be the upside-down version of the top-down phrase. In the bottom-up tree, a noun phrase parents are shown. Below is an example of an entry for the "talks" noun phrase.

$... 0 3 1 1 1 /laZKvGiVT1I3jpS5xlQJA talks 1 0 1 2 2 BhErVPRfSrokiOLhxNaXuA talks europeans 1 1 1 2 2 8b1/ZNuvgruxhbZ6L9X3uA low-level talks 2 0 1 3 3 tBCNRM6b+qr0eSYyvs5UQg low-level diplomatic talks$

This phrase has two parents, each at the same level: "talks europeans" and "low-level talks". These two parents are at the same level because they are of the same length. The "low-level talks" has the parent phrase "low-level diplomatic talks".

Note that a noun phrase may have one or more parents. Parent phrases are always longer than their children.

Every noun phrase has an entry at depth=0 in the tree lists. Then, depending whether it has children (or parents), it is followed by its children (or parents) which appear at depth > 1. Noun phrases that have no children (i.e. there are no shorter similar noun phrases in the text) or no parents (i.e. there are no longer similar noun phrases in the text) can be identified by lines that start with $0 0$.

Noun phrases without children are listed in nphrase.SPEAKER.nochild.txt. Noun phrases without parents are listed in nphrase.SPEAKER.noparent.txt.

### WindBag Index

The purpose of the Windbag Index, created for this analysis, is to measure the complexity of speech. For the present purpose, complex speech is considered to be speech with a large number of concepts and low repetition. The index is low for complex speech and high for verbose and repetitive speech. You do not want to score highly on the Windbag Index, lest you be called a windbag.

The Windbag Index (WI) is defined as a product of terms $$\textrm{WI} = \frac{1}{\prod_{i=1..8} t_i}$$

where

• $t_1$ = fraction of words that are non-stop (measures non-filler content)
• $t_2$ = fraction of non-stop words that are unique
• $t_3$ = fraction of nouns that are unique
• $t_4$ = fraction of verbs that are unique
• $t_5$ = fraction of adjectives that are unique
• $t_6$ = fraction of adverbs that are unique
• $t_7$ = fraction of noun phrases that are unique
• $t_8$ = fraction of noun phrases that are top-level

### Tag Cloud Generation

Tag clouds are images of words, with the size of the word in the image proportional to the frequency of occurrence in a text.

The size $s(w)$ of a word $w$ found at a frequency $f(w)$ is computed based on

$$s(w) = s_\text{min} + (s_\textrm{max} - s_\textrm{min}) \left( \frac{f(w) - f_\textrm{min}}{f_\textrm{max} - f_\textrm{min}} \right)$$

where $f_\textrm{min}$ and $f_\textrm{max}$ are the minimum (almost always 1) and maximum word frequency, and $s_\textrm{min}$ and $s_textrm{max}$ are the sizes of the least and most frequent words.

Any tag cloud shown in the analysis in which multiple colors are used, is drawn from multiple word lists (each color corresponds to a differnet list). Word sizes are computed relative to usage in each list — each list has the same $s_\textrm{min}$ and $s_\textrm{max}$ , but different $f_\textrm{max}$. Therefore, absolute word sizes from two different lists cannot be directly compared.

For example, for a tag cloud that shows both nouns and verbs, the most frequent noun and most frequent verb will be of the same size, regardless of their actual frequencies.

Larger words are preferentially placed in the center of the tag cloud, by virtue of the way the tag cloud is formed.

### Data Structure

The data structure is in XML format. This file is generated automatically from a Perl memory structure, so the format is generic. The inner most block of $<item>$ tags reports individual statistics. Depending on the value of the outer blocks, the statistics are either for word frequencies or sentence lengths.

• $n$ - number of samples (e.g. number of unique words)
• $sum$ - sum of samples (e.g. number of total words)
• $min, max, mean, stdev, median, mode$ - distribution characteristics for the samples
• $pNN$ - NNth percentile
• $cNN$ - NNth cumulative percentile
• $cwNN$ - NNth weighted cumulative percentile

The attributes $np$, $nps$ and $npns$ refer to all words, stop words and non-stop words, respectively. The $np$ prefix indicates that there is no punctuation.

When the item tag attribute is $sentence$, the statistics are for sentence length. For example, $sentence/npns/palin/mean$ is the average sentence length for non-stop words.

When the item tag attribute is $word$, the statistics are for word frequencies. For example, $word/nps/trump/n$ is the number of word frequencies for Trump's stop word list. Since there is one frequency value per word, this is also the number of unique words. On the other hand, $word/nps/trump/sum$ is the sum of frequencies, or the total number of stop words.

The attribute $pospair$ indicates that the statistics are for frequencies of pairs of parts of speech.

The attribute $nphrase$ indicates statistics for noun phrases. A variety of noun phrase statistics are available: $noparent_X$, $withparent_X$ and $all_X$ for noun phrases which have no parent phrase, for those that have a parent phrase, and for all noun phrases. $X$ is one of freq, len or ulen and indicates noun phrase frequency, length (number of words) and unique length (number of unique words).

Finally, the $pos$ attribute indicates statistics for individual parts of speech. Within this section of the file, the tag $both$ is used for lists of words spoken by both candidate and the tag $all$ for words spoken by either candidate. In hindsight, I realize tht I should have used $either$ rather than $all$.

### Weighted Cumulative Distribution

Consider the word list $a b c c d e e e e f f$. Here each word is just a letter, for simplicity. The frequency list for these words is $1,1,2,1,4,2$ since $a$ was seen once, $b$ also once, $c$ twice, $d$ once, and so on. The input to the statistical analysis is the list $1,1,2,1,4,2$.

The average of this list is $11/6 = 1.8$ and this is the average word frequency. In other words, on average, each word is used 1.8 times.

The percentile is the value of the list below which a certain percentage of all values fall in a sorted list $1,1,1,2,2,4$. For example, 50% of the values are smaller than 2, so the 50% percentile is 2. Similarly, 83% (5/6) of the values are smaller than 4, so the 83% percentile is 4. For small sets of numbers, as in this example, the percentile is open to ambiguity (e.g. 83% percentile could also be 3.5, since 83% of values are smaller than 3.5). A variety of methods exist to determine percentiles and they converge when the number of values is large. The 50% percentile is also called the median. For word frequencies, the percentile gives the percentage of frequencies smaller than a given frequency. For example, if the 90% percentile of word frequencies is 7.5, then 90% words in the vocabulary of the speaker are found with a frequency of less than 7.5.

Percentile values are identified by $key=pNN$ attribute in the data structure.

The cumulative distribution values, identified by $key=cNN$, will correspond to the same values as the percentiles for large lists, but may vary slightly for small lists. The reason for this is the definition of how the percentile is reported by the statistics module I use in my code (Perl's Statistics::Descriptive). It's generally safe to treat $cNN$ and $pNN$ as the same.

The weighted cumulative distribution, on the other hand, reports what fraction of speech is composed of words below a given word frequency. This is very different than the cumulative distribution (or percentile), which reports the fraction relative to vocabulary words. Consider the example sentence "no no no no no no no no yes yes". 20% (2/10 words) of this sentence is composed of words with a frequency $\le 2$ ($cw20 = 2$), but 50% of the unique words in this sentence (there are only two: "no" and "yes") have a frequency $\le 2$ ($c50 = 2$). Thus think of the cumulative distribution as characterizing the set of unique words and the weighted cumulative distribution as characterizing the actual speech.

Another way to look at the weighted cumulative distribution is as follows. If the 90% weighted cumulative value is 15, for example, then if I take a transcript of a candidate's speech and pick a random word out of it, 90% of the time it will be a word used with a frequency of $\le 15$.

### Limitations of Analysis

The analysis presented here is fully automated - there is no manual intervention to transcript processing, with the exception of removal of transcriptionists' notes. Therefore, without any manual intervention, what may appear to be trivial errors propagate due to limitations in parsing.

The accuracy of the part of speech tagger is good but it does categorize some words incorrectly.

In an interactive debate, where speakers cut each other off, some sentences are fragments and these are not characteristic of the type of sentences that would appear in typical speech.