| What is it | Language Fingerprint | Analysis | Source Code | Make Your Own! |
I have always been interested in visualizing data in various ways. I recently had the idea to visualize language. There has been much effort in finding rules in language and pictures in words - such pictorial or quantitative representations help to understand the concepts in a field like linguistics.
Today I had an idea to try to take a fingerprint of a piece of text. The fingerprint would be unique for a given text, up to some arbitrary point which would depend on how the fingerprint was defined. I was initially hoping that different languages would have different fingerprints. For example, one could identify the language of a piece of text by looking at the fingerprint image. While it would probably be easier to figure out the language by looking at the text, pictures are a lot more fun and in the process give me an excuse to write more Perl.
It turned out the the idea I had of a fingerprint can be broadly used to distinguish languages, but also specifically to distinguish styles of writing with a language. Or at least at this point this is my idea!
There are probably lots of different ways to turn words into pictures. Hollywood does a pretty good job sometimes - often not. I wanted my fingerprint to be as unique as possible, without too much complexity. Something that would appeal to the eye and would be easy to assess. A two-dimensional fingerprint seemed like a good place to start - with natural extensions to N dimensions.
The fingerprint is based on the concept of letter-letter enumeration. English uses 26 letters and therefore there arae 26*26 different letter-letter combinations, such as the th of the or the no of genome.
If we have a piece of text with W words in it, and a total of C characters (only counting a-z without sensitivity to case), then there are C-W letter-letter neighbour pairs in the text. For example,
my dog is lazy (text) my dog is lazy (pair 1) my dog is lazy (pair 2) my dog is lazy (pair 3) my dog is lazy (pair 4) my dog is lazy (pair 5) my dog is lazy (pair 6) my dog is lazy (pair 7)
In the case above there were only 7 pairs but you can imagine how a big book will give lots and lots of pairs. For example, Alice in Wonderland provides 80,000 such pairs with the most common being
h e 3779 (first letter | second letter | frequency) t h 3484 i n 2026 e r 1822 a n 1607 o u 1556 i t 1325 n d 1270 a t 1167 r e 1150 h a 1148 n g 1140
On the other hand, some combinations never occur, such as yk, mg, or rq. A 2D plot can be made with the first and second letters on the axes and the frequency of a particular combination encoded by colour. The figure here shows what this looks like, with examples of words that would contribute to the highlighted bin for the in combination.
The analysis that was done did not treat a space as a character. The acceptable character set can be expanded from a-z to a-z + spaces, or a-z + spaces + punctuation, etc. Colour coding was done logarithmically to bring out structure in parts of the graph where letter-letter combinations are not commonly used. The top 5 combinations are coded in green and the second top 5 are shown in yellow.
Some of the characteristics of such a fingerprint are
Below are the results of analyzing six novels of three different authors. The texts were downloaded from the Project Gutenberg, a repository of etexts. An additional source is examined. I think Poe is closest to Doyle's Sherlock Holmes, by looking at the top right of the graph.
|
Lewis Carroll |
William Shakespeare |
Arthur Conan Doyle |
Alice in Wonderland
|
Hamlet
|
Adventures of Sherlock Holmes
|
Through the Looking Glass
|
Macbeth
|
The Hound of the Baskervilles
|
|
Edgar Allan Poe |
Works of Edgar Allan Poe, Vol 2
|
The analysis program is Perl using calls to GD.pm to make the the plots. Download.
The code was written quickly not cleanly and is about 200 lines.
You can use the graph script on any text file. For now, make sure your file has a .txt extension. Don't worry about formatting, the script cleans out the file. Just type
graph filename.txt
and you'll get some stats output and a filename.gif file created.