Clustering with ZIP
Introduction | Method | Vizualization | Results | Graph
IntroductionImpatient? Go to interactive graph now. There is also a VRML version, without the interactive links. On the other hand, if you have time read the paper Language Trees and Zipping (Benedetto D. et al, Phys Rev Let, 2002 (88)). You can also read about common words and word pairs in the abstracts. Archivers, such as pkzip, can be used to compute information content in text. A derived method is applied to illustrate how the abstracts from the conference cluster based on information content. The method rests on the fact that modern archivers are very good at implicitly calculating information content in text. Typically, text with a lot of content is said to compress poorly and text with very little content is said to compress very well. For example, "aaaaaa" has less content than "abababab". When an archiver starts to compress text, it begins to learn patterns in the text. This pattern recognition makes the archiver more efficient at compressing text further along in the file. If the text is uniform in information, then the patterns the archiver has memorized can be applied to other parts of the text. However, if the information content is not uniform, or somehow radically changes, then the archiver must relearn its rules.
Consider the following text A="hello hello hello" and B="hello hello goodbye". When the archiver sees A, it memorizes "hello" and can apply this pattern to all the words. Therefore, the third "hello" is stored very efficiently. However, in the case of B, because the third word is unique, it cannot be stored as efficiently as the third word of A. This can be applied to try to pair text together. Consider A="hello hello" and B1="hello" and B2="goodbye". Here is how to find which of B1 or B2 are more similar to A, in the sense of the same information content. Take zip(A+B1) and zip(A) and then take the difference zip(A+B1)-zip(A). This difference will measure how efficiently the archiver has stored B1 using rules of A. Similarly, you can construct the same difference using A and B2. Because B1 contains information that is already found in A, it will be compressed more efficiently than B2. For this to be robust, typically the len(A) >> len(B1,B2). This method is applied to abstracts from the conference. It is not robust because the length of the string used to teach the archiver (first abstract) and the text string (second abstract) are similar. Ideally, you want the archiver to learn maximally from the first string and minimally from the second. MethodFor each abstract, the title and body are concatenated and all stop words are removed. L_diff(A,B) = L_AB-L_A where L_AB = ziplength(A+B)and L_A = ziplength(B)The function ziplength() gives the length of the zipped text. Abstracts are clustered in such a way that (X,Y) are chosen as pairs if min [ L_diff ] = L_diff(X,Y) In other words, the information content in B is most similar to the information content in A. Vizualization![]() The large graph below shows the pair-wise relationships between the abstracts. There were four posters presented by the GSC and the abstracts to these are GREEN (1 on diagram). Abstracts which are self-refering using this method (e.g. A pairs with B and B pairs with A) are in LIGHT BLUE . All other abstracts are in DARK BLUE . The edges are coloured by the strength of the match, which is calculated as -log(P) where P is the p-value of the compression efficiency of the second abstract. ResultsFor strong pairs, the results are very convincing. Keep in mind that this method of associating one text with another does not depend on any language or grammatical definitions. As long as all the texts are in the same representation (language, encoding, whatever) then the method will be applicable. Lonely nodes![]() Sometimes the self-referrers clump into lone pairs, like this one. Nothing else points to these nodes and they only point to themselves. Interestingly both are pretty much about the same thing: DEVELOPMENTS OF THE ENSEMBL PROJECT and THE AUTOMATIC ANNOTATION PIPELINE IN ENSEMBL. BellweathersIt is interesting to look at the diagram and find all the nodes with a high number of neighbours. Such abstracts have many others that pair with them and can be interpreted as subject-setters. Here are some abstracts which appear in a large number of abstract pairs. ![]() #66 HUMAN CHROMOSOME 14 : COMPLETE SEQUENCE AND ANNOTATION, #28 USING MOUSE-HUMAN COMPARISON TO IMPROVE GENE PREDICTION ![]() #20 TOWARDS A COMPLETE HUMAN GENE SET ![]() #170 A REGIONAL ASSESSMENT OF HUMAN GENOMIC COMPLEXITY: NOVEL PRIMATE-SPECIFIC TRANSCRIPTIONAL UNITS, ENDOGENOUS ANTISENSE, BIDIRECTIONAL PROMOTERS, AND EVOLVING GENE STRUCTURES AT 5q31 ![]() #258: UNIFIED APPROACH TO FINISHING MICROBIOL AND EUKARYOTIC GENOMES ![]() #24 IMPLEMENTATION OF HIGH THROUGHPUT SNP GENOTYPING AND CROSS-PLATFORM TECHNOLOGY COMPARISONS ![]() #168 PRODUCTION SEQUENCING MODEL AT UWGC IS EFFICIENT AND ADAPTABLE FOR SMALL TO MEDIUM SIZE GENOME CENTERS Probably more than anything, these abstracts contain a wide scope of vocabulary usage which is particularly suitable for teaching the archiver how to compress more text. GSC PostersDuane's poster TRANSPOSON-MEDIATED CDNA SEQUENCING AT THE BC CANCER AGENCY GENOME SEQUENCE CENTRE clusters with UNIFIED APPROACH TO FINISHING MICROBIOL AND EUKARYOTIC GENOMES, but not very well (-log(P) = 3). Steve's poster IDENTIFICATION OF GENES EXPRESSED IN EARLY-STAGE LUNG CANCERS clusters with USING THE DROSOPHILA GENE COLLECTION, WHOLE MOUNT IN SITU HYBRIDIZATION AND MICROARRAYS TO ASSAY GENE EXPRESSION DURING DROSOPHILA EMBRYOGENESIS with -log(P)=3. Ian's poster FINGERPRINTED BAC CLONE PHYSICAL MAPS clusters with my poster A SET OF REARRAYED BAC CLONES SPANNING THE HUMAN GENOME with -log(P)=5. None of these, however, illustrate some of the best clustering. Here are some great pairs that were fished out. Great PairsSELECTION OF SINGLE NUCLEOTIDE POLYMORPHISMS FOR A WHOLE-GENOME LINKAGE DISEQULIBRIUM MAPPING SET
THE C. ELEGANS ORFEOME CLONING PROJECT : VERSION 1.0 READY TO USE
RGD - MAPPING DISEASE ONTO THE GENOME
THE GENOME CHANNEL; ANNOTATION AND DISPLAY FROM MULTIPLE GENE MODELERS
LARGE SCALE VARIATION BETWEEN PRIMATE GENOMES DETERMINED BY ARRAY COMPARATIVE GENOMIC HYBRIDIZATION
|