Clustering with ZIP

Introduction | Method | Vizualization | Results | Graph

Introduction

Impatient? Go to interactive graph now. There is also a VRML version, without the interactive links. On the other hand, if you have time read the paper Language Trees and Zipping (Benedetto D. et al, Phys Rev Let, 2002 (88)). You can also read about common words and word pairs in the abstracts.

Archivers, such as pkzip, can be used to compute information content in text. A derived method is applied to illustrate how the abstracts from the conference cluster based on information content.

The method rests on the fact that modern archivers are very good at implicitly calculating information content in text. Typically, text with a lot of content is said to compress poorly and text with very little content is said to compress very well. For example, "aaaaaa" has less content than "abababab".

When an archiver starts to compress text, it begins to learn patterns in the text. This pattern recognition makes the archiver more efficient at compressing text further along in the file. If the text is uniform in information, then the patterns the archiver has memorized can be applied to other parts of the text. However, if the information content is not uniform, or somehow radically changes, then the archiver must relearn its rules.

Consider the following text A="hello hello hello" and B="hello hello goodbye". When the archiver sees A, it memorizes "hello" and can apply this pattern to all the words. Therefore, the third "hello" is stored very efficiently. However, in the case of B, because the third word is unique, it cannot be stored as efficiently as the third word of A. This can be applied to try to pair text together.

Consider A="hello hello" and B1="hello" and B2="goodbye". Here is how to find which of B1 or B2 are more similar to A, in the sense of the same information content. Take zip(A+B1) and zip(A) and then take the difference zip(A+B1)-zip(A). This difference will measure how efficiently the archiver has stored B1 using rules of A. Similarly, you can construct the same difference using A and B2. Because B1 contains information that is already found in A, it will be compressed more efficiently than B2. For this to be robust, typically the len(A) >> len(B1,B2).

This method is applied to abstracts from the conference. It is not robust because the length of the string used to teach the archiver (first abstract) and the text string (second abstract) are similar. Ideally, you want the archiver to learn maximally from the first string and minimally from the second.

Method

For each abstract, the title and body are concatenated and all stop words are removed.
Subsequently, for each abstract pair (A,B) the following quantity is calculated

L_diff(A,B) = L_AB-L_A

where

L_AB = ziplength(A+B)
and
L_A = ziplength(B)
The function ziplength() gives the length of the zipped text. Abstracts are clustered in such a way that (X,Y) are chosen as pairs if
min [ L_diff ] = L_diff(X,Y)

In other words, the information content in B is most similar to the information content in A.

Vizualization

The large graph below shows the pair-wise relationships between the abstracts. There were four posters presented by the GSC and the abstracts to these are  GREEN  (1 on diagram). Abstracts which are self-refering using this method (e.g. A pairs with B and B pairs with A) are in  LIGHT BLUE . All other abstracts are in  DARK BLUE . The edges are coloured by the strength of the match, which is calculated as -log(P) where P is the p-value of the compression efficiency of the second abstract.

Results

For strong pairs, the results are very convincing. Keep in mind that this method of associating one text with another does not depend on any language or grammatical definitions. As long as all the texts are in the same representation (language, encoding, whatever) then the method will be applicable.

Lonely nodes

Sometimes the self-referrers clump into lone pairs, like this one. Nothing else points to these nodes and they only point to themselves. Interestingly both are pretty much about the same thing: DEVELOPMENTS OF THE ENSEMBL PROJECT and THE AUTOMATIC ANNOTATION PIPELINE IN ENSEMBL.

Bellweathers

It is interesting to look at the diagram and find all the nodes with a high number of neighbours. Such abstracts have many others that pair with them and can be interpreted as subject-setters. Here are some abstracts which appear in a large number of abstract pairs.


#66 HUMAN CHROMOSOME 14 : COMPLETE SEQUENCE AND ANNOTATION, #28 USING MOUSE-HUMAN COMPARISON TO IMPROVE GENE PREDICTION


#20 TOWARDS A COMPLETE HUMAN GENE SET


#170 A REGIONAL ASSESSMENT OF HUMAN GENOMIC COMPLEXITY: NOVEL PRIMATE-SPECIFIC TRANSCRIPTIONAL UNITS, ENDOGENOUS ANTISENSE, BIDIRECTIONAL PROMOTERS, AND EVOLVING GENE STRUCTURES AT 5q31


#258: UNIFIED APPROACH TO FINISHING MICROBIOL AND EUKARYOTIC GENOMES


#24 IMPLEMENTATION OF HIGH THROUGHPUT SNP GENOTYPING AND CROSS-PLATFORM TECHNOLOGY COMPARISONS


#168 PRODUCTION SEQUENCING MODEL AT UWGC IS EFFICIENT AND ADAPTABLE FOR SMALL TO MEDIUM SIZE GENOME CENTERS

Probably more than anything, these abstracts contain a wide scope of vocabulary usage which is particularly suitable for teaching the archiver how to compress more text.

GSC Posters

Duane's poster TRANSPOSON-MEDIATED CDNA SEQUENCING AT THE BC CANCER AGENCY GENOME SEQUENCE CENTRE clusters with UNIFIED APPROACH TO FINISHING MICROBIOL AND EUKARYOTIC GENOMES, but not very well (-log(P) = 3).

Steve's poster IDENTIFICATION OF GENES EXPRESSED IN EARLY-STAGE LUNG CANCERS clusters with USING THE DROSOPHILA GENE COLLECTION, WHOLE MOUNT IN SITU HYBRIDIZATION AND MICROARRAYS TO ASSAY GENE EXPRESSION DURING DROSOPHILA EMBRYOGENESIS with -log(P)=3.

Ian's poster FINGERPRINTED BAC CLONE PHYSICAL MAPS clusters with my poster A SET OF REARRAYED BAC CLONES SPANNING THE HUMAN GENOME with -log(P)=5.

None of these, however, illustrate some of the best clustering. Here are some great pairs that were fished out.

Great Pairs

SELECTION OF SINGLE NUCLEOTIDE POLYMORPHISMS FOR A WHOLE-GENOME LINKAGE DISEQULIBRIUM MAPPING SET
IMPLEMENTATION OF HIGH THROUGHPUT SNP GENOTYPING AND CROSS-PLATFORM TECHNOLOGY COMPARISONS

THE C. ELEGANS ORFEOME CLONING PROJECT : VERSION 1.0 READY TO USE
THE C. ELEGANS INTERACTOME PROJECT

RGD - MAPPING DISEASE ONTO THE GENOME
RAT GENOME DATABASE - DISEASE TO QTL TO SEQUENCE

THE GENOME CHANNEL; ANNOTATION AND DISPLAY FROM MULTIPLE GENE MODELERS
THE ORNL GRAILEXP GENE FINDER AND GENOME ANALYSIS TOOLKIT

LARGE SCALE VARIATION BETWEEN PRIMATE GENOMES DETERMINED BY ARRAY COMPARATIVE GENOMIC HYBRIDIZATION
COORDINATED GENE DUPLICATION, ADAPTIVE SELECTION AND GENOMIC REARRANGEMENTS IN HUMANS AND GREAT APES.







This is a large graph (2830 x 2886 pixels). Scroll around and have fun. Click on a node to go to the abstract.
Abstract 1 title words: cancer transcriptome Abstract 40 title words: assay collection drosophila drosophila embryogenesi expression gene gene hybridization microarray mount situ Abstract 10 title words: detailed genomic human sequencing Abstract 255 title words: cdna chromosome finding gene gene human isolation length novel predicted Abstract 257 title words: finding gene proximal Abstract 100 title words: mouse Abstract 70 title words: annotation chromosome coding comparative entire function genome genomic hc21 mouse regulatory Abstract 101 title words: Abstract 111 title words: comparative drug encoding enzyme familie gene gene genomic metabolizing multi single specie Abstract 170 title words: antisense assessment bidirectional complexity endogenou genomic human novel primate promoter regional specific transcriptional unit Abstract 102 title words: clustering complex expression fuzzy gene genomic relationship uncovering yeast Abstract 107 title words: chromosome dictyostelium discoideum model rganism sequence Abstract 148 title words: bioknowledge biological genomic integrating library proteomic Abstract 103 title words: Abstract 44 title words: Abstract 104 title words: genome rat sequencing Abstract 49 title words: assembly genome genomic global mammalian sequence strategy Abstract 105 title words: assembly comparing derived fingerprint maps mcd seq2fp sequence sequence tool verification Abstract 66 title words: annotation chromosome complete human sequence Abstract 106 title words: advance blast management sequence Abstract 120 title words: biofacet complete largeenvironment lassap rueil scale sequence Abstract 108 title words: autofinish automated cost finishing genome low method Abstract 168 title words: adaptable center efficient genome medium model production sequencing size uwgc Abstract 109 title words: architecture diverse evolution evolutionarily exploration function genome platform sequence studie targeted vertebrate Abstract 159 title words: composition hidden markov modeling protein Abstract 17 title words: cros experiment mutagenesi specie utilization Abstract 177 title words: alysi effort initial Abstract 11 title words: annotation havana institute manual quality sanger Abstract 20 title words: Abstract 110 title words: genome map mouse physical Abstract 123 title words: Abstract 129 title words: mapping project zebrafish Abstract 204 title words: comparison complex inversion map mouse physical region region sequence syntenic Abstract 261 title words: genome human rat scale sequencing Abstract 297 title words: assemblie bacwga hybrid mouse Abstract 81 title words: 1sb cb10 fugu generating genome map physical Abstract 112 title words: chromosome finished Abstract 28 title words: comparison gene human improve mouse prediction Abstract 113 title words: candidate complex disease gene involved pathway tool Abstract 24 title words: comparison cros genotyping implementation platform snp technology Abstract 268 title words: disequilibrium linkage linkage type Abstract 289 title words: complex disease genotype method snp visualization Abstract 4 title words: bayesian model nucleotide quantitative selection trait Abstract 63 title words: association card15 crohn disease divergent ethnically haplotype mutation population structure Abstract 114 title words: deciphering fly genome genomic Abstract 115 title words: candidate conservation finding functional genome mammalian sequence variation Abstract 83 title words: alignment element genome human regulatory statistical Abstract 116 title words: functional genomic integrated platform wormbase Abstract 284 title words: acedb based database emulator relational zace Abstract 117 title words: automation bead increased magnetic method plasmid purification sequencing Abstract 183 title words: Abstract 118 title words: assemblie atla binning fishing genome rat scaffolding Abstract 119 title words: assembly graph splicing Abstract 127 title words: alternatively assignment associating exon functional protein spliced splicing structure variant Abstract 281 title words: Abstract 300 title words: cdna genome insert sequencing unigene Abstract 309 title words: algorithim correspondence mapping Abstract 12 title words: computational expression gene integrating mouse orthologou rna sequence skin Abstract 263 title words: bioinformatic collaborative comparative create gene network phenotype regulatory required study Abstract 139 title words: alignment consecutive cros indexe sequence specie Abstract 307 title words: java lima Abstract 121 title words: Abstract 276 title words: Abstract 124 title words: association common common disease hypothesi published studie support variant Abstract 94 title words: assessment association case control empirical population stratification studie Abstract 125 title words: coli database escherichia genome gtop identifie investigation probable pseudogene significant systematic Abstract 97 title words: eukaryotic genome periodicity power prokaryotic spectrum Abstract 126 title words: forward genetic mice mutational reverse scale Abstract 214 title words: boosting msms peptide search sequence spectrum Abstract 128 title words: bac based framework genome hybrid map pac radiation rat yac Abstract 311 title words: Abstract 13 title words: capable detecting development dna fluorescently highly instrument labeled minute novel product sbs2000 sensitive sequencing testing Abstract 18 title words: clean flat plate reaction sequencing system thermocycling volume Abstract 130 title words: finder gene genome grailexp ornl toolkit Abstract 149 title words: gene genome prediction specific tuning Abstract 160 title words: annotation channel display gene genome modeler multiple Abstract 194 title words: databank dna genome information japanddbj service Abstract 282 title words: annotation assembly pipeline sequence Abstract 131 title words: cellular damage dna expression gene induced ionizing profile radiation response Abstract 144 title words: amplicon annotated arabidopsi dna expression gene genomic microarray Abstract 132 title words: 20mb comparative diabete failure functional genomic human implicated mouse rat region renal specie type Abstract 133 title words: adaptive apes coordinated duplication gene genomic human rearrangement selection Abstract 172 title words: array comparative determined genome genomic hybridization primate scale variation Abstract 134 title words: common comparative disease gene identification involved sequencing Abstract 73 title words: approach bioinformatic brca2 comparative region sequence Abstract 135 title words: cancer early expressed gene identification lung stage Abstract 136 title words: conserved identification mousehuman sequence study Abstract 138 title words: conserved genome motif regulatory yeast Abstract 147 title words: gene hyperthermophile identified noncoding rich rna Abstract 137 title words: Abstract 153 title words: assemblie bac completion drosophila genome repeat resolving rich strategie Abstract 14 title words: automated familie identification novo repeat Abstract 75 title words: browser comparative genome sequence usa vista visualizing Abstract 140 title words: annotate arabidopsi brassica oleracea random read shotgun thaliana utilization Abstract 306 title words: draft genome mouse sequence Abstract 141 title words: browser genome ucsc Abstract 275 title words: Abstract 142 title words: region regulatory triangulating Abstract 143 title words: finland model population structure subisolate Abstract 158 title words: disease inheritance isolate model population Abstract 178 title words: description distribution genome haplotype human model observed permit snps structure systematic Abstract 294 title words: complemented dmd dystrophin expression gene human humanised mouse size Abstract 145 title words: Abstract 42 title words: Abstract 146 title words: chimpanzee gene gorilla human positive selection topology tree twisting Abstract 283 title words: comparative disease evolution gene genomic Abstract 207 title words: genome human knowledgebase Abstract 219 title words: Abstract 235 title words: cloning elegan orfeome project ready usa version Abstract 312 title words: proteomic Abstract 15 title words: approach causal disease disequilibrium fine linkage mapping region strong variant Abstract 150 title words: centromeric cloning exploring isolation region tar Abstract 151 title words: 3hgri cloning contig draft genome human isolation missing nci nieh nih segment sequence tar verification Abstract 212 title words: clone cloning direct gene genetic positive recombinational selection usa yeast Abstract 32 title words: favored hiv site Abstract 239 title words: assembly draft human initial mapping region sequence subtelomeric Abstract 152 title words: Abstract 258 title words: approach eukaryotic finishing genome microbiol unified Abstract 278 title words: Abstract 310 title words: chromosome drosophila finishing hgsc human Abstract 64 title words: clustering ests nematode parasitic phrapconsed Abstract 154 title words: bac clone genome human rearrayed spanning Abstract 25 title words: bac clone fingerprinted maps physical Abstract 155 title words: chromosome comparative genomic primate Abstract 285 title words: atm brca1 comparative consequence functional missense mutation ortholog predicted primate sequence Abstract 156 title words: Abstract 302 title words: consomic consomic expression gene identifying microarray rat specific strain Abstract 157 title words: cancer colon expression gene microarray progression Abstract 324 title words: cancer classification expression fingerprint gene molecular Abstract 16 title words: complex disorder enu generation mice model mutagenized Abstract 48 title words: comparison correlation correlation expression gene identify linear metric microarray rank regulated similarity spearman Abstract 5 title words: chicken divergent functional genetic intercrosse Abstract 253 title words: alignment dispenser genomic human mouse pip providing rat website Abstract 161 title words: Abstract 162 title words: consortium mouse sequencing straight Abstract 176 title words: 2xu chromosome13 del13svea36h mouse region Abstract 26 title words: brown chr2 chr4 consortium deletion mouse mouse mouse ox11ord preliminary region region sequencing wagr Abstract 163 title words: ciona cis element embryo genome initiative intestinali regulatory screening specific tissue Abstract 242 title words: ciona genome innovation intestinali sea seed squirt vertebrate Abstract 71 title words: amplification application circle genomic investigating isothermal rolling Abstract 164 title words: alignment cros eukaryotic gene genome orthologou referencing tigr toga Abstract 252 title words: comparative genomic information resource Abstract 165 title words: complete gene genomic human intronic polymorphism region repeat sequence tandem telomerase Abstract 216 title words: Abstract 166 title words: content dioika gene genome oikopleura Abstract 167 title words: atrix enome etrieval nformation ystem Abstract 231 title words: based dye genome primer sequencing snp workflow Abstract 246 title words: bacs chr finishing genome human microbial uwgc Abstract 318 title words: approache center genome igh sample sequencing tracking Abstract 36 title words: automated dna genomic method plasmid purification sequencing Abstract 85 title words: developed handling information laboratory management reporting sample tool uwgc Abstract 169 title words: cattle comparative generation genome human map ordered second Abstract 218 title words: umr usa Abstract 232 title words: Abstract 175 title words: diversity element protein transposable vertebrate Abstract 19 title words: carcinogenesi danio developing model p53 rerio study zebrafish Abstract 193 title words: cshl evolution genome genome genone grammar model tool vali Abstract 198 title words: Abstract 200 title words: acting bac cis comparative deletion element gdf5 gdf6 gene identify range regulatory sequencing tool Abstract 213 title words: functional genomic pipeline Abstract 221 title words: 9pt drug evolutionary familie gene m13 rich target Abstract 234 title words: development genetic method natural study variation Abstract 251 title words: assessment finished human quality sequence Abstract 280 title words: comparative development functional genomic gras mapping tool Abstract 320 title words: Abstract 329 title words: computationally database derived genome informatic integrating mgi mouse mouse sequence transcript Abstract 59 title words: Abstract 8 title words: Abstract 84 title words: genome mammalian phenotype relating Abstract 173 title words: approach based comparative discovery element functional regulatory sequence Abstract 21 title words: boundary defining gene human pattern sequence Abstract 54 title words: cell cycle gene identification novel regulated Abstract 58 title words: coding comparing discovery functio genome nal sequence yeast Abstract 76 title words: assembly biological comparing discovery genome genome human mouse usa Abstract 174 title words: annotation genome ncbi project Abstract 245 title words: associated evidence gene istance malaria natural res selection statistical test Abstract 248 title words: diploid genotype haplotype resolution Abstract 254 title words: sequence usa Abstract 79 title words: disequilibrium linkage locu region summary Abstract 98 title words: block genome haplotype human structure Abstract 179 title words: aid assembly clone fetchwg finishing genome incorporate introducing mouse program shotgun Abstract 201 title words: assembler phusion Abstract 180 title words: genome linkage map resolution snp Abstract 57 title words: haplotype inference level population Abstract 67 title words: Abstract 181 title words: code homologou human interferon locu mouse multiple novel protein Abstract 182 title words: information international iris rice Abstract 233 title words: construction fosmid library mouse sequencing sheared whitehead Abstract 184 title words: interspecie matche significance statistical Abstract 185 title words: characterization human scale snp Abstract 186 title words: genomic indexing pooled Abstract 188 title words: array caps clone experimental indexing method pooled shotgun validation Abstract 187 title words: annotation based development genboree publication sharing tool Abstract 272 title words: block browser building database generic genome model organism Abstract 189 title words: caenorhabditi elegan interaction map meiotic Abstract 279 title words: dauer elegan formation functional genomic involved pathway signaling Abstract 30 title words: annotation approach driven enu functional gene genome identification mammalian mouse mutant systematic Abstract 190 title words: chromosome finishing human progres summary Abstract 191 title words: cshl molecule sequ single Abstract 195 title words: approach biogenesi combined comparative composition exploring genomic mitochondria proteomic Abstract 236 title words: flexible genotyping probing studie tool Abstract 303 title words: estimating expressed gene Abstract 317 title words: expression gene local microarray normalization robust Abstract 38 title words: algorithm haplotyping mathematical model phased rflp Abstract 43 title words: Abstract 192 title words: base cshl design goal implementation microarray versatile Abstract 7 title words: cell mathematically modeling population stem Abstract 196 title words: cloning gene integrated map positional Abstract 197 title words: associated dna genome microsatellite plant preferentially repetitive Abstract 46 title words: dna genome inbred microsatellite phylogeny rat rat sequence strain Abstract 199 title words: Abstract 2 title words: based candidate cardiovascular complex differentially expressed gene identification microarray phenotype strategie Abstract 243 title words: gene Abstract 249 title words: assessment based evaluation implementation novelty technique Abstract 271 title words: annotation chromosome evaluating experimental extending gene human identification Abstract 273 title words: annotation distributed sequence Abstract 316 title words: chromosome finished human sequence Abstract 37 title words: automated clone identification length method Abstract 82 title words: alignable chromosome database feature region sequence Abstract 202 title words: chromosome conserved genome human multiple region vertebrate Abstract 203 title words: covering density mouse nome snps Abstract 299 title words: mouse polymorphism resolution structure Abstract 205 title words: chromosome differentially discovery disease expressed gene imprinted study Abstract 321 title words: methylation patient pattern schizophrenia Abstract 206 title words: bac complement genome genomic human librarie primate project Abstract 96 title words: chimpanzee chromosome effort genome international project sequenceannotate Abstract 208 title words: amplification circle detection dnarna homogenou multiplexed padlock probe quantitative rolling situ Abstract 29 title words: flexible genotype highly individual ing scor snp usd Abstract 325 title words: genotyping novel snp technology Abstract 93 title words: acid analyse binder detection dna ligation nucleic protein protein proximity sequence tagged transforming Abstract 209 title words: Abstract 56 title words: genome mouse ncbi resource Abstract 211 title words: Abstract 215 title words: annoation curated fantom2 functional Abstract 227 title words: comparative gene human influencing level mice novel revealed sequence triglyceride Abstract 265 title words: korean male origin study Abstract 217 title words: bac construction cruzi librarie mansoni schistosoma trypanosoma Abstract 327 title words: cataract chromosome cloning gene identification lr2 mouse positional recessive Abstract 22 title words: arabidopsi cell control decipher expression generating mechanism specific transcriptional type Abstract 220 title words: Abstract 60 title words: assembly comparative genome lbnl shotgun Abstract 222 title words: disequilibrium diversity linkage locu maize range sequence Abstract 238 title words: distance genome human megabase rate recombination scale sub variation Abstract 223 title words: insert organellar oryza sativa Abstract 240 title words: chromosome comparative human orthologou study Abstract 224 title words: canine ests sequencing Abstract 225 title words: database generic insertion model organism Abstract 290 title words: annotation apollo apollo curation gact genome tool Abstract 226 title words: allelic detection discovery imbalance polymorphism regulatory Abstract 319 title words: allele determination extension frequency primer real snp Abstract 228 title words: Abstract 229 title words: advantage affymetrix aims analyzed ancestrally anticipation approximately assay automated base capable ceph chip chip chromosomal chromosome chromosome chromosome chromosome chromosome completed completing complex conserved construct coupled database density density determination develop developed development development disrupted distinguish enable entire epeat evaluated experimental extend extension extension filtering finish followed force generated genflex genome genome genotype genotype genotyping genotyping genotyping handling haplotype haplotype haplotypic hour human human individual individual informative initial initially initiative integrated international launched likely location location map map mapping mapping maps maps marker mean multiplex mutation operation orchid orchid orchid orchid pattern pcr pedigree period platform platform platform precise primer primer proces proces proces project project project proprietary public readout recombination region revealing sample sample scale selected selected selection sequencing simultaneou single single single snp snp snp snp snpcode snpcode snps snps snps snps software software solution spacing start strategie technology tool tracking unique Abstract 23 title words: development ensembl project Abstract 230 title words: annotation automatic ensembl pipeline Abstract 270 title words: construction fosmid library mouse sequencing sheared whitehead Abstract 313 title words: construction genomic insert librarie mouse Abstract 274 title words: Abstract 295 title words: elegan interactome project Abstract 301 title words: elegan integration interactome phenome transcriptome Abstract 308 title words: collection expression gene genome length proteome scale transcriptome Abstract 77 title words: ajm alhout chesnut debarnardi development dupuy elegan expressio gene hope lemp leong mapping project scale vidal Abstract 9 title words: elegan hybrid library normalized orfeome Abstract 237 title words: assessment disequilibrium genome genome human landscape linkage sequence variation Abstract 95 title words: disequilibrium genome human linkage sequence variation Abstract 277 title words: antihypertensive drug microarray minisequencing pharmacogenetic profiling response Abstract 45 title words: array assembled automated genotyping randomly scale snp Abstract 288 title words: Abstract 241 title words: annotation assembly dna genome nigrovirid sequence tetraodon tool vertebrate Abstract 55 title words: ancient duplication gene lineage teleost Abstract 244 title words: Abstract 31 title words: controlling enu function gene identify immune mouse mutagenesi screen Abstract 247 title words: dna maldi mass spectrometry strategie Abstract 328 title words: analyse bac genome mammalian sequence structural Abstract 250 title words: chromosome finishing human Abstract 326 title words: annotation chromosome evolution functional gene genomic human updated Abstract 256 title words: database disease genome oriented rat research resource rgd Abstract 286 title words: disease genome mapping rgd usa Abstract 51 title words: chromosome discovery disease duplication human involvement segmental Abstract 259 title words: agency cancer cdna centre genome mediated sequence sequencing transposon Abstract 292 title words: custom dens finishing primer Abstract 322 title words: based genomic motif nucleotide primer selection sequence unique Abstract 61 title words: center finishing genome sequencing service support Abstract 260 title words: annotation automated crassa fungu genome ing neurospora sequenc Abstract 262 title words: arms chromosome drosophila euchromatic finished genomic melanogaster reannotation sequence Abstract 264 title words: database development draft gene genome human human sequence usa usa Abstract 266 title words: alternative alu depend insertion mrna splicing Abstract 267 title words: bac clone completing path tiling wicgr Abstract 269 title words: Abstract 315 title words: Abstract 27 title words: genetic informatic mouse mutagenesi mutation support Abstract 287 title words: digit existing finding gene genefinder integrating novel program Abstract 65 title words: annotat annotation exon exon human human ion nome promote promoter promoter Abstract 89 title words: adh annotation drosophila region twinscan Abstract 87 title words: enhancer exonic gene human identification predictive splicing Abstract 291 title words: database disease genome qtl rat sequence Abstract 47 title words: bulk database genome genomic model organism pipeline rat Abstract 80 title words: algorithm assessing functional genesaver genomic tool validity Abstract 293 title words: candidate database gene selection Abstract 296 title words: assay design primer snp Abstract 298 title words: assembly ciona genome highly polymorphic savigny urochordate Abstract 3 title words: analyze functional genome human noncoding portion Abstract 35 title words: Abstract 88 title words: collection create gene length mammalian nih project Abstract 304 title words: comparative genomic gramene gras resource Abstract 305 title words: chimpanzee genome project Abstract 52 title words: Abstract 314 title words: based design expression genome human microarray transcriptome Abstract 39 title words: design determination microarray orientation transcript Abstract 50 title words: design genotyping nested novel pcr snp Abstract 323 title words: alignment based dna dnafe extracting fragment information software specified tool Abstract 33 title words: arachne assembly genome improving problem shotgun solution Abstract 53 title words: assembly duplicated genome highly human region sequence Abstract 34 title words: acetivoran comparative family methanosarcina multigene Abstract 99 title words: acetivoran archaeon complete genome methanogenic methanosarcina sequence Abstract 6 title words: maps microbe optical scaffolding sequence validation Abstract 62 title words: Abstract 69 title words: controlling infection mapping mice pneumoniae qtl streptococcu susceptibility Abstract 86 title words: duplication genome human segmental Abstract 68 title words: Abstract 74 title words: chromosome duplication extensive finishing human reveal segmental Abstract 78 title words: arabidopsi creating function gene mapping mutation ranscription sequence unit Abstract 72 title words: element genetic genome human impact model organism transposable Abstract 91 title words: account chimpanzee deletion dna fraction genomic gnificant human variation Abstract 92 title words: chromosome human identificaton selection seq signiture uence