Martin Krzywinski / Genome Sciences Center / Martin Krzywinski / Genome Sciences Center / - contact me Martin Krzywinski / Genome Sciences Center / on Twitter Martin Krzywinski / Genome Sciences Center / - Lumondo Photography Martin Krzywinski / Genome Sciences Center / - Pi Art Martin Krzywinski / Genome Sciences Center / - Hilbertonians - Creatures on the Hilbert Curve
Lips that taste of tears, they say, are the best for kissing.Dorothy Parkerget crankymore quotes

kmer: bite-sized

See you at Shonan Meeting 167 — Formalizing Biomedical Visualization

genomes + non-existence

Dark Matter of the Genome—sequences that are not there

Failing to fetch me at first keep encouraged,
Missing me one place search another,
I stop somewhere waiting for you.
—Walt Whitman, Song of Myself

One of the challenges in visualization is cueing what data is not being displayed, which is a common requirement when transitioning from overview to details, such as by zooming or filtering.

An interesting take on this is the enumeration and analysis of data that isn't merely not displayed but actually not there.

You can download the frequencies of all `k`-mers in the human genome for `k=1...12`.

I had an idea to try this and Michael Schatz pointed out an efficient algorithm [1] to find such missing sequences, or "nullomers" or, more generally, "unwords".

[1] Julia Herold, Stefan Kurtz and Robert Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) 9:167

I wanted to create some phylogenetic trees based on these nullomers but discovered that this has also been done [2]. It turns out that the nullomers tell us something about the sequence without being visible. The genome's dark matter?

[2] Davide Vergni, Daniele Santoni. Nullomers and High Order Nullomers in Genomic Sequences. (2016) PLoS ONE 11(12): e0164540.

At some point I think I'll make some art around this concept.

sequences that don't exist

Because the human genome sequence is thankfully finite, there are some sequences that do not appear in it. These are (vaguely creepily) coined nullomers.

For example, the hg38 human reference assembly (Dec 2013), has 3,049,315,783 bases across its 455 assembled pieces, which includes the 24 chromosomes (e.g., chr1–chr22, chrX, chrY), mitochondrial chromosome (chrM), alternative and unanchored scaffolds (e.g. chr1_GL383520v2_alt and chr1_KI270706v1_random.fa). In this number I'm exluding all the padding in the sequence, which is indicated by the base "n".

Not surprisingly, all possible `k`-mers (sequences of length `k`) for `k=2` exist. There are 3,018,334,391 of them and they break down in frequency as follows

gc  130065644 4.31%
ac  153830681 5.10%
gt  154194068 5.11%
cc  158048073 5.24%
gg  159302235 5.28%
tc  181782675 6.02%
ga  182772932 6.06%
ta  197567087 6.55%
ct  213517855 7.07%
ag  213785914 7.08%
ca  221181041 7.33%
tg  222266728 7.36%
at  233904527 7.75%
aa  296763858 9.83%
tt  299351073 9.92%
   ---------- -----
   3018334391  100%

The least common 2-mer is "gc", which isn't surprising since the GC content of the human genome is about 41%, and both "g" and "c" appear in essentailly equal measure.

c  623727342 20.45%
g  626335137 20.54%
a  898285419 29.46%
t  900967885 29.55%
  ---------- ------
  3049315783   100%

You can now ask "What is the smallest value of `k` for which we cannot find a `k`-mer in the human reference sequence?"

all `k`-mers up to `k=10` exist

It turns out that all possible sequences up to length 10 exist in the reference.

Depending on your familiarity with the genome and depth of intuition about statistics, this might be not at all surprising, mildly suprising or just plain confusing.

For example, if we consider `k=8`, there are `4^8 = 65536` such sequences. Alphabetically,


All these exist. The 8-mer "cgtacgct" is the most rare and "tttttttt" the most common.

cgtacgcg 304 0.00000997%
cgcgtacg 315 0.00001033%
cgtcgacg 322 0.00001056%
tcgacgcg 333 0.00001092%
cgatcgac 356 0.00001167%
tgtgtgtg 899632 0.02950286%
tatatata 925963 0.03036637%
atatatat 967263 0.03172078%
aaaaaaaa 4743237 0.15555144%
tttttttt 4773492 0.15654364%

If we repeat this all the way up to `k=10`, we find that all sequences exist. For example, there are `4^{10}=1,048,576` 10-mers and they also all exist.

cgacgatcga 2 0.00000007%
tcgcgacgta 2 0.00000007%
cgtaacgcgc 2 0.00000007%
tcgacgtacg 2 0.00000007%
tattcgcgcg 2 0.00000007%
cacacacaca 602118 0.01974610%
gtgtgtgtgt 604553 0.01982595%
tgtgtgtgtg 608965 0.01997064%
aaaaaaaaaa 3085421 0.10118453%
tttttttttt 3105565 0.10184514%

The least common 10-mer is "cgacgatcga", which appears only twice. In fact, the 5 10-mers shown above are the only ones that appear only twice. No 10-mer appears once.

Once we look at 11-mers, we finally see a sequence that doesn't appear. There are `4^{11} = 4,194,304` 11-mers of which 2,728 (0.07%) appear once and 4,705 (0.11%) appear twice. The most common ones are

ctgggattaca 409261 0.01342148%
tgtaatcccag 409332 0.01342380%
ctgtaatccca 420484 0.01378953%
tgggattacag 420548 0.01379163%
cacacacacac 512570 0.01680943%
gtgtgtgtgtg 517147 0.01695953%
acacacacaca 549346 0.01801548%
tgtgtgtgtgt 555966 0.01823258%
aaaaaaaaaaa 2563770 0.08407734%
ttttttttttt 2580640 0.08463058%

once you take out what's not there, you're left with what's there

Interestingly (maybe) 941 11-mers don't appear in the genome. I've listed them alphabetically below. The first one is the most interesting one, as it is the first alphabetically ordered sequence that does not appear in the genome: aaccgacgcgt.

If you look at the statistics of all the bases of these missing 11-mers, you find that they're the inverse of what is there.

  # 11-kmer nullomers     # bases in hg38
c 3166 30.59%             623727342 20.45%
g 3182 30.74%             626335137 20.54%
a 1986 19.19%             898285419 29.46%
t 2017 19.49%             900967885 29.55%

For example, since 41% of the bases in the genome are "g" or "c" we find `0.305 \approx (1-0.41)/2` of the missing bases to be "g" or "c" and similarly `0.19 \approx (1-0.59)/2` to be "a" or "t".

# all 11-mers sorted alphabetically that do not appear 
# in the human genome reference sequence (hg38 Dec 2013)
aaccgacgcgt aaccgcgcgta aaccgcgtacg aaccggttcgc aacgcgatacg aacgcgatcgg aacgcgcgaat aacgcggtcga aacgcgtagcg aacgcgtatcg aacggtcgtcg aacgttgcgcg aagcgtacgcg aagtcgcgcga aatacgcgcga aatacgtcgcg aatcgcgtacg aatcgcgtcga aatcgcgttcg aatcggttgcg aatcgtcgacg aatgtcgcgcg aattcgcgacg aattcgcgcga aattgtcgcga acacgcgtcga acccgtcgcga accgatacgcg accgatcgacg accgatcgtcg accgccgatcg accgcgacgaa accgcgcgatt accgcgtacgt accggtcggac accgtcgatcg accgtcgcgat accgtcgcgta accgttcgtcg acgaaacgtcg acgaatcgacg acgaccgcgta acgaccgttcg acgacgacgta acgacgccgta acgacgcgcta acgacgcgtat acgacggtacg acgatacgcga acgatacgcgg acgatacggcg acgatcggcga acgatcgtcgg acgatcgtgcg acgattcggcg acgcaacgtcg acgccggtcga acgcgacgatt acgcgacgcta acgcgacggta acgcgacgtaa acgcgataacg acgcgatacgt acgcgcgatac acgcgcgatat acgcgcgatta acgcggcgtaa acgcggtacga acgcggtacgt acgcgtaacga acgcgtaacgt acgcgtaccgg acgcgtacgat acgcgtcgtac acgctagtcga acgctcgaacg acggcgactaa acgggtcgacg acggtacgccg acggtacgtcg acggtcgagcg acggtcgcgta acgtaacgcgc acgtaccgtcg acgtacgaacg acgtacgatcg acgtagcgcgt acgtatcgccg acgtcgaccga acgtcgacgct acgtcgacgta acgtcgagtcg acgtcgcgata acgtcgcgcgt acgtcgcgtac acgtcggtcga acgtcgtacgc acgttaacgcg actaatcgcga actacgcgtcg acttcgcgcgt agcgtcgtacg agcgttcgacg agtacgtcgcg agtatcgcgac agtcgatcgcg agtcggtcgat agttacgcgcg ataccgcgtcg atacgaacgcg atacgacgcga atacgcgatcg atacgcgccga atacgcgcgac atacgcgcgag atacgcgttcg atacggcgcgt atacttcgccg atacttcgcgc atagactcgcg atatcgacgcg atatcgcgcgg atatcgcgcgt atatcgtcgcg atccgtcgcga atcgacgtgcg atcgcgaaacg atcgcgacgta atcgcgattcg atcgcgcgaag atcgcgcgtat atcgcgcgtcg atcgcggtcgt atcgcgtaacg atcgcgtaccg atcgcgtacgt atcgcgttccg atcggcgcgta atcgggtcgac atcggtacgcg atcggttgcga atcgtaccgcg atcgtccgacg atcgtcgaacg atcgtcgacga atcgtcgacgc atcgtcgatcg atcgtcgcgta atcgtctcgcg atgtatcgcgc atgtcgcgcga atgtcgtcgcg attcgacggcg attcgcgatcg attcgcgcgat attcgcggcgc attgtacgcgg atttcgacgcg atttcgcgacg atttcgcgcga atttcgtcgcg caacgcgctac cacgtcgcgaa cagtccgatcg catacgcgtcg catatcgcgcg ccgaatacgcg ccgaccgtacg ccgacgatcga ccgacgatcgt ccgacgcgata ccgacgcgatt ccgacgtaacg ccgacgtacga ccgatacgcga ccgatacgtcg ccgatagtcgg ccgatcgtccg ccgattacgcg ccgcgaatcga ccgcgacgtaa ccgcgataacg ccgcgatacga ccgcgcgatat ccgcgtaatcg ccgcgtcgaaa ccgcttaagcg ccggcgtaacg ccgtataacgg ccgtcgaaacg ccgtcgaacgc ccgtcgaccgg ccgtcgatcga ccgtcgccgaa ccgtcgcgtag ccgtcgcgtat ccgttacgtcg ccgttcgcacg ccgttgcgtcg cgaacggtcgc cgaacggtcgt cgaacggttcg cgaacgtacga cgaactaacgg cgaatacgcgc cgaataggcgg cgaatcgacga cgaatcgacgg cgaatcgagcg cgaatcgcgta cgaatcgtcga cgacagtacgg cgacatacgcg cgaccgactat cgaccgatacg cgaccgctata cgaccgtcgac cgaccgtcgat cgaccgttaac cgaccgttacg cgacgaacgag cgacgaacggt cgacgaccgta cgacgacgtat cgacgatacgg cgacgatcgaa cgacgatcgat cgacgatcggc cgacgattcgc cgacgcaacga cgacgcgacaa cgacgcgacat cgacgcgataa cgacgcgatac cgacgcgtaaa cgacgcgtata cgacgcgtcaa cgacgcgttac cgacgcgttta cgacgctatcg cgacggacgta cgacggatacg cgacggcgtat cgacgtaacgc cgacgtaacgg cgacgtaccga cgacgtaccgt cgacgtacgac cgacgtacgat cgacgtacggg cgacgtactcg cgacgtatcgg cgactacgcga cgactagtccg cgactatcgcg cgagcgtatcg cgagtcgaccg cgataccgcgt cgataccgtcg cgatacgaccg cgatacgcgag cgatacgcgat cgatacgcgga cgatacggcgt cgatacgttcg cgatagacggt cgatagcgcgt cgatagtcgga cgatagtcggt cgatatacgcg cgatatgcgcg cgatccgatac cgatccgtacg cgatcgaccga cgatcgaccgt cgatcgacgtc cgatcgcgaac cgatcggcgta cgatcggtcgt cgatcgtacgc cgatcgtacgg cgatcgtcggg cgatcgtcgta cgatcgtgcga cgatgcgcgac cgattacgcga cgattacgcgc cgattacgcgt cgattatcgcg cgattcgacgt cgattcggcga cgattgaccgc cgcaacggtac cgcaatagtcg cgcatatcgcg cgcattcgtcg cgccgaaacga cgccgatacga cgccgatcgaa cgccgatcgta cgccgtaatcg cgccgtacgta cgcgaaacgat cgcgaacgatt cgcgaacgtta cgcgaatcgaa cgcgaattcgt cgcgaccgaat cgcgaccgtaa cgcgacgaact cgcgacgatta cgcgacgcata cgcgacgctaa cgcgacgctat cgcgacggtta cgcgacgtaac cgcgacgttaa cgcgacttacg cgcgagcgata cgcgagcgtag cgcgataacga cgcgataacgc cgcgataattg cgcgatacacg cgcgatcaacg cgcgatcggta cgcgattcgat cgcgattgtcg cgcgcacgata cgcgcataaga cgcgcataata cgcgcgatatg cgcgcgatatt cgcgcgtatta cgcgcgttata cgcgctatacg cgcgctatccg cgcggtacgaa cgcggtacgta cgcggtatacg cgcggtcgatt cgcggttcgtt cgcgtaacgcg cgcgtaatacg cgcgtaatcga cgcgtaatcgt cgcgtaccgga cgcgtaccgtt cgcgtatcgcg cgcgtatcggt cgcgtatcgtt cgcgtattcgg cgcgtcaaacg cgcgtcgagac cgcgtcgcaat cgcgtcggatg cgcgtcgtact cgcgtcgttat cgcgttacgcg cgcgttatacg cgcgttattcg cgcgtttcgta cgctaacgtcg cgctcgacgaa cgctcgacgta cgctcgcgtat cgctcgtaacg cgctcgtacga cgctcgtatcg cgcttacgcga cggaccatacg cggactatcga cggagcgtacg cggatcgacga cggcgaacgta cggcgatctaa cggcgcgatag cggcgtaccga cggcgttaacg cggcttacgcg cggtaatcgcg cggtacgatcg cggtacgccga cggtatacggg cggtccgcata cggtcgaccgt cggtcgataac cggtcgattcg cggtcgcacga cggtcgcgtaa cggtcggtacg cggtcgtacga cggttacgtcg cggttcgatcg cggttcgtacg cgtaacgagcg cgtaacgccgt cgtaacgcgca cgtaacgcgct cgtaacgcgtt cgtaacgctcg cgtaacgtccg cgtaacgttcg cgtaccggtct cgtacgaacgc cgtacgaatcg cgtacgaccgt cgtacgacgaa cgtacgacgct cgtacgatacg cgtacgatcgg cgtacgatcgt cgtacgatgcg cgtacgccgtt cgtacgcgaat cgtacgcgact cgtacgcgata cgtacgcgatc cgtacggacgc cgtacggcggt cgtacggtcga cgtacggtcgt cgtacgtcgcg cgtacgtcggc cgtacgtgacg cgtatacgacg cgtatacgcga cgtatagcgcg cgtatatcggc cgtatcgcgtc cgtatcggtcg cgtatgtcgcg cgtattacgcg cgtattcgacg cgtattcgcgc cgtattgcgcg cgtcaatcgcg cgtcacgcgta cgtccgatcgt cgtccgtcgaa cgtcgaacgat cgtcgaattcg cgtcgaccggt cgtcgaccgtc cgtcgacgatc cgtcgacgcgt cgtcgacgctt cgtcgacgtac cgtcgactacg cgtcgactatc cgtcgagacgt cgtcgagcatc cgtcgataggc cgtcgattcga cgtcgcgaagc cgtcgcgacta cgtcgcgatac cgtcgcgatag cgtcgcgtagg cgtcgcgtatg cgtcgcgtatt cgtcgctcgaa cgtcggaatcg cgtcggtacga cgtcggtcgac cgtcggtcgat cgtcggttacg cgtcgtaccgg cgtcgttatac cgtcgttcgac cgtctcgtacg cgtgcgaacta cgtgtcgaacg cgtgtcgacga cgttaaacgcg cgttacgacga cgttacgcgtc cgttacggcgt cgttacggtcg cgttacgtcgt cgttagtcgcg cgttccgtcga cgttcgacgaa cgttcgacgcg cgttcgacggc cgttcgacggg cgttcgatcgt cgttcgcgaaa cgttcgcggaa cgttcgcggta cgttcgcgtat cgttcgctagg cgttgcgatcg cgttgcgtcgt cgtttacgcga cgtttcgaccg cgtttcgcgaa cgtttcgtcga ctaccgccgta ctacgaacggt ctacgcgcgaa ctacgcgcgac ctacgcgcgta ctacgcgtcga ctacgtcgacg ctactcgatcg ctagacgtacg ctataccgcgc ctatacgtccg ctatcgcgtcg ctatcgtcgcg ctattcgcgcg ctcgaccgtcg ctcgacgcgta ctcgacgtacg ctcgatcgacg ctcgatcgtcg ctcgcgacgta ctcgcgcggta ctcgtcgagta ctcgttcgtcg cttcgcgaacg gaccgatacgc gacgacgatcg gacgcgataat gacgcgattcg gacgcgcatag gacgcgcgtat gacgcgtaacg gacgtaacgcg gacgtacggcg gacgtacgtcg gacgtcgaacg gatacgtcgcg gatacttcgcg gatagcgcgtt gatagtcgacg gatcgcgaacg gatcgcgcgta gatcggtccga gatcggtcgta gatcgtcggtc gcccgttcgta gccgatcgttg gcgaacgcgta gcgaattacgc gcgacgatacg gcgacgatcga gcgacgcgata gcgacgcgtta gcgacgtaacg gcgattgacga gcgcatatcgt gcgccggtata gcgcgaatcga gcgcgacgata gcgcgacgtta gcgcgtaccga gcgcgtacgat gcgcgttcgat gcggtacgcgt gcggtcgacga gcggtcgtacg gcgtaacgcgc gcgtacaacga gcgtacgtcgc gcgtcgacaat gcgtcgattcg gcgtcgattgt gcgtcgtacgc gcgttacgcgt gcgttcgacgg gcgttcgcgta gctacgaaccg gctatacgcgt gctatcgcgcg gctcgtatcgt ggcgcgctata ggcgtcgatcg ggcgtcgcaat ggtacgcgacc ggtacgcgtaa ggtataccgcg ggtcgacgcga ggtcgattcgc ggtcgattgcg ggttccgtacg gtaatcgacga gtaccggcgta gtacgaggtcg gtacgccggtt gtacgcgaccg gtacgcgagta gtacgcgcgta gtacgcgcgtt gtacgctcgac gtacggcgatc gtacggcgtac gtacggtcgcg gtacgttcgcg gtagcggtacg gtataacgcgg gtatagcgacg gtatccgatcg gtatcgaaccg gtatcgacgcg gtatcgcgcga gtatcgcgtcg gtatcgtcgcg gtatcgttgcg gtatgccgcga gtatgcgaacg gtattatcgcg gtattcgacgc gtccgacgcga gtccgacgtcg gtccgagcgta gtccgatcgcg gtccgttacgc gtcgaaccgac gtcgaacgacg gtcgaacgcga gtcgacgacta gtcgacgatcg gtcgacgcgaa gtcgacgcgac gtcgacgtacg gtcgatcggta gtcgattcgag gtcgcaattcg gtcgcacgcga gtcgcccgata gtcgcgacaat gtcgcgacgta gtcgcgcataa gtcgcgccgta gtcgcgcgtaa gtcgcggataa gtcgcgtcgca gtcgcgttacg gtcgctatcgt gtcgtaacgcg gtcgtacgcga gtcgttacgcg gttatgcgcga gttattcgcgc gttcgaacgcg gttcgatcgga gttcgtacgcg gttcgtagcga taaccgtcgac taacgcgatcg taacgcgccga taacgcgcgat taacgcgtcga taacgtcgcga taacgtcgcgc taagtcgcgcg taatcgcgtcg taatcgtcgac taattcgacgc taattcgcgcg tacccgcggtt taccgcgcgat taccggtcgta taccgtacgcg taccgttcgcg tacgacgcgat tacgacgtacg tacgagctcgt tacgatcgacg tacgatcgtcg tacgcacgcga tacgccgacgc tacgcgacacg tacgcgaccga tacgcgacgag tacgcgacgca tacgcgatcga tacgcgatcgc tacgcgatgcg tacgcgattcg tacgcgcgaac tacgcgcgaca tacgcgcgagc tacgcgcgata tacgcgcgatt tacgcgcggtc tacgcgcgtaa tacgcggtacg tacgcgtaacc tacgcgtacga tacgcgtatcg tacgcgtcacg tacgcgtcgac tacgctacggc tacgctcggac tacggcgtacg tacgggcgacg tacgggcgtcg tacggtcgcga tacgtacgcga tacgtccgacg tacgtccgtcg tacgtcgagcg tacgtcgatcg tacgtcgcgca tacgtcgcgct tacgtcgcgta tacgtcgctcg tacgtcggtcg tacgtcgtacg tacgtcgtcgc tacgttacgcg tacgttcgacg tactacgcgcg tactatcgacg tactcgcgacg tagaccgacgc tagccggtacg tagccgttcga tagcgcgaatc tagcgcgacgt tagcgcgtcga tagcgtaccga tagctcgacga taggcgaaccg taggcgcgtaa tagtcgacgca tagttacgcgc tatacgcgaac tatacgcgcga tatacgcgtcg tatatgtcgcg tatccgatcgc tatccgcgcgt tatccggatcg tatccgtcgca tatcgccgact tatcgcgcgca tatcgcgcgct tatcgcggtcg tatcgcgtcga tatcgcgtcgt tatcgcgttcg tatcgctcgac tatcggcgatc tatcgtcgacg tatcgtcgccg tatcgttcgcg tatgcgaccgc tatgcgcgacg tatgcgcgcga tatgcgtcgcg tatggcgcgcc tatgtcgacgc tatgtcgcgat tatgtttcgcg tattacgcgcg tattatgcgcg tattcgacgcg tattcgcgacg tattcgcgcga tattcgcgcgg tattcgcggcg tattcgcgtcg tattcggcgcg tcatatcgcgc tccgatagtcg tccgcgactta tccgcggtacg tccgtaacgcg tccgtacgcga tccgtacgcgg tccgtcgaccg tccgtcgatcg tcgaacgatcg tcgaatacgcg tcgaccgcgta tcgaccgtcga tcgaccgtcgg tcgaccgttcg tcgacgaacga tcgacgaccga tcgacgaccgt tcgacgagtcg tcgacgatcgc tcgacgcgata tcgacgcggtt tcgacgcgtag tcgacggtacg tcgacggtatg tcgacgtaccg tcgacgtacga tcgacgtacgg tcgacgttacg tcgactaagcg tcgactagcgg tcgactcgacg tcgagccgacg tcgagcgcgta tcgagcgtacg tcgatacgccg tcgatacgcgg tcgatcgacgt tcgatcggata tcgatcgtacg tcgatcgtcgg tcgatgcgtcg tcgattacgcg tcgattagcgc tcgatttcgcg tcgcaatcggc tcgcacgatcg tcgcacgcgat tcgccgaatcg tcgccgtacgg tcgcgaaacgt tcgcgaaccga tcgcgaacgtt tcgcgaatacg tcgcgacccgt tcgcgaccgta tcgcgacgcaa tcgcgacgcga tcgcgacgtaa tcgcgacgtag tcgcgactcgt tcgcgagcgta tcgcgattcga tcgcgccgata tcgcgcgaacg tcgcgcgaata tcgcgcgacat tcgcgcgacta tcgcgcgagta tcgcgcggcta tcgcgcgtcaa tcgcgcgttga tcgcggatacg tcgcgtaatcg tcgcgtatacg tcgcgtatcgc tcgcgtcaacg tcgcgtccgat tcgcgtcgaac tcgcgtcggta tcgcgtcgtta tcgcgttaacg tcgcgtttcga tcgctcgcgat tcgctcgtcga tcggacgtacg tcggacgtagc tcggcgatacg tcggcgatata tcgggctatcg tcggtacgcgc tcggtacgcta tcggtcgacga tcggtcgtacg tcgtaacgcgc tcgtaatacgg tcgtacgaccg tcgtacgccga tcgtacggccg tcgtatcgccg tcgtatcgtcg tcgtccgatcg tcgtcgaaccg tcgtcgaatcg tcgtcgacgat tcgtcgattcg tcgtcgcgata tcgtcgcgcaa tcgtcgcgtac tcgtcggtacg tcgtcggtcga tcgtctcgcgt tcgttacgcgg tcgttcgaccg tcgttcgagcg tcgttcgcgta tcgttcgtacg tcgtttcgacg tctacgcgtcg tgacgaacgcg tgatcgcgacg tgatcgcgtag tgcgaacgacg tgcgcggcgta tgcgtaacgcg tgcgtcgtacg tgtacgcgacg tgtcgacgcga tgtcgacgcgt tgtcgcgcgat tgtcgcgcgta tgtcgcgtcgt ttaaccgtcga ttaacgcgcga ttaacgtcgcg ttaccgcgacg ttacgacgcgt ttacgcgacgc ttacgcgatcg ttacgcgcgcc ttacgcgcgga ttacgcggaca ttacgcgtacc ttacgtcgcga ttagcgcgtca ttaggtcgcgc ttagtacgcgc ttagtcgctcg ttatcgcgccg ttatcgcgcgc ttattcgcgcg ttcgacgaacg ttcgacgcgac ttcgacgcgag ttcgacgcgta ttcgacggacg ttcgagcgacg ttcgatccgtt ttcgccgatcg ttcgcgagacg ttcgcgattcg ttcgcgcatag ttcgcgcgaat ttcgcgcgata ttcgcgtacgg ttgattacgcg ttgcgcgatag ttgtcgacgcg tttagtcgcgc tttcgacgcaa tttcgacgcgt tttcgcgtacg tttcgtcgcgc tttcgtcgcgg

news + thoughts

Using Circos in Galaxy Australia Workshop

Thu 20-02-2020

A workshop in using the Circos Galaxy wrapper by Rasche and Hiltemann. Event organized by Australian Biocommons.

Martin Krzywinski @MKrzywinski
Using Circos in Galaxy Australia workshop. (zoom)

Download workshop slides.

Galaxy wrapper training materials, Saskia Hiltemann, Helena Rasche, 2020 Visualisation with Circos (Galaxy Training Materials).

Essence of Data Visualization in Bioinformatics Webinar

Thu 20-02-2020

My webinar on fundamental concepts in data visualization and visual communication of scientific data and concepts. Event organized by Australian Biocommons.

Martin Krzywinski @MKrzywinski
Essence of Data Visualization in Bioinformatics webinar. (zoom)

Download webinar slides.

Markov models — training and evaluation of hidden Markov models

Thu 20-02-2020

With one eye you are looking at the outside world, while with the other you are looking within yourself.
—Amedeo Modigliani

Following up with our Markov Chain column and Hidden Markov model column, this month we look at how Markov models are trained using the example of biased coin.

We introduce the concepts of forward and backward probabilities and explicitly show how they are calculated in the training process using the Baum-Welch algorithm. We also discuss the value of ensemble models and the use of pseudocounts for cases where rare observations are expected but not necessarily seen.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Markov models — training and evaluation of hidden Markov models. (read)

Grewal, J., Krzywinski, M. & Altman, N. (2019) Points of significance: Markov models — training and evaluation of hidden Markov models. Nature Methods 17:121–122.

Background reading

Altman, N. & Krzywinski, M. (2019) Points of significance: Hidden Markov models. Nature Methods 16:795–796.

Altman, N. & Krzywinski, M. (2019) Points of significance: Markov Chains. Nature Methods 16:663–664.

Genome Sciences Center 20th Anniversary Clothing, Music, Drinks and Art

Tue 28-01-2020

Science. Timeliness. Respect.

Read about the design of the clothing, music, drinks and art for the Genome Sciences Center 20th Anniversary Celebration, held on 15 November 2019.

Martin Krzywinski @MKrzywinski
Luke and Mayia wearing limited edition volunteer t-shirts. The pattern reproduces the human genome with chromosomes as spirals. (zoom)

As part of the celebration and with the help of our engineering team, we framed 48 flow cells from the lab.

Martin Krzywinski @MKrzywinski
Precisely engineered frame mounts of flow cells used to sequence genomes in our laboratory. (zoom)

Each flow cell was accompanied by an interpretive plaque explaining the technology behind the flow cell and the sample information and sequence content.

Martin Krzywinski @MKrzywinski
The plaque at the back of one of the framed Illumina flow cell. This one has sequence from a patient's lymph node diagnosed with Burkitt's lymphoma. (zoom)

Scientific data visualization: Aesthetic for diagrammatic clarity

Mon 13-01-2020

The scientific process works because all its output is empirically constrained.

My chapter from The Aesthetics of Scientific Data Representation, More than Pretty Pictures, in which I discuss the principles of data visualization and connect them to the concept of "quality" introduced by Robert Pirsig in Zen and the Art of Motorcycle Maintenance.

Yearning for the Infinite — Aleph 2

Mon 18-11-2019

Discover Cantor's transfinite numbers through my music video for the Aleph 2 track of Max Cooper's Yearning for the Infinite (album page, event page).

Martin Krzywinski @MKrzywinski
Yearning for the Infinite, Max Cooper at the Barbican Hall, London. Track Aleph 2. Video by Martin Krzywinski. Photo by Michal Augustini. (more)

I discuss the math behind the video and the system I built to create the video.