Martin Krzywinski / Genome Sciences Center / Martin Krzywinski / Genome Sciences Center / - contact me Martin Krzywinski / Genome Sciences Center / on Twitter Martin Krzywinski / Genome Sciences Center / - Lumondo Photography Martin Krzywinski / Genome Sciences Center / - Pi Art Martin Krzywinski / Genome Sciences Center / - Hilbertonians - Creatures on the Hilbert Curve
Lips that taste of tears, they say, are the best for kissing.Dorothy Parkerget crankymore quotes

EMBO Practical Course: Bioinformatics and Genome Analysis, 5–17 June 2017.

science + genomics

What if we were to print what we sequence?

Expressing the amount of sequence in the human genome in terms of the number of printed pages has been done before. At the Broad Institute, all of the human reference genome is printed in bound volumes.

At our sequencing facility, we sequence about 1 terabases per day. This is equivalent to 167 diploid human genomes (167 × 6 gigabases). The sequencing is done using a pool of 13 Illumina HiSeq 2500 sequencers, of which about 50% are sequencing at any given time.

Martin Krzywinski @MKrzywinski
A single letter-size page (8.5" × 11") of 6pt Courier using 0.25 inch margins accomodates 18,126 bases on 114 lines. Shown here is a portion of sequence from human chromosome 1. (PDF)

This sequencing is extremely fast.

To understand just how fast this is, consider printing this amount of sequence using a modern office laser printer. Let's pick the HP P3015n which costs about $400—a cheap and fast network printer. It can print at about 40 pages per minute.

If we print the sequence at 6pt Courier using 0.25" margins, each 8.5" × 11" page will accomodate 18,126 bases. I chose this font size because it's reasonably legible. To print 1 terabases we need `10^12 / 18126 = 55.2` million pages.

If we print continuously at 40 pages per minute, we need `10^12 / (18126*40*1440) = 957.8` days.

If we had 958 printers working around the clock, we could print everything we sequence and not fall behind. This does not account for time required to replenish toner or paper.

what's cheaper, sequencing or printing?

It costs us about $12,000 to sequence a terabase in reagents. If we do it on a cost-recovery basis, it is about twice that, to include labor and storage. Let's say $25,000 per terabase.

Coincidentally, this is about $150 per 1× coverage of a diploid human genome. The cost of sequencing a single genome would be significantly higher because of overhead. To overcome gaps in coverage and to be sensitive to alleles in heterogenous samples, sequencing should be done to 30× or more. For example, we sequence cancer genomes at over 100×. For theory and review see Aspects of coverage in medical DNA sequencing by Wendl et al. and Sequencing depth and coverage: key considerations in genomic analyses by Sims et al.. (Thanks to Nicolas Robine for pointing out that redundant coverage should be mentioned here).

Printing is 44× more expensive than sequencing, per base: 25 n$ vs 1.1 μ$.

I should mention that the cost of analyzing the sequenced genome should be considered—this step is always the much more expensive one. In The $1,000 genome, the $100,000 analysis? Mardis asks "If our efforts to improve the human reference sequence quality, variation, and annotation are successful, how do we avoid the pitfall of having cheap human genome resequencing but complex and expensive manual analysis to make clinical sense out of the data?"

The cost of a single printed page (toner, power, etc) is about $0.02–0.05, depending on the printer. Let's be generous and say it's $0.02. To print 55.2 million pages would cost us $1.1M. This is about 44 times as expensive as sequencing.

Martin Krzywinski @MKrzywinski
To print what we sequence we would require 958 office laser printers (shown here as HP3015n) at a cost of $1,100,000 per day. (PDF)

Think about this. It's 44× more expensive to merely print a letter on a page than it is to determine it from the DNA of a cell. In other words, to go from the physical molecule to a bit state on a disk is much cheaper than from a bit state on a disk to a representation of the letter on a page.

Per base, our sequencing costs `$25000/10^12 = $25*10^-9`, or 25 nanodollars. At $0.02 and 18,126 bp per page, printing costs `0.02/18126 = $1.1*10^-6` or 1.1 microdollars.

If at this point you're thinking that printing isn't practical, the fact that the pages would weigh 248,000 kg and stack to 5.5 km should cinch the argument.

The capital cost of sequencing is, of course, much higher. The printers themselves would cost about $400,000 to purchase. The 6 sequencers, on the other hand, cost about $3,600,000.

sequencing is as fast as downloading

We sequence at a rate close to the average internet bandwidth available to the public.

At 3.86 Mb/s, we could download a terabase of compressed sequence in a day, assuming the sequence can be compressed by a factor of 3. This level of compression is reasonable—the current human assembly is 938 Mb zipped).

In other words, you would have to be downloading essentially continuously to keep up with our sequencing.


news + thoughts

`k` index: a weightlighting and Crossfit performance measure

Wed 07-06-2017

Similar to the `h` index in publishing, the `k` index is a measure of fitness performance.

To achieve a `k` index for a movement you must perform `k` unbroken reps at `k`% 1RM.

The expected value for the `k` index is probably somewhere in the range of `k = 26` to `k=35`, with higher values progressively more difficult to achieve.

In my `k` index introduction article I provide detailed explanation, rep scheme table and WOD example.

Dark Matter of the English Language—the unwords

Wed 07-06-2017

I've applied the char-rnn recurrent neural network to generate new words, names of drugs and countries.

The effect is intriguing and facetious—yes, those are real words.

But these are not: necronology, abobionalism, gabdologist, and nonerify.

These places only exist in the mind: Conchar and Pobacia, Hzuuland, New Kain, Rabibus and Megee Islands, Sentip and Sitina, Sinistan and Urzenia.

And these are the imaginary afflictions of the imagination: ictophobia, myconomascophobia, and talmatomania.

And these, of the body: ophalosis, icabulosis, mediatopathy and bellotalgia.

Want to name your baby? Or someone else's baby? Try Ginavietta Xilly Anganelel or Ferandulde Hommanloco Kictortick.

When taking new therapeutics, never mix salivac and labromine. And don't forget that abadarone is best taken on an empty stomach.

And nothing increases the chance of getting that grant funded than proposing the study of a new –ome! We really need someone to looking into the femome and manome.

Dark Matter of the Genome—the nullomers

Wed 31-05-2017

An exploration of things that are missing in the human genome. The nullomers.

Julia Herold, Stefan Kurtz and Robert Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) 9:167


Wed 31-05-2017
Clustering finds patterns in data—whether they are there or not.

We've already seen how data can be grouped into classes in our series on classifiers. In this column, we look at how data can be grouped by similarity in an unsupervised way.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Clustering. (read)

We look at two common clustering approaches: `k`-means and hierarchical clustering. All clustering methods share the same approach: they first calculate similarity and then use it to group objects into clusters. The details of the methods, and outputs, vary widely.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Clustering. Nature Methods 14:545–546.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

...more about the Points of Significance column

What's wrong with pie charts?

Thu 25-05-2017

In this redesign of a pie chart figure from a Nature Medicine article [1], I look at how to organize and present a large number of categories.

I first discuss some of the benefits of a pie chart—there are few and specific—and its shortcomings—there are few but fundamental.

I then walk through the redesign process by showing how the tumor categories can be shown more clearly if they are first aggregated into a small number groups.

(bottom left) Figure 2b from Zehir et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. (2017) Nature Medicine doi:10.1038/nm.4333