Martin Krzywinski / Genome Sciences Center / Martin Krzywinski / Genome Sciences Center / - contact me Martin Krzywinski / Genome Sciences Center / on Twitter Martin Krzywinski / Genome Sciences Center / - Lumondo Photography Martin Krzywinski / Genome Sciences Center / - Pi Art Martin Krzywinski / Genome Sciences Center / - Hilbertonians - Creatures on the Hilbert Curve
Love itself became the object of her love.Jonathan Safran Foercount sadnessesmore quotes

science: exciting

DNA on 10th — street art, wayfinding and font

science + genomics

What if we were to print what we sequence?

Expressing the amount of sequence in the human genome in terms of the number of printed pages has been done before. At the Broad Institute, all of the human reference genome is printed in bound volumes.

At our sequencing facility, we sequence about 1 terabases per day. This is equivalent to 167 diploid human genomes (167 × 6 gigabases). The sequencing is done using a pool of 13 Illumina HiSeq 2500 sequencers, of which about 50% are sequencing at any given time.

Martin Krzywinski @MKrzywinski
A single letter-size page (8.5" × 11") of 6pt Courier using 0.25 inch margins accomodates 18,126 bases on 114 lines. Shown here is a portion of sequence from human chromosome 1. (PDF)

This sequencing is extremely fast.

To understand just how fast this is, consider printing this amount of sequence using a modern office laser printer. Let's pick the HP P3015n which costs about $400—a cheap and fast network printer. It can print at about 40 pages per minute.

If we print the sequence at 6pt Courier using 0.25" margins, each 8.5" × 11" page will accomodate 18,126 bases. I chose this font size because it's reasonably legible. To print 1 terabases we need `10^12 / 18126 = 55.2` million pages.

If we print continuously at 40 pages per minute, we need `10^12 / (18126*40*1440) = 957.8` days.

If we had 958 printers working around the clock, we could print everything we sequence and not fall behind. This does not account for time required to replenish toner or paper.

what's cheaper, sequencing or printing?

It costs us about $12,000 to sequence a terabase in reagents. If we do it on a cost-recovery basis, it is about twice that, to include labor and storage. Let's say $25,000 per terabase.

Coincidentally, this is about $150 per 1× coverage of a diploid human genome. The cost of sequencing a single genome would be significantly higher because of overhead. To overcome gaps in coverage and to be sensitive to alleles in heterogenous samples, sequencing should be done to 30× or more. For example, we sequence cancer genomes at over 100×. For theory and review see Aspects of coverage in medical DNA sequencing by Wendl et al. and Sequencing depth and coverage: key considerations in genomic analyses by Sims et al.. (Thanks to Nicolas Robine for pointing out that redundant coverage should be mentioned here).

Printing is 44× more expensive than sequencing, per base: 25 n$ vs 1.1 μ$.

I should mention that the cost of analyzing the sequenced genome should be considered—this step is always the much more expensive one. In The $1,000 genome, the $100,000 analysis? Mardis asks "If our efforts to improve the human reference sequence quality, variation, and annotation are successful, how do we avoid the pitfall of having cheap human genome resequencing but complex and expensive manual analysis to make clinical sense out of the data?"

The cost of a single printed page (toner, power, etc) is about $0.02–0.05, depending on the printer. Let's be generous and say it's $0.02. To print 55.2 million pages would cost us $1.1M. This is about 44 times as expensive as sequencing.

Martin Krzywinski @MKrzywinski
To print what we sequence we would require 958 office laser printers (shown here as HP3015n) at a cost of $1,100,000 per day. (PDF)

Think about this. It's 44× more expensive to merely print a letter on a page than it is to determine it from the DNA of a cell. In other words, to go from the physical molecule to a bit state on a disk is much cheaper than from a bit state on a disk to a representation of the letter on a page.

Per base, our sequencing costs `$25000/10^12 = $25*10^-9`, or 25 nanodollars. At $0.02 and 18,126 bp per page, printing costs `0.02/18126 = $1.1*10^-6` or 1.1 microdollars.

If at this point you're thinking that printing isn't practical, the fact that the pages would weigh 248,000 kg and stack to 5.5 km should cinch the argument.

The capital cost of sequencing is, of course, much higher. The printers themselves would cost about $400,000 to purchase. The 6 sequencers, on the other hand, cost about $3,600,000.

sequencing is as fast as downloading

We sequence at a rate close to the average internet bandwidth available to the public.

At 3.86 Mb/s, we could download a terabase of compressed sequence in a day, assuming the sequence can be compressed by a factor of 3. This level of compression is reasonable—the current human assembly is 938 Mb zipped).

In other words, you would have to be downloading essentially continuously to keep up with our sequencing.


news + thoughts

Analyzing outliers: Robust methods to the rescue

Sat 30-03-2019
Robust regression generates more reliable estimates by detecting and downweighting outliers.

Outliers can degrade the fit of linear regression models when the estimation is performed using the ordinary least squares. The impact of outliers can be mitigated with methods that provide robust inference and greater reliability in the presence of anomalous values.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Analyzing outliers: Robust methods to the rescue. (read)

We discuss MM-estimation and show how it can be used to keep your fitting sane and reliable.

Greco, L., Luta, G., Krzywinski, M. & Altman, N. (2019) Points of significance: Analyzing outliers: Robust methods to the rescue. Nature Methods 16:275–276.

Background reading

Altman, N. & Krzywinski, M. (2016) Points of significance: Analyzing outliers: Influential or nuisance. Nature Methods 13:281–282.

Two-level factorial experiments

Fri 22-03-2019
To find which experimental factors have an effect, simultaneously examine the difference between the high and low levels of each.

Two-level factorial experiments, in which all combinations of multiple factor levels are used, efficiently estimate factor effects and detect interactions—desirable statistical qualities that can provide deep insight into a system.

They offer two benefits over the widely used one-factor-at-a-time (OFAT) experiments: efficiency and ability to detect interactions.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Two-level factorial experiments. (read)

Since the number of factor combinations can quickly increase, one approach is to model only some of the factorial effects using empirically-validated assumptions of effect sparsity and effect hierarchy. Effect sparsity tells us that in factorial experiments most of the factorial terms are likely to be unimportant. Effect hierarchy tells us that low-order terms (e.g. main effects) tend to be larger than higher-order terms (e.g. two-factor or three-factor interactions).

Smucker, B., Krzywinski, M. & Altman, N. (2019) Points of significance: Two-level factorial experiments Nature Methods 16:211–212.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of significance: Designing comparative experiments.. Nature Methods 11:597–598.

Happy 2019 `\pi` Day—
Digits, internationally

Tue 12-03-2019

Celebrate `\pi` Day (March 14th) and set out on an exploration explore accents unknown (to you)!

This year is purely typographical, with something for everyone. Hundreds of digits and hundreds of languages.

A special kids' edition merges math with color and fat fonts.

Martin Krzywinski @MKrzywinski
116 digits in 64 languages. (details)
Martin Krzywinski @MKrzywinski
223 digits in 102 languages. (details)

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day, 2017 `\pi` Day and 2018 `\pi` Day.

Tree of Emotional Life

Sun 17-02-2019

One moment you're :) and the next you're :-.

Make sense of it all with my Tree of Emotional life—a hierarchical account of how we feel.

Martin Krzywinski @MKrzywinski
A section of the Tree of Emotional Life.

Find and snap to colors in an image

Sat 29-12-2018

One of my color tools, the colorsnap application snaps colors in an image to a set of reference colors and reports their proportion.

Below is Times Square rendered using the colors of the MTA subway lines.

Colors used by the New York MTA subway lines.

Martin Krzywinski @MKrzywinski
Times Square in New York City.
Martin Krzywinski @MKrzywinski
Times Square in New York City rendered using colors of the MTA subway lines.
Martin Krzywinski @MKrzywinski
Granger rainbow snapped to subway lines colors from four cities. (zoom)