Expressing the amount of sequence in the human genome in terms of the number of printed pages has been done before. At the Broad Institute, all of the human reference genome is printed in bound volumes.
At our sequencing facility, we sequence about 1 terabases per day. This is equivalent to 167 diploid human genomes (167 × 6 gigabases). The sequencing is done using a pool of 13 Illumina HiSeq 2500 sequencers, of which about 50% are sequencing at any given time.
This sequencing is extremely fast.
To understand just how fast this is, consider printing this amount of sequence using a modern office laser printer. Let's pick the HP P3015n which costs about $400—a cheap and fast network printer. It can print at about 40 pages per minute.
If we print the sequence at 6pt Courier using 0.25" margins, each 8.5" × 11" page will accomodate 18,126 bases. I chose this font size because it's reasonably legible. To print 1 terabases we need `10^12 / 18126 = 55.2` million pages.
If we print continuously at 40 pages per minute, we need `10^12 / (18126*40*1440) = 957.8` days.
If we had 958 printers working around the clock, we could print everything we sequence and not fall behind. This does not account for time required to replenish toner or paper.
It costs us about $12,000 to sequence a terabase in reagents. If we do it on a cost-recovery basis, it is about twice that, to include labor and storage. Let's say $25,000 per terabase.
Coincidentally, this is about $150 per 1× coverage of a diploid human genome. The cost of sequencing a single genome would be significantly higher because of overhead. To overcome gaps in coverage and to be sensitive to alleles in heterogenous samples, sequencing should be done to 30× or more. For example, we sequence cancer genomes at over 100×. For theory and review see Aspects of coverage in medical DNA sequencing by Wendl et al. and Sequencing depth and coverage: key considerations in genomic analyses by Sims et al.. (Thanks to Nicolas Robine for pointing out that redundant coverage should be mentioned here).
Printing is 44× more expensive than sequencing, per base: 25 n$ vs 1.1 μ$.
I should mention that the cost of analyzing the sequenced genome should be considered—this step is always the much more expensive one. In The $1,000 genome, the $100,000 analysis? Mardis asks "If our efforts to improve the human reference sequence quality, variation, and annotation are successful, how do we avoid the pitfall of having cheap human genome resequencing but complex and expensive manual analysis to make clinical sense out of the data?"
The cost of a single printed page (toner, power, etc) is about $0.02–0.05, depending on the printer. Let's be generous and say it's $0.02. To print 55.2 million pages would cost us $1.1M. This is about 44 times as expensive as sequencing.
Think about this. It's 44× more expensive to merely print a letter on a page than it is to determine it from the DNA of a cell. In other words, to go from the physical molecule to a bit state on a disk is much cheaper than from a bit state on a disk to a representation of the letter on a page.
Per base, our sequencing costs `$25000/10^12 = $25*10^-9`, or 25 nanodollars. At $0.02 and 18,126 bp per page, printing costs `0.02/18126 = $1.1*10^-6` or 1.1 microdollars.
If at this point you're thinking that printing isn't practical, the fact that the pages would weigh 248,000 kg and stack to 5.5 km should cinch the argument.
The capital cost of sequencing is, of course, much higher. The printers themselves would cost about $400,000 to purchase. The 6 sequencers, on the other hand, cost about $3,600,000.
We sequence at a rate close to the average internet bandwidth available to the public.
At 3.86 Mb/s, we could download a terabase of compressed sequence in a day, assuming the sequence can be compressed by a factor of 3. This level of compression is reasonable—the current human assembly is 938 Mb zipped).
In other words, you would have to be downloading essentially continuously to keep up with our sequencing.
To achieve a `k` index for a movement you must perform `k` unbroken reps at `k`% 1RM.
The expected value for the `k` index is probably somewhere in the range of `k = 26` to `k=35`, with higher values progressively more difficult to achieve.
In my `k` index introduction article I provide detailed explanation, rep scheme table and WOD example.
The effect is intriguing and facetious—yes, those are real words.
But these are not: necronology, abobionalism, gabdologist, and nonerify.
These places only exist in the mind: Conchar and Pobacia, Hzuuland, New Kain, Rabibus and Megee Islands, Sentip and Sitina, Sinistan and Urzenia.
And these are the imaginary afflictions of the imagination: ictophobia, myconomascophobia, and talmatomania.
And these, of the body: ophalosis, icabulosis, mediatopathy and bellotalgia.
Want to name your baby? Or someone else's baby? Try Ginavietta Xilly Anganelel or Ferandulde Hommanloco Kictortick.
When taking new therapeutics, never mix salivac and labromine. And don't forget that abadarone is best taken on an empty stomach.
And nothing increases the chance of getting that grant funded than proposing the study of a new –ome! We really need someone to looking into the femome and manome.
An exploration of things that are missing in the human genome. The nullomers.
Julia Herold, Stefan Kurtz and Robert Giegerich. Efficient computation of absent words in genomic sequences. BMC Bioinformatics (2008) 9:167
We've already seen how data can be grouped into classes in our series on classifiers. In this column, we look at how data can be grouped by similarity in an unsupervised way.
We look at two common clustering approaches: `k`-means and hierarchical clustering. All clustering methods share the same approach: they first calculate similarity and then use it to group objects into clusters. The details of the methods, and outputs, vary widely.
Altman, N. & Krzywinski, M. (2017) Points of Significance: Clustering. Nature Methods 14:545–546.
Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.
Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.
In this redesign of a pie chart figure from a Nature Medicine article , I look at how to organize and present a large number of categories.
I first discuss some of the benefits of a pie chart—there are few and specific—and its shortcomings—there are few but fundamental.
I then walk through the redesign process by showing how the tumor categories can be shown more clearly if they are first aggregated into a small number groups.
(bottom left) Figure 2b from Zehir et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. (2017) Nature Medicine doi:10.1038/nm.4333