2023 Pi Daylatest newsbuy art
And she looks like the moon. So close and yet, so far.Future Islandsaim highmore quotes
very clickable
data visualization + art
The information graphic showing the history of the human genome assembly is part of my series of designs created for the Scientific American Graphic Science page. Together with Senior Graphics Editor Jen Christiansen, we've looked at everything from the evolution of the genomes of SARS-Cov-2 strains to how pets contribute to the bacterial flora in your home.
Most of the art is available for purchase as framed prints and, yes, even pillows. Sleep's never been more important — I take custom requests.

History of the Human Genome Assembly

22 years, 3,117,275,501 bases and 0 gaps later

Round numbers are always false.
— Samuel Johnson

1 · The first gapless assembly of a human genome

In March 2022, a flurry of publications announced the first ever complete assembly of a human genome. This is a big deal because, in genomics, we don't throw around words like “complete” very often. Actually, never — until now.

Because the human genome — a human genome — is complete. All hail the CHM13v2 (Mar 2022) telomere-to-telomere (T2T) assembly.

1.1 · Finished... again?

You've probably already heard — and have been hearing for the last 15 years — that the human genome has been sequenced. After all, how else can there be a publication with the title Finishing the euchromatic sequence of the human genome (2004 Nature 431:931–945).

In 2004, the only thing that stood between “finished” and “complete” was that pesky word “euchromatic”.

Euchromatin is a lightly packed form of chromatin (DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active transcription. Euchromatin stands in contrast to heterochromatin, which is tightly packed and less accessible for transcription. 92% of the human genome is euchromatic.” — Wikipedia

For modeling and analysis — such as in cancer research, for example, which is what we do here — by far the most important parts of the human genome assembly are the parts that code for protein (transcribed regions and their ORFs), along with their adjacent regulatory sequences.

These regions in total occupy less than half of the genome. The parts that ultimately translated into protein exons account for just 2.58% of the genome. For the vast majority of genes, the sequence in these regions was indeed finished.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
IT'S IN THE GENES | The relationship between transcribed region (TX), open reading frame (ORF) and protein coding region (CDS) in a gene. Size of elements is illustrative. Modified from Wikipedia.

In the most recent assembly prior to the complete telomere-to-telomere CHM13v2 assembly was hg38 (Dec 2013), exons cover only 2.58% of the total sequence (excluding gaps) of the assembly. Notice how much of the open reading frame (ORF) is in introns, which are spliced out during post-transcriptional modification into mature mRNA.

HG38 ASSEMBLY bases total size 3,088,269,832 total sequence 2,937,639,113 100.00% REFSEQ GENE SET transcripts 59,657 with complete coding region annotation tx region 33,246 unique transcript regions GENE REGIONS bases transcription 1,195,741,784 40.70% ORF 1,011,159,704 34.42% exons 75,884,758 2.58% introns 979,628,382 33.35%

The values in the table above are generated from the size of the union of assembly intervals over the set of genes, because some genes overlap The list of genes includes protein coding, non-coding and pseudogenes.

But with the CHMv2 assembly being more than 200 million bases larger (hello heterochromatin!), we expect to find more genes. And indeed, 3,604 genes, of which 140 are protein coding, are exclusive to CHMv2 (Table 1 in Nurk et al.).

2 · Creating the Scientific American graphic

Our goal for the August 2022 Scientific American Graphic Science page was to present the CHMv2 assembly in context of the efforts of the past 22 years — as the final sequencing stage in the sequencing effort. See my other Graphic Science illustrations.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
HOW DID THE ASSEMBLY IMPROVE OVER THE YEARS? | Colors of each 250,000 base regions of each chromosome indicate when the region reached 50, 90 or 99%+ completion. Completion from previous years is carried over in grey. Scientific American (2022) 327(2):92.
buy artwork Year-by-year history of the human genome assembly (1Mb bins) by Martin Krzywinski
MY DATA + SCIENCE ART ON A WALL | Most of my artwork is available for sale as prints and other items. (buy artwork / see all my art)

3 · The human genome — pieces of different genomes

The DNA samples used to create the reference human sequence assembly used DNA samples from multiple individuals.

3.1 · Genomic libraries

The bulk of the genome was sequenced from the RPCI-11 genomic library, which is a collection of about 10,000 randomly sampled pieces (each on average 180,000 bases long) from the genome of an individual male. Other parts of the genome assembly were reconstructed by sequencing RPCI-13 (female) and Caltech-D (Table 2 in Zhao (2000)) libraries.

The human genome is a variable quantity — two individuals vary, on average, by 1 base in 1,000 (or, roughly, in about 3 million positions) — and much of human genomics deals with understanding and handling this variability.

As such, the term “the human genome” is a little misleading and it's important to acknowledge that, given this natural variation between individuals (and even between cells within an individual), there is really no such thing.

When we say “the genome” we invariably mean “the reference genome”. Even then, we need to specify which version of the reference we mean. At the moment (July 2022), the hg38 (Dec 2013) assembly is considered to be the canonical reference and this may not change anytime soon. It's not uncommon (especially for several months after the release of a new assembly) to find genomic resources that use older references — updating an analysis or data pipeline to a new reference assembly is not a trivial task.

For example, well after the release of hg38 (Dec 2013), the previous reference hg19 (Feb 2009) continued to be widely used for several years. To facilitate standardizing results to the same assembly, so-called liftOver annotations allow conversion of one assembly's coordinates to another.

3.2 · Focused efforts

As the sequence assembly matured, regions were successively filled in by targeted efforts focusing on specific regions (e.g. DNA sequence analysis of human chromosome 9) or by specific institutions (e.g. A Japanese history of the Human Genome Project).

This kind of focused effort to improve the assembly quality of a region was always part of the process. In fact, already in 2000 you can see that significant portions of chromosomes 14, 20, 21 and 22 were completed to 99%+. In 2001, chromosomes 6 and 13 received a lot of attention.

In contrast to focused sequencing, other parts of the assembly were sequenced in a shotgun fashion — by randomly sampling as many regions in the genome as possible in an effort to spread out the coverage. Early on, the shotgun effort was accelerated so that the public assembly would provide more coverage (with commensurately higher quality) than the private sector assembly created by Celera.

4 · 2000 to 2013 — zero to genome hero

Over the years 2000 to 2013, the human genome assembly progressively improved from a so-called “draft” assembly to one that, practically, may be called “finished” (finishish?). These terms can be defined in various ways — for example, based on how much of coding sequence is represented, the contiguity of the assembly, the total coverage of the genome, or some combination of these metrics.

Individual assemblies were indexed by hgN from hg1 (May 2000) to hg38 (Dec 2013). Differences in the index value do not necessarily represent how much of the assembly changed and some indexes were skipped. And I'm still trying to locate the sequence files of hg3 (July 2000).

The hg prefix was used by the team at the UCSC Genome Browser while NCBI had their own assembly build indexes (e.g. hg10 was NCBI build 28). In fact, as of July 2022, there are 1,268 assemblies of the human genome indexed by NCBI.

5 · 2022 — CHM13 fills in the gaps

The hg38 (Dec 2013) assembly continues to be the canonical “reference sequence” since 2013 (and likely beyond 2022). An excellent assembly, it nevertheless contains 1,001 gaps totalling about 150 Mb across chromosomes 1–22, X and Y. as well as 430 unanchored and alternate pieces that totalled 121 Mb and themselves include 240 gaps of about 9 Mb.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
FILLING IN THE GAPS | Before the March 2022 telomere-to-telomere assembly, the most recent humang enome assembly was from 2013 (hg38). This assembly had 1,001 gaps in chromosomes 1–22, X and Y. Shown here is the size, location and distribution of these gaps. To make small regions visible, those smaller than 230 kb are shown at a fixed size.

These gaps were mostly in heterochromatic regions, which are notoriously difficult to sequence. When the size of the gap was not known, following tradition, it was set to 100 bases . Other sizes, such as 1kb, 10kb and 50kb signalled more information about the nature of the gap, such as a contig gap, short arm gap, telomere gap, and so on.

The CHM13v2 (Mar 2022) assembly filled in all these gaps. It is a completely gapless assembly.

news + thoughts

Convolutional neural networks

Thu 17-08-2023

Nature uses only the longest threads to weave her patterns, so that each small piece of her fabric reveals the organization of the entire tapestry. – Richard Feynman

Following up on our Neural network primer column, this month we explore a different kind of network architecture: a convolutional network.

The convolutional network replaces the hidden layer of a fully connected network (FCN) with one or more filters (a kind of neuron that looks at the input within a narrow window).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Convolutional neural networks. (read)

Even through convolutional networks have far fewer neurons that an FCN, they can perform substantially better for certain kinds of problems, such as sequence motif detection.

Derry, A., Krzywinski, M & Altman, N. (2023) Points of significance: Convolutional neural networks. Nature Methods 20:.

Background reading

Derry, A., Krzywinski, M. & Altman, N. (2023) Points of significance: Neural network primer. Nature Methods 20:165–167.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nature Methods 13:541–542.

Neural network primer

Tue 10-01-2023

Nature is often hidden, sometimes overcome, seldom extinguished. —Francis Bacon

In the first of a series of columns about neural networks, we introduce them with an intuitive approach that draws from our discussion about logistic regression.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Neural network primer. (read)

Simple neural networks are just a chain of linear regressions. And, although neural network models can get very complicated, their essence can be understood in terms of relatively basic principles.

We show how neural network components (neurons) can be arranged in the network and discuss the ideas of hidden layers. Using a simple data set we show how even a 3-neuron neural network can already model relatively complicated data patterns.

Derry, A., Krzywinski, M & Altman, N. (2023) Points of significance: Neural network primer. Nature Methods 20:165–167.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nature Methods 13:541–542.

Cell Genomics cover

Mon 16-01-2023

Our cover on the 11 January 2023 Cell Genomics issue depicts the process of determining the parent-of-origin using differential methylation of alleles at imprinted regions (iDMRs) is imagined as a circuit.

Designed in collaboration with with Carlos Urzua.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Our Cell Genomics cover depicts parent-of-origin assignment as a circuit (volume 3, issue 1, 11 January 2023). (more)

Akbari, V. et al. Parent-of-origin detection and chromosome-scale haplotyping using long-read DNA methylation sequencing and Strand-seq (2023) Cell Genomics 3(1).

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Science Advances cover

Thu 05-01-2023

My cover design on the 6 January 2023 Science Advances issue depicts DNA sequencing read translation in high-dimensional space. The image showss 672 bases of sequencing barcodes generated by three different single-cell RNA sequencing platforms were encoded as oriented triangles on the faces of three 7-dimensional cubes.

More details about the design.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Science Advances cover that encodes sequence onto hypercubes (volume 9, issue 1, 6 January 2023). (more)

Kijima, Y. et al. A universal sequencing read interpreter (2023) Science Advances 9.

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Regression modeling of time-to-event data with censoring

Thu 17-08-2023

If you sit on the sofa for your entire life, you’re running a higher risk of getting heart disease and cancer. —Alex Honnold, American rock climber

In a follow-up to our Survival analysis — time-to-event data and censoring article, we look at how regression can be used to account for additional risk factors in survival analysis.

We explore accelerated failure time regression (AFTR) and the Cox Proportional Hazards model (Cox PH).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Regression modeling of time-to-event data with censoring. (read)

Dey, T., Lipsitz, S.R., Cooper, Z., Trinh, Q., Krzywinski, M & Altman, N. (2022) Points of significance: Regression modeling of time-to-event data with censoring. Nature Methods 19:1513–1515.


© 1999–2023 Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA