latest newsbuy art
And she looks like the moon. So close and yet, so far.Future Islandsaim highmore quotes
very clickable
data visualization + art
The information graphic showing the history of the human genome assembly is part of my series of designs created for the Scientific American Graphic Science page. Together with Senior Graphics Editor Jen Christiansen, we've looked at everything from the evolution of the genomes of SARS-Cov-2 strains to how pets contribute to the bacterial flora in your home.
Most of the art is available for purchase as framed prints and, yes, even pillows. Sleep's never been more important — I take custom requests.

History of the Human Genome Assembly

22 years, 3,117,275,501 bases and 0 gaps later

Round numbers are always false.
— Samuel Johnson

Here, you can expect to learn about the history of the human genome assembly (and a little bit about its structure) and see a map of its completion over the last 22 years, which was published in Scientific American Graphic Science.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
THE HUMAN GENOME ASSEMBLY: YEAR BY YEAR | The history of the human genome assembly over the last 22 years. Most of the activity happened during 2000–2003 and 2022, with relatively minor changes in the intervening years. Colors of each region indicate that the region reached 50, 90 or 99%+ completion in that year.

1 · The first gapless assembly of a human genome

In March 2022, a flurry of publications announced the first ever complete assembly of a human genome. This is a big deal because, in genomics, we don't throw around words like “complete” very often. Actually, never — until now.

Because the human genome — a human genome — is complete. All hail the CHM13v2 (Mar 2022) telomere-to-telomere (T2T) assembly.

1.1 · Finished... again?

You've probably already heard — and have been hearing for the last 15 years — that the human genome has been sequenced. After all, how else can there be a publication with the title Finishing the euchromatic sequence of the human genome (2004 Nature 431:931–945).

In 2004, the only thing that stood between “finished” and “complete” was that pesky word “euchromatic”.

Euchromatin is a lightly packed form of chromatin (DNA, RNA, and protein) that is enriched in genes, and is often (but not always) under active transcription. Euchromatin stands in contrast to heterochromatin, which is tightly packed and less accessible for transcription. 92% of the human genome is euchromatic.” — Wikipedia

For modeling and analysis — such as in cancer research, for example, which is what we do here — by far the most important parts of the human genome assembly are the parts that code for protein (transcribed regions and their ORFs), along with their adjacent regulatory sequences.

These regions in total occupy less than half of the genome. The parts that ultimately translated into protein exons account for just 2.58% of the genome. For the vast majority of genes, the sequence in these regions was indeed finished.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
IT'S IN THE GENES | The relationship between transcribed region (TX), open reading frame (ORF) and protein coding region (CDS) in a gene. Size of elements is illustrative. Modified from Wikipedia.

In the most recent assembly prior to the complete telomere-to-telomere CHM13v2 assembly was hg38 (Dec 2013), exons cover only 2.58% of the total sequence (excluding gaps) of the assembly. Notice how much of the open reading frame (ORF) is in introns, which are spliced out during post-transcriptional modification into mature mRNA.

HG38 ASSEMBLY bases total size 3,088,269,832 total sequence 2,937,639,113 100.00% REFSEQ GENE SET transcripts 59,657 with complete coding region annotation tx region 33,246 unique transcript regions GENE REGIONS bases transcription 1,195,741,784 40.70% ORF 1,011,159,704 34.42% exons 75,884,758 2.58% introns 979,628,382 33.35%

The values in the table above are generated from the size of the union of assembly intervals over the set of genes, because some genes overlap The list of genes includes protein coding, non-coding and pseudogenes.

But with the CHMv2 assembly being more than 200 million bases larger (hello heterochromatin!), we expect to find more genes. And indeed, 3,604 genes, of which 140 are protein coding, are exclusive to CHMv2 (Table 1 in Nurk et al.).

2 · Creating the Scientific American graphic

Our goal for the August 2022 Scientific American Graphic Science page was to present the CHMv2 assembly in context of the efforts of the past 22 years — as the final sequencing stage in the sequencing effort. See my other Graphic Science illustrations.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
HOW DID THE ASSEMBLY IMPROVE OVER THE YEARS? | Colors of each 250,000 base regions of each chromosome indicate when the region reached 50, 90 or 99%+ completion. Completion from previous years is carried over in grey. Scientific American (2022) 327(2):92.
buy artwork
Year-by-year history of the human genome assembly (1Mb bins) by Martin Krzywinski
MY DATA + SCIENCE ART ON A WALL | Most of my artwork is available for sale as prints and other items. (buy artwork / see all my art)

3 · The human genome — pieces of different genomes

The DNA samples used to create the reference human sequence assembly used DNA samples from multiple individuals.

3.1 · Genomic libraries

The bulk of the genome was sequenced from the RPCI-11 genomic library, which is a collection of about 10,000 randomly sampled pieces (each on average 180,000 bases long) from the genome of an individual male. Other parts of the genome assembly were reconstructed by sequencing RPCI-13 (female) and Caltech-D (Table 2 in Zhao (2000)) libraries.

The human genome is a variable quantity — two individuals vary, on average, by 1 base in 1,000 (or, roughly, in about 3 million positions) — and much of human genomics deals with understanding and handling this variability.

As such, the term “the human genome” is a little misleading and it's important to acknowledge that, given this natural variation between individuals (and even between cells within an individual), there is really no such thing.

When we say “the genome” we invariably mean “the reference genome”. Even then, we need to specify which version of the reference we mean. At the moment (July 2022), the hg38 (Dec 2013) assembly is considered to be the canonical reference and this may not change anytime soon. It's not uncommon (especially for several months after the release of a new assembly) to find genomic resources that use older references — updating an analysis or data pipeline to a new reference assembly is not a trivial task.

For example, well after the release of hg38 (Dec 2013), the previous reference hg19 (Feb 2009) continued to be widely used for several years. To facilitate standardizing results to the same assembly, so-called liftOver annotations allow conversion of one assembly's coordinates to another.

3.2 · Focused efforts

As the sequence assembly matured, regions were successively filled in by targeted efforts focusing on specific regions (e.g. DNA sequence analysis of human chromosome 9) or by specific institutions (e.g. A Japanese history of the Human Genome Project).

This kind of focused effort to improve the assembly quality of a region was always part of the process. In fact, already in 2000 you can see that significant portions of chromosomes 14, 20, 21 and 22 were completed to 99%+. In 2001, chromosomes 6 and 13 received a lot of attention.

In contrast to focused sequencing, other parts of the assembly were sequenced in a shotgun fashion — by randomly sampling as many regions in the genome as possible in an effort to spread out the coverage. Early on, the shotgun effort was accelerated so that the public assembly would provide more coverage (with commensurately higher quality) than the private sector assembly created by Celera.

4 · 2000 to 2013 — zero to genome hero

Over the years 2000 to 2013, the human genome assembly progressively improved from a so-called “draft” assembly to one that, practically, may be called “finished” (finishish?). These terms can be defined in various ways — for example, based on how much of coding sequence is represented, the contiguity of the assembly, the total coverage of the genome, or some combination of these metrics.

Individual assemblies were indexed by hgN from hg1 (May 2000) to hg38 (Dec 2013). Differences in the index value do not necessarily represent how much of the assembly changed and some indexes were skipped. And I'm still trying to locate the sequence files of hg3 (July 2000).

The hg prefix was used by the team at the UCSC Genome Browser while NCBI had their own assembly build indexes (e.g. hg10 was NCBI build 28). In fact, as of July 2022, there are 1,268 assemblies of the human genome indexed by NCBI.

5 · 2022 — CHM13 fills in the gaps

The hg38 (Dec 2013) assembly continues to be the canonical “reference sequence” since 2013 (and likely beyond 2022). An excellent assembly, it nevertheless contains 1,001 gaps totalling about 150 Mb across chromosomes 1–22, X and Y. as well as 430 unanchored and alternate pieces that totalled 121 Mb and themselves include 240 gaps of about 9 Mb.

History of the Human Genome Assembly (22 years, 3,117,275,501 bases and 0 gaps later) -- science + art + data visualization / Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
FILLING IN THE GAPS | Before the March 2022 telomere-to-telomere assembly, the most recent humang enome assembly was from 2013 (hg38). This assembly had 1,001 gaps in chromosomes 1–22, X and Y. Shown here is the size, location and distribution of these gaps. To make small regions visible, those smaller than 230 kb are shown at a fixed size.

These gaps were mostly in heterochromatic regions, which are notoriously difficult to sequence. When the size of the gap was not known, following tradition, it was set to 100 bases . Other sizes, such as 1kb, 10kb and 50kb signalled more information about the nature of the gap, such as a contig gap, short arm gap, telomere gap, and so on.

The CHM13v2 (Mar 2022) assembly filled in all these gaps. It is a completely gapless assembly.

news + thoughts

Annals of Oncology cover

Wed 14-09-2022

My cover design on the 1 September 2022 Annals of Oncology issue shows 570 individual cases of difficult-to-treat cancers. Each case shows the number and type of actionable genomic alterations that were detected and the length of therapies that resulted from the analysis.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
An organic arrangement of 570 individual cases of difficult-to-treat cancers showing genomic changes and therapies. Apperas on Annals of Oncology cover (volume 33, issue 9, 1 September 2022).

Pleasance E et al. Whole-genome and transcriptome analysis enhances precision cancer treatment options (2022) Annals of Oncology 33:939–949.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Annals of Oncology 570 cancer cohort cover (volume 33, issue 9, 1 September 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

Survival analysis—time-to-event data and censoring

Fri 05-08-2022

Love's the only engine of survival. —L. Cohen

We begin a series on survival analysis in the context of its two key complications: skew (which calls for the use of probability distributions, such as the Weibull, that can accomodate skew) and censoring (required because we almost always fail to observe the event in question for all subjects).

We discuss right, left and interval censoring and how mishandling censoring can lead to bias and loss of sensitivity in tests that probe for differences in survival times.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Survival analysis—time-to-event data and censoring. (read)

Dey, T., Lipsitz, S.R., Cooper, Z., Trinh, Q., Krzywinski, M & Altman, N. (2022) Points of significance: Survival analysis—time-to-event data and censoring. Nature Methods 19:906–908.

3,117,275,501 Bases, 0 Gaps

Sun 21-08-2022

See How Scientists Put Together the Complete Human Genome.

My graphic in Scientific American's Graphic Science section in the August 2022 issue shows the full history of the human genome assembly — from its humble shotgun beginnings to the gapless telomere-to-telomere assembly.

Read about the process and methods behind the creation of the graphic.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
3,117,275,501 Bases, 0 Gaps. Text by Clara Moskowitz (Senior Editor), art direction by Jen Christiansen (Senior Graphics Editor), source: UCSC Genome Browser.

See all my Scientific American Graphic Science visualizations.

Anatomy of SARS-Cov-2

Tue 31-05-2022

My poster showing the genome structure and position of mutations on all SARS-CoV-2 variants appears in the March/April 2022 issue of American Scientist.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses (zoom)

An accompanying piece breaks down the anatomy of each genome — by gene and ORF, oriented to emphasize relative differences that are caused by mutations.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Deadly Genomes: Genome Structure and Size of Harmful Bacteria and Viruses (zoom)

Cancer Cell cover

Sat 23-04-2022

My cover design on the 11 April 2022 Cancer Cell issue depicts depicts cellular heterogeneity as a kaleidoscope generated from immunofluorescence staining of the glial and neuronal markers MBP and NeuN (respectively) in a GBM patient-derived explant.

LeBlanc VG et al. Single-cell landscapes of primary glioblastomas and matched explants and cell lines show variable retention of inter- and intratumor heterogeneity (2022) Cancer Cell 40:379–392.E9.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
My Cancer Cell kaleidoscope cover (volume 40, issue 4, 11 April 2022). (more)

Browse my gallery of cover designs.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A catalogue of my journal and magazine cover designs. (more)

© 1999–2022 Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA