Martin Krzywinski / Genome Sciences Center / Martin Krzywinski / Genome Sciences Center / - contact me Martin Krzywinski / Genome Sciences Center / on Twitter Martin Krzywinski / Genome Sciences Center / - Lumondo Photography Martin Krzywinski / Genome Sciences Center / - Pi Art Martin Krzywinski / Genome Sciences Center / - Hilbertonians - Creatures on the Hilbert Curve
Tango is a sad thought that is danced.Enrique Santos Discépolothink & dancemore quotes

triceratops: cool

EMBO Practical Course: Bioinformatics and Genome Analysis, 5–17 June 2017.

science + popular culture

Dinosaurs of the Corn—Fixing Jurassic World Science Visualization

TL;DR Get science visualization less wrong. And, if you make movies and want genome visualizations, call me.

Martin Krzywinski @MKrzywinski
Dino corn cob holder by Lana Filippone.

dinosaurs of the corn

When your science art can be enjoyed only by someone who doesn't know better, you're doing it wrong.

Do you know the difference between corn and a dinosaur? It might appear that the makers of Jurassic World don't care whether you do. Or don't know whether you care.

Which one is it? I don't know but I do care.

The Jurassic World Creation Lab is one of the web accoutrements of the Jurassic World brand and shows you how one might create a dinosaur from a sample of DNA. First extract, sequence, assemble and fill in the gaps in the DNA and then incubate in an egg and wait.

With enough time, you'll grow your own brand new dinosaur. Or a stalk of corn ... with more teeth.

What went wrong?

Martin Krzywinski @MKrzywinski
We can't get dinosaur genomics right, but we can get it less wrong. The assembly step uses an image from a paper about the corn genome made with my Circos software. More dinos, less corn, please. (Creation Lab, Circos)

With the exception of this note, I have practiced restraint and nowhere on this page do I describe the design practices of Jurassic World as corny.

jurassic world creation lab—very wrong genome looking awfully good

Martin Krzywinski @MKrzywinski
Alpha stalker. (zoom)

In step #3 (assembly) in the Creation Lab, the photo is of a corn genome visualization. The image was taken from the The B73 Maize Genome: Complexity, Diversity, and Dynamics Science publication that described the state of the reference corn genome sequenced at the McDonnell Genome Institute at Washington University. The figure was generated by the authors of the paper using my Circos software.

I should mention that the Creation Lab website does not include any attribution (e.g., Schnable et al., Science 2009) for the corn visualization image. Bad image reasearcher, bad.

Not only has the genome in the image crossed the hilarity boundary, its perspective skew in the composite doesn't seem to be compatible with the plane of the paper, at least to my eye. Warped humor.

The image might be an actual photo of the image printed on a piece of paper, but it feels more like a composite in which the image was superimposed on a photo of blank paper. In the random and lizard genome images I show below, I use a proper perspective projection of the image onto the paper, and do the same for the blue dino icon in the bottom right.

Tangentially, there's little reason for the researcher to be holding up what looks like a computer CPU in the photo for step #2. A flow cell would have been a good choice here and plenty such images exist.

Let's see how a more authentic image could be generated.

wrong, fake and real —lazy science in movies and why we should care

$150m, the budget of Jurassic World, is not enough to buy you a correct genome visualization. Genomics is expensive—but not that expensive. In fact, it's 44× less expensive than printing with a laser jet printer, but I digress.

Should we care that an image derived from corn genome data is being used to represent a visualization of a Triceratops horridus genome assembly? Yes, we should—unconditionally—and not just because of the inadvertant disappointment of having one's spiky dinosaur secretly replaced with a harmless plant. Cue Folger's crystals commercial (yuck).

Martin Krzywinski @MKrzywinski
We can't get dinosaur genomics right, but we can get it less wrong. (a) Corn genome used in Jurassic World Creation Lab website. Image is from the Science publication B73 Maize Genome: Complexity, Diversity, and Dynamics. Photo and composite by Universal Studios and Amblin Entertainment. (b) Random data on 8 chromosomes from chicken genome resized to triceratops genome size (3.2 Gb). Image by Martin Krzywinski. (c) Actual genome data for lizard genome, UCSC anoCar2.0, May 2010. Image by Martin Krzywinski. Triceratops outline in (b,c) from wikipedia. (Jurassic World Creation Lab, Triceratops outline, B73 Maize Genome: Complexity, Diversity, and Dynamics)

When your science art can be enjoyed only by someone who doesn't know better, you're doing it wrong. No meaningful conversation about the subject can continue once your audience has the answer to "What does this image show?" I can't take the corn image into a grade 5 classroom and talk about dinosaur genomics. We know so much about the world that it's trivial to get the obvious things less wrong in science fiction.

We know so much about the world that it's trivial to get the obvious things less wrong in science fiction.

The science must always be respected. A lot of people worked very hard for us to know what most of us don't realize—let's honor that effort. Science-based art, at every opportunity, should get as many things right as possible. At least, it should get as few things wrong.

The dinosaurs must be turning in their sendimentary beds. First, they suffer a high-throughput extinction. Now, they've been asked to trade their alpha predator status for a starchy, though extant, vegetable. Can you say extinct clade action suit?

It's true that people respond to strong visuals. But they'll respond even better to strong visuals based on relevant science. It's not just about eye-marvel. Let's see some thought-marvel—beauty that connects us to the world, informs us about it and reveals its intricacies.

dino-chickens—they're here

Dinosaur genomics isn't pure fiction. Although we can't yet grow a full dinosaur, we can create chickens with dinosaur-like snouts. Don't worry, you're unlikely to be pecked to death by one of these creations. This is a great example of the fact that characteristics of extinct animals can be found today in their evolutionary descendants.

In fact, the characteristics of evolutionary ancestors may be latent in their descendants. Jack Horner certaintly hopes so—his goal is to turn a chicken into a dinosaur by reactivating its ancient DNA. Watch his TED talk.

We have a pretty good sense of size and aspects of the structure of dinosaur genomes. For example, Origin of avian genome size and structure in non-avian dinosaurs, estimates the size of the triceratops genome to be about 3.2 gigabases.

The list of steps to grow your own dinosaur in the Creation Lab is quite reasonable. For additional authenticity, a synthesis step should be added. The assembly step determines the contiguous (if you're lucky, gapped otherwise). It does not actually synthesize the DNA, a step that would be required for us to be able to package and implant the designer genome into the egg.

improving the Jurassic World Creation Lab genome assembly image

There are three options for the image: wrong, fake and realistic.

Let's look at each in turn.

wrong—using the corn genome

Martin Krzywinski @MKrzywinski
Jurassic World Creation Lab image showing corn genome image. (Jurassic World Creation Lab)
Martin Krzywinski @MKrzywinski
(A) Googly sequence cartoons from Jurassic World Creation Lab (B) Maize B73 reference genome with annotations. Shown are chromosome structure, genetic map, Mu insertions, methyl-filtration reads, repeats, genes, sorghum synteny, rice synteny and homoeology map. (Fig 1 from The B73 Maize Genome: Complexity, Diversity, and Dynamics)

The Jurassic World Creation Lab picks the worst of the three options. Its image (above) shows a visualization of corn genome data (figure on right, B) taken from the Science paper The B73 Maize Genome: Complexity, Diversity, and Dynamics and presents it as having something to do with dinosaurs.

Even though both have been known to stalk and neither of them are a mineral, corn and dinosaurs don't have a lot in common.

The corn image is one of the more visually striking published Circos images. It's colorful and fits with the cartoon DNA (figure on right, A) used in the Jurassic World Creation Lab website.

The corn genome is about 2.3 Gb in size and composed of 10 pairs of chromosomes. The image focuses on the similarity between corn and rice and sorghum (a kind of grass) and the corn chromosomes are shown out of order to make this similarity more clear.

fake—using random data on resized chicken chromosomes

Martin Krzywinski @MKrzywinski
Modified Jurassic World Creation Lab image showing random data on rescaled chicken genome. (zoom)

The look and feel of the corn genome image (colors, ink density, proportions) can be reproduced in an image that uses randomly generated data. Random data is less interesting than real genome data, which I'll talk about below, but arguably more appropriate than data from a completely unrelated genome.

Since chickens are a kind of modern dinosaur, we could start with the chicken genome. My colleague Cath Ennis pointed out that a Komodo dragon genome might be more suitable to represent a triceratops. Unfortunately, we don't have a Komodo assembly yet so that's not possible but Cath's suggestion did lead me to generate an image based on the lizard assembly (see below).

Martin Krzywinski @MKrzywinski
Randomly generated data shown in the same format as tracks in corn genome image. 8 chromosomes are shown whose total size is 3.2 Gb, the approximate size of the triceratops genome. The relative size of chromosomes was modeled after the chicken genome. (zoom)

I took the first 8 chromosomes of chicken, which is the number of large chromosomes in the Varanus subgenus of lizards to which the Komodo dragon belongs and resized them so that their length totaled 3.2 Gb, which is the estimated size of the triceratops genome. The actual size doesn't matter on first glance but it does add the extra touch because the tick labels on the chromosomes reflect the correct total genome size.

I mimicked each track in the corn genome with random data, keeping the same colors.

The white, grey and black bars within the chromosome ideograms were uniformly randomly sized, up to 20 Mb. The red bar represents the centromere which was placed somewhere within 20% of the center of the chromosome and sized between 2.5% and 5% of the chromosome length, or 5 Mb, whichever was longer.

The smooth blue heatmap, which corresponds to the recombination rate in the corn image, was generated using the function `x(1-x)^(0.75k)` where `x` is the relative position along the chromosome and `k` is the relative position of the centromere.

The tracks C-F (mu insertions, MF enrichment, repeats, genes) were faked using a random coverage process as shown below.

Martin Krzywinski @MKrzywinski
Random heat map tracks can be generated by simulating a coverage process. The genome is covered randomly and uniformly with four different tiling elements, each with a different value. The value of the track at a given point is the sum of the height of the tiles at that position. Depending on the length, fold-coverage and height of the tiles, tracks can be made to look sparse (left) or dense (right). (zoom)

The original corn genome image showed the synteny between corn and rice and sorghum (a kind of grass). Synteny is the mapping between positions on one genome and those with the same sequence in another genome—it can tell us how much a genome was "mixed up" during evolution.

To generate the synteny tracks, I started with the 12 chromosomes of corn, using the color coding from the original image. First, I cut each chromosome at a random position once and shuffled the cut pieces. This assured me of a good chance that each chromosome was split at least once. Then I progressively added more cross-overs by selecting two pieces from the list, cutting them in a random position and swapping the position of the second from each cut pair. The process is illustrated below for two independent simulations up to 35 cross-overs.

Martin Krzywinski @MKrzywinski
Generating a random synteny track is done by cutting chromosomes and shuffling the pieces. (zoom)

The outer synteny track in the image is the result of 30 cross-overs and the inner of 35 cross-overs.

realistic—using the real assembly of a similar animal

Martin Krzywinski @MKrzywinski
Modified Jurassic World Creation Lab image showing the lizard genome. (zoom)

The third option is to base the image on the real assembly of a reasonably closely related genome. This would mean picking one of the vertebrates for which genome annotations are available. I chose the lizard.

Martin Krzywinski @MKrzywinski
Lizard genome showing actual assembly and annotation data. (anoCar 2.0, May 2010) (zoom)

The lizard genome assembly has 14 chromosomes (chr1..chr6, chrLGa..chrLGh and chrM) which total 1.08 Gb and 6,443 unanchored pieces which total 717 Mb. I decided to create the image based only on chromosomes 1–6. The LG chromosomes were much smaller (LGa, the largest, is more than 10 times smaller than the next larger, chr6).

I used the UCSC Genome Browser table viewer to download a variety of annotations for the lizard genome (assembly, gaps, quality, GC content, CpG islands, gene models, and alignments to human genome and genes).

I parsed each annotation file and calculated statistics for each segment in the genome of size `w`, which was either `g/250` or `g/500` depending on the annotation. Here, `g` is the total size of the chromosomes shown in the image (1.06 Gb). The two outer-most tracks, the GC content and gene models, used `g/500` to provide two resolution scales in the image for visual interest.

I've used colors vaguely similar to those used in the corn image. The actual colors for the lizard genome image are drawn from the Brewer palettes as well as from luminance-normalized UCSC human chromosome color palette.

The image below is a linearized version of the Circos image and describes what each track shows.

Martin Krzywinski @MKrzywinski
Lizard genome showing actual assembly and annotation data. (anoCar 2.0, May 2010) (zoom)

With more work, you could perturb the lizard data so that the data weren't exactly that of the lizard. Or use phylogenetic information to to model the entire triceratops genome!

So that's it. Jurassic World science visualization fixed, or at least improved.


news + thoughts

Tabular Data

Tue 11-04-2017
Tabulating the number of objects in categories of interest dates back to the earliest records of commerce and population censuses.

After 30 columns, this is our first one without a single figure. Sometimes a table is all you need.

In this column, we discuss nominal categorical data, in which data points are assigned to categories in which there is no implied order. We introduce one-way and two-way tables and the `\chi^2` and Fisher's exact tests.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Tabular data. Nature Methods 14:329–330.

...more about the Points of Significance column

Happy 2017 `\pi` Day—Star Charts, Creatures Once Living and a Poem

Tue 14-03-2017

on a brim of echo,

capsized chamber
drawn into our constellation, and cooling.
—Paolo Marcazzan

Celebrate `\pi` Day (March 14th) with star chart of the digits. The charts draw 40,000 stars generated from the first 12 million digits.

Martin Krzywinski @MKrzywinski
12,000,000 digits of `\pi` interpreted as a star catalogue. (details)

The 80 constellations are extinct animals and plants. Here you'll find old friends and new stories. Read about how Desmodus is always trying to escape or how Megalodon terrorizes the poor Tecopa! Most constellations have a story.

Martin Krzywinski @MKrzywinski
Find friends and stories among the 80 constellations of extinct animals and plants. Oh look, a Dodo guardings his eggs! (details)

This year I collaborate with Paolo Marcazzan, a Canadian poet, who contributes a poem, Of Black Body, about space and things we might find and lose there.

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day and and 2016 `\pi` Day.

Data in New Dimensions: convergence of art, genomics and bioinformatics

Tue 07-03-2017

Art is science in love.
— E.F. Weisslitz

A behind-the-scenes look at the making of our stereoscopic images which were at display at the AGBT 2017 Conference in February. The art is a creative collaboration with Becton Dickinson and The Linus Group.

Its creation began with the concept of differences and my writeup of the creative and design process focuses on storytelling and how concept of differences is incorporated into the art.

Oh, and this might be a good time to pick up some red-blue 3D glasses.

BD Genomics 3D art exhibit - AGBT 2017 / Martin Krzywinski @MKrzywinski
A stereoscopic image and its interpretive panel of single-cell transcriptomes of blood cells: diseased versus healthy control.

Interpreting P values

Thu 02-03-2017
A P value measures a sample’s compatibility with a hypothesis, not the truth of the hypothesis.

This month we continue our discussion about `P` values and focus on the fact that `P` value is a probability statement about the observed sample in the context of a hypothesis, not about the hypothesis being tested.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Interpreting P values. (read)

Given that we are always interested in making inferences about hypotheses, we discuss how `P` values can be used to do this by way of the Benjamin-Berger bound, `\bar{B}` on the Bayes factor, `B`.

Heuristics such as these are valuable in helping to interpret `P` values, though we stress that `P` values vary from sample to sample and hence many sources of evidence need to be examined before drawing scientific conclusions.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Interpreting P values. Nature Methods 14:213–214.

Background reading

Krzywinski, M. & Altman, N. (2017) Points of significance: P values and the search for significance. Nature Methods 14:3–4.

Krzywinski, M. & Altman, N. (2013) Points of significance: Significance, P values and t–tests. Nature Methods 10:1041–1042.

...more about the Points of Significance column

Snellen Charts—Typography to Really Look at

Sat 18-02-2017

Another collection of typographical posters. These ones really ask you to look.

Martin Krzywinski @MKrzywinski
Snellen charts designed using physical constants, Braille and elemental abundances in the universe and human body.

The charts show a variety of interesting symbols and operators found in science and math. The design is in the style of a Snellen chart and typset with the Rockwell font.