2024 π Daylatest newsbuy art
Tango is a sad thought that is danced.Enrique Santos Discépolothink & dancemore quotes
very clickable
data visualization + art
The Earth BioGenome Project (EBP) seeks to create a new foundation for biology to drive solutions for preserving biodiversity and sustaining human societies.

PNAS cover — Earth BioGenome Project

An enlightened perspective on our planet

Lewin HA et al., The Earth BioGenome Project 2020: Starting the clock. (2022) PNAS 119(4) e2115635118

1 · A world of sequence

The design represents a genomics perspective on the biodiversity of the Earth.

The basis of the design are the shape of continents. These were taken from the award-winning Authagraph Map, which is a projection that strongly preserves shapes and areas. Wired calls it “not perfect, but close”.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Authagraph map by Hajime Narukawa. Shapes and areas are strongly preserved.

I rearranged the authagraph map to fit the aspect ratio of the cover image (nearly 1:1) and allow for enough room for text on the cover. To make things interesting, I put Antarctica in the center — north is everywhere. Given how the image is constructed (read below), I had to remove some islands. Sorry, Madagascar — you're a star.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Layout of authagraph continents.

Australia has the job of representing all of Oceania, such as Inaccessible Island.

2 · Journey of fewest steps

For each continent and ocean, a travelling salesman problem (TSP) was solved to find the shortest (or close to it) path that covers the area. The path joins points that were sampled at a level of detail that I thought would strike a good balance between detail and legibility.

Some continents do not have full coverage by the path &mdsah; some small islands and peninsulas were excluded (or extended). This needed to be done to give room for the ocean path between continents and to avoid situations where the ocean path crossed a landmass. The shortest solution isn't always the most appropriate.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A TSP path for each land mass and ocean.

The number of points visited for each landmass were

  1,575 africa
    848 antarctica
  2,818 asia
    475 australia
    713 europe
  1,674 namerica
 17,354 ocean
    872 samerica
 26,329 ALL

I then smoothed the path using a J-spline curve.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A smoothed J-spline of the TSP path for each land mass and ocean. The original TSP path is shown as a thin line.

3 · Adding the double helix

The DNA double helix is a trope — let's embrace it.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The smoothed J-spline was used to create a double helix. The parameters for the helix were selected to balance detail with clarity.

And before you start pixel peeping: the helix doesn't have handedness. It's composed of three independent stacked layers: one layer per strand strands and one layer with bonds.

Once the helix was drawn, I added bases as circles and resized to increase visibility while avoiding overlap.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Bases are added to the double helix.

The number of base pairs per area is

              nbp
 australia    596 
    europe    872 
antarctica  1,024
  samerica  1,064
    africa  1,911
  namerica  2,033 
      asia  3,403
     ocean 21,208
           ------
           32,111

4 · Adding sequence

I mapped species sequenced as part of the Earth BioGenome Project to the path that corresponded to their habitat location. This was done using a combination of automated searches through the Global Biodiversity Information Facility (GBIF) and manual corrections.

For a given path, the length of the sequence of each species was fixed and depended on the length of the area's double helix and number of species assigned to it.

For example, in Africa we have 1,911 bases to distribute across 87 species, giving us `1911/87-1=20` bases per species. The value is rounded down to the nearest integer and the `-1` term allows for a gap between each sequence.

The ocean area is very large and has relatively few species, so we can have long sequences — 396 bases per species.

              nbp   species  bp_per_species 
 australia    596        71               7  
    europe    872       113               6  
antarctica  1,024         2             511  
  samerica  1,064        93              10  
    africa  1,911        87              20  
  namerica  2,033       182              10  
      asia  3,403       206              15  
     ocean 21,208        52             396  
           ------       ---                  
           32,111       806                  

Any bases left over on the helix were assigned to gaps. Initially, I thought of concatenating the sequences without any gaps. While this would allow me to represent (neglibibly) longer sequences, I was persuaded against this idea (thank you Emma!).

First, the Earth Biogenome sequencing isn't complete (nor will it ever be), so gaps in the sequence are a nice metaphor for this. Second, the gaps offer some hope of actually reading the sequences off for each species. Below is a view of one of the strands with the backbone and base pair locations within gaps shown in magenta.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Distribution of gaps (magenta) in the sequence path. Gaps add up to about 5.5% of the total bases in the design.

In the final figure, the helix backbone in gaps is faded.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The double helix is faded in gaps.

In the final figure, bases are colored by nucleotide (A, T, C, G) and region (land vs ocean).

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Bases of the sequence are encoded by colored circles. The first base of the first sequence on a path is indicated by a (hard-to-find) black triangle.

5 · Download sequences

You can download each species's sequence and its location.

Sequences were sampled from the Genbank records starting at the first non-repeated base. This was a quick way to avoid spamming the design with polytails.

For example, the sequence for the Hanuman langur

>PVII010000001.1:1-1000 Semnopithecus entellus isolate BS30 SemEnt_scaffold_0, WGS
AAAAA AAAAA AAAAA AAAAA AAAAA AAATT ATTTT GCTAA GGTCT GAGAA CTCCA GAAGG TGGTG TCGTG
----- ----- ----- ----- ----- ---++ +++++ +++++ +++++ +++++ +++++ +++++ +++++ +++++
                                 ** ***** ***** ***

was sampled by skipping the first 28 bases (A's) and uses the sequence indicated by * above.

6 · What's where

The position and sequence for each species is shown below. The first base of the sequence is capitalized and remaining are lowercase. The numeric code below the species name is the taxon, a unique taxonomy ID used by Genbank.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Discover what's where. Shown are the species, taxon (unique taxonomy ID) and sequence for each of the 806 species in the design. Species and taxon labels are placed at the average (x,y) position of its bases. It'll do.

A version of the above without the sequence is shown below.

Martin Krzywinski / Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Discover what's where. Shown are the species name, taxon (unique taxonomy ID) and sequence for each of the 806 species in the design. Species and taxon labels are placed at the average (x,y) position of its bases. It'll do.
news + thoughts

Nasa to send our human genome discs to the Moon

Sat 23-03-2024

We'd like to say a ‘cosmic hello’: mathematics, culture, palaeontology, art and science, and ... human genomes.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | A cosmic hello of art, science, and genomes. (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | Benoit Faiveley, founder of the Sanctuary project gives the Sanctuary disc a visual check at CEA LeQ Grenoble (image: Vincent Thomas). (details)
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
SANCTUARY PROJECT | Sanctuary team examines the Life disc at INRIA Paris Saclay (image: Benedict Redgrove) (details)

Comparing classifier performance with baselines

Sat 23-03-2024

All animals are equal, but some animals are more equal than others. —George Orwell

This month, we will illustrate the importance of establishing a baseline performance level.

Baselines are typically generated independently for each dataset using very simple models. Their role is to set the minimum level of acceptable performance and help with comparing relative improvements in performance of other models.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Comparing classifier performance with baselines. (read)

Unfortunately, baselines are often overlooked and, in the presence of a class imbalance5, must be established with care.

Megahed, F.M, Chen, Y-J., Jones-Farmer, A., Rigdon, S.E., Krzywinski, M. & Altman, N. (2024) Points of significance: Comparing classifier performance with baselines. Nat. Methods 20.

Happy 2024 π Day—
sunflowers ho!

Sat 09-03-2024

Celebrate π Day (March 14th) and dig into the digit garden. Let's grow something.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
2024 π DAY | A garden of 1,000 digits of π. (details)

How Analyzing Cosmic Nothing Might Explain Everything

Thu 18-01-2024

Huge empty areas of the universe called voids could help solve the greatest mysteries in the cosmos.

My graphic accompanying How Analyzing Cosmic Nothing Might Explain Everything in the January 2024 issue of Scientific American depicts the entire Universe in a two-page spread — full of nothing.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
How Analyzing Cosmic Nothing Might Explain Everything. Text by Michael Lemonick (editor), art direction by Jen Christiansen (Senior Graphics Editor), source: SDSS

The graphic uses the latest data from SDSS 12 and is an update to my Superclusters and Voids poster.

Michael Lemonick (editor) explains on the graphic:

“Regions of relatively empty space called cosmic voids are everywhere in the universe, and scientists believe studying their size, shape and spread across the cosmos could help them understand dark matter, dark energy and other big mysteries.

To use voids in this way, astronomers must map these regions in detail—a project that is just beginning.

Shown here are voids discovered by the Sloan Digital Sky Survey (SDSS), along with a selection of 16 previously named voids. Scientists expect voids to be evenly distributed throughout space—the lack of voids in some regions on the globe simply reflects SDSS’s sky coverage.”

voids

Sofia Contarini, Alice Pisani, Nico Hamaus, Federico Marulli Lauro Moscardini & Marco Baldi (2023) Cosmological Constraints from the BOSS DR12 Void Size Function Astrophysical Journal 953:46.

Nico Hamaus, Alice Pisani, Jin-Ah Choi, Guilhem Lavaux, Benjamin D. Wandelt & Jochen Weller (2020) Journal of Cosmology and Astroparticle Physics 2020:023.

Sloan Digital Sky Survey Data Release 12

constellation figures

Alan MacRobert (Sky & Telescope), Paulina Rowicka/Martin Krzywinski (revisions & Microscopium)

stars

Hoffleit & Warren Jr. (1991) The Bright Star Catalog, 5th Revised Edition (Preliminary Version).

cosmology

H0 = 67.4 km/(Mpc·s), Ωm = 0.315, Ωv = 0.685. Planck collaboration Planck 2018 results. VI. Cosmological parameters (2018).

Error in predictor variables

Tue 02-01-2024

It is the mark of an educated mind to rest satisfied with the degree of precision that the nature of the subject admits and not to seek exactness where only an approximation is possible. —Aristotle

In regression, the predictors are (typically) assumed to have known values that are measured without error.

Practically, however, predictors are often measured with error. This has a profound (but predictable) effect on the estimates of relationships among variables – the so-called “error in variables” problem.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Error in predictor variables. (read)

Error in measuring the predictors is often ignored. In this column, we discuss when ignoring this error is harmless and when it can lead to large bias that can leads us to miss important effects.

Altman, N. & Krzywinski, M. (2024) Points of significance: Error in predictor variables. Nat. Methods 20.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple linear regression. Nat. Methods 12:999–1000.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nat. Methods 13:541–542 (2016).

Das, K., Krzywinski, M. & Altman, N. (2019) Points of significance: Quantile regression. Nat. Methods 16:451–452.

Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.151 }