music + dance + projected visualsmarvel at perfect timingmore quotes
very clickable
data visualization + art
The Earth BioGenome Project (EBP) seeks to create a new foundation for biology to drive solutions for preserving biodiversity and sustaining human societies.

# PNAS cover — Earth BioGenome Project

## An enlightened perspective on our planet

Lewin HA et al., The Earth BioGenome Project 2020: Starting the clock. (2022) PNAS 119(4) e2115635118

## 1 · A world of sequence

The design represents a genomics perspective on the biodiversity of the Earth.

The basis of the design are the shape of continents. These were taken from the award-winning Authagraph Map, which is a projection that strongly preserves shapes and areas. Wired calls it “not perfect, but close”.

Authagraph map by Hajime Narukawa. Shapes and areas are strongly preserved.

I rearranged the authagraph map to fit the aspect ratio of the cover image (nearly 1:1) and allow for enough room for text on the cover. To make things interesting, I put Antarctica in the center — north is everywhere. Given how the image is constructed (read below), I had to remove some islands. Sorry, Madagascar — you're a star.

Layout of authagraph continents.

Australia has the job of representing all of Oceania, such as Inaccessible Island.

## 2 · Journey of fewest steps

For each continent and ocean, a travelling salesman problem (TSP) was solved to find the shortest (or close to it) path that covers the area. The path joins points that were sampled at a level of detail that I thought would strike a good balance between detail and legibility.

Some continents do not have full coverage by the path &mdsah; some small islands and peninsulas were excluded (or extended). This needed to be done to give room for the ocean path between continents and to avoid situations where the ocean path crossed a landmass. The shortest solution isn't always the most appropriate.

A TSP path for each land mass and ocean.

The number of points visited for each landmass were

$1,575 africa 848 antarctica 2,818 asia 475 australia 713 europe 1,674 namerica 17,354 ocean 872 samerica 26,329 ALL$

I then smoothed the path using a J-spline curve.

A smoothed J-spline of the TSP path for each land mass and ocean. The original TSP path is shown as a thin line.

## 3 · Adding the double helix

The DNA double helix is a trope — let's embrace it.

The smoothed J-spline was used to create a double helix. The parameters for the helix were selected to balance detail with clarity.

And before you start pixel peeping: the helix doesn't have handedness. It's composed of three independent stacked layers: one layer per strand strands and one layer with bonds.

Once the helix was drawn, I added bases as circles and resized to increase visibility while avoiding overlap.

Bases are added to the double helix.

The number of base pairs per area is

$nbp australia 596 europe 872 antarctica 1,024 samerica 1,064 africa 1,911 namerica 2,033 asia 3,403 ocean 21,208 ------ 32,111$

I mapped species sequenced as part of the Earth BioGenome Project to the path that corresponded to their habitat location. This was done using a combination of automated searches through the Global Biodiversity Information Facility (GBIF) and manual corrections.

For a given path, the length of the sequence of each species was fixed and depended on the length of the area's double helix and number of species assigned to it.

For example, in Africa we have 1,911 bases to distribute across 87 species, giving us $1911/87-1=20$ bases per species. The value is rounded down to the nearest integer and the $-1$ term allows for a gap between each sequence.

The ocean area is very large and has relatively few species, so we can have long sequences — 396 bases per species.

$nbp species bp_per_species australia 596 71 7 europe 872 113 6 antarctica 1,024 2 511 samerica 1,064 93 10 africa 1,911 87 20 namerica 2,033 182 10 asia 3,403 206 15 ocean 21,208 52 396 ------ --- 32,111 806$

Any bases left over on the helix were assigned to gaps. Initially, I thought of concatenating the sequences without any gaps. While this would allow me to represent (neglibibly) longer sequences, I was persuaded against this idea (thank you Emma!).

First, the Earth Biogenome sequencing isn't complete (nor will it ever be), so gaps in the sequence are a nice metaphor for this. Second, the gaps offer some hope of actually reading the sequences off for each species. Below is a view of one of the strands with the backbone and base pair locations within gaps shown in magenta.

Distribution of gaps (magenta) in the sequence path. Gaps add up to about 5.5% of the total bases in the design.

In the final figure, the helix backbone in gaps is faded.

The double helix is faded in gaps.

In the final figure, bases are colored by nucleotide (A, T, C, G) and region (land vs ocean).

Bases of the sequence are encoded by colored circles. The first base of the first sequence on a path is indicated by a (hard-to-find) black triangle.

Sequences were sampled from the Genbank records starting at the first non-repeated base. This was a quick way to avoid spamming the design with polytails.

For example, the sequence for the Hanuman langur

$>PVII010000001.1:1-1000 Semnopithecus entellus isolate BS30 SemEnt_scaffold_0, WGS AAAAA AAAAA AAAAA AAAAA AAAAA AAATT ATTTT GCTAA GGTCT GAGAA CTCCA GAAGG TGGTG TCGTG ----- ----- ----- ----- ----- ---++ +++++ +++++ +++++ +++++ +++++ +++++ +++++ +++++ ** ***** ***** ***$

was sampled by skipping the first 28 bases (A's) and uses the sequence indicated by $*$ above.

## 6 · What's where

The position and sequence for each species is shown below. The first base of the sequence is capitalized and remaining are lowercase. The numeric code below the species name is the taxon, a unique taxonomy ID used by Genbank.

Discover what's where. Shown are the species, taxon (unique taxonomy ID) and sequence for each of the 806 species in the design. Species and taxon labels are placed at the average (x,y) position of its bases. It'll do.

A version of the above without the sequence is shown below.

Discover what's where. Shown are the species name, taxon (unique taxonomy ID) and sequence for each of the 806 species in the design. Species and taxon labels are placed at the average (x,y) position of its bases. It'll do.
news + thoughts

# Neural network primer

Mon 06-02-2023

Nature is often hidden, sometimes overcome, seldom extinguished. —Francis Bacon

In the first of a series of columns about neural networks, we introduce them with an intuitive approach that draws from our discussion about logistic regression.

Nature Methods Points of Significance column: Neural network primer. (read)

Simple neural networks are just a chain of linear regressions. And, although neural network models can get very complicated, their essence can be understood in terms of relatively basic principles.

We show how neural network components (neurons) can be arranged in the network and discuss the ideas of hidden layers. Using a simple data set we show how even a 3-neuron neural network can already model relatively complicated data patterns.

Derry, A., Krzywinski, M & Altman, N. (2023) Points of significance: Neural network primer. Nature Methods 20.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nature Methods 13:541–542.

# Cell Genomics cover

Mon 16-01-2023

Our cover on the 11 January 2023 Cell Genomics issue depicts the process of determining the parent-of-origin using differential methylation of alleles at imprinted regions (iDMRs) is imagined as a circuit.

Designed in collaboration with with Carlos Urzua.

Our Cell Genomics cover depicts parent-of-origin assignment as a circuit (volume 3, issue 1, 11 January 2023). (more)

Akbari, V. et al. Parent-of-origin detection and chromosome-scale haplotyping using long-read DNA methylation sequencing and Strand-seq (2023) Cell Genomics 3(1).

Browse my gallery of cover designs.

A catalogue of my journal and magazine cover designs. (more)

Thu 05-01-2023

My cover design on the 6 January 2023 Science Advances issue depicts DNA sequencing read translation in high-dimensional space. The image showss 672 bases of sequencing barcodes generated by three different single-cell RNA sequencing platforms were encoded as oriented triangles on the faces of three 7-dimensional cubes.

My Science Advances cover that encodes sequence onto hypercubes (volume 9, issue 1, 6 January 2023). (more)

Kijima, Y. et al. A universal sequencing read interpreter (2023) Science Advances 9.

Browse my gallery of cover designs.

A catalogue of my journal and magazine cover designs. (more)

# Regression modeling of time-to-event data with censoring

Mon 21-11-2022

If you sit on the sofa for your entire life, you’re running a higher risk of getting heart disease and cancer. —Alex Honnold, American rock climber

In a follow-up to our Survival analysis — time-to-event data and censoring article, we look at how regression can be used to account for additional risk factors in survival analysis.

We explore accelerated failure time regression (AFTR) and the Cox Proportional Hazards model (Cox PH).

Nature Methods Points of Significance column: Regression modeling of time-to-event data with censoring. (read)

Dey, T., Lipsitz, S.R., Cooper, Z., Trinh, Q., Krzywinski, M & Altman, N. (2022) Points of significance: Regression modeling of time-to-event data with censoring. Nature Methods 19.

# Music video for Max Cooper's Ascent

Tue 25-10-2022

My 5-dimensional animation sets the visual stage for Max Cooper's Ascent from the album Unspoken Words. I have previously collaborated with Max on telling a story about infinity for his Yearning for the Infinite album.

I provide a walkthrough the video, describe the animation system I created to generate the frames, and show you all the keyframes

Frame 4897 from the music video of Max Cooper's Asent.

The video recently premiered on YouTube.

Renders of the full scene are available as NFTs.