Safe, fallen down this way, I want to be just what I am.safe at lastmore quotes

# understanding: more

EMBO Practical Course: Bioinformatics and Genome Analysis, 5–17 June 2017.

# data visualization + art

To view the art you'll need a pair of red-blue 3D glasses.
The data will stand out—and you will too.

# BD Genomics stereoscopic art exhibit — AGBT 2017

Art is science in love.
— E.F. Weisslitz

Our art exhibit at AGBT 2017 asked new school questions in old school ways.

## the art of storytelling in science

Instead of 'explain, not merely show,' seek to 'narrate, not merely explain.' Krzywinski M & Cairo A (2013) Points of View: Storytelling. Nat. Methods 10:687.

Science cannot move forward without storytelling. While we learn about the world and its patterns through science, it is through stories that we can organize and sort through the observations and conclusions that drive the generation of scientific hypotheses.

With Alberto Cairo, I've written about the importance of storytelling as a tool to explain and narrate in Storytelling (2013) Nat. Methods 10:687. There we suggest that instead of "explain, not merely show," you should seek to "narrate, not merely explain."

Our account received support (Should scientists tell stories. (2013) Nat. Methods 10:1037) but not from all (Against storytelling of scientific results. (2013) Nat. Methods 10:1045).

A good science story must present facts and conclusions within a hierarchy—a bag of unsorted observations isn't likely to engage your readers. But while a story must always inform, it should also delight (as much as possible), and inspire. It should make the complexity of the problem accessible—or, at least, approachable—without simplifications that preclude insight into how concepts connect (they always do).

## the story of making science stories

Just like science, explaining science is a process—one that can be more vexing than the science itself!

In science one tries to tell people, in such a way as to be understood by everyone, something that no one ever knew before. But in poetry, it’s the exact opposite.
—Paul Dirac, Mathematical Circles Adieu by H. Eves [quoted]

I have previously written about the process of taking a scientific statement (Creating Scientific American Graphic Science graphics) and turning it into a data visualization or, more broadly, visual story.

December 2015. Composition of bacteria in household dust.
June 2015. Relationship between genes and traits.
September 2014. Similarity of human, Denisovan, chimp, bonobo, and gorilla genomes.

The process of the creation of one of these visual stories is itself a story. A story about how the genome is not a blueprint, a discovery of Hilbertonians, which are creatures that live on the Hilbert curve, how algorithms for protein folding can be used to generate art based on the digits of $\pi$, or how we can make human genome art by humans with genomes. I've also written about my design process in creating the cover for Genome Research and the cover of PNAS. As always, not everything works out all the time—read about the EMBO Journal covers that never made it.

Cover image accompanying our article on mouse vasculature development. Biology turns astrophysical. PNAS 1 May 2012; 109 (18)
Cover image accompanying Spark: A navigational paradigm for genomic data exploration. Genome Research 22 (11).
Pi Day 2014 poster | 132 paths with E=-23 of 64 digits of Pi, sorted by aspect ratio.

Here, I'd like to walk you through the process and sketches of creating a story based on the idea of differences in data and how the story can be used to understand the function of cells and disease.

## the difference is in the differences

The visual story is a creative collaboration with Becton Dickinson and The Linus Group and its creation began with the concept of differences. The art was on display at AGBT 2017 conference and accompanies BD's launch of the Resolve platform and "Difference of One in Genomics".

Starting with the idea of the "difference of one", our goal was to create artistic representations of data sets generated using the BD Resolve platform, which generates single-cell transcriptomes, that captured a variety of differences that are relevant in genomics research.

The data art pieces were installed in a gallery style, with data visualization and artistic expression in equal parts.

The art itself is an old school take on virtual reality. Unlike modern VR, which isolates the participants from one another, we chose a low-tech route that not only brings the audience closer to the data but also to each other.

## data in the art

The data were generated using the BD Resolve single-cell transcriptomics platform. For each of the three art pieces, we identified a data set that captured a variety of differences.

1. disease onset—how does gene expression in tumor cells differ from normal cells?
2. disease progression—as a tumor grows and spreads, how does expression change?
3. background variation—how does gene expression change between normal cells that perform a different function?

The real surprise and insight is in difference that ultimately advance our thinking (Data visualization: amgibuity as a fellow traveller. (2013) Nat. Methods 10:613-615).

Figuring out which differences are of this kind requires that instead of "What's new?" we ask "What's different?"

VIEW ALL

# Snowflake simulation

Tue 14-11-2017
Symmetric, beautiful and unique.

Just in time for the season, I've simulated a snow-pile of snowflakes based on the Gravner-Griffeath model.

A few of the beautiful snowflakes generated by the Gravner-Griffeath model. (explore)

Gravner, J. & Griffeath, D. (2007) Modeling Snow Crystal Growth II: A mesoscopic lattice map with plausible dynamics.

# Genes that make us sick

Thu 02-11-2017
Where disease hides in the genome.

My illustration of the location of genes in the human genome that are implicated in disease appears in The Objects that Power the Global Economy, a book by Quartz.

The location of genes implicated in disease in the human genome, shown here as a spiral. (more...)

# Ensemble methods: Bagging and random forests

Mon 16-10-2017
Many heads are better than one.

We introduce two common ensemble methods: bagging and random forests. Both of these methods repeat a statistical analysis on a bootstrap sample to improve the accuracy of the predictor. Our column shows these methods as applied to Classification and Regression Trees.

Nature Methods Points of Significance column: Ensemble methods: Bagging and random forests. (read)

For example, we can sample the space of values more finely when using bagging with regression trees because each sample has potentially different boundaries at which the tree splits.

Random forests generate a large number of trees by not only generating bootstrap samples but also randomly choosing which predictor variables are considered at each split in the tree.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Ensemble methods: bagging and random forests. Nature Methods 14:933–934.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. Nature Methods 14:757–758.

# Classification and regression trees

Mon 16-10-2017
Decision trees are a powerful but simple prediction method.

Decision trees classify data by splitting it along the predictor axes into partitions with homogeneous values of the dependent variable. Unlike logistic or linear regression, CART does not develop a prediction equation. Instead, data are predicted by a series of binary decisions based on the boundaries of the splits. Decision trees are very effective and the resulting rules are readily interpreted.

Trees can be built using different metrics that measure how well the splits divide up the data classes: Gini index, entropy or misclassification error.

Nature Methods Points of Significance column: Classification and decision trees. (read)

When the predictor variable is quantitative and not categorical, regression trees are used. Here, the data are still split but now the predictor variable is estimated by the average within the split boundaries. Tree growth can be controlled using the complexity parameter, a measure of the relative improvement of each new split.

Individual trees can be very sensitive to minor changes in the data and even better prediction can be achieved by exploiting this variability. Using ensemble methods, we can grow multiple trees from the same data.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. Nature Methods 14:757–758.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. Nature Methods 13:703-704.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Regularization. Nature Methods 13:803-804.

# Personal Oncogenomics Program 5 Year Anniversary Art

Wed 26-07-2017

The artwork was created in collaboration with my colleagues at the Genome Sciences Center to celebrate the 5 year anniversary of the Personalized Oncogenomics Program (POG).

5 Years of Personalized Oncogenomics Program at Canada's Michael Smith Genome Sciences Centre. The poster shows 545 cancer cases. (left) Cases ordered chronologically by case number. (right) Cases grouped by diagnosis (tissue type) and then by similarity within group.

The Personal Oncogenomics Program (POG) is a collaborative research study including many BC Cancer Agency oncologists, pathologists and other clinicians along with Canada's Michael Smith Genome Sciences Centre with support from BC Cancer Foundation.

The aim of the program is to sequence, analyze and compare the genome of each patient's cancer—the entire DNA and RNA inside tumor cells— in order to understand what is enabling it to identify less toxic and more effective treatment options.

# Principal component analysis

Thu 06-07-2017
PCA helps you interpret your data, but it will not always find the important patterns.

Principal component analysis (PCA) simplifies the complexity in high-dimensional data by reducing its number of dimensions.

Nature Methods Points of Significance column: Principal component analysis. (read)

To retain trend and patterns in the reduced representation, PCA finds linear combinations of canonical dimensions that maximize the variance of the projection of the data.

PCA is helpful in visualizing high-dimensional data and scatter plots based on 2-dimensional PCA can reveal clusters.

Altman, N. & Krzywinski, M. (2017) Points of Significance: Principal component analysis. Nature Methods 14:641–642.