Martin Krzywinski / Genome Sciences Center / Martin Krzywinski / Genome Sciences Center / - contact me Martin Krzywinski / Genome Sciences Center / on Twitter Martin Krzywinski / Genome Sciences Center / - Lumondo Photography Martin Krzywinski / Genome Sciences Center / - Hilbertonians - Creatures on the Hilbert Curve
Thoughts rearrange, familiar now strange.Holly Golightly & The Greenhornes break flowers

More than Pretty Pictures—Aesthetics of Data Representation, Denmark, April 13–16, 2015

visualization + design

Getting into Visualization of Large Biological Data Sets

The 20 imperatives of information design

Martin Krzywinski, Inanc Birol, Steven Jones, Marco Marra

Presented at Biovis 2012 (Visweek 2012). Content is drawn from my book chapter Visualization Principles for Scientific Communication (Martin Krzywinski & Jonathan Corum) in the upcoming open access Cambridge Press book Visualizing biological data - a practical guide (Seán I. O'Donoghue, James B. Procter, Kate Patterson, eds.), a survey of best practices and unsolved problems in biological visualization. This book project was conceptualized and initiated at the Vizbi 2011 conference.

If you are interested in guidelines for data encoding and visualization in biology, see our Visualization Principles Vizbi 2012 Tutorial and Nature Methods Points of View column by Bang Wong.

Martin Krzywinski @MKrzywinski
Getting into Visualization of Large Biological Data Sets. M Krzywinski, I Birol, S Jones, M Marra (poster presentation) (PDF)

The 20 imperatives of information design


Create legible visualizations with a strong message. Make elements large enough to be resolved comfortably. Bin dense data to avoid sacrificing clarity.

Distinguish between exploration and communication.

Use exploratory tools (e.g. genome browsers) to discover patterns and validate hypotheses. Avoid using screenshots from these applications for communication – they are typically too complex and cluttered with navigational elements to be an effective static figure.

Do not exceed resolution of visual acuity.

Our acuity is ~50 cycles/degree or about 1/200 (0.3 pt) at 10 inches. Ensure the reader can comfortably see detail by limiting resolution to no more than 50% of acuity. Where possible, elements that require visual separation should be at least 1 pt part.

Use no more than ~500 scale intervals.

Ensure data elements are at least 1 pt on a two-column Nature figure (6.22 in), 4 pixels on a 1920 horizontal resolution display, or 2 pixels on a typical LCD projector. These restrictions become challenges for large genomes.

Show variation with statistics.

Data on large genomes must be downsampled. Depict variation with min/max plots and consider hiding it when it is within noise levels. Help the reader notice significant outliers.

Do not draw small elements to scale.

Map size of elements onto clearly legible symbols. Legibility and clarity are more important than precise positioning and sizing. Discretize sizes and positions to facilitate making meaningful comparisons.

Aggregate data for focused theme.

A strong visual message has no uncertainty in its interpretation. Focus on a single theme by aggregating unnecessary detail.

Show density maps and outliers.

Establishing context is helpful when emergent patterns in the data provide a useful perspective on the message. When data sets are large, it is difficult to maintain detail in the context layer because the density of points can visually overwhelm the area of interest. In this case, consider showing only the outliers in the data set.

Consider whether showing the full data set is useful.

The reader’s attention can be focused by increasing the salience of interesting patterns. Other complex data sets, such as networks, are shown more effectively when context is carefully edited or even removed.


Match the visual encoding to the hypothesis. Use encodings specific and sensitive to important patterns. Dense annotations should be independent of the core data in distinct visual layers.

Use the simplest encoding.

Choose concise encodings over elaborate ones.

Help the reader judge accurately.

Accuracy and speed in detecting differences in visual forms depends on how information is presented. We judge relative lengths more accurately than areas, particularly when elements are aligned and adjacent. Our judgment of area is poor because we use length as a proxy, which causes us to systematically underestimate.

Use encodings that are robust and comparable.

In addition to being transparent and predictable, visualizations must be robust with respect to the data. Changes in the data set should be reflected by proportionate changes in the visualization. Be wary of force-directed network layouts, which have low spatial autocorrelation. In general, these are neither sensitive nor specific to patterns of interest.

Crop scale to reveal fine structure in data.

Biological data sets are typically high-resolution (changes at base pair level can meaningful), sparse (distances between changes are orders of magnitude greater than the affected areas) and connect distant regions by adjacency relationships (gene fusions and other rearrangements). It is difficult to take these properties into account on a fixed linear scale, the kind used by traditional genome browsers. To mitigate this, crop and order axis segments arbitrarily and apply a scale adjustment to a segment or portion thereof.

Use perceptual palettes.

Selecting perceptually favorable colors is difficult because most software does not support the required color spaces. Brewer palettes exist for the full range of colors to help us make useful choices. Qualitative palettes have no perceived order of importance. Sequential palettes are suitable for heat maps because they have a natural order and the perceived difference between adjacent colors is constant. Twin hue diverging palettes, are useful for two-sided quantitative encodings, such as immunofluorescence and copy number.

Never use hue to encode magnitude.

Hue does not communicate relative change in values because we perceive hue categorically (blue, green, yellow, etc). Changes within one category have less perceptual impact than transitions between categories. For example, variations across the green/yellow boundary are perceived to be larger than variations across the same sized hue interval in other parts of the spectrum.


Well-designed figures illustrate complex concepts and patterns that may be difficult to express concisely in words. Figures that are clear, concise and attractive are effective – they form a strong connection with the reader and communicate with immediacy. These qualities can be achieved with methods of graphic design, which are based on theories of how we perceive, interpret and organize visual information.

Reduce unnecessary variation.

The reader does not know what is important in a figure and will assume that any spatial or color variation is meaningful. The figure’s variation should come solely from data or act to organize information.

Encapsulate details.

Including details not relevant to the core message of the figure can create confusion. Encapsulation should be done to the same level of detail and to the simplest visual form. Duplication in labels should be avoided.

Use consistent alignment. Center on theme.

Establish equivalence using consistent alignment. Awkward callouts can be avoided if elements are logically placed.

Respect natural hierarchies.

When the data set embodies a natural hierarchy, use an encoding that emphasizes it clearly and memorably. The use hierarchy in layout (e.g. tabular form) and encoding can significantly improve a muddled figure.

Palette for color blindness / Martin Krzywinski @MKrzywinski
This 15-color palette provides good discrimination for common color blindness types. Individuals with tritanopia cannot distinguish colors marked with ● and ◥. (hires)

Be aware of the luminance effect.

Color is a useful encoding – the eye can distinguish about 450 levels of gray, 150 hues, and 10-60 levels of saturation, depending on the color – but our ability to perceive differences varies with context. Adjacent tones with different luminance values can interfere with discrimination, in interaction known as the luminance effect.

Be aware of color blindness.

In an audience of 8 men and 8 women, chances are 50% that at least one has some degree of color blindness. Use a palette that is color-blind safe. In the palette below the 15 colors appear as 5-color tone progressions to those with color blindness. Additional encodings can be achieved with symbols or line thickness.

news + thoughts

Two Factor Designs

Tue 09-12-2014

We've previously written about how to analyze the impact of one variable in our ANOVA column. Complex biological systems are rarely so obliging—multiple experimental factors interact and producing effects.

ANOVA is a natural way to analyze multiple factors. It can incorporate the possibility that the factors interact—the effect of one factor depends on the level of another factor. For example, the potency of a drug may depend on the subject's diet.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Two Factor Designs. (read)

We can increase the power of the analysis by allowing for interaction, as well as by blocking.

Krzywinski, M., Altman, (2014) Points of Significance: Two Factor Designs Nature Methods 11:1187-1188.

Background reading

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column

Nested Designs—Assessing Sources of Noise

Mon 29-09-2014

Sources of noise in experiments can be mitigated and assessed by nested designs. This kind of experimental design naturally models replication, which was the topic of last month's column.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Nested designs. (read)

Nested designs are appropriate when we want to use the data derived from experimental subjects to make general statements about populations. In this case, the subjects are random factors in the experiment, in contrast to fixed factors, such as we've seen previously.

In ANOVA analysis, random factors provide information about the amount of noise contributed by each factor. This is different from inferences made about fixed factors, which typically deal with a change in mean. Using the F-test, we can determine whether each layer of replication (e.g. animal, tissue, cell) contributes additional variation to the overall measurement.

Krzywinski, M., Altman, N. & Blainey, P. (2014) Points of Significance: Nested designs Nature Methods 11:977-978.

Background reading

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column

Replication—Quality over Quantity

Tue 02-09-2014

It's fitting that the column published just before Labor day weekend is all about how to best allocate labor.

Replication is used to decrease the impact of variability from parts of the experiment that contribute noise. For example, we might measure data from more than one mouse to attempt to generalize over all mice.

Martin Krzywinski @MKrzywinski
Nature Methods Points of Significance column: Replication. (read)

It's important to distinguish technical replicates, which attempt to capture the noise in our measuring apparatus, from biological replicates, which capture biological variation. The former give us no information about biological variation and cannot be used to directly make biological inferences. To do so is to commit pseudoreplication. Technical replicates are useful to reduce the noise so that we have a better chance to detect a biologically meaningful signal.

Blainey, P., Krzywinski, M. & Altman, N. (2014) Points of Significance: Replication Nature Methods 11:879-880.

Background reading

Krzywinski, M. & Altman, N. (2014) Points of Significance: Analysis of variance (ANOVA) and blocking Nature Methods 11:699-700.

Krzywinski, M. & Altman, N. (2014) Points of Significance: Designing Comparative Experiments Nature Methods 11:597-598.

...more about the Points of Significance column

Monkeys on a Hilbert Curve—Scientific American Graphic

Tue 19-08-2014

I was commissioned by Scientific American to create an information graphic that showed how our genomes are more similar to those of the chimp and bonobo than to the gorilla.

I had about 5 x 5 inches of print space to work with. For 4 genomes? No problem. Bring out the Hilbert curve!

Martin Krzywinski @MKrzywinski
Our genomes are much more similar to the chimp and bonobo than to the gorilla. And, we're practically still Denisovans. (details)

To accompany the piece, I will be posting to the Scientific American blog about the process of creating the figure. And to emphasize that the genome is not a blueprint!

As part of this project, I created some Hilbert curve art pieces. And while exploring, found thousands of Hilbertonians!

Happy Pi Approximation Day— π, roughly speaking 10,000 times

Wed 13-08-2014

Celebrate Pi Approximation Day (July 22nd) with the art of arm waving. This year I take the first 10,000 most accurate approximations (m/n, m=1..10,000) and look at their accuracy.

Martin Krzywinski @MKrzywinski
Accuracy of the first 10,000 m/n approximations of Pi. (details)

I turned to the spiral again after applying it to stack stacked ring plots of frequency distributions in Pi for the 2014 Pi Day.

Martin Krzywinski @MKrzywinski
Frequency distribution of digits of Pi in groups of 4 up to digit 4,988. (details)