Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca - contact me Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca on Twitter Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca - Lumondo Photography Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca - Pi Art Martin Krzywinski / Genome Sciences Center / mkweb.bcgsc.ca - Hilbertonians - Creatures on the Hilbert Curve
This love's a nameless dream.Cocteau Twinstry to figure it outmore quotes

science: exciting


EMBO Practical Course: Bioinformatics and Genome Analysis, 5–17 June 2017.


design + visualization

EMBO Journal 2011 Cover Contest

scientific image entry - a hive panel

For the EMBO Journal 2011 Cover Contest, I prepared two entries, one for the scientific category and one for the non-scientific category.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The non-scientific entry is abstract photo of fiber optics. The scientific entry was an information graphic showing a hive panel of genomic annotations in human, mouse and dog genomes. The hive panel is based on the use of the newly introduced hive plot.

About the EMBO Journal Cover Contest

The EMBO Journal non-scientific cover prize is awarded for the most interesting and beautiful image made outside the lab. Contestants may submit, for example, photos or artistic impressions of wildlife animals, plants or landscapes. Particularly welcome will also be hand or computer-generated paintings or drawings (or photographs of other works of art) related to a biological or molecular biological topic.

The EMBO Journal scientific cover prize is awarded for the most captivating and thought-provoking contribution depicting a piece of molecular biology research. Entries can include light or electron micrographs, 3D reconstructions or models of biological specimen or molecules, spectacular artefacts collected in the lab, original new views of lab equipment (but not of colleagues!), or other research-based images to be of interest to molecular biologists.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Examples of scientific cover image winners from previous years. My Circos image (top left) won the 2010 scientfic image cover category. (see more)

2011 Contest and Image Status

The 2011 winners have been announced. The scientific image winner was Heiti Paves, who submitted a confocal image of an Arabidopsis thaliana anther filled with pollen grains. The non-scientific winner was Dieter Lampl, with his "Blue Ice" photo — a glacier in Los Glaciares National Park in Patagonia.

My non-scientific entry (photo of fiber optics) received honourable mention and was included in the Favourites of the Jury gallery.


scientific image entry - a hive panel

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Four genomes — The illustration, originally part of a poster, shows syntenic relationships between human, chimpanzee, mouse and zebrafish genomes. Curved links encode sequence similarity and outer data tracks represent consensus similarity statistics and orthologous genes. The cover image shows a detail of a visualization prepared with the free genome comparison tool, Circos. (EMBO Journal - Best Scientific Cover - 2010)

In 2010 EMBO selected my submission of a large Circos figure for its cover (see right). Front page exposure of this sort has made Circos a very popular tool for visualization in genomics, and in particular, in cancer research where there is a need to illustrate differences between genomes.

It was now time to try something else — the hive panel (learn about hive plots and hive panels).

My other entry for the 2011 cover contest was a non-scientific abstract image photo of fiber optics.

Current State of Network Visualization

A large number of layout algorithms already exist to attempt to visualize networks. In an attempt to create attractive layouts, node and edge positions are optimized to minimize some fitness function, such as overlap or force (if edges are treated as springs). Unfortunately, as a result it is impossible to relate the position of a node (or the distance between any two nodes in the layout) to their connected neighbourhood in the network. This particularly holds for large networks, where nodes and edge overlap in the layout is unavoidable.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Hairballs are irrational network visualizations. Shown here are 8 different layouts of the same network — it is impossible to identify that these images correspond to the same network. More importantly, it is very difficult to extract meaningful and quantitative information from these layouts. (Hive plots solve this problem.)

The Hive Plot

Hive Plots for Networks

The hive plot is a rational approach to visualizing networks. It is designed to complement (at times, replace) the network hairball.

In a hive plot, network nodes are assigned to and placed on axes using rational rules. These rules typically are a function of local network structure around the node (connectivity, density, centrality, etc). The resulting plot is interpretable.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
In a hive plot, nodes in a network are assigned and placed on axes using properties of the node and its relationship to its neighbours. The resulting layout is rational and easily interpreted, because the rules are based on meaningful quantities. (Hive plots rationalize network visualization.)

Hive Plots for Ratios

The hive plot can be applied to visualize a large number of ratios between three or more scales.

Instead of network edges, the lines in a hive plot now correspond to an (x,y) data pair, which can be interpreted as a ratio (x/y). This approach is particularly effective when lines are drawn as ribbons, which are then stacked. This is shown in the figure below.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
A hive plot can be used to visualize ratios by rendering individual ratios as stacked ribbons. The result is the circular equivalent of a stacked bar plot (Hive plots are useful for visualizing ratios.)

The resulting visualization bears resemblance to a stacked bar plot. The circular layout grants the advantage of being able to instantly compare all pair-wise comparisons between the axes (when three axes are used). This layout also gives the image a compare compact feel and is particularly suitable for tiling.

In the examples below, a 3-axis hive plot is shown with 8 ratios between each axis. The ratios are independent, in the sense that corresponding ribbons (e.g. blue) may have different thickness on either side of an axis. For example, if x:z = 2:3 and x:y = 1:3 then the ribbon on the left of the x axis will be twice as thick as on the right (see black arrow in figure below).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
In a dual scale hive plot, each axis supports two groups of independent ribbons. Axes can be hidden (A), shown (B), or split by various amounts (C 20deg, D 30deg, E 40deg, F 60deg) to explicitly show the transition between ribbons on either side of the axis. Download high-resolution panels A B C D E F (Hive plots are useful for visualizing ratios.)

The axes in a hive plot can be arranged arbitrarily. In the figure above panels A and B show 24 ratios — 8 each between x/y, x/z, and y/x axes. In panels C-F each axis is split to create a single 6-axis plot from a dual 3-axis plot. The split axes reveal the transition between ribbons from the left and right sides.

The dual 3-axis plot appears more stylized and mathematical, whereas the single 6-axis plot is softer and organic. As the axis split distance is increased, the plots begin to look like surface density maps, which to some degree occludes the relationships between the ratio ribbons.

Comparing Genome Annotation

For each of human (hg18), mouse (mm8) and dog (canfam2) genome assemblies, UCSC annotations, available for each genome from the table browser, were used to hierarchically organize each base in the assembly using the following criteria: gene, repeat and gene+repeat. For each of these, bases were further categorized as conserved or not.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Each base in the genome assembly was assigned to one of eight disjoint categories. (More about hive plots.)

By exhaustively intersecting each of the annotation regions, the assembly was divided into disjoint segments, each with its annotation states. For example, below are a few adjacent regions from hg18 chr1 (a assembly, r repeat, c-cf conserved with dog, c-mm conserved with mouse).

...
hg 1 120,942,663 120,945,658 2,996 a r
hg 1 120,945,659 120,945,665     7 a
hg 1 120,945,666 120,947,239 1,574 a c-cf c-mm
hg 1 120,947,240 120,947,243     4 a c-cf c-mm r
hg 1 120,947,244 120,947,268    25 a c-mm r
hg 1 120,947,269 120,950,367 3,099 a r
hg 1 120,950,368 120,950,386    19 a
...

Next, the total size of regions for each combination of annotation was calculated for each pairwise combination of genomes. The second genome in the pair dictates which conservation is used. For example, for the human-mouse pair, the relative fractions of the human genome that fall into each of the categories are

hg mm a        1,839,255,050 0.643542044483869
hg mm a,c-mm     757,027,260 0.264878365091574
hg mm a,r        206,719,589 0.0723296896425132
hg mm a,c-mm,r    42,358,464 0.0148209203088807
hg mm a,g          8,139,587 0.00284798264342638
hg mm a,c-mm,g     4,435,658 0.0015520046651231
hg mm a,g,r           48,994 1.71426463814481e-05
hg mm a,c-mm,g,r      33,869 1.18505182327074e-05

thus categorizing all the 2.86 Gb of the assembled human genome. The corresponding ratios for the mouse genome are

mm hg a          1,388,193,028 0.544355712823795
mm hg a,c-hg       892,892,218 0.350132128602082
mm hg a,r          196,173,508 0.0769260237089193
mm hg a,c-hg,r      62,305,053 0.0244318411447455
mm hg a,g            6,377,904 0.00250098394691097
mm hg a,c-hg,g       4,076,727 0.00159861747416369
mm hg a,g,r             81,889 3.21113447973805e-05
mm hg a,c-hg,g,r        57,585 2.2580954586784e-05

Using these two lists, all the ratios between the human and mouse axes can be determined. For example, for the conserved/gene/non-repeat regions the ratio of human:mouse is 0.00155:0.00160 (lines are bolded above). The corresponding ribbon for this ratio is shown below.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The ratio of conserved gene regions not in repeats between human and mouse genomes. (More about hive plots.)

Category assignment into repeat, gene and conserved region was parametrized into three ranges for each criteria. These values were selected heuristically, to obtain a reasonable sample for each combination.

  • gene g1 <4kb, g2 4kb-22kb, g3 >22kb
  • repeat r1 simple, r2 LTR, r3 LINE/SINE
  • conservation c1 <45%, c2 45%-58%, c3 >58%

Given 3 parameters for each of the categories, the full comparison is represented by 27 hive plots. These plots are arranged on the cover as follows

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
The ratio of conserved gene regions not in repeats between human and mouse genomes. (More about hive plots.)

The scale of the axes was logarithmic to maintain visibility of all categories.

The final cover designs for the cluster of 27 hive plots are shown below.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Final EMBO Journal cover submissions. (More about hive plots.)

VIEW ALL

news + thoughts

Classification and regression trees

Fri 28-07-2017
Decision trees are a powerful but simple prediction method.

Decision trees classify data by splitting it along the predictor axes into partitions with homogeneous values of the dependent variable. Unlike logistic or linear regression, CART does not develop a prediction equation. Instead, data are predicted by a series of binary decisions based on the boundaries of the splits. Decision trees are very effective and the resulting rules are readily interpreted.

Trees can be built using different metrics that measure how well the splits divide up the data classes: Gini index, entropy or misclassification error.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
Nature Methods Points of Significance column: Classification and decision trees. (read)

When the predictor variable is quantitative and not categorical, regression trees are used. Here, the data are still split but now the predictor variable is estimated by the average within the split boundaries. Tree growth can be controlled using the complexity parameter, a measure of the relative improvement of each new split.

Individual trees can be very sensitive to minor changes in the data and even better prediction can be achieved by exploiting this variability. Using ensemble methods, we can grow multiple trees from the same data.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. Nature Methods 14:757–758.

Background reading

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. Nature Methods 13:541-542.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression Nature Methods 12:1103-1104.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. Nature Methods 13:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. Nature Methods 13:703-704.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Regularization. Nature Methods 13:803-804.

...more about the Points of Significance column

Personal Oncogenomics Program 5 Year Anniversary Art

Wed 26-07-2017

The artwork was created in collaboration with my colleagues at the Genome Sciences Center to celebrate the 5 year anniversary of the Personalized Oncogenomics Program (POG).

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
5 Years of Personalized Oncogenomics Program at Canada's Michael Smith Genome Sciences Centre. The poster shows 545 cancer cases. (left) Cases ordered chronologically by case number. (right) Cases grouped by diagnosis (tissue type) and then by similarity within group.

The Personal Oncogenomics Program (POG) is a collaborative research study including many BC Cancer Agency oncologists, pathologists and other clinicians along with Canada's Michael Smith Genome Sciences Centre with support from BC Cancer Foundation.

The aim of the program is to sequence, analyze and compare the genome of each patient's cancer—the entire DNA and RNA inside tumor cells— in order to understand what is enabling it to identify less toxic and more effective treatment options.



me as a keyword list

aikido | analogies | animals | astronomy | comfortable silence | cosmology | dorothy parker | drumming | espresso | fundamental forces | good kerning | graphic design | humanism | humour | jean michel jarre | kayaking | latin | little fluffy clouds | lord of the rings | mathematics | negative space | nuance | perceptual color palettes | philosophy of science | photography | physical constants | physics | poetry | pon farr | reason | rhythm | richard feynman | science | secularism | swing | symmetry and its breaking | technology | things that make me go hmmm | typography | unix | victoria arduino | wine | words