latest news

Distractions and amusements, with a sandwich and coffee.

Here we are now at the middle of the fourth large part of this talk.
•
• get nowhere
• more quotes

Nature uses only the longest threads to weave her patterns, so that each small piece of her fabric reveals the organization of the entire tapestry.

— Richard Feynman

Art is Science in Love

— E.F. Weisslitz

The legend can be printed at 4" × 6". The bitmap resolution is 600 dpi.

For every case, we sequence the DNA to study the genome structure and the RNA to discover which genes are expressed and to what extent. The analysis is quite complex and brings together many steps: sequence alignment, structural variation detection, expression profiling, pathway analysis and so on. Every case is "summarized" by a lengthy report, such as the one below, which can run to over 40 pages.

One of the goals of the 5-year anniversary art was to represent the cases in a way to clearly show their number, classification as well as diversity. There are many metrics that can be used and I decided to choose the case's correlation to other cancer types.

For every POG case, the gene expression of 1,744 key genes is compared to that of 1,000's of cases in the TCGA database of cancer samples. For a given cancer type in the TCGA database (e.g. BRCA), we visualize the correlations using box plots. The box plot is ideal for showing the distribution of values in a sample.

The 10 largest Spearman correlation coefficients for the case shown above are

case corr type tissue ----------------------------------------------- POG661 0.436 BRCA Breast POG661 0.371 PRAD Urologic POG661 0.295 OV Gynecologic POG661 0.257 UCEC Gynecologic POG661 0.244 LUAD Thoracic POG661 0.235 CESC_CAD Gynecologic POG661 0.225 MB_Adult Central Nervous System POG661 0.222 KICH Urologic POG661 0.219 THCA Endocrine POG661 0.208 UCS Gynecologic

In the figure below I show how the final encoding of the correlations is done. First, the top three correlations are taken—using more generates a busy look and diminishes visual impact. The correlations are encoded as concentric rings.

Because in most cases the differences in the top 3 correlations are relatively small, differences are emphasized by non-linearly scaling the encoding (the correlations are first scaled `r^3`).

The type face is Proxima Nova. The colors for each tissue source are

Gastrointestinal ● 234,62,144 Breast ● 237,75,51 Thoracic ● 242,130,56 Gynecologic ● 253,188,61 Soft tissue ● 244,217,59 Skin ● 193,216,51 Urologic ● 114,197,49 Hematologic ● 29,166,68 Head and neck ● 43,168,224 Endocrine ● 71,82,178 Central nervous system ● 127,65,146 Other ● 150,150,150

We discuss the many ways in which analysis can be confounded when data has a large number of dimensions (variables). Collectively, these are called the "curses of dimensionality".

Some of these are unintuitive, such as the fact that the volume of the hypersphere increases and then shrinks beyond about 7 dimensions, while the volume of the hypercube always increases. This means that high-dimensional space is "mostly corners" and the distance between points increases greatly with dimension. This has consequences on correlation and classification.

Altman, N. & Krzywinski, M. (2018) Points of significance: Curse(s) of dimensionality *Nature Methods* **15**:399–400.

Inference creates a mathematical model of the datageneration process to formalize understanding or test a hypothesis about how the system behaves. Prediction aims at forecasting unobserved outcomes or future behavior. Typically we want to do both and know how biological processes work and what will happen next. Inference and ML are complementary in pointing us to biologically meaningful conclusions.

Statistics asks us to choose a model that incorporates our knowledge of the system, and ML requires us to choose a predictive algorithm by relying on its empirical capabilities. Justification for an inference model typically rests on whether we feel it adequately captures the essence of the system. The choice of pattern-learning algorithms often depends on measures of past performance in similar scenarios.

Bzdok, D., Krzywinski, M. & Altman, N. (2018) Points of Significance: Statistics vs machine learning. Nature Methods 15:233–234.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: supervised methods. Nature Methods 15:5–6.

Celebrate `\pi` Day (March 14th) and go to brand new places. Together with Jake Lever, this year we shrink the world and play with road maps.

Streets are seamlessly streets from across the world. Finally, a halva shop on the same block!

Intriguing and personal patterns of urban development for each city appear in the Boonies, Burbs and Boutiques series.

No color—just lines. Lines from Marrakesh, Prague, Istanbul, Nice and other destinations for the mind and the heart.

The art is featured in the Pi City on the Scientific American SA Visual blog.

Check out art from previous years: 2013 `\pi` Day and 2014 `\pi` Day, 2015 `\pi` Day, 2016 `\pi` Day and 2017 `\pi` Day.

We examine two very common supervised machine learning methods: linear support vector machines (SVM) and k-nearest neighbors (kNN).

SVM is often less computationally demanding than kNN and is easier to interpret, but it can identify only a limited set of patterns. On the other hand, kNN can find very complex patterns, but its output is more challenging to interpret.

We illustrate SVM using a data set in which points fall into two categories, which are separated in SVM by a straight line "margin". SVM can be tuned using a parameter that influences the width and location of the margin, permitting points to fall within the margin or on the wrong side of the margin. We then show how kNN relaxes explicit boundary definitions, such as the straight line in SVM, and how kNN too can be tuned to create more robust classification.

Bzdok, D., Krzywinski, M. & Altman, N. (2018) Points of Significance: Machine learning: a primer. Nature Methods 15:5–6.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.

In a Nature graphics blog article, I present my process behind designing the stark black-and-white Nature 10 cover.

Nature 10, 18 December 2017

In this primer, we focus on essential ML principles— a modeling strategy to let the data speak for themselves, to the extent possible.

The benefits of ML arise from its use of a large number of tuning parameters or weights, which control the algorithm’s complexity and are estimated from the data using numerical optimization. Often ML algorithms are motivated by heuristics such as models of interacting neurons or natural evolution—even if the underlying mechanism of the biological system being studied is substantially different. The utility of ML algorithms is typically assessed empirically by how well extracted patterns generalize to new observations.

We present a data scenario in which we fit to a model with 5 predictors using polynomials and show what to expect from ML when noise and sample size vary. We also demonstrate the consequences of excluding an important predictor or including a spurious one.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.