latest news

Distractions and amusements, with a sandwich and coffee.

Embrace me, surround me as the rush comes.
•
• drift deeper into the sound
• more quotes

Numbers are a lot of fun. They can start conversations—the interesting number paradox is a party favourite: every number must be interesting because the first number that wasn't would be very interesting! Of course, in the wrong company they can just as easily end conversations.

The art here is my attempt at transforming famous numbers in mathematics into pretty visual forms, start some of these conversations and awaken emotions for mathematics—other than dislike and confusion

Numerology is bogus, but art based on numbers can be beautiful. Proclus got it right when he said (as quoted by M. Kline in *Mathematical Thought from Ancient to Modern Times*)

Wherever there is number, there is beauty.

—Proclus Diadochus

The consequence of the interesting number paradox is that all numbers are interesting. But some are more interesting than others—how Orwellian!

All animals are equal, but some animals are more equal than others.

—George Orwell (Animal Farm)

Numbers such as `\pi` (or `\tau` if you're a revolutionary), `\phi`, `e`, `i = \sqrt{-1}`, and `0` have captivated imagination. Chances are at least one of them appears in the next physics equation you come across.

π φ e

= 3.14159 26535 89793 23846 26433 83279 50288 41971 69399 ... = 1.61803 39887 49894 84820 45868 34365 63811 77203 09179 ... = 2.71828 18284 59045 23536 02874 71352 66249 77572 47093 ...

Of these three transcendental numbers, `\pi` (3.14159265...) is the most well known. It is the ratio of a circle's circumference to its diameter (`d = \pi r`) and appears in the formula for the area of the circle (`a = \pi r^2`).

The Golden Ratio (`\phi`, 1.61803398...) is the attractive proportion of values `a > b` that satisfy `{a+b}/2 = a/b`, which solves to `a/b = {1 + \sqrt{5}}/2`.

The last of the three numbers, `e` (2.71828182...) is Euler's number and also known as the base of the natural logarithm. It, too, can be defined geometrically—it is the unique real number, `e`, for which the function `f(x) = e^x` has a tangent of slope 1 at `x=0`. Like `\pi`, `e` appears throughout mathematics. For example, `e` is central in the expression for the normal distribution as well as the definition of entropy. And if you've ever heard of someone talking about log plots ... well, there's `e` again!

Two of these numbers can be seen together in mathematics' most beautiful equation, the Euler identity: `e^{i\pi} = -1`. The tau-oists would argue that this is even prettier: `e^{i\tau} = 1`.

Did you notice how the 13th digit of all three numbers is the same (9)? This accidental similarity generates its own number—the Accidental Similarity Number (ASN).

In this primer, we focus on essential ML principles— a modeling strategy to let the data speak for themselves, to the extent possible.

The benefits of ML arise from its use of a large number of tuning parameters or weights, which control the algorithm’s complexity and are estimated from the data using numerical optimization. Often ML algorithms are motivated by heuristics such as models of interacting neurons or natural evolution—even if the underlying mechanism of the biological system being studied is substantially different. The utility of ML algorithms is typically assessed empirically by how well extracted patterns generalize to new observations.

We present a data scenario in which we fit to a model with 5 predictors using polynomials and show what to expect from ML when noise and sample size vary. We also demonstrate the consequences of excluding an important predictor or including a spurious one.

Bzdok, D., Krzywinski, M. & Altman, N. (2017) Points of Significance: Machine learning: a primer. Nature Methods 14:1119–1120.",

Just in time for the season, I've simulated a snow-pile of snowflakes based on the Gravner-Griffeath model.

Gravner, J. & Griffeath, D. (2007) Modeling Snow Crystal Growth II: A mesoscopic lattice map with plausible dynamics.

My illustration of the location of genes in the human genome that are implicated in disease appears in The Objects that Power the Global Economy, a book by Quartz.

We introduce two common ensemble methods: bagging and random forests. Both of these methods repeat a statistical analysis on a bootstrap sample to improve the accuracy of the predictor. Our column shows these methods as applied to Classification and Regression Trees.

For example, we can sample the space of values more finely when using bagging with regression trees because each sample has potentially different boundaries at which the tree splits.

Random forests generate a large number of trees by not only generating bootstrap samples but also randomly choosing which predictor variables are considered at each split in the tree.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Ensemble methods: bagging and random forests. *Nature Methods* **14**:933–934.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. *Nature Methods* **14**:757–758.

Decision trees classify data by splitting it along the predictor axes into partitions with homogeneous values of the dependent variable. Unlike logistic or linear regression, CART does not develop a prediction equation. Instead, data are predicted by a series of binary decisions based on the boundaries of the splits. Decision trees are very effective and the resulting rules are readily interpreted.

Trees can be built using different metrics that measure how well the splits divide up the data classes: Gini index, entropy or misclassification error.

When the predictor variable is quantitative and not categorical, regression trees are used. Here, the data are still split but now the predictor variable is estimated by the average within the split boundaries. Tree growth can be controlled using the complexity parameter, a measure of the relative improvement of each new split.

Individual trees can be very sensitive to minor changes in the data and even better prediction can be achieved by exploiting this variability. Using ensemble methods, we can grow multiple trees from the same data.

Krzywinski, M. & Altman, N. (2017) Points of Significance: Classification and regression trees. *Nature Methods* **14**:757–758.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Logistic regression. *Nature Methods* **13**:541-542.

Altman, N. & Krzywinski, M. (2015) Points of Significance: Multiple Linear Regression *Nature Methods* **12**:1103-1104.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Classifier evaluation. *Nature Methods* **13**:603-604.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Model Selection and Overfitting. *Nature Methods* **13**:703-704.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of Significance: Regularization. *Nature Methods* **13**:803-804.

The artwork was created in collaboration with my colleagues at the Genome Sciences Center to celebrate the 5 year anniversary of the Personalized Oncogenomics Program (POG).

The Personal Oncogenomics Program (POG) is a collaborative research study including many BC Cancer Agency oncologists, pathologists and other clinicians along with Canada's Michael Smith Genome Sciences Centre with support from BC Cancer Foundation.

The aim of the program is to sequence, analyze and compare the genome of each patient's cancer—the entire DNA and RNA inside tumor cells— in order to understand what is enabling it to identify less toxic and more effective treatment options.