data visualization + databases

Schemaball

Circular visualization of database schemas

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

▲ SCHEMABALL CAN ACCOMODATE VARIOUS SCHEMA SIZES | (left) A Medline citation database with 12 tables. (middle) A ugene database with 35 tables. (right) A massive sequencing LIMS database with 135 tables.

Schemaball was published in SysAdmin Magazine (Krzywinski, M. Schemaball: A New Spin on Database Visualization (2004) Sysadmin Magazine Vol 13 Issue 08). Who cites Schemaball?

≡ tour screenshots tutorials Ensembl man page requirements download

overview

Schemaball is a Perl script which uses GD to generate static, circularly formatted SQL database schema views. Schemaball is well-suited for use in publications, online or print, presentations or schema development. Schemaball is suitable for visualizing schemas of all sizes.

features

To illustrate the features of schemaball, I'll use our ugene MySQL database (middle schema in the above figure). In the schema view below, the tables can be seen organized around a circle in alphabetical order. Tables which are linked by foreign key relationships — I'll get to how this is done in MySQL in a moment — are linked using bezier curves. In the view below, neighbours and next-neighbours of the Clone table are highlighted.

▲ A view of the ugene schema. The ugene database has 35 tables.

The simplest schema ball is one in which the table glyphs are shown without links. I find this view to be useful as a design template. I can print a few copies out and draw possible table relationships as I work on a schema.

▲ A blank schema ball of the ugene database. Tables without relationships are shown.

Tables can be hidden from the schema ball using regular expressions that are applied to the table names. In this case I've removed all tables which have the letter "a" in them.

▲ The blank schema ball without any tables which have 'a' in their name.

One of the main features of schemaball is the ability to visualize links between tables. Schemaball can parse these links from the schema structure itself (CONSTRAINT table options), using field names (if you've named your foreign keys using some convention), or from a file listing the table pairs.

▲ Table relationships can be visualized using straight line connectors.

▲ The straight lines can be replaced by Bezier curves to make the schema ball more attractive.

The curvature of the Bezier lines can be adjusted using a parameter which controls the distance from the center of the schema ball of the middle point on the Bezier curve.

▲ The straight lines can be replaced by Bezier curves to make the schema ball more attractive.

Schemaball is very flexible. You can turn off the table glyphs and labels and have your schema turned into Bezier art.

▲ Only Bezier links are shown.

If you are paranoid about your intellectual property but would like to show the complexity of your schema to impress your competitors, you can anonymize the labels.

▲ Table names become 'Table IDX' to thward the reverse-engineers while you patent your masterpiece.

You can specify the colour characteristics of the features in the schema ball. Let's switch to a soothing blue theme. The table glyphs can be stroked and the link lines made thicker.

▲ Rainy Monday colour theme with thicker link lines and stroked table glyphs.

Tables can be highlighted using regular expressions, which are applied to the table names. Below, I show how three different table groups can be highlighted.

▲ Regular expressions are used as filters to determine which tables are highlighted. (left) the rx "^Clone" is used to highlight all tables whose names start with the string "Clone" (middle) three regular expressions: "^Clone , "^Gene ," "^Sequence are used to highlight specific tables (right) tables with names longer than 11 characters are highlighted using "^(.){12,}". The colour and stroke of the highlighted table can be adjusted.

Tables can be hidden from the schema ball. This feature is useful if you have a lot of tables and would like to focus on the relationships between a subset of tables. Hiding is implemented in two ways: making tables in the schema ball invisible, but retaining a gap in the schema ball where the invisible table glyph is, or removing the table from the schema ball altogether and rearranging the other tables to fill the ball. Hiding is controlled using regular expression, in the same way as highlighting.

▲ The tables are highlighted using "^Clone". All tables are shown (left) and tables matching the regular expression "_" are invisible (middle). Tables matching the regular expression "_" are removed from the schema ball (right).

If you hide tables by making them invisible, you can choose to still have links to these tables kept in the schema ball. I don't know why you'd want to do this, but you can.

▲ All the tables with "_" in their name are invisible (left). By default links do not connect to invisible tables, unless you set the "link_to_invisible" option (right).

In addition to hiding and highlighting tables, you can also hide and highlight links. The hiding and highlighting process is controlled by regular expressions, like for tables. A link that joins two tables TABLE1 and TABLE2 together is named TABLE1___TABLE2 and the regular expressions controlling link visibility are applied to this compound name. For example, "^TABLE1___" selects all links which point from TABLE1 and "___TABLE2$" selects all links which point to TABLE2.

▲ Hiding and highlighting links is accomplished using regular expressions. Links matching "Clone", "Build" and "Sequence" are hidden. These are all links coming to/from tables which have these strings in their name (left). The same links as hidden in the left panel are highlighted (middle). Links matching "Build" are hidden, while links matching "Clone" and "Sequence" are highlighted.

▲ The fact that the schema is a directed graph can be illustrated by adjusting the link highlight regular expression to match to the source or sink table in the link. All links coming from the table "Build" are highlighted using the regular expression "^Build___" (left). All links arriving at the "Build" table are highlighted with "___Build (right). The highlight colour can be adjusted.

Tracing the table dependency through foreign key relationships can be tedious in a large schema. Schemaball supports a chain hilighting scheme which follows links from highlighted tables and highlights connected tables. You can adjust the number of iterations of this scheme to highlight linked neighbours of a table to varying degrees of separation. A number of different parameters controls how the highlighting is inherited.

The "highlight_by_link" property is used to highlight tables which connect to highlighted links. This is useful if you would like to highlight all tables which, for example, are referenced by a specific table.

▲ First I highlight all links pointing from the "Build" table using the regular expression "^Build___". I also highlight the Build table itself, but this is not strictly necessary for the second step (left). By setting "highlight_by_link", schemaball automatically highlights any tables participating in the highlighted links (right). Thus I answer the question: what tables are referenced by the Build table?

The highlight_by_iterations specifies the number of cycles of highlight inheritance that Schemaball should follow. You can follow links in the forward or reverse direction, or both.

▲ All links from the Contig table are highlighted using "^Contig___" (left). By using highlight_by_link, all tables associated with the highlighted links are highlighted (middle). Setting highlight_by_iterations to 1 and highlight_by_table_forward causes Schemaball to travel from all highlighted tables along links in the forward direction and highlight new links and new tables which are pointed to by highlighted tables. In this case, 4 new tables are highlighted (right). Thus I answer the question: which tables are 1 or 2 lookups away from the Contig table?

▲ In the same manner as above, links can be followed backwards. In this process, tables which refer to highlighted tables are highlighted. I use the "Build" table, and highlight all links (1) and then tables (2) which point to it using "___Build to highlight the links. By setting highlight_by_table_reverse and turning off highlight_by_table_forward and incrementally increasing highlight_by_iterations from 1 to 4 I get panels (3) — (6).

If a large value for highlight_by_iterations is used and the schema is large, you can wind up with many highlighted elements. In order to retain information about the inheritance depth of a highlighted element, the "fade_factor_table" and "fade_factor_link" parameters are used. When these parameters are used, the highlight colour is progressively diluted with each iteration.

▲ This is one of our latest schemas, the sequencing LIMS system. Tables with names bigger than 14 characters are hidden and all links and tables upstream from the Equipment table are highlighted. With each iteration, the colour of the table and link highlight is closer to the colour of the un-hilighted corresponding element. Bezier curves are used (left) and straight lines (right). I find the Bezier curves guide the eye to the associated table much better than lines, which tend to diverge immediately from the table.

VIEW ALL

news + thoughts

Nasa to send our human genome discs to the Moon

Sat 23-03-2024

We'd like to say a ‘cosmic hello’: mathematics, culture, palaeontology, art and science, and ... human genomes.

▲ SANCTUARY PROJECT | A cosmic hello of art, science, and genomes. (details)

▲ SANCTUARY PROJECT | Benoit Faiveley, founder of the Sanctuary project gives the Sanctuary disc a visual check at CEA LeQ Grenoble (image: Vincent Thomas). (details)

▲ SANCTUARY PROJECT | Sanctuary team examines the Life disc at INRIA Paris Saclay (image: Benedict Redgrove) (details)

Comparing classifier performance with baselines

Sat 23-03-2024

All animals are equal, but some animals are more equal than others. —George Orwell

This month, we will illustrate the importance of establishing a baseline performance level.

Baselines are typically generated independently for each dataset using very simple models. Their role is to set the minimum level of acceptable performance and help with comparing relative improvements in performance of other models.

▲ Nature Methods Points of Significance column: Comparing classifier performance with baselines. (read)

Unfortunately, baselines are often overlooked and, in the presence of a class imbalance5, must be established with care.

Megahed, F.M, Chen, Y-J., Jones-Farmer, A., Rigdon, S.E., Krzywinski, M. & Altman, N. (2024) Points of significance: Comparing classifier performance with baselines. Nat. Methods 20.

Happy 2024 π Day—
sunflowers ho!

Sat 09-03-2024

Celebrate π Day (March 14th) and dig into the digit garden. Let's grow something.

▲ 2024 π DAY | A garden of 1,000 digits of π. (details)

How Analyzing Cosmic Nothing Might Explain Everything

Thu 18-01-2024

Huge empty areas of the universe called voids could help solve the greatest mysteries in the cosmos.

My graphic accompanying How Analyzing Cosmic Nothing Might Explain Everything in the January 2024 issue of Scientific American depicts the entire Universe in a two-page spread — full of nothing.

▲ How Analyzing Cosmic Nothing Might Explain Everything. Text by Michael Lemonick (editor), art direction by Jen Christiansen (Senior Graphics Editor), source: SDSS

The graphic uses the latest data from SDSS 12 and is an update to my Superclusters and Voids poster.

Michael Lemonick (editor) explains on the graphic:

“Regions of relatively empty space called cosmic voids are everywhere in the universe, and scientists believe studying their size, shape and spread across the cosmos could help them understand dark matter, dark energy and other big mysteries.

To use voids in this way, astronomers must map these regions in detail—a project that is just beginning.

Shown here are voids discovered by the Sloan Digital Sky Survey (SDSS), along with a selection of 16 previously named voids. Scientists expect voids to be evenly distributed throughout space—the lack of voids in some regions on the globe simply reﬂects SDSS’s sky coverage.”

voids

Sofia Contarini, Alice Pisani, Nico Hamaus, Federico Marulli Lauro Moscardini & Marco Baldi (2023) Cosmological Constraints from the BOSS DR12 Void Size Function Astrophysical Journal 953:46.

Nico Hamaus, Alice Pisani, Jin-Ah Choi, Guilhem Lavaux, Benjamin D. Wandelt & Jochen Weller (2020) Journal of Cosmology and Astroparticle Physics 2020:023.

Sloan Digital Sky Survey Data Release 12

constellation figures

Alan MacRobert (Sky & Telescope), Paulina Rowicka/Martin Krzywinski (revisions & Microscopium)

stars

Hoffleit & Warren Jr. (1991) The Bright Star Catalog, 5th Revised Edition (Preliminary Version).

cosmology

H₀ = 67.4 km/(Mpc·s), Ω_m = 0.315, Ω_v = 0.685. Planck collaboration Planck 2018 results. VI. Cosmological parameters (2018).

Error in predictor variables

Tue 02-01-2024

It is the mark of an educated mind to rest satisfied with the degree of precision that the nature of the subject admits and not to seek exactness where only an approximation is possible. —Aristotle

In regression, the predictors are (typically) assumed to have known values that are measured without error.

Practically, however, predictors are often measured with error. This has a profound (but predictable) effect on the estimates of relationships among variables – the so-called “error in variables” problem.

▲ Nature Methods Points of Significance column: Error in predictor variables. (read)

Error in measuring the predictors is often ignored. In this column, we discuss when ignoring this error is harmless and when it can lead to large bias that can leads us to miss important effects.

Altman, N. & Krzywinski, M. (2024) Points of significance: Error in predictor variables. Nat. Methods 20.

Background reading

Altman, N. & Krzywinski, M. (2015) Points of significance: Simple linear regression. Nat. Methods 12:999–1000.

Lever, J., Krzywinski, M. & Altman, N. (2016) Points of significance: Logistic regression. Nat. Methods 13:541–542 (2016).

Das, K., Krzywinski, M. & Altman, N. (2019) Points of significance: Quantile regression. Nat. Methods 16:451–452.