Learning Circos

Bioinformatics and Genome Analysis

Institut Pasteur Tunis Tunisia, December 10 – December 11, 2018

Download course materials
v2.00 6 Dec 2018 Download PDF slides
v2.00 6 Dec 2018

A 2- or 4-day practical mini-course in Circos, command-line parsing and scripting. This material is part of the Bioinformatics and Genome Analysis course held at the Institut Pasteur Tunis.

Quick links

menu day 1 day 2 day 3 day 4 scripts references updates

sessions / day.4

Visualizing and parsing .maf alignments

Additional material — Day 4

9h00 - 10h30 | Lecture 1 — Perl data structure and regular expression refresher

11h00 - 12h30 | Lecture (practical) 2 — Parsing .maf Ebola strain alignments on the command line

14h00 - 15h30 | Lecture (practical) 3 — Parsing .maf Ebola strain alignments with Perl

16h00 - 18h00 | Lecture (practical) 4 — Visualizing Ebola strain alignments

Concepts covered

.maf multiple-alignment data format, intermediate scripting, regular expressions, flexible parsing with Perl, creating Circos input data files, visualizing alignments between large number of sequences

menu lecture 1 lecture 2 lecture 3 lecture 4

sessions / day.4 / lecture.4

Visualizing Ebola strain alignments

sessions / day.4 / lecture.4 / README

We're now going to use the links and karyotype files we created in the last lecture to generate Circos images.

sessions / day.4 / lecture.4 / 1 / etc / circos.conf

karyotype = ../../lecture.3/2/karyotype.txt

chromosomes_display_default = no

We have alignment sets for Zaire_1995 vs all strains, such as Nakisamata_2011, Marburg_KitumCave_Kenya_1987, Bundibugyo_2007, Reston_1996, Maleo_1979, Cote_dIvoire_1994 and others.

Use a regular expression that uniquely matches only two of the strains. Remember that you can get a list of the strains by running parsemaf without any options.

# cat ebola.multiz.txt | ./parsemaf

Here, the regular expression matches Zaire_1995 and Maleo_1979

chromosomes         = /Zaire|Maleo/

What would happen if the regular expression matched more than two strains? Try it.

There are some parts of the genome without alignments. You can crop the axis to focus on regions of interest. For example, the crop below is applied to both chromosomes and show the axis from the start of the chromosome (- to 12kb and then from 15 to 19 kb.

#chromosomes         = /Zaire|Maleo/:(-12000,15000-19000

In the karyotype the color of each genome is black. You can apply color to the ideograms like this

chromosomes_color   = /Zaire/:red

chromosomes_reverse = /Zaire/

<links>
<link>
file          = ../../lecture.3/2/z.vs.all.txt
radius        = 0.98r
bezier_radius = 0r
color         = black
thickness     = 1
</link>
</links>

<<include ideogram.conf>>
<<include ticks.conf>>

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

▲ sessions/day.4/lecture.4/1/circos.png (zoom)

sessions / day.4 / lecture.4 / 2 / etc / circos.conf

karyotype = ../../lecture.3/2/karyotype.txt

chromosomes_display_default = no

chromosomes         = /Zaire|Maleo/:(-12000,15000-19000
chromosomes_color   = /Zaire/:red
chromosomes_reverse = /Zaire/

<links>
<link>
file          = ../../lecture.3/2/z.vs.all.txt
radius        = 0.98r
bezier_radius = 0r
color         = black
thickness     = 1

<rules>
<rule>
condition     = var(offset) > -45
color         = lgrey
</rule>

Now let's color the other offsets based on their size.

<rule>
use           = no
condition     = 1
color         = eval(sprintf("spectral-11-div-%d",remap_int(abs(var(offset)),45,200,1,11)))
</rule>
</rules>

</link>
</links>

It looks like Zaire vs Maleo don't have very extreme offsets. Is this true?

Generate a histogram of offsets for this pair. Hint: grep out the offsets from the link file.

What strain has the largest positive offset? The largest negative offset? Plot the offsets for these strains' alignments to Zaire. Can you figure out how to plot both these strains and Zaire on the same plot? Hint: change the regular expression for the 'chromosomes' parameter.

Try different color palettes, such as blues-9-seq or piyg-9-div.

Make sure to change the range of remapped values from 1-11 to 1-9. For more about Brewer palettes, see Day 1 Lecture 3.

As a challenge, think of how you would draw the largest 250 offsets across all alignments. Hint: hide all alignments except those with the right size cutoff and show ideograms for all strains.

▲ sessions/day.4/lecture.4/2/circos.png (zoom)

sessions / day.4 / lecture.4 / 2 / README

Each alignment has an offset stored in the optional 7th column in the link file.

#z.vs.all.txt
Zaire_1995 436 572 M-M_2007 388 524 offset=-48
Zaire_1995 436 572 G3758_2014 417 553 offset=-19
Zaire_1995 436 572 G3826_2014 412 548 offset=-24
Zaire_1995 436 572 EM095B_2014 391 527 offset=-45
...

The offset is the difference between the position on the comparison genome (end of link) and the reference genome (start of link).

It's worthwhile to look at how the offsets are distributed and for this, take a look at the histogram script in this directory. It generates a histogram of values at the command line using ASCII characters and provides some basic stats.

histogram.10.90.txt
histogram.-300.p20.txt
showhistogram.sh
histogram

Look at the histogram script and understand how it works. If you have time, edit the code to be able to define the histogram bin symbol on the command line with usage like

histogram ... -char X

to draw bins using X's.

You can use this script to hunt around in an interesting part of the distribution of offsets. Which part is interesting? Well, that's for you to decide! Let's look at some of the largest offsets (in absolute value). Let's make all alignments with offsets larger than -45 light grey.

Keep in mind that this histogram is for all alignments across all strains.

This kind of analysis, of distributions, aggregate statistics, and much much more, can be done in R. However, for simple questions like (what's the sum or average of these numbers), I encourage you to have a script handy.

For example, I have a large number of command-line helper scripts that I use everyday to help me manipulate files quickly: swapping and rolling columns, testing columns for value ranges, sampling a given fraction of lines from a file, and so on.

sessions / day.4 / lecture.4 / 2 / showhistogram.sh

↓

#!/bin/bash

# show the histogram of 1000 uniformly distributed random numbers
seq 1 1000 | awk 'BEGIN { srand() } {print rand()}' | ./histogram | tee histogram.uniform.txt

# show the histogram of 1000 normally(ish) (guaranteed by the Central
# Limit Theorem) random numbers generated from average of n = 10 samples
# of uniform random numbers
seq 1 1000 | \
    awk 'BEGIN { srand();N=10 } { samplesum=0; for(i=1;i<=N;++i) samplesum+=rand(); print samplesum/N}' | \
    ./histogram | tee histogram.normal.txt

# show the histogram of the 10-90% of offsets
cat ../../lecture.3/2/z.vs.all.txt | sed 's/.*=//' | \
    ./histogram -bins 15 -pmin 10 -pmax 90 | tee histogram.10.90.txt

# show the histogram of offsets between -300 and 20% (-45).
cat ../../lecture.3/2/z.vs.all.txt | sed 's/.*=//' | \
    ./histogram -bins 15 -min -300 -pmax 20 | tee histogram-300.p20.txt

sessions / day.4 / lecture.4 / 2 / histogram.uniform.txt

               0.0001>     0  0.000
    0.0001     0.0501     39  0.039  0.039 ***************          
    0.0501     0.1001     59  0.059  0.098 ***********************  
    0.1001     0.1501     45  0.045  0.143 *****************        
    0.1501     0.2001     63  0.063  0.206 *************************
    0.2001     0.2501     41  0.041  0.247 ****************         
    0.2501     0.3001     53  0.053  0.300 *********************    
    0.3001     0.3501     47  0.047  0.347 ******************       
    0.3501     0.4000     55  0.055  0.402 *********************    
    0.4000     0.4500     51  0.051  0.453 ********************     
    0.4500     0.5000     48  0.048  0.502 *******************      
    0.5000     0.5500     56  0.056  0.558 **********************   
    0.5500     0.6000     42  0.042  0.600 ****************         
    0.6000     0.6500     53  0.053  0.653 *********************    
    0.6500     0.7000     35  0.035  0.688 *************            
    0.7000     0.7500     55  0.055  0.743 *********************    
    0.7500     0.8000     45  0.045  0.788 *****************        
    0.8000     0.8500     55  0.055  0.843 *********************    
    0.8500     0.9000     54  0.054  0.897 *********************    
    0.9000     0.9500     42  0.042  0.939 ****************         
    0.9500     0.9999     61  0.061  1.000 ************************ 
               0.9999<     0  0.000
       n         1000
 average      0.50246
      sd      0.29102
     min      0.00011
     max      0.99995
     sum    502.45584

sessions / day.4 / lecture.4 / 2 / histogram.normal.txt

               0.1875>     0  0.000
    0.1875     0.2178      2  0.002  0.002                          
    0.2178     0.2480      1  0.001  0.003                          
    0.2480     0.2783      3  0.003  0.006                          
    0.2783     0.3086     10  0.010  0.016 *                        
    0.3086     0.3388     22  0.022  0.038 ***                      
    0.3388     0.3691     31  0.031  0.069 *****                    
    0.3691     0.3993     67  0.067  0.136 ***********              
    0.3993     0.4296     91  0.091  0.227 ***************          
    0.4296     0.4598    101  0.101  0.328 *****************        
    0.4598     0.4901    145  0.145  0.473 *************************
    0.4901     0.5203    130  0.130  0.603 **********************   
    0.5203     0.5506    138  0.138  0.741 ***********************  
    0.5506     0.5808     94  0.094  0.835 ****************         
    0.5808     0.6111     60  0.060  0.895 **********               
    0.6111     0.6413     51  0.051  0.946 ********                 
    0.6413     0.6716     26  0.026  0.972 ****                     
    0.6716     0.7018     13  0.013  0.985 **                       
    0.7018     0.7321     12  0.012  0.997 **                       
    0.7321     0.7624      2  0.002  0.999                          
    0.7624     0.7926      1  0.001  1.000                          
               0.7926<     0  0.000
       n         1000
 average      0.49702
      sd      0.08856
     min      0.18754
     max      0.79261
     sum    497.02431

sessions / day.4 / lecture.4 / 2 / histogram.10.90.txt

             -54.0000>  2383  0.099
  -54.0000   -39.0000   4195  0.196  0.196 *********                
  -39.0000   -24.0000   3737  0.174  0.370 ********                 
  -24.0000    -9.0000   2807  0.131  0.501 ******                   
   -9.0000     6.0000  10687  0.499  1.000 *************************
               0.0000<   275  0.011
       n        23954
 average    -25.02906
      sd     32.74197
     min   -996.00000
     max    392.00000
     sum -599546.00000

sessions / day.4 / lecture.4 / 2 / histogram.-300.p20.txt

            -300.0000>     4  0.000
 -300.0000  -283.0000      0  0.000  0.000                          
 -283.0000  -266.0000      0  0.000  0.000                          
 -266.0000  -249.0000      0  0.000  0.000                          
 -249.0000  -232.0000      2  0.000  0.000                          
 -232.0000  -215.0000      7  0.001  0.002                          
 -215.0000  -198.0000      1  0.000  0.002                          
 -198.0000  -181.0000    115  0.021  0.023                          
 -181.0000  -164.0000      0  0.000  0.023                          
 -164.0000  -147.0000      3  0.001  0.024                          
 -147.0000  -130.0000    114  0.021  0.045                          
 -130.0000  -113.0000    466  0.087  0.132 ***                      
 -113.0000   -96.0000    148  0.028  0.159 *                        
  -96.0000   -79.0000    636  0.118  0.277 ****                     
  -79.0000   -62.0000    385  0.072  0.349 **                       
  -62.0000   -45.0000   3501  0.651  1.000 *************************
             -45.0000< 18572  0.775
       n        23954
 average    -25.02906
      sd     32.74197
     min   -996.00000
     max    392.00000

sessions / day.4 / lecture.4 / 2 / histogram

↓

# read the values from STDIN and use the -col COLUMN, by default 0
my @values;
while(my $line = <>) {
  chomp $line;
  next if $line =~ /^\#/;
  my @tok = split(" ",$line);
  if(defined $CONF{col} && ! defined $tok[ $CONF{col} ]) {
    die "The line [$line] has no number in column [$CONF{col}]";
  }
  push @values, $tok[ $CONF{col}];
}

# get min and max of values
my ($min,$max) = minmax(@values);

# change min/max range if redefined via -min and/or -max
$min = $CONF{min} if defined $CONF{min};# && $CONF{min} > $min;
$max = $CONF{max} if defined $CONF{max};# && $CONF{max} < $max;

# *very* quick and dirty percentile if -pmin or -pmax used
if($CONF{pmin} || $CONF{pmax}) {
  my @sorted = sort {$a <=> $b} @values;
  $min = $sorted[ $CONF{pmin}/100 * (@sorted-1) ] if $CONF{pmin};
  $max = $sorted[ $CONF{pmax}/100 * (@sorted-1) ] if $CONF{pmax};
}

# determine bin size from either the -binsize command-line parameter
# or from -nbins. At least one of these is defined — we made sure of
# that in validateconfiguration()
my $bin_size;
if($CONF{binsize}) {
  $bin_size = $CONF{binsize};
  $CONF{nbins} = round(($max-$min)/$CONF{binsize});
} else {
  $bin_size = ($max-$min)/$CONF{nbins};
}

# populate bins
my $bins;
for my $i (0..$CONF{nbins}-1) {
  my $start = $min + $i * $bin_size;
  my $end   = $min + ($i+1) * $bin_size;
  my $n     = grep($_ >= $start && ($_ < $end || ( $i == $CONF{nbins}-1 && $_ <= $end)), @values);
  push @$bins, { start=>$start,
                 i=>$i,
                 end=>$end,
                 size=>$bin_size,
                 n=>$n };
}

my $histogram_width = $CONF{height};
my $maxn   = max(map { $_->{n} } @$bins);
my $sumn   = sum(map { $_->{n} } @$bins);
my $cumuln = 0;

my $bin_count_width = max(map { length($_->{n}) } @$bins);
my $bin_count_fmt   = sprintf("%%%dd",2+$bin_count_width);
my $bin_width_f_max = max(map { $_->{n}/$maxn } @$bins);

# report number and fraction of values smaller than requested minimum
printinfo(sprintf("%10s %10.4f>$bin_count_fmt %6.3f","",
                  $min,
                  int(grep($_ < $min, @values)),
                  int(grep($_ < $min, @values))/@values));

# report each bin: min, max, count, fractional count, cumulative count
for my $bin (@$bins) {
  $cumuln += $bin->{n};
  my $binwidth = $bin->{n}*$histogram_width/$maxn;
  if(defined $CONF{sqrt}) {
    my $power = $CONF{sqrt} || 0.5;
    my $f = $bin->{n}/$maxn;
    $binwidth = $f**$power/$bin_width_f_max**$power * $histogram_width;
  }
  printinfo(sprintf("%10.4f %10.4f $bin_count_fmt %6.3f %6.3f %-${histogram_width}s",
                    $bin->{start},
                    $bin->{end},
                    $bin->{n},
                    $bin->{n}/$sumn,
                    $cumuln/$sumn,
                    "*"x$binwidth));
}

# report number and fraction of values larger than requested maximum
printinfo(sprintf("%10s %10.4f<%6d %6.3f","",
                  $max,
                  int(grep($_ > $max, @values)),
                  int(grep($_ > $max, @values))/@values));

# aggregate stats for full data set
printinfo(sprintf("%8s %12d","n",int(@values)));
printinfo(sprintf("%8s %12.5f","average",average(@values)));
printinfo(sprintf("%8s %12.5f","sd",stddev(@values)));
printinfo(sprintf("%8s %12.5f","min",min(@values)));
printinfo(sprintf("%8s %12.5f","max",max(@values)));
printinfo(sprintf("%8s %12.5f","sum",sum(@values)));