2024 π Daylatest newsbuy art
In your hiding, you're alone. Kept your treasures with my bones.Coeur de Piratecrawl somewhere bettermore quotes
bioinformatics + data visualization

Learning Circos

Bioinformatics and Genome Analysis

Institut Pasteur Tunis Tunisia, December 10 – December 11, 2018

Download course materials
v2.00 6 Dec 2018
Download PDF slides
v2.00 6 Dec 2018

A 2- or 4-day practical mini-course in Circos, command-line parsing and scripting. This material is part of the Bioinformatics and Genome Analysis course held at the Institut Pasteur Tunis.

sessions / day.4

Visualizing and parsing .maf alignments

Additional material — Day 4

9h00 - 10h30 | Lecture 1 — Perl data structure and regular expression refresher

11h00 - 12h30 | Lecture (practical) 2 — Parsing .maf Ebola strain alignments on the command line

14h00 - 15h30 | Lecture (practical) 3 — Parsing .maf Ebola strain alignments with Perl

16h00 - 18h00 | Lecture (practical) 4 — Visualizing Ebola strain alignments

Concepts covered

.maf multiple-alignment data format, intermediate scripting, regular expressions, flexible parsing with Perl, creating Circos input data files, visualizing alignments between large number of sequences

sessions / day.4 / lecture.4

Visualizing Ebola strain alignments

sessions / day.4 / lecture.4 / README

We're now going to use the links and karyotype files we created in the last lecture to generate Circos images.

sessions / day.4 / lecture.4 / 1 / etc / circos.conf
karyotype = ../../lecture.3/2/karyotype.txt

chromosomes_display_default = no

We have alignment sets for Zaire_1995 vs all strains, such as Nakisamata_2011, Marburg_KitumCave_Kenya_1987, Bundibugyo_2007, Reston_1996, Maleo_1979, Cote_dIvoire_1994 and others.

Use a regular expression that uniquely matches only two of the strains. Remember that you can get a list of the strains by running parsemaf without any options.

# cat ebola.multiz.txt | ./parsemaf

Here, the regular expression matches Zaire_1995 and Maleo_1979

chromosomes         = /Zaire|Maleo/

What would happen if the regular expression matched more than two strains? Try it.

There are some parts of the genome without alignments. You can crop the axis to focus on regions of interest. For example, the crop below is applied to both chromosomes and show the axis from the start of the chromosome (- to 12kb and then from 15 to 19 kb.

#chromosomes         = /Zaire|Maleo/:(-12000,15000-19000

In the karyotype the color of each genome is black. You can apply color to the ideograms like this

chromosomes_color   = /Zaire/:red

chromosomes_reverse = /Zaire/

<links>
<link>
file = ../../lecture.3/2/z.vs.all.txt
radius = 0.98r
bezier_radius = 0r
color = black
thickness = 1
</link>
</links>

<<include ideogram.conf>>
<<include ticks.conf>>
Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
sessions/day.4/lecture.4/1/circos.png (zoom)
sessions / day.4 / lecture.4 / 2 / etc / circos.conf
karyotype = ../../lecture.3/2/karyotype.txt

chromosomes_display_default = no

chromosomes = /Zaire|Maleo/:(-12000,15000-19000
chromosomes_color = /Zaire/:red
chromosomes_reverse = /Zaire/

<links>
<link>
file = ../../lecture.3/2/z.vs.all.txt
radius = 0.98r
bezier_radius = 0r
color = black
thickness = 1

<rules>
<rule>
condition = var(offset) > -45
color = lgrey
</rule>

Now let's color the other offsets based on their size.

<rule>
use = no
condition = 1
color = eval(sprintf("spectral-11-div-%d",remap_int(abs(var(offset)),45,200,1,11)))
</rule>
</rules>

</link>
</links>

It looks like Zaire vs Maleo don't have very extreme offsets. Is this true?

Generate a histogram of offsets for this pair. Hint: grep out the offsets from the link file.

What strain has the largest positive offset? The largest negative offset? Plot the offsets for these strains' alignments to Zaire. Can you figure out how to plot both these strains and Zaire on the same plot? Hint: change the regular expression for the 'chromosomes' parameter.

Try different color palettes, such as blues-9-seq or piyg-9-div.

Make sure to change the range of remapped values from 1-11 to 1-9. For more about Brewer palettes, see Day 1 Lecture 3.

As a challenge, think of how you would draw the largest 250 offsets across all alignments. Hint: hide all alignments except those with the right size cutoff and show ideograms for all strains.

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca
sessions/day.4/lecture.4/2/circos.png (zoom)
sessions / day.4 / lecture.4 / 2 / README

Each alignment has an offset stored in the optional 7th column in the link file.

#z.vs.all.txt
Zaire_1995 436 572 M-M_2007 388 524 offset=-48
Zaire_1995 436 572 G3758_2014 417 553 offset=-19
Zaire_1995 436 572 G3826_2014 412 548 offset=-24
Zaire_1995 436 572 EM095B_2014 391 527 offset=-45
...

The offset is the difference between the position on the comparison genome (end of link) and the reference genome (start of link).

It's worthwhile to look at how the offsets are distributed and for this, take a look at the histogram script in this directory. It generates a histogram of values at the command line using ASCII characters and provides some basic stats.

histogram.10.90.txt
histogram.-300.p20.txt
showhistogram.sh
histogram

Look at the histogram script and understand how it works. If you have time, edit the code to be able to define the histogram bin symbol on the command line with usage like

histogram ... -char X

to draw bins using X's.

You can use this script to hunt around in an interesting part of the distribution of offsets. Which part is interesting? Well, that's for you to decide! Let's look at some of the largest offsets (in absolute value). Let's make all alignments with offsets larger than -45 light grey.

Keep in mind that this histogram is for all alignments across all strains.

This kind of analysis, of distributions, aggregate statistics, and much much more, can be done in R. However, for simple questions like (what's the sum or average of these numbers), I encourage you to have a script handy.

For example, I have a large number of command-line helper scripts that I use everyday to help me manipulate files quickly: swapping and rolling columns, testing columns for value ranges, sampling a given fraction of lines from a file, and so on.

sessions / day.4 / lecture.4 / 2 / showhistogram.sh
#!/bin/bash

# show the histogram of 1000 uniformly distributed random numbers
seq 1 1000 | awk 'BEGIN { srand() } {print rand()}' | ./histogram | tee histogram.uniform.txt

# show the histogram of 1000 normally(ish) (guaranteed by the Central
# Limit Theorem) random numbers generated from average of n = 10 samples
# of uniform random numbers
seq 1 1000 | \
awk 'BEGIN { srand();N=10 } { samplesum=0; for(i=1;i<=N;++i) samplesum+=rand(); print samplesum/N}' | \
./histogram | tee histogram.normal.txt

# show the histogram of the 10-90% of offsets
cat ../../lecture.3/2/z.vs.all.txt | sed 's/.*=//' | \
./histogram -bins 15 -pmin 10 -pmax 90 | tee histogram.10.90.txt

# show the histogram of offsets between -300 and 20% (-45).
cat ../../lecture.3/2/z.vs.all.txt | sed 's/.*=//' | \
./histogram -bins 15 -min -300 -pmax 20 | tee histogram-300.p20.txt
sessions / day.4 / lecture.4 / 2 / histogram.uniform.txt
               0.0001>     0  0.000
0.0001 0.0501 39 0.039 0.039 ***************
0.0501 0.1001 59 0.059 0.098 ***********************
0.1001 0.1501 45 0.045 0.143 *****************
0.1501 0.2001 63 0.063 0.206 *************************
0.2001 0.2501 41 0.041 0.247 ****************
0.2501 0.3001 53 0.053 0.300 *********************
0.3001 0.3501 47 0.047 0.347 ******************
0.3501 0.4000 55 0.055 0.402 *********************
0.4000 0.4500 51 0.051 0.453 ********************
0.4500 0.5000 48 0.048 0.502 *******************
0.5000 0.5500 56 0.056 0.558 **********************
0.5500 0.6000 42 0.042 0.600 ****************
0.6000 0.6500 53 0.053 0.653 *********************
0.6500 0.7000 35 0.035 0.688 *************
0.7000 0.7500 55 0.055 0.743 *********************
0.7500 0.8000 45 0.045 0.788 *****************
0.8000 0.8500 55 0.055 0.843 *********************
0.8500 0.9000 54 0.054 0.897 *********************
0.9000 0.9500 42 0.042 0.939 ****************
0.9500 0.9999 61 0.061 1.000 ************************
0.9999< 0 0.000
n 1000
average 0.50246
sd 0.29102
min 0.00011
max 0.99995
sum 502.45584
sessions / day.4 / lecture.4 / 2 / histogram.normal.txt
               0.1875>     0  0.000
0.1875 0.2178 2 0.002 0.002
0.2178 0.2480 1 0.001 0.003
0.2480 0.2783 3 0.003 0.006
0.2783 0.3086 10 0.010 0.016 *
0.3086 0.3388 22 0.022 0.038 ***
0.3388 0.3691 31 0.031 0.069 *****
0.3691 0.3993 67 0.067 0.136 ***********
0.3993 0.4296 91 0.091 0.227 ***************
0.4296 0.4598 101 0.101 0.328 *****************
0.4598 0.4901 145 0.145 0.473 *************************
0.4901 0.5203 130 0.130 0.603 **********************
0.5203 0.5506 138 0.138 0.741 ***********************
0.5506 0.5808 94 0.094 0.835 ****************
0.5808 0.6111 60 0.060 0.895 **********
0.6111 0.6413 51 0.051 0.946 ********
0.6413 0.6716 26 0.026 0.972 ****
0.6716 0.7018 13 0.013 0.985 **
0.7018 0.7321 12 0.012 0.997 **
0.7321 0.7624 2 0.002 0.999
0.7624 0.7926 1 0.001 1.000
0.7926< 0 0.000
n 1000
average 0.49702
sd 0.08856
min 0.18754
max 0.79261
sum 497.02431
sessions / day.4 / lecture.4 / 2 / histogram.10.90.txt
             -54.0000>  2383  0.099
-54.0000 -39.0000 4195 0.196 0.196 *********
-39.0000 -24.0000 3737 0.174 0.370 ********
-24.0000 -9.0000 2807 0.131 0.501 ******
-9.0000 6.0000 10687 0.499 1.000 *************************
0.0000< 275 0.011
n 23954
average -25.02906
sd 32.74197
min -996.00000
max 392.00000
sum -599546.00000
sessions / day.4 / lecture.4 / 2 / histogram.-300.p20.txt
            -300.0000>     4  0.000
-300.0000 -283.0000 0 0.000 0.000
-283.0000 -266.0000 0 0.000 0.000
-266.0000 -249.0000 0 0.000 0.000
-249.0000 -232.0000 2 0.000 0.000
-232.0000 -215.0000 7 0.001 0.002
-215.0000 -198.0000 1 0.000 0.002
-198.0000 -181.0000 115 0.021 0.023
-181.0000 -164.0000 0 0.000 0.023
-164.0000 -147.0000 3 0.001 0.024
-147.0000 -130.0000 114 0.021 0.045
-130.0000 -113.0000 466 0.087 0.132 ***
-113.0000 -96.0000 148 0.028 0.159 *
-96.0000 -79.0000 636 0.118 0.277 ****
-79.0000 -62.0000 385 0.072 0.349 **
-62.0000 -45.0000 3501 0.651 1.000 *************************
-45.0000< 18572 0.775
n 23954
average -25.02906
sd 32.74197
min -996.00000
max 392.00000
sessions / day.4 / lecture.4 / 2 / histogram
# read the values from STDIN and use the -col COLUMN, by default 0
my @values;
while(my $line = <>) {
chomp $line;
next if $line =~ /^\#/;
my @tok = split(" ",$line);
if(defined $CONF{col} && ! defined $tok[ $CONF{col} ]) {
die "The line [$line] has no number in column [$CONF{col}]";
}
push @values, $tok[ $CONF{col}];
}

# get min and max of values
my ($min,$max) = minmax(@values);

# change min/max range if redefined via -min and/or -max
$min = $CONF{min} if defined $CONF{min};# && $CONF{min} > $min;
$max = $CONF{max} if defined $CONF{max};# && $CONF{max} < $max;

# *very* quick and dirty percentile if -pmin or -pmax used
if($CONF{pmin} || $CONF{pmax}) {
my @sorted = sort {$a <=> $b} @values;
$min = $sorted[ $CONF{pmin}/100 * (@sorted-1) ] if $CONF{pmin};
$max = $sorted[ $CONF{pmax}/100 * (@sorted-1) ] if $CONF{pmax};
}

# determine bin size from either the -binsize command-line parameter
# or from -nbins. At least one of these is defined — we made sure of
# that in validateconfiguration()
my $bin_size;
if($CONF{binsize}) {
$bin_size = $CONF{binsize};
$CONF{nbins} = round(($max-$min)/$CONF{binsize});
} else {
$bin_size = ($max-$min)/$CONF{nbins};
}

# populate bins
my $bins;
for my $i (0..$CONF{nbins}-1) {
my $start = $min + $i * $bin_size;
my $end = $min + ($i+1) * $bin_size;
my $n = grep($_ >= $start && ($_ < $end || ( $i == $CONF{nbins}-1 && $_ <= $end)), @values);
push @$bins, { start=>$start,
i=>$i,
end=>$end,
size=>$bin_size,
n=>$n };
}

my $histogram_width = $CONF{height};
my $maxn = max(map { $_->{n} } @$bins);
my $sumn = sum(map { $_->{n} } @$bins);
my $cumuln = 0;

my $bin_count_width = max(map { length($_->{n}) } @$bins);
my $bin_count_fmt = sprintf("%%%dd",2+$bin_count_width);
my $bin_width_f_max = max(map { $_->{n}/$maxn } @$bins);

# report number and fraction of values smaller than requested minimum
printinfo(sprintf("%10s %10.4f>$bin_count_fmt %6.3f","",
$min,
int(grep($_ < $min, @values)),
int(grep($_ < $min, @values))/@values));

# report each bin: min, max, count, fractional count, cumulative count
for my $bin (@$bins) {
$cumuln += $bin->{n};
my $binwidth = $bin->{n}*$histogram_width/$maxn;
if(defined $CONF{sqrt}) {
my $power = $CONF{sqrt} || 0.5;
my $f = $bin->{n}/$maxn;
$binwidth = $f**$power/$bin_width_f_max**$power * $histogram_width;
}
printinfo(sprintf("%10.4f %10.4f $bin_count_fmt %6.3f %6.3f %-${histogram_width}s",
$bin->{start},
$bin->{end},
$bin->{n},
$bin->{n}/$sumn,
$cumuln/$sumn,
"*"x$binwidth));
}

# report number and fraction of values larger than requested maximum
printinfo(sprintf("%10s %10.4f<%6d %6.3f","",
$max,
int(grep($_ > $max, @values)),
int(grep($_ > $max, @values))/@values));

# aggregate stats for full data set
printinfo(sprintf("%8s %12d","n",int(@values)));
printinfo(sprintf("%8s %12.5f","average",average(@values)));
printinfo(sprintf("%8s %12.5f","sd",stddev(@values)));
printinfo(sprintf("%8s %12.5f","min",min(@values)));
printinfo(sprintf("%8s %12.5f","max",max(@values)));
printinfo(sprintf("%8s %12.5f","sum",sum(@values)));
Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.151 }