A 2- or 4-day practical mini-course in Circos, command-line parsing and scripting. This material is part of the Bioinformatics and Genome Analysis course held at the Institut Pasteur Tunis.
BCGA 2018 | 1-day Circos course | Circos documentation best practices getting started | Brewer palette swatches | Color resources | Nature Methods Points of View Points of Significance
Additional material — Day 4
9h00 - 10h30 | Lecture 1 — Perl data structure and regular expression refresher
11h00 - 12h30 | Lecture (practical) 2 — Parsing .maf Ebola strain alignments on the command line
14h00 - 15h30 | Lecture (practical) 3 — Parsing .maf Ebola strain alignments with Perl
16h00 - 18h00 | Lecture (practical) 4 — Visualizing Ebola strain alignments
.maf multiple-alignment data format, intermediate scripting, regular expressions, flexible parsing with Perl, creating Circos input data files, visualizing alignments between large number of sequences
In the last lecture we saw how you can use awk, grep and other command line tools to pull out all the pairwise alignments between two strains form a .maf file.
This method required us to hack the data a little bit by adding a unique alignment id in front of each line so that we could use join to match up the alignments.
Now let's see how we might do this in Perl, with a few more options.
I have an example script which you should go through and understand. Focus on the script lines between the _START and _END comments.
To get usage instructions,
> ./parsemaf -h
Usage:
# get a list of strains
cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf
# report alignments between a reference strain and all other strains
cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf -g1 Zaire_1995
# report alignments between two strains
cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf -g1 Zaire_1995 -g2 Reston_1996
or to get a man page
> ./parsemaf -man
You see that the script does three things, depending on how it's called.
If you don't provide any genomes, it will list all the genomes mentioned in the alignment file. These are all the Ebola strains.
>cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf
*** These strains are available:
034-KS_2008
13625Kikwit_1995
13709Kikwit_1995
1Eko_1996
1Ikot_Gabon_1996
...
If you provide only one genome via -g1, it will make a link file of all the alignments between this genome and all others.
> cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf -g1 Zaire_1995| head | addcomment
Zaire_1995 0 0 GuineaPig_Mayinga_2007 0 0 offset=0
Zaire_1995 0 0 Mayinga_2002 0 0 offset=0
Zaire_1995 0 0 G3770v2_2014 0 0 offset=0
Zaire_1995 0 0 Reston08-C_2008 0 0 offset=0
...
But if you pass a strain that doesn't exist, it will show all those that match the regular expression.
> cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf -g1 Gab | addcomment
ERROR: There are no alignments for -g1 [Gab]
Strains that match your input Gab as a regular expression are
1Mbie_Gabon_1996
Gabon_1994
1Oba_Gabon_1996
1Ikot_Gabon_1996
If you provide two genomes via -g1 and -g2 then the parsed output is restricted to this pair. And if one or two of the strains are not defined, you get the matching list for both.
> cat ../../lecture.2/1/ebola.multiz.txt | ./parsemaf -g1 Gab -g2 Mar | addcomment
ERROR: There are no alignments for -g1 [Gab]
Strains that match your input Gab as a regular expression are
1Mbie_Gabon_1996
Gabon_1994
1Oba_Gabon_1996
1Ikot_Gabon_1996
ERROR: There are no alignments for -g2 [Mar]
Strains that match your input Mar as a regular expression are
Marburg_MtElgon_Musoke_Kenya_1980
Marburg_KitumCave_Kenya_1987
Reporting the list of strains that match incomplete or misspelled input is type of quality-of-life feature that I encourage you to add to your scripts.
Look through the script file — it provides a nice template for your own scripts — and understand what is being done. You'll learn (or brush up on your knowledge) about parsing multi-line records and list and hash references.
Notice that I don't add extensions to my Perl scripts but rather make them executable so that they can be called directly, thanks to the #! notation in the first line
# make executable and call directly
> chmod +x ./parsemaf
> ./parsemaf
# instead of
> perl parsemaf.py
Once the script is executable, we don't need to know it's a Perl script. It calls the appropriate parser by itself.
One exception that I observe is for bash scripts, which I typically end in .sh.
One benefit of the extensions is that they may inform text editors which syntax mode to pretty-print the code.
# Make sure you understand everything below all the way to the
# housekeeping section.
# Read about the MAF format at http://www.bx.psu.edu/~dcking/man/maf.xhtml
# This will be our alignment list reference. It's declared as
# a scalar because it will be a reference to a list. While it
# doesn't have to be a reference, get used to using references.
#
# Alternatively, we could have done this
#
# my @alignments;
# ...
# push @alignments, ...
# ...
# $alignments[-1]{strain}{$strain} = ...
#
# By using a list variable saves you from having to dereference
# it with @$alignments or using -> to pull out an element.
#
# Last element using @alignments list: $alignments[-1]
# Last element using $alignments as list reference: $alignments->[-1]
#
# We can always make a list reference by
#
# my $alignments = \@alignments;
my $alignments;
# Read from STDIN until end of stream. This way the script expects
# a pipe from another process, like cat
#
# cat .../1/ebola.multiz.txt | ./parsemaf
#
while(my $line = <>) {
# remove the trailing new line
chomp $line;
# if this file matches this regular expression, then it is
# the start of an alignment block
if($line =~ /^a score=(.+)/) {
# push a new alignment hash onto the list
# for now, only the score is defined
push @$alignments, {score=>$1};
} elsif($line =~ /^s /) {
# this is an individual alignment line and belongs
# to the current alignment block, which is the last
# element in the @$alignment list.
# chop up the line into fields using any whitespace as delimiter
my @tok = split(" ",$line);
# assign the first four elements of the tokens to
# individual variables. By putting 'undef' as the
# first element we discard the first element in
# the assignment. Another way to do this would be
#
# my ($strain,$start,$size,$strand,$genome_size) = @tok[1..5];
my (undef,$strain,$start,$size,$strand,$genome_size) = @tok;
# if the strain is in the format STR.STR (repeated string)
# joined by a period, throw out everything after
# the period
if.\1/) {
$strain =~ s/[.].*//;
}
# assign this alignment line to the last (current) alignment
# block. We could have made a list out of these but it's
# even more convenient to key them by the strain
$alignments->[-1]{strain}{$strain} =
{ start=>$start,size=>$size,strain=>$strain,genome_size=>$genome_size };
}
}
printdebug("parsed",int(@$alignments),"alignments");
# The parsing above could have been done by changing the
# input record separator in Perl, which is accessible using
# the special variable $/
#
# The default value of this variable depends on the operating
# system. For more details see https://perldoc.perl.org/perlvar.html
#
# I urge you against using these special variables (with the
# exception of $_ and @_) and get in the habit of doing things yourself.
# Your code will be much more legible. Some of these variables
# actually slow down your code.
# Just dump the data structure to STDOUT and quit. This
# helps understand what's happening under the hood.
#
# The alignments are a list. Each list item is a hash of this form
#
# { score => SCORE,
# strain => { STRAIN1 => { start=>START1,size=>SIZE1,strain=>STRAIN1 },
# STRAIN2 => { start=>START2,size=>SIZE2,strain=>STRAIN2 },
# STRAIN3 => { start=>START3,size=>SIZE3,strain=>STRAIN3 },
# ...
# }
# }
printdumperq($alignments) if $CONF{dump};
# Make a unique list of strains. This is a compound line made up of
# several nested functions. Each is a simple step and this is
# a powerful Perl idiom when the steps are combined together.
#
# the list of alignments
# @$alignments
#
# a list of strains (which are keys in the alignment hash) in a given alignment
# keys %{$a->{strain}}
#
# a list of strains from all alignments
# here we're basically mapping the above step to each alignment
# map { keys %{$->{strain}} } @$alignments
#
# a list of unique strains
# uniq ( map { keys %{$->{strain}} } @$alignments )
my @strains = uniq ( map { keys %{$_->{strain}} } @$alignments );
# If -g1 is not set, we just list the strains and quit.
# I've defined printinfo() below as a wrapper for space-delimited
# print that prints _undef_ for arguments that are not defined.
if(! $CONF{g1}) {
printinfo("*** These strains are available:");
for my $strain (sort @strains) {
printinfo($strain);
}
exit;
}
# If we defined -g1 and/or -g2 make sure that we've seen these in the file
my $quit;
for my $param (qw(g1 g2)) {
# die if we have defined the genome we want but it's not in the list of strains.
if($CONF{$param} && ! grep($_ eq $CONF{$param}, @strains)) {
printinfo("ERROR: There are no alignments for -$param [$CONF{$param}]");
# Even if there are strains that match $CONF{g1} or $CONF{g2} we can
# produce a list of strains that match it as a regular expression to help the user
# find a strain.
my @matching_strains = grep($_ =~ /$CONF{$param}/, @strains);
if(@matching_strains) {
printinfo("Strains that match your input $CONF{$param} as a regular expression are");
for (@matching_strains) {
printinfo(" $_");
}
} else {
printinfo("No strains match your input $CONF{$param} as a regular expression.");
}
$quit = 1;
}
}
die if $quit;
# Now we print out the alignments in Circos' link format
my $seen_strain;
for my $alignment (@$alignments) {
# alignment to our reference
my $a1 = $alignment->{strain}{$CONF{g1}};
# the reference may be absent from this alignment, in which case go to next alignment
next unless $a1;
# for this alignment block, make a list of alignments to match up with
# the reference. if we have defined -g2 STRAIN2 then it's the alignment for
# STRAIN2 (which may be absent). Othewise it's a list of all strains in this alignment.
my @a2 = $CONF{g2} ? $alignment->{strain}{$CONF{g2}} : values %{$alignment->{strain}};
# Loop over all the alignments in this block
for my $a2 (@a2) {
# Move to the next alignment line if this one is not defined. This shouldn't
# happen but it's safe to check.
next if ! $a2;
# Report the karyotype if we used -karyotype. These will be printed
# to STDERR. The reason for this is that we can write to two
# files at the same time by redirecting STDOUT and STDERR to
# different files
#
# ./parsemaf > stdout.txt 2> stderr.txt
#
# Or we can combine the two
#
# ./parsemaf > everything 2>&1
if($CONF{karyotype} && ! $seen_strain->{ $a1->{strain} }++) {
printerr(sprintf("chr - %s %s 0 %d black",
$a1->{strain},$a1->{strain},$a1->{genome_size}));
}
if($CONF{karyotype} && ! $seen_strain->{ $a2->{strain} }++) {
printerr(sprintf("chr - %s %s 0 %d black",
$a2->{strain},$a2->{strain},$a2->{genome_size}));
}
# We're reporting alignment pairs, so skip if we're asking for self-alignment
next if $a1->{strain} eq $a2->{strain};
printinfo(
$a1->{strain},
$a1->{start},
$a1->{start} + $a1->{size} - 1,
$a2->{strain},
$a2->{start},
$a2->{start} + $a2->{size} - 1,
sprintf("offset=%d",$a2->{start}-$a1->{start}),
);
}
}
We are now ready to make our tracks. The make.tracks.sh script does this. It calls parsemaf and writes to a link file and karyotype file.
The link file z.vs.all.txt has alignments of Zaire_1995 vs all other strains. The karyotype file karyotype.txt defines the lengths of each strain.
#!/bin/bash
g1=Zaire_1995
# Notice how we're redirecting STDOUT to the link file and the
# STDERR to the karyotype file. This is a very useful idiom to
# generate two independent files at the same time and
# use redirection for maximum flexibility.
cat ../../lecture.2/1/ebola.multiz.txt | ../1/parsemaf -g1 $g1 -karyotype > z.vs.all.txt 2> karyotype.txt