Learning Circos

Bioinformatics and Genome Analysis

Institut Pasteur Tunis Tunisia, December 10 – December 11, 2018

Download course materials
v2.00 6 Dec 2018 Download PDF slides
v2.00 6 Dec 2018

A 2- or 4-day practical mini-course in Circos, command-line parsing and scripting. This material is part of the Bioinformatics and Genome Analysis course held at the Institut Pasteur Tunis.

Quick links

menu day 1 day 2 day 3 day 4 scripts references updates

sessions / day.1

Getting started with Circos: Yeast and Ebola

Monday 10 December 2018 — Day 1

9h00 - 10h30 | Lecture 1 — Introduction to Circos

11h00 - 12h30 | Lecture (practical) 2 — Visualizing gene distribution and size in Yeast: the histogram data track

14h00 - 15h30 | Lecture (practical) 3 — Conservation in Yeast: the link data track

16h00 - 18h00 | Lecture (practical) 4 — Visualizing an Ebola strain

Concepts covered today

Circos configuration, common Circos errors, Circos debugging, ideograms, selecting ideograms with regular expressions, data tracks, histograms, links, downloading files from UCSC genome browser, essential command-line tools and basic scripting, using bash to create data files for Ebola genome strains, color definitions, using transparency, Brewer palettes, runtime formatting rules, accessing data track statistics, input data formats

menu lecture 1 lecture 2 lecture 3 lecture 4

sessions / day.1 / lecture.4

Visualizing an Ebola strain

sessions / day.1 / lecture.4 / README

Let's look how we can parse data files for Circos. Knowing how to preparing data files efficiently is a very important skill that you must continue to develop.

sessions / day.1 / lecture.4 / 1 / parsing.data.README

Downloading and Parsing Ebola Data

Sometimes your data is already almost nearly in the right format and can be quickly parsed to the format that Circos expects.

Here command-line tools like grep, sed and awk shine and you should become as familiar with them as possible. I cannot stress this enough: a lot of scripting can be entirely avoided by piping calls to these tools.

Let's start with Ebola! I downloaded three tracks for the Sierra Leone Ebola virus assembly, ebola.{assembly,genes,variation}.txt, which you can access at

https://genome.ucsc.edu/cgi-bin/hgTables

by selecting "Clade: Viruses" and "Genome: Ebola virus".

 Mapping and sequencing tracks: assembly
 day.1/lecture.4/1/ebola.assembly.txt
 
 Genes and gene prediction tracks: NCBI genes
 day.1/lecture.4/1/ebola.genes.txt
 
 Variation and repeats: 2014 specific
 day.1/lecture.4/1/ebola.variation.txt

If we're going to draw any data on the Ebola genome, we need to create the karyotype file that defines the Ebola genome's chromosome. In this case there's only one, so it's trivial to make. You can look at the last line of the data file to see the size of the chromosome

 tail -1 ebola.assembly.txt
 585 KM034562v1 0 18957...

You can make the file by hand by simply entering

 # day.1/lecture.4/1/ebola.karyotype.txt
 chr - ebola ebola 0 18957 black

or alternatively if you're feeling frisky you can automate this step if you don't want to risk making a mistake entering the length.

 # data.1/lecture.4/1/make.karyotype.sh
 tail -1 ebola.assembly.txt | cut -f 4 | awk '{print "chr - ebola ebola 0",$1,"black"}' > ebola.karyotype.txt

Great! You're ready to draw the genome (no data yet).

sessions / day.1 / lecture.4 / 1 / make.karyotype.sh

↓

#!/bin/bash

echo "parsing UCSC assembly table and making karyotype file"
tail -1 ebola.assembly.txt | cut -f 4 | awk '{print "chr - ebola ebola 0",$1,"black"}' > ebola.karyotype.txt

sessions / day.1 / lecture.4 / 2 / etc / circos.conf

We reference the karyotype file we made in the previous part of this lecture.

chr - ebola ebola 0 18895 black

karyotype = ../1/ebola.karyotype.txt

I have adjusted the thickness of the ideogram to 5 pixels and also the ticks. If you like you can peek into these files. The ideogram.conf file defines the position of the ideograms, their thickness, spacing and so on. All these parameters are required.

<<include ideogram.conf>>

The ticks.conf file defines the position and size of the ticks and their labels. This file is optional — if it is missing and the parameters it defines aren't defined anywhere else, the image will have no ticks.

<<include ticks.conf>>

Martin Krzywinski @MKrzywinski mkweb.bcgsc.ca

▲ sessions/day.1/lecture.4/2/circos.png (zoom)

sessions / day.1 / lecture.4 / 2 / etc / ideogram.conf

The ideogram.conf file doesn't have to be named this way — the only important thing is that you at some point define the <ideogram> block with the right parameters. Conventionally, this is done in this file, but I've seen people include this definition in their circos.conf file.

I wouldn't recommend this, though. I suggest that you keep your configuration modular, which makes it possible to share files between projects.

<ideogram>

Most of the required parameters are imported from ../../etc/ideogram.conf. Look through that file to familiarize yourself with how ideogram positions and size are defined.

<<include ../../etc/ideogram.conf>>

To override parameters defined in this block, the name is suffixed with *. If this suffix wasn't added, then thickness would be defined twice in this block (once here and second time from the file included above), and Circos would report an error.

In this case, the thickness is set to 5p (5 pixels). Its original value was 15p.

thickness* = 5p

<spacing>
default = 0.0025r
</spacing>

Similarly, the thickness and color of the stroke around the ideograms is redefined.

stroke_thickness* = 1
stroke_color*     = vdgrey

</ideogram>

sessions / day.1 / lecture.4 / 2 / etc / ticks.conf

Like for ideogram.conf, it's convenient to keep tick and tick label definitions separate.

show_ticks        = yes
show_tick_labels  = yes
show_grid         = no

<ticks>
tick_label_font = default
radius          = dims(ideogram,radius_outer) 
label_offset    = 2p
label_size      = 6p
multiplier      = 1
color           = black

The u suffix corresponds to the chromosomes_size defined in circos.conf, which is set to 1000. Thus, 500u corresponds to 500kb.

If you want to change the spacing and size of the ticks, feel free to try different values.

The show = yes parameter is the default behaviour and thus not needed here. I've added its definition to remind you that if you want to turn off a block, you can use show = no.

<tick>
show           = yes
spacing        = 500u
size           = 8p
thickness      = 1p
show_label     = yes
format         = %d
</tick>

<tick>
show           = yes
spacing        = 100u
size           = 4p
thickness      = 1p
show_label     = no
</tick>

<tick>
show           = yes
spacing        = 25u
size           = 2p
thickness      = 1p
show_label     = no
</tick>

</ticks>

sessions / day.1 / lecture.4 / 3 / parsing.data.README

Creating Circos data files for Ebola genes and SNPs

Now that we have the ideogram drawn, let's pull out the gene regions from the gene table file. This file has a pretty simple format. The columns are

bin        585         Indexing field to speed chromosome range queries.
name       NP          Name of gene
chrom      KM034562v1  Reference sequence chromosome or scaffold
strand     +         ) + or - for strand
txStart    55          Transcription start position (or end position for minus strand item)
txEnd      3026        Transcription end position (or start position for minus strand item)
cdsStart   469         Coding region start (or end position for minus strand item)
cdsEnd     2689        Coding region end (or start position for minus strand item)
exonCount  1           Number of exons
exonStarts 55,         Exon start positions (or end positions for minus strand item)
exonEnds   3026,       Exon end positions (or start positions for minus strand item)

This is going to be quick — there are only 9 entries in the gene file! Here I've added a comment row with a row of 1-indexed column for convenience.

#day.1/lecture.4/1/ebola.genes.txt
#1 2    3          4 5     6     7     8     9 10         11
585 NP   KM034562v1 + 55    3026  469   2689  1 55,        3026,
585 VP35 KM034562v1 + 3031  4407  3128  4151  1 3031,      4407,
585 VP40 KM034562v1 + 4389  5894  4478  5459  1 4389,      5894,
585 GP   KM034562v1 + 5899  8305  6038  8068  2 5899,6922, 6920,8305,
585 sGP  KM034562v1 + 5899  8305  6038  7133  1 5899,      8305,
585 ssGP KM034562v1 + 5899  8305  6038  6933  2 5899,6923, 6922,8305,
585 VP30 KM034562v1 + 8287  9740  8508  9375  1 8287,      9740,
585 VP24 KM034562v1 + 9884  11518 10344 11100 1 9884,      11518,
585 L    KM034562v1 + 11500 18282 11580 18219 1 11500,     18282,

It's up to us what to draw. Let's pull out the gene start and end positions, its name and the number of exons. We'll pull out column 2, 5, 6 and 9 using cut (remember cut starts indexing at 1 not zero) and then rearrange them with awk, since cut doesn't offer the option for changing column order.

#day.1/lecture.4/3/create.tracks.sh
grep -v \../1/ebola.genes.txt | cut -f 2,5,6,9 | awk '{print "ebola",$2,$3,$4,"name="$1}' > ebola.genes.circos.txt

This gives us this file

# day.1/lecture.4/3/ebola.genes.circos.txt
ebola 55 3026 1 name=NP
ebola 3031 4407 1 name=VP35
ebola 4389 5894 1 name=VP40
ebola 5899 8305 2 name=GP
ebola 5899 8305 1 name=sGP
ebola 5899 8305 2 name=ssGP
ebola 8287 9740 1 name=VP30
ebola 9884 11518 1 name=VP24
ebola 11500 18282 1 name=L

Next, let's parse the variations, which are SNPs. The variationfile has the following tab-delimited fields

chrom      KM034562v1 Reference sequence
chromStart 8985       Start position in reference
chromEnd   8986       End position in reference
name       A/C        Variant: ancestral/2014-derived
score      0          Score from 0-1000 (not used)
strand     +          + or - (always + here)
thickStart 8985       Start of where display should be thick (same as chromStart)
thickEnd   8986       End of where display should be thick (same as chromEnd)
reserved   0,220,0    Used as itemRgb
gene       VP30       Gene symbol
type       synonymous noncoding, synonymous or nonsynonymous
hgsv       VP30:Arg160Arg HGSV notation for protein coding changes (or lack thereof)
blosum62   NA             BLOSUM62 substitution score for ancestral and 2014 amino acids
countInNew 81             Number of 2014 samples that contain derived allele (out of 81 total)
freqInNew  1.000000       Derived allele frequency of 2014 samples

and the first few lines are

#day.1/lecture.4/1/ebola.variation.txt
#1       2   3   4   5 6 7   8   9     10 11         12 13 14
KM034562v1 126 127 C/T 0 + 126 127 0,0,0 NP noncoding  NA 81 1.000000
KM034562v1 154 155 A/C 0 + 154 155 0,0,0 NP noncoding  NA 81 1.000000
KM034562v1 181 182 A/G 0 + 181 182 0,0,0 NP noncoding  NA 81 1.000000
KM034562v1 186 187 A/G 0 + 186 187 0,0,0 NP noncoding  NA 81 1.000000
...

I've added a comment row with a row of zero-indexed column for convenience.

#day.1/lecture.4/3/create.tracks.sh
grep -v \../1/ebola.variation.txt | cut -f 2,3,4,10,15 | \
  perl -pe 's/\w+:\w+//' | \
  awk '{print "ebola",$2,$3,$5,"snp="$1",gene="$4}' > ebola.variation.circos.txt

gives us

#day.1/lecture.4/3/ebola.variation.circos.txt
ebola 126 127 1.000000 snp=C/T,gene=NP
ebola 154 155 1.000000 snp=A/C,gene=NP
ebola 181 182 1.000000 snp=A/G,gene=NP
ebola 186 187 1.000000 snp=A/G,gene=NP
...

sessions / day.1 / lecture.4 / 3 / create.tracks.sh

↓

#!/bin/bash

# extract genes

grep -v \# ../1/ebola.genes.txt | cut -f 2,5,6,9 | \
    awk '{print "ebola",$2,$3,$4,"name="$1}' > ebola.genes.circos.txt

# extract SNPs

grep -v \# ../1/ebola.variation.txt | cut -f 2,3,4,10,15 | \
    perl -pe 's/\w+:\w+//' | \
    awk '{print "ebola",$1,$2,$5,"snp="$3",gene="$4}' > ebola.variation.circos.txt

sessions / day.1 / lecture.4 / 4 / etc / circos.conf

We are now ready to draw the genes and SNPs.

karyotype = ../1/ebola.karyotype.txt

<plots>

We'll draw the gene regions as a tile track. This track automatically organizes regions into layers in a way that avoids overlap. The reason that we need this track is because some of the gene regions overlap.

<plot>
type = tile
file = ../3/ebola.genes.circos.txt
r1   = 0.95r
r0   = 0.90r
stroke_thickness = 0
color     = black
margin    = 100u
thickness = 5p
padding   = 5p
</plot>

For the SNPs we can use a highlight track. This track colors the regions defined in the file.

How can we check that all the SNP positions are unique? Make a call to cut, extract the start position, sort it (which is needed for the next step) and report the number of runs of identical values. Finally, grep out any line that doesn't have a "1" applied to matching whole words (i.e. we need the number 1 not just the digit 1 embedded in another number).

cat ../3/ebola.variation.circos.txt | cut -d " " -f 2 | sort | uniq -c | grep -v -w 1

Since this command reports no lines, we know that they're uniq. Another way of doing it is counting all lines and unique start values

> wc ../3/ebola.variation.circos.txt
395
> cat ../3/ebola.variation.circos.txt | cut -d " " -f 2 | sort | uniq | wc
395

Since both commands give us the same count, all start positions are unique.

<plot>
type = highlight
file = ../3/ebola.variation.circos.txt
fill_color = red
stroke_thickness = 0
r1   = 0.89r
r0   = 0.85r

Since the SNP regions are a single base in size, they're small and hard to see. In these cases you can use minsize to expand the size of the data point to be at least this minimum size. Beware that this changes how people might interpret the data — but it's better to be able to see the data clearly than not.

minsize = 10

</plot>

It's worth exploring how the SNPs might look when drawn as tiles. Each tile is a single SNP, enlarged here so that you can see it more clearly. It's now very obvious where SNPs are close together — something that we couldn't see with the highlight track above.

<plot>
type = tile
file = ../3/ebola.variation.circos.txt

r1   = 0.79r
r0   = 0.70r

You can make the size larger (try 25). Below 10 it's impossible to see the SNPs. That is because the genome is about 14 kb in size and the circumference of the circle is about 3,000 pixels. Thus, a single base is only 1/5th of a pixel!

minsize = 10

stroke_thickness = 0
color     = black

This is how close SNPs can be to one another in the same layer.

margin    = 20u
thickness = 4p
padding   = 3p

layers_overflow = grow

<rules>

<rule>
condition = var(snp) =~ /^A/
color     = red
</rule>

<rule>
condition = var(snp) =~ /A$/
color     = blue
</rule>

</rules>

</plot>

</plots>

▲ sessions/day.1/lecture.4/4/circos.png (zoom)

sessions / day.1 / lecture.4 / 5 / README

A supplementary parsing example aimed to show you that you can achieve quite a lot on the command line without having to write a custom script.

Here, we have a hypothetical example in which two sequences are aligned and the alignment is reported as a pair of line.

In each pair, the first line is from sequence 1 and the second line is from sequence 2.

The makesnp perl script creates a random instance of such a file. Arbitrarily, I've set the line length to 15 and report 3 pairs of lines with a SNP probability of 0.1 at each position.

AGACGCGTCGTTGTT
AGACTCGTCGTTGTT

TCAACAATAGCTTTC
TCAACAATAGCTTTC

GACATTCTTACCGCA
GAGATTCTTACCGGA

Next, the testsnp.sh script parses these sequence pairs and assigns to each base a unique tag based on its (a) the index of the line pair (b) position along the line and (c) the base.

00.00.A 1 0
00.01.G 0 1
00.02.A 1 2
00.03.C 0 3
00.04.G 1 4
00.05.C 0 5
00.06.G 1 6
00.07.T 0 7
00.08.C 1 8
00.09.G 0 9
00.10.T 1 10
00.11.T 0 11 # e.g. line pair 0, position 11, T, record 0, position 11 along record
00.12.G 1 12
00.13.T 0 13
00.14.T 1 14
00.00.A 0 0
00.01.G 1 1
00.02.A 0 2
00.03.C 1 3
00.04.T 0 4
00.05.C 1 5
00.06.G 0 6
00.07.T 1 7
00.08.C 0 8
00.09.G 1 9
00.10.T 0 10
00.11.T 1 11
00.12.G 0 12
00.13.T 1 13
00.14.T 0 14
...

Once we have this, it's no enough to just check how many times each tag repeats. When there is no SNP and the bases are the same in both sequences, we expect the tag to appear twice. If there is a SNP, then it will appear only once because the base part of the tag is different.

In this example, we find 3 SNPs.

1 00.04.T 0 4  # position 4 T/G
1 00.04.G 1 4
1 02.02.G 0 32 # position 32 G/C
1 02.02.C 1 32
1 02.13.C 0 43 # position 43 C/G
1 02.13.G 1 43

We can finally extract the base, record index and position to create a list of SNPs based on position.

4 T G
32 G C
43 C G

The first base is from sequence 1 and the second base from sequence 2. This list is made by using join on two files based on the base position.

sessions / day.1 / lecture.4 / 5 / snp.txt

AGACGCGTCGTTGTT
AGACTCGTCGTTGTT

TCAACAATAGCTTTC
TCAACAATAGCTTTC

GACATTCTTACCGCA
GAGATTCTTACCGGA

sessions / day.1 / lecture.4 / 5 / snp.report.txt

1 00.04.T 0 4
1 00.04.G 1 4
1 02.02.G 0 32
1 02.02.C 1 32
1 02.13.C 0 43
1 02.13.G 1 43

sessions / day.1 / lecture.4 / 5 / snp.list.txt

4 T G
32 G C
43 C G

sessions / day.1 / lecture.4 / 5 / makesnp

↓

# -seed N sets a random seed so that we can reproduce
# the random output for debugging
srand($CONF{seed}) if defined $CONF{seed};

# -dict ATGC sets the dictionary from which bases are selected
# for example you can add N's with -dict ATGCN

my @dict = split("",$CONF{dict});

# iterate over lines (pairs of sequence)
for my $i (1..$CONF{lines}) {
  # this is the "reference" sequence, which is a concatenation
  # of $CONF{len} randomly selected characters from the dictionary
  my $str = join("",map { $dict[rand(@dict)] } (1..$CONF{len}));
  print "$str\n";
  # for each position in the reference
  for my $j (1..$CONF{len}) {

    # make a snp but only if we pass the probability requirement
    # the chance of a SNP at given position is $CONF{p}
    next unless rand() < $CONF{p};

    # this is the reference base ast this position
    my $ref = substr($str,$j-1,1);

    # replace the character at this position with a randomly
    # selected character from the dictionary that is not
    # the reference base
    #
    # substr() can be an lvalue too — very useful
    substr($str,$j-1,1) = (grep($_ ne $ref, @dict))[rand(@dict-1)];
  }
  print "$str\n\n";
}

sessions / day.1 / lecture.4 / 5 / testsnp.sh

↓

#!/bin/bash

LEN=15
LINES=3

perl makesnp -p 0.1 -len $LEN -lines $LINES -seed 0 > snp.txt

# 0. filter blank lines
# 1. fold the lines to make one character per line
# 2. calculate
#
#    rec - record number (line number modulo 2), 0|1
#   pair - index of line pair (0,1,2,3...)
#    pos - position along the line
#
# 3. each base gets a unique tag: pair.pos.base
#
# 4. positions at which there are snps will only have one entry of this
#    tag, which we find by count unique tags and then looking for ones
#    that only appear once
#
# The uniq -w7 flag compares only the first 7 characters (length of tag)

grep -v "^$" snp.txt | fold -w1 | \
    awk -v len=$LEN -v lines=$LINES \
    '{ rec=NR%2; pair=int((NR-1)/(2*len)); pos=(NR-1)%len; \
       printf("%02d.%02d.%s %d %d\n",pair,pos,$1,rec,pair*len+pos) }' > snp.mod.txt

sort snp.mod.txt | uniq -w7 -c | sed 's/^ *//' | grep ^1 | sort -n -k 4,5  -n -k 3,4 | tee snp.report.txt

cat snp.report.txt | sed 's/\./ /g'| awk '{print $5,$6,$4}' > tmp.txt

grep ^0 tmp.txt | cut -d " " -f 2- | sort > tmp.1.txt
grep ^1 tmp.txt | cut -d " " -f 2- | sort > tmp.2.txt

join tmp.1.txt tmp.2.txt | sort -n > snp.list.txt

\rm tmp.*.txt