Sequences in Color

M Krzywinski (Genome Sequence Centre)

After reading wombat's entry in our experimentbank.org I thought it would be an interesting diversion to colour base pairs and make sequence into a GIF thumbview. Anything as an excuse to Perl one more time.

It was clear that I was going to see something - something more than coloured blocks. The human eye looks for patterns and a few emerged, bands and stripes. None of this analysis is essentially serious, since all of the visualization depends on casting the 1d sequence into a 2d object and with the second dimension arbitrary in size, patterns can just as easily appear as disappear.

It turns out to be a colourful foray into data exploration. What kinds of other ways can the sequence be visually annotated? I picked the GC content, its running average, the mode basepair an a 2- and 4-lagging sum, a variant on the lag plot.

Have a good time looking at the coloured dots. Maybe you'll see something?


DOWNLOAD

The code is written in Perl. You will need the following modules

  • Math::VecStat
  • Pod::Usage
  • GD
  • bioperl (Bio::SeqIO, Bio::Seq)
> tar xvfz colortiles.tgz
# now edit colortiles.conf
# try with random 20kb sequence (see result)
> /path/to/your/perl colortiles -random 20000 -gc 0.41

  • sequences are coloured according to the following scheme:

    A T G C

  • The position index counts of 1000 bp at a time. The graph to the right of the sequence plot shows the following
  • the row GC content (grey line) and a running average over 10 rows (red line). This ranges from 0...100. The grey vertical dotted grid lines demarcate the levels of GC content in increments of 20%.
  • the blue and orange bars show the value of the lagging sum L(n) for the row

    L(n) = number of times s(i)=s(i+n),     i=1...length(s)

    where s(i) is the base pair at row position i. Compare how the 2- and 4-lag functions look for random data and elegans sequence data. The laging sum function can be used to visually ascertain if there is 2-neighbour and 4-neighbour similarity. For example, if every other base pair is the same then L(2)=length(s).
  • The coloured ticks to the right of the plot show the row's most common base pair.

C. elegans chromosome I
offset=1e6 bp (download)
C05D11 elegans cosmid
GC = 34% (download)
random sequence
GC = 34% (download)