Reverse Complement of Gene Sequences in Haskell

Posted on March 8, 2015

Typically, performance was not so much a concern in the projects I was working on in the last couple of months; rather getting them to run and spit out something useful at all. So it was a welcome change when a friend of mine pointed me to the reverse complement benchmark.

This benchmark aims to compare the performance of different programming language implementations on a sample problem. This sample problem is calculating the reverse complement of a file containing gene sequences (encoded in FASTA). While I’m not too interested in the programming language comparison, the problem makes a nice practice in writing some Haskell with performance in mind.

The Problem

A gene sequence is a word in the power set of A, C, G, T; the four bases DNA is composed of (the power set to model ambiguity, I guess). The elements of the power set are coded A, C, G, T, U, M, R, W, S, Y, K, V, H, D, B, N. Each of the four bases has a complement base it attaches to in a DNA double helix; this extends to a complement on the power set. In total, we get:

code meaning           complement
C    C                 G
G    G                 C
T/U  T                 A
M    A or C            K
R    A or G            Y
W    A or T            W
S    C or G            S
Y    C or T            R
K    G or T            M
V    A or C or G       B
H    A or C or T       D
D    A or G or T       H
B    C or G or T       V
N    G or A or T or C  N

The problem is, given a gene sequence, to calculate the sequence of the complement strand. That is, each character is replaced with its complement character, and then the sequence is reversed.

Our input files contain multiple sequences, separated by individual header lines (each starting with >), with line breaks every 60 characters, like so:

>human genome
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>hobbit genome
NBDHNBDHNBSCKVDHNBDHNBDHNBDHAVWHNBDHNBDHNBDHNBDHNBGUMRWUACGT
NBDHNBDHNBDHNBDACGTKNBDHNBDHNBDHNBGTUMDHNBDHNBDHNBDHNBDHNBDH
NBDHNBDH

The headers and the formatting must be preserved; the latter will bite us further down the road.

The Plan

This might best be solved in C, since this is algorithmically pretty trivial and thus does not require a high level of abstraction to be approachable. But that would miss the point of being an excercise in high performance Haskell.

The encoding of the file was not stated anywhere, so I assume it is ASCII encoded. That probably makes stuff faster, as we can use ByteStrings, and work on individual bytes directly. That will come in handy.

Then we have IO (reading the input file from stdin and writing the output to stdout), and since I like pipes, we’ll use it here (I’m scared of lazy IO). Unfortunately, pipes can not really shine here, since we cannot reverse a stream in constant space. But at least we can stream the individual sequences in the file, so when there are multiple, we don’t have to keep them all in memory.

With that out of the way, let us begin:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE LambdaCase #-}
import           Pipes                   as P
import qualified Pipes.Prelude           as PP
import qualified Pipes.Group             as PG
import qualified Data.ByteString         as B
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Unsafe  as BU
import qualified Data.ByteString.Lazy    as BL
import qualified Pipes.ByteString        as PB
import qualified Data.Vector.Unboxed     as UV
import           Data.Word
import           Data.Char
import           Data.Monoid
import           Foreign.Storable
import           Control.Lens
import           Control.Monad
import           Control.Applicative

Calculating the Complement

Before we get all into the messy stuff, let us calculate the complement of a single character. We use ByteStrings, so characters are Word8s. Unfortunately, Word8 does not admit character literals, so we have to convert them:

-- |
-- Converts an ASCII character to its ASCII code. Undefined behaviour on
-- anything else.
{-# INLINE toWord8 #-}
toWord8 :: Char -> Word8
toWord8 = fromIntegral . ord

Now we have to map bytes to their complement bytes. We will run that function on every single character encoding a set of bases in our sequence; since this will be hundreds of millions, we better make it fast. There are only 256 values in a Word8, so we can build a lookup table:

-- |
-- Calculates the DNA complement character. Identity on all other characters.
{-# INLINE comp #-}
comp :: Word8 -> Word8
comp i = compTable `UV.unsafeIndex` fromIntegral i

-- |
-- Lookup table for DNA complement conversion.
-- Not defined inside of `comp` to stay in memory.
compTable :: UV.Vector Word8
compTable = UV.generate 256 $ \i -> toWord8 $ case toUpper $ chr i of
  'A' -> 'T'
  'C' -> 'G'
  'G' -> 'C'
  'T' -> 'A'
  'U' -> 'A'
  'M' -> 'K'
  'R' -> 'Y'
  'W' -> 'W'
  'S' -> 'S'
  'Y' -> 'R'
  'K' -> 'M'
  'V' -> 'B'
  'H' -> 'D'
  'D' -> 'H'
  'B' -> 'V'
  'N' -> 'N'
  x   -> x

The UV.unsafeIndex is a variant of UV.(!) that omits the bounds check. This is safe here, since all values of Word8 are proper indices into the vector.

Splitting the File

We can have multiple sequences in our file and we have to treat them individually (because of the reverse). So we somehow have to split the file as it is streamed in. Fortunately, we can just split at the > characters, so we don’t need to perform any sophisticated (read: slow) parsing.

My first try was to use the splits function from pipes-bytestring. This function will split a producer of ByteStrings into a FreeT of producers. Each layer of that free monad transformer then is a producer of one section, each section delimited by >. Since we need the entire bytestring of the section, we would then proceed to read each of those producers until exhaustion, do some things to the resulting string and yield it further down the pipe.

But this turned out to be slow. Horribly so. The finished program spent 87 percent of its time just in the splits function, which does not appear to be doing most of the work. So I scrapped that program and began anew; first thing, I implemented a combinator for what I needed: Given a producer of byte strings (chunked, so representing one big byte string) and a character, split the input at the character and yield the entire sections. In code:

-- |
-- Generates the largest consecutive strings between the given character in the
-- byte string generated by the given producer.
splitDirect :: Monad m => Word8 -> Producer B.ByteString m r -> Producer B.ByteString m r
splitDirect c p = go p mempty where
  go p buf = lift (next p) >>= \case
    Left r         -> yieldBuf buf >> return r
    Right (bs, p') -> if
      | B.null suf -> go p' buf'
      | otherwise  -> yieldBuf buf' >> go (yield (B.tail suf) >> p') mempty
      where 
        (pre, suf) = B.breakByte c bs
        buf'       = buf <> BB.byteString pre
  yieldBuf = yield . BL.toStrict . BB.toLazyByteString

This uses Builders from bytestring to concatenate the chunks in an efficient manner, until the separator character is read. The buffered builder is then transformed into a byte string and yielded downstream.

Note that this is a horrible thing to do most of the time, since we lose chunking and produce potentially really large byte strings which are completely stored in memory. Here, we want to reverse the sections, so we need to do that anyway.

Get the Work Done

We now have a stream of sections, each consisting of a header and a gene sequence. The header should just be reproduced as is; the gene sequence however requires a more complicated treatment:

Just mapping comp over the sequence and then reversing it with the bytestring function reverse would be a perfectly nice (and also fast) thing to do here, if it weren’t producing wrong results: We need to keep the layout of our input files, i.e. there has to be a line break after every consecutive 60 characters. Thus we would need to strip our newly mapped and reversed string of all newlines and insert one every 60 characters.

And that is slow. Again we have to do things ourselves, this time considering a more nuclear option: We get ourselves a pointer to the contents of the bytestring and do things the traditional way: With indices starting on both sides of the string, we swap characters that are not newlines, until the two indices meet.

-- |
-- Reverse complement of a DNA sequence. Be careful with how you use the byte string,
-- since this function will update it in place under the hood for performance reasons.
revcomp :: B.ByteString -> IO B.ByteString
revcomp b = BU.unsafeUseAsCStringLen b (uncurry withPtr) where
  withPtr ptr len = loop ptr 0 (len - 1) >> return b
  loop ptr i j | i >= j = return ()
  loop ptr i j = do
    ie <- peekByteOff ptr i
    je <- peekByteOff ptr j
    if
      | nl == ie  -> loop ptr (i + 1) j
      | nl == je  -> loop ptr i (j - 1)
      | otherwise -> do
        pokeByteOff ptr i $ comp je
        pokeByteOff ptr j $ comp ie
        loop ptr (i + 1) (j - 1)
  nl = toWord8 '\n'

Oh my, this looks just like what one would have done in a C implementation. But good that Haskell can fall back to do evil things like that when it is time to sacrifice at the altar of performance.

Putting it All Together

We have splitDirect and revcomp, now it is time to put it together. First, we need to process blocks, i.e. sections with a header and some gene data. We could do that as a function of type B.ByteString -> IO B.ByteString, but then we would need to concat the header parts and the processed data; even with a Builder (the fast way to concat byte strings), this turned out to be slower than producing the output parts in a pipe:3

-- |
-- Given a block in the DNA sequence, generates a producer for the reverse complement block.
block :: B.ByteString -> Producer B.ByteString IO ()
block b = do
  let (pre, suf) = B.breakByte (toWord8 '\n') b
  yield $ ">" <> pre <> "\n"
  yield =<< liftIO (revcomp suf)
  yield "\n"

The input of our program will come through stdin. It then is split into blocks, the blocks are individually processed and then sent back to the user through stdout. In code:

main :: IO ()
main = runEffect $ for (splitDirect (toWord8 '>') PB.stdin >-> PP.drop 1) block >-> PB.stdout

The PP.drop 1 is for dropping the part of the file that comes before the first >, since we do not consider that a block.

Performance

Running the code with O2 on a generated example file 250 MB in size we get quite a decent result:

./revcomp < testBig.txt > test2.txt  0,35s user 0,37s system 56% cpu 1,283 total

I’m going to omit a performance comparison here, since that was not the point and I’m quite too lazy to do it just for the sake of it.

Conclusion

In hindsight, the morale of this story ought to be the following: By forcing the result data to be formatted, we had to go out of our way to stay performant, where two simple library functions (map and reverse) would have done the job quite fine (and just as fast!) otherwise. Since humans can not really read gene sequences without any help of a program anyways, a human readable ASCII file with a line break every 60 characters does not really make any sense. A much more sensible option would be a sequence of four bit bitsets (maybe even padded to eight bits for easier processing and double the storage if one was so inclined), coding all the possible combinations of the four bases. But maybe there is some really important aspect I miss there, I’m not a domain expert.

All in all, the project was quite fun since it was the first non-compiler project in a long time for me. But now I’m out of ideas, so compiler work it is again.