## Reverse Complement of Gene Sequences in Haskell

Posted on March 8, 2015

Typically, performance was not so much a concern in the projects I was working on in the last couple of months; rather getting them to run and spit out something useful at all. So it was a welcome change when a friend of mine pointed me to the reverse complement benchmark.

This benchmark aims to compare the performance of different programming language implementations on a sample problem. This sample problem is calculating the reverse complement of a file containing gene sequences (encoded in FASTA). While I’m not too interested in the programming language comparison, the problem makes a nice practice in writing some Haskell with performance in mind.

### The Problem

A gene sequence is a word in the power set of A, C, G, T; the four bases DNA is composed of (the power set to model ambiguity, I guess). The elements of the power set are coded A, C, G, T, U, M, R, W, S, Y, K, V, H, D, B, N. Each of the four bases has a complement base it attaches to in a DNA double helix; this extends to a complement on the power set. In total, we get:

code meaning           complement
C    C                 G
G    G                 C
T/U  T                 A
M    A or C            K
R    A or G            Y
W    A or T            W
S    C or G            S
Y    C or T            R
K    G or T            M
V    A or C or G       B
H    A or C or T       D
D    A or G or T       H
B    C or G or T       V
N    G or A or T or C  N

The problem is, given a gene sequence, to calculate the sequence of the complement strand. That is, each character is replaced with its complement character, and then the sequence is reversed.

Our input files contain multiple sequences, separated by individual header lines (each starting with >), with line breaks every 60 characters, like so:

>human genome
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>hobbit genome
NBDHNBDHNBSCKVDHNBDHNBDHNBDHAVWHNBDHNBDHNBDHNBDHNBGUMRWUACGT
NBDHNBDHNBDHNBDACGTKNBDHNBDHNBDHNBGTUMDHNBDHNBDHNBDHNBDHNBDH
NBDHNBDH

The headers and the formatting must be preserved; the latter will bite us further down the road.

### The Plan

This might best be solved in C, since this is algorithmically pretty trivial and thus does not require a high level of abstraction to be approachable. But that would miss the point of being an excercise in high performance Haskell.

The encoding of the file was not stated anywhere, so I assume it is ASCII encoded. That probably makes stuff faster, as we can use ByteStrings, and work on individual bytes directly. That will come in handy.

Then we have IO (reading the input file from stdin and writing the output to stdout), and since I like pipes, we’ll use it here (I’m scared of lazy IO). Unfortunately, pipes can not really shine here, since we cannot reverse a stream in constant space. But at least we can stream the individual sequences in the file, so when there are multiple, we don’t have to keep them all in memory.

With that out of the way, let us begin:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE LambdaCase #-}
import           Pipes                   as P
import qualified Pipes.Prelude           as PP
import qualified Pipes.Group             as PG
import qualified Data.ByteString         as B
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Unsafe  as BU
import qualified Data.ByteString.Lazy    as BL
import qualified Pipes.ByteString        as PB
import qualified Data.Vector.Unboxed     as UV
import           Data.Word
import           Data.Char
import           Data.Monoid
import           Foreign.Storable
import           Control.Lens
import           Control.Applicative

### Calculating the Complement

Before we get all into the messy stuff, let us calculate the complement of a single character. We use ByteStrings, so characters are Word8s. Unfortunately, Word8 does not admit character literals, so we have to convert them:

-- |
-- Converts an ASCII character to its ASCII code. Undefined behaviour on
-- anything else.
{-# INLINE toWord8 #-}
toWord8 :: Char -> Word8
toWord8 = fromIntegral . ord

Now we have to map bytes to their complement bytes. We will run that function on every single character encoding a set of bases in our sequence; since this will be hundreds of millions, we better make it fast. There are only 256 values in a Word8, so we can build a lookup table:

-- |
-- Calculates the DNA complement character. Identity on all other characters.
{-# INLINE comp #-}
comp :: Word8 -> Word8
comp i = compTable UV.unsafeIndex fromIntegral i

-- |
-- Lookup table for DNA complement conversion.
-- Not defined inside of comp to stay in memory.
compTable :: UV.Vector Word8
pokeByteOff ptr j $comp ie loop ptr (i + 1) (j - 1) nl = toWord8 '\n' Oh my, this looks just like what one would have done in a C implementation. But good that Haskell can fall back to do evil things like that when it is time to sacrifice at the altar of performance. ### Putting it All Together We have splitDirect and revcomp, now it is time to put it together. First, we need to process blocks, i.e. sections with a header and some gene data. We could do that as a function of type B.ByteString -> IO B.ByteString, but then we would need to concat the header parts and the processed data; even with a Builder (the fast way to concat byte strings), this turned out to be slower than producing the output parts in a pipe:3 -- | -- Given a block in the DNA sequence, generates a producer for the reverse complement block. block :: B.ByteString -> Producer B.ByteString IO () block b = do let (pre, suf) = B.breakByte (toWord8 '\n') b yield$ ">" <> pre <> "\n"
yield =<< liftIO (revcomp suf)
yield "\n"

The input of our program will come through stdin. It then is split into blocks, the blocks are individually processed and then sent back to the user through stdout. In code:

main :: IO ()
main = runEffect \$ for (splitDirect (toWord8 '>') PB.stdin >-> PP.drop 1) block >-> PB.stdout

The PP.drop 1 is for dropping the part of the file that comes before the first >, since we do not consider that a block.

### Performance

Running the code with O2 on a generated example file 250 MB in size we get quite a decent result:

./revcomp < testBig.txt > test2.txt  0,35s user 0,37s system 56% cpu 1,283 total

I’m going to omit a performance comparison here, since that was not the point and I’m quite too lazy to do it just for the sake of it.

### Conclusion

In hindsight, the morale of this story ought to be the following: By forcing the result data to be formatted, we had to go out of our way to stay performant, where two simple library functions (map and reverse) would have done the job quite fine (and just as fast!) otherwise. Since humans can not really read gene sequences without any help of a program anyways, a human readable ASCII file with a line break every 60 characters does not really make any sense. A much more sensible option would be a sequence of four bit bitsets (maybe even padded to eight bits for easier processing and double the storage if one was so inclined), coding all the possible combinations of the four bases. But maybe there is some really important aspect I miss there, I’m not a domain expert.

All in all, the project was quite fun since it was the first non-compiler project in a long time for me. But now I’m out of ideas, so compiler work it is again.