10 January 2008

Brainfuck in Haskell - speed gives me what I need


Today it's time for final two optimizations to our interpreter. Initially I planned to use mutable unboxed array (access time of O(1), in-place updates, unboxed elements), but I was given a nicer, more functional solution - list zipper.
I must admit I didn't know this technique, but it looks nice.

First of all I will change benchmarking procedure a little bit - calculation of primes up to 10 has gotten too fast (ca 0.15s) to be measured accurately, so from now on benchmark will be finding primes up to 30, which gives us a nice timing of ca 3.5s:

> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m3.626s
user 0m3.480s
sys 0m0.099s
>

with profile:

COST CENTRE MODULE %time %alloc

run Main 41.2 32.8
step Main 24.2 37.8
setMem Main 21.6 21.3
getMem Main 5.9 0.0
!!! Main 3.3 2.5
nextPC Main 2.6 5.5
isEnd Main 1.3 0.0


Zipper is a structure that "simulates" movement over list elements. It holds three things - current element (focus), list of elements on the left of focus and list of elements on the right.
Moving left or right means simply changing focus and adding/removing single element from left/right lists.
Here it is wrapped in code:


> data ListZipper a = ListZipper {
> left :: [a], -- elements on the left from focus
> focus :: a, -- current element
> right :: [a] -- elements on the right
> }
>
> moveLeft :: ListZipper a -> ListZipper a
> moveLeft (ListZipper (x:xs) y zs) = ListZipper xs x (y:zs)
>
> moveRight :: ListZipper a -> ListZipper a
> moveRight (ListZipper xs y (z:zs)) = ListZipper (y:xs) z zs
>
> setValue :: ListZipper a -> a -> ListZipper a
> setValue (ListZipper xs y zs) v = ListZipper xs v zs
>
> getValue :: ListZipper a -> a
> getValue (ListZipper _ y _) = y


We also need to change definition of BFState and functions operating on memory, namely setMem and getMem:


> data BFState = BFState {
> program :: BS.ByteString, -- program being interpreted
> memory :: ListZipper Word16, -- memory
> pc :: Int, -- current program counter
>
> prog_len :: Int, -- cached program length
> loop_ends :: UArray Int Int -- cached loop ends
> }

> initState :: BS.ByteString -> Int -> BFState
> initState program memSize = BFState program mem 0 (BS.length program) loops where
> loops = (listArray (0, (BS.length program - 1)) $ loopEnds program 0)
> mem = ListZipper [] 0 (take (memSize - 1) $ repeat 0)

> getMem :: BFState -> Word16
> getMem st = getValue (memory st)


> setMem :: BFState -> Word16 -> ListZipper Word16
> setMem st value = setValue (memory st) value


The benchmark:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m2.673s
user 0m2.511s
sys 0m0.108s
>

Over 25% faster - nice! (side note: IOUArray is a little bit faster - you can gain another 10% - but it is not that much, so let's stick with purer solution)

The profile does not give us much clue of what to do to make it faster

COST CENTRE MODULE %time %alloc

run Main 52.1 40.6
step Main 28.2 43.7
nextPC Main 5.1 6.8
isEnd Main 3.4 0.0
getValue Main 3.4 0.0
getMem Main 2.6 0.0
moveLeft Main 1.7 2.4
!!! Main 1.7 3.1
moveRight Main 0.0 2.4

but there is one more thing to look at - while interpreting Brainfuck program all characters that are not valid BF instructions are being ignored, i.e. their execution is limited to advancing program counter. This happens each time such character is encountered, possibly multiple times during program run.
If those characters do nothing, we can simply filter them away before we start interpreting program. It should give us some speed gain (especially if you look inside prime.bf - a LOT of whitespace and some comments).


> normalize :: BS.ByteString -> BS.ByteString
> normalize program = BS.filter (\c -> elem (chr $ fromEnum c) "+-[]<>.,") program

> main :: IO ()
> main = do
> args <- getArgs
> if length args == 0 then fail "Please provide name of the program to execute"
> else do program <- BS.readFile (head args)
> run (initState (normalize program) 30000)


Results:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m1.734s
user 0m1.658s
sys 0m0.054s
>

Another 1s shaved off.

Some additional notes:

  • times above have profiling overhead. If compiled without profiling ("-prof -auto-all") and run without "+RTS -p -RTS" time drops to ca 0.38s

  • interpreter can be about twice faster with some more hacking - using IOUArray instead of zipper for memory, making use of strict markers on fields (pc, prog_len) + passing -funbox-strict-fields to compiler etc


No comments: