02 January 2008

Brainfuck in Haskell - time for some optimization


In the previous post we wrote an extremely slow, but working, Brainfuck interpreter in Haskell. Now it's time to make it a little bit faster. (Side note: this post will not be a valid literate Haskell program. It describes changes to the origital + command line operations.)

GHC gives us a nice profiler that makes it easy to spot most of the performance issues, and that's what we will use.
For benchmarking I'll use prime.bf finding prime numbers up to, let's say, 10.


> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m14.244s
user 0m14.010s
sys 0m0.092s
>

Profiling results are in Main.prof. Here's a snippet that shows our hot spots:

COST CENTRE MODULE %time %alloc

findMatchingBackwards Main 32.3 13.0
isEnd Main 30.9 0.0
step Main 21.3 33.8
findMatchingForward Main 14.9 5.6
setMem Main 0.4 19.5
run Main 0.1 19.7
nextPC Main 0.0 8.0

It seems that the easiest thing to get rid of is isEnd function - each time it is called it calculates length of the program (string => list of char => length's complexity is O(n)). Program does not change over time, so we can simply cache program length in BFState and compare pc against this cached value.
Changes are needed in definition of BFState:


> data BFState = BFState {
> program :: String, -- program being interpreted
> memory :: [Word16], -- memory
> pc :: Int, -- current program counter
> pos :: Int, -- current pointer position
>
> prog_len :: Int -- cached program length
> }


in initState function:


> initState :: String -> Int -> BFState
> initState program memSize = BFState program (take memSize $ repeat 0) 0 0 (length program)


and in isEnd itself:


> isEnd :: BFState -> Bool
> isEnd st = (pc st) >= (prog_len st)


Result:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m9.387s
user 0m9.234s
sys 0m0.058s
>

So it worked as expected - ca 30% performance gain. Current performance bottlenecks:

COST CENTRE MODULE %time %alloc

findMatchingBackwards Main 47.5 11.4
step Main 26.8 34.3
findMatchingForward Main 24.4 4.9
setMem Main 0.9 17.0
run Main 0.4 25.1
nextPC Main 0.0 7.0

70% of the time is spent looking for matching '[' and ']' in loops.
Situation here is similar to isEnd - program does not change over time, so it should not be a problem to cache somehow findMatching results.

The simplest way is to create a table of the same length as program, where on a position corresponding to '[' we will have index of matching ']', and on position corresponding to ']' we will have index of matching '[', for example table for program "++[>+<-]" will look like [0, 0, 7, 0, 0, 0, 0, 2] (values for characters other than '[' and ']' do not matter, so I put 0 there).

Function generating such table could look like this:


> loopEnds :: String -> Int -> [Int]
> loopEnds program pos =
> if (pos >= length program) then []
> else end:(loopEnds program (pos+1)) where
> end = case (program !! pos) of
> '[' -> findMatchingForward program pos
> ']' -> findMatchingBackwards program pos
> otherwise -> 0


Of course we need also changes in BFState:


> data BFState = BFState {
> program :: String, -- program being interpreted
> memory :: [Word16], -- memory
> pc :: Int, -- current program counter
> pos :: Int, -- current pointer position
>
> prog_len :: Int, -- cached program length
> loop_ends :: [Int] -- cached loop ends
> }


initState:


> initState :: String -> Int -> BFState
> initState program memSize = BFState program (take memSize $ repeat 0) 0 0 (length program) (loopEnds program 0)


and in implementation of '[' and ']' in step:


> '[' -> return st { pc = pc' } where
> pc' = if getMem st == 0 then ((loop_ends st) !! (pc st)) + 1
> else nextPC st




> ']' -> return st { pc = pc' } where
> pc' = (loop_ends st) !! (pc st)


Results:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m4.090s
user 0m4.003s
sys 0m0.032s
>

With current profile:

COST CENTRE MODULE %time %alloc

step Main 85.6 42.1
findMatchingForward Main 5.0 0.5
findMatchingBackwards Main 5.0 0.5
loopEnds Main 3.0 0.6
run Main 1.5 30.2
setMem Main 0.0 18.3
nextPC Main 0.0 7.4

Not bad. 14s down to 4s. Still way slower than C version, but getting better.

In the next part - some more optimizations, this time mostly on data types used.

No comments: