31 January 2008

Applications menu in ion3

After some time spent using KDE I recently switched back to ion3. One of the first things I realized after the switch was lack of KDE menu with all installed apps. Actually lack of any menu with all installed apps. I am too old to remember all the exec names I have installed, and way too lazy to add those manually to the ion menu, so I've written a script that parses all .desktop files from /usr/share/applications and adds (more or less) all the apps to the menu which can be invoked by Meta1-Esc (for me: alt-esc).

You can download the script from here. All you have to do is to save it in your ~/.ion3 directory and add dopath("cfg_apps") to cfg_ion.lua.
Menu is created during ion start, so if you install new apps you either have to logout/login, or restart ion for changes to be visible.

Note: the script is quite lame, uses hardcoded list of categories, does not handle some Exec entries properly etc. I will be cleaning those up soon.

Enjoy.

15 January 2008

Brainfuck - it's the last one, I promise


OK, I promise - this is the last version. What changed since last time:

  • program is now stored in a list of [Op Word16] instead of the zipper - zipper is good for memory, where pointer moves in both directions, but in case of program that only goes in one direction it has an unnecessary overhead

  • parser clean-up + optimization of sequence of similar operation ("++++", "---", ">>>", "<<<") built-in into parser

  • precompiled code optimization based on common loop patterns such as setting current cell to 0, adding current cell to another etc


Result - ca 5x speed-up on prime.bf. Other benchmarks also get some speed.

Again, post is a literate Haskell source. Enjoy.



> module Main where
>
> import qualified Control.Exception as CE
> import Control.Monad
> import Data.Char
> import Data.Word
> import System.Environment
> import System.IO
> import Text.ParserCombinators.Parsec

> import Debug.Trace


This version of interpreter makes use of "precompiled" and optimized Brainfuck code. Source file gets compiled to following set of operations:


> data Op a =
> Add a -- add a to current memory cell (a can be negative)
> | Move a -- move pointer a positions to the right (to the left if a negative)
> | Input -- input one character and store it in memory
> | Output -- output current memory cell to stdout
> | Loop [Op a] -- loop with body consisting of given operations list
> | Set a -- OPTIMIZATION: set cell to value
> | FarAdd a a -- OPTIMIZATION: add k*value of current cell to cell 'a' away
> | Scan a -- OPTIMIZATION: search for 0 jumping x cells at a time
> deriving (Show, Eq)


Source code parsing is done using simple grammar written with Parsec.

The BNF form of grammar looks something like this:

program ::= instruction*
instruction ::= loop | simple
loop ::= '[' instruction* ']'
simple ::= '+' | '-' | '<' | '>' | '.' | ','

which can be written in Parsec:


> program :: Parser [Op Int]
> program = many instruction
>
> instruction :: Parser (Op Int)
> instruction = simple <|> loop
>
> loop :: Parser (Op Int)
> loop = between (char '[') (char ']') program >>= \p -> return $ Loop p
>
> simple :: Parser (Op Int)
> simple = (many1 (char '+') >>= \p -> return $ Add (length p))
> <|> (many1 (char '-') >>= \p -> return $ Add (negate $ length p))
> <|> (many1 (char '>') >>= \p -> return $ Move (length p))
> <|> (many1 (char '<') >>= \p -> return $ Move (negate $ length p))
> <|> ( (char '.') >>= \_ -> return $ Output)
> <|> ( (char ',') >>= \_ -> return $ Input)


Data memory is handled using a zipper:


> data ListZipper a = ListZipper {
> left :: ![a], -- elements left from focus
> focus :: ! a , -- current element
> right :: ![a] -- elements right from focus
> } deriving Show
>
> move :: Int -> ListZipper a -> ListZipper a
> move (-1) (ListZipper (x:xs) y zz) = ListZipper xs x (y:zz)
> move 1 (ListZipper xx y (z:zs)) = ListZipper (y:xx) z zs
> move 0 lz = lz
> move n lz = if n > 0 then move (n-1) (move 1 lz)
> else move (n+1) (move (-1) lz)
>
> mkZipper :: [a] -> ListZipper a
> mkZipper x = ListZipper [] (head x) (tail x)
>
> getValue :: ListZipper a -> a
> getValue (ListZipper _ y _) = y
>
> setValue :: ListZipper a -> a -> ListZipper a
> setValue (ListZipper xx _ yy) v = ListZipper xx v yy
>
> {-# INLINE move #-}
> {-# INLINE getValue #-}
> {-# INLINE setValue #-}

> scan :: Int -> ListZipper Word16 -> ListZipper Word16
> scan n lz@(ListZipper _ 0 _) = lz
> scan n lz = scan n (move n lz)
>
> addAt :: Int -> Int -> ListZipper Word16 -> ListZipper Word16
> addAt n k lz@(ListZipper l v r) =
> if v == 0 then lz
> else let doAddAt 0 _ _ = error "Invalid use of doAddAt - 0 offset"
> doAddAt 1 (x:xs) v = (v+x):xs
> doAddAt n (x:xs) v = x:(doAddAt (n-1) xs v)
> value = fromIntegral k*v in
> if n > 0 then ListZipper l v (doAddAt n r value)
> else ListZipper (doAddAt (-n) l value) v r
>
> {-# INLINE addAt #-}
> {-# INLINE scan #-}



Time for main interpreter.

Executing program transforms memory state (ListZipper Word16) into a new state with possible side-effects in I/O.


> runSequence :: ListZipper Word16 -> [Op Int] -> IO (ListZipper Word16)
> runSequence memory = foldM step memory
>
> step :: ListZipper Word16 -> Op Int -> IO (ListZipper Word16)
> step mem op = --trace (show op) $!
> case op of
> Move n -> return $! move n mem
> Add n -> return $! setValue mem ((getValue mem) + (fromIntegral n))
> Loop p -> doLoop p mem
> Set n -> return $! setValue mem (fromIntegral n)
> FarAdd n k -> return $! addAt n k mem
> Scan n -> return $! scan n mem
> Input -> CE.try getChar >>= \c ->
> case c of
> Left err -> return $! setValue mem 0
> Right x -> return $! setValue mem (fromIntegral $ ord x)
> Output -> hPutChar stdout (chr . fromEnum $ getValue mem) >> return mem


I have enclosed loop body execution into a separate function to make it look more clear (for me at least). Algorithm is simple - if on entry to the loop current memory cell has non-zero value, the loop body is executed as if it was a separate program, and the resulting memory state is passed again to doLoop to check if another iteration should be performed.
If value of current cell is zero program is skipped and function simply returns the same memory state that it got as parameter.


> doLoop :: [Op Int] -> ListZipper Word16 -> IO (ListZipper Word16)
> doLoop block memory = if (getValue memory) == 0 then return memory
> else runSequence memory block >>= doLoop block


Optimization step - precompiled program is transformed according to following rules:

  • sequence of Add operations is replaced with single Add with argument being sum of arguments of input, for example [Add 1, Add1] will be replaced with [Add 2], [Add 1, Add 1, Add (-1)] with [Add 1] etc. Technically sequences of '+' or '-' are handled by Brainfuck parser, but it does not optimize intermixed '+' and '-' correctly, for example in "+-+-+-". This is actually programer's mistake, but can happen anyway.

  • sequence of Move operations is replaced with single Move in similar fashion - [Move (-1), Move (-1)] is replaced with [Move (-2)] etc. Situation here is similar to described above - parser does not optimize "<><><><>" properly.

  • empty operations are removed, for example Add 0, Move 0 or empty loop.


Optimizer also replaces common loop patterns with single opcodes:

  • Loop [Add (-1)] (filling current cell with 0) is replaced with Set 0.

  • Loop [Add (-1), Move n, Add k, Move (-n)] (addition of k*current to cell n away) replaced with [FarAdd n k, Set 0] (Set 0 is necessary, since side-effect of the original loop is setting current cell to 0). Also variant with Add (-1) at the end of the loop is recognized.

  • Loop [Add (-1), Move n1, Add k1, Move n2, Add k2, Move n3] if n1+n2+n3 == 0 (two additions - k1*current to cell n1 away, k2*current to cell n1+n2) replaced with sequence of [FarAdd n1 k1, FarAdd (n1+n2) k2, Set 0]. Again, also variant with Add (-1) at the end is recognized.




> optimize :: [Op Int] -> [Op Int]
> optimize [] = []
> optimize ((Add 0) :xs) = optimize xs
> optimize ((Move 0) :xs) = optimize xs
> optimize ((Loop []) :xs) = optimize xs
> optimize ((Add x) :(Add y) :xs) = optimize (Add (x+y) :xs)
> optimize ((Move x):(Move y):xs) = optimize (Move (x+y):xs)
> optimize ((Set x) :(Add y) :xs) = optimize (Set (x+y) :xs)
> optimize ((Set x) :(Set y) :xs) = optimize (Set y :xs)
> optimize ((Loop [Add (-1)]):xs) = optimize (Set 0 :xs)
> optimize ((Loop [Move x]) :xs) = optimize (Scan x :xs)
> optimize ((Loop p) :xs) = let p' = optimize p in loopOptimize (Loop p') ++ (optimize xs)
> optimize (x :xs) = x:(optimize xs)
>
> loopOptimize x@(Loop [Add (-1), Move n1, Add k, Move n2]) =
> if n1 == -n2 then [FarAdd n1 k, Set 0]
> else [x]
> loopOptimize x@(Loop [Move n1, Add k, Move n2, Add (-1)]) =
> if n1 == -n2 then [FarAdd n1 k, Set 0]
> else [x]
> loopOptimize x@(Loop [Add (-1), Move n1, Add k1, Move n2, Add k2, Move n3]) =
> if (n1+n2) == -n3 then [FarAdd n1 k1, FarAdd (n1+n2) k2, Set 0]
> else [x]
> loopOptimize x@(Loop [Move n1, Add k1, Move n2, Add k2, Move n3, Add (-1)]) =
> if (n1+n2) == -n3 then [FarAdd n1 k1, FarAdd (n1+n2) k2, Set 0]
> else [x]
> loopOptimize x = [x]


doOptimize does optimization on its input as long as subsequent optimize calls result in any changes to the program.


> doOptimize p = let p' = optimize p in
> if p == p' then p else doOptimize p'


Program "normalization", i.e. removal of all characters other than valid Brainfuck instructions.


> normalize :: String -> String
> normalize program = filter (`elem` "+-<>[].,") program


And finally main function.


> main :: IO ()
> main = do
> args <- getArgs
> if length args < 1
> then fail "Please provide name of the program to run"
> else do
> prog <- readFile (head args)
> case (parse program "" . normalize) prog of
> Left err -> do putStr "Parse error at "
> print err
> Right res -> do
> runSequence (mkZipper (replicate 30000 0)) (doOptimize res)
> return ()



11 January 2008

The purest is "the bestest" - final attempt to Brainfuck in Haskell


Ladies and gentelmen, time for the final attempt to writing a perfect implementation of Brainfuck interpreter in Haskel.

First of all I must make a statement: all previous attempts were lame, wrong and the result sucked big time. And I shouldn't have published them to save myself embarrassment.
The code was big, ugly and was imperative rather than functional. Luckily people suggested proper ways of doing things, and I must admit that I like the result. Thanks guys!

This post is again a valid literate Haskell code, so you can just paste it into .lhs file and compile it.
Let's do it.


> module Main where
>
> import qualified Control.Exception as CE
> import Data.Char
> import Data.Word
> import System.Environment
> import System.IO
> import Text.ParserCombinators.Parsec


This version of interpreter makes use of "precompiled" and optimized Brainfuck code. Source file gets compiled to following set of operations:


> data Op a =
> Add a -- add a to current memory cell (a can be negative)
> | Move a -- move pointer a positions to the right (to the left if a negative)
> | Input -- input one character and store it in memory
> | Output -- output current memory cell to stdout
> | Loop [Op a] -- loop with body consisting of given operations list
> | End -- guard used to detect end of program or loop body
> deriving (Show, Eq)


"End" is used as guard and is being placed at the end of each loop body and of the whole program. It is perfectly possible to get rid of this guard, but I somehow found it more pleasant to use than handling empty lists in zipper.

Source code parsing is done using simple grammar written with Parsec.

The BNF form of grammar looks something like this:

program ::= instruction*
instruction ::= loop | simple
loop ::= '[' instruction* ']'
simple ::= '+' | '-' | '<' | '>' | '.' | ','

which can be written in Parsec:


> program :: Parser [Op Int]
> program = do { ins <- instruction -- instruction
> ; do { rest <- program -- followed by more instructions
> ; return (ins:rest)
> }
> <|> return [ins] -- or not - just single
> }
> <|> return [End] -- empty program / end of list
>
> instruction :: Parser (Op Int)
> instruction = simple <|> loop
>
> loop :: Parser (Op Int)
> loop = do { char '['
> ; body <- program
> ; char ']'
> ; return (Loop body)
> }
>
> simple :: Parser (Op Int)
> simple = do { ins <- oneOf "+-<>.,"
> ; let op = case ins of
> '+' -> Add 1
> '-' -> Add (-1)
> '<' -> Move (-1)
> '>' -> Move 1
> '.' -> Output
> ',' -> Input
> ; return op
> }



To make things as pure and as functional as possible both program and memory handling will be implemented using zipper. Really cool technique!
Note: it could also be implemented in the way suggested by lasts here, i.e. without "focus" field. I don't know it that would be faster or not, but I like the way with explicitly declaring focus field.


> data ListZipper a = ListZipper {
> left :: ![a], -- elements left from focus
> focus :: ! a , -- current element
> right :: ![a] -- elements right from focus
> } deriving Show
>
> move :: Int -> ListZipper a -> ListZipper a
> move (-1) (ListZipper (x:xs) y zz) = ListZipper xs x (y:zz)
> move 1 (ListZipper xx y (z:zs)) = ListZipper (y:xx) z zs
> move 0 lz = lz
> -- note: move for abs(n)>1 is done in such way, because it is faster than
> -- interpretation of a series of '>'/'<' opcodes and than trying to implement
> -- such operation using take/drop/splitAt etc.
> move n lz = if n > 0 then move (n-1) (move 1 lz)
> else move (n+1) (move (-1) lz)
>
> mkZipper :: [a] -> ListZipper a
> mkZipper x = ListZipper [] (head x) (tail x)
>
> getValue :: ListZipper a -> a
> getValue (ListZipper _ y _) = y
>
> setValue :: ListZipper a -> a -> ListZipper a
> setValue (ListZipper xx _ yy) v = ListZipper xx v yy



Time for main interpreter.

"runSequence" executes program inside ListZipper until it reaches "End" marker. Executing program transforms memory state (ListZipper Word16) into a new state with possible side-effects in I/O.


> runSequence :: ListZipper (Op Int) -> ListZipper Word16 -> IO (ListZipper Word16)
> runSequence program memory =
> if (focus program) == End
> then return memory
> else do mem' <- step (focus program) memory
> runSequence (move 1 program) mem'
>
> step :: Op Int -> ListZipper Word16 -> IO (ListZipper Word16)
> step op mem =
> case op of
> Move n -> return $ move n mem
> Add n -> return $ setValue mem ((getValue mem) + (fromIntegral n))
> Loop p -> doLoop (mkZipper p) mem
> Input -> do c <- CE.try getChar
> case c of
> Left err -> return $ setValue mem 0
> Right x -> return $ setValue mem (fromIntegral $ fromEnum x)
> Output -> do hPutChar stdout (chr $ fromEnum $ getValue mem)
> hFlush stdout
> return mem


I have enclosed loop body execution into a separate function to make it look more clear (for me at least). Algorithm is simple - if on entry to the loop current memory cell has non-zero value, the loop body is executed as if it was a separate program, and the resulting memory state is passed again to doLoop to check if another iteration should be performed.
If value of current cell is zero program is skipped and function simply returns the same memory state that it got as parameter.


> doLoop :: ListZipper (Op Int) -> ListZipper Word16 -> IO (ListZipper Word16)
> doLoop block memory = if (getValue memory) == 0 then return memory
> else do mem' <- runSequence block memory
> doLoop block mem'


Optimization step - precompiled program is transformed according to following rules:

  • sequence of Add operations is replaced with single Add with argument being sum of arguments of input, for example [Add 1, Add1] will be replaced with [Add 2], [Add 1, Add 1, Add (-1)] with [Add 1] etc

  • sequence of Move operations is replaced with single Move in similar fashion - [Move (-1), Move (-1)] is replaced with [Move (-2)] etc




> optimize :: [Op Int] -> [Op Int]
> optimize [] = []
> optimize (x:[]) = [x]
> optimize ((Add x):(Add y):xs) = optimize ((Add (x+y)):xs)
> optimize ((Move x):(Move y):xs) = optimize ((Move (x+y)):xs)
> optimize ((Loop p):xs) = (Loop (optimize p)):(optimize xs)
> optimize (x:xs) = x:(optimize xs)


Program "normalization", i.e. removal of all characters other than valid Brainfuck instructions.


> normalize :: String -> String
> normalize program = filter (\c -> elem c "+-<>[].,") program


And finally main function.


> main :: IO ()
> main = do
> args <- getArgs
> if length args < 1
> then fail "Please provide name of the program to run"
> else do
> prog <- readFile (head args)
> case (parse program "" (normalize prog)) of
> Left err -> do putStr "Parse error at "
> print err
> Right res -> do runSequence (mkZipper (optimize res)) (mkZipper (take 30000 $ repeat 0))
> return ()



Result:

time (echo 30 | ./Main ../tests/prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m0.267s
user 0m0.239s
sys 0m0.012s


Yep, that's right! Cleaner, better code being ca 30% faster than semi-imperative code using unboxed types!
Note, that it can be sped up a little bit more - marking ListZipper fields as strict and passing -funbox-strict-fields to ghc drops time to 0.2s.

EDIT: original post missed one quite important line - optimize ((Loop p):xs) ... . Without it optimizations are not propagated inside loops, hence in any non-trivial program they have basically no effect.

This is corrected now.

10 January 2008

Brainfuck in Haskell - speed gives me what I need


Today it's time for final two optimizations to our interpreter. Initially I planned to use mutable unboxed array (access time of O(1), in-place updates, unboxed elements), but I was given a nicer, more functional solution - list zipper.
I must admit I didn't know this technique, but it looks nice.

First of all I will change benchmarking procedure a little bit - calculation of primes up to 10 has gotten too fast (ca 0.15s) to be measured accurately, so from now on benchmark will be finding primes up to 30, which gives us a nice timing of ca 3.5s:

> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m3.626s
user 0m3.480s
sys 0m0.099s
>

with profile:

COST CENTRE MODULE %time %alloc

run Main 41.2 32.8
step Main 24.2 37.8
setMem Main 21.6 21.3
getMem Main 5.9 0.0
!!! Main 3.3 2.5
nextPC Main 2.6 5.5
isEnd Main 1.3 0.0


Zipper is a structure that "simulates" movement over list elements. It holds three things - current element (focus), list of elements on the left of focus and list of elements on the right.
Moving left or right means simply changing focus and adding/removing single element from left/right lists.
Here it is wrapped in code:


> data ListZipper a = ListZipper {
> left :: [a], -- elements on the left from focus
> focus :: a, -- current element
> right :: [a] -- elements on the right
> }
>
> moveLeft :: ListZipper a -> ListZipper a
> moveLeft (ListZipper (x:xs) y zs) = ListZipper xs x (y:zs)
>
> moveRight :: ListZipper a -> ListZipper a
> moveRight (ListZipper xs y (z:zs)) = ListZipper (y:xs) z zs
>
> setValue :: ListZipper a -> a -> ListZipper a
> setValue (ListZipper xs y zs) v = ListZipper xs v zs
>
> getValue :: ListZipper a -> a
> getValue (ListZipper _ y _) = y


We also need to change definition of BFState and functions operating on memory, namely setMem and getMem:


> data BFState = BFState {
> program :: BS.ByteString, -- program being interpreted
> memory :: ListZipper Word16, -- memory
> pc :: Int, -- current program counter
>
> prog_len :: Int, -- cached program length
> loop_ends :: UArray Int Int -- cached loop ends
> }

> initState :: BS.ByteString -> Int -> BFState
> initState program memSize = BFState program mem 0 (BS.length program) loops where
> loops = (listArray (0, (BS.length program - 1)) $ loopEnds program 0)
> mem = ListZipper [] 0 (take (memSize - 1) $ repeat 0)

> getMem :: BFState -> Word16
> getMem st = getValue (memory st)


> setMem :: BFState -> Word16 -> ListZipper Word16
> setMem st value = setValue (memory st) value


The benchmark:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m2.673s
user 0m2.511s
sys 0m0.108s
>

Over 25% faster - nice! (side note: IOUArray is a little bit faster - you can gain another 10% - but it is not that much, so let's stick with purer solution)

The profile does not give us much clue of what to do to make it faster

COST CENTRE MODULE %time %alloc

run Main 52.1 40.6
step Main 28.2 43.7
nextPC Main 5.1 6.8
isEnd Main 3.4 0.0
getValue Main 3.4 0.0
getMem Main 2.6 0.0
moveLeft Main 1.7 2.4
!!! Main 1.7 3.1
moveRight Main 0.0 2.4

but there is one more thing to look at - while interpreting Brainfuck program all characters that are not valid BF instructions are being ignored, i.e. their execution is limited to advancing program counter. This happens each time such character is encountered, possibly multiple times during program run.
If those characters do nothing, we can simply filter them away before we start interpreting program. It should give us some speed gain (especially if you look inside prime.bf - a LOT of whitespace and some comments).


> normalize :: BS.ByteString -> BS.ByteString
> normalize program = BS.filter (\c -> elem (chr $ fromEnum c) "+-[]<>.,") program

> main :: IO ()
> main = do
> args <- getArgs
> if length args == 0 then fail "Please provide name of the program to execute"
> else do program <- BS.readFile (head args)
> run (initState (normalize program) 30000)


Results:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 30 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7 11 13 17 19 23 29

real 0m1.734s
user 0m1.658s
sys 0m0.054s
>

Another 1s shaved off.

Some additional notes:

  • times above have profiling overhead. If compiled without profiling ("-prof -auto-all") and run without "+RTS -p -RTS" time drops to ca 0.38s

  • interpreter can be about twice faster with some more hacking - using IOUArray instead of zipper for memory, making use of strict markers on fields (pc, prog_len) + passing -funbox-strict-fields to compiler etc


06 January 2008

Brainfuck in Haskell - optimizations part 2


In previous post we went down with execution time of prime.bf benchmark from 14s down to 4s. The profile result was:

COST CENTRE MODULE %time %alloc

step Main 85.6 42.1
findMatchingForward Main 5.0 0.5
findMatchingBackwards Main 5.0 0.5
loopEnds Main 3.0 0.6
run Main 1.5 30.2
setMem Main 0.0 18.3
nextPC Main 0.0 7.4


It does not show us directly what operation consumes most CPU ("step" is quite complex function), but we can do some guessing. First of all program is operating a lot on strings, mostly accessing random element of the string (!! operator). Strings in Haskell are a list of char, which makes this operation O(n).
Haskell provides a nice data type that could be used as replacement - ByteString.

First we need to add an import (it has to be qualified since there are name conflicts with Prelude):


> import qualified Data.ByteString as BS


Now we can replace all occurrences of String (since its used only for program) with BS.ByteString, all "length program" with "BS.length program" and "readFile" with "BS.readFile" (which reads file contents as ByteString). The only thing left is !! operator, which is not defined for ByteString.
Function for accessing random elements is called "index", but unfortunately it returns Word8 instead of Char, so we would have to change character constants in code with its numerical equivalents. To prevent this we'll write our own operator that returns ByteString element as Char:


> (!!!) :: BS.ByteString -> Int -> Char
> (!!!) bs ind = chr $ fromEnum $ BS.index bs ind
>
> infix 9 !!!


The last line makes !!! an infix operator with the same priority as string's !!.

Now we can safely replace !! with !!! in all places where we are selecting program character ("program !! pos" with "program !!! pos", "program st !! pc st" with "program st !!! pc st") and do some benchmarking.

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m0.694s
user 0m0.517s
sys 0m0.020s
>

Not bad - almost 6 times faster.

There is a similar situation with loop_ends and with memory - we often access random elements of a list. Plus list elements are boxed in Haskell objects, which generates additional cost when operating on them.
loop_ends is the easiest to improve, since the array is immutable, so we can replace it with Haskell's type for immutable arrays of unboxed objects (UArray), which additionally has O(1) access time for its elements.

Necessary changes:


> import Data.Array.Unboxed

> data BFState = BFState {
> program :: BS.ByteString, -- program being interpreted
> memory :: [Word16], -- memory
> pc :: Int, -- current program counter
> pos :: Int, -- current pointer position
>
> prog_len :: Int, -- cached program length
> loop_ends :: UArray Int Int -- cached loop ends
> }


> initState :: BS.ByteString -> Int -> BFState
> initState program memSize = BFState program (take memSize $ repeat 0) 0 0 (BS.length program)
> (listArray (0, (BS.length program - 1)) $ loopEnds program 0)


Plus change of all "(loop_ends st) !! (pc st)" with "(loop_ends st) ! (pc st)", since UArray uses ! to access elements.

As simple as that.

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m0.137s
user 0m0.101s
sys 0m0.012s
>

Wow! Another 5 times faster! Profile:

COST CENTRE MODULE %time %alloc

step Main 40.0 41.1
setMem Main 40.0 15.3
getMem Main 20.0 0.0
run Main 0.0 32.4
nextPC Main 0.0 6.2
!!! Main 0.0 3.3

As you can see getMem and setMem operations are now taking ca 60% of program run time, and we know exactly why - accessing random elements of a list.
Unfortunately memory is mutable, so we cannot replace it as easily as we did with loop_ends. Haskell does have a proper type, namely mutable arrays of unboxed elements (IOUArray) and that's what we will use next time.

02 January 2008

Brainfuck in Haskell - time for some optimization


In the previous post we wrote an extremely slow, but working, Brainfuck interpreter in Haskell. Now it's time to make it a little bit faster. (Side note: this post will not be a valid literate Haskell program. It describes changes to the origital + command line operations.)

GHC gives us a nice profiler that makes it easy to spot most of the performance issues, and that's what we will use.
For benchmarking I'll use prime.bf finding prime numbers up to, let's say, 10.


> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m14.244s
user 0m14.010s
sys 0m0.092s
>

Profiling results are in Main.prof. Here's a snippet that shows our hot spots:

COST CENTRE MODULE %time %alloc

findMatchingBackwards Main 32.3 13.0
isEnd Main 30.9 0.0
step Main 21.3 33.8
findMatchingForward Main 14.9 5.6
setMem Main 0.4 19.5
run Main 0.1 19.7
nextPC Main 0.0 8.0

It seems that the easiest thing to get rid of is isEnd function - each time it is called it calculates length of the program (string => list of char => length's complexity is O(n)). Program does not change over time, so we can simply cache program length in BFState and compare pc against this cached value.
Changes are needed in definition of BFState:


> data BFState = BFState {
> program :: String, -- program being interpreted
> memory :: [Word16], -- memory
> pc :: Int, -- current program counter
> pos :: Int, -- current pointer position
>
> prog_len :: Int -- cached program length
> }


in initState function:


> initState :: String -> Int -> BFState
> initState program memSize = BFState program (take memSize $ repeat 0) 0 0 (length program)


and in isEnd itself:


> isEnd :: BFState -> Bool
> isEnd st = (pc st) >= (prog_len st)


Result:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m9.387s
user 0m9.234s
sys 0m0.058s
>

So it worked as expected - ca 30% performance gain. Current performance bottlenecks:

COST CENTRE MODULE %time %alloc

findMatchingBackwards Main 47.5 11.4
step Main 26.8 34.3
findMatchingForward Main 24.4 4.9
setMem Main 0.9 17.0
run Main 0.4 25.1
nextPC Main 0.0 7.0

70% of the time is spent looking for matching '[' and ']' in loops.
Situation here is similar to isEnd - program does not change over time, so it should not be a problem to cache somehow findMatching results.

The simplest way is to create a table of the same length as program, where on a position corresponding to '[' we will have index of matching ']', and on position corresponding to ']' we will have index of matching '[', for example table for program "++[>+<-]" will look like [0, 0, 7, 0, 0, 0, 0, 2] (values for characters other than '[' and ']' do not matter, so I put 0 there).

Function generating such table could look like this:


> loopEnds :: String -> Int -> [Int]
> loopEnds program pos =
> if (pos >= length program) then []
> else end:(loopEnds program (pos+1)) where
> end = case (program !! pos) of
> '[' -> findMatchingForward program pos
> ']' -> findMatchingBackwards program pos
> otherwise -> 0


Of course we need also changes in BFState:


> data BFState = BFState {
> program :: String, -- program being interpreted
> memory :: [Word16], -- memory
> pc :: Int, -- current program counter
> pos :: Int, -- current pointer position
>
> prog_len :: Int, -- cached program length
> loop_ends :: [Int] -- cached loop ends
> }


initState:


> initState :: String -> Int -> BFState
> initState program memSize = BFState program (take memSize $ repeat 0) 0 0 (length program) (loopEnds program 0)


and in implementation of '[' and ']' in step:


> '[' -> return st { pc = pc' } where
> pc' = if getMem st == 0 then ((loop_ends st) !! (pc st)) + 1
> else nextPC st




> ']' -> return st { pc = pc' } where
> pc' = (loop_ends st) !! (pc st)


Results:

> ghc --make -prof -auto-all -O2 Main.lhs
[1 of 1] Compiling Main ( Main.lhs, Main.o )
Linking Main ...
> time (echo 10 | ./Main +RTS -p -RTS prime.bf)
Primes up to: 2 3 5 7

real 0m4.090s
user 0m4.003s
sys 0m0.032s
>

With current profile:

COST CENTRE MODULE %time %alloc

step Main 85.6 42.1
findMatchingForward Main 5.0 0.5
findMatchingBackwards Main 5.0 0.5
loopEnds Main 3.0 0.6
run Main 1.5 30.2
setMem Main 0.0 18.3
nextPC Main 0.0 7.4

Not bad. 14s down to 4s. Still way slower than C version, but getting better.

In the next part - some more optimizations, this time mostly on data types used.