Optimizing Haskell Inner Loops - optimization

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:
iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
where
a' = rotate a 5 + f t b c d + e + w + k t
b' = a
c' = rotate b 30
d' = c
e' = d
The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.
Can anyone see a significant optimization that can be further applied?
FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?
f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
| t <= 19 = (b .&. c) .|. ((complement b) .&. d)
| t <= 39 = b `xor` c `xor` d
| t <= 59 = (b .&. c) .|. (b .&. d) .|. (c .&. d)
| otherwise = b `xor` c `xor` d
k :: Int -> Word32
k t
| t <= 19 = 0x5A827999
| t <= 39 = 0x6ED9EBA1
| t <= 59 = 0x8F1BBCDC
| otherwise = 0xCA62C1D6

Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation).
You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.
Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.
And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.
With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.or#
(GHC.Prim.uncheckedShiftL# sc2_sEn 5)
(GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
y#_aBw))
sc6_sEr))
y#1_XCZ))
y#2_XD6))
See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.
Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.
And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

Related

Generate context free grammar for the following language

**{a^i b^j c^k d^m | i+j=k+m | i<m}**
The grammar should allow the language in order abbccd not cbbcda. First should be the a's then b's and so on.
I know that you must "count" the number of a's and b's you are adding to make sure there are an equivalent number of c's and d's. I just can't seem to figure out how to make sure there are more c's than a's in the language. I appreciate any help anyone can give. I've been working on this for many hours now.
Edit:
The grammar should be Context Free
I have only got these two currently because all others turned out to be very wrong:
S -> C A D
| B
B -> C B D
|
C -> a
| b
D -> c
| d
and
S -> a S d
| A
A -> b A c
|
(which is close but doesn't satisfy the i < k part)
EDIT: This is for when i < k, not i < m. OP changed the problem, but I figure this answer may still be useful.
This is not a context free grammar, and this can be proven with the pumping lemma which states that if the grammar is context free, there exists an integer p > 0, such that all strings in the language of length >= p can be split into a substring uvwxy, where len(vx) >= 1, len(vwx) <= p, and uvnwxny is a member of the language for all n >= 0.
Suppose that a value of p exists. We can create a string such that:
k = i + 1
j = m + 1
j > p
k > p
v and x cannot contain more than one type of character or be both on the left side or both on the right side, because then raising them to powers would break the grammar immediately. They cannot be the same character as each other, because then multiplying them would break the rule that i + j = k + m. v cannot be a if x is d, because then w contains the bs and cs, which makes len(vwx) > p. By the same reasoning, v cannot be as if x is cs, and v cannot be bs if x is ds. The only remaining option is bs and cs, but setting n to 0 would make i >= k and j >= m, breaking the grammar.
Therefore, it is not a context free grammar.
There has to be at least one d because i < m, so there has to be a b somewhere to offset it. T and V guarantee this criterion before moving to S, the accepted state.
T ::= bd | bTd
U ::= bc | bUc
V ::= bUd | bVd
S ::= T | V | aSd

Optimizing partial computation in Haskell

I'm curious how to optimize this code :
fun n = (sum l, f $ f0 l, g $ g0 l)
where l = map h [1..n]
Assuming that f, f0, g, g0, and h are all costly, but the creation and storage of l is extremely expensive.
As written, l is stored until the returned tuple is fully evaluated or garbage collected. Instead, length l, f0 l, and g0 l should all be executed whenever any one of them is executed, but f and g should be delayed.
It appears this behavior could be fixed by writing :
fun n = a `seq` b `seq` c `seq` (a, f b, g c)
where
l = map h [1..n]
a = sum l
b = inline f0 $ l
c = inline g0 $ l
Or the very similar :
fun n = (a,b,c) `deepSeq` (a, f b, g c)
where ...
We could perhaps specify a bunch of internal types to achieve the same effects as well, which looks painful. Are there any other options?
Also, I'm obviously hoping with my inlines that the compiler fuses sum, f0, and g0 into a single loop that constructs and consumes l term by term. I could make this explicit through manual inlining, but that'd suck. Are there ways to explicitly prevent the list l from ever being created and/or compel inlining? Pragmas that produce warnings or errors if inlining or fusion fail during compilation perhaps?
As an aside, I'm curious about why seq, inline, lazy, etc. are all defined to by let x = x in x in the Prelude. Is this simply to give them a definition for the compiler to override?
If you want to be sure, the only way is to do it yourself. For any given compiler version, you can try out several source-formulations and check the generated core/assembly/llvm byte-code/whatever whether it does what you want. But that could break with each new compiler version.
If you write
fun n = a `seq` b `seq` c `seq` (a, f b, g c)
where
l = map h [1..n]
a = sum l
b = inline f0 $ l
c = inline g0 $ l
or the deepseq version thereof, the compiler might be able to merge the computations of a, b and c to be performed in parallel (not in the concurrency sense) during a single traversal of l, but for the time being, I'm rather convinced that GHC doesn't, and I'd be surprised if JHC or UHC did. And for that the structure of computing b and c needs to be simple enough.
The only way to obtain the desired result portably across compilers and compiler versions is to do it yourself. For the next few years, at least.
Depending on f0 and g0, it might be as simple as doing a strict left fold with appropriate accumulator type and combining function, like the famous average
data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !Double
average :: [Double] -> Double
average = ratio . foldl' count (P 0 0)
where
ratio (P n s) = s / fromIntegral n
count (P n s) x = P (n+1) (s+x)
but if the structure of f0 and/or g0 doesn't fit, say one's a left fold and the other a right fold, it may be impossible to do the computation in one traversal. In such cases, the choice is between recreating l and storing l. Storing l is easy to achieve with explicit sharing (where l = map h [1..n]), but recreating it may be difficult to achieve if the compiler does some common subexpression elimination (unfortunately, GHC does have a tendency to share lists of that form, even though it does little CSE). For GHC, the flags fno-cse and -fno-full-laziness can help avoiding unwanted sharing.

Haskell Heap Issues with Parameter Passing Style

Here's a simple program that blows my heap to Kingdom Come:
intersect n k z s rs c
| c == 23 = rs
| x == y = intersect (n+1) (k+1) (z+1) (z+s) (f : rs) (c+1)
| x < y = intersect (n+1) k (z+1) s rs c
| otherwise = intersect n (k+1) z s rs c
where x = (2*n*n) + 4 * n
y = (k * k + k )
f = (z, (x `div` 2), (z+s))
p = intersect 1 1 1 0 [] 0
main = do
putStr (show p)
What the program does is calculate the intersection of two infinite series, stopping when it reaches 23 elements. But that's not important to me.
What's interesting is that as far as I can tell, there shouldn't be much here that is sitting on the heap. The function intersect is recursives with all recursions written as tail calls. State is accumulated in the arguments, and there is not much of it. 5 integers and a small list of tuples.
If I were a betting person, I would bet that somehow thunks are being built up in the arguments as I do the recursion, particularly on arguments that aren't evaluated on a given recursion. But that's just a wild hunch.
What's the true problem here? And how does one fix it?
If you have a problem with the heap, run the heap profiler, like so:
$ ghc -O2 --make A.hs -prof -auto-all -rtsopts -fforce-recomp
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A.exe ...
Which when run:
$ ./A.exe +RTS -M1G -hy
Produces an A.hp output file:
$ hp2ps -c A.hp
Like so:
So your heap is full of Integer, which indicates some problem in the accumulating parameters of your functions -- where all the Integers are.
Modifying the function so that it is strict in the lazy Integer arguments (based on the fact you never inspect their value), like so:
{-# LANGUAGE BangPatterns #-}
intersect n k !z !s rs c
| c == 23 = rs
| x == y = intersect (n+1) (k+1) (z+1) (z+s) (f : rs) (c+1)
| x < y = intersect (n+1) k (z+1) s rs c
| otherwise = intersect n (k+1) z s rs c
where x = (2*n*n) + 4 * n
y = (k * k + k )
f = (z, (x `div` 2), (z+s))
p = intersect 1 1 1 0 [] 0
main = do
putStr (show p)
And your program now runs in constant space with the list of arguments you're producing (though doesn't terminate for c == 23 in any reasonable time).
If it is OK to get the resulting list reversed, you can take advantage of Haskell's laziness and return the list as it is computed, instead of passing it recursively as an accumulating argument. Not only does this let you consume and print the list as it is being computed (thereby eliminating one space leak right there), you can also factor out the decision about how many elements you want from intersect:
{-# LANGUAGE BangPatterns #-}
intersect n k !z s
| x == y = f : intersect (n+1) (k+1) (z+1) (z+s)
| x < y = intersect (n+1) k (z+1) s
| otherwise = intersect n (k+1) z s
where x = (2*n*n) + 4 * n
y = (k * k + k )
f = (z, (x `div` 2), (z+s))
p = intersect 1 1 1 0
main = do
putStrLn (unlines (map show (take 23 p)))
As Don noted, we need to be careful so that accumulating arguments evaluate timely instead of building up big thunks. By making the argument z strict we ensure that all arguments will be demanded.
By outputting one element per line, we can watch the result being produced:
$ ghc -O2 intersect.hs && ./intersect
[1 of 1] Compiling Main ( intersect.hs, intersect.o )
Linking intersect ...
(1,3,1)
(3,15,4)
(10,120,14)
(22,528,36)
(63,4095,99)
(133,17955,232)
(372,139128,604)
(780,609960,1384)
...

Finite difference in Haskell, or how to disable potential optimizations

I'd like to implement the following naive (first order) finite differencing function:
finite_difference :: Fractional a => a -> (a -> a) -> a -> a
finite_difference h f x = ((f $ x + h) - (f x)) / h
As you may know, there is a subtle problem: one has to make sure that (x + h) and x differ by an exactly representable number. Otherwise, the result has a huge error, leveraged by the fact that (f $ x + h) - (f x) involves catastrophic cancellation (and one has to carefully choose h, but that is not my problem here).
In C or C++, the problem can be solved like this:
volatile double temp = x + h;
h = temp - x;
and the volatile modifier disables any optimization pertaining to the variable temp, so we are assured that a "clever" compiler will not optimize away those two lines.
I don't know enough Haskell yet to know how to solve this. I'm afraid that
let temp = x + h
hh = temp - x
in ((f $ x + hh) - (f x)) / h
will get optimized away by Haskell (or the backend it uses). How do I get the equivalent of volatile here (if possible without sacrificing laziness)? I don't mind GHC specific answers.
I have two solutions and a suggestion:
First solution: You can guarantee that this won't be optimized out with two helper functions and the NOINLINE pragma:
norm1 x h = x+h
{-# NOINLINE norm1 #-}
norm2 x tmp = tmp-x
{-# NOINLINE norm2 #-}
normh x h = norm2 x (norm1 x h)
This will work, but will introduce a small cost.
Second solution: Write the normalization function in C using volatile and call it through the FFI. The performance penalty will be minimal.
Now for the suggestion: Currently the math isn't optimized out, so it will work properly at present. You're afraid it will break in a future compiler. I think this is unlikely, but not so unlikely that I wouldn't want to guard against it also. So write some unit tests that cover the cases in question. Then if it does break in the future (for any reason), you'll know exactly why.
One way is to look at the Core.
Specializing to Doubles (which will be the case most likely to trigger some optimization):
finite_difference :: Double -> (Double -> Double) -> Double -> Double
finite_difference h f x = ((f $ x + hh) - (f x)) / h
where
temp = x + h
hh = temp - x
Is compiled to:
A.$wfinite_difference h f x =
case f (case x of
D# x' -> D# (+## x' (-## (+## x' h) x'))
) of
D# x'' -> case f x of D# y -> /## (-## x'' y) h
And similarly (with even less rewriting) for the polymorphic version.
So while the variables are inlined, the math isn't optimized away.
Beyond looking at the Core, I can't think of a way to guarantee the property you want.
I don't think that
temp = unsafePerformIO $ return $ x + h
would get optimised out. Just a guess.

optimization of a haskell code

I write the following Haskell code which take a triplet (x,y,z) and a list of triplets [(Int,Int,Int)] and look if there is a triplet (a,b,c) in the list such that x == a and y == b if it is a case i just need to update c = c + z, if there is not a such of triplet in the list I just add the triplet in the list.
-- insertEdge :: (Int,Int,Int) -> [(Int, Int, Int)] -> [(Int, Int, Int)]
insertEdge (x,y,z) cs =
if (length [(a,b,c) | (a,b,c) <- cs, a /= x || b /= y]) == (length cs)
then ((x,y,z):cs))
else [if (a == x && b == y) then (a,b,c+1) else (a,b,c) | (a,b,c) <- cs]
After profiling my code it appears that this fuction take 65% of the execution time.
How can I re-write my code to be more efficient?
Other answers are correct, so I want to offer some unasked-for advice instead: how about using Data.Map (Int,Int) Int instead of list?
Then your function becomes insertWith (+) (a,b) c mymap
The first thing that jumps out at me is the conditional: length examines the entire list, so in the worst-case scenario (updating the last element) your function traverses the list three times: Once for the length of the filtered list, once for the length of cs, and once to find the element to update.
However, even getting rid of the extra traversals, the best you can do with the function as written will usually require a traversal of most of the list. From the name of the function and how much time was being spent in it, I'm guessing you're calling this repeatedly to build up a data structure? If so, you should strongly consider using a more efficient representation.
For instance, a quick and easy improvement would be to use Data.Map, the first two elements of the triplet in a 2-tuple as the key, and the third element as the value. That way you can avoid making so many linear-time lookups/redundant traversals.
As a rule of thumb, lists in Haskell are only an appropriate data structure when all you do is either walk sequentially down the list a few times (ideally, just once) or add/remove from the head of the list (i.e., using it like a stack). If you're searching, filtering, updating elements in the middle, or--worst of all--indexing by position, using lists will only end in tears.
Here's a quick example, if that helps:
import qualified Data.Map as M
incEdge :: M.Map (Int, Int) Int -> ((Int, Int), Int) -> M.Map (Int, Int) Int
incEdge cs (k,v) = M.alter f k cs
where f (Just n) = Just $ n + v
f Nothing = Just v
The alter function is just insert/update/delete all rolled into one. This inserts the key into the map if it's not there, and sums the values if the key does exist. To build up a structure incrementally, you can do something like foldl incEdge M.empty edgeList. Testing this out, for a few thousand random edges your version with a list takes several seconds, whereas the Data.Map version is pretty much immediate.
It's always a good idea to benchmark (and Criterion makes it so easy). Here are the results for the original solution (insertEdgeO), Geoff's foldr (insertEdgeF), and Data.Map (insertEdgeM):
benchmarking insertEdgeO...
mean: 380.5062 ms, lb 379.5357 ms, ub 381.1074 ms, ci 0.950
benchmarking insertEdgeF...
mean: 74.54564 ms, lb 74.40043 ms, ub 74.71190 ms, ci 0.950
benchmarking insertEdgeM...
mean: 18.12264 ms, lb 18.03029 ms, ub 18.21342 ms, ci 0.950
Here's the code (I compiled with -O2):
module Main where
import Criterion.Main
import Data.List (foldl')
import qualified Data.Map as M
insertEdgeO :: (Int, Int, Int) -> [(Int, Int, Int)] -> [(Int, Int, Int)]
insertEdgeO (x, y, z) cs =
if length [(a, b, c) | (a, b, c) <- cs, a /= x || b /= y] == length cs
then (x, y, z) : cs
else [if (a == x && b == y) then (a, b, c + z) else (a, b, c) | (a, b, c) <- cs]
insertEdgeF :: (Int, Int, Int) -> [(Int, Int, Int)] -> [(Int, Int, Int)]
insertEdgeF (x,y,z) cs =
case foldr f (False, []) cs of
(False, cs') -> (x, y, z) : cs'
(True, cs') -> cs'
where
f (a, b, c) (e, cs')
| (a, b) == (x, y) = (True, (a, b, c + z) : cs')
| otherwise = (e, (a, b, c) : cs')
insertEdgeM :: (Int, Int, Int) -> M.Map (Int, Int) Int -> M.Map (Int, Int) Int
insertEdgeM (a, b, c) = M.insertWith (+) (a, b) c
testSet n = [(a, b, c) | a <- [1..n], b <- [1..n], c <- [1..n]]
testO = foldl' (flip insertEdgeO) [] . testSet
testF = foldl' (flip insertEdgeF) [] . testSet
testM = triplify . M.toDescList . foldl' (flip insertEdgeM) M.empty . testSet
where
triplify = map (\((a, b), c) -> (a, b, c))
main = let n = 25 in defaultMain
[ bench "insertEdgeO" $ nf testO n
, bench "insertEdgeF" $ nf testF n
, bench "insertEdgeM" $ nf testM n
]
You can improve insertEdgeF a bit by using foldl' (55.88634 ms), but Data.Map still wins.
The main reason your function is slow is that it traverses the list at least twice, maybe three times. The function can be rewritten to to traverse the list only once using a fold. This will transform the list into a tuple (Bool,[(Int,Int,Int)]) where the Bool indicates if there was a matching element in the list and the list is the transformed list
insertEdge (x,y,z) cs = case foldr f (False,[]) cs of
(False,cs') -> (x,y,z):cs'
(True,cs') -> cs'
where f (a,b,c) (e,cs') = if (a,b) == (x,y) then (True,(a,b,c+z):cs') else (e,(a,b,c):cs')
If you haven't seen foldr before, it has type
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr embodies a pattern of recursive list processing of defining a base case and combining the current list element with the result from the rest of the list. Writing foldr f b xs is the same as writing a function g with definition
g [] = b
g (x:xs) = f x (g xs)
Sticking with your data structure, you might
type Edge = (Int,Int,Int)
insertEdge :: Edge -> [Edge] -> [Edge]
insertEdge t#(x,y,z) es =
case break (abx t) es of
(_, []) -> t : es
(l, ((_,_,zold):r)) -> l ++ (x,y,z+zold) : r
where abx (a1,b1,_) (a2,b2,_) = a1 == a2 && b1 == b2
No matter what language you're using, searching lists is always a red flag. When searching you want sublinear complexity (think: hashes, binary search trees, and so on). In Haskell, an implementation using Data.Map is
import Data.Map
type Edge = (Int,Int,Int)
type EdgeMap = Map (Int,Int) Int
insertEdge :: Edge -> EdgeMap -> EdgeMap
insertEdge (x,y,z) es = alter accumz (x,y) es
where accumz Nothing = Just z
accumz (Just zold) = Just (z + zold)
You may not be familiar with alter:
alter :: Ord k => (Maybe a -> Maybe a) -> k -> Map k a -> Map k a
O(log n). The expression (alter f k map) alters the value x at k, or absence thereof. alter can be used to insert, delete, or update a value in a Map. In short: lookup k (alter f k m) = f (lookup k m).
let f _ = Nothing
alter f 7 (fromList [(5,"a"), (3,"b")]) == fromList [(3, "b"), (5, "a")]
alter f 5 (fromList [(5,"a"), (3,"b")]) == singleton 3 "b"
let f _ = Just "c"
alter f 7 (fromList [(5,"a"), (3,"b")]) == fromList [(3, "b"), (5, "a"), (7, "c")]
alter f 5 (fromList [(5,"a"), (3,"b")]) == fromList [(3, "b"), (5, "c")]
But as ADEpt shows in another answer, this is a bit of overengineering.
In
insertEdgeM :: (Int, Int, Int) -> M.Map (Int, Int) Int -> M.Map (Int, Int) Int
insertEdgeM (a, b, c) = M.insertWith (+) (a, b) c
you want to use the strict version of insertWith, namely insertWith'.
Very small optimisation: Use an as-pattern, this avoids multiple reconstructions of the same tuple. Like this:
insertEdge xyz#(x,y,z) cs =
if (length [abc | abc#(a,b,c) <- cs, a /= x || b /= y]) == (length cs)
then (xyz:cs))
else [if (a == x && b == y) then (a,b,c+1) else abc' | abc'#(a,b,c) <- cs]
You should apply the other optimization hionts first, but this may save a very small amount of time, since the tuple doesn't have to be reconstructed again and again. At least in the last at-pattern (The first two patterns are not important, since the tuple never gets evaluated in the first case and the as-pattern is only applied once in the second case).