Optimizing partial computation in Haskell - optimization

I'm curious how to optimize this code :
fun n = (sum l, f $ f0 l, g $ g0 l)
where l = map h [1..n]
Assuming that f, f0, g, g0, and h are all costly, but the creation and storage of l is extremely expensive.
As written, l is stored until the returned tuple is fully evaluated or garbage collected. Instead, length l, f0 l, and g0 l should all be executed whenever any one of them is executed, but f and g should be delayed.
It appears this behavior could be fixed by writing :
fun n = a `seq` b `seq` c `seq` (a, f b, g c)
where
l = map h [1..n]
a = sum l
b = inline f0 $ l
c = inline g0 $ l
Or the very similar :
fun n = (a,b,c) `deepSeq` (a, f b, g c)
where ...
We could perhaps specify a bunch of internal types to achieve the same effects as well, which looks painful. Are there any other options?
Also, I'm obviously hoping with my inlines that the compiler fuses sum, f0, and g0 into a single loop that constructs and consumes l term by term. I could make this explicit through manual inlining, but that'd suck. Are there ways to explicitly prevent the list l from ever being created and/or compel inlining? Pragmas that produce warnings or errors if inlining or fusion fail during compilation perhaps?
As an aside, I'm curious about why seq, inline, lazy, etc. are all defined to by let x = x in x in the Prelude. Is this simply to give them a definition for the compiler to override?

If you want to be sure, the only way is to do it yourself. For any given compiler version, you can try out several source-formulations and check the generated core/assembly/llvm byte-code/whatever whether it does what you want. But that could break with each new compiler version.
If you write
fun n = a `seq` b `seq` c `seq` (a, f b, g c)
where
l = map h [1..n]
a = sum l
b = inline f0 $ l
c = inline g0 $ l
or the deepseq version thereof, the compiler might be able to merge the computations of a, b and c to be performed in parallel (not in the concurrency sense) during a single traversal of l, but for the time being, I'm rather convinced that GHC doesn't, and I'd be surprised if JHC or UHC did. And for that the structure of computing b and c needs to be simple enough.
The only way to obtain the desired result portably across compilers and compiler versions is to do it yourself. For the next few years, at least.
Depending on f0 and g0, it might be as simple as doing a strict left fold with appropriate accumulator type and combining function, like the famous average
data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !Double
average :: [Double] -> Double
average = ratio . foldl' count (P 0 0)
where
ratio (P n s) = s / fromIntegral n
count (P n s) x = P (n+1) (s+x)
but if the structure of f0 and/or g0 doesn't fit, say one's a left fold and the other a right fold, it may be impossible to do the computation in one traversal. In such cases, the choice is between recreating l and storing l. Storing l is easy to achieve with explicit sharing (where l = map h [1..n]), but recreating it may be difficult to achieve if the compiler does some common subexpression elimination (unfortunately, GHC does have a tendency to share lists of that form, even though it does little CSE). For GHC, the flags fno-cse and -fno-full-laziness can help avoiding unwanted sharing.

Related

Posing a quadrating optimization problem for CVXOPT correctly

I am trying to minimize the function || Cx - d ||_2^2 with constraints Ax <= b. Some information about their sizes is as such:
* C is a (138, 22) matrix
* d is a (138,) vector
* A is a (138, 22) matrix
* b is a (138, ) vector of zeros
So I have 138 equation and 22 free variables that I'd like to optimize. I am currently coding this in Python and am using the transpose C.T*C to form a square matrix. The entire code looks like this
C = matrix(np.matmul(w, b).astype('double'))
b = matrix(np.matmul(w, np.log(dwi)).astype('double').reshape(-1))
P = C.T * C
q = -C.T * b
G = matrix(-constraints)
h = matrix(np.zeros(G.size[0]))
dt = np.array(solvers.qp(P, q, G, h, dims)['x']).reshape(-1)
where np.matmul(w, b) is C and np.matmul(w, np.log(dwi)) is d. Variables P and q are C and b multiplied by the transpose C.T to form a square multiplier matrix and constant vector, respectively. This works perfectly and I can find a solution.
I'd like to know whether this my approach makes mathematical sense. From my limited knowledge of linear algebra I know that a square matrix produces a unique solution, but is there is a way to run the same this to produce an overdetermined solution? I tried this but solver.qp said input Q needs to be a square matrix.
We can also parse in a dims argument to solver.qp, which I tried, but received the error:
use of function valued P, G, A requires a user-provided kktsolver.
How do I correctly setup dims?
Thanks a lot for any help. I'll try to clarify any questions as best as I can.

Constructing a linear grammar for the language

I find difficulties in constructing a Grammar for the language especially with linear grammar.
Can anyone please give me some basic tips/methodology where i can construct the grammar for any language ? thanks in advance
I have a doubt whether the answer for this question "Construct a linear grammar for the language: is right
L ={a^n b c^n | n belongs to Natural numbers}
Solution:
Right-Linear Grammar :
S--> aS | bA
A--> cA | ^
Left-Linear Grammar:
S--> Sc | Ab
A--> Aa | ^
As pointed out in the comments, these grammars are wrong since they generate strings not in the language. Here's a derivation of abcc in both grammars:
S -> aS -> abA -> abcA -> abccA -> abcc
S -> Sc -> Scc -> Abcc -> Aabcc -> abcc
Also as pointed out in the comments, there is a simple linear grammar for this language, where a linear grammar is defined as having at most one nonterminal symbol in the RHS of any production:
S -> aSc | b
There are some general rules for constructing grammars for languages. These are either obvious simple rules or rules derived from closure properties and the way grammars work. For instance:
if L = {a} for an alphabet symbol a, then S -> a is a gammar for L.
if L = {e} for the empty string e, then S -> e is a grammar for L.
if L = R U T for languages R and T, then S -> S' | S'' along with the grammars for R and T are a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = RT for languages R and T, then S = S'S'' is a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = R* for language R, then S = S'S | e is a grammar for L if S' is the start symbol of the grammar for R.
Rules 4 and 5, as written, do not preserve linearity. Linearity can be preserved for left-linear and right-linear grammars (since those grammars describe regular languages, and regular languages are closed under these kinds of operations); but linearity cannot be preserved in general. To prove this, an example suffices:
R -> aRb | ab
T -> cTd | cd
L = RT = a^n b^n c^m d^m, 0 < a,b,c,d
L' = R* = (a^n b^n)*, 0 < a,b
Suppose there were a linear grammar for L. We must have a production for the start symbol S that produces something. To produce something, we require a string of terminal and nonterminal symbols. To be linear, we must have at most one nonterminal symbol. That is, our production must be of the form
S := xYz
where x is a string of terminals, Y is a single nonterminal, and z is a string of terminals. If x is non-empty, reflection shows the only useful choice is a; anything else fails to derive known strings in the language. Similarly, if z is non-empty, the only useful choice is d. This gives four cases:
x empty, z empty. This is useless, since we now have the same problem to solve for nonterminal Y as we had for S.
x = a, z empty. Y must now generate exactly a^n' b^n' b c^m d^m where n' = n - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x empty, z = d. Y must now generate exactly a^n b^n c c^m' d^m' where m' = m - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x = a, z = d. Y must now generate exactly a^n' b^n' bc c^m' d^m' where n' and m' are as in 2 and 3. But then the exact same argument applies to the grammar whose start symbol is Y.
None of the possible choices for a useful production for S is actually useful in getting us closer to a string in the language. Therefore, no strings are derived, a contradiction, meaning that the grammar for L cannot be linear.
Suppose there were a grammar for L'. Then that grammar has to generate all the strings in (a^n b^n)R(a^m b^m), plus those in e + R. But it can't generate the ones in the former by the argument used above: any production useful for that purpose would get us no closer to a string in the language.

Example of programming language that is left associative?

I was teaching a friend about programming and I had a hard time convincing them that a = b and b = a are two very different things.
I eventually found the correct words to describe it (right associative) which got me thinking.
Are there any programming languages which are left associative? I have never seen a language where:
a = b results in b being set to the value of a.
You misunderstood associativity. An operator op is associative if (a op b) op c is equivalent to a op (b op c). For operators that are not associative it thus becomes relevant whether a op b op c stands for the former or the latter. Thus we distinguish between left-associative operators, where it's (a op b) op c, and right-associative operators, where it's a op (b op c).
Most operators in most languages are left-associative. Take for example -: a - b - c is equivalent to (a - b) - c in most languages, not a - (b - c).
The assignment operator is an exception from that as (a = b) = c is generally not legal (as you can't assign to the result of an assignment). Thus in most languages a = b = c is equivalent to a = (b = c). A notable exception is Python where a = b = c does not associate at all and is simply illegal.
None of this has anything to do with the difference between a = b and b = a. Since this involves only a single use of the = operator, associativity does not factor into this at all. Rather the relevant property is commutativity: An operator op is commutative if a op b is equivalent to b op a. I'm not aware of any language where assignment is commutative, nor do I have an idea of how it could be.
Concepts like left-commutativity or right-commutativity do not exist. There is, to the best of my knowledge, no term for the question "Does a = b assign b to a or vice-versa?" - that's just part of the semantics of the assignment operator.
I don't believe so. Though you can always overload the assignment operator and cause complete confusion inside your c++.
R has both <- and -> assignment operators defined.
> b <- 42
> b -> a
> a
[1] 42
If we talk about Operator associativity we should actually consider two types of associativity: side_associative (e.g. = ) or non-associative (e.g. == ) as it doesn't make any difference for a machine if it get's the line from left to right or from right to left.
There are no such programming languages that are originally left associative but some of them allow it via operator overloading and some (e.g. R) will allow both: -> and <-.
I believe no-one will like it in Europe but it might be found lovely in the Middle East. I can imagine that there are IDEs that switch right-side with left-side but at the end compilers are written in an European way.

Optimizing Haskell Inner Loops

Still working on my SHA1 implementation in Haskell. I've now got a working implementation and this is the inner loop:
iterateBlock' :: Int -> [Word32] -> Word32 -> Word32 -> Word32 -> Word32 -> Word32 -> [Word32]
iterateBlock' 80 ws a b c d e = [a, b, c, d, e]
iterateBlock' t (w:ws) a b c d e = iterateBlock' (t+1) ws a' b' c' d' e'
where
a' = rotate a 5 + f t b c d + e + w + k t
b' = a
c' = rotate b 30
d' = c
e' = d
The profiler tells me that this function takes 1/3 of the runtime of my implementation. I can think of no way to further optimize it other than maybe inlining the temp variables but I believe -O2 will do that for me anyway.
Can anyone see a significant optimization that can be further applied?
FYI the k and f calls are below. They are so simple I don't think there is a way to optimize these other. Unless the Data.Bits module is slow?
f :: Int -> Word32 -> Word32 -> Word32 -> Word32
f t b c d
| t <= 19 = (b .&. c) .|. ((complement b) .&. d)
| t <= 39 = b `xor` c `xor` d
| t <= 59 = (b .&. c) .|. (b .&. d) .|. (c .&. d)
| otherwise = b `xor` c `xor` d
k :: Int -> Word32
k t
| t <= 19 = 0x5A827999
| t <= 39 = 0x6ED9EBA1
| t <= 59 = 0x8F1BBCDC
| otherwise = 0xCA62C1D6
Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation).
You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.
Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.
And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.
With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.or#
(GHC.Prim.uncheckedShiftL# sc2_sEn 5)
(GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
y#_aBw))
sc6_sEr))
y#1_XCZ))
y#2_XD6))
See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.
Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.
And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

Finite difference in Haskell, or how to disable potential optimizations

I'd like to implement the following naive (first order) finite differencing function:
finite_difference :: Fractional a => a -> (a -> a) -> a -> a
finite_difference h f x = ((f $ x + h) - (f x)) / h
As you may know, there is a subtle problem: one has to make sure that (x + h) and x differ by an exactly representable number. Otherwise, the result has a huge error, leveraged by the fact that (f $ x + h) - (f x) involves catastrophic cancellation (and one has to carefully choose h, but that is not my problem here).
In C or C++, the problem can be solved like this:
volatile double temp = x + h;
h = temp - x;
and the volatile modifier disables any optimization pertaining to the variable temp, so we are assured that a "clever" compiler will not optimize away those two lines.
I don't know enough Haskell yet to know how to solve this. I'm afraid that
let temp = x + h
hh = temp - x
in ((f $ x + hh) - (f x)) / h
will get optimized away by Haskell (or the backend it uses). How do I get the equivalent of volatile here (if possible without sacrificing laziness)? I don't mind GHC specific answers.
I have two solutions and a suggestion:
First solution: You can guarantee that this won't be optimized out with two helper functions and the NOINLINE pragma:
norm1 x h = x+h
{-# NOINLINE norm1 #-}
norm2 x tmp = tmp-x
{-# NOINLINE norm2 #-}
normh x h = norm2 x (norm1 x h)
This will work, but will introduce a small cost.
Second solution: Write the normalization function in C using volatile and call it through the FFI. The performance penalty will be minimal.
Now for the suggestion: Currently the math isn't optimized out, so it will work properly at present. You're afraid it will break in a future compiler. I think this is unlikely, but not so unlikely that I wouldn't want to guard against it also. So write some unit tests that cover the cases in question. Then if it does break in the future (for any reason), you'll know exactly why.
One way is to look at the Core.
Specializing to Doubles (which will be the case most likely to trigger some optimization):
finite_difference :: Double -> (Double -> Double) -> Double -> Double
finite_difference h f x = ((f $ x + hh) - (f x)) / h
where
temp = x + h
hh = temp - x
Is compiled to:
A.$wfinite_difference h f x =
case f (case x of
D# x' -> D# (+## x' (-## (+## x' h) x'))
) of
D# x'' -> case f x of D# y -> /## (-## x'' y) h
And similarly (with even less rewriting) for the polymorphic version.
So while the variables are inlined, the math isn't optimized away.
Beyond looking at the Core, I can't think of a way to guarantee the property you want.
I don't think that
temp = unsafePerformIO $ return $ x + h
would get optimised out. Just a guess.