Haskell - Exposing IO actions in API - api

I wrote a small library[1] that interfaces with a postgresql DB which contains 600+ Spanish verbs and pulls out conjugations and other useful things.
I have a single function which performs the DB read. It looks like this (I am using the postgresql-simple[2] library):
-- | A postgres query.
queryDB :: (ToRow params, FromRow a) => Query -> params -> IO [a]
queryDB q paramTypes = do
c <- connection
return =<< query c q paramTypes
Each function that I expose in the library, uses this function and returns an IO action of some type. For example, if the user conjugates the verb 'ser' using conjugate, I get back a IO [Conjugation]:
-- | Conjugate the verb 'i' in the tense 't' and mood 'm'.
--
-- > conjugate "ser" "Presente" "Indicativo"
conjugate :: Infinitive -> Tense -> Mood -> IO [Conjugation]
conjugate i t m = queryDB conjugationQuery [i :: Infinitive,
t :: Tense,
m :: Mood]
I am new to writing libraries in Haskell. Is it fine to leave functions such as conjugate to export IO actions? They do interact with the DB, but that isn't really the point of the function ... the user just wants conjugations. Normally, if I would write code like this in another language, the user would not know an IO action has taken place.
Can I seperate IO and expose pure functions?

Since you're hitting a database, no. A huge part of Haskell is specifying to someone using your API that they're performing an IO action. Since IO actions can fail, return different results for the same input, or fire the missiles, we always tell the user when this happens.
What would happen if I used your API but didn't have your database as well? Then I would likely see some sort of error message about not having a connection. Or if I did have your database but modified it to return incorrect conjugations, then you can't guarantee that conjugate will always return the same conjugations given a particular infinitive, tense, and mood. This means that you can't have your conjugate function be pure.

If you want to avoid reconnecting to the database for every query, one thing you can do is make a newtype wrapper over ReaderT Connection IO that you use all over the place, and then provide a separate runDB function:
newtype DB a = MkDB{ unDB :: ReaderT DBConnection IO a } deriving (Functor, Applicative, Monad)
queryDB :: (ToRow params, FromRow a) => Query -> params -> DB [a]
queryDB q paramTypes = MkDB $ do
c <- ask
lift $ query c q paramTypes
conjugate :: Infinitive -> Tense -> Mood -> DB [Conjugation]
conjugate i t m = queryDB conjugationQuery [i :: Infinitive,
t :: Tense,
m :: Mood]
-- Of course, this still needs to be in IO
runDB :: DB a -> IO a
runDB db = runReaderT db =<< connection
The crucial bit is to not export MkDB and unDB; DB is an opaque type that the user can only use via the exported functions (conjugate etc.) and the monadic combinators. This way, undiluted IO is not spread all over the client code.

Related

Test.QuickCheck: speed up testing multiple properties for the same type

I am testing a random generator generating instances of my own type. For that I have a custom instance of Arbitrary:
complexGenerator :: (RandomGen g) => g -> (MyType, g)
instance Arbitrary MyType where
arbitrary = liftM (fst . complexGenerator . mkStdGen) arbitrary
This works well with Test.QuickCheck (actually, Test.Framework) for testing that the generated values hold certain properties. However, there are quite a few properties I want to check, and the more I add, the more time it takes to verify them all.
Is there a way to use the same generated values for testing every property, instead of generating them anew each time? I obviously still want to see, on failures, which property did not hold, so making one giant property with and is not optimal.
I obviously still want to see, on failures, which property did not hold, so making one giant property with and is not optimal.
You could label each property using printTestCase before making a giant property with conjoin.
e.g. you were thinking this would be a bad idea:
prop_giant :: MyType -> Bool
prop_giant x = and [prop_one x, prop_two x, prop_three x]
this would be as efficient yet give you better output:
prop_giant :: MyType -> Property
prop_giant x = conjoin [printTestCase "one" $ prop_one x,
printTestCase "two" $ prop_two x,
printTestCase "three" $ prop_three x]
(Having said that, I've never used this method myself and am only assuming it will work; conjoin is probably marked as experimental in the documentation for a reason.)
In combination with the voted answer, what I've found helpful is using a Reader transformer with the Writer monad:
type Predicate r = ReaderT r (Writer String) Bool
The Reader "shared environment" is the tested input in this case. Then you can compose properties like this:
inv_even :: Predicate Int
inv_even = do
lift . tell $ "this is the even invariant"
(==) 0 . flip mod 2 <$> ask
toLabeledProp :: r -> Predicate r -> Property
toLabeledProp cause r =
let (effect, msg) = runWriter . (runReaderT r) $ cause in
printTestCase ("inv: " ++ msg) . property $ effect
and combining:
fromPredicates :: [Predicate r] -> r -> Property
fromPredicates predicates cause =
conjoin . map (toLabeledProp cause) $ predicates
I suspect there is another approach involving something similar to Either or a WriterT here- which would concisely compose predicates on different types into one result. But at the least, this allows for documenting properties which impose different post-conditions dependent on the the value of the input.
Edit: This idea spawned a library:
http://github.com/jfeltz/quickcheck-property-comb

Is this F# function tail-recursive where the recursive function is called several times inside the function?

There are a couple of questions about tail-recursive function e.g. this and this but could not find anything similar to the following.
My understanding is that a tail-call optimised function should return an accumulated value in its last call without any further evaluation. It's quite easy to understand using factorial function, for example, which get optimized into loops 2. But it not always obvious to tell in other cases e.g. in the following, what is that last call? There are many of them as the function is called recursively more than once in the body.
Brian suggests a way of finding out but I am not sure how to make it tail-call optimised. I can pass the --tailcalls flag to the compiler to do it automatically but does it always succeed?
f and g returns the same type.
type T = T of int * T list
let rec myfunc f (T (x,xs)) =
if (List.isEmpty xs) then f x
else
List.fold g acc (List.map (fun xxs -> myfunc f xxs) xs)
Any help to tail-call optimise the above code would be much appreciated.
As Jon already said, your function is not tail-recursive. The basic problem is that it needs to call itself recursively multiple times (once for every element in the xs list, which is done in the lambda function passed to List.map).
In case when you actually need to make multiple recursive calls, using the continuation passing style or i.e. imperative stack are probably the only options. The idea behind continuations is that every function will take another function (as the last argument) that should be executed when the result is available.
The following example shows normal version (on the left) and continuation based (on the right)
let res = foo a b fooCont a b (fun res ->
printfn "Result: %d" res printfn "Result: %d" res)
To write your function in a continuation passing style, you'll need to use a continuation-based fold function too. You can first avoid using map by moving the operation done in map into the lambda function of fold:
List.fold g acc (List.map (fun xxs -> myfunc f xxs) xs)
Becomes:
List.fold (fun state xxs -> g state (myfunc f xxs)) acc xs
Then you can rewrite the code as follows (Note that both f and g that you did not show in your question are now continuation-based functions, so they take additional argument, which represents the continuation):
// The additional parameter 'cont' is the continuation to be called
// when the final result of myfunc is computed
let rec myfunc' f (T (x,xs)) cont =
if (List.isEmpty xs) then
// Call the 'f' function to process the last element and give it the
// current continuation (it will be called when 'f' calculates the result)
f x cont
else
// Using the continuation-based version of fold - the lambda now takes current
// element 'xxs', the accumulated state and a continuation to call
// when it finishes calculating
List.foldCont (fun xxs state cont ->
// Call 'myfunc' recursively - note, this is tail-call position now!
myfunc' f xxs (fun res ->
// In the continuation, we perform the aggregation using 'g'
// and the result is reported to the continuation that resumes
// folding the remaining elements of the list.
g state res cont)) acc xs cont
The List.foldCont function is a continuation-based version of fold and can be written as follows:
module List =
let rec foldCont f (state:'TState) (list:'T list) cont =
match list with
| [] -> cont state
| x::xs -> foldCont f state xs (fun state ->
f x state cont)
Since you did not post a complete working example, I could not really test the code, but I think it should work.
My understanding is that a tail-call optimised function should return an accumulated value in its last call...
Almost. Tail recursion is when recursive calls all appear in tail position. Tail position means the caller returns the result from its callee directly.
in the following, what is that last call?
There are two calls in tail position. First, the call to f. Second, the call to List.fold. The recursive call is not in tail position because its return value is not returned directly by its caller.
if (List.isEmpty xs) then f x
Also, use pattern matching instead of isEmpty and friends.
Any help to tail-call optimise the above code would be much appreciated.
You'll have to post working code or at least a specification before anyone will be able to help you write a tail recursive version. In general, the simplest solutions are either to write in continuation passing style or imperative style.

Can GHC really never inline map, scanl, foldr, etc.?

I've noticed the GHC manual says "for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored."
Doesn't this say every application of common recursive functional constructs like map, zip, scan*, fold*, sum, etc. cannot be inlined?
You could always rewrite all these function when you employ them, adding appropriate strictness tags, or maybe employ fancy techniques like the "stream fusion" recommended here.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
Indeed, GHC cannot at present inline recursive functions. However:
GHC will still specialise recursive functions. For instance, given
fac :: (Eq a, Num a) => a -> a
fac 0 = 1
fac n = n * fac (n-1)
f :: Int -> Int
f x = 1 + fac x
GHC will spot that fac is used at type Int -> Int and generate a specialised version of fac for that type, which uses fast integer arithmetic.
This specialisation happens automatically within a module (e.g. if fac and f are defined in the same module). For cross-module specialisation (e.g. if f and fac are defined in different modules), mark the to-be-specialised function with an INLINABLE pragma:
{-# INLINABLE fac #-}
fac :: (Eq a, Num a) => a -> a
...
There are manual transformations which make functions nonrecursive. The lowest-power technique is the static argument transformation, which applies to recursive functions with arguments which don't change on recursive calls (eg many higher-order functions such as map, filter, fold*). This transformation turns
map f [] = []
map f (x:xs) = f x : map f xs
into
map f xs0 = go xs0
where
go [] = []
go (x:xs) = f x : go xs
so that a call such as
g :: [Int] -> [Int]
g xs = map (2*) xs
will have map inlined and become
g [] = []
g (x:xs) = 2*x : g xs
This transformation has been applied to Prelude functions such as foldr and foldl.
Fusion techniques are also make many functions nonrecursive, and are more powerful than the static argument transformation. The main approach for lists, which is built into the Prelude, is shortcut fusion. The basic approach is to write as many functions as possible as non-recursive functions which use foldr and/or build; then all the recursion is captured in foldr, and there are special RULES for dealing with foldr.
Taking advantage of this fusion is in principle easy: avoid manual recursion, preferring library functions such as foldr, map, filter, and any functions in this list. In particular, writing code in this style produces code which is "simultaneously fast and elegant".
Modern libraries such as text and vector use stream fusion behind the scenes. Don Stewart wrote a pair of blog posts (1, 2) demonstrating this in action in the now obsolete library uvector, but the same principles apply to text and vector.
As with shortcut fusion, taking advantage of stream fusion in text and vector is in principle easy: avoid manual recursion, preferring library functions which have been marked as "subject to fusion".
There is ongoing work on improving GHC to support inlining of recursive functions. This falls under the general heading of supercompilation, and recent work on this seems to have been led by Max Bolingbroke and Neil Mitchell.
In short, not as often as you would think. The reason is that the "fancy techniques" such as stream fusion are employed when the libraries are implemented, and library users don't need to worry about them.
Consider Data.List.map. The base package defines map as
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (x:xs) = f x : map f xs
This map is self-recursive, so GHC won't inline it.
However, base also defines the following rewrite rules:
{-# RULES
"map" [~1] forall f xs. map f xs = build (\c n -> foldr (mapFB c f) n xs)
"mapList" [1] forall f. foldr (mapFB (:) f) [] = map f
"mapFB" forall c f g. mapFB (mapFB c f) g = mapFB c (f.g)
#-}
This replaces uses of map via foldr/build fusion, then, if the function cannot be fused, replaces it with the original map. Because the fusion happens automatically, it doesn't depend on the user being aware of it.
As proof that this all works, you can examine what GHC produces for specific inputs. For this function:
proc1 = sum . take 10 . map (+1) . map (*2)
eval1 = proc1 [1..5]
eval2 = proc1 [1..]
when compiled with -O2, GHC fuses all of proc1 into a single recursive form (as seen in the core output with -ddump-simpl).
Of course there are limits to what these techniques can accomplish. For example, the naive average function, mean xs = sum xs / length xs is easily manually transformed into a single fold, and frameworks exist that can do so automatically, however at present there's no known way to automatically translate between standard functions and the fusion framework. So in this case, the user does need to be aware of the limitations of the compiler-produced code.
So in many cases compilers are sufficiently advanced to create code that's fast and elegant. Knowing when they will do so, and when the compiler is likely to fall down, is IMHO a large part of learning how to write efficient Haskell code.
for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored.
If something is recursive, to inline it, you would have to know how many times it is executed at compile time. Considering it will be a variable length input, that is not possible.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
There are certain techniques though that can make recursive calls much, much faster than their normal situation. For example, tail call optimization SO Wiki

GroupBy function from .NET in Haskell

LINQ library in .NET framework does have a very useful function called GroupBy, which I have been using all the time.
Its type in Haskell would look like
Ord b => (a-> b) -> [a] -> [(b, [a])]
Its purpose is to classify items based on the given classification function f into buckets, with each bucket containing similar items, that is (b, l) such that for any item x in l, f x == b.
Its performance in .NET is O(N) because it uses hash-tables, but in Haskell I am OK with O(N*log(N)).
I can't find anything similar in standard Haskell libraries. Also, my implementation in terms of standard functions is somewhat bulky:
myGroupBy :: Ord k => (a -> k) -> [a] -> [(k, [a])]
myGroupBy f = map toFst
. groupBy ((==) `on` fst)
. sortBy (comparing fst)
. map (\a -> (f a, a))
where
toFst l#((k,_):_) = (k, map snd l)
This is definitely not something I want to see amongst my problem-specific code.
My question is: how can I implement this function nicely exploiting standard libraries to their maximum?
Also, the seeming absence of such a standard function hints that it may rarely be needed by experienced Haskellers because they may know some better way. Is that true? What can be used to implement similar functionality in a better way?
Also, what would be the good name for it, considering groupBy is already taken? :)
GHC.Exts.groupWith
groupWith :: Ord b => (a -> b) -> [a] -> [[a]]
Introduced as part of generalised list comprehensions: http://www.haskell.org/ghc/docs/7.0.2/html/users_guide/syntax-extns.html#generalised-list-comprehensions
Using Data.Map as the intermediate structure:
import Control.Arrow ((&&&))
import qualified Data.Map as M
myGroupBy f = M.toList . M.fromListWith (++) . map (f &&& return)
The map operation turns the input list into a list of keys paired with singleton lists containing the elements. M.fromListWith (++) turns this into a Data.Map, concatenating when two items have the same key, and M.toList gets the pairs back out again.
Note that this reverses the lists, so adjust for that if necessary. It is also easy to replace return and (++) with other monoid-like operations if you for example only wanted the sum of the elements in each group.

Binary Serialization for Lists of Undefined Length in Haskell

I've been using Data.Binary to serialize data to files. In my application I incrementally add items to these files. The two most popular serialization packages, binary and cereal, both serialize lists as a count followed by the list items. Because of this, I can't append to my serialized files. I currently read in the whole file, deserialize the list, append to the list, re-serialize the list, and write it back out to the file. However, my data set is getting large and I'm starting to run out of memory. I could probably go around unboxing my data structures to gain some space, but that approach doesn't scale.
One solution would be to get down and dirty with the file format to change the initial count, then just append my elements. But that's not very satisfying, not to mention being sensitive to future changes in the file format as a result of breaking the abstraction. Iteratees/Enumerators come to mind as an attractive option here. I looked for a library combining them with a binary serialization, but didn't find anything. Anyone know if this has been done already? If not, would a library for this be useful? Or am I missing something?
So I say stick with Data.Binary but write a new instance for growable lists. Here's the current (strict) instance:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
get = do n <- get :: Get Int
getMany n
-- | 'getMany n' get 'n' elements in order, without blowing the stack.
getMany :: Binary a => Int -> Get [a]
getMany n = go [] n
where
go xs 0 = return $! reverse xs
go xs i = do x <- get
x `seq` go (x:xs) (i-1)
{-# INLINE getMany #-}
Now, a version that lets you stream (in binary) to append to a file would need to be eager or lazy. The lazy version is the most trivial. Something like:
import Data.Binary
newtype Stream a = Stream { unstream :: [a] }
instance Binary a => Binary (Stream a) where
put (Stream []) = putWord8 0
put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)
get = do
t <- getWord8
case t of
0 -> return (Stream [])
1 -> do x <- get
Stream xs <- get
return (Stream (x:xs))
Massaged appropriately works for streaming. Now, to handle silently appending, we'll need to be able to seek to the end of the file, and overwrite the final 0 tag, before adding more elements.
It's four years since this question has been answered, but I ran into the same problems as gatoatigrado in the comment to Don Stewart's answer. The put method works as advertised, but get reads the whole input. I believe the problem lies in the pattern match in the case statement, Stream xs <- get, which must determine whether or not the remaining get is a Stream a or not before returning.
My solution used the example in Data.Binary.Get as a starting point:
import Data.ByteString.Lazy(toChunks,ByteString)
import Data.Binary(Binary(..),getWord8)
import Data.Binary.Get(pushChunk,Decoder(..),runGetIncremental)
import Data.List(unfoldr)
decodes :: Binary a => ByteString -> [a]
decodes = runGets (getWord8 >> get)
runGets :: Get a -> ByteString -> [a]
runGets g = unfoldr (decode1 d) . toChunks
where d = runGetIncremental g
decode1 _ [] = Nothing
decode1 d (x:xs) = case d `pushChunk` x of
Fail _ _ str -> error str
Done x' _ a -> Just (a,x':xs)
k#(Partial _) -> decode1 k xs
Note the use of getWord8 This is to read the encoded [] and : resulting from the definition of put for the stream instance. Also note, since getWord8 ignores the encoded [] and : symbols, this implementation will not detect the end of the list. My encoded file was just a single list so it works for that, but otherwise you'll need to modify.
In any case, this decodes ran in constant memory in both cases of accessing the head and last elements.