using haskell pipes-bytestring to iterate a file by line - haskell-pipes

I am using the pipes library and need to convert a ByteString stream to a stream of lines (i.e. String), using ASCII encoding. I am aware that there are other libraries (Pipes.Text and Pipes.Prelude) that perhaps let me yield lines from a text file more easily, but because of some other code I need to be able to get lines as String from a Producer of ByteString.
More formally, I need to convert a Producer ByteString IO () to a Producer String IO (), which yields lines.
I am sure this must be a one-liner for an experienced Pipes-Programmer, but I so far did not manage to successfully hack through all the FreeT and Lens-trickery in Pipes-ByteString.
Any help is much appreciated!
Stephan

If you need that type signature, then I would suggest this:
import Control.Foldl (mconcat, purely)
import Data.ByteString (ByteString)
import Data.Text (unpack)
import Lens.Family (view)
import Pipes (Producer, (>->))
import Pipes.Group (folds)
import qualified Pipes.Prelude as Pipes
import Pipes.Text (lines)
import Pipes.Text.Encoding (utf8)
import Prelude hiding (lines)
getLines
:: Producer ByteString IO r -> Producer String IO (Producer ByteString IO r)
getLines p = purely folds mconcat (view (utf8 . lines) p) >-> Pipes.map unpack
This works because the type of purely folds mconcat is:
purely folds mconcat
:: (Monad m, Monoid t) => FreeT (Producer t m) r -> Producer t m r
... where t in this case would be Text:
purely folds mconcat
:: Monad m => FreeT (Producer Text m) r -> Producer Text m r
Any time you want to reduce each Producer sub-group of a FreeT-delimited stream you probably want to use purely folds. Then it's just a matter of picking the right Fold to reduce the sub-group with. In this case, you just want to concatenate all the Text chunks within a group, so you pass in mconcat. I generally don't recommend doing this since it will break on extremely long lines, but you specified that you needed this behavior.
The reason this is verbose is because the pipes ecosystem promotes Text over String and also tries to encourage handling arbitrarily long lines. If you were not constrained by your other code then the more idiomatic approach would just be:
view (utf8 . lines)

After a little bit of hacking and some hints from this blog, I came up with a solution, but it is surprisingly clumsy, and I fear a bit inefficient as well, as it uses ByteString.append:
import Pipes
import qualified Pipes.ByteString as PB
import qualified Pipes.Prelude as PP
import qualified Pipes.Group as PG
import qualified Data.ByteString.Char8 as B
import Lens.Family (view )
import Control.Monad (liftM)
getLines :: Producer PB.ByteString IO r -> Producer String IO r
getLines = PG.concats . PG.maps toStringProducer . view PB.lines
toStringProducer :: Producer PB.ByteString IO r -> Producer String IO r
toStringProducer producer = go producer B.empty
where
go producer bs = do
x <- lift $ next producer
case x of
Left r -> do
yield $ B.unpack bs
return r
Right (bs', producer') -> go producer' (B.append bs' bs)

Related

Haskell - Exposing IO actions in API

I wrote a small library[1] that interfaces with a postgresql DB which contains 600+ Spanish verbs and pulls out conjugations and other useful things.
I have a single function which performs the DB read. It looks like this (I am using the postgresql-simple[2] library):
-- | A postgres query.
queryDB :: (ToRow params, FromRow a) => Query -> params -> IO [a]
queryDB q paramTypes = do
c <- connection
return =<< query c q paramTypes
Each function that I expose in the library, uses this function and returns an IO action of some type. For example, if the user conjugates the verb 'ser' using conjugate, I get back a IO [Conjugation]:
-- | Conjugate the verb 'i' in the tense 't' and mood 'm'.
--
-- > conjugate "ser" "Presente" "Indicativo"
conjugate :: Infinitive -> Tense -> Mood -> IO [Conjugation]
conjugate i t m = queryDB conjugationQuery [i :: Infinitive,
t :: Tense,
m :: Mood]
I am new to writing libraries in Haskell. Is it fine to leave functions such as conjugate to export IO actions? They do interact with the DB, but that isn't really the point of the function ... the user just wants conjugations. Normally, if I would write code like this in another language, the user would not know an IO action has taken place.
Can I seperate IO and expose pure functions?
Since you're hitting a database, no. A huge part of Haskell is specifying to someone using your API that they're performing an IO action. Since IO actions can fail, return different results for the same input, or fire the missiles, we always tell the user when this happens.
What would happen if I used your API but didn't have your database as well? Then I would likely see some sort of error message about not having a connection. Or if I did have your database but modified it to return incorrect conjugations, then you can't guarantee that conjugate will always return the same conjugations given a particular infinitive, tense, and mood. This means that you can't have your conjugate function be pure.
If you want to avoid reconnecting to the database for every query, one thing you can do is make a newtype wrapper over ReaderT Connection IO that you use all over the place, and then provide a separate runDB function:
newtype DB a = MkDB{ unDB :: ReaderT DBConnection IO a } deriving (Functor, Applicative, Monad)
queryDB :: (ToRow params, FromRow a) => Query -> params -> DB [a]
queryDB q paramTypes = MkDB $ do
c <- ask
lift $ query c q paramTypes
conjugate :: Infinitive -> Tense -> Mood -> DB [Conjugation]
conjugate i t m = queryDB conjugationQuery [i :: Infinitive,
t :: Tense,
m :: Mood]
-- Of course, this still needs to be in IO
runDB :: DB a -> IO a
runDB db = runReaderT db =<< connection
The crucial bit is to not export MkDB and unDB; DB is an opaque type that the user can only use via the exported functions (conjugate etc.) and the monadic combinators. This way, undiluted IO is not spread all over the client code.

Can GHC really never inline map, scanl, foldr, etc.?

I've noticed the GHC manual says "for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored."
Doesn't this say every application of common recursive functional constructs like map, zip, scan*, fold*, sum, etc. cannot be inlined?
You could always rewrite all these function when you employ them, adding appropriate strictness tags, or maybe employ fancy techniques like the "stream fusion" recommended here.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
Indeed, GHC cannot at present inline recursive functions. However:
GHC will still specialise recursive functions. For instance, given
fac :: (Eq a, Num a) => a -> a
fac 0 = 1
fac n = n * fac (n-1)
f :: Int -> Int
f x = 1 + fac x
GHC will spot that fac is used at type Int -> Int and generate a specialised version of fac for that type, which uses fast integer arithmetic.
This specialisation happens automatically within a module (e.g. if fac and f are defined in the same module). For cross-module specialisation (e.g. if f and fac are defined in different modules), mark the to-be-specialised function with an INLINABLE pragma:
{-# INLINABLE fac #-}
fac :: (Eq a, Num a) => a -> a
...
There are manual transformations which make functions nonrecursive. The lowest-power technique is the static argument transformation, which applies to recursive functions with arguments which don't change on recursive calls (eg many higher-order functions such as map, filter, fold*). This transformation turns
map f [] = []
map f (x:xs) = f x : map f xs
into
map f xs0 = go xs0
where
go [] = []
go (x:xs) = f x : go xs
so that a call such as
g :: [Int] -> [Int]
g xs = map (2*) xs
will have map inlined and become
g [] = []
g (x:xs) = 2*x : g xs
This transformation has been applied to Prelude functions such as foldr and foldl.
Fusion techniques are also make many functions nonrecursive, and are more powerful than the static argument transformation. The main approach for lists, which is built into the Prelude, is shortcut fusion. The basic approach is to write as many functions as possible as non-recursive functions which use foldr and/or build; then all the recursion is captured in foldr, and there are special RULES for dealing with foldr.
Taking advantage of this fusion is in principle easy: avoid manual recursion, preferring library functions such as foldr, map, filter, and any functions in this list. In particular, writing code in this style produces code which is "simultaneously fast and elegant".
Modern libraries such as text and vector use stream fusion behind the scenes. Don Stewart wrote a pair of blog posts (1, 2) demonstrating this in action in the now obsolete library uvector, but the same principles apply to text and vector.
As with shortcut fusion, taking advantage of stream fusion in text and vector is in principle easy: avoid manual recursion, preferring library functions which have been marked as "subject to fusion".
There is ongoing work on improving GHC to support inlining of recursive functions. This falls under the general heading of supercompilation, and recent work on this seems to have been led by Max Bolingbroke and Neil Mitchell.
In short, not as often as you would think. The reason is that the "fancy techniques" such as stream fusion are employed when the libraries are implemented, and library users don't need to worry about them.
Consider Data.List.map. The base package defines map as
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (x:xs) = f x : map f xs
This map is self-recursive, so GHC won't inline it.
However, base also defines the following rewrite rules:
{-# RULES
"map" [~1] forall f xs. map f xs = build (\c n -> foldr (mapFB c f) n xs)
"mapList" [1] forall f. foldr (mapFB (:) f) [] = map f
"mapFB" forall c f g. mapFB (mapFB c f) g = mapFB c (f.g)
#-}
This replaces uses of map via foldr/build fusion, then, if the function cannot be fused, replaces it with the original map. Because the fusion happens automatically, it doesn't depend on the user being aware of it.
As proof that this all works, you can examine what GHC produces for specific inputs. For this function:
proc1 = sum . take 10 . map (+1) . map (*2)
eval1 = proc1 [1..5]
eval2 = proc1 [1..]
when compiled with -O2, GHC fuses all of proc1 into a single recursive form (as seen in the core output with -ddump-simpl).
Of course there are limits to what these techniques can accomplish. For example, the naive average function, mean xs = sum xs / length xs is easily manually transformed into a single fold, and frameworks exist that can do so automatically, however at present there's no known way to automatically translate between standard functions and the fusion framework. So in this case, the user does need to be aware of the limitations of the compiler-produced code.
So in many cases compilers are sufficiently advanced to create code that's fast and elegant. Knowing when they will do so, and when the compiler is likely to fall down, is IMHO a large part of learning how to write efficient Haskell code.
for a self-recursive function, the loop breaker can only be the function itself, so an INLINE pragma is always ignored.
If something is recursive, to inline it, you would have to know how many times it is executed at compile time. Considering it will be a variable length input, that is not possible.
Yet, doesn't all this dramatically constrain our ability to write code that's simultaneously fast and elegant?
There are certain techniques though that can make recursive calls much, much faster than their normal situation. For example, tail call optimization SO Wiki

Serialising and counting a list of values

I need to serialise a large list of values using a custom encoding function (which I have). I've done this and it works, but I'd also like to have it count how many values are being serialised and written to disk whilst still using a relatively constant amount of memory (i.e. it shouldn't need to keep the entire input list around, as it gets very large).
Without the requirement of keeping a count, binary, cereal and blaze-builder all work (using the equivalent of B.writeFile "foo" . runPut . mapM_ encodeValue); but no matter what I try to do with any of these libraries it seems that the resulting ByteString gets kept around in memory until it is finished rather than starting to be written to disk as soon as a chunk is available (even when using toByteStringIO from blaze-builder).
This is a minimal example demonstrating what I've been trying to do:
import Data.Binary
import Data.Binary.Put
import Control.Monad(foldM)
import qualified Data.ByteString.Lazy as B
main :: IO ()
main = do let ns = [1..10000000] :: [Int]
(count,b) = runPutM $ foldM (\ c n -> c `seq` (put n >> return (c+1))) (0 :: Int) ns
B.writeFile "testOut" b
print count
When compiled and run with +RTS -hy, the result is an almost triangular graph dominated by ByteString values.
The only solution I've found so far (that I'm not a big fan of) is to do the looping (either directly or with foldM) in IO using B.appendFile rather than within Put or directly constructing a Builder value, which to me doesn't seem very elegant. Is there a better way?
I'm a bit surprised that toByteStringIO doesn't work, hopefully someone more familiar with that library will provide an answer.
That being said, whenever I want to intermix stream processing with IO actions, I usually find iteratees to be the most elegant solution. This is because they allow for precise control over how much data is processed and retained, and for combining the streaming aspects with other arbitrary IO actions. There are several iteratee implementations on hackage; this example is with "iteratee" because it's the one I'm most familiar with.
import Data.Binary.Put
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import Data.Iteratee hiding (foldM)
import qualified Data.Iteratee as I
main :: IO ()
main = do
let ns = [1..80000000] :: [Int]
iter <- enumPureNChunk ns (defaultChunkSize `div` 8)
(joinI $ serializer $ writer "testOut")
count <- run iter
print count
serializer = mapChunks ((:[]) . runPutM . foldM
(\ !cnt n -> put n >> return (cnt+1)) 0)
writer fp = I.foldM
(\ !cnt (len,ck) -> liftIO (B.appendFile fp ck) >> return (cnt+len))
0
There are three parts to this. writer is the "iteratee", i.e. a data consumer. It writes each chunk of data as its received and keeps a running count of the length. serializer is a stream transformer a.k.a. "enumeratee". It takes an input chunk of type [Int] and serializes it to a stream with type [(Int, B.ByteString)] (number of elements, bytestring). Finally enumPureNChunk is the "enumerator", which produces a stream, in this case from the input list. It takes enough elements from the input to fill a single lazy bytestring chunk (I'm on 64bit, divide by 4 for 32bit systems), and then writes them to disk so they can be GC'd.

Binary Serialization for Lists of Undefined Length in Haskell

I've been using Data.Binary to serialize data to files. In my application I incrementally add items to these files. The two most popular serialization packages, binary and cereal, both serialize lists as a count followed by the list items. Because of this, I can't append to my serialized files. I currently read in the whole file, deserialize the list, append to the list, re-serialize the list, and write it back out to the file. However, my data set is getting large and I'm starting to run out of memory. I could probably go around unboxing my data structures to gain some space, but that approach doesn't scale.
One solution would be to get down and dirty with the file format to change the initial count, then just append my elements. But that's not very satisfying, not to mention being sensitive to future changes in the file format as a result of breaking the abstraction. Iteratees/Enumerators come to mind as an attractive option here. I looked for a library combining them with a binary serialization, but didn't find anything. Anyone know if this has been done already? If not, would a library for this be useful? Or am I missing something?
So I say stick with Data.Binary but write a new instance for growable lists. Here's the current (strict) instance:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
get = do n <- get :: Get Int
getMany n
-- | 'getMany n' get 'n' elements in order, without blowing the stack.
getMany :: Binary a => Int -> Get [a]
getMany n = go [] n
where
go xs 0 = return $! reverse xs
go xs i = do x <- get
x `seq` go (x:xs) (i-1)
{-# INLINE getMany #-}
Now, a version that lets you stream (in binary) to append to a file would need to be eager or lazy. The lazy version is the most trivial. Something like:
import Data.Binary
newtype Stream a = Stream { unstream :: [a] }
instance Binary a => Binary (Stream a) where
put (Stream []) = putWord8 0
put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)
get = do
t <- getWord8
case t of
0 -> return (Stream [])
1 -> do x <- get
Stream xs <- get
return (Stream (x:xs))
Massaged appropriately works for streaming. Now, to handle silently appending, we'll need to be able to seek to the end of the file, and overwrite the final 0 tag, before adding more elements.
It's four years since this question has been answered, but I ran into the same problems as gatoatigrado in the comment to Don Stewart's answer. The put method works as advertised, but get reads the whole input. I believe the problem lies in the pattern match in the case statement, Stream xs <- get, which must determine whether or not the remaining get is a Stream a or not before returning.
My solution used the example in Data.Binary.Get as a starting point:
import Data.ByteString.Lazy(toChunks,ByteString)
import Data.Binary(Binary(..),getWord8)
import Data.Binary.Get(pushChunk,Decoder(..),runGetIncremental)
import Data.List(unfoldr)
decodes :: Binary a => ByteString -> [a]
decodes = runGets (getWord8 >> get)
runGets :: Get a -> ByteString -> [a]
runGets g = unfoldr (decode1 d) . toChunks
where d = runGetIncremental g
decode1 _ [] = Nothing
decode1 d (x:xs) = case d `pushChunk` x of
Fail _ _ str -> error str
Done x' _ a -> Just (a,x':xs)
k#(Partial _) -> decode1 k xs
Note the use of getWord8 This is to read the encoded [] and : resulting from the definition of put for the stream instance. Also note, since getWord8 ignores the encoded [] and : symbols, this implementation will not detect the end of the list. My encoded file was just a single list so it works for that, but otherwise you'll need to modify.
In any case, this decodes ran in constant memory in both cases of accessing the head and last elements.

Generating a list of lists of Int with QuickCheck

I'm working through Real World
Haskell one of the
exercises of chapter 4 is to implement an foldr based version of
concat. I thought this would be a great candidate for testing with
QuickCheck since there is an existing implementation to validate my
results. This however requires me to define an instance of the
Arbitrary typeclass that can generate arbitrary [[Int]]. So far I have
been unable to figure out how to do this. My first attempt was:
module FoldExcercises_Test
where
import Test.QuickCheck
import Test.QuickCheck.Batch
import FoldExcercises
prop_concat xs =
concat xs == fconcat xs
where types = xs ::[[Int]]
options = TestOptions { no_of_tests = 200
, length_of_tests = 1
, debug_tests = True }
allChecks = [
run (prop_concat)
]
main = do
runTests "simple" options allChecks
This results in no tests being performed. Looking at various bits and
pieces I guessed that an Arbitrary instance declaration was needed and
added
instance Arbitrary a => Arbitrary [[a]] where
arbitrary = sized arb'
where arb' n = vector n (arbitrary :: Gen a)
This resulted in ghci complaining that my instance declaration was
invalid and that adding -XFlexibleInstances might solve my problem.
Adding the {-# OPTIONS_GHC -XFlexibleInstances #-} directive
results in a type mismatch and an overlapping instances warning.
So my question is what's needed to make this work? I'm obviously new
to Haskell and am not finding any resources that help me out.
Any pointers are much appreciated.
Edit
It appears I was misguided by QuickCheck's output when in a test first manner fconcat is defined as
fconcat = undefined
Actually implementing the function correctly indeed gives the expected result. DOOP!
[[Int]] is already an Arbitrary instance (because Int is an Arbitrary instance, so is [a] for all as that are themselves instances of Arbitrary). So that is not the problem.
I ran your code myself (replacing import FoldExcercises with fconcat = concat) and it ran 200 tests as I would have expected, so I am mystified as to why it doesn't do it for you. But you do NOT need to add an Arbitrary instance.