Binary Serialization for Lists of Undefined Length in Haskell - serialization

I've been using Data.Binary to serialize data to files. In my application I incrementally add items to these files. The two most popular serialization packages, binary and cereal, both serialize lists as a count followed by the list items. Because of this, I can't append to my serialized files. I currently read in the whole file, deserialize the list, append to the list, re-serialize the list, and write it back out to the file. However, my data set is getting large and I'm starting to run out of memory. I could probably go around unboxing my data structures to gain some space, but that approach doesn't scale.
One solution would be to get down and dirty with the file format to change the initial count, then just append my elements. But that's not very satisfying, not to mention being sensitive to future changes in the file format as a result of breaking the abstraction. Iteratees/Enumerators come to mind as an attractive option here. I looked for a library combining them with a binary serialization, but didn't find anything. Anyone know if this has been done already? If not, would a library for this be useful? Or am I missing something?

So I say stick with Data.Binary but write a new instance for growable lists. Here's the current (strict) instance:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
get = do n <- get :: Get Int
getMany n
-- | 'getMany n' get 'n' elements in order, without blowing the stack.
getMany :: Binary a => Int -> Get [a]
getMany n = go [] n
where
go xs 0 = return $! reverse xs
go xs i = do x <- get
x `seq` go (x:xs) (i-1)
{-# INLINE getMany #-}
Now, a version that lets you stream (in binary) to append to a file would need to be eager or lazy. The lazy version is the most trivial. Something like:
import Data.Binary
newtype Stream a = Stream { unstream :: [a] }
instance Binary a => Binary (Stream a) where
put (Stream []) = putWord8 0
put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)
get = do
t <- getWord8
case t of
0 -> return (Stream [])
1 -> do x <- get
Stream xs <- get
return (Stream (x:xs))
Massaged appropriately works for streaming. Now, to handle silently appending, we'll need to be able to seek to the end of the file, and overwrite the final 0 tag, before adding more elements.

It's four years since this question has been answered, but I ran into the same problems as gatoatigrado in the comment to Don Stewart's answer. The put method works as advertised, but get reads the whole input. I believe the problem lies in the pattern match in the case statement, Stream xs <- get, which must determine whether or not the remaining get is a Stream a or not before returning.
My solution used the example in Data.Binary.Get as a starting point:
import Data.ByteString.Lazy(toChunks,ByteString)
import Data.Binary(Binary(..),getWord8)
import Data.Binary.Get(pushChunk,Decoder(..),runGetIncremental)
import Data.List(unfoldr)
decodes :: Binary a => ByteString -> [a]
decodes = runGets (getWord8 >> get)
runGets :: Get a -> ByteString -> [a]
runGets g = unfoldr (decode1 d) . toChunks
where d = runGetIncremental g
decode1 _ [] = Nothing
decode1 d (x:xs) = case d `pushChunk` x of
Fail _ _ str -> error str
Done x' _ a -> Just (a,x':xs)
k#(Partial _) -> decode1 k xs
Note the use of getWord8 This is to read the encoded [] and : resulting from the definition of put for the stream instance. Also note, since getWord8 ignores the encoded [] and : symbols, this implementation will not detect the end of the list. My encoded file was just a single list so it works for that, but otherwise you'll need to modify.
In any case, this decodes ran in constant memory in both cases of accessing the head and last elements.

Related

How to modify a List element at a given index

What I've done (with some help from a friend) is create a function that takes a List, Int for the index, and a function to be applied to the element at the specified index. It's similar to Map but instead of applying a function to every element, it applies it to only one element.
So my questions are:
Does this function already exist in the core somewhere? We couldn't find it.
If not, is there a better way of accomplishing this than how we have done it?
Here's the code:
import Html exposing (text)
main =
let
m = {arr=[1,5,3], msg=""}
in
text (toString (getDisplay m 4 (\x -> x + 5)))
type alias Model =
{ arr : List (Int)
, msg : String
}
getDisplay : Model -> Int -> (Int -> Int) -> Model
getDisplay model i f =
let
m = (changeAt model.arr i f)
in
case m of
Ok val ->
{model | arr = val, msg = ""}
Err err ->
{model | arr = [], msg = err}
changeAt : List a -> Int -> (a -> a) -> Result String (List a)
changeAt l i func =
let
f j x = if j==i then func x else x
in
if i < (List.length l) && i >= 0 then
Ok(List.indexedMap f l)
else
Err "Bad index"
NOTE: Elm discourages indexing Lists, as they are linked lists under the hood: to retrieve the 1001th element, you have to first visit all 1000 previous elements. Nonetheless, if you wanted to do it, this is one way.
List.indexedMap is a good way to do what you're describing.
However, since you mention the downside of having to visit all preceding elements in a list, the reality in your example is actually a little worse, if indeed you are super worried about performance.
Your list is actually traversed fully at least two times, regardless of whether the index exists or not. The simple act of asking for the length of a linked list has to traverse the entire list. Check out the source code, length is implemented in terms of a foldl.
Furthermore, List.indexedMap traverses the entire list at least once. I say, at least once, since the source of indexedMap also calls the length function in addition to using map. If we're lucky, the length call is memoized (I'm not familiar enough with Elm internals to know whether it is or not, hence the at least comment). The map itself traverses the entire list when called, unlike Haskell which evaluates things lazily, only as much as necessary.
And if you use indexedMap, the whole list is indexed regardless of the position you are interested in. That is, even if you want to apply the function at index zero, the entire list is indexed.
If you actually want to reduce the number of traversals to a minimum, you're going to (at this time) have to implement your own function and you'll have to do it without relying on length or indexedMap.
Here is an example of a changeAt function which avoids unnecessary traversals and if it finds the position, it stops traversing the list.
changeAt : List a -> Int -> (a -> a) -> Result String (List a)
changeAt l i func =
if i < 0 then
Err "Bad Index"
else
case l of
[] ->
Err "Not found"
(x::xs) ->
if i == 0 then
Ok <| func x :: xs
else
Result.map ((::) x) <| changeAt xs (i - 1) func
It's not terribly pretty, but if you want to avoid unnecessarily walking through the list - multiple times - then you might want to go with something like this.
You're looking for the set function for Arrays. Instead of using a List, which is inefficient as you described, this structure is better suited to your use case.
Here's an efficient implementation of the function you're looking for:
changeAt : Int -> (a -> a) -> Array a -> Array a
changeAt i f array =
case get i array of
Just item ->
set i (f item) array
Nothing ->
array
Also note that the data structure is the last argument in this implementation.
Array is mentioned in the link in your question, but nobody on this thread had explicitly mentioned this option yet.

Preventing FsCheck from generating NaN and infinities

I have a deeply nested datastructure with floats all over the place.
I'm using FsCheck to check if the data is unchanged after serializing and then deserializing.
This property fails, when a float is either NaN or +/- infinity, however, such a case doesn't interest me, since I don't expect these values to occur in the actual data.
Is there a way to prevent FsCheck from generating NaN and infinities?
I have tried discarding generated data that contains said values, but this makes the test incredibly slow, so slow in fact, that the test is still running while I'm writing this, and I have my doubts it will actually finish...
For reflectively generated types that contain floats (as I suspect you're using) you can overwrite the default generator for floats by writing a class as follows:
type Overrides() =
static member Float() =
Arb.Default.Float()
|> filter (fun f -> not <| System.Double.IsNaN(f) &&
not <| System.Double.IsInfinity(f))
And then calling:
Arb.register<Overrides>()
Before FsCheck tries to generate the types; e.g. in your test setup or before calling Check.Quick.
You can check the result of the register method to see how it merged the default arbitrary instances with the new ones; it should have overridden them.
If you are using the xUnit extension you can avoid calling the Arb.register by using the Arbitraries argument of PropertyAttribute:
[<Property(Arbitraries=Overides)>]
As Mauricio Scheffer said, you can use NormalFloat type in test parameter.
Simple example for list of floats:
open FsCheck
let f (x : float list) = x |> List.map id
let propFloat (x : float list) = x = (f x)
let propNormalFloat (xn : NormalFloat list) =
let x = xn |> List.map NormalFloat.get
x = f x
Check.Quick propFloat
//Falsifiable, after 18 tests (13 shrinks) (StdGen (761688149,295892075)):
//[nan]
Check.Quick propNormalFloat
//Ok, passed 100 tests.

Test.QuickCheck: speed up testing multiple properties for the same type

I am testing a random generator generating instances of my own type. For that I have a custom instance of Arbitrary:
complexGenerator :: (RandomGen g) => g -> (MyType, g)
instance Arbitrary MyType where
arbitrary = liftM (fst . complexGenerator . mkStdGen) arbitrary
This works well with Test.QuickCheck (actually, Test.Framework) for testing that the generated values hold certain properties. However, there are quite a few properties I want to check, and the more I add, the more time it takes to verify them all.
Is there a way to use the same generated values for testing every property, instead of generating them anew each time? I obviously still want to see, on failures, which property did not hold, so making one giant property with and is not optimal.
I obviously still want to see, on failures, which property did not hold, so making one giant property with and is not optimal.
You could label each property using printTestCase before making a giant property with conjoin.
e.g. you were thinking this would be a bad idea:
prop_giant :: MyType -> Bool
prop_giant x = and [prop_one x, prop_two x, prop_three x]
this would be as efficient yet give you better output:
prop_giant :: MyType -> Property
prop_giant x = conjoin [printTestCase "one" $ prop_one x,
printTestCase "two" $ prop_two x,
printTestCase "three" $ prop_three x]
(Having said that, I've never used this method myself and am only assuming it will work; conjoin is probably marked as experimental in the documentation for a reason.)
In combination with the voted answer, what I've found helpful is using a Reader transformer with the Writer monad:
type Predicate r = ReaderT r (Writer String) Bool
The Reader "shared environment" is the tested input in this case. Then you can compose properties like this:
inv_even :: Predicate Int
inv_even = do
lift . tell $ "this is the even invariant"
(==) 0 . flip mod 2 <$> ask
toLabeledProp :: r -> Predicate r -> Property
toLabeledProp cause r =
let (effect, msg) = runWriter . (runReaderT r) $ cause in
printTestCase ("inv: " ++ msg) . property $ effect
and combining:
fromPredicates :: [Predicate r] -> r -> Property
fromPredicates predicates cause =
conjoin . map (toLabeledProp cause) $ predicates
I suspect there is another approach involving something similar to Either or a WriterT here- which would concisely compose predicates on different types into one result. But at the least, this allows for documenting properties which impose different post-conditions dependent on the the value of the input.
Edit: This idea spawned a library:
http://github.com/jfeltz/quickcheck-property-comb

Serialising and counting a list of values

I need to serialise a large list of values using a custom encoding function (which I have). I've done this and it works, but I'd also like to have it count how many values are being serialised and written to disk whilst still using a relatively constant amount of memory (i.e. it shouldn't need to keep the entire input list around, as it gets very large).
Without the requirement of keeping a count, binary, cereal and blaze-builder all work (using the equivalent of B.writeFile "foo" . runPut . mapM_ encodeValue); but no matter what I try to do with any of these libraries it seems that the resulting ByteString gets kept around in memory until it is finished rather than starting to be written to disk as soon as a chunk is available (even when using toByteStringIO from blaze-builder).
This is a minimal example demonstrating what I've been trying to do:
import Data.Binary
import Data.Binary.Put
import Control.Monad(foldM)
import qualified Data.ByteString.Lazy as B
main :: IO ()
main = do let ns = [1..10000000] :: [Int]
(count,b) = runPutM $ foldM (\ c n -> c `seq` (put n >> return (c+1))) (0 :: Int) ns
B.writeFile "testOut" b
print count
When compiled and run with +RTS -hy, the result is an almost triangular graph dominated by ByteString values.
The only solution I've found so far (that I'm not a big fan of) is to do the looping (either directly or with foldM) in IO using B.appendFile rather than within Put or directly constructing a Builder value, which to me doesn't seem very elegant. Is there a better way?
I'm a bit surprised that toByteStringIO doesn't work, hopefully someone more familiar with that library will provide an answer.
That being said, whenever I want to intermix stream processing with IO actions, I usually find iteratees to be the most elegant solution. This is because they allow for precise control over how much data is processed and retained, and for combining the streaming aspects with other arbitrary IO actions. There are several iteratee implementations on hackage; this example is with "iteratee" because it's the one I'm most familiar with.
import Data.Binary.Put
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import Data.Iteratee hiding (foldM)
import qualified Data.Iteratee as I
main :: IO ()
main = do
let ns = [1..80000000] :: [Int]
iter <- enumPureNChunk ns (defaultChunkSize `div` 8)
(joinI $ serializer $ writer "testOut")
count <- run iter
print count
serializer = mapChunks ((:[]) . runPutM . foldM
(\ !cnt n -> put n >> return (cnt+1)) 0)
writer fp = I.foldM
(\ !cnt (len,ck) -> liftIO (B.appendFile fp ck) >> return (cnt+len))
0
There are three parts to this. writer is the "iteratee", i.e. a data consumer. It writes each chunk of data as its received and keeps a running count of the length. serializer is a stream transformer a.k.a. "enumeratee". It takes an input chunk of type [Int] and serializes it to a stream with type [(Int, B.ByteString)] (number of elements, bytestring). Finally enumPureNChunk is the "enumerator", which produces a stream, in this case from the input list. It takes enough elements from the input to fill a single lazy bytestring chunk (I'm on 64bit, divide by 4 for 32bit systems), and then writes them to disk so they can be GC'd.

Variables in Haskell

Why does the following Haskell script not work as expected?
find :: Eq a => a -> [(a,b)] -> [b]
find k t = [v | (k,v) <- t]
Given find 'b' [('a',1),('b',2),('c',3),('b',4)], the interpreter returns [1,2,3,4] instead of [2,4]. The introduction of a new variable, below called u, is necessary to get this to work:
find :: Eq a => a -> [(a,b)] -> [b]
find k t = [v | (u,v) <- t, k == u]
Does anyone know why the first variant does not produce the desired result?
From the Haskell 98 Report:
As usual, bindings in list
comprehensions can shadow those in
outer scopes; for example:
[ x | x <- x, x <- x ] = [ z | y <- x, z <- y]
One other point: if you compile with -Wall (or specifically with -fwarn-name-shadowing) you'll get the following warning:
Warning: This binding for `k' shadows the existing binding
bound at Shadowing.hs:4:5
Using -Wall is usually a good idea—it will often highlight what's going on in potentially confusing situations like this.
The pattern match (k,v) <- t in the first example creates two new local variables v and k that are populated with the contents of the tuple t. The pattern match doesn't compare the contents of t against the already existing variable k, it creates a new variable k (which hides the outer one).
Generally there is never any "variable substitution" happening in a pattern, any variable names in a pattern always create new local variables.
You can only pattern match on literals and constructors.
You can't match on variables.
Read more here.
That being said, you may be interested in view patterns.