Serialising and counting a list of values - serialization

I need to serialise a large list of values using a custom encoding function (which I have). I've done this and it works, but I'd also like to have it count how many values are being serialised and written to disk whilst still using a relatively constant amount of memory (i.e. it shouldn't need to keep the entire input list around, as it gets very large).
Without the requirement of keeping a count, binary, cereal and blaze-builder all work (using the equivalent of B.writeFile "foo" . runPut . mapM_ encodeValue); but no matter what I try to do with any of these libraries it seems that the resulting ByteString gets kept around in memory until it is finished rather than starting to be written to disk as soon as a chunk is available (even when using toByteStringIO from blaze-builder).
This is a minimal example demonstrating what I've been trying to do:
import Data.Binary
import Data.Binary.Put
import Control.Monad(foldM)
import qualified Data.ByteString.Lazy as B
main :: IO ()
main = do let ns = [1..10000000] :: [Int]
(count,b) = runPutM $ foldM (\ c n -> c `seq` (put n >> return (c+1))) (0 :: Int) ns
B.writeFile "testOut" b
print count
When compiled and run with +RTS -hy, the result is an almost triangular graph dominated by ByteString values.
The only solution I've found so far (that I'm not a big fan of) is to do the looping (either directly or with foldM) in IO using B.appendFile rather than within Put or directly constructing a Builder value, which to me doesn't seem very elegant. Is there a better way?

I'm a bit surprised that toByteStringIO doesn't work, hopefully someone more familiar with that library will provide an answer.
That being said, whenever I want to intermix stream processing with IO actions, I usually find iteratees to be the most elegant solution. This is because they allow for precise control over how much data is processed and retained, and for combining the streaming aspects with other arbitrary IO actions. There are several iteratee implementations on hackage; this example is with "iteratee" because it's the one I'm most familiar with.
import Data.Binary.Put
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import Data.Iteratee hiding (foldM)
import qualified Data.Iteratee as I
main :: IO ()
main = do
let ns = [1..80000000] :: [Int]
iter <- enumPureNChunk ns (defaultChunkSize `div` 8)
(joinI $ serializer $ writer "testOut")
count <- run iter
print count
serializer = mapChunks ((:[]) . runPutM . foldM
(\ !cnt n -> put n >> return (cnt+1)) 0)
writer fp = I.foldM
(\ !cnt (len,ck) -> liftIO (B.appendFile fp ck) >> return (cnt+len))
0
There are three parts to this. writer is the "iteratee", i.e. a data consumer. It writes each chunk of data as its received and keeps a running count of the length. serializer is a stream transformer a.k.a. "enumeratee". It takes an input chunk of type [Int] and serializes it to a stream with type [(Int, B.ByteString)] (number of elements, bytestring). Finally enumPureNChunk is the "enumerator", which produces a stream, in this case from the input list. It takes enough elements from the input to fill a single lazy bytestring chunk (I'm on 64bit, divide by 4 for 32bit systems), and then writes them to disk so they can be GC'd.

Related

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()

using haskell pipes-bytestring to iterate a file by line

I am using the pipes library and need to convert a ByteString stream to a stream of lines (i.e. String), using ASCII encoding. I am aware that there are other libraries (Pipes.Text and Pipes.Prelude) that perhaps let me yield lines from a text file more easily, but because of some other code I need to be able to get lines as String from a Producer of ByteString.
More formally, I need to convert a Producer ByteString IO () to a Producer String IO (), which yields lines.
I am sure this must be a one-liner for an experienced Pipes-Programmer, but I so far did not manage to successfully hack through all the FreeT and Lens-trickery in Pipes-ByteString.
Any help is much appreciated!
Stephan
If you need that type signature, then I would suggest this:
import Control.Foldl (mconcat, purely)
import Data.ByteString (ByteString)
import Data.Text (unpack)
import Lens.Family (view)
import Pipes (Producer, (>->))
import Pipes.Group (folds)
import qualified Pipes.Prelude as Pipes
import Pipes.Text (lines)
import Pipes.Text.Encoding (utf8)
import Prelude hiding (lines)
getLines
:: Producer ByteString IO r -> Producer String IO (Producer ByteString IO r)
getLines p = purely folds mconcat (view (utf8 . lines) p) >-> Pipes.map unpack
This works because the type of purely folds mconcat is:
purely folds mconcat
:: (Monad m, Monoid t) => FreeT (Producer t m) r -> Producer t m r
... where t in this case would be Text:
purely folds mconcat
:: Monad m => FreeT (Producer Text m) r -> Producer Text m r
Any time you want to reduce each Producer sub-group of a FreeT-delimited stream you probably want to use purely folds. Then it's just a matter of picking the right Fold to reduce the sub-group with. In this case, you just want to concatenate all the Text chunks within a group, so you pass in mconcat. I generally don't recommend doing this since it will break on extremely long lines, but you specified that you needed this behavior.
The reason this is verbose is because the pipes ecosystem promotes Text over String and also tries to encourage handling arbitrarily long lines. If you were not constrained by your other code then the more idiomatic approach would just be:
view (utf8 . lines)
After a little bit of hacking and some hints from this blog, I came up with a solution, but it is surprisingly clumsy, and I fear a bit inefficient as well, as it uses ByteString.append:
import Pipes
import qualified Pipes.ByteString as PB
import qualified Pipes.Prelude as PP
import qualified Pipes.Group as PG
import qualified Data.ByteString.Char8 as B
import Lens.Family (view )
import Control.Monad (liftM)
getLines :: Producer PB.ByteString IO r -> Producer String IO r
getLines = PG.concats . PG.maps toStringProducer . view PB.lines
toStringProducer :: Producer PB.ByteString IO r -> Producer String IO r
toStringProducer producer = go producer B.empty
where
go producer bs = do
x <- lift $ next producer
case x of
Left r -> do
yield $ B.unpack bs
return r
Right (bs', producer') -> go producer' (B.append bs' bs)

QuickCheck: Arbitrary instances of nested data structures that generate balanced specimens

tl;dr: how do you write instances of Arbitrary that don't explode if your data type allows for way too much nesting? And how would you guarantee these instances produce truly random specimens of your data structure?
I want to generate random tree structures, then test certain properties of these structures after I've mangled them with my library code. (NB: I'm writing an implementation of a subtyping algorithm, i.e. given a hierarchy of types, is type A a subtype of type B. This can be made arbitrarily complex, by including multiple-inheritance and post-initialization updates to the hierarchy. The classical method that supports neither of these is Schubert Numbering, and the latest result known to me is Alavi et al. 2008.)
Let's take the example of rose-trees, following Data.Tree:
data Tree a = Node a (Forest a)
type Forest a = [Tree a]
A very simple (and don't-try-this-at-home) instance of Arbitray would be:
instance (Arbitrary a) => Arbitrary (Tree a) where
arbitrary = Node <$> arbitrary <$> arbitrary
Since a already has an Arbitrary instance as per the type constraint, and the Forest will have one, because [] is an instance, too, this seems straight-forward. It won't (typically) terminate for very obvious reasons: since the lists it generates are arbitrarily long, the structures become too large, and there's a good chance they won't fit into memory. Even a more conservative approach:
arbitrary = Node <$> arbitrary <*> oneof [arbitrary,return []]
won't work, again, for the same reason. One could tweak the size parameter, to keep the length of the lists down, but even that won't guarantee termination, since it's still multiple consecutive dice-rolls, and it can turn out quite badly (and I want the odd node with 100 children.)
Which means I need to limit the size of the entire tree. That is not so straight-forward. unordered-containers has it easy: just use fromList. This is not so easy here: How do you turn a list into a tree, randomly, and without incurring bias one way or the other (i.e. not favoring left-branches, or trees that are very left-leaning.)
Some sort of breadth-first construction (the functions provided by Data.Tree are all pre-order) from lists would be awesome, and I think I could write one, but it would turn out to be non-trivial. Since I'm using trees now, but will use even more complex stuff later on, I thought I might try to find a more general and less complex solution. Is there one, or will I have to resort to writing my own non-trivial Arbitrary generator? In the latter case, I might actually just resort to unit-tests, since this seems too much work.
Use sized:
instance Arbitrary a => Arbitrary (Tree a) where
arbitrary = sized arbTree
arbTree :: Arbitrary a => Int -> Gen (Tree a)
arbTree 0 = do
a <- arbitrary
return $ Node a []
arbTree n = do
(Positive m) <- arbitrary
let n' = n `div` (m + 1)
f <- replicateM m (arbTree n')
a <- arbitrary
return $ Node a f
(Adapted from the QuickCheck presentation).
P.S. Perhaps this will generate overly balanced trees...
You might want to use the library presented in the paper "Feat: Functional Enumeration of Algebraic Types" at the Haskell Symposium 2012. It is on Hackage as testing-feat, and a video of the talk introducing it is available here: http://www.youtube.com/watch?v=HbX7pxYXsHg
As Janis mentioned, you can use the package testing-feat, which creates enumerations of arbitrary algebraic data types. This is the easiest way to create unbiased uniformly distributed generators
for all trees of up to a given size.
Here is how you would use it for rose trees:
import Test.Feat (Enumerable(..), uniform, consts, funcurry)
import Test.Feat.Class (Constructor)
import Data.Tree (Tree(..))
import qualified Test.QuickCheck as QC
-- We make an enumerable instance by listing all constructors
-- for the type. In this case, we have one binary constructor:
-- Node :: a -> [Tree a] -> Tree a
instance Enumerable a => Enumerable (Tree a) where
enumerate = consts [binary Node]
where
binary :: (a -> b -> c) -> Constructor c
binary = unary . funcurry
-- Now we use the Enumerable instance to create an Arbitrary
-- instance with the help of the function:
-- uniform :: Enumerable a => Int -> QC.Gen a
instance Enumerable a => QC.Arbitrary (Tree a) where
QC.arbitrary = QC.sized uniform
-- QC.shrink = <some implementation>
The Enumerable instance can also be generated automatically with TemplateHaskell:
deriveEnumerable ''Tree

Binary Serialization for Lists of Undefined Length in Haskell

I've been using Data.Binary to serialize data to files. In my application I incrementally add items to these files. The two most popular serialization packages, binary and cereal, both serialize lists as a count followed by the list items. Because of this, I can't append to my serialized files. I currently read in the whole file, deserialize the list, append to the list, re-serialize the list, and write it back out to the file. However, my data set is getting large and I'm starting to run out of memory. I could probably go around unboxing my data structures to gain some space, but that approach doesn't scale.
One solution would be to get down and dirty with the file format to change the initial count, then just append my elements. But that's not very satisfying, not to mention being sensitive to future changes in the file format as a result of breaking the abstraction. Iteratees/Enumerators come to mind as an attractive option here. I looked for a library combining them with a binary serialization, but didn't find anything. Anyone know if this has been done already? If not, would a library for this be useful? Or am I missing something?
So I say stick with Data.Binary but write a new instance for growable lists. Here's the current (strict) instance:
instance Binary a => Binary [a] where
put l = put (length l) >> mapM_ put l
get = do n <- get :: Get Int
getMany n
-- | 'getMany n' get 'n' elements in order, without blowing the stack.
getMany :: Binary a => Int -> Get [a]
getMany n = go [] n
where
go xs 0 = return $! reverse xs
go xs i = do x <- get
x `seq` go (x:xs) (i-1)
{-# INLINE getMany #-}
Now, a version that lets you stream (in binary) to append to a file would need to be eager or lazy. The lazy version is the most trivial. Something like:
import Data.Binary
newtype Stream a = Stream { unstream :: [a] }
instance Binary a => Binary (Stream a) where
put (Stream []) = putWord8 0
put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)
get = do
t <- getWord8
case t of
0 -> return (Stream [])
1 -> do x <- get
Stream xs <- get
return (Stream (x:xs))
Massaged appropriately works for streaming. Now, to handle silently appending, we'll need to be able to seek to the end of the file, and overwrite the final 0 tag, before adding more elements.
It's four years since this question has been answered, but I ran into the same problems as gatoatigrado in the comment to Don Stewart's answer. The put method works as advertised, but get reads the whole input. I believe the problem lies in the pattern match in the case statement, Stream xs <- get, which must determine whether or not the remaining get is a Stream a or not before returning.
My solution used the example in Data.Binary.Get as a starting point:
import Data.ByteString.Lazy(toChunks,ByteString)
import Data.Binary(Binary(..),getWord8)
import Data.Binary.Get(pushChunk,Decoder(..),runGetIncremental)
import Data.List(unfoldr)
decodes :: Binary a => ByteString -> [a]
decodes = runGets (getWord8 >> get)
runGets :: Get a -> ByteString -> [a]
runGets g = unfoldr (decode1 d) . toChunks
where d = runGetIncremental g
decode1 _ [] = Nothing
decode1 d (x:xs) = case d `pushChunk` x of
Fail _ _ str -> error str
Done x' _ a -> Just (a,x':xs)
k#(Partial _) -> decode1 k xs
Note the use of getWord8 This is to read the encoded [] and : resulting from the definition of put for the stream instance. Also note, since getWord8 ignores the encoded [] and : symbols, this implementation will not detect the end of the list. My encoded file was just a single list so it works for that, but otherwise you'll need to modify.
In any case, this decodes ran in constant memory in both cases of accessing the head and last elements.

How to force haskell to don't store whole bytestring?

I writing a small (relatively) application on haskell in academic purpose. I'm implementing a Huffman compression, based on this code http://www.haskell.org/haskellwiki/Toy_compression_implementations .
My variant of this code is here https://github.com/kravitz/har/blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs and it uses lazy bytestrings. When I implemented RLE compression everything was smooth, because it process the input stream in one step. But Huffman process it twice and as a result I have an evaluated bytestring stored in memory, which is bad for a big files (but for relatively small files it allocates too much space in heap too). That is not only my suspicion, because profiling also shows that most of the heap eaten by bytestring allocation.
Also I seriallizing a stream length in file, and it also may cause the full bytestring loading in memory. Is there any simple way to say ghc be kindly and re-evaluate stream several times?
Instead of passing a bytestring to the encoder, you can pass something that computes a bytestring, then explicitly recompute the value each time you need it.
compress :: ST s ByteString -> ST s ByteString
compress makeInput = do
len <- (return $!) . ByteString.length =<< makeInput
codebook <- (return $!) . makeCodebook =<< makeInput
return . encode len codebook =<< makeInput
compressIO :: IO ByteString -> IO ByteString
compressIO m = stToIO (compress (unsafeIOToST m))
The parameter to compress should actually compute the value. Simply wrapping a value with return won't work. Also, each call to makeInput must actually have its result evaluated, else there will remain a lazy, un-evaluated copy of the input in memory when the input is recomputed.
The usual approach, as barsoap said, is to just compress one block at a time.
The usual approach when (Huffmann-)compressing, as one can't get around processing the input twice, once to collect the probability distribution, and once to do the actual compressing, is to chunk up the input into blocks and compress each separately. While that still eats memory, it only eats, maximally, a constant amount.
That said, you might want to have a look at bytestring-mmap, though that won't work with standard input, sockets, and other file descriptors that aren't backed by a file system which supports mmap.
You can also re-read the bytestring from file (again, provided you're not receiving it from anything pipe-like) after collecting the probability distribution, but that will still make your code bail out on say 1TB files.