I have a large data structure, a tree, that takes up about 2gb in ram. It includes clojure sets in the leaves, and refs as the branches. The tree is built by reading and parsing a large flat file and inserting the rows into the tree. However this takes about 30 seconds. Is there a way I can build the tree once, emit it to a clj file, and then compile the tree into my standalone jar so I can lookup values in the tree without re-reading the large text file? I think this will trim out the 30 second tree build, but also this will help me deploy my standalone jar without needing the text file to come along for the ride.
My first swing at this failed:
(def x (ref {:zebra (ref #{1 2 3 4})}))
#<Ref#6781a7dc: {:zebra #<Ref#709c4f85: #{1 2 3 4}>}>
(def y #<Ref#6781a7dc: {:zebra #<Ref#709c4f85: #{1 2 3 4}>}>)
RuntimeException Unreadable form clojure.lang.Util.runtimeException (Util.java:219)
Embedding data this big in compiled code may not be possible because of size limits imposed upon the JVM. In particular, no single method may exceed 64 KiB in length. Embedding data in the way I describe further below also necessitates including tons of stuff in the class file it's going to live in; doesn't seem like a great idea.
Given that you're using the data structure read-only, you can construct it once, then emit it to a .clj / .edn (that's for edn, the serialization format based on Clojure literal notation), then include that file on your class path as a "resource", so that it's included in the überjar (in resources/ with default Leiningen settings; it'll then get included in the überjar unless excluded by :uberjar-exclusions in project.clj) and read it from the resource at runtime at full speed of Clojure's reader:
(ns foo.core
(:require [clojure.java.io :as io]))
(defn get-the-huge-data-structure []
(let [r (io/resource "huge.edn")
rdr (java.io.PushbackReader. (io/reader r))]
(read r)))
;; if you then do something like this:
(def ds (get-the-huge-data-structure))
;; your app will load the data as soon as this namespace is required;
;; for your :main namespace, this means as soon as the app starts;
;; note that if you use AOT compilation, it'll also be loaded at
;; compile time
You could also not add it to the überjar, but rather add it to the classpath when running your app. This way your überjar itself would not have to be huge.
Handling stuff other than persistent Clojure data could be accomplished using print-method (when serializing) and reader tags (when deserializing). Arthur already demonstrated using reader tags; to use print-method, you'd do something like
(defmethod print-method clojure.lang.Ref [x writer]
(.write writer "#ref ")
(print-method #x writer))
;; from the REPL, after doing the above:
user=> (pr-str {:foo (ref 1)})
"{:foo #ref 1}"
Of course you only need to have the print-method methods defined when serializing; you're deserializing code can leave it alone, but will need appropriate data readers.
Disregarding the code size issue for a moment, as I find the data embedding issue interesting:
Assuming your data structure only contains immutable data natively handled by Clojure (Clojure persistent collections, arbitrarily nested, plus atomic items such as numbers, strings (atomic for this purpose), keywords, symbols; no Refs etc.), you can indeed include it in your code:
(defmacro embed [x]
x)
The generated bytecode will then recreate x without reading anything, by using constants included in the class file and static methods of the clojure.lang.RT class (e.g. RT.vector and RT.map).
This is, of course, how literals are compiled, since the macro above is a noop. We can make things more interesting though:
(ns embed-test.core
(:require [clojure.java.io :as io])
(:gen-class))
(defmacro embed-resource [r]
(let [r (io/resource r)
rdr (java.io.PushbackReader. (io/reader r))]
(read r)))
(defn -main [& args]
(println (embed-resource "foo.edn")))
This will read foo.edn at compile time and embed the result in the compiled code (in the sense of including appropriate constants and code to reconstruct the data in the class file). At run time, no further reading will be performed.
Is this structure something that doesn't change? If not, consider using Java serialization to persist the structure. Deserializing will be much faster than rebuilding every time.
If you can structure the tree to be a single value instead of a tree fo references to many values then you would be able to print the tree and read it. Because refs are not readable you won't be able to treat the entire tree as something readable without doing doing your own parsing.
It may be worth looking into using the extensible reader to add print and read functions for your tree by making it a type.
here is a minimal example of using data-readers to produce references to sets and maps from a string:
first define handlers for the contents of each EDN tag/type
user> (defn parse-map-ref [m] (ref (apply hash-map m)))
#'user/parse-map-ref
user> (defn parse-set-ref [s] (ref (set s)))
#'user/parse-set-ref
Then bind the map data-readers to associate the handlers with textual tags:
(def y-as-string
"#user/map-ref [:zebra #user/set-ref [1 2 3 4]]")
user> (def y (binding [*data-readers* {'user/set-ref user/parse-set-ref
'user/map-ref user/parse-map-ref}]
(read-string y-as-string)))
user> y
#<Ref#6d130699: {:zebra #<Ref#7c165ec0: #{1 2 3 4}>}>
this also works with more deeply nested trees:
(def z-as-string
"#user/map-ref [:zebra #user/set-ref [1 2 3 4]
:ox #user/map-ref [:amimal #user/set-ref [42]]]")
user> (def z (binding [*data-readers* {'user/set-ref user/parse-set-ref
'user/map-ref user/parse-map-ref}]
(read-string z-as-string)))
#'user/z
user> z
#<Ref#2430c1a0: {:ox #<Ref#7cf801ef: {:amimal #<Ref#7e473201: #{42}>}>,
:zebra #<Ref#7424206b: #{1 2 3 4}>}>
producing strings from trees can be accomplished by extending the print-method multimethod, though it would be a lot easier if you define a type for ref-map and ref-set using deftype so the printer can know which ref should produce which strings.
If in general reading them as strings is too slow there are faster binary serialization libraries like protocol buffers.
Related
I am new to Functional programming.
The challenge I have is regarding the mental map of how a binary search tree works in Haskell.
In other programs (C,C++) we have something called root. We store it in a variable. We insert elements into it and do balancing etc..
The program takes a break does other things (may be process user inputs, create threads) and then figures out it needs to insert a new element in the already created tree. It knows the root (stored as a variable) and invokes the insert function with the root and the new value.
So far so good in other languages. But how do I mimic such a thing in Haskell, i.e.
I see functions implementing converting a list to a Binary Tree, inserting a value etc.. That's all good
I want this functionality to be part of a bigger program and so i need to know what the root is so that i can use it to insert it again. Is that possible? If so how?
Note: Is it not possible at all because data structures are immutable and so we cannot use the root at all to insert something. in such a case how is the above situation handled in Haskell?
It all happens in the same way, really, except that instead of mutating the existing tree variable we derive a new tree from it and remember that new tree instead of the old one.
For example, a sketch in C++ of the process you describe might look like:
int main(void) {
Tree<string> root;
while (true) {
string next;
cin >> next;
if (next == "quit") exit(0);
root.insert(next);
doSomethingWith(root);
}
}
A variable, a read action, and loop with a mutate step. In haskell, we do the same thing, but using recursion for looping and a recursion variable instead of mutating a local.
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
let t' = insert next t
doSomethingWith t'
loop t'
If you need doSomethingWith to be able to "mutate" t as well as read it, you can lift your program into State:
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
loop (execState doSomethingWith (insert next t))
Writing an example with a BST would take too much time but I give you an analogous example using lists.
Let's invent a updateListN which updates the n-th element in a list.
updateListN :: Int -> a -> [a] -> [a]
updateListN i n l = take (i - 1) l ++ n : drop i l
Now for our program:
list = [1,2,3,4,5,6,7,8,9,10] -- The big data structure we might want to use multiple times
main = do
-- only for shows
print $ updateListN 3 30 list -- [1,2,30,4,5,6,7,8,9,10]
print $ updateListN 8 80 list -- [1,2,3,4,5,6,7,80,9,10]
-- now some illustrative complicated processing
let list' = foldr (\i l -> updateListN i (i*10) l) list list
-- list' = [10,20,30,40,50,60,70,80,90,100]
-- Our crazily complicated illustrative algorithm still needs `list`
print $ zipWith (-) list' list
-- [9,18,27,36,45,54,63,72,81,90]
See how we "updated" list but it was still available? Most data structures in Haskell are persistent, so updates are non-destructive. As long as we have a reference of the old data around we can use it.
As for your comment:
My program is trying the following a) Convert a list to a Binary Search Tree b) do some I/O operation c) Ask for a user input to insert a new value in the created Binary Search Tree d) Insert it into the already created list. This is what the program intends to do. Not sure how to get this done in Haskell (or) is am i stuck in the old mindset. Any ideas/hints welcome.
We can sketch a program:
data BST
readInt :: IO Int; readInt = undefined
toBST :: [Int] -> BST; toBST = undefined
printBST :: BST -> IO (); printBST = undefined
loop :: [Int] -> IO ()
loop list = do
int <- readInt
let newList = int : list
let bst = toBST newList
printBST bst
loop newList
main = loop []
"do balancing" ... "It knows the root" nope. After re-balancing the root is new. The function balance_bst must return the new root.
Same in Haskell, but also with insert_bst. It too will return the new root, and you will use that new root from that point forward.
Even if the new root's value is the same, in Haskell it's a new root, since one of its children has changed.
See ''How to "think functional"'' here.
Even in C++ (or other imperative languages), it would usually be considered a poor idea to have a single global variable holding the root of the binary search tree.
Instead code that needs access to a tree should normally be parameterised on the particular tree it operates on. That's a fancy way of saying: it should be a function/method/procedure that takes the tree as an argument.
So if you're doing that, then it doesn't take much imagination to figure out how several different sections of code (or one section, on several occasions) could get access to different versions of an immutable tree. Instead of passing the same tree to each of these functions (with modifications in between), you just pass a different tree each time.
It's only a little more work to imagine what your code needs to do to "modify" an immutable tree. Obviously you won't produce a new version of the tree by directly mutating it, you'll instead produce a new value (probably by calling methods on the class implementing the tree for you, but if necessary by manually assembling new nodes yourself), and then you'll return it so your caller can pass it on - by returning it to its own caller, by giving it to another function, or even calling you again.
Putting that all together, you can have your whole program manipulate (successive versions of) this binary tree without ever having it stored in a global variable that is "the" tree. An early function (possibly even main) creates the first version of the tree, passes it to the first thing that uses it, gets back a new version of the tree and passes it to the next user, and so on. And each user of the tree can call other subfunctions as needed, with possibly many of new versions of the tree produced internally before it gets returned to the top level.
Note that I haven't actually described any special features of Haskell here. You can do all of this in just about any programming language, including C++. This is what people mean when they say that learning other types of programming makes them better programmers even in imperative languages they already knew. You can see that your habits of thought are drastically more limited than they need to be; you could not imagine how you could deal with a structure "changing" over the course of your program without having a single variable holding a structure that is mutated, when in fact that is just a small part of the tools that even C++ gives you for approaching the problem. If you can only imagine this one way of dealing with it then you'll never notice when other ways would be more helpful.
Haskell also has a variety of tools it can bring to this problem that are less common in imperative languages, such as (but not limited to):
Using the State monad to automate and hide much of the boilerplate of passing around successive versions of the tree.
Function arguments allow a function to be given an unknown "tree-consumer" function, to which it can give a tree, without any one place both having the tree and knowing which function it's passing it to.
Lazy evaluation sometimes negates the need to even have successive versions of the tree; if the modifications are expanding branches of the tree as you discover they are needed (like a move-tree for a game, say), then you could alternatively generate "the whole tree" up front even if it's infinite, and rely on lazy evaluation to limit how much work is done generating the tree to exactly the amount you need to look at.
Haskell does in fact have mutable variables, it just doesn't have functions that can access mutable variables without exposing in their type that they might have side effects. So if you really want to structure your program exactly as you would in C++ you can; it just won't really "feel like" you're writing Haskell, won't help you learn Haskell properly, and won't allow you to benefit from many of the useful features of Haskell's type system.
In the SWI Prolog manual, I found the following remark:
For example, assume an application that can reason about multiple worlds. It is attractive to store the data of a particular world in a module, so we extract information from a world simply by invoking goals in this world.
This is actually a very good description of what I'm trying to achieve. However I ran into a problem. While I do want to model many different worlds, there are also things that I want to share across all of them. So my idea is to have an allworlds module for things that are true in every world, and one module for every world that I want to reason about, and the latter imports from the former. So I'd do something like this in the REPL:
allworlds:asserta(grandparent(X, Z) :- (parent(X, Y), parent(Y, Z))).
allworlds:dynamic(parent/2).
add_import_module(greece, allworlds, start).
greece:asserta(parent(kronos, zeus)).
greece:asserta(parent(zeus, ares)).
Now I'd like to query greece:grandparent(kronos, X) and get X = ares, but all I get is false. When allworlds:grandparent calls parent, it doesn't call greece:parent like I want it to, but allworlds:parent. My research seems to indicate that I need to make the grandparent predicate module-transparent. But calling allworlds:module_transparent(grandparent/2). didn't fix the issue, and it's also deprecated. This is where I'm stuck. How can I get this working? Is meta_predicate/1 part of the solution? Unfortunately I can't make heads or tails of its documentation.
Prolog modules don't provide a good solution for the "many worlds" design pattern. Notably, making the predicates meta-predicates (or module transparent or multifile) would be a problematic hack. But this pattern is trivial with Logtalk, which is a language extends Prolog and can use most Prolog systems as a backend compiler. A minimal (but not unique) solution for your problem is:
:- object(allworlds).
:- public(grandparent/2).
grandparent(X, Z) :-
::parent(X, Y),
::parent(Y, Z).
:- public(parent/2).
:- end_object.
:- object(greece,
extends(allworlds)).
parent(kronos, zeus).
parent(zeus, ares).
:- end_object.
Here, we use inheritance (the individual worlds inherit the common knowledge) and messages to self (the ::/1 control construct) when common predicates need to access world specific predicate definitions (self is the object/world that received the message - grandparent/2 in the example).
Assuming the code is saved in a worlds.lgt file and that you're using SWI-Prolog as the backend:
$ swilgt
...
?- {worlds}.
% [ /Users/pmoura/worlds.lgt loaded ]
% (0 warnings)
true.
?- greece::grandparent(kronos, X).
X = ares.
P.S. If running on windows, use the "Logtalk - SWI-Prolog" shortcut from the Start Menu after installing Logtalk.
I ultimately solved this by passing the module around explicitly and invoking predicates in it with the : operator. It reminds me a bit of doing OOP in C, where you do things like obj->vtable->method(obj, params) (note how obj is mentioned twice, just like the M in my code below).
Similar to the Logtalk solution, I need to explicitly call into the imported module when I want to consider its clauses. As an example, I've added the fact that a father is also a parent to the allworlds module.
allworlds:assertz(grandparent(M, X, Z) :- (M:parent(M, X, Y), M:parent(M, Y, Z))).
allworlds:assertz(parent(M, X, Y) :- M:father(M, X, Y)).
add_import_module(greece, allworlds, start).
greece:assertz(parent(_, kronos, zeus)).
% need to call into allworlds explicitly
greece:assertz(parent(M, X, Y) :- allworlds:parent(M, X, Y)).
greece:assertz(father(_, zeus, ares)).
After making these assertions, I can call greece:grandparent(greece, kronos, X). and get the expected result X = ares.
We all know that Rich uses a ideal hash tree-based method to implement the persistent data structures in Clojure. This structure enables us to manipulate the persistent data structures without copying a lot.
But it seems I cannot find the correct way to serialize this specific structure. For example given:
(def foo {:a :b :c :d})
(def bar (assoc foo :e :f))
(def bunny {:foo foo :bar bar})
My question is:
How can I serialize the bunny such that the contents of foo, i.e. :a mapping to :b and :c mapping to :d, appear only once in the serialized content? It's like dumping a memory image of the structures. It's also like serializing the "internal nodes" as well as the "leaf nodes" referenced here.
P.S. In case this is relevant, I am building a big DAG (directed acyclic graph) where we assoc quite a bit to link these nodes to those nodes, and want to serialize the DAG for later de-serialization. The expanded representation of the graph (i.e., the content one'll get when printing the DAG in repl) is unacceptably long.
Davyzhu,
Few things first:
The DAG, without tokenization strategy, will be as long as the DAG is. If foo is referenced 1 or more times each will be fully realized (i.e. displayed) in turn during printing.
For the interchanges of the information (serialize and deserialize) it will be largely dependent on your goals. For example, if you are serializing to send it off over the wire you will either want to do it fully (like the printed representation) or you will need to encode individual data points with some identification/tokenization strategy. The latter, of course, assumes the receiving end can deserialize with understanding of the tokenization protocol.
The tokenization strategy example, could use Clojure meta facilities perhaps, would require encoding unique keys for each content block reference and your DAG contains nodes where the edges are represented by the keys.
Edit:: Modified since original post to clarify as per comments but the example
does not reflect the hierarchical nature of the DAG.
A contrived example:
(def node1 {:a :b :c :d})
(def node2 {:e :f})
(def dictionary {:foo node1 :bar node2})
(def DAG [:bunny [:foo :bar]])
(println DAG) ; => [:bunny [:foo :bar]]
(defn expand-dag1
[x]
(if (keyword? x)
(get dictionary x x)
x))
(println (w/postwalk expand-dag1 DAG)) ; => [:bunny [{:a :b, :c :d} {:e :f}]]
Note: Use of vectors, maps, lists, etc. to express your DAG is up to you.
This is one option (that will not work in Clojurescript, in case that matters) and in general may be perceived as a bad idea, but it is worth mentioning anyway.
If I understand your question, you want the foo in (def bunny {:foo foo :bar bar}) to not be "pasted" as a full copy, but rather retain a "reference" to the (def foo..) such that the original foo map is only serialized once.
One technique I would consider though not necessarily encourage (and only after exhausting other options such as a reorginization of your data structure as hinted by Frank C.), is to serialize the code for bunny rather than the structure itself. Then you read the code string back in and eval it. This would only work if the structure for bunny does not change, or if it does, that you can easily build a string of the bunny map with the relevant symbols included as part of the string, rather than the contents of those symbols.
But a much better idea would be to serialize your "raw" data structures only, like the maps foo and bar, then build your bunny after these are read back in -- by also serializing the structure but not the contents of bunny. I believe this is what Frank's answer is getting at.
Worth noting that if the structure of bunny does change dynamically, and you are able to create a string of symbols as suggested in 1. above, then that means you also have the tools to instead build a representation of bunny as in 2. above, which would be preferable.
Since code is data, option 1. is an example of the type of flexibility available to us as lisp programmers -- but that doesn't mean there are not better options.
Im installing the code for the book Artificial Intelligence a Modern Approach i got here http://aima.cs.berkeley.edu/lisp/doc/install.html its the lisp version im installing btw
I'm on Ubuntu Trusty using Emacs SBCL slime, I placed the code in ~/.emacs.d
so per the instructions at above link i run (load "/home/w/.emacs.d/aima/code/aima.lisp")
which loads fine i get "T" as output
i run (aima-load 'all) that works I get "T" as output
but when i run (aima-compile) I get the error
Can't declare constant variable locally special: +NO-BINDINGS+
[Condition of type SIMPLE-ERROR]
I'm not sure I understand the error I read the Hyperspec on Declare and Special but that didn't help. I'm not opposed to a hack to make this work so if someone can help me reword the code or figure out if its a emacs setting or sbcl setting i could change to get the aforementioned variable declared that would be great. The error message is referring to this file /home/w/.emacs.d/aima/code/logic/algorithms/tell-ask.lisp
this section
(defmethod ask-each ((kb literal-kb) query fn)
"For each proof of query, call fn on the substitution that
the proof ends up with."
(declare (special +no-bindings+))
(for each s in (literal-kb-sentences kb) do
(when (equal s query) (funcall fn +no-bindings+))))
I verified all the permisions of all the files in my aima folder are set to read write for my username with my username as owner as a step to correct....but as far as understanding the error I could use help figuring out the next step one would take to debug this...The code is downloadable here http://aima.cs.berkeley.edu/lisp/code.tar.gz and its an easy install for a veteran emacs/lisp user....any help is appreciated.
Using SBCL:
I tried loading the AIMA code as well, with the same problem in SBCL. Just comment out line in ./logic/algorithms/tell-ask.lisp:
(declare (special +no-bindings+))
like so
;;(declare (special +no-bindings+))
, or delete the whole line altogether. Nevertheless, this is what I did:
(defmethod ask-each ((kb literal-kb) query fn)
"For each proof of query, call fn on the substitution that
the proof ends up with."
;;(declare (special +no-bindings+))
(for each s in (literal-kb-sentences kb) do
(when (equal s query) (funcall fn +no-bindings+))))
You will still have an issue running (aims-compile) with SBCL. It will complain about some constants being redefined. Just look at the possible restarts and select every time:
0: [CONTINUE] GO ahead and change the value.
Do this as many times (about 6 times that is) as it needs to, and it will load eventually. This is probably happening because of AIMA's code non-standard build/compile system. This can be annoying, but an alternative is to trace the code and see why/where some files are being reloaded.
USING Clozure (CCL):
CCL has a different problem with ./utilities/utilities.lisp. CCL has both true and false functions predefined, therefore you have to make sure that both lines:
#-(or MCL Lispworks)
that directly precede both (defun true ...) and (defun false ...) are changed to:
#-(or MCL Lispworks CCL)
also, in the same source, modify error inside for-each macro to look like so:
(error "~a is an illegal variable in (for each ~a in ~a ...)"
var var list)
With these modifications, CCL seems load AIMA code just fine.
In general:
It's a bad idea to redefine constants or to somehow bypass debugger restarts. Best solution is to have (defconstant ...)s evaluated once only, perhaps by placing them in a separate source file and making sure that the build system picks it up only once.
Another solution found here, which entails wrapping calls to defconstantin a macro like so (borrowed from here):
(defmacro define-constant (name value &optional doc)
(if (boundp name)
(format t
"~&already defined ~A~%old value ~s~%attempted value ~s~%"
name (symbol-value name) value))
`(defconstant ,name (if (boundp ',name) (symbol-value ',name) ,value)
,#(when doc (list doc))))
And then replacing all occurrences of defconstant like so:
(defconstant +no-bindings+ '((nil))
"Indicates unification success, with no variables.")
with:
(define-constant +no-bindings+ '((nil))
"Indicates unification success, with no variables.")
If you opt for define-constant "solution", make sure that define-constant is evaluated first.
tl;dr: how do you write instances of Arbitrary that don't explode if your data type allows for way too much nesting? And how would you guarantee these instances produce truly random specimens of your data structure?
I want to generate random tree structures, then test certain properties of these structures after I've mangled them with my library code. (NB: I'm writing an implementation of a subtyping algorithm, i.e. given a hierarchy of types, is type A a subtype of type B. This can be made arbitrarily complex, by including multiple-inheritance and post-initialization updates to the hierarchy. The classical method that supports neither of these is Schubert Numbering, and the latest result known to me is Alavi et al. 2008.)
Let's take the example of rose-trees, following Data.Tree:
data Tree a = Node a (Forest a)
type Forest a = [Tree a]
A very simple (and don't-try-this-at-home) instance of Arbitray would be:
instance (Arbitrary a) => Arbitrary (Tree a) where
arbitrary = Node <$> arbitrary <$> arbitrary
Since a already has an Arbitrary instance as per the type constraint, and the Forest will have one, because [] is an instance, too, this seems straight-forward. It won't (typically) terminate for very obvious reasons: since the lists it generates are arbitrarily long, the structures become too large, and there's a good chance they won't fit into memory. Even a more conservative approach:
arbitrary = Node <$> arbitrary <*> oneof [arbitrary,return []]
won't work, again, for the same reason. One could tweak the size parameter, to keep the length of the lists down, but even that won't guarantee termination, since it's still multiple consecutive dice-rolls, and it can turn out quite badly (and I want the odd node with 100 children.)
Which means I need to limit the size of the entire tree. That is not so straight-forward. unordered-containers has it easy: just use fromList. This is not so easy here: How do you turn a list into a tree, randomly, and without incurring bias one way or the other (i.e. not favoring left-branches, or trees that are very left-leaning.)
Some sort of breadth-first construction (the functions provided by Data.Tree are all pre-order) from lists would be awesome, and I think I could write one, but it would turn out to be non-trivial. Since I'm using trees now, but will use even more complex stuff later on, I thought I might try to find a more general and less complex solution. Is there one, or will I have to resort to writing my own non-trivial Arbitrary generator? In the latter case, I might actually just resort to unit-tests, since this seems too much work.
Use sized:
instance Arbitrary a => Arbitrary (Tree a) where
arbitrary = sized arbTree
arbTree :: Arbitrary a => Int -> Gen (Tree a)
arbTree 0 = do
a <- arbitrary
return $ Node a []
arbTree n = do
(Positive m) <- arbitrary
let n' = n `div` (m + 1)
f <- replicateM m (arbTree n')
a <- arbitrary
return $ Node a f
(Adapted from the QuickCheck presentation).
P.S. Perhaps this will generate overly balanced trees...
You might want to use the library presented in the paper "Feat: Functional Enumeration of Algebraic Types" at the Haskell Symposium 2012. It is on Hackage as testing-feat, and a video of the talk introducing it is available here: http://www.youtube.com/watch?v=HbX7pxYXsHg
As Janis mentioned, you can use the package testing-feat, which creates enumerations of arbitrary algebraic data types. This is the easiest way to create unbiased uniformly distributed generators
for all trees of up to a given size.
Here is how you would use it for rose trees:
import Test.Feat (Enumerable(..), uniform, consts, funcurry)
import Test.Feat.Class (Constructor)
import Data.Tree (Tree(..))
import qualified Test.QuickCheck as QC
-- We make an enumerable instance by listing all constructors
-- for the type. In this case, we have one binary constructor:
-- Node :: a -> [Tree a] -> Tree a
instance Enumerable a => Enumerable (Tree a) where
enumerate = consts [binary Node]
where
binary :: (a -> b -> c) -> Constructor c
binary = unary . funcurry
-- Now we use the Enumerable instance to create an Arbitrary
-- instance with the help of the function:
-- uniform :: Enumerable a => Int -> QC.Gen a
instance Enumerable a => QC.Arbitrary (Tree a) where
QC.arbitrary = QC.sized uniform
-- QC.shrink = <some implementation>
The Enumerable instance can also be generated automatically with TemplateHaskell:
deriveEnumerable ''Tree