menhir - associate AST nodes with token locations in source file - error-handling

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.
A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?

I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli
This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.
For example, here is a part of the representation of expressions:
type ...
and exp = {
eid: int; (** unique identifier *)
enode: exp_node; (** the expression itself *)
eloc: location; (** location of the expression. *)
}
and exp_node =
| Const of constant (** Constant *)
| Lval of lval (** Lvalue *)
| UnOp of unop * exp * typ
| BinOp of binop * exp * exp * typ
...
So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.
So simple, "top-level" matches on syntax are very easy:
let rec is_const_expr e =
match e.enode with
| Const _ -> true
| Lval _ -> false
| UnOp (_op, e', _typ) -> is_const_expr e'
| BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r
To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
| _ -> e
For comparison, on a pure AST without metadata, this would be something like:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, UnOp (Neg, e', _), _) -> e'
| _ -> e
I find that Frama-C's approach works well in practice.

You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,
type node =
| Typedef of typdef
| Typeexp of typeexp
| Literal of string
| Constant of int
| ...
type annotated_node = { node : node; loc : loc}
Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,
match node with
| {node=Typedef t} -> pp_typedef t
| ...
Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by #Isabelle Newbie.
A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you actually externalize node attributes. The drawback is that you can't actually guarantee the totality of an attribute since finite maps are no total. Thus it is harder to preserve an invariant that, for example, all nodes have a location.
Since every heap allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap allocated objects without actually wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.
Of course, by saying an address, I do not literally mean the physical or virtual address of an object. OCaml uses a moving GC so an actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.
Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:
let c1 = Literal "hello"
let c2 = Constant 42
Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)
module Locations = Ephemeron.K1.Make(struct
type t = node
let hash = Hashtbl.hash (* or your own hash if you have one *)
let equal = (=) (* or a specilized equal operator *)
end)
The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,
let locations = Locations.create 1337
(* somewhere in the semantics actions, where c1 and c2 are created *)
Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"
And later, you can extract the location:
# Locations.find locs c1;;
- : string = "hello.ml:12:32"
As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nice and easy, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:
let normalize = function
| Literal str -> Literal (normalize_literal str)
| node -> node
It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.
Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.
Speaking of the "monadic approach" that you mentioned in comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) In order to attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not only unique in a program run, but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). And although it looks like that generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as pure functional, as it doesn't contain observable effects.

Related

How do I remember the root of a binary search tree in Haskell

I am new to Functional programming.
The challenge I have is regarding the mental map of how a binary search tree works in Haskell.
In other programs (C,C++) we have something called root. We store it in a variable. We insert elements into it and do balancing etc..
The program takes a break does other things (may be process user inputs, create threads) and then figures out it needs to insert a new element in the already created tree. It knows the root (stored as a variable) and invokes the insert function with the root and the new value.
So far so good in other languages. But how do I mimic such a thing in Haskell, i.e.
I see functions implementing converting a list to a Binary Tree, inserting a value etc.. That's all good
I want this functionality to be part of a bigger program and so i need to know what the root is so that i can use it to insert it again. Is that possible? If so how?
Note: Is it not possible at all because data structures are immutable and so we cannot use the root at all to insert something. in such a case how is the above situation handled in Haskell?
It all happens in the same way, really, except that instead of mutating the existing tree variable we derive a new tree from it and remember that new tree instead of the old one.
For example, a sketch in C++ of the process you describe might look like:
int main(void) {
Tree<string> root;
while (true) {
string next;
cin >> next;
if (next == "quit") exit(0);
root.insert(next);
doSomethingWith(root);
}
}
A variable, a read action, and loop with a mutate step. In haskell, we do the same thing, but using recursion for looping and a recursion variable instead of mutating a local.
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
let t' = insert next t
doSomethingWith t'
loop t'
If you need doSomethingWith to be able to "mutate" t as well as read it, you can lift your program into State:
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
loop (execState doSomethingWith (insert next t))
Writing an example with a BST would take too much time but I give you an analogous example using lists.
Let's invent a updateListN which updates the n-th element in a list.
updateListN :: Int -> a -> [a] -> [a]
updateListN i n l = take (i - 1) l ++ n : drop i l
Now for our program:
list = [1,2,3,4,5,6,7,8,9,10] -- The big data structure we might want to use multiple times
main = do
-- only for shows
print $ updateListN 3 30 list -- [1,2,30,4,5,6,7,8,9,10]
print $ updateListN 8 80 list -- [1,2,3,4,5,6,7,80,9,10]
-- now some illustrative complicated processing
let list' = foldr (\i l -> updateListN i (i*10) l) list list
-- list' = [10,20,30,40,50,60,70,80,90,100]
-- Our crazily complicated illustrative algorithm still needs `list`
print $ zipWith (-) list' list
-- [9,18,27,36,45,54,63,72,81,90]
See how we "updated" list but it was still available? Most data structures in Haskell are persistent, so updates are non-destructive. As long as we have a reference of the old data around we can use it.
As for your comment:
My program is trying the following a) Convert a list to a Binary Search Tree b) do some I/O operation c) Ask for a user input to insert a new value in the created Binary Search Tree d) Insert it into the already created list. This is what the program intends to do. Not sure how to get this done in Haskell (or) is am i stuck in the old mindset. Any ideas/hints welcome.
We can sketch a program:
data BST
readInt :: IO Int; readInt = undefined
toBST :: [Int] -> BST; toBST = undefined
printBST :: BST -> IO (); printBST = undefined
loop :: [Int] -> IO ()
loop list = do
int <- readInt
let newList = int : list
let bst = toBST newList
printBST bst
loop newList
main = loop []
"do balancing" ... "It knows the root" nope. After re-balancing the root is new. The function balance_bst must return the new root.
Same in Haskell, but also with insert_bst. It too will return the new root, and you will use that new root from that point forward.
Even if the new root's value is the same, in Haskell it's a new root, since one of its children has changed.
See ''How to "think functional"'' here.
Even in C++ (or other imperative languages), it would usually be considered a poor idea to have a single global variable holding the root of the binary search tree.
Instead code that needs access to a tree should normally be parameterised on the particular tree it operates on. That's a fancy way of saying: it should be a function/method/procedure that takes the tree as an argument.
So if you're doing that, then it doesn't take much imagination to figure out how several different sections of code (or one section, on several occasions) could get access to different versions of an immutable tree. Instead of passing the same tree to each of these functions (with modifications in between), you just pass a different tree each time.
It's only a little more work to imagine what your code needs to do to "modify" an immutable tree. Obviously you won't produce a new version of the tree by directly mutating it, you'll instead produce a new value (probably by calling methods on the class implementing the tree for you, but if necessary by manually assembling new nodes yourself), and then you'll return it so your caller can pass it on - by returning it to its own caller, by giving it to another function, or even calling you again.
Putting that all together, you can have your whole program manipulate (successive versions of) this binary tree without ever having it stored in a global variable that is "the" tree. An early function (possibly even main) creates the first version of the tree, passes it to the first thing that uses it, gets back a new version of the tree and passes it to the next user, and so on. And each user of the tree can call other subfunctions as needed, with possibly many of new versions of the tree produced internally before it gets returned to the top level.
Note that I haven't actually described any special features of Haskell here. You can do all of this in just about any programming language, including C++. This is what people mean when they say that learning other types of programming makes them better programmers even in imperative languages they already knew. You can see that your habits of thought are drastically more limited than they need to be; you could not imagine how you could deal with a structure "changing" over the course of your program without having a single variable holding a structure that is mutated, when in fact that is just a small part of the tools that even C++ gives you for approaching the problem. If you can only imagine this one way of dealing with it then you'll never notice when other ways would be more helpful.
Haskell also has a variety of tools it can bring to this problem that are less common in imperative languages, such as (but not limited to):
Using the State monad to automate and hide much of the boilerplate of passing around successive versions of the tree.
Function arguments allow a function to be given an unknown "tree-consumer" function, to which it can give a tree, without any one place both having the tree and knowing which function it's passing it to.
Lazy evaluation sometimes negates the need to even have successive versions of the tree; if the modifications are expanding branches of the tree as you discover they are needed (like a move-tree for a game, say), then you could alternatively generate "the whole tree" up front even if it's infinite, and rely on lazy evaluation to limit how much work is done generating the tree to exactly the amount you need to look at.
Haskell does in fact have mutable variables, it just doesn't have functions that can access mutable variables without exposing in their type that they might have side effects. So if you really want to structure your program exactly as you would in C++ you can; it just won't really "feel like" you're writing Haskell, won't help you learn Haskell properly, and won't allow you to benefit from many of the useful features of Haskell's type system.

How to determine whether finite-state automata is deterministic?

How to write a Java code that will determine whether the given automaton is deterministic. What I have is a class that represents automata and it has the following five variables:
int alphabet_size;
int n_states;
int delta[][];
int initial_state;
int accepting_states[];
Delta represents following e.g. in real automata q1, 1, q2 in program would look like {1, 1, 2}. I know that in a non-deterministic automaton there could be more than one possible state to go to, but I don't know how to compute that.
To solve it I would go this way:
Create a Map (String, String or integer), I will use the key to check what actual state>symbol is and the value to track which state the transaction goes to (null or -1 if the actual state do not have a transaction with that symbol);
Populate the map keys and values, ex: map.put("1>1", null) means from state 1 with symbol 1 goes to null. For now, put null or -1 for every possibility;
Iterate over your transaction possibilities, build the key string, retrieve fom the map the corresponding value. If it is null include the number of the state that goes to, else means there is another transaction coming from the same state with the same symbol that goes to a different state, so it is not deterministic;
If you can iterate over the whole array it is deterministic!
With the member variables that you have now, you can only represent a deterministic FSA. Why? Because a NDFSA is one where there is a state q1 and input a such that both
(q1, a) -> q2
(q1, a) -> q3
are in delta. You implement this (qfrom, input) -> qto mapping as int[][], which means that for each instate-input pair, you can only store one outstate.
One very quick fix would be to allow a list of outstates for all instate-input pair, like so:
ArrayList[][] delta;
However, this is ugly. We can arrive at a nicer solution* if you realize that delta is actually the mapping qfrom -> input -> qto, and it can be bracketed in two ways:
(qfrom -> input) -> qto: this is what you had, with a little different notation
qfrom -> (input -> qto)
Another thing to realize is that int[] is conceptually the same as Map<Integer, Integer>. With that, the second mapping can implemented in Java as:
Map<Integer, Map<Integer, List<Integer>>> delta;
Whichever implementation you choose, the test that decides if the FSA is deterministic or not is very simple, and comes from the definition: A is a DFSA iff the lengths of all lists in delta are <= 1; otherwise, it is a NDFSA.
*Note that real FSA libraries usually use a different approach. For instance, foma uses an array of transition object with fields instate, outstate and the input character. However, for starters, the ones above are probably easier to handle.

How to use SmallCheck in Haskell?

I am trying to use SmallCheck to test a Haskell program, but I cannot understand how to use the library to test my own data types. Apparently, I need to use the Test.SmallCheck.Series. However, I find the documentation for it extremely confusing. I am interested in both cookbook-style solutions and an understandable explanation of the logical (monadic?) structure. Here are some questions I have (all related):
If I have a data type data Person = SnowWhite | Dwarf Integer, how do I explain to smallCheck that the valid values are Dwarf 1 through Dwarf 7 (or SnowWhite)? What if I have a complicated FairyTale data structure and a constructor makeTale :: [Person] -> FairyTale, and I want smallCheck to make FairyTale-s from lists of Person-s using the constructor?
I managed to make quickCheck work like this without getting my hands too dirty by using judicious applications of Control.Monad.liftM to functions like makeTale. I couldn't figure out a way to do this with smallCheck (please explain it to me!).
What is the relationship between the types Serial, Series, etc.?
(optional) What is the point of coSeries? How do I use the Positive type from SmallCheck.Series?
(optional) Any elucidation of what is the logic behind what should be a monadic expression, and what is just a regular function, in the context of smallCheck, would be appreciated.
If there is there any intro/tutorial to using smallCheck, I'd appreciate a link. Thank you very much!
UPDATE: I should add that the most useful and readable documentation I found for smallCheck is this paper (PDF). I could not find the answer to my questions there on the first look; it is more of a persuasive advertisement than a tutorial.
UPDATE 2: I moved my question about the weird Identity that shows up in the type of Test.SmallCheck.list and other places to a separate question.
NOTE: This answer describes pre-1.0 versions of SmallCheck. See this blog post for the important differences between SmallCheck 0.6 and 1.0.
SmallCheck is like QuickCheck in that it tests a property over some part of the space of possible types. The difference is that it tries to exhaustively enumerate a series all of the "small" values instead of an arbitrary subset of smallish values.
As I hinted, SmallCheck's Serial is like QuickCheck's Arbitrary.
Now Serial is pretty simple: a Serial type a has a way (series) to generate a Series type which is just a function from Depth -> [a]. Or, to unpack that, Serial objects are objects we know how to enumerate some "small" values of. We are also given a Depth parameter which controls how many small values we should generate, but let's ignore it for a minute.
instance Serial Bool where series _ = [False, True]
instance Serial Char where series _ = "abcdefghijklmnopqrstuvwxyz"
instance Serial a => Serial (Maybe a) where
series d = Nothing : map Just (series d)
In these cases we're doing nothing more than ignoring the Depth parameter and then enumerating "all" possible values for each type. We can even do this automatically for some types
instance (Enum a, Bounded a) => Serial a where series _ = [minBound .. maxBound]
This is a really simple way of testing properties exhaustively—literally test every single possible input! Obviously there are at least two major pitfalls, though: (1) infinite data types will lead to infinite loops when testing and (2) nested types lead to exponentially larger spaces of examples to look through. In both cases, SmallCheck gets really large really quickly.
So that's the point of the Depth parameter—it lets the system ask us to keep our Series small. From the documentation, Depth is the
Maximum depth of generated test values
For data values, it is the depth of nested constructor applications.
For functional values, it is both the depth of nested case analysis and the depth of results.
so let's rework our examples to keep them Small.
instance Serial Bool where
series 0 = []
series 1 = [False]
series _ = [False, True]
instance Serial Char where
series d = take d "abcdefghijklmnopqrstuvwxyz"
instance Serial a => Serial (Maybe a) where
-- we shrink d by one since we're adding Nothing
series d = Nothing : map Just (series (d-1))
instance (Enum a, Bounded a) => Serial a where series d = take d [minBound .. maxBound]
Much better.
So what's coseries? Like coarbitrary in the Arbitrary typeclass of QuickCheck, it lets us build a series of "small" functions. Note that we're writing the instance over the input type---the result type is handed to us in another Serial argument (that I'm below calling results).
instance Serial Bool where
coseries results d = [\cond -> if cond then r1 else r2 |
r1 <- results d
r2 <- results d]
these take a little more ingenuity to write and I'll actually refer you to use the alts methods which I'll describe briefly below.
So how can we make some Series of Persons? This part is easy
instance Series Person where
series d = SnowWhite : take (d-1) (map Dwarf [1..7])
...
But our coseries function needs to generate every possible function from Persons to something else. This can be done using the altsN series of functions provided by SmallCheck. Here's one way to write it
coseries results d = [\person ->
case person of
SnowWhite -> f 0
Dwarf n -> f n
| f <- alts1 results d ]
The basic idea is that altsN results generates a Series of N-ary function from N values with Serial instances to the Serial instance of Results. So we use it to create a function from [0..7], a previously defined Serial value, to whatever we need, then we map our Persons to numbers and pass 'em in.
So now that we have a Serial instance for Person, we can use it to build more complex nested Serial instances. For "instance", if FairyTale is a list of Persons, we can use the Serial a => Serial [a] instance alongside our Serial Person instance to easily create a Serial FairyTale:
instance Serial FairyTale where
series = map makeFairyTale . series
coseries results = map (makeFairyTale .) . coseries results
(the (makeFairyTale .) composes makeFairyTale with each function coseries generates, which is a little confusing)
If I have a data type data Person = SnowWhite | Dwarf Integer, how do I explain to smallCheck that the valid values are Dwarf 1 through Dwarf 7 (or SnowWhite)?
First of all, you need to decide which values you want to generate for each depth. There's no single right answer here, it depends on how fine-grained you want your search space to be.
Here are just two possible options:
people d = SnowWhite : map Dwarf [1..7] (doesn't depend on the depth)
people d = take d $ SnowWhite : map Dwarf [1..7] (each unit of depth increases the search space by one element)
After you've decided on that, your Serial instance is as simple as
instance Serial m Person where
series = generate people
We left m polymorphic here as we don't require any specific structure of the underlying monad.
What if I have a complicated FairyTale data structure and a constructor makeTale :: [Person] -> FairyTale, and I want smallCheck to make FairyTale-s from lists of Person-s using the constructor?
Use cons1:
instance Serial m FairyTale where
series = cons1 makeTale
What is the relationship between the types Serial, Series, etc.?
Serial is a type class; Series is a type. You can have multiple Series of the same type — they correspond to different ways to enumerate values of that type. However, it may be arduous to specify for each value how it should be generated. The Serial class lets us specify a good default for generating values of a particular type.
The definition of Serial is
class Monad m => Serial m a where
series :: Series m a
So all it does is assigning a particular Series m a to a given combination of m and a.
What is the point of coseries?
It is needed to generate values of functional types.
How do I use the Positive type from SmallCheck.Series?
For example, like this:
> smallCheck 10 $ \n -> n^3 >= (n :: Integer)
Failed test no. 5.
there exists -2 such that
condition is false
> smallCheck 10 $ \(Positive n) -> n^3 >= (n :: Integer)
Completed 10 tests without failure.
Any elucidation of what is the logic behind what should be a monadic expression, and what is just a regular function, in the context of smallCheck, would be appreciated.
When you are writing a Serial instance (or any Series expression), you work in the Series m monad.
When you are writing tests, you work with simple functions that return Bool or Property m.
While I think that #tel's answer is an excellent explanation (and I wish smallCheck actually worked the way he describes), the code he provides does not work for me (with smallCheck version 1). I managed to get the following to work...
UPDATE / WARNING: The code below is wrong for a rather subtle reason. For the corrected version, and details, please see this answer to the question mentioned below. The short version is that instead of instance Serial Identity Person one must write instance (Monad m) => Series m Person.
... but I find the use of Control.Monad.Identity and all the compiler flags bizarre, and I have asked a separate question about that.
Note also that while Series Person (or actually Series Identity Person) is not actually exactly the same as functions Depth -> [Person] (see #tel's answer), the function generate :: Depth -> [a] -> Series m a converts between them.
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses, FlexibleContexts, UndecidableInstances #-}
import Test.SmallCheck
import Test.SmallCheck.Series
import Control.Monad.Identity
data Person = SnowWhite | Dwarf Int
instance Serial Identity Person where
series = generate (\d -> SnowWhite : take (d-1) (map Dwarf [1..7]))

QuickCheck: Arbitrary instances of nested data structures that generate balanced specimens

tl;dr: how do you write instances of Arbitrary that don't explode if your data type allows for way too much nesting? And how would you guarantee these instances produce truly random specimens of your data structure?
I want to generate random tree structures, then test certain properties of these structures after I've mangled them with my library code. (NB: I'm writing an implementation of a subtyping algorithm, i.e. given a hierarchy of types, is type A a subtype of type B. This can be made arbitrarily complex, by including multiple-inheritance and post-initialization updates to the hierarchy. The classical method that supports neither of these is Schubert Numbering, and the latest result known to me is Alavi et al. 2008.)
Let's take the example of rose-trees, following Data.Tree:
data Tree a = Node a (Forest a)
type Forest a = [Tree a]
A very simple (and don't-try-this-at-home) instance of Arbitray would be:
instance (Arbitrary a) => Arbitrary (Tree a) where
arbitrary = Node <$> arbitrary <$> arbitrary
Since a already has an Arbitrary instance as per the type constraint, and the Forest will have one, because [] is an instance, too, this seems straight-forward. It won't (typically) terminate for very obvious reasons: since the lists it generates are arbitrarily long, the structures become too large, and there's a good chance they won't fit into memory. Even a more conservative approach:
arbitrary = Node <$> arbitrary <*> oneof [arbitrary,return []]
won't work, again, for the same reason. One could tweak the size parameter, to keep the length of the lists down, but even that won't guarantee termination, since it's still multiple consecutive dice-rolls, and it can turn out quite badly (and I want the odd node with 100 children.)
Which means I need to limit the size of the entire tree. That is not so straight-forward. unordered-containers has it easy: just use fromList. This is not so easy here: How do you turn a list into a tree, randomly, and without incurring bias one way or the other (i.e. not favoring left-branches, or trees that are very left-leaning.)
Some sort of breadth-first construction (the functions provided by Data.Tree are all pre-order) from lists would be awesome, and I think I could write one, but it would turn out to be non-trivial. Since I'm using trees now, but will use even more complex stuff later on, I thought I might try to find a more general and less complex solution. Is there one, or will I have to resort to writing my own non-trivial Arbitrary generator? In the latter case, I might actually just resort to unit-tests, since this seems too much work.
Use sized:
instance Arbitrary a => Arbitrary (Tree a) where
arbitrary = sized arbTree
arbTree :: Arbitrary a => Int -> Gen (Tree a)
arbTree 0 = do
a <- arbitrary
return $ Node a []
arbTree n = do
(Positive m) <- arbitrary
let n' = n `div` (m + 1)
f <- replicateM m (arbTree n')
a <- arbitrary
return $ Node a f
(Adapted from the QuickCheck presentation).
P.S. Perhaps this will generate overly balanced trees...
You might want to use the library presented in the paper "Feat: Functional Enumeration of Algebraic Types" at the Haskell Symposium 2012. It is on Hackage as testing-feat, and a video of the talk introducing it is available here: http://www.youtube.com/watch?v=HbX7pxYXsHg
As Janis mentioned, you can use the package testing-feat, which creates enumerations of arbitrary algebraic data types. This is the easiest way to create unbiased uniformly distributed generators
for all trees of up to a given size.
Here is how you would use it for rose trees:
import Test.Feat (Enumerable(..), uniform, consts, funcurry)
import Test.Feat.Class (Constructor)
import Data.Tree (Tree(..))
import qualified Test.QuickCheck as QC
-- We make an enumerable instance by listing all constructors
-- for the type. In this case, we have one binary constructor:
-- Node :: a -> [Tree a] -> Tree a
instance Enumerable a => Enumerable (Tree a) where
enumerate = consts [binary Node]
where
binary :: (a -> b -> c) -> Constructor c
binary = unary . funcurry
-- Now we use the Enumerable instance to create an Arbitrary
-- instance with the help of the function:
-- uniform :: Enumerable a => Int -> QC.Gen a
instance Enumerable a => QC.Arbitrary (Tree a) where
QC.arbitrary = QC.sized uniform
-- QC.shrink = <some implementation>
The Enumerable instance can also be generated automatically with TemplateHaskell:
deriveEnumerable ''Tree

First and follow of the non-terminals in two grammars

Given the following grammar:
S -> L=L
s -> L
L -> *L
L -> id
What are the first and follow for the non-terminals?
If the grammar is changed into:
S -> L=R
S -> R
L -> *R
L -> id
R -> L
What will be the first and follow ?
When I took a compiler course in college I didn't understand FIRST and FOLLOWS at all. I implemented the algorithms described in the Dragon book, but I had no clue what was going on. I think I do now.
I assume you have some book that gives a formal definition of these two sets, and the book is completely incomprehensible. I'll try to give an informal description of them, and hopefully that will help you make sense of what's in your book.
The FIRST set is the set of terminals you could possibly see as the first part of the expansion of a non-terminal. The FOLLOWS set is the set of terminals you could possibly see following the expansion of a non-terminal.
In your first grammar, there are only three kinds of terminals: =, *, and id. (You might also consider $, the end-of-input symbol, to be a terminal.) The only non-terminals are S (a statement) and L (an Lvalue -- a "thing" you can assign to).
Think of FIRST(S) as the set of non-terminals that could possibly start a statement. Intuitively, you know you do not start a statement with =. So you wouldn't expect that to show up in FIRST(S).
So how does a statement start? There are two production rules that define what an S looks like, and they both start with L. So to figure out what's in FIRST(S), you really have to look at what's in FIRST(L). There are two production rules that define what an Lvalue looks like: it either starts with a * or with an id. So FIRST(S) = FIRST(L) = { *, id }.
FOLLOWS(S) is easy. Nothing follows S because it is the start symbol. So the only thing in FOLLOWS(S) is $, the end-of-input symbol.
FOLLOWS(L) is a little trickier. You have to look at every production rule where L appears, and see what comes after it. In the first rule, you see that = may follow L. So = is in FOLLOWS(L). But you also notice in that rule that there is another L at the end of the production rule. So another thing that could follow L is anything that could follow that production. We already figured out that the only thing that can follow the S production is the end-of-input. So FOLLOWS(L) = { =, $ }. (If you look at the other production rules, L always appears at the end of them, so you just get $ from those.)
Take a look at this Easy Explanation, and for now ignore all the stuff about ϵ, because you don't have any productions which contain the empty-string. Under "Rules for First Sets", rules #1, #3, and #4.1 should make sense. Under "Rules for Follows Sets", rules #1, #2, and #3 should make sense.
Things get more complicated when you have ϵ in your production rules. Suppose you have something like this:
D -> S C T id = V // Declaration is [Static] [Const] Type id = Value
S -> static | ϵ // The 'static' keyword is optional
C -> const | ϵ // The 'const' keyword is optional
T -> int | float // The Type is mandatory and is either 'int' or 'float'
V -> ... // The Value gets complicated, not important here.
Now if you want to compute FIRST(D) you can't just look at FIRST(S), because S may be "empty". You know intuitively that FIRST(D) is { static, const, int, float }. That intuition is codified in rule #4.2. Think of SCT in this example as Y1Y2Y3 in the "Easy Explanation" rules.
If you want to compute FOLLOWS(S), you can't just look at FIRST(C), because that may be empty, so you also have to look at FIRST(T). So FOLLOWS(S) = { const, int, float }. You get that by applying "Rules for follow sets" #2 and #4 (more or less).
I hope that helps and that you can figure out FIRST and FOLLOWS for the second grammar on your own.
If it helps, R represents an Rvalue -- a "thing" you can't assign to, such as a constant or a literal. An Lvalue can also act as an Rvalue (but not the other way around).
a = 2; // a is an lvalue, 2 is an rvalue
a = b; // a is an lvalue, b is an lvalue, but in this context it's an rvalue
2 = a; // invalid because 2 cannot be an lvalue
2 = 3; // invalid, same reason.
*4 = b; // Valid! You would almost never write code like this, but it is
// grammatically correct: dereferencing an Rvalue gives you an Lvalue.