How to determine whether finite-state automata is deterministic? - finite-automata

How to write a Java code that will determine whether the given automaton is deterministic. What I have is a class that represents automata and it has the following five variables:
int alphabet_size;
int n_states;
int delta[][];
int initial_state;
int accepting_states[];
Delta represents following e.g. in real automata q1, 1, q2 in program would look like {1, 1, 2}. I know that in a non-deterministic automaton there could be more than one possible state to go to, but I don't know how to compute that.

To solve it I would go this way:
Create a Map (String, String or integer), I will use the key to check what actual state>symbol is and the value to track which state the transaction goes to (null or -1 if the actual state do not have a transaction with that symbol);
Populate the map keys and values, ex: map.put("1>1", null) means from state 1 with symbol 1 goes to null. For now, put null or -1 for every possibility;
Iterate over your transaction possibilities, build the key string, retrieve fom the map the corresponding value. If it is null include the number of the state that goes to, else means there is another transaction coming from the same state with the same symbol that goes to a different state, so it is not deterministic;
If you can iterate over the whole array it is deterministic!

With the member variables that you have now, you can only represent a deterministic FSA. Why? Because a NDFSA is one where there is a state q1 and input a such that both
(q1, a) -> q2
(q1, a) -> q3
are in delta. You implement this (qfrom, input) -> qto mapping as int[][], which means that for each instate-input pair, you can only store one outstate.
One very quick fix would be to allow a list of outstates for all instate-input pair, like so:
ArrayList[][] delta;
However, this is ugly. We can arrive at a nicer solution* if you realize that delta is actually the mapping qfrom -> input -> qto, and it can be bracketed in two ways:
(qfrom -> input) -> qto: this is what you had, with a little different notation
qfrom -> (input -> qto)
Another thing to realize is that int[] is conceptually the same as Map<Integer, Integer>. With that, the second mapping can implemented in Java as:
Map<Integer, Map<Integer, List<Integer>>> delta;
Whichever implementation you choose, the test that decides if the FSA is deterministic or not is very simple, and comes from the definition: A is a DFSA iff the lengths of all lists in delta are <= 1; otherwise, it is a NDFSA.
*Note that real FSA libraries usually use a different approach. For instance, foma uses an array of transition object with fields instate, outstate and the input character. However, for starters, the ones above are probably easier to handle.

Related

Why does random always return an even number

I want to learn Elm and currently I want to create a random List with tuples containing an index and a random number.
My current approach is to create a list and for each element create a random value:
randomList =
List.map randomEntry (List.range 0 1000)
randomEntry index =
let
seed = Random.initialSeed index
randomResult = Random.step (Random.int 1 10) seed
in
(index, Tuple.first randomResult)
But this only creates even numbers.
Why does it always create even numbers and what is the correct way of doing this?
Strange - using your randomEntry function, the first odd numbers don't start showing up until 53668, and then it's odd numbers for a while. An example from the REPL:
> List.range 0 100000 |> List.map randomEntry |> List.filter (\(a,b) -> b % 2 /= 0) |> List.take 10
[(53668,35),(53669,87),(53670,1),(53671,15),(53672,29),(53673,43),(53674,57),(53675,71),(53676,85),(53677,99)]
: List ( Int, Int )
Now I can't tell you why there is such stickiness in this range (here's the source for the int generator if you're curious), but hopefully I can shed some light on randomness in Elm.
There is really no way to create a truly random number generator using standard computing technology (quantum computers aside), so the typical way of creating randomness is to provide a function that takes a seed value, returns a pseudo-random number and the next seed value to use.
This makes random number generation predictable, since you will always get the same "random" number for the same seed.
This is why you always get identical results for the input you've given: You are using the same seed values of 0 through 1000. In addition, you are ignoring the "next seed" value passed back from the step function, which is returned as the second value of the tuple.
Now, when dealing with random number generators, it is a good rule of thumb to avoid dealing with the seed as much as possible. You can write generators without referring to seeds by building on smaller generators like int, list, and so on.
The way you execute a generator is either by returning a Cmd generated from Random.generate from your update function, which leaves the responsibility of deciding which seed to use to the Elm Architecture (which probably uses some time-based seed), or you can pass in the seed using Random.step, which you've done above.
So, going back to your original example, if you were to write a generator for returning a list of random numbers of a certain size, where each number is within a certain range, it could look something like this:
randomListGenerator : Int -> (Int, Int) -> Random.Generator (List Int)
randomListGenerator size (low, high) =
Random.list size (Random.int low high)
Executing this using step in the REPL shows how it can be used:
> initialSeed 0 |> step (randomListGenerator 20 (1, 10)) |> Tuple.first
[6,6,6,1,3,10,4,4,4,9,6,3,5,3,7,8,3,4,8,5] : List Int
You'll see that this includes some odd numbers, unlike your initial example. The fact that it is different than your example is because the generator returns the next seed to use each consecutive step, whereas your example used the integers 0 through 1000 in order. I still have no explanation to your original question of why there is such a big block of evens using your original input, other than to say it is very odd.

menhir - associate AST nodes with token locations in source file

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.
A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?
I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli
This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.
For example, here is a part of the representation of expressions:
type ...
and exp = {
eid: int; (** unique identifier *)
enode: exp_node; (** the expression itself *)
eloc: location; (** location of the expression. *)
}
and exp_node =
| Const of constant (** Constant *)
| Lval of lval (** Lvalue *)
| UnOp of unop * exp * typ
| BinOp of binop * exp * exp * typ
...
So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.
So simple, "top-level" matches on syntax are very easy:
let rec is_const_expr e =
match e.enode with
| Const _ -> true
| Lval _ -> false
| UnOp (_op, e', _typ) -> is_const_expr e'
| BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r
To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
| _ -> e
For comparison, on a pure AST without metadata, this would be something like:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, UnOp (Neg, e', _), _) -> e'
| _ -> e
I find that Frama-C's approach works well in practice.
You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,
type node =
| Typedef of typdef
| Typeexp of typeexp
| Literal of string
| Constant of int
| ...
type annotated_node = { node : node; loc : loc}
Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,
match node with
| {node=Typedef t} -> pp_typedef t
| ...
Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by #Isabelle Newbie.
A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you actually externalize node attributes. The drawback is that you can't actually guarantee the totality of an attribute since finite maps are no total. Thus it is harder to preserve an invariant that, for example, all nodes have a location.
Since every heap allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap allocated objects without actually wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.
Of course, by saying an address, I do not literally mean the physical or virtual address of an object. OCaml uses a moving GC so an actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.
Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:
let c1 = Literal "hello"
let c2 = Constant 42
Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)
module Locations = Ephemeron.K1.Make(struct
type t = node
let hash = Hashtbl.hash (* or your own hash if you have one *)
let equal = (=) (* or a specilized equal operator *)
end)
The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,
let locations = Locations.create 1337
(* somewhere in the semantics actions, where c1 and c2 are created *)
Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"
And later, you can extract the location:
# Locations.find locs c1;;
- : string = "hello.ml:12:32"
As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nice and easy, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:
let normalize = function
| Literal str -> Literal (normalize_literal str)
| node -> node
It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.
Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.
Speaking of the "monadic approach" that you mentioned in comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) In order to attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not only unique in a program run, but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). And although it looks like that generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as pure functional, as it doesn't contain observable effects.

How are records stored in erlang, and how are they mutated?

I recently came across some code that looked something like the following:
-record(my_rec, {f0, f1, f2...... f711}).
update_field({f0, Val}, R) -> R#my_rec{f0 = Val};
update_field({f1, Val}, R) -> R#my_rec{f1 = Val};
update_field({f2, Val}, R) -> R#my_rec{f2 = Val};
....
update_field({f711, Val}, R) -> R#my_rec{f711 = Val}.
generate_record_from_proplist(Props)->
lists:foldl(fun update_field/2, #my_rec{}, Props).
My question is about what actually happens to the record - lets say the record has 711 fields and I'm generating it from a proplist - since the record is immutable, we are, at least semantically, generating a new full record one every step in the foldr - making what looks like a function that would be linear in the length of the arguments, into one that is actually quadratic in the length, since there are updates corresponding to the length the record for every insert - Am I correct in this assumption,or is the compiler intelligent enough
to save me?
Records are tuples which first element contains the name of the records, and the next one the record fields.
The name of the fields is not stored, it is a facility for the compiler, and of course the programmer. I think it was introduced only to avoid errors in the field order when writting programs, and to allow tupple extension when releasing new version without rewriting all pattern matches.
Your code will make 712 copies of a 713 element tuple.
I am afraid, the compiler is not smart enough.
You can read more in this SO answer. If you have so big number of fields and you want to update it in O(1) time, you should use ETS tables.

How to use SmallCheck in Haskell?

I am trying to use SmallCheck to test a Haskell program, but I cannot understand how to use the library to test my own data types. Apparently, I need to use the Test.SmallCheck.Series. However, I find the documentation for it extremely confusing. I am interested in both cookbook-style solutions and an understandable explanation of the logical (monadic?) structure. Here are some questions I have (all related):
If I have a data type data Person = SnowWhite | Dwarf Integer, how do I explain to smallCheck that the valid values are Dwarf 1 through Dwarf 7 (or SnowWhite)? What if I have a complicated FairyTale data structure and a constructor makeTale :: [Person] -> FairyTale, and I want smallCheck to make FairyTale-s from lists of Person-s using the constructor?
I managed to make quickCheck work like this without getting my hands too dirty by using judicious applications of Control.Monad.liftM to functions like makeTale. I couldn't figure out a way to do this with smallCheck (please explain it to me!).
What is the relationship between the types Serial, Series, etc.?
(optional) What is the point of coSeries? How do I use the Positive type from SmallCheck.Series?
(optional) Any elucidation of what is the logic behind what should be a monadic expression, and what is just a regular function, in the context of smallCheck, would be appreciated.
If there is there any intro/tutorial to using smallCheck, I'd appreciate a link. Thank you very much!
UPDATE: I should add that the most useful and readable documentation I found for smallCheck is this paper (PDF). I could not find the answer to my questions there on the first look; it is more of a persuasive advertisement than a tutorial.
UPDATE 2: I moved my question about the weird Identity that shows up in the type of Test.SmallCheck.list and other places to a separate question.
NOTE: This answer describes pre-1.0 versions of SmallCheck. See this blog post for the important differences between SmallCheck 0.6 and 1.0.
SmallCheck is like QuickCheck in that it tests a property over some part of the space of possible types. The difference is that it tries to exhaustively enumerate a series all of the "small" values instead of an arbitrary subset of smallish values.
As I hinted, SmallCheck's Serial is like QuickCheck's Arbitrary.
Now Serial is pretty simple: a Serial type a has a way (series) to generate a Series type which is just a function from Depth -> [a]. Or, to unpack that, Serial objects are objects we know how to enumerate some "small" values of. We are also given a Depth parameter which controls how many small values we should generate, but let's ignore it for a minute.
instance Serial Bool where series _ = [False, True]
instance Serial Char where series _ = "abcdefghijklmnopqrstuvwxyz"
instance Serial a => Serial (Maybe a) where
series d = Nothing : map Just (series d)
In these cases we're doing nothing more than ignoring the Depth parameter and then enumerating "all" possible values for each type. We can even do this automatically for some types
instance (Enum a, Bounded a) => Serial a where series _ = [minBound .. maxBound]
This is a really simple way of testing properties exhaustively—literally test every single possible input! Obviously there are at least two major pitfalls, though: (1) infinite data types will lead to infinite loops when testing and (2) nested types lead to exponentially larger spaces of examples to look through. In both cases, SmallCheck gets really large really quickly.
So that's the point of the Depth parameter—it lets the system ask us to keep our Series small. From the documentation, Depth is the
Maximum depth of generated test values
For data values, it is the depth of nested constructor applications.
For functional values, it is both the depth of nested case analysis and the depth of results.
so let's rework our examples to keep them Small.
instance Serial Bool where
series 0 = []
series 1 = [False]
series _ = [False, True]
instance Serial Char where
series d = take d "abcdefghijklmnopqrstuvwxyz"
instance Serial a => Serial (Maybe a) where
-- we shrink d by one since we're adding Nothing
series d = Nothing : map Just (series (d-1))
instance (Enum a, Bounded a) => Serial a where series d = take d [minBound .. maxBound]
Much better.
So what's coseries? Like coarbitrary in the Arbitrary typeclass of QuickCheck, it lets us build a series of "small" functions. Note that we're writing the instance over the input type---the result type is handed to us in another Serial argument (that I'm below calling results).
instance Serial Bool where
coseries results d = [\cond -> if cond then r1 else r2 |
r1 <- results d
r2 <- results d]
these take a little more ingenuity to write and I'll actually refer you to use the alts methods which I'll describe briefly below.
So how can we make some Series of Persons? This part is easy
instance Series Person where
series d = SnowWhite : take (d-1) (map Dwarf [1..7])
...
But our coseries function needs to generate every possible function from Persons to something else. This can be done using the altsN series of functions provided by SmallCheck. Here's one way to write it
coseries results d = [\person ->
case person of
SnowWhite -> f 0
Dwarf n -> f n
| f <- alts1 results d ]
The basic idea is that altsN results generates a Series of N-ary function from N values with Serial instances to the Serial instance of Results. So we use it to create a function from [0..7], a previously defined Serial value, to whatever we need, then we map our Persons to numbers and pass 'em in.
So now that we have a Serial instance for Person, we can use it to build more complex nested Serial instances. For "instance", if FairyTale is a list of Persons, we can use the Serial a => Serial [a] instance alongside our Serial Person instance to easily create a Serial FairyTale:
instance Serial FairyTale where
series = map makeFairyTale . series
coseries results = map (makeFairyTale .) . coseries results
(the (makeFairyTale .) composes makeFairyTale with each function coseries generates, which is a little confusing)
If I have a data type data Person = SnowWhite | Dwarf Integer, how do I explain to smallCheck that the valid values are Dwarf 1 through Dwarf 7 (or SnowWhite)?
First of all, you need to decide which values you want to generate for each depth. There's no single right answer here, it depends on how fine-grained you want your search space to be.
Here are just two possible options:
people d = SnowWhite : map Dwarf [1..7] (doesn't depend on the depth)
people d = take d $ SnowWhite : map Dwarf [1..7] (each unit of depth increases the search space by one element)
After you've decided on that, your Serial instance is as simple as
instance Serial m Person where
series = generate people
We left m polymorphic here as we don't require any specific structure of the underlying monad.
What if I have a complicated FairyTale data structure and a constructor makeTale :: [Person] -> FairyTale, and I want smallCheck to make FairyTale-s from lists of Person-s using the constructor?
Use cons1:
instance Serial m FairyTale where
series = cons1 makeTale
What is the relationship between the types Serial, Series, etc.?
Serial is a type class; Series is a type. You can have multiple Series of the same type — they correspond to different ways to enumerate values of that type. However, it may be arduous to specify for each value how it should be generated. The Serial class lets us specify a good default for generating values of a particular type.
The definition of Serial is
class Monad m => Serial m a where
series :: Series m a
So all it does is assigning a particular Series m a to a given combination of m and a.
What is the point of coseries?
It is needed to generate values of functional types.
How do I use the Positive type from SmallCheck.Series?
For example, like this:
> smallCheck 10 $ \n -> n^3 >= (n :: Integer)
Failed test no. 5.
there exists -2 such that
condition is false
> smallCheck 10 $ \(Positive n) -> n^3 >= (n :: Integer)
Completed 10 tests without failure.
Any elucidation of what is the logic behind what should be a monadic expression, and what is just a regular function, in the context of smallCheck, would be appreciated.
When you are writing a Serial instance (or any Series expression), you work in the Series m monad.
When you are writing tests, you work with simple functions that return Bool or Property m.
While I think that #tel's answer is an excellent explanation (and I wish smallCheck actually worked the way he describes), the code he provides does not work for me (with smallCheck version 1). I managed to get the following to work...
UPDATE / WARNING: The code below is wrong for a rather subtle reason. For the corrected version, and details, please see this answer to the question mentioned below. The short version is that instead of instance Serial Identity Person one must write instance (Monad m) => Series m Person.
... but I find the use of Control.Monad.Identity and all the compiler flags bizarre, and I have asked a separate question about that.
Note also that while Series Person (or actually Series Identity Person) is not actually exactly the same as functions Depth -> [Person] (see #tel's answer), the function generate :: Depth -> [a] -> Series m a converts between them.
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses, FlexibleContexts, UndecidableInstances #-}
import Test.SmallCheck
import Test.SmallCheck.Series
import Control.Monad.Identity
data Person = SnowWhite | Dwarf Int
instance Serial Identity Person where
series = generate (\d -> SnowWhite : take (d-1) (map Dwarf [1..7]))

Trie Implementation Question

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.
If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.
If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};
A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.
I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.
I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.