Lex and Yacc for Finite Automata - finite-automata

I want to develop a tool to construct the transition graph of any finite automata given its transition table, start state and final state using Lex and Yacc. Tool should also provide a facility to check whether a string is accepted by the automata or not.
Can anyone tell me how to do it.

This might be a helpful introduction. Implementation of a DFA in lex, with source code and quite detailed illustrations.

Related

Compare NSString with bunch of regex

This case is different I think, I have a word and I have some 100 regex with me. I want to check which regex it is passing? How to do it in an optimised way?
The most efficient way would be to combine all of those regular expressions into a deterministic finite automaton (a finite state machine). Then run the string through that finite state machine.
Michael Sipser's Introduction to the Theory of Computation explains how to do this. It is fairly complex, thus the reference to the book.
After you have constructed the DFA by hand, you can implement it in code.
There are tools that can do this for you, such as flex. flex takes the regular expressions as input and generates the DFA as .c file which you can then use in your project. You can configure flex to return a token to indicate which regular expression was matched.
flex is a unix tool and is part of OS X 10.8.

Is there some other way to describe a formal language other than grammars?

I'm looking for the mathematical theory which deals with describing formal languages (set of strings) in general and not just grammar hierarchies.
Grammars give you the algorithm that lists all possible strings in the language. You could specify the algorithm any other way, but grammars are a concise and well-accepted format to do so.
Another way is to list every string that belongs to the language -- this will only work if the set of strings in the language is small (and definitely not when the set is infinite).
Regular expressions are a formalism for describing a set of languages, for instance. Although there are algorithms for transforming regular grammars and expressions in both ways, they are still two different theories. Also, automata (as a plural of automaton) can help you describe languages, not just DFA and NFA which describe the same set as regular languages, but 2DFA, stack automata. For example, a two-stacks automata is as powerful as a Turing machine. Finally, Turing machines itself are a formalism for languages. For any Turing machine, the set of all string on which the given Turing machine stops on a finite number of steps is a formally defined language.

chomsky hierarchy in plain english

I'm trying to find a plain (i.e. non-formal) explanation of the 4 levels of formal grammars (unrestricted, context-sensitive, context-free, regular) as set out by Chomsky.
It's been an age since I studied formal grammars, and the various definitions are now confusing for me to visualize. To be clear, I'm not looking for the formal definitions you'll find everywhere (e.g. here and here -- I can google as well as anyone else), or really even formal definitions of any sort. Instead, what I was hoping to find was clean and simple explanations that don't sacrifice clarity for the sake of completeness.
Maybe you get a better understanding if you remember the automata generating these languages.
Regular languages are generated by regular automata. They have only have a finit knowledge of the past (their compute memory has limits) so everytime you have a language with suffixes depending on prefixes (palindrome language) this can not be done with regular languages.
Context-free languages are generated by nondeterministic pushdown automata. They have a kind of knowledge of the past (the stack, which is not limited in contrast to regular automata) but a stack can only be viewed from top so you don't have complete knowledge of the past.
Context-sensitive languages are generated by linear-bound non-deterministic turing machines. They know the past and can deal with different contexts because they are non-deterministic and can access all the past at every time.
Unrestricted languages are generated by Turing machines. According to the Church-Turing-Thesis turing machines are able to calculate everything you can imagine (which means everything decidable).
As for regular languages, there are many equivalent characterizations. They give many different ways of looking at regular languages. It is hard to give a "plain English" definition, and if you find it hard to understand any of the characterizations of regular languages, it is unlikely that a "plain English" explanation will help. One thing to note from the definitions and various closure properties is that regular languages embody the notion of "finiteness" somehow. But this is again hard to appreciate without better familiarity with regular languages.
Do you find the notion of a finite automaton to be not simple and clean?
Let me mention some of the many equivalent characterizations (at least for other readers) :
Languages accepted by deterministic finite automata
Languages accepted by nondeterministic finite automata
Languages accepted by alternating finite automata
Languages accepted by two-way deterministic finite automata
Languages generated by left-linear grammars
Languages generated by right-linear grammars
Languages generated by regular expressions.
A union of some equivalence classes of a right-congruence of finite index.
A union of some equivalence classes of a congruence of finite index.
The inverse image under a monoid homomorphism of a subset of a finite monoid.
Languages expressible in monadic second order logic over words.
Regular: These languages answer yes/no with finite automata
Context free: These languages when given input word ( using state machiene and stack ) we can always answer yes/no if it is member of the language
Context sensitive: As long as production in grammar never shrinks ( α -> β ) we can answer yes/no (using state machiene and chunk of memory that is linear in size with input)
Recursively ennumerable: It can answer yes but in case of no it will go into infinite loop
see this video for full explanation.

ANTLR, steps order

I am trying to design a compiler for a language like C# in ANTLR. But I don't fully comprehend the proper order of steps that should be undertaken.
This is how I see it:
First I define Lexer tokens
Then grammar rules (with rewrite rules to build AST) with actions that gather informations about classes and methods declarations (so that I can resolve method invocations in the next step)
Finally, I create "tree grammar" which traverse AST tree and invokes rules that generate the opcodes of (virtual) machine language.
Is this correct? Is the second step's role reading methods' declarations and building AST?
How can I resolve overloaded methods' declarations without build AST ? (backpatching?)
Have a look at Language Implementation Patterns it explains how to create your own languages (both interpreted and byte-code/VM-like). At the moment, your questions are too broad, and I don't think anyone is able to post an answer in a forum that explains all the details of how to create your own language from start to finish.
Feel free to ask more specific questions when you have them, of course.
Good luck!

Is it easier to write a recursive-descent parser using an EBNF or a BNF?

I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.