spotting an Ambiguous BNF - grammar

I have an assignment to correct an ambiguous BNF, but I am completely lost. I know this not a true programming question, and I will gladly delete it if it is not an appropriate question for these boards. Are there any good sites where I could learn more about BNF's? The one I am dealing with seems rather simple, but I can't find any examples or good explanations regarding BNF's. I have had some experience spotting ambiguous parse trees and other sorts of grammars, but I am completely lost on this one.
Since it is a school assignment, I am not sure I should post the BNF in question, but if anyone knows of a good site that I could look over to perhaps gain a better understanding of how to attack my question. I really just don't know where to start.

Some BNF describing a context-free grammar is also describing a state machine (in this case a Pushdown automata). The best way to do this is probably by inspection of the state machine.
As a starting point, you could look into what a conflict is within parsers that make use of such automata.

If there are two or more of the same nonterminals at the right side of a sentence it's ambiguous. For example: < expr > -> < expr > + < expr > | < facto >. The right side expr's can be exported in different ways in the tree, so different trees can be drawn and it's ambiguous.

Related

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

What is a good tool for automatically calculating FIRST and FOLLOW sets?

I'm currently in the middle of playing with a BNF grammar that I hope to be able to wrangle into a LL(1) form. However, I've just finished making changes and calculating the new FIRST and FOLLOW sets for the grammar by hand for the third time today and I'm getting tired of it. There has to be a better way!
Can someone suggest a tool that, given a grammar, will automatically calculate the first and follow sets for all of the non terminals?
A year ago, we had a semester project at the university I attend, where our task was to create a programming language. As a group, we decided we wanted to be able to hand-write the parser from scratch, so we had to aim for an LL(1) grammar, since it would have been completely unrealistic to write a parser otherwise.
Of course, our starting point was far from being LL(1), so we too had to wrangle it into place. For that purpose, we used the kfgEdit tool from the AtoCC package. All you do is enter your rules, and then it can check if it's an LL(1) grammar at the click of a button.
A fair word of warning: The tool is a bit finicky about what it accepts. While you'd often use EBNF for the real grammar, so you can write ? and * and + to signal how many times that token must appear there, this is not supported. Grouping is also not supported. You may very well find that it takes a very long time to do this, and you will almost surely want to do some "rearranging" after you've reached LL(1) to make the grammar even close to being readable.
Of course, depending on the type of grammar you're dealing with, this may not be much of a problem for you. We had created a sort of Pascal/C hybrid, with a fairly restricted set of constructs (procedures, functions, only built-in primitive types and arrays of them, ifs, a single loop construct we'd come up with ourselves in place of the standard 3...), and it took me at least a week to wrangle it into an LL(1) grammar - probably 2, actually. Note that this is out of a total of about 4 months, so that was a lot of time spent there.
If you absolutely MUST have an LL(1) grammar, then you obviously will need to press on if you get into a situation like this, but if you're allowed to use parser generators like yacc/bison or SableCC then you will, in the long run, most likely find it a LOT easier to go down that route. That doesn't mean you SHOULD go down that route - I found that actually writing everything by hand provided some insight I probably wouldn't have gained otherwise - but it might be better for you to gain that insight in a different situation than your current.
tl;dr version: Use kfgEdit from the AtoCC package.
For recursive descent parsing, it would be worth looking at ANTLR. However, I'm not sure it provides an exact answer for your question - find the FIRST and FOLLOW sets for a given grammar.
The DMS Software Reengineering Toolkit has a parser generator that computes FIRST and FOLLOW sets; it will also let you inspect the L(AL)R state machine it generates.
However, if you have a legitimate context-free grammar, you don't have to "wrangle it" into LL shape; the DMS parser generator produces GLR parsers from any context-free grammar.

Is it easier to write a recursive-descent parser using an EBNF or a BNF?

I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.

Programming languages that define the problem instead of the solution?

Are there any programming languages designed to define the solution to a given problem instead of defining instructions to solve it? So, one would define what the solution or end result should look like and the language interpreter would determine how to arrive at that result. Looking at the list of programming languages, I'm not sure how to even begin to research this.
The best examples I can currently think of to help illustrate what I'm trying to ask are SQL and MapReduce, although those are both sort of mini-languages designed to retrieve data. But, when writing SQL or MapReduce statements, you're defining the end result, and the DB decides the best course of action to arrive at the end result set.
I could see these types of languages, if they exist, being used in crunching a lot of data or finding solutions to a set of equations. The dream language would be one that could interpret the defined problem, identify which parts are parallelizable, and execute the solution across multiple processes/cores/boxes.
What about Declarative Programming? Excerpt from wikipedia article (emphasis added):
In computer science, declarative
programming is a programming paradigm
that expresses the logic of a
computation without describing its
control flow. Many languages
applying this style attempt to
minimize or eliminate side effects by
describing what the program should
accomplish, rather than describing how
to go about accomplishing it. This
is in contrast with imperative
programming, which requires an
explicitly provided algorithm.
The closest you can get to something like this is with a logic language such as Prolog. In these languages you model the problem's logic but again it's not magic.
This sounds like a description of a declarative language (specifically a logic programming language), the most well-known example of which is Prolog. I have no idea whether Prolog is parallelizable, though.
In my experience, Prolog is great for solving constraint-satisfaction problems (ones where there's a set of conditions that must be satisfied) -- you define your input set, define the constraints (e.g., an ordering that must be imposed on the previously unordered inputs) -- but pathological cases are possible, and sometimes the logical deduction process takes a very long time to complete.
If you can define your problem in terms of a Boolean formula you could throw a SAT solver at it, but note that the 3SAT problem (Boolean variable assignment over three-variable clauses) is NP-complete, and its first-order-logic big brother, the Quantified Boolean formula problem (which uses the existential quantifier as well as the universal quantifier), is PSPACE-complete.
There are some very good theorem provers written in OCaml and other FP languages; here are a whole bunch of them.
And of course there's always linear programming via the simplex method.
These languages are commonly referred to as 5th generation programming languages. There are a few examples on the Wikipedia entry I have linked to.
Let me try to answer ... may be Prolog could answer your needs.
I would say Objective Caml (OCaml) too...
This may seem flippant but in a sense that is what stackoverflow is. You declare a problem and or intended result and the community provides the solution, usually in code.
It seems immensely difficult to model dynamic open systems down to a finite number of solutions. I think there is a reason most programming languages are imperative. Not to mention there are massive P = NP problems lurking in the dark that would make such a system difficult to engineer.
Although what would be interesting is if there was a formal framework that could leverage human input to "crunch the numbers" and provide a solution, perhaps imperative code generation. The internet and google search engines are kind of that tool but very primitive.
Large problems and software are basically just a collection of smaller problems solved in code. So any system that generated code would require fairly delimited problem sets that can be mapped to more or less atomic solutions.
Lisp. There are so many Lisp systems out there defined in terms of rules not imperative commands. Google ahoy...
There are various Java-based rules engines which allow declarative programming - Drools is one that I've played with and it seems pretty interesting.
A lot of languages define more problems than solutions (don't take this one seriously).
On a serious note: one more vote for Prolog and different kinds of DSLs designed to be declarative.
I remember reading something about computation using DNA back when I was in college. You would put segments of DNA in a solution that represented segments of the problem, and define it in such a way that if the DNA fits together, it's a valid solution. Then you let the properties of chemicals solve the problem for you and look for finished strands that represent a solution. It sounds sort of like what you are refering to.
I don't recall if it was theoretical or had been done, though.
LINQ could also be considered another declarative DSL (aschewing the argument that it's too similar to SQL). Again, you declare what your solution looks like, and LINQ decides how to find it.
The beauty of these kinds of languages is that projects like PLINQ (which I just found) can spring up around them. Check out this video with the PLINQ developers (WMV direct link) on how they parallelize solution finding without modifying the LINQ language (much).
While mathematical proofs don't constitute a programming language, they do form a formal language where you simply define solutions (as long as you allow nonconstructive proofs). Of course, it's not algorithmic, so "math" might not be an acceptable answer.
Meta Discussion
What constitutes a problem or a solution is not absolute and depends on the level of abstraction that you are taking as a reference point.
Let's compare the following 3 languages: SQL, C++, and CPU instructions.
C++ vs CPU instructions
If you choose array manipulation as the desired level of abstraction, then C++ allows you to "define the problem" instead of the solution:
array[i * 2 + 3] = 5;
array[t] = array[k - m] - 1;
Note what this C++ snippet does not state: how the memory is laid out, how many bits are used by each array element, which CPU registers hold the data, and even in which order the arithmetic operations will be performed (as long as the result is the same).
The C++ compiler, however, will translate this code to lower-level CPU instructions that will contain all of these details.
At the abstraction level of array manipulation, C++ is declarative, and CPU instructions are imperative.
SQL vs C++
If you choose a sorting algorithm as the desired level of abstraction, then SQL allows you to "define the problem" instead of the solution:
select *
from table
order by key
This snippet of code is declarative with respect to the sorting algorithm's level of abstraction because it declares that the output is sorted without using lower-level concepts (like array manipulation).
If you had to sort an array in C++ (without using a library), the program would be expressed in terms of array manipulation steps of a particular sorting algorithm.
void sort(int *array, int size) {
int key, j;
for(int i = 1; i < size; i++) {
key = array[i];
j = i;
while(j > 0 && array[j-1] > key) {
array[j] = array[j-1];
j--;
}
array[j] = key;
}
}
This snippet is not declarative with respect to the sorting algorithm's level of abstraction because it uses concepts (such as array manipulation) that are constituents of the sorting algorithm.
Summary
To summarize, whether a language defines problems or solutions depends on what problems and solutions you are referring to.
Many answers here have brought up examples: SQL, LINQ, Prolog, Lisp, OCaml. I am sure there are many useful levels of abstractions with respect to which these languages are declarative.
However, do not forget that you can build a language with an even higher level of abstraction on top of them.

Getting started with ANTLR and avoiding common mistakes

I have started to learn ANTLR and have both the 2007 book "The Definitive ANTLR Reference" and ANTLRWorks (an interactive tool for creating grammars). And, being that sort of person, I started at Chapter 3. ("A quick tour for the impatient").
It's a fairly painful process especially as some errors are rather impenetrable (e.g. ANTLR: "missing attribute access on rule scope" problem which just means to me "you got something wrong"). Also I have some very simple grammars (3-4 productions only) and simple input (2 lines) which when run give "OutOfMemory" error.
The ANTLR site is useful but somewhat fragmented and some SO users have commented (https://stackoverflow.com/questions/278480/good-tutorial-for-antlr) that the book and the tutorials expect a high entry level. I've been reluctant to approach the ANTLR discussion list because of this.
LATER We are beginning to get to grips with it. It would be useful to have simple reliable examples that could be gently expanded. It's certainly worth mastering as we have remodelled quite a lot of our thinking based on ANTLR.
One problem is that ANTLR V3 has signifcant changes from V2. One answer on SO (and on the ANTLR pages) refered to a V2 syntax that is no longer available.
Some of the ANTLR questions on SO have helped me a lot, but finding them is a bit ad hoc. So I'd like to know how SO users can help to make the learning process less painful. (If you refer to the reference book it would be useful to point to particular pages).
EDIT. #duffymo and #JamesAnderson have confirmed that ANTLR is hard work - largely because parsers are difficult. (FWIW I have been through LEX/YACC, etc. and there's no doubt that ANTLR is more powerful and easier to work with.) I think it would still be useful to have areas where it's possible to avoid fouling up such as:
ensure correct capitalisation of variable names
add package name to lexer as well as parser
take care over order of rules as it affects precedence
and more of these sort would be useful.
I agree - ANTLR is not for the faint of heart. It does expect a high entry level, because grammars and parsers are not trivial.
With that said, here are a few suggestions:
Forget about v2. Version 3 is the standard; don't even waste time considering the earlier version or its documentation.
OutOfMemoryError is telling you that there's something circular in the grammar you've defined.
IntelliJ has a wonderful IDE for working with ANTLR v3. It'll give you a graphical representation of your grammar, step-through debugging, etc. If you're going to be doing a lot of work with ANTLR it'd be worth a few dollars to buy a license.
ANTLR won't be easy to master. The book is good, but dense. The error messages are cryptic, as you've noted. I'd be surprised if anyone here could make it easy.
Sorry but my experience of ANTLR (indeed javacc, bison or any full function parser) is that most of your learning will be by fixing your own mistakes!
Getting good examples of other peoples code will cut this down somewhat, the best examples look really simple -- but you are missing all the sweat and hair pulling it took to get them looking that easy.
Even if you prefer command line, it is worth using AntlrWorks when you have problems. The diagramatic representation can make it easier to see what i sgoing wrong.
A picture is worth a thousand error messages.