Getting started with ANTLR and avoiding common mistakes - antlr

I have started to learn ANTLR and have both the 2007 book "The Definitive ANTLR Reference" and ANTLRWorks (an interactive tool for creating grammars). And, being that sort of person, I started at Chapter 3. ("A quick tour for the impatient").
It's a fairly painful process especially as some errors are rather impenetrable (e.g. ANTLR: "missing attribute access on rule scope" problem which just means to me "you got something wrong"). Also I have some very simple grammars (3-4 productions only) and simple input (2 lines) which when run give "OutOfMemory" error.
The ANTLR site is useful but somewhat fragmented and some SO users have commented (https://stackoverflow.com/questions/278480/good-tutorial-for-antlr) that the book and the tutorials expect a high entry level. I've been reluctant to approach the ANTLR discussion list because of this.
LATER We are beginning to get to grips with it. It would be useful to have simple reliable examples that could be gently expanded. It's certainly worth mastering as we have remodelled quite a lot of our thinking based on ANTLR.
One problem is that ANTLR V3 has signifcant changes from V2. One answer on SO (and on the ANTLR pages) refered to a V2 syntax that is no longer available.
Some of the ANTLR questions on SO have helped me a lot, but finding them is a bit ad hoc. So I'd like to know how SO users can help to make the learning process less painful. (If you refer to the reference book it would be useful to point to particular pages).
EDIT. #duffymo and #JamesAnderson have confirmed that ANTLR is hard work - largely because parsers are difficult. (FWIW I have been through LEX/YACC, etc. and there's no doubt that ANTLR is more powerful and easier to work with.) I think it would still be useful to have areas where it's possible to avoid fouling up such as:
ensure correct capitalisation of variable names
add package name to lexer as well as parser
take care over order of rules as it affects precedence
and more of these sort would be useful.

I agree - ANTLR is not for the faint of heart. It does expect a high entry level, because grammars and parsers are not trivial.
With that said, here are a few suggestions:
Forget about v2. Version 3 is the standard; don't even waste time considering the earlier version or its documentation.
OutOfMemoryError is telling you that there's something circular in the grammar you've defined.
IntelliJ has a wonderful IDE for working with ANTLR v3. It'll give you a graphical representation of your grammar, step-through debugging, etc. If you're going to be doing a lot of work with ANTLR it'd be worth a few dollars to buy a license.
ANTLR won't be easy to master. The book is good, but dense. The error messages are cryptic, as you've noted. I'd be surprised if anyone here could make it easy.

Sorry but my experience of ANTLR (indeed javacc, bison or any full function parser) is that most of your learning will be by fixing your own mistakes!
Getting good examples of other peoples code will cut this down somewhat, the best examples look really simple -- but you are missing all the sweat and hair pulling it took to get them looking that easy.

Even if you prefer command line, it is worth using AntlrWorks when you have problems. The diagramatic representation can make it easier to see what i sgoing wrong.
A picture is worth a thousand error messages.

Related

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

What tool to use for finding duplicated Ada code due to copy&paste

I'm looking for a tool for finding duplicated code due to copy&paste programming to be run over a large Ada codebase. I suppose that Ada support in the tool is important for detecting more than the trivial text similarities, that is, ignore layout or identifier difference, etc.
The tools that I have found with Ada support are the following:
Clone Doctor, commercial product with support for several languages, including Ada. http://www.semdesigns.com/Products/Clone/index.html
ConQAT: commercially supported open source product that includes a CloneDetection tool with Ada support since September 2011 http://conqat.cs.tum.edu/index.php/CloneDetectionTutorial
Have you tried these tools? Am I missing any other one of interest? Is the language support really significant or a general text tool would be enough? What is your experience with code duplication detection?
Thanks in advance.
I'm the author of CloneDR. Read the following understanding my bias.
It is important to understand the differences in the detection methods of clone detection tools, and the quality of the results as a consequence.
ConQAT is a representative of what are called "token based" detectors. They match sequences of language tokens (operators, identifiers, brackets, keywords etc.) The good news is they are pretty fast (that isn't a big issue; you don't run clone detection every 30 seconds, once a week is enough). They will find some clones that are near-misses, in the sense that another identifier or constant is substituted for an identifier in a clone. The bad news is that they don't understand the structure of your code and consequently want to report things like
} void ID ( ID
as clones. This is defeated by making the detectors only hunt for very long sequences of tokens (typically 30 or more), which means token-based detectors cannot find small but interesting clones without also drowning you in false positives like the above.
CloneDR operates by parsing the code (even for Ada) just like a compiler, building abstract syntax trees, and matching the trees up to a point of difference. It cannot propose a clone that crosses structure boundaries in silly ways. It will find near misses of the same kind as the token based detectors, but it goes beyond this. CloneDR will find consistent substitutions ("anti unifiers") which means clones can be explained by a small number of parameters that have been used in many places in the clone, and it will find variations in the code in which the mismatches are larger than a single token, e.g., expressions, statements, declarations, even blocks. So it produces fewer false positives and better answers. Independent research reports that compare types of clone detectors, specifically including CloneDR, agree with this analysis.
There is more detailed discussion at the Clone Doctor link you listed above. You can see examples of detected clones for many languages (but we don't have an Ada report on the web site).
EDIT March 19, 2012:
Now you can download an eval copy of an Ada95 CloneDR.
Ira Baxter has a good description.
Token-based clone detection tools tend to be good enough for our purpose, which is usually to get a quick overview of how bad code duplication is in a body of source code we haven't seen before, and how duplication is distributed across that code.
In particular, we are happy with CCFinderX, because it has a nice visualization frontend.
However, it's buggy, unmaintained, and the code has been released but without any license statement.
It has language specific preprocessors for some languages, but we often just disable them (they are buggy as well).
If you need better accuracy, you know exactly the language you need to parse (e.g. with C or C++, this is not always the case), and you can find a tool that parses exactly that language (which is also an issue with C and C++), a parsing-based approach may be better, as Ira writes.

Tools to generating a grammar using examples?

This answer shows a pretty example of using a parser generator to look through text for some patterns of interest. In that example, it's product prices.
Does anyone know of tools to generate the grammars given training examples (document + info I want from it)? I found a couple papers, but no tools. I looked through ANTLR docs a bit, but it deals with grammars; a "recognizer" takes as input a grammar, not training examples.
This is a machine learning problem. You can at best get an approximation. But I don't think anybody has done this well, let alone released a tool. (I actively track what people do to build grammars for computer languages, and this idea has been proposed many times, but I have yet to see a useful implementation).
The problem is that for any fixed set of examples, there's a huge number of possible grammars. It is easy to construct a naive one: for the fixed set of examples, simply propose a grammar that has one rule to recognize each example. That works, but is hardly helpful. Now the question is, how many ways can you generalize this, and which one is the best? In fact you can't know, because your next new example may be a total surprise in terms of structure. (Theory definition: A language is the set of sentences that comprise it).
We haven't even talked about the simpler problem of learning the lexemes of the language. How would you propose to learn what legal strings for floating point numbers are?
One tool that does this is NLTK. I Highly recommend it, and the O'Reilly book that covers it is available free online. There are tools for parsing, learning grammars, etc... The only downside is that it is mainly a research rather than production tool, so the emphasis isn't on performance.
NLTK is able to construct grammar from labeled training samples, which is exactly what you are asking. Have a look at the great docs and the book. (My last experience with it also had it working on the JVM through Jython without any issues.)

What is a good tool for automatically calculating FIRST and FOLLOW sets?

I'm currently in the middle of playing with a BNF grammar that I hope to be able to wrangle into a LL(1) form. However, I've just finished making changes and calculating the new FIRST and FOLLOW sets for the grammar by hand for the third time today and I'm getting tired of it. There has to be a better way!
Can someone suggest a tool that, given a grammar, will automatically calculate the first and follow sets for all of the non terminals?
A year ago, we had a semester project at the university I attend, where our task was to create a programming language. As a group, we decided we wanted to be able to hand-write the parser from scratch, so we had to aim for an LL(1) grammar, since it would have been completely unrealistic to write a parser otherwise.
Of course, our starting point was far from being LL(1), so we too had to wrangle it into place. For that purpose, we used the kfgEdit tool from the AtoCC package. All you do is enter your rules, and then it can check if it's an LL(1) grammar at the click of a button.
A fair word of warning: The tool is a bit finicky about what it accepts. While you'd often use EBNF for the real grammar, so you can write ? and * and + to signal how many times that token must appear there, this is not supported. Grouping is also not supported. You may very well find that it takes a very long time to do this, and you will almost surely want to do some "rearranging" after you've reached LL(1) to make the grammar even close to being readable.
Of course, depending on the type of grammar you're dealing with, this may not be much of a problem for you. We had created a sort of Pascal/C hybrid, with a fairly restricted set of constructs (procedures, functions, only built-in primitive types and arrays of them, ifs, a single loop construct we'd come up with ourselves in place of the standard 3...), and it took me at least a week to wrangle it into an LL(1) grammar - probably 2, actually. Note that this is out of a total of about 4 months, so that was a lot of time spent there.
If you absolutely MUST have an LL(1) grammar, then you obviously will need to press on if you get into a situation like this, but if you're allowed to use parser generators like yacc/bison or SableCC then you will, in the long run, most likely find it a LOT easier to go down that route. That doesn't mean you SHOULD go down that route - I found that actually writing everything by hand provided some insight I probably wouldn't have gained otherwise - but it might be better for you to gain that insight in a different situation than your current.
tl;dr version: Use kfgEdit from the AtoCC package.
For recursive descent parsing, it would be worth looking at ANTLR. However, I'm not sure it provides an exact answer for your question - find the FIRST and FOLLOW sets for a given grammar.
The DMS Software Reengineering Toolkit has a parser generator that computes FIRST and FOLLOW sets; it will also let you inspect the L(AL)R state machine it generates.
However, if you have a legitimate context-free grammar, you don't have to "wrangle it" into LL shape; the DMS parser generator produces GLR parsers from any context-free grammar.

What are the disadvantages of using ANTLR compared to Flex/Bison?

I've worked on Flex, Bison few years ago during my undergraduate studies. However, I don't remember much about it now. Recently, I have come to hear about ANTLR.
Would you recommend that I learn ANTLR or better to brush up Flex/Bison?
Does ANTLR have more/less features than Flex/Bison?
ANTLRv3 is LL(k), and can be configured to be LL(*). The latter in particular is ridiculously easy to write parsers in, as you can essentially use EBNF as-is.
Also, ANTLR generates code that is quite a lot like recursive descent parser you'd write from scratch. It's very readable and easy to debug to see why the parse doesn't work, or works wrong.
The advantage of Flex/Bison (or any other LALR parser) is that it's faster.
ANTLR has a run-time library JAR that you must include in your project.
ANTLR's recursive-descent parsers are easier to debug than the "bottom-up" parsers generated by Flex/Bison, but the grammar rules are slightly different.
If you want a Flex/Bison-style (LALR) parser generator for Java, look at JavaCC.
We have decided to use ANTLR for some of our information processing requirements - parsing legacy files and natural language. The learning curve is steep ut we are getting on top of it and I feel that it's a more modern and versatile approach for what we need to do. The disadvantages - since you ask - are mainly the learning curve which seems to be inevitable.