What is a good tool for automatically calculating FIRST and FOLLOW sets? - grammar

I'm currently in the middle of playing with a BNF grammar that I hope to be able to wrangle into a LL(1) form. However, I've just finished making changes and calculating the new FIRST and FOLLOW sets for the grammar by hand for the third time today and I'm getting tired of it. There has to be a better way!
Can someone suggest a tool that, given a grammar, will automatically calculate the first and follow sets for all of the non terminals?

A year ago, we had a semester project at the university I attend, where our task was to create a programming language. As a group, we decided we wanted to be able to hand-write the parser from scratch, so we had to aim for an LL(1) grammar, since it would have been completely unrealistic to write a parser otherwise.
Of course, our starting point was far from being LL(1), so we too had to wrangle it into place. For that purpose, we used the kfgEdit tool from the AtoCC package. All you do is enter your rules, and then it can check if it's an LL(1) grammar at the click of a button.
A fair word of warning: The tool is a bit finicky about what it accepts. While you'd often use EBNF for the real grammar, so you can write ? and * and + to signal how many times that token must appear there, this is not supported. Grouping is also not supported. You may very well find that it takes a very long time to do this, and you will almost surely want to do some "rearranging" after you've reached LL(1) to make the grammar even close to being readable.
Of course, depending on the type of grammar you're dealing with, this may not be much of a problem for you. We had created a sort of Pascal/C hybrid, with a fairly restricted set of constructs (procedures, functions, only built-in primitive types and arrays of them, ifs, a single loop construct we'd come up with ourselves in place of the standard 3...), and it took me at least a week to wrangle it into an LL(1) grammar - probably 2, actually. Note that this is out of a total of about 4 months, so that was a lot of time spent there.
If you absolutely MUST have an LL(1) grammar, then you obviously will need to press on if you get into a situation like this, but if you're allowed to use parser generators like yacc/bison or SableCC then you will, in the long run, most likely find it a LOT easier to go down that route. That doesn't mean you SHOULD go down that route - I found that actually writing everything by hand provided some insight I probably wouldn't have gained otherwise - but it might be better for you to gain that insight in a different situation than your current.
tl;dr version: Use kfgEdit from the AtoCC package.

For recursive descent parsing, it would be worth looking at ANTLR. However, I'm not sure it provides an exact answer for your question - find the FIRST and FOLLOW sets for a given grammar.

The DMS Software Reengineering Toolkit has a parser generator that computes FIRST and FOLLOW sets; it will also let you inspect the L(AL)R state machine it generates.
However, if you have a legitimate context-free grammar, you don't have to "wrangle it" into LL shape; the DMS parser generator produces GLR parsers from any context-free grammar.

Related

Antlr and PL/I grammar

Right now we would like to have the grammar of PL/I, COBOL based on Antlr4. Is there anyone provide these grammars
If not, can you please share your thought/experience on developing these grammars from scratch
Thanks
I assume you mean IBM PL/I and COBOL. (Not many other PL/Is around, but I don't think that really changes the answer much).
The obvious place to look for mature ANTLR grammars is ANTLR3 grammar library; no PL/1 or COBOL grammars there. The Antlr V4 (a very new, radical, backwards incompatible reengineering of ANTLR3) main page talks about Java and C#; no hint of PL/1 or COBOL there; given its newness, no surprise. If you are really lucky, somebody may have one they will give you and speak up.
Developing such grammars is difficult for several reasons (based on personal experience building production-quality parsers for these two specific items, using a very strong parser system different than ANTLR [see my bio for more details]):
The character set and column layout rules (columns 1-5, 6 and 72-80 are special) may be an issue: the languages you describe are typically written in EBCDIC historically in punch-card 80 column format without line break characters between lines. Translation to ASCII sometimes produces nasty glitches; the ASCII end-of-line character is occasionally found in the middle of COBOL literal strings as a binary value, but because it has the same exact code in EBCDIC and ASCII, after translation it will (be and) appear to be an ASCII newline break character. Character strings can also be long but split across multiple lines; but columns 72-80 by definition have to be ignored. Column 6 may contain a "D" character, which affects interpretation of the following source lines as "debug" or "not". This means you need to get 80 column processing right. I don't know what ANTLR has to support processing characters-in-column-areas. You'll also need to worry about DBCS encoding of string literals, and variations of that if the source code is used in non-English speaking countries, such as Japan.
These languages are large and complex; IBM has had 40 years to decorate them with cruft. The IBM COBOL manual is some 600 pages ... then you discover that COBOL also includes a Report Writer, which is another 600 page document. Capturing all the nuances of the lexical tokens and the grammar rules will take effort, and you have to do that from the IBM manuals, which don't contain nice BNF-style descriptions, which means guessing from the textual description and some examples. For COBOL, expect several thousand grammar rules; PL/1 is less complicated in the abstract. Expect a certain amount of "lies"; we've encountered a number of places where the reference documentation clearly says certain things are not legal, and yet the IBM compilers (based on real, running source code) accepts them, and vice versa. The only way you find these is by empirical experiments.
Both languages have constructs that are difficult to parse, e.g., requiring arbitrary lookahead and/or local ambiguity. ANTLR4 is much better than ANTLR3 from my understanding on these, but that doesn't mean these aspects will be easy. PL/1 is particularly nasty in this regard: it has no keywords, but hundreds of keywords-in-context. To resolve these one has to get the lexer and the parser to cooperate, and even then there may be many locally ambiguous parses. ANTLR3 doesn't do these well; ANTLR4 is supposed to be better but I don't know how it handles this, if it does at all.
To verify these parsers are right, you will need to run them on millions of lines of code (which means you have to have access to such code samples), and correct any errors you find. This takes a long time (in our case, several years of more or less continuous work/improvement to get production quality grammars that work on large code bases). You might be miraculously faster than this; good luck.
You need to build a preprocessor for COBOL (COPY ... REPLACING), whose details are poorly documented, and eventually another one for PL/1 (which I understand to be fully Turing capable).
After you build a parser, you need to capture a syntax tree; here ANTLR4 is supposed to be pretty good in that it will capture one for the grammar you give it. That may or may not be the AST you want; with several thousand grammar rules, I'd expect not. ANTLR3 requires you to add, manually, indications of where and how to form the AST.
After you get the AST, you'll want to do something with it. This means you will need to build at least symbol tables (mappings from identifier instances to their declarations and any related type information). ANTLR provides nothing special to support this AFAIK except for support in walking the ASTs. This, too, is hard to get right, COBOL has crazy rules about how an unqualified identifier reference can be interpreted as to a specific data field if there are no other conflicting interpretations. (There's lots more to Life After Parsing if you want to have good semantic information about the program; see my bio for more details; for each of these semantic aspects you have develop them and then for validation go back and run them on large code bases again.).
TL;DR
Building parsers (well, "front ends") for these languages is a lot of work no matter what parsing engine you choose. Likely explains why they aren't already in ANTLR's grammar zoo.
Have a look at the OpenSource Cobol-85 Parser from ProLeap, based on antlr4 and creating ASTs and ASGs as well.
And, best of all, it really works !
https://github.com/uwol/proleap-cobol-parser
I am not aware of a comparable PLI-grammar, but a very good start is the EBNF-definition from Ralf Lämmel (CWI, Amsterdam) & Chris Verhoef (WINS, Universiteit van Amsterdam)
http://www.cs.vu.nl/grammarware/browsable/os-pli-v2r3/

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

Can PMD be customized to fully support a new language?

Can PMD be customized to fully support a new language, in a reasonable amount of time. I mean I know that technically almost anything can be done, but im wondering if this can be done in a reasonable amount of time? E.g. < 2 weeks
This page mentions how to write a CPD parser http://pmd.sourceforge.net/cpd-parser-howto.html
But is this just for copy / paste detection? Does writing a CPD parser give me full support of PMD in terms of rile sets?
I would guess not, but I'm not a PMD expert (and I have my own bias, check my bio).
The issues are:
Can you define a syntax for my langauge quickly (maybe, depending on how good you are, how messy the language is, and the strength of the parsing machinery offered by PMD)
Can you define the semantics of my language so that "semantic checks" provided by PMD work. You have to do this, because syntax tells you (and a tool) literally nothing about semantic of the syntax. I would guess that the PMD tool 'semantic checks' are pretty wired into the precise details of Java; if you language matched java perfectly, this would be zero work. But it doesn't, or you wouldn't be asking the question. And langauge semantics differences, even minor ones, cause discontinuous changes to the interpreation of the code. Before you get to doing even "serious" semantics, you're likely to have to build a symbol table mapping identifiers in the code to declarations (and the "semantic" type) for those symbols. Based on tool infrastructure I work with, this step alone takes 1-2 months for a real language.
Lastly, you are likely to have to code special PMD checks that are specific to your langauge. That takes time and energy, too.
I build generic compiler-type machinery (parsers, flow analyzers, style/error checkers) and get asked the equivalent of this question all the time WRT to our machinery. We try to have a lot of machinery available, try to make it easy to integrate new langauges, and we've been working on trying to make this "convenient and fast" for 15+ years. Its still not convenient, and there's no way to do this with our tools in a few weeks. I doubt PMD is better.

Is it easier to write a recursive-descent parser using an EBNF or a BNF?

I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.

Getting started with ANTLR and avoiding common mistakes

I have started to learn ANTLR and have both the 2007 book "The Definitive ANTLR Reference" and ANTLRWorks (an interactive tool for creating grammars). And, being that sort of person, I started at Chapter 3. ("A quick tour for the impatient").
It's a fairly painful process especially as some errors are rather impenetrable (e.g. ANTLR: "missing attribute access on rule scope" problem which just means to me "you got something wrong"). Also I have some very simple grammars (3-4 productions only) and simple input (2 lines) which when run give "OutOfMemory" error.
The ANTLR site is useful but somewhat fragmented and some SO users have commented (https://stackoverflow.com/questions/278480/good-tutorial-for-antlr) that the book and the tutorials expect a high entry level. I've been reluctant to approach the ANTLR discussion list because of this.
LATER We are beginning to get to grips with it. It would be useful to have simple reliable examples that could be gently expanded. It's certainly worth mastering as we have remodelled quite a lot of our thinking based on ANTLR.
One problem is that ANTLR V3 has signifcant changes from V2. One answer on SO (and on the ANTLR pages) refered to a V2 syntax that is no longer available.
Some of the ANTLR questions on SO have helped me a lot, but finding them is a bit ad hoc. So I'd like to know how SO users can help to make the learning process less painful. (If you refer to the reference book it would be useful to point to particular pages).
EDIT. #duffymo and #JamesAnderson have confirmed that ANTLR is hard work - largely because parsers are difficult. (FWIW I have been through LEX/YACC, etc. and there's no doubt that ANTLR is more powerful and easier to work with.) I think it would still be useful to have areas where it's possible to avoid fouling up such as:
ensure correct capitalisation of variable names
add package name to lexer as well as parser
take care over order of rules as it affects precedence
and more of these sort would be useful.
I agree - ANTLR is not for the faint of heart. It does expect a high entry level, because grammars and parsers are not trivial.
With that said, here are a few suggestions:
Forget about v2. Version 3 is the standard; don't even waste time considering the earlier version or its documentation.
OutOfMemoryError is telling you that there's something circular in the grammar you've defined.
IntelliJ has a wonderful IDE for working with ANTLR v3. It'll give you a graphical representation of your grammar, step-through debugging, etc. If you're going to be doing a lot of work with ANTLR it'd be worth a few dollars to buy a license.
ANTLR won't be easy to master. The book is good, but dense. The error messages are cryptic, as you've noted. I'd be surprised if anyone here could make it easy.
Sorry but my experience of ANTLR (indeed javacc, bison or any full function parser) is that most of your learning will be by fixing your own mistakes!
Getting good examples of other peoples code will cut this down somewhat, the best examples look really simple -- but you are missing all the sweat and hair pulling it took to get them looking that easy.
Even if you prefer command line, it is worth using AntlrWorks when you have problems. The diagramatic representation can make it easier to see what i sgoing wrong.
A picture is worth a thousand error messages.