Is there a tool to create a graphical representation of one's antlr4 grammar that means the parser/lexer rules e.g. as a graphical representation of a finite state machine?
It should be the case that it can be represented since it has backus naur form.
Example:
plus: INT '+' INT | plus '+' INT
INT: [0-9]+
A corresponding finite-state machine would be
start -> INT <-> plus
|
v
exit
There may also be other graphical representations but a finite-state machine. The goal is to provide a different perspective in order to make debugging/understanding the grammar easier.
You probably want something like this: https://github.com/bkiers/rrd-antlr4. These types of graphics are called railroad diagrams.
Another solution is to use ANTLRWorks 2.1. There's a view called "Syntax Diagram" included that can generate railroad diagrams of your parser rules and your lexer rules.
I'm using those images for my master thesis and the process works fine so far.
Related
For example, input = '(1+2)*3' .
tree is like that '(expr (expr ((expr (expr 1) + (expr 2)) ))*(expr 3))'
And then, I would like to hide or delete the '(' and ')' in the tree , they are no needed any more. I try to make it , but didn't.
expr : ID LPAREN exprList? RPAREN
| '-' expr
| '!' expr
| expr op=('*'|'/') expr
| expr op=('+'|'-') expr
| ID
| INT
| LPAREN expr RPAREN //### Parens Here ####
;
LPAREN : '(' ;
RPAREN : ')' ;
What I want is** NOT** the following.
PAREN : ( '(' | ')' ) -> channel(HIDDEN)
Standard parser generator schemes separate parsing from tree building.
This allows one to design custom actions to build an AST, and fine tune its structure to the targeted langauge (including leaving out concrete syntax such as "parentheses").
The price is that one must not only specify the grammar, but one must also specify the rules for building the AST. And that makes defining a "grammar + tree builder" about twice as much work as just defining a grammar. When your grammars are tiny, this doesn't matter, but usually tiny grammars means "toy problem". With big real production gnarly grammars, this matters a lot; there's usually a bunch of initial churn in trying to get such grammars right and the AST building stuff just gets in the way during this phase. Clever people delay adding AST building rules till the churn phase is over, but that only partially works, and it turns out that you may want to reshape the grammar based on AST you want to build, so this delay actually increases the churn somewhat. One also pays a maintenance cost; if your grammar has any scale, you will change it, and then the AST building part must change, too.
My company builds a tool, the DMS Software Reengineering Toolkit, which contains a parser generator. We decided, the first to do so AFAIK, some 20 years ago, that this extra AST building step was too much work for the benefit for the many big grammars we expected to (and did) build. So we designed DMS to automatically build a concrete syntax tree as it parsed. Voila, write a grammar, get a parser and the tree is free. This decision has turned out to be a really good one.
The price is the final tree retains all the concrete syntax, e.g., the parentheses. While it may not look elegant, it turns out that this does not matter much in practice when manipulating trees (inspecting, traversing, analyzing, modifying, ...). We've traded a bit of inelegance for much easier tree building and grammar maintenance.
The ANTLR guy(!) decided for ANTRL4, unlike his previous ANTLR1/2/3 systems, to follow our lead and switch to building "ASTs" automatically from the grammar, as concrete syntax trees. (I don't know if you can actually write your own AST building rules to override the built-in feature for ANTLR4. Comments on this answer suggest that the way to get an AST from ANTLR4 is to walk the CST and build what you want. I'm not keen on that solution; it strikes me as paying the price for building and managing the AST, and also having the parsing overhead [time and space] of building the CST. If you only build small trees, maybe you don't care. For DMS, we regularly read thousands of files for processing together; space and time matter!)
For some discussion on how to make this a bit more elegant (effectively even more AST like), see my SO answer on ASTs vs. CSTs
To suppress useless tokens in the tree, use '!' symbol after corresponding tokens:
//remove ',' comma from the output
list: LISTNAME LISTMEMBER (','! LISTMEMBER)*;
from http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.
Is there any means to get ANTLR4 to automatically remove redundant nodes in generated parse trees?
More specifically, I've been experimenting with a grammar for GLSL and you end up with long linear sequences of "expressions" in the parse tree due to the rule forwarding needed to give the automatic handling of operator precedence.
Most of the generated tree nodes are simply "forward to the next level of precedence", so don't provide any useful syntactic information - you only really need the last expression node in each sequence (i.e. the point at which the rule forwarding stopped), or the point where it becomes an actual tree node with more than one child (i.e. an actual expression was encountered in the source) ...
I was hoping there would be an easy way to kill off the dummy intermediate expression nodes - this type of structure must be common in any grammar with operator precedence.
The basic structure of the grammar is a fairly direct clone taken from the Khronos specification for the language:
https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf
ANTLR v4 is able to generate code from a single recursive rule dealing with different precedence levels, if you use a grammar like this (example for basic math):
expr : '(' expr ')'
| '-' expr
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
;
ANTLR v3 was unable to do so and basically required you to write one rule per precedence level. So I'd advise you to rewrite your grammar to avoid these boilerplate rules.
Then, I think you're confusing the parse tree (aka concrete syntax tree) with the AST (abstract syntax tree). The AST is like a simplified version of the parse tree, which keeps only what's needed for your purpose. For instance, with the expr rule above, the AST wouldn't contain any node for parentheses, since the precedence is encoded in the tree itself and you usually don't need to know whether a part of a given expression was parenthesized or not.
Your program should build an AST from the parse tree and then go from there. Don't deal with parse trees directly, even if it seems convenient at first sight because the tool generates them for you. It'll quickly become cumbersome. Build your own tree structure (AST), tailored for the task at hand.
Use the Visitor implementation to access each node in sequence. Build your own tree by adding nodes to parents as they are visited. Decide at the time the node is visited whether to add it to your new tree or not. For example:
public T visitExpression(#NotNull AcParser.ExpressionContext ctx) {
// Expressionable parent = getParent(Expressionable.class, ctx);
// Class<? extends AcExpression> expClass = AcExpression.class;
AcExpression obj = null;
String text = ctx.getText();
//do something with text or children
for (int i=0; i<ctx.getChildCount(); i++){
printnl(ctx.getChild(i).getText()+"/");
}
return visitChildren(ctx);
}
I'm starting to study ANTLR.
The aim is to 'translate' Strings into SQL statements.
One simple example of what I want to do:
If I receive the String "name = A and age = B" --- ANTLR ---> "select * from USERS where name = 'A' and age = 'B'"
I've been reading some information about ANTLR, and following some examples, but those just convert the input stream of characters (source file) into a AST. But how can I use ANTLR to translate the input message, and use the translated output?
Can you give me some highlights or tell me where can I found some information about that?
I'm using the Eclipse IDE and Maven ANTLR Plugin.
ANTLR is just a parser generator. You can insert actions into the grammar that collect information or directly print output. The most common mechanism is to allow ANTLR to create an intermediate presentation in the form of an AST or, with ANTLR 4, a parse tree. From there, you build a tree walker to either build an internal model or directly generate output. From the internal model, which represent constructs in your output language, you can then generate the output. I typically use StringTemplate for generating structured text.
When the input and output are very similar and, more importantly, the order of output is very similar, you can get away with syntax directed translation: i.e. actions directly in the grammar or actions applied directly to a parse tree.
When the order of output is very different, you have to build some form of intermediate representation. Imagine simply reading in a bunch of integers and printing them back out in reverse order. You can do that by simply printing out the numbers as you see them. This is all explained in my [shameless plug] book, Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages http://amzn.com/B00A376HGG
Im looking for a tool that can validate if a given text\paragraph subject to a specific format .
for example :
I can be able to check if the text is as following :
xxx{
sss:aaa;
}
yyy();
preferably open source tool, with easy rule sets like xml or something .
by text i mean a string that i get from i.e fgets(), or any function that reads from a file .
For something like this I'd suggest a parser (see, for instance, What is Parse/parsing?). You can build one from a definition of the language that you want to parse using a parser generator like Yacc or its free GNU equivalent Bison, or any number of other parser generators, many of which are also freely available.
Most parsers are used to transform a text that complies with a grammar into some other form (e.g. an intermediate language or a machine code) but that isn't neccesary - in your case the parser could simply say (at a minimum) "Yes" if the text conforms to a given grammar.
Parsers for simple grammars can be built by hand but, if you have the tools available, using a parser generator is easier and more robust in my experience.
Further, the text that you've shown is similar to a portion of code written in the C language (something close to a struct declaration followed by a function call), so you would be able to re-use parts of the grammar that you need from an existing Yacc grammar for C like this one.