Since I've gotten no answer at all to my question Is there an alternative to MKS Yacc that supports selection preference syntax or something very similar?, I'll ask the more basic question:
Has anyone used the "selection preferences" provided by MKS Yacc?
If you have, what did you use it for? Also, does it make any sense to use it in anything other than the last position in a rule?
I have to look after a grammar which figures rules such as:
TOKEN1 LPAREN non_terminal1 [^EQUAL] TOKEN2 non_terminal2 RPAREN
Unless I'm misunderstanding something, the embedded selection preference doesn't provide any value whatsoever in this context.
Background
MKS Yacc supports a notation which their web site calls "selection preference syntax". It isn't illustrated, but it consists of a token or list of tokens in square brackets with a caret (which might be optional), and it indicates that the particular token must not follow this construct, but that token is not counted as part of this rule:
non_terminal1: TOKEN1 non_terminal2 TOKEN2 [^TOKEN3]
So, this rules says that a TOKEN1 followed by a non_terminal2 and a TOKEN2 is a non_terminal1, unless the next token is a TOKEN3 in which case some other rule applies.
(I'm not clear whether the bracketed item can be a non-terminal. The code I've seen using the notation always uses a token or a couple of space separated tokens, and never a non-terminal. I'm also not clear whether the caret is required; again, all the examples I've seen use the caret.)
Jonathan. at 1:30 AM I'm not prepared to try to do this myself, but whatever those rules do, they can only be shorthand for rules that could be written in something like normal BNF. Looking at this, it appears that what the "selection preference" is doing is allowing you to express what would otherwise be several productions with one grammar rule.
I did a little digging and found this, which confirms my supposition: what the selection preference does is lets you explicitly insert a lookahead, so that rules which would otherwise be in conflect can be disambiguated.
What I'd suggest is to think about what one of these rules would look like if re-written into yacc or straight BNF. I suspect it will turn out something like
TOKEN1 LPAREN non_terminal1 MULT TOKEN2 non_terminal2 RPAREN
TOKEN1 LPAREN non_terminal1 DIVIDE TOKEN2 non_terminal2 RPAREN
TOKEN1 LPAREN non_terminal1 ADD TOKEN2 non_terminal2 RPAREN
TOKEN1 LPAREN non_terminal1 SUBTRACT TOKEN2 non_terminal2 RPAREN
TOKEN1 LPAREN non_terminal1 EXP TOKEN2 non_terminal2 RPAREN
TOKEN1 LPAREN non_terminal1 MOD TOKEN2 non_terminal2 RPAREN
...
So that the overall effect is to take one rule for every operator except equal, the [^ notation being common in the various Bell Labs languages for something like the complement of a set.
Related
is it possible in ANTLR 4 to create a parser rule with arguments of type 'token', i.e. a sort of a rule
list[elem Token] : '[' elem (',' elem)* ']';
which should match a list of tokens of the type 'elem'. For example, list[ID] should match a list of identifiers while list[String] should match a list of strings both following the syntax given in the above rule.
No, such semantic checks are generally done after parsing, in a listener or visitor (which ANTLR generates as well).
Is there any API in ANTLR4 for obtaining the original productions from the grammar?
For example, if there was a rule:
functionHeader : identifier LPAREN parameterDecl RPAREN
... is there some function on the parse that, given the functionHeader token would return a list ["identifier", "LPAREN", "parameterDecl", "RPAREN"]?
Well, it is rarely as simple as the list of elements you specify their but you can look at the augmented transition network (ATN) via parser.getATN() then get the rule start state etc...See ATNState
For example, input = '(1+2)*3' .
tree is like that '(expr (expr ((expr (expr 1) + (expr 2)) ))*(expr 3))'
And then, I would like to hide or delete the '(' and ')' in the tree , they are no needed any more. I try to make it , but didn't.
expr : ID LPAREN exprList? RPAREN
| '-' expr
| '!' expr
| expr op=('*'|'/') expr
| expr op=('+'|'-') expr
| ID
| INT
| LPAREN expr RPAREN //### Parens Here ####
;
LPAREN : '(' ;
RPAREN : ')' ;
What I want is** NOT** the following.
PAREN : ( '(' | ')' ) -> channel(HIDDEN)
Standard parser generator schemes separate parsing from tree building.
This allows one to design custom actions to build an AST, and fine tune its structure to the targeted langauge (including leaving out concrete syntax such as "parentheses").
The price is that one must not only specify the grammar, but one must also specify the rules for building the AST. And that makes defining a "grammar + tree builder" about twice as much work as just defining a grammar. When your grammars are tiny, this doesn't matter, but usually tiny grammars means "toy problem". With big real production gnarly grammars, this matters a lot; there's usually a bunch of initial churn in trying to get such grammars right and the AST building stuff just gets in the way during this phase. Clever people delay adding AST building rules till the churn phase is over, but that only partially works, and it turns out that you may want to reshape the grammar based on AST you want to build, so this delay actually increases the churn somewhat. One also pays a maintenance cost; if your grammar has any scale, you will change it, and then the AST building part must change, too.
My company builds a tool, the DMS Software Reengineering Toolkit, which contains a parser generator. We decided, the first to do so AFAIK, some 20 years ago, that this extra AST building step was too much work for the benefit for the many big grammars we expected to (and did) build. So we designed DMS to automatically build a concrete syntax tree as it parsed. Voila, write a grammar, get a parser and the tree is free. This decision has turned out to be a really good one.
The price is the final tree retains all the concrete syntax, e.g., the parentheses. While it may not look elegant, it turns out that this does not matter much in practice when manipulating trees (inspecting, traversing, analyzing, modifying, ...). We've traded a bit of inelegance for much easier tree building and grammar maintenance.
The ANTLR guy(!) decided for ANTRL4, unlike his previous ANTLR1/2/3 systems, to follow our lead and switch to building "ASTs" automatically from the grammar, as concrete syntax trees. (I don't know if you can actually write your own AST building rules to override the built-in feature for ANTLR4. Comments on this answer suggest that the way to get an AST from ANTLR4 is to walk the CST and build what you want. I'm not keen on that solution; it strikes me as paying the price for building and managing the AST, and also having the parsing overhead [time and space] of building the CST. If you only build small trees, maybe you don't care. For DMS, we regularly read thousands of files for processing together; space and time matter!)
For some discussion on how to make this a bit more elegant (effectively even more AST like), see my SO answer on ASTs vs. CSTs
To suppress useless tokens in the tree, use '!' symbol after corresponding tokens:
//remove ',' comma from the output
list: LISTNAME LISTMEMBER (','! LISTMEMBER)*;
from http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html
Just starting to write my first lexer and i've come across this:
RPAREN options { paraphrase = ")"; } : ")";
I'd like to know what paraphrase actually does, does it mean that in this case RPAREN can also be used as simply ) in the parser?
thanks!
EDIT - just found this online
We can use paraphrases in Rules to make error messages user-friendly
is this correct?
paraphrase is not a valid option in ANTLR 3 or ANTLR 4. Including it would either produce a warning or error, and it would not have any impact on behavior.
In SPARQL a QuadPattern is defined as
QuadPattern ::= '{' Quads '}'
Quads ::= TriplesTemplate? ( QuadsNotTriples '.'? TriplesTemplate? )*
From this I understand that a QuadPattern can be empty. But I can not understand the reason. Whats the purpose of an empty QuadPattern?
As #Antoine Zimmermann points out just because the syntax allows it doesn't mean it is meaningful.
In this case I believe it was done to keep the grammar within a certain constraint and to simplify it. If you don't allows Quads to be empty then you'd have to redefine the QuadPattern rule as so:
QuadPattern ::= '{' '}' | '{' Quads '}'
Which just adds unnecessary complication particularly when you are using a parser generator
With an empty quad pattern, you can, for instance, delete the default graph completely:
DELETE WHERE { }
But the fact that something is allowed by the syntax does not necessarily mean that there was a deliberate choice to allow a specific pattern. It may be, in some cases, that it is more convenient to define things in a more generic way.