Optional rule (myRule?) vs rule and empty alternative ((myRule | )) - antlr

In the ANTLRv4 grammar that one can find in the grammars-v4 repository (https://github.com/antlr/grammars-v4/blob/master/antlr4/ANTLRv4Parser.g4) the optional rule ebnfSuffix is:
sometimes matched using ebnfSuffix?, see lexerElement
sometimes matched using (ebnfSuffix | ), see element.
I was indeed asking to myself, and here as well, if the two have slightly different meaning.
The grammars-v4 repository has another example in https://github.com/antlr/grammars-v4/blob/master/cql3/CqlParser.g4 of the same two patterns with respect to beginBatch rule used has optional element or together with an empty alternative.
EDIT: I've added here the part of the grammar I'm referring to as suggested:
lexerElement
: labeledLexerElement ebnfSuffix? <-- case 1: optional rule
| lexerAtom ebnfSuffix?
| lexerBlock ebnfSuffix?
| actionBlock QUESTION?
;
element
: labeledElement (ebnfSuffix |) <-- case 2: block with empty alternative
| atom (ebnfSuffix |)
| ebnf
| actionBlock QUESTION?
;

Both ebnfSuffix? and (ebnfSuffix | ) result in exactly the same behaviour: they (greedily) optionally match ebnfSuffix.
The fact that they're both being used in a grammar could be because it was translated from some spec (or other grammar) that used that notation and that notation didn't have the ? operator, but that's just guessing.
Personally I'd just use ebnfSuffix?.

Related

Did "!", "^" and "$" had a special meaning in Antlr3?

I dont have any prior knowledge about ANTLR(I recently learned a little bit about ANTLR4), but I have to translate an old grammar to a newer version and eclipse is telling me, that their are no viable alternatives for those characters and shows the syntax error " '!' came as a complete surprise to me".
I already deleted those characters and it does not seam to be a problem, but maybe it had a special function in ANTLR3.
Thanks in advance.
global_block:
DATABASE! IDENTIFIER!
| GLOBALS! define_section!+ END! GLOBALS!
| GLOBALS! STRING!
;
main_block: MAIN sequence? END em=MAIN
-> ^(MAIN MAIN '(' ')' sequence? $em)
;
^ and -> are related to tree rewriting: https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687090/Tree+construction
ANTLR4 does not support it (v4 has listeners and visitors for tree traversal, but no rewriting anymore). Just remove all of these ! and -> ... in parser rules (do not remove the -> ... inside lexer rules like -> channel(...), which is still supported in v4).
So in your case, these rules would be valid in ANTLR4:
global_block
: DATABASE IDENTIFIER
| GLOBALS define_section+ END GLOBALS
| GLOBALS STRING
;
main_block
: MAIN sequence? END MAIN
;
The $ can still be used in ANTLR4: they are used to reference sub-rules or tokens:
expression
: lhs=expression operator=(PLUS | MINUS) rhs=expression
| NUMBER
;
so that in embedded code block, you can do: $lhs.someField.someMethod(). In your case, you can also just remove them because they are probably only used in the tree rewrite rules.
EDIT
kaby76 has a Github page with some instructions for converting grammars to ANTLR4: https://github.com/kaby76/AntlrVSIX/blob/master/doc/Import.md#antlr3

ANTLR - Checking for a String's "contruction"

Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)
Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.

Is there an automatic way to simplify EBNF grammars?

I want to handle the Scala grammar in order to create a data structure and I want to create a graph to model it. The problem that I have is that I want to get rid of some syntax sugar of the EBNF grammar in order to parse it in an efficient way.
For example if I have the following rule:
Expr ::= (Bindings | id | '_') '=>' Expr | Expr1
I want the corresponding rules for it:
Expr ::= LeftExpr '=>' Expr | Expr1
LeftExpr ::= Bindings | id | '_'
What I would really appreciate would be a tool which can automatically convert those rules in order to have left-associativity rules with only OR operators in order to avoid doing it manually. I think it's possible a thing like that, maybe without some semantic sense of the new added rules, e.g. it's the same if LeftExpr is Foo or whatever other name.
I looked at web sources but the argument is so confusing that I cannot find the right answer. Any help is appreciated, also just a link (really) related to the problem. Thank you

How do I properly parse Regex in ANTLR

I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.

How do I add parenthesis to this rule?

I have a left-recursive rule like the following:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION | UNARYOP EXPRESSION | NUMBER;
I need to add parenthesis to it but I'm not sure how to make a left parenthesis depend on a matching right parenthesis yet still optional. Can someone show me how? (Or am I trying to do entirely too much in lexing, and should I leave some or all of this to the parsing?)
You could add a recursive rule:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION
| UNARYOP EXPRESSION
| NUMBER
| OPENPARENS EXPRESSION CLOSEPARENS
;
Yes, you're trying to do too much in the lexer. Here's how to get around the left-recursive rules:
http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator (see how the parser rule expr trickles down to the rule atom and then get called recursively from atom again)
HTH