Unexpected parser rule matching order - antlr

With the following (subset of a) grammer for a scripting language:
expr
...
| 'regex(' str=expr ',' re=expr ')' #regexExpr
...
an expression like regex('s', 're') parses to the following tree which makes sense:
regexExpr
'regex('
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
')'
I'm now trying to add an option third argument to my regex function, so I've used this modified rule:
'regex(' str=expr ',' re=expr (',' n=expr )? ')'
This causes regex('s', 're', 1) to be parsed in a way that's unexpected to me:
regexExpr
'regex('
expr:listExpression
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
','
expr: integerLiteral(1)
')'
where listExpression is another rule defined below regexExpr:
expr
...
| 'regex(' str=expr ',' re=expr (',' n=expr)? ')' #regexExpr
...
| left=expr ',' right=expr #listExpr
...
I think this listExpr could have been defined better (by defining surrounding tokens), but I've got compatibility concerns with changing it now.
I don't understand the parser rule matching precedence here. Is there a way I can add the optional third arg to regex() without causing the first two args to be parsed as a listExpr?

Try defining them in two separate alternatives and with the same label #regexExpr:
expr
: 'regex' '(' str=expr ',' re=expr ',' n=expr ')' #regexExpr
| 'regex' '(' str=expr ',' re=expr ')' #regexExpr
| left=expr ',' right=expr #listExpr
| ...
;

Related

ANTLR4 Grammar - Issue with "dot" in fields and extended expressions

I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)

Mutually left-recursive lexer rules on ANTL4?

I'm trying to write Swift language highlight. Also I would like to highlight in addition to tokens of some language constructs. Having problems with the following rule:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| (Attributes? Function_type_argument_clause 'throws'? '->' Type | Attributes? Function_type_argument_clause 'rethrows' '->' Type)
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type)
| Tuple_type
| Type '?'
| Type '!'
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type) '&' Protocol_composition_continuation
| (Type '.' 'Type' | Type '.' 'Protocol')
| 'Any'
| 'Self'
| '(' Type ')'
;
Error: The following sets of rules are mutually left-recursive [Type]
Tried to leave in the rule, only the following cases:
Type
: Type '?'
| 'Any'
| 'Self'
;
But the problem remained: The following sets of rules are mutually left-recursive [Type]
You defined Type as a lexer rule. Lexer rules cannot be left recursive. Type should be a parser rule.
See: Practical difference between parser rules and lexer rules in ANTLR?
Note that there are existing Swift grammars:
https://github.com/antlr/grammars-v4/blob/master/swift2/Swift2.g4
https://github.com/antlr/grammars-v4/blob/master/swift3/Swift3.g4
Note that these grammars are user-comitted, test them properly!
EDIT
I'm still unable to understand it from the point of view of lexical analysis
Oh, you're only tokenising? Well, then you can't use Type as you're doing it now. You will have to rewrite it so that there is no left recursion any more.
For example, let's say the simplified Type rule looks like this:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| Type '?'
| Type '!'
| 'Any'
| 'Self'
| '(' Type ')'
;
then you should rewrite it like this:
Type
: TypeStart TypeTrailing?
;
fragment TypeStart
: '[' Type ']'
| '[' Type ':' Type ']'
| 'Any'
| 'Self'
| '(' Type ')'
;
fragment TypeTrailing: [?!];

how to parse string of SQL contain ESCAPE in ANTLR4?

there are two ESCAPE type in SQL: \' AND ''
a input may like:
SELECT '\'', '''';
I parse the string with this grammar:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )* '\''
;
but ANTLR parse the input error, the tree like this:
error parsed tree
I also tried another type of STRING_LITERAL grammar with GREEDY: "?":
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )*? '\''
;
but it also give me a error parse resule like this:
error parsed tree in another grammar
the '''' should parsed as a string contain but not two empty string.
How should I modify the grammar to fix the problem?
You didn't exclude the \ in the ( ... )*. Try this:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~['\\] )* '\''
;
where ~['\\] matches any char except ' and \. You may want to include line break chars in it: ~[\r\n'\\].

Antlr4 - Implicit Definitions

I am trying to create a simple for now only integer-arithmetic expression parser. For now i have:
grammar MyExpr;
input: (expr NEWLINE)+;
expr: '(' expr ')'
| '-' expr
| <assoc = right> expr '^' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| ID '(' ExpressionList? ')'
| INT;
ExpressionList : expr (',' expr)*;
ID : [a-zA-Z]+;
INT : DIGIT+;
DIGIT: [0-9];
NEWLINE : '\r'?'\n';
WS : [\t]+ -> skip;
The rule ExpressionList seems to cause some problems. If i remove everything containing ExpressionList, everything compiles and seems to work out fine. But like above, i get errors like:
error(160): MyExpr.g4:14:17: reference to parser rule expr in lexer rule ExpressionList
error(126): MyExpr.g4:7:6: cannot create implicit token for string literal in non-combined grammar: '-'
I am using Eclipse and the Antlr4 Plugin. I try to orient myself on the cymbol grammar given in the antlr4-book.
Can someone tell me whats going wrong in my little grammar?
Found it out by myself:
Rules starting with capital letter refer to Lexer-rules. SO all I had to do was renaming my ExpressionList to expressionList.
Maybe someone else will find this useful some day ;)

Antlr : mutually left-recursive rule

i have this rule in antlr :
anREs : anRE
| ('(' anREs ')') => '(' anREs ')'
| (anREs '|' anREs) => anREs '|' anREs ;
where the anRE is a regular expression , when i want to compile the rules file i have this error message due to 3rd alternative in last rule :
error(210): The following sets of
rules are mutually left-recursive
[anREs]
how i can re write this rule ?
thanks
Here is your left recursion:
... | (anREs '|' anREs) => anREs '|' anREs ;
Worse, its ambiguous. If you have anREs_1 | anREs_2 | anREs3 as input,
it isn't clear what the subterms of the | operator are.
I'd expect this to solve the problem, and resolve the ambiguity, too:
... | (anRE '|' anREs) => anRE '|' anREs ;