how to resolve an ambiguity - antlr

I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.

An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll

Related

ANTLR4 grammar: getting mismatched input error

I have defined the following grammar:
grammar Test;
parse: expr EOF;
expr : IF comparator FROM field THEN #comparatorExpr
;
dateTime : DATE_TIME;
number : (INT|DECIMAL);
field : FIELD_IDENTIFIER;
op : (GT | GE | LT | LE | EQ);
comparator : op (number|dateTime);
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
IF : '$IF';
FROM : '$FROM';
THEN : '$THEN';
OR : '$OR';
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '=' ;
INT : DIGIT+;
DECIMAL : INT'.'INT;
DATE_TIME : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS : [ \r\t\u000C\n]+ -> skip;
And I try to parse the following input:
$IF >=15 $FROM AgeInYears $THEN
it gives me the following error:
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.
Any pointers would be appreciated here.
It's always a good idea to run take a look at the token stream that your Lexer produces:
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)
Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.
For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:
FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:7='15',<INT>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)
That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).

ANTLR: too greedy rule

It looks like I have a problem understanding a too greedy rule match. I'm trying to lex a .g4 file for syntax coloring. Here is a minimum (simplified) extract for making this problem reproducible:
lexer grammar ANTLRv4Lexer;
Range
: '[' RangeChar+ ']'
;
fragment EscapedChar
: '\\' ~[u]
| '\\u' EscapedCharHex EscapedCharHex EscapedCharHex EscapedCharHex
;
fragment EscapedCharHex
: [0-9A-Fa-f]
;
fragment RangeChar
: ~']'
| EscapedChar
;
Punctuation
: [:;()+\->*[\]~|]
;
Identifier
: [a-zA-Z0-9]+
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
LineComment
: '//' ~[\r\n]*
;
The (incomplete) test file is following:
: (~ [\]\\] | EscAny)+ -> more
;
// ------
fragment Id
: NameStartChar NameChar*
;
String2Part
: ( ~['\\]
| EscapeSequence
)+
;
I don't understand why it matches Range so greedy:
[#0,3:3=':',<Punctuation>,1:3]
[#1,5:5='(',<Punctuation>,1:5]
[#2,6:6='~',<Punctuation>,1:6]
[#3,8:135='[\]\\] | EscAny)+ -> more\r\n ;\r\n\r\n // ------\r\n\r\nfragment Id\r\n : NameStartChar NameChar*\r\n ;\r\n\r\n\r\nString2Part\r\n\t: ( ~['\\]',<Range>,1:8]
[#4,141:141='|',<Punctuation>,13:3]
[#5,143:156='EscapeSequence',<Identifier>,13:5]
[#6,162:162=')',<Punctuation>,14:3]
[#7,163:163='+',<Punctuation>,14:4]
[#8,167:167=';',<Punctuation>,15:1]
[#9,170:169='<EOF>',<EOF>,16:0]
I understand why in the first line it matches [, \] and \\, but why it obviously treats ] as RangeChar?
Your lexer matches the first \ in \\] using the ~']' alternative and then matches the remaining \] as an EscapedChar. The reason it does this is that this interpretation leads to a longer match than the one where \\ is the EscapedChar and ] is the end of the range and when there are multiple valid ways to match a lexer rule, ANTLR always chooses the longest one (except when *? is involved).
To fix this, you should change RangeChar, so that backslashes are only allowed as part of escape sequences, i.e. replace ~']' with ~[\]\\].

ANTLR4 Grammar - Issue with "dot" in fields and extended expressions

I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)

Controlling Parameter Slurping

I'm trying to write a grammar that supports functions calls without using parentheses:
f x, y
As in Haskell, I'd like function calls to minimally slurp up their parameters. That is, I want
g 5 + 3
to mean
(g 5) + 3
instead of
g (5 + 3)
Unfortunately, I'm getting the second parse with this grammar:
grammar Parameters;
expr
: '(' expr ')'
| expr MULTIPLICATIVE_OPERATOR expr
| expr ADDITIVE_OPERATOR expr
| ID (expr (',' expr)*?)??
| INT
;
MULTIPLICATIVE_OPERATOR: [*/%];
ADDITIVE_OPERATOR: '+';
ID: [a..z]+;
INT: '-'? [0-9]+;
WHITESPACE: [ \t\n\r]+ -> skip;
The parse tree I'm getting is this:
I had thought that the subrule listed first would get attempted first. In this case, expr ADDITIVE_OPERATOR expr appears before the ID subrule, so why is the ID subrule taking higher precedence?
In this case ANTLR does not the correct rule transformation (to eliminate left recursion and to handle precedences):
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[0] (',' expr_1[0])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5 + (expr_1 3))))
correct would be:
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[5] (',' expr_1[5])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5) + (expr_1 3)))
I am not certain if this is a bug in ANTLR4 or a trade-off of the transformation algorithm. Perhaps one should write an issue to the ANTLR4 jira.
To solve your problem you can simply put the correctly transformed grammar into your code and it should work. The explanation of rule transformation is found in "The Definitive ANTLR4 Reference" on pages 249ff (and perhaps somewhere on the web).

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.