ANTLR 4 extraneous input matching non lexer item

ANTLR 4 extraneous input matching non lexer item - antlr

I have a grammar like this :
grammar MyGrammar;
field : f1 (STROKE f2 f3)? ;
f1 : FIELDTEXT+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
FIELDTEXT : ~['/'] ;
NUMBER4 : [0-9][0-9][0-9][0-9];
STROKE : '/' ;
This works well enough, and fields f1 f2 f3 are all populated correctly.
Except when there is an A to the left of the /, (regardless of the presence of the optional part) this additionally causes an error:
extraneous input 'A' expecting {<EOF>, FIELDTEXT, '/'}
Some sample Data:
PHOEN
-> OK.
KLM405/A4046
-> OK.
SAW502A
-> Not OK, 'A' is in f1.
BAW617/A5136
-> Not OK, 'A' is in f1.
I am not understanding why 'A' is a problem here (the fields are still populated).

The problem with SAW502A is that 'A' is a separate token, implicitly defined :
f2 : 'A' ;
(it would be the same if it were explicitly defined) :
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<FIELDTEXT>,3:7]
and the rule f1 does not allow anything else than FIELDTEXT.
It works with :
f1 : ( FIELDTEXT | 'A' )+ ;
File Question.g4 :
grammar Question;
question
#init {System.out.println("Question last update 2305");}
: line+ EOF
;
line
: f1 (STROKE f2 f3)? NL
{System.out.println("f1=" + $f1.text + " f2=" + $f2.text + " f3=" + $f3.text);}
;
f1 : ( FIELDTEXT | 'A' )+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
STROKE : '/' ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
FIELDTEXT : ~[/] ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='P',<FIELDTEXT>,1:0]
[#1,1:1='H',<FIELDTEXT>,1:1]
[#2,2:2='O',<FIELDTEXT>,1:2]
[#3,3:3='E',<FIELDTEXT>,1:3]
[#4,4:4='N',<FIELDTEXT>,1:4]
[#5,5:5='\n',<NL>,1:5]
[#6,6:6='K',<FIELDTEXT>,2:0]
[#7,7:7='L',<FIELDTEXT>,2:1]
[#8,8:8='M',<FIELDTEXT>,2:2]
[#9,9:9='4',<FIELDTEXT>,2:3]
[#10,10:10='0',<FIELDTEXT>,2:4]
[#11,11:11='5',<FIELDTEXT>,2:5]
[#12,12:12='/',<'/'>,2:6]
[#13,13:13='A',<'A'>,2:7]
[#14,14:17='4046',<NUMBER4>,2:8]
[#15,18:18='\n',<NL>,2:12]
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<NL>,3:7]
[#24,27:27='B',<FIELDTEXT>,4:0]
[#25,28:28='A',<'A'>,4:1]
[#26,29:29='W',<FIELDTEXT>,4:2]
[#27,30:30='6',<FIELDTEXT>,4:3]
[#28,31:31='1',<FIELDTEXT>,4:4]
[#29,32:32='7',<FIELDTEXT>,4:5]
[#30,33:33='/',<'/'>,4:6]
[#31,34:34='A',<'A'>,4:7]
[#32,35:38='5136',<NUMBER4>,4:8]
[#33,39:39='\n',<NL>,4:12]
[#34,40:39='<EOF>',<EOF>,5:0]
Question last update 2305
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136

The input SAW502A will be tokenized as six FIELDTEXTs, followed by one 'A' token. That's a problem because 'A' tokens aren't allowed at that position - only FIELDTEXT tokens are. Clearly you intended A to be a FIELDTEXT in this context as well (and only be treated differently in the f2 rule), but the tokenizer doesn't know which kind of token is required by the grammar at a certain point - it only knows the token rules and generates whichever token is the best fit. So whenever it sees an A, it generates an 'A' token.
Note that this also means that whenever it sees four consecutive digits, it generates NUMBER4 token. So if your input were SAW5023, you'd get an error because of an unexpected NUMBER4 token.
You can fix the issue with the A by introducing a everythingButAStroke non-terminal rule that can be either a FIELDTEXT, an 'A' or a NUMBER4, but this wouldn't solve the NUMBER4 issue. And whenever you add a new token rule, you add that one to everythingButAStroke as well. But that's not a very good solution. For one, it will get less manageable the more token rules you add. And for another, you clearly intended f1 to be a list of single characters, but now NUMBER4 tokens, which have four characters, would be there as well, which would be weird and inconsistent.
It seems to me that your whole field rule could be a single terminal rule (ideally separated into fragments for readability) instead of using non-terminal rules like this. That way you would have no problems with overlapping terminal rules.

I have often experienced that a negating lexer rule makes it hard to define other lexer rules, so I prefer to avoid them. It seems that a /, if present, is always followed by an A. Therefore I have another solution.
File Question_x.g4 :
grammar Question_x;
question
#init {System.out.println("Question last update 0112");}
: line+ EOF
;
line
: f1 ( f2s='/A' f3 )? NL
{ String f2 = _localctx.f2s != null ? _localctx.f2s.getText().substring(1) : null;
System.out.println("f1=" + $f1.text + " f2=" + f2 + " f3=" + $f3.text);}
;
f1 : ALPHANUM | NUMBER4 ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
ALPHANUM : [a-zA-Z0-9]+ ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
SAW5023
1234/A1234
Execution :
$ grun Question_x question -tokens -diagnostics t.text
[#0,0:4='PHOEN',<ALPHANUM>,1:0]
[#1,5:5='\n',<NL>,1:5]
[#2,6:11='KLM405',<ALPHANUM>,2:0]
[#3,12:13='/A',<'/A'>,2:6]
[#4,14:17='4046',<NUMBER4>,2:8]
[#5,18:18='\n',<NL>,2:12]
[#6,19:25='SAW502A',<ALPHANUM>,3:0]
[#7,26:26='\n',<NL>,3:7]
[#8,27:32='BAW617',<ALPHANUM>,4:0]
[#9,33:34='/A',<'/A'>,4:6]
[#10,35:38='5136',<NUMBER4>,4:8]
[#11,39:39='\n',<NL>,4:12]
[#12,40:46='SAW5023',<ALPHANUM>,5:0]
[#13,47:47='\n',<NL>,5:7]
[#14,48:51='1234',<NUMBER4>,6:0]
[#15,52:53='/A',<'/A'>,6:4]
[#16,54:57='1234',<NUMBER4>,6:6]
[#17,58:58='\n',<NL>,6:10]
[#18,59:58='<EOF>',<EOF>,7:0]
Question last update 0112
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
f1=SAW5023 f2=null f3=null
f1=1234 f2=A f3=1234

Related

ANTLR4 grammar: getting mismatched input error

I have defined the following grammar:
grammar Test;
parse: expr EOF;
expr : IF comparator FROM field THEN #comparatorExpr
;
dateTime : DATE_TIME;
number : (INT|DECIMAL);
field : FIELD_IDENTIFIER;
op : (GT | GE | LT | LE | EQ);
comparator : op (number|dateTime);
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
IF : '$IF';
FROM : '$FROM';
THEN : '$THEN';
OR : '$OR';
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '=' ;
INT : DIGIT+;
DECIMAL : INT'.'INT;
DATE_TIME : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS : [ \r\t\u000C\n]+ -> skip;
And I try to parse the following input:
$IF >=15 $FROM AgeInYears $THEN
it gives me the following error:
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.
Any pointers would be appreciated here.

It's always a good idea to run take a look at the token stream that your Lexer produces:
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)
Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.
For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:
FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:7='15',<INT>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)
That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).

ANTLR: too greedy rule

It looks like I have a problem understanding a too greedy rule match. I'm trying to lex a .g4 file for syntax coloring. Here is a minimum (simplified) extract for making this problem reproducible:
lexer grammar ANTLRv4Lexer;
Range
: '[' RangeChar+ ']'
;
fragment EscapedChar
: '\\' ~[u]
| '\\u' EscapedCharHex EscapedCharHex EscapedCharHex EscapedCharHex
;
fragment EscapedCharHex
: [0-9A-Fa-f]
;
fragment RangeChar
: ~']'
| EscapedChar
;
Punctuation
: [:;()+\->*[\]~|]
;
Identifier
: [a-zA-Z0-9]+
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
LineComment
: '//' ~[\r\n]*
;
The (incomplete) test file is following:
: (~ [\]\\] | EscAny)+ -> more
;
// ------
fragment Id
: NameStartChar NameChar*
;
String2Part
: ( ~['\\]
| EscapeSequence
)+
;
I don't understand why it matches Range so greedy:
[#0,3:3=':',<Punctuation>,1:3]
[#1,5:5='(',<Punctuation>,1:5]
[#2,6:6='~',<Punctuation>,1:6]
[#3,8:135='[\]\\] | EscAny)+ -> more\r\n ;\r\n\r\n // ------\r\n\r\nfragment Id\r\n : NameStartChar NameChar*\r\n ;\r\n\r\n\r\nString2Part\r\n\t: ( ~['\\]',<Range>,1:8]
[#4,141:141='|',<Punctuation>,13:3]
[#5,143:156='EscapeSequence',<Identifier>,13:5]
[#6,162:162=')',<Punctuation>,14:3]
[#7,163:163='+',<Punctuation>,14:4]
[#8,167:167=';',<Punctuation>,15:1]
[#9,170:169='<EOF>',<EOF>,16:0]
I understand why in the first line it matches [, \] and \\, but why it obviously treats ] as RangeChar?

Your lexer matches the first \ in \\] using the ~']' alternative and then matches the remaining \] as an EscapedChar. The reason it does this is that this interpretation leads to a longer match than the one where \\ is the EscapedChar and ] is the end of the range and when there are multiple valid ways to match a lexer rule, ANTLR always chooses the longest one (except when *? is involved).
To fix this, you should change RangeChar, so that backslashes are only allowed as part of escape sequences, i.e. replace ~']' with ~[\]\\].

What's wrong with this ANTLR grammar?

I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;

Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.

how to resolve an ambiguity

I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.

An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll

ANTLR grammar problem with parenthetical expressions

I'm using ANTLRWorks 1.4.2 to create a simple grammar for the purpose of evaluating an user-provided expression as boolean result. This ultimately will be part of a larger grammar, but I have some questions about this current fragment. I want users to be able to use expressions such as:
2 > 1
2 > 1 and 3 < 1
(2 > 1 or 1 < 3) and 4 > 1
(2 > 1 or 1 < 3) and (4 > 1 or (2 < 1 and 3 > 1))
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, and I am not sure why. So, I seem to be missing out on some insight into the right way to handle parenthetical grouping in a grammar.
How can I change my grammar to properly handle parentheses?
My grammar is below:
grammar conditional_test;
boolean
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
boolean_term
: boolean_factor (AND boolean_factor)*
;
boolean_factor
: (NOT)? boolean_test
;
boolean_test
: predicate
;
predicate
: expression relational_operator expression
| LPAREN boolean_value_expression RPAREN
;
relational_operator
: EQ
| LT
| GT
;
expression
: NUMBER
;
LPAREN : '(';
RPAREN : ')';
NUMBER : '0'..'9'+;
EQ : '=';
GT : '>';
LT : '<';
AND : 'and';
OR : 'or' ;
NOT : 'not';

Chris Farmer wrote:
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. ...
You should remove the EOF token from:
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
You normally only use the EOF after the entry point of your grammar (boolean in your case). Be careful boolean is a reserved word in Java and can therefor not be used as a parser rule!
So the first two rules should look like:
bool
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
;
And you may also want to ignore literal spaces by adding the following lexer rule:
SPACE : ' ' {$channel=HIDDEN;};
(you can include tabs an line breaks, of course)
Now all of your example input matches properly (tested with ANTLRWorks 1.4.2 as well).
Chris Farmer wrote:
Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, ...
No, ANTLRWorks does produce errors, perhaps not very noticeable ones. The parse tree ANTLRWorks produces has a NoViableAltException as a leaf, and there are some errors on the "Console" tab.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas