Why doesn't this ANTLR grammar derive the string `baba`? - antlr

Using ANTLR v4.9.3, I created the following grammar …
grammar G ;
start : s EOF ;
s : 'ba' a b ;
a : 'b' ;
b : 'a' ;
Given the above grammar, I thought that the following derivation is possible …
start → s → 'ba' a b → 'ba' 'b' b → 'ba' 'b' 'a' = 'baba'
However, my Java test program indicates a syntax error occurs when trying to parse the string baba.
Shouldn't the string baba be in the language generated by grammar G ?

Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.
When defining literal tokens inside parser rule (the 'ba', 'a' and 'b'), ANTLR implicitly creates the following grammar:
grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;
T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';
Now, when the lexer get the input "baba", it will create 2 T__0 tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:
try to match as many characters as possible for a rule
when 2 (or more) lexer rules match the same characters, let the one defined first "win"
Because of rule 1, it is apparent that 2 T__0 tokens are created.
As you already mentioned in a comment, removing the 'ba' token (and using 'b' followed by 'a') would resolve the issue:
grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;
which would really be the grammar:
grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;
T__0 : 'b';
T__1 : 'a';

Related

What's wrong with this ANTLR grammar?

I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;
Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.

ANTLR 4 extraneous input matching non lexer item

I have a grammar like this :
grammar MyGrammar;
field : f1 (STROKE f2 f3)? ;
f1 : FIELDTEXT+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
FIELDTEXT : ~['/'] ;
NUMBER4 : [0-9][0-9][0-9][0-9];
STROKE : '/' ;
This works well enough, and fields f1 f2 f3 are all populated correctly.
Except when there is an A to the left of the /, (regardless of the presence of the optional part) this additionally causes an error:
extraneous input 'A' expecting {<EOF>, FIELDTEXT, '/'}
Some sample Data:
PHOEN
-> OK.
KLM405/A4046
-> OK.
SAW502A
-> Not OK, 'A' is in f1.
BAW617/A5136
-> Not OK, 'A' is in f1.
I am not understanding why 'A' is a problem here (the fields are still populated).
The problem with SAW502A is that 'A' is a separate token, implicitly defined :
f2 : 'A' ;
(it would be the same if it were explicitly defined) :
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<FIELDTEXT>,3:7]
and the rule f1 does not allow anything else than FIELDTEXT.
It works with :
f1 : ( FIELDTEXT | 'A' )+ ;
File Question.g4 :
grammar Question;
question
#init {System.out.println("Question last update 2305");}
: line+ EOF
;
line
: f1 (STROKE f2 f3)? NL
{System.out.println("f1=" + $f1.text + " f2=" + $f2.text + " f3=" + $f3.text);}
;
f1 : ( FIELDTEXT | 'A' )+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
STROKE : '/' ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
FIELDTEXT : ~[/] ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='P',<FIELDTEXT>,1:0]
[#1,1:1='H',<FIELDTEXT>,1:1]
[#2,2:2='O',<FIELDTEXT>,1:2]
[#3,3:3='E',<FIELDTEXT>,1:3]
[#4,4:4='N',<FIELDTEXT>,1:4]
[#5,5:5='\n',<NL>,1:5]
[#6,6:6='K',<FIELDTEXT>,2:0]
[#7,7:7='L',<FIELDTEXT>,2:1]
[#8,8:8='M',<FIELDTEXT>,2:2]
[#9,9:9='4',<FIELDTEXT>,2:3]
[#10,10:10='0',<FIELDTEXT>,2:4]
[#11,11:11='5',<FIELDTEXT>,2:5]
[#12,12:12='/',<'/'>,2:6]
[#13,13:13='A',<'A'>,2:7]
[#14,14:17='4046',<NUMBER4>,2:8]
[#15,18:18='\n',<NL>,2:12]
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<NL>,3:7]
[#24,27:27='B',<FIELDTEXT>,4:0]
[#25,28:28='A',<'A'>,4:1]
[#26,29:29='W',<FIELDTEXT>,4:2]
[#27,30:30='6',<FIELDTEXT>,4:3]
[#28,31:31='1',<FIELDTEXT>,4:4]
[#29,32:32='7',<FIELDTEXT>,4:5]
[#30,33:33='/',<'/'>,4:6]
[#31,34:34='A',<'A'>,4:7]
[#32,35:38='5136',<NUMBER4>,4:8]
[#33,39:39='\n',<NL>,4:12]
[#34,40:39='<EOF>',<EOF>,5:0]
Question last update 2305
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
The input SAW502A will be tokenized as six FIELDTEXTs, followed by one 'A' token. That's a problem because 'A' tokens aren't allowed at that position - only FIELDTEXT tokens are. Clearly you intended A to be a FIELDTEXT in this context as well (and only be treated differently in the f2 rule), but the tokenizer doesn't know which kind of token is required by the grammar at a certain point - it only knows the token rules and generates whichever token is the best fit. So whenever it sees an A, it generates an 'A' token.
Note that this also means that whenever it sees four consecutive digits, it generates NUMBER4 token. So if your input were SAW5023, you'd get an error because of an unexpected NUMBER4 token.
You can fix the issue with the A by introducing a everythingButAStroke non-terminal rule that can be either a FIELDTEXT, an 'A' or a NUMBER4, but this wouldn't solve the NUMBER4 issue. And whenever you add a new token rule, you add that one to everythingButAStroke as well. But that's not a very good solution. For one, it will get less manageable the more token rules you add. And for another, you clearly intended f1 to be a list of single characters, but now NUMBER4 tokens, which have four characters, would be there as well, which would be weird and inconsistent.
It seems to me that your whole field rule could be a single terminal rule (ideally separated into fragments for readability) instead of using non-terminal rules like this. That way you would have no problems with overlapping terminal rules.
I have often experienced that a negating lexer rule makes it hard to define other lexer rules, so I prefer to avoid them. It seems that a /, if present, is always followed by an A. Therefore I have another solution.
File Question_x.g4 :
grammar Question_x;
question
#init {System.out.println("Question last update 0112");}
: line+ EOF
;
line
: f1 ( f2s='/A' f3 )? NL
{ String f2 = _localctx.f2s != null ? _localctx.f2s.getText().substring(1) : null;
System.out.println("f1=" + $f1.text + " f2=" + f2 + " f3=" + $f3.text);}
;
f1 : ALPHANUM | NUMBER4 ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
ALPHANUM : [a-zA-Z0-9]+ ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
SAW5023
1234/A1234
Execution :
$ grun Question_x question -tokens -diagnostics t.text
[#0,0:4='PHOEN',<ALPHANUM>,1:0]
[#1,5:5='\n',<NL>,1:5]
[#2,6:11='KLM405',<ALPHANUM>,2:0]
[#3,12:13='/A',<'/A'>,2:6]
[#4,14:17='4046',<NUMBER4>,2:8]
[#5,18:18='\n',<NL>,2:12]
[#6,19:25='SAW502A',<ALPHANUM>,3:0]
[#7,26:26='\n',<NL>,3:7]
[#8,27:32='BAW617',<ALPHANUM>,4:0]
[#9,33:34='/A',<'/A'>,4:6]
[#10,35:38='5136',<NUMBER4>,4:8]
[#11,39:39='\n',<NL>,4:12]
[#12,40:46='SAW5023',<ALPHANUM>,5:0]
[#13,47:47='\n',<NL>,5:7]
[#14,48:51='1234',<NUMBER4>,6:0]
[#15,52:53='/A',<'/A'>,6:4]
[#16,54:57='1234',<NUMBER4>,6:6]
[#17,58:58='\n',<NL>,6:10]
[#18,59:58='<EOF>',<EOF>,7:0]
Question last update 0112
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
f1=SAW5023 f2=null f3=null
f1=1234 f2=A f3=1234

How best to write this xtext grammar

I am using Xtext, and need suggestions on the following two problems.
Problem #1
Lets say I have three rules a, b and c. And I want to allow any sequence of these rules, except that b and c should appear only once. How best to write such a grammar?
Here is what I came up with:
root:
a+=a*
b=b
a+=a*
c=c
a+=a*
;
a: 'a';
b: 'b';
c: 'c';
Is there a better way to write root grammar? b and c still have to be in a strict order, which is not ideal.
Problem #2
Have a look at this grammar:
root:
any+=any*
x=x
any+=any*
;
any:
name=ID
'{'
any+=any*
'}'
;
x:
name='x' '{' y=y '}'
;
y:
name='y' '{' z=z '}'
;
z:
name='z' '{' any+=any* '}'
;
Using this grammar I expected to be able to write a language like below:
a {
b {
}
c {
y {
}
}
}
x {
y {
z {
the_end {}
}
}
}
However, I get an error due to the node "y" appearing under "c". Why is that? Is it because now that "y" has been used as a terminal in one of the rules, it cannot appear anywhere else in the grammar?
How to fix this grammar?
For problem #1:
root: a+=a* (b=b a+=a* & c=c a+=a*);
For problem #2 you need a datatype rule like this:
IdOrABC: ID | 'a' | 'b' | 'c' ;
and you have to use it in your any rule like name=IdOrABC instead of name=ID.
For problem #1, we can adjust the grammar like below:
root:
a+=a*
(
b=b a+=a* c=c
|
c=c a+=a* b=b
)
a+=a*
;
a: 'a';
b: 'b';
c: 'c';
On the other hand, the problem #2 cannot really be solved via the grammar as parser can never differentiate between an ID and the special keyword "x", "y" or "z". Perhaps a better strategy would be to keep the grammar simple like below:
root:
any+=any+
;
any:
name=ID
'{'
any+=any+
'}'
;
And enforce the special x/y/z hierarchy via validators.

only char 'a' cannot be recognized in ANTLR grammar

identification for ID :
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
when I parse my rules it , only char 'a' cannot be recognised ,but 'A' or 'aa' or 'a0' or 'b' or 'c' or 'AAAZzzzxx' or .... everything else in universe except 'a' can be recognized by lexer why not 'a'??
error :
mismatched input 'a' expecting 'u0005'
thanks!
Your rule can match ZERO characters and so the lexer will go haywire. You need:
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+
;
See the '+' instead of '*'?
Jim

ANTLR grammar problem with parenthetical expressions

I'm using ANTLRWorks 1.4.2 to create a simple grammar for the purpose of evaluating an user-provided expression as boolean result. This ultimately will be part of a larger grammar, but I have some questions about this current fragment. I want users to be able to use expressions such as:
2 > 1
2 > 1 and 3 < 1
(2 > 1 or 1 < 3) and 4 > 1
(2 > 1 or 1 < 3) and (4 > 1 or (2 < 1 and 3 > 1))
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, and I am not sure why. So, I seem to be missing out on some insight into the right way to handle parenthetical grouping in a grammar.
How can I change my grammar to properly handle parentheses?
My grammar is below:
grammar conditional_test;
boolean
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
boolean_term
: boolean_factor (AND boolean_factor)*
;
boolean_factor
: (NOT)? boolean_test
;
boolean_test
: predicate
;
predicate
: expression relational_operator expression
| LPAREN boolean_value_expression RPAREN
;
relational_operator
: EQ
| LT
| GT
;
expression
: NUMBER
;
LPAREN : '(';
RPAREN : ')';
NUMBER : '0'..'9'+;
EQ : '=';
GT : '>';
LT : '<';
AND : 'and';
OR : 'or' ;
NOT : 'not';
Chris Farmer wrote:
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. ...
You should remove the EOF token from:
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
You normally only use the EOF after the entry point of your grammar (boolean in your case). Be careful boolean is a reserved word in Java and can therefor not be used as a parser rule!
So the first two rules should look like:
bool
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
;
And you may also want to ignore literal spaces by adding the following lexer rule:
SPACE : ' ' {$channel=HIDDEN;};
(you can include tabs an line breaks, of course)
Now all of your example input matches properly (tested with ANTLRWorks 1.4.2 as well).
Chris Farmer wrote:
Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, ...
No, ANTLRWorks does produce errors, perhaps not very noticeable ones. The parse tree ANTLRWorks produces has a NoViableAltException as a leaf, and there are some errors on the "Console" tab.