I have a grammar like this :
grammar MyGrammar;
field : f1 (STROKE f2 f3)? ;
f1 : FIELDTEXT+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
FIELDTEXT : ~['/'] ;
NUMBER4 : [0-9][0-9][0-9][0-9];
STROKE : '/' ;
This works well enough, and fields f1 f2 f3 are all populated correctly.
Except when there is an A to the left of the /, (regardless of the presence of the optional part) this additionally causes an error:
extraneous input 'A' expecting {<EOF>, FIELDTEXT, '/'}
Some sample Data:
PHOEN
-> OK.
KLM405/A4046
-> OK.
SAW502A
-> Not OK, 'A' is in f1.
BAW617/A5136
-> Not OK, 'A' is in f1.
I am not understanding why 'A' is a problem here (the fields are still populated).
The problem with SAW502A is that 'A' is a separate token, implicitly defined :
f2 : 'A' ;
(it would be the same if it were explicitly defined) :
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<FIELDTEXT>,3:7]
and the rule f1 does not allow anything else than FIELDTEXT.
It works with :
f1 : ( FIELDTEXT | 'A' )+ ;
File Question.g4 :
grammar Question;
question
#init {System.out.println("Question last update 2305");}
: line+ EOF
;
line
: f1 (STROKE f2 f3)? NL
{System.out.println("f1=" + $f1.text + " f2=" + $f2.text + " f3=" + $f3.text);}
;
f1 : ( FIELDTEXT | 'A' )+ ;
f2 : 'A' ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
STROKE : '/' ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
FIELDTEXT : ~[/] ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='P',<FIELDTEXT>,1:0]
[#1,1:1='H',<FIELDTEXT>,1:1]
[#2,2:2='O',<FIELDTEXT>,1:2]
[#3,3:3='E',<FIELDTEXT>,1:3]
[#4,4:4='N',<FIELDTEXT>,1:4]
[#5,5:5='\n',<NL>,1:5]
[#6,6:6='K',<FIELDTEXT>,2:0]
[#7,7:7='L',<FIELDTEXT>,2:1]
[#8,8:8='M',<FIELDTEXT>,2:2]
[#9,9:9='4',<FIELDTEXT>,2:3]
[#10,10:10='0',<FIELDTEXT>,2:4]
[#11,11:11='5',<FIELDTEXT>,2:5]
[#12,12:12='/',<'/'>,2:6]
[#13,13:13='A',<'A'>,2:7]
[#14,14:17='4046',<NUMBER4>,2:8]
[#15,18:18='\n',<NL>,2:12]
[#16,19:19='S',<FIELDTEXT>,3:0]
[#17,20:20='A',<'A'>,3:1]
[#18,21:21='W',<FIELDTEXT>,3:2]
[#19,22:22='5',<FIELDTEXT>,3:3]
[#20,23:23='0',<FIELDTEXT>,3:4]
[#21,24:24='2',<FIELDTEXT>,3:5]
[#22,25:25='A',<'A'>,3:6]
[#23,26:26='\n',<NL>,3:7]
[#24,27:27='B',<FIELDTEXT>,4:0]
[#25,28:28='A',<'A'>,4:1]
[#26,29:29='W',<FIELDTEXT>,4:2]
[#27,30:30='6',<FIELDTEXT>,4:3]
[#28,31:31='1',<FIELDTEXT>,4:4]
[#29,32:32='7',<FIELDTEXT>,4:5]
[#30,33:33='/',<'/'>,4:6]
[#31,34:34='A',<'A'>,4:7]
[#32,35:38='5136',<NUMBER4>,4:8]
[#33,39:39='\n',<NL>,4:12]
[#34,40:39='<EOF>',<EOF>,5:0]
Question last update 2305
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
The input SAW502A will be tokenized as six FIELDTEXTs, followed by one 'A' token. That's a problem because 'A' tokens aren't allowed at that position - only FIELDTEXT tokens are. Clearly you intended A to be a FIELDTEXT in this context as well (and only be treated differently in the f2 rule), but the tokenizer doesn't know which kind of token is required by the grammar at a certain point - it only knows the token rules and generates whichever token is the best fit. So whenever it sees an A, it generates an 'A' token.
Note that this also means that whenever it sees four consecutive digits, it generates NUMBER4 token. So if your input were SAW5023, you'd get an error because of an unexpected NUMBER4 token.
You can fix the issue with the A by introducing a everythingButAStroke non-terminal rule that can be either a FIELDTEXT, an 'A' or a NUMBER4, but this wouldn't solve the NUMBER4 issue. And whenever you add a new token rule, you add that one to everythingButAStroke as well. But that's not a very good solution. For one, it will get less manageable the more token rules you add. And for another, you clearly intended f1 to be a list of single characters, but now NUMBER4 tokens, which have four characters, would be there as well, which would be weird and inconsistent.
It seems to me that your whole field rule could be a single terminal rule (ideally separated into fragments for readability) instead of using non-terminal rules like this. That way you would have no problems with overlapping terminal rules.
I have often experienced that a negating lexer rule makes it hard to define other lexer rules, so I prefer to avoid them. It seems that a /, if present, is always followed by an A. Therefore I have another solution.
File Question_x.g4 :
grammar Question_x;
question
#init {System.out.println("Question last update 0112");}
: line+ EOF
;
line
: f1 ( f2s='/A' f3 )? NL
{ String f2 = _localctx.f2s != null ? _localctx.f2s.getText().substring(1) : null;
System.out.println("f1=" + $f1.text + " f2=" + f2 + " f3=" + $f3.text);}
;
f1 : ALPHANUM | NUMBER4 ;
f3 : NUMBER4 ;
NUMBER4 : [0-9][0-9][0-9][0-9] ;
ALPHANUM : [a-zA-Z0-9]+ ;
NL : [\r\n]+ ; // -> channel(HIDDEN) ;
WS : [ \t]+ -> skip ;
Input file t.text :
PHOEN
KLM405/A4046
SAW502A
BAW617/A5136
SAW5023
1234/A1234
Execution :
$ grun Question_x question -tokens -diagnostics t.text
[#0,0:4='PHOEN',<ALPHANUM>,1:0]
[#1,5:5='\n',<NL>,1:5]
[#2,6:11='KLM405',<ALPHANUM>,2:0]
[#3,12:13='/A',<'/A'>,2:6]
[#4,14:17='4046',<NUMBER4>,2:8]
[#5,18:18='\n',<NL>,2:12]
[#6,19:25='SAW502A',<ALPHANUM>,3:0]
[#7,26:26='\n',<NL>,3:7]
[#8,27:32='BAW617',<ALPHANUM>,4:0]
[#9,33:34='/A',<'/A'>,4:6]
[#10,35:38='5136',<NUMBER4>,4:8]
[#11,39:39='\n',<NL>,4:12]
[#12,40:46='SAW5023',<ALPHANUM>,5:0]
[#13,47:47='\n',<NL>,5:7]
[#14,48:51='1234',<NUMBER4>,6:0]
[#15,52:53='/A',<'/A'>,6:4]
[#16,54:57='1234',<NUMBER4>,6:6]
[#17,58:58='\n',<NL>,6:10]
[#18,59:58='<EOF>',<EOF>,7:0]
Question last update 0112
f1=PHOEN f2=null f3=null
f1=KLM405 f2=A f3=4046
f1=SAW502A f2=null f3=null
f1=BAW617 f2=A f3=5136
f1=SAW5023 f2=null f3=null
f1=1234 f2=A f3=1234
I am using Xtext, and need suggestions on the following two problems.
Problem #1
Lets say I have three rules a, b and c. And I want to allow any sequence of these rules, except that b and c should appear only once. How best to write such a grammar?
Here is what I came up with:
root:
a+=a*
b=b
a+=a*
c=c
a+=a*
;
a: 'a';
b: 'b';
c: 'c';
Is there a better way to write root grammar? b and c still have to be in a strict order, which is not ideal.
Problem #2
Have a look at this grammar:
root:
any+=any*
x=x
any+=any*
;
any:
name=ID
'{'
any+=any*
'}'
;
x:
name='x' '{' y=y '}'
;
y:
name='y' '{' z=z '}'
;
z:
name='z' '{' any+=any* '}'
;
Using this grammar I expected to be able to write a language like below:
a {
b {
}
c {
y {
}
}
}
x {
y {
z {
the_end {}
}
}
}
However, I get an error due to the node "y" appearing under "c". Why is that? Is it because now that "y" has been used as a terminal in one of the rules, it cannot appear anywhere else in the grammar?
How to fix this grammar?
For problem #1:
root: a+=a* (b=b a+=a* & c=c a+=a*);
For problem #2 you need a datatype rule like this:
IdOrABC: ID | 'a' | 'b' | 'c' ;
and you have to use it in your any rule like name=IdOrABC instead of name=ID.
For problem #1, we can adjust the grammar like below:
root:
a+=a*
(
b=b a+=a* c=c
|
c=c a+=a* b=b
)
a+=a*
;
a: 'a';
b: 'b';
c: 'c';
On the other hand, the problem #2 cannot really be solved via the grammar as parser can never differentiate between an ID and the special keyword "x", "y" or "z". Perhaps a better strategy would be to keep the grammar simple like below:
root:
any+=any+
;
any:
name=ID
'{'
any+=any+
'}'
;
And enforce the special x/y/z hierarchy via validators.
I'm using ANTLRWorks 1.4.2 to create a simple grammar for the purpose of evaluating an user-provided expression as boolean result. This ultimately will be part of a larger grammar, but I have some questions about this current fragment. I want users to be able to use expressions such as:
2 > 1
2 > 1 and 3 < 1
(2 > 1 or 1 < 3) and 4 > 1
(2 > 1 or 1 < 3) and (4 > 1 or (2 < 1 and 3 > 1))
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, and I am not sure why. So, I seem to be missing out on some insight into the right way to handle parenthetical grouping in a grammar.
How can I change my grammar to properly handle parentheses?
My grammar is below:
grammar conditional_test;
boolean
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
boolean_term
: boolean_factor (AND boolean_factor)*
;
boolean_factor
: (NOT)? boolean_test
;
boolean_test
: predicate
;
predicate
: expression relational_operator expression
| LPAREN boolean_value_expression RPAREN
;
relational_operator
: EQ
| LT
| GT
;
expression
: NUMBER
;
LPAREN : '(';
RPAREN : ')';
NUMBER : '0'..'9'+;
EQ : '=';
GT : '>';
LT : '<';
AND : 'and';
OR : 'or' ;
NOT : 'not';
Chris Farmer wrote:
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. ...
You should remove the EOF token from:
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
You normally only use the EOF after the entry point of your grammar (boolean in your case). Be careful boolean is a reserved word in Java and can therefor not be used as a parser rule!
So the first two rules should look like:
bool
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
;
And you may also want to ignore literal spaces by adding the following lexer rule:
SPACE : ' ' {$channel=HIDDEN;};
(you can include tabs an line breaks, of course)
Now all of your example input matches properly (tested with ANTLRWorks 1.4.2 as well).
Chris Farmer wrote:
Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, ...
No, ANTLRWorks does produce errors, perhaps not very noticeable ones. The parse tree ANTLRWorks produces has a NoViableAltException as a leaf, and there are some errors on the "Console" tab.