I have the following rule:
ASTMin:
MinKeyword '(' expression=ASTSimple ')';
MinKeyword: 'min';
For an expression like min (4) the parser creates the error message:
extraneous input ' ' expecting '('
Where can I disable the whitespace behaviour?
To solve it just add the terminal rule "WS" in hidden in the top of your grammar as following:
grammar org.your.Dsl hidden(WS, ML_COMMENT, SL_COMMENT)
If you are using the Xtext Terminals grammar :
grammar org.your.Dsl with org.eclipse.xtext.common.Terminals hidden(WS, ML_COMMENT, SL_COMMENT)
Related
I'm doing some experiments with ANTLR4 with this grammar:
srule
: '(' srule ')'
| srule srule
| '(' ')';
this grammar is for the language of balanced parentheses.
The problem is that when I run antlr with this string: (()))(
This string is obviously wrong but antlr simply return this AST:
It seems to stop when it finds the wrong parenthesis, but no error message returns. I would like to know more about this behavior. Thank you
The parser recognises (())) and then stops. If you want to force the parser to consume all tokens, "anchor" your test rule with the EOF token:
parse_all
: srule EOF
;
Btw, it's always a good idea to include the EOF token in the entry point (entry rule) of your grammar.
Given the grammar:
grammar Test;
words: (WORD|SPACE|DOT)+;
WORD : (
LD
|DOT {_input.LA(1)!='.'}?
) + ;
DOT: '.';
SPACE: ' ';
fragment LD: ~[.\n\r ];
with Antlr4 generated Lexer, for an input:
test. test.test test..test
The token sequence is like:
[#0,0:4='test.',<1>,1:0]
[#1,5:5=' ',<3>,1:5]
[#2,6:14='test.test',<1>,1:6]
[#3,15:15=' ',<3>,1:15]
[#4,16:19='test',<1>,1:16]
[#5,20:20='.',<2>,1:20]
[#6,21:25='.test',<1>,1:21]
[#7,26:25='<EOF>',<-1>,1:26]
What puzzles why the last piece of text test..test is tokenized into test . and .test, while I was supposed to see test. .test
What puzzled me more is for input:
test..test test. test.test
the token sequence is:
[#0,0:3='test',<1>,1:0]
[#1,4:4='.',<2>,1:4]
[#2,5:9='.test',<1>,1:5]
[#3,10:10=' ',<3>,1:10]
[#4,11:14='test',<1>,1:11]
[#5,15:15='.',<1>,1:15]
[#6,16:16=' ',<3>,1:16]
[#7,17:20='test',<1>,1:17]
[#8,21:25='.test',<1>,1:21]
[#9,26:25='<EOF>',<-1>,1:26]
Here the test.test is separated into two tokens while in above it is one.
Is the calling of _input.LA(1) has some side effect to cause this? Can some one explain?
I'm using Antlr4.
Quick fix is to check the previous LA(-1) token if it is unequal . and add a leading optional DOT.
Resulting grammar is:
grammar Test;
words: (WORD|SPACE|DOT)+;
WORD : DOT? (
LD
|{_input.LA(-1)!='.'}? DOT
) + ;
DOT: '.';
SPACE: ' ';
fragment LD: ~[.\n\r ];
Have fun and enjoy ANTLR, it is a nice tool.
I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.
I'm using ANTLR to generate recognizer for a java-like language and the following rules are used to recognize generic types:
referenceType
: singleType ('.' singleType)*
;
singleType
: Identifier typeArguments?
;
typeArguments
: '<' typeArgument (',' typeArgument)* '>'
;
typeArgument
: referenceType
;
Now, for the following input statement, ANTLR produces the 'no viable alternative' error.
Iterator<Entry<K,V>> i = entrySet().iterator();
However, if I put a space between the two consecutive '>' characters, no error is produced. It seams that ANTLR cannot distinguish between the above rule and the rule used to recognize shift expressions, but I don't know how to modify the grammar to resolve this ambiguity. Any help would be appreciated.
You probably have a rule like the following in the lexer:
RightShift : '>>';
For ANTLR to recognize >> as either two > characters or one >> operator, depending on context, you'll need to instead place your shift operator in the parser:
rightShift : '>' '>';
If your language includes the >>> or >>= operators, those would need to be moved to the parser as well.
To validate that x > > y isn't allowed, you'll want to make a pass over the resulting parse tree (ANTLR 4) or AST (ANTLR 3) to verify that the two > characters parsed by the rightShift parser rule appear in sequence.
280Z28 is probably right in his diagnosis that you have a rule like
RightShift : '>>';
An alternative solution is to explicitly include the possibility of a trailing >> in your parser. (I have seen this in other grammars, but only in LALR.)
typeArguments
: ('<' typeArgument (',' typeArgument)* '>') |
('<' typeArgument ',' referenceType '<' typeArgument RightShift );
;
In Antlr3, that will need to be left factored.
Whether this is clearer or having a second pass that validates your right shift operator depends on how often you need to use this.
How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....