What is the wrong with the simple ANTLR grammar? - antlr

I am writing an ANTLR grammar to parse a log files, and faced a problem.
I have simplified my grammar to reproduce the problem as followed:
stmt1:
'[ ' elapse ': ' stmt2
;
stmt2:
'[xxx'
;
stmt3:
': [yyy'
;
elapse :
FLOAT;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')*
;
When I used the following string to test the grammar:
[ 98.9: [xxx
I got the error:
E:\work\antlr\output\__Test___input.txt line 1:9 mismatched character 'x' expecting 'y'
E:\work\antlr\output\__Test___input.txt line 1:10 no viable alternative at character 'x'
E:\work\antlr\output\__Test___input.txt line 1:11 no viable alternative at character 'x'
E:\work\antlr\output\__Test___input.txt line 1:12 mismatched input '<EOF>' expecting ': '
But if I remove the ruel 'stmt3', same string would be accepted.
I am not sure what happened...
Thanks for any advice!
Leon
Thanks help from Bart. I have tried to correct the grammar.
I think, the baseline, I have to disambiguate all tokens.
And I add WS token to simplify the rule.
stmt1:
'[' elapse ':' stmt2
;
stmt2:
'[' 'xxx'
;
stmt3:
':' '[' 'yyy'
;
elapse :
FLOAT;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')*
;
WS : (' ' |'\t' |'\n' |'\r' )+ {skip();} ;

ANTLR has a strict separation between lexer rules (tokens) and parser rules. Although you defined some literals inside parser rules, they are still tokens. This means the following grammar is equivalent (in practice) to your example grammar:
stmt1 : T1 elapse T2 stmt2 ;
stmt2 : T3 ;
stmt3 : T4 ;
elapse : FLOAT;
T1 : '[ ' ;
T2 : ': ' ;
T3 : '[xxx' ;
T4 : ': [yyy' ;
FLOAT : ('0'..'9')+ '.' ('0'..'9')* ;
Now, when the lexer tries to construct tokens from the input "[ 98.9: [xxx", it successfully creates the tokens T1 and FLOAT, but when it sees ": [", it tries to construct a T4 token. But when the next char in the stream is a "x" instead of a "y", the lexer tries to construct another token that starts with ": [". But since there is no such token, the lexer emit the error:
[...] mismatched character 'x' expecting 'y'
And no, the lexer will not backtrack to "give up" the character "[" from ": [" to match the token T2, nor will it look ahead in the char-stream to see if a T4 token can really be constructed. ANTLR's LL(*) is only applicable to parser rules, not lexer rules!

Related

how to resolve an ambiguity

I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll

Can I use antlr to parse partial data?

I am trying to use antlr to parse a log file. Because I am only interested in partial part of the log, I want to only write a partial parser to process important part.
ex:
I want to parse the segment:
[ 123 begin ]
So I wrote the grammar:
log :
'[' INT 'begin' ']'
;
INT : '0'..'9'+
;
NEWLINE
: '\r'? '\n'
;
WS
: (' '|'\t')+ {skip();}
;
But the segment may appear at the middle of a line, ex:
111 [ 123 begin ] 222
According to the discussion:
What is the wrong with the simple ANTLR grammar?
I know why my grammar can't process above statement.
I want to know, is there any way to make antlr ignore any error, and continue to process remaining text?
Thanks for any advice!
Leon
Since '[' might also be skipped in certain cases outside of [ 123 begin ], there's no way to handle this in the lexer. You'll have to create a parser rule that matches token(s) to be skipped (see the noise rule).
You'll also need to create a fall-through rule that matches any character if none of the other lexer rules matches (see the ANY rule).
A quick demo:
grammar T;
parse
: ( log {System.out.println("log=" + $log.text);}
| noise
)*
EOF
;
log : OBRACK INT BEGIN CBRACK
;
noise
: ~OBRACK // any token except '['
| OBRACK ~INT // a '[' followed by any token except an INT
| OBRACK INT ~BEGIN // a '[', an INT and any token except an BEGIN
| OBRACK INT BEGIN ~CBRACK // a '[', an INT, a BEGIN and any token except ']'
;
BEGIN : 'begin';
OBRACK : '[';
CBRACK : ']';
INT : '0'..'9'+;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {skip();};
ANY : .;

ANTLR lexer rule consumes characters even if not matched?

I've got a strange side effect of an antlr lexer rule and I've created an (almost) minimal working example to demonstrate it.
In this example I want to match the String [0..1] for example. But when I debug the grammar the token stream that reaches the parser only contains [..1]. The first integer, no matter how many digits it contains is always consumed and I've got no clue as to how that happens. If I remove the FLOAT rule everything is fine so I guess the mistake lies somewhere in that rule. But since it shouldn't match anything in [0..1] at all I'm quite puzzled.
I'd be happy for any pointers where I might have gone wrong. This is my example:
grammar min;
options{
language = Java;
output = AST;
ASTLabelType=CommonTree;
backtrack = true;
}
tokens {
DECLARATION;
}
declaration : LBRACEVAR a=INTEGER DDOTS b=INTEGER RBRACEVAR -> ^(DECLARATION $a $b);
EXP : 'e' | 'E';
LBRACEVAR: '[';
RBRACEVAR: ']';
DOT: '.';
DDOTS: '..';
FLOAT
: INTEGER DOT POS_INTEGER
| INTEGER DOT POS_INTEGER EXP INTEGER
| INTEGER EXP INTEGER
;
INTEGER : POS_INTEGER | NEG_INTEGER;
fragment NEG_INTEGER : ('-') POS_INTEGER;
fragment POS_INTEGER : NUMBER+;
fragment NUMBER: ('0'..'9');
The '0' is discarded by the lexer and the following errors are produced:
line 1:3 no viable alternative at character '.'
line 1:2 extraneous input '..' expecting INTEGER
This is because when the lexer encounters '0.', it tries to create a FLOAT token, but can't. And since there is no other rule to fall back on to match '0.', it produces the errors, discards '0' and creates a DOT token.
This is simply how ANTLR's lexer works: it will not backtrack to match an INTEGER followed by a DDOTS (note that backtrack=true only applies to parser rules!).
Inside the FLOAT rule, you must make sure that when a double '.' is ahead, you produce a INTEGER token instead. You can do that by adding a syntactic predicate (the ('..')=> part) and produce FLOAT tokens only when a single '.' is followed by a digit (the ('.' DIGIT)=> part). See the following demo:
declaration
: LBRACEVAR INTEGER DDOTS INTEGER RBRACEVAR
;
LBRACEVAR : '[';
RBRACEVAR : ']';
DOT : '.';
DDOTS : '..';
INTEGER
: DIGIT+
;
FLOAT
: DIGIT+ ( ('.' DIGIT)=> '.' DIGIT+ EXP?
| ('..')=> {$type=INTEGER;} // change the token here
| EXP
)
;
fragment EXP : ('e' | 'E') DIGIT+;
fragment DIGIT : ('0'..'9');

antlr lexer rule matching a prefix of another rule

I am not sure that the issue is actually the prefixes, but here goes.
I have these two rules in my grammar (among many others)
DOT_T : '.' ;
AND_T : '.AND.' | '.and.' ;
and I need to parse strings like this:
a.eq.b.and.c.ne.d
c.append(b)
this should get lexed as:
ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T
the error I get for the second line is:
line 1:3 mismatched character "p"; expecting "n"
It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..
Any idea on what I need to do to make this work?
UPDATE
I added the following rule and thought I'd use the same trick
NUMBER_T
: DIGIT+
( (DECIMAL)=> DECIMAL
| (KIND)=> KIND
)?
;
fragment DECIMAL
: '.' DIGIT+ ;
fragment KIND
: '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;
but when I try parsing this:
lda.eq.3.and.dim.eq.3
it gives me the following error:
line 1:9 no viable alternative at character "a"
while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...
Yes, that is because of the prefixed '.'-s.
Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.
What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.
A demo:
grammar T;
parse
: (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
DOT_T
: '.' ( (AND_T)=> AND_T {$type=AND_T;}
| (EQ_T)=> EQ_T {$type=EQ_T; }
| (NE_T)=> NE_T {$type=NE_T; }
)?
;
ID
: ('a'..'z' | 'A'..'Z')+
;
LPAREN_T
: '('
;
RPAREN_T
: ')'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL)?
;
fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T : ('AND' | 'and') '.' ;
fragment EQ_T : ('EQ' | 'eq' ) '.' ;
fragment NE_T : ('NE' | 'ne' ) '.' ;
fragment DIGIT : '0'..'9';
And if you feed the generated parser the input:
a.eq.b.and.c.ne.d
c.append(b)
the following output will be printed:
ID 'a'
EQ_T '.eq.'
ID 'b'
AND_T '.and.'
ID 'c'
NE_T '.ne.'
ID 'd'
ID 'c'
DOT_T '.'
ID 'append'
LPAREN_T '('
ID 'b'
RPAREN_T ')'
And for the input:
lda.eq.3.and.dim.eq.3
the following is printed:
ID 'lda'
EQ_T '.eq.'
NUMBER_T '3'
AND_T '.and.'
ID 'dim'
EQ_T '.eq.'
NUMBER_T '3'
EDIT
The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
;
fragment DECIMAL : '.' DIGIT+;
fragment KIND : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment
Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
;

How to resolve this parsing ambiguitiy in Antlr3

Hopefully this is just the right amount of information to help me solve this problem.
Given the following ANTLR3 syntax
grammar mygrammar;
program : statement* | function*;
function : ID '(' args ')' '->' statement+ (','statement+) '.' ;
args : arg (',' arg)*;
arg : ID ('->' expression)?;
statement : assignment
| number
| string
;
assignment : ID '->' expression;
string : UNICODE_STRING;
number : HEX_NUMBER | INTEGER ( '.' INTEGER )?;
// ================================================================
HEX_NUMBER : '0x' HEX_DIGIT+;
INTEGER : DIGIT+;
fragment
DIGIT : ('0'..'9');
Here is the line that is causing problems in the parser.
my_function(x, y, z -> 42) -> 10001.
ANTLRWorks highlights the last . after the 10001 in red as being a problem with the following error.
How can I make this stop throwing org.antlr.runtime.EarlyExitException?
I am sure this is because of some ambiguity between my number parser rule and trying to use the . as a EOL delimiter.
There is another ambiguity that also needs fixing. Change:
program : statement* | function*;
into:
program : (statement | function)*;
(although the 2 are not equivalent, I'm guessing you want the latter)
And in your function rule, you now defined there to be at least 2 statements:
function : ID '(' args ')' '->' statement (','statement)+ '.' ;
while I'm guessing you really want at least one:
function : ID '(' args ')' '->' statement (','statement)* '.' ;
Now, your real problem: since you're constructing floats in a parser rule, from the end of your input, 10001., the parser tries to construct a number of it, while you want it to match an INTEGER and then a ., as you yourself already said in your OP.
To fix this, you need to give the parser a bit of extra look-ahead to "see" beyond this ambiguity. Do that by adding the predicate (INTEGER '.' INTEGER)=> before actually matching said input:
number
: HEX_NUMBER
| (INTEGER '.' INTEGER)=> INTEGER '.' INTEGER
| INTEGER
;
Now your input will generate the following parse tree:
Perhaps unrelated, but I'm curious none-the-less:
function : ID '(' args ')' '->' statement+ (','statement+) '.' ;
Should this instead be:
function : ID '(' args ')' '->' statement (',' statement)* '.' ;
I think the first one would require a single comma in a function definition but the second one would require a comma as a statement separator.
Also, does the rule for args allow z -> 42 correctly?