Why isn't antlr 4 breaking my tokens up as expected? - intellij-idea

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.

The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

Related

grammar is accepting too much

I am new to ANTLR and try to get along with some very first and simple examples (using antlr-4.8). This seems to me like a stupid newbie problem but I could not find an appropriate answer (actually I do not even know how to phrase the question other than this lousy title). Sorry for that!
My grammar looks like this.
grammar ExprTest;
expr : compareExpr
| NUMBER
;
compareExpr
: (GT | GE | LT | LE) NUMBER
;
NUMBER : [0-9]+;
GT : '>';
GE : '>=';
LT : '<';
LE : '<=';
It pretty much does the job and recognizes 17, >15 and <=22 and it complains correctly with token recognition error at an input of #34.
What I do not understand is the input 34>. There is no complaining and it is matched as (expr 34).
Why isn't there a recognition error with the last greater-than character (which is obviously in the wrong position)?
The input 34> does not produce a token recognition error, because there are two expected tokens in it: NUMBER and GT. And the parser also has no problem with it because the rule:
expr : compareExpr
| NUMBER
;
happily accepts the NUMBER token and then stops, leaving the GT token alone.
If you want to force the parser to consume all tokens in your stream, you should anchor your parser with the built-in EOF token:
expr : (compareExpr | NUMBER) EOF;
after which the input 34> will produce an error.

Getting inconsistent results

I'm using ANTLR 4.6 and I was trying to do some clean up on my grammar and ended up breaking it. I found out that it's because I had made the following change that I assumed would have been equivalent. Can someone explain why they are different?
First try
DIGIT : [0-9] ;
LETTER : [a-zA-Z] ;
ident : ('_'|LETTER) ('_'|LETTER|DIGIT)* ;
Second try
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
ident : LETTER (LETTER | DIGIT)* ;
Both produce different results than this
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
IDENT : LETTER (LETTER | DIGIT)* ;
In both your tries you changed your ident rule from a lexer rule to a parser rule since you wrote it in lower case and since it's the only difference from the second try I assume that's the problem. The lexer rules are for defining tokens for parsing, parsing rules define the way you construct your AST. Beware that making changes like that can result in great differences in the way the your AST is constructed.

ANTLR4 disambiguation of terminal tokens

This is my grammar in ANTLR4:
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
When I type in something like:
hello buddy
I got the following error message:
line 1 missing WORD at 'hello'
But, if I change the grammar in
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [1-9]+ ;
WS : [ \t\r\n]+ -> skip ;
where now WORD is a number, everything is ok.
I strongly suspect that since in the first grammar we have two terminal node with the same regex, the parser doesn't know the correspondance of the real word.
So am I wrong thinking of it? If not, how would you solve this issue keeping more than one terminal with the same regex?
You cannot have two terminals that match the same pattern.
If your grammar actually needs to match twice [a-z]+, then use a production like
r : WORD WORD ;
and the discrimination will be done at the parser / tree traversal level.
If either WORD or ID can be restricted to a fixed list, you could declare all the possible words as terminals then use them to define e.g. what a WORD can be.
where now WORD is a number, everything is ok.
Not really :
$ alias
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Hello r -tokens data.txt
[#0,0:4='hello',<ID>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
line 1:0 missing WORD at 'hello'
When the lexer can match some input with two rules, there is an ambiguity, and it chooses the first rule. With a hello buddy input, the lexer produces two ID tokens
with the first grammar, because it's ambiguous and ID comes first
with the second grammar, the input can only be matched by ID WS ID
You can disambiguate with a predicate in the lexer rule like so :
grammar Question;
/* Ambiguous input */
file
: HELLO ID
;
HELLO
: [a-z]+ {getText().equals("hello")}? ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Execution :
$ grun Question file -tokens data.txt
[#0,0:4='hello',<HELLO>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
More on semantic predicates in The Definitive ANTLR Reference.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

ANTLR - identifier with whitespace

i want identifiers that can contain whitespace.
grammar WhitespaceInSymbols;
premise : ( options {greedy=false;} : 'IF' ) id=ID{
System.out.println($id.text);
};
ID : ('a'..'z'|'A'..'Z')+ (' '('a'..'z'|'A'..'Z')+)*
;
WS : ' '+ {skip();}
;
When i test this with "IF statement analyzed" i get a MissingTokenException and the output "IF statement analyzed".
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token. But instead the IF is part of the ID.
Is there a way to achieve my goal? I already tried some variations of the greed=false-option, but without success.
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token.
No, the parser has nothing to say about the creation of tokens: the input is first tokenized and then the parser rules are applied on these tokens. So setting greedy=false has no effect.
You can do this (creating ID tokens with white spaces), but it will be a horrible solution with many predicates, and a few custom methods in the lexer doing manual look-aheads: you really, really don't want this! A much cleaner solution would be to introduce a id rule in your parser and let it match one or more ID tokens.
A demo:
grammar WhitespaceInSymbols;
premise
: IF id THEN EOF
;
id
: ID+
;
IF
: 'IF'
;
THEN
: 'THEN'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WS
: ' '+ {skip();}
;
would parse the input IF statement analyzed THEN into the following tree: