How can I differentiate between reserved words and variables using ANTLR? - antlr

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?

Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.

Related

Dealing with lexer ambiguity: different lexer rules in context

I'm building a ANTLR4 grammar to parse a custom language which looks like:
start rule_set {
/foo/bar {
//some_rules
}
}
Where /foo/bar is a URL-like path so it may contain escaped characters (eg. %20) and other symbols. But rule_set part is a normal identifier and % shouldn't be in there.
Here is my current grammar:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '#'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
The problem now is foo and bar are lexed as IDENTIFIER because it's the longest match. I want pathSegment to get correct result in this scenario. How to resolve this ambiguity?
[#0,0:4='start',<'start'>,1:0]
[#1,6:13='rule_set',<IDENTIFIER>,1:6]
[#2,15:15='{',<'{'>,1:15]
[#3,21:21='/',<'/'>,2:4]
[#4,22:24='foo',<IDENTIFIER>,2:5]
[#5,25:25='/',<'/'>,2:8]
[#6,26:28='bar',<IDENTIFIER>,2:9]
[#7,30:30='{',<'{'>,2:13]
[#8,40:51='//some_rules',<'//some_rules'>,3:8]
[#9,57:57='}',<'}'>,4:4]
[#10,59:59='}',<'}'>,5:0]
[#11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR

ANTLR How to differentiate input arguments of the same type

If I have my input message:
name IS (Jon, Ted) IS NOT (Peter);
I want this AST:
name
|
|-----|
IS IS NOT
| |
| Peter
|----|
Jon Ted
But I'm receiving:
name
|
|-----------------|
IS IS NOT
| |
| |
|----|-----| |----|-----|
Jon Ted Peter Jon Ted Peter
My Grammar file has:
...
expression
| NAME 'IS' OParen Identifier (Comma Identifier)* CParen 'IS NOT' OParen
Identifier (Comma Identifier)* CParen
-> ^(NAME ^('IS' ^(Identifier)*) ^('IS NOT' ^(Identifier)*))
;
...
NAME
: 'name'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)*
;
How can I differentiate what "belongs" to the 'IS' and what to belongs to the 'IS NOT' ?
Something like this should do it:
expression
: NAME IS left=id_list IS NOT right=id_list -> ^(NAME ^(IS $left) ^(NOT $right))
;
id_list
: '(' ID (',' ID)* ')' -> ID+
;
IS : 'IS';
NOT : 'NOT'; // not a single token that is 'IS NOT'
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)+
// Not `(...)*`: it should always match a single char!
;

How to get the Text of a Lexer Rule

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?
from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

Can ANTLR differentiate between lexer rules based on the following character?

For parsing a test file I'd like to allow identifier's to begin with a number.
my rule is:
ID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
However I also need to match numbers in this file as well. My rule for that is:
INT : '0'..'9'+
;
Obviously Antlr won't let me do this as INT will never be matched.
Is there a way to allow this? Specifically I'd like to match an INTEGER followed by an ID with no spaces as just an ID and create an INT token only if it's followed by a space.
For example:
3BOB -> [ID with text "3BOB"]
3 BOB -> [INT with text "3"] [ID with text "BOB"]
Just change the order in which ID and INT tokens are defined.
grammar qqq;
// Parser's rules.
root:
(integer|identifier)+
;
integer:
INT {System.out.println("INT with text '"+$INT.text+"'.");}
;
identifier:
ID {System.out.println("ID with text '"+$ID.text+"'.");}
;
// Lexer's tokens.
INT: '0'..'9'+
;
ID: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
WS: ' ' {skip();}
;
UNPREDICTED_TOKEN
:
~(' ') {System.out.println("Unpredicted token.");}
;
The order in which tokens are defined in grammar is significant: in case a string can be attributed to multiple tokens it is attributed to that one which is defined first. In your case if you want integer '123' to be attributed to INT when it still conforms to ID -- put INT definition first.
Antlr's token matching is greedy so it won't stop on '123' in '123BOB', but will continue until non of the tokens match the string and take the last token matched. So your identifiers now can start with numbers.
A remark on tokens order can also be found in this article by Mark Volkmann.
The following minor changes in your rules should do the trick:
ID : ('0'..'9')* // optional numbers
('a'..'z' | 'A'..'Z' | '_' | '&' | '/' | '-' | '.') // followed by mandatory character which is not a number
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')* // followed by more stuff (including numbers)
;
INT : '0'..'9'+ // a number
;
You simply let allow your identifiers to start with an optional number and make the following characters mandatory.

Why does my grammar work for operators like *, -, /, but not +?

I'm creating a grammar right now and I had to get rid of left recursion, and it seems work for everything except the addition operator.
Here is the related part of my grammar:
SUBTRACT: '-';
PLUS: '+';
DIVIDE: '/';
MULTIPLY: '*';
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
)
(
PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr
)*
;
INTEGER: ('0'..'9')*;
IDENTIFIER: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*;
Then when I try to do something like
x*1
It work's perfectly. However when I try to do something like
x+1
I get an error saying:
MismatchedTokenException: mismatched input '+' expecting '\u001C'
I've been at this for a while but don't get why it works with *, -, and /, but not +. I have the exact same code for all of them.
Edit: If I reorder it and put SUBTRACT above PLUS, the + symbol will now work but the - symbol won't. Why would antlr care about the order of stuff like that?
Avoiding left recursion (in an expression grammar) is usually done like this:
grammar Expr;
parse
: expr EOF
;
expr
: equalityExpr
;
equalityExpr
: relationalExpr (('==' | '!=') relationalExpr)*
;
relationalExpr
: additionExpr (('>=' | '<=' | '>' | '<') additionExpr)*
;
additionExpr
: multiplyExpr (('+'| '-') multiplyExpr)*
;
multiplyExpr
: atom (('*' | '/') atom)*
;
atom
: IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
| '(' expr ')'
;
// ... lexer rules ...
For example, the input A+B+C would be parsed as follows:
Also see this related answer: ANTLR: Is there a simple example?
I fixed it by making a new rule for the part at the end that I made from removing left recursion:
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
) lr*
;
lr: PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr;