How to get the Text of a Lexer Rule - antlr

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?

from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

Related

antlr3 always matching the longest possible token

Let's suppose that I have input which matches two tokens, antlr is always choosing the longest match. Instead how do I configure it start from shortest match and then go to longest if not possible ?
Example:
rule
: USER PATH
| PATH
;
USER
: '#' ('a'..'z' | 'A'..'Z' | '0-9' | '_')+
;
PATH
: URL_ALLOWED_CHARS+ '.config'
;
fragment URL_ALLOWED_CHARS
: ':' | '/' | '?' | '#' | '['
| ']' | '#' |'!' | '$' | '&'
| '\'' | '(' | ')' | '*'
| '+' | ',' | ';' | '='
| '%' | 'A'..'Z' | 'a'..'z'
| '0'..'9' | '_' | '.'
| '\\' | '-' | '~'
;
For the grammar above, input such as #random_user/file.config
option1 on rule should match and I should get two tokens: #random_user for USER and /file.config for FILE.
Instead, grammar matches the option 2 of the rule and the complete input is matched as PATH. How could I avoid it ?

Dealing with lexer ambiguity: different lexer rules in context

I'm building a ANTLR4 grammar to parse a custom language which looks like:
start rule_set {
/foo/bar {
//some_rules
}
}
Where /foo/bar is a URL-like path so it may contain escaped characters (eg. %20) and other symbols. But rule_set part is a normal identifier and % shouldn't be in there.
Here is my current grammar:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '#'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
The problem now is foo and bar are lexed as IDENTIFIER because it's the longest match. I want pathSegment to get correct result in this scenario. How to resolve this ambiguity?
[#0,0:4='start',<'start'>,1:0]
[#1,6:13='rule_set',<IDENTIFIER>,1:6]
[#2,15:15='{',<'{'>,1:15]
[#3,21:21='/',<'/'>,2:4]
[#4,22:24='foo',<IDENTIFIER>,2:5]
[#5,25:25='/',<'/'>,2:8]
[#6,26:28='bar',<IDENTIFIER>,2:9]
[#7,30:30='{',<'{'>,2:13]
[#8,40:51='//some_rules',<'//some_rules'>,3:8]
[#9,57:57='}',<'}'>,4:4]
[#10,59:59='}',<'}'>,5:0]
[#11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR

ANTLR How to differentiate input arguments of the same type

If I have my input message:
name IS (Jon, Ted) IS NOT (Peter);
I want this AST:
name
|
|-----|
IS IS NOT
| |
| Peter
|----|
Jon Ted
But I'm receiving:
name
|
|-----------------|
IS IS NOT
| |
| |
|----|-----| |----|-----|
Jon Ted Peter Jon Ted Peter
My Grammar file has:
...
expression
| NAME 'IS' OParen Identifier (Comma Identifier)* CParen 'IS NOT' OParen
Identifier (Comma Identifier)* CParen
-> ^(NAME ^('IS' ^(Identifier)*) ^('IS NOT' ^(Identifier)*))
;
...
NAME
: 'name'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)*
;
How can I differentiate what "belongs" to the 'IS' and what to belongs to the 'IS NOT' ?
Something like this should do it:
expression
: NAME IS left=id_list IS NOT right=id_list -> ^(NAME ^(IS $left) ^(NOT $right))
;
id_list
: '(' ID (',' ID)* ')' -> ID+
;
IS : 'IS';
NOT : 'NOT'; // not a single token that is 'IS NOT'
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)+
// Not `(...)*`: it should always match a single char!
;

How can I differentiate between reserved words and variables using ANTLR?

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?
Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.

Can ANTLR differentiate between lexer rules based on the following character?

For parsing a test file I'd like to allow identifier's to begin with a number.
my rule is:
ID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
However I also need to match numbers in this file as well. My rule for that is:
INT : '0'..'9'+
;
Obviously Antlr won't let me do this as INT will never be matched.
Is there a way to allow this? Specifically I'd like to match an INTEGER followed by an ID with no spaces as just an ID and create an INT token only if it's followed by a space.
For example:
3BOB -> [ID with text "3BOB"]
3 BOB -> [INT with text "3"] [ID with text "BOB"]
Just change the order in which ID and INT tokens are defined.
grammar qqq;
// Parser's rules.
root:
(integer|identifier)+
;
integer:
INT {System.out.println("INT with text '"+$INT.text+"'.");}
;
identifier:
ID {System.out.println("ID with text '"+$ID.text+"'.");}
;
// Lexer's tokens.
INT: '0'..'9'+
;
ID: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
WS: ' ' {skip();}
;
UNPREDICTED_TOKEN
:
~(' ') {System.out.println("Unpredicted token.");}
;
The order in which tokens are defined in grammar is significant: in case a string can be attributed to multiple tokens it is attributed to that one which is defined first. In your case if you want integer '123' to be attributed to INT when it still conforms to ID -- put INT definition first.
Antlr's token matching is greedy so it won't stop on '123' in '123BOB', but will continue until non of the tokens match the string and take the last token matched. So your identifiers now can start with numbers.
A remark on tokens order can also be found in this article by Mark Volkmann.
The following minor changes in your rules should do the trick:
ID : ('0'..'9')* // optional numbers
('a'..'z' | 'A'..'Z' | '_' | '&' | '/' | '-' | '.') // followed by mandatory character which is not a number
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')* // followed by more stuff (including numbers)
;
INT : '0'..'9'+ // a number
;
You simply let allow your identifiers to start with an optional number and make the following characters mandatory.