antlr3 always matching the longest possible token - antlr

Let's suppose that I have input which matches two tokens, antlr is always choosing the longest match. Instead how do I configure it start from shortest match and then go to longest if not possible ?
Example:
rule
: USER PATH
| PATH
;
USER
: '#' ('a'..'z' | 'A'..'Z' | '0-9' | '_')+
;
PATH
: URL_ALLOWED_CHARS+ '.config'
;
fragment URL_ALLOWED_CHARS
: ':' | '/' | '?' | '#' | '['
| ']' | '#' |'!' | '$' | '&'
| '\'' | '(' | ')' | '*'
| '+' | ',' | ';' | '='
| '%' | 'A'..'Z' | 'a'..'z'
| '0'..'9' | '_' | '.'
| '\\' | '-' | '~'
;
For the grammar above, input such as #random_user/file.config
option1 on rule should match and I should get two tokens: #random_user for USER and /file.config for FILE.
Instead, grammar matches the option 2 of the rule and the complete input is matched as PATH. How could I avoid it ?

Related

Dealing with lexer ambiguity: different lexer rules in context

I'm building a ANTLR4 grammar to parse a custom language which looks like:
start rule_set {
/foo/bar {
//some_rules
}
}
Where /foo/bar is a URL-like path so it may contain escaped characters (eg. %20) and other symbols. But rule_set part is a normal identifier and % shouldn't be in there.
Here is my current grammar:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '#'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
The problem now is foo and bar are lexed as IDENTIFIER because it's the longest match. I want pathSegment to get correct result in this scenario. How to resolve this ambiguity?
[#0,0:4='start',<'start'>,1:0]
[#1,6:13='rule_set',<IDENTIFIER>,1:6]
[#2,15:15='{',<'{'>,1:15]
[#3,21:21='/',<'/'>,2:4]
[#4,22:24='foo',<IDENTIFIER>,2:5]
[#5,25:25='/',<'/'>,2:8]
[#6,26:28='bar',<IDENTIFIER>,2:9]
[#7,30:30='{',<'{'>,2:13]
[#8,40:51='//some_rules',<'//some_rules'>,3:8]
[#9,57:57='}',<'}'>,4:4]
[#10,59:59='}',<'}'>,5:0]
[#11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR

Antlr program won't compile

I am trying to make a grammar to parse the json language
The link i used to understand the automata for every entry http://www.json.com
grammar myjson;
prog
: object+ EOF
;
object
: '{'
STRING ':' value
(',' STRING ':' value)*
'}'
| '{' EMPTY '}'
;
array
: '['
value
(',' value)*
']'
| '[' EMPTY ']
;
value
: object | STRING | NUMBER
| array | BOOL | NULL
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
UNICODE
: ~('\u0022' | '\u005C')
;
SPECIAL
: '\u005C'
(
| '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
NULL: 'null';
BOOL
: 'true'
| 'false'
;
NUMBER : ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
fragment
EXPONENT : ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
DIGIT : '0'..'9'
;
fragment
LETTER
: ('a'..'z' | 'A'..'Z')
;
COMM
: '//' ~('\r'? '\n') {skip();}
| '/*' .* '*/' {skip();}
;
WS
: ' ' | '\t' | '\r' | '\n' | '\u000c' {skip();}
;
EMPTY
: ''
;
I would like to state that i am using antlrworks v 1.4.3 because that's what my teacher suggested to work with.
My problem is that that this grammar won't even compile because i get the following error
java.util.NoSuchElementException: can't look backwards more than one token in this stream
at org.antlr.runtime.misc.LookaheadStream.LB(LookaheadStream.java:159)
at org.antlr.runtime.misc.LookaheadStream.LT(LookaheadStream.java:120)
at org.antlr.runtime.RecognitionException.extractInformationFromTreeNodeStream(RecognitionException.java:144)
at org.antlr.runtime.RecognitionException.<init>(RecognitionException.java:111)
at org.antlr.runtime.MismatchedTreeNodeException.<init>(MismatchedTreeNodeException.java:42)
at org.antlr.runtime.tree.TreeParser.recoverFromMismatchedToken(TreeParser.java:135)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.antlr.grammar.v3.TreeToNFAConverter.alternative(TreeToNFAConverter.java:2798)
at org.antlr.grammar.v3.TreeToNFAConverter.block(TreeToNFAConverter.java:2662)
at org.antlr.grammar.v3.TreeToNFAConverter.rule(TreeToNFAConverter.java:1995)
at org.antlr.grammar.v3.TreeToNFAConverter.rules(TreeToNFAConverter.java:1338)
at org.antlr.grammar.v3.TreeToNFAConverter.grammarSpec(TreeToNFAConverter.java:1288)
at org.antlr.grammar.v3.TreeToNFAConverter.grammar_(TreeToNFAConverter.java:319)
at org.antlr.tool.Grammar.buildNFA(Grammar.java:1006)
at org.antlr.tool.CompositeGrammar.createNFAs(CompositeGrammar.java:390)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createLexerGrammarFromCombinedGrammar(ANTLRGrammarEngineImpl.java:219)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createCombinedGrammar(ANTLRGrammarEngineImpl.java:204)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createGrammars(ANTLRGrammarEngineImpl.java:165)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.analyze(ANTLRGrammarEngineImpl.java:272)
at org.antlr.works.grammar.engine.GrammarEngineImpl.analyze(GrammarEngineImpl.java:325)
at org.antlr.works.debugger.local.DBLocal.analyzeGrammar(DBLocal.java:385)
at org.antlr.works.debugger.local.DBLocal.generateAndCompileGrammar(DBLocal.java:365)
at org.antlr.works.debugger.local.DBLocal.run(DBLocal.java:222)
at java.lang.Thread.run(Unknown Source)
I read in a post about "can't look backwards more than one token in this stream" java exception that the lexer and parser grammar don't match but i have no idea what that is or what it refers to. I also apologize for not commenting the code. But I don't know too much antlr so I don't want to write something that could put you off.
Please help and thank you in advance
There are a couple of things wrong in your grammar:
never match tokens that (potentially) match an empty string: your lexer would go in an infinite loop when it tries to match them. In short: remove the EMPTY token
' ' | '\t' | '\r' | '\n' | '\u000c' {skip();} is equivalent to ' ' | '\t' | '\r' | '\n' | ('\u000c' {skip();}). You'd want to do: (' ' | '\t' | '\r' | '\n' | '\u000c') {skip();} instead
your SPECIAL rule matches a single backslash: '\u005C' ( /* NOTHING HERE */ | '"' | ...: remove the first |: '\u005C' ( '"' | ...
a negated character set must contain single characters, not two as you did: ~('\r'? '\n')* (you can't negate \r\n). It should be: ~('\r' | '\n')*
Try something like this instead (untested!):
grammar myjson;
prog
: object+ EOF
;
object
: '{' (key_value (',' key_value)*)? '}'
;
array
: '[' (value (',' value)*)? ']'
;
key_value
: STRING ':' value
;
value
: object
| array
| STRING
| NUMBER
| BOOL
| NULL
;
NULL
: 'null'
;
BOOL
: 'true'
| 'false'
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
NUMBER
: ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
COMM
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
SPACE
: (' ' | '\t' | '\r' | '\n' | '\u000c')+ {skip();}
;
fragment
DIGIT
: '0'..'9'
;
fragment
EXPONENT
: ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
UNICODE
: ~('\u0022' | '\u005C')
;
fragment
SPECIAL
: '\u005C' ( '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
Also check the JSON grammar from the ANTLR Github repository: https://github.com/antlr/grammars-v4/blob/master/json/Json.g4 Although an ANTLR4 grammar, it looks to be ANTLR 3 compatible.

How to get the Text of a Lexer Rule

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?
from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

How can I differentiate between reserved words and variables using ANTLR?

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?
Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.

Can ANTLR differentiate between lexer rules based on the following character?

For parsing a test file I'd like to allow identifier's to begin with a number.
my rule is:
ID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
However I also need to match numbers in this file as well. My rule for that is:
INT : '0'..'9'+
;
Obviously Antlr won't let me do this as INT will never be matched.
Is there a way to allow this? Specifically I'd like to match an INTEGER followed by an ID with no spaces as just an ID and create an INT token only if it's followed by a space.
For example:
3BOB -> [ID with text "3BOB"]
3 BOB -> [INT with text "3"] [ID with text "BOB"]
Just change the order in which ID and INT tokens are defined.
grammar qqq;
// Parser's rules.
root:
(integer|identifier)+
;
integer:
INT {System.out.println("INT with text '"+$INT.text+"'.");}
;
identifier:
ID {System.out.println("ID with text '"+$ID.text+"'.");}
;
// Lexer's tokens.
INT: '0'..'9'+
;
ID: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
WS: ' ' {skip();}
;
UNPREDICTED_TOKEN
:
~(' ') {System.out.println("Unpredicted token.");}
;
The order in which tokens are defined in grammar is significant: in case a string can be attributed to multiple tokens it is attributed to that one which is defined first. In your case if you want integer '123' to be attributed to INT when it still conforms to ID -- put INT definition first.
Antlr's token matching is greedy so it won't stop on '123' in '123BOB', but will continue until non of the tokens match the string and take the last token matched. So your identifiers now can start with numbers.
A remark on tokens order can also be found in this article by Mark Volkmann.
The following minor changes in your rules should do the trick:
ID : ('0'..'9')* // optional numbers
('a'..'z' | 'A'..'Z' | '_' | '&' | '/' | '-' | '.') // followed by mandatory character which is not a number
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')* // followed by more stuff (including numbers)
;
INT : '0'..'9'+ // a number
;
You simply let allow your identifiers to start with an optional number and make the following characters mandatory.