ANTLR How to differentiate input arguments of the same type - antlr

If I have my input message:
name IS (Jon, Ted) IS NOT (Peter);
I want this AST:
name
|
|-----|
IS IS NOT
| |
| Peter
|----|
Jon Ted
But I'm receiving:
name
|
|-----------------|
IS IS NOT
| |
| |
|----|-----| |----|-----|
Jon Ted Peter Jon Ted Peter
My Grammar file has:
...
expression
| NAME 'IS' OParen Identifier (Comma Identifier)* CParen 'IS NOT' OParen
Identifier (Comma Identifier)* CParen
-> ^(NAME ^('IS' ^(Identifier)*) ^('IS NOT' ^(Identifier)*))
;
...
NAME
: 'name'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)*
;
How can I differentiate what "belongs" to the 'IS' and what to belongs to the 'IS NOT' ?

Something like this should do it:
expression
: NAME IS left=id_list IS NOT right=id_list -> ^(NAME ^(IS $left) ^(NOT $right))
;
id_list
: '(' ID (',' ID)* ')' -> ID+
;
IS : 'IS';
NOT : 'NOT'; // not a single token that is 'IS NOT'
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)+
// Not `(...)*`: it should always match a single char!
;

Related

antlr3 always matching the longest possible token

Let's suppose that I have input which matches two tokens, antlr is always choosing the longest match. Instead how do I configure it start from shortest match and then go to longest if not possible ?
Example:
rule
: USER PATH
| PATH
;
USER
: '#' ('a'..'z' | 'A'..'Z' | '0-9' | '_')+
;
PATH
: URL_ALLOWED_CHARS+ '.config'
;
fragment URL_ALLOWED_CHARS
: ':' | '/' | '?' | '#' | '['
| ']' | '#' |'!' | '$' | '&'
| '\'' | '(' | ')' | '*'
| '+' | ',' | ';' | '='
| '%' | 'A'..'Z' | 'a'..'z'
| '0'..'9' | '_' | '.'
| '\\' | '-' | '~'
;
For the grammar above, input such as #random_user/file.config
option1 on rule should match and I should get two tokens: #random_user for USER and /file.config for FILE.
Instead, grammar matches the option 2 of the rule and the complete input is matched as PATH. How could I avoid it ?

Dealing with lexer ambiguity: different lexer rules in context

I'm building a ANTLR4 grammar to parse a custom language which looks like:
start rule_set {
/foo/bar {
//some_rules
}
}
Where /foo/bar is a URL-like path so it may contain escaped characters (eg. %20) and other symbols. But rule_set part is a normal identifier and % shouldn't be in there.
Here is my current grammar:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '#'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
The problem now is foo and bar are lexed as IDENTIFIER because it's the longest match. I want pathSegment to get correct result in this scenario. How to resolve this ambiguity?
[#0,0:4='start',<'start'>,1:0]
[#1,6:13='rule_set',<IDENTIFIER>,1:6]
[#2,15:15='{',<'{'>,1:15]
[#3,21:21='/',<'/'>,2:4]
[#4,22:24='foo',<IDENTIFIER>,2:5]
[#5,25:25='/',<'/'>,2:8]
[#6,26:28='bar',<IDENTIFIER>,2:9]
[#7,30:30='{',<'{'>,2:13]
[#8,40:51='//some_rules',<'//some_rules'>,3:8]
[#9,57:57='}',<'}'>,4:4]
[#10,59:59='}',<'}'>,5:0]
[#11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR

Antlr program won't compile

I am trying to make a grammar to parse the json language
The link i used to understand the automata for every entry http://www.json.com
grammar myjson;
prog
: object+ EOF
;
object
: '{'
STRING ':' value
(',' STRING ':' value)*
'}'
| '{' EMPTY '}'
;
array
: '['
value
(',' value)*
']'
| '[' EMPTY ']
;
value
: object | STRING | NUMBER
| array | BOOL | NULL
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
UNICODE
: ~('\u0022' | '\u005C')
;
SPECIAL
: '\u005C'
(
| '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
NULL: 'null';
BOOL
: 'true'
| 'false'
;
NUMBER : ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
fragment
EXPONENT : ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
DIGIT : '0'..'9'
;
fragment
LETTER
: ('a'..'z' | 'A'..'Z')
;
COMM
: '//' ~('\r'? '\n') {skip();}
| '/*' .* '*/' {skip();}
;
WS
: ' ' | '\t' | '\r' | '\n' | '\u000c' {skip();}
;
EMPTY
: ''
;
I would like to state that i am using antlrworks v 1.4.3 because that's what my teacher suggested to work with.
My problem is that that this grammar won't even compile because i get the following error
java.util.NoSuchElementException: can't look backwards more than one token in this stream
at org.antlr.runtime.misc.LookaheadStream.LB(LookaheadStream.java:159)
at org.antlr.runtime.misc.LookaheadStream.LT(LookaheadStream.java:120)
at org.antlr.runtime.RecognitionException.extractInformationFromTreeNodeStream(RecognitionException.java:144)
at org.antlr.runtime.RecognitionException.<init>(RecognitionException.java:111)
at org.antlr.runtime.MismatchedTreeNodeException.<init>(MismatchedTreeNodeException.java:42)
at org.antlr.runtime.tree.TreeParser.recoverFromMismatchedToken(TreeParser.java:135)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.antlr.grammar.v3.TreeToNFAConverter.alternative(TreeToNFAConverter.java:2798)
at org.antlr.grammar.v3.TreeToNFAConverter.block(TreeToNFAConverter.java:2662)
at org.antlr.grammar.v3.TreeToNFAConverter.rule(TreeToNFAConverter.java:1995)
at org.antlr.grammar.v3.TreeToNFAConverter.rules(TreeToNFAConverter.java:1338)
at org.antlr.grammar.v3.TreeToNFAConverter.grammarSpec(TreeToNFAConverter.java:1288)
at org.antlr.grammar.v3.TreeToNFAConverter.grammar_(TreeToNFAConverter.java:319)
at org.antlr.tool.Grammar.buildNFA(Grammar.java:1006)
at org.antlr.tool.CompositeGrammar.createNFAs(CompositeGrammar.java:390)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createLexerGrammarFromCombinedGrammar(ANTLRGrammarEngineImpl.java:219)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createCombinedGrammar(ANTLRGrammarEngineImpl.java:204)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createGrammars(ANTLRGrammarEngineImpl.java:165)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.analyze(ANTLRGrammarEngineImpl.java:272)
at org.antlr.works.grammar.engine.GrammarEngineImpl.analyze(GrammarEngineImpl.java:325)
at org.antlr.works.debugger.local.DBLocal.analyzeGrammar(DBLocal.java:385)
at org.antlr.works.debugger.local.DBLocal.generateAndCompileGrammar(DBLocal.java:365)
at org.antlr.works.debugger.local.DBLocal.run(DBLocal.java:222)
at java.lang.Thread.run(Unknown Source)
I read in a post about "can't look backwards more than one token in this stream" java exception that the lexer and parser grammar don't match but i have no idea what that is or what it refers to. I also apologize for not commenting the code. But I don't know too much antlr so I don't want to write something that could put you off.
Please help and thank you in advance
There are a couple of things wrong in your grammar:
never match tokens that (potentially) match an empty string: your lexer would go in an infinite loop when it tries to match them. In short: remove the EMPTY token
' ' | '\t' | '\r' | '\n' | '\u000c' {skip();} is equivalent to ' ' | '\t' | '\r' | '\n' | ('\u000c' {skip();}). You'd want to do: (' ' | '\t' | '\r' | '\n' | '\u000c') {skip();} instead
your SPECIAL rule matches a single backslash: '\u005C' ( /* NOTHING HERE */ | '"' | ...: remove the first |: '\u005C' ( '"' | ...
a negated character set must contain single characters, not two as you did: ~('\r'? '\n')* (you can't negate \r\n). It should be: ~('\r' | '\n')*
Try something like this instead (untested!):
grammar myjson;
prog
: object+ EOF
;
object
: '{' (key_value (',' key_value)*)? '}'
;
array
: '[' (value (',' value)*)? ']'
;
key_value
: STRING ':' value
;
value
: object
| array
| STRING
| NUMBER
| BOOL
| NULL
;
NULL
: 'null'
;
BOOL
: 'true'
| 'false'
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
NUMBER
: ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
COMM
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
SPACE
: (' ' | '\t' | '\r' | '\n' | '\u000c')+ {skip();}
;
fragment
DIGIT
: '0'..'9'
;
fragment
EXPONENT
: ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
UNICODE
: ~('\u0022' | '\u005C')
;
fragment
SPECIAL
: '\u005C' ( '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
Also check the JSON grammar from the ANTLR Github repository: https://github.com/antlr/grammars-v4/blob/master/json/Json.g4 Although an ANTLR4 grammar, it looks to be ANTLR 3 compatible.

How to get the Text of a Lexer Rule

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?
from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

How can I differentiate between reserved words and variables using ANTLR?

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?
Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.