I am trying to make a grammar to parse the json language
The link i used to understand the automata for every entry http://www.json.com
grammar myjson;
prog
: object+ EOF
;
object
: '{'
STRING ':' value
(',' STRING ':' value)*
'}'
| '{' EMPTY '}'
;
array
: '['
value
(',' value)*
']'
| '[' EMPTY ']
;
value
: object | STRING | NUMBER
| array | BOOL | NULL
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
UNICODE
: ~('\u0022' | '\u005C')
;
SPECIAL
: '\u005C'
(
| '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
NULL: 'null';
BOOL
: 'true'
| 'false'
;
NUMBER : ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
fragment
EXPONENT : ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
DIGIT : '0'..'9'
;
fragment
LETTER
: ('a'..'z' | 'A'..'Z')
;
COMM
: '//' ~('\r'? '\n') {skip();}
| '/*' .* '*/' {skip();}
;
WS
: ' ' | '\t' | '\r' | '\n' | '\u000c' {skip();}
;
EMPTY
: ''
;
I would like to state that i am using antlrworks v 1.4.3 because that's what my teacher suggested to work with.
My problem is that that this grammar won't even compile because i get the following error
java.util.NoSuchElementException: can't look backwards more than one token in this stream
at org.antlr.runtime.misc.LookaheadStream.LB(LookaheadStream.java:159)
at org.antlr.runtime.misc.LookaheadStream.LT(LookaheadStream.java:120)
at org.antlr.runtime.RecognitionException.extractInformationFromTreeNodeStream(RecognitionException.java:144)
at org.antlr.runtime.RecognitionException.<init>(RecognitionException.java:111)
at org.antlr.runtime.MismatchedTreeNodeException.<init>(MismatchedTreeNodeException.java:42)
at org.antlr.runtime.tree.TreeParser.recoverFromMismatchedToken(TreeParser.java:135)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.antlr.grammar.v3.TreeToNFAConverter.alternative(TreeToNFAConverter.java:2798)
at org.antlr.grammar.v3.TreeToNFAConverter.block(TreeToNFAConverter.java:2662)
at org.antlr.grammar.v3.TreeToNFAConverter.rule(TreeToNFAConverter.java:1995)
at org.antlr.grammar.v3.TreeToNFAConverter.rules(TreeToNFAConverter.java:1338)
at org.antlr.grammar.v3.TreeToNFAConverter.grammarSpec(TreeToNFAConverter.java:1288)
at org.antlr.grammar.v3.TreeToNFAConverter.grammar_(TreeToNFAConverter.java:319)
at org.antlr.tool.Grammar.buildNFA(Grammar.java:1006)
at org.antlr.tool.CompositeGrammar.createNFAs(CompositeGrammar.java:390)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createLexerGrammarFromCombinedGrammar(ANTLRGrammarEngineImpl.java:219)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createCombinedGrammar(ANTLRGrammarEngineImpl.java:204)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.createGrammars(ANTLRGrammarEngineImpl.java:165)
at org.antlr.works.grammar.antlr.ANTLRGrammarEngineImpl.analyze(ANTLRGrammarEngineImpl.java:272)
at org.antlr.works.grammar.engine.GrammarEngineImpl.analyze(GrammarEngineImpl.java:325)
at org.antlr.works.debugger.local.DBLocal.analyzeGrammar(DBLocal.java:385)
at org.antlr.works.debugger.local.DBLocal.generateAndCompileGrammar(DBLocal.java:365)
at org.antlr.works.debugger.local.DBLocal.run(DBLocal.java:222)
at java.lang.Thread.run(Unknown Source)
I read in a post about "can't look backwards more than one token in this stream" java exception that the lexer and parser grammar don't match but i have no idea what that is or what it refers to. I also apologize for not commenting the code. But I don't know too much antlr so I don't want to write something that could put you off.
Please help and thank you in advance
There are a couple of things wrong in your grammar:
never match tokens that (potentially) match an empty string: your lexer would go in an infinite loop when it tries to match them. In short: remove the EMPTY token
' ' | '\t' | '\r' | '\n' | '\u000c' {skip();} is equivalent to ' ' | '\t' | '\r' | '\n' | ('\u000c' {skip();}). You'd want to do: (' ' | '\t' | '\r' | '\n' | '\u000c') {skip();} instead
your SPECIAL rule matches a single backslash: '\u005C' ( /* NOTHING HERE */ | '"' | ...: remove the first |: '\u005C' ( '"' | ...
a negated character set must contain single characters, not two as you did: ~('\r'? '\n')* (you can't negate \r\n). It should be: ~('\r' | '\n')*
Try something like this instead (untested!):
grammar myjson;
prog
: object+ EOF
;
object
: '{' (key_value (',' key_value)*)? '}'
;
array
: '[' (value (',' value)*)? ']'
;
key_value
: STRING ':' value
;
value
: object
| array
| STRING
| NUMBER
| BOOL
| NULL
;
NULL
: 'null'
;
BOOL
: 'true'
| 'false'
;
STRING
: '"' (UNICODE | SPECIAL)* '"'
;
NUMBER
: ('+'|'-')? DIGIT+ '.' DIGIT* EXPONENT?
| ('+'|'-')? '.'? DIGIT+ EXPONENT?
;
COMM
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
SPACE
: (' ' | '\t' | '\r' | '\n' | '\u000c')+ {skip();}
;
fragment
DIGIT
: '0'..'9'
;
fragment
EXPONENT
: ('e' | 'E') ('+' | '-') ? DIGIT+
;
fragment
UNICODE
: ~('\u0022' | '\u005C')
;
fragment
SPECIAL
: '\u005C' ( '"' | '\u005C' | '\u002F'
| 'b' | 'f' | 'n' | 'r'
| 't' | 'u' DIGIT DIGIT DIGIT DIGIT
)
;
Also check the JSON grammar from the ANTLR Github repository: https://github.com/antlr/grammars-v4/blob/master/json/Json.g4 Although an ANTLR4 grammar, it looks to be ANTLR 3 compatible.
Related
Let's suppose that I have input which matches two tokens, antlr is always choosing the longest match. Instead how do I configure it start from shortest match and then go to longest if not possible ?
Example:
rule
: USER PATH
| PATH
;
USER
: '#' ('a'..'z' | 'A'..'Z' | '0-9' | '_')+
;
PATH
: URL_ALLOWED_CHARS+ '.config'
;
fragment URL_ALLOWED_CHARS
: ':' | '/' | '?' | '#' | '['
| ']' | '#' |'!' | '$' | '&'
| '\'' | '(' | ')' | '*'
| '+' | ',' | ';' | '='
| '%' | 'A'..'Z' | 'a'..'z'
| '0'..'9' | '_' | '.'
| '\\' | '-' | '~'
;
For the grammar above, input such as #random_user/file.config
option1 on rule should match and I should get two tokens: #random_user for USER and /file.config for FILE.
Instead, grammar matches the option 2 of the rule and the complete input is matched as PATH. How could I avoid it ?
I'm building a ANTLR4 grammar to parse a custom language which looks like:
start rule_set {
/foo/bar {
//some_rules
}
}
Where /foo/bar is a URL-like path so it may contain escaped characters (eg. %20) and other symbols. But rule_set part is a normal identifier and % shouldn't be in there.
Here is my current grammar:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '#'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
The problem now is foo and bar are lexed as IDENTIFIER because it's the longest match. I want pathSegment to get correct result in this scenario. How to resolve this ambiguity?
[#0,0:4='start',<'start'>,1:0]
[#1,6:13='rule_set',<IDENTIFIER>,1:6]
[#2,15:15='{',<'{'>,1:15]
[#3,21:21='/',<'/'>,2:4]
[#4,22:24='foo',<IDENTIFIER>,2:5]
[#5,25:25='/',<'/'>,2:8]
[#6,26:28='bar',<IDENTIFIER>,2:9]
[#7,30:30='{',<'{'>,2:13]
[#8,40:51='//some_rules',<'//some_rules'>,3:8]
[#9,57:57='}',<'}'>,4:4]
[#10,59:59='}',<'}'>,5:0]
[#11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR
For parsing a test file I'd like to allow identifier's to begin with a number.
my rule is:
ID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
However I also need to match numbers in this file as well. My rule for that is:
INT : '0'..'9'+
;
Obviously Antlr won't let me do this as INT will never be matched.
Is there a way to allow this? Specifically I'd like to match an INTEGER followed by an ID with no spaces as just an ID and create an INT token only if it's followed by a space.
For example:
3BOB -> [ID with text "3BOB"]
3 BOB -> [INT with text "3"] [ID with text "BOB"]
Just change the order in which ID and INT tokens are defined.
grammar qqq;
// Parser's rules.
root:
(integer|identifier)+
;
integer:
INT {System.out.println("INT with text '"+$INT.text+"'.");}
;
identifier:
ID {System.out.println("ID with text '"+$ID.text+"'.");}
;
// Lexer's tokens.
INT: '0'..'9'+
;
ID: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
WS: ' ' {skip();}
;
UNPREDICTED_TOKEN
:
~(' ') {System.out.println("Unpredicted token.");}
;
The order in which tokens are defined in grammar is significant: in case a string can be attributed to multiple tokens it is attributed to that one which is defined first. In your case if you want integer '123' to be attributed to INT when it still conforms to ID -- put INT definition first.
Antlr's token matching is greedy so it won't stop on '123' in '123BOB', but will continue until non of the tokens match the string and take the last token matched. So your identifiers now can start with numbers.
A remark on tokens order can also be found in this article by Mark Volkmann.
The following minor changes in your rules should do the trick:
ID : ('0'..'9')* // optional numbers
('a'..'z' | 'A'..'Z' | '_' | '&' | '/' | '-' | '.') // followed by mandatory character which is not a number
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')* // followed by more stuff (including numbers)
;
INT : '0'..'9'+ // a number
;
You simply let allow your identifiers to start with an optional number and make the following characters mandatory.
I'm creating a grammar right now and I had to get rid of left recursion, and it seems work for everything except the addition operator.
Here is the related part of my grammar:
SUBTRACT: '-';
PLUS: '+';
DIVIDE: '/';
MULTIPLY: '*';
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
)
(
PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr
)*
;
INTEGER: ('0'..'9')*;
IDENTIFIER: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*;
Then when I try to do something like
x*1
It work's perfectly. However when I try to do something like
x+1
I get an error saying:
MismatchedTokenException: mismatched input '+' expecting '\u001C'
I've been at this for a while but don't get why it works with *, -, and /, but not +. I have the exact same code for all of them.
Edit: If I reorder it and put SUBTRACT above PLUS, the + symbol will now work but the - symbol won't. Why would antlr care about the order of stuff like that?
Avoiding left recursion (in an expression grammar) is usually done like this:
grammar Expr;
parse
: expr EOF
;
expr
: equalityExpr
;
equalityExpr
: relationalExpr (('==' | '!=') relationalExpr)*
;
relationalExpr
: additionExpr (('>=' | '<=' | '>' | '<') additionExpr)*
;
additionExpr
: multiplyExpr (('+'| '-') multiplyExpr)*
;
multiplyExpr
: atom (('*' | '/') atom)*
;
atom
: IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
| '(' expr ')'
;
// ... lexer rules ...
For example, the input A+B+C would be parsed as follows:
Also see this related answer: ANTLR: Is there a simple example?
I fixed it by making a new rule for the part at the end that I made from removing left recursion:
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
) lr*
;
lr: PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr;
the 'expr' rule in the ANTLR grammar below obviously mutually left-recursive. As a ANTLR newbie it's difficult to get my head around solving this. I've read "Resolving Non-LL(*) Conflicts" in the ANTLR reference book, but I still don't see the solution. Any pointers?
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
AND : ( 'AND' | '&' | 'EN' ) ;
OR : ( 'OR' | '|' | 'OF' );
NOT : ('-' | 'NOT' | 'NIET' );
WS : ( ' ' | '\t' | '\r' | '\n' ) {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
input : expr EOF;
expr : (andexpr | orexpr | notexpr | atom);
andexpr : expr AND expr;
orexpr : expr OR expr;
notexpr : NOT expr;
phrase : '"' WORD* '"';
atom : (phrase | WORD);
I would suggest to have a look at the example grammers on the antlr site. The java grammar does what you want.
Basicly you can do something like this:
expr : andexpr;
andexpr : orexpr (AND andexpr)*;
orexpr : notexpr (OR orexpr)*;
notexpr : atom | NOT expr;
The key is, that every expression can end to be an atom.