ANTLR’s predicated-LL(*) parsing mechanism - antlr

I'm building the following grammar:
Letter : 'a'..'z'|'A'..'Z' ;
Number : '0'..'9' ;
Float
: Number+ '.' Number+
;
a5
#init
{
int n = 1;
}
: ({n<=5}?=>(Letter|Number){n++;})+
;
It not successfully parsed the string "CD923IJK", because I needs to be consumed "CD923" and not "CDIJK" like happening
If FLoat is commented the problem disappear and consumed "CD923" like I want
Obviously requires an advanced parsing, because this grammar is LL(K), I'm set the lookahead depth
options
{
k=5;
}
But not solved anything. Any idea?
UPDATE
Response to the suggestion 500 - Internal Server Error
I added the following rule
public test :a5 Float ;
I need to match CD9231.23 where CD923 is an alphanumeric and 1.23 a float. But see parse tree:

The problem seems to be in the rules Number and Float. You have an ambiguity in this two rules, but due both Number and Float are lexer rules, you must recall that antlr implicit create a nextToken rule to handle all the tokens. The nextToken in the example looks like this:
nextToken: Letter | Number | Float;
when antlr find a digit he walk through the DFA to find to which rule jump, but in this case he can't decide which is the proper rule (Number or Float) to jump to. You can avoid this behavior making the Float rule a parser rule. You can try something like this:
grammar a5;
s : a5 coordinate?
;
a5
#init{
int n = 0;
}
: ({n<5}?=> (LETTER|Number){n++;})+
;
Number : '0'..'9'
;
coordinate : Number+ '.' Number+
;
LETTER
: 'a'..'z'|'A'..'Z'
;

I think maybe you could make a simpler solution: if you know that your a5 rule's items always will be a text of size 5 or less you can write the rule according to that:
A5
: (Letter|Number)(Letter|Number)(Letter|Number)(Letter|Number)(Letter|Number)
| (Letter|Number)(Letter|Number)(Letter|Number)(Letter|Number)
| (Letter|Number)(Letter|Number)(Letter|Number)
| (Letter|Number)(Letter|Number)
| (Letter|Number)
;
Other solution could be make the rule without taking in account the lenthg, and then checking it in a semantic phase:
AK
: (Letter|Number)+
;
This are some ideas, hope helps...

Related

ANTLR: empty condition not working

I want to be able to parse int [] or int tokens.
Consider the following grammar:
TYPE : 'int' AFTERINT;
AFTERINT: '[' ']';
Of course it works, but only for int []. To make it work for int too, I changed AFTERINT to this (added an empty condition':
AFTERINT: '[' ']' |
|;
But now I get this warning and error:
[13:34:08] warning(200): MiniJava.g:5:9: Decision can match input
such as "" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input [13:34:08]
error(201): MiniJava.g:5:9: The following alternatives can never be
matched: 3
Why won't empty condition work?
The lexer cannot cope with tokens that match empty string. If you think about it for a moment, this is not surprising: after all, there are an infinite amount of empty strings in your input. The lexer would always produce an empty string as a valid token, resulting in an infinite loop.
The recognition of types does not belong in the lexer, but in the parser:
type
: (INT | DOUBLE | BOOLEAN | ID) (OBR CBR)?
;
OBR : '[';
CBR : ']';
INT : 'int';
DOUBLE : 'double';
BOOLEAN : 'boolean';
ID : ('a'..'z' | 'A'..'Z')+;
Whenever you start combining different type of characters to create a (single) token, it's usually better to create a parser rule for this. Think of lexer rules (tokens) as the smallest building block of your language. From these building blocks, you compose parser rules.

Simple Antlr3 Token parsing

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue.
I'm using Antlr3.3 with a mixed Token/Parser lexer.
I'm using gUnit to help prove the grammar, and some jUnit tests; this is where the fun begins.
I have a simple config file i want to parse:
identifier foobar {
port=8080
stub plusone.google.com {
status-code = 206
header = []
body = []
}
}
I'm having trouble parsing the "identifier" (foobar in this example):
Valid names i want to allow are:
foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3
and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-'
The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question):
fragment HYPHEN : '-' ;
fragment UNDERSCORE : '_' ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' |'A'..'Z' ;
fragment NUMBER : DIGIT+ ;
fragment WORD : LETTER+ ;
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
and the corresponding gUnit test
IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo#bar" FAIL
// not with whitepsace
"foo bar" FAIL
Running the gUnit tests only fails for "5foobar". I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me.
Can anyone point me to where i'm going wrong? How can i match without being greedy?
Many thanks in advance.
-- UPDATE --
I changed the grammar as per Barts answer, to this:
IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;
and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter.
The following grammar deals with the "port=8080" element of the config snippet above:
configurationStatement[MiddlemanConfiguration config]
: PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
| def=proxyDefinition { config.add(def); }
;
The message i get is:
mismatched input '8080' expecting NUMBER
Where NUMBER is defined as NUMBER : ('0'..'9')+ ;
Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue.
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;
is equivalent to:
IDENTIFIER
: DIGIT
| LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
So, an IDENTIFIER is eiter a single DIGIT, or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)*.
You probably meant:
IDENTIFIER
: (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
;
However, that also allows for 3---3 as being a valid IDENTIFIER, is that correct?

ANTLR lexer rule consumes characters even if not matched?

I've got a strange side effect of an antlr lexer rule and I've created an (almost) minimal working example to demonstrate it.
In this example I want to match the String [0..1] for example. But when I debug the grammar the token stream that reaches the parser only contains [..1]. The first integer, no matter how many digits it contains is always consumed and I've got no clue as to how that happens. If I remove the FLOAT rule everything is fine so I guess the mistake lies somewhere in that rule. But since it shouldn't match anything in [0..1] at all I'm quite puzzled.
I'd be happy for any pointers where I might have gone wrong. This is my example:
grammar min;
options{
language = Java;
output = AST;
ASTLabelType=CommonTree;
backtrack = true;
}
tokens {
DECLARATION;
}
declaration : LBRACEVAR a=INTEGER DDOTS b=INTEGER RBRACEVAR -> ^(DECLARATION $a $b);
EXP : 'e' | 'E';
LBRACEVAR: '[';
RBRACEVAR: ']';
DOT: '.';
DDOTS: '..';
FLOAT
: INTEGER DOT POS_INTEGER
| INTEGER DOT POS_INTEGER EXP INTEGER
| INTEGER EXP INTEGER
;
INTEGER : POS_INTEGER | NEG_INTEGER;
fragment NEG_INTEGER : ('-') POS_INTEGER;
fragment POS_INTEGER : NUMBER+;
fragment NUMBER: ('0'..'9');
The '0' is discarded by the lexer and the following errors are produced:
line 1:3 no viable alternative at character '.'
line 1:2 extraneous input '..' expecting INTEGER
This is because when the lexer encounters '0.', it tries to create a FLOAT token, but can't. And since there is no other rule to fall back on to match '0.', it produces the errors, discards '0' and creates a DOT token.
This is simply how ANTLR's lexer works: it will not backtrack to match an INTEGER followed by a DDOTS (note that backtrack=true only applies to parser rules!).
Inside the FLOAT rule, you must make sure that when a double '.' is ahead, you produce a INTEGER token instead. You can do that by adding a syntactic predicate (the ('..')=> part) and produce FLOAT tokens only when a single '.' is followed by a digit (the ('.' DIGIT)=> part). See the following demo:
declaration
: LBRACEVAR INTEGER DDOTS INTEGER RBRACEVAR
;
LBRACEVAR : '[';
RBRACEVAR : ']';
DOT : '.';
DDOTS : '..';
INTEGER
: DIGIT+
;
FLOAT
: DIGIT+ ( ('.' DIGIT)=> '.' DIGIT+ EXP?
| ('..')=> {$type=INTEGER;} // change the token here
| EXP
)
;
fragment EXP : ('e' | 'E') DIGIT+;
fragment DIGIT : ('0'..'9');

Antlr lexer tokens that match similar strings, what if the greedy lexer makes a mistake?

It seems that sometimes the Antlr lexer makes a bad choice on which rule to use when tokenizing a stream of characters... I'm trying to figure out how to help Antlr make the obvious-to-a-human right choice. I want to parse text like this:
d/dt(x)=a
a=d/dt
d=3
dt=4
This is an unfortunate syntax that an existing language uses and I'm trying to write a parser for. The "d/dt(x)" is representing the left hand side of a differential equation. Ignore the lingo if you must, just know that it is not "d" divided by "dt". However, the second occurrence of "d/dt" really is "d" divided by "dt".
Here's my grammar:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : DDT ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
When using this grammar the lexer grabs the first "d/dt(" and turns it to the token DDT. Perfect! Now later the lexer sees the second "d" followed by a "/" and says "hmmm, I can match this as an ID and a '/' or I can be greedy and match DDT". The lexer chooses to be greedy... but little does it know, there is no "(" a few characters later in the input stream. When the lexer looks for the missing "(" it throws a MismatchedTokenException!
The only solution I've found so far, is to move all the rules into the parser with a grammar like:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt id ')' '=' id;
assignment
: id '=' number
| id '=' id '/' id
;
ddt : 'd' '/' 'd' 't' '(';
id : CHAR+;
number : DIGIT+;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
NEWLINE : '\r\n'|'\r'|'\n';
This is a fine solution if I didn't already have thousands of lines of working code that depend on the first grammar working. After spending 2 days researching this problem I have come to the conclusion that a lexer... really ought to be able to distinguish the two cases. At some point the Antlr lexer is deciding between two rules: DDT and ID. It chooses DDT because the lexer is greedy. But when matching DDT fails, I'd like the lexer to go back to using ID.
I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
Ideally I can modify the lexer rule for DDT with any valid Antlr code... and be done.
My target language is Java.
Thanks!
UPDATE
Thank you guys for some great answers!! I accepted the answer that best fit my question. The actual solution I used is in my own answer (not the accepted answer), and there are more answers that could have worked. Readers, check out all the answers; some of them may suit your case better than mine.
I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
In that case, force the lexer to look ahead in the char-stream to make sure there really is "d/dt(" using a gated syntactic predicate.
A demo:
grammar diffeq_grammar;
#parser::members {
public static void main(String[] args) throws Exception {
String src =
"d/dt(x)=a\n" +
"a=d/dt\n" +
"d=3\n" +
"dt=4\n";
diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
parser.program();
}
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
program
: (statement? NEWLINE)* EOF
;
statement
: diffeq {System.out.println("diffeq : " + $text);}
| assignment {System.out.println("assignment : " + $text);}
;
diffeq
: DDT ID ')' '=' ID
;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : {ahead("d/dt(")}?=> 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';
If you now run the demo:
java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser
(when using Windows, replace the : with ; in the last command)
you will see the following output:
diffeq : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4
Although this is not what you are trying to do considering the large amount of working code that you have in the project, you should still consider separating your parser and lexer more thoroughly. I is best to let the parser and the lexer do what they do best, rather than "fusing" them together. The most obvious indication of something being wrong is the lack of symmetry between your ( and ) tokens: one is part of a composite token, while the other one is a stand-alone token.
If refactoring is at all an option, you could change the parser and lexer like this:
grammar diffeq_grammar;
program : (statement? NEWLINE)* EOF; // <-- You forgot EOF
statement
: diffeq
| assignment;
diffeq : D OVER DT OPEN id CLOSE EQ id; // <-- here, id is a parser rule
assignment
: id EQ NUMBER
| id EQ id OVER id
;
id : ID | D | DT; // <-- Nice trick, isn't it?
D : 'D';
DT : 'DT';
OVER : '/';
EQ : '=';
OPEN : '(';
CLOSE : ')';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
You may need to enable backtracking and memoization for this to work (but try compiling it without backtracking first).
Here's the solution I finally used. I know it violates one of my requirements: to keep lexer rules in the lexer and parser rules in the parser, but as it turns out moving DDT to ddt required no change in my code. Also, dasblinkenlight makes some good points about mismatched parenthesis in his answer and comments.
grammar ddt_problem;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
ddt : ( d=ID ) { $d.getText().equals("d") }? '/' ( dt=ID ) { $dt.getText().equals("dt") }? '(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.