Grammar in ANTLR and the selected words - antlr

EDIT: I changed the example to explain better what I am trying to get.
This is my grammar:
INTEGER : ' int ';
LET : [a-z] ;
cchar : LET | '-' | ' ' ;
wor : cchar+;
aaa : wor+ | wor* INTEGER wor* ;
aaa is the root. And writing eg.: 'xx int xx int'.
I would like to get a result: 'x x int x x i n t'. Only the first int should be catched, the next one should not give the "extraneous input" mistake but be splitted into letters.
How can I fix it?

this seems to work as you wanted:
LET : [a-z];
INT : 'int ';
cchar : LET | '-' | ' ';
wor: cchar+;
int_string: INT;
aaa: (wor|int_string)+;
What this grammar says is that: alow me a word or an integer declaration, where integer is a declaration if it is 'int' followed by a space defined as a lexer item, everything else are words.
Now the following does not work:
LET : [a-z];
INT : 'int';
cchar : LET | '-' | ' ';
wor: cchar+;
int_string: INT ' ';
aaa: (wor|int_string)+;
After moving the space to parser rules instead of lexer rules, it fails to parse 'intt' for example, in fact any word that has an 'int' substring. It happens because the lexer part seems to read any occurence of 'int' as INT and even wor does not parse 'intt' as a string now, it tries to match (wor int (cchar t)) and it fails for some reason not matching 'int' as separate cchars.
The first example's wor rule parses 'intt' as (wor (cchar i) (cchar n) (cchar t) (cchar t)). Which makes sense. The first example's grammar can't match it in the lexer phase because the space character required for lexer rule INT is not there in 'intt'.
Why has it done so? I'd think that it is because the lexer runs before the parser and what the parser gets is already the semantic equivalent. Even replacing the lexer rule INT with 'int' in int_string in the second example yields the same behaviour as I expect antlr just generates a hidden lexer rule for that match. Not 100% certain though.
Tell me if this helps, if I come up with a way to fix the second case, I'll make an edit :)

Related

How does one remove shift/reduce errors caused by limited lookahead?

In my grammar, it's usually possible to only use declaration which looks like:
int x, y, z = 23;
int i = 1;
int j;
In for I'd like to use a set of comma separated declarations of different types e.g.
for (int i = 0, double d = 2.0; i < 0; i++) { ... }
Using yacc, the limited lookahead creates problems. Here is the naive grammar:
variables_decl
: type_expression IDENT
| variables_decl ',' IDENT
;
declaration
: variables_decl
| variables_decl '=' initializer
;
declaration_list
: declaration
| declaration_list ',' declaration
;
This causes a shift/reduce error on ',':
state 149
100 variables_decl: variables_decl . ',' IDENT
101 declaration: variables_decl .
102 | variables_decl . '=' initializer
',' shift, and go to state 261
'=' shift, and go to state 262
',' [reduce using rule 101 (declaration)]
$default reduce using rule 101 (declaration)
I'd like fix this issue so that this actually works:
for (double x, y, int i, j = 0, long l = 1; i < 0; i++) { ... }
But it's not obvious to me how to do that.
In general terms, you avoid this type of shift/reduce conflict by avoiding forcing the parser to make a decision until absolutely necessary.
It's understandable why you have structured the grammar as you have; intuitively, a declaration list is a list of declarations, where each declaration is a type and a list of variables of that type. The problem is that this definition makes it impossible to know whether a comma belongs to an inner or outer list.
Moreover, one extra lookahead token might not be enough, since the following IDENT could be a typename or the name of a variable to be declared, assuming type-expression is the usual C syntax which can start with an identifier corresponding to a typename.
But that's not the only way to look at the declaration list syntax. Instead, you can think of it as a list of individual declarations, each of which starts with an optional type (except the first in the list, which must have an explicit type), using the semantic convention that an omitted type is the same as the type of the previous variable. That leads to the following conflict-free grammar:
declaration_list: explicit_decl
| declaration_list ',' declaration
declaration : explicit_decl
| implicit_decl
explicit_decl : type_expression implicit_decl
implicit_decl : IDENT opt_init
opt_init : %empty | '=' expr
That does not capture the syntax of C declarations, since not all C declarations have the form type_expression IDENT. The IDENT being defined can be buried inside the declaration, as with, for example, int a[4] or int f(int i); fortunately, these forms are of limited use in a for loop.
On the other hand, unlike your grammar it does allow all declared variables to be initialised, so
int a = 1, b = 0, double x = -1.0, y = 0.0
should work.
Another note: the first item in a C for clause can be empty, a declaration (possibly in the form of a list) or an expression. In the last case, a top-level , is an operator, not a list indicator.
In short, the above fragment might or might not be a solution in the context of your actual grammar. But it is conflict-free in a simple test framework where typed declarations are always of the form typename identifier.

ANTLR: empty condition not working

I want to be able to parse int [] or int tokens.
Consider the following grammar:
TYPE : 'int' AFTERINT;
AFTERINT: '[' ']';
Of course it works, but only for int []. To make it work for int too, I changed AFTERINT to this (added an empty condition':
AFTERINT: '[' ']' |
|;
But now I get this warning and error:
[13:34:08] warning(200): MiniJava.g:5:9: Decision can match input
such as "" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input [13:34:08]
error(201): MiniJava.g:5:9: The following alternatives can never be
matched: 3
Why won't empty condition work?
The lexer cannot cope with tokens that match empty string. If you think about it for a moment, this is not surprising: after all, there are an infinite amount of empty strings in your input. The lexer would always produce an empty string as a valid token, resulting in an infinite loop.
The recognition of types does not belong in the lexer, but in the parser:
type
: (INT | DOUBLE | BOOLEAN | ID) (OBR CBR)?
;
OBR : '[';
CBR : ']';
INT : 'int';
DOUBLE : 'double';
BOOLEAN : 'boolean';
ID : ('a'..'z' | 'A'..'Z')+;
Whenever you start combining different type of characters to create a (single) token, it's usually better to create a parser rule for this. Think of lexer rules (tokens) as the smallest building block of your language. From these building blocks, you compose parser rules.

byacc shift/reduce

I'm having trouble figuring this one out as well as the shift reduce problem.
Adding ';' to the end doesn't solve the problem since I can't change the language, it needs to go just as the following example. Does any prec operand work?
The example is the following:
A variable can be declared as: as a pointer or int as integer, so, both of this are valid:
<int> a = 0
int a = 1
the code goes:
%left '<'
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It obviously gives a shift/reduce problem afer expr. since it can shift for expr of "less" operator or reduce for another variable declaration.
I want precedence to be given on variable declaration, and have tried to create a %nonassoc prec_aux and put after '<' type '>' %prec prec_aux and after type tNAME but it doesn't solve my problem :S
How can I solve this?
Output was:
Well cant figure hwo to post linebreaks and code on reply... so here it goes the output:
35: shift/reduce conflict (shift 47, reduce 7) on '<'
state 35
variable : type tNAME '=' expr . (7)
expr : expr . '+' expr (26)
expr : expr . '-' expr (27)
expr : expr . '*' expr (28)
expr : expr . '/' expr (29)
expr : expr . '%' expr (30)
expr : expr . '<' expr (31)
expr : expr . '>' expr (32)
'>' shift 46
'<' shift 47
'+' shift 48
'-' shift 49
'*' shift 50
'/' shift 51
'%' shift 52
$end reduce 7
tINT reduce 7
Thats the output and the error seems the one I mentioned.
Does anyone know a different solution, other than adding a new terminal to the language that isn't really an option?
I think the resolution is to rewrite the grammar so it can lookahead somehow and see if its a type or expr after the '<' but I'm not seeing how to.
Precedence is unlikely to work since its the same character. Is there a way to give precendence for types that we define? such as declaration?
Thanks in advance
Your grammar gets confused in text like this:
int a = b
<int> c
That '<' on the second line could be part of an expression in the first declaration. It would have to look ahead further to find out.
This is the reason most languages have a statement terminator. This produces no conflicts:
%%
%token tNAME;
%token tINT;
%token tINTEGER;
%token tTERM;
%left '<';
declaration: variable
| declaration variable
variable : type tNAME '=' expr tTERM
| type tNAME tTERM
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It helps when creating a parser to know how to design a grammar to eliminate possible conflicts. For that you would need an understanding of how parsers work, which is outside the scope of this answer :)
The basic problem here is that you need more lookahead than the 1 token you get with yacc/bison. When the parser sees a < it has no way of telling whether its done with the preivous declaration and its looking at the beginning of a bracketed type, or if this is a less-than operator. There's two basic things you can do here:
Use a parsing method such as bison's %glr-parser option or btyacc, which can deal with non-LR(1) grammars
Use the lexer to do extra lookahead and return disambiguating tokens
For the latter, you would have the lexer do extra lookahead after a '<' and return a different token if its followed by something that looks like a type. The easiest is to use flex's / lookahead operator. For example:
"<"/[ \t\n\r]*"<" return OPEN_ANGLE;
"<"/[ \t\n\r]*"int" return OPEN_ANGLE;
"<" return '<';
Then you change your bison rules to expect OPEN_ANGLE in types instead of <:
type : OPEN_ANGLE type '>'
| tINT
expr : tINTEGER
| expr '<' expr
For more complex problems, you can use flex start states, or even insert an entire token filter/transform pass between the lexer and the parser.
Here is the fix, but not entirely satisfactory:
%{
%}
%token tNAME tINT tINTEGER
%left '<'
%left '+'
%nonassoc '=' /* <-- LOOK */
%%
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
| expr '+' expr
;
This issue is a conflict between these two LR items: the dot-final:
variable : type tNAME '=' expr_no_less .
and this one:
expr : expr . '<' expr
Notice that these two have different operators. It is not, as you seem to think, a conflict between different productions involving the '<' operator.
By adding = to the precedence ranking, we fix the issue in the sense that the conflict diagnostic goes away.
Note that I gave = a high precedence. This will resolve the conflict by favoring the reduce. This means that you cannot use a '<' expression as an initializer:
int name = 4 < 3 // syntax error
When the < is seen, the int name = 4 wants to be reduced, and the idea is that < must be the start of the next declaration, as part of a type production.
To allow < relational expressions to be used as initializers, add the support for parentheses into the expression grammar. Then users can parenthesize:
int foo = (4 < 3) <int> bar = (2 < 1)
There is no way to fix that without a more powerful parsing method or hacks.
What if you move the %nonassoc before %left '<', giving it low precedence? Then the shift will be favored. Unfortunately, that has the consequence that you cannot write another <int> declaration after a declaration.
int foo = 3 <int> bar = 4
^ // error: the machine shifted and is now doing: expr '<' . expr.
So that is the wrong way to resolve the conflict; you want to be able to write multiple such declarations.
Another Note:
My TXR language, which implements something equivalent to Parse Expression Grammars handles this grammar fine. This is essentially LL(infinite), which trumps LALR(1).
We don't even have to have a separate lexical analyzer and parser! That's just something made necessary by the limitations of one-symbol-lookahead, and the need for utmost efficiency on 1970's hardware.
Example output from shell command line, demonstrating the parse by translation to a Lisp-like abstract syntax tree, which is bound to the variable dl (declaration list). So this is complete with semantic actions, yielding an output that can be further processed in TXR Lisp. Identifiers are translated to Lisp symbols via calls to intern and numbers are translated to number objects also.
$ txr -l type.txr -
int x = 3 < 4 int y
(dl (decl x int (< 3 4)) (decl y int nil))
$ txr -l type.txr -
< int > x = 3 < 4 < int > y
(dl (decl x (pointer int) (< 3 4)) (decl y (pointer int) nil))
$ txr -l type.txr -
int x = 3 + 4 < 9 < int > y < int > z = 4 + 3 int w
(dl (decl x int (+ 3 (< 4 9))) (decl y (pointer int) nil)
(decl z (pointer int) (+ 4 3)) (decl w int nil))
$ txr -l type.txr -
<<<int>>>x=42
(dl (decl x (pointer (pointer (pointer int))) 42))
The source code of (type.txr):
#(define ws)#/[ \t]*/#(end)
#(define int)#(ws)int#(ws)#(end)
#(define num (n))#(ws)#{n /[0-9]+/}#(ws)#(filter :tonumber n)#(end)
#(define id (id))#\
#(ws)#{id /[A-Za-z_][A-Za-z_0-9]*/}#(ws)#\
#(set id #(intern id))#\
#(end)
#(define type (ty))#\
#(local l)#\
#(cases)#\
#(int)#\
#(bind ty #(progn 'int))#\
#(or)#\
<#(type l)>#\
#(bind ty #(progn '(pointer ,l)))#\
#(end)#\
#(end)
#(define expr (e))#\
#(local e1 op e2)#\
#(cases)#\
#(additive e1)#{op /[<>]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(additive e)#\
#(end)#\
#(end)
#(define additive (e))#\
#(local e1 op e2)#\
#(cases)#\
#(num e1)#{op /[+\-]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(num e)#\
#(end)#\
#(end)
#(define decl (d))#\
#(local type id expr)#\
#(type type)#(id id)#\
#(maybe)=#(expr expr)#(or)#(bind expr nil)#(end)#\
#(bind d #(progn '(decl ,id ,type ,expr)))#\
#(end)
#(define decls (dl))#\
#(coll :gap 0)#(decl dl)#(end)#\
#(end)
#(freeform)
#(decls dl)

ANTLR lexer rule consumes characters even if not matched?

I've got a strange side effect of an antlr lexer rule and I've created an (almost) minimal working example to demonstrate it.
In this example I want to match the String [0..1] for example. But when I debug the grammar the token stream that reaches the parser only contains [..1]. The first integer, no matter how many digits it contains is always consumed and I've got no clue as to how that happens. If I remove the FLOAT rule everything is fine so I guess the mistake lies somewhere in that rule. But since it shouldn't match anything in [0..1] at all I'm quite puzzled.
I'd be happy for any pointers where I might have gone wrong. This is my example:
grammar min;
options{
language = Java;
output = AST;
ASTLabelType=CommonTree;
backtrack = true;
}
tokens {
DECLARATION;
}
declaration : LBRACEVAR a=INTEGER DDOTS b=INTEGER RBRACEVAR -> ^(DECLARATION $a $b);
EXP : 'e' | 'E';
LBRACEVAR: '[';
RBRACEVAR: ']';
DOT: '.';
DDOTS: '..';
FLOAT
: INTEGER DOT POS_INTEGER
| INTEGER DOT POS_INTEGER EXP INTEGER
| INTEGER EXP INTEGER
;
INTEGER : POS_INTEGER | NEG_INTEGER;
fragment NEG_INTEGER : ('-') POS_INTEGER;
fragment POS_INTEGER : NUMBER+;
fragment NUMBER: ('0'..'9');
The '0' is discarded by the lexer and the following errors are produced:
line 1:3 no viable alternative at character '.'
line 1:2 extraneous input '..' expecting INTEGER
This is because when the lexer encounters '0.', it tries to create a FLOAT token, but can't. And since there is no other rule to fall back on to match '0.', it produces the errors, discards '0' and creates a DOT token.
This is simply how ANTLR's lexer works: it will not backtrack to match an INTEGER followed by a DDOTS (note that backtrack=true only applies to parser rules!).
Inside the FLOAT rule, you must make sure that when a double '.' is ahead, you produce a INTEGER token instead. You can do that by adding a syntactic predicate (the ('..')=> part) and produce FLOAT tokens only when a single '.' is followed by a digit (the ('.' DIGIT)=> part). See the following demo:
declaration
: LBRACEVAR INTEGER DDOTS INTEGER RBRACEVAR
;
LBRACEVAR : '[';
RBRACEVAR : ']';
DOT : '.';
DDOTS : '..';
INTEGER
: DIGIT+
;
FLOAT
: DIGIT+ ( ('.' DIGIT)=> '.' DIGIT+ EXP?
| ('..')=> {$type=INTEGER;} // change the token here
| EXP
)
;
fragment EXP : ('e' | 'E') DIGIT+;
fragment DIGIT : ('0'..'9');

Antlr lexer tokens that match similar strings, what if the greedy lexer makes a mistake?

It seems that sometimes the Antlr lexer makes a bad choice on which rule to use when tokenizing a stream of characters... I'm trying to figure out how to help Antlr make the obvious-to-a-human right choice. I want to parse text like this:
d/dt(x)=a
a=d/dt
d=3
dt=4
This is an unfortunate syntax that an existing language uses and I'm trying to write a parser for. The "d/dt(x)" is representing the left hand side of a differential equation. Ignore the lingo if you must, just know that it is not "d" divided by "dt". However, the second occurrence of "d/dt" really is "d" divided by "dt".
Here's my grammar:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : DDT ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
When using this grammar the lexer grabs the first "d/dt(" and turns it to the token DDT. Perfect! Now later the lexer sees the second "d" followed by a "/" and says "hmmm, I can match this as an ID and a '/' or I can be greedy and match DDT". The lexer chooses to be greedy... but little does it know, there is no "(" a few characters later in the input stream. When the lexer looks for the missing "(" it throws a MismatchedTokenException!
The only solution I've found so far, is to move all the rules into the parser with a grammar like:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt id ')' '=' id;
assignment
: id '=' number
| id '=' id '/' id
;
ddt : 'd' '/' 'd' 't' '(';
id : CHAR+;
number : DIGIT+;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
NEWLINE : '\r\n'|'\r'|'\n';
This is a fine solution if I didn't already have thousands of lines of working code that depend on the first grammar working. After spending 2 days researching this problem I have come to the conclusion that a lexer... really ought to be able to distinguish the two cases. At some point the Antlr lexer is deciding between two rules: DDT and ID. It chooses DDT because the lexer is greedy. But when matching DDT fails, I'd like the lexer to go back to using ID.
I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
Ideally I can modify the lexer rule for DDT with any valid Antlr code... and be done.
My target language is Java.
Thanks!
UPDATE
Thank you guys for some great answers!! I accepted the answer that best fit my question. The actual solution I used is in my own answer (not the accepted answer), and there are more answers that could have worked. Readers, check out all the answers; some of them may suit your case better than mine.
I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
In that case, force the lexer to look ahead in the char-stream to make sure there really is "d/dt(" using a gated syntactic predicate.
A demo:
grammar diffeq_grammar;
#parser::members {
public static void main(String[] args) throws Exception {
String src =
"d/dt(x)=a\n" +
"a=d/dt\n" +
"d=3\n" +
"dt=4\n";
diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
parser.program();
}
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
program
: (statement? NEWLINE)* EOF
;
statement
: diffeq {System.out.println("diffeq : " + $text);}
| assignment {System.out.println("assignment : " + $text);}
;
diffeq
: DDT ID ')' '=' ID
;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : {ahead("d/dt(")}?=> 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';
If you now run the demo:
java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser
(when using Windows, replace the : with ; in the last command)
you will see the following output:
diffeq : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4
Although this is not what you are trying to do considering the large amount of working code that you have in the project, you should still consider separating your parser and lexer more thoroughly. I is best to let the parser and the lexer do what they do best, rather than "fusing" them together. The most obvious indication of something being wrong is the lack of symmetry between your ( and ) tokens: one is part of a composite token, while the other one is a stand-alone token.
If refactoring is at all an option, you could change the parser and lexer like this:
grammar diffeq_grammar;
program : (statement? NEWLINE)* EOF; // <-- You forgot EOF
statement
: diffeq
| assignment;
diffeq : D OVER DT OPEN id CLOSE EQ id; // <-- here, id is a parser rule
assignment
: id EQ NUMBER
| id EQ id OVER id
;
id : ID | D | DT; // <-- Nice trick, isn't it?
D : 'D';
DT : 'DT';
OVER : '/';
EQ : '=';
OPEN : '(';
CLOSE : ')';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
You may need to enable backtracking and memoization for this to work (but try compiling it without backtracking first).
Here's the solution I finally used. I know it violates one of my requirements: to keep lexer rules in the lexer and parser rules in the parser, but as it turns out moving DDT to ddt required no change in my code. Also, dasblinkenlight makes some good points about mismatched parenthesis in his answer and comments.
grammar ddt_problem;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
ddt : ( d=ID ) { $d.getText().equals("d") }? '/' ( dt=ID ) { $dt.getText().equals("dt") }? '(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';