Jison rule precedence - grammar

I'm trying to create a grammar for a programming language in Jison, and have run into a problem with calls. Functions in my language is invoked with the following syntax:
functionName arg1 arg2 arg3
In order to do arguments that aren't just simple expressions, they need to be wrapped in parenthesizes like this:
functionName (1 + 2) (3 + 3) (otherFunction 5)
However, there is a bug in my grammar that causes my parser to interpret functionName arg1 arg2 arg3 as functionName(arg1(arg2(arg3))) instead of functionName(arg1, arg2, arg3).
The relevant part of my jison grammar file looks like this:
expr:
| constantExpr { $$ = $1; }
| binaryExpr { $$ = $1; }
| callExpr { $$ = $1; }
| tupleExpr { $$ = $1; }
| parenExpr { $$ = $1; }
| identExpr { $$ = $1; }
| blockExpr { $$ = $1; }
;
callArgs:
| callArgs expr { $$ = $1.concat($2); }
| expr { $$ = [$1]; }
;
callExpr:
| path callArgs { $$ = ast.Expr.Call($1, $2); }
;
identExpr:
| path { $$ = ast.Expr.Ident($1); }
;
How can I make Jison prefer the callArgs rather than the expr?

You might be able to do this by playing games with precedence relations, but I think the most straightforward solution is to be clear.
What you want to say is that callArgs cannot directly contain a callExpr. As in your example, if you want to pass a callExpr as an argument, you need to enclose it in parentheses, in which case it will match some other production (presumably parenExpr).
So you can write that directly:
callArgExpr
: constantExpr
| binaryExpr
| tupleExpr
| parenExpr
| identExpr
| blockExpr
;
expr
: callArgExpr
| callExpr
;
callArgs
: callArgs callArgExpr { $$ = $1.concat($2); }
| callArgExpr { $$ = [$1]; }
;
callExpr
: path callArgs { $$ = ast.Expr.Call($1, $2); }
;
In fact, it's likely that you want to restrict callArgs even further, since (if I understand correctly) func a + b does not mean "apply a+b to func", which would have been written func (a + b). So you might want to also remove binaryExpr from callArgExpr, and possibly some other. I hope the model above shows how to do that.
By the way, I removed all the empty productions, assuming that they were unintentional (unless jison has some exception for that syntax; I'm not really a jison expert). And I removed { $$ = $1; }, which I believe is as unnecessary in jison as in the classic yacc/bison/etc., since it is the default action.

It is important to review other parts of your grammar to give a precise answer. I do not know if what I think is right, but from what I saw in your code, you could create a rule explicitly for the arguments in the order that you want without nesting one inside the other:
args:
| "(" simple_expression ")" args { /*Do something with $2*/ }
| "\n"
;
I hope this has helped you a little. Greetings.

Related

How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)?

I need to tokenize everything that is "outside" any comment, until end of line. For instance:
take me */ and me /* but not me! */ I'm in! // I'm not...
Tokenized as (STR is the "outside" string, BC is block-comment and LC is single-line-comment):
{
STR: "take me */ and me ", // note the "*/" in the string!
BC : " but not me! ",
STR: " I'm in! ",
LC : " I'm not..."
}
And:
/* starting with don't take me */ ...take me...
Tokenized as:
{
BC : " starting with don't take me ",
STR: " ...take me..."
}
The problem is that STR can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR.
I thought maybe to do something like:
STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;
But I don't know how to look-ahead for characters in lexer actions.
Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?
Yes, it is possible to perform the tokenization you are attempting.
Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.
I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.
/*
All 3 COMMENTS are Mutually Exclusive
*/
DOC_COMMENT
: '/**'
( [*]* ~[*/] // Cannot START/END Comment
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*'+ '/' -> channel( DOC_COMMENT )
;
BLK_COMMENT
: '/*'
(
( /* Must never match an '*' in position 3 here, otherwise
there is a conflict with the definition of DOC_COMMENT
*/
[/]? ~[*/] // No START/END Comment
| DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
)
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*/' -> channel( BLK_COMMENT )
;
INL_COMMENT
: '//'
( ~[\n\r*/] // No NEW_LINE
| INL_COMMENT // Nested Inline Comment
)* -> channel( INL_COMMENT )
;
STR // Consume everthing up to the start of a COMMENT
: ( ~'/' // Any Char not used to START a Comment
| '/' ~[*/] // Cannot START a Comment
)+
;
start
: DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| STR
;
Try something like this:
grammar T;
#lexer::members {
// Returns true iff either "//" or "/*" is ahead in the char stream.
boolean startCommentAhead() {
return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
}
}
// other rules
STR
: ( {!startCommentAhead()}? . )+
;

YACC Grammar: Operator precedence issue

I'm trying to get: (20 + (-3)) * 3 / (20 / 3) / 2 to equal 4. Right now it equals 17.
So basically it's doing (20/3) then dividing that by 2, then dividing 3 by [(20/3)/2], then multiplying that by 17. Not sure how to alter my grammar/rules/precedences to get it to read correctly. Any guidance would be appreciated, thanks.
%%
start: PROGRAMtoken IDtoken IStoken compoundstatement
compoundstatement: BEGINtoken {print_header();} statement semi ENDtoken {print_end();}
semi: SEMItoken statement semi
|
statement: IDtoken EQtoken exp
{ regs[$1] = $3; }
| PRINTtoken exp
{ cout << $2 << endl; }
| declaration
declaration: VARtoken IDtoken comma
comma: COMMAtoken IDtoken comma
|
exp: exp PLUStoken term
{ $$ = $1 + $3; }
| exp MINUStoken term
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| factor TIMEStoken term
{$$ = $1 * $3;
}
| factor DIVIDEtoken term
{ $$ = $1 / $3;
}
factor: ICONSTtoken
{ $$ = $1;}
| IDtoken
{ $$ = regs[$1]; }
| LPARENtoken exp RPARENtoken
{ $$ = $2;}
%%
My tokens and types look like:
%token BEGINtoken
%token COMMAtoken
%left DIVIDEtoken
%left TIMEStoken
%token ENDtoken
%token EOFtoken
%token EQtoken
%token <value> ICONSTtoken
%token <value> IDtoken
%token IStoken
%token LPARENtoken
%left PLUStoken MINUStoken
%token PRINTtoken
%token PROGRAMtoken
%token RPARENtoken
%token SEMItoken
%token VARtoken
%type <value> exp
%type <value> term
%type <value> factor
You really wanted someone to work hard to give you an answer, which is why the question has hung around for a year. Have a read of this help page: https://stackoverflow.com/help/how-to-ask, in particular the part about simplifying the problem. There are lots of rules in your grammar file that were not needed to reproduce the problem. We did not need:
%token BEGINtoken
%token COMMAtoken
%token ENDtoken
%token EOFtoken
%token EQtoken
%token <value> IDtoken
%token IStoken
%token PROGRAMtoken
%token VARtoken
%%
start: PROGRAMtoken IDtoken IStoken compoundstatement
compoundstatement: BEGINtoken {print_header();} statement semi ENDtoken {print_end();}
semi: SEMItoken statement semi
|
| declaration
declaration: VARtoken IDtoken comma
comma: COMMAtoken IDtoken comma
|
You could have just removed these tokens and rules to get to the heart of the operator precedence issue. We did not need any variables, declarations, assignments or program structure to illustrate the failure. Learning to simplify is the heart of competent debugging and thus programming. If you'd done this simplification more people would have had a go at answering. I'm saying this not for the OP, but for those that will follow with similar problems!
I'm wondering what school is setting this assignment, as I've seen a fair number of yacc questions on SO around the same dumb problem. I suspect more will come here every year, so answering this will help them. I knew what the issue was on inspection of the grammar, but to test my solution I had to code up a working lexer, some symbol table routines, a main program and other ancillary code. Again, another deterrent for problem solvers.
Lets get to the heart of the problem. You have these token declarations:
%left DIVIDEtoken
%left TIMEStoken
%left PLUStoken MINUStoken
These tell yacc that if any rules are ambiguous that the operators associate left. Your rules for these operators are:
exp: exp PLUStoken term
{ $$ = $1 + $3; }
| exp MINUStoken term
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| factor TIMEStoken term
{$$ = $1 * $3;
}
| factor DIVIDEtoken term
{ $$ = $1 / $3;
}
However, these rules are not ambiguous, and thus the operator precedence declaration is not required. Yacc will follow the non-ambiguous grammar you have used. The ways these rules are written, it tells yacc that the operators have right associativity, which is the opposite of what you want. Now, it could be clearly seen from the simple arithmetic in your example that the operators were being calculated in a right associative way, and you wanted the opposite. There were really big clues there weren't there?
OK. How to change the associativity? One way would be to make the grammar ambiguous again so that the %left declaration is used, or just flip the rules around to invert the associativity. That's what I did:
exp: term PLUStoken exp
{ $$ = $1 + $3; }
| term MINUStoken exp
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| term TIMEStoken factor
{$$ = $1 * $3;
}
| term DIVIDEtoken factor
{ $$ = $1 / $3;
}
Do you see what I did there? I rotated the grammar rule around the operator.
Now for some more disclaimers. I said this is a dumb exercise. Interpreting expressions on the fly is a poor use of the yacc tool, and not what happens in real compilers or interpreters. In a realistic implementation, a parse tree would be built and the value calculations would be performed during the tree walk. This would then enable the issues of undeclared variables to be resolved (which also occurs in this exercise). The use of the regs array to store values is also dumb, because there is clearly an ancillary symbol table in use to return a unique integer ID for the symbols. In a real compiler/interpreter those values would also be stored in that symbol table.
I hope this tutorial has helped further students understand these parsing issues in their classwork.

How to handle the precedence of assignment operator in a PHP parser?

I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:

yacc associativity of nonterminal symbols?

Say I have a grammar like this:
expr : expr '+' expr { $$ = operation('+', $1, $3); }
| expr '-' expr { $$ = operation('-', $1, $3); }
| expr '*' expr { $$ = operation('*', $1, $3); }
| expr '/' expr { $$ = operation('/', $1, $3); }
| num
;
Where each of those operators has a precedence attached and is marked as left associative.
Then I want to refactor my grammar such that:
op : '+' | '-' | '*' | '/' ;
expr : expr op expr { $$ = operation($2, $1, $3); }
| num
;
How does yacc (if even at all) determine the associativity and precedence of op in this case? Will it trace its way through all the possible precedences/associativities of +, -, * and / when evaluating op, or does defining an associativity for nonterminal symbols make no sense?
AFAIK, with precedence order for nonterminals, it uses the precedence of the rightmost terminal symbol, but I can't find any documentation on the associativity rules themselves for nonterminals.
The "normal" way to do this (as far as I'm aware) is to define a different expr type for each operator, that way you get very explicit control over what's happening.
Python's grammar is a good example of this: http://docs.python.org/reference/grammar.html.

Bison Syntax Error (Beginner)

I'm back and now writing my own language and my OS, but as I'm now starting in the development of my own development language, I'm getting some errors when using Bison and I don't know how to solve them. This is my *.y file code:
input:
| input line
;
line: '\n'
| exp '\n' { printf ("\t%.10g\n", $1); }
;
exp: NUM { $$ = $1; }
| exp exp '+' { $$ = $1 + $2; }
| exp exp '-' { $$ = $1 - $2; }
| exp exp '*' { $$ = $1 * $2; }
| exp exp '/' { $$ = $1 / $2; }
/* Exponentiation */
| exp exp '^' { $$ = pow ($1, $2); }
/* Unary minus */
| exp 'n' { $$ = -$1; }
;
%%
And when I try to use Bison with this source code I'm getting this error:
calc.y:1.1-5: syntax error, unexpected identifier:
You need a '%%' before the rules as well as after them (or, strictly, instead; if there is no code after the second '%%', you can omit that line).
You will also need a '%token NUM' before the first '%%'; the grammar then passes Bison.
Another alternative solution exists, which is to upgrade to bison version 3.0.4. I guess between version 2.x and 3.x, they changed the file syntax.