Say I have a grammar like this:
expr : expr '+' expr { $$ = operation('+', $1, $3); }
| expr '-' expr { $$ = operation('-', $1, $3); }
| expr '*' expr { $$ = operation('*', $1, $3); }
| expr '/' expr { $$ = operation('/', $1, $3); }
| num
;
Where each of those operators has a precedence attached and is marked as left associative.
Then I want to refactor my grammar such that:
op : '+' | '-' | '*' | '/' ;
expr : expr op expr { $$ = operation($2, $1, $3); }
| num
;
How does yacc (if even at all) determine the associativity and precedence of op in this case? Will it trace its way through all the possible precedences/associativities of +, -, * and / when evaluating op, or does defining an associativity for nonterminal symbols make no sense?
AFAIK, with precedence order for nonterminals, it uses the precedence of the rightmost terminal symbol, but I can't find any documentation on the associativity rules themselves for nonterminals.
The "normal" way to do this (as far as I'm aware) is to define a different expr type for each operator, that way you get very explicit control over what's happening.
Python's grammar is a good example of this: http://docs.python.org/reference/grammar.html.
Related
My ANTLR Grammar for simple expressions is as below:
This grammar works for most of the scenarios except when I try to use negative numbers.
abs(1.324) is valid
abs(-1.324) is being thrown as an error.
Or if the expression is just a negative number such as -1.344 I am having the following error in the console.
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| expr ( MOD ) expr
| NUM
| ID
| STRING
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
MOD: '%';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: ([0-9]*[.])?[0-9]+;
STRING: '"' ~ ["]* '"';
fragment ID_NODE: [a-zA-Z_$][a-zA-Z0-9_$]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '/*' .*? '*/' -> skip;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
WS: [ \r\t\n]+ -> skip;
The grammar seems fine to me. It could be a bug with the runtime you're using, but that seems odd to me, given you're not doing anything special.
With the Java runtime, this is what I get when parsing/lexing the input abs(-1.324):
The following tokens are produced:
ID `abs`
OPEN_PAR `(`
MIN `-`
NUM `1.324`
CLOSE_PAR `)`
EOF `<EOF>`
and the entry point parse gives:
My grammar is working, but I have a bunch of elements in the tree that are single element arrays, and I don't really understand why. I tried reading the information about visitors, but I'm pretty sure the "problem" is with the grammar and perhaps its verbosity. Does anything jump out here? Or perhaps I'm just visiting things incorrectly. In the example below I do not react to visitFnArgs or visitArgs, but just visitFunctionCall. Things like function arguments and statements seem to sometimes be wrapped in single element arrays.
grammar Txl;
root: program;
// High level language
program: stmt (NEWLINE stmt)* NEWLINE? EOF # Statement
;
stmt: require # Condition
| entry # CreateEntry
| assignment # Assign
;
require: REQUIRE valueExpression;
entry: (CREDIT | DEBIT) journal valueExpression (IF valueExpression)? (LPAREN 'id:' valueExpression RPAREN)?;
assignment: IDENT ASSIGN valueExpression;
journal: IDENT COLON IDENT;
valueExpression: expr # Expression;
expr: expr (MULT | DIV) expr # MulDiv
| expr (PLUS | MINUS) expr # AddSub
| expr MOD expr # Mod
| expr POW expr # Pow
| MINUS expr # Negative
| expr AND expr # And
| expr OR expr # Or
| NOT expr # Not
| expr EQ expr # Equality
| expr NEQ expr # Inequality
| expr (LTE | GTE) expr # CmpEqual
| expr (LT | GT) expr # Cmp
| expr QUESTION expr COLON expr # Ternary
| LPAREN expr RPAREN # Parens
| NUMBER # NumberLiteral
| IDENT LPAREN args RPAREN # FunctionCall
| IDENT # Identifier
| STRING_LITERAL # StringLiteral
;
fnArg: expr | journal;
args: (fnArg (',' fnArg)*)?;
// Reserved words
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
REQUIRE: 'require';
// Operators
MULT: '*';
DIV: '/';
MINUS: '-';
PLUS: '+';
POW: '^';
MOD: '%';
LPAREN: '(';
RPAREN: ')';
LBRACE: '[';
RBRACE: ']';
COMMA: ',';
EQ: '==';
NEQ: '!=';
GTE: '>=';
LTE: '<=';
GT: '>';
LT: '<';
ASSIGN: '=';
QUESTION: '?';
COLON: ':';
AND: 'and';
OR: 'or';
NOT: 'not';
HASH: '#';
NEWLINE : [\r\n];
WS: [ \t] + -> skip;
// Entities
NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
IDENT: [a-zA-Z]+[0-9a-zA-Z]*;
EXTID: [a-zA-Z0-9-]+;
STRING_LITERAL : '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
This input:
require balance(assets:cash) + balance(assets:earnings) > AMT
Produces the following single element arrays:
SINGLE ELEMENT INSTRUCTION MathOperation (>)
SINGLE ELEMENT INSTRUCTION JournalReference { identifier: 'assets:cash' }
SINGLE ELEMENT INSTRUCTION JournalReference { identifier: 'assets:earnings' }
I wonder if partly my problem is I'm not visiting things properly. Here's my Math visitor:
visitMath(ctx) {
const visited = this.visitChildren(ctx);
return new MathOperation(
visited[0],
ctx.getChild(1).getText(),
visited[2],
);
}
But I assume the problem is in the thing that contains the math operation, which I think is visitRequire:
visitRequire(ctx) {
return new Condition(this.visitExpression(ctx.getChild(1)));
}
Or perhaps in visitValueExpression or visitCondition, which are not overridden in my visitor.
Really short answer: There's nothing wrong with single element arrays. If there was only one instance of a thing that could exist multiple times, then it has to be an array (or List), and that list will have only the one item, if that's how many there are.
Antlr won't "unwrap" a single item to not be in an array. (That would only be valid in untyped languages or languages that allow Union types, and would be a pain to use as you'd always have to check whether you had a "thing" or a list of "thing"s)
Any time the "same type of thing" can exist more than once when matching a rule, ANTLR will make that available as an Array/List of that type.
Eample:
journal: IDENT COLON IDENT;
has 2 IDENT tokens, so it'll be made accessible via the context as a List of those types
(in Java, I'm not positive which language you're using).
public List<TerminalNode> IDENT() { return getTokens(TxlParser.IDENT); }
Two of your examples are of "JournalReference" so this would explain getting a list (if you use the ctx.IDENT() or the ctx.getChild(n) methods).
If I change the Journal rule to be:
journal: j1=IDENT COLON j2=IDENT;
I've given names to each IDENT so I get individual accessors for them (in addition to the IDENT() accessor that returns a list:
public static class JournalContext extends ParserRuleContext {
public Token j1;
public Token j2;
public TerminalNode COLON() { return getToken(TxlParser.COLON, 0); }
public List<TerminalNode> IDENT() { return getTokens(TxlParser.IDENT); }
With the labels you can use cox.j1 or cox.j2 to get individual tokens. (of course you'd name them as appropriate to your use case).
since the FunctionCall alternative of the expr rule uses the args rule
args: (fnArg (',' fnArg)*)?;
and that rule can have more than one fnArg, the it will necessarily be a list of fnArgs in the context:
public static class ArgsContext extends ParserRuleContext {
public List<FnArgContext> fnArg() {
return getRuleContexts(FnArgContext.class);
}
There's really not much you can do (or should want to do to not have that in a List, there can be one or more of them.
Since non of the code you present shows where you're writing your output, its a bit difficult to be more specific than that.
Your visitMath(cox) example is also a bit perplexing as math is not a rule in your grammar, so it would not exist in the Visitor interface.
I would suggest taking a closer look at the *Context classes that are generated for you. They'll provide utility methods that will be much easy to use and read in the future than getChild(n). getChild(n) is obscure, in that you'll have to refer back to the rule and diligently count rule members to determine which child to get, and it is also VERY brittle, in that n will change with any modification to your grammar. (Maintainers, or future you, will appreciate using the utility methods instead.)
I'm trying to convert my existing Antlr3 project to Antlr4 to get more functionality. I have this grammar that wouldn't compile with Antlr4.9
expr
: term ( OR^ term )* ;
and
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
Mostly because Antlr4 doesn't support ^ and ! anymore. From the documentation it seems like those are
AST root operator. When generating abstract syntax trees (ASTs), token
references suffixed with the "^" root operator force AST nodes to be
created and added as the root of the current tree. This symbol is only
effective when the buildAST option is set. More information about ASTs
is also available.
AST exclude operator. When generating abstract syntax trees, token
references suffixed with the "!" exclude operator are not included in
the AST constructed for that rule. Rule references can also be
suffixed with the exclude operator, which implies that, while the tree
for the referenced rule is constructed, it is not linked into the tree
for the referencing rule. This symbol is only effective when the
buildAST option is set. More information about ASTs is also available.
If I took those out it would compile but I'm not sure what do those mean and how would Antlr4 supports it.
LPAREN and RPAREN is tokens
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
which Antlr4 kindly provides the way to convert that in the error messages but not ^ and !. The grammar is for parsing boolean expression for example (a=b AND b=c)
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
The v3 grammar:
...
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
...
expr
: term ( OR^ term )* ;
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
in v4 would look like this:
...
expr
: term ( OR term )* ;
factor
: ava | NOT factor | (LPAREN expr RPAREN) ;
EQUALS : '=';
LPAREN : '(';
RPAREN : ')';
So, just remove the inline ^ and ! operators (tree rewriting is no longer available in ANTLR4), and move the literal tokens in the tokens { ... } sections into own lexer rules.
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
What you posted there is part of a tree grammar for which there is no equivalent. In ANTLR4 you'd use a visitor to evaluate your expressions instead of inside a tree grammar.
I am trying to create a calculator by using lex and yacc. However I can not understand how can I give operator precedence to this program? I could not find any information about it. Which code do I need to add to my project to calculate correctly?
Yacc file is:
%{
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
int yylex();
void yyerror(const char *s);
%}
%token INTEGER
%left '*' '/'
%left '+' '-'
%%
program:
program line | line
line:
expr ';' { printf("%d\n",$1); } ; | '\n'
expr:
expr '+' term { $$ = $1 + $3; }
| expr '-' term { $$ = $1 - $3; }
| expr '*' term { $$ = $1 * $3; }
| expr '/' term { $$ = $1 / $3; }
| expr '%' term { $$ = $1 % $3; }
| expr '^' term { $$ = $1 ; }
| term { $$ = $1; }
term:
INTEGER { $$ = $1; }
%%
void yyerror(const char *s) { fprintf(stderr,"%s\n",s); return ; }
int main(void) { /*yydebug=1;*/ yyparse(); return 0; }
Lex file is:
%{
#include <stdlib.h>
#include <stdio.h>
void yyerror(char*);
extern int yylval;
#include "calc.tab.h"
#include<time.h>
%}
%%
[ \t]+ ; //skip whitespace
[0-9]+ {yylval = atoi(yytext); return INTEGER;}
[-+*/%^] {return *yytext;}
\n {return *yytext;}
; {return *yytext;}
. {char msg[25]; sprintf(msg,"%s <%s>","invalid character",yytext); yyerror(msg);}
%left '*' '/'
%left '+' '-'
Precedence declarations are specified in the order from lowest precedence to highest. So in the above code you give * and / the lowest precedence level and + and - the highest. That's the opposite order of what you want, so you'll need to switch the order of these two lines. You'll also want to add the operators % and ^, which are currently part of your grammar, but not your precedence annotations.
With those changes, you'll now have specified the precedence you want, but it won't take effect yet. Why not? Because precedence annotations are used to resolve ambiguities, but your grammar isn't actually ambiguous.
The way you've written the grammar, with only the left operand of all operators being expr and the right operand being term, there's only one way to derive an expression like 2+4*2, namely by deriving 2+4 from expr and 2 from term (because deriving 4*2 from term would be impossible since term can only match a single number). So your grammar treats all operators as left-associative and having the same precedence and your precedence annotations aren't considered at all.
In order for the precedence annotations to be considered, you'll have to change your grammar, so that both operands of the operators are expr (e.g. expr '+' expr instead of expr '+' term). Written like that an expression like 2+4*2 could either be derived by deriving 2+4 from expr as the left operand and 2 from expr as the right operand or 2 as the left and 4*2 as the right and this ambiguity will be resolved using your precedence annotations.
I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed: