antlr grammar definition - antlr

I am relatively new to compilers theory and I just wanted to create a grammar to parse some comparisons in order to evaluate them later. I found antlr which is a powerful tool to specify the grammar. From what I have learned in the theory I know that operators with higher precedence must be declard in deeper levels than operators with lower precedence. Additionally if I want some rule to be left associative I know that I have to set the recursivity to the left of the rule. Knowing that I have created a basic grammar to use &&, ||, !=, ==, <, >, <=, >=, (,) and !
start
: orExpr
;
orExpr
: orExpr OR andExpr
| andExpr
;
andExpr
: andExpr AND eqNotEqExpr
| eqNotEqExpr
;
eqNotEqExpr
: eqNotEqExpr NEQ compExpr
| eqNotEqExpr EQ compExpr
| compExpr
;
compExpr
: compExpr LT compExpr
| compExpr GT compExpr
| compExpr LTEQ compExpr
| compExpr GTEQ compExpr
| notExpr
;
notExpr
: NOT notExpr
| parExpr
;
parExpr
: OPAR orExpr CPAR
| id
;
id
: INT
| FLOAT
| TRUE
| FALSE
| ID
| STRING
| NULL
;
However searching in internet I have found a different way to specify above grammar which does not follow the above rules I mentioned regarding operator precedence and left associativity:
start
: expr
;
expr
: NOT expr //notExpr
| expr op=(LTEQ | GTEQ | LT | GT) expr //relationalExpr
| expr op=(EQ | NEQ) expr //equalityExpr
| expr AND expr //andExpr
| expr OR expr //orExpr
| atom //atomExpr
;
atom
: OPAR expr CPAR //parExpr
| (INT | FLOAT) //numberAtom
| (TRUE | FALSE) //booleanAtom
| ID //idAtom
| STRING //stringAtom
| NULL //nullAtom
;
Can someone explain why this way of definig the grammar also works? Is it because some specific treatment of antlr or another type of grammar definition?
Below there are the operators and ids defined for the grammar:
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
NOT : '!';
OPAR : '(';
CPAR : ')';
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
COMMENT
: '//' ~[\r\n]* -> skip
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;

This is specific to ANTLR v4.
Under the hood, a rule like this one will be rewritten to something equivalent to what you have done manually as part of the left-recursion elimination step. ANTLR does this as a convenience because LL grammars cannot contain left-recursive rules, as a direct conversion of such a rule into parser code would produce an infinite recursion in code (a function which unconditionnally calls itself).
There is more info and a transformation example in the docs page about left-recursion.

Related

How to detect an expression result is unused in an interpreted programming language?

I'm working on a simple procedural interpreted scripting language, written in Java using ANTLR4. Just a hobby project. I have written a few DSLs using ANTLR4 and the lexer and parser presented no real problems. I got quite a bit of the language working by interpreting directly from the parse tree but that strategy, apart from being slow, started to break down when I started to add functions.
So I've created a stack-based virtual machine, based on Chapter 10 of "Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages". I have an assembler for the VM that works well and I'm now trying to make the scripting language generate assembly via an AST.
Something I can't quite see is how to detect when an expression or function result is unused, so that I can generate a POP instruction to discard the value from the top of the operand stack.
I want things like assignment statements to be expressions, so that I can do things like:
x = y = 1;
In the AST, the assignment node is annotated with the symbol (the lvalue) and the rvalue comes from visiting the children of the assignment node. At the end of the visit of the assignment node, the rvalue is stored into the lvalue, and this is reloaded back into the operand stack so that it can be used as an expression result.
This generates ( for x = y = 1):
CLOAD 1 ; Push constant value
GSTOR y ; Store into global y and pop
GLOAD y ; Push value of y
GSTOR x ; Store into global x and pop
GLOAD x ; Push value of x
But it needs a POP instruction at the end to discard the result, otherwise the operand stack starts to grow with these unused results. I can't see the best way of doing this.
I guess my grammar could be flawed, which is preventing me seeing a solution here.
grammar g;
// ----------------------------------------------------------------------------
// Parser
// ----------------------------------------------------------------------------
parse
: (functionDefinition | compoundStatement)*
;
functionDefinition
: FUNCTION ID parameterSpecification compoundStatement
;
parameterSpecification
: '(' (ID (',' ID)*)? ')'
;
compoundStatement
: '{' compoundStatement* '}'
| conditionalStatement
| iterationStatement
| statement ';'
;
statement
: declaration
| expression
| exitStatement
| printStatement
| returnStatement
;
declaration
: LET ID ASSIGN expression # ConstantDeclaration
| VAR ID ASSIGN expression # VariableDeclaration
;
conditionalStatement
: ifStatement
;
ifStatement
: IF expression compoundStatement (ELSE compoundStatement)?
;
exitStatement
: EXIT
;
iterationStatement
: WHILE expression compoundStatement # WhileStatement
| DO compoundStatement WHILE expression # DoStatement
| FOR ID IN expression TO expression (STEP expression)? compoundStatement # ForStatement
;
printStatement
: PRINT '(' (expression (',' expression)*)? ')' # SimplePrintStatement
| PRINTF '(' STRING (',' expression)* ')' # PrintFormatStatement
;
returnStatement
: RETURN expression?
;
expression
: expression '[' expression ']' # Indexed
| ID DEFAULT expression # DefaultValue
| ID op=(INC | DEC) # Postfix
| op=(ADD | SUB | NOT) expression # Unary
| op=(INC | DEC) ID # Prefix
| expression op=(MUL | DIV | MOD) expression # Multiplicative
| expression op=(ADD | SUB) expression # Additive
| expression op=(GT | GE | LT | LE) expression # Relational
| expression op=(EQ | NE) expression # Equality
| expression AND expression # LogicalAnd
| expression OR expression # LogicalOr
| expression IF expression ELSE expression # Ternary
| ID '(' (expression (',' expression)*)? ')' # FunctionCall
| '(' expression ')' # Parenthesized
| '[' (expression (',' expression)* )? ']' # LiteralArray
| ID # Identifier
| NUMBER # LiteralNumber
| STRING # LiteralString
| BOOLEAN # LiteralBoolean
| ID ASSIGN expression # SimpleAssignment
| ID op=(CADD | CSUB | CMUL | CDIV) expression # CompoundAssignment
| ID '[' expression ']' ASSIGN expression # IndexedAssignment
;
// ----------------------------------------------------------------------------
// Lexer
// ----------------------------------------------------------------------------
fragment
IDCHR : [A-Za-z_$];
fragment
DIGIT : [0-9];
fragment
ESC : '\\' ["\\];
COMMENT : '#' .*? '\n' -> skip;
// ----------------------------------------------------------------------------
// Keywords
// ----------------------------------------------------------------------------
DO : 'do';
ELSE : 'else';
EXIT : 'exit';
FOR : 'for';
FUNCTION : 'function';
IF : 'if';
IN : 'in';
LET : 'let';
PRINT : 'print';
PRINTF : 'printf';
RETURN : 'return';
STEP : 'step';
TO : 'to';
VAR : 'var';
WHILE : 'while';
// ----------------------------------------------------------------------------
// Operators
// ----------------------------------------------------------------------------
ADD : '+';
DIV : '/';
MOD : '%';
MUL : '*';
SUB : '-';
DEC : '--';
INC : '++';
ASSIGN : '=';
CADD : '+=';
CDIV : '/=';
CMUL : '*=';
CSUB : '-=';
GE : '>=';
GT : '>';
LE : '<=';
LT : '<';
AND : '&&';
EQ : '==';
NE : '!=';
NOT : '!';
OR : '||';
DEFAULT : '??';
// ----------------------------------------------------------------------------
// Literals and identifiers
// ----------------------------------------------------------------------------
BOOLEAN : ('true'|'false');
NUMBER : DIGIT+ ('.' DIGIT+)?;
STRING : '"' (ESC | .)*? '"';
ID : IDCHR (IDCHR | DIGIT)*;
WHITESPACE : [ \t\r\n] -> skip;
ANYCHAR : . ;
So my question is where is the usual place to detect unused expression results, i.e. when expressions are used as plain statements? Is it something I should detect during the parse, then annotate the AST node? Or is this better done when visiting the AST for code generation (assembly generation in my case)? I just can't see where best to do it.
IMO it's not a question of the right grammar, but how you process the AST/parse tree. The fact if a result is used or not could be determined by checking the siblings (and parent siblings etc.). An assignment for instance is made of the lvalue, the operator and the rvalue, hence when you determined the rvalue, check the previous tree node sibling if that is an operator. Similarly you can check if the parent is a parentheses expression (for nested function calls, grouping etc.).
statement
: ...
| expression
If you label this case with # ExpressionStatement, you can generate a pop after every expression statement by overriding exitExpressionStatement() in the listener or visitExpressionStatement in the visitor.

Remove warnings instead of using backtrack option

I am not sure how to solve this problem without using backtrack=true;.
My sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression
;
expression : binaryExpression
| tupleExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/'|'div'|'inter') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| sumExpression
| '(' expression ')'
;
sumExpression : 'sum'|'div'|'inter' expression
;
tupleExpression : ('<' expression '>' (',' '<' expression '>')*)
;
literalExpression : INT
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
Is there a way to fix the grammar such a way that no warnings can happen? Let's assume I want to choose both alternatives depending on the case.
Thank you in advance!
Note that:
sumExpression : 'sum'|'div'|'inter' expression
;
gets interpreted as:
sumExpression : 'sum' /* nothing */
| 'div' /* nothing */
| 'inter' expression
;
since the | has a low precedence. You probably want:
sumExpression : ('sum'|'div'|'inter') expression
;
Let's assume I want to choose both alternatives depending on the case.
That is not possible: you cannot let the parser choose both (or more) alternatives, it can only choose one.
I assume you know why the grammar is ambiguous? If not, here's why: the input A div B can be parsed in two ways:
alternative 1
unaryExpression 'div' unaryExpression
| |
A B
alternative 2
id sumExpression
| | \
A 'div' B
It looks like you want 'sum', 'div' and 'inter' to be some sort of unary operator, in which case you could just merge them into your unaryExpression rule:
unaryExpression : '!' unaryExpression
| '-' unaryExpression
| 'sum' unaryExpression
| 'div' unaryExpression
| 'inter' unaryExpression
| primitiveElement
;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
That way you don't have any ambiguity. Note that A div B will now be parsed as a multiplyingExpression and A div sum B as:
multiplyingExpression
/ \
'div' unaryExpression
/ / \
A 'sum' B

How to create a rule to match 2^3 to create a power operator?

Given that I have the following grammar how would I add a rule to match something like 2^3 to create a power operator?
negation : '!'* term ;
unary : ('+'!|'-'^)* negation ;
mult : unary (('*' | '/' | ('%'|'mod') ) unary)* ;
add : mult (('+' | '-') mult)* ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (('&&' | '||') relation)* ;
// LEXER ================================================================
HEX_NUMBER : '0x' HEX_DIGIT+;
fragment
FLOAT: ;
INTEGER : DIGIT+ ({input.LA(1)=='.' && input.LA(2)>='0' && input.LA(2)<='9'}?=> '.' DIGIT+ {$type=FLOAT;})? ;
fragment
HEX_DIGIT : (DIGIT|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9') ;
What I have tried:
I tried something like power : ('+' | '-') unary'^' unary but that doesn't seem to work.
I also tried mult : unary (('*' | '/' | ('%'|'mod') | '^' ) unary)* ; but that doesn't work either.
To give ^ higher precedence than negation, do this:
pow : term ('^' term)* ;
negation : '!' negation | pow ;
unary : ('+'! | '-'^)* negation ;
If you want to consider the right-associativity already in the grammar, you can also use recursion:
pow : term ('^'^ pow)?
;
negation : '!'* pow;
...

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.

how to solve this simple antlr recursive issue

I was reading the URL (and trying to copy) and failed... (great article on antlr too).
https://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/antlr/antlr.html
My solution before I added parenthesis stuff:
whereClause: WHERE expression -> ^(WHERE_CLAUSE expression);
expression: orExpr;
orExpr: andExpr (OR^ andExpr)*;
andExpr: primaryExpr (AND^ primaryExpr)*;
primaryExpr: parameterExpr | inExpr | compExpr;
My solution that failed due to infinite recursion (but I thought the LPAREN^ and RPAREN! where supposed to solve that???)....
whereClause: WHERE^ (expression | orExpr);
expression: LPAREN^ orExpr RPAREN!;
orExpr: andExpr (OR^ andExpr)*;
andExpr: primaryExpr (AND^ primaryExpr)*;
primaryExpr: parameterExpr | inExpr | compExpr | expression;
Notice primaryExpr at bottom has expression tacked back on which has LPAREN and RPAREN, but the WHERE can be an orExpr or expression (i.e. the first expression can use or not use parentheses).
I am sure this is probably a simple issue like a typo that I keep staring at for hours or something.
I was reading the url(and trying to copy) and failed...(great article on antlr too)...
Note that the article explains ANTLR v2, which has a significant different syntax than v3. Better look for a decent ANTLR v3 tutorial here: https://stackoverflow.com/questions/278480/antlr-tutorials
My solution that failed due to infinite recursion (but I thought the LPAREN^ and RPAREN! where supposed to solve that???)....
It would have, if that were the only expression after the WHILE. However, the orExpr is causing the problem in your case (if you remove it, that recursion error will go away).
A parenthesized expression usually has the highest precedence, and should therefor be placed in your primaryExpr rule, like this:
grammar T;
options {
output=AST;
}
parse : whereClause EOF!;
whereClause : WHERE^ expression;
expression : orExpr;
orExpr : andExpr (OR^ andExpr)*;
andExpr : primaryExpr (AND^ primaryExpr)*;
primaryExpr : bool | NUMBER | '('! expression ')'!;
bool : TRUE | FALSE;
TRUE : 'true';
FALSE : 'false';
WHERE : 'where';
LPAREN : '(';
RPAREN : ')';
OR : '||';
AND : '&&';
NUMBER : '0'..'9'+ ('.' '0'..'9'*)?;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
Now both the input "where true || false" and "where (true || false)" will be parsed in the following AST: