Antlr 4 deactivate a subrule within a left-recursive rule - dynamic

I am writing a parser for prolog, the following is part of source. "arg_term" is very similar to "term", but it can not match ',' expression, because I need to count the number of arguments. "arg_item" will need match ',' expression, so I create two similar rules. I tried use semantic predicates, but Antlr 4 reported compiling error. Now it seems not to support semantic predicates in a direct left-recursive rule. The implementation looks clumsy. Can anyone provide a better solution?
I am not very familiar with Antlr and compiller implementation. In prolog, users can define their own operators and related precendence. How to cope with such cases? Now I just ignore their precedence and put them in the end of the "term" rule.
arguments returns [ int argc ] //return argument number
:
arg {$argc = 1; } (',' arg {$argc = $argc + 1;} )*
;
arg :
arg_term
| '(' arg_item ')'
| '{' arg_item '}'
;
arg_item:
':-' term
| term ':-' term
| term
;
arg_term :
simple_term
|'(' arg_term ')'
| ('+'|'-') arg_term //here '+, -' denotes number's sign.
| arg_term ('**'|'^'|'isa'|'has') arg_term
| arg_term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') arg_term
| arg_term ('+'|'-'|'#') arg_term
| arg_term ':' arg_term
| arg_term (OP_XFY_700|'<'|'>'|'=') arg_term
| '\\+' arg_term
| arg_term '->' arg_term
| arg_term ';' arg_term
| OP_FX_1150 arg_term
| arg_term user_op arg_term
;
term
:
simple_term
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term ',' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;

1) Semantic predicates in ANTLR4 have changes since v3 (see here).
2) To clean up your arg_term and term productions, try something similar to this grammar snippet:
grammar Prolog;
...
argTerm: term (',' term)*;
term :
simpleTerm
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;
...
3) Rather than embedding that Java code in your grammar, use the ANTLR4 generated ParseTreeVisitor.
You can generate a PrologBaseVisitor by using the -visitor argument from the command line:
org.antlr.v4.Tool -visitor Prolog.g4
This is an example of an implementation extending the generated PrologBaseVisitor which would count your arguments:
public class ProglogArgCountVis extends PrologBaseVisitor<Integer> {
// By default, all productions will return 0.
#Override
protected Integer defaultResult() {
return 0;
}
// Return the size of ctx.term(), which is a list of
// TermContexts... see generated parser code.
#Override
public Integer visitArgTermContext(ArgTermContext ctx) {
return ctx.term().size();
}
}
Using this visitor would look something like this:
PrologParser p;
....
Integer argCount = new PrologArgCountVis().visit(p.argTerm());
User defined precedence would be interesting to implement. I think the best way to handle this situation would be to define another PrologBaseVisitor, have it check the precedence of every operator it visits and evaluate accordingly.

Related

How to detect an expression result is unused in an interpreted programming language?

I'm working on a simple procedural interpreted scripting language, written in Java using ANTLR4. Just a hobby project. I have written a few DSLs using ANTLR4 and the lexer and parser presented no real problems. I got quite a bit of the language working by interpreting directly from the parse tree but that strategy, apart from being slow, started to break down when I started to add functions.
So I've created a stack-based virtual machine, based on Chapter 10 of "Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages". I have an assembler for the VM that works well and I'm now trying to make the scripting language generate assembly via an AST.
Something I can't quite see is how to detect when an expression or function result is unused, so that I can generate a POP instruction to discard the value from the top of the operand stack.
I want things like assignment statements to be expressions, so that I can do things like:
x = y = 1;
In the AST, the assignment node is annotated with the symbol (the lvalue) and the rvalue comes from visiting the children of the assignment node. At the end of the visit of the assignment node, the rvalue is stored into the lvalue, and this is reloaded back into the operand stack so that it can be used as an expression result.
This generates ( for x = y = 1):
CLOAD 1 ; Push constant value
GSTOR y ; Store into global y and pop
GLOAD y ; Push value of y
GSTOR x ; Store into global x and pop
GLOAD x ; Push value of x
But it needs a POP instruction at the end to discard the result, otherwise the operand stack starts to grow with these unused results. I can't see the best way of doing this.
I guess my grammar could be flawed, which is preventing me seeing a solution here.
grammar g;
// ----------------------------------------------------------------------------
// Parser
// ----------------------------------------------------------------------------
parse
: (functionDefinition | compoundStatement)*
;
functionDefinition
: FUNCTION ID parameterSpecification compoundStatement
;
parameterSpecification
: '(' (ID (',' ID)*)? ')'
;
compoundStatement
: '{' compoundStatement* '}'
| conditionalStatement
| iterationStatement
| statement ';'
;
statement
: declaration
| expression
| exitStatement
| printStatement
| returnStatement
;
declaration
: LET ID ASSIGN expression # ConstantDeclaration
| VAR ID ASSIGN expression # VariableDeclaration
;
conditionalStatement
: ifStatement
;
ifStatement
: IF expression compoundStatement (ELSE compoundStatement)?
;
exitStatement
: EXIT
;
iterationStatement
: WHILE expression compoundStatement # WhileStatement
| DO compoundStatement WHILE expression # DoStatement
| FOR ID IN expression TO expression (STEP expression)? compoundStatement # ForStatement
;
printStatement
: PRINT '(' (expression (',' expression)*)? ')' # SimplePrintStatement
| PRINTF '(' STRING (',' expression)* ')' # PrintFormatStatement
;
returnStatement
: RETURN expression?
;
expression
: expression '[' expression ']' # Indexed
| ID DEFAULT expression # DefaultValue
| ID op=(INC | DEC) # Postfix
| op=(ADD | SUB | NOT) expression # Unary
| op=(INC | DEC) ID # Prefix
| expression op=(MUL | DIV | MOD) expression # Multiplicative
| expression op=(ADD | SUB) expression # Additive
| expression op=(GT | GE | LT | LE) expression # Relational
| expression op=(EQ | NE) expression # Equality
| expression AND expression # LogicalAnd
| expression OR expression # LogicalOr
| expression IF expression ELSE expression # Ternary
| ID '(' (expression (',' expression)*)? ')' # FunctionCall
| '(' expression ')' # Parenthesized
| '[' (expression (',' expression)* )? ']' # LiteralArray
| ID # Identifier
| NUMBER # LiteralNumber
| STRING # LiteralString
| BOOLEAN # LiteralBoolean
| ID ASSIGN expression # SimpleAssignment
| ID op=(CADD | CSUB | CMUL | CDIV) expression # CompoundAssignment
| ID '[' expression ']' ASSIGN expression # IndexedAssignment
;
// ----------------------------------------------------------------------------
// Lexer
// ----------------------------------------------------------------------------
fragment
IDCHR : [A-Za-z_$];
fragment
DIGIT : [0-9];
fragment
ESC : '\\' ["\\];
COMMENT : '#' .*? '\n' -> skip;
// ----------------------------------------------------------------------------
// Keywords
// ----------------------------------------------------------------------------
DO : 'do';
ELSE : 'else';
EXIT : 'exit';
FOR : 'for';
FUNCTION : 'function';
IF : 'if';
IN : 'in';
LET : 'let';
PRINT : 'print';
PRINTF : 'printf';
RETURN : 'return';
STEP : 'step';
TO : 'to';
VAR : 'var';
WHILE : 'while';
// ----------------------------------------------------------------------------
// Operators
// ----------------------------------------------------------------------------
ADD : '+';
DIV : '/';
MOD : '%';
MUL : '*';
SUB : '-';
DEC : '--';
INC : '++';
ASSIGN : '=';
CADD : '+=';
CDIV : '/=';
CMUL : '*=';
CSUB : '-=';
GE : '>=';
GT : '>';
LE : '<=';
LT : '<';
AND : '&&';
EQ : '==';
NE : '!=';
NOT : '!';
OR : '||';
DEFAULT : '??';
// ----------------------------------------------------------------------------
// Literals and identifiers
// ----------------------------------------------------------------------------
BOOLEAN : ('true'|'false');
NUMBER : DIGIT+ ('.' DIGIT+)?;
STRING : '"' (ESC | .)*? '"';
ID : IDCHR (IDCHR | DIGIT)*;
WHITESPACE : [ \t\r\n] -> skip;
ANYCHAR : . ;
So my question is where is the usual place to detect unused expression results, i.e. when expressions are used as plain statements? Is it something I should detect during the parse, then annotate the AST node? Or is this better done when visiting the AST for code generation (assembly generation in my case)? I just can't see where best to do it.
IMO it's not a question of the right grammar, but how you process the AST/parse tree. The fact if a result is used or not could be determined by checking the siblings (and parent siblings etc.). An assignment for instance is made of the lvalue, the operator and the rvalue, hence when you determined the rvalue, check the previous tree node sibling if that is an operator. Similarly you can check if the parent is a parentheses expression (for nested function calls, grouping etc.).
statement
: ...
| expression
If you label this case with # ExpressionStatement, you can generate a pop after every expression statement by overriding exitExpressionStatement() in the listener or visitExpressionStatement in the visitor.

How to use antlr4 write a better parser that can distinguish attribute access expression, method invoke expression, array access expression?

I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?
Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.
I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.

Xtext: Using syntactic predicates with cross-reference

I'm having trouble understanding how to use the syntactic predicates.
My grammar is:
Rule:
'terminalOne' (name=ID ':')?
(field='terminalTwo' | myReference=[Something])? (anotherField=RuleTwo TOK_SEMI);
Which produces a non-LL(*) conflict.
I tried to put '=>' in-front of:
(anotherField=RuleTwo TOK_SEMI)
But it doesn't seem to help.
How can I solve it with syntactic predicates?
Thanks.
i did some shortening (your way of left factoring looks very unusal
RuleA:
'terminalA' (name=ID ':')?
((->fieldA=ID passedParams+=AdditiveExpression (',' passedParams+=AdditiveExpression)*)
|
((fieldB='t' | fieldC='q')? (fieldD=AdditiveExpression ";")));
AdditiveExpression returns BExpression :
RuleB
({BBinaryOpExpression.leftExpr=current} functionName=("+" | "-") rightExpr=RuleB)*
;
RuleB returns BExpression
: PostopExpression
| RuleC
;
RuleC returns BExpression : {BUnaryOpExpression}
functionName="-" expr=UnaryOrPrimaryExpression
;
PostopExpression returns BExpression :
PrimaryExpression ({BUnaryPostOpExpression.expr=current} functionName = ("++"))?
;
PrimaryExpression returns BExpression:
c=constant
| myID=ID '(' myFieldB+=AdditiveExpression (',' myFieldB+=AdditiveExpression)* ')'
| myP=ID (operator+='['intvalue=INT operator+=']')?
| operator+='(' additiveExpression=AdditiveExpression operator+=')'
| operator+='someOperator' operator+='(' additiveExpression=AdditiveExpression operator+=')';
constant:
booleanValue='FALSE'
| booleanValue='TRUE'
| integerValue=INT;

What does this ANLTR4 notation mean?

I have a question regarding the notation of a UCB Logo grammar that I found was generated for ANTLR4. There are some notations can't make out and thought about asking. If anyone is willing to clarify, I will be grateful.
Here are the notations I don't quite understand:
WORD
: {listDepth > 0}? ~[ \t\r\n\[\];] ( ~[ \t\r\n\];~] | LINE_CONTINUATION | '\\' ( [ \t\[\]();~] | LINE_BREAK ) )*
| {arrayDepth > 0}? ~[ \t\r\n{};] ( ~[ \t\r\n};~] | LINE_CONTINUATION | '\\' ( [ \t{}();~] | LINE_BREAK ) )*;
array
: '{' ( ~( '{' | '}' ) | array )* '}';
NAME
: ~[-+*/=<> \t\r\n\[\]()":{}] ( ~[-+*/=<> \t\r\n\[\](){}] | LINE_CONTINUATION | '\\' [-+*/=<> \t\r\n\[\]();~{}] )*;
I guess the array means that it can start with { and have an arbitrary number of levels, but has to end with }.
I take it that the others are some form of regular expressions?
Too my knowledge, regex is different for different programming languages.
Did I get that right?
Antlr does not do regular expressions. It does implement some of the same operators, but that is where the similarity largely ends.
The first sub-terms ( {listDepth > 0}?) in the WORD rule are predicates - no relation to anything in the regular expression world. They are defined in the Antlr documentation and explained in detail in the TDAR.
Your understanding of the array rule is essentially correct.

Using semantic predicates with Python target

I'm currently building a grammar for unit tests regarding a proprietary language my company uses.
This language resembles Regex in some way, for example F=bing* indicates the possible repetition of bing. A single * however represents one any block, and ** means any number of arbitrary blocks.
My only solution to this is using semantic predicates, checking if the preceding token was a space. If anyone has suggestions circumventing this problem in a different way, please share!
Otherwise, my grammar looks like this right now, but the predicates don't seem to work as expected.
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? REPEAT?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
block :
element
| and_con
| or_con
| head_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ ')' REPEAT?;
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\n'|'\r\n'))* (dictentry)? EOF;
ID : ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME REPEAT? '$'? )
| SINGLESTAR REPEAT?;
fragment SINGLESTAR: {_input.LA(-1)==' '}? '*';
fragment REPEATSTAR: {_input.LA(-1)!=' '}? '*';
fragment NAME: CHAR+ | ',' | '.' | '*';
fragment CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
REPEAT: (REPEATSTAR|'+'|'?'|FROMTIL);
fragment FROMTIL: '{'NUM'-'NUM'}';
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
NUM: [0-9]+;
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
When targeting Python, the syntax of lookahead predicates needs to be like this:
SINGLESTAR: {self._input.LA(-1)==ord(' ')}? '*';
Note that it is necessary to add the "self." reference to the call and wrap the character with the ord() function which returns a unicode value for comparison. Antlr documentation for Python target is seriously lacking!