How to use antlr4 write a better parser that can distinguish attribute access expression, method invoke expression, array access expression? - antlr

I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?

Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.

I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.

Related

Mutually left-recursive lexer rules on ANTL4?

I'm trying to write Swift language highlight. Also I would like to highlight in addition to tokens of some language constructs. Having problems with the following rule:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| (Attributes? Function_type_argument_clause 'throws'? '->' Type | Attributes? Function_type_argument_clause 'rethrows' '->' Type)
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type)
| Tuple_type
| Type '?'
| Type '!'
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type) '&' Protocol_composition_continuation
| (Type '.' 'Type' | Type '.' 'Protocol')
| 'Any'
| 'Self'
| '(' Type ')'
;
Error: The following sets of rules are mutually left-recursive [Type]
Tried to leave in the rule, only the following cases:
Type
: Type '?'
| 'Any'
| 'Self'
;
But the problem remained: The following sets of rules are mutually left-recursive [Type]
You defined Type as a lexer rule. Lexer rules cannot be left recursive. Type should be a parser rule.
See: Practical difference between parser rules and lexer rules in ANTLR?
Note that there are existing Swift grammars:
https://github.com/antlr/grammars-v4/blob/master/swift2/Swift2.g4
https://github.com/antlr/grammars-v4/blob/master/swift3/Swift3.g4
Note that these grammars are user-comitted, test them properly!
EDIT
I'm still unable to understand it from the point of view of lexical analysis
Oh, you're only tokenising? Well, then you can't use Type as you're doing it now. You will have to rewrite it so that there is no left recursion any more.
For example, let's say the simplified Type rule looks like this:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| Type '?'
| Type '!'
| 'Any'
| 'Self'
| '(' Type ')'
;
then you should rewrite it like this:
Type
: TypeStart TypeTrailing?
;
fragment TypeStart
: '[' Type ']'
| '[' Type ':' Type ']'
| 'Any'
| 'Self'
| '(' Type ')'
;
fragment TypeTrailing: [?!];

Xtext: Using syntactic predicates with cross-reference

I'm having trouble understanding how to use the syntactic predicates.
My grammar is:
Rule:
'terminalOne' (name=ID ':')?
(field='terminalTwo' | myReference=[Something])? (anotherField=RuleTwo TOK_SEMI);
Which produces a non-LL(*) conflict.
I tried to put '=>' in-front of:
(anotherField=RuleTwo TOK_SEMI)
But it doesn't seem to help.
How can I solve it with syntactic predicates?
Thanks.
i did some shortening (your way of left factoring looks very unusal
RuleA:
'terminalA' (name=ID ':')?
((->fieldA=ID passedParams+=AdditiveExpression (',' passedParams+=AdditiveExpression)*)
|
((fieldB='t' | fieldC='q')? (fieldD=AdditiveExpression ";")));
AdditiveExpression returns BExpression :
RuleB
({BBinaryOpExpression.leftExpr=current} functionName=("+" | "-") rightExpr=RuleB)*
;
RuleB returns BExpression
: PostopExpression
| RuleC
;
RuleC returns BExpression : {BUnaryOpExpression}
functionName="-" expr=UnaryOrPrimaryExpression
;
PostopExpression returns BExpression :
PrimaryExpression ({BUnaryPostOpExpression.expr=current} functionName = ("++"))?
;
PrimaryExpression returns BExpression:
c=constant
| myID=ID '(' myFieldB+=AdditiveExpression (',' myFieldB+=AdditiveExpression)* ')'
| myP=ID (operator+='['intvalue=INT operator+=']')?
| operator+='(' additiveExpression=AdditiveExpression operator+=')'
| operator+='someOperator' operator+='(' additiveExpression=AdditiveExpression operator+=')';
constant:
booleanValue='FALSE'
| booleanValue='TRUE'
| integerValue=INT;

Using semantic predicates with Python target

I'm currently building a grammar for unit tests regarding a proprietary language my company uses.
This language resembles Regex in some way, for example F=bing* indicates the possible repetition of bing. A single * however represents one any block, and ** means any number of arbitrary blocks.
My only solution to this is using semantic predicates, checking if the preceding token was a space. If anyone has suggestions circumventing this problem in a different way, please share!
Otherwise, my grammar looks like this right now, but the predicates don't seem to work as expected.
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? REPEAT?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
block :
element
| and_con
| or_con
| head_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ ')' REPEAT?;
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\n'|'\r\n'))* (dictentry)? EOF;
ID : ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME REPEAT? '$'? )
| SINGLESTAR REPEAT?;
fragment SINGLESTAR: {_input.LA(-1)==' '}? '*';
fragment REPEATSTAR: {_input.LA(-1)!=' '}? '*';
fragment NAME: CHAR+ | ',' | '.' | '*';
fragment CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
REPEAT: (REPEATSTAR|'+'|'?'|FROMTIL);
fragment FROMTIL: '{'NUM'-'NUM'}';
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
NUM: [0-9]+;
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
When targeting Python, the syntax of lookahead predicates needs to be like this:
SINGLESTAR: {self._input.LA(-1)==ord(' ')}? '*';
Note that it is necessary to add the "self." reference to the call and wrap the character with the ord() function which returns a unicode value for comparison. Antlr documentation for Python target is seriously lacking!

Antlr - mismatched input '1' expecting number

I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.
By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".

Antlr 4 deactivate a subrule within a left-recursive rule

I am writing a parser for prolog, the following is part of source. "arg_term" is very similar to "term", but it can not match ',' expression, because I need to count the number of arguments. "arg_item" will need match ',' expression, so I create two similar rules. I tried use semantic predicates, but Antlr 4 reported compiling error. Now it seems not to support semantic predicates in a direct left-recursive rule. The implementation looks clumsy. Can anyone provide a better solution?
I am not very familiar with Antlr and compiller implementation. In prolog, users can define their own operators and related precendence. How to cope with such cases? Now I just ignore their precedence and put them in the end of the "term" rule.
arguments returns [ int argc ] //return argument number
:
arg {$argc = 1; } (',' arg {$argc = $argc + 1;} )*
;
arg :
arg_term
| '(' arg_item ')'
| '{' arg_item '}'
;
arg_item:
':-' term
| term ':-' term
| term
;
arg_term :
simple_term
|'(' arg_term ')'
| ('+'|'-') arg_term //here '+, -' denotes number's sign.
| arg_term ('**'|'^'|'isa'|'has') arg_term
| arg_term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') arg_term
| arg_term ('+'|'-'|'#') arg_term
| arg_term ':' arg_term
| arg_term (OP_XFY_700|'<'|'>'|'=') arg_term
| '\\+' arg_term
| arg_term '->' arg_term
| arg_term ';' arg_term
| OP_FX_1150 arg_term
| arg_term user_op arg_term
;
term
:
simple_term
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term ',' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;
1) Semantic predicates in ANTLR4 have changes since v3 (see here).
2) To clean up your arg_term and term productions, try something similar to this grammar snippet:
grammar Prolog;
...
argTerm: term (',' term)*;
term :
simpleTerm
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;
...
3) Rather than embedding that Java code in your grammar, use the ANTLR4 generated ParseTreeVisitor.
You can generate a PrologBaseVisitor by using the -visitor argument from the command line:
org.antlr.v4.Tool -visitor Prolog.g4
This is an example of an implementation extending the generated PrologBaseVisitor which would count your arguments:
public class ProglogArgCountVis extends PrologBaseVisitor<Integer> {
// By default, all productions will return 0.
#Override
protected Integer defaultResult() {
return 0;
}
// Return the size of ctx.term(), which is a list of
// TermContexts... see generated parser code.
#Override
public Integer visitArgTermContext(ArgTermContext ctx) {
return ctx.term().size();
}
}
Using this visitor would look something like this:
PrologParser p;
....
Integer argCount = new PrologArgCountVis().visit(p.argTerm());
User defined precedence would be interesting to implement. I think the best way to handle this situation would be to define another PrologBaseVisitor, have it check the precedence of every operator it visits and evaluate accordingly.