Xtext: Using syntactic predicates with cross-reference - grammar

I'm having trouble understanding how to use the syntactic predicates.
My grammar is:
Rule:
'terminalOne' (name=ID ':')?
(field='terminalTwo' | myReference=[Something])? (anotherField=RuleTwo TOK_SEMI);
Which produces a non-LL(*) conflict.
I tried to put '=>' in-front of:
(anotherField=RuleTwo TOK_SEMI)
But it doesn't seem to help.
How can I solve it with syntactic predicates?
Thanks.

i did some shortening (your way of left factoring looks very unusal
RuleA:
'terminalA' (name=ID ':')?
((->fieldA=ID passedParams+=AdditiveExpression (',' passedParams+=AdditiveExpression)*)
|
((fieldB='t' | fieldC='q')? (fieldD=AdditiveExpression ";")));
AdditiveExpression returns BExpression :
RuleB
({BBinaryOpExpression.leftExpr=current} functionName=("+" | "-") rightExpr=RuleB)*
;
RuleB returns BExpression
: PostopExpression
| RuleC
;
RuleC returns BExpression : {BUnaryOpExpression}
functionName="-" expr=UnaryOrPrimaryExpression
;
PostopExpression returns BExpression :
PrimaryExpression ({BUnaryPostOpExpression.expr=current} functionName = ("++"))?
;
PrimaryExpression returns BExpression:
c=constant
| myID=ID '(' myFieldB+=AdditiveExpression (',' myFieldB+=AdditiveExpression)* ')'
| myP=ID (operator+='['intvalue=INT operator+=']')?
| operator+='(' additiveExpression=AdditiveExpression operator+=')'
| operator+='someOperator' operator+='(' additiveExpression=AdditiveExpression operator+=')';
constant:
booleanValue='FALSE'
| booleanValue='TRUE'
| integerValue=INT;

Related

How to use antlr4 write a better parser that can distinguish attribute access expression, method invoke expression, array access expression?

I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?
Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.
I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.

Using semantic predicates with Python target

I'm currently building a grammar for unit tests regarding a proprietary language my company uses.
This language resembles Regex in some way, for example F=bing* indicates the possible repetition of bing. A single * however represents one any block, and ** means any number of arbitrary blocks.
My only solution to this is using semantic predicates, checking if the preceding token was a space. If anyone has suggestions circumventing this problem in a different way, please share!
Otherwise, my grammar looks like this right now, but the predicates don't seem to work as expected.
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? REPEAT?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
block :
element
| and_con
| or_con
| head_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ ')' REPEAT?;
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\n'|'\r\n'))* (dictentry)? EOF;
ID : ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME REPEAT? '$'? )
| SINGLESTAR REPEAT?;
fragment SINGLESTAR: {_input.LA(-1)==' '}? '*';
fragment REPEATSTAR: {_input.LA(-1)!=' '}? '*';
fragment NAME: CHAR+ | ',' | '.' | '*';
fragment CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
REPEAT: (REPEATSTAR|'+'|'?'|FROMTIL);
fragment FROMTIL: '{'NUM'-'NUM'}';
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
NUM: [0-9]+;
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
When targeting Python, the syntax of lookahead predicates needs to be like this:
SINGLESTAR: {self._input.LA(-1)==ord(' ')}? '*';
Note that it is necessary to add the "self." reference to the call and wrap the character with the ord() function which returns a unicode value for comparison. Antlr documentation for Python target is seriously lacking!

Exclude tokens from Identifier lexical rule

I have Identifier lexical rule:
Identifier
: ( 'a'..'z' | 'A'..'Z' | '_' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' )*
;
LogicalOr and LogicalAnd rules:
LogicalOr : '| ' | '||' | OR;
LogicalAnd : '&' | '&&' | AND;
fragment Or : '[Oo][Rr]';
fragment And : '[Aa][Nn][Dd]';
strings "and" and "or" are recognized as identifiers, instead of logicalAnd and logicalOr. Could someone help me to solve this problem please?
There are two potential issues at play. First and foremost, ANTLR 3 does not support the character class syntax introduced by ANTLR 4. Your Or fragment literally matches the input [Oo][Rr]; it does not match OR, or, or oR. The same applies to your And fragment. You need to write the rule like this instead:
fragment
Or
: ('O' | 'o') ('R' | 'r')
;
If this does not resolve your issue, then you need to make sure your LogicalOr and LogicalAnd rules are positioned before the Identifier rule in the grammar. The rule which appears first will determine what token type is assigned for this input sequence.

Antlr 4 deactivate a subrule within a left-recursive rule

I am writing a parser for prolog, the following is part of source. "arg_term" is very similar to "term", but it can not match ',' expression, because I need to count the number of arguments. "arg_item" will need match ',' expression, so I create two similar rules. I tried use semantic predicates, but Antlr 4 reported compiling error. Now it seems not to support semantic predicates in a direct left-recursive rule. The implementation looks clumsy. Can anyone provide a better solution?
I am not very familiar with Antlr and compiller implementation. In prolog, users can define their own operators and related precendence. How to cope with such cases? Now I just ignore their precedence and put them in the end of the "term" rule.
arguments returns [ int argc ] //return argument number
:
arg {$argc = 1; } (',' arg {$argc = $argc + 1;} )*
;
arg :
arg_term
| '(' arg_item ')'
| '{' arg_item '}'
;
arg_item:
':-' term
| term ':-' term
| term
;
arg_term :
simple_term
|'(' arg_term ')'
| ('+'|'-') arg_term //here '+, -' denotes number's sign.
| arg_term ('**'|'^'|'isa'|'has') arg_term
| arg_term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') arg_term
| arg_term ('+'|'-'|'#') arg_term
| arg_term ':' arg_term
| arg_term (OP_XFY_700|'<'|'>'|'=') arg_term
| '\\+' arg_term
| arg_term '->' arg_term
| arg_term ';' arg_term
| OP_FX_1150 arg_term
| arg_term user_op arg_term
;
term
:
simple_term
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term ',' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;
1) Semantic predicates in ANTLR4 have changes since v3 (see here).
2) To clean up your arg_term and term productions, try something similar to this grammar snippet:
grammar Prolog;
...
argTerm: term (',' term)*;
term :
simpleTerm
|'(' term ')'
| ('+'|'-') term
| term ('**'|'^'|'isa'|'has') term
| term ('//' | 'mod' | 'rem' | '<<' | '>>' |'*' |'/') term
| term ('+'|'-'|'#') term
| term ':' term
| term (OP_XFY_700|'<'|'>'|'=') term
| '\\+' term
| term '->' term
| term ';' term
| OP_FX_1150 term
| term user_op term
;
...
3) Rather than embedding that Java code in your grammar, use the ANTLR4 generated ParseTreeVisitor.
You can generate a PrologBaseVisitor by using the -visitor argument from the command line:
org.antlr.v4.Tool -visitor Prolog.g4
This is an example of an implementation extending the generated PrologBaseVisitor which would count your arguments:
public class ProglogArgCountVis extends PrologBaseVisitor<Integer> {
// By default, all productions will return 0.
#Override
protected Integer defaultResult() {
return 0;
}
// Return the size of ctx.term(), which is a list of
// TermContexts... see generated parser code.
#Override
public Integer visitArgTermContext(ArgTermContext ctx) {
return ctx.term().size();
}
}
Using this visitor would look something like this:
PrologParser p;
....
Integer argCount = new PrologArgCountVis().visit(p.argTerm());
User defined precedence would be interesting to implement. I think the best way to handle this situation would be to define another PrologBaseVisitor, have it check the precedence of every operator it visits and evaluate accordingly.

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.