Lvalue awareness in ANTLR grammar and syntax predicates - antlr

I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?

If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?

Related

How can I correctly express in BNF this condition?

I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);
There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.

Xtext: Using syntactic predicates with cross-reference

I'm having trouble understanding how to use the syntactic predicates.
My grammar is:
Rule:
'terminalOne' (name=ID ':')?
(field='terminalTwo' | myReference=[Something])? (anotherField=RuleTwo TOK_SEMI);
Which produces a non-LL(*) conflict.
I tried to put '=>' in-front of:
(anotherField=RuleTwo TOK_SEMI)
But it doesn't seem to help.
How can I solve it with syntactic predicates?
Thanks.
i did some shortening (your way of left factoring looks very unusal
RuleA:
'terminalA' (name=ID ':')?
((->fieldA=ID passedParams+=AdditiveExpression (',' passedParams+=AdditiveExpression)*)
|
((fieldB='t' | fieldC='q')? (fieldD=AdditiveExpression ";")));
AdditiveExpression returns BExpression :
RuleB
({BBinaryOpExpression.leftExpr=current} functionName=("+" | "-") rightExpr=RuleB)*
;
RuleB returns BExpression
: PostopExpression
| RuleC
;
RuleC returns BExpression : {BUnaryOpExpression}
functionName="-" expr=UnaryOrPrimaryExpression
;
PostopExpression returns BExpression :
PrimaryExpression ({BUnaryPostOpExpression.expr=current} functionName = ("++"))?
;
PrimaryExpression returns BExpression:
c=constant
| myID=ID '(' myFieldB+=AdditiveExpression (',' myFieldB+=AdditiveExpression)* ')'
| myP=ID (operator+='['intvalue=INT operator+=']')?
| operator+='(' additiveExpression=AdditiveExpression operator+=')'
| operator+='someOperator' operator+='(' additiveExpression=AdditiveExpression operator+=')';
constant:
booleanValue='FALSE'
| booleanValue='TRUE'
| integerValue=INT;

antlr 4 - warning: rule contains an optional block with at least one alternative that can match an empty string

I work with antlr v4 to write a t-sql parser.
Is this warning a problem?
"rule 'sqlCommit' contains an optional block with at least one alternative that can match an empty string"
My Code:
sqlCommit: COMMIT (TRAN | TRANSACTION | WORK)? id?;
id:
ID | CREATE | PROC | AS | EXEC | OUTPUT| INTTYPE |VARCHARTYPE |NUMERICTYPE |CHARTYPE |DECIMALTYPE | DOUBLETYPE | REALTYPE
|FLOATTYPE|TINYINTTYPE|SMALLINTTYPE|DATETYPE|DATETIMETYPE|TIMETYPE|TIMESTAMPTYPE|BIGINTTYPE|UNSIGNEDBIGINTTYPE..........
;
ID: (LETTER | UNDERSCORE | RAUTE) (LETTER | [0-9]| DOT | UNDERSCORE)*
In a version before I used directly the lexer rule ID instead of the parser rule id in sqlCommit. But after change ID to id the warning appears.
(Hint if you are confused of ID and id: I want to use the parser rule id instead of ID because an identifier can be a literal which maybe already matched by an other lexer rule)
Regards
EDIT
With the help of "280Z28" I solved the problem. In the parser rule "id" was one slash more than needed:
BITTYPE|CREATE|PROC|
|AS|EXEC|OUTPUT|
So the | | includes that the parser rule can match an empty string.
From a Google search:
ErrorType.EPSILON_OPTIONAL
Compiler Warning 154.
rule rule contains an optional block with at least one alternative that can match an empty string
A rule contains an optional block ((...)?) around an empty alternative.
The following rule produces this warning.
x : ;
y : x?; // warning 154
z1 : ('foo' | 'bar'? 'bar2'?)?; // warning 154
z2 : ('foo' | 'bar' 'bar2'? | 'bar2')?; // ok
Since:
4.1
The problem described by this warning is primarily a performance problem. By wrapping a zero-length string in an optional block, you added a completely unnecessary decision to the grammar (whether to enter the optional block or not) which has a high likelihood of forcing the prediction algorithm through its slowest path. It's similar to wrapping Java code in the following:
if (slowMethodThatAlwaysReturnsTrue()) {
...
}
I'm struggling to see how this rule also suffers from this warning (with antlr 4.7.1)
join_type: (INNER | (left_right_full__join_type)? (OUTER)?)? JOIN;
left_right_full__join_type: LEFT | RIGHT | FULL;
JOIN: J O I N;
INNER: I N N E R;
OUTER: O U T E R;
AFAICT it always returns JOIN and optionally preceded by the type.

Unambiguous grammar for arithmetic expression with Unary + and -

I have just started self-studying the Dragon book of Compiler Design. I am working on a problem that says to design grammar for an expression containing binary +,-,*,/ and unary +,-
I came up with following
E -> E+T | E-T | T
T -> T*P | T/P | P
P -> +S | -S | S
S -> id | constant | (E)
However, there is an obvious flaw in it. According to this grammar, expressions like
1--3
are valid, which is an error in all programming languages I know. Though, expressions like
1+-+3
and
1- -3
must be valid. How can such a grammar be designed?
I believe your problem is with tokenization. You are identifying 1--3 as an error because you think it should be resolved as 1 --3 rather than 1 - -3, the latter being perfectly valid. So I think you problem comes because when you tokenize the string you are getting:
['1', '-', '-' , '3']
rather than:
['1', '--', '3']
I think you have one extra production rule
P -> +S | -S | S
S -> id | constant | (E)
can be shrinked to
P -> +P | -P | id | constant | (E)
With such grammar you will succesfully match exp "1+-+3" as valid.
You have tokenizer(scanner) problem! before passing token to parser You must differentiate between "-" and "--". you must define a token struct which contains a token type and value, then parse the token list.
Also the rule P->--S must be added to the production rules!!

ANTLR 3.x - How to format rewrite rules

I'm finding myself challenged on how to properly format rewrite rules when certain conditions occur in the original rule.
What is the appropriate way to rewrite this:
unaryExpression: op=('!' | '-') t=term
-> ^(UNARY_EXPR $op $t)
Antlr doesn't seem to like me branding anything in parenthesis with a label and "op=" fails. Also, I've tried:
unaryExpression: ('!' | '-') t=term
-> ^(UNARY_EXPR ('!' | '-') $t)
Antlr doesn't like the or '|' and throws a grammar error.
Replacing the character class with a token name does solve this problem, however it creates a quagmire of other issues with my grammar.
--- edit ----
A second problem has been added. Please help me format this rule with tree grammar:
multExpression
: unaryExpression (MULT_OP unaryExpression)*
;
Pretty simple: My expectation is to enclose every matched token in a parent (imaginary) token MULT so that I end up with something like:
MULT
o
|
o---o---o---o---o
| | | | |
'3' '*' '6' '%' 2
unaryExpression
: (op='!' | op='-') term
-> ^(UNARY_EXPR[$op] $op term)
;
I used the UNARY_EXPR[$op] so the root node gets some useful line/column information instead of defaulting to -1.