How can I correctly express in BNF this condition? - language-design

I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);

There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.

Related

Parsing SQL 'between' and 'and' expressions with ANTLR 4

I have difficulties with a SQL expression parser. Specifically, with the a AND b and a BETWEEN c AND d rules. The alternatives are specified as follows:
| lhs=exprRule K_AND rhs=exprRule # AndExpression
| value=exprRule K_NOT? K_BETWEEN lower=exprRule K_AND upper=exprRule # BetweenExpression
Unfortunately, this grammar parses a string, such as
...
l_discount BETWEEN 0.02 - 0.01 AND 0.02 + 0.01
AND l_quantity < 25
...
as BetweenExpression with lower={0.02 - 0.01 AND 0.02 + 0.01} and upper={l_quantity < 25}. Clearly, I want it to be parsed as lower={0.02 - 0.01} and upper={0.02 + 0.01} with an AndExpression as parent node.
Basically, I want the lower=exprRule of the BetweenExpression to take the smallest number of tokens instead of the largest number. It seems to me that there should be a straightforward solution to this but I lack the nomenclature to phrase the correct google search and could not find an answer in the ANTLR documantation either.
I also tried, as suggest by mnesarco, to give BETWEEN expressions alt a higher precedence, but in both cases, the parse tree:
is created. Which makes sense if you think about it.
The only thing I could come up with is the introduce an extra "numeric expression" rule that does not match and and between expressions:
exprRule
: value=exprRule ( '+' | '-' ) lower=exprRule #AddExpression
| value=exprRule ( '<' | '>' | '<=' | '=>' ) lower=exprRule #ComparisonExpression
| value=exprRule K_NOT? K_BETWEEN lower=exprNumeric K_AND upper=exprNumeric #BetweenExpression
| lhs=exprRule K_AND rhs=exprRule #AndExpression
| NUMBER #NumberExpression
| ID #IdExpression
;
exprNumeric
: value=exprNumeric ( '+' | '-' ) lower=exprNumeric #AddNumericExpression
| NUMBER #NumNumericberExpression
| ID #IdNumericExpression
;
which results in the parse tree:
It looks like a precedence problem. basically you need [Between] operator to have higher precedence than [And] and probably than [Or] too.
In Antlr4, precedence is just order of definition. So just try swapping the alternative order. ie:
| value=exprRule K_NOT? K_BETWEEN lower=exprRule K_AND upper=exprRule # BetweenExpression
| lhs=exprRule K_AND rhs=exprRule # AndExpression

How to use antlr4 write a better parser that can distinguish attribute access expression, method invoke expression, array access expression?

I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?
Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.
I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

How to parse a parenthesized hierarchy root?

I'm trying to parse values with ANTLR. Here's the relevant part of my grammar:
root : IDENTIFIER | SELF | literal | constructor | call | indexer;
hierarchy : root (SUB^ (IDENTIFIER | call | indexer))*;
factor : hierarchy ((MULT^ | DIV^ | MODULO^) hierarchy)*;
sum : factor ((PLUS^ | MINUS^) factor)*;
comparison : sum (comparison_operator^ sum)*;
value : comparison | '(' value ')';
I won't describe each token or rule since their name is quite explanatory of their role. This grammar works well and compiles, allowing me to parse, using value, things such as:
a.b[c(5).d[3] * e()] < e("f")
The only thing left for value recognition is to be able to have parenthesized hierarchy roots. For instance:
(a.b).c
(3 < d()).e
...
Naively, and without much expectations, I tried adding the following alternative to my root rule:
root : ... | '(' value ')';
This however breaks the value rule due to non-LL(*)ism:
rule value has non-LL(*) decision due to recursive rule invocations reachable
from alts 1,2. Resolve by left-factoring or using syntactic predicates or using
backtrack=true option.
Even after reading most of The Definitive ANTLR Reference, I still don't understand these errors. However, what I do understand is that, upon seeing a parenthesis opening, ANTLR cannot know if it's looking at the beginning of a parenthesized value, or at the beginning of a parenthesized root.
How can I clearly define the behavior of parenthesized hierarchy root?
Edit: As requested, the additional rules:
parameter : type IDENTIFIER -> ^(PARAMETER ^(type IDENTIFIER));
constructor : NEW type PAREN_OPEN (arguments+=value (SEPARATOR arguments+=value)*)? PAREN_CLOSE -> ^(CONSTRUCTOR type ^(ARGUMENTS $arguments*)?);
call : IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE -> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?);
indexer : IDENTIFIER INDEX_START (values+=value (SEPARATOR values+=value)*)? INDEX_END -> ^(INDEXER IDENTIFIER ^(ARGUMENTS $values*));
Remove '(' value ')' from value and place it in root:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '(' value ')';
...
value : comparison;
Now (a.b).c will result in the following parse:
And (3 < d()).e in:
Of course, you'll probably want to omit the parenthesis from the AST:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '('! value ')'!;
Also, you don't need to append tokens in a List using += in your parser rules. The following:
call
: IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?)
;
can be rewritten into:
call
: IDENTIFIER PAREN_OPEN (value (SEPARATOR value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS value*)?)
;
EDIT
Your main problem is the fact that certain input can be parsed in two (or more!) ways. For example, the input (a) could be parsed by alternative 1 and 2 of your value rule:
value
: comparison // alternative 1
| '(' value ')' // alternative 2
;
Run through your parser rules: a comparison (alternative 1) can match (a) because it matches the root rule, which in its turn matches '(' value ')'. But that is also what alternative 2 matches! And there you have it: the parser "sees" for one input, two different
parses and reports about this ambiguity.

Lvalue awareness in ANTLR grammar and syntax predicates

I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?