Parsing SQL 'between' and 'and' expressions with ANTLR 4 - antlr

I have difficulties with a SQL expression parser. Specifically, with the a AND b and a BETWEEN c AND d rules. The alternatives are specified as follows:
| lhs=exprRule K_AND rhs=exprRule # AndExpression
| value=exprRule K_NOT? K_BETWEEN lower=exprRule K_AND upper=exprRule # BetweenExpression
Unfortunately, this grammar parses a string, such as
...
l_discount BETWEEN 0.02 - 0.01 AND 0.02 + 0.01
AND l_quantity < 25
...
as BetweenExpression with lower={0.02 - 0.01 AND 0.02 + 0.01} and upper={l_quantity < 25}. Clearly, I want it to be parsed as lower={0.02 - 0.01} and upper={0.02 + 0.01} with an AndExpression as parent node.
Basically, I want the lower=exprRule of the BetweenExpression to take the smallest number of tokens instead of the largest number. It seems to me that there should be a straightforward solution to this but I lack the nomenclature to phrase the correct google search and could not find an answer in the ANTLR documantation either.

I also tried, as suggest by mnesarco, to give BETWEEN expressions alt a higher precedence, but in both cases, the parse tree:
is created. Which makes sense if you think about it.
The only thing I could come up with is the introduce an extra "numeric expression" rule that does not match and and between expressions:
exprRule
: value=exprRule ( '+' | '-' ) lower=exprRule #AddExpression
| value=exprRule ( '<' | '>' | '<=' | '=>' ) lower=exprRule #ComparisonExpression
| value=exprRule K_NOT? K_BETWEEN lower=exprNumeric K_AND upper=exprNumeric #BetweenExpression
| lhs=exprRule K_AND rhs=exprRule #AndExpression
| NUMBER #NumberExpression
| ID #IdExpression
;
exprNumeric
: value=exprNumeric ( '+' | '-' ) lower=exprNumeric #AddNumericExpression
| NUMBER #NumNumericberExpression
| ID #IdNumericExpression
;
which results in the parse tree:

It looks like a precedence problem. basically you need [Between] operator to have higher precedence than [And] and probably than [Or] too.
In Antlr4, precedence is just order of definition. So just try swapping the alternative order. ie:
| value=exprRule K_NOT? K_BETWEEN lower=exprRule K_AND upper=exprRule # BetweenExpression
| lhs=exprRule K_AND rhs=exprRule # AndExpression

Related

How can I correctly express in BNF this condition?

I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);
There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.

How to express a hex literal in Spark SQL?

I am new to Spark SQL. I have searched the language manual for Hive/SparkSQL and googled for the answer, but could not find an obvious answer.
In MySQL we can express a hex literal 0xffff like this:
mysql>select 0+0xffff;
+----------+
| 0+0xffff |
+----------+
| 65535 |
+----------+
1 row in set (0.00 sec)
But in Spark SQL (I am using the beeline client), I could only do the following where the numerical values are expressed in decimal not hexidecimal.
> select 0+65535;
+--------------+--+
| (0 + 65535) |
+--------------+--+
| 65535 |
+--------------+--+
1 row selected (0.047 seconds)
If I did the following instead, I would get an error:
> select 0+0xffff;
Error: org.apache.spark.sql.AnalysisException:
cannot resolve '`0xffff`' given input columns: []; line 1 pos 9;
'Project [unresolvedalias((0 + '0xffff), None)]
+- OneRowRelation$ (state=,code=0)
How do we express a hex literal in Spark SQL?
Unfortunatelly, you can't do it in Spark SQL.
You can discover it just by looking at the ANTLR grammar file. There, the number rule defined via DIGIT lexer rule which looks like this:
number
: MINUS? DECIMAL_VALUE #decimalLiteral
| MINUS? INTEGER_VALUE #integerLiteral
| MINUS? BIGINT_LITERAL #bigIntLiteral
| MINUS? SMALLINT_LITERAL #smallIntLiteral
| MINUS? TINYINT_LITERAL #tinyIntLiteral
| MINUS? DOUBLE_LITERAL #doubleLiteral
| MINUS? BIGDECIMAL_LITERAL #bigDecimalLiteral
;
...
INTEGER_VALUE
: DIGIT+
;
...
fragment DIGIT
: [0-9]
;
It does not include any hexadecimal characters, so you can't use them.

How to use antlr4 write a better parser that can distinguish attribute access expression, method invoke expression, array access expression?

I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?
Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.
I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.

Lvalue awareness in ANTLR grammar and syntax predicates

I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?

ANTLR 3.x - How to format rewrite rules

I'm finding myself challenged on how to properly format rewrite rules when certain conditions occur in the original rule.
What is the appropriate way to rewrite this:
unaryExpression: op=('!' | '-') t=term
-> ^(UNARY_EXPR $op $t)
Antlr doesn't seem to like me branding anything in parenthesis with a label and "op=" fails. Also, I've tried:
unaryExpression: ('!' | '-') t=term
-> ^(UNARY_EXPR ('!' | '-') $t)
Antlr doesn't like the or '|' and throws a grammar error.
Replacing the character class with a token name does solve this problem, however it creates a quagmire of other issues with my grammar.
--- edit ----
A second problem has been added. Please help me format this rule with tree grammar:
multExpression
: unaryExpression (MULT_OP unaryExpression)*
;
Pretty simple: My expectation is to enclose every matched token in a parent (imaginary) token MULT so that I end up with something like:
MULT
o
|
o---o---o---o---o
| | | | |
'3' '*' '6' '%' 2
unaryExpression
: (op='!' | op='-') term
-> ^(UNARY_EXPR[$op] $op term)
;
I used the UNARY_EXPR[$op] so the root node gets some useful line/column information instead of defaulting to -1.