SQL Parser Syntax error - sql

In an old postI found a recomendation for a SQL parser I was searching for Lex and Yacc. Here is the link.
SQL lex yacc grammar
I later have found that it is the code that comes explained in the O'reilly book "lex & yacc.
I am trying to put it working and I have succesfully integrated in my aplication, but whenever I send an UPDATE command I get a syntax error, even with the simplest ones:
UPDATE user SET name = 'johnfoo'
I get the error on the = symbol. I have tried to trace everything but I cannot find why it gives this message. I have tried to analize the lex and yacc code and It makes no sense for me, as the code looks correct.
[UPDATE]The error I get is just:
1: syntax error at =
Embedded SQL parse failed
INSERT works perfectly.

After some different ways of trying, the sugested solution (now deleted by the author) worked in some way.
What he suggested was updating the lex and yacc definition of comparision.
In the lex file change
<SQL>"=" |
<SQL>"<>" |
<SQL>"<" |
<SQL>">" |
<SQL>"<=" |
<SQL>">=" TOK(COMPARISON)
by
<SQL>"=" TOK(EQ)
<SQL>"<>" TOK(NE)
<SQL>"<" TOK(LT)
<SQL>">" TOK(GT)
<SQL>"<=" TOK(LE)
<SQL>">=" TOK(GE)
In the yacc file add:
comparison:
EQ
| NE
| LT
| GT
| LE
| GE
;
And change all references to = with EQ and the other symbols and COMPARISON with comparison:
%left COMPARISON /* = <> < > <= >= */
by
%left EQ NE LT GT LE GE /* = <> < > <= >= */
assignment:
column = scalar_exp
| column = NULLX
;
by
assignment:
column EQ scalar_exp
| column EQ NULLX
;
And
comparison_predicate:
scalar_exp COMPARISON scalar_exp
| scalar_exp COMPARISON subquery
;
by
comparison_predicate:
scalar_exp comparison scalar_exp
| scalar_exp comparison subquery
;
And it works!

Related

How can I correctly express in BNF this condition?

I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);
There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.

Array support for Hplsql.g4 or Hive.g4

Good day everyone,
I am using antlr4 to create a parser and lexer for Hive SQL (Hplsql.g4).
I believe this is the latest grammar file.
https://github.com/AngersZhuuuu/Spark-Hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
However, I found at least two additions that are needed: IF and array indices.
For example, in a select statement, I may have:
a) SELECT if(a>8,12,20) FROM x
b) SELECT column_name[2] FROM x
Both are valid in Hive but both do not parse when I create a parser and lexer for java from the Hplsql.g4 above. I added an expression for the IF and it appears to work.
I added
expr :
...
| expr_if //I added
and a new rule:
expr_if :
T_IF T_OPEN_P bool_expr T_COMMA expr T_COMMA expr T_CLOSE_P //I added
;
However, figuring out how to allow an array index is not so easy because the grammar allows aliases:
select a from x
select a alias_of_a from x
select a[1] from x
select a[1] alias_of_a from x
should all be valid.
I tried adding a new expression for this like so:
expr :
...
| expr_array //I added
expr_array :
T_OPEN_SB L_INT T_OPEN_CB //I added
;
This didn't work for me. (T_OPEN_SB L_INT T_OPEN_CB are [ integer ] respectively). I tried so many variations on this as well. My questions are:
Am I using the right grammar file - if not is there a newer one with IF and array handling?
Has anyone been successful in extending this grammar to handle my cases above?
As per Bart's recommendations:
I updated ident.
I updated expr_atom.
I added array_index.
I had // | '[' .*? ']' commented out before.
Test Sql: select a[0] from t
Result:
line 1:8 no viable alternative at input 'selecta[0]'
line 1:8 mismatched input '[0]'
Tree
(program (block stmt (stmt select) (stmt (expr_stmt (expr (expr_atom (ident a)))))) [0] from t)
I feel like the problem is somehow related to select_list_alias below.
With select_list_alias containing ident and T_AS optional, ident is matching the array index.
I can't reconcile why this happens, especially since ident has been updated.
Excerpt from Hplsql.sql:
select_list :
select_list_set? select_list_limit? select_list_item (T_COMMA select_list_item)*
;
select_list_item :
(ident T_EQUAL)? expr select_list_alias?
| select_list_asterisk
;
select_list_alias :
{!_input.LT(1).getText().equalsIgnoreCase("INTO") && !_input.LT(1).getText().equalsIgnoreCase("FROM")}? T_AS? ident
| T_OPEN_P T_TITLE L_S_STRING T_CLOSE_P
;
If I pass in a simple SQL stmt to grun such as
select a[1] from t
The parse tree should look similar to this:
Instead of expr_atom, I want to see expr_array where it would split into expr_atom for the a and array_index for the [1].
Note that there is one SQL statement here. With my existing g4, the array index [1] (and the remainder of the stmt) gets parsed as a separate SQL statement.
Bart, I see from your parse tree that parsing resulted in two SQL statements from "select a[0] from t" - I was getting the same situation.
I will continue to explore different approaches - I am still suspicious of the select_list_alias which has T_AS? ident at the end. Just to confirm, I have commented out one line from ident_part like this: // | '[' .*? ']'
As mentioned in the comments: [ ... ] will be tokenised as a L_ID token. If you don;t want that, remove the | '[' .*? ']' part:
fragment
L_ID_PART :
[a-zA-Z] ([a-zA-Z] | L_DIGIT | '_')* // Identifier part
| ('_' | '#' | ':' | '#' | '$') ([a-zA-Z] | L_DIGIT | '_' | '#' | ':' | '#' | '$')+ // (at least one char must follow special char)
| '"' .*? '"' // Quoted identifiers
// | '[' .*? ']' <-- removed
| '`' .*? '`'
;
and create/edit the grammar like this:
expr_atom :
date_literal
| timestamp_literal
| bool_literal
| expr_array // <-- added
| ident
| string
| dec_number
| int_number
| null_const
;
// new rule
expr_array
: ident array_index+
;
// new rule
array_index
: T_OPEN_SB expr T_CLOSE_SB
;
The rules above will cause select a[1] alias_of_a from x to be parsed successfully, but wil fail on input like select a[1] alias_of_a from [identifier]: the [identifier] will not be matched as an identifier.
You could try adding something like this:
ident :
L_ID
| T_OPEN_SB ~T_CLOSE_SB+ T_CLOSE_SB // <-- added
| non_reserved_words
;
which will parse select a[1] alias_of_a from [identifier] properly, but have no good picture of the whole grammar (or deep knowledge of HPL/SQL) to determine if that will mess up other things :)
EDIT
With my proposed changes, the grammar looks like this: https://gist.github.com/bkiers/4aedd6074726cbcd5d87ede00000cd0d (I cannot post it here on SO because of the char limit)
Parsing select a[0] from t with this will result in the parse tree:
And parsing select a[0] from [t] with this will result in this parse tree:
You're also able to test it by running the following Java code:
String source = "select a[0] from [t]";
HplsqlLexer lexer = new HplsqlLexer(CharStreams.fromString(source));
HplsqlParser parser = new HplsqlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
JFrame frame = new JFrame("Antlr AST");
JPanel panel = new JPanel();
TreeViewer viewer = new TreeViewer(Arrays.asList(parser.getRuleNames()), root);
viewer.setScale(1.5);
panel.add(viewer);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.pack();
frame.setVisible(true);

Matching similar terms in ANTLR without capturing difference

As part of the nand2tetris challenge I'm trying to write a parser using ANTLR to generate machine code, having already implemented it using regex.
However, I'm struggling to work out how to use ANTLR effectively. A subset of the problem is below.
(Some) Valid instructions
M=D
D=M
M=D+1
D;JMP
0;JMP
A (partial) Regex
(?<assignment>(?<destination>[ADM])=)?(?<computation>[ADM+10])(?<condition>;(?<jump>JMP))?
A (partial) grammar
command
: assignment '=' computation
| computation ';' condition
| assignment '=' computation ';' condition
;
assignment
: ASSIGNMENT
;
computation
: OPERATION
;
condition
: CONDITION
;
ASSIGNMENT
: DESTINATION
;
CONDITION
: JUMP
;
DESTINATION
: 'A'
| 'D'
| ...etc
;
OPERATION
: 'A'
| 'D'
| 'A+D'
| ... etc
;
JUMP
: JMP
| JLE
| etc...
;
Now, as you can see, the lexer will get mixed up between what is an computation and what is an assignment, as both could be 'A'...
However, if I change the ASSIGNMENT to
ASSIGNMENT
: DESTINATION '='
;
and command to
command
: assignment computation
| etc...
then assignment picks up the equals sign.
So, I am trying to match under two tokens (FOO and FOO=) in different contexts, but I'm not interested in the =, only the FOO.
Am I barking up the wrong tree entirely with the current approach?

Rascal error when specifying grammar

I have a simple file in rascal for specifying a toy grammar
module temp
import IO;
import ParseTree;
layout LAYOUT = [\t-\n\r\ ]*;
start syntax Simple
= A B ;
syntax A = "Hello"+ ("joe" "pok")* ;
syntax A= "Hi";
syntax B = "world"*|"wembly";
syntax B = C | C C* ;
public void main () {
println("hello");
iprint(parse(#start[Simple], "Hello Hello world world world"));
}
This works fine, however, the problem is that I didn't want to write
syntax B = C | C C* ;
I wanted to write
syntax B = ( C | C C* )?
but it was rejected as a parse error by rascal -even though all of
syntax B = ( C C C* )? ;
syntax B = ( C | C* )? ;
syntax B = C | C C* ;
are accepted fine. Can anyone explain to me what I'm doing wrong?
The sequence symbol (nested sequence) always requires brackets in rascal. The meta notation is defined as
syntax Sym = sequence: "(" Sym+ ")" | opt: Sym "?" | alternative: "(" Sym "|" {Sym "|"}+ ")" | ... ;
So, in your example you should have written:
syntax B = (C | (C C*))?;
What is perhaps confusing is that Rascal uses the | sign twice. Once for separating top-level alternatives, once for nested alternative:
syntax X = "a" | "b"; // top-level
syntax Y = ("c" | "d"); // nested, will internally generate a new rule:
syntax ("c" | "d") = "c" | "d";
Finally, normal alternatives have sequences without brackets, as in:
syntax B
= C
| C C*
;
// or less abstractly:
syntax Exp = left Exp "*" Exp
> left Exp "+" Exp
;
BTW, we generally avoid the use of too many nested regular expressions because they are so anonymous and therefore make interpreting parse trees harder. The best usage of regular expressions is for expressing lexical syntax where we are not so much interested in the internal structure anyhow.

Lvalue awareness in ANTLR grammar and syntax predicates

I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?