Most relational databases handles a JDBC / SQL query in four steps:
Parse the incoming SQL query
Compile the SQL query
Plan/optimize the data acquisition path
Execute the optimized query / acquire and return data
I want to know what does "parse the incoming query" really mean? And what does "plan/optimize data acquisition path" mean?
Parsing means examining the characters input and recognizing it as a command or statement by looking through the characters for keywords and identifiers, ignoring comments, arranging quoted portions as string constants, and matching the overall structure to the language syntax making sense of it all.
Plan/optimize means figure out the best way (of all the possible ways) to determine the result, usually with respect to execution time. It could also mean minimizing the number of locks needed. Maybe some parts of the query can be ignored (where ... and 1 == 1) or a table doesn't need to be accessed at all, etc.
parsing is one of the Process of compilation.
Phases of a Compiler:
Source: Phases of Compiler
1) Parsing: syntactic analysis of the query according to the SQL grammar rules, etc. and attempting to "tokenize" the query into the elementary parts form.
2) Planning/optimization: at that stage the SQL engine tries to evaluate what the best way to execute your query would be. It tries to take advantage of existing indexes, clusters and table relationships; find ways around full table scans, utilize caching effectively by avoiding repeated data reads, and so forth.
Related
Sorry for this question was closed and can not be reopened, and my poor english, it was translated by website indeed. :)
https://stackoverflow.com/questions/70035964/how-to-skip-sql-parsing-in-antlr4
#BartKiers Thanks for being interested in this question, let me give it a detailed example.
There are lots of SQL queries, such as select * from user or update user set field1 = 'value1' where condition = 'value' etc, let's called it original SQL queries.
There is a java program which intercepts and parses all the original SQL queries into Parse Tree nodes by ANTLR4, and then rewrites the query (which depended on the parse phase) by the java program, so the original SQL queries may be parsed and rewritten as
select field1, field1_encrypted, field1_digest, field2 from user
or
update user
set field1 = value1,
field1_encrypted = encrypt_algorithm(value1),
field1_digest = digest_algorithm(value1)
where condition_digest = digest_algorithm(values)
etc.
While they finished the rewritten phase, they should be executed as SQLStatement, the SELECT was executed as SelectSQLStatement while UPDATE executed as UpdateSQLStatement.
Now I thought some of the original SQL queries should skip the parse phase, and the rewrite phase which should be skipped as the same, but the originalSQL queries should be executed as it was.
I thought to mark those with comment as
/* PARSE_PHASE_SKIPPED=TRUE */ originalSQL
or prefix SKIP as
SKIP originalSQL
, I wish to parse the whole marked but original SQL query part into Parse Tree nodes by ANTLR4, and executed it as ParsePhaseSkippedSQLStatement.
Can ANTLR4 support on this situation, and how should the grammar be written? Thanks in advance.
====================
Thank you for your reply #Mike Cargal, Yes, almost.
Let me say it again and give a more detailed example.
There is a java system that we call it X, X has lots of SQL queries that the developers write and guarantee that those SQLs can be executed correctly by Ibatis / JPA etc, let's named those SQL queries as original SQL queries.
Using below original SQL queries as examples:
insert into user (username, id_no) values ('xyz', '123456')
select username, id_no from user u where u.id_no = '123456'
We say the column id_no on table user is sensitive data, we should save ciphertext instead of plaintext, so the originalSQLs would be parsed by ANTLR and rewritten by java code as below, let's named those SQLs as rewritten SQL queries, also rewritten SQL queries should be executed correctly by Ibatis / JPA etc.
insert
into user (username, id_no, id_no_cipher, id_no_digest)
values ('xyz', '', 'encrypted_123456', 'digest_123456')
select username, id_no_cipher as id_no
from user u
where u.id_no_digest = 'digest_123456'
In this case:
1、we see that the rewrite phase depends on the parse phase, original SQL queries need to be correctly parsed then to be rewritten by java code.
2、all original SQL queries are parsed but only a few matching the sensitive rules are rewritten to rewritten SQL queries.
But there are lots original SQL queries we clearly know that do not need to be rewritten, and also no need to be parsed, and may report exceptions in various complex situations while parsing it, but it should be executed correctly by Ibatis / JPA etc.
So I planed to use sql comment / customized keyword annotation to "turn off" parse phase of it.
If I understand your question correctly, you wish to use some sort of comment/annotation to "turn off" execution of the following SQL statement.
(NOTE: You can't really skip "parsing" part of the input. This will address ways in which you could skip processing part of the parsed input, which I believe is what you're ultimately wanting to accomplish.)
This would not be an ANTLR concern. ANTLR's responsibility is to parse you input stream and produce a parse tree (not technically an AST) that correctly represents the structure of your input.
Executing the SQL is not what ANTLR does. It does, however, generate utility Listener and Visitor classes that can be used to cleanly and efficiently navigate the resulting parse tree. There can be a lot of code involved in actually executing the SQL from the parse tree. Often, the first step is to produce an AST from the parse tree to make it easier to deal with.
You have a couple of alternatives (as you hint at).
1 - Using the current grammar an putting these annotations inside of comments (/* PARSE_PHASE_SKIPPED=TRUE */)
This can be done, but it's a bit "messy". It's most likely that COMMENT tokens are -> skiped (or perhaps sent to -> channel(HIDDEN)). This makes it MUCH easier to write the parser rules as you don't have to include optional COMMENTs everywhere a comment could appear. That said, if you send COMMENT tokens to the HIDDEN channel, they are still in the token stream even though they are ignored by the parser. The COMMENT tokens won't be in the rule Context objects that the listeners/visitors deal with, but you could look backwards/forwards in the token stream for COMMENT nodes.
2 - you could introduce some new syntax for annotations (similar to your SKIP idea). To do this you'll have to extend the syntax in the grammar to recognize these annotations. They'd have to be distinguishable from valid SQL, so a simple SKIP is probably not going to work.
The benefit of this approach is that, when you extend the grammar to recognize annotations, you can be very specific about where annotations are allowed. You'd be able to include them in your parse tree, making them easier to locate.
With either of these approaches, you would use a visitor or listener to go through your parse tree looking for the comment/annotation and then mark the subsequent statement with an indicator that you don't want to execute it. (You might use the information to simply skip the parse tree to AST transformation of the "skipped" nodes).
Let me see if I understand your question correctly. In your environment you run SQL queries (not "SQLs", btw.), which may contain data that must not reach the server as is. It doesn't matter if that is sensitive data or what else. All what matters is that you want to replace the text in the queries.
For that you parse the queries and rewrite them, before sending them to the server. However, you don't want to do that for all queries, but only for specific ones. And you came up with the idea to mark queries (or query parts) that must not be transformed, with a special comment. Does this match your intention so far?
Now I wonder why you want to accomplish that on query (parsing) level. It's not the parsing you want to modify but the semantic handling of the parse result (here the parse tree, as Mike Cargal already mentioned). So, in my opinion you don't need to introduce special markup for your queries, but instead define criterions that indicate which data must be replaced.
When you think about that you will probably realise that data for specific fields (columns) in specific tables need to be replaced. You can actually keep a list of schema/table/columns tuples, which tell your rewriter if a value must be rewritten. Everything else stays as it is.
ANTLR4 has nothing to do in this process. It's all to be done in the semantic phase (the processing of the parse tree using a parse tree listener). In this phase you have to collect all column references that are used in a query. Then you compare that list with the list for the rewriter. If a column reference matches, the rewriter has to rewrite the text for it in the query.
That task is however non-trivial, because of nested queries (subqueries, where inner queries can reference tables from an outer query). This is btw. pretty similar to the way code completion works, where you have to provide a possible column list for all mentioned tables in a query. That's why I have written (C++) code to collect such references in MySQL Workbench's SQL code completion.
I wish to analyze the queries executed on certain redshift warehouse (not mine).
In order to do so I'm using a query with a join on stl_querytext and stl_query.
My question is how come I'm also getting illegal queries (I.E queries with wrong sql syntax)?
When I've tried it in my local redshift I haven't seen those. Also, couldn't find relevant documentation.
Is this a configuration issue? And in case I'm supposed to those queries is there a way to know those are illegal ones?
Thanks,
Nir.
So stl_querytext breaks long queries into parts identified by sequence number. I hope you are reconstructing the parts into the original query as a first step. This can be done with listagg() function as long as the resulting query doesn't over the max tex field (about 320 parts).
Now this is not enough to get valid SQL back in all cases because you need to treat combining the parts differently depending if the section of the query is inside or outside a text string in the query. (Is white space needed between parts or not)
I've done this exact process a bunch so this is doable. I don't have a perfect process on the whitespace question, I get close. Maybe others know the expression to get an exact recreation of the query from stl_querytext.
Are there any open source tools that can generate a natural language description of a given SQL query? If not, some general pointers would be appreciated.
I don't know much about NLP, so I am not sure how difficult this is, although I saw from some previous discussion that the vice versa conversion is still an active area of research. It might help to say that the SQL tables I will be handling are not arbitrary in any sense, yet mine, which means that I know exact semantics of each table and its columns.
I can devise two approaches:
SQL was intended to be "legible" to non-technical people. A naïve and simpler way would be to perform a series of replacements right on the SQL query: "SELECT" -> "display"; "X=Y" -> "when the field X equals to value Y"... in this approach, using functions may be problematic.
Use a SQL parser and use a series of templates to realize the parsed structure in a textual form: "(SELECT (SUM(X)) (FROM (Y)))" -> "(display (the summation of (X)) (in the table (Y))"...
ANTLR has a grammar of SQL you can use: https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4 and there are a couple SQL parsers:
http://www.sqlparser.com/sql-parser-java.php
https://github.com/facebook/presto/tree/master/presto-parser/src/main
http://db.apache.org/derby/
Parsing is a core process for executing a SQL query, check this for more information: https://decipherinfosys.wordpress.com/2007/04/19/parsing-of-sql-statements/
There is a new project (I am part of) called JustQuery.Me which intends to do just that with NLP and google's SyntaxNet. You can go to the https://github.com/justquery-me/justqueryme page for more info. Also, sign up for the mailing list at justqueryme-development#googlegroups.com and we will notify you when we have a proof of concept ready.
OK, at the first glance, it seems that it must be more efficient to use SQLPrepare+SQLBindParameter+SQLExecute than format string (e.g. with CString::Format) and pass the whole complete query string to SQLExecDirect. If not, why would there exist the second method (SQLPrepare+SQLBindParameter+SQLExecute) at all?
BUT... here is what I think: The driver has sooner or later (I suspect later, but anyway...) convert the parameters (that I feed it with SQLBindParameter) into string representation right? (Or maybe not?) So if I make this formatting in my application (printf-like formatting), will I have any loss in performance?
One thing I suspect is that when the connection is over the network, passing parameters as raw data and then formatting them at server end might decrease the network traffic, instead of passing preformatted query strings, but lets ignore the network traffic for a moment. If not that, is there any performance gain in using SQLPrepare+SQLBindParameter+SQLExecute instead of formatting full query string in application and then using SQLExecDirect?
For me using SQLExecDirect is simpler and more convenient, so I need good answer on whether (and if) I should opt to another approach.
Important: If you will say that SQLPrepare+SQLBindParameter+SQLExecute approach will give better performance, I'd like to know how much! I don't mind theoretical assumptions, I'd like to know when is it worth practically? My current use-case is not very db-intensive, I won't have more than 100 inserts/updates per second, is it ok to use SQLExecDirect? In what scenarios - if ever - do I have to use SQLPrepare+SQLBindParameter+SQLExecute?
If you are inserting or updating with the same SQL (excluding parameters) repeatedly then SQLPrepare, SQLBindParameter and SQLExecute will be faster than SQLExecDirect every time. Consider:
SQLPrepare("insert into mytable (cola, colb) values(?,?);");
for (n = 0; n < 10000; n++) {
SQLBindParameter(1, n);
SQLBindParameter(2, n);
SQLExecute;
}
and
for (n = 0; n < 10000; n++) {
char sql[1000];
sprintf("insert into mytable (cola, colb) values(%d,%d)", n, n);
SQLExecDirect(sql);
}
In the first example, the statement is prepared once and hence the db engine only has to parse it once and work out an execution plan once. In the second example the sql and parameters are passed every time and the SQL looks different every time so it is parsed each time.
In addition, in the first example you can use arrays of parameters to pass multiple rows of parameters in one go - see SQL_PARAMSET_SIZE.
See 3.1.2 Inserting data for a worked example and an indication of how much time you can save.
Ignore network traffic, you'll just be second guessing what happens under the hood in the driver.
ADDITION:
Regarding your description of what happens with parameters where you seem to think the driver will convert them to strings; the other advantage of binding parameters is you can provide them in one type and ask the driver to use them as another type. You may find you'll come across a parameter type which cannot easily be represented as a string without adding some sort of conversion function which could be avoided with a parameter.
Yes, it's a bad idea, and for two reasons:
Performance
SQLPrepare causes the SQL statement to be parsed, and depending on the SQL statement it can be time consuming. If you're using a DB on another server, it might get sent to it for parsing. Even if the parsing takes only e.g. 10% time of executing your whole query, you save that time when executing the statement twice. That may be the case when you're inserting multiple rows, or call a "select" another time.
Of course the SQL statement passed must always be a static string. Some SQL frameworks may even do prepared statement caching for you. I don't know if ODBC does this. If you want to have real performance numbers, you have to measure for yourself - every query is different (and even might depend on the table contents, too).
SQL Injection
No matter what you say where the data comes from that you're formatting with CString::Format or any other method, you might be at risk for SQL injection. Even if you're using strings from your sources, sometimes later you or someone other may be changing your code to accept data from outside, and then you're vulnerable to SQL injection. If you need more info about SQL injection, just search StackOverflow, I'm sure there are some good questions about it, or see this image:
I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.
Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.
SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.
As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...
I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.
Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?
Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.
To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.
I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.
I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.
There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.
The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.
public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}
I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.