How to implement SEPARATE island grammar in ANTLR4 with correct line numbers and char index? - sql

I've been developing a COBOL grammar with support of embedded SQL statements. For anyone who's not familiar with COBOL, here is an example.
MOVE A TO B.
EXEC SQL
SELECT C FROM T WHERE ID=1
INTO :E
END-EXEC
MOVE F TO G
The code between "EXEC SQL" and "END-EXEC" uses a (specially augmented) SQL syntax, which is a perfect example of island grammar.
I know this can be implemented with Lexer mode in ANTLR4. But I have another requirement that the SQL grammar should be separated from COBOL grammar so that the SQL grammar could be reused when embedded in other languages like PL1, without copy paste programming.
So what I did is using a simple lexer mode to capture anything between "EXEC SQL" and "END-EXEC", extract the SQL code as a String and give it to a separate SQL lexer (and parser).
This worked OK with one drawback - the line numbers and char index of tokens recognized in the SQL parser is counted from the start of the extracted SQL code string, instead of starting from the original COBOL program. When it comes to tracking back to source code, e.g. in case there are parsing errors, the line numbers turn out to be mis-leading.
So the question is : is there a simpler way in ANTLR 4 to implement island grammars seperately (both lexer and parser seperated), yet still preserving correct line numbers and char index in the tokens generated for the island part?
Update: I found there is grammar import feature in ANTLR 4 and my colleague told me we've been trying that but failed. The problem is - lexer mode in imported grammar are not well supported, which gets compiling errors. This issue is being tracked here.

To expand on Bill's comment, when instancing your SQL parser/lexer, pass it the line offset of the beginning of the EXEC block. Implement a custom SQL token that reports the line number as offset plus the SQL text relative line number. Have your SQL TokenFactory inject the offset as a constant in to each token generated.
Update
Using modes to implement an idiomatic island grammar, with or without using includes (which work quite well for me at least), is the most natural approach.
Barring that, initiating an external SQL block parser process can be from an Action in the lexer or parser, by an override of the lexer's token emit() method (or related methods), and from a visitor walking the base grammar's parse tree.
Only you can balance which is acceptable, desirable or necessary in any given circumstance.
For example, if the parse tree evaluation provides a value for use in the dynamic execution of an SQL exec block or, conversely, depends on the values returned by such an execution, you are essentially forced to use a symbol table and defer initiations of the SQL executions to a walker. Of course, you can then cache each of the different generated SQL parse trees and reinitialize their symbol tables with instance specific data for reuse without reparsing the SQL blocks.
Just depends on whatever your real requirements are.

Related

How to skip the SQL(part of the SQL) parsing in antlr4?

Sorry for this question was closed and can not be reopened, and my poor english, it was translated by website indeed. :)
https://stackoverflow.com/questions/70035964/how-to-skip-sql-parsing-in-antlr4
#BartKiers Thanks for being interested in this question, let me give it a detailed example.
There are lots of SQL queries, such as select * from user or update user set field1 = 'value1' where condition = 'value' etc, let's called it original SQL queries.
There is a java program which intercepts and parses all the original SQL queries into Parse Tree nodes by ANTLR4, and then rewrites the query (which depended on the parse phase) by the java program, so the original SQL queries may be parsed and rewritten as
select field1, field1_encrypted, field1_digest, field2 from user
or
update user
set field1 = value1,
field1_encrypted = encrypt_algorithm(value1),
field1_digest = digest_algorithm(value1)
where condition_digest = digest_algorithm(values)
etc.
While they finished the rewritten phase, they should be executed as SQLStatement, the SELECT was executed as SelectSQLStatement while UPDATE executed as UpdateSQLStatement.
Now I thought some of the original SQL queries should skip the parse phase, and the rewrite phase which should be skipped as the same, but the originalSQL queries should be executed as it was.
I thought to mark those with comment as
/* PARSE_PHASE_SKIPPED=TRUE */ originalSQL
or prefix SKIP as
SKIP originalSQL
, I wish to parse the whole marked but original SQL query part into Parse Tree nodes by ANTLR4, and executed it as ParsePhaseSkippedSQLStatement.
Can ANTLR4 support on this situation, and how should the grammar be written? Thanks in advance.
====================
Thank you for your reply #Mike Cargal, Yes, almost.
Let me say it again and give a more detailed example.
There is a java system that we call it X, X has lots of SQL queries that the developers write and guarantee that those SQLs can be executed correctly by Ibatis / JPA etc, let's named those SQL queries as original SQL queries.
Using below original SQL queries as examples:
insert into user (username, id_no) values ('xyz', '123456')
select username, id_no from user u where u.id_no = '123456'
We say the column id_no on table user is sensitive data, we should save ciphertext instead of plaintext, so the originalSQLs would be parsed by ANTLR and rewritten by java code as below, let's named those SQLs as rewritten SQL queries, also rewritten SQL queries should be executed correctly by Ibatis / JPA etc.
insert
into user (username, id_no, id_no_cipher, id_no_digest)
values ('xyz', '', 'encrypted_123456', 'digest_123456')
select username, id_no_cipher as id_no
from user u
where u.id_no_digest = 'digest_123456'
In this case:
1、we see that the rewrite phase depends on the parse phase, original SQL queries need to be correctly parsed then to be rewritten by java code.
2、all original SQL queries are parsed but only a few matching the sensitive rules are rewritten to rewritten SQL queries.
But there are lots original SQL queries we clearly know that do not need to be rewritten, and also no need to be parsed, and may report exceptions in various complex situations while parsing it, but it should be executed correctly by Ibatis / JPA etc.
So I planed to use sql comment / customized keyword annotation to "turn off" parse phase of it.
If I understand your question correctly, you wish to use some sort of comment/annotation to "turn off" execution of the following SQL statement.
(NOTE: You can't really skip "parsing" part of the input. This will address ways in which you could skip processing part of the parsed input, which I believe is what you're ultimately wanting to accomplish.)
This would not be an ANTLR concern. ANTLR's responsibility is to parse you input stream and produce a parse tree (not technically an AST) that correctly represents the structure of your input.
Executing the SQL is not what ANTLR does. It does, however, generate utility Listener and Visitor classes that can be used to cleanly and efficiently navigate the resulting parse tree. There can be a lot of code involved in actually executing the SQL from the parse tree. Often, the first step is to produce an AST from the parse tree to make it easier to deal with.
You have a couple of alternatives (as you hint at).
1 - Using the current grammar an putting these annotations inside of comments (/* PARSE_PHASE_SKIPPED=TRUE */)
This can be done, but it's a bit "messy". It's most likely that COMMENT tokens are -> skiped (or perhaps sent to -> channel(HIDDEN)). This makes it MUCH easier to write the parser rules as you don't have to include optional COMMENTs everywhere a comment could appear. That said, if you send COMMENT tokens to the HIDDEN channel, they are still in the token stream even though they are ignored by the parser. The COMMENT tokens won't be in the rule Context objects that the listeners/visitors deal with, but you could look backwards/forwards in the token stream for COMMENT nodes.
2 - you could introduce some new syntax for annotations (similar to your SKIP idea). To do this you'll have to extend the syntax in the grammar to recognize these annotations. They'd have to be distinguishable from valid SQL, so a simple SKIP is probably not going to work.
The benefit of this approach is that, when you extend the grammar to recognize annotations, you can be very specific about where annotations are allowed. You'd be able to include them in your parse tree, making them easier to locate.
With either of these approaches, you would use a visitor or listener to go through your parse tree looking for the comment/annotation and then mark the subsequent statement with an indicator that you don't want to execute it. (You might use the information to simply skip the parse tree to AST transformation of the "skipped" nodes).
Let me see if I understand your question correctly. In your environment you run SQL queries (not "SQLs", btw.), which may contain data that must not reach the server as is. It doesn't matter if that is sensitive data or what else. All what matters is that you want to replace the text in the queries.
For that you parse the queries and rewrite them, before sending them to the server. However, you don't want to do that for all queries, but only for specific ones. And you came up with the idea to mark queries (or query parts) that must not be transformed, with a special comment. Does this match your intention so far?
Now I wonder why you want to accomplish that on query (parsing) level. It's not the parsing you want to modify but the semantic handling of the parse result (here the parse tree, as Mike Cargal already mentioned). So, in my opinion you don't need to introduce special markup for your queries, but instead define criterions that indicate which data must be replaced.
When you think about that you will probably realise that data for specific fields (columns) in specific tables need to be replaced. You can actually keep a list of schema/table/columns tuples, which tell your rewriter if a value must be rewritten. Everything else stays as it is.
ANTLR4 has nothing to do in this process. It's all to be done in the semantic phase (the processing of the parse tree using a parse tree listener). In this phase you have to collect all column references that are used in a query. Then you compare that list with the list for the rewriter. If a column reference matches, the rewriter has to rewrite the text for it in the query.
That task is however non-trivial, because of nested queries (subqueries, where inner queries can reference tables from an outer query). This is btw. pretty similar to the way code completion works, where you have to provide a possible column list for all mentioned tables in a query. That's why I have written (C++) code to collect such references in MySQL Workbench's SQL code completion.

What does parsing a query mean?

Most relational databases handles a JDBC / SQL query in four steps:
Parse the incoming SQL query
Compile the SQL query
Plan/optimize the data acquisition path
Execute the optimized query / acquire and return data
I want to know what does "parse the incoming query" really mean? And what does "plan/optimize data acquisition path" mean?
Parsing means examining the characters input and recognizing it as a command or statement by looking through the characters for keywords and identifiers, ignoring comments, arranging quoted portions as string constants, and matching the overall structure to the language syntax making sense of it all.
Plan/optimize means figure out the best way (of all the possible ways) to determine the result, usually with respect to execution time. It could also mean minimizing the number of locks needed. Maybe some parts of the query can be ignored (where ... and 1 == 1) or a table doesn't need to be accessed at all, etc.
parsing is one of the Process of compilation.
Phases of a Compiler:
Source: Phases of Compiler
1) Parsing: syntactic analysis of the query according to the SQL grammar rules, etc. and attempting to "tokenize" the query into the elementary parts form.
2) Planning/optimization: at that stage the SQL engine tries to evaluate what the best way to execute your query would be. It tries to take advantage of existing indexes, clusters and table relationships; find ways around full table scans, utilize caching effectively by avoiding repeated data reads, and so forth.

SQL queries to their natural language description

Are there any open source tools that can generate a natural language description of a given SQL query? If not, some general pointers would be appreciated.
I don't know much about NLP, so I am not sure how difficult this is, although I saw from some previous discussion that the vice versa conversion is still an active area of research. It might help to say that the SQL tables I will be handling are not arbitrary in any sense, yet mine, which means that I know exact semantics of each table and its columns.
I can devise two approaches:
SQL was intended to be "legible" to non-technical people. A naïve and simpler way would be to perform a series of replacements right on the SQL query: "SELECT" -> "display"; "X=Y" -> "when the field X equals to value Y"... in this approach, using functions may be problematic.
Use a SQL parser and use a series of templates to realize the parsed structure in a textual form: "(SELECT (SUM(X)) (FROM (Y)))" -> "(display (the summation of (X)) (in the table (Y))"...
ANTLR has a grammar of SQL you can use: https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4 and there are a couple SQL parsers:
http://www.sqlparser.com/sql-parser-java.php
https://github.com/facebook/presto/tree/master/presto-parser/src/main
http://db.apache.org/derby/
Parsing is a core process for executing a SQL query, check this for more information: https://decipherinfosys.wordpress.com/2007/04/19/parsing-of-sql-statements/
There is a new project (I am part of) called JustQuery.Me which intends to do just that with NLP and google's SyntaxNet. You can go to the https://github.com/justquery-me/justqueryme page for more info. Also, sign up for the mailing list at justqueryme-development#googlegroups.com and we will notify you when we have a proof of concept ready.

move subtree from one part of AST to another

I am working on a tool to convert Oracle SQL to ANSI SQL. I have a grammar that will parse both Oracle SQL and ANSI SQL.
I want to extract the Oracle outer join expressions from the where clause part of the AST and insert new join clauses at the end of the from clause part of the AST for the matching select or subquery.
Can a tree parser with rewrite rules do this type of tree transformation?
i.e. take an AST generated from Oracle SQL
SELECT
a.columna, b.columnb
FROM
tablea a,
tableb b
WHERE
a.columna2 (+) = b.columnb2 (+)
AND
a.columna3 = 'foo'
AND
b.columnb3 = 'bar'
and transform it to an AST for ANSI SQL
SELECT
a.columna, b.columnb
FROM
tablea a FULL OUTER JOIN tableb b ON (a.columna2 = b.columnb2 )
WHERE
a.columna3 = 'foo'
and
b.columnb3 = 'bar'
NOTE1: the table references for tablea and tableb are deleted from the FROM clause and replaced with a JOIN clause referring to the same tables and table aliases.
NOTE2: the Oracle join condition is identified as a FULL OUTER JOIN by the presence of the OuterJoinIndicator (+) on both sides of the sql_condition comparison.
NOTE3: the join condition comparison is deleted from the WHERE clause and used to construct the join clause ON condition [with the OuterJoinIndicator(s) removed].
Yes, this is quite possible, especially since you have a grammar that recognizes both Oracle and ANSI SQL. I once wrote a translator from AREV BASIC to Visual BASIC and did many similar transformations.
In my project I used ANTLR 2 and wrote a master tree grammar which did nothing but completely walk the tree according to all rules in my grammars. I then used ANTLR 2's subclassing to override specific rules to do the transformations. I liked this as it let me build up the translation in passes and keep all my expression handling in one pass, control structures in another pass, etc.
ANTLR 3 does not provide grammar subclassing, so you won't be able to use that approach. You will need a complete tree grammar to print out your resulting tree. Personally, I would write that tree grammar first and get it working properly. Then I would copy that grammar and strip all the actions out but put in the option to rewrite the AST. Then modify the rules you need for your transformation. If you do many transformations you may want to use multiple passes, one tree grammar for each pass. You may have a pass or two that does analysis to help drive the later passes. On my BASIC translation project I did control flow analysis, data flow analysis and dead code removal as analysis passes.
If you want help writing the specific transformation you'll need to share your tree grammar. There are quite a few tree grammar idioms to wrap your head around. Terence's ANTLR 3 book would be a valuable purchase if you need help there. If you haven't written the tree grammar yet then post questions when you get stuck. Choosing the correct root nodes is important. If you want to get an idea of how to build trees and tree parsers, you can look at my C grammar. It is ANTLR 2, but the tree building concepts are the same. http://www.antlr3.org/grammar/cgram/grammars/
Do you need to retain comments and formatting? That adds another layer of complexity, for which I would recommend creating another question.
If you have two different grammars, you are likely to discover that the "small differences" in the grammars lead to quite considerably different ASTs for these clauses, and so your real problem is to convert the tree structure for one into the tree structure for another. And you'll have to do this piecewise for the whole tree because such differences are spread all across the grammars. YMMV.
ANTLR's tree parser will pretty likely let you recognize arbitrary fragments; these are certainly cues for generating equivalents in the other grammar's AST. But you'll have to write lots of such fragments, and the code corresponding routines to assemble the equivalent tree node-by-node. As a general rule for a large grammar (such as Oracle SQL), this can be quite a lot of work. You can do it this way.
An alternative is program transformation systems. These are tools that allow you to write surface syntax patterns (e.g., phrases in Oracle SQL and ANSI SQL) to code and apply your transformations directly. Writing transformations this way is considerably easier IMHO. You'd end up writing something like this:
source domain Oracle.
target domain ANSISQL.
rule xlate_Oracle_SELECT(c: columns, t1: table, t2: table,
c1: column, c2: column,
more_conditions: conditions):SQL_phrase
"SELECT \c FROM \t1, \t2 WHERE c1 (+) = c2 (+) and \more_conditions";
=>
"SELECT \c FROM \t1 FULL OUTER JOIN \t2 on ( c1 = c2 ) WHERE \more_conditions";
(The backslash-IDs are pattern variables which can match an arbitrary subtree of the the declared syntax type legal in that location.)
The reason this works is that the transformation tool parses the first pattern with the first grammar, and so gets a tree it can match on trees of the first grammar, and similarly parses the second pattern using the second grammar, getting a replacement tree that follows the rules of the second grammar. The transformation engine matches the tree for first pattern, and substitutes the tree for the second. So such a rule transforms a small set of blue tree nodes from the blue tree, to a small set of green nodes of the desired tree type. The color analogy should make it clear that you have to translate all the blue nodes into green ones if you want an accurate translation.
You'd need additional rules to translate the various subclauses just to paper over the differences in the grammar, e.g.,:
rule translate column(t: IDENTIFIER, c: IDENTIFIER, ):table->table
"\t.\c" -> " \toSQLidentifier\(\t\).\toSQLidentifier\(\c\)";
This would handles differences in how the two languages spell identifiers, by calling a custom function toSQLidentifier that does string hacking.
I don't think ANTLR supports these kind of transformation rules. You can simulate it all by lots of code.
You might avoid some of this if you have one "union" grammar for both languages (which is what you imply), but this usually gets you a highly ambiguous grammar and that's a huge amount of trouble. If you have succeeded in this, than you only have to apply translation rules where the languages differ (e.g., everything is a blue node).
You can also hack it: scan the tree left to right; prettyprint the parts that are equivalent (figuring this out is harder than it looks), where they differ, prettyprint a substitution. This is a very fragile way to do this.

Regular expression to match common SQL syntax?

I was writing some Unit tests last week for a piece of code that generated some SQL statements.
I was trying to figure out a regex to match SELECT, INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.
I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statement.
Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.
By the way it's C# RegEx that I'm after.
Clarification
I don't want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.
Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.
SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.
That said, you may try ANTLR's SQL grammars.
As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.
http://savage.net.au/SQL/
Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...
I had the same problem - an approach that would work for all the more standard sql statements would be to spin up an in-memory Sqlite database and issue the query against it, if you get back a "table does not exist" error, then your query parsed properly.
Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?
Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.
To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.
Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.
I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.
I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.
With Oracle as an example, you would certainly get an error if you did:
select * from non_existant_table;
In this case, "ORA-00942: table or view does not exist".
However if you execute:
select * frm non_existant_table;
Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".
It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..
Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.
There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.
The best way is to validate the parameters used to create the query, rather than the query itself. A function that receives the variables can check the length of the strings, valid numbers, valid emails or whatever. You can use regular expressions to do this validations.
public bool IsValid(string sql)
{
string pattern = #"SELECT\s.*FROM\s.*WHERE\s.*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
return rgx.IsMatch(sql);
}
I am assuming you did something like .\* try instead [^"]* that will keep you from eating the whole line. It still will give false positives on cases where you have \ inside your strings.