ANTLR Filtering Grammar for specific tokens and ignore everything else possible? - sql

I am currently trying to create a SQL-grammar for the Data Definition Language.
For my program, the parser only needs to recognize some specific sql-commands like "CREATE TABLE", "ALTER TABLE", etc.
Since I am working with automatically generated export files there is also a lot of overhead in the things I am gonna parse like "SET CURRENT PATH" etc. This is not necessary to be parsed and I am wondering if there is a way to ignore "everything else" that is not defined in the SQL-Statements. Hope anyone has some experience with this..
Here's the header part of my grammar:
list: sql_expression ENDOFFILE?;
sql_expression:
((create_statement|alter_table_statement|create_unique_index_statement|insert_statement) SEMICOLON)+
;
...
and I am wondering if it is possible to extend the sql_expression rule like this:
list: sql_expression ENDOFFILE?;
sql_expression:
((create_statement|alter_table_statement|create_unique_index_statement|insert_statement|else_stuff) SEMICOLON)+
;
Thanks in advance!

Yes you can achieve this.
You can ignore statements like "SET CURRENT PATH" or "CONNECT ..blah blah". These are nothing but SQL plus commands. You need to swallow everything which comes after particular keyword.
For e.g , in case of "ACCEPT ..blah.." , you can create following rule:
SQL_PLUS_ACCEPT
: 'accept' SPACE ( ~('\r' | '\n') )* (NEWLINE|EOF)
;
accept_key
: SQL_PLUS_ACCEPT
;
this will ignore "ACCEPT.. " command and u can parse whatever stmt u wnat to parse. You need to do this for other sql plus commands like SET, CONNECT, EXIT etc.
You can refer to this link

Related

How to debug 'no production for X' in lbnf / bnfc grammar?

when playing around with lbnf/bnfc, in some cases I would like it to optionally allow for the plural form. However it always says 'no production for 'Plural' appearing in rule' and and I do not get why.
Relevant line below. SomeOther and SomeToken are basically strings.
HeadAuthors. Authors::= "AUTHOR" [Plural] ":" SomeOther SomeToken ;
Plural. Plural::= "S" ;
I would skip the list, and make Plural into a rule like this
rules Plural ::= "S" | ;
For documentation about the rules macro, see https://bnfc.readthedocs.io/en/latest/lbnf.html#rules.
If you want to keep the list, then you need to give a separator or terminator for Plural, see here https://bnfc.readthedocs.io/en/latest/lbnf.html#terminator, otherwise it doesn't become a list. You can just write
terminator Plural "" ;

Correct use of Syntactic Predicates in XText

I have a grammar with some ambuguities I need to resolve.
One of the rules takes the following form:
TArg:
anys=Anys
| rnumb1=PNumb ".." (rnumb2=PNumb)?
;
Or this image, if you prefer
The rule Anys has the potential to start with a PNumb. I can see where the ambiguity is, but how to I tell XText to take the second path if it sees a PNumb followed by the double dot?
Presumably, if I use
TArg:
(=> rnumb1=PNumb ".." (rnumb2=PNumb)?)
|anys=Anys
;
Then it will always choose the first if it sees a number, regargless of if it sees the "..", and I will run into problems.
What is the correct usage/placement of the syntactic predicate here to allow Antlr to look ahead to see if the ".." is present?
Cheers in advance.
You need to also include the '..'
TArg:
=>(rnumb1=PNumb "..") (rnumb2=PNumb)?
| anys=Anys
;

Failed to parse command using ANTLR3 grammar, if command has same word which is declared as rule

I have facing a problem while parsing some command with the parser which, I have implemented using ANLTR3. Parser fails to parse those commands which contains 'any-word' that is declared as lexer rule in the grammar.
For Example take a look following grammar:
show :
SHOW TABLES '[' projectName? tableName']' -> ^(SHOW TABLES_ ^(PROJECT_NAME projectName)? ^(DATASET_TABLE tableName));
SHOW : S H O W;
If i try to parse command 'SHOW TABLES [sample-project:SHOW]' then parse fails for this command.But if I change the SHOW word then it works.
SHOW TABLES [sample-project:SHOW] - this works.
I don't want to get name as string which is surrounded in quotes(").
Can anyone suggest solution? I am using ANTLR3.
Thanks in advance.
This is a typical effect of using a reserved word as identifier. In ANTLR when you define a reserved word like your SHOW rule it will implicitly excluded from a identifier rule you might have defined after that keyword rule.
The solution to allow such keywords also as identifiers in rules like your tablName is to make that rule accept certain (or all) keywords that could be accepted in that place (and will not act as keywords then). Example:
tableName:
IDENTIFIER
| SHOW
| <others go here>
;

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How can I extract field names from SQL with Perl?

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.
Given select statement fields that could have several nested parenthese like:
ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name,
Or the simple case of just base_field_name as a field, what would the regex look like in Perl?
Don't try to write a regex parser (though perl regexes can handle nested patterns like that), use SQL::Statement::Structure.
Why not ask the target database itself how it would interpret the queries?
In perl, one can use the DBI to query the prepared representation of a SQL query. Sometimes this is database-specific: some drivers (under the perl DBD:: namespace) support their RDBMS' idea of describing statements in ways analogous to the RDBMS' native C or C++ API.
It can be done generically, however, as the DBI will put the names of result columns in the statement handle attribute NAME. The following, for example, has a good chance of working on any DBI-supported RDBMS:
use strict;
use warnings;
use DBI;
use constant DSN => 'dbi:YouHaveNotToldUs:dbname=we_do_not_know';
my $dbh = DBI->connect(DSN, ..., { RaiseError => 1 });
my $sth;
while (<>) {
next unless /^SELECT/i; # SELECTs only, assume whole query on one line
chomp;
my $sql = /\bWHERE\b/i ? "$_ AND 1=0" : "$_ WHERE 1=0"; # XXX ugly!
eval {
$sth = $dbh->prepare($sql); # some drivers don't know column names
$sth->execute(); # until after a successful execute()
};
print $#, next if $#; # oops, problem with that one
print join(', ', #{$sth->{NAME}}), "\n";
}
The XXX ugly! bit there tries to append an always-false condition on the SELECT, so that the SQL engine doesn't have to do any real work when you execute(). It's a terribly naive approach -- that /\bWHERE\b/i test is no more correctly identifying a SQL WHERE clause than simple regexes correctly parse out SELECT field names -- but it is likely to work.
In a somewhat related problem at the office I used:
my #SqlKeyWordList = qw/select from where .../; # (1)
my #Candidates =split(/\s/,$SqlSelectQuery); # (2)
my %FieldHash; # (3)
for my $Word (#Candidates) {
next if grep($word,#SqlKeyWordList);
$FieldHash($Word)++;
}
Comments:
SqlKeyWordList contains all the SQL keywords that are potentially in the SQL statement (we use MySQL, there are many SQL dialiects, choosing/building this list is work, look at my comments below!). If someone decided to use a keyword as a field name, you will need a regex after all (beter to refactor the code).
Split the SQL statement into a list of words, this is the trickiest part and WILL REQUIRE tweeking. For now it uses Perl notion of "space" (=not in word) to split. Splitting the field list (select a,b,c) and the "from" portion of the SQL might be advisabel here, depends on your SQL statements.
%MyFieldHash will contain one entry per select field (and gunk, until you validated your SqlKeyWorkList and the regex in (2)
Beware
there is nothing in this code that could not be done in Python.
your life would be much easier if you can influence the creation of said SQL statements. (e.g. make sure each field is written to a comment)
there are so many things that can/will go wrong in this parsing approach, you really should sidestep the issue entirely, by changing the process (saves time in the long run).
this is the regex we use at the office
my #Candidates=split(/[\s
\(
\)
\+
\,
\*
\/
\-
\n
\
\=
\r
]+/,$SqlSelectQuery
);
How about splitting each line into terms (replace every parenthesis, comma and space with a newline), then sorting:
perl -ne's/[(), ]/\n/g; print' < textfile | sort -u
You'll end up with a lot of content like:
fieldname1
fieldname1
formatstring
ltrim
rtrim
t_char