How can I extract field names from SQL with Perl? - sql

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.
Given select statement fields that could have several nested parenthese like:
ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name,
Or the simple case of just base_field_name as a field, what would the regex look like in Perl?

Don't try to write a regex parser (though perl regexes can handle nested patterns like that), use SQL::Statement::Structure.

Why not ask the target database itself how it would interpret the queries?
In perl, one can use the DBI to query the prepared representation of a SQL query. Sometimes this is database-specific: some drivers (under the perl DBD:: namespace) support their RDBMS' idea of describing statements in ways analogous to the RDBMS' native C or C++ API.
It can be done generically, however, as the DBI will put the names of result columns in the statement handle attribute NAME. The following, for example, has a good chance of working on any DBI-supported RDBMS:
use strict;
use warnings;
use DBI;
use constant DSN => 'dbi:YouHaveNotToldUs:dbname=we_do_not_know';
my $dbh = DBI->connect(DSN, ..., { RaiseError => 1 });
my $sth;
while (<>) {
next unless /^SELECT/i; # SELECTs only, assume whole query on one line
chomp;
my $sql = /\bWHERE\b/i ? "$_ AND 1=0" : "$_ WHERE 1=0"; # XXX ugly!
eval {
$sth = $dbh->prepare($sql); # some drivers don't know column names
$sth->execute(); # until after a successful execute()
};
print $#, next if $#; # oops, problem with that one
print join(', ', #{$sth->{NAME}}), "\n";
}
The XXX ugly! bit there tries to append an always-false condition on the SELECT, so that the SQL engine doesn't have to do any real work when you execute(). It's a terribly naive approach -- that /\bWHERE\b/i test is no more correctly identifying a SQL WHERE clause than simple regexes correctly parse out SELECT field names -- but it is likely to work.

In a somewhat related problem at the office I used:
my #SqlKeyWordList = qw/select from where .../; # (1)
my #Candidates =split(/\s/,$SqlSelectQuery); # (2)
my %FieldHash; # (3)
for my $Word (#Candidates) {
next if grep($word,#SqlKeyWordList);
$FieldHash($Word)++;
}
Comments:
SqlKeyWordList contains all the SQL keywords that are potentially in the SQL statement (we use MySQL, there are many SQL dialiects, choosing/building this list is work, look at my comments below!). If someone decided to use a keyword as a field name, you will need a regex after all (beter to refactor the code).
Split the SQL statement into a list of words, this is the trickiest part and WILL REQUIRE tweeking. For now it uses Perl notion of "space" (=not in word) to split. Splitting the field list (select a,b,c) and the "from" portion of the SQL might be advisabel here, depends on your SQL statements.
%MyFieldHash will contain one entry per select field (and gunk, until you validated your SqlKeyWorkList and the regex in (2)
Beware
there is nothing in this code that could not be done in Python.
your life would be much easier if you can influence the creation of said SQL statements. (e.g. make sure each field is written to a comment)
there are so many things that can/will go wrong in this parsing approach, you really should sidestep the issue entirely, by changing the process (saves time in the long run).
this is the regex we use at the office
my #Candidates=split(/[\s
\(
\)
\+
\,
\*
\/
\-
\n
\
\=
\r
]+/,$SqlSelectQuery
);

How about splitting each line into terms (replace every parenthesis, comma and space with a newline), then sorting:
perl -ne's/[(), ]/\n/g; print' < textfile | sort -u
You'll end up with a lot of content like:
fieldname1
fieldname1
formatstring
ltrim
rtrim
t_char

Related

How can i query for specific matching substrings while ignoring the rest in a SCAN query?

trying to do a redis SCAN command and trying to figure out how to do glob-pattern substring matching for words instead of single characters (using ruby redis gem)
redis.set("first:url:123", "val1")
redis.set("second:url:123", "val2")
redis.set("third:url:123", "val3")
redis.set("fourth:url:123", "val4")
cursor = 0
pattern = "[first,second]:url:*" ## I only want the first and second keys
redis.scan(cursor, match: pattern)
# => ...
--
according to the docs here i found these available options but it looks like it only works for single characters, how can i use it for words?
h[ae]llo matches hello and hallo, but not hillo
Edit:
https://globster.xyz/
makes me think that using {first,second}:url:123 should work, but that doesnt seem to work either
Redis doesn't support regex expressions for key name patterns, only glob-like expressions.
You can revert to an EVAL Lua script if you are amendment about it. Here's one that does it, but do read the comments: https://gist.github.com/itamarhaber/19c8393f465b62c9cfa8

Translate Perl Script to T-SQL

Ok - I need help from a Perl Guru on this one. A co-worker provided me code from a Perl application they support and how it encodes values before it writes the data to Oracle. Don't ask me why they did this encoding (appears to be for special characters). The values are being written to a CLOB in Oracle. I need an equivalent decode for use in an SSIS package in SQL Server.
Basically I am reading the data from the Oracle database using an SSIS package and need to decode the values. The "+" sign between words is easy with a replace statement (not sure that is the best way, but seems to work so far).
This one is beyond my skill set, because my Perl script skills are limited (yes I have done some reading, but not turning out to be as easy as I thought since I don't know Perl very well). I am only interested in decoding the string not encoding.
BTW as a hint to this, I know that %29 equals a ")" sign. Looks to be using regex, but I am not well versed in using that either (I know I need to learn it).
sub decodeValue($)
{
my $varRef = shift;
${$varRef} =~ tr/+/ /;
${$varRef} =~ s/%([a-fA-F0-9]{2})/pack("C",hex($1))/eg;
${$varRef} =~ tr/\cM//;
${$varRef} =~ s/"/\"/g;
return;
}
sub encodeValue($)
{
my $varRef = shift;
# ${$varRef} =~ tr/ /+/;
${$varRef} =~ s/"/\"/g;
${$varRef} =~ s/'/\'/g;
${$varRef} =~ s/(\W)/sprintf( "%%%x", ord($1) )/eg;
return;
}
The encodeValue subroutine is a simple URL-encoding algorithm, with additional steps to convert single and double quotes to their equivalent HTML entities. You need to write Transact-SQL code to decode those steps in reverse order, so the first thing that must be done is to replace all %7f-type sequences with their equivalent characters
You should look at URL Decode in T-SQL for code to do that. It supports the full UTF-8 character set. You could remove all the ELSE IF #Byte1Value blocks to support only 7-bit ASCII if you wish, but it will work fine for you as it stands
The remaining conversions of single and double-quotes and spaces can be undone using REPLACE calls, which I trust you don't need help with. The original decodeValue subroutine restores only double-quotes and spaces, leaving single-quotes as ', so I don't know whether you would want to replicate that behaviour

Remove quote marks in Google Cloud Datalab SQL module parameters?

The parameterization example in the "SQL Parameters" IPython notebook in the datalab github repo (under datalab/tutorials/BigQuery/) shows how to change the value being tested for in a WHERE clause.
%%sql --module get_data
SELECT *
FROM
[myproject:mydataset.mytable]
WHERE
$query
However, this syntax always seems to insert quotation marks around the parameter. This breaks when I pass parameters that aren't just a simple value:
import gcp.bigquery as bq
query = "(bnf_code LIKE '1202%') OR (bnf_code LIKE '1203%')"
query = bq.Query(get_data, query=query)
print query.sql
This prints an invalid query:
SELECT * FROM [myproject:mydataset.mytable]
WHERE "(bnf_code LIKE '1202%') OR (bnf_code LIKE '1203%')"
Is there any way I can insert values that aren't wrapped in quotation marks?
I'm using the module repeatedly in my code, with variable numbers of OR clauses in the query parameter. So I do need a way to pass in more complicated queries.
Sorry, variables are meant to be simple scalars, or tables, or (soon) lists for use in IN clauses. They are not meant for expressions.
Passing unquoted arguments to SQL modules isn't possible, but it is possible to create a datalabs.data.SQLStatement with straight-up SQL in string form. With that you can use your own, Python-style placeholders to substitute values as you see fit:
import datalab.data._sql_statement as bqsql
statement = bqsql.SqlStatement(
"SELECT some-field FROM %s" % '[your-instance:some-table-name]')
query = bq.Query(statement)
I don't know if they're doing anything special with placeholders or the in-notebook command processing but... well, I didn't see any of that in my (admittedly limited) spelunking.

Does mIRC Scripting have an escape character?

I'm trying to write a simple multi-line Alias that says several predefined strings of characters in mIRC. The problem is that the strings can contain:
{
}
|
which are all used in the scripting language to group sections of code/commands. So I was wondering if there was an escape character I could use.
In lack of that, is there a method, or alternative way to be able to "say" multiple lines of these strings, so that this:
alias test1 {
/msg # samplestring}contains_chars|
/msg # _that|break_continuity}{
}
Outputs this on typing /test1 on a channel:
<MyName> samplestring}contains_chars|
<MyName> _that|break_continuity}{
It doesn't have to use the /msg command specifically, either, as long as the output is the same.
So basically:
Is there an escape character of sorts I can use to differentiate code from a string in mIRC scripting?
Is there a way to tell a script to evaluate all characters in a string as a literal? Think " " quotes in languages like Java.
Is the above even possible using only mIRC scripting?
"In lack of that, is there a method, or alternative way to be able to "say" multiple lines of these strings, so that this:..."
I think you have to have to use msg # every time when you want to message a channel. Alterativelty you can use the /say command to message the active window.
Regarding the other 3 questions:
Yes, for example you can use $chr(123) instead of a {, $chr(125) instead of a } and $chr(124) instead of a | (pipe). For a full list of numbers you can go to http://www.atwebresults.com/ascii-codes.php?type=2. The code for a dot is 46 so $chr(46) will represent a dot.
I don't think there is any 'simple' way to do this. To print identifiers as plain text you have to add a ! after the $. For example '$!time' will return the plain text '$time' as $time will return the actual value of $time.
Yes.

SQL Regular Expressions

I created the following SQL regex pattern for matching an ISBN:
CREATE RULE ISBN_Rule AS #value LIKE 'ISBN\x20(?=.{13}$)\d{1,5}([-])\d{1,7}\1\d{1,6}\1(\d|X)$'
I used the following values as test data; however, the data is not being committed:
ISBN 0 93028 923 4 | ISBN 1-56389-668-0 | ISBN 1-56389-016-X
Where am I wrong?
You can do this using LIKE.
You'll need some ORs to deal with the different ISBN 10 and 13 formats
For the above strings:
LIKE 'ISBN [0-9][ -][0-9][0-9][0-9][0-9][0-9][ -][0-9][0-9][0-9][ -][0-9X]'
The LIKE operator in SQL Server isn't a regex operator. You can do some complicated pattern matching, but its not normal regex syntax.
http://msdn.microsoft.com/en-us/library/ms179859.aspx
SQL Server 2005 does not support REGEX expressions out of the box, you would need OLE Automation or a CLR to provide that functionality through a UDF.
The only supported wildcards are % (any) and _ (one), and character range (or negation) matches using [] optionally [^]. So your expression
'ISBN\x20(?=.{13}$)\d{1,5}([- ])\d{1,7}\1\d{1,6}\1(\d|X)$'
Means something very weird with the range [- ] and everything else being literal.
If it splits on | and doesen't strip whitespaces, its probably missing a space before ISBN and/or after (\d|X) here $ .. Also, I doubt this is the problem, but [- ] could be [ -]
edit: ok, well keep this in mind when you get a regex lib/control.