Translate Perl Script to T-SQL - sql

Ok - I need help from a Perl Guru on this one. A co-worker provided me code from a Perl application they support and how it encodes values before it writes the data to Oracle. Don't ask me why they did this encoding (appears to be for special characters). The values are being written to a CLOB in Oracle. I need an equivalent decode for use in an SSIS package in SQL Server.
Basically I am reading the data from the Oracle database using an SSIS package and need to decode the values. The "+" sign between words is easy with a replace statement (not sure that is the best way, but seems to work so far).
This one is beyond my skill set, because my Perl script skills are limited (yes I have done some reading, but not turning out to be as easy as I thought since I don't know Perl very well). I am only interested in decoding the string not encoding.
BTW as a hint to this, I know that %29 equals a ")" sign. Looks to be using regex, but I am not well versed in using that either (I know I need to learn it).
sub decodeValue($)
{
my $varRef = shift;
${$varRef} =~ tr/+/ /;
${$varRef} =~ s/%([a-fA-F0-9]{2})/pack("C",hex($1))/eg;
${$varRef} =~ tr/\cM//;
${$varRef} =~ s/"/\"/g;
return;
}
sub encodeValue($)
{
my $varRef = shift;
# ${$varRef} =~ tr/ /+/;
${$varRef} =~ s/"/\"/g;
${$varRef} =~ s/'/\'/g;
${$varRef} =~ s/(\W)/sprintf( "%%%x", ord($1) )/eg;
return;
}

The encodeValue subroutine is a simple URL-encoding algorithm, with additional steps to convert single and double quotes to their equivalent HTML entities. You need to write Transact-SQL code to decode those steps in reverse order, so the first thing that must be done is to replace all %7f-type sequences with their equivalent characters
You should look at URL Decode in T-SQL for code to do that. It supports the full UTF-8 character set. You could remove all the ELSE IF #Byte1Value blocks to support only 7-bit ASCII if you wish, but it will work fine for you as it stands
The remaining conversions of single and double-quotes and spaces can be undone using REPLACE calls, which I trust you don't need help with. The original decodeValue subroutine restores only double-quotes and spaces, leaving single-quotes as ', so I don't know whether you would want to replicate that behaviour

Related

How to add a small bit of context in a grammar?

I am tasked to parse (and transform) a code of a computer language, that has a slight quirk in its rules, at least I see it this way. To be exact, the compiler treats new lines (as well as semicolons) as statement separators, but other than that (e.g. inside the statement) it treats them as spacers (whitespace).
As an example, this code:
try
local x = 5 / 0
catch (i)
print(i + "\n")
is proved to be equivalent to this:
try local x = 5 / 0 catch (i) print(i + "\n")
I don't see how I can express such a rule in EBNF, or specifically in Lark EBNF dialect. I mean in a sensible way. I probably could define all possible newline positions inside all statements, but it would be cumbersome and error-prone.
I wish to find a way to treat newlines contextually. Is there a proven method for this, preferably within Python/Lark domain? If I have to modify the parser for that purpose, then where should I start?
Or if I misunderstood something in this language in particular or in machine language parsing in general, or my statement of the problem is wrong, I'd also be happy to get educated.
(As you may guess, the language in question has a well proven implementation, but no officially defined grammar. Also, it is Squirrel, for all that it matters.)
The relevant quote from the "specification" is this:
A squirrel program is a simple sequence of statements.:
stats := stat [';'|'\n'] stats
[...] Statements can be separated with a new line or ‘;’ (or with the keywords case or default if inside a switch/case statement), both symbols are not required if the statement is followed by ‘}’.
These are relatively complex rules and in their totality not context free if newlines can also be ignored everywhere else. Note however that in my understanding the text implies that ; or \n are required when no of the other cases apply. That would make your example illegal. That probably means that the BNF as written is correct, e.g. both ; and \n are optionally everywhere. In that case you can (for lark) just put an %ignore "\n" statement and it should work fine.
Also, lark should not complain if you both ignore the \n and use it in a rule: Where useful it will match it in a rule, otherwise it will just ignore it. Note however that this breaks if you use a Terminal that includes the \n (e.g. WS or /\s/). Just have \n as an extra case.
(For the future: You will probably get faster response for lark questions if you ask over on gitter or at least put a link to SO there.)

What regular expression characters have to be escaped in SQL?

To prevent SQL injection attack, the book "Building Scalable Web Sites" has a function to replace regular expression characters with escaped version:
function db_escape_str_rlike($string) {
preg_replace("/([().\[\]*^\$])/", '\\\$1', $string);
}
Does this function escape ( ) . [ ] * ^ $? Why are only those characters escaped in SQL?
I found an excerpt from the book you mention, and found that the function is not for escaping to protect against SQL injection vulnerabilities. I assumed it was, and temporarily answered your question with that in mind. I think other commenters are making the same assumption.
The function is actually about escaping characters that you want to use in regular expressions. There are several characters that have special meaning in regular expressions, so if you want to search for those literal characters, you need to escape them (precede with a backslash).
This has little to do with SQL. You would need to escape the same characters if you wanted to search for them literally using grep, sed, perl, vim, or any other program that uses regular expression searches.
Unfortunately, active characters in sql databases is an open issue. Each database vendor uses their own (mainly oracle's mysql, that uses \ escape sequences)
The official SQL way to escape a ', which is the string delimiter used for values is to double the ', as in ''.
That should be the only way to ensure transparency in SQL statements, and the only way to introduce a proper ' into a string. As soon as any vendor admits \' as a synonim of a quote, you are open to support all the extra escape sequences to delimit strings. Suppose you have:
'Mac O''Connor' (should go into "Mac O'Connor" string)
and assume the only way to escape a ' is that... then you have to check the next char when you see a ' for a '' sequence and:
you get '' that you change into '.
you get another, and you terminate the string literal and process the char as the first of the next token.
But if you admit \ as escape also, then you have to check for \' and for \\', and \\\' (this last one should be converted to \' on input) etc. You can run into trouble if you don't detect special cases as
\'' (should the '' be processed as SQL mandates, or the first \' is escaping the first ' and the second is the string end quote?)
\\'' (should the \\ be converted into a single \ then the ' should be the string terminator, or do we have to switch to SQL way of encoding and consider '' as a single quote?)
etc.
You have to check your database documentation to see if \ as escape characters affect only the encoding of special characters (like control characters or the like) and also affects the interpretation of the quote character or simply doesn't, and you have to escape ' the other way.
That is the reason for the vendors to include functions to do the escape/unescape of character literals into values to be embedded in a SQL statement. The idea of the attackers is to include (if you don't properly do) escape sequences into the data they post to you to see if that allows them to modify the text of the sql command to simply add a semicolon ; and write a complete sql statement that allows them to access freely your database.

Does mIRC Scripting have an escape character?

I'm trying to write a simple multi-line Alias that says several predefined strings of characters in mIRC. The problem is that the strings can contain:
{
}
|
which are all used in the scripting language to group sections of code/commands. So I was wondering if there was an escape character I could use.
In lack of that, is there a method, or alternative way to be able to "say" multiple lines of these strings, so that this:
alias test1 {
/msg # samplestring}contains_chars|
/msg # _that|break_continuity}{
}
Outputs this on typing /test1 on a channel:
<MyName> samplestring}contains_chars|
<MyName> _that|break_continuity}{
It doesn't have to use the /msg command specifically, either, as long as the output is the same.
So basically:
Is there an escape character of sorts I can use to differentiate code from a string in mIRC scripting?
Is there a way to tell a script to evaluate all characters in a string as a literal? Think " " quotes in languages like Java.
Is the above even possible using only mIRC scripting?
"In lack of that, is there a method, or alternative way to be able to "say" multiple lines of these strings, so that this:..."
I think you have to have to use msg # every time when you want to message a channel. Alterativelty you can use the /say command to message the active window.
Regarding the other 3 questions:
Yes, for example you can use $chr(123) instead of a {, $chr(125) instead of a } and $chr(124) instead of a | (pipe). For a full list of numbers you can go to http://www.atwebresults.com/ascii-codes.php?type=2. The code for a dot is 46 so $chr(46) will represent a dot.
I don't think there is any 'simple' way to do this. To print identifiers as plain text you have to add a ! after the $. For example '$!time' will return the plain text '$time' as $time will return the actual value of $time.
Yes.

How to pass a regular expression to a function in AWK

I do not know how to pass an regular expression as an argument to a function.
If I pass a string, it is OK,
I have the following awk file,
#!/usr/bin/awk -f
function find(name){
for(i=0;i<NF;i++)if($(i+1)~name)print $(i+1)
}
{
find("mysql")
}
I do something like
$ ./fct.awk <(echo "$str")
This works OK.
But when I call in the awk file,
{
find(/mysql/)
}
This does not work.
What am I doing wrong?
Thanks,
Eric J.
you cannot (should not) pass regex constant to a user-defined function. you have to use dynamic regex in this case. like find("mysql")
if you do find(/mysql/), what does awk do is : find($0~/mysql/) so it pass a 0 or 1 to your find(..) function.
see this question for detail.
awk variable assignment statement explanation needed
also
http://www.gnu.org/software/gawk/manual/gawk.html#Using-Constant-Regexps
section: 6.1.2 Using Regular Expression Constants
warning: regexp constant for parameter #1 yields boolean value
The regex gets evaluated (matching against $0) before it's passed to the function. You have to use strings.
Note: make sure you do proper escaping: http://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps
If you use GNU awk, you can use regular expression as user defined function parameter.
You have to define your regex as #/.../.
In your example, you would use it like this:
function find(regex){
for(i=1;i<=NF;i++)
if($i ~ regex)
print $i
}
{
find(#/mysql/)
}
It's called strongly type regexp constant and it's available since GNU awk version 4.2 (Oct 2017).
Example here.
use quotations, treat them as a string. this way it works for mawk, mawk2, and gnu-gawk. but you'll also need to double the backslashes since making them strings will eat away one of them right off the bat.
in your examplem just find("mysql") will suffice.
you can actually get it to pass arbitrary regex as you wish, and not be confined to just gnu-gawk, as long as you're willing to make them strings not the #/../ syntax others have mentioned. This is where the # of backslashes make a difference.
You can even make regex out of arbitrary bytes too, preferably via octal codes. if you do "\342\234\234" as a regex, the system will convert that into actual bytes in the regex before matching.
While there's nothing with that approach, if you wanna be 100% safe and prefer not having arbitrary bytes flying around , write it as
"[\\342][\\234][\\234]" ----> ✜
Once initially read by awk to create an internal representation, it'll look like this :
[\342][\234][\234]
which will still match the identical objects you desire (in this case, some sort of cross-looking dingbat). This will spit out annoying warnings in unicode-aware mode of gawk due to attempting to enclose non-ASCII bytes directly into square brackets. For that use case,
"\\342\\234\\234" ------(eqv to )---> /\342\234\234/
will keep gawk happy and quiet. Lately I've been filling the gaps in my own codes and write regex that can mimic all the Unicode-script classes that perl enjoys.

How can I extract field names from SQL with Perl?

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.
Given select statement fields that could have several nested parenthese like:
ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name,
Or the simple case of just base_field_name as a field, what would the regex look like in Perl?
Don't try to write a regex parser (though perl regexes can handle nested patterns like that), use SQL::Statement::Structure.
Why not ask the target database itself how it would interpret the queries?
In perl, one can use the DBI to query the prepared representation of a SQL query. Sometimes this is database-specific: some drivers (under the perl DBD:: namespace) support their RDBMS' idea of describing statements in ways analogous to the RDBMS' native C or C++ API.
It can be done generically, however, as the DBI will put the names of result columns in the statement handle attribute NAME. The following, for example, has a good chance of working on any DBI-supported RDBMS:
use strict;
use warnings;
use DBI;
use constant DSN => 'dbi:YouHaveNotToldUs:dbname=we_do_not_know';
my $dbh = DBI->connect(DSN, ..., { RaiseError => 1 });
my $sth;
while (<>) {
next unless /^SELECT/i; # SELECTs only, assume whole query on one line
chomp;
my $sql = /\bWHERE\b/i ? "$_ AND 1=0" : "$_ WHERE 1=0"; # XXX ugly!
eval {
$sth = $dbh->prepare($sql); # some drivers don't know column names
$sth->execute(); # until after a successful execute()
};
print $#, next if $#; # oops, problem with that one
print join(', ', #{$sth->{NAME}}), "\n";
}
The XXX ugly! bit there tries to append an always-false condition on the SELECT, so that the SQL engine doesn't have to do any real work when you execute(). It's a terribly naive approach -- that /\bWHERE\b/i test is no more correctly identifying a SQL WHERE clause than simple regexes correctly parse out SELECT field names -- but it is likely to work.
In a somewhat related problem at the office I used:
my #SqlKeyWordList = qw/select from where .../; # (1)
my #Candidates =split(/\s/,$SqlSelectQuery); # (2)
my %FieldHash; # (3)
for my $Word (#Candidates) {
next if grep($word,#SqlKeyWordList);
$FieldHash($Word)++;
}
Comments:
SqlKeyWordList contains all the SQL keywords that are potentially in the SQL statement (we use MySQL, there are many SQL dialiects, choosing/building this list is work, look at my comments below!). If someone decided to use a keyword as a field name, you will need a regex after all (beter to refactor the code).
Split the SQL statement into a list of words, this is the trickiest part and WILL REQUIRE tweeking. For now it uses Perl notion of "space" (=not in word) to split. Splitting the field list (select a,b,c) and the "from" portion of the SQL might be advisabel here, depends on your SQL statements.
%MyFieldHash will contain one entry per select field (and gunk, until you validated your SqlKeyWorkList and the regex in (2)
Beware
there is nothing in this code that could not be done in Python.
your life would be much easier if you can influence the creation of said SQL statements. (e.g. make sure each field is written to a comment)
there are so many things that can/will go wrong in this parsing approach, you really should sidestep the issue entirely, by changing the process (saves time in the long run).
this is the regex we use at the office
my #Candidates=split(/[\s
\(
\)
\+
\,
\*
\/
\-
\n
\
\=
\r
]+/,$SqlSelectQuery
);
How about splitting each line into terms (replace every parenthesis, comma and space with a newline), then sorting:
perl -ne's/[(), ]/\n/g; print' < textfile | sort -u
You'll end up with a lot of content like:
fieldname1
fieldname1
formatstring
ltrim
rtrim
t_char