Python3.8, MySQL5.7: passing active wildcards to LIKE query - python-3.8

I am using python3.8 and mysql.connector. Can I parameterize a query that contains a LIKE clause and pass in active % wildcard(s)?
To tame the incoming pattern I have tried:
pattern = re.sub(r'%+', '%', target)
~and~
pattern = re.sub(r'%+', '%%', target)
~and~
pattern = re.sub(r'%+', '\\%', target)
for the queries I have tried:
cursor.execute(f"""
DELETE FROM thnigs
WHERE user = %(user)s
AND widget LIKE %(pattern)s
""", dict(user=username, pattern=pattern))
cursor.execute(f"""
DELETE FROM thnigs
WHERE user = %s
AND widget LIKE %s
""", (username, pattern))
pattern could contain just about anything really. I know that mysql.connector will escape the input but if the input string contains any number of percent symbols the query hangs then dies with: 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
I have tried switching the LIKES for RLIKES aand changing the wildcards to simple .* strings with the same result -- if there is any active wildcard then the query dies.
Values that have not worked so far:
pattern = "%red"
pattern = "red%"
If this is not possible then I suppose I could pull in all values and do the wildcard search locally in the app but that feels wrong. The possible dataset could potentially become large.
Is there a correct or better way?
** Edit **
added another % replacement pattern that i have tried

You must escape % with %% in the sql query text that is passed, this is what I remember at least from long ago.
See also
mysql LIKE with double percent
and there my answer.

Related

Properly escaping strings in raw Django SQL query

Really the root of the problem is poor database design going back to long before I started here, but that problem isn't going to be resolved by the end of the day or even the end of the week. Thus, I need to find a better way to handle the following to get a feature added.
I'm forced to deviate away from the Django ORM because I need to build a raw SQL query due to having to write logic around the FROM <table>. We've elected to go this route instead of updating the models.py a couple times a year with the new tables that are added.
There are numerous areas starting here where the Django documentation says "Do not use string formatting on raw queries or quote placeholders in your SQL strings!"
If I write the query like this:
cursor.execute("""
SELECT * \
FROM agg_data_%s \
WHERE dataview = %s \
ORDER BY sortorder """, [tbl, data_view])
It adds single quotes around tbl, which obviously causes an issue, but will correctly construct the WHERE clause surrounded in single quotes.
Doing this, will not put single quotes around tbl, but will force you to put single quotes around the WHERE which is bad to say the least (opens it up for SQL injection):
sql = """ \
SELECT * \
FROM agg_data_%s \
WHERE dataview = '%s' \
ORDER BY sortorder """ % (tbl, data_view)
cursor.execute(sql)
Anyway to make lemonade out of these lemons?
The parameters %s can only be used for values. Not for (parts) of identifiers, like table/column names. If you are sure tbl is safe, then you can escape with:
sql = """
SELECT * \
FROM agg_data_%s \
WHERE dataview = %%s \
ORDER BY sortorder """ % tbl, [data_view])
Here the %% will "fold" to % and thus you yield %s after the first string interpolation.
But it might be better not to use different tables for this, and thus filter the table, for example by a foreign key.

Regex Help for wildcard lookup

I am having issues with this wildcard lookup, not sure why this doesn't work:
I am looking up example Sales Agent 42. As you can imagine being sales, they do not really care about garbage in = garbage out. So their agent codes are usually a mess to sort through.
Valid Examples for Agent 42:
42
30-42-22-holiday
42easter
42-coupon
42coupon-423355
29-42sale-52
Non-Valid Examples that explicitly need to not show up
A4290042
4297901
42cmowc209d
o203f9j42po0
Here is the most successful model I came up with:
SELECT company_id, agent
FROM cust_data
WHERE (agent = ('42') OR agent LIKE ('42%-%') OR agent LIKE ('%-%42') OR agent LIKE ('%-%42%-%') OR agent LIKE ('42[a-z]%-%') OR agent LIKE ('%-%42[a-z]%') OR agent LIKE ('%-%42[a-z]%-%') OR agent LIKE ('42[a-z]%'))
I get most of the valid ones to return and none of the non-valid ones, but I still can't seem to grab the examples like 42easter or 29-42sale-52 even though I am telling it to grab that style...
Any suggestions?
If you need to match 42 that is not surrounded with digits, you can use alternations with anchors (^ standing for the start of string and $ standing for the end of string) and negated character classes:
WHERE agent ~ '(^|[^0-9])42($|[^0-9])'
See the regex demo
Explanation:
(^|[^0-9]) - either the start of the string ^ or a non-digit [^0-9]
42 - literal 42
($|[^0-9]) - end of string $ or a non-digit [^0-9]

Limitting character input to specific characters

I'm making a fully working add and subtract program as a nice little easy project. One thing I would love to know is if there is a way to restrict input to certain characters (such as 1 and 0 for the binary inputs and A and B for the add or subtract inputs). I could always replace all characters that aren't these with empty strings to get rid of them, but doing something like this is quite tedious.
Here is some simple code to filter out the specified characters from a user's input:
local filter = "10abAB"
local input = io.read()
input = input:gsub("[^" .. filter .. "]", "")
The filter variable is just set to whatever characters you want to be allowed in the user's input. As an example, if you want to allow c, add c: local filter = "10abcABC".
Although I assume that you get input from io.read(), it is possible that you get it from somewhere else, so you can just replace io.read() with whatever you need there.
The third line of code in my example is what actually filters out the text. It uses string:gsub to do this, meaning that it could also be written like this:
input = string.gsub(input, "[^" .. filter .. "]", "").
The benefit of writing it like this is that it's clear that input is meant to be a string.
The gsub pattern is [^10abAB], which means that any characters that aren't part of that pattern will be filtered out, due to the ^ before them and the replacement pattern, which is the empty string that is the last argument in the method call.
Bonus super-short one-liner that you probably shouldn't use:
local input = io.read():gsub("[^10abAB]", "")

regexp_matches() returns two matches for $ (end of string)

Can somebody explain this odd behavior of regexp_matches() in PostgreSQL 9.2.4 (same result in 9.1.9):
db=# SELECT regexp_matches('test string', '$') AS end_of_string;
end_of_string
---------------
{""}
(1 row)
db=# SELECT regexp_matches('test string', '$', 'g') AS end_of_string;
end_of_string
---------------
{""}
{""}
(2 rows)
-> SQLfiddle demo.
The second parameter is a regular expression. $ marks the end of the string.
The third parameter is for flags. g is for "globally", meaning the the function doesn't stop at the first match.
The function seems to report the end of the string twice with the g flag, but that can only exist once per definition. It breaks my query. :(
Am I missing something?
I would need my query to return one more row at the end, for any possible string. I expected this query to do the job, but it adds two rows:
SELECT (regexp_matches('test & foo/bar', '(&|/|$)', 'ig'))[1] AS delim
I know how to manually add a row, but I want to let the function take care of it.
It looks like it was a bug in PostgreSQL. I verified for sure it is fixed in 9.3.8. Looking at the release notes, I see possible references in:
9.3.4
Allow regular-expression operators to be terminated early by query
cancel requests (Tom Lane)
This prevents scenarios wherein a pathological regular expression
could lock up a server process uninterruptably for a long time.
9.3.6
Fix incorrect search for shortest-first regular expression matches
(Tom Lane)
Matching would often fail when the number of allowed iterations is
limited by a ? quantifier or a bound expression.
Thanks to Erwin for narrowing it down to 9.3.x.
I am not sure about what I am going to say because I don't use PostgreSQL so this is just me thinking out loud.
Since you are trying to match the end of string/line $, then in the first situation the outcome is expected, but when you turn on global match modifier g and because matching the end of line character doesn't actually consume or read any characters from the input string then the next match attempt will start where the first one left off, that is at the end of string and this will cause an infinite loop if it kept going like that so PostgreSQL engine might be able to detect this and stop it to prevent a crash or an infinite loop.
I tested the same expression in RegexBuddy with POSIX ERE flavor and it caused the program to become unresponsive and crash and this is the reason for my reasoning.
the same occurs for example in C# in which I had the same problem recently so I think this is a normal behaviour for regexps
this is because $ doesn't stand for a specific sign but a specific position instead
so $ doesn't really match anything and the position of parser stays in the same position
you need to change your convention a little;
to test for an empty string you can use ^$
This was a bug that has been fixed in Postgres 9.3. See accepted answer.
For Postgres 9.2 or older: A halfway decent workaround for my situation would be to use the expression .$ instead - matches for any string once at the last character:
WITH x(id, t) AS (
VALUES
(1, 'test & foo/bar')
,(2, 'test')
,(3, '') -- empty string
,(4, 'test & foo/') -- other branch as last character
)
SELECT id, (regexp_matches(t, '(&|/|.$)', 'ig'))[1] AS delim
FROM x;
But it fails for empty strings.
And it fails if the last character happens to match another branch. Like: 'foo/bar/'.
And it isn't perfect to have the actual final character returned. An empty string would be much preferable.
-> SQLfiddle.

How can I extract field names from SQL with Perl?

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.
Given select statement fields that could have several nested parenthese like:
ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name,
Or the simple case of just base_field_name as a field, what would the regex look like in Perl?
Don't try to write a regex parser (though perl regexes can handle nested patterns like that), use SQL::Statement::Structure.
Why not ask the target database itself how it would interpret the queries?
In perl, one can use the DBI to query the prepared representation of a SQL query. Sometimes this is database-specific: some drivers (under the perl DBD:: namespace) support their RDBMS' idea of describing statements in ways analogous to the RDBMS' native C or C++ API.
It can be done generically, however, as the DBI will put the names of result columns in the statement handle attribute NAME. The following, for example, has a good chance of working on any DBI-supported RDBMS:
use strict;
use warnings;
use DBI;
use constant DSN => 'dbi:YouHaveNotToldUs:dbname=we_do_not_know';
my $dbh = DBI->connect(DSN, ..., { RaiseError => 1 });
my $sth;
while (<>) {
next unless /^SELECT/i; # SELECTs only, assume whole query on one line
chomp;
my $sql = /\bWHERE\b/i ? "$_ AND 1=0" : "$_ WHERE 1=0"; # XXX ugly!
eval {
$sth = $dbh->prepare($sql); # some drivers don't know column names
$sth->execute(); # until after a successful execute()
};
print $#, next if $#; # oops, problem with that one
print join(', ', #{$sth->{NAME}}), "\n";
}
The XXX ugly! bit there tries to append an always-false condition on the SELECT, so that the SQL engine doesn't have to do any real work when you execute(). It's a terribly naive approach -- that /\bWHERE\b/i test is no more correctly identifying a SQL WHERE clause than simple regexes correctly parse out SELECT field names -- but it is likely to work.
In a somewhat related problem at the office I used:
my #SqlKeyWordList = qw/select from where .../; # (1)
my #Candidates =split(/\s/,$SqlSelectQuery); # (2)
my %FieldHash; # (3)
for my $Word (#Candidates) {
next if grep($word,#SqlKeyWordList);
$FieldHash($Word)++;
}
Comments:
SqlKeyWordList contains all the SQL keywords that are potentially in the SQL statement (we use MySQL, there are many SQL dialiects, choosing/building this list is work, look at my comments below!). If someone decided to use a keyword as a field name, you will need a regex after all (beter to refactor the code).
Split the SQL statement into a list of words, this is the trickiest part and WILL REQUIRE tweeking. For now it uses Perl notion of "space" (=not in word) to split. Splitting the field list (select a,b,c) and the "from" portion of the SQL might be advisabel here, depends on your SQL statements.
%MyFieldHash will contain one entry per select field (and gunk, until you validated your SqlKeyWorkList and the regex in (2)
Beware
there is nothing in this code that could not be done in Python.
your life would be much easier if you can influence the creation of said SQL statements. (e.g. make sure each field is written to a comment)
there are so many things that can/will go wrong in this parsing approach, you really should sidestep the issue entirely, by changing the process (saves time in the long run).
this is the regex we use at the office
my #Candidates=split(/[\s
\(
\)
\+
\,
\*
\/
\-
\n
\
\=
\r
]+/,$SqlSelectQuery
);
How about splitting each line into terms (replace every parenthesis, comma and space with a newline), then sorting:
perl -ne's/[(), ]/\n/g; print' < textfile | sort -u
You'll end up with a lot of content like:
fieldname1
fieldname1
formatstring
ltrim
rtrim
t_char