Regex: does a SQL statement include a WHERE clause? - sql

I need a regex that will determine if a given SQL statement has a WHERE clause. My problem is that the passed SQL statements will most likely be complex, so I can not rely on just the existence of the word WHERE in the statement.
For example this should match
SELECT Contacts.ID
, CASE WHEN (Contacts.Firstname IS NULL) THEN ''
ELSE CAST(Contacts.Firstname AS varchar)
END AS Firstname
, CASE WHEN (Contacts.Lastname IS NULL) THEN ''
ELSE CAST(Contacts.Lastname AS varchar)
END AS Lastname
, CASE WHEN (tbl_ContactExtras.Prequalified=-1 OR
tbl_ContactExtras.Prequalified IS NULL) THEN ''
WHEN tbl_ContactExtras.Prequalified=0 THEN 'No'
WHEN tbl_ContactExtras.Prequalified=1 THEN 'Yes - Other'
WHEN tbl_ContactExtras.Prequalified=2 THEN 'Yes'
ELSE CAST(tbl_ContactExtras.Prequalified AS varchar)
END AS Prequalified
FROM contacts
LEFT JOIN tbl_ContactExtras
ON tbl_ContactExtras.ContactID = Contacts.ID
WHERE (Contacts.Firstname LIKE 'Bob%')
and this should not match:
SELECT Contacts.ID
, CASE WHEN (Contacts.Firstname IS NULL) THEN ''
ELSE CAST(Contacts.Firstname AS varchar)
END AS Firstname
, CASE WHEN (Contacts.Lastname IS NULL) THEN ''
ELSE CAST(Contacts.Lastname AS varchar)
END AS Lastname
, CASE WHEN (tbl_ContactExtras.Prequalified=-1 OR
tbl_ContactExtras.Prequalified IS NULL) THEN ''
WHEN tbl_ContactExtras.Prequalified=0 THEN 'No'
WHEN tbl_ContactExtras.Prequalified=1 THEN 'Yes - Other'
WHEN tbl_ContactExtras.Prequalified=2 THEN 'Yes'
ELSE CAST(tbl_ContactExtras.Prequalified AS varchar)
END AS Prequalified
FROM contacts
LEFT JOIN tbl_ContactExtras
ON tbl_ContactExtras.ContactID = Contacts.ID
Those are examples of some of the simpler statements: a statement could have up to 30 CASE statements in it, or it could have none at all.
I need to programmatically add WHERE parameters, but doing this correctly requires knowing whether a WHERE clause is already present.
Any idea on a regex that would work for this? If not, any other ideas on how to tell the two apart?
Thanks,

This is not possible, since a WHERE clause may be arbitrarily nested inside the FROM clause.

This may not catch all cases but you may find you can catch most of them just by finding the last from and the last where in the statement.
if the where is after the from, then it has a where clause. If the where is before the from (or there is no where at all), then no where clause exists.
Sometimes, it's okay to leave restrictions or limitations in your code, as long as they're properly documented.
For example, I've worked on a project before that parsed SQL and we discovered it didn't handle things like between:
where recdate between '2010-01-01' and '2010-12-31'
Rather than spend a bucket-load of money fixing the problem (and probably introducing bugs on the way), we simply published it as a restriction and told everyone they had to change it to:
where recdate >= '2010-01-01'
and recdate <= '2010-12-31'
Problem solved. While it's good to keep customers happy, you don't have to cater to every whim :-)
Other than that, you need an SQL parser, and SQL is not a pretty language to parse, trust me on that one.

Are all of the joins the same? If so you could find the index of all or part of the FROM statement (perhaps using a regex to be tolerant of slight differences in syntax and whitespace) and then look for the occurrence of the word WHERE after that index.
In general you would be better off using a parser. But if this is just a one off thing and the statements are all fairly similar then the above approach should be okay.

Regex is not designed to do this. Parsing SQL properly requires matching balanced parentheses (and other matching pairs, such as quotes), something regex is not designed to do (and pure regex isn't even equipped to; PCRE can but it's not pretty).
Instead, just write a basic state machine or something to parse it.

What's the problem you're trying to solve? Are you trying to determine if it's safe to add constraints to these existing queries?
For example, if you've got this query
...
where foo = 'bar'
then you know it's safe to add
and bat = 'quux'
but if you don't have a WHERE clause already, then you have to do it as
where bat = 'quux'
Is that the problem you're trying to solve? If so, can you make every SQL query you're working with have a WHERE clause by adding a "WHERE 0=0" to those queries that don't have one? Then you know in your post-process phase that every query already has one.
This is just a guess, of course. Your question sounded like that might be the larger issue.

Related

SQL how to join on another field if the first field has letters in it

I have a SQL join that looks like this
INNER JOIN UPLOAD PH on cast(p.PickTicket_Number as varchar(20))=PH.FIELD004
However, sometimes, the field I really want to join on is PH.FIELD003. This happens when PH.FIELD004 is a different field that has letters in it. How do I include a condition in the ON clause where if field004 does not have letters in it (is numeric), it joins just like that above, but if it does have letters in it, it instead joins on the pickticket number = field003?
You can express the logic like this:
UPLOAD PH
ON cast(p.PickTicket_Number as varchar(20)) = PH.FIELD004 OR
(TRY_CONVERT(int, PH.FIELD004) IS NULL AND
CAST(p.PickTicket_Number as varchar(20)) = PH.FIELD003
)
Notes:
This checks that FIELD004 cannot be converted to an INT. This is hopefully close enough to "has letters in it". If not, you can use a CASE expression with LIKE.
OR is a performance killer for JOIN conditions. Function calls (say for conversion) also impede the optimizer.
If performance is an issue, ask a new question.
I am guessing the "number" value is an INT; if not, change to the appropriate type.
Converting numbers to strings for comparison can be dangerous. I've had problems with leading 0's, for example.
Because of the last issue, I would recommend:
UPLOAD PH
ON TRY_CONVERT(INT, PH.FIELD004) = p.PickTicket_Number OR
(TRY_CONVERT(int, PH.FIELD004) IS NULL AND
p.PickTicket_Number= PH.FIELD003
)

Check column IS NULL not working in Case Expression

I am trying to check a column which belongs to View "IS NULL" and it works good with select statement:
SELECT TOP (1000) [Invoice_Number]
,[Invoice_Date]
,[Invoice_Amount]
,[Invoice_CoCd]
,[Invoice_vendor]
,[Invoice_PBK]
,[Invoice_DType]
,[Invoice_DueDate]
,[Invoice_ClgDate] FROM [dbo].[viewABC] where Invoice_PBK IS NULL
Now I have an update statement, where I need to update a column in a table based on NULL in VIEW:
UPDATE cis
SET
cis.InvoiceStatus =
(
CASE
WHEN RTRIM(LTRIM(imd.[Invoice_PBK])) IS NULL THEN
'HELLO'
WHEN RTRIM(LTRIM(imd.Invoice_DType)) = 'RD'
THEN '233' END)
FROM
[dbo.[tblABC] cis,
[dbo].[viewABC] imd
WHERE [condition logic]
These is no issue with where condition, the IS NULL in the CASE expression causing the problem.
Can someone help please?
You may want to just test your Logic piece by piece.
Try
SELECT ISNULL(RTRIM(LTRIM(imd.[Invoice_PBK])),'Hey Im Null')
FROM
[dbo].[tblABC] cis,
[dbo].[viewABC] imd
WHERE [condition logic]
See if you get that String value on your returned column(s) from the Query. Then, build from there. It could be something as simple as the JOIN. Which I'm not a fan of that old syntax but, without more info on this table other than [conditions]. It's hard to just guess an answer for you. You have no detailed evidence that helps us. It very well could be your conditions but, you're saying "condition logic" and that does nothing for the group on this thread.
Looking at this expression:
cis.InvoiceStatus =
CASE
WHEN RTRIM(LTRIM(imd.[Invoice_PBK])) IS NULL THEN
'HELLO'
WHEN RTRIM(LTRIM(imd.Invoice_DType)) = 'RD' THEN
'233'
END
And this symptom:
CIS.InvoiceStatus gets updated with NULL
The obvious conclusion is neither WHEN condition is met, and therefore the result of the CASE expression is NULL.
Maybe you wanted this, which will preserve the original value in that situation:
cis.InvoiceStatus =
CASE
WHEN RTRIM(LTRIM(imd.[Invoice_PBK])) IS NULL THEN
'HELLO'
WHEN RTRIM(LTRIM(imd.Invoice_DType)) = 'RD' THEN
'233'
ELSE
cis.InvoiceStatus
END
Or maybe you wanted this, to also match an empty string value:
cis.InvoiceStatus =
CASE
WHEN NULLIF(RTRIM(LTRIM(imd.[Invoice_PBK])),'') IS NULL THEN
'HELLO'
WHEN RTRIM(LTRIM(imd.Invoice_DType)) = 'RD' THEN
'233'
END
It's also worth pointing out the two WHEN conditions are looking at two different columns.
Finally, it may be worth a data clean-up project here. Needing to do an LTRIM() will break any chance of using indexes on those fields (RTRIM() is slightly less bad), and index use cuts to the core of database performance.

Can 2 character length variables cause SQL injection vulnerability?

I am taking a text input from the user, then converting it into 2 character length strings (2-Grams)
For example
RX480 becomes
"rx","x4","48","80"
Now if I directly query server like below can they somehow make SQL injection?
select *
from myTable
where myVariable in ('rx', 'x4', '48', '80')
SQL injection is not a matter of length of anything.
It happens when someone adds code to your existing query. They do this by sending in the malicious extra code as a form submission (or something). When your SQL code executes, it doesn't realize that there are more than one thing to do. It just executes what it's told.
You could start with a simple query like:
select *
from thisTable
where something=$something
So you could end up with a query that looks like:
select *
from thisTable
where something=; DROP TABLE employees;
This is an odd example. But it does more or less show why it's dangerous. The first query will fail, but who cares? The second one will actually work. And if you have a table named "employees", well, you don't anymore.
Two characters in this case are sufficient to make an error in query and possibly reveal some information about it. For example try to use string ')480 and watch how your application will behave.
Although not much of an answer, this really doesn't fit in a comment.
Your code scans a table checking to see if a column value matches any pair of consecutive characters from a user supplied string. Expressed in another way:
declare #SearchString as VarChar(10) = 'Voot';
select Buffer, case
when DataLength( Buffer ) != 2 then 0 -- NB: Len() right trims.
when PatIndex( '%' + Buffer + '%', #SearchString ) != 0 then 1
else 0 end as Match
from ( values
( 'vo' ), ( 'go' ), ( 'n ' ), ( 'po' ), ( 'et' ), ( 'ry' ),
( 'oo' ) ) as Samples( Buffer );
In this case you could simply pass the value of #SearchString as a parameter and avoid the issue of the IN clause.
Alternatively, the character pairs could be passed as a table parameter and used with IN: where Buffer in ( select CharacterPair from #CharacterPairs ).
As far as SQL injection goes, limiting the text to character pairs does preclude adding complete statements. It does, as others have noted, allow for corrupting the query and causing it to fail. That, in my mind, constitutes a problem.
I'm still trying to imagine a use-case for this rather odd pattern matching. It won't match a column value longer (or shorter) than two characters against a search string.
There definitely should be a canonical answer to all these innumerable "if I have [some special kind of data treatment] will be my query still vulnerable?" questions.
First of all you should ask yourself - why you are looking to buy yourself such an indulgence? What is the reason? Why do you want add an exception to your data processing? Why separate your data into the sheep and the goats, telling yourself "this data is "safe", I won't process it properly and that data is unsafe, I'll have to do something?
The only reason why such a question could even appear is your application architecture. Or, rather, lack of architecture. Because only in spaghetti code, where user input is added directly to the query, such a question can be ever occur. Otherwise, your database layer should be able to process any kind of data, being totally ignorant of its nature, origin or alleged "safety".

SQL produced by Entity Framework for string matching

Given this linq query against an EF data context:
var customers = data.Customers.Where(c => c.EmailDomain.StartsWith(term))
You’d expect it to produce SQL like this, right?
SELECT {cols} FROM Customers WHERE EmailDomain LIKE #term+’%’
Well, actually, it does something like this:
SELECT {cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Do you know why?
Also, replacing the Where selector to:
c => c.EmailDomain.Substring(0, term.Length) == term
it runs 10 times faster but still produces some pretty yucky SQL.
NOTE: Linq to SQL correctly translates StartsWith into Like {term}%, and nHibernate has a dedicated LikeExpression.
I don't know about MS SQL server but on SQL server compact LIKE 'foo%' is thousands time faster than CHARINDEX, if you have INDEX on seach column. And now I'm sitting and pulling my hair out how to force it use LIKE.
http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/1b835b94-7259-4284-a2a6-3d5ebda76e4b
The reason is that CharIndex is a lot faster and cleaner for SQL to perform than LIKE. The reason is, that you can have some crazy "LIKE" clauses. Example:
SELECT * FROM Customer WHERE EmailDomain LIKE 'abc%de%sss%'
But, the "CHARINDEX" function (which is basically "IndexOf") ONLY handles finding the first instance of a set of characters... no wildcards are allowed.
So, there's your answer :)
EDIT: I just wanted to add that I encourage people to use CHARINDEX in their SQL queries for things that they didn't need "LIKE" for. It is important to note though that in SQL Server 2000... a "Text" field can use the LIKE method, but not CHARINDEX.
Performance seems to be about equal between LIKE and CHARINDEX, so that should not be the reason. See here or here for some discussion. Also the CAST is very weird because CHARINDEX returns an int.
charindex returns the location of the first term within the second term.
sql starts with 1 as the first location (0 = not found)
http://msdn.microsoft.com/en-us/library/ms186323.aspx
i don't know why it uses that syntax but that's how it works
I agree that it is no faster, I was retrieving tens of thousands of rows from our database with the letter i the name. I did find however that you need to use > rather than = ... so use
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) > 0)
rather than
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Here are my two tests ....
select * from members where surname like '%i%' --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) > 0) --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) = 1) --too few results

SQL Query - Use Like only if no exact match exists?

I'm having an issue with a query that currently uses
LEFT JOIN weblog_data AS pwd
ON (pwd.field_id_41 != ''
AND pwd.field_id_41 LIKE CONCAT('%', ewd.field_id_32, '%'))
However I'm discovering that I need it to only use that if there is no exact match first. What's happening is that the query is double dipping due to the use of LIKE, so if it tests for an exact match first then it will avoid the double dipping issue. Can anyone provide me with any further guidance?
It sounds like you want to join the tables aliased as pwd and ewd in your snippet based first on an exact match, and if that fails, then on the like comparison you have now.
Try this:
LEFT JOIN weblog_data AS pwd1 ON (pwd.field_id_41 != '' AND pwd.field_id_41 = ewd.field_id_32)
LEFT JOIN weblog_data AS pwd2 ON (pwd.field_id_41 != '' AND pwd.field_id_41 LIKE CONCAT('%', ewd.field_id_32, '%'))
Then, in your select clause, use something like this:
select
isnull(pwd1.field, pwd2.field)
however, if you are dealing with a field that can be null in pwd, that will cause problems, this should work though:
select
case pwd1.nonnullfield is null then pwd2.field else pwd1.field end
You'll also have to make sure to do a group by, as the join to pwd2 will still add rows to your result set, even if you end up ignoring the data in it.
you're talking about short circuit evaluation.
Take a look at this article it might help you:
http://beingmarkcohen.com/?p=62
using TSQL, run an exact match, check for num of rows == 0, if so, run the like, otherwise don't run the like or add the like results below the exact matches.
I can only think of doing it in code. Look for an exact match, if the result is empty, look for a LIKE.
One other option is a WHERE within this query such that WHERE ({count from exact match}=0), in which case, it wont go through the comparison with LIKE if the exact match returns more than 0 results. But its terribly inefficient... not to mention the fact that using it meaningfully in code is rather difficult.
i'd go for a If(count from exact match = 0) then do like query, else just use the result from exact match.