Premise
I recently ran into a bug in a select statement in my code. It was fairly trivial to fix after I realized what was going on, but I'm interested in finding a way to make sure a similar bug doesn't happen again.
Here's an example of an offending query:
select
the,
quick,
brown
fox,
jumped,
over,
the,
lazy,
dog
from table_name;
What I had intended was:
select
the,
quick,
brown,
fox,
jumped,
over,
the,
lazy,
dog
from table_name;
For those who don't see it, a comma is missing after brown in the former. This causes the column to be aliased, because the as keyword is not required. So, what you get in the result is:
the,
quick,
fox,
jumped,
over,
the,
lazy,
dog
...with all the values of brown in a column named fox. This can be noticed pretty easily for a short query like the above (especially when each column has very different values), but where it came up was in a fairly complicated query with mostly integer columns like this:
select
foo,
bar,
baz,
another_table.quux,
a1,
a2,
a3,
a4,
a5,
a6,
a7,
a8,
a9,
a10,
a11,
a12,
a13,
a14,
a15,
a16,
b1,
b2,
b3,
b7,
b8,
b9,
b10,
b11,
b12,
b13,
b14,
b18,
b19,
b20,
b21,
c1,
c2,
c3,
c4,
c5,
c6,
c7,
c8
from table_name
join another_table on table_name.foo_id = another_table.id
where
blah = 'blargh'
-- many other things here
;
Even with better column names, the values are all very similar. If I were to miss a comma after b11 (for example) and then all of the b11 values get called b12, it's pretty unfortunate when we run the data through our processing pipeline (which depends on these column names in the result). Normally, I'd do select * from table_name, but what we needed required us to be a little more selective than that.
Question
What I'm looking for is a strategy to stop this from happening again.
Is there a way to require as when aliasing columns? Or a trick of writing things to make it give an error? (For example, in C-like languages, I started writing 1 == foo instead of foo == 1 to cause a compile error when I accidentally left out an equal sign, making it the invalid 1 = foo instead of foo = 1.)
I use vim normally, so I can use hlsearch to highlight commas just so I can eyeball it. However, I have to write queries in other environments quite often, including a proprietary interface in which I can't do something like this easily.
Thanks for your help!
One thing that I've done before is to move the commas to the beginning of the line. This allows some benefits. First, you can instantly see if there are any commas missing. Second, you can add a new column at the end without having to modify the previously last line.
Missing:
select
the
, quick
, brown
fox
, jumped
, over
, the
, lazy
, dog
from table_name;
Not missing:
select
the
, quick
, brown
, fox
, jumped
, over
, the
, lazy
, dog
from table_name;
You could wrap your SQL calls in a function that would either:
Iterate over the columns in the result set, checking for column names containing a space
or
Accept both the SQL statement and an integer intended number of columns, then check the result set to make sure the number of columns matches what you intended.
I have the same problem that you do. I have used make and the perl script to do a "lint" like check on my code for a long time. It has helped prevent a number of mistakes like this.
In the makefile I have:
lint_code:
perl lint_code.pl <file_1.php
The perl file is:
$st = 0;
$line_no = 0;
while (<>)
{
$line_no++;
$st = 1 if ( /start-sql/ );
$st = 0 if ( /end-sql/ );
$st = 2 if ( $st == 1 && /select/ );
$st = 3 if ( $st == 2 && /from/ );
if ( $st == 2 && /^[ \t]+[a-zA-Z][a-zA-Z0-9]*[ \t*]$/ )
{
if ( ! /select/ )
{
printf ( "Possible Error: Line: $line_no\n" );
}
}
}
I surround my select statements with comments //start-sql and //end-sql. I hope this helps.
I have changed the regular expression to reflect how you formatted your SQL as I have been
using a different format (with the commas in the front).
As a part of my build/test process I run a set of checks over the code. This is a less than
perfect solution but it has helped me.
(I am having a little difficulty with the stackoverflow rich text editor changing my code.
Hopefully I will learn how to properly use it.)
write a comma before the name
first
,short
,medium
,longlonglong
,...
vs
first,
short,
medium,
longlonglong,
...
also makes it really easy to see the list of sql select arguments
works in any IDE :)
If you have columns with similar names, distinguished only by suffix numbers, you've already lost. You have a bad database design.
And most modern developers use SQL generators or ORMs these days, instead of writing this "assembly language" SQL.
Related
So lets say I have this list of strings in an Excel file:
33000
33100
33010
33110
45050
45150
45250
45350
45360
45370
55360
55370
And I've got a SQL table that has this list of strings and more and I want to make a SELECT statement that searches only for this list of strings.
I could make a brute force statement like SELECT * FROM Table WHERE field = '33100' OR field = '33010' .... However I could make the WHERE list smaller by using LIKE statements.
I'm trying to find a way to make the number of LIKE statements as small as possible so I need to generate the least amount of SQL patterns to identify the whole list. For the list above, the least amount of SQL patterns would be this:
33[01][01]0
45[0123]50
[45]53[67]0
How could I generate a list of patterns like this dynamically where the input is the list of strings?
An alternative approach might be more "elegant", but it will not be faster. Your strings start with different characters, so the first part of a like pattern would be a wildcard or character range -- effectively precluding the use of an index.
A simple in expression, on the other hand, can use an index:
where col in ('33100', '33010', '33110', '45050', ...)
Okay, let's say you have this data in Excel which starts from cell A2
In cell C1 write this code: create table ##TEMP(STRS varchar(20))
In cell C2 write this code: ="insert into ##TEMP"&" values"&" ('"&A2&"' )"&","
In cell C3 write this code: =" ('"&A3&"')"&","
Now Ctrl+C formula in cell C3 and paste it in range C4-C13
Now you get Excel like this
Copy this code in range C1-C13, open SQL management studio paste it, delete last comma (in this case in cell C13 there is comma at the end you have to delete it for success SQL run) and run, now in you have ##temp table.
INNER JOIN it with your table like
SELECT * FROM MYTABLE M INNER JOIN ##TEMP AS T ON T.STRS = M.COLUMN_NAME_STR
And you should get data which you need, hope it helps.
I've been asked to run a query to return a list of UK post codes from a table full of filters for email reports which only have 1 number at the end. The problem is that UK post codes are of variable length; some are structured 'AA#' or 'AA##' and some are structured 'A#' or 'A##'. I only want those that are either 'AA#' or 'A#'.
I tried running the below SQL, using length and (attempting to) use regex to filter out all results which didn't match what I wanted, but I'm very new to using ranges and it hasn't worked.
SELECT PostCode
FROM ReportFilterTable RFT
WHERE RFT.FilterType = 'Postcode'
AND LEN(RFT.Postcode) < 4
AND RFT.PostCode LIKE '%[0-9]'
I think the way I'm approaching this is flawed, but I'm clueless as to a better way. Could anyone help me out?
Thanks!
EDIT:
Since I helpfully didn't include any example data originally, I've now done so below.
This is a sample of the kind of values in the column I'm returning, with examples of what I need to return and what I don't.
B1 -- Should be returned
B10 -- Should not be returned
B2 -- Should be returned
B20 -- Should not be returned
B3 -- Should be returned
B30 -- Should not be returned
SE1 -- Should be returned
SE10 -- Should not be returned
You could filter for one or two letters (and omit the length check, since it's implicit in the LIKE):
WHERE RFT.FilterType = 'Postcode' AND
(RFT.PostCode LIKE '[A-Z][0-9]' OR RFT.PostCode LIKE '[A-Z][A-Z][0-9]')
If the issue is that you are getting values with multiple digits and you are using SQL Server (as suggested by the syntax), then you can do:
WHERE RFT.FilterType = 'Postcode' AND
LEN(RFT.Postcode) < 4 AND
(RFT.PostCode LIKE '%[0-9]' AND RFT.PostCode NOT LIKE '%[0-9][0-9]')
Or, if you know there are at least two characters, you could use:
WHERE RFT.FilterType = 'Postcode' AND
LEN(RFT.Postcode) < 4 AND
RFT.PostCode LIKE '%[^0-9][0-9]'
Non-digit followed by 1 digit ... LIKE '%[^0-9][0-9]'
I am taking a text input from the user, then converting it into 2 character length strings (2-Grams)
For example
RX480 becomes
"rx","x4","48","80"
Now if I directly query server like below can they somehow make SQL injection?
select *
from myTable
where myVariable in ('rx', 'x4', '48', '80')
SQL injection is not a matter of length of anything.
It happens when someone adds code to your existing query. They do this by sending in the malicious extra code as a form submission (or something). When your SQL code executes, it doesn't realize that there are more than one thing to do. It just executes what it's told.
You could start with a simple query like:
select *
from thisTable
where something=$something
So you could end up with a query that looks like:
select *
from thisTable
where something=; DROP TABLE employees;
This is an odd example. But it does more or less show why it's dangerous. The first query will fail, but who cares? The second one will actually work. And if you have a table named "employees", well, you don't anymore.
Two characters in this case are sufficient to make an error in query and possibly reveal some information about it. For example try to use string ')480 and watch how your application will behave.
Although not much of an answer, this really doesn't fit in a comment.
Your code scans a table checking to see if a column value matches any pair of consecutive characters from a user supplied string. Expressed in another way:
declare #SearchString as VarChar(10) = 'Voot';
select Buffer, case
when DataLength( Buffer ) != 2 then 0 -- NB: Len() right trims.
when PatIndex( '%' + Buffer + '%', #SearchString ) != 0 then 1
else 0 end as Match
from ( values
( 'vo' ), ( 'go' ), ( 'n ' ), ( 'po' ), ( 'et' ), ( 'ry' ),
( 'oo' ) ) as Samples( Buffer );
In this case you could simply pass the value of #SearchString as a parameter and avoid the issue of the IN clause.
Alternatively, the character pairs could be passed as a table parameter and used with IN: where Buffer in ( select CharacterPair from #CharacterPairs ).
As far as SQL injection goes, limiting the text to character pairs does preclude adding complete statements. It does, as others have noted, allow for corrupting the query and causing it to fail. That, in my mind, constitutes a problem.
I'm still trying to imagine a use-case for this rather odd pattern matching. It won't match a column value longer (or shorter) than two characters against a search string.
There definitely should be a canonical answer to all these innumerable "if I have [some special kind of data treatment] will be my query still vulnerable?" questions.
First of all you should ask yourself - why you are looking to buy yourself such an indulgence? What is the reason? Why do you want add an exception to your data processing? Why separate your data into the sheep and the goats, telling yourself "this data is "safe", I won't process it properly and that data is unsafe, I'll have to do something?
The only reason why such a question could even appear is your application architecture. Or, rather, lack of architecture. Because only in spaghetti code, where user input is added directly to the query, such a question can be ever occur. Otherwise, your database layer should be able to process any kind of data, being totally ignorant of its nature, origin or alleged "safety".
Okay, so I have a huge list of entries, and in one of the columns (for simplicity let's call it num there's a number, something like 123456780000 (they are all the same length and format), but sometimes there are fields that look something like this
12345678E000
or
12345678H000
Now, I need to delete all the rows in which the num column is not entirely numeric. The type of num is TEXT, not INTEGER. So the above examples should be deleted, while 123456780000 should not.
I have tried two solutions, of which one works but is inelegant and messy, and the other one doesn't work at all.
The first thing I tried is
DELETE FROM MY_TABLE WHERE abs(num) == 0.0
Because according to the documentation, abs(X) returns exactly 0.0 if a TEXT value is given and is unconvertable to an real number. So I was thinking it should let all the "numbers-only" pass and delete the ones with a character in it. But it doesn't do a thing, it doesn't delete even a single row.
The next thing I tried is
DELETE FROM MY_TABLE WHERE num LIKE "%A%" OR "%B%" OR "%C%"
Which seems to work, but the database is large and I am not sure which characters can appear, and while I could just do "%D%" OR "%E%" OR "%F%" OR ... with the entire alphabet, this feels inelegant and messy. And I actually want to learn something about the SQLite language.
My question, finally, is: how do I solve this problem in a nice and simple way? Perhaps there's something I'm doing wrong with the abs(X) solution, or is there another way that I do not know of/thought of?
Thanks in advance.
EDIT:
According to a comment I tried SELECT abs(num) FROM MY_TABLE WHERE num like '%A%'
and it returned the following
12345678.0
That's strange. It seems it has split the number where the alphabetical appeared. The documentation claimed it would return 0.0 if it couldn't convert it to a number. Hmm..
You can use GLOB in SQLite with a range to single them out:
SELECT *
FROM MY_TABLE
WHERE num GLOB '*[A-Za-z]*'
See it in use with this fiddle: http://sqlfiddle.com/#!7/4bc21/10
For example, for these records:
num
----------
1234567890
0987654321
1000000000
1000A00000
1000B00000
1000c00000
GLOB '*[A-Za-z]*' will return these three:
num
----------
1000A00000
1000B00000
1000c00000
You can then translate that to the appropriate DELETE:
DELETE
FROM MY_TABLE
WHERE num GLOB '*[A-Za-z]*'
My code is as follows:
REPLACE(REPLACE(cc.contype,'x','y'),'y','z') as ContractType,
This REPLACE's correctly what I would like, but it unfortunatley changes all "z's" to "y's" when I would like
x > y
y > z
Does this make sense? I would not like all of the new Y's to then change again in my second REPLACE function. In Microsoft Access, I would do this with the following
Iif(cc.contype = x, y, iif(cc.contype = y, x))
But I am not sure how to articulate this in SQL, would it be best I do this kind of thing in the client side language?
Many thanks.
EDIT: Have also tried with no luck:
CASE WHEN SUBSTRING(cc.contype, 1, 1) = 'C'
THEN REPLACE(cc.contype, 'C', 'Signed')
CASE WHEN SUBSTRING(cc.contype, 1, 1) = 'E'
THEN REPLACE(cc.contype, 'E', 'Estimate') as ContractType,
Try doing it the other way round if you don't want the new "y"'s to become "z"'s:
REPLACE(REPLACE(cc.contype,'y','z'),'x','y') as ContractType
Not that I'm a big fan of the performance killing process of handling sub-columns, but it appears to me you can do that just by reversing the order:
replace(replace(cc.contype,'y','z'),'x','y') as ContractType,
This will transmute all the y characters to z before transmuting the x characters to y.
If you're after a more general solution, you can do unioned queries like:
select 'Signed: ' || cc.contype as ContractType
wherecc.contype like 'C%' from wherever
union all select 'Estimate: ' || cc.contype as ContractType
where cc.contype like 'E%' from wherever
without having to mess about with substrings at all (at the slight cost of prefixing the string rather than modifying it, and adding any other required conditions as well, of course). This will usually be much more efficient than per-row functions.
Some DBMS' will actually run these sub-queries in parallel for efficiency.
Of course, the ideal solution is to change your schema so that you don't have to handle sub-columns. Separate the contype column into two, storing the first character into contype_first and contype_rest.
Then whenever you want the full contype:
select contype_first || contype_rest ...
For your present query, you could then use a lookup table:
lookup_table:
first char(1) primary key
description varchar(20)
containing:
first description
----- -----------
C Signed:
E Estimate:
and the query:
select lkp.description || cc.contype_rest
from lookup_table lkp, real_table cc
where lkp.first = cc.first ...
Both these queries are likely to be blazingly fast compared to one that does repeated string substitutions on each row.
Even if you can't replace the single column with two independent columns, you can at least create the two new ones and use an insert/update trigger to keep them in sync. This gives you the old way and a new improved way for accessing the contype information.
And while this technically violates 3NF, that's often acceptable for performance reasons, provided you understand and mitigate the risks (with the triggers).
How about
REPLACE(REPLACE(REPLACE(cc.contype,'x','ahhhgh'),'y','z'),'ahhhgh','y') as ContractType,
ahhhgh can be replaced with whatever you like.