Query applying Where clause BEFORE Joins? - sql

I'm honestly really confused here, so I'll try to keep it simple.
We have Table A:
id
Table B:
id || number
Table A is a "prefilter" to B, since B contains a lot of different objects, including A.
So my query, trying to get all A's with a filter;
SELECT * FROM A a
JOIN B b ON b.id = a.id
WHERE CAST(SUBSTRING(b.number, 2, 30) AS integer) between 151843 and 151865
Since ALL instances of A starts with a letter ("X******"), I just want to truncate the first letter to let the filter do his work with the number specified by the user.
At first glance, there should be absolutely no worries. But it seems I was wrong. And on something I didn't expect to be...
It seems like my WHERE clause is executed BEFORE my JOIN. Therefore, since many B's have number with more than one Letter at the start, I have an invalid conversion happening. Despite the fact that it would NEVER happen if we stay in A's.
I always thought that where clause was executed after joins, but in this case, it seems postgres wants to prove me wrong.
Any explanations ?
SQLFiddle demonstrating problem: http://sqlfiddle.com/#!15/cd7e6e/7
And even with the SubQuery, it still makes the same error...

You can use the regex substr function to remove everything but digits: CAST(substring(B.number from '\d') AS integer).
See working example here: http://sqlfiddle.com/#!15/cd7e6e/18

SQL is a declarative language. For a select statement, you declare the criteria the data you are looking for must meet. You don't get to choose the execution path, your query isn't executed procedurally.
Thus the optimizer is free to choose any execution plan it likes, as long as it returns records specified by your criteria.
I suggest you change your query to cast to string instead of to integer. Something like:
WHERE SUBSTRING(b.number, 2, 30) between CAST(151843 AS varchar) and CAST(151865 AS varchar)

Do the records of A that are in B have the same id in table B as in A. If those records are inserted in a different order, this may not be the case and therefore return different records than expected.

Related

How to do damage with SQL by adding to the end of a statement?

Perhaps I am not creative or knowledgeable enough with SQL... but it looks like there is no way to do a DROP TABLE or DELETE FROM within a SELECT without the ability to start a new statement.
Basically, we have a situation where our codebase has some gigantic, "less-than-robust" SQL generation component that never uses prepared statements and we now have an API that interacts with this legacy component.
Right now we can modify a query by appending to the end of it, but have been unable to insert any semicolons. Thus, we can do something like this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
which will result in this
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');...
This is just one example.
Basically we can append pretty much anything to the end of any/most generated SQL queries, less adding a semicolon.
Any ideas on how this could potentially do some damage? Can you add something to the end of a SQL query that deletes from or drops tables? Or create a query so absurd that it takes up all CPU and never completes?
You said that this:
/query?[...]&location_ids=loc1')%20or%20L1.ID%20in%20('loc2
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('loc1') or L1.ID in ('loc2');
so it looks like this:
/query?[...]&location_ids=');DROP%20TABLE users;--
will result in this:
SELECT...WHERE L1.PARENT_ID='1' and L1.ID IN ('');DROP TABLE users;--');
which is a SELECT, a DROP and a comment.
If it’s not possible to inject another statement, you limited to the existing statement and its abilities.
Like in this case, if you are limited to SELECT and you know where the injection happens, have a look at PostgreSQL’s SELECT syntax to see what your options are. Since you’re injecting into the WHERE clause, you can only inject additional conditions or other clauses that are allowed after the WHERE clause.
If the result of the SELECT is returned back to the user, you may want to add your own SELECT with a UNION operation. However, PostgreSQL requires compatible data types for corresponding columns:
The two SELECT statements that represent the direct operands of the UNION must produce the same number of columns, and corresponding columns must be of compatible data types.
So you would need to know the number and data types of the columns of the original SELECT first.
The number of columns can be detected with the ORDER BY clause by specifying the column number like ORDER BY 3, which would order the result by the values of the third column. If the specified column does not exist, the query will fail.
Now after determining the number of columns, you can inject a UNION SELECT with the appropriate number of columns with an null value for each column of your UNION SELECT:
loc1') UNION SELECT null,null,null,null,null --
Now you determine the types of each column by using a different value for each column one by one. If the types of a column are incompatible, you may an error that hints the expected data type like:
ERROR: invalid input syntax for integer
ERROR: UNION types text and integer cannot be matched
After you have determined enough column types (one column may be sufficient when it’s one that is presented the user), you can change your SELECT to select whatever you want.

Can a SQL Case statment be used to test if a Join statement should be used?

What I am trying to do, using java, is:
access a database
read a record from table "Target_stats"
if the field "threat_level" = 0, doAction1
if the field "threat_level" > 0, get additional fields from another table "Attacker_stats" and doAction2
read the next record
Now I have everything I need but a well thought out SQL statement that will allow me to only go through the database only once, if this does not work I suspect I will need to use two separate SQL statements and go through the database a second time. I do not have a clear understanding of case statements, so I will just provide pseudo code using an if statement.
SELECT A.1, A.2, A.3
if(A.3 > 0){
SELECT A.1, A.2, A.3, B.1, B.3
FROM A
JOIN B
ON A.1 = B.1
}
FROM A
Can anyone shed any light on my situation?
EDIT: Thankyou both for your time and effort. I understand both of your comments and I believe that I am headed more towards the right direction however, I'm still having some trouble. I didn't know about SQLfiddle before so I have now gone ahead and made a sample DB and tried to demonstrate my purpose. Here is the link: http://sqlfiddle.com/#!3/ea108/1 What I want to do here is Select target_stats.server_id, target_stats.target, target_stats.threat_level Where interval_id=3 and if the threat_level>0 I want to retrieve attack_stats.attacker, attack_stats.sig_name Where interval_id=3. Again, thankyou for your time and effort it is very useful to me
EDIT: after some tinkering around, I figured it out. thankyou so much for your help
As #Ocelot20 said, SQL is not procedural code. It is based on set-based operations, not per row operations. One immediate consequence of this is that the SELECT in your pseudo-example is wrong as it relies on rows in the same result set having different column lists.
That said, you can get pretty close to your pseudo-code example, if you can tolerate NULL values where the join is not possible.
Here's an example that (to me anyway) seems to be close to what your are driving at:
select *
from A
left outer join B
on A.a = B.d and A.a > 2
You can see it in action in this SQLFiddle, which should show you what sort of output to expect.
Note that what this is actually saying is something like this:
Fetch all the records from table A and also fetch any records from
table B have their d column the same as the a column in table
A, provided the value of A.a is greater than 2.
(This was picked for convenience. In my rather contrived example shifting the conditional column does not effect the output as can be see here).

I have a large query, how do I debug this?

So, I get this error message:
EDT ERROR: syntax error at or near "union" at character 436
The query in question is a large query that consists of 12 smaller queries all connected together with UNION ALL, and each small query has two inner join statements. So, something like:
SELECT table.someid as id
,table.lastname as name
,table2.groupname as groupname
, 'Leads ' as Type
from table
inner join table3 on table3.specificid = table.someid
INNER JOIN table2 on table3.specificid=table2.groupid
where table3.deleted=0
and table.someid > 0
and table2.groupid in ('2','3','4')
LIMIT 5
UNION all
query2....
Note that table2 and table3 are the same tables in each query, and the fields from table2 and table3 are also the same, I think.
Quick question (I am still kinda new to all this):
What does 'Leads ' as Type mean? Unlike the other statements preceding an AS, this one isn't written like table.something.
Quick edit question: What does table2.groupid in ('2','3','4') mean?
I checked each small query one by one, each one works and returns a result, though the results are always empty for some reason(this may or may not be dependent on the user logged in though, as some PHP code generated this query).
As for the results themselves, most of them look something like this (they are arranged horizontally though):
id(integer)
name (character varying(80))
groupname (character varying(100))
type (unknown)
The difference in the results are twofold:
1)Most of the results contain the same field names but quite a few of them have different field lengths. Like some will say character varying (80), while others will say character varying (100), please correct me if this is actually not field length.
2)2 of the queries contain different fields, but only the id field is different, and it's probably because they don't have the "as id" part.
I am not quite sure of what the requirements of UNION ALL are, but if I think, it is meant to only work if all the fields are the same, but if that funky number changes (the one in the brackets), then are the fields considered to be different even if they have the same name?
Also, what's strange is that some of the queries returned the exact same fields, with the same field length, so I tried to UNION ALL only those queries, but no luck, still got a syntax error at UNION.
Another important thing I should mention is that the DB used to be MySQL, but we changed to PostGreSQL, so this bug might be a result of the change (i.e. code that might work in MySQL but not in PostGres).
Thanks for your time.
You can have only one "LIMIT xxx" clause. At the end of the query, and not before the UNION.
The error you get is due to missing parentheses here:
...
LIMIT 5
UNION all
...
The manual:
(ORDER BY and LIMIT can be attached to a subexpression if it is
enclosed in parentheses. Without parentheses, these clauses will be
taken to apply to the result of the UNION, not to its right-hand input
expression.)
Later example:
Sum results of a few queries and then find top 5 in SQL
The only real way I have found to debug big queries is to break it into understandable parts and debug each subexpression independently:
Does each show the expected rows?
Are the resulting fields and types as expected?
For union, do the result fields and types exactly match corresponding other subexpressions?

How does SQL choose which row to show when grouping multiple rows together?

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...
That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).
All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.
You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome
It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

In SQL, 'distinct' reduces the number of result rows from one to zero

I have a SQL statement of the following structure:
select distinct ...
from table1,
(select from table2, table3, table4 where ...)
where ...
order by ...
With certain values in the where clauses, the statement returns zero rows in the result set. When I remove the 'distinct' keyword, it returns a single row. I would expect to see a single result row in both cases. Is there some property of the 'distinct' keyword that I am not aware of and that causes this behavior?
The database is Oracle 11g.
What you describe is not the expected behaviour of DISTINCT. This is:
SQL> select * from dual
2 /
D
-
X
1 row selected.
SQL> select distinct * from dual
2 /
D
-
X
1 row selected.
SQL>
So, if what you say is happening really is what is happening then it's a bug. However, you also say it's a rare occurrence which means there is a good chance it is some peculiarity in your data and/or transient conditions in your environment, and not a bug.
You need to create a reproducible test case, for two reasons. Partly, nobody will be able to investigate your problem without one. But mainly because building a test case is an investigation in its own right: attempting to isolate the precise combination of data and/or ambient factors often generates the insight which leads to a solution.
It turned out that one of the sub-selects resulted in a data set that contained, among others, a row where every column was NULL. It seems that this row influenced the evaluation of the DISTINCT in a non-obvious way (at least to me). Maybe this is due to some under-the-hood SQL optimizations. After I removed the cause of this NULL-filled row, the problem is gone and the statement evaluates to one row in the result as it should.