SQL select IN (select) process too long why? - sql

Lets say TABLE1 has 1 million entrys in it.
Table2 has 50k entries in it.
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
Result of select:
5
4
That select takes 0.02s to process
But when i use it within IN it takes up to 20.20s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
If i would use it like this it would process in 0.02s
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
Can anyone explain me how things work here ?

I think your RDBMS is doing a poor job of executing your query, other RDBMS(e.g. SQL Server) can see that if a subquery is not correlated with an outer query it will internally materialize the result and would not execute the subquery repeatedly. e.g.
select *
, (select count(*) from tbl) -- an smart RDBMS won't execute this repeatedly
from tbl
A good RDBMS would not execute the counting repeatedly, since it is an independent query(not correlated to the outside query)
Try all of the options, there are just few of them anyway
1st, try EXISTS. Your RDBMS's EXISTS might be faster than its IN. I encountered IN is faster than EXISTS though, example: Why the most natural query(I.e. using INNER JOIN (instead of LEFT JOIN)) is very slow Same observation by Quassnoi (IN is faster than EXISTS): http://explainextended.com/2009/06/16/in-vs-join-vs-exists/
SELECT count(*)
FROM TABLE1
WHERE
-- stringVal IN
EXISTS(
SELECT * -- please, don't bikeshed ;-)
FROM TABLE2
where
table1.stringVal = table2.stringVal -- simulated IN
and table2.idTable2 = 5);
2nd, try INNER JOIN, use this if there's no duplicate, or use DISTINCT to remove duplicates.
SELECT count(*)
FROM TABLE1
JOIN (
SELECT DISTINCT stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5 ) as x
ON X.stringVal = table1.stringVal
3rd, try to materialize the rows yourself. I encountered same problem with SQL Server, querying the materialized rows is faster than querying the result of another query.
Check the example of materializing the query result to table, then using IN on result. I see that it is faster than using IN on another query approach, you can just read the bottom part of the post: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html
Example:
SELECT distinct stringVal -- remove duplicates
into anotherTable
FROM TABLE2
where idTable2 = 5;
SELECT count(*)
FROM TABLE1 where stringVal in (select stringVal from anotherTable);
The above is working on Sql Server and Postgresql, on other RDBMS it might be like this:
create table anotherTable as
SELECT distinct stringVal -- remove duplicates
FROM TABLE2
where table2.idTable2 = 5;
select count(*)
from table1 where stringVal in (select stringVal from anotherTable)

While I love subqueries, they are immensely powerful, their also quite slow, since the query has to be completely evaluated at each iteration, ouch! (depending on implementation)
This is why they are mine/our last resort.
Some SQL implementations are quite good and will cache the subquery though Im not quite sure how safe that would be, but still you have to traverse this entire structure and if the structure isn't properly optmize it would take quadratic even cubic time if you nest enough of them ...
SELECT stringVal
FROM TABLE2
WHERE idTable2=5
This is linear time O(n), it can be even be constant O(1) if the sql database stores statistical information, but we will assume it doesn't as such it will search every row and return all those that match the where clause.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (
SELECT stringVal FROM TABLE2 where idTable2=5);
Assuming the subquery isn't cache then it is being evaluated at each row, and if you have a lof them thats a lot evaluations, many many wasted repeated calculations, and even if its cache the structure may not be optimal for search, not to mention you are also comparing strings, on a list of strings.
SELECT count(*)
FROM TABLE1
WHERE stringVal IN (5,4);
The subquery is still being evaluated but its a constant expression theres basically no overhead at all, it doesn't need to do any IO or deal with locks or anythig :)

Try this
SELECT count(*) FROM TABLE1 where EXISTS
(SELECT 1 FROM TABLE2 where idTable2=5 and stringVal = TABLE1.stringVal );
An you should create indexes for stringVal both TABLE1 and TABLE2 tables.

Here is a simple join that will give you the same kind of result that you were looking for. This can be applied in many different situations and this will avoid having to query against another table.
SELECT COUNT(*)
FROM TABLE1 INNER JOIN TABLE2 ON TABLE1.'COLUMN' = TABLE2.'COLUMN' AND TABLE2.IDTABLE2=5
WHERE 'WHATEVER YOU WANT'
Replace 'COLUMN' with a column that is referenced in both tables, normally an ID or primary key.

Related

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).
From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Resusing select subquery/result

I am trying to optimize the speed of a query which uses a redundant query block. I am trying to do a row-wise join in sql server 2008 using the query below.
Select * from
(<complex subquery>) cq
join table1 t1 on (cq.id=t1.id)
union
Select * from
<complex subquery> cq
join table2 t2 on (cq.id=t2.id)
The <complex subquery> is exactly the same on both the union sub query pieces except we need to join it with multiple different tables to obtain the same columnar data.
Is there any way i can either rewrite the query to make it faster without using a temporary table to cache results?
Why not use a temporary table and see if that improves the execution stats?
In some circumstances the Query Optimizer will automatically add a spool to the plan that caches sub queries this is basically just a temporary table though.
Have you checked your current plan to be sure that the query is actually being evaluated more than once?
Without a concrete example it's difficult to help, but try a WITH statement, something like:
WITH csq(x,y,z) AS (
<complex subquery>
)
Select * from
csq
join table1 t1 on (cq.id=t1.id)
union
Select * from
csq
join table2 t2 on (cq.id=t2.id)
it sometimes speeds things up no end
Does the nature of your query allow you to invert it? Instead of "(join, join) union", do "(union) join"? Like:
Select * from
(<complex subquery>) cq
join (
Select * from table1 t1
union
Select * from table2 t2
) ts
on cq.id=ts.id
I'm not really sure if double evaluation of your complex query is actually what's wrong. But, as per your question, this would be a form of the query that would encourage SQL to only evaluate <complex query> once.

Why is it that the IsEqual (=) operator is working faster than the IsNotEqual (<>) operator in Oracle?

Like the title says, if anyone has the answer I would like to know. I've been googling but couldn't find a straight answer.
Example:
This works
SELECT COUNT(*) FROM Table1 TB1, Table2 TB2
WHERE TB1.Field1 = TB2.Table2
This seems to take hours
SELECT COUNT(*) FROM Table1 TB1, Table2 TB2
WHERE TB1.Field1 <> TB2.Table2
Because they are different SQL sentences. In the first one, you are joining two tables using Field1 and Table2 fields. Probably returning a few records.
In the second one, your query is probably returning a lot of records, since you are doing a cross join, and a lot of rows will satisfy your Field1 <> Table2 condition.
A very simplified example
Table1
Field1
------
1
2
5
9
Table2
Table2
------
3
4
5
6
9
Query1 will return 2 since only 5 and 9 are common.
Query2 will return 18 since a lot of rows from cross join will count.
If you have table with a lot of records, it will take a while to process your second query.
It's important to realize that SQL is a declarative language and not an imperative one. You describe what conditions you want your data to fit and not how those comparisons should be executed. It's the job of the database to find the fastest way to give you an answer (a task taken over by the query optimizer). This means that a seemingly small change in your query can result in a wildly different query plan, which in turn results in a wildly different runtime behaviour.
The = comparison can be converted to and optimized the same way as a simple join on the two fields. This means that normal indices can be used to execute the query very fast, probably without reading the actual data and using only the indices instead.
A <> comparison on the other hand requires a full cartesian product to be calculated and checked for the condition, usually (there might be a way to optimize this with the correct index, but usually an index won't help here). It will also usually return a lot more results, which adds to the execution time.
Probably, the second query processes way more rows than the first one.
(Thinking back to a similar question)
Are you trying to count the rows in Table1 for which there is no matching record in Table2?
If so you could use this
SELECT COUNT(*) FROM Table1 TB1
WHERE NOT EXISTS
(SELECT * FROM Table2 TB2
WHERE TB1.Field1 = TB2.Field2 )
or this for example
SELECT COUNT(*)
FROM
(
SELECT Field1 FROM Table1
MINUS
SELECT Field2 FROM Table2
) T

How does the IN predicate work in SQL?

After prepairing an answer for this question I found I couldn't verify my answer.
In my first programming job I was told that a query within the IN () predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.
For example, given the query:
SELECT count(*) FROM Table1 WHERE Table1Id NOT IN (
SELECT Table1Id FROM Table2 WHERE id_user = 1)
Table1 Rows | # of "IN" executions
----------------------------------
10 | 10
100 | 100
1000 | 1000
10000 | 10000
Is this correct? How does the IN predicate actually work?
The warning you got about subqueries executing for each row is true -- for correlated subqueries.
SELECT COUNT(*) FROM Table1 a
WHERE a.Table1id NOT IN (
SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user
);
Note that the subquery references the id_user column of the outer query. The value of id_user on each row of Table1 may be different. So the subquery's result will likely be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the outer query.
The example you tested is a non-correlated subquery. Most modern RDBMS optimizers worth their salt should be able to tell when the subquery's result doesn't depend on the values in each row of the outer query. In that case, the RDBMS runs the subquery a single time, caches its result, and uses it repeatedly for the predicate in the outer query.
PS: In SQL, IN() is called a "predicate," not a statement. A predicate is a part of the language that evaluates to either true or false, but cannot necessarily be executed independently as a statement. That is, you can't just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it's not a valid statement.
It will entirely depend on the database you're using, and the exact query.
Query optimisers are very smart at times - in your sample query, I'd expect the better databases to be able to use the same sort of techniques that they do with a join. More naive databases may just execute the same query many times.
This depends on the RDBMS in question.
See detailed analysis here:
MySQL, part 1
MySQL, part 2
SQL Server
Oracle
PostgreSQL
In short:
MySQL will optimize the query to this:
SELECT COUNT(*)
FROM Table1 t1
WHERE NOT EXISTS
(
SELECT 1
FROM Table2 t2
WHERE t2.id_user = 1
AND t2.Table1ID = t1.Table2ID
)
and run the inner subquery in a loop, using the index lookup each time.
SQL Server will use MERGE ANTI JOIN.
The inner subquery will not be "executed" in a common sense of word, instead, the results from both query and subquery will be fetched concurrently.
See the link above for detailed explanation.
Oracle will use HASH ANTI JOIN.
The inner subquery will be executed once, and a hash table will be built from the resultset.
The values from the outer query will be looked up in the hash table.
PostgreSQL will use NOT (HASHED SUBPLAN).
Much like Oracle.
Note that rewriting the query as this:
SELECT (
SELECT COUNT(*)
FROM Table1
) -
(
SELECT COUNT(*)
FROM Table2 t2
WHERE (t2.id_user, t2.Table1ID) IN
(
SELECT 1, Table1ID
FROM Table1
)
)
will greatly improve the performance in all four systems.
Depends on optimizer. Check exact query plan for each particular query to see how the RDBMS will actually execute that.
In Oracle that'd be:
EXPLAIN PLAN FOR «your query»
In MySQL or PostgreSQL
EXPLAIN «your query»
Most SQL engines nowadays will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS
I would say look at your execution plan and find out :-)
Also if you have NULL values for the Table1Id column you will not get any data back
Not really. But it's butter to write such queries using JOIN
Yes, but execution stops as soon as the query processer "finds" the value you are looking for... So if, for example the first row in the outer select has Table1Id = 32, and if Table2 has a record with a TableId = 32, then
as soon as the subquery finds the row in Table2 where TableId = 32, it stops...

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.