Performance of nested select - sql

I know this is a common question and I have read several other posts and papers but I could not find one that takes into account indexed fields and the volume of records that both queries could return.
My question is simple really. Which of the two is recommended here written in an SQL-like syntax (in terms of performance).
First query:
Select *
from someTable s
where s.someTable_id in
(Select someTable_id
from otherTable o
where o.indexedField = 123)
Second query:
Select *
from someTable
where someTable_id in
(Select someTable_id
from otherTable o
where o.someIndexedField = s.someIndexedField
and o.anotherIndexedField = 123)
My understanding is that the second query will query the database for every tuple that the outer query will return where the first query will evaluate the inner select first and then apply the filter to the outer query.
Now the second query may query the database superfast considering that the someIndexedField field is indexed but say that we have thousands or millions of records wouldn't it be faster to use the first query?
Note: In an Oracle database.

In MySQL, if nested selects are over the same table, the execution time of the query can be hell.
A good way to improve the performance in MySQL is create a temporary table for the nested select and apply the main select against this table.
For example:
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from someTable s2
where s2.Field = 123);
Can have a better performance with:
create temporary table 'temp_table' as (
Select someTable_id
from someTable s2
where s2.Field = 123
);
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from tempTable s2);
I'm not sure about performance for a large amount of data.

About first query:
first query will evaluate the inner select first and then apply the
filter to the outer query.
That not so simple.
In SQL is mostly NOT possible to tell what will be executed first and what will be executed later.
Because SQL - declarative language.
Your "nested selects" - are only visually, not technically.
Example 1 - in "someTable" you have 10 rows, in "otherTable" - 10000 rows.
In most cases database optimizer will read "someTable" first and than check otherTable to have match. For that it may, or may not use indexes depending on situation, my filling in that case - it will use "indexedField" index.
Example 2 - in "someTable" you have 10000 rows, in "otherTable" - 10 rows.
In most cases database optimizer will read all rows from "otherTable" in memory, filter them by 123, and than will find a match in someTable PK(someTable_id) index. As result - no indexes will be used from "otherTable".
About second query:
It completely different from first. So, I don't know how compare them:
First query link two tables by one pair: s.someTable_id = o.someTable_id
Second query link two tables by two pairs: s.someTable_id = o.someTable_id AND o.someIndexedField = s.someIndexedField.
Common practice to link two tables - is your first query.
But, o.someTable_id should be indexed.
So common rules are:
all PK - should be indexed (they indexed by default)
all columns for filtering (like used in WHERE part) should be indexed
all columns used to provide match between tables (including IN, JOIN, etc) - is also filtering, so - should be indexed.
DB Engine will self choose the best order operations (or in parallel). In most cases you can not determine this.
Use Oracle EXPLAIN PLAN (similar exists for most DBs) to compare execution plans of different queries on real data.

When i used directly
where not exists (select VAL_ID FROM #newVals = OLDPAR.VAL_ID) it was cost 20sec. When I added the temp table it costs 0sec. I don't understand why. Just imagine as c++ developer that internally there loop by values)
-- Temp table for IDX give me big speedup
declare #newValID table (VAL_ID int INDEX IX1 CLUSTERED);
insert into #newValID select VAL_ID FROM #newVals
insert into #deleteValues
select OLDPAR.VAL_ID
from #oldVal AS OLDPAR
where
not exists (select VAL_ID from #newValID where VAL_ID=OLDPAR.VAL_ID)
or exists (select VAL_ID from #VaIdInternals where VAL_ID=OLDPAR.VAL_ID);

Related

Which is best to use between the IN and JOIN operators in SQL server for the list of values as table two?

I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).
From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Query equivalence with DISTINCT

Let us have a simple table order(id: int, category: int, order_date: int) created using the following script
IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL DROP TABLE dbo.orders
SELECT TOP 1000000
NEWID() id,
ABS(CHECKSUM(NEWID())) % 100 category,
ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM sys.sysobjects
CROSS JOIN sys.all_columns
Now, I have two equivalent queries (at least I believe that they are equivalent):
-- Q1
select distinct o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from orders o1
-- Q2
select o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from (select distinct category from orders) o1
However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.
The difference is still there if add requested index
CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)
Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:
CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)
My question are:
Are these queries equivalent?
If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
If no, could you post a counter example?
EDIT
My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?
The queries are functionally equivalent, meaning that they should return the same data.
However, they are interpreted differently by the SQL engine. The first (SELECT DISTINCT) generates all the results and then removes the duplicates.
The second extracts the distinct values first, so the subquery is only called on the appropriate subset.
An index might make either query more efficient, but it won't fundamentally affect whether the distinct processing occurs before or after the subquery.
In this case, the results are the same. However, that is not necessarily true depending on the subquery.

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?
I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.
If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.

Sub-query Optimization Talk with an example case

I need advises and want to share my experience about Query Optimization. This week, I found myself stuck in an interesting dilemma.
I'm a novice person in mySql (2 years theory, less than one practical)
Environment :
I have a table that contains articles with a column 'type', and another table article_version that contain a date where an article is added in the DB, and a third table that contains all the article types along with types label and stuffs...
The 2 first tables are huge (800000+ fields and growing daily), the 3rd one is naturally small sized. The article tables have a lot of column, but we will only need 'ID' and 'type' in articles and 'dateAdded' in article_version to simplify things...
What I want to do :
A Query that, for a specified 'dateAdded', returns the number of articles for each types (there is ~ 50 types to scan).
What was already in place is 50 separate count, one for each document types oO ( not efficient, long(~ 5sec in general), ).
I wanted to do it all in one query and I came up with that :
SELECT type,
(SELECT COUNT(DISTINCT articles.ID)
FROM articles
INNER JOIN article_version
ON article_version.ARTI_ID = legi_arti.ID
WHERE type = td.NEW_ID
AND dateAdded = '2009-01-01 00:00:00') AS nbrArti
FROM type_document td
WHERE td.NEW_ID != ''
GROUP BY td.NEW_ID;
The external select (type_document) allow me to get the 55 types of documents I need.
The sub-Query is counting the articles for each type_document for the given date '2009-01-01'.
A common result is like :
* type * nbrArti *
*************************
* 123456 * 23 *
* 789456 * 5 *
* 16578 * 98 *
* .... * .... *
* .... * .... *
*************************
This query get the job done, but the join in the sub-query is making this extremely slow, The reason, if I'm right, is that a join is made by the server for each types, so 50+ times, this solution is even more slower than doing the 50 queries independently for each types, awesome :/
A Solution
I came up with a solution myself that drastically improve the performance with the same result, I just created a view corresponding to the subQuery, making the join on ids for each types... And Boom, it's f.a.s.t.
I think, correct me if I'm wrong, that the reason is the server only runs the JOIN statement once.
This solution is ~5 time faster than the solution that was already there, and ~20 times faster than my first attempt. Sweet
Questions / thoughts
With yet another view, I'll now need to check if I don't loose more than win when documents get inserted...
Is there a way to improve the original Query, by getting the JOIN statement out of the sub-query? (And getting rid of the view)
Any other tips/thoughts? (In Server Optimizing for example...)
Apologies for my approximating English, it'is not my primary language.
You cannot create a single index on (type, date_added), because these fields are in different tables.
Without the view, the subquery most probably selects article as a leading table and the index on type which is not very selective.
By creating the view, you force the subquery to calculate the sums for all types first (using a selective the index on date) and then use a JOIN BUFFER (which is fast enough for only 55 types).
You can achieve similar results by rewriting your query as this:
SELECT new_id, COALESCE(cnt, 0) AS cnt
FROM type_document td
LEFT JOIN
(
SELECT type, COUNT(DISTINCT article_id) AS cnt
FROM article_versions av
JOIN articles a
ON a.id = av.article_id
WHERE av.date = '2009-01-01 00:00:00'
GROUP BY
type
) q
ON q.type = td.new_id
Unfortunately, MySQL is not able to do table spools or hash joins, so to improve the performance you'll need to denormalize your tables: add type to article_version and create a composite index on (date, type).

Why is inserting into and joining #temp tables faster?

I have a query that looks like
SELECT
P.Column1,
P.Column2,
P.Column3,
...
(
SELECT
A.ColumnX,
A.ColumnY,
...
FROM
dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
WHERE
A.Key = P.Key
FOR XML AUTO, TYPE
),
(
SELECT
B.ColumnX,
B.ColumnY,
...
FROM
dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
WHERE
B.Key = P.Key
FOR XML AUTO, TYPE
)
FROM
(
<joined tables here>
) AS P
FOR XML AUTO,ROOT('ROOT')
P has ~ 5000 rows
A and B ~ 4000 rows each
This query has a runtime performance of ~10+ minutes.
Changing it to this however:
SELECT
P.Column1,
P.Column2,
P.Column3,
...
INTO #P
SELECT
A.ColumnX,
A.ColumnY,
...
INTO #A
FROM
dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
SELECT
B.ColumnX,
B.ColumnY,
...
INTO #B
FROM
dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
SELECT
P.Column1,
P.Column2,
P.Column3,
...
(
SELECT
A.ColumnX,
A.ColumnY,
...
FROM
#A AS A
WHERE
A.Key = P.Key
FOR XML AUTO, TYPE
),
(
SELECT
B.ColumnX,
B.ColumnY,
...
FROM
#B AS B
WHERE
B.Key = P.Key
FOR XML AUTO, TYPE
)
FROM #P AS P
FOR XML AUTO,ROOT('ROOT')
Has a performance of ~4 seconds.
This makes not a lot of sense, as it would seem the cost to insert into a temp table and then do the join should be higher by default. My inclination is that SQL is doing the wrong type of "join" with the subquery, but maybe I've missed it, there's no way to specify the join type to use with correlated subqueries.
Is there a way to achieve this without using #temp tables/#table variables via indexes and/or hints?
EDIT: Note that dbo.TableReturningFunc1 and dbo.TableReturningFunc2 are inline TVF's, not multi-statement, or they are "parameterized" view statements.
Your procedures are being reevaluated for each row in P.
What you do with the temp tables is in fact caching the resultset generated by the stored procedures, thus removing the need to reevaluate.
Inserting into a temp table is fast because it does not generate redo / rollback.
Joins are also fast, since having a stable resultset allows possibility to create a temporary index with an Eager Spool or a Worktable
You can reuse the procedures without temp tables, using CTE's, but for this to be efficient, SQL Server needs to materialize the results of CTE.
You may try to force it do this with using an ORDER BY inside a subquery:
WITH f1 AS
(
SELECT TOP 1000000000
A.ColumnX,
A.ColumnY
FROM dbo.TableReturningFunc1(#StaticParam1, #StaticParam2) AS A
ORDER BY
A.key
),
f2 AS
(
SELECT TOP 1000000000
B.ColumnX,
B.ColumnY,
FROM dbo.TableReturningFunc2(#StaticParam1, #StaticParam2) AS B
ORDER BY
B.Key
)
SELECT …
, which may result in Eager Spool generated by the optimizer.
However, this is far from being guaranteed.
The guaranteed way is to add an OPTION (USE PLAN) to your query and wrap the correspondind CTE into the Spool clause.
See this entry in my blog on how to do that:
Generating XML in subqueries
This is hard to maintain, since you will need to rewrite your plan each time you rewrite the query, but this works well and is quite efficient.
Using the temp tables will be much easier, though.
This answer needs to be read together with Quassnoi's article
http://explainextended.com/2009/05/28/generating-xml-in-subqueries/
With judicious application of CROSS APPLY, you can force the caching or shortcut evaluation of inline TVFs. This query returns instantaneously.
SELECT *
FROM (
SELECT (
SELECT f.num
FOR XML PATH('fo'), ELEMENTS ABSENT
) AS x
FROM [20090528_tvf].t_integer i
cross apply (
select num
from [20090528_tvf].fn_num(9990) f
where f.num = i.num
) f
) q
--WHERE x IS NOT NULL -- covered by using CROSS apply
FOR XML AUTO
You haven't provided real structures so it's hard to construct something meaningful, but the technique should apply as well.
If you change the multi-statement TVF in Quassnoi's article to an inline TVF, the plan becomes even faster (at least one order of magnitude) and the plan magically reduces to something I cannot understand (it's too simple!).
CREATE FUNCTION [20090528_tvf].fn_num(#maxval INT)
RETURNS TABLE
AS RETURN
SELECT num + #maxval num
FROM t_integer
Statistics
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(10 row(s) affected)
Table 't_integer'. Scan count 2, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
It is a problem with your sub-query referencing your outer query, meaning the sub query has to be compiled and executed for each row in the outer query.
Rather than using explicit temp tables, you can use a derived table.
To simplify your example:
SELECT P.Column1,
(SELECT [your XML transformation etc] FROM A where A.ID = P.ID) AS A
If P contains 10,000 records then SELECT A.ColumnX FROM A where A.ID = P.ID will be executed 10,000 times.
You can instead use a derived table as thus:
SELECT P.Column1, A2.Column FROM
P LEFT JOIN
(SELECT A.ID, [your XML transformation etc] FROM A) AS A2
ON P.ID = A2.ID
Okay, not that illustrative pseudo-code, but the basic idea is the same as the temp table, except that SQL Server does the whole thing in memory: It first selects all the data in "A2" and constructs a temp table in memory, then joins on it. This saves you having to select it to TEMP yourself.
Just to give you an example of the principle in another context where it may make more immediate sense. Consider employee and absence information where you want to show the number of days absence recorded for each employee.
Bad: (runs as many queryes as there are employees in the DB)
SELECT EmpName,
(SELECT SUM(absdays) FROM Absence where Absence.PerID = Employee.PerID) AS Abstotal
FROM Employee
Good: (Runs only two queries)
SELECT EmpName, AbsSummary.Abstotal
FROM Employee LEFT JOIN
(SELECT PerID, SUM(absdays) As Abstotal
FROM Absence GROUP BY PerID) AS AbsSummary
ON AbsSummary.PerID = Employee.PerID
There are several possible reasons why using intermediate Temp tables might speed up a query, but the most likely in your case is that the functions which are being called (but are not listed), are probably Multi-statement TVF's and not in-line TVF's. Multi-statement TVF's are opaque to the optimization of their calling queries and thus the optimizer cannot tell if there are any oppurtunities for re-use of data, or other logical/physical operator re-ordering optimizations. Thus, all it can do is to re-execute the TVFs every time that the containing query is supposed to produce another row with the XML columns.
In short, multi-statement TVF's frustrate the optimizer.
The usual solutions, in order of (typical) preference are:
Re-write the offending multi-statement TVF to be an in-line TVF
In-line the function code into the calling query, or
Dump the offending TVF's data into a temp table. which is what you've done...
Consider using the WITH common_table_expression construct for what you now have as sub-selects or temporary tables, see http://msdn.microsoft.com/en-us/library/ms175972(SQL.90).aspx .
This makes not a lot of sense, as it
would seem the cost to insert into a
temp table and then do the join should
be higher by de> This makes not a lot of sense, as it
would seem the cost to insert into a
temp table and then do the join should
be higher by default.fault.
With temporary tables, you explitly instruct Sql Server which intermediate storage to use. But if you stash everything in a big query, Sql Server will decide for itself. The difference is not really that big; at the end of the day, temporary storage is used, whether you specify it as a temp table or not.
In your case, temporary tables work faster, so why not stick to them?
I agreed, Temp table is a good concept. When the row count increases in a table an example of 40 million rows and i want to update multiple columns on a table by applying joins with other table in that case i would always prefer to use Common table expression to update the columns in select statement using case statement, now my select statement result set contains updated rows.Inserting 40 million records into a temp table with select statement using case statement took 21 minutes for me and then creating an index took 10 minutes so my insert and index creation time took 30 minutes. Then i am going to apply update by joining temp table updated result set with main table. It took 5 minutes to update 10 million records out of 40 million records so my overall update time for 10 million records took almost 35 minutes vs 5 minutes from Common table expression. My choice in that case is Common table expression.
If temp tables turn out to be faster in your particular instance, you should instead use a table variable.
There is a good article here on the differences and performance implications:
http://www.codeproject.com/KB/database/SQP_performance.aspx