Why query optimizer selects completely different query plans?

Why query optimizer selects completely different query plans? - sql

Let us have the following table in SQL Server 2016
-- generating 1M test table with four attributes
WITH x AS
(
SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
), t1 AS
(
SELECT ones.n + 10 * tens.n + 100 * hundreds.n + 1000 * thousands.n + 10000 * tenthousands.n + 100000 * hundredthousands.n as id
FROM x ones, x tens, x hundreds, x thousands, x tenthousands, x hundredthousands
)
SELECT id,
id % 50 predicate_col,
row_number() over (partition by id % 50 order by id) join_col,
LEFT('Value ' + CAST(CHECKSUM(NEWID()) AS VARCHAR) + ' ' + REPLICATE('*', 1000), 1000) as padding
INTO TestTable
FROM t1
GO
-- setting the `id` as a primary key (therefore, creating a clustered index)
ALTER TABLE TestTable ALTER COLUMN id int not null
GO
ALTER TABLE TestTable ADD CONSTRAINT pk_TestTable_id PRIMARY KEY (id)
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col
ON TestTable (predicate_col, join_col)
GO
Ok, and now when I run the following queries having just slightly different predicates (b.predicate_col <= 0 vs. b.predicate_col = 0) I got completely different plans.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
-- Q2
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col = 0
option (maxdop 1)
If I look on query plans, then it is clear that he chooses to join the key lookup together with non-clustered index seek first and then he does the final join with non-clustered index in the case of Q1 (which is bad). A much better solution is in the case of Q2: he joins the non-clustered indexes first and then he does the final key lookup.
The question is: why is that and can I improve it somehow?
In my intuitive understanding of histograms, it should be easy to estimate the correct result for both variants of predicates (b.predicate_col <= 0 vs. b.predicate_col = 0), therefore, why different query plans?
EDIT:
Actually, I do not want to change the indexes or physical structure of the table. I would like to understand why he picks up such a bad query plan in the case of Q1. Therefore, my question is precisely like this:
Why he picks such a bad query plan in the case of Q1 and can I improve without altering the physical design?
I have checked the result estimations in the query plan and both query plans have exact row number estimations of every operator! I have checked the result memo structure (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8615, QUERYTRACEON 8620)) and rules applied during the compilation (OPTION (QUERYTRACEON 3604, QUERYTRACEON 8619, QUERYTRACEON 8620)) and it seems that he finish the query plan search once he hit the first plan. Is this the reason for such behaviour?

This is caused by SQL Server's inability to use Index Columns to the Right of the Inequality search.
This code produces the same issue:
SELECT * FROM TestTable WHERE predicate_col <= 0 and join_col = 1
SELECT * FROM TestTable WHERE predicate_col = 0 and join_col <= 1
Inequality queries such as >= or <= put a limitation on SQL, the Optimiser can't use the rest of the columns in the index, so when you put an inequality on [predicate_col] you're rendering the rest of the index useless, SQL can't make full use of the index and produces an alternate (bad) plan. [join_col] is the last column in the Index so in the second query SQL can still make full use of the index.
The reason SQL opts for the Hash Match is because it can't guarantee the order of the data coming out of table B. The inequality renders [join_col] in the index useless so SQL has to prepare for unsorted data on the join, even though the row count is the same.
The only way to fix your problem (even though you don't like it) is to alter the Index so that Equality columns come before Inequality columns.

Ok answer can be from Statistics and histogram point of view also.
Answer can be from index structure arrangement point of view also.
Ok I am trying to answer this from index structure.
Although you get same result in both query because there is no predicate_col < 0 records
When there is Range predicate in composite index ,both the index are not utilise. There can also be so many other reason of index not being utilise.
-- Q1
select b.id, b.predicate_col, b.join_col, b.padding
from TestTable b
join TestTable a on b.join_col = a.id
where a.predicate_col = 1 and b.predicate_col <= 0
option (maxdop 1)
If we want plan like in Q2 then we can create another composite index.
-- creating a non-clustered index
CREATE NONCLUSTERED INDEX ix_TestTable_predicate_col_join_col_1
ON TestTable (join_col,predicate_col)
GO
We get query plan exactly like Q2.
Another way is to define CHECK constraint in predicate_col
Alter table TestTable ADD check (predicate_col>=0)
GO
This also give same query plan as Q2.
Though in real table and data, whether you can create CHECK Constraint or create another composite index or not is another discussion.

Related

Find rows with all columns duplicated and no unique field in PostgreSQL

Say I have a table like this, where no column or combination of columns is guaranteed to be unique:
GAME_EVENT
USERNAME
ITEM
QUANTITY
sell
poringLUVR
sword
1
sell
poringLUVR
sword
1
kill
daenerys
civilians
200000
kill
daenerys
civilians
200000
invoke
sylvanas
undead
1000000
And I want to retrieve the list of all rows that exist more than once (where the combination of ALL their columns appears more than once).
(In this case I would expect to get a list with the "sell/poringLUVR" and "kill/daenerys" rows)
What would be a good way of approaching this? Would a combined index be of any help? Suggestions for non-Postgres approaches are also welcome.

Assuming all columns NOT NULL, this will do:
SELECT *
FROM tbl t1
WHERE EXISTS (
SELECT FROM tbl t2
WHERE (t1.*) = (t2.*)
AND t1.ctid <> t2.ctid
);
ctid is a system column, the "tuple identifier" / "item pointer" that can serve as poor-man's PK in the absence of an actual PK (which you obviously don't have), and only within the scope of a single query. Related:
Delete duplicate rows from small table
How do I decompose ctid into page and row numbers?
If columns can be NULL, (more expensively) operate with IS NOT DISTINCT FROM instead of =. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
(t1.*) = (t2.*) is comparing ROW values. This shorter syntax is equivalent: t1 = t2 unless a column of the same name exists in the underlying tables, in which case the second form fails while the first won't. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Index?
If any of the columns has a particularly high cardinality (many distinct values, few duplicates), let's call it hi_cardi_column for the purpose of this answer, a plain btree index on just that column can be efficient for your task. A combination of a few, small columns with a multicolumn index can work, too. The point is to have a small, fast index or the overhead won't pay.
SELECT *
FROM tbl t1
WHERE EXISTS (
SELECT FROM tbl t2
WHERE t1.hi_cardi_column = t2.hi_cardi_column -- logically redundant
AND (t1.*) = (t2.*)
AND t1.ctid <> t2.ctid
);
The added condition t1.hi_cardi_column = t2.hi_cardi_column is logically redundant, but helps to utilize said index.
Other than that I don't see much potential for index support as all rows of the table have to be visited anyway, and all columns have to be checked.

performance penalty when using "join with temp table " in contrast of "IN clause with constant values"

I have a temp table with two records like this:
select * into #Tbl from (select 1 id union select 2) tbl
and also the related index:
Create nonclustered index IX_1 on #T(id)
The following query takes 4000ms to run:
SELECT AncestorId
FROM myView
WHERE AncestorId =ANY(select id from #t)
But the equivalent query (with IN and literal values) takes only 3ms to run!:
SELECT ProjectStructureId
FROM myView
WHERE AncestorId in (1,2)
Why this huge difference and how can I change the first query to be as fast as the second one?
P.S.
SQL SERVER 2014 SP2
myView is a Recursive CTE
Changing the first query to INNER JOIN model or EXISTS model didn't help
Changing the IX_1 Index to a cluster index didn't help
Using FORSEEK didn't help
P.S.2
The execution plans of both can be downloaded here : https://www.dropbox.com/s/pas1ovyamqojhba/Query-With-In.sqlplan?dl=0
Execution plans in Paste the Plan
P.S. 3
The view definition is :
ALTER VIEW [dbo].[myView]
AS
WITH parents AS (SELECT main.Id, main.NodeTypeCode, main.ParentProjectStructureId AS DirectParentId, parentInfo.Id AS AncestorId, parentInfo.ParentProjectStructureId AS AncestorParentId, CASE WHEN main.NodeTypeCode <> IsNull(parentInfo.NodeTypeCode, 0)
THEN 1 ELSE 0 END AS AncestorTypeDiffLevel
FROM dbo.ProjectStructures AS main LEFT OUTER JOIN
dbo.ProjectStructures AS parentInfo ON main.ParentProjectStructureId = parentInfo.Id
UNION ALL
SELECT m.Id, m.NodeTypeCode, m.ParentProjectStructureId, parents.AncestorId, parents.AncestorParentId,
CASE WHEN m.NodeTypeCode <> parents.NodeTypeCode THEN AncestorTypeDiffLevel + 1 ELSE AncestorTypeDiffLevel END AS AncestorTypeDiffLevel
FROM dbo.ProjectStructures AS m INNER JOIN
parents ON m.ParentProjectStructureId = parents.Id)
SELECT ISNULL(Id, - 1) AS ProjectStructureId,
ISNULL(NodeTypeCode,-1) NodeTypeCode,
DirectParentId,
ISNULL(AncestorId, - 1) AS AncestorId,
AncestorParentId,
AncestorTypeDiffLevel
FROM parents
WHERE (AncestorId IS NOT NULL)

In your good plan it is able to push the literal values right into the index seek of the anchor part of the recursive CTE.
It refuses to do that when they come from a table.
You could create a table type
CREATE TYPE IntegerSet AS TABLE
(
Integer int PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);
And then pass that to an inline TVF written to use that in the anchor part directly.
Then just call it like
DECLARE #AncestorIds INTEGERSET;
INSERT INTO #AncestorIds
VALUES (1),
(2);
SELECT *
FROM [dbo].[myFn](#AncestorIds);
The inline TVF would be much the same as the view but with
WHERE parentInfo.Id IN (SELECT Integer FROM #AncestorIds)
in the anchor part of the recursive CTE.
CREATE FUNCTION [dbo].[myFn]
(
#AncestorIds IntegerSet READONLY
)
RETURNS TABLE
AS
RETURN
WITH parents
AS (SELECT /*omitted for clarity*/
WHERE parentInfo.Id IN (SELECT Integer FROM #AncestorIds)
UNION ALL
SELECT/* Rest omitted for clarity*/
Also you might as well change that LEFT JOIN to an INNER JOIN though the optimiser does that for you.

I just want to say that I would write the query as:
SELECT AncestorId
FROM myView
WHERE AncestorId IN (select id from #t);
I doubt this would help.
The issue is that SQL Server can optimize literal values better than values inside a table. The result is that the execution plan changes.
If neither IN nor JOIN fix the problem, then you probably have to fiddle with the definition of the view to improve performance.

T-SQL Stored Procedure: Performance of select count(*) vs. select count([uniqueId])

So, I'm looking at a stored procedure here, which has more than one line like the following pseudocode:
if(select count(*) > 0)
...
on tables having a unique id (or identifier, for making it more general).
Now, in terms of performance, is it more performant to change this clause
to
if(select count([uniqueId]) > 0)
...
where uniqueId is, e.g., an Idx containing double values?
An example:
Consider a table like Idx (double) | Name (String) | Address (String)
Now the 'Idx' is a foreign key which I want to join in a stored procedure.
So, in terms of performance: what is better here?
if(select count(*) > 0)
...
or
if(select count(Idx) > 0)
...
? Or does the SQL Engine Change select count(*) to select count(Idx) internally, so we do not have to bother about this? Because at first sight, I'd say that select count(Idx) would be more performant.

The two are slightly different. count(*) counts rows. count([uniqueid]) counts the number of non-NULL values for uniqueid. Because a unique constraint allows a NULL value, SQL Server actually needs to read the column. This could add microseconds of time to a query, particularly if the page with the id is not already in memory. This also gives SQL Server more opportunities to optimize count(*).
As #lad2025 writes in a comment, the performant solution is to use if (exists . . ..

SELECT t1.*
FROM Table1 t1
JOIN Table2 t2 ON t2.idx = t1.idx
will give you only the rows in t1 that match an idx value in Table2. I'm not sure there is a good reason to do an if(select count...).
If you are really interested in the performance of something like this, just create a temp table with a million rows and give it a go:
CREATE TABLE #TempTable (id int identity, txt varchar(50))
GO
INSERT #TempTable (txt) VALUES (##IDENTITY)
GO 1000000

Nested select query optimization - slow execution

I have a query that looks like the following:
SELECT
ROUND(SUM(AGLR * BlokInsideAreaFactor), 2) AS AGLRSum,
ROUND(SUM(Vaarsaed * BlokInsideAreaFactor), 2) AS VaarsaedSum,
ROUND(SUM(Vintsaed * BlokInsideAreaFactor), 2) AS VintsaedSum,
ROUND(SUM(Oliefroe * BlokInsideAreaFactor), 2) AS OliefroeSum,
ROUND(SUM(Baelgsaed * BlokInsideAreaFactor), 2) AS BaelgsaedSum
.... (+ 10 more columns)
FROM
(
SELECT
AGLR,
Vaarsaed,
Vintsaed,
Oliefroe,
Baelgsaed,
.... (+ 10 more columns)
Round((CASE WHEN bloktema.AREAL > 0 THEN
omraade.Geom.STIntersection(bloktema.Geom).STArea() / bloktema.AREAL ELSE 0 END), 2)
AS BlokInsideAreaFactor
FROM [CTtoolsData].dbo.BlokAfgroedeGrp blokAfgroed
INNER JOIN [CTtoolsTema].dbo.bloktema2012 bloktema
ON (bloktema.bloknr = blokAfgroed.bloknr)
INNER JOIN [CTtoolsTema].dbo.Area omraade
ON omraade.Geom.STIntersects(bloktema.GEOM) = 1
where omraade.Id = 296
AND blokAfgroed.[Year] = 2012
) AS Q1
The reason why I have done a nested select is because I have to calculate the "BlokInsideAreaFactor" before multiplying it to the other column values in the outer select.
My initial thought was that I would optimize the query this way because the "BlokInsideAreaFactor" is only calculated once for each row instead of fifteen times per row (once per column). The thing is that the query gets very very slow doing it like this. The query takes about 15 min containing about 4000 rows. Unfortunately we have ageing hardware and are running the query on SQLServer 2012 Express.
I have looked at indexes and can't seem to optimize further that way. Why does a query looking like this gets so slow and most importantly is there a way to optimize it?
UPDATE:
The tables involved look as follows:
BlokAfgroedeGrp:
Columns: Id (Primary key, identity), BlokNr, Year, AGLR, Vaarsaed, Vintsaed...etc.
Indexes: Clustered on Id, Unique Non-Clustered on BlokNr + Year
Bloktema2012:
Columns: Id (Primary key, identity), BlokNr, Geom (geometry) + others (not important)
Indexes: Clustered on Id, Spatial on Geom, Non-Unique - Non Clustered on Id + BlokNr, Non-Unique - Non Clustered on BlokNr alone.
Area:
Columns: Id (Primary key, identity), Geom (geometry) + others (not important)
Indexes: Clustered on Id, Spatial on Geom
I have made sure that there are no fragmentation on any on the indexes.

I recently came back to this question after learning about temp tables. I've been able to optimize the query to this:
DECLARE #TempTable TABLE (AGLR float,
Vaarsaed float,
Vintsaed float,
Oliefroe float,
Baelgsaed float,
BlokInsideAreaFactor float)
INSERT INTO #TempTable (AGLR, Vaarsaed, Vintsaed, Oliefroe, Baelgsaed, BlokInsideAreaFactor)
SELECT
AGLR,
Vaarsaed,
Vintsaed,
Oliefroe,
Baelgsaed,
Round((CASE WHEN bloktema.AREAL > 0 THEN
omraade.Geom.STIntersection(bloktema.Geom).STArea() / bloktema.AREAL ELSE 0 END), 2)
AS BlokInsideAreaFactor
FROM [CTtoolsData].dbo.BlokAfgroedeGrp blokAfgroed
INNER JOIN [CTtoolsTema].dbo.bloktema2012 bloktema
ON (bloktema.bloknr = blokAfgroed.bloknr)
INNER JOIN [CTtoolsTema].dbo.Area omraade
ON omraade.Geom.STIntersects(bloktema.GEOM) = 1
where omraade.Id = 296
AND blokAfgroed.[Year] = 2012
SELECT
ROUND(SUM(AGLR * BlokInsideAreaFactor), 2) AS AGLRSum,
ROUND(SUM(Vaarsaed * BlokInsideAreaFactor), 2) AS VaarsaedSum,
ROUND(SUM(Vintsaed * BlokInsideAreaFactor), 2) AS VintsaedSum,
ROUND(SUM(Oliefroe * BlokInsideAreaFactor), 2) AS OliefroeSum,
ROUND(SUM(Baelgsaed * BlokInsideAreaFactor), 2) AS BaelgsaedSum
FROM #TempTable
...so now the query takes about 11 sec, instead of 15 min.
Hope it helps someone else!

Why don't you declare a variable, put the dataset or value you need into the variable, and then reference the variable to do all of the calculations? Then you only need to find that value once.
If you don't want to do that, you could create a CTE (Common Table Expression) table, so you can reference and join to that table instead of doing anything in the where clause.
If you're not using SQL Server then you can look into using temp tables.

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.

Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).

As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...

You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)

This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas