SQL Server using order by clause significantly improves select performance - sql

I am executing the following query directly in SQL Server:
SELECT *
FROM TableA
LEFT JOIN TableB
ON TableB.field1 = TableA.field1
LEFT JOIN TableC
ON TableC.field2 = TableA.field2
LEFT JOIN TableD
ON TableD.field3 = TableA.field3
LEFT JOIN TableE
ON TableE.field4 = TableA.field4
LEFT JOIN TableF
ON TableF.field5 = TableA.field5
LEFT JOIN
(SELECT *
FROM
(SELECT
Id1, Id2,
UpdateDate,
ROW_NUMBER() OVER(PARTITION BY Id1, Id2,
ORDER BY UpdateDate DESC) AS RN
FROM TableG) AS G
WHERE G.RN = 1) TableH
ON TableA.Id1 = TableH.Id2
AND TableA.Id1 = TableH.Id2
For point of reference, Table A-F and G are about 1000 rows, and Table G is about 10000 rows.
For a particular input, this query takes about 1 minute to run.
I then add a
ORDER BY Id1 ASC
at the end of the statement, and now it takes about 6 seconds to run. How can adding a sort significantly improve performance like this?

Run a showplan on both versions of your query.
Probably what's happening is the sort forces a different query plan, which uses a more efficient for your particular data join strategy (probably in-memory), but which has a higher estimated cost.

After examining the execution plan, it seems that the issue was with the JOIN on Table A and Table G. Initially the optimizer was trying to use a nested loop join which was very inefficient for tables of their size. Adding the ORDER BY clause hinted to the optimizer to use a merge join instead, which was much faster. Thanks for the answers!

Related

Improving performance of a full outer join in Redshift

I need to complete a full outer join using two columns: date_local and hour_local but am concerned about query performance when using an outer join as opposed to another type of join.
Below is the query using outer join:
SELECT *
FROM TABLE_A
FULL OUTER JOIN TABLE_B
USING (DATE_LOCAL, HOUR_LOCAL)
Would the following query perform better than the query above:
WITH JOIN_VALS AS
(SELECT DATE_LOCAL
, HOUR_LOCAL
FRΩM TABLE_A
UNION
SELECT
DATE_LOCAL
, HOUR_LOCAL
FROM TABLE_B
)
SELECT
JV.DATE
, JV.HOUR_LOCAL
, TA.PLANNED
, TB.ACTUAL
FROM JOIN_VALS JV
LEFT JOIN TABLE_A TA
ON JV.DATE = TA.DATE
AND JV.HOUR_LOCAL = TA.HOUR_LOCAL
LEFT JOIN TABLE_B TB
ON JV.DATE = TB.DATE
AND JV.HOUR_LOCAL = TB.HOUR_LOCAL;
Wondering if I get any performance improvements but isolating the unique join values first, rather than finding them during the outer join.
UNION can be expensive and I don’t think you will seen any benefit from this construct in Redshift. Likely performance loss. Redshift is a columnar database and will see no benefit from peeling off these columns.
The big cost will be if the matches between the two tables on these two columns will be many-to-many. This would lead to additional row creation which could lead to slow performance.

SQL takes very long to execute

This SQL statement left joins two tables, both with approx. 10.000 rows (table1 = 20 columns, table2 = 50+ columns), and it takes 60+ seconds to execute. Is there a way to make it faster?
SELECT
t.*, k.*
FROM
table1 AS t
LEFT JOIN
table2 AS k ON t.key_Table1 = k.Key_Table2
WHERE
((t.Time) = (SELECT MAX(t2.Time) FROM table1 AS t2
WHERE t2.key2_Table1 = t.key2_Table1))
ORDER BY
t.Time;
The ideal execution time would be < 5 seconds, since Excel query does it in 8 secs, and it is very surprising that Excel query would work faster than a SQL Server Express query.
Execution plan:
also you can rewrite your query better :
select *
from table2 as k
join (
select *, row_number() over (partition by Key_Table2 order by time desc) rn
from table1
) t
on t.rn = 1
and t.key_Table1 = k.Key_Table2
but you need index on Key_Table2, time and key_Table1 columns if you already don't have.
also another improvement would be to select only columns you want instead of select *
The optimizer is determining that a merge join is best, but if both tables have 10,000 rows and they aren't joining on indexed columns then forcing the optimizer to get out of the way and telling it to hash join may improve performance
The syntax would be to change LEFT JOIN to LEFT HASH JOIN
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/ms191426(v=sql.100)
https://learn.microsoft.com/en-us/sql/relational-databases/performance/joins?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15
I would recommend rewriting the query using outer apply:
SELECT t.*, k.*
FROM table1 t OUTER APPLY
(SELECT TOP (1) k.*
FROM table2 k
WHERE t.key_Table1 = k.Key_Table2
ORDER BY Time DESC
) k
ORDER BY t.Time;
And for this query, you want an index on table2(Key_Table2, time desc).

Which is better for performance, selecting all the columns or select only the required columns while performng join?

I am been asked to do performance tuning of a SQL Server query which has so many joins in it.
For example
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
There are almost 25 columns present in vw_Billing_Cenus_R but we wanted to use only 3 of them. So I wanted to know instead of selecting all the columns from the view or table, if I only select those columns which are required and then perform join like this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
So Will this improve the performance or not?
The important part is the columns you are actually using on the outmost SELECT, not the ones to are selecting to join. The SQL Server engine is smart enough to realize that he does not need to retrieve all columns from the referenced table (or view) if he doesn't need them.
So the following 2 queries should yield the exact same query execution plan:
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
The difference would be if you actually use the selected column (in a conditional where or actually retrieving the value), as in here:
SELECT
A.SomeColumn,
X.* -- * has all X columns
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn,
X.* -- * has only X's SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
I would rather use this approach:
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
than this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
Since in this case:
you make your query simpler,
you does not have to rely on query optimizer smartness and expect that it will eliminate unnecessary columns and rows
finally, you can select as many columns in the outer SELECT as necessary without using derived tables techniques.
In some cases, derived tables are welcome, when you want to eliminate duplicates in a table you want to join on a fly, but, imho, not in your case.
It depends on how many records are stored, but generally it will improve performance.
In this case read #LukStorms ' comments, I think he is right

Can the order of Inner Joins Change the results o a query

I have the following scenario on a SQL Server 2008 R2:
The following queries returns :
select * from TableA where ID = '123'; -- 1 rows
select * from TableB where ID = '123'; -- 5 rows
select * from TableC where ID = '123'; -- 0 rows
When joining these tables the following way, it returns 1 row
SELECT A.ID
FROM TableA A
INNER JOIN ( SELECT DISTINCT ID
FROM TableB ) AS D
ON D.ID = A.ID
INNER JOIN TableC C
ON A.ID = C.ID
ORDER BY A.ID
But, when switching the inner joins order it does not returns any row
SELECT A.ID
FROM TableA A
INNER JOIN TableC C
ON A.ID = C.ID
INNER JOIN ( SELECT DISTINCT ID
FROM TableB ) AS D
ON D.ID = A.ID
ORDER BY A.ID
Can this be possible?
Print Screen:
For inner joins, the order of the join operations does not affect the query (it can affect the ordering of the rows and columns, but the same data is returned).
In this case, the result set is a subset of the Cartesian product of all the tables. The ordering doesn't matter.
The order can and does matter for outer joins.
In your case, one of the tables is empty. So, the Cartesian product is empty and the result set is empty. It is that simple.
As Gordon mentioned, for inner joins the order of joins doesn't matter, whereas it does matter when there's at least one outer join involved; however, in your case, none of this is pertinent as you are inner joining 3 tables, one of which will return zero rows - hence all combinations will result in zero rows.
You cannot reproduce the erratic behavior with the queries as they are shown in this question since they will always return zero records. You can try it again on your end to see what you come up with, and if you do find a difference, please share it with us then.
For the future, whenever you have something like this, creating some dummy data either in the form of insert statements or in rextester or the like, you make it that much easier for someone to help you.
Best of luck.

SQL Server query performance - removing need for Hash Match (Inner Join)

I have the following query, which is doing very little and is an example of the kind of joins I am doing throughout the system.
select t1.PrimaryKeyId, t1.AdditionalColumnId
from TableOne t1
join TableTwo t2 on t1.ForeignKeyId = t2.PrimaryKeyId
join TableThree t3 on t1.PrimaryKeyId = t3.ForeignKeyId
join TableFour t4 on t3.ForeignKeyId = t4.PrimaryKeyId
join TableFive t5 on t4.ForeignKeyId = t5.PrimaryKeyId
where
t1.StatusId = 1
and t5.TypeId = 68
There are indexes on all the join columns, however the performance is not great. Inspecting the query plan reveals a lot of Hash Match (Inner Joins) when really I want to see Nested Loop joins.
The number of records in each table is as follows:
select count(*) from TableOne
= 64393
select count(*) from TableTwo
= 87245
select count(*) from TableThree
= 97141
select count(*) from TableFour
= 116480
select count(*) from TableFive
= 62
What is the best way in which to improve the performance of this type of query?
First thoughts:
Change to EXISTS (changes equi-join to semi-join)
You need to have indexes on t1.StatusId, t5.TypeId and INCLUDE t1.AdditionalColumnID
I wouldn't worry about your join method yet...
Personally, I've never used a JOIN hint. They only work for the data, indexes and statistics you have at that point in time. As these change, your JOIN hint limits the optimiser
select t1.PrimaryKeyId, t1.AdditionalColumnId
from
TableOne t1
where
t1.Status = 1
AND EXISTS (SELECT *
FROM
TableThree t3
join TableFour t4 on t3.ForeignKeyId = t4.PrimaryKeyId
join TableFive t5 on t4.ForeignKeyId = t5.PrimaryKeyId
WHERE
t1.PrimaryKeyId = t3.ForeignKeyId
AND
t5.TypeId = 68)
AND EXISTS (SELECT *
FROM
TableTwo t2
WHERE
t1.ForeignKeyId = t2.PrimaryKeyId)
Index for tableOne.. one of
(Status, ForeignKeyId) INCLUDE (AdditionalColumnId)
(ForeignKeyId, Status) INCLUDE (AdditionalColumnId)
Index for tableFive... probably (typeID, PrimaryKeyId)
Edit: updated JOINS and EXISTS to match question fixes
SQL Server is pretty good at optimizing queries, but it's also conservative: it optimizes queries for the worst case. A loop join typically results in an index lookup and a bookmark lookup for for every row. Because loop joins cause dramatic degradation for large sets, SQL Server is hesitant to use them unless it's sure about the number of rows.
You can use the forceseek query hint to force an index lookup:
inner join TableTwo t2 with (FORCESEEK) on t1.ForeignKeyId = t2.PrimaryKeyId
Alternatively, you can force a loop join with the loop keyword:
inner LOOP join TableTwo t2 on t1.ForeignKeyId = t2.PrimaryKeyId
Query hints limit SQL Server's freedom, so it can no longer adapt to changed circumstances. It's best practice to avoid query hints unless there is a business need that cannot be met without them.