Query faster with top attribute - sql

Why is this query faster in SQL Server 2008 R2 (Version 10.50.2806.0)
SELECT
MAX(AtDate1),
MIN(AtDate2)
FROM
(
SELECT TOP 1000000000000
at.Date1 AS AtDate1,
at.Date2 AS AtDate2
FROM
dbo.tab1 a
INNER JOIN
dbo.tab2 at
ON
a.id = at.RootId
AND CAST(GETDATE() AS DATE) BETWEEN at.Date1 AND at.Date2
WHERE
a.Number = 223889
)B
then
SELECT
MAX(AtDate1),
MIN(AtDate2)
FROM
(
SELECT
at.Date1 AS AtDate1,
at.Date2 AS AtDate2
FROM
dbo.tab1 a
INNER JOIN
dbo.tab2 at
ON
a.id = at.RootId
AND CAST(GETDATE() AS DATE) BETWEEN at.Date1 AND at.Date2
WHERE
a.Number = 223889
)B
?
The second statement with the TOP attribute is six times faster.
The count(*) of the inner subquery is 9280 rows.
Can I use a HINT to declare that SQL Server optimiser make it right?

I see you've now posted the plans. Just luck of the draw.
Your actual query is a 16 table join.
SELECT max(atDate1) AS AtDate1,
min(atDate2) AS AtDate2,
max(vtDate1) AS vtDate1,
min(vtDate2) AS vtDate2,
max(bgtDate1) AS bgtDate1,
min(bgtDate2) AS bgtDate2,
max(lftDate1) AS lftDate1,
min(lftDate2) AS lftDate2,
max(lgtDate1) AS lgtDate1,
min(lgtDate2) AS lgtDate2,
max(bltDate1) AS bltDate1,
min(bltDate2) AS bltDate2
FROM (SELECT TOP 100000 at.Date1 AS atDate1,
at.Date2 AS atDate2,
vt.Date1 AS vtDate1,
vt.Date2 AS vtDate2,
bgt.Date1 AS bgtDate1,
bgt.Date2 AS bgtDate2,
lft.Date1 AS lftDate1,
lft.Date2 AS lftDate2,
lgt.Date1 AS lgtDate1,
lgt.Date2 AS lgtDate2,
blt.Date1 AS bltDate1,
blt.Date2 AS bltDate2
FROM dbo.Tab1 a
INNER JOIN dbo.Tab2 at
ON a.id = at.Tab1Id
AND cast(Getdate() AS DATE) BETWEEN at.Date1 AND at.Date2
INNER JOIN dbo.Tab5 v
ON v.Tab1Id = a.Id
INNER JOIN dbo.Tab16 g
ON g.Tab5Id = v.Id
INNER JOIN dbo.Tab3 vt
ON v.id = vt.Tab5Id
AND cast(Getdate() AS DATE) BETWEEN vt.Date1 AND vt.Date2
LEFT OUTER JOIN dbo.Tab4 vk
ON v.id = vk.Tab5Id
LEFT OUTER JOIN dbo.VerkaufsTab3 vkt
ON vk.id = vkt.Tab4Id
LEFT OUTER JOIN dbo.Plu p
ON p.Tab4Id = vk.Id
LEFT OUTER JOIN dbo.Tab15 bg
ON bg.Tab5Id = v.Id
LEFT OUTER JOIN dbo.Tab7 bgt
ON bgt.Tab15Id = bg.Id
AND cast(Getdate() AS DATE) BETWEEN bgt.Date1 AND bgt.Date2
LEFT OUTER JOIN dbo.Tab11 b
ON b.Tab15Id = bg.Id
LEFT OUTER JOIN dbo.Tab14 lf
ON lf.Id = b.Id
LEFT OUTER JOIN dbo.Tab8 lft
ON lft.Tab14Id = lf.Id
AND cast(Getdate() AS DATE) BETWEEN lft.Date1 AND lft.Date2
LEFT OUTER JOIN dbo.Tab13 lg
ON lg.Id = b.Id
LEFT OUTER JOIN dbo.Tab9 lgt
ON lgt.Tab13Id = lg.Id
AND cast(Getdate() AS DATE) BETWEEN lgt.Date1 AND lgt.Date2
LEFT OUTER JOIN dbo.Tab10 bl
ON bl.Tab11Id = b.Id
LEFT OUTER JOIN dbo.Tab6 blt
ON blt.Tab10Id = bl.Id
AND cast(Getdate() AS DATE) BETWEEN blt.Date1 AND blt.Date2
WHERE a.Nummer = 223889) B
On both the good and bad plans the Execution Plan shows "Reason for Early Termination of Statement Optimization" as "Time Out".
The two plans have slightly different join orders.
The only join in the plans not satisfied by an index seek is that on Tab9. This has 63,926 rows.
The missing index details in the execution plan suggest that you create the following index.
CREATE NONCLUSTERED INDEX [miising_index]
ON [dbo].[Tab9] ([Date1],[Date2])
INCLUDE ([Tab13Id])
The problematic part of the bad plan can be clearly seen in SQL Sentry Plan Explorer
SQL Server estimates that 1.349174 rows will be returned from the previous joins coming into the join on Tab9. And therefore costs the nested loops join as if it will need to execute the scan on the inside table 1.349174 times.
In fact 2,600 rows feed into that join meaning that it does 2,600 full scans of Tab9 (2,600 * 63,926 = 164,569,600 rows.)
It just so happens that on the good plan the estimated number of rows coming in to the join is 2.74319. This is still wrong by three orders of magnitude but the slightly increased estimate means SQL Server favors a hash join instead. A hash join just does one pass through Tab9
I would first try adding the missing index on Tab9.
Also/instead you might try updating the statistics on all tables involved (especially those with a date predicate such as Tab2 Tab3 Tab7 Tab8 Tab6) and see if that goes some way to correcting the huge discrepancy between estimated and actual rows on the left of the plan.
Also breaking the query up into smaller parts and materialising these into temporary tables with appropriate indexes might help. SQL Server can then use the statistics on these partial results to make better decisions for joins later in the plan.
Only as a last resort would I consider using query hints to try and force the plan with a hash join. Your options for doing that are either the USE PLAN hint in which case you dictate exactly the plan you want including all join types and orders or by stating LEFT OUTER HASH JOIN tab9 .... This second option also has the side effect of fixing all join orders in the plan. Both mean that SQL Server will be severely limited is its ability to adjust the plan with changes in data distribution.

It's hard to answer not knowing the size and structure of your tables, and not being able to see the entire execution plan. But the difference in both plans is Hash Match join for "top n" query vs Nested Loop join for the other one.
Hash Match is very resource intensive join, because the server has to prepare hash buckets in order to use it. But it becomes much more effective for big tables, while Nested Loops, comparing each row in one table to every row in another table works great for small tables, because there's no such preparation needed.
What I think is that by selecting TOP 1000000000000 rows in subquery you give the optimizer a hint that you're subquery will produce a great amount of data, so it uses Hash Match. But in fact the output is small, so Nested Loops works better.
What I just said is based on shreds of information, so please have heart criticising my answer ;).

Related

Converting a SQL subquery to a join for performance gains

I have a subquery with an inner join, This join is meant to cut down the data size to a more manageable size, before extracting data via an unpivot which has a further join to only pull out relevant matches..
When i've looked at the execution plan, it seems like the outer select is being executed first and thus taking an inordinate amount of time to complete as it is processing the data for all gamers instead of the cut down cohort..
this is the query
SELECT
t2.Gamer_ID,
C.Feature_Code,
C.Feature_Name,
t2.CODE_DATE
FROM
(
SELECT
CAST(A.Gamer_ID
,[Identification_Code_Code_1]
,[Identification_Code_Code_2]
,[Identification_Code_Code_3]
,[Identification_Code_Code_4]
,[Identification_Code_Code_5]
,[Identification_Code_Code_6]
,[Identification_Code_Code_7]
,[Identification_Code_Code_8]
,[Identification_Code_Code_9]
,[Identification_Code_Code_10]
,CAST(Joining_date AS DATE) AS CODE_DATE
FROM Gamer_Characteristics A
INNER JOIN Gamer_Population P ON P.Gamer_ID = A.Gamer_ID --cuts down the number of gamers to the selected cohort
) s
unpivot (CODE for col in (
[Identification_Code_Code_1]
,[Identification_Code_Code_2]
,[Identification_Code_Code_3]
,[Identification_Code_Code_4]
,[Identification_Code_Code_5]
,[Identification_Code_Code_6]
,[Identification_Code_Code_7]
,[Identification_Code_Code_8]
,[Identification_Code_Code_9]
,[Identification_Code_Code_10])) as t2
INNER JOIN Gamer_feature_Code C ON C.CODE = LEFT(t2.CODE,C.CODE_LENGTH) --join to a dimension table to pull through characteristcs based on code and code length
WHERE
T2.CODE_DATE <= '2020-03-31'
GROUP BY t2.Gamer_ID,
C.Feature_Code,
C.Feature_Name,
t2.CODE_DATE
I have two questions.
1: can this be converted to use a join instead of subquery
2: Can i force the inner join in the subquery to take precedence over the inner join in the outer select?

Sql View with WHERE clause runs slower than a raw query

This runs in a constant time:
SELECT row_number() OVER (order by PackagingUniqueId) as RowNum, Barcode, pu.PackagingUniqueId,
rd.Name, pu.ComponentBarcode, rrl.ponum, rrl.mfgpart, rrl.new_lot_code, rrl.pno
FROM Trace.dbo.TraceData td
INNER JOIN Trace.dbo.TraceJob tj ON td.Id = tj.TraceDataId
INNER JOIN Trace.dbo.Job j ON tj.JobId = j.Id
INNER JOIN Trace.dbo.[Order] o ON j.OrderId = o.id
INNER JOIN Trace.dbo.PCBBarcode p ON td.PCBBarcodeId = p.Id
INNER JOIN Trace.dbo.TracePlacement tp ON td.Id = tp.TraceDataId
INNER JOIN Trace.dbo.Placement p2 ON p2.PlacementGroupId = tp.PlacementGroupId
INNER JOIN Trace.dbo.Charge c ON p2.ChargeId = c.Id
INNER JOIN Trace.dbo.PackagingUnit pu ON c.PackagingUnitId = pu.Id
INNER JOIN Trace.dbo.RefDesignator rd ON p2.RefDesignatorId = rd.Id
INNER JOIN SpotlightSQL.spot_light_dbo.peel_off_ids po ON po.peel_off_id = pu.PackagingUniqueId
INNER JOIN SpotlightSQL.spot_light_dbo.recv_receipts_log rrl ON rrl.label_id = po.label_id
WHERE p.Barcode = '20092619153'
However, this one takes about 7 seconds:
SELECT * FROM Component WHERE Barcode = '20092619153'
Component is a SQL view which consists of the first longer query without WHERE clause.
Why this happens? Does the view retrieve all records and then apply Where clause? Is there a way to speed up the second query? (without applying indexes)
Why this happens? Does the view retrieve all records and then apply Where clause?
Yes, in this particular case, SQL Server will first execute the original underlying query, and then apply a WHERE filter on top of that intermediate result.
Is there a way to speed up the second query? (without applying indexes)
A SQL view generally performs as well as the underlying query. So, if Barcode is a good way to filter off many records, then adding an index to Barcode is the way to go. Other than this, there is not much you can do to speed up the view.
One option would be to create a materialized view, which is basically just a table whose data is generated by your view's query. Selecting all records from a materialized view, with no additional restrictions, should have a speed limited only by the time of data transfer.

SQL - faster to filter by large table or small table

I have the below query which takes a while to run, since ir_sales_summary is ~ 2 billion rows:
select c.ChainIdentifier, s.SupplierIdentifier, s.SupplierName, we.Weekend,
sum(sales_units_cy) as TY_unitSales, sum(sales_cost_cy) as TY_costDollars, sum(sales_units_ret_cy) as TY_retailDollars,
sum(sales_units_ly) as LY_unitSales, sum(sales_cost_ly) as LY_costDollars, sum(sales_units_ret_ly) as LY_retailDollars
from ir_sales_summary i
left join Chains c
on c.ChainID = i.ChainID
inner join Suppliers s
on s.SupplierID = i.SupplierID
inner join tmpWeekend we
on we.SaleDate = i.saledate
where year(i.saledate) = '2017'
group by c.ChainIdentifier, s.SupplierIdentifier, s.SupplierName, we.Weekend
(Worth noting, it takes roughly 3 hours to run since it is using a view that brings in data from a legacy service)
I'm thinking there's a way to speed up the filtering, since I just need the data from 2017. Should I be filtering from the big table (i) or be filtering from the much smaller weekending table (which gives us just the week ending dates)?
Try this. This might help, joining a static table as first table in query onto a fact/dynamic table will impact query performance i believe.
SELECT c.ChainIdentifier
,s.SupplierIdentifier
,s.SupplierName
,i.Weekend
,sum(sales_units_cy) AS TY_unitSales
,sum(sales_cost_cy) AS TY_costDollars
,sum(sales_units_ret_cy) AS TY_retailDollars
,sum(sales_units_ly) AS LY_unitSales
,sum(sales_cost_ly) AS LY_costDollars
,sum(sales_units_ret_ly) AS LY_retailDollars
FROM Suppliers s
INNER JOIN (
SELECT we
,weeekend
,supplierid
,chainid
,sales_units_cy
,sales_cost_cy
,sales_units_ret_cy
,sales_units_ly
,sales_cost_ly
,sales_units_ret_ly
FROM ir_sales_summary i
INNER JOIN tmpWeekend we
ON we.SaleDate = i.saledate
WHERE year(i.saledate) = '2017'
) i
ON s.SupplierID = i.SupplierID
INNER JOIN Chains c
ON c.ChainID = i.ChainID
GROUP BY c.ChainIdentifier
,s.SupplierIdentifier
,s.SupplierName
,i.Weekend

Why does separate table perform significantly better than subquery?

I was trying to improve performance of a SQL query and tried few combinations.
Original Query
SELECT ALIAS_A.id1,
ALIAS_A.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
FROM db_A.table_A ALIAS_A
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON ALIAS_A.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON ALIAS_A.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(ALIAS_A.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
AND Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE
The above query consumes nearly 400k impactCPU
Optimized Query 1
SELECT New_sub_table.id1,
New_sub_table.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
--changed part start--
FROM ( sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE ) New_sub_table -- created a subquery
--changed part end--
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON New_sub_table.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON New_sub_table.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(New_sub_table.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
I thought to filter the data first and then do the joins. After I checked the performance stats. It was consuming nearly 390k CPU. Not much of a difference.
Optimized Query 2
SELECT ALIAS_A.id1,
ALIAS_A.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
--changed part start--
FROM INTERMEDIATE_DB.INTERMEDIATE_TABLE ALIAS_A --CREATED AN INTERMEDIATE TABLE
--changed part end--
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON ALIAS_A.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON ALIAS_A.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(ALIAS_A.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
MACRO for loading data into intermediate table
INSERT INTO INTERMEDIATE_DB.INTERMEDIATE_TABLE
sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE
So what I did here was. I used an intermediate table instead of subquery. The intermediate table gets loaded via the macro first and then the select query runs. It now consumes only 50k impactCPU (for both Macro and Select query combined).
My question -
I am unable to reason why this is happening even though the logic behind both queries is same (or so I think it is). What would be the best practice if this is incorrect way ?
Your main problem is the Cast(ALIAS_A.columnD AS DATE). When you check Explains you will notice the optimizer has no confidence for this step, probably greatly overestimating the number of rows returned.
But when you materialize the Select the number of rows is better known and the order of joins changes.
You would probably get the same plan when you Collect Statistics on the Cast(ALIAS_A.columnD AS DATE), run DIAGNOSTIC HELPSTATS ON FOR SESSION; and Explain should show you this as recommended stats.

Why does my SELECT return only a subset of the rows?

I'm working with an Oracle database for the first time and stumbled upon another problem again. When I want to select all the rows in a table with some JOINS I only get the first 350 rows of about 15.000 rows.
Anyone know if there is a set limit somewhere I'm not aware of?
Below is my query if needed:
SELECT orders.plant, orders.workcenter, workcenters.occupied,
workcenters.section, workcentersections.section, orders.capacitycat,
orders.week, orders.earlieststartdate, orders.lateststartdate,
orders.useropstatus, orders.programstatus, orders.reqhours,
orders.finishdate, orders.reqquantity, orders.material, parts.TYPE,
parttypes.TYPE, orders.ordernumber, orders.operation,
orders.preoperation, orders.seqoperation, orders.projectcode,
orders.queuetime, orders.hoursworked, orders.operationtext,
orders.shorttext
FROM (((orders INNER JOIN workcenters ON orders.workcenter =
workcenters.code)
INNER JOIN
workcentersections ON workcenters.section = workcentersections.ID)
INNER JOIN
parts ON orders.material = parts.material)
INNER JOIN
parttypes ON parts.TYPE = parttypes.ID
Assuming your ORDERS table contains 15000 rows and your original query returns only 350 rows, you can replace your (INNER) JOINs with OUTER JOINs:
SELECT orders.plant, orders.workcenter, workcenters.occupied,
workcenters.section, workcentersections.section, orders.capacitycat,
orders.week, orders.earlieststartdate, orders.lateststartdate,
orders.useropstatus, orders.programstatus, orders.reqhours,
orders.finishdate, orders.reqquantity, orders.material, parts.TYPE,
parttypes.TYPE, orders.ordernumber, orders.operation,
orders.preoperation, orders.seqoperation, orders.projectcode,
orders.queuetime, orders.hoursworked, orders.operationtext,
orders.shorttext
FROM orders
LEFT OUTER JOIN workcenters ON orders.workcenter = workcenters.code
LEFT OUTER JOIN workcentersections
ON workcenters.section = workcentersections.ID
LEFT OUTER JOIN parts ON orders.material = parts.material
LEFT OUTER JOIN parttypes ON parts.TYPE = parttypes.ID
This will give you all rows from ORDERS (you might get duplicates if you haven't got strict 1:N relationships).
Then, you should replace the LEFT OUTER JOINs one-by-one with INNER JOINs and check the row count of each of these modified queries to find out which of the JOINs is responsible for the missing data.