I've got this query here which uses dense_rank to number groups in order to select the first group only. It is working but its slow and tempdb (SQL server) becomes so big that the disk is filled up. Is it normal for dense_rank that it's such a heavy operation? And how else should this be done then without resorting to coding?
select
a,b,c,d
from
(select a,b,c,d,
dense_rank() over (order by s.[time] desc) as gn
from [Order] o
JOIN Scan s ON s.OrderId = o.OrderId
JOIN PriceDetail p ON p.ScanId = s.ScanId) as p
where p.OrderNumber = #OrderNumber
and p.Number = #Number
and p.Time > getdate() - 20
and p.gn = 1
group by a,b,c,d,p.gn
Any operation that has to sort a large dataset may fill tempdb. dense_rank is no exception, just like rank, row_number, ntile etc etc.
You are asking for a sort over what appears to be a global, complete sort of every scan entry, since database start. The way you expressed the query the join must occur before the sort, so the sort will be both big and wide. After all is said and done, consuming a lot of IO, CPU and tempdb space, you will restrict the result to a small subset for only a specified order and some conditions (which mentions columns not present in projection, so they must be some made up example not the real code).
You have a filter on WHERE gn=1 followed by a GROUP BY gn. This is unnecessary, the gn is already unique from the predicate so it cannot contribute to the group by.
You compute the dense_rank over every order scan and then you filter by p.OrderNumber = #OrderNumber AND p.gn = 1. This makes even less sense. This query will only return results if the #OrderNumber happens to contain the scan with rank 1 over all orders! It cannot possibly be correct.
Your query makes no sense. The fact that is slow is just a bonus. Post your actual requirements.
If you want to learn about performance investigation, read How to analyse SQL Server performance.
PS. As a rule, computing ranks and selecting =1 can always be expressed as a TOP(1) correlated subquery, with usually much better results. Indexes help, obviously.
PPS. Use of group by without any aggregate function is yest another serious code smell.
Related
I've got a view defined that lists transactions together with a running total, something like
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
The biggest tables this query looks at have ~10 million rows, but on average when the view is queried it will only return a few tens of rows.
My issue is that when this SELECT statement is run directly for a given member, it executes extremely quickly and returns results in a couple of milliseconds. However, when queried as a view...
SELECT h.createdDate, h.value, h.runningTotal
FROM historyView h
WHERE member.username = 'blah#blah.com'
...the performance is dreadful. The two query plans are very different - in the first case it is pretty much ideal but in the latter case, there are loads of scans and hundreds of thousands/millions of rows being read. This is clearly because the filter on member is being run last thing after everything else has been done, rather than right up front at the start.
If I remove the SUM(x) OVER (ORDER BY y) clause, this problem goes away.
Is there something I can do to ensure that the SUM(x) OVER (ORDER BY y) clause does not ruin the query plan?
One solution to my problem is to let the query optimiser know it is safe to filter before running the windowed function by PARTITION'ing by that property. The change to the view is:
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (PARTITION BY m.username ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
Unfortunately this only creates the correct plan if filtering my member's username is part of the query.
That's because there's probably an index on m.username. When it comes to query tuning it takes some trial and error.
When using window functions there is the concept of 'POC' index to take into consideration - just search on google (Itzik Ben-Gan has good references about this as well).
From the book 'High-Performance T-SQL Using Window Functions':
Absent a POC index, the plan includes a Sort iterator, and with large input sets, it can be quite
expensive. Sorting has N * LOG(N) complexity, which is worse than linear. This means that with more
rows, you pay more per row. For example 1000 * LOG(1000) = 3000 and 10000 * LOG(10000) =
40000. This means that 10 times more rows results in 13 times more work, and it gets worse the further you go.
Here's a reference link to get started on window functions and indexes.
I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.
I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.
I have the following query which takes too long to retrieve around 70000 records. I noticed that the execution time is proportional to the number of the records retrieved. I need to optimize this query so that the execution time is not proportional to the number of records retrieved. Any idea?
;WITH TT AS (
SELECT TaskParts.[TaskPartID],
PartCost,
LabourCost,
VendorPaidPartAmount,
VendorPaidLabourAmount,
ROW_NUMBER() OVER (ORDER BY [Employees].[EmpCode] asc) AS RowNum
FROM [TaskParts],[Tasks],[WorkOrders], [Employees], [Status],[Models]
,[SubAccounts]WHERE 1=1 AND (TaskParts.TaskLineID = Tasks.TaskLineID)
AND (Tasks.WorkOrderID = [WorkOrders].WorkOrderID)
AND (Tasks.EmpID = [Employees].EmpID)
AND (TaskParts.StatusID = [Status].StatusID)
And (Models.ModelID = Tasks.FailedModelID)
And (SubAccounts.SubAccountID = Tasks.SubAccountID)AND (SubAccounts.GLAccountID = 5))
SELECT --*
COUNT(0)--,
SUM(ISNULL(PartCost,0)),
SUM(ISNULL(LabourCost,0)),
SUM(ISNULL(VendorPaidPartAmount,0)),
SUM(ISNULL(VendorPaidLabourAmount,0))
FROM TT
As Lieven noted, you can remove TD0, TD1 and TP1 as they are redundant.
You can also remove the row_number column, as that is not used and windowing functions are relatively expensive.
It may also be possible to remove some of the tables from the TT CTE if they are not used; however, as table names have not been included with each column selected, it isn't possible to tell which tables are not being used.
Aside from that, your query's response will always be proportional to the number of rows returned, because the RDBMS has to read each row returned to calculate the results.
Make sure that you have support index for each Foreign Key also most probably it is not the issue in this case but MS SQL optimization better works with inner joins.
Also I don't see any reason why you need RowNum if you need only totals.