Optimize SQL Query having SUM and COUNT functions - sql

I have the following query which takes too long to retrieve around 70000 records. I noticed that the execution time is proportional to the number of the records retrieved. I need to optimize this query so that the execution time is not proportional to the number of records retrieved. Any idea?
;WITH TT AS (
SELECT TaskParts.[TaskPartID],
PartCost,
LabourCost,
VendorPaidPartAmount,
VendorPaidLabourAmount,
ROW_NUMBER() OVER (ORDER BY [Employees].[EmpCode] asc) AS RowNum
FROM [TaskParts],[Tasks],[WorkOrders], [Employees], [Status],[Models]
,[SubAccounts]WHERE 1=1 AND (TaskParts.TaskLineID = Tasks.TaskLineID)
AND (Tasks.WorkOrderID = [WorkOrders].WorkOrderID)
AND (Tasks.EmpID = [Employees].EmpID)
AND (TaskParts.StatusID = [Status].StatusID)
And (Models.ModelID = Tasks.FailedModelID)
And (SubAccounts.SubAccountID = Tasks.SubAccountID)AND (SubAccounts.GLAccountID = 5))
SELECT --*
COUNT(0)--,
SUM(ISNULL(PartCost,0)),
SUM(ISNULL(LabourCost,0)),
SUM(ISNULL(VendorPaidPartAmount,0)),
SUM(ISNULL(VendorPaidLabourAmount,0))
FROM TT

As Lieven noted, you can remove TD0, TD1 and TP1 as they are redundant.
You can also remove the row_number column, as that is not used and windowing functions are relatively expensive.
It may also be possible to remove some of the tables from the TT CTE if they are not used; however, as table names have not been included with each column selected, it isn't possible to tell which tables are not being used.
Aside from that, your query's response will always be proportional to the number of rows returned, because the RDBMS has to read each row returned to calculate the results.

Make sure that you have support index for each Foreign Key also most probably it is not the issue in this case but MS SQL optimization better works with inner joins.
Also I don't see any reason why you need RowNum if you need only totals.

Related

Oracle FIRST_ROWS optimizer hint

I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.

Filtering on ROW_NUMBER() is changing the results

I did implement an OData service of my own that takes an SQL statement and apply the top / skip filter using a ROW_NUMBER(). Most statement tested so far are working well except for a statement involving 2 levels of Left Join. For some reason I can't explain, the data returned by the sql is changing when I apply a where clause on the row number column.
For readability (and testing), I removed most of the sql to keep only the faulty part. Basically, you have a Patients table that may have 0 to N Diagnostics and the Diagnostics may have 0 to N Treatments:
SELECT RowNumber, PatientID, DiagnosticID, TreatmentID
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RowNumber
, *
FROM PATIENTS
LEFT JOIN DIAGNOSTICS ON DIAGNOSTICS.PatientID = PATIENTS.PatientID
LEFT JOIN TREATMENTS ON TREATMENTS.DiagnosticID = DIAGNOSTICS.DiagnosticID
) AS Wrapper
--WHERE RowNumber BETWEEN 1 AND 10
--If I uncomment the line above, I'll get 10 lines that differs from the first 10 line of this query
This is the results I got from the statement above. The result on the left is showing the first 10 rows without the WHERE clause while the one on the right is showing the results with the WHERE clause.
For the record, I'm using SQL Server 2008 R2 SP3. My application is in C# but the problem occurs in SQL server too so I don't think .NET is involved in this case.
EDIT
About the ORDER BY (SELECT NULL), I took that code a while ago from this SO question. However, an order by null will work only if the statement is sorted... in my case, I forgot about adding an order by clause so that's why I was getting some random sorting.
Let me first ask: why do you expect it to be the same? Or rather, why do you expect it to be anything in particular? You haven't imposed an ordering, so the query optimizer is free to use whatever execution operators are most efficient (according to its cost scheme). When you add the WHERE clause, the plan will change and the natural ordering of the results will be different. This can also happen when adding joins or subqueries, for example.
If you want the results to come back in a specific order, you need to actually use the ORDER BY subclause of the ROW_NUMBER() window function. I'm not sure why you are ordering by SELECT NULL, but I can guarantee you that's the problem.

dense_rank filling up tempdb on SQL server?

I've got this query here which uses dense_rank to number groups in order to select the first group only. It is working but its slow and tempdb (SQL server) becomes so big that the disk is filled up. Is it normal for dense_rank that it's such a heavy operation? And how else should this be done then without resorting to coding?
select
a,b,c,d
from
(select a,b,c,d,
dense_rank() over (order by s.[time] desc) as gn
from [Order] o
JOIN Scan s ON s.OrderId = o.OrderId
JOIN PriceDetail p ON p.ScanId = s.ScanId) as p
where p.OrderNumber = #OrderNumber
and p.Number = #Number
and p.Time > getdate() - 20
and p.gn = 1
group by a,b,c,d,p.gn
Any operation that has to sort a large dataset may fill tempdb. dense_rank is no exception, just like rank, row_number, ntile etc etc.
You are asking for a sort over what appears to be a global, complete sort of every scan entry, since database start. The way you expressed the query the join must occur before the sort, so the sort will be both big and wide. After all is said and done, consuming a lot of IO, CPU and tempdb space, you will restrict the result to a small subset for only a specified order and some conditions (which mentions columns not present in projection, so they must be some made up example not the real code).
You have a filter on WHERE gn=1 followed by a GROUP BY gn. This is unnecessary, the gn is already unique from the predicate so it cannot contribute to the group by.
You compute the dense_rank over every order scan and then you filter by p.OrderNumber = #OrderNumber AND p.gn = 1. This makes even less sense. This query will only return results if the #OrderNumber happens to contain the scan with rank 1 over all orders! It cannot possibly be correct.
Your query makes no sense. The fact that is slow is just a bonus. Post your actual requirements.
If you want to learn about performance investigation, read How to analyse SQL Server performance.
PS. As a rule, computing ranks and selecting =1 can always be expressed as a TOP(1) correlated subquery, with usually much better results. Indexes help, obviously.
PPS. Use of group by without any aggregate function is yest another serious code smell.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.
SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').
Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix
This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!
The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.
Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.
[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

ORDER BY column that has allow null is slow. Why?

So really my question is WHY this worked.
Anyway, I had this query that does a few inner joins, has a where clause and does an order by on a nvarchar column. If I run the query WITHOUT order by, the query takes less than a second. If I run the query WITH order by, it takes 12 seconds.
Now I had a great idea and changed all the INNER JOINs to LEFT JOINs. And also included the ORDER BY clause. That took less than a second. So I remembered the difference between LEFT JOINs and INNER JOINs. INNER JOINs check for NULL and LEFT JOINs don't. So I went into the table design and unchecked "Allow Nulls". Now I run the query WITH INNER JOINs and a ORDER BY clause and the query takes less than a second. WHY?
From what I understand, the FROM, JOINS, WHERE, then SELECT clauses should run first and return a result set. Then the ORDER BY clause runs at the very end on the resultant record set. Therefore the query should have taken AT MOST a second, yes, even with the column allowing nulls. So why would the query take less than a second WITHOUT the order by clause, but take 12 seconds WITH order by clause? That doesn't make sense to me.
Query below:
SELECT PlanInfo.PlanId, PlanName, COALESCE(tResponsible, '') AS tResponsible, Processor, CustName, TaskCategoryId, MapId, tEnd,
CASE MapId WHEN 9 THEN 1 ELSE 2 END AS sor
FROM PlanInfo INNER JOIN [orders].dbo.BaanOrders_Ext ON PlanInfo.PlanName = [orders].dbo.BaanOrders_Ext.OrderNo
INNER JOIN [orders].dbo.BaanOrders ON PlanInfo.PlanName = [orders].dbo.BaanOrders.OrderNo
INNER JOIN Tasks ON PlanInfo.PlanId = Tasks.PlanId
INNER JOIN EngSchedToTimingMap ON Tasks.CatId = EngSchedToTimingMap.TaskCategoryId
WHERE (MapId = 9 OR MapId = 11 or MapId = 13 or MapId = 15)
AND([orders].dbo.BaanOrders_Ext.Processor = 'metest' OR tResponsible = 'metest')
ORDER BY PlanInfo.PlanId
I would have to guess that it is due to having an index on PlanInfo.PlanId, on which you are sorting.
SQL Server could streamline collection so that it follows the index and build the rest of the columns along that order. When the field is NULLable, the index cannot be used for sorting, because it will not contain the NULL values, which incidentally come first, so it decides to optimize along a different path.
Showing the Execution Plan always helps. Either paste the images of the plans, or just show the text-mode plans, i.e. add the line above the query, then execute it
SET SHOWPLAN_TEXT ON;
<the query>
When you use the ORDER BY clause, you force the database engine to sort the results. This takes some time (especially if the result contains many rows) - thus it is possible that a query that runs 1 second without an ORDER BY clause runs 12 seconds with it. Note that sorting takes at best O(N*log(N)) time where N is the number of rows.
The reason why NULLs are generally slow is the fact that they must be treated specially. Sorting with NULLs adds more complex comparison conditions and slows the sorting down.
If your question is "Why does the ORDER BY clause cause my query to run longer?" the answer is because sorting the results is added to the query execution plan.
If you use the "Show Estimated Query Execution Plan" tool in SQL Server Studio, it will show you exactly what it thinks the SQL Server engine will do.