SELECT MAX() too slow - any alternatives? - sql

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.

SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').

Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix

This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!

The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.

Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.

[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

Related

Can any one please suggest which sql query is better approach for large data set in mysql

I have to find the max id row for same group in a table and show the roe details.Using following two approach we can achieve it. But want know which will be good approach for large data. Or any other new approach that will take less time to execute?
Approach 1:
select a.* from tab1 a left join (SELECT max(id) as id,name from tab1
GROUP by name) as tab2 on a.id=tab2.id where a.id=tab2.id
Approach 2:
SELECT id,name from tab1 where id in(SELECT MAX(id) FROM tab1 GROUP by name)
Taken from the manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT JOIN can be faster than a subquery because
the server might be able to optimize it better.
So subqueries can be slower than LEFT [OUTER] JOINS, but in my opinion their strength is slightly higher readability. But since there is one LEFT JOIN and a subquery in the first approach, the second approach might be faster at large scale query.
You can also use a window function to avoid a self-join:
SELECT id, name
FROM (
SELECT
id,
name,
RANK() OVER(PARTITION BY name ORDER BY Id DESC) AS IdRankPerGroup
FROM tab1
) src
WHERE IdRankPerGroup = 1
The RANK() function orders each row per "name" group and assigns a ranking based on the "id" value within each group. Then in the outer query, you just get the rows with a ranking = 1.
Try all three queries, check out the EXPLAIN plans, and see which one works best with large amounts of data.

Oracle performance issue in getting first row in sub query

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

How to retrieve the value which corresponding to the max in an other column in SQL?

I have the following table, which represents valuations of items.
ITEM REFERENCEDATE VALUATION
------------------------------------------------
A 25/01/2012 25.35
A 26/01/2012 51.35
B 25/01/2012 25.00
Edit: (ITEM, REFERENCEDATE) is a unique index.
The goal is to get the latest valuations for a set of item. Which means i'm trying to create a SQL request that would return something like
ITEM REFERENCEDATE VALUATION
------------------------------------------------
A 26/01/2012 51.35
B 25/01/2012 25.00
Flowing a tutorial on GROUP BY, I ended up trying
SELECT A.ITEM, A.VALUATION, MAX(A.REFERENCEDATE)
FROM VALUATIONS A
GROUP BY A.ITEM
Full of hope that the SQL server would understand that I need A.VALUATION for the line which realizes the max for A.REFERENCEDATE for the ITEM represented on the current result line.
But instead, I have this unpleasant error message:
Column 'VALUATIONS.VALUATION' is invalid in the select list because it is not contained
in either an aggregate function or the GROUP BY clause.
How can I indicate that the VALUATION where the maximum of REFERENCEDATE is reached should be used ?
Note: I need a solution that works at least on Oracle and SQL Server
EDIT: Thanks everybody for your help. I was stuck in a hole try to get away with only one single SELECT ... GROUP BY request. Now I see there are two approaches that articulate around the same idea:
Making a JOIN with the result of an other independant request that will return all the item/max(date) couples
Using a subrequest result in the where clause which will have a different value for each item.
Could anybody provide a reason (or a pointer to a reason) to prefer one to the other ?
Select V.Item, V.ReferenceDate, V.Valuation
From Valuations As V
Where V.ReferenceDate = (
Select Max(V1.ReferenceDate)
From Valuations As V1
Where V1.Item = V.Item
)
SQL Fiddle version
In response to your edit, the only way to know for sure which approach will perform better is to evaluate the execution plan on each of the queries. There are many factors that can come into determining the fastest approach and certainly the DBMS itself is one of those factors. A good query engine should be able to deduce the same or similar execution plan regardless of the approach. That said, using a derived table (i.e. approach #1) may be a bit more explicit to the query engine (even if less explicit to the reader of the query) and thus might perform better. Often it is the case that derived tables perform better than correlated subqueries (my solution and your approach #2). However, I wouldn't alter the approach until I had evidence to support the change. Again, the only way to know which will perform better for certain is to evaluate the execution plan against your data.
If you are using almost any database other than MySQL, then answer is to use ranking functions. In particular, row_number does what you are looking for:
select ITEM, REFERENCEDATE, VALUATION
from (select t.*
row_number() over (partition by item order by referencedate desc) as seqnum
from t
) t
where seqnum = 1 and
item in (<your list of items>)
Row number assigns a sequence nubmer to the records for each item. It starts at 1 for the biggest reference date and then 2 for the next biggest and so on (based on the order by clause). You want the first one, where seqnum = 1.
select a.item, a.valuation, a.referencedate
from valuations a
join (select a2.item, max(referencedate) as max_date
from valuations a2
group by a2.item
) b ON a.item = b.item and a.referencedate = b.max_date
Try this:
SELECT A.ITEM, MAX(A.VALUATION), A.REFERENCEDATE
FROM VALUATIONS A
JOIN
(
SELECT A.ITEM, MAX(A.REFERENCEDATE) AS REFERENCEDATE
FROM VALUATIONS A
GROUP BY A.ITEM
) B ON A.ITEM = B.ITEM AND A.REFERENCEDATE = B.REFERENCEDATE
GROUP BY A.ITEM, A.REFERENCEDATE
It will select the MAX value from the columns holding the max(REFERENCEDATE). If you only expect one column to have the max, then it would simply select from the one it can choose from.
This is the code you possibly need:
Select *
From ItemValues As A
Inner Join
ItemValues As MaxValuedItem
On MaxValuedItem.Id = (
Select Top 1
B.Id
From ItemValues As B
Where B.Item_Id = A.Item_Id
Order By B.Valuation Desc
)
You need to use a "join" with the table itself that refers to the record that has the maximum value for the same item.

Optimize SQL Query having SUM and COUNT functions

I have the following query which takes too long to retrieve around 70000 records. I noticed that the execution time is proportional to the number of the records retrieved. I need to optimize this query so that the execution time is not proportional to the number of records retrieved. Any idea?
;WITH TT AS (
SELECT TaskParts.[TaskPartID],
PartCost,
LabourCost,
VendorPaidPartAmount,
VendorPaidLabourAmount,
ROW_NUMBER() OVER (ORDER BY [Employees].[EmpCode] asc) AS RowNum
FROM [TaskParts],[Tasks],[WorkOrders], [Employees], [Status],[Models]
,[SubAccounts]WHERE 1=1 AND (TaskParts.TaskLineID = Tasks.TaskLineID)
AND (Tasks.WorkOrderID = [WorkOrders].WorkOrderID)
AND (Tasks.EmpID = [Employees].EmpID)
AND (TaskParts.StatusID = [Status].StatusID)
And (Models.ModelID = Tasks.FailedModelID)
And (SubAccounts.SubAccountID = Tasks.SubAccountID)AND (SubAccounts.GLAccountID = 5))
SELECT --*
COUNT(0)--,
SUM(ISNULL(PartCost,0)),
SUM(ISNULL(LabourCost,0)),
SUM(ISNULL(VendorPaidPartAmount,0)),
SUM(ISNULL(VendorPaidLabourAmount,0))
FROM TT
As Lieven noted, you can remove TD0, TD1 and TP1 as they are redundant.
You can also remove the row_number column, as that is not used and windowing functions are relatively expensive.
It may also be possible to remove some of the tables from the TT CTE if they are not used; however, as table names have not been included with each column selected, it isn't possible to tell which tables are not being used.
Aside from that, your query's response will always be proportional to the number of rows returned, because the RDBMS has to read each row returned to calculate the results.
Make sure that you have support index for each Foreign Key also most probably it is not the issue in this case but MS SQL optimization better works with inner joins.
Also I don't see any reason why you need RowNum if you need only totals.

Is there a better way to sort this query?

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.
If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.
Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.
Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?
Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.