Find the Max and related fields - sql

Here is my (simplified) problem, very common I guess:
create table sample (client, recordDate, amount)
I want to find out the latest recording, for each client, with recordDate and amount.
I made the below code, which works, but I wonder if there is any better pattern or Oracle tweaks to improve the efficiency of such SELECT. (I am not allowed to modify to the structure of the database, so indexes etc are out of reach for me, and out of scope for the question).
select client, recordDate, Amount
from sample s
inner join (select client, max(recordDate) lastDate
from sample
group by client) t on s.id = t.id and s.recordDate = t.lastDate
The table has half a million records and the select takes 2-4 secs, which is acceptable but I am curious to see if that can be improved.
Thanks

In most cases Windowed Aggregate Functions might perform better (at least it's easier to write):
select client, recordDate, Amount
from
(
select client, recordDate, Amount,
rank() over (partition by client order by recordDate desc) as rn
from sample s
) dt
where rn = 1

Another structure for the query is not exists. This can perform faster under some circumstances:
select client, recordDate, Amount
from sample s
where not exists (select 1
from sample s2
where s2.client = s.client and
s2.recordDate > s.recordDate
);
This would take good advantage of an index on sample(client, recordDate), if one were available.
And, another thing to try is keep:
select client, max(recordDate),
max(Amount) keep (dense_rank first order by recordDate desc)
from sample s
group by client;
This version assumes only one max record date per client (your original query does not make that assumption).
These queries (plus the one by dnoeth) should all have different query plans and you might get lucky on one of them. The best solution, though, is to have the appropriate index.

Related

Very huge data avoid analytical function

Assume there are 100 million distinct SID values in this table.
Example: Table test, columns SID, date_operation, score
So in this table, score changes everyday so if I want to get report of all the SID with most recent score. Don't want to use analytical function as otherwise cost would be very high. Tried self join also but looks like that is also costly.
If this question is redundant please direct me to similar question I will delete it.
select sid, max(date_operation)
from test
group by sid
will return what you asked for:
get all the Sid with most recent score
One method is:
select t.*
from t
where t.date = (select max(t2.date) from test t2 where t2.sid = t.sid);
This can take advantage of an index on test(sid, date).
However I have found good performance in Oracle with keep:
select sid, max(date),
max(score) keep (dense_rank first order by date desc) as most_recent_score
from test
group by sid;

Retrieving last record in each group from database with additional max() condition in MSSQL

This is a follow-up question to Retrieving last record in each group from database - SQL Server 2005/2008
In the answers, this example was provided to retrieve last record for a group of parameters (example below retrieves last updates for each value in computername):
select t.*
from t
where t.lastupdate = (select max(t2.lastupdate)
from t t2
where t2.computername = t.computername
);
In my case, however, "lastupdate" is not unique (some updates come in batches and have same lastupdate value, and if two updates of "computername" come in the same batch, you will get non-unique output for "computername + lastupdate").
Suppose I also have field "rowId" that is just auto-incremental. The mitigation would be to include in the query another criterion for a max('rowId') field.
NB: while the example employs time-specific name "lastupdate", the actual selection criteria may not be related to the time at all.
I, therefore, like to ask, what would be the most performant query that selects the last record in each group based both on "group-defining parameter" (in the case above, "computername") and on maximal rowId?
If you don't have uniqueness, then row_number() is simpler:
select t.*
from (select t.*,
row_number() over (partition by computername order by lastupdate, rowid desc) as seqnum
from t
) t
where seqnum = 1;
With the right indexes, the correlated subquery is usually faster. However, the performance difference is not that great.

SQL Querying Column on Max Date

I apologize if my code is not properly typed. I am trying to query a table that will return the latest bgcheckdate and status report. The table contains additional bgcheckdates and statuses for each record but in my report I only need to see the latest bgcheckdate with its status.
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN) AS DATERUN, BG.STATUS
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID, BG.status;
When I run the above query, I still see queries with multiple background check dates and statuses.
Whereas when I run without the status, it works fine:
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN)
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID;
So just wondering if anyone can help me figure out help me query the date run and status and both reflecting the latest date.
The best solution depends on which RDBMS you are using.
Here is one with basic, standard SQL:
SELECT bg.PEOPLE_ID, bg.DATE_RUN, bg.STATUS
FROM (
SELECT PEOPLE_ID, MAX(DATE_RUN) AS MAX_DATERUN
FROM PKS_BGCHECK
GROUP BY PEOPLE_ID
) sub
JOIN PKS_BGCHECK bg ON bg.PEOPLE_ID = sub.PEOPLE_ID
AND bg.DATE_RUN = sub.MAX_DATERUN;
But you can get multiple rows per PEOPLE_ID if there are ties.
In Oracle, Postgres or SQL Server and others (but not MySQL) you can also use the window function row_number():
WITH cte AS (
SELECT PEOPLE_ID, DATE_RUN, STATUS
, ROW_NUMBER() OVER(PARTITION BY PEOPLE_ID ORDER BY DATE_RUN DESC) AS rn
FROM PKS_BGCHECK
)
SELECT PEOPLE_ID, DATE_RUN, STATUS
FROM cte
WHERE rn = 1;
This guarantees 1 row per PEOPLE_ID. Ties are resolved arbitrarily. Add more expressions to ORDER BY to break ties deterministically.
In Postgres, the simplest solution would be with DISTINCT ON.
Details for both in this related answer:
Select first row in each GROUP BY group?
Selecting the latest row in a time-sensitive set is fairly easy and largely platform independent:
SELECT BG.PEOPLE_ID, BG.DATE_RUN, BG.STATUS
FROM PKS_BGCHECK BG
WHERE BG.DATE_RUN =(
SELECT MAX( DATE_RUN )
FROM PKS_BGCHECK
WHERE PEOPLE_ID = BG.PEOPLE_ID
AND DATE_RUN < SYSDATE );
If the PK is (PEOPLE_ID, DATE_RUN), the query will execute about as quickly as any other method. If they don't form the PK (why not???) then use them to form a unique index. But I'm sure you're already doing one or the other.
Btw, you don't really need the and part of the sub-query if you don't allow future dates to be entered. Some temporal implementations allow for future dates (planned or scheduled events) so I'm used to adding it.

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?
THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.
If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.