Query Optimization with ROW_NUMBER - sql

I have this query:
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
-- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE
FROM
INT_ADM.Product_Equipment_Dim PE1
INNER JOIN
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
) PE2
ON PE2.PRODUCT_EQUIPMENT_KEY = PE1.PRODUCT_EQUIPMENT_KEY
WHERE
Line_Of_Business_Cd = 'M'
AND /*v_Date_Start*/ TO_DATE( '2022/01/12', 'yyyy/mm/dd' ) BETWEEN Start_Dt AND End_Dt
AND Current_Ind = 'Y'
If I run it as you see it then it runs in under a second.
If I run it with -- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE uncommented then the query takes up to 5 minutes to complete.
I know it's something to do with ROW_NUMBER() but after looking all over online I can't find a good explanation and solution. Does anyone know why uncommenting that line makes the query so slow, and what I can do about it so it runs fast?
Much appreciate your help in advance.

The root cause is, that even if the predicate in the where clause allows an efficient access to the rows of the table (but I suspect your below a second response is the time to get the first page of the result), you need in the subquery to access all rows of the table, to window sort them and finaly to join them to the first row source.
So if you comment out the ep_rnk Oracle is smart enought that it do not need to evaluate the subquery at all, because the subquery is on the same table and the join is on the primary key - so no row can be lost or duplicated in the join.
What can you improve?
Not much. If the WHERE condition filters the table very restrictive (and you end with only a small number of PRODUCT_EQUIPMENT_KEY) make the same filer in the subquery:
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
--- filer added
WHERE PRODUCT_EQUIPMENT_KEY in (
SELECT PRODUCT_EQUIPMENT_KEY
FROM INT_ADM.Product_Equipment_Dim
WHERE ... same predicate as in the main query ...
)
) PE2
If the predicate returns all (or most) of the PRODUCT_EQUIPMENT_KEY the only (often used) way is to pre-calculate the rank e.g. in a materialized view
The materialized view is defined as follows
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM
INT_ADM.Product_Equipment_Dim PE1
and you simple query from it - without a join.

Related

Delete using CTE slower than using temp table in Postgres

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).
This is Postgres 9.5.
The CTE version is this:
with targets as
(
select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer
)
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
I killed that version this morning after running for over an hour.
The temp table version is this:
create temp table
targets
as select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer;
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
This version finishes in about 7 seconds.
Any idea what may be causing this?
The CTE is slower because it has to be executed unaltered (via a CTE scan).
TFM (section 7.8.2) states:
Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output.
Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.
It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.
The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.
DELETE FROM customer del
USING ( SELECT id
, row_number() over(partition by uuid order by created_date desc)
as rn
FROM customer
) sub
WHERE sub.id = del.id
AND sub.rn > 1
;
Another way is to use a TEMP VIEW. This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view as a kind of macro in PG.
CREATE TEMP VIEW targets
AS SELECT id
, row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;
EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
FROM targets
WHERE rn > 1
);
[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]
Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.
An alternative way to doing that delete which might be faster than both of your methods:
DELETE C
FROM
Customer C
WHERE
EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)
That won't handle situations where you have exact matches with created_date, but that can be solved by adding the id to the subquery as well.

SQL Querying Column on Max Date

I apologize if my code is not properly typed. I am trying to query a table that will return the latest bgcheckdate and status report. The table contains additional bgcheckdates and statuses for each record but in my report I only need to see the latest bgcheckdate with its status.
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN) AS DATERUN, BG.STATUS
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID, BG.status;
When I run the above query, I still see queries with multiple background check dates and statuses.
Whereas when I run without the status, it works fine:
SELECT BG.PEOPLE_ID, MAX(BG.DATE_RUN)
FROM PKS_BGCHECK BG
GROUP BY BG.PEOPLE_ID;
So just wondering if anyone can help me figure out help me query the date run and status and both reflecting the latest date.
The best solution depends on which RDBMS you are using.
Here is one with basic, standard SQL:
SELECT bg.PEOPLE_ID, bg.DATE_RUN, bg.STATUS
FROM (
SELECT PEOPLE_ID, MAX(DATE_RUN) AS MAX_DATERUN
FROM PKS_BGCHECK
GROUP BY PEOPLE_ID
) sub
JOIN PKS_BGCHECK bg ON bg.PEOPLE_ID = sub.PEOPLE_ID
AND bg.DATE_RUN = sub.MAX_DATERUN;
But you can get multiple rows per PEOPLE_ID if there are ties.
In Oracle, Postgres or SQL Server and others (but not MySQL) you can also use the window function row_number():
WITH cte AS (
SELECT PEOPLE_ID, DATE_RUN, STATUS
, ROW_NUMBER() OVER(PARTITION BY PEOPLE_ID ORDER BY DATE_RUN DESC) AS rn
FROM PKS_BGCHECK
)
SELECT PEOPLE_ID, DATE_RUN, STATUS
FROM cte
WHERE rn = 1;
This guarantees 1 row per PEOPLE_ID. Ties are resolved arbitrarily. Add more expressions to ORDER BY to break ties deterministically.
In Postgres, the simplest solution would be with DISTINCT ON.
Details for both in this related answer:
Select first row in each GROUP BY group?
Selecting the latest row in a time-sensitive set is fairly easy and largely platform independent:
SELECT BG.PEOPLE_ID, BG.DATE_RUN, BG.STATUS
FROM PKS_BGCHECK BG
WHERE BG.DATE_RUN =(
SELECT MAX( DATE_RUN )
FROM PKS_BGCHECK
WHERE PEOPLE_ID = BG.PEOPLE_ID
AND DATE_RUN < SYSDATE );
If the PK is (PEOPLE_ID, DATE_RUN), the query will execute about as quickly as any other method. If they don't form the PK (why not???) then use them to form a unique index. But I'm sure you're already doing one or the other.
Btw, you don't really need the and part of the sub-query if you don't allow future dates to be entered. Some temporal implementations allow for future dates (planned or scheduled events) so I'm used to adding it.

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?
THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.
If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

Getting the Nth most recent business date - very different query performance using two different methods

I have a requirement to get txns on a T-5 basis. Meaning I need to "go back" 5 business days.
I've coded up two SQL queries for this and the second method is 5 times slower than the first.
How come?
-- Fast
with
BizDays as
( select top 5 bdate bdate
from dbo.business_days
where bdate < '20091211'
order by bdate Desc
)
,BizDate as ( select min(bdate) bdate from BizDays)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
-- Slow
with
BizDays as
( select dense_rank() Over(order by bdate Desc) RN
, bdate
from dbo.business_days
where bdate < '20091211'
)
,BizDate as ( select bdate from BizDays where RN = 5)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
DENSE_RANK does not stop after the first 5 records like TOP 5 does.
Though DENSE_RANK is monotonic and hence theoretically could be optimized to TOP WITH TIES, SQL Server's optimizer is not aware of that and does not do this optimization.
If your business days are unique, you can replace DENSE_RANK with ROW_NUMBER and get the same performance, since ROW_NUMBER is optimized to a TOP.
instead of putting the conditions in where and join clauses, could you perhaps use ORDER BY on your meeting data and then LIMIT offset, rowcount?
The reason this is running so slow is that DENSE_RANK() and ROW_NUMBER() are functions. The engine has to read every record in the table that matches the WHERE clause, apply the function to each row, save the function value, and then get the top 5 from that list.
A "plain" top 5 uses the index on the table to get the first 5 records that meet the WHERE clause. In the best case, the engine may only have to read a couple of index pages. Worst case, it may have to read a few data pages as well. Even without an index, the engine is reading the rows but does not have to execute the function or work with temporary tables.

Are inline queries a bad idea?

I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.
I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.
The table is as follows:
CREATE TABLE generator_logs (
id integer NOT NULL,
site_id character varying(4) NOT NULL,
start timestamp without time zone NOT NULL,
"end" timestamp without time zone NOT NULL,
duration integer NOT NULL
);
And the query:
SELECT id, site_id, start, "end", duration
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start
FROM generator_logs
GROUP BY site_id)
ORDER BY start DESC
There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.
So is there a better query for this, and are inline queries generally a bad idea?
Should your query not be correlated? i.e.:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start
FROM generator_logs g2
WHERE g2.site_id = g1.site_id)
ORDER BY start DESC
Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
Or alternatively:
SELECT id, site_id, start, "end", duration
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start
FROM generator_logs g2
GROUP BY site_id)
ORDER BY start DESC
I would use joins as they perform much better then "IN" clause:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration
from
generator_logs gl
inner join (
select max(start) as start, site_id
from generator_logs
group by site_id
) gl2
on gl.site_id = gl2.site_id
and gl.start = gl2.start
Also as Tony pointed out you were missing correlation in your original query
In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="
SELECT id, site_id, start, \"end\", duration
FROM generator_logs
WHERE start IN ($maxids)
ORDER BY start DESC
";
This problem - finding not just the the MAX, but the rest of the corresponding row - is a common one. Luckily, Postgres provides a nice way to do this with one query, using DISTINCT ON:
SELECT DISTINCT ON (site_id)
id, site_id, start, "end", duration
FROM generator_logs
ORDER BY site_id, start DESC;
DISTINCT ON (site_id) means "return one record per site_id". The order by clause determines which record that is. Note, however, that this is subtly different from your original query - if you have two records for the same site with the same start, your query would return two records, while this returns only one.
A way to find records having the MAX value per group is to select those records for which there is no record within the same group having a higher value:
SELECT id, site_id, "start", "end", duration
FROM generator_logs g1
WHERE NOT EXISTS (
SELECT 1
FROM generator_logs g2
WHERE g2.site_id = g1.site_id
AND g2."start" > g1."start"
);