Delete using CTE slower than using temp table in Postgres - sql

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).
This is Postgres 9.5.
The CTE version is this:
with targets as
(
select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer
)
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
I killed that version this morning after running for over an hour.
The temp table version is this:
create temp table
targets
as select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer;
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
This version finishes in about 7 seconds.
Any idea what may be causing this?

The CTE is slower because it has to be executed unaltered (via a CTE scan).
TFM (section 7.8.2) states:
Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output.
Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.
It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.
The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.
DELETE FROM customer del
USING ( SELECT id
, row_number() over(partition by uuid order by created_date desc)
as rn
FROM customer
) sub
WHERE sub.id = del.id
AND sub.rn > 1
;
Another way is to use a TEMP VIEW. This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view as a kind of macro in PG.
CREATE TEMP VIEW targets
AS SELECT id
, row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;
EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
FROM targets
WHERE rn > 1
);
[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]

Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.
An alternative way to doing that delete which might be faster than both of your methods:
DELETE C
FROM
Customer C
WHERE
EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)
That won't handle situations where you have exact matches with created_date, but that can be solved by adding the id to the subquery as well.

Related

Query Optimization with ROW_NUMBER

I have this query:
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
-- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE
FROM
INT_ADM.Product_Equipment_Dim PE1
INNER JOIN
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
) PE2
ON PE2.PRODUCT_EQUIPMENT_KEY = PE1.PRODUCT_EQUIPMENT_KEY
WHERE
Line_Of_Business_Cd = 'M'
AND /*v_Date_Start*/ TO_DATE( '2022/01/12', 'yyyy/mm/dd' ) BETWEEN Start_Dt AND End_Dt
AND Current_Ind = 'Y'
If I run it as you see it then it runs in under a second.
If I run it with -- PE2.ep_rnk ------------------ UNCOMMENT THIS LINE uncommented then the query takes up to 5 minutes to complete.
I know it's something to do with ROW_NUMBER() but after looking all over online I can't find a good explanation and solution. Does anyone know why uncommenting that line makes the query so slow, and what I can do about it so it runs fast?
Much appreciate your help in advance.
The root cause is, that even if the predicate in the where clause allows an efficient access to the rows of the table (but I suspect your below a second response is the time to get the first page of the result), you need in the subquery to access all rows of the table, to window sort them and finaly to join them to the first row source.
So if you comment out the ep_rnk Oracle is smart enought that it do not need to evaluate the subquery at all, because the subquery is on the same table and the join is on the primary key - so no row can be lost or duplicated in the join.
What can you improve?
Not much. If the WHERE condition filters the table very restrictive (and you end with only a small number of PRODUCT_EQUIPMENT_KEY) make the same filer in the subquery:
(
SELECT
PRODUCT_EQUIPMENT_KEY,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM INT_ADM.Product_Equipment_Dim PE2
--- filer added
WHERE PRODUCT_EQUIPMENT_KEY in (
SELECT PRODUCT_EQUIPMENT_KEY
FROM INT_ADM.Product_Equipment_Dim
WHERE ... same predicate as in the main query ...
)
) PE2
If the predicate returns all (or most) of the PRODUCT_EQUIPMENT_KEY the only (often used) way is to pre-calculate the rank e.g. in a materialized view
The materialized view is defined as follows
SELECT
PE1.PRODUCT_EQUIPMENT_KEY, -- primary key
PE1.Customer_Ban,
PE1.Subscriber_No,
PE1.Prod_Equip_Cd,
PE1.Prod_Equip_Txt,
PE1.Prod_Equip_Category_Txt--,
ROW_NUMBER() OVER (PARTITION BY Customer_Ban, Subscriber_No ORDER BY Start_Dt ASC) AS ep_rnk
FROM
INT_ADM.Product_Equipment_Dim PE1
and you simple query from it - without a join.

SQL Eliminate Duplicates with NO ID

I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table
Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.
here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1

How to retrieve the last 2 records from table?

I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?
THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.
If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

Deleting non distinct rows

I have a table that has a unique non-clustered index and 4 of the columns are listed in this index. I want to update a large number of rows in the table. If I do so, they will no longer be distinct, therefore the update fails because of the index.
I am wanting to disable the index and then delete the oldest duplicate rows. Here's my query so far:
SELECT t.itemid, t.fieldid, t.version, updated
FROM dbo.VersionedFields w
inner JOIN
(
SELECT itemid, fieldid, version, COUNT(*) AS QTY
FROM dbo.VersionedFields
GROUP BY itemid, fieldid, version
HAVING COUNT(*) > 1
) t
on w.itemid = t.itemid and w.fieldid = t.fieldid and w.version = t.version
The select inside the inner join returns the right number of records that we want to delete, but groups them so there is actually twice the amount.
After the join it shows all the records but all I want to delete is the oldest ones?
How can this be done?
If you say SQL (Structured Query Language), but really mean SQL Server (the Microsoft relatinonal database system) by it, and if you're using SQL Server 2005 or newer, you can use a CTE (Common Table Expression) for this purpose.
With this CTE, you can partition your data by some criteria - i.e. your ItemId (or a combination of columns) - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably version (or some other column).
So try something like this:
;WITH PartitionedData AS
(
SELECT
itemid, fieldid, version,
ROW_NUMBER() OVER(PARTITION BY ItemId ORDER BY version DESC) AS 'RowNum'
FROM dbo.VersionedFields
)
DELETE FROM PartitionedData
WHERE RowNum > 1
Basically, you're partitioning your data by some criteria and numbering each partition, starting at 1 for each new partition, ordered by some other criteria (e.g. Date or Version).
So for each "partition" of data, the "newest" entry has RowNum = 1, and any others that belongs into the same partition (by means of having the same partitino values) will have sequentially numbered values from 2 up to however many rows there are in that partition.
If you want to keep only the newest entry - delete anything with a RowNum larger than 1 and you're done!
In SQL Server 2005 and above:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY itemid, fieldid, version ORDER BY updated DESC) AS rn
FROM versionedFields
)
DELETE
FROM q
WHERE rn > 1
Try something like:
DELETE FROM dbo.VersionedFields w WHERE w.version < (SELECT MAX(version) FROM dbo.VersionedFields)
Ofcourse, you'd want to limit the MAX(version) to only the versions of the field you're wanting to delete.
You probably need to look at this Stack Overflow answer (delete earlier of duplicate rows).
Essentially the technique uses grouping (or optionally, windowing) to find the minimum id value of a group in order to delete it. It may be more accurate to delete rows where the value <> max(row identifier).
So:
Drop unique index
Load data
Delete data using the grouping mechanism (ideally in a transaction, so that you can rollback if there is a mistake), then commit
Recreate the index.
Note that recreating an index on a big table can take a long time.