SQL Eliminate Duplicates with NO ID - sql

I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table

Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.

here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1

Related

Handling duplicates in BigQuery (Nested Table)

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM
dataset.table)
WHERE
row_number = 1
order by id, date_register
To de-duplicate in place, without re-creating the table - use MERGE:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
It's simpler than the current accepted answer, as it won't ask you to match the current partitioning or clustering - it will just respect it.
Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.
You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:
CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register)
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM your_table)
WHERE
row_number = 1
If you donĀ“t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

Delete using CTE slower than using temp table in Postgres

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).
This is Postgres 9.5.
The CTE version is this:
with targets as
(
select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer
)
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
I killed that version this morning after running for over an hour.
The temp table version is this:
create temp table
targets
as select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer;
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
This version finishes in about 7 seconds.
Any idea what may be causing this?
The CTE is slower because it has to be executed unaltered (via a CTE scan).
TFM (section 7.8.2) states:
Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output.
Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.
It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.
The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.
DELETE FROM customer del
USING ( SELECT id
, row_number() over(partition by uuid order by created_date desc)
as rn
FROM customer
) sub
WHERE sub.id = del.id
AND sub.rn > 1
;
Another way is to use a TEMP VIEW. This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view as a kind of macro in PG.
CREATE TEMP VIEW targets
AS SELECT id
, row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;
EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
FROM targets
WHERE rn > 1
);
[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]
Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.
An alternative way to doing that delete which might be faster than both of your methods:
DELETE C
FROM
Customer C
WHERE
EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)
That won't handle situations where you have exact matches with created_date, but that can be solved by adding the id to the subquery as well.

sql Row_Number function is not working after created in table

I hope this isn't a repeated discussion. I did search through the forums and didn't find anything that related to my problem. I love this site by the way. It has helped me for a couple years now and I usually will get all my questions solved by searching this site.
Anyway, I am running into an issue in SQL with the ROW_NUMBER() function. I created a view that involves joining a different view and a table and in the view I pulled fields from both the view and the table, but I also created two fields that was a ROW_Number() field called seq_number and another field called seq_alpha.
Seq Number field is:
ROW_NUMBER() over(order by book_date, room, start) as seq_number,
The seq_alpha field is a case field that is based on what the row number is to give an alpha letter instead.
For example
case ROW_NUMBER() over(order by book_date, Room, Start)
when 1 then 'A'
when 2 then 'B'
when 3 then 'C'
....
End as seq_alpha
When I created the view for testing purposes I used a WHERE clause and everything worked exactly the way it should. I then commented out the Where clause and had the table created.
Then after the view was created I tried to pull the created view and used the same Where clause that worked when creating the view but now did it like:
select *
from created_view
where (same as I used for testing)
But now what happens is the seq_number field looks at the entire view instead of letting the where clause filter out the results. All the other data pulls correctly, but the seq_number and seq_alpha fields don't. So instead of having the seq number be 1-22 in my 22 results it is 400,000 range and the seq_alpha field doesn't even display anything because I only went up to 51 in my case.
Has anyone had similar issues with trying to pull a row_number field after it is created and the field not filtering the results with the where clause?
Thanks for your help in advance!
EDIT
After Mikael's response it seems unlikely that I am able to create a row_number field in a view and then query the view afterward and have it work the way I want. So, my next question is is there a way to create an alpha sequence based on the row number that I can put in a view and be able to query later and be able to have it work correct based on the where clause? Or do I just need to create the alpha sequence field every time I would pull this view?
J
row_number() enumerates the rows it sees when it is executed after the where clause is applied.
This will enumerate the whole table
select ID,
row_number() over(order by ID) as rn
from YourTable
and this will enumerate all rows where ID < 10
select ID,
row_number() over(order by ID) as rn
from YourTable
where ID < 10
If you create a view that queries the whole table
create view ViewYourTable as
select ID,
row_number() over(order by ID) as rn
from YourTable
and then query the view with a where clause ID < 10
select T.*
from ViewYourTable as T
where T.ID < 10
you are doing the same as if you used a derived table.
select T.*
from (
select ID,
row_number() over(order by ID) as rn
from YourTable
) as T
where T.ID < 10
The where clause is applied after the rows are enumerated.

Deleting Duplicate Records from a Table

I Have a table called Table1 which has 48 records. Out of which only 24 should be there in that table. For some reason I got duplicate records inserted into it. How do I delete the duplicate records from that table.
Here's something you might try if SQL Server version is 2005 or later.
WITH cte AS
(
SELECT {list-of-columns-in-table},
row_number() over (PARTITION BY {list-of-key-columns} ORDER BY {rule-to-determine-row-to-keep}) as sequence
FROM myTable
)
DELETE FROM cte
WHERE sequence > 1
This uses a common table expression (CTE) and adds a sequence column. {list-of-columns-in-table} is just as it states. Not all columns are needed, but I won't explain here.
The {list-of-key-columns] is the columns that you use to define what is a duplicate.
{rule-to-determine-row-to-keep} is a sequence so that the first row is the row to keep. For example, if you want to keep the oldest row, you would use a date column for sequence.
Here's an example of the query with real columns.
WITH cte AS
(
SELECT ID, CourseName, DateAdded,
row_number() over (PARTITION BY CourseName ORDER BY DateAdded) as sequence
FROM Courses
)
DELETE FROM cte
WHERE sequence > 1
This example removes duplicate rows based on the CoursName value and keeps the oldest basesd on the DateAdded value.
http://support.microsoft.com/kb/139444
This section is the key. The primary point you should take away. ;)
This article discusses how to locate
and remove duplicate primary keys from
a table. However, you should closely
examine the process which allowed the
duplicates to happen in order to
prevent a recurrence.
Identify your records by grouping data by your logical keys, since you obviously haven't defined them, and applying a HAVING COUNT(*) > 1 statement at the end. The article goes into this in depth.
This is an easier way
Select * Into #TempTable FROM YourTable
Truncate Table YourTable
Insert into YourTable Select Distinct * from #TempTable
Drop Table #TempTable

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.