SQL to delete the duplicates in a table - sql

I have a table transaction which has duplicates. i want to keep the record that had minimum id and delete all the duplicates based on four fields DATE, AMOUNT, REFNUMBER, PARENTFOLDERID. I wrote this query but i am not sure if this can be written in an efficient way. Do you think there is a better way? I am asking because i am worried about the run time.
DELETE FROM TRANSACTION
WHERE ID IN
(SELECT FIT2.ID
FROM
(SELECT MIN(ID) AS ID, FIT.DATE, FIT.AMOUNT, FIT.REFNUMBER, FIT.PARENTFOLDERID
FROM EWORK.TRANSACTION FIT
GROUP BY FIT.DATE, FIT.AMOUNT , FIT.REFNUMBER, FIT.PARENTFOLDERID
HAVING COUNT(1)>1 and FIT.AMOUNT >0) FIT1,
EWORK.TRANSACTION FIT2
WHERE FIT1.DATE=FIT2.DATE AND
FIT1.AMOUNT=FIT2.AMOUNT AND
FIT1.REFNUMBER=FIT2.REFNUMBER AND
FIT1.PARENTFOLDERID=FIT2.PARENTFOLDERID AND
FIT1.ID<>FIT2.ID)

It would probably be more efficient to do something like
DELETE FROM transaction t1
WHERE EXISTS( SELECT 1
FROM transaction t2
WHERE t1.date = t2.date
AND t1.refnumber = t2.refnumber
AND t1.parentFolderId = t2.parentFolderId
AND t2.id > t1.id )

DELETE FROM transaction
WHERE ID IN (
SELECT ID
FROM (SELECT ID,
ROW_NUMBER () OVER (PARTITION BY date
,amount
,refnumber
,parentfolderid
ORDER BY ID) rn
FROM transaction)
WHERE rn <> 1);
I will try like this

I would try something like this:
DELETE transaction
FROM transaction
LEFT OUTER JOIN
(
SELECT MIN(id) as id, date, amount, refnumber, parentfolderid
FROM transaction
GROUP BY date, amount, refnumber, parentfolderid
) as validRows
ON transaction.id = validRows.id
WHERE validRows.id IS NULL

Related

SQL - delete record where sum = 0

I have a table which has below values:
If Sum of values = 0 with same ID I want to delete them from the table. So result should look like this:
The code I have:
DELETE FROM tmp_table
WHERE ID in
(SELECT ID
FROM tmp_table WITH(NOLOCK)
GROUP BY ID
HAVING SUM(value) = 0)
Only deletes rows with ID = 2.
UPD: Including additional example:
Rows in yellow needs to be deleted
Your query is working correctly because the only group to total zero is id 2, the others have sub-groups which total zero (such as the first two with id 1) but the total for all those records is -3.
What you're wanting is a much more complex algorithm to do "bin packing" in order to remove the sub groups which sum to zero.
You can do what you want using window functions -- by enumerating the values for each id. Taking your approach using a subquery:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
)
delete from t
where exists (select 1
from t t2
where t2.id = t.id and t2.value = - t.value and t2.seqnum = t.seqnum
);
You can also do this with a second layer of window functions:
with t as (
select t.*,
row_number() over (partition by id, value order by id) as seqnum
from tmp_table t
),
tt as (
select t.*, count(*) over (partition by id, abs(value), seqnum) as cnt
from t
)
delete from tt
where cnt = 2;

Delete Distinct column and latest date of other column

I have a table where the primary key is a composite key of ID and date. Is there a way that I can delete a single row where ID matches and the date is the latest date?
I am new to SQL, so I have tried a few things, but I either don't get the results I am looking for or cant get the syntax correct
DELETE FROM Master
WHERE ((Identifier = 'SomeID')
AND (EffectiveDate = MAX(EffectiveDate));
There are multiple columns with the same ID, but different dates, ie.
ID EffectiveDate
-------------------------
A '2019-09-18'
A '2019-09-17'
A '2019-09-16'
Is there a way I can delete only the row with A | '2019-09-18'?
You can use window functions and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by id order by effective_date desc) as seqnum
from t
)
delete from todelete
where seqnum = 1;
Note: If you want to limit this to a single id, then be sure to include a where id = 'a' in either the subquery or outer query.
use row_number()
delete from (select *, row_number() over(partition by id order by effectivedate desc) rn from table_name
) a where a.rn=1
A correlated subquery might get the job done:
DELETE FROM Master
WHERE
Identifier = 'SomeID'
AND EffectiveDate = (
SELECT MAX(EffectiveDate) FROM Master WHERE Identifier = 'SomeID'
)
;
Use the CTE Function to Delete the Row but the below Query will not delete the Record of Max Date of those ID's where Single Record exist against that.
with todelete as (
select t.*, row_number() over (partition by id order by effective_date desc) as seqnum
from t
)
delete from todelete
where seqnum = 1 and id in(select distinct id from todelete where seqnum<>1)
With correlated subquery for all IDs:
delete table1
from table1 t1
where t1.EffectiveDate =
(
select max(t2.EffectiveDate)
from table1 t2
where t2.ID = t1.ID
)

How to delete a record by its OID?

I have a weird case which I have no idea how it happened.
This is my table:
id
date
amount
where id can not be NULL and is auto increasing.
Someone last year the system created the following situation:
OID id date amount
710604512 197 2015-03-11 10657.61
710604513 197 2015-03-11 10657.61
This causes huge problems as id should be unique.
I can't fix this from regular SQL because any action I'll do will be done on both rows.
One of them needs to be deleted.
The solution of deleting both and inserting one is unacceptable as I can not play with the dates (it records the date of creation and the logs will show it)
How can I delete the row by its OID?
If you want to delete the record with the smaller OID should duplicates occur, then you can try this:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id, date, amount ORDER BY OID DESC) AS rn
FROM yourTable
)
DELETE FROM cte WHERE rn=2; -- or rn >=2 to delete all duplicates
To delete the record with the greater OID, just change the ORDER BY clause to this:
ORDER BY OID
Assuming you have want to keep the row with max OID for each ID, you can use this:
delete
from your_table t1
using (
select id, max(OID)
from your_table
group by id
) t2
where t1.id = t2.id and t1.OID <> t2.OID;
Or:
delete
from your_table t1
where exists (
select 1
from your_table t2
where t1.id = t2.id
and t1.OID < t2.OID
);
If id, date, amount are the business key in your case, you can remove all records beyond the second by grouping by these columns. Something like this:
DELETE FROM theTable
WHERE OID IN (
SELECT OID
FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY id, date, amount) AS RowNo, OID
FROM tab) x
WHERE x.RowNo > 1);
Note: this should work regardless of the number of duplicates.

How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
Is there a way to find all duplicates?
The basic idea will be using a nested query with count aggregation:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
There is another good solution for that mentioned in the comments, (but not everyone reads them):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
From "Find duplicate rows with PostgreSQL" here's smart solution:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
In order to find duplicate values you should run,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
Inspired by Sandro Wiggers, I did something similiar to
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id
FROM ordered
WHERE rnk > 1
)
DELETE
FROM user_links
USING to_delete
WHERE user_link.id = to_delete.id;
If you want to test it, change it slightly:
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id,year,user_id,sid, cid
FROM ordered
WHERE rnk > 1
)
SELECT * FROM to_delete;
This will give an overview of what is going to be deleted (there is no problem to keep year,user_id,sid,cid in the to_delete query when running the deletion, but then they are not needed)
In your case, because of the constraint you need to delete the duplicated records.
Find the duplicated rows
Organize them by created_at date - in this case I'm keeping the oldest
Delete the records with USING to filter the right rows
WITH duplicated AS (
SELECT id,
count(*)
FROM products
GROUP BY id
HAVING count(*) > 1),
ordered AS (
SELECT p.id,
created_at,
rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk
FROM products o
JOIN duplicated d ON d.id = p.id ),
products_to_delete AS (
SELECT id,
created_at
FROM ordered
WHERE rnk = 2
)
DELETE
FROM products
USING products_to_delete
WHERE products.id = products_to_delete.id
AND products.created_at = products_to_delete.created_at;
Following SQL syntax provides better performance while checking for duplicate rows.
SELECT id, count(id)
FROM table1
GROUP BY id
HAVING count(id) > 1
begin;
create table user_links(id serial,year bigint, user_id bigint, sid bigint, cid bigint);
insert into user_links(year, user_id, sid, cid) values (null,null,null,null),
(null,null,null,null), (null,null,null,null),
(1,2,3,4), (1,2,3,4),
(1,2,3,4),(1,1,3,8),
(1,1,3,9),
(1,null,null,null),(1,null,null,null);
commit;
set operation with distinct and except.
(select id, year, user_id, sid, cid from user_links order by 1)
except
select distinct on (year, user_id, sid, cid) id, year, user_id, sid, cid
from user_links order by 1;
except all also works. Since id serial make all rows unique.
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1;
So far works nulls and non-nulls.
delete:
with a as(
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1)
delete from user_links using a where user_links.id = a.id returning *;

aggregate value from one table updated in other table

i want to insert an aggregated value from one table to other using Update clause.Here is what i did.
SELECT Sum(1.0 * volume * speed) / Sum(volume) AS aggregated_speed
FROM traffic_data_replica
GROUP BY date,
id
ORDER BY date;
I want update this values in other table. This what i have tried.
UPDATE traffic_data_aggregated_lanes
SET traffic_data_aggregated_lanes.aggregated_speed = (SELECT
Sum(1.0 * volume *
speed) / Sum(volume) AS aggregated_speed
FROM
traffic_data_replica
GROUP BY date,
id
ORDER BY date);
Please Help
relation is not clear.
Somewhat
Merge traffic_data_aggregated_lanes a
using (select id, sum(1.0*volume*speed)/sum(volume) as aggregated_speed from
traffic_data_replica group by date,id) b
on a.id=b.id
when matched then
update set a.aggregated_speed= b.aggregated_speed
we can go by using CTE also i am not sure of data.
So i have given you assumption result
WITH CTE AS (
select t.Id,sum(1.0*volume*speed)/sum(volume) as aggregated_speed from traffic_data_replica t
group by date,id order by date
)
Select * FROM CTE
update traffic_data_aggregated_lanes set traffic_data_aggregated_lanes.aggregated_sp= t.aggregated_speed From traffic_data_replica t
INNER JOIN CTE c ON C.Id = t.Id