Identify duplicates records and insert into another table - sql

I have a table like this in this there are duplicate records are there So my requirement is identify the duplicate records and store into another table i.e., Customer_duplicate
and distinct records into one table
Existing query:
Create proc usp_store_duplicate_into_table
as
begin
insert into Customer_Duplicate
select *
from Customer C
group by cid
having count(cid) > 1

What you have is fine, except that you can't select items that are not in your group by; for example, you could do:
insert into Customer_Duplicate
select cid, count(*)
from Customer C
group by cid
having count(cid) > 1
Depending on what Customer_Duplicate looks like. If you really need to include all the rows then something like this might work for you:
insert into Customer_Duplicate
select *
from customer c
where c.cid in
(
select cid
from Customer
group by cid
having count(cid) > 1
)

You can Use Row_Number() ranking Function With Partition By in SQL Server to Identify Duplicate rows.
In Partition By You can Define numbers of columns That you have to Find duplicate records.
For Example I am Using Name and No, You can Replace it with Your columns name.
insert into Customer_Duplicate
SELECT * FROM (
select * , ROW_NUMBER() OVER(PARTITION BY NAME,NO ORDER BY NAME,NO) AS RNK
from Customer C
) AS d
WHERE rnk > 1

For finding the duplicates, you can use the below code.
insert into Customer_Duplicate
SELECT c.name, c.othercolumns
(select c.name,c.othercolumns, ROW_NUMBER() OVER(PARTITION BY cid ORDER BY 1) AS rnk
from Customer C
) AS c
WHERE c.rnk >1;
If you want to insert distinct records into another table, you can use the below code.
insert into Customer_Distinct
SELECT c.name, c.othercolumns
(select c.name,c.othercolumns, ROW_NUMBER() OVER(PARTITION BY cid ORDER BY 1) AS rnk
from Customer C
) AS c
WHERE c.rnk = 1;

Related

Returning the full record of each duplicated row by selecting the table and joining it to the duplicates?

The first query works. Query A is based on a post from StackOverflow (Using GROUP BY and HAVING COUNT(*) >1 to select duplicate and noon-duplicate field).
But is it possible to return the full record of each duplicated row by selecting the table and joining it to the duplicates? That's what I'm attempting in Query B. I'm trying to do so on two fields. Is it possible to accomplish this with the HAVING clause constructed this way? I'm a n00b. Any advice or education would be appreciated.
Query A) Based on an example from StackOverflow:
SELECT InstanceID, InstanceSequenceNumber
FROM [dbo].[ANBasics]
WHERE InstanceID IN
(SELECT InstanceID FROM [dbo].[ANBasics]
GROUP BY InstanceID
HAVING (COUNT(*) > 1))
ORDER BY InstanceID
Query B) What I'm trying to accomplish:
SELECT A.*, COUNT(*) AS B
FROM [dbo].[ANBasics] AS A
JOIN(
SELECT [InstanceID], [InstanceSequenceNumber], COUNT(*)
FROM [dbo].[ANBasics]
GROUP BY [InstanceID], [InstanceSequenceNumber]
HAVING (B > 1) )
ON A.[InstanceID] = B.[InstanceID]
AND A.[InstanceSequenceNumber] = B.[InstanceSequenceNumber]
ORDER BY A.[InstanceID]
If I understand correctly, window functions are the simplest solution:
SELECT ab.*
FROM (SELECT ab.*,
COUNT(*) OVER (PARTITION BY InstanceID, InstanceSequenceNumber) as cnt
FROM [dbo].[ANBasics] ab
) ab
WHERE cnt > 1;
If you want this for duplicates of two columns:
SELECT ab.*
FROM (SELECT ab.*,
COUNT(*) OVER (PARTITION BY InstanceID) as cnt
FROM [dbo].[ANBasics] ab
) ab
WHERE cnt > 1;

SQL show orders where 2 values are distinct and match the first

I'm looking for a way to let me select all orders that have multiple distinct names within the same order-number, it looks like this:
order - name
111-Paul
112-Paula
113-John
113-John
113-Jessica
114-Eric
114-Eric
114-John
115-Zack
115-Zack
115-Zack
etc.
so that i would get all the orders that have 2 or more distinct names in it:
113-John
113-Jessica
114-Eric
114-John
with which I could do further queries but I'm stuck. Can anyone give me some hints on how to tackle this problem please? I've tried it with count(*) which looked like this:
select order, name, count(name) from dbo.orders
group by order, name
having count(name) > 1
which gave me all the orders which had more than 1 name in it but I don't know how to let it only show orders with the distinct names.
Here's one approach using exists:
select distinct [order], name
from orders o
where exists (
select 1
from orders o2
where o.[order] = o2.[order] and o.name != o2.name)
Fiddle Demo
I would use windows functions for this
For example:
select distinct order
from
(select
order,
row_number() over(partition by order, name order by order asc) as rn
) as t1
where rn > 1
you can do the same with count
count(*) over(partition by order,name order by order asc) as cnt
Here's a straight forward implementation in Sql Server:
select distinct *
from table1
where [order] in (
select [order]
from (select distinct * from table1) iq
group by [order]
having count(*) > 1)
It's essentially breaking down the problem into:
Finding the orders that have more than one distinct value.
Finding the pairs of distinct order - name that belong to the list previously calculated.
When you use HAVING COUNT(name) > 1, it is counting all of the rows in those groups, including duplicate rows (rows 113-John and 113-John are 2 rows for order 113). I would query distinct rows from your table, and then select from that:
SELECT [order], [name] FROM (
SELECT DISTINCT [order], [name] FROM dbo.orders
) A
GROUP BY [order], [name]
HAVING COUNT([name]) > 1
As a note, if a [name] is null, then it will not be counted with COUNT(name). If you want nulls to be counted, use COUNT(*) instead.
You can use count(distinct name) to get the number of unique names for each order:
select [order], count(distinct name)
from orders
group by [order]
To just get the order for those you can use having:
select [order]
from orders
group by [order]
having count(distinct name) > 1
To get the details for those orders you can put that in the where clause to just return the rows with order in that list:
select *
from orders
where [order] in (
select [order]
from orders
group by [order]
having count(distinct name) > 1
)
sqlfiddle
I would use RANK (or DENSE_RANK) for this as shown below.
SELECT [Order]
FROM (SELECT
[Order],
RANK() OVER(PARTITION BY [Order] ORDER BY Name) AS NameRank
FROM [StackOverflow].[dbo].[OrderAndName]) ranked
WHERE ranked.NameRank > 1
GROUP BY [Order]
The sub-query ranks (gives a seeding) to the names in an order according to their value. Names with the same value would have the same rank i.e. when an order has one name multiple times (like 115) the rank of all names would be 1.
The partition is important here as otherwise you would get the rank for all names for all orders which wouldn't give you the result you'd like.
It is then just a case of pulling out the orders that have a RANK greater than 1 and grouping (could use distinct if that's a preference).
You can then join to this table to get get the orders and names as follows;
SELECT oan.[Order], [Name]
FROM [StackOverflow].[dbo].[OrderAndName] oan
INNER JOIN (SELECT [Order]
FROM (SELECT [Order],
RANK() OVER(PARTITION BY [Order] ORDER BY Name) AS NameRank
FROM [StackOverflow].[dbo].[OrderAndName]) ranked
WHERE ranked.NameRank > 1
GROUP BY [Order]) twoOrMore ON oan.[Order] = twoOrMore.[Order]

How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:
year, user_id, sid, cid
The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.
Is there a way to find all duplicates?
The basic idea will be using a nested query with count aggregation:
select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1
You can adjust the where clause in the inner query to narrow the search.
There is another good solution for that mentioned in the comments, (but not everyone reads them):
select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1
Or shorter:
SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1
From "Find duplicate rows with PostgreSQL" here's smart solution:
select * from (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
FROM tbl
) dups
where
dups.Row > 1
In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.
In order to find duplicate values you should run,
SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);
Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.
DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;
You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id;
Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:
with dupe_set as (
select tn1.id,
array_agg(tn2.id) as duplicate_entries,
from table_name tn1 join table_name tn2 on
tn1.year = tn2.year
and tn1.sid = tn2.sid
and tn1.user_id = tn2.user_id
and tn1.cid = tn2.cid
and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists
(select de from unnest(ds.duplicate_entries) as de where de < ds.id)
Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.
Inspired by Sandro Wiggers, I did something similiar to
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id
FROM ordered
WHERE rnk > 1
)
DELETE
FROM user_links
USING to_delete
WHERE user_link.id = to_delete.id;
If you want to test it, change it slightly:
WITH ordered AS (
SELECT id,year, user_id, sid, cid,
rank() OVER (PARTITION BY year, user_id, sid, cid ORDER BY id) AS rnk
FROM user_links
),
to_delete AS (
SELECT id,year,user_id,sid, cid
FROM ordered
WHERE rnk > 1
)
SELECT * FROM to_delete;
This will give an overview of what is going to be deleted (there is no problem to keep year,user_id,sid,cid in the to_delete query when running the deletion, but then they are not needed)
In your case, because of the constraint you need to delete the duplicated records.
Find the duplicated rows
Organize them by created_at date - in this case I'm keeping the oldest
Delete the records with USING to filter the right rows
WITH duplicated AS (
SELECT id,
count(*)
FROM products
GROUP BY id
HAVING count(*) > 1),
ordered AS (
SELECT p.id,
created_at,
rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk
FROM products o
JOIN duplicated d ON d.id = p.id ),
products_to_delete AS (
SELECT id,
created_at
FROM ordered
WHERE rnk = 2
)
DELETE
FROM products
USING products_to_delete
WHERE products.id = products_to_delete.id
AND products.created_at = products_to_delete.created_at;
Following SQL syntax provides better performance while checking for duplicate rows.
SELECT id, count(id)
FROM table1
GROUP BY id
HAVING count(id) > 1
begin;
create table user_links(id serial,year bigint, user_id bigint, sid bigint, cid bigint);
insert into user_links(year, user_id, sid, cid) values (null,null,null,null),
(null,null,null,null), (null,null,null,null),
(1,2,3,4), (1,2,3,4),
(1,2,3,4),(1,1,3,8),
(1,1,3,9),
(1,null,null,null),(1,null,null,null);
commit;
set operation with distinct and except.
(select id, year, user_id, sid, cid from user_links order by 1)
except
select distinct on (year, user_id, sid, cid) id, year, user_id, sid, cid
from user_links order by 1;
except all also works. Since id serial make all rows unique.
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1;
So far works nulls and non-nulls.
delete:
with a as(
(select id, year, user_id, sid, cid from user_links order by 1)
except all
select distinct on (year, user_id, sid, cid)
id, year, user_id, sid, cid from user_links order by 1)
delete from user_links using a where user_links.id = a.id returning *;

display max on one columns with multiple columns in output

How can I display maximum OrderId for a CustomerId with many columns?
I have a table with following columns:
CustomerId, OrderId, Status, OrderType, CustomerType
A customer with Same customer id could have many order ids(1,2,3..) I want to be able to display the max Order id with the rest of the customers in a sql view. how can I achieve this?
Sample Data:
CustomerId OrderId OrderType
145042 1 A
110204 1 C
145042 2 D
162438 1 B
110204 2 B
103603 1 C
115559 1 D
115559 2 A
110204 3 A
I'd use a common table expression and ROW_NUMBER:
;With Ordered as (
select *,
ROW_NUMBER() OVER (PARTITION BY CustomerID
ORDER BY OrderId desc) as rn
from [Unnamed table from the question]
)
select * from Ordered where rn = 1
select * from table_name
where orderid in
(select max(orderid) from table_name group by customerid)
One way to do this is with not exists:
select t.*
from table t
where not exists (select 1
from table t2
where t2.CustomerId = t.CustomerId and
t2.OrderId > t.OrderId
);
This is saying: "get me all rows from t where there is no higher order id for the customer."

Getting the smaller index for each duplicate in SQL

Let's say I have a table with two columns, one column for the ID and another for a Name. All the names in this table appear more than once.
How can I get all the IDs in the table excluding the smallest IDs for each Name?
In SQL Server 2005+ you could go like this:
SELECT ID FROM atable
EXCEPT
SELECT MIN(ID) FROM atable GROUP BY Name
I would use a CTE (Common Table Expression) using the ROW_NUMBER() ranking function for that:
;WITH GroupedNames AS
(
SELECT ID, Name,
ROW_NUMBER() OVER(PARTITION BY Name ORDER BY ID) AS 'RowNum'
FROM
dbo.YourTable
)
SELECT *
FROM GroupedNames
This will "partition" your data by means, e.g. create groups by name, and each group will get consecutive numbers starting at 1. This way, you can easily select everything except the entry (ID, Name) with the smallest ID with this:
.....
SELECT *
FROM GroupedNames
WHERE RowNum > 1
and if you need to, you can even use this construct to actually delete all those names with a row number bigger than 1 (all the "duplicates"):
;WITH GroupedNames AS
(
SELECT ID, Name,
ROW_NUMBER() OVER(PARTITION BY Name ORDER BY ID) AS 'RowNum'
FROM
dbo.YourTable
)
DELETE FROM GroupedNames
WHERE RowNum > 1
Maybe this would work?
SELECT id FROM table WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY name)
SELECT DISTINCT b.id
FROM yourTable a
JOIN yourTable b
ON a.name = b.name
AND a.id < b.id