Finding duplicate values in a table where all the columns are not the same - sql

I am working with a set of data in a table.
For simplicity i have the table like below with some sample data:
Some of the data in this table came from a different source, such data are the ones that have cqmRecordID != null
I need to find duplicate values in this table and delete the duplicate ones that came over from the other source (ones with a cqmRecordID)
A record is considered duplicate if they have the same values for these cols:
[Name]
Cast([CreatedDate] as Date)
[CreatedBy]
So in the sample data i have above, record #5 and record #6 would be considered duplicates.
As solutions I came up with these two queries:
Query #1:
select * from (
select recordid, cqmrecordid, ROW_NUMBER() over (partition by name, cast(createddate as date), createdby
order by cqmrecordid, recordid) as rownum
from vmsNCR ) A
where cqmrecordid is not null
order by recordid
Query #2:
select A.recordID, A.cqmRecordID, B.RecordID, B.cqmRecordID
from vmsNCR A
join vmsNCR B
on A.Name = B.Name
and cast(A.CreatedDate as date) = cast(B.CreatedDate as date)
and A.CreatedBy = B.CreatedBy
and A.RecordID != B.RecordID
and A.cqmRecordID is not null
order by A.RecordID
Is there a better approach to this? Is one better than the other performance wise?

If you want to fetch all the rows without duplicates, then:
select t.* -- or all columns except seqnum
from (select t.*,
row_number() over (partition by name, cast(createddate as date), createdby
order by (case when cqmRecordId is not null then 1 else 2 end)
) as seqnum
from t
) t
where seqnum = 1;
If you want performance, create a columns and then an index:
alter table t add cqmRecordId_flag as (case when cqmRecordId is null then 0 else 1 end) persisted;
alter table t add createddate_date as (cast(createddate as date)) persisted;
And then an index:
create index idx_t_4 on t(name, createddate_date, createdby, cqmRecordId_flag desc);
EDIT:
If you actually just want to delete the NULL values from the table, you can use:
delete t from t
where t.cqmRecordId is null and
exists (select 1
from t t2
where t2.name = t.name and
convert(date, t2.createddate_date) =convert(date, t.createddate_date) and
t2.createdby = t.createdby and
t2.cqmRecordId is not null
);
You can use the same logic with select to just select the duplicates.

Try below Query it might work for You
;WITH TestCTE
AS
(
SELECT *,ROW_NUMBER() OVER(
PARTITION BY [Name],Cast([CreatedDate] as Date),[CreatedBy]
ORDER BY RecordId
) AS RowNumber
)
DELETE FROM TestCTE
WHERE RowNumber > 1

Use the below code to eliminate duplicates
;WITH CTE
AS
(
SELECT ROW_NUMBER() OVER(
PARTITION BY [Name],Cast([CreatedDate] as Date),[CreatedBy]
ORDER BY cqmRecordId
) AS Rnk
,*
)
DELETE FROM CTE
WHERE Rnk <> 1

Related

Records apart form max date

I have been helped by Metal to write a SQL as below
select id
, OrderDate
, RejectDate
, max(case when RejectDate = '1900-01-01' then '9999-12-31' else RejectDate end) as rSum
from tableA
group by id, OrderDate, RejectDate
Now, I would like to find out all the records for a partcular id below the max reject date to delete them from a transformation table
An option is to use row_number():
select
id,
OrderDate,
RejectDate
from (
select
t.*,
row_number() over(
partition by id
order by case when RejectDate = '1900-01-01' then '9999-12-31' else RejectDate end desc
) rn
from tableA t
) t
where rn > 1
The advantage of this technique is that it avoids aggregation, which may lead to better performance. Also, you can easily turn this into a delete statement by leveraging the concept of updateable CTE, as follows:
with cte as (
select
row_number() over(
partition by id
order by case when RejectDate = '1900-01-01' then '9999-12-31' else RejectDate end desc
) rn
from tableA t
)
delete from cte where rn > 1
This should work...
SELECT *
FROM tableA t1
INNER JOIN (
SELECT ID, MAX(RejectDate) as MaxRejectDate
FROM tableA) t2 ON t1.ID = t2.ID
WHERE t1.RejectDate < t2.MaxRejectDate

Random records in Oracle table based on conditions

I have a Oracle table with the following columns
Table Structure
In a query I need to return all the records with CPER>=40 which is trivial. However, apart from CPER>=40 I need to list 5 random records for each CPID.
I have attached a sample list of records. However, in my table I have around 50,000 records.
Appreciate if you can help.
Oracle solution:
with CTE as
(
select t1.*,
row_number() over(order by DBMS_RANDOM.VALUE) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
SQL Server:
with CTE as
(
select t1.*,
row_number() over(order by newid()) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
How about something like this...
SELECT *
FROM (SELECT CID,
CVAL,
CPID,
CPER,
Row_number() OVER (partition BY CPID ORDER BY CPID ASC ) AS RN
FROM Table) tmp
WHERE CPER>=40 OR pids <= 5
However, this is not random.
Assuming that you want five additional random records, you can do:
select t.*
from (select t.*,
row_number() over (partition by cpid,
(case when cper >= 40 then 1 else 2 end)
order by dbms_random.value
) as seqnum
from t
) t
where seqnum <= 5 or cper >= 40;
The row_number() is enumerating the rows for each cpid in two groups -- based on the cper value. The outer where is taking all cper values in the range you want as well as five from the other group.

Find duplicate ID and add new sequence ID

I have a table where ID must be unique. There are some IDs that are not unique. How do I generate a new column which adds a sequence to this ID? I want to generate ID_new_generated in the table below
ID Company Name ID_new_generated
1 A 1
1 B 1_2
2 C 2
You can use a windowing function (e.g. Rank) to to generate an secondary ID, over each window defined by rows that have the same ID number, then just concatenate it to create the new one.
something like:
select
ID
, companyName
, rank() over(partition by ID ORDER BY companyName)
, concat(ID, '_', rank() over(partition by ID ORDER BY companyName)) as new_id
from test;
See this demo: https://www.db-fiddle.com/f/bd6aQKnZ7gcZCQjFpZicrp/0
Syntax will be different depending on which sql you are using.
Assumed you are looking for a solution in SQL Server:
First you will need to add a nullable column ID_Generated like below:
ALTER TABLE tablename
ADD COLUMN ID_Generated varchar(25) null
GO
Then, use row_number like below in a cte structure (you can use temp table if you are using mysql):
;with cte as (
SELECT DISTINCT t.ID,
(ROW_NUMBER() over (partition by t.ID order by t.ID)) as RowNumber
FROM tablename t
INNER JOIN (select ID, Count(*) RecCount
From tablename
group by ID
having Count(*) > 1) tt on t.ID = t.ID
ORDER BY id ASC
)
Update t
set t.ID_Generated = cte.RowNumber
from tablename t
inner join cte on t.ID = cte.ID
I think you want:
select ID, companyName,
(case when row_number() over (partition by id order by companyname) = 1
then cast(id as varchar(255))
else id || '_' || row_number() over (partition by id order by companyname)
end) as new_id
from test;
|| is the ANSI/ISO standard concatenation operator in SQL. Not all databases support it, so you might need to replace the operator with the one appropriate for your database.

removing duplicate values in column

there is a column called r_no which as three duplicate values associated with date column
r_no date values
1 09-Apr-2016 03:49:24
1 09-Apr-2016 03:49:30
1 09-Apr-2016 03:49:40
I want output as:
1 09-Apr-2016 03:49:40
removing duplicate values with latest date value
Assuming that You're working with Microsoft SQL Server :
SELECT TOP 1 ID, DATE FROM #TEMP ORDER BY DATE DESC
Delete & show the latest :
;WITH TTT AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID)SR, DATE, ID FROM #TEMP
)
DELETE FROM TTT WHERE DATE NOT IN(SELECT MAX(DATE) FROM TTT)
SELECT * FROM #TEMP
Explanation: The CTE will grab all of your records and apply a row number for each unique entry. Each additional entry will get an incrementing number.
You can delete dublicate data directly from CTE expression where Row_Number() function is used as follows
;with cte as (
select
r_no, [date values],
ROW_NUMBER() over (partition by r_no order by [date values] desc) as rn
from temptbl
)
delete from cte where rn > 1
You can do this with an OUTER APPLY.
SELECT DISTINCT yt.r_no, oa.[date values]
FROM YourTable yt
OUTER APPLY (SELECT TOP 1 [date values]
FROM YourTable
WHERE r_no = yt.r_no
ORDER BY [date values] DESC) oa
You can also do this with an embedded query.
SELECT DISTINCT *
FROM YourTable yt
WHERE [date values] = (SELECT MAX([date values]) FROM YourTable WHERE r_no = yt.r_no)

SQL query: how to distinct count of a column group by another column

In my table I need to know if each ID has one and only one ID_name. How can I write such query?
I tried:
select ID, count(distinct ID_name) as count_name
from table
group by ID
having count_name > 1
But it takes forever to run.
Any thoughts?
select ID
from YourTable
group by
ID
having count(distinct ID_name) > 1
or
select *
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.ID = yt2.ID
and yt1.ID_Name <> yt2.ID_Name
)
Now, most ID columns are defined as primary key and are unique. So in a regular database you'd expect both queries to return an empty set.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_Number() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
group by tt.ID
This gives you every ID with it's total number of ID_Name
If you want only those ID's which have more than one name associated just add a where clause
e.g.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_NUMBER() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
**where tt.myRank > 1**
group by tt.ID