How can I remove duplicates in SQL but keep one copy?

How can I remove duplicates in SQL but keep one copy? - sql

I have the following table in SQL with lines of an order as follows:
RowId OrderId Type Text
----------------------------------------
1 1 5 "Sometext"
2 1 5 "Sometext"
3 2 4 "Sometext"
4 3 5 "Sometext"
5 2 4 "Sometext"
6 1 3 "Sometext"
Each order cannot have a duplicate type, but can have multiple different types.
Rows 1 and 2 are duplicates for Order 1, but row 6 is fine.
Rows 3 and 5 are duplicates for Order 2.
I need to delete all of the duplicated data, so in this case I need to delete row 2 and row 5.
What is the best query to delete the data? Or even just return a list of RowID's that contain duplicates to be deleted (or the opposite, a list of RowID's to be kept)?
Thanks.

Try a simple approach:
DELETE FROM t
WHERE rowid NOT IN (
SELECT min(rowid) FROM t
GROUP BY orderid, type
)
Fiddle here.
Note that it seems you want to keep the lowers rowid when it is repeated. That's why I'm keeping the min.

Please try:
with c as
(
select *, row_number() over(partition by OrderId, Type order by (select 0)) as n
from YourTable
)
delete from c
where n > 1;

;with cte as
(
Select Row_Number() Over(Partition BY ORDERID,TYPE ORDER BY RowId) as Rows,
RowId , OrderId , Type , Text from TableName
)
Select RowId , OrderId , Type , Text from cte where Rows>1
Sql Fiddle Demo

Related

Remove all non contiguous records with identical fields

I got a table with some columns like
ID RecordID DateInserted
1 10 now + 1
2 10 now + 2
3 4 now + 3
4 10 now + 4
5 10 now + 5
I would like to remove all non contiguous duplicates of the RecordID Column when they are sorted by DateInserted
In my example I would like to remove record 4 and 5 because between 2 and 4 there is a record with different id.
Is there a way to do it with 1 query ?

You can use window functions. One method is to count the changes in value that occur up to each row and just take the rows with one change:
select t.*
from (select t.*,
sum(case when prev_recordid = recordid then 0 else 1 end) over (order by dateinserted) as grp_num
from (select t.*,
lag(recordid) over (order by dateinserted) as prev_recordid
from t
) t
) t
where grp_num = 1;

One way would be to "flag" all the rows where it is not the first time this RecordID appeared and the prior row contained a different RecordID. Then you just exclude any row beyond that point for that RecordID.
;WITH cte AS
(
SELECT ID, RecordID, DateInserted,
dr = DENSE_RANK() OVER (PARTITION BY RecordID ORDER BY DateInserted),
prior = COALESCE(LAG(RecordID,1) OVER (ORDER BY DateInserted), RecordID)
FROM dbo.table_name
),
FlaggedRows AS
(
SELECT RecordID, dr
FROM cte
WHERE dr > 1 AND prior <> RecordID
)
SELECT cte.ID, cte.RecordID, cte.DateInserted
FROM cte
LEFT OUTER JOIN FlaggedRows AS f
ON cte.RecordID = f.RecordID
WHERE cte.dr < COALESCE(f.dr, cte.dr + 1)
ORDER BY cte.DateInserted;
If you want to actually delete the rows from the source (remove will typically be inferred as removing from the result), then change the SELECT at the end to:
DELETE cte
FROM cte
INNER JOIN FlaggedRows f
ON cte.RecordID = f.RecordID
WHERE cte.dr >= f.dr;

SQL Select One Record over another based on column value

What I am trying to do is select rows based off of a 'priority'.
Say I have this:
ControlID ProgramID Priority
1 4 0
1 4 1
2 4 0
I want to choose one row each for the control ids (the whole row), which would be the third row, because there is no priority, and the 2nd row becuase it has priority. So if I have two control IDs that are the same, the one I want to choose is the one with 'priority'.
So my results would be:
ControlID ProgramID Priority
1 4 1
2 4 0
I've tried doing a sub query but I'm not that good at them...

You can do that by using row_number:
with r as (
select
ControlId,
ProgramId,
Priority,
row_number() over(partition by ControlId order by Priority desc) rn
)
select
ControlId,
ProgramId,
Priority
from r
where rn = 1

Retrieve specific rows without using rownum

Since I cant use rownum in the query, how can i use rowid to get result from 2nd row until 4th row using rowid or other possible solution apart from rownum.
Here is my current query where it will retrieve 2nd and 4th row:
SELECT * FROM Record a
WHERE
2 = (SELECT COUNT (rowid)
FROM Record b
WHERE a.rowid >= b.rowid)
UNION
SELECT * FROM Record a
WHERE
4 = (SELECT COUNT (rowid)
FROM Record c
WHERE a.rowid >= c.rowid);
Maybe there are other better ways to do it? TQ

If you can't use rownum, then use row_number():
SELECT a.*
FROM (SELECT a.*, ROW_NUMBER() OVER (ORDER BY rowid) as seqnum
FROM Record a
) a
WHERE seqnum BETWEEN 2 and 4;
Note: The ?? is for an ordering column. SQL tables represent unordered sets, so there is no concept of a first row or a second row, except in reference to an ordering column. You can use rowid for this purpose.
In Oracle 12c, you would use OFFSET/FETCH:
SELECT a.*
FROM Record a
OFFSET 1 ROWS
FETCH FIRST 3 ROWS ONLY;
I should point out that you can use rownum. You just can't do:
SELECT a.*
FROM Record a
WHERE rownum BETWEEN 2 and 4;
You can use it in a subquery:
SELECT a.*
FROM (SELECT a.*, rownum as seqnum
FROM Record a
) a
WHERE seqnum BETWEEN 2 and 4;
Do note that without an ORDER BY, there is no guarantee that the results come back in any order, including rowid order.

If you want to avoid rownum and row_number, use sum:
select *
from (
select sum(1) over ( order by rowid /* or whatever you need */ ) as rn,
r.*
from record
)
where rn between 2 and 4
The trick is only in the fact that here sum(1) gives the same thing than count(1) or count(rowid) or whatever count on a not null value, and this is the same thing than counting the rows with row_number or rownum.
In this way you use the sum to compute a row_number, without explicitly writing 'row_number' or 'rownum'.
SQL> create table testTab(x) as ( select level from dual connect by level <= 6);
Table created.
SQL> select t.*,
2 count(1) over (order by rowid desc) as count,
3 sum(1) over (order by rowid desc) as sum,
4 row_number() over (order by rowid desc) as rowNumber
5 from testTab t;
X COUNT SUM ROWNUMBER
---------- ---------- ---------- ----------
6 1 1 1
5 2 2 2
4 3 3 3
3 4 4 4
2 5 5 5
1 6 6 6
The external query simply applies the filter.

With Oracle 12c, you can now easily do row limiting. In your scenario you can do something like this:
SELECT *
FROM RECORD
OFFSET 1 ROWS FETCH NEXT 1 ROWS ONLY
UNION
SELECT *
FROM RECORD
OFFSET 3 ROWS FETCH NEXT 1 ROWS ONLY

Update statement with lookup table

I have a SQL table Customer with the following columns:
Customer_ID, Actioncode
I have another table with 1000+ actioncodes. Now I want to update the records in the Customer table with a unique code from the actioncode table.
I use this select statement at the moment:
update t
set t.actiecode = (select top 1 actiecode from data_mgl_campagnemails_codes)
from data_mgl_campagnemails_transfer t;
The result is that all records are updated with the same actiecode. The top 1 is responsible for that. When I remove that I got an error:
Subquery returned more than 1 value
This seems logical. How can I do this without using a cursor?
There is no relationship between the Customer and Code table.
Table structure:
data_mgl_campagnemails_transfer
id customer_id actioncode actioncode_id
1 1 - -
2 3 - -
3 4 - -
data_mgl_campagnemails_codes
id actioncode active
1 TTTT
2 RRRR
3 VVVV
4 RRRW
The result should be:
data_mgl_campagnemails_transfer
id customer_id actioncode actioncode_id
1 1 TTTT 1
2 3 RRRR 2
3 4 VVVV 3
data_mgl_campagnemails_codes
id actioncode active
1 TTTT YES
2 RRRR YES
3 VVVV YES
4 RRRW

This can be a bit tricky using a single statement, because SQL Server likes to optimize things. So the obvious:
update t
set t.actiecode = (select top 1 actiecode
from data_mgl_campagnemails_codes
order by newid()
)
from data_mgl_campagnemails_transfer t;
Also doesn't work. One method is to enumerate things and use a join or correlated subquery:
with t as (
select t.*, row_number() over (order by newid()) as seqnum
from data_mgl_campagnemails_transfer t
),
a as (
select a.*, row_number() over (order by newid()) as seqnum
from data_mgl_campagnemails_codes a
)
update t
set t.actiecode = (select top 1 actiecode from a)
from t join
a
on t.seqnum = a.seqnum;
Another way is to "trick" SQL Server into running the correlated subquery more than once. I think something like this:
update t
set t.actiecode = (select top 1 actiecode
from data_mgl_campagnemails_codes
where t.CustomerId is not null -- references the outer table but really does nothing
order by newid()
)
from data_mgl_campagnemails_transfer t;

Getting all fields from table filtered by MAX(Column1)

I have table with some data, for example
ID Specified TIN Value
----------------------
1 0 tin1 45
2 1 tin1 34
3 0 tin2 23
4 3 tin2 47
5 3 tin2 12
I need to get rows with all fields by MAX(Specified) column. And if I have few row with MAX column (in example ID 4 and 5) i must take last one (with ID 5)
finally the result must be
ID Specified TIN Value
-----------------------
2 1 tin1 34
5 3 tin2 12

This will give the desired result with using window function:
;with cte as(select *, row_number(partition by tin order by specified desc, id desc) as rn
from tablename)
select * from cte where rn = 1

Edit: Updated query after question edit.
Here is the fiddle
http://sqlfiddle.com/#!9/20e1b/1/0
SELECT * FROM TBL WHERE ID IN (
SELECT max(id) FROM
TBL WHERE SPECIFIED IN
(SELECT MAX(SPECIFIED) FROM TBL
GROUP BY TIN)
group by specified)
I am sure we can simplify it further, but this will work.
select * from tbl where id =(
SELECT MAX(ID) FROM
tbl where specified =(SELECT MAX(SPECIFIED) FROM tbl))

One method is to use window functions, row_number():
select t.*
from (select t.*, row_number() over (partition by tim
order by specified desc, id desc
) as seqnum
from t
) t
where seqnum = 1;
However, if you have an index on tin, specified id and on id, the most efficient method is:
select t.*
from t
where t.id = (select top 1 t2.id
from t t2
where t2.tin = t.tin
order by t2.specified desc, id desc
);
The reason this is better is that the index will be used for the subquery. Then the index will be used for the outer query as well. This is highly efficient. Although the index will be used for the window functions; the resulting execution plan probably requires scanning the entire table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I remove duplicates in SQL but keep one copy? - sql

Try a simple approach: DELETE FROM t WHERE rowid NOT IN ( SELECT min(rowid) FROM t GROUP BY orderid, type ) Fiddle here. Note that it seems you want to keep the lowers rowid when it is repeated. That's why I'm keeping the min.

Please try: with c as ( select *, row_number() over(partition by OrderId, Type order by (select 0)) as n from YourTable ) delete from c where n > 1;

;with cte as ( Select Row_Number() Over(Partition BY ORDERID,TYPE ORDER BY RowId) as Rows, RowId , OrderId , Type , Text from TableName ) Select RowId , OrderId , Type , Text from cte where Rows>1 Sql Fiddle Demo

Related

Remove all non contiguous records with identical fields

SQL Select One Record over another based on column value

Retrieve specific rows without using rownum

Update statement with lookup table

Getting all fields from table filtered by MAX(Column1)

Categories

Resources