Removing duplicate values from sql server on condition of 2 columns - sql

|Rownumber |OldIdassigned |commoncode |
------------------------------------------
| 1 |FLEX |Y2573F102 |
------------------------------------------
| 2 |RCL |Y2573F102 |
------------------------------------------
| 3 |FLEX |Y2573F102 |
------------------------------------------
| 4 |QGEN |N72482123 |
------------------------------------------
| 5 |QGEN |N72482123 |
------------------------------------------
| 6 |QGEN |N72482123 |
------------------------------------------
| 7 |RACE |N72482123 |
------------------------------------------
| 8 |CLB |N22717107 |
------------------------------------------
| 9 |CLB |N22717107 |
------------------------------------------
<b>| 10 |CLB |N22717107 |
I need to delete the duplicate records based on Common code and a condition that - if oldidassigned is same then delete else don't delete.
For example Y2573F102 has 3 duplicate records rows 1,2,3 .... 1,2 need not to be deleted , only 3rd row has to be deleted.

I like updatable CTEs and window functions for this purpose:
with todelete as (
select t.*,
row_number() over (partition by commoncode order by rownumber) as seqnum
from t
)
delete todelete
where seqnum > 1;

Use ROW_NUMBER() :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY OldIdassigned, commoncode ORDER BY rownumber) AS Seq
FROM table t
) t
WHERE t.seq > 1;
EDIT : If you want to check the duplication based on commoncode only then remove OldIdassigned from PARTITION clause :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY commoncode ORDER BY rownumber DESC) AS Seq
FROM table t
) t
WHERE t.seq > 1;

use window function row_number, according to your description and comments it seems you need change in partition clause
delete t
from
(select t1.*,row_number() over(partition by commoncode order by Rownumber) rn from table t1
)t where rn<>1
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=eacc0688efb534a0addee68678f323fe

Use Row_Number()
delete t from
(select *, row_number() over(partition by commoncode order by
rownumber) as rn) t
where rn<>1

Since all answers are similar (and correct), I will post one alternative way:
DELETE FROM TableA
WHERE EXISTS ( SELECT * FROM TableA AS A2
WHERE A2.commoncode = TableA.commoncode
AND A2.OldIdassigned = TableA.OldIdassigned
AND A2.Rownumber < TableA.Rownumber )

Related

SQL query - remove duplicated

I have a table with the following columns that matter:
ID | commentid
1 | abs345
2 | abs345
3 | abs345
4 | poly234
5 | poly234
6 | qq1r4c
7 | abs345
8 | abs345
And I intend to delete the lines where the commentid is duplicated, that is, when the ID numbering is not followed sequentially.
For this example, the lines with ID 7 and 8 would be eliminated.
Do you want to return all rows except for the last comment id when it is duplicated?
select t.*
from (select t.*,
count(*) over (partition by commentid) as commentid_cnt,
max(id) over (partition by commentid) as max_commentid_id,
max(id) over () as max_id
from t
) t
where max_id = max_comment_id and commentid_cnt > 1;
EDIT:
Oh, I think I understand. You want to keep only the first "grouping" of commentid. Assuming that the is are sequential with no gaps, then one approach is:
enumerate the rows for each commentid
subtract the value from id
If this is larger than the minimum id minus 1, then you are not in the "first" group.
This looks like:
select t.*
from (select t.*,
min(id) over (partition by commentid) as min_id,
row_number() over (partition by commentid order by id) as seqnum
from t
) t
where id - seqnum = min_id - 1

how to find which element appears the most in an sql table

I have a table set in the following manner:
band_id | song_name
1 | rolling
2 | stomp
1 | rage
3 | atmosphere
and so on, how can I find out which band appears the most?
You can use RANK() window function:
select t.band_id
from (
select band_id,
rank() over (order by count(*) desc) rn
from tablename
group by band_id
) t
where t.rn = 1;
or if you don't need ties in the results:
select band_id
from tablename
group by band_id
order by count(*) desc limit 1;
See the demo.
Results:
| band_id |
| ------- |
| 1 |

hadoop hive using row_number()

I have a dataset with many duplicating IDs. I just want to do a row_number() and take the first. If i have table1 left join with table2 and only take table2.rownumber=1, it works. but if i do a standalone without table join, it doesn't. I have the following code:
SELECT
ID,
NAME,
NRIC,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) as RNK
FROM TABLE1
WHERE RNK=1;
The error message show that RNK is not a valid table column or alias etc.
Any help would be greatly appreciated. Thanks.
You have to use a subquery or CTE to refer to a column alias for filtering:
SELECT ID, NAME, NRIC, RNK
FROM (SELECT t1.*, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) as RNK
FROM TABLE1
) t1
WHERE RNK = 1;
This is true of all column aliases, even though defined by window functions.
Given:
create table dupes (
id string,
democode string,
extract_timestamp string
);
And:
insert into dupes (id, democode,extract_timestamp) values
('1','code','2020')
,('2','code2','2020')
,('2','code22','2021')
,('3','code3','2020')
,('3','code33','2021')
,('3','code333','2012')
;
When:
SELECT id,democode,extract_timestamp
FROM (
SELECT id,democode,extract_timestamp,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num
FROM dupes
) t1
WHERE row_num = 1;
Then:
+-----+-----------+--------------------+--+
| id | democode | extract_timestamp |
+-----+-----------+--------------------+--+
| 1 | code | 2020 |
| 2 | code22 | 2021 |
| 3 | code33 | 2021 |
+-----+-----------+--------------------+--+
Note that often tables are partitioned and that we might want to deduplicate within each partition. In which case we would add the partition key(s) into the OVER statement. For example if the table was partition by report_date DATE then we might use:
ROW_NUMBER() OVER (PARTITION BY id, report_date ORDER BY extract_timestamp DESC) AS row_num

How to compare the column values of two last inserted specified rows on the same table?

I have a data set that is being updated on each operation maden by customers.
For example, I am getting a customer's last two operations by
select id,
referance
from (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id')
al where seqnum <= 2
where id is getting from a feature file. But now I need to compare the referance values of these two operations.
mytable:
id | name | referance | time |
-------------------------------------
11 | abc | 4589 | 09:05 |
11 | abc | 1234 | 09:04 |
10 | xyz | 0185 | 09:02 |
15 | qpr | 9564 | 08:54 |
so on...
Again, I can get the last two rows with id = 11; and, as far as all columns are not (null), it is returning "true" which is what I want literally.
But also I'd like to compare if their referances are the same or not; and, when I call the query, it has to return "true" or "false".
Thanks in advance
P.S. I actually just need a useful function or idea. I've already try to use inner join but couldnt manage it:
select table1.id,
table1.referance,
table2.id,
table2.referance
from (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id') table1
inner join (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id') table2
on table1.referance != table2.referance
al where seqnum <= 2 order by seqnum
Aggregate your current query over the id and check if the two reference values be the same or not.
select
id,
case when count(distinct reference) = 1
then 'true' else 'false' end as result
from
(
select id, reference,
row_number() over (order by time desc) as seqnum
from table
where id=':id'
) al
where seqnum <= 2
group by id;
If the distinct count of reference over the two records be 1 then it implies that they have the same value. Otherwise, we can assume that the values are different.
Why are you using row_nubmer()? You can get the last two rows as:
select top 2 id, referance
from mytable
where id=':id'
order by time desc;
You can then determine if these are the same using aggregation:
select (case when min(reference) <> max(reference) then 'false'
else 'true'
end) as is_same
from (select top 2 id, referance
from mytable
where id=':id'
order by time desc
) t;
Note: This doesn't take NULL values for reference into account, but that is easily incorporated into the logic.

sql query distinct with Row_Number

I am fighting with the distinct keyword in sql.
I just want to display all row numbers of unique (distinct) values in a column & so I tried:
SELECT DISTINCT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
however the below code giving me the distinct values:
SELECT distinct id FROM table WHERE fid = 64
but when tried it with Row_Number.
then it is not working.
This can be done very simple, you were pretty close already
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
Use this:
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM
(SELECT DISTINCT id FROM table WHERE fid = 64) Base
and put the "output" of a query as the "input" of another.
Using CTE:
; WITH Base AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM Base
The two queries should be equivalent.
Technically you could
SELECT DISTINCT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
but if you increase the number of DISTINCT fields, you have to put all these fields in the PARTITION BY, so for example
SELECT DISTINCT id, description,
ROW_NUMBER() OVER (PARTITION BY id, description ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
I even hope you comprehend that you are going against standard naming conventions here, id should probably be a primary key, so unique by definition, so a DISTINCT would be useless on it, unless you coupled the query with some JOINs/UNION ALL...
This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+
Using DISTINCT causes issues as you add fields and it can also mask problems in your select. Use GROUP BY as an alternative like this:
SELECT id
,ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
where fid = 64
group by id
Then you can add other interesting information from your select like this:
,count(*) as thecount
or
,max(description) as description
How about something like
;WITH DistinctVals AS (
SELECT distinct id
FROM table
where fid = 64
)
SELECT id,
ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM DistinctVals
SQL Fiddle DEMO
You could also try
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM #mytable
where fid = 64
SQL Fiddle DEMO
Try this:
;WITH CTE AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM cte
WHERE fid = 64
Try this
SELECT distinct id
FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
Or use RANK() instead of row number and select records DISTINCT rank
SELECT id
FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
WHERE t.RowNum=1
This also returns the distinct ids
Question is too old and my answer might not add much but here are my two cents for making query a little useful:
;WITH DistinctRecords AS (
SELECT DISTINCT [col1,col2,col3,..]
FROM tableName
where [my condition]
),
serialize AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY [colNameAsNeeded] ORDER BY [colNameNeeded]) AS Sr,*
FROM DistinctRecords
)
SELECT * FROM serialize
Usefulness of using two cte's lies in the fact that now you can use serialized record much easily in your query and do count(*) etc very easily.
DistinctRecords will select all distinct records and serialize apply serial numbers to distinct records. after wards you can use final serialized result for your purposes without clutter.
Partition By might not be needed in most cases