delete duplicates in table using partion and where clause - sql

Using SQL Server 2016
I have found that using the partition over rows method to be fastest for duplicating rows in large tables. I'm trying to use the same process to delete some duplicates, but now I have unique situation.
Basically I need to delete rows that are duplicated on all columns except one. However, the rows would be allowed to be duplicated if the excluded column was also duplicated but not if it is different.
For example
col1 col2 col3 col4
1 2 3 4
1 2 3 4
1 2 3 5
The first 2 rows would be allowed to stay, but the 3rd row needs to be removed.
Normally I would use the code below to delete rows that are duplicated on certain criteria, but I don't know how to account for my current situation.
delete x from (select col1, ROW_NUMBER()
over (partition by col1 order by col1) As rn From table1) x
where rn > 1
Thanks for any help.
Just FYI the table contains 226 Million rows.

I think you want a count, not a row number:
with todelete as (
select t1.*,
count(*) over (partition by col1, col2, col3, col4) as cnt,
from t1
)
delete from todelete
where cnt > 1;

Related

BigQuery delete duplicates keep last

I have BigQuery table with columns col1, col2, col3. I want to delete rows that have duplicated values in col1 but keep the last (The row that was last pushed to BigQuery). col2 and col3 does not have to be duplicates.
My main problem is that I cannot find the last column. I tried below query, ordering was not right. But when I just SELECT all the rows the ordering is from the oldest to the newest rows.
SELECT *, ROW_NUMBER() OVER (PARTITION BY col1) AS row_numb
FROM table
I saw other solutions but they ordered by some column timestamp/created_at that I do not have. I know that one solution would be to add column with timestamp and then order by it to get most recent row. But is there any other way?
Example:
col1
col2
col3
1
2
3
2
3
1
1
4
3
The last row (col1 = 1, col2 = 4, col3 = 3) was added last to BigQuery
So what I want is to find duplicates in first column (That would be 1. and 3. row) and delete all the duplicates except the one that was added last to BigQuery (That is the 3. row).
The result would be
col1
col2
col3
2
3
1
1
4
3

SQL query to remove duplicates from a table with 139 columns and load all columns to another table

I need to remove the duplicates from a table with 139 columns based on 2 columns and load the unique rows with 139 columns into another table.
eg :
col1 col2 col3 .....col139
a b .............
b c .............
a b .............
o/p:
col1 col2 col3 .....col139
a b .............
b c .............
need a SQL query for DB2?
If the "other table" does not exist yet you can create it like this
CREATE TABLE othertable LIKE originaltable
And the insert the requested row with this statement:
INSERT INTO othertable
SELECT col1,...,coln
FROM (SELECT
t.*,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS num
FROM t) t
WHERE num = 1
There are numerous tools out there that generate queries and column lists - so if you do not want to write it by hand you could generate it with these tools or use another SQL statement to select it from the Db2 catalog table (syscat.columns).
You might be better just deleting the duplicates in place. This can be done without specifying a column list.
DELETE FROM
( SELECT
ROW_NUMBER() OVER (PARTITION BY col1, col2) AS DUP
FROM t
)
WHERE
DUP > 1
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by a) as seqnum
from t
) t;
If you don't want seqnum in the result set, though, you need to list out all the columns.
To find duplicate values in col1 or any column, you can run the following query:
SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1;
And if you want to delete those duplicate rows using the value of col1, you can run the following query:
DELETE FROM your_table WHERE col1 IN (SELECT col1 FROM your_table GROUP BY col1 HAVING COUNT(*) > 1);
You can use the same approach to delete duplicate rows from the table using col2 values.

how to select min value from table if table has two unique values with rest of columns are identical

ex:Input
ID Col1 Col2 Col3
-- ---- ---- ----
1 a a sql
2 a a hive
Out put
ID Col1 Col2 Col3
-- ---- ---- ----
1 a a sql
Here my id value and Col3 values are unique but i need to filter on min id and populate all records.
I know below approach will work, but any best approach other than this please suggest
select Col1,Col2,min(ID) from table group by Col1,Col2;
and join this on ID,Col1,Col2
I think you want row_number():
select t.*
from (select t.*, row_number() over (partition by col1, col2 order by id) as seqnum
from t
) t
where seqnum = 1
It appears that Hive supports ROW_NUMBER. Though I’ve never used hive, other rdbms would use it like this to get the entire contents of the min row without needing to join (doesn’t suffer problems if there are repeated minimum values)
SELECT a.* FROM
(
SELECT *, ROW_NUMBER() OVER(ORDER BY id) rn FROM yourtable
) a
WHERE a.rn = 1
The inner query selects all the table data and establishes an incrementing counter in order of ID. It could be based on any column, the min ID (in this case) being row number 1. If you wanted the max, order by ID desc
If you want the number to restart for different values of another column (eg of ten of your Col3 were “sql” and twenty rows had “hive”) you an say PARTITION BY col3 ORDER BY id, and the row number will be a counter that increments for identical values of col3, restarting from 1 for each distinct value of col3

Rows between rownumber 1-2 million from an oracle table without field rownum in final output?

How to get rows between row number 1 million to 2 million from an oracle table without having field rownum in final output?
Just do:
select col1, col2, col3, . . .
from (select t.*, rownum as seqnum
from t
) t
where seqnum between 100000 and 200000;
That is, select the columns that you want in the output.

Update Table Beginning At Record One SQL Server

I am trying to update a table with records from another table. Whenever I use the insert into statement, I find that the records are simply appended. Instead, I want the records to be inserted from the top of the table. What is the easiest way to do this? I am thin king I could use a update statement, but that means I will have to join the tables. One of the tables(the one I am pulling records from) has only one column. As such, I would have to include another column to do the join.I am trying not to make it so complicated. If there is a simplier way, please let me know.
Sample:
Table One
Col1
1
2
3
4
Table 2
Col1 Col2
a
b
c
d
I want to move column 1 from table 1 to column 2 in table 2 such that table 2 will be:
Table 2
Col1 Col2
a 1
b 2
c 3
d 4
You can do the update using row_number(), but the rows will be assigned in an indeterminate order:
with toupdate as (
select t2.*, row_number() over (select NULL)) as seqnum
from table2 t2
),
t1 as (
select t1.*, row_numbrer() over (select NULL)) as seqnum
from table1 t1
)
update toupdate
set col2 = t1.col1
from toupdate join
t1
on toupdate.seqnum = t1.seqnum;
Note: if you have an ordering in mind, then use the appropriate order by in the partition clauses.
Unless you explicity define an ORDER BY clause in your SELECT statements, your result set will be completely arbitrary. This is in line with how any RDBMS should operate. You should consider including a timestamp at the time of insertion to identify the latest rows.