Remove duplicate records in sql table - sql

I have duplicate records in my table and want to delete them using Identity column values. I want columns "Fname" and "Lname" to uniquely identify every record. But there are duplicate Fname and Lname with different upload date. Below is the SQL query I designed to solve the problem but is will take Min(id) rather than Max(uploaddate). Please help fix this code.
Select Max(uploaddate),
Min(id),
Fname,
Lname
From tbl
Group By Fname, Lname

This may be helpful for you,
CAUTION
Since this is DELETE, before executing this, change it to SELECT * instead of DELETE and validate the output. If you are okay with result change it back to DELETE
DELETE FROM MY_TABLE
WHERE MY_TABLE.ROWID IN (
SELECT ROWID
FROM (
SELECT MY_TABLE.ROWID
, ROW_NUMBER() OVER (PARTITION BY FNAME, LNAME ORDER BY UPLOADDATE DESC, ID ASC) RNK
FROM MY_TABLE
) TMP
WHERE RNK = 2
)
I'm not sure what is your database. I have tested this with ORACLE

Related

Deleting multiple duplicate rows

This code I have finds duplicate rows in a table. H
SELECT position, name, count(*) as cnt
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1
How do I delete the duplicate rows that I have found in Hiveql?
Apart from distinct, you can use row_number for this in Hive. Explicit delete and update can only be performed on tables that support ACID. So insert overwrite is more universal.
insert overwrite table team
select position, name, other1, other2...
from (
select
*,
row_number() over(partition by position, name order by rand()) as rn
from team
) tmp
where rn = 1
;
Please try this.assuming id is primary key column
delete from team where id in (
select t1.id from team t1,
(SELECT position, name, count(*) as cnt ,max(id) as id1
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1) t2
where t1.position=t2.position
and t1.name=t2.name
and t1.id<>t2.id1)
This is an alternative way, since deletes are expensive in Hive
Create table Team_new
As
Select distinct <col1>, <col2>,...
from Team;
Drop table Team purge;
Alter table Team_new rename to Team;
This is assuming you don’t have an id column. If you have an id column then the 1st query would change slightly as
Create table Team_new
As
Select <col1>,<col2>,...,max(id) as id from Team
Group by <col1>,<col2>,... ;
Other queries (drop & alter post this) would remain the same as above.

Remove duplicates from table in bigquery

I found duplicates in my table by doing below query.
SELECT name, id, count(1) as count
FROM [myproject:dev.sample]
group by name, id
having count(1) > 1
Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message.
Am i missing something?
DELETE FROM PRD.GPBP WHERE
id not in(select id from [myproject:dev.sample] GROUP BY id) and
name not in (select name from [myproject:dev.sample] GROUP BY name)
I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.
You can find duplicates like below:
Create table new_table as
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *,
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample]
)a
WHERE rnk = 1;
Then drop the older table and rename new_table with old table name.
Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done
Or if you want you can use DML's INSERT to new table and then copy over original one
Meantime, the easiest way is as below (using DDL)
#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
)

PostgreSQL shuffle column values

In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete + insert
You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
You can turn off index so update are faster
But the best solution I use when need update all the rows is create the table again
.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading a_horse_with_no_name
looks like you need
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN result.

Find duplicated rows that are not exactly same

Can i select all rows that have same column value (for example SSN field) but display them all separably. ?
I've searched for this answer but they all have "count(*) and group by" section that demands the rows to be exactly same.
Try This:
SELECT A, B FROM MyTable
WHERE A IN
(
SELECT A FROM MyTable GROUP BY A HAVING COUNT(*)>1
)
I have done with SQL server. But hope this is what you need
Here is another approach, which only references the table once, using an analytic function instead of a subquery to get the duplicate counts It might be faster; it also might not, depending on the particular data.
SELECT * FROM (
SELECT col1, col2, col3, ssn, COUNT(*) OVER (PARTITION BY ssn) ssn_dup_count
)
WHERE ssn_dup_count > 1
ORDER BY ssn_dup_count DESC
SELECT
*
FROM
MyTable
WHERE
EXISTS
(
SELECT
NULL
FROM
MyTable MT
WHERE
MyTable.SameColumnName = MT.SameColumnName
AND MyTable.DifferentColumnName <> MT.DifferentColumnName)
This will fetch the required data and show them in order so that we can see the grouped data together.
SELECT * FROM TABLENAME
WHERE SSN IN
(
SELECT SSN FROM TABLENAMEGROUP BY SSN HAVING COUNT(SSN)>1
)
ORDER BY SSN
Here SSN is the column names fro which similar value check is done.

Removing dups and updating null values

I've just been tasked with removing all the duplicate values in a database. Simple enough. But they also want me to go through and check if there are any Null values that were not Null in previous entries for that record.
So let's say that we have user 123. User 123 doesn't have a zip code listed for whatever reason. But in a past entry he had zip code 55555. I'm supposed to update the latest entry with that zip code from a past entry and then delete the past entry. Leaving me with only one entry for user 123 AND having the zip code 55555.
I'm just unsure how to do the update portion. Anybody have any suggestions?
Thanks!
Here is how you can do the update. It finds the last value for zip, and then updates the field, if necessary:
with lastval as (
select *
from (select id, zip, row_number() over (partition by id order by datecreated desc) as seqnum
from t
where zip is not null
) t
where seqnum = 1
)
update t
set t.zip = lastval.zip
from lastval
where t.id = lastval.id
However, I would suggest that you create a new table with the data that you want. Don't both deleting and updating a zilion rows, create a table using a query such as:
select *
from (select t.*, row_number() over (partition by id order by datecreated desc) as seqnum
from t
where zip is not null
) t
where seqnum = 1
And insert the rows into a new table.
And, one more suggestion. Ask another question, with a better notion of what the fields are like in the table, and which ones you want to look up last values for. That will provide additional information for better solutions.
You could use a statement similar to the following one:
update t1
set t1.address = dt.address,
t1.city = dt.city,
... and so on ...
from your_table as t1
inner join
(
select
max(id) as id,
companyname,
max(address) as address,
max(city) as city,
... and so on ...
from your_table
group by companyname -- your duplicate detection goes here
) dt
on dt.id = t1.id
This way you fill up all gaps in your duplicates. Then you just have to delete the duplicates.