Randomise Table data in Oracle - sql

Our client has asked us to randomise some contact data before we export data from a live database.
My plan was to create a copy of the contact table, and then update the copied tables "FIRST_NAME" using a random row from the original table, and then update the "LAST_NAME" using a different random row of the original table.
Any idea how to do this? FYI, it's an oracle 10 db and I'm using SQL developer to do the work.

I would propose a hash function: STANDARD_HASH
By this the data gets anonymized but the customer is still able to use the data and run joins, analytic, etc.
SELECT
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA1')) AS SHA1,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA256')) AS SHA256,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA384')) AS SHA384,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA512')) AS SHA512,
RAWTOHEX(STANDARD_HASH('Wernfried', 'MD5')) AS MD5
FROM dual;
So, something like
UPDATE my_table SET
FIRST_NAME = RAWTOHEX(STANDARD_HASH('secretPrefix'||FIRST_NAME, 'MD5')),
LAST_NAME = RAWTOHEX(STANDARD_HASH('secretPrefix'||LAST_NAME, 'MD5'));
Where "my_table" is of course a copy of your original data. I set a 'secretPrefix' because otherwise it would be rather simple to revert the hash value just by reading and hashing all names from a telephone directory for example.
Update
ALTER TABLE CONTACTS ADD (rn NUMBER);
UPDATE CONTACTS a SET rn = (SELECT RN FROM (SELECT ROW_NUMBER() OVER (ORDER BY DBMS_RANDOM.VALUE) AS RN, ROWID AS ROW_ID FROM CONTACTS b) WHERE a.rowid = ROW_ID);
UPDATE CONTACTS a SET (first_name, last_Name) = (
SELECT first_name, last_Name
FROM (SELECT first_name, last_Name, ROW_NUMBER() OVER (ORDER BY DBMS_RANDOM.VALUE) AS RN FROM ANONYMISED_NAMES) r WHERE r.RN = a.RN)
Note, for this update the amount of random names must be equal or bigger than table of real names. Otherwise you may use MOD function.

insert into NewTable(...., name, last_name)
select ...., n.name, l.last_name
from OldTable u,
(select count(distinct name) name_count, count(distinct last_name) last_name_count
from OldTable) c,
(select name, row_number() over(order by DBMS_RANDOM.value) as id
from (select distinct name from OldTable)
) n,
(select last_name, row_number() over(order by DBMS_RANDOM.value) as id
from (select distinct last_name from OldTable)
) l
where n.id=mod(u.ID, c.name_count)+1 and l.id=mod(u.ID, c.last_name_count)+1
u.ID - is primary key of source table or other unique, "random" integer value.

Related

Deleting multiple duplicate rows

This code I have finds duplicate rows in a table. H
SELECT position, name, count(*) as cnt
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1
How do I delete the duplicate rows that I have found in Hiveql?
Apart from distinct, you can use row_number for this in Hive. Explicit delete and update can only be performed on tables that support ACID. So insert overwrite is more universal.
insert overwrite table team
select position, name, other1, other2...
from (
select
*,
row_number() over(partition by position, name order by rand()) as rn
from team
) tmp
where rn = 1
;
Please try this.assuming id is primary key column
delete from team where id in (
select t1.id from team t1,
(SELECT position, name, count(*) as cnt ,max(id) as id1
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1) t2
where t1.position=t2.position
and t1.name=t2.name
and t1.id<>t2.id1)
This is an alternative way, since deletes are expensive in Hive
Create table Team_new
As
Select distinct <col1>, <col2>,...
from Team;
Drop table Team purge;
Alter table Team_new rename to Team;
This is assuming you don’t have an id column. If you have an id column then the 1st query would change slightly as
Create table Team_new
As
Select <col1>,<col2>,...,max(id) as id from Team
Group by <col1>,<col2>,... ;
Other queries (drop & alter post this) would remain the same as above.

Remove duplicate records in sql table

I have duplicate records in my table and want to delete them using Identity column values. I want columns "Fname" and "Lname" to uniquely identify every record. But there are duplicate Fname and Lname with different upload date. Below is the SQL query I designed to solve the problem but is will take Min(id) rather than Max(uploaddate). Please help fix this code.
Select Max(uploaddate),
Min(id),
Fname,
Lname
From tbl
Group By Fname, Lname
This may be helpful for you,
CAUTION
Since this is DELETE, before executing this, change it to SELECT * instead of DELETE and validate the output. If you are okay with result change it back to DELETE
DELETE FROM MY_TABLE
WHERE MY_TABLE.ROWID IN (
SELECT ROWID
FROM (
SELECT MY_TABLE.ROWID
, ROW_NUMBER() OVER (PARTITION BY FNAME, LNAME ORDER BY UPLOADDATE DESC, ID ASC) RNK
FROM MY_TABLE
) TMP
WHERE RNK = 2
)
I'm not sure what is your database. I have tested this with ORACLE

PostgreSQL shuffle column values

In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete + insert
You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
You can turn off index so update are faster
But the best solution I use when need update all the rows is create the table again
.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading a_horse_with_no_name
looks like you need
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN result.

Can we use join with in same table while using group by function?

For instance, I have a table with columns below:
pk_id,address,first_name,last_name
and I have a query like this to display the first name ans last name that are repetitive(duplicates)
select first_name,last_name
from table
group by first_name,last_name
having count(*)>1;
but the above query just returns first and last names but I want to display pk_id and address too that are tied to these duplicate first and last names
Can we use joins to do this on the same table.Please help!!
A simple way of doing is to build a view with the pk_id and the count of duplicates. Once you have it, it is only a matter of using a JOIN on the base table, and a filter to only keep rows having a duplicate:
SELECT T.*
FROM T
JOIN (SELECT "pk_id",
COUNT(*) OVER(PARTITION BY "first_name", "last_name") cnt
FROM T) V
ON T."pk_id" = V."pk_id"
WHERE cnt > 1
See http://sqlfiddle.com/#!4/3ecd0/9
You have to call it from an outer query, like this:
select * from table
where first_name||last_name in
(select first_name||last_name from
(select first_name, last_name, count( * )
from table
group by first_name,last_name
having count( * ) > 1
)
)
note: you may not need to concatenate the 2 fields, but I haven't tested thaT.
with
my_duplicates as
(
select
first_name,
last_name
from
my_table
group by
first_name,
last_name
having
count(*) > 1
)
select
bb.pk_id,
bb.address,
bb.first_name,
bb.last_name
from
my_duplicates aa
join my_table bb on
(
aa.first_name = bb.first_name
and
aa.last_name = bb.last_name
)
order by
bb.last_name,
bb.first_name,
bb.pk_id

Delete duplicate records from a SQL table without a primary key

I have the below table with the below records in it
create table employee
(
EmpId number,
EmpName varchar2(10),
EmpSSN varchar2(11)
);
insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (1, 'Jack', '555-55-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6 ,'Lisa', '555-70-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
I dont have any primary key in this table .But i have the above records in my table already.
I want to remove the duplicate records which has the same value in EmpId and EmpSSN fields.
Ex : Emp id 5
How can I frame a query to delete those duplicate records?
It is very simple. I tried in SQL Server 2008
DELETE SUB FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY EmpId, EmpName, EmpSSN ORDER BY EmpId) cnt
FROM Employee) SUB
WHERE SUB.cnt > 1
Add a Primary Key (code below)
Run the correct delete (code below)
Consider WHY you woudln't want to keep that primary key.
Assuming MSSQL or compatible:
ALTER TABLE Employee ADD EmployeeID int identity(1,1) PRIMARY KEY;
WHILE EXISTS (SELECT COUNT(*) FROM Employee GROUP BY EmpID, EmpSSN HAVING COUNT(*) > 1)
BEGIN
DELETE FROM Employee WHERE EmployeeID IN
(
SELECT MIN(EmployeeID) as [DeleteID]
FROM Employee
GROUP BY EmpID, EmpSSN
HAVING COUNT(*) > 1
)
END
Use the row number to differentiate between duplicate records. Keep the first row number for an EmpID/EmpSSN and delete the rest:
DELETE FROM Employee a
WHERE ROW_NUMBER() <> ( SELECT MIN( ROW_NUMBER() )
FROM Employee b
WHERE a.EmpID = b.EmpID
AND a.EmpSSN = b.EmpSSN )
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by EmpID,EmpSSN Order by EmpID,EmpSSN) as Duplicate From Employee)
delete From duplicates
Where Duplicate > 1 ;
This will update Table and remove all duplicates from the Table!
select distinct * into newtablename from oldtablename
Now, the newtablename will have no duplicate records.
Simply change the table name(newtablename) by pressing F2 in object explorer in sql server.
Code
DELETE DUP
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY Clientid ORDER BY Clientid ) AS Val
FROM ClientMaster
) DUP
WHERE DUP.Val > 1
Explanation
Use an inner query to construct a view over the table which includes a field based on Row_Number(), partitioned by those columns you wish to be unique.
Delete from the results of this inner query, selecting anything which does not have a row number of 1; i.e. the duplicates; not the original.
The order by clause of the row_number window function is needed for a valid syntax; you can put any column name here. If you wish to change which of the results is treated as a duplicate (e.g. keep the earliest or most recent, etc), then the column(s) used here do matter; i.e. you want to specify the order such that the record you wish to keep will come first in the result.
You could create a temporary table #tempemployee containing a select distinct of your employee table.
Then delete from employee.
Then insert into employee select from #tempemployee.
Like Josh said - even if you know the duplicates, deleting them will be impossile since you cannot actually refer to a specific record if it is an exact duplicate of another record.
If you don't want to create a new primary key you can use the TOP command in SQL Server:
declare #ID int
while EXISTS(select count(*) from Employee group by EmpId having count(*)> 1)
begin
select top 1 #ID = EmpId
from Employee
group by EmpId
having count(*) > 1
DELETE TOP(1) FROM Employee WHERE EmpId = #ID
end
ITS easy use below query
WITH Dups AS
(
SELECT col1,col2,col3,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY (SELECT 0)) AS rn
FROM mytable
)
DELETE FROM Dups WHERE rn > 1
delete sub from (select ROW_NUMBER() OVer(Partition by empid order by empid)cnt from employee)sub
where sub.cnt>1
I'm not an SQL expert so bear with me. I'm sure you'll get a better answer soon enough. Here's how you can find the duplicate records.
select t1.empid, t1.empssn, count(*)
from employee as t1
inner join employee as t2 on (t1.empid=t2.empid and t1.empssn = t2.empssn)
group by t1.empid, t1.empssn
having count(*) > 1
Deleting them will be more tricky because there is nothing in the data that you could use in a delete statement to differentiate the duplicates. I suspect the answer will involve row_number() or adding an identity column.
create unique clustered index Employee_idx
on Employee ( EmpId,EmpSSN )
with ignore_dup_key
You can drop the index if you don't need it.
no ID, no rowcount() or no temp table needed....
WHILE
(
SELECT COUNT(*)
FROM TBLEMP
WHERE EMPNO
IN (SELECT empno from tblemp group by empno having count(empno)>1)) > 1
DELETE top(1)
FROM TBLEMP
WHERE EMPNO IN (SELECT empno from tblemp group by empno having count(empno)>1)
there are two columns in the a table ID and name where names are repeating with different IDs so for that you may use this query:
.
.
DELETE FROM dbo.tbl1
WHERE id NOT IN (
Select MIN(Id) AS namecount FROM tbl1
GROUP BY Name
)
Having a database table without Primary Key is really and will say extremely BAD PRACTICE...so after you add one (ALTER TABLE)
Run this until you don't see any more duplicated records (that is the purpose of HAVING COUNT)
DELETE FROM [TABLE_NAME] WHERE [Id] IN
(
SELECT MAX([Id])
FROM [TABLE_NAME]
GROUP BY [TARGET_COLUMN]
HAVING COUNT(*) > 1
)
SELECT MAX([Id]),[TABLE_NAME], COUNT(*) AS dupeCount
FROM [TABLE_NAME]
GROUP BY [TABLE_NAME]
HAVING COUNT(*) > 1
MAX([Id]) will cause to delete latest records (ones added after first created) in case you want the opposite meaning that in case of requiring deleting first records and leave the last record inserted please use MIN([Id])
Let's think out of the box.
I don't delete from the table, I make a new table first, for safety.
I personally prefer do a
INSERT INTO new_table SELECT DISTINCT * FROM orig_table;
Now, new_table now should contains the expected data I want.
I can check new_table to ensure that.
Then I have 2 options to replace the orig_table
A. delete orig_table; rename new_table to orig_table
B. truncate orig_table; insert data from new_table to orig_table; delete new_table (Recommended: in case you have some trigger/something else linked to the original orig_table)
select t1.* from employee t1, employee t2 where t1.empid=t2.empid and t1.empname = t2.empname and t1.salary = t2.salary
group by t1.empid, t1.empname,t1.salary having count(*) > 1
delete from employee where rowid in (select rowid from (select rowid, name_count from (select rowid, count(emp_name) as name_count from employee group by emp_id, emp_name) where name_count>1))
DELETE FROM 'test'
USING 'test' , 'test' as vtable
WHERE test.id>vtable.id and test.common_column=vtable.common_column
Using this we can remove duplicate records
ALTER IGNORE TABLE test
ADD UNIQUE INDEX 'test' ('b');
# here 'b' is column name to uniqueness,
# here 'test' is index name.