In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete + insert
You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
You can turn off index so update are faster
But the best solution I use when need update all the rows is create the table again
.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading a_horse_with_no_name
looks like you need
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN result.
Related
I am trying to come up with a single query that helps me atomically normalise a table (populate a new table with initial values from another table, and simultaneously add the foreign key reference to it)
Obviously populating one table from another is a standard INSERT INTO ... SELECT ..., but I also want to update the foreign key reference in the 'source' table to reference the new record in the 'new' table.
Let's say I am migrating a schema from:
CREATE TABLE companies (
id INTEGER PRIMARY KEY,
address_line_1 TEXT,
address_line_2 TEXT,
address_line_3 TEXT
)
to:
CREATE TABLE addresses (
id INTEGER PRIMARY KEY,
line_1 TEXT,
line_2 TEXT,
line_3 TEXT
)
CREATE TABLE companies (
id INTEGER PRIMARY KEY,
address_id INTEGER REFERENCES addresses(id)
)
... I thought perhaps a CTE might help of the form
WITH new_addresses AS (
INSERT INTO addresses (line_1, line_2, line_3)
SELECT address_line_1, address_line_2, address_line_3
FROM companies
RETURNING id, companies.id AS company_id -- DOESN'T WORK
)
UPDATE companies
SET address_id = new_addresses.id
FROM new_addresses
WHERE new_addresses.company_id = companies.id
It seems that RETURNING can only return data from the inserted record however, so this will not work
I assume at this point that the answer will be to either use PLSQL or incorporate domain knowledge of the data to do this in a multi step process. My current solution is pretty much:
-- FIRST QUERY
-- ensure address record IDs will be in the same sequential order as their relating companies
INSERT INTO addresses (line_1, line_2, line_3)
SELECT address_line_1, address_line_2, address_line_3
FROM companies
ORDER BY id;
-- SECOND QUERY
-- join, making the assumption that the table IDs are in the same order
WITH address_ids AS (
SELECT id AS address_id, ROW_NUMBER() OVER(ORDER BY id) AS idx
FROM addresses
), company_ids AS (
SELECT id AS company_id, ROW_NUMBER() OVER(ORDER BY id) AS idx
FROM companies
), company_address_ids AS (
SELECT company_id, address_id
FROM address_ids
JOIN company_ids USING (idx)
)
UPDATE companies
SET address_id = company_address_ids.address_id
FROM company_address_ids
WHERE id = company_address_ids.company_id
This is obviously problematic in that it relies on the addresses table containing exactly as many records as the company table, but such a query would be a one-off when the table is first created.
This code I have finds duplicate rows in a table. H
SELECT position, name, count(*) as cnt
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1
How do I delete the duplicate rows that I have found in Hiveql?
Apart from distinct, you can use row_number for this in Hive. Explicit delete and update can only be performed on tables that support ACID. So insert overwrite is more universal.
insert overwrite table team
select position, name, other1, other2...
from (
select
*,
row_number() over(partition by position, name order by rand()) as rn
from team
) tmp
where rn = 1
;
Please try this.assuming id is primary key column
delete from team where id in (
select t1.id from team t1,
(SELECT position, name, count(*) as cnt ,max(id) as id1
FROM team
GROUP BY position, name,
HAVING COUNT(*) > 1) t2
where t1.position=t2.position
and t1.name=t2.name
and t1.id<>t2.id1)
This is an alternative way, since deletes are expensive in Hive
Create table Team_new
As
Select distinct <col1>, <col2>,...
from Team;
Drop table Team purge;
Alter table Team_new rename to Team;
This is assuming you don’t have an id column. If you have an id column then the 1st query would change slightly as
Create table Team_new
As
Select <col1>,<col2>,...,max(id) as id from Team
Group by <col1>,<col2>,... ;
Other queries (drop & alter post this) would remain the same as above.
My query deletes the whole table instead of duplicate rows.
Video as proof: https://streamable.com/3s843
create table customer_info (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
phone_number VARCHAR(50)
);
insert into customer_info (id, first_name, last_name, phone_number) values
(1, 'Kevin', 'Binley', '600-449-1059'),
(1, 'Kevin', 'Binley', '600-449-1059'),
(2, 'Skippy', 'Lam', '779-278-0889');
My query:
with t1 as (
select *, row_number() over(partition by id order by id) as rn
from customer_info)
delete
from customer_info
where id in (select id from t1 where rn > 1);
Your query would delete all rows from each set of dupes (as all share the same id by which you select - that's what #wildplasser hinted at with subtle comments) and only initially unique rows would survive. So if it "deletes the whole table", that means there were no unique rows at all.
In your query, dupes are defined by (id) alone, not by the whole row as your title suggests.
Either way, there is a remarkably simple solution:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND c1 = c -- comparing whole rows
);
Since you deal with completely identical rows, the remaining way to tell them apart is the internal tuple ID ctid.
My query deletes all rows, where an identical row with a smaller ctid exists. Hence, only the "first" row from each set of dupes survives.
Notably, NULL values compare equal in this case - which is most probably as desired. The manual:
The SQL specification requires row-wise comparison to return NULL if
the result depends on comparing two NULL values or a NULL and a
non-NULL. PostgreSQL does this only when comparing the results of two
row constructors (as in Section 9.23.5) or comparing a row constructor
to the output of a subquery (as in Section 9.22). In other contexts
where two composite-type values are compared, two NULL field values
are considered equal, [...]
If dupes are defined by id alone (as your query suggests), then this would work:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND id = c.id
);
But then there might be a better way to decide which rows to keep than ctid as a measure of last resort!
Obviously, you would then add a PRIMARY KEY to avoid the initial dilemma from reappearing. For the second interpretation, id is the candidate.
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
About ctid:
How do I decompose ctid into page and row numbers?
You can't if the table does not have a key.
Tables have "keys" that identify each row uniquely. If your table does not have any key, then you won't be able to identify one row from the other one.
The only workaround to delete duplicate rows I can think of would be to:
Add a key on the table.
Use the key to delete the rows that are in excess.
For example:
create sequence seq1;
alter table customer_info add column k1 int;
update customer_info set k1 = nextval('seq1');
delete from customer_info where k1 in (
select k1
from (
select
k1,
row_number() over(partition by id, first_name, last_name, phone_number) as rn
from customer_info
) x
where rn > 1
)
Now you only have two rows.
I have a statement in stored procedure
INSERT into table(ID, name, age)
SELECT fnGetLowestFreeID(), name, age
FROM #tempdata
The function fnGetLowestFreeID() gets the lowest free ID of the table table.
I want to insert unique ID with every record in the table. I have tried iteration and transaction. But they aren't fitting the scenario.
I cannot use Identity Column. I have this restriction of using IDs between 0-4 and assigning the lowest free ID using that function. In case of returned ID greater than 4, the function is returning an error. Suppose there are already 1 and 2 in the table. The function will return 0 and I have to assign this ID to the new record, 3 to the next record and so on on the basis of number of records in the #tempdata.
try this
CREATE TABLE dbo.Tmp_City(Id int NOT NULL IDENTITY(1, 1),
Name varchar(50) , Country varchar(50), )
OR
ALTER TABLE dbo.Tmp_City
MODIFY COLUMN Id int NOT NULL IDENTITY(1, 1)
OR
Create a Sequence and assign Sequence.NEXTVAL as ID
in the insert statement
You can make use of a rank function like row_number and do something like this.
INSERT into table(ID, name, age)
SELECT row_number() over (order by id) + fnGetLowestFreeID(), name, age
FROM #tempdata
Here are 3 scenarios-
1)Show the function which you are using
2) Doesn't make sense to use a function and make it unique
still- you can use rank-
INSERT into table(ID, name, age)
SELECT row_number() over (order by id) + fnGetLowestFreeID(), name, age
FROM #tempdata
3)Else, get rid of function and use max(id)+1 because you dont want to use identitiy column
You could use a Numbers table to join the query doing your insert. You can google the concept for more info, but essentially you create a table (for example with the name "Numbers") with a single column "nums" of some integer type, and then you add some amount of rows, starting with 0 or 1, and going as far as you need. For example, you could end with this table content:
nums
----
0
1
2
3
4
5
6
Then you can use such a table to modify your insert, you don't need the function anymore:
INSERT into table(ID, name, age)
SELECT t2.nums, t.name, t.age
FROM (
SELECT name, age, row_number() over (order by name) as seq
FROM #tempdata
) t
INNER JOIN (
SELECT n.nums, row_number() over (order by n.nums) as seq
FROM Numbers n
WHERE n.nums < 5 AND NOT EXISTS (
SELECT * FROM table WHERE table.ID = n.nums
)
) t2 ON t.seq = t2.seq
Now, this query leaves out one of your requirements, that would be launching an error when no slots are available, but that is easy to fix. You can do previously a query and test if the count of records in table plus the sum of records in #tempdata is higher than 5. If so, you launch the error as you know there would not be enough free slots for the records in #tempdata.
Side note: table looks like a terrible name for a table, I hope that in your real code you have a meaningful name :)
Our client has asked us to randomise some contact data before we export data from a live database.
My plan was to create a copy of the contact table, and then update the copied tables "FIRST_NAME" using a random row from the original table, and then update the "LAST_NAME" using a different random row of the original table.
Any idea how to do this? FYI, it's an oracle 10 db and I'm using SQL developer to do the work.
I would propose a hash function: STANDARD_HASH
By this the data gets anonymized but the customer is still able to use the data and run joins, analytic, etc.
SELECT
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA1')) AS SHA1,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA256')) AS SHA256,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA384')) AS SHA384,
RAWTOHEX(STANDARD_HASH('Wernfried', 'SHA512')) AS SHA512,
RAWTOHEX(STANDARD_HASH('Wernfried', 'MD5')) AS MD5
FROM dual;
So, something like
UPDATE my_table SET
FIRST_NAME = RAWTOHEX(STANDARD_HASH('secretPrefix'||FIRST_NAME, 'MD5')),
LAST_NAME = RAWTOHEX(STANDARD_HASH('secretPrefix'||LAST_NAME, 'MD5'));
Where "my_table" is of course a copy of your original data. I set a 'secretPrefix' because otherwise it would be rather simple to revert the hash value just by reading and hashing all names from a telephone directory for example.
Update
ALTER TABLE CONTACTS ADD (rn NUMBER);
UPDATE CONTACTS a SET rn = (SELECT RN FROM (SELECT ROW_NUMBER() OVER (ORDER BY DBMS_RANDOM.VALUE) AS RN, ROWID AS ROW_ID FROM CONTACTS b) WHERE a.rowid = ROW_ID);
UPDATE CONTACTS a SET (first_name, last_Name) = (
SELECT first_name, last_Name
FROM (SELECT first_name, last_Name, ROW_NUMBER() OVER (ORDER BY DBMS_RANDOM.VALUE) AS RN FROM ANONYMISED_NAMES) r WHERE r.RN = a.RN)
Note, for this update the amount of random names must be equal or bigger than table of real names. Otherwise you may use MOD function.
insert into NewTable(...., name, last_name)
select ...., n.name, l.last_name
from OldTable u,
(select count(distinct name) name_count, count(distinct last_name) last_name_count
from OldTable) c,
(select name, row_number() over(order by DBMS_RANDOM.value) as id
from (select distinct name from OldTable)
) n,
(select last_name, row_number() over(order by DBMS_RANDOM.value) as id
from (select distinct last_name from OldTable)
) l
where n.id=mod(u.ID, c.name_count)+1 and l.id=mod(u.ID, c.last_name_count)+1
u.ID - is primary key of source table or other unique, "random" integer value.