I want to update names in a table with random names from another table. There's real user names in the table Users. There's made up names in the table tmp_users. I'd like to update all the names in Users table with random names from the tmp_users table. The idea is to anonymize real production data with fake customers. There are fewer entries in the tmp_users table so I don't think I can correlate on an id.
The problem I have is that all users gets set to the same name.
Some sample data:
create table users
(
name varchar2(50)
);
create table tmp_users
(
name varchar2(50)
);
insert into users values ('Cora');
insert into users values ('Rayna');
insert into users values ('Heidi');
insert into users values ('Gilda');
insert into users values ('Dorothy');
insert into users values ('Elena');
insert into users values ('Providencia');
insert into users values ('Louetta');
insert into users values ('Portia');
insert into users values ('Rodrick');
insert into users values ('Rocco');
insert into users values ('Nelson');
insert into users values ('Derrick');
insert into users values ('Everett');
insert into users values ('Nisha');
insert into users values ('Amy');
insert into users values ('Hyun');
insert into users values ('Brendon');
insert into users values ('Gabriela');
insert into users values ('Melina');
insert into tmp_users values ('Snow White');
insert into tmp_users values ('Cinderella');
insert into tmp_users values ('Aurora');
insert into tmp_users values ('Ariel');
insert into tmp_users values ('Belle');
insert into tmp_users values ('Jasmine');
insert into tmp_users values ('Pocahontas');
insert into tmp_users values ('Mulan');
insert into tmp_users values ('Tinker Bell');
insert into tmp_users values ('Anna');
insert into tmp_users values ('Elsa');
--Wrong, sets all users to the same random name
update users set name = (select name from (select name from tmp_users order by sys_guid()) where rownum = 1);
--Wrong, sets all users to the same random name
update users set name = (select name from (select name from tmp_users order by dbms_random.value) where rownum = 1);
When doing this:
select * from users;
The result I get is something like this, which I don't want.
Cinderella
Cinderella
Cinderella
Cinderella
Cinderella
...
I'd like to assign a random name to each row in the Users table. Not the same name to all rows. I'd like somthing like this:
Mulan
Cinderella
Belle
Elsa
Jasmine
Tinker Bell
...
Any idea how this can be done? I'm using Oracle Database 11g Express Edition 11.2.0.2.0. It would be easy to do with a cursor but I'm trying to figure out how to do it with a set operation.
Update:
I've now tested on two different Oracle versions. The correlated subquery solution doesn't work on Oracle Database 11g Express Edition 11.2.0.2.0.
But it does work sometimes on Oracle Database 11g Enterprise Edition 11.2.0.4.0. On one table it works all the time and on another it never works.
Testing with 11.2.0.4 the following works, similar to what #VR46 suggested:
SQL> UPDATE users u
2 SET name = (SELECT name
3 FROM (SELECT NAME,
4 row_number() over(ORDER BY dbms_random.value) rn
5 FROM tmp_users) tu
6 WHERE u.name IS NOT NULL
7 AND rn = 1);
20 rows updated
SQL> select * from users;
NAME
--------------------------------------------------
Ariel
Aurora
Belle
Ariel
Anna
Mulan
Aurora
Ariel
Mulan
Tinker Bell
Mulan
Ariel
Aurora
Pocahontas
Pocahontas
Aurora
Snow White
Mulan
Aurora
Anna
20 rows selected
I think correlating the sub-query will pull random names from Sub-Query
UPDATE users
SET name = (SELECT name
FROM (SELECT name
FROM tmp_users tu
ORDER BY Sys_guid())
WHERE ROWNUM = 1
AND users.name <> name );
You need some correlation, as #VR46 suggested; but sys_guid() doesn't work for this (in 11gR2 anyway; I think the optimiser is only evaluating it once in this scenario for some reason, perhaps); you can use dbms_random.value though:
update users u set name = (
select name from (
select name from tmp_users order by dbms_random.value
)
where rownum = 1 and u.name is not null
);
NAME
--------------------------------------------------
Jasmine
Tinker Bell
Ariel
Elsa
Elsa
Elsa
Belle
Snow White
...
If you don't want aubquery you could use keep dense rank instead:
update users u set name = (
select max(name) keep (dense_rank first order by dbms_random.value)
from tmp_users
where u.name is not null
);
NAME
--------------------------------------------------
Mulan
Anna
Snow White
Elsa
Tinker Bell
Belle
Belle
Elsa
...
The correlation just has to be true; if you have null values in your user table this will update them to null, and you could use a different condition if that is an issue.
From comments it seems you have the same problem with dbms_random in 11.2.0.2 as I have with sys_guid in 11.2.0.3 and 11.2.0.4. If your users table also has a numeric unique/primary key such as an ID you can use that for the correlation and pass it into the value function, which might make a difference, but I don't have a suitable instance to test against:
update users u set name = (
select max(name) keep (dense_rank first order by dbms_random.value(0, u.id))
from tmp_users
);
MOS note 420779.1 includes the line "Including DBMS_RANDOM.VALUE in a subquery may or may not work depending on the optimization and execution code path chosen", which seems to be the problem here.
You could also try variations with merge, e.g. (again assuming there's an ID you can use):
merge into users u
using (
select u.id, max(tu.name) keep (dense_rank first
order by dbms_random.value(0, u.id)) as name
from users u
cross join tmp_users tu
group by u.id
) tu
on (tu.id = u.id)
when matched then update set u.name = tu.name;
The cross join may make this impractical though, depending on the number of rows you actually have in each table.
Here's one more approach, which uses rowid and modulus:
MERGE into users_tab u
USING (
select
actual.row_id as actual_rowid,
actual.rnum actual_rnum,
actual.name actual_name,
fake.rnum fake_rnum,
fake.name fake_name,
mod(actual.rnum, fake_count.cnt) modulus
from
(
select rownum rnum, name, rowid as row_id
from users_tab
) actual,
(
select rownum-1 rnum, name
from (select distinct name from tmp_users_tab)
) fake,
(select count(distinct name) cnt from tmp_users_tab) fake_count
where mod(actual.rnum, fake_count.cnt) = fake.rnum
) x
ON (x.actual_rowid = u.rowid)
WHEN MATCHED THEN UPDATE
set name = x.fake_name;
Not sure about how this will perform on a very large user table, however. Its not random, but follows a series of fake names. So if you have 10 fake names, records 1->10 in users will be assigned fake names 1->10, and user 11 will start over with fake name #1.
The USING query has extra fields for testing.
Related
I have a table that looks like:
ID|CREATED |VALUE
1 |1649122158|200
1 |1649122158|200
1 |1649122158|200
That I'd like to look like:
ID|CREATED |VALUE
1 |1649122158|200
And I run the following query:
DELETE FROM MY_TABLE T USING (SELECT ID,CREATED,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY CREATED DESC) AS RANK_IN_KEY FROM MY_TABLE T) X WHERE X.RANK_IN_KEY <> 1 AND T.ID = X.ID AND T.CREATED = X.CREATED
But it removes everything from MY_TABLE and not just other rows with the same value. This is more than just selecting distinct records, I'd like to enforce a unique constraint to get the latest value of ID and keep just one record for it, even if there were duplicates.
So
ID|CREATED |VALUE
1 |1649122158|200
1 |1649122159|300
2 |1649122158|200
2 |1649122158|200
3 |1649122170|500
3 |1649122160|200
Would become (using the same final unique constraint statement):
ID|CREATED |VALUE
1 |1649122159|300
2 |1649122158|200
3 |1649122170|500
How can I improve my logic to properly handle these unique constraint modifications?
Check out this post: https://community.snowflake.com/s/question/0D50Z00008EJgemSAD/how-to-delete-duplicate-records-
If all columns make up a unique records, the recommended solution is the insert all the records into a new table with SELECT DISTINCT * and do a swap. You could also do a INSERT OVERWRITE INTO the same table.
Something like INSERT OVERWRITE INTO tableA SELECT DISTINCT * FROM tableA;
The following setup should leave rows with id of 1 and 3. And not delete all rows as you say.
Schema
create table t (
id int,
created int ,
value int
);
insert into t values(1, 1649122158, 200);
insert into t values(1 ,1649122159, 300);
insert into t values(2 ,1649122158, 200);
insert into t values(2 ,1649122158, 200);
insert into t values(3 ,1649122170, 500);
insert into t values(3 ,1649122160, 200);
Delete statement
with x as (
SELECT
id, created,
row_number() over(partition by id) as r
FROM t
)
delete from t
using x
where x.id = t.id and x.r <> 1 and x.created = t.created
;
Output
select * from t;
1 1649122158 200
3 1649122170 500
The logic is such, that the table in the using clause is joined with the operated on table. Following the join logic, it just matches by some key. In your case, you have key as {id,created}. This key is duplicated for rows with id of 2. So the whole group is deleted.
I'm no savvy in database schemas. But as a thought, you may add a row with a rank to existing table. And after that you can proceed with deletion. This way you do not need to create other table and insert values to that. Be warned that data may become fragmented(physically, on disks). So you will need to run some kind of tune up later.
Update
You may find this almost one-liner interesting:
SO answer
I will duplicate code here, as it is so small and well written.
WITH
u AS (SELECT DISTINCT * FROM your_table),
x AS (DELETE FROM your_table)
INSERT INTO your_table SELECT * FROM u;
I got two tables: defaults and users with same columns, except in defaults the column user_id is unique while in "users" it's not.
I want to take all the rows from users and insert them to defaults, if two rows in users got the same user_id, I want to merge them in such way that all the empty/null values will be overridden with non empty/null values
Here is an example
users
-----
user_id|name|email|address
--------------------------
1 |abc |null |J St
1 |coco|a#b.c|null
After inserting to defaults I expect the next result:
defaults
-----
user_id|name|email|address
--------------------------
1 |abc |a#b.c|J St
#Eric B provided an answer of how to do this with insert values:
Assuming 3 columns in the table.. ID, NAME, ROLE
This will update 2 of
the columns. When ID=1 exists, the ROLE will be unaffected. When ID=1
does not exist, the role will be set to 'Benchwarmer' instead of the
default value.
INSERT OR REPLACE INTO Employee (id, name, role)
VALUES ( 1,
'Susan Bar',
COALESCE((SELECT role FROM Employee WHERE id = 1), 'Benchwarmer')
);
How to do this when I use select for my insert
insert or replace into defaults select ??? from users
INSERT OR REPLACE INT DEFAULTS VALUES (ID, Name, Role)
(
SELECT ID, MAX(Name), MAX(Role) FROM Users GROUP BY ID
);
The max will select the max value instead of null if there is one.
I have two tables, one is a table #1 contains user information, email, password, etc..
the other table #2 contains item information
when I do a insert into table #2, and then use the returning statement, to gather what was inserted (returning auto values as well as other information), I also need to return information from table #1.
(excuse the syntax)
example:
insert into table #1(item,user) values('this item','the user')
returning *, select * from table 2 where table #1.user = table #2.user)
in other words, after the insert I need to return the values inserted, as well as the information about the user who inserted the data.
is this possible to do?
the only thing I came up with is using a whole bunch of subquery statements in the returning clause. there has to be a better way.
I suggest a data-modifying CTE (Postgres 9.1 or later):
WITH ins AS (
INSERT INTO tbl1(item, usr)
VALUES('this item', 'the user')
RETURNING usr
)
SELECT t2.*
FROM ins
JOIN tbl2 t2 USING (usr)
Working with the column name usr instead of user, which is a reserved word.
Use a subquery.
Simple demo: http://sqlfiddle.com/#!15/bcc0d/3
insert into table2( userid, some_column )
values( 2, 'some data' )
returning
userid,
some_column,
( SELECT username FROM table1
WHERE table1.userid = table2.userid
);
I have two similar table hierarchies:
Owner -> OwnerGroup -> Parent
and
Owner2 -> OwnerGroup2
I would like to determine if there is an exact match of Owners that exists in Owner2 based on a set of values. There are approximately a million rows in each Owner table. Some OwnerGroups contain up to 100 Owners.
So basically if there is an OwnerGroup than contains Owners "Smith", "John" and "Smith, "Jane", I want to know the id of the OwnerGroup2s that are exact matches.
The first attempt at this was to generate a join per Owner (which required dynamic sql being generated in the application:
select og.id
from owner_group2 og
-- dynamic bit starts here
join owner2 o1 on
(og.id = o1.og_id) AND
(o1.given_names = 'JOHN' and o1.surname='SMITH')
-- dynamic bit ends here
join owner2 o2 on
(og.id = o2.og_id) AND
(o2.given_names = 'JANE' and o2.surname='SMITH');
This works fine until for small numbers of owners, but when we have to deal with the 100 Owners in a group scenario as this query plan means there 100 nested loops and it takes almost a minute to run.
Another option I had was to use something around the intersect operator. E.g.
select * from (
select o.surname, o.given_names
from owner1 o1
join owner_group1 og1 on o1.og_id = og1.id
where
og1.parent_id = 1936233
)
intersect
select o.surname, o.given_names
from owner2 o2
join owner_group2 og2 on og2.id = o2.og_id;
I'm not sure how to suck out the owner2.id in this scenario either - and it was still running in the 4-5 second range.
I feel like I am missing something obvious - so please feel free to provide some better solutions!
You're on the right track with intersect, you just need to go a bit further. You need to join the results of it back to the owner_groups2 table to find the ids.
You can use the listagg function to convert the groups into comma-separated lists of the names (note - requires 11g). You can then take the intersection of these name lists to find the matches and join this back to the list in owner_groups2.
I've created a simplified example below, in it "Dave, Jill" is the group that is present in both tables.
create table grps (id integer, name varchar2(100));
create table grps2 (id integer, name varchar2(100));
insert into grps values (1, 'Dave');
insert into grps values(1, 'Jill');
insert into grps values (2, 'Barry');
insert into grps values(2, 'Jane');
insert into grps2 values(3, 'Dave');
insert into grps2 values(3, 'Jill');
insert into grps2 values(4, 'Barry');
with grp1 as (
SELECT id, listagg(name, ',') within group (order by name) n
FROM grps
group by id
), grp2 as (
SELECT id, listagg(name, ',') within group (order by name) n
FROM grps2
group by id
)
SELECT * FROM grp2
where n in (
-- find the duplicates
select n from grp1
intersect
select n from grp2
);
Note this will still require a full scan of owner_groups2; I can't think of a way you can avoid this. So your query is likely to remain slow.
I have an Access table of the form (I'm simplifying it a bit)
ID AutoNumber Primary Key
SchemeName Text (50)
SchemeNumber Text (15)
This contains some data eg...
ID SchemeName SchemeNumber
--------------------------------------------------------------------
714 Malcolm ABC123
80 Malcolm ABC123
96 Malcolms Scheme ABC123
101 Malcolms Scheme ABC123
98 Malcolms Scheme DEF888
654 Another Scheme BAR876
543 Whatever Scheme KJL111
etc...
Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).
I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.
See if this query returns the rows you want to keep:
SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
(SELECT
SchemeNumber,
SchemeName,
Len(SchemeName) AS name_length,
ID
FROM tblSchemes
) AS r
INNER JOIN
(SELECT
SchemeNumber,
Max(Len(SchemeName)) AS name_length
FROM tblSchemes
GROUP BY SchemeNumber
) AS w
ON
(r.SchemeNumber = w.SchemeNumber)
AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;
If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.
DELETE
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);
Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:
DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));
DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
WHERE t1.SchemeNumber = t2.SchemeNumber
AND Length(t2.SchemeName) > Length(t1.SchemeName)
)
Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)
delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))
(SQL Server flavored)
Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.
I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.
-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
id int(11) NOT NULL,
SchemeName varchar(50),
SchemeNumber varchar(15),
PRIMARY KEY (id)
);
insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );
-- Find all the records that have dups, find the longest one
create table bar as
select max(length(SchemeName)) as max_length, SchemeNumber
from foo
group by SchemeNumber
having count(*) > 1;
-- Find the one we want to keep
create table bat as
select min(a.id) as id, a.SchemeNumber
from foo a join bar b on a.SchemeNumber = b.SchemeNumber
and length(a.SchemeName) = b.max_length
group by SchemeNumber;
-- Select into this table all the rows to delete
create table baz as
select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber
and a.id != b.id;
This will give you a new table with only records for rows that you want to remove.
Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.
Then when you're ready, use this command to delete the rows using this command.
delete from foo where id in (select id from baz);
This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.
If your platform supports ranking functions and common table expressions:
with cte as (
select row_number()
over (partition by SchemeNumber order by len(SchemeName) desc) as rn
from Table)
delete from cte where rn > 1;
try this:
Select * From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or this:,...
Select * From Table t
Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))
if either of these selects the records that should be deleted, just change it to a delete
Delete
From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or using the second construction:
Delete From Table t Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))