Anonymise SQLite database? - sql

Problem: I have a table with first name, surname, and gender columns. I need to partially anonymise the database, by replacing all the names in this table with arbitrary made-up names. I also have a spreadsheet with lots of gender-specific arbitrary names.
Given this, how do I iterate through the rows of this table, and replace each name in turn with a name from the spreadsheet?
I can do this in C fairly trivially, but it's a days work - export the spreadsheet as CSV, and then iterate through the rows of the table, updating each name with the next one from the CSV file. However, I can't help feeling that there's a much simpler way to do this by turning the CSV name data into a script, but I've got no idea how to iterate through the table from a script. Any pointers/ideas appreciated.

I believe you are on the right track with the application route either with C or Python or whatever you feel convenient. Here is a different method that can be scripted.
Export data from Excel as CSV
$ cat test.csv
Jacob Jacobs,M
Rogers Bogers,M
Marsha Darsha,F
Tina Fina,F
Mono Bono,M
Import this into sqlite
sqlite> .mode csv
sqlite> .import test.csv proxy
sqlite> select * from proxy2;
"Jacob Jacobs",M
"Rogers Bogers",M
"Marsha Darsha",F
"Tina Fina",F
"Mono Bono",M
Remember count of males and females
Let's say your table was called main in which you have real names, and you want to change them to names from proxy table randomly.
sqlite> .schema
CREATE TABLE proxy (fullname text, gender text);
CREATE TABLE main(fullname TEXT,gender TEXT,age INT);
sqlite> select * from main;
fullname,gender,age
"John Smith",M,20
"Marshall Dubin",M,20
"Kate Ortiz",F,20
"Ron Bunsh",M,20
"Kelly Torro",F,20
sqlite> select count(*) from main where gender='M';
count(*)
3
sqlite> select count(*) from main where gender='F';
count(*)
2
Have your application remember this information that there are 3 Males and 2 Females.
Execute update statement repeatedly with different offset
sqlite> update main
...> set fullname = (
...> select fullname from proxy where gender='M' order by random() limit 1)
...> where rowid = (
...> select rowid from main where gender='M' order by rowid limit 0,1);
Change the limit 0,1 to limit 1,1 and re-execute. Go on till you reach limit 2,1. Since you have 3 records for Males, go from limit 0,1 to limit 2,1.
Repeat the same thing to anonymize Female records. Change gender='M' to gender='F'. Since there are only 2 females, you will execute update two times. Once with limit 0,1 and then with limit 1,1.
If you run this in a transaction, my hope that your script should be able to churn through the updates quite fast.
End Result
WAS
fullname gender age
---------- ---------- ----------
John Smith M 20
Marshall D M 20
Kate Ortiz F 20
Ron Bunsh M 20
Kelly Torr F 20
IS
fullname gender age
------------- ---------- ----------
Rogers Bogers M 20
Jacob Jacobs M 20
Tina Fina F 20
Jacob Jacobs M 20
Jasmine F 20
Example of scripting SQLite with Bash - http://andreaolivato.tumblr.com/post/133473114/using-sqlite3-in-bash
Other option
In your application, hold the fake names in two arrays - one for male and one for female. The idea is to be able to pull a random fake name by gender on demand
Do a select rowid, gender from main order by rowid
Iterate through the records
If gender is male, pull a random fake record from male array; likewise for female record
Run update main set fullname=<fake-record> where rowid=<selected-row-id>

Related

MS Access - Continuous Form Select with Dropdown from Another Table

I've been using Databasedevelopment.co.uk's excellent example on how to do a continuous form select with a invisible button overlaying a checkbox to assign employees to a specific shift. I'd like to make it so that said continuous form also has a dropdown of the different Paycodes so that when they are selected I can use a combobox to indicate "Regular Pay, Overtime, etc....". I'm running into a wall because with the query as-is from the example, the recordset for the Paycode field is not updateable.
Messing with the primary key for the employee's table fixes the paycode issue but prevents the selection code from working properly.
I'm a bit out of my depth here, what's the easiest way to accomplish this?
SELECT CAT.EmployeeID, CAT.FirstName, CAT.LastName, ASGN_TEMP.ShiftNum, ASGN.PayCode, IIf(ASGN_TEMP.[ShiftNum] Is Null,0,-1) AS IsSelected
FROM tblEmployees AS CAT
LEFT JOIN (SELECT ASGN.EmployeeID, ASGN.ShiftNum, ASGN.PayCode FROM tblAssignedEmployees AS ASGN
WHERE ASGN.ShiftNum = Forms!frmMainMenu![txtShiftNum]) AS ASGN_TEMP
ON CAT.EmployeeID = ASGN_TEMP.EmployeeID;
Paycode is a static table with an ID, a Paycode and a description and would only correspond with each record in "tblAssignedEmployee". That is to say, there is no relationship between the employee or the shift with what Paycodes are available, I'd just like a second table for ease of updates.
---EDIT---
Table: Employees
ID
EmployeeID
Firstname
LastName
1
1234
Bob
Jones
2
9999
Mary
Sue
Table: AssignedEmployees
ID
EmployeeID
ShiftNum
PayCode
1
1234
1
OT
2
9999
2
Regular
3
1234
2
OT
Table: PayCodes
ID
PayCode
Desc
1
Regular
Regular Pay
2
OT
Overtime

How to load grouped data with SSIS

I have a tricky flat file data source. The data is grouped, like this:
Country City
U.S. New York
Washington
Baltimore
Canada Toronto
Vancouver
But I want it to be this format when it's loaded in to the database:
Country City
U.S. New York
U.S. Washington
U.S. Baltimore
Canada Toronto
Canada Vancouver
Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!
The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:
You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.
If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.
If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.
EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):
The collection of ReadWriteVariables is only available in the
PostExecute method to maximize performance and minimize the risk of
locking conflicts. Therefore you cannot directly increment the value
of a package variable as you process each row of data. Increment the
value of a local variable instead, and set the value of the package
variable to the value of the local variable in the PostExecute method
after all data has been processed. You can also use the
VariableDispenser property to work around this limitation, as
described later in this topic. However, writing directly to a package
variable as each row is processed will negatively impact performance
and increase the risk of locking conflicts.
That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.
Yes, it is possible. First you need to load the data to a table with an IDENTITY column:
-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))
INSERT INTO #t(Country, City)
SELECT a.Country, a.City
FROM OPENROWSET( BULK 'c:\import.txt',
FORMATFILE = 'c:\format.fmt',
FIRSTROW = 2) AS a;
select * from #t
The result will be:
id Country City
----------- -------------------- --------------------
1 U.S. New York
2 Washington
3 Baltimore
4 Canada Toronto
5 Vancouver
And now with a bit of recursive CTE magic you can populate the missing details:
;WITH a as(
SELECT Country
,City
,ID
FROM #t WHERE ID = 1
UNION ALL
SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
,#t.City
,#t.ID
FROM a INNER JOIN #t ON a.ID+1 = #t.ID
)
SELECT * FROM a
OPTION (MAXRECURSION 0)
Result:
Country City ID
-------------------- -------------------- -----------
U.S. New York 1
U.S. Washington 2
U.S. Baltimore 3
Canada Toronto 4
Canada Vancouver 5
Update:
As Tab Alleman suggested below the same result can be achieved without the recursive query:
SELECT ID
, COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
, City
FROM #t a
BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):
9.0
2
1 SQLCHAR 0 11 "" 1 Country SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "\r\n" 2 City SQL_Latin1_General_CP1_CI_AS

Populating column for Oracle Text search from 2 tables

I am investigating the benefits of Oracle Text search, and currently am looking at collecting search text data from multiple (related) tables and storing the data in the smaller table in a 1-to-many relationship.
Consider these 2 simple tables, house and inhabitants, and there are NEVER any uninhabited houses:
HOUSE
ID Address Search_Text
1 44 Some Road
2 31 Letsby Avenue
3 18 Moon Crescent
INHABITANT
ID House Name Nickname
1 1 Jane Doe Janey
2 1 John Doe JD
3 2 Jo Smythe Smithy
4 2 Percy Plum PC
5 3 Apollo Lander Moony
I want to to write SQL that updates the HOUSE.Search_Text column with text from INHABITANT. Now because this is a 1-to-many, the SQL needs to collate the data in INHABITANT for each matching row in house, and then combine the data (comma separated) and update the Search_Text field.
Once done, the Oracle Text search index on HOUSE.Search_Text will return me HOUSEs that match the search criteria, and I can look up INHABITANTs accordingly.
Of course, this is a very simplified example, I want to pick up data from many columns and Full Text Search across fields in both tables.
With the help of a colleague we've got:
select id, ADDRESS||'; '||Names||'; '||Nicknames as Search_Text
from house left join(
SELECT distinct house_id,
LISTAGG(NAME, ', ') WITHIN GROUP (ORDER BY NAME) OVER (PARTITION BY house_id) as Names,
LISTAGG(NICKNAME, ', ') WITHIN GROUP (ORDER BY NICKNAME) OVER (PARTITION BY house_id) as Nicknames
FROM INHABITANT)
i on house.id = i.house_id;
which returns:
1 44 Some Road; Jane Doe, John Doe; JD, Janey
2 31 Letsby Avenue; Jo Smythe, Percy Plum; PC, Smithy
3 18 Moon Crescent; Apollo Lander; Moony
Some questions:
Is this an efficient query to return this data? I'm slightly
concerned about the distinct.
Is this the right way to use Oracle Text search across multiple text fields?
How to update House.Search_Text with the results above? I think I need a correlated subquery, but can't quite work it out.
Would it be more efficient to create a new table containing House_ID and Search_Text only, rather than update House?

Table Join issue

Right now I've got a Main table in which I am uploading data. Because the Main table has many different duplicates, I Append various data out of the Main table into other tables such as, username, phone number, and locations in order to keep things optimized. Once I have everything stripped down from the Main table, I then append what's left into a final optimized Main table. Before this happens though, I run a select query joining all the stripped tables with the original Main table in order to connect the IDs from each table, with the correct data. For example:
Original Main Table
--Name---------Number------Due Date-------Location-------Charges Monthly-----Charges Total--
John Smith 111-1111 4/3 Chicago 234.56 500.23
Todd Jones 222-2222 4/3 New York 174.34 323.56
John Smith 111-1111 4/3 Chicago 274.56 670.23
Bill James 333-3333 4/3 Orlando 100.00 100.00
This gets split into 3 tables (name, number, location) and then there is a date table with all the dates for the year:
Name Table Number Table Location Table Due Date Table
--ID---Name------ -ID--Number--------- ---ID---Location---- --Date---
1 John Smith 1 111-1111 1 Chicago 4/1
2 Todd Jones 2 222-2222 2 New York 4/2
3 Bill James 3 333-3333 3 Orlando 4/3
Before The Original table gets stripped, I run a select query that grabs the ID from the 3 new tables, and joins them based on the connection they have with the original Main table.
Select Output
--Name ID----Number ID---Location ID---Due Date--
1 1 1 4/3
2 2 2 4/3
1 1 1 4/3
3 3 3 4/3
My issue comes when I need to introduce a new table that isn't able to be tied into the Original Main Table. I have an inventory table that, much like the original Main table, has duplicates and needs to be optimized. I do this by creating a secondary table that takes all the duplicated devices out and put them in their own table, and then strips the username and number out and puts them into their tables. I would like to add the IDs from this new device table into the select output that I have above. Resulting in:
Select Output
--Name ID----Number ID---Location ID---Due Date--Device ID---
1 1 1 4/3 1
2 2 2 4/3 1
1 1 1 4/3 2
3 3 3 4/3 1
Unlike the previous tables, the device table has no relationship to the originalMain Table, which is what is causing me so much headache. I can't seem to find a way to make this happen...is there anyway to accomplish this?
Any two tables can be joined. A table represents an application relationship. In some versions (not the original) of Entity-Relationship Modelling (notice that the "R" in E-R stands for "(application) relationship"!) a foreign key is sometimes called a "relationship". You do not need other tables or FKs to join any two tables.
Explain, in terms of its column names and the values for those names, exactly when a row should turn up in the result. Maybe you want:
SELECT *
FROM the stripped-and-ID'd version of the Original AS o
JOIN the stripped-and-ID'd version of the Device AS d
USING NameID, NumberID, LocationID and DueDate
Ie
SELECT *
FROM the stripped-and-ID'd version of the Original AS o
JOIN the stripped-and-ID'd version of the Device AS d
ON o.NameID=d.NameId AND o.NumberID=d.NumberID
AND o.LocationID=d.LocationID AND o.DueDateID=d.DueDate.
Suppose p(a,...) is some statement parameterized by a,... .
If o holds the rows where o(NameID,NumberID,LocationID,DueDate) and d holds the rows where d(NameID,NumberID,LocationID,DueDate,DeviceID) then the above holds the rows where o(NameID, NumberID, LocationID, DueDate) AND d(NameID,NumberID,LocationID,DueDate,DeviceID). But you really have not explained what rows you want.
The only way to "join" tables that have no relation is by unioning them together:
select attribute1, attribute2, ... , attributeN
from table1
where <predicate>
union // or union all
select attribute1, attribute2, ... , attributeN
from table2
where <predicate>
the where clauses are obviously optional
EDIT
optionally you could join the tables together by stating ON true which will act like a cross product

SQL rand() dependent on another column?

I have a table with the following format
id cityname user
1 newyork a
2 newyork b
3 newyork c
4 denver d
5 colorodo e
6 colorodo e
I need to add a new column with name version which is randomly generated using rand() and it should have same values for similar cityname
id cityname user version
1 newyork a 1111111.11
2 newyork b 1111111.11
3 newyork c 1111111.11
4 denver d 7845156.12
5 colorodo e 8765589.12
6 colorodo e 8765589.12
How can I achieve random values similar for a group.
Please help.
If you are on SQL Server 2008 or above, you can use the CHECKSUM function. Keep in mind that you may get collisions with a 4 byte hash.
SELECT *, CHECKSUM(CityName) as Version
FROM Cities
For something a bit less likely to have a collision, you could use HASHBYTES:
SELECT *, HASHBYTES('SHA1', CityName) as Version
FROM Cities
For MySQL, you can use any of the encryption functions, and take a substring:
SELECT *, LEFT(SHA1(CityName), 8) as Version
FROM Cities
or, just use the whole hash for some heavier collision protection. Most other RDBMS have similar hash functions.
As #Mitch mentioned, you can have it as CHECKSUM
you can make it as computed column, so that on INSERT or UPDATE it is computed automatically
ALTER TABLE tableA ADD version AS CHECKSUM(CityName) PERSISTED