(data-protecting) want to mask/replace some data in database - sql

I'm working on a problem in which i have to mask/replace (i know they are both different)some data like credit card no,account no,date of birth etc with a particular pattern .
for example if a credit card no. is like 123/456/789 it will show ###/###/### in front end .
The solution i thought is to use regexp_replace function and it's working but the problem is that it's taking to much time and the query is very tedious and is giving a new column for each pattern(need to match more than 75 pattern for only credit card no. and account no.)+ future pattern will also come
Secondly,is it possible that we can creating a table in which we can store all the pattern and reflect to that table using dynamic sql query ??(if we get the table create access)(but i don't know how to do this )
Thirdly,we can use procedure to mask the data(not replace the data with a pattern),generate the random no. for protecting of data.(I don't think so they will agree on this ,the senior members).
if any other optimum solution is there please share,i also don't know that all the credit card no,account no etc reside in one table or they are present in more than one table, if the data is present in more than one table then what will be the solution ??
Detailed explanation needed....

From a design point of view these data points should have been stored in unique columns -- a column for credit card numbers for example. Is that not the structure of this table? If it is, why would you even include that column in your query? If cc numbers, etc. are included with other columns you may want to take the time to re-structure if you plan to use moving forward.
Continued on if they are stored in the same column -- you are really risking a breach of PII by relying on a replace function to remove sensitive information. Consider other options for accessing the data you need so that you don't breach confidential information due to a mistake in data entry.

Related

Is it possible to create indexes for a string/uuid-based primary key to be able to fast search by similarity (e.g. noisy uuids)?

I will give the concrete case for better comprehension.
I have some codes that I will call here UUID coming from OCR.
From the, say, 25 characters, a few are misrecognized.
Is it possible to "index by similarity" the UUID column in a SQL database?
Will a SELECT ... LIKE statement already have a good behavior, supposing only one character is wrong per UUID and I perform 25 queries?
[The noisy uuid is not going to be inserted, just SELECTed.]
I'm sorry, i don't know if there is a built in funtion to do so but what you are trying to do is an algorithm called Levenshtein distance. Have a look at that :
Definition :
https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.
Using SQL :
https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,not%20match%20exactly%20the%20fields.
You should fix the data that goes into the database -- or at least have the original code and an imputed code.
If you need to keep the original code, then my suggestion would be a look-up table with the original code and imputed code. This table would be used for queries that want to filter by the actual code.
To give a concrete example, if I have a column with US state abbreviations and one of the codes was RA, I would not want to "automatically" figure out if this is :
AR backwards (Arkansas)
RI (Rhode Island)
CA (California)
MA (Massachusetts)
PA (Pennsylvania)
VA (Virginia)
WA (Washington)
It seems like a manual effort would be required.

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.

Storing phone numbers in DB

this was asked in an interview where i had to store multiple phone numbers for each employee. I answered we could have a comma separated string of numbers. The next question was what if the size of the string becomes really long (suppose hypothetically 1000 numbers). Come up with a better solution. I was clueless. Could someone suggest the correct approach to the solution to this problem..
EDIT: i did suggest we freeze number of columns as some max number and insert aas per needs but it would lead to to many NULL values in most cases so that would have been a bad design.
EDIT: I just wanted to know if their does exist some other way of solving this problem other than adding a new table as suggested in one of the below comments (which i did tell as an answer).
BTW is this some trick on the interviewer's part or does another solution actually exist?
How about a simple 1:n-Relation? Create a seperate table for the phone numbers like this:
Phone_Numbers(id, employee_id, phone_number_type, phone_number)
This way you can add thousands of phone numbers for each employee and not have a problem.
In general: It is never a good idea to store a comma-separated anything in a database field. You should read up on Database Normalization. Usually the 3NF is a good compromise to go
Here phone number is a multi-valued attribute. You may use comma separated values and set upper bound and lower bound to a multi valued attribute for making sense, but as your interviewer asked for 1000 number entries then it will be good to provide atomicity to the table and create a new row for every phone number. This will increase the number of rows. You may then perform normalization. It is a case of multivalued dependency so you have to go till 4NF to come over this problem.
you said you wanted to store long string into DB, I think the DB can not be reational DB, it can be nosql db instead. if the string is very long, you can choose to store the difference of each number instead of storing each of them wholly. and I think this way can save the disk space.
eg. if you want to store 12345, 12346, 12347, 12358
you can store 12345, 1, 2, 3

Removing privacy data from a database?

Say that I needed to share a database with a partner. Obviously I have customer information in that database. Short of going through and identifying every column that contains privacy information and a custom script to 'scrub' the data, is there any tool or script which can scrub the data, but keep the format in tact (for example, if a string is 5 characters, it would stay 5 characters, only scrubbed)?
If not, how would you accomplish something like this, preferably in TSQL?
You may consider only share VIEW, create VIEWs to hide data that you don't want share.
Example:
CREATE VIEW v_customer
AS
SELECT
NAME,
LEFT(CreditCard,5) + '****' As CreditCard -- OR, don't show this column at all
....
FROM customer
Firstly I need to state professional interest I work for IBM which has tools that do exactly this.
Step 1. Ensure you identify all the PII (Personally Identifiable Information). When sharing database information it is typical that the obvious column names like "name" are found but you also need to find the "hidden" data where either the data is embedded in a standard format eg string-name-string and column name is something like "reference code" or is in free format text fields . as you have seen this is not going to be an easy job unless you automate it. The Tool for this is InfoSphere Discovery
Step 2. What context does the "scrubbed" data need to be in. Changing named fields to random characters has problems when testing as users focus on text errors rather than functional failures, therefore change names to real but ficticious. Credit card information often needs to be "valid". by that I mean it needs to have a valid prefix say 49XX but the rest an invalid sequence. Finally you need to ensure that every instance of the change is propogated through the database to maintain consistency. Tool for this is Optim Test Data Management with Data Privacy option.
The two tools integrate to give a full data privacy solution.
Based on the original question, it seems you need the fields to be the same length, but not in a "valid" format? How about:
UPDATE customers
SET email = REPLICATE('z', LEN(email))
-- additional fields as needed
Copy/paste and rename tables/fields as appropriate. I think you're going to have a hard time finding a tool that's less work, unless your schema is very complicated, or my formatting assumptions are incorrect.
I don't have an MSSQL database in front of me right now, but you can also find all of the string-like columns by something like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('...', '...')
I don't remember the exact values you need to compare for, but if you run the query and see what's there, they should be pretty self-explanatory.

How to get multi row data of one column to one row of one Column

I need to get data in multiple row of one column.
For example data from that format
ID Interest
Sports
Cooking
Movie
Reading
to that format
ID Interest
Sports,Cooking
Movie,Reading
I wonder that we can do that in MS Access sql. If anybody knows that, please help me on that.
Take a look at Allen Browne's approach: Concatenate values from related records
As for the normalization argument, I'm not suggesting you store concatenated values. But if you want to join them together for display purposes (like a report or form), I don't think you're violating the rules of normalization.
This is called de-normalizing data. It may be acceptable for final reporting. Apparently some experts believe it's good for something, as seen here.
(Mind you, kevchadder's question is right on.)
Have you looked into the SQL Pivot operation?
Take a look at this link:
http://technet.microsoft.com/en-us/library/ms177410.aspx
Just noticed you're using access. Take a look at this article:
http://www.blueclaw-db.com/accessquerysql/pivot_query.htm
This is nothing you should do in SQL and it's most likely not possible at all.
Merging the rows in your application code shouldn't be too hard.