SQL Masking A Mapping Field In The Query - sql

I am creating a view to extract data from a table and load that data into a fixed file which will be loaded into a system. The view will map the table column to a particular format.
There is one column, Account_Number, which needs to be masked as the column has sensitive information.
My logic to mask the value is to shift the number to the next place in numberline.
so, if the number is 0 then 1, 4 then 5, etc. I am not able to come with the logic in the view itself.
Any help would be appreciated.
CREATE OR REPLACE FORCE EDITIONABLE VIEW "Schema1"."VW_ActiveTraders" ("FUND", "NAME", "CITY", "ACN") AS
Select
TD_Fund as FUND,
Name as NAME,
City as CITY,
Account_Number as ACN
FROM Trader1 -- Table Name
Account Number
023457456
123456789
012345678
Masked Account Number
134568567
012345678
123456789
Please note that Account Number column has more than 1000 entries.

You may use TRANSLATE to shift the numbers
with dt as (
select '023457456' ACN from dual union all
select '123456789' ACN from dual union all
select '012345678' ACN from dual)
select ACN,
TRANSLATE(ACN,'0123456789','1234567890') as ACN_WEAK_MASK
from dt;
ACN ACN_WEAK_
--------- ---------
023457456 134568567
123456789 234567890
012345678 123456789
But note, that this is not a real masking of sensitive information. It is very easy to unmask the information and get the original acount ID.
An often used masking is e.g. 012345678 gets ******678.

#MarmiteBomber #Stilgar - Thanks so much for clarification and help on the answer.
I just tweaked the query and it ran successfully.
Changed Query
------------------------------------------------------------------------------------------
CREATE OR REPLACE FORCE EDITIONABLE VIEW "Schema1"."VW_ActiveTraders" ("FUND", "NAME", "CITY", "ACN") AS
Select
TD_Fund as FUND,
Name as NAME,
City as CITY,
--Account_Number as ACN
TRANSLATE(Account_Number,'0123456789','1234567890') as ACN,
FROM Trader1 -- Table Name
------------------------------------------------------------------------------------------

Related

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

BigQuery: grouping by similar strings for a large dataset

I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.
Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.

matching names with SSN in a columnar table

thanks in advance for any advice that you might have.
This is my first time attempting to query from a columnar database, so I'm a bit uncertain as to how to write a query that gives me the results that I'm looking for.
The table ("census_data") that I'm querying from has the following types of values (41 rows total):
plan_id ssn_key field value
1 111111111 DOB 1732-02-22
1 111111111 DOR 1830-11-02
1 111111111 FNAME GEORGE
1 111111111 LNAME WASHINGTON
1 863283322 DOR 2020-03-22
As an FYI, in some cases, we might only have someone's SSN and DOB, but not their FNAME, LNAME, DOR (date of retirement), etc.
We're working with dummy data now and attempting to have queries in place for when we begin working with a large-scale data set.
We know that in some cases in the actual data set, there will be illogical data, such as a Date of Retirement ('DOR') that occurs in the future (assuming for our rules, a 'DOR' value must occur in the past in order for it to be valid).
We've written some queries that have given us the results that we're looking for, such as:
1) Give us the birthdays of all people with FNAME = 'GEORGE' and LNAME = 'WASHINGTON'
select [value] from [testdb3].[dbo].[census_data]
where ssn_key in (select ssn_key from census_data where field='LNAME'
and value='WASHINGTON' and ssn_key in
(select ssn_key from census_data where field='FNAME'
and [value]='GEORGE')) AND field='DOB'
2) Give us all SSNs of people with a Date of Retirement after today
select [plan_id], [ssn_key], [field], [value]
from [testdb3].[dbo].[census_data] as
cd where cd.field = 'DOR'and value > GETDATE()
As a reminder, the SSN values are in the 2nd column in our table, whereas the values for DOB, FNAME, DOR, LNAME, etc. are all in the 4th column of our table.
And here's where we're stumped. We're trying to write a query that gives us the first name of anyone with a date of retirement greater than today. We've spent a few hours trying to come up with something that works and have come up empty so far. If anyone has any thoughts on what the code would be, please let me know, I would greatly appreciate it. Thank you.

SQL select query to find the latest "destination_id" to track moves

Hello – I am trying to construct an Oracle 11g query that will find the latest version of an entity by going through a table that has a history of moves. An example of this is that the table could contain a list of addresses that a person has lived at and different addresses that they have moved to.
For example, you might live at ADDRESS_ID 123 but then moved to ADDRESS_ID 456 and moved again to ADDRESS_ID 789.
It is also possible that you lived at ADDRESS_ID 123 the whole time and never moved therefore you would never appear on the MOVE_LIST table.
The goal of the query would be so if I select ADDRESS_ID 123 in the first example above then it would tell me the MOST RECENT ADDRESS_ID that the person is at (789).
The table is called MOVE_LIST and has the following columns:
MOVE_LIST_ID
ORIGINAL_ADDRESS_ID
DESTINATION_ADDRESS_ID
The query I have so far doesn’t complete this task since it doesn’t go through the list of moves:
Select DESTINATION_ADDRESS_ID
from MOVE_LIST
where ORIGINAL_ADDRESS_ID = '123'
Any tips on this query would be GREATLY appreciated.
Here is some sample data:
MOVED_LIST_ID ORIGINAL_ADDRESS_ID DESTINATION_ADDRESS_ID
1 123 456
2 456 789
Thank you
In you case data in the move_table form a hierarchy. So, in order to find out the last address a person moved to, a simple hierarchical query can be used:
with move_list(moved_list_id, original_address_id, destination_address_id) as(
select 1, 123, 456 from dual union all
select 2, 456, 789 from dual
)
select destination_address_id
from move_list
where connect_by_isleaf = 1
start with original_address_id = 123
connect by original_address_id = prior destination_address_id
Result:
DESTINATION_ADDRESS_ID
----------------------
789

Pentaho Report Designer (PRD): Use parameter in select clause

In the report I'm working with, I need to display information of four columns of a database table. The first three columns of the table are SEX, AGE and NAME. The other N columns (N being like 100!) are questions, with every line of the table meaning that person's answer to that question:
SEX | AGE | NAME | Q1 | Q2 | Q3 | ... | Q100
In my report, I need to show four of these columns, where the first three are always the same and the fourth column varies according to the option selected by the user:
SEX | AGE | NAME | <QUESTION_COLUMN>
So far I've created a dropdown parameter (filled with "Q1", "Q2", "Q3", etc) where the user can select the question he wants to compare. I've tried to use the value of the selected option (for instance, "Q1") in the SELECT clause of my report query, without success:
SELECT sex, age, name, ${QUESTION} FROM user_answers
Pentaho Report Designer doesn't show any errors with that, it simply doesn't show any values for the question column (the other columns - sex, age and name - always return their values)
So, I would like your know:
Can I do this? I mean, use parameters in the SELECT clause?
Is there any other way have this "wildcard" column according to a parameter?
Thanks in advance!
Bruno Gama
you can use the pentaho report design to design.
First,you must bulid the param "QUESTION"on the paramers
eg: SELECT question FROM user_ansewers order by XXXX
you can use the sql
SELECT sex, age, name,question FROM user_answers
where question= ${QUESTION}
last ,you can see the "drop down" to realized the choose
I am using SQL server as database. This problem solves like this :
execute('SELECT sex, age, name, '+${QUESTION}+' as Q1 FROM user_answers')
Please note that ${QUESTION} must be a column name of user_answers. In this example I used a text box parameter name QUESTION where column name is given as input. You may need other coding if input parameter is not text box.