Masking a Substring in Hive Views - sql

I need to create a View on top of a Hive Table, masking data in a particular column.
The Table has a column of String Type. The data in that particular column is of JSON structure. I need to mask a value of a particular field say 'ip_address'
{"id":1,"first_name":"john","last_name":"doe","email":"sample#123.com","ip_address":"111.111.111.111"}
expected:
{"id":1,"first_name":"john","last_name":"doe","email":"sample#123.com","ip_address":null}
These are the few Built-in Hive Functions I have tried, they don't seem to help my cause.
mask
get_json_object
STR_TO_MAP
if clause
Also I don't think substring and regexp_Extract are useful here coz the position of the field value is not always predetermined plus I'm not familiar with regex expressions.
PS: Any help is appreciated that would help me avoid writing a new UDF.

regexp_replace:
select regexp_replace(column_name,'"ip_address":".*?"', '"ip_address":null') as column_name will work fine with any position.
You can add any number of optional spaces before and after ::
regexp_replace(column_name,'"ip_address" *: *".*?"', '"ip_address":null')
Regexp '"ip_address" *: *".*?"' meaning:
"ip_address" - literally "ip_address"
* - 0 or more spaces (allowed in json)
: - literally :
* - 0 or more spaces
".*?" - any number of any characters (non-greedy) inside double-quotes.
See also similar question if you want to replace value with some calculated value, for example obfuscate using sha256, not with just null: https://stackoverflow.com/a/54179543/2700344

Related

Regular expression metacharacter in SQL yields different results in Oracle vs. Postgres

I'm trying to convert some queries from an Oracle environment to Postgres. This is a simplified version of one of the queries:
SELECT * FROM TABLE
WHERE REGEXP_LIKE(TO_CHAR(LINK_ID),'\D')
I believe the equivalent postgreSQL should be this:
SELECT * FROM TABLE
WHERE CAST(LINK_ID AS TEXT) ~ '\D'
But when I run these queries in their respective environments on the exact same dataset, the first query outputs no records (which is correct) and the second query outputs all records in the table. I didn't write the original code, but as I understand it, it's looking for any values in the numeric field LINK_ID that are non-digit characters. Is the \D metacharacter supposed to behave differently in Oracle vs. postgres? I'm not seeing anything in documentation to say they should.
The documentation for Oracle's TO_CHAR(number) states
If you omit fmt, then n is converted to a VARCHAR2 value exactly long enough to hold its significant digits.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions181.htm
This means that the only non-numeric character which might be produced is a negative sign or a decimal point. If the number is positive and has no fractional part, it will not match the regular expression \D.
On the other hand, on PostgreSQL CAST(numeric(38,8)as TEXT) returns a value with the number of decimal places specified by the type specification, in this case 8.
E.g.:
cast( cast(12341234 as numeric(38,8)) as TEXT)
Generates 12341234.00000000 The result of such a cast will always contain a decimal point and therefore will always match the regular expression \D.
You may find that replacing it with this solves your problem:
(LINK_ID % 1) <> 0.0
Alternatively, If you need to use the regex (e.g. to simplify migration work), consider changing it to '\.0*[1-9]' i.e. to find a decimal point with any nonzero digit after it.

Determine if substring corresponds to specific code (character types) in SQL

I have a collection of strings and want to filter out those where the last four characters are: (alpha)(alpha)(number)(number).
I know I can make a substring of each of these and separately, but what is the method to determine the types of the characters in the sequence?
This is for SQL in Hive.
You can use regular expressions. Something like:
where col regexp '[a-zA-Z]{2}[0-9]{2}$'

How to select values around .(dot) using sql

I am running below query in Teradata :
sel requesttext from dbc.tables
where tablename='old_employee_table'
Result:
alter table DB_NAME.employee_table,no fallback ;
I want to get below result using SQL:
DB_NAME.employee_table
Requesttext can be:
create set table DB_NAME.employee_table;
DB Name and table can occur anywhere in the result. Since .(dot) is joining them that's why i want to split with .(dot).
Basically I need sql which can result me surrounding values of .(dot)
I want DBName and Tablename in result.
I'm not a Teradata person, but this should work for both strings given so far, as long as teradata's regexp_substr() supports positive look-behind and positive look-ahead assertions (I might have the Teradata syntax wrong, so a little tweaking may be needed):
SELECT REGEXP_SUBSTR(requesttext, '(?<= )(\w+\.\w+)(?=[,$]?)', 1, 1)
FROM dbc.tables
WHERE tablename='old_employee_table'
See the regex101 example. Hopefully it translates to Teradata easily.
The regex looks for and returns the words either side of and including the period, when preceded by a space, and followed by an optional comma or the end of the line.
You could do this with either regexp_substr() or strtok().
As Jamie Zawinski said:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
So I would go with the strtok() method. Also I'm lazy and regular expressions are hard.
Function strtok() takes three arguments:
The string being split
The delimiter to split the string
The number of the token to grab.
To get at the <database>.<table> from that string that is returned in your query, we can split by a space, grab the third token, then split that by a comma and grab the first token.
That would look like:
SELECT strtok(strtok(requestText,' ',3),',',1)
FROM dbc.tables
WHERE tablename='old_employee_table'

Redshift SQL - Extract numbers from string

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)
might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now
Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.
Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')
You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

count number of characters in nvarchar column

Does anyone know a good way to count characters in a text (nvarchar) column in Sql Server?
The values there can be text, symbols and/or numbers.
So far I used sum(datalength(column))/2 but this only works for text. (it's a method based on datalength and this can vary from a type to another).
You can find the number of characters using system function LEN.
i.e.
SELECT LEN(Column) FROM TABLE
Use
SELECT length(yourfield) FROM table;
Use the LEN function:
Returns the number of characters of the specified string expression, excluding trailing blanks.
Doesn't SELECT LEN(column_name) work?
text doesn't work with len function.
ntext, text, and image data types will be removed in a future version
of Microsoft SQL Server. Avoid using these data types in new
development work, and plan to modify applications that currently use
them. Use nvarchar(max), varchar(max), and varbinary(max) instead. For
more information, see Using Large-Value Data Types.
Source
I had a similar problem recently, and here's what I did:
SELECT
columnname as 'Original_Value',
LEN(LTRIM(columnname)) as 'Orig_Val_Char_Count',
N'['+columnname+']' as 'UnicodeStr_Value',
LEN(N'['+columnname+']')-2 as 'True_Char_Count'
FROM mytable
The first two columns look at the original value and count the characters (minus leading/trailing spaces).
I needed to compare that with the true count of characters, which is why I used the second LEN function. It sets the column value to a string, forces that string to Unicode, and then counts the characters.
By using the brackets, you ensure that any leading or trailing spaces are also counted as characters; of course, you don't want to count the brackets themselves, so you subtract 2 at the end.