REGEXP_REPLACE for name and surname masking - sql

select REGEXP_REPLACE('Tina Frederich Piedro', '\w+', '*') from table;
I'm using \w+ this but it returns * * * what is the true regex for expected output?
Input;
Tina Frederich Piedro
Expected Output;
T*** F******** P*****

This is not a general solution, but it might work in your case. You can replace the lower case letters with '*'s:
select REGEXP_REPLACE('Tina Frederich Piedro', '[a-z]', '*', 1, 0, 'c')
The 'c' is for a case-sensitive replace.

I'm by no means a regexp expert, I had to split the answer into two stages. I imagine someone more capable can combine the two steps. But this does the trick and also accounts for upper case letters in the middle of names or punctuation for example O'Brian.
select
regexp_replace(lowers_done,'\*[A-Z]','**') first_letters_only
from
(
select
regexp_replace('Tina McDonald O''Brian','[a-z]|[[:punct:]]','*') lowers_done
from
dual
)
Output:
T*** M******* O******

If it's for security reasons you can use the DBMS_REDACT package to apply masking pattern on sensitive information. Documentation is here
I'm aware it's not a regexp solution though and further more this functionality may be subject to additional licencing from oracle, but that the solution Oracle suggest to PCI compliant solution on sensitive data.

Related

ORACLE: How to use regexp_like to find a string with single quotes between two characters?

I need to query the DB for all records that have two single quite between characters. Example : We've, who's.
I have the regex https://regex101.com/r/6MtB9j/1 but it doesn't work with REGEXP_LIKE.
Tried this
SELECT content
FROM MyTable
WHERE REGEXP_LIKE (content, '(?<=[a-zA-Z])''(?=[a-zA-Z])')
Appreciate the help!
Oracle regex does not support lookarounds.
You do not actually need lookaround in this case, you can use
SELECT content
FROM MyTable
WHERE REGEXP_LIKE (content, '[a-zA-Z]''[a-zA-Z]')
This will work since REGEXP_LIKE only attempts one match, and if there is a match, it returns true, otherwise, false (eventually, fetching a record or not).
Lookarounds are useful in case you need to replace or extract values, when matches may overlap.
If you just need a single quote in a string, you can use:
where content like '%''%'
If they specifically need to be letters, then you need a regular expression:
regexp_like(content, '[a-zA-Z][''][a-zA-Z]')
or:
regexp_like(content, '[a-zA-Z]\'[a-zA-Z]')
If I understand well, you may need something like
regexp_count(content, '[a-zA-Z]''[a-zA-Z]') = 2.
For example, this
with myTable(content) as
(
select q'[what's]' from dual union all
select q'[who's, what's]' from dual union all
select q'[who's, what's, I'm]' from dual
)
select *
from myTable
where regexp_count(content, '[a-zA-Z]''[a-zA-Z]') = 2
gives
CONTENT
------------------
who's, what's

How to remove leftmost group of numbers from string in Oracle SQL?

I have a string like T_44B56T4 that I'd like to make T_B56T4. I can't use positional logic because the string could instead be TE_2BMT that I'd like to make TE_BMT.
What is the most concise Oracle SQL logic to remove the leftmost grouping on consecutive numbers from the string?
EDIT:
regex_replace is unavailable but I have LTRIM,REPLACE,SUBSTR, etc.
would this fit the bill? I am assuming there are alphanumeric characters, then underscore, and then the numbers you want to remove followed by anything.
select regexp_replace(s, '^([[:alnum:]]+)_\d*(.*)$', '\1_\2')
from (
select 'T_44B56T4' s from dual union all
select 'TXM_1JK7B' from dual
)
It uses regular expressions with matched groups.
Alphanumeric characters before underscore are matched and stored in first group, then underscore followed by 0-many digits (it will match as many digits as possible) followed by anything else that is stored in second group.
If we have a match, the string will be replaced by content of the first group followed by underscore and content of the second group.
if there is no match, the string will not be changed.
It seems that you must use standard string functions, as regular expression functions are not available to you. (Comment under Gordon Linoff's answer; it would help if you would add the same at the bottom of your original question, marked clearly as EDIT).
Also, it seems that the input will always have at least one underscore, and any digits that must be removed will always be immediately after the first underscore.
If so, here is one way you could solve it:
select s, substr(s, 1, instr(s, '_')) ||
ltrim(substr(s, instr(s, '_') + 1), '0123456789') as result
from (
select 'T_44B56T4' s from dual union all
select 'TXM_1JK7B' from dual union all
select '34_AB3_1D' from dual
)
S RESULT
--------- ------------------
T_44B56T4 T_B56T4
TXM_1JK7B TXM_JK7B
34_AB3_1D 34_AB3_1D
I added one more test string, to show that only digits immediately following the first underscore are removed; any other digits are left unchanged.
Note that this solution would very likely be faster than regexp solutions, too (assuming that matters; sometimes it does, but often it doesn't).
If I understand correctly, you can use regexp_replace():
select regexp_replace('T_44B56T4', '_[0-9]+', '_')
Here is a db<>fiddle with your two examples.
Note: Your questions says the left most grouping, but the examples all have the number following an underscore, so the underscore seems to be important.
EDIT:
If you really just want the first string of digits replaced without reference to the underscore:
select regexp_replace(code, '[0-9]+', '', 1, 1)
from (select 'T_44B56T4' as code from dual union all select 'TE_2BMT' from dual ) t

substr in Oracle from column

What is the Syntax to substr in Oracle to subtract a string
i have "123456789 #073"
I only want what after the #
substr (table.col, 17,3)
is that ok ?
Most likely the simplest (and most performant) way of doing this would be to use the base string functions:
SELECT SUBSTR(col, INSTR(col, '#') + 1)
FROM yourTable;
Demo
We could also try using REGEXP_REPLACE here:
SELECT REGEXP_REPLACE(col, '.*#(.*)', '\1')
FROM yourTable;
The regex option would in general not perform as well as the first query. The reason for this is that invoking a regex incurs a performance overhead. You might want to consider a regex option if you expect that the string logic might change or get more complicated in the future. Otherwise, go with base string functions wherever possible.
I think the most direct method might be regexp_substr():
select regexp_substr('123456789 #073', '[^#]+$')
from dual;
The regular expression says: "get me all non-hash characters at the end of the string".
If you happen to know that there are 3 characters and really want the last three characters of the string:
select substr('123456789 #073', -3)

How to match and replace sections of a string in SQL

I'm pulling a list of popular sites from my database, but I want to combine results that are from the same domain. I've been able to do this partially by using :
REGEXP_REPLACE(site, '%|^www([123])?\.|^m\.|^mobile\.|^desktop\.')) as site
so that "www.facebook.com" and "facebook.com" or "m.facebook.com"
- all of which appear in the database - are treated as the same when I do a select distinct.
However, I want to take this a step further by writing an expression that looks at each string between periods. If a match is found consecutively in three or more strings between periods, then I want to treat those as the same. I simply can't predict every possible string that could come before "facebook.com", or any other site.
So for example:
"my.careerone.com.au" and
"careerone.com.au" match in three places.
Or "yahoo.realestate.com.au" and "rs.realestate.com.au" match in three places.
Any ideas on how to achieve this?
#David code will work in Vertica as well but not so well performance wise maybe.
You can use Vertica's own internal functions such as TRIM & REGEXP_REPLACE.
After borrowing #David Faber reg exp i endend-up with this.
select TRIM(LEADING '.' from REGEXP_REPLACE(col_name,'^.*((\.[^.]+){3})$', '\1')) AS fixed_dn from table_name;
I don't have Vertica available so I tested this in Oracle SQL (which does have REGEXP_REPLACE() that is similar to Vertica's). Not sure what the CTE syntax would be in Vertica but you'll be querying against a table anyway:
WITH d1 AS (
SELECT 'my.careerone.com.au' AS domain_nm FROM dual
UNION ALL
SELECT 'careerone.com.au' FROM dual
UNION ALL
SELECT 'yahoo.realestate.com.au' FROM dual
UNION ALL
SELECT 'rs.realestate.com.au' FROM dual
)
SELECT domain_nm, TRIM('.' FROM REGEXP_REPLACE(domain_nm, '^.*((\.[^.]+){3})$', '\1')) AS domain_nm_fix
FROM d1;
What REGEXP_REPLACE() does here is trim the highest level subdomains from the domain name, if it exists and if there are more than 3 levels. If there are only three levels then nothing will be replaced as the regex won't match -- that is why the leading . character then has to be trimmed. So, for example, careerone.com.au will be unaltered, while my.careerone.com.au will be changed to .careerone.com.au by the REGEXP_REPLACE(), from which the leading . then has to be trimmed.

RegEx: Repeated identical vowels in a string - Oracle SQL

I need to only display those strings (name of manufacturers) that contain 2 or more identical vowels in Oracle11g. I am using a RegEx to find this.
SELECT manuf_name "Manufacturer", REGEXP_LIKE(manuf_name,'([aeiou])\2') Counter FROM manufacturer;
For example:
The RegEx accepts
OtterBox
Abca
abcA
The RegEx rejects
Samsung
Apple
I am not sure how to proceed ahead.
I think you want something like this:
WITH mydata AS (
SELECT 'OtterBox' AS manuf_name FROM dual
UNION ALL
SELECT 'Apple' FROM dual
UNION ALL
SELECT 'Samsung' FROM dual
)
SELECT * FROM mydata
WHERE REGEXP_LIKE(manuf_name, '([aeiou]).*\1', 'i');
I am not sure why you used \2 as a backreference instead of \1 -- \2 doesn't refer to anything in this regex. Also, note the wildcard and quantifier .* to indicate that there can be any number of any character between the first occurrence of the vowel and the second. Third, note the 'i' parameter to indicate a case-insensitive search (which I think is what you want since you say that the regex should match "OtterBox").
SQL Fiddle here.
David yours wasn't quite working for me. What about this?
\w*([aeiou])\w*\1+\w*
https://regex101.com/r/eE3iC2/3
EDIT: updated one per suggestions:
.*([aeiou]).*\1.*
https://regex101.com/r/eE3iC2/5