Oracle SQL - Redacting multiple occurences all but last four digits of numbers of varying length within free text narrative - sql

Is there are straightforward way, perhaps using REGEXP_REPLACE or the like, to redact all but the last four digits of numbers (or varying length of 5 or above) appearing within free text (there may be multiple occurrences of separate numbers within the text)?
E.g.
Input = 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text'
Output = 'This is a test text with numbers ****5, *****3210 and separately number ************4321 all buried within the text'
With REGEX_REPLACE it's obviously straightforward to replace all numbers with the *, but it's maintaining the final four digits and replacing with the correct number of *s that's vexing me.
Any help would be much appreciated!
(Just for context, due to the usual kind of business limitations, this had to be done within the query retrieving the data rather than using actual Oracle DBMS redaction functionality).
Many thanks.

You could try the following regex:
regexp_replace(txt, '(\d{4})(\d+(\D|$))', '****\2')
This captures sequences of 4 digits followed by at least one digit, then by a non-digit character (or the end of string), and replaces them with 4 stars.
Demo on DB Fiddle:
with t as (select 'select This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' txt from dual)
select regexp_replace(txt, '(\d{4})(\d+\D)', '****\2') new_text from t
| NEW_TEXT |
| :-------------------------------------------------------------------------------------------------------------------------- |
| select This is a test text with numbers ****5, ****543210 and separately number ****567887654321 all buried within the text |
Edit
Here is a simplified version, suggested by Aleksej in the comments:
regexp_replace(txt, '(\d{4})(\d+)', '****\2')
This works because of the greadiness of the regexp engine, that will slurp as many '\d+' as possible.

If you really need to keep the length of the numbers, then (I think) there is not wayy todo it in one step. You'll have to split the string in numbers and not numbers and then replace the digits seperatly:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', LENGTH(txt) - 4,'*') || SUBSTR(txt, LENGTH(txt) -3) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
Result:
This is a test text with numbers *2345, ******3210 and separately number ************4321 all buried within the text
And as your example replaced the first for digits of your first number - you might also want to replace at least 4 digits:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', GREATEST(LENGTH(txt) - 4, 4),'*') || SUBSTR(txt, GREATEST(LENGTH(txt) -3, 5)) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
(Added GREATEST in the second line to replace at least 4 digits.)
Result:
This is a test text with numbers ****5, ******3210 and separately number ************4321 all buried within the text

Related

Search a pattern from comma seperated parameters in plsql

My Parameter to a procedure lv_ip := 'MNS-GC%|CS,MIB-TE%|DC'
My cursor query should search for records that start with 'MNS-GC%' and 'MIB-TE%'.
Select id, date,program,program_start_date
from table_1
where program like 'MNS-GC%' or program LIKE 'MIB-TE%'
Please suggest ways to read it from the parameter and an alternative to LIKE.
Since you mention you want to preserve what's on the right side of the pipe, and want to be able to process parameters dynamically, here's a way to parse multi-delimited data that could give you some ideas using a CTE.
The table called 'tbl' just sets up your original data. tbl_comma contains that data split on the comma. The final query splits that data into name/value pairs.
Hopefully this will help give you some ideas even though it's not the exact answer you are looking for.
COLUMN ID FORMAT a3
COLUMN PROGRAM FORMAT a10
COLUMN part2 FORMAT a6
-- Original data
WITH tbl(ID, DATA) AS (
SELECT 1, 'MNS-GC%|CS,MIB-TE%|DC' FROM dual UNION ALL
SELECT 2, 'MNS-GC%|CS,MIB-TE%|DC,MIB-TA%|AB,MIB-TB%|BC' FROM dual
),
tbl_comma(ID, CASE) AS (
SELECT ID,
REGEXP_SUBSTR(DATA, '(.*?)(,|$)', 1, LEVEL, NULL, 1) CASE
FROM tbl
CONNECT BY REGEXP_SUBSTR(DATA, '(.*?)(,|$)', 1, LEVEL) IS NOT NULL
AND PRIOR ID = ID
AND PRIOR SYS_GUID() IS NOT NULL
)
--SELECT * FROM tbl_comma;
-- Parse into name/value pairs
SELECT ID,
REGEXP_REPLACE(CASE, '^(.*)\|.*', '\1') PROGRAM,
REGEXP_REPLACE(CASE, '.*\|(.*)$', '\1') PART2
FROM tbl_comma;
ID PROGRAM PART2
--- ---------- ------
1 MNS-GC% CS
1 MIB-TE% DC
2 MNS-GC% CS
2 MIB-TE% DC
2 MIB-TA% AB
2 MIB-TB% BC
6 rows selected.
If you're stuck with that input and the structure is fixed, with each comma-separated element having a pipe-delimited value, you could possibly convert that string to a regular expression pattern, and then use regexp_like to pattern-match:
select id, date, program, program_start_date
from table_1
where regexp_like(
program,
'^(' || rtrim(regexp_replace(lv_ip, '%\|.*?(,|$)', '|'), '|') || ')')
With your example parameter, the
'^(' || rtrim(regexp_replace(lv_ip, '%\|.*?(,|$)', '|'), '|') || ')'
would generate the pattern
^(MNS-GC|MIB-TE)
i.e. looking for either of those strings at the start of the program value.
db<>fiddle
Alternatively you could split the input up yourself, with instr and substr, and - since the number of elements may vary - create a dynamic query using them. That might be faster than using regular expression, but might be harder to maintain.
What would the regexp be to match CS|DC
It depends how you plan to use those values, but if you're looking for some column exactly matching one of them, then you could do something similar with:
'^(' || ltrim(regexp_replace(l_ip, '(^|,)[^|]*', null), '|') || ')$'
which with your input string would generate the pattern
^(CS|DC)$
But if you need to match the corresponding values as pairs - so the equivalent of something like:
where (program like 'MNS-GC%' and some_col = 'CS')
or (program like 'MIB-TE%' and some_col = 'DC')
... then you'd need to extract them as pairs, as #Gary_W has shown.

Consecutive Pattern replacing is not happening with REGEXP_REPLACE

I have a string as below
Welcome to the world of the Hackers
I am trying to replace the occurrences of listed strings i.e. of,to,the in between the entire string using below query, but it's not working properly if the patterns are consecutive, it fails.
SELECT regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome the world the Hackers
Even if the pattern is repeating consecutive it is not working i.e.
SELECT regexp_replace( 'Welcome to to the world of the Hackers', '( to )|( the )|( of )', ' ' )
FROM dual;
Output: Welcome to world the Hackers
Whereas my expected output is: Welcome world Hackers
Is there any alternative/solution for this using REGEXP_REPLACE?
You can use the regular expression (^|\s+)((to|the|of)(\s+|$))+:
SQL Fiddle
Query 1:
WITH test_data ( sentence ) AS (
SELECT 'to the of' FROM DUAL UNION ALL
SELECT 'woof breathe toto' FROM DUAL UNION ALL -- has all the words as sub-strings of words
SELECT 'theory of the offer to total' FROM DUAL -- mix of words to replace and words starting with those words
)
SELECT sentence,
regexp_replace(
sentence,
'(^|\s+)((to|the|of)(\s+|$))+',
'\1'
) AS replaced
FROM test_data
Results:
| SENTENCE | REPLACED |
|------------------------------|--------------------|
| to the of | (null) | -- All words replaced
| woof breathe toto | woof breathe toto |
| theory of the offer to total | theory offer total |
Why doesn't regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) work with successive matches?
Because the regular expression parser will look for the second match after the end of the first match and will not include the already parsed part of the string or the replacement text when looking for subsequent matches.
So the first match will be:
'Welcome to the world of the Hackers'
^^^^
The second match will look in the sub-string following that match
'the world of the Hackers'
^^^^
The 'the ' at the start of the sub-string will not be matched as it has no leading space character (yes, there was a space before it but that was matched in the previous match and, yes, that match was replaced with a space but overlapping matches and matches on previous replacements are not how regular expressions work).
So the second match is the ' of ' in the middle of the remaining sub-string.
There will be no third match as the remaining un-parsed sub-string is:
'the Hackers'
and, again, the 'the ' is not matched as there is not leading space character to match.
REGEXP_REPLACE does not match a second pattern which is a part of the already matched pattern. This is more apparent when you use the multi-pattern matching like |. Thus, you can't rely on spaces for word boundaries to match multiple patterns this way. One solution could be to split and combine the characters. This may not be the best way, but works nonetheless. I would be glad to know a better solution.
This also assumes that you are ok with single spaces in the combined string when it had more than one in the original string.Also, words ending with comma or semicolon aren't considered. You may enhance it using NOT REGEXP_LIKE instead of NOT IN for such cases.
WITH t (id,s)
AS (
SELECT 1 , 'Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
UNION ALL
SELECT 2, 'The second line.Welcome to the world of the Hackers, you told me these words at the'
FROM DUAL
)
SELECT LISTAGG(word, ' ') WITHIN
GROUP (
ORDER BY w
)
FROM (
SELECT id,
LEVEL AS w
,REGEXP_SUBSTR(s, '[^ ]+', 1, LEVEL) AS word
FROM t CONNECT BY LEVEL <= REGEXP_COUNT(s, '[^ ]+')
AND PRIOR id = id
AND PRIOR SYS_GUID() IS NOT NULL
)
WHERE lower(word) NOT IN (
'to'
,'the'
,'of'
)
GROUP BY id;
Demo

Counting word lengths in a string

I am using an Oracle regular expression to extract the first letter of each word in a string. The results are returned in a single cell, with spaces representing hard breaks. Here is an example...
input:
'I hope that some kind person
browsing stack overflow
can help me'
output:
ihtskp bso chm
What I am trying to do next is count the length of each "word" in my output, like this:
6 3 3
Alternatively, a count of the words in each line of the original string would be acceptable, as it would yield the same result.
Thanks!
Count the number of spaces and add one:
select (length(your_col) - length(replace(your_col, ' '))+1) from your_table;
It will give you the number of words per line. From there you can get all counts on one line by using listagg function:
select LISTAGG(cnt,' ') within group (order by null) from (
select (length(a)-length(replace(a,' '))+1) cnt from (
select 'apa bpa bv' a from dual
union all
select 'n bb gg' a from dual
union all
select 'ff ff rr gg' a from dual))
group by null;
Perhaps you also need to split the strings if they contain newlines or are they split already?
I tried to edit my original post but it hasn't appeared, but I figured out a way to solve my issue. I just decided to break the words into rows, since I know how to character count rows, and then reassembled the character counts into a single cell using listagg:
with my_string as (
select regexp_substr (words,'[0-9]+|[a-z]+|[A-Z]+',1,lvl) parsed
from (
select words, level lvl
from letters connect by level <= length(words) - length(replace(words,' ')) + 1)
)
select listagg(length(parsed),' ') within group (order by parsed) word_count
from my_string

splitting strings in oracle sql based on length

I want to split my strings in Oracle based on length with space as a delimiter.
For example,
MY_STRING="welcome to programming world"
My output should be
STRING1="welcome to "
STRING2="programming "
The strings should be a maximum of 13 characters in length. The words after position 26 can be ignored.
You don't mention what version of Oracle you're using. If you're using 10g or above you can use regular expressions to get what you need:
with spaces as (
select regexp_instr('welcome to programming world' || ' '
, '[[:space:]]', 1, level) as s
from dual
connect by level <= regexp_count('welcome to programming world' || ' '
, '[[:space:]]')
)
, actual as (
select max(case when s <= 13 then s else 0 end) as a
, max(case when s <= 26 then s else 0 end) as b
from spaces
)
select substr('welcome to programming world',1,a)
, substr('welcome to programming world',a, b - a)
from actual
This finds the positional index of all the spaces, then finds the one that's nearest but less than 14. Lastly uses a simple substr to split your string. The strings will have a trailing space so you might want to trim this.
You have to concatenate your string with a space to ensure that there is a trailing space so the last word doesn't get removed if your string is shorter than 26 characters.
Assuming you're using an earlier version you could hack something together with instr and length but it won't be very pretty at all.

SQL change date formats inside a string

I would like to convert a string containing dates in SQL select from Oracle 11g database.
Original string (CLOB) example:
"1.12.2011 - event 1
2.2.2012 - event 2
13.3.2012 - event 44"
Desired output:
"20111201 - event 1
20120202 - event 2
20120313 - event 44"
Is there a better (faster) way than using 4 separate replacements?
regexp_replace(regexp_replace(regexp_replace(regexp_replace(my_string,
'(\d\d)\.(\d\d)\.(20\d\d)', '\3\2\1'),
'(\d\d)\.(\d)\.(20\d\d)', '\30\2\1'),
'(\d)\.(\d\d)\.(20\d\d)', '\3\20\1'),
'(\d)\.(\d)\.(20\d\d)', '\30\20\1')
Especially if you're using clobs you have to be careful unless you're certain of the data in there.
However, if your clob only looks like that then you need threeregexp_replace in order for this to work; it'll also be much more dynamic. Just explicitly specify digits using [[:digit:]] then specify a minimum and maximum number of times these digits could be there using {1,2}.
Then the following would work:
select regexp_replace(
regexp_replace(
regexp_replace( my_string
, '([[:digit:]]{1,2})\.([[:digit:]]{1,2})\.(20[[:digit:]]{2})'
, '\3-\2-\1')
, '-([[:digit:]]{1}(-|$))'
, '0\1' )
, ('-')
, '')
from dual
This means:
match ( group 1 ) 1 or 2 digits
match a full stop.
match ( group 2 ) 1 or 2 digits
match a full stop
match ( group 3 ) 20 + 2 digits.
Then take out only groups 1, 2 and 3, i.e. ignoring the full stops and return then in the order 3, 2, 1 padded with a hyphen
Then replace any [digit] that is followed by either a hyphen or the end of the string, i.e. the number of digits is only 1 with -0[digit].
Lastly replace all the hyphens.
Separately from that I agree with tbone. It would make a lot more sense to store this data in a separate table (event_id number, event_date date). Any string transformations are easy with no chance of getting it wrong, unlike in this situation, and the data is easy to query and compare.
there are no better options (both correct and readable) with better performance - or if there are, no one cares..
i prefer a 2-level regexp_replace for date part:
select regexp_replace(
regexp_replace( my_string,
'([[:digit:]]{1,2})\.([[:digit:]]{1,2})\.(20[[:digit:]]{2})',
'\3-0\2-0\1' ),
'(20[[:digit:]]{2})-0?([[:digit:]]{2})-0?([[:digit:]]{2})',
'\3\2\1' )
from dual;
Demo
Maybe try doing:
select to_char(to_date('13.3.2011', 'DD.MM.YYYY'),'YYYYMMDD') from dual;