Masking a query string param value using Postgres regexp_replace - sql

I want to mask movie names with XXXXXXXX in a PostgreSQL table column. The content of the column is something like
hollywood_genre_movieTitle0=The watergate&categorey=blabla&hollywood_genre_movieTitle1=Terminator&hollywood_genre_movieTitle2=Spartacus&hollywood_genre_movieTitle3=John Wayne and the Indians&categorey=blabla&hollywood_genre_movieTitle4=Start Trek&hollywood_genre_movieTitle5=ET&categorey=blabla
And I would like to mask the titles (behind the pattern hollywood_genre_movieTitle\d) using the regexp_replace function
regexp_replace('(hollywood_genre_movieTitle\d+=)(.*?)(&?)', '\1XXXXXXXX\3', 'g')
This just replaces the first occurrence of a title and and cuts the string. In short this expression does not do the thing I want. What I would like is that all movies names are replace with XXXXXXXX.
Can someone help me solve that?

The regex does not work because (.*?)(&?) matches an empty string or & lands in Group 3 if it immediately follows hollywood_genre_movieTitle\d+= pattern.
You need to use a negated character class [^&] and a + quantifier to match any 1 or more chars other than & after the hollywood_genre_movieTitle\d+= pattern.
SELECT regexp_replace(
'hollywood_genre_movieTitle0=The watergate&categorey=blabla&hollywood_genre_movieTitle1=Terminator&hollywood_genre_movieTitle2=Spartacus&hollywood_genre_movieTitle3=John Wayne and the Indians&categorey=blabla&hollywood_genre_movieTitle4=Start Trek&hollywood_genre_movieTitle5=ET&categorey=blabla',
'(hollywood_genre_movieTitle\d+=)[^&]+',
'\1XXXXXXXX',
'g')
See the online demo.
Details
(hollywood_genre_movieTitle\d+=) - Capturing group 1:
hollywood_genre_movieTitle - a substring
\d+= - 1 or more digits and a = after them
[^&]+ - 1 or more chars other than &.

Related

Remove space between number and character - PostgreSQL/REGEXP_REPLACE

I have a table with medication_product_amount column where there are spaces between numbers and characteres like below:
medication_product_amount
1 UN DE 50 ML
20 UN
1 UN DE 600 G
What I want is to remove the single space ONLY between numbers and characters, something like this:
new_medication_product_amount
1UN DE 50ML
20UN
1UN DE 600G
To do this, I am looking for a regular expression to use in the function REGEXP_REPLACE. I tried using the pattern below, indicating to replace the single space after the numbers, but the output remained the same as the input:
select REGEXP_REPLACE(medication_product_amount, '(^[0-9])( )', '\1') as new_medication_product_amount
from medications
Can anyone help me come up with the right way to do this? Thanks!
Your regex is a little off. First what yours does. '(^[0-9])( )', '\1')
(^[0-9]) Start Capture (field 1) at the beginning of the string for 1 digit
followed by Start Capture (field 2) for 1 space.
Replace the string by field1.
The problems and correction:
What you want to capture does not necessary the first character of the string. So eliminate the anchor ^.
What you want to capture may be more that 1 digit in length. So replace [0-9] by [0-9]+. I.E any number of digits.
Not actually a problem but a space holds no special meaning in a regexp, it is just a space so no need to capture it unless user later. Replace ( ) with just .
END of Pattern. But there may be other occurrences. Tell Postgres to continue with the above pattern until end of string. (see flag 'g').
Resulting Expression/Query: (demo here)
select regexp_replace(medication_product_posology, '([0-9]+) ', '\1','g') as new_medication_product_posology
from medications;
Match "digit space letter", capturing and the digit and letter using '([0-9]) ([A-Z])', then put them back using back references.
select REGEXP_REPLACE(medication_product_amount, '([0-9]) ([A-Z])', '\1\2') as new_medication_product_amount
from medications

Extract character between the first two characters

I have a table in BigQuery:
ab_col_jfsfhfd_ggg_sdf
arfd_am_fdsf_fddg_fg
d_fdf_fdddg_ffddd_f
I would like to extract those characters that go right after the first _ character and followed by the second _ character. I want to get the following:
col
am
fdf
I used the following regular expression to extract the characters but it does not work as intended:
^.*\_(\D+)\_.*$
regexp_replace(id,'^.*\\_(\\D+)\\_.*$' , '\\1')
Please help!
If I follow you correctly, you can use split():
(split(col, '_'))[safe_ordinal(2)]
split() turns the string column to an array of values, given a separator (here, we use _). Then we can just grab second array element.
split() is a very simply way of solving this. But regular expressions are also quite simple:
with t as (
select 'ab_col_jfsfhfd_ggg_sdf' as id union all
select 'arfd_am_fdsf_fddg_fg' union all
select 'd_fdf_fdddg_ffddd_f'
)
select id, regexp_extract(id, '[^_]+', 1, 2)
from t;
The logic for the pattern is: "Look for any string of characters that is not an underscore. Then take the second one in the string."
Use regexp_extract:
regexp_extract(id,'^[^_]+_([^_]+)')
See proof
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[^_]+ any character except: '_' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
_ '_'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^_]+ any character except: '_' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1

Get second match from regexp_matches results

I have a name column which looks like this:
'1234567 - 7654321 - some - more - text'
I need to get a string "7654321". I am stuck with the following:
SELECT regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g');
regexp_matches
----------------
{1234567}
{7654321}
(2 rows)
How do I what I want? Maybe there's a better option than regexp_matches - gladly will consider. Thx!
You could use REGEXP_REPLACE:
SELECT REGEXP_REPLACE('1234567 - 7654321 - some - more - text', '^\d+[^\d]+(\d+).*$', '\1');
Output
7654321
This regexp looks for a string starting with some number of digits (^\d+) followed by some non-digit characters ([^\d]+) and then another group of digits ((\d+)) followed by some number of characters until the end of the string (.*$). The () around the second group of digit characters makes that a capturing group, which we can then refer to in the replacement string with \1. Since REGEXP_REPLACE only replaces the parts of the string that match the regex, it is necessary to have a regex that matches the whole string in order to replace it with just the desired data.
Update
If there are potentially characters before the first set of digits, you should change the regex to
^[^\d]*\d+[^\d]+(\d+).*$
Update 2
If it's possible that there is only one set of numbers at the beginning, we must make matching the first part optional. We can do that with a non-capturing group:
^[^\d]*(?:\d+[^\d]+)?(\d+).*$
This makes the match on the first set of digits optional so that if it doesn't exist (i.e. there is only one set of digits) the regex will still match. By using a non-capturing group (adding the ?: to the beginning of the group, we don't need to change the replacement string from \1. Updated SQLFiddle
regexp_matches() returns a table, so you can use that in the from clause together with the with ordinality option:
SELECT t.value
from regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g') with ordinality as t(value,idx)
where t.idx = 2;
Note that value is still an array, to get the actual array element you can use:
SELECT t.value[1]
from regexp_matches('1234567 - 7654321 - some - more - text', '\d+', 'g') with ordinality as t(value,idx)
where t.idx = 2;

Oracle SQL - find string pattern in string

I need to extract some text from a string, but only where the text matches a string pattern. The string pattern will consist of...
2 numbers, a forward slash and 6 numbers
e.g. 12/123456
or
2 numbers, a forward slash, 6 numbers, a hyphen and 2 numbers
e.g. 12/123456-12
I know how to use INSTR to find a specific string. Is it possible to find a string that matches a specific pattern?
You'll need to use regexp_like to filter the results and regexp_substr to get the substring.
Here is roughly what it should look like:
select id, myValue, regexp_substr(myValue, '[0-9]{2}/[0-9]{6}') as myRegExMatch
from Foo
where regexp_like(myValue,'^([a-zA-Z0-9 ])*[0-9]{2}/[0-9]{6}([a-zA-Z0-9 ])*$')
with a link to a SQLFiddle that you can see in action and adjust to your taste.
The regexp_like provided in the sample above takes into consideration the alphanumerics and whitespace characters that may bound the number pattern.
Use regexp_like.
where regexp_like(col_name,'\s[0-9]{2}\/[0-9]{6}(-[0-9]{2})?\s')
\s matches a space. Include them at the start and end of pattern.
[0-9]{2}\/[0-9]{6} matches 2 numerics, a forward slash and 6 numerics
(-[0-9]{2})? is optional for a hyphen and 2 numerics following the previous pattern.
regexp_like(col_name,'^\d{2}/\d{6}($|-\d{2}$)')
or
regexp_like(col_name,'^\d{2}/\d{6}(-\d{2})?$')

Oracle regexp to match only digits after certain combination of signs

I have a string which roughly looks like: XXXXXXXXX - 1234567 XXXXXXXX,
where X can be either digit, string or sign (<,>,. or space).
I need to extract only these numbers after ' - '.
I have tried following:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}') from dual
I end up with - 1234567.
How to I get rid of '- '?
Thank you in advance
This should work with Oracle 11g.
Place the capturing group around the pattern part you are interested in first. Since you need the digits, wrap the [0-9]{1,7} with the capturing parentheses.
Then, pass all the 6 arguments to the REGEXP_SUBSTR function where the 6th one indicates the number of capturing group you want to extract:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>',' - ([0-9]{1,7})', 1,1,NULL,1) from dual
Here, 1,1,NULL,1 means: start looking for a pattern match from Position 1, just for the first match, with no specific regex options, and return the contents of Group 1.
What #Gordon Linoff was trying to say was:
select substr(regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}'), 3)
from dual
Substr the remaining "- " off of your result.