Regex to replace multiple patterns with single not working - sql

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM

Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!

You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

Related

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Regex - trying to get the 5 digit words extracted from the string (presto)

I am trying to retrieved each sequence of 5 numbers / letters that are in brackets just like this example:
accuracy of action - [1232d, 74294, qw23t, 23d45, 76wer, 12874] march
and from that I want to extract 1232d 74294 qw23t 23d45 76wer 12874
I know that to extract only a single 5 digit sequence in square brackets I can do \[[a-z0-9 ]{5,7}\] But I don't know how to do retrieve various 5 digit sequences.
Right now, since all the words inside [...] consist of 5 alphanumeric chars, you can use
(?:\G(?!^),\s*|\[)(\w+)(?=[^\]\[]*])
See the regex demo.
Details:
(?:\G(?!^),\s*|\[) - either the end of the preceding successful match and a comma and zero or more whitesapces, or a [ char
(\w+) - Group 1: one or more word chars
(?=[^\]\[]*]) - followed with zero or more chars other than [ and ] and then a ].

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J
You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Teradata regular expressions, look behind

I have a field, Simplified_Description and I'm looking for patterns in it. Specifically, I'm looking for a pattern like 6 X 8 or 6X8 or 600X800. I want to pull out the first and second numbers into new fields. I've been able to get the first number (with much help) using a look-ahead.
REGEXP_substr(Simplified_Description, '[0-9]+(?= {0,1}[X] {0,1}[0-9]+)') AS FirstNum,
When I try to get the second number by changing the look-ahead to a look-behind (by simply adding in a "<"),
REGEXP_substr(Simplified_Description, '[0-9]+(?<= {0,1}[X] {0,1}[0-9]+)') AS SecondNum
I now get an error
SELECT Failed. [9134] The pattern specified is not a valid pattern.
I am a complete newb on regular expressions, especially on look-ahead and look-behind, so it's possible I have some extremely simple error, but I can't figure it out as what I'm doing appears to be the correct syntax.
You may use the following regex to extract the first number:
REGEXP_substr(Simplified_Description, '\d+(?=\s*X\s*\d)') AS FirstNum
and this regex for the second number:
REGEXP_substr(Simplified_Description, '\d+\s*X\s*\K\d+') AS SecondNum
See the regex 1 and regex 2 demo.
Patter 1 details
\d+ - 1 or more digits that are followed with...
(?=\s*X\s*\d) - a sequence of patterns:
\s* - 0+ whitespaces
X - an X char
\s* - 0+ whitespaces
\d - a digit.
Pattern 2 details
\d+ - 1 or more digits
\s*X\s* - an X char enclosed with any 0+ whitespace chars
\K - a match reset operator that omits (removes) the text matched so far from the match value
\d+ - 1 or more digits.

Extra blank space between words

Please help me with 2 questions on how to do the GREL expression for:
If there are double spaces between 2 words in a column, how can I eliminate 1 space Example: Robert--Smith to Robert-Smith The minus character equals a blank for illustration
How can I look for an exact word in a text filter.
Thanks!
1°) try transform---> value.replace(" "," ")
Or, simply common transforms ----> collapse consecutive white spaces
2°) Column ---> text filters and enter you word
Or, do column---> Facet---> Customs facet and type : value.contains(" you_word ")
or value.contains(/(yourexactword)/)
This will return a True or False facet
H.
#hpiedcoq is the right answer if you need to have them in GREL. if not you can just use the point and click interface:
for the first question: Select your column and select Edit cells > Common transforms > Collapse consecutive white space
for the second question: select your column > text filter > enter the work you are looking for. You can select case sensitive if you want to take into account upper and lower case in your search.
1.1 transform -- > value.replace(" "," ")
Deletes all double whitespace.
1.2 transform -- > value.trim()
Deletes all double whitespace and deletes whitespaces before and after the string.
1.3 transform -- > value.replace(/\b \b/," ")
Replace with regular expression, deletes only double whitespace between two words.
Text filter > turn on regular expression and use \b.
Text filter with regular expression: \bWord\b = exact word, before and after the word may or may not be a only whitespace.