How to extract just numeric value with REGEXP_EXTRACT in BigQuery? - sql

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J

You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Related

How to add a character to the last third place of a string?

I have a column with numbers with various lengths such as 50055, 1055,155 etc. How can I add a decimal before the last 2nd place of each so that it would be 500.55, 10.55, and 1.55?
I tried using replace by finding the last 2 numbers and replace it with .||last 2 number. That doesn't always work because of a possibility of multiple repetition of the same sequence in the same string.
replace(round(v_num/2),substr(round(v_num/2),-2),'.'||substr(round(v_num/2),-2))
You would divide by 100:
select v_num / 100
You can convert this into a string, if you want.

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

Regex - trying to get the 5 digit words extracted from the string (presto)

I am trying to retrieved each sequence of 5 numbers / letters that are in brackets just like this example:
accuracy of action - [1232d, 74294, qw23t, 23d45, 76wer, 12874] march
and from that I want to extract 1232d 74294 qw23t 23d45 76wer 12874
I know that to extract only a single 5 digit sequence in square brackets I can do \[[a-z0-9 ]{5,7}\] But I don't know how to do retrieve various 5 digit sequences.
Right now, since all the words inside [...] consist of 5 alphanumeric chars, you can use
(?:\G(?!^),\s*|\[)(\w+)(?=[^\]\[]*])
See the regex demo.
Details:
(?:\G(?!^),\s*|\[) - either the end of the preceding successful match and a comma and zero or more whitesapces, or a [ char
(\w+) - Group 1: one or more word chars
(?=[^\]\[]*]) - followed with zero or more chars other than [ and ] and then a ].

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM
Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!
You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

Teradata regular expressions, look behind

I have a field, Simplified_Description and I'm looking for patterns in it. Specifically, I'm looking for a pattern like 6 X 8 or 6X8 or 600X800. I want to pull out the first and second numbers into new fields. I've been able to get the first number (with much help) using a look-ahead.
REGEXP_substr(Simplified_Description, '[0-9]+(?= {0,1}[X] {0,1}[0-9]+)') AS FirstNum,
When I try to get the second number by changing the look-ahead to a look-behind (by simply adding in a "<"),
REGEXP_substr(Simplified_Description, '[0-9]+(?<= {0,1}[X] {0,1}[0-9]+)') AS SecondNum
I now get an error
SELECT Failed. [9134] The pattern specified is not a valid pattern.
I am a complete newb on regular expressions, especially on look-ahead and look-behind, so it's possible I have some extremely simple error, but I can't figure it out as what I'm doing appears to be the correct syntax.
You may use the following regex to extract the first number:
REGEXP_substr(Simplified_Description, '\d+(?=\s*X\s*\d)') AS FirstNum
and this regex for the second number:
REGEXP_substr(Simplified_Description, '\d+\s*X\s*\K\d+') AS SecondNum
See the regex 1 and regex 2 demo.
Patter 1 details
\d+ - 1 or more digits that are followed with...
(?=\s*X\s*\d) - a sequence of patterns:
\s* - 0+ whitespaces
X - an X char
\s* - 0+ whitespaces
\d - a digit.
Pattern 2 details
\d+ - 1 or more digits
\s*X\s* - an X char enclosed with any 0+ whitespace chars
\K - a match reset operator that omits (removes) the text matched so far from the match value
\d+ - 1 or more digits.