Regex - trying to get the 5 digit words extracted from the string (presto) - sql

I am trying to retrieved each sequence of 5 numbers / letters that are in brackets just like this example:
accuracy of action - [1232d, 74294, qw23t, 23d45, 76wer, 12874] march
and from that I want to extract 1232d 74294 qw23t 23d45 76wer 12874
I know that to extract only a single 5 digit sequence in square brackets I can do \[[a-z0-9 ]{5,7}\] But I don't know how to do retrieve various 5 digit sequences.

Right now, since all the words inside [...] consist of 5 alphanumeric chars, you can use
(?:\G(?!^),\s*|\[)(\w+)(?=[^\]\[]*])
See the regex demo.
Details:
(?:\G(?!^),\s*|\[) - either the end of the preceding successful match and a comma and zero or more whitesapces, or a [ char
(\w+) - Group 1: one or more word chars
(?=[^\]\[]*]) - followed with zero or more chars other than [ and ] and then a ].

Related

how to use positioning/range in regexp

I have a product code where the references always follows this pattern: XX00XX000XX. Characters 1 and 2 are always a combination of 2 letters, 3 to 4 a combination of 2 numbers, 5 to 6 letters, 7 to 10 numbers and 10 to 11 letters again (they`re always varying so it'll never be the same).
I want to do a regexp_contains (or another variant) that matches by position like; position 1 - 2 must be [[:alpha:]], 3 - 4 [[:digit:]], and so on.
(I need this to find product codes that match the reference pattern inside sell links, but I can't find any clear explanation on how to use positioning on regex statements...)
You can use character classes for this.
[a-zA-Z][a-zA-Z]\d\d[a-zA-Z][a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]
This regex contains the class [a-zA-Z] and \d, which matches letter and digit respectively. This explicitly checks, first character is a letter, second character is a letter, third character is a digit, etc.
The character classes match 1 character in the set specified, so [a-zA-Z] matches any letter, [13579] will match any odd number, etc.

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J
You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM
Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!
You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

Teradata regular expressions, look behind

I have a field, Simplified_Description and I'm looking for patterns in it. Specifically, I'm looking for a pattern like 6 X 8 or 6X8 or 600X800. I want to pull out the first and second numbers into new fields. I've been able to get the first number (with much help) using a look-ahead.
REGEXP_substr(Simplified_Description, '[0-9]+(?= {0,1}[X] {0,1}[0-9]+)') AS FirstNum,
When I try to get the second number by changing the look-ahead to a look-behind (by simply adding in a "<"),
REGEXP_substr(Simplified_Description, '[0-9]+(?<= {0,1}[X] {0,1}[0-9]+)') AS SecondNum
I now get an error
SELECT Failed. [9134] The pattern specified is not a valid pattern.
I am a complete newb on regular expressions, especially on look-ahead and look-behind, so it's possible I have some extremely simple error, but I can't figure it out as what I'm doing appears to be the correct syntax.
You may use the following regex to extract the first number:
REGEXP_substr(Simplified_Description, '\d+(?=\s*X\s*\d)') AS FirstNum
and this regex for the second number:
REGEXP_substr(Simplified_Description, '\d+\s*X\s*\K\d+') AS SecondNum
See the regex 1 and regex 2 demo.
Patter 1 details
\d+ - 1 or more digits that are followed with...
(?=\s*X\s*\d) - a sequence of patterns:
\s* - 0+ whitespaces
X - an X char
\s* - 0+ whitespaces
\d - a digit.
Pattern 2 details
\d+ - 1 or more digits
\s*X\s* - an X char enclosed with any 0+ whitespace chars
\K - a match reset operator that omits (removes) the text matched so far from the match value
\d+ - 1 or more digits.

regex - match exactly 10 digits with atleast one symbol or spaces between them

I'm trying to write a query in oracle sql to get rows which has invalid 10 digit numbers, ie with other symbols in between them.
For example:
(111) 111-1111 #10 digit number with some symbols and spaces in between
111-111-1111
(111)111-1111
111)111-1111
(111) 11 1-1111
ie, It should match exactly 10 digit numbers which are non consecutive because it has some symbols in it.
So it should not match the following example:
111 #consecutive 3 digit number
11 1 #3 digit number with spaces
11-1 #3 digit number with symbol in between
1111111111 #consective 10 digit number
And I'm using REGEXP_LIKE, something like this
select * from table where REGEXP_LIKE(column, ?)
Any help is much appreciated. Thanks.
You could use a combination of a regex and length; the latter to exclude a pure 10-digit number without other characters:
regexp_like(col, '^[ .()-]*(\d[ .()-]*){10}$') and length(col) > 10
In the [.()-] class you would list all the characters that you would allow as symbols among the digits. Note that - needs to be the last in that list or else be escaped.
If you would allow any non-digit to occur among the 10 digits, you can use \D:
regexp_like(col, '^\D*(\d\D*){10}$') and length(col) > 10
So: the string should have length greater than 10, and the total number of digits must be exactly 10. This can be done without regular expressions (which should make it faster):
... where length(str) > 10 and
length(str) = 10 + length(translate(str, 'z0123456789', 'z'))
translate will translate the letter z to itself and all the other characters (digits) to nothing. Having to include the z is annoying, but unavoidable; translate will return NULL if any of its arguments is NULL. The second condition says the length of the input str is exactly 10 more than the length of the string with all digits removed - so there are exactly 10 digits.