Regex sub-group behaviour with and without space

Regex sub-group behaviour with and without space - sql

Say the task were to append the last numbers in a product code to itself with a hyphen between the original and added numbers (purely for experimentation).
I would like to understand why including a space is necessary in the following example:
with foo ( prod )
as ( values ('MYPRODUCT 123'))
select
'dot aster space' as test_type,
'''(.* (\d+))'',''$1-$2''' as the_regex,
regexp_replace(prod,'(.* (\d+))','$1-$2')
from foo
UNION ALL
select
'dot aster no space',
'''(.*(\d+))'',''$1-$2''',
regexp_replace(prod,'(.*(\d+))','$1-$2')
from foo
Result
TEST_TYPE THE_REGEX REGEXP_REPLACE
dot aster space '(.* (\d+))','$1-$2' MYPRODUCT 123-123
dot aster no space '(.*(\d+))','$1-$2' MYPRODUCT 123-3
I would have expected that, since the period matches any character, including a blank space, the two regexes would have the same result.
However, even accepting that they do not, I can't figure out why only the last 3 is captured in the second group.
Thanks.

It's a matter of greediness.
With the regex
'(.* (\d+))'
you ask explicitely for a space before the digits, so \d+ will get the 3 digits.
With the regex
'(.*(\d+))'
the dot .* will take as many characters as it can before matching a digit or more. So .* will match 'MYPRODUCT 12' and \d+ will match '3'.
Solution : the non-greedy quantifier '?'.
The regex would be
'(.*?(\d+))'
and it will match a maximum digits for \d+, then the remainder for .*

Related

SQL, extract everything before 5th comma

For example, my column "tags" have
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj,movie:aa,movie:mdkk"
I'm trying to return everything before 5th comma. below is the result example
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj"
I've tried below code but it's not working.
select
NVL(SUBSTRING(tags, 1,REGEXP_INSTR(tags,',',1,5) -1),tags)
from myTable

You can use
REGEXP_REPLACE(tags, '^(([^,]*,){4}[^,]*).*', '\\1')
See the regex demo.
The REGEXP_REPLACE will find the occurrence of the following pattern:
^ - start of string
(([^,]*,){4}[^,]*) - Group 1 (\1 refers to this part of the match): four sequences of any zero or more chars other than a comma and a comma, and then zero or more chars other than a comma
.* - the rest of the string.
The \1 replacement restores Group 1 value in the resulting string.

Replacing the nth white by an asterisk in GBQ

REGEXP_REPLACE("My dog is funny and happy", r"(\S+ \S+ \S+)", r"*") This is my SQL for achieving this. My output should look something like this = My dog is funny *and happy
When I try the above query it removes the first few words. How do I work this out?

You should use a backreference:
REGEXP_REPLACE("My dog is funny and happy", r"^((?:\S+\s+){4})", r"\1*")
REGEXP_REPLACE("My dog is funny and happy", r"^(?:\S+\s+){4}", r"\0*")
See the regex demo. Details:
^ - start of string
((?:\S+\s+){4}) - Group 1 (\1 in the replacement will refer to this group value): four occurrences of one or more non-whitespaces followed with one or more whitespaces.
\0 refers to the whole match value.
See the regexp_replace reference:
REGEXP_REPLACE(value, regexp, replacement)
Returns a STRING where all substrings of value that match regular
expression regexp are replaced with replacement.
You can use backslashed-escaped digits (\1 to \9) within the
replacement argument to insert text matching the corresponding
parenthesized group in the regexp pattern. Use \0 to refer to the
entire matching text.

Extract string between different special symbols

I am having following string in my query
.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt
beginning with a period from which I need to extract the segment between the final \ and the file extension period, meaning following expected result
ABC__123_123_123_ABC123
Am fairly new to using REGEXP and couldn't help myself to an elegant (or workable) solution with what Q&A here or else. In all queries the pattern is the same in quantity and order but for my growth of knowledge I'd prefer to not just count and cut.

You can use REGEXP_REPLACE function such as
REGEXP_REPLACE(col,'(.*\\)(.*)\.(.*)','\2')
in order to extract the piece starting from the last slash upto the dot. Preceding slashes in \\ and \. are used as escape characters to distinguish the special characters and our intended \ and . characters.
Demo

You need just regexp_substr and simple regexp ([^\]+)\.[^.]*$
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'([^\]+)\.[^.]*$',
1, -- position
1, -- occurence
null, -- match_parameter
1 -- subexpr
) substring
from dual;
([^\]+)\.[^.]*$ means:
([^\]+) - find one or more(+) any characters except slash([] - set, ^ - negative, ie except) and name it as group \1(subexpression #1)
\. - then simple dot (. is a special character which means any character, so we need to "escape" it using \ which is an escape character)
[^.]* - zero or more any characters except .
$ - end of line
So this regexp means: find a substring which consist from: one or more any characters except slash followed by dot followed by zero or more any characters except dot and it should be in the end of string. And subexpr parameter = 1, says oracle to return first subexpression (ie first matched group in (...))
Other parameters you can find in the doc.

Here is my simple full compatible example with Oracle 11g R2, PCRE2 and some other languages.
Oracle 11g R2 using function substr (Reference documentation)
select
regexp_substr(
'.\ABC\ABC\2021\02\24\ABC__123_123_123_ABC123.txt',
'((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}',
1,
1
) substring
from dual;
Pattern: ((\w)+(_){2}(((\d){3}(_)){3}){1}((\w)+(\d)+){1}){1}
Result: ABC__123_123_123_ABC123
Just as simple as it can be, regular expressions always follow a minimal standard, as you can see portability also provided, just for the case someone else is interested in going the simplest way.
Hopefully, this will help you out!

Regexp_Like to Validate Uppercase Characters [A-Z] and Numbers [0-9] Only

I would like a query using regexp_like within Oracle's SQL which only validates uppercase characters [A-Z] and numbers [0-9]
SELECT *
FROM dual
WHERE REGEXP_LIKE('AAAA1111', '[A-Z, 0-9]')

List item
The select Statement probalby should look like
SELECT 'Yes' as MATCHING
FROM dual
WHERE REGEXP_LIKE ('AAAA1111', '^[A-Z0-9]+$')
Which means that starting from the very first ^ to the last $ letter every character should be upper case or a number. Important: no comma or space between Z and 0. The + stands for at least one or more characters.
Edit: Based on the answer of Barbaros another way of selecting would be possible
SELECT 'Yes' as MATCHING
FROM DUAL
WHERE regexp_like('AAAA1111','^[[:digit:][:upper:]]+$')
Edit: added a DBFiddle
A quick help may be found here and for oracle regular expressions here.

You can use :
select str as "Result String"
from tab
where not regexp_like(str,'[[:lower:] ]')
and regexp_like(str,'[[:alnum:]]')
where not regexp_like with POSIX [^[:lower:]] pattern stands for eliminating the strings
containing lowercase,
and regexp_like with POSIX [[:alnum:]] pattern stands for accepting the strings
without symbols
( containing only letters and numbers even doesn't contain a space because of the trailing space at the end part of [[:lower:] ] )
Demo

Oracle SQL - find string pattern in string

I need to extract some text from a string, but only where the text matches a string pattern. The string pattern will consist of...
2 numbers, a forward slash and 6 numbers
e.g. 12/123456
or
2 numbers, a forward slash, 6 numbers, a hyphen and 2 numbers
e.g. 12/123456-12
I know how to use INSTR to find a specific string. Is it possible to find a string that matches a specific pattern?

You'll need to use regexp_like to filter the results and regexp_substr to get the substring.
Here is roughly what it should look like:
select id, myValue, regexp_substr(myValue, '[0-9]{2}/[0-9]{6}') as myRegExMatch
from Foo
where regexp_like(myValue,'^([a-zA-Z0-9 ])*[0-9]{2}/[0-9]{6}([a-zA-Z0-9 ])*$')
with a link to a SQLFiddle that you can see in action and adjust to your taste.
The regexp_like provided in the sample above takes into consideration the alphanumerics and whitespace characters that may bound the number pattern.

Use regexp_like.
where regexp_like(col_name,'\s[0-9]{2}\/[0-9]{6}(-[0-9]{2})?\s')
\s matches a space. Include them at the start and end of pattern.
[0-9]{2}\/[0-9]{6} matches 2 numerics, a forward slash and 6 numerics
(-[0-9]{2})? is optional for a hyphen and 2 numerics following the previous pattern.

regexp_like(col_name,'^\d{2}/\d{6}($|-\d{2}$)')
or
regexp_like(col_name,'^\d{2}/\d{6}(-\d{2})?$')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regex sub-group behaviour with and without space - sql

Related

SQL, extract everything before 5th comma

Replacing the nth white by an asterisk in GBQ

Extract string between different special symbols

Regexp_Like to Validate Uppercase Characters [A-Z] and Numbers [0-9] Only

Oracle SQL - find string pattern in string

Categories

Resources