Extract a number from comma separated string using regular expressions in oracle sql - sql

I am trying to fetch a number which starts with 628 in a comma separated string.
Below is what I am using:
SELECT
REGEXP_REPLACE(REGEXP_SUBSTR('62810,5152,,', ',?628[[:alnum:]]+,?'),',','') first,
REGEXP_REPLACE(REGEXP_SUBSTR('5152,62810,,', ',?628[[:alnum:]]+,?'),',','') second,
REGEXP_REPLACE(REGEXP_SUBSTR('5152,562810,,', ',?628[[:alnum:]]+,?'),',','') third,
REGEXP_REPLACE(REGEXP_SUBSTR(',5152,,62810', ',?(628[[:alnum:]]+),?'),',','') fourth
FROM DUAL;
Its working but in one case it fails which is the third column where number is 562810. Actually I am expecting NULL in the third column.
Actual output from above query is:
"FIRST","SECOND","THIRD","FOURTH"
"62810","62810","62810","62810"

Not sure why you are using [[:alnum::]]. You could use matching group to extract the number starting with 628 or followed by a comma. REPLACE may be avoided this way
If you have alphabets as well, modify the 2nd match group () accordingly.
SELECT
REGEXP_SUBSTR('62810,5152,,' , '(^|,)(628\d*)',1,1,NULL,2) first,
REGEXP_SUBSTR('5152,62810,,' , '(^|,)(628\d*)',1,1,NULL,2) second,
REGEXP_SUBSTR('5152,562810,,', '(^|,)(628\d*)',1,1,NULL,2) third,
REGEXP_SUBSTR(',5152,,62810' , '(^|,)(628\d*)',1,1,NULL,2) fourth
FROM DUAL;
Demo

The problem with your regex logic is that you are searching for an optional comma before the numbers 628. This means that any number having 628 anywhere would match. Instead, you can phrase this by looking for 628 which is either preceded by either a comma, or the start of the string.
SELECT
REGEXP_REPLACE(REGEXP_SUBSTR('62810,5152,,', '(,|^)628[[:alnum:]]+,?'),',','') first,
REGEXP_REPLACE(REGEXP_SUBSTR('5152,62810,,', '(,|^)628[[:alnum:]]+,?'),',','') second,
REGEXP_REPLACE(REGEXP_SUBSTR('5152,562810,,', '(,|^)628[[:alnum:]]+,?'),',','') third,
REGEXP_REPLACE(REGEXP_SUBSTR(',5152,,62810', '(,|^)(628[[:alnum:]]+),?'),',','') fourth
FROM DUAL
Demo
The ideal pattern we'd like to use here is \b628.*, or something along these lines. But Oracle's regex functions do not appear to support word boundaries, hence we can use (^|,)628.* as an alternative.

Related

Using regexp_like in Oracle to match on multiple string conditions using a range of values

I have a field in my Oracle DB that contains codes and from this I need to pull multiple values using a range of values.
As an example I need to pull all codes in the range C00.0 - C39.9 i.e. begins with C, the second character can be 0-3, third character is 0-9, followed by a "." and then the last digit is 0-9 e.g.
CODES
-----
C00.0
C10.4
C15.8
C39.8
The example above is for one pattern, I have multiple patterns to match on, here is another example
C50.011-C69.92
Again, starts with C, second character is 5-6, third is 0-9, fourth is ".", fifth is 0-9, sixth is 1-2 etc.
I have tried the following but my pipe function doesn't appear to pick up the second condition and therefore I am only getting results for the first condition '^[C][0-3][0-9][.][0-9]':
SELECT DISTINCT CODES
FROM
TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]|
^[C][4][0-3][.][0-9]|
^[C][4][A][.][0-9]|
^[C][4][4-9][.][0-9]|
^[C][4][9][.][A][0-9]|
^[C][5-6][0-9][.][0-9][1-9]|
^[C][7][0-5][.][0-9]|
^[C][7][A-B][.][0-8]')
ORDER BY CODES
I would be very grateful if anyone could make a suggestion on how I can pull the additional patterns.
You have newlines in the pattern -- in other words, your attempt at readability is causing the problem. You can just remove them, although I would probably factor out common elements:
WHERE REGEXP_LIKE (CODES, '^[C]([0-3][0-9][.][0-9]|[4][0-3][.][0-9]|[4][A][.][0-9]|[4][4-9][.][0-9]|[4][9][.][A][0-9]|[5-6][0-9][.][0-9][1-9]|[7][0-5][.][0-9]|[7][A-B][.][0-8])')
I think you also want $ at the end.
If you want readability, you could use or:
SELECT DISTINCT CODES
FROM TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]') OR
REGEXP_LIKE (CODES, '^[C][4][0-3][.][0-9]|') OR
. . .
Here is a regex pattern for what you want to match here:
^C[0-3][0-9][.][0-9]$
Demo
This would match the range of C00.0 - C39.9. If you want to match other ranges, then you would need an alternation with another pattern to cover those ranges.
Applying this to your current query:
SELECT DISTINCT CODES
FROM yourTable
WHERE REGEXP_LIKE (CODES, '^C[0-3][0-9][.][0-9]$');

How can I remove characters in a string after a specific special character (~) in snowflake sql?

I am using Snowflake SQL. I would like to remove characters from a string after a special character ~. How can I do that?
here is the whole scenario. Let me explain. I do have a string like 'CK#123456~fndkjfgdjkg'. Now, i want only the number after #.And not anything after ~. This is number length varies for that field value. It might be 1 or 5 or 3. And i want to add the condition in where class where this number is equal to check_num from other table after joining. I am trying REGEXP_SUBSTR(A.SRC_TXT, '(?<=CK#)(.+?\b)') = C.CHK_NUM in the where condition. I am getting the error as 'No repititive argument after ?'
You can use a regex for this
-- To remove just the character after a ~
select regexp_replace('fo~o bar','~.', '');
-- returns 'fo bar'
--If you want to keep the ~
select regexp_replace('fo~o bar','~.', '~');
-- returns 'fo~ bar'
--If you want to remove everything after the ~
select regexp_replace('fo~o bar','~.*', '');
--returns 'fo'
If you need to remove other specific character sets after a ~, you can probably do this with a slightly more complicated regex, but I'd need examples of your desired input/output to help with that.
EDIT for updated question
This regex replace should get what you need.
select regexp_replace('CK#123456~fndkjfgdjkg','CK#(\\d*)~.*', '\\1');
-- returns 123456
(\\d*) gets ANY number of digits in a row, and the \\1 causes it to replace the match with what was in the first set of parenthesis, which is your list of digits. the CK# and ~.* are there to make sure the whole string gets matched and replaced.
If the CK# can vary as well, you can use .*? like this.
select regexp_replace('ABCD123HI#123456~fndkjfgdjkg','.*?#(\\d*)~.*', '\\1')
-- returns 123456
I'd probably do something like the following, easy enough but not as cool as RegEx type of functions.
set my_string='fooo~12345';
set search_for_me = '~';
SELECT SUBSTR($my_string, 1, DECODE(position($search_for_me, $my_string), 0, length($my_string), position($search_for_me, $my_string)));
I hope this helps...Rich
It looks like lookahead and lookbehinds are not supported in REGEXP functions, they seem to work in the PATTERN clause of a LIST command. Snowflake documentation makes no mention either way of lookahead or lookbehinds.
In your example:
It seems that the query engine is looking for that repeating argument, where you are attempting a lookbehind
You have not specified what you wanted extracted. You have two capture groups, but in this scenario everything would be returned
Since you are looking to remove everything after ~ you have a delimiter, why not use it in your REGEXP_SUBSTR function?
Try the following:
SELECT $1,REGEXP_SUBSTR($1,'\\w+#(.+?)~',1,1,'is',1)
FROM VALUES
('CK#123456~fndkjfgdjkg')
,('QH#128fklj924~fndkjfgdjkg')
;
This looks for:
One or more word characters
Followed by #
Capturing one or more characters upto and not including ~
Returns the characters within the capture group
You can change the .+? to \\d+? to make sure the pattern is only digits. Backslashes must be escaped with a backslash.
The descriptions for each argument of the function can be found here:
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_substr.html
You could check this!!
select substr('CK#123456~fndkjfgdjkg',4,6) from dual;
OUTPUT
123456
https://docs.snowflake.net/manuals/sql-reference/functions/substr.html

Regular expression - capture number between underscores within a sequence between commas

I have a field in a database table in the format:
111_2222_33333,222_444_3,aaa_bbb_ccc
This is format is uniform to the entire field. Three underscore separated numeric values, a comma, three more underscore separated numeric values, another comma and then three underscore separated text values. No spaces in between
I want to extract the middle value from the second numeric sequence, in the example above I want to get 444
In a SQL query I inherited, the regex used is ^.,(\d+)_.$ but this doesn't seem to do anything.
I've tried to identify the first comma, first number after and the following underscore ,222_ to use as a starting point and from there get the next number without the _ after it
This (,\d*_)(\d+[^_]) selects ,222_444 and is the closest I've gotten
We can try using REGEXP_REPLACE with a capture group:
SELECT
REGEXP_REPLACE(
'111_2222_33333,222_444_3,aaa_bbb_ccc',
'^[^,]+,[^_]+_(.*?)_[^_]+,.*$',
'\1') AS num
FROM yourTable;
Here is a demo showing that the above regex' first capture group contains the quantity you want.
Demo

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

How to extract group from regular expression in Oracle?

I got this query and want to extract the value between the brackets.
select de_desc, regexp_substr(de_desc, '\[(.+)\]', 1)
from DATABASE
where col_name like '[%]';
It however gives me the value with the brackets such as "[TEST]". I just want "TEST". How do I modify the query to get it?
The third parameter of the REGEXP_SUBSTR function indicates the position in the target string (de_desc in your example) where you want to start searching. Assuming a match is found in the given portion of the string, it doesn't affect what is returned.
In Oracle 11g, there is a sixth parameter to the function, that I think is what you are trying to use, which indicates the capture group that you want returned. An example of proper use would be:
SELECT regexp_substr('abc[def]ghi', '\[(.+)\]', 1,1,NULL,1) from dual;
Where the last parameter 1 indicate the number of the capture group you want returned. Here is a link to the documentation that describes the parameter.
10g does not appear to have this option, but in your case you can achieve the same result with:
select substr( match, 2, length(match)-2 ) from (
SELECT regexp_substr('abc[def]ghi', '\[(.+)\]') match FROM dual
);
since you know that a match will have exactly one excess character at the beginning and end. (Alternatively, you could use RTRIM and LTRIM to remove brackets from both ends of the result.)
You need to do a replace and use a regex pattern that matches the whole string.
select regexp_replace(de_desc, '.*\[(.+)\].*', '\1') from DATABASE;