Explanation in REGEXP_REPLACE function output [duplicate] - sql

This question already has answers here:
How to capture multiple repeated groups?
(11 answers)
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Following two query gives some output,
Select REGEXP_REPLACE('Milind,Milind,Gopal,Gopal,Gopal,Milind'), '([^,]+) (,\1)+(,|$)', '\1\3') "OUTPUT"
FROM dual;
o/p Milind,Gopal,Milind
Select REGEXP_REPLACE('Milind,Milind,Gopal,Gopal,Gopal,Milind'), '([^,]+) (,\1+)(,|$)', '\1\3') "OUTPUT"
FROM dual;
o/p Milind,Gopal,Gopal,Milind
I do not understand the workflow. How it is giving output.
Can anyone help me out and explain.

[^,] matches one character, that can be anything except a comma. For example: M.
[^,]+ matches 1 or more times any character except a comma. For example: Milind
([^,]+) same, but capture the result, so it can be referenced later with \1.
(,\1)+ matches a comma followed by the previous capture, one or more times, and capture the result as \2. If \1 is Milind, it can match: ,Milind, or ,Milind,Milind or ,Milind,Milind,Milind, etc.
(,|$) matches either a comma or the end of the line, and captures it as \3.
\1\3 This is the replacement pattern: we are only keeping \1 and \3, so everything matched in capture \2 is effectively removed.
The second statement is showing a small difference that introduces a bug:
(,\1+) matches a comma followed by one or more occurences of the previous capture, as in: ,Milind, or ,MilindMilind or ,MilindMilindMilind. As a result, it fails to remove several comma separated occurences.

Related

Using regexp_like in Oracle to match on multiple string conditions using a range of values

I have a field in my Oracle DB that contains codes and from this I need to pull multiple values using a range of values.
As an example I need to pull all codes in the range C00.0 - C39.9 i.e. begins with C, the second character can be 0-3, third character is 0-9, followed by a "." and then the last digit is 0-9 e.g.
CODES
-----
C00.0
C10.4
C15.8
C39.8
The example above is for one pattern, I have multiple patterns to match on, here is another example
C50.011-C69.92
Again, starts with C, second character is 5-6, third is 0-9, fourth is ".", fifth is 0-9, sixth is 1-2 etc.
I have tried the following but my pipe function doesn't appear to pick up the second condition and therefore I am only getting results for the first condition '^[C][0-3][0-9][.][0-9]':
SELECT DISTINCT CODES
FROM
TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]|
^[C][4][0-3][.][0-9]|
^[C][4][A][.][0-9]|
^[C][4][4-9][.][0-9]|
^[C][4][9][.][A][0-9]|
^[C][5-6][0-9][.][0-9][1-9]|
^[C][7][0-5][.][0-9]|
^[C][7][A-B][.][0-8]')
ORDER BY CODES
I would be very grateful if anyone could make a suggestion on how I can pull the additional patterns.
You have newlines in the pattern -- in other words, your attempt at readability is causing the problem. You can just remove them, although I would probably factor out common elements:
WHERE REGEXP_LIKE (CODES, '^[C]([0-3][0-9][.][0-9]|[4][0-3][.][0-9]|[4][A][.][0-9]|[4][4-9][.][0-9]|[4][9][.][A][0-9]|[5-6][0-9][.][0-9][1-9]|[7][0-5][.][0-9]|[7][A-B][.][0-8])')
I think you also want $ at the end.
If you want readability, you could use or:
SELECT DISTINCT CODES
FROM TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]') OR
REGEXP_LIKE (CODES, '^[C][4][0-3][.][0-9]|') OR
. . .
Here is a regex pattern for what you want to match here:
^C[0-3][0-9][.][0-9]$
Demo
This would match the range of C00.0 - C39.9. If you want to match other ranges, then you would need an alternation with another pattern to cover those ranges.
Applying this to your current query:
SELECT DISTINCT CODES
FROM yourTable
WHERE REGEXP_LIKE (CODES, '^C[0-3][0-9][.][0-9]$');

Extract word by Regex postgreSQL [duplicate]

This question already has an answer here:
Postgresql regexp_matches syntax not working as expected
(1 answer)
Closed 2 years ago.
a colums include code like 'a357' , 'b123' with many word and i want to extract them
it work with
select *, regexp_matches(col1,'a\d{3}') from table
but i also want the 'b123' code then i write this code not work:
select *, regexp_matches(col1,'(a|b)\d{3}') from table
where as (a|b) is regex. Please show me solution or any other way not regexp_matches because i need to trim '{}' sign after that.
The issue is the subexpression in parentheses. As the documentation explains:
If a match is found, and the pattern contains parenthesized
subexpressions, then the result is a text array whose n'th element is
the substring matching the n'th parenthesized subexpression of the
pattern.
In your case, this is easily fixed by using a character class:
regexp_matches(col1, '[ab]\d{3}')

Regular expression - capture number between underscores within a sequence between commas

I have a field in a database table in the format:
111_2222_33333,222_444_3,aaa_bbb_ccc
This is format is uniform to the entire field. Three underscore separated numeric values, a comma, three more underscore separated numeric values, another comma and then three underscore separated text values. No spaces in between
I want to extract the middle value from the second numeric sequence, in the example above I want to get 444
In a SQL query I inherited, the regex used is ^.,(\d+)_.$ but this doesn't seem to do anything.
I've tried to identify the first comma, first number after and the following underscore ,222_ to use as a starting point and from there get the next number without the _ after it
This (,\d*_)(\d+[^_]) selects ,222_444 and is the closest I've gotten
We can try using REGEXP_REPLACE with a capture group:
SELECT
REGEXP_REPLACE(
'111_2222_33333,222_444_3,aaa_bbb_ccc',
'^[^,]+,[^_]+_(.*?)_[^_]+,.*$',
'\1') AS num
FROM yourTable;
Here is a demo showing that the above regex' first capture group contains the quantity you want.
Demo

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>