PostgreSQL regular expression capture group in select - sql

How can the matched regular expression be returned from an SQL select? I tried using REGEXP_EXTRACT with no luck (function not available). What I've done that does work is this:
SELECT column ~ '^stuff.*$'
FROM table;
but this gives me a list of true / false. I want to know what is extracted in each case.

If you're trying to capture the regex match that resulted from the expression, then substring would do the trick:
select substring ('I have a dog', 'd[aeiou]g')
Would return any match, in this case "dog."
I think the missing link of what you were trying above was that you need to put the expression you want to capture in parentheses. regexp_matches would work in this case (had you included parentheses around the expression you wanted to capture), but would return an array of text with each match. If it's one match, substring is sort of convenient.
So, circling back to your example, if you're trying to return stuff if and only if it's at the beginning of a column:
select substring (column, '^(stuff)')
or
select (regexp_matches (column, '^(stuff)'))[1]

Use regexp_matches.
SELECT regexp_matches(column,'^stuff.*$')
FROM table
The regexp_matches function returns a text array of all of the captured substrings resulting from matching a POSIX regular expression pattern. It has the syntax regexp_matches(string, pattern [, flags ]). The function can return no rows, one row, or multiple rows (see the g flag below). If the pattern does not match, the function returns no rows. If the pattern contains no parenthesized subexpressions, then each row returned is a single-element text array containing the substring matching the whole pattern. If the pattern contains parenthesized subexpressions, the function returns a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern (not counting "non-capturing" parentheses; see below for details). The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag g causes the function to find each match in the string, not only the first one, and return a row for each such match.

I'm using Amazon Redshift which uses PostgreSQL 8.0.2 (I should have mentioned this in the question). For me what worked was REGEXP_SUBSTR
e.g.
SELECT REGEXP_SUBSTR(column,'^stuff.*$')
FROM table

Related

Postgres Regular Expression Positive Lookbehind with Repetition

Working for the first time with Postgres flavor regular expressions and stumped on it's behavior here.
Input: 'xxxaaabc'
Expected: 'xxxaaazzzbc'
Logic: Match M a's, capture group, and add 'zzz' directly after the sequence
Attempts:
select regexp_replace('aaabc','(?!\Sa+)','zzz','i')
Result: zzzxxxaaabc
select regexp_replace('aaabc','(?!\Sa+)','zzz','i')
Result: xxxazzzaabc
select select regexp_replace('xxxaaabc','(?!=a+)','zzz','i')
Result: zzzxxxaaabc
This one seems to get pretty close regexp_replace('aaabc','(?!\Sa+)','zzz','i'), but the repetition of a's isn't working.
Note the (?!\Sa+) pattern matches an empty location that is not immediately followed with a single non-whitespace char and then one or more a chars.
The (?!=a+) pattern matches a location in a string, that is not immediately followed with = and then one or more as.
You can capture one or more as and then use a backreference to the match and then just append the zzz:
select regexp_replace('xxxaaabc','(a+)','\1zzz','i') AS Result;
See the DB fiddle.

PostgreSQL - find matching line in char/string column?

How can I find matching line in char/string type column?
For example let say I have column called text and some row has content of:
12345\nabcdf\nXKJKJ
(where \n are real new lines)
Now I want to find related row if any of lines match. For example, I have value 12345,
then it should find match. But if I have value 123, It would not.
I tried using like but it finds in both cases, when I have matching value (like 12345) and partially matching value (like 123).
For example something like this, but to have boundary for checking whole line:
SELECT id
FROM my_table
WHERE text like [SOME_VALUE]
Update
Maybe its not yet clear what Im asking. But basically I want something equivalent what you can do with regular expression,
like this: https://regexr.com/5akj1
Here regular expression /^123$/m would not match my string, it would only match if it would have been with pattern /^12345$/m (when I use pattern, value is dynamic, so pattern would change depending what value I got).
You may use regexp_replace and then check that the replaced string is not equal to the original column value:
select count(*)
from dummy
where regexp_replace(mytext, '(?m)^1234$', '') <> mytext;
You have a demo here.
Bear in mind that I have used the (?m) modifier, which makes ^ and $ match begin and end of line instead of begin and end of string.
You should be able to use ~ for matching:
where mytext ~ '(\n|^)1234(\n|$)'

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

How to retrieve specific character positions within rows of database column using REGEX in Oracle SQL?

What Oracle SQL query could return the second, third and fourth positions of characters contained within rows of a specific column using the REGEXP_SUBSTR method instead of using SUBSTR method like my example provided below?
SELECT SUBSTR(city,2,3) AS "2nd, 3rd, 4th"
FROM student.zipcode;`
One way that works for me (with test data) is:
SELECT REGEXP_SUBSTR(city, '\S{3}', 2) AS partial FROM student.zipcode;
Note that this is set to find three non-whitespace characters beginning at the second position of the string.
You could also use:
SELECT REGEXP_SUBSTR(city, '.{3}', 2) AS partial FROM student.zipcode;
which will instead match any three characters in the 2nd to 4th position.
However, I'm not sure what advantage this has over simply:
SELECT SUBSTR(city,2,3) AS partial FROM student.zipcode;
The REGEXP_INSTR function is not what you want, as it returns an index (position number) for the search item in the searched string. You can read about it here: http://www.techonthenet.com/oracle/functions/regexp_instr.php

How to extract group from regular expression in Oracle?

I got this query and want to extract the value between the brackets.
select de_desc, regexp_substr(de_desc, '\[(.+)\]', 1)
from DATABASE
where col_name like '[%]';
It however gives me the value with the brackets such as "[TEST]". I just want "TEST". How do I modify the query to get it?
The third parameter of the REGEXP_SUBSTR function indicates the position in the target string (de_desc in your example) where you want to start searching. Assuming a match is found in the given portion of the string, it doesn't affect what is returned.
In Oracle 11g, there is a sixth parameter to the function, that I think is what you are trying to use, which indicates the capture group that you want returned. An example of proper use would be:
SELECT regexp_substr('abc[def]ghi', '\[(.+)\]', 1,1,NULL,1) from dual;
Where the last parameter 1 indicate the number of the capture group you want returned. Here is a link to the documentation that describes the parameter.
10g does not appear to have this option, but in your case you can achieve the same result with:
select substr( match, 2, length(match)-2 ) from (
SELECT regexp_substr('abc[def]ghi', '\[(.+)\]') match FROM dual
);
since you know that a match will have exactly one excess character at the beginning and end. (Alternatively, you could use RTRIM and LTRIM to remove brackets from both ends of the result.)
You need to do a replace and use a regex pattern that matches the whole string.
select regexp_replace(de_desc, '.*\[(.+)\].*', '\1') from DATABASE;