How to retrieve specific character positions within rows of database column using REGEX in Oracle SQL? - sql

What Oracle SQL query could return the second, third and fourth positions of characters contained within rows of a specific column using the REGEXP_SUBSTR method instead of using SUBSTR method like my example provided below?
SELECT SUBSTR(city,2,3) AS "2nd, 3rd, 4th"
FROM student.zipcode;`

One way that works for me (with test data) is:
SELECT REGEXP_SUBSTR(city, '\S{3}', 2) AS partial FROM student.zipcode;
Note that this is set to find three non-whitespace characters beginning at the second position of the string.
You could also use:
SELECT REGEXP_SUBSTR(city, '.{3}', 2) AS partial FROM student.zipcode;
which will instead match any three characters in the 2nd to 4th position.
However, I'm not sure what advantage this has over simply:
SELECT SUBSTR(city,2,3) AS partial FROM student.zipcode;
The REGEXP_INSTR function is not what you want, as it returns an index (position number) for the search item in the searched string. You can read about it here: http://www.techonthenet.com/oracle/functions/regexp_instr.php

Related

Impala equivalent to regexp_substr

I have a Oracle query which needs to be converted to Impala. I know that Impala has regexp_extract to return the string based on the regular expression that I provide. What my concern is if there is more that one occurance of the same string how do I capture that?
Let's say the dummy Oracle code I have:
Select t1.r1, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,1) as r2, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,2) as r3 From t1;
Here I will get value of r2 and r3 as RMG123 and RMG987 respectively.
When I converted it into Impala equivalent as
Select t1.r1, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',1) as r2, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',2) as r3 From t1;
I got the value for r2 as RMG123 but didn't get any value for r3 as regexp_extract is not allowing to check for second occurance of the pattern.
Note that the data RMG123/RMH987 is just a sample data. The user doesn't know that these two field are seperated by /.
Please suggest a way in Impala where I can achieve the result as same as in Oracle.
In Impala regexp_extract the last parameter is a group () number in a pattern, not n-th occurence number as in Oracle regesp_substr. Your pattern contains single group number 1, no group 2. And if you want to extract 2nd occurence of substring, change the pattern for example like this:
regexp_extract("RMG123/RMG987",'(RMG\d{3})+.*?(RMG\d{3})',2)
Pattern '(RMG\\d{3})+.*?(RMG\\d{3})' means:
(RMG\\d{3})+ - first group 1+ times. + here means that two or more pattern occurrences in a row will be considered as single one.
.*? - some delimiters any times non-greedy
(RMG\\d{3}) - Second group - this is second occurrence of the pattern you want to extract.
+ sign after first group in the pattern is significant here because without it, multiple occurrences of the group without any delimiters will be considered as new occurrence, with + sign, multiple occurrences will be considered as the single one.
For example if initial string is RMG123RMG980/RMG987,
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3})+.*?(RMG\\d{3})',2)
will produce RMG987
And the same pattern without +
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3}).*?(RMG\\d{3})',2)
will produce RMG980
Unfortunately I have no Impala to test it, the same works in Hive, Impala regex flavor may be a bit different.

Postgres SQL regexp_replace replace all number

I need some help with the next. I have a field text in SQL, this record a list of times sepparates with '|'. For example
'14613|15474|3832|148|5236|5348|1055|524' Each value is a time in milliseconds. This field could any length, for example is perfect correct '3215|2654' or '4565' (only 1 value). I need get this field and replace all number with -1000 value.
So '14613|15474|3832|148|5236|5348|1055|524' will be '-1000|-1000|-1000|-1000|-1000|-1000|-1000|-1000'
Or '3215|2654' => '-1000|-1000' Or '4565' => '-1000'.
I try use regexp_replace(times_field,'[[:digit:]]','-1000','g') but it replace each digit, not the complete number, so in this example:
'3215|2654' than must be '-1000|-1000', i get:
'-1000-1000-1000-1000|-1000-1000-1000-1000', I try with other combinations and more options of regexp but i'm done.
Please need your help, thanks!!!.
We can try using REGEXP_REPLACE here:
UPDATE yourTable
SET times_field = REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g');
If instead you don't really want to alter your data but rather just view your data this way, then use a select:
SELECT
times_field,
REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g') AS times_field_replace
FROM yourTable;
Note that in either case we pass g as the fourtb parameter to REGEXP_REPLACE to do a global replacement of all pipe separated numbers.
[[:digit:]] - matches a digit [0-9]
+ Quantifier - matches between one and unlimited times, as many times as possible
your regexp must look like
regexp_replace(times_field,'[[:digit:]]+','-1000','g')

PostgreSQL regular expression capture group in select

How can the matched regular expression be returned from an SQL select? I tried using REGEXP_EXTRACT with no luck (function not available). What I've done that does work is this:
SELECT column ~ '^stuff.*$'
FROM table;
but this gives me a list of true / false. I want to know what is extracted in each case.
If you're trying to capture the regex match that resulted from the expression, then substring would do the trick:
select substring ('I have a dog', 'd[aeiou]g')
Would return any match, in this case "dog."
I think the missing link of what you were trying above was that you need to put the expression you want to capture in parentheses. regexp_matches would work in this case (had you included parentheses around the expression you wanted to capture), but would return an array of text with each match. If it's one match, substring is sort of convenient.
So, circling back to your example, if you're trying to return stuff if and only if it's at the beginning of a column:
select substring (column, '^(stuff)')
or
select (regexp_matches (column, '^(stuff)'))[1]
Use regexp_matches.
SELECT regexp_matches(column,'^stuff.*$')
FROM table
The regexp_matches function returns a text array of all of the captured substrings resulting from matching a POSIX regular expression pattern. It has the syntax regexp_matches(string, pattern [, flags ]). The function can return no rows, one row, or multiple rows (see the g flag below). If the pattern does not match, the function returns no rows. If the pattern contains no parenthesized subexpressions, then each row returned is a single-element text array containing the substring matching the whole pattern. If the pattern contains parenthesized subexpressions, the function returns a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern (not counting "non-capturing" parentheses; see below for details). The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag g causes the function to find each match in the string, not only the first one, and return a row for each such match.
I'm using Amazon Redshift which uses PostgreSQL 8.0.2 (I should have mentioned this in the question). For me what worked was REGEXP_SUBSTR
e.g.
SELECT REGEXP_SUBSTR(column,'^stuff.*$')
FROM table

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

Is it possible to get the matching string from an SQL query?

If I have a query to return all matching entries in a DB that have "news" in the searchable column (i.e. SELECT * FROM table WHERE column LIKE %news%), and one particular row has an entry starting with "In recent World news, Somalia was invaded by ...", can I return a specific "chunk" of an SQL entry? Kind of like a teaser, if you will.
select substring(column,
CHARINDEX ('news',lower(column))-10,
20)
FROM table
WHERE column LIKE %news%
basically substring the column starting 10 characters before where the word 'news' is and continuing for 20.
Edit: You'll need to make sure that 'news' isn't in the first 10 characters and adjust the start position accordingly.
You can use substring function in a SELECT part. Something like:
SELECT SUBSTRING(column, 1,20) FROM table WHERE column LIKE %news%
This will return the first 20 characters from column column
I had the same problem, I ended up loading the whole field into C#, then re-searched the text for the search string, then selected x characters either side.
This will work fine for LIKE, but not full text queries which use FORMS OF INFLECTION because that may match "women" when you search for "woman".
If you are using MSSQL you can perform all kinds VB-like of substring functions as part of your query.