Impala equivalent to regexp_substr - sql

I have a Oracle query which needs to be converted to Impala. I know that Impala has regexp_extract to return the string based on the regular expression that I provide. What my concern is if there is more that one occurance of the same string how do I capture that?
Let's say the dummy Oracle code I have:
Select t1.r1, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,1) as r2, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,2) as r3 From t1;
Here I will get value of r2 and r3 as RMG123 and RMG987 respectively.
When I converted it into Impala equivalent as
Select t1.r1, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',1) as r2, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',2) as r3 From t1;
I got the value for r2 as RMG123 but didn't get any value for r3 as regexp_extract is not allowing to check for second occurance of the pattern.
Note that the data RMG123/RMH987 is just a sample data. The user doesn't know that these two field are seperated by /.
Please suggest a way in Impala where I can achieve the result as same as in Oracle.

In Impala regexp_extract the last parameter is a group () number in a pattern, not n-th occurence number as in Oracle regesp_substr. Your pattern contains single group number 1, no group 2. And if you want to extract 2nd occurence of substring, change the pattern for example like this:
regexp_extract("RMG123/RMG987",'(RMG\d{3})+.*?(RMG\d{3})',2)
Pattern '(RMG\\d{3})+.*?(RMG\\d{3})' means:
(RMG\\d{3})+ - first group 1+ times. + here means that two or more pattern occurrences in a row will be considered as single one.
.*? - some delimiters any times non-greedy
(RMG\\d{3}) - Second group - this is second occurrence of the pattern you want to extract.
+ sign after first group in the pattern is significant here because without it, multiple occurrences of the group without any delimiters will be considered as new occurrence, with + sign, multiple occurrences will be considered as the single one.
For example if initial string is RMG123RMG980/RMG987,
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3})+.*?(RMG\\d{3})',2)
will produce RMG987
And the same pattern without +
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3}).*?(RMG\\d{3})',2)
will produce RMG980
Unfortunately I have no Impala to test it, the same works in Hive, Impala regex flavor may be a bit different.

Related

Using regexp_like in Oracle to match on multiple string conditions using a range of values

I have a field in my Oracle DB that contains codes and from this I need to pull multiple values using a range of values.
As an example I need to pull all codes in the range C00.0 - C39.9 i.e. begins with C, the second character can be 0-3, third character is 0-9, followed by a "." and then the last digit is 0-9 e.g.
CODES
-----
C00.0
C10.4
C15.8
C39.8
The example above is for one pattern, I have multiple patterns to match on, here is another example
C50.011-C69.92
Again, starts with C, second character is 5-6, third is 0-9, fourth is ".", fifth is 0-9, sixth is 1-2 etc.
I have tried the following but my pipe function doesn't appear to pick up the second condition and therefore I am only getting results for the first condition '^[C][0-3][0-9][.][0-9]':
SELECT DISTINCT CODES
FROM
TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]|
^[C][4][0-3][.][0-9]|
^[C][4][A][.][0-9]|
^[C][4][4-9][.][0-9]|
^[C][4][9][.][A][0-9]|
^[C][5-6][0-9][.][0-9][1-9]|
^[C][7][0-5][.][0-9]|
^[C][7][A-B][.][0-8]')
ORDER BY CODES
I would be very grateful if anyone could make a suggestion on how I can pull the additional patterns.
You have newlines in the pattern -- in other words, your attempt at readability is causing the problem. You can just remove them, although I would probably factor out common elements:
WHERE REGEXP_LIKE (CODES, '^[C]([0-3][0-9][.][0-9]|[4][0-3][.][0-9]|[4][A][.][0-9]|[4][4-9][.][0-9]|[4][9][.][A][0-9]|[5-6][0-9][.][0-9][1-9]|[7][0-5][.][0-9]|[7][A-B][.][0-8])')
I think you also want $ at the end.
If you want readability, you could use or:
SELECT DISTINCT CODES
FROM TABLE
WHERE REGEXP_LIKE (CODES, '^[C][0-3][0-9][.][0-9]') OR
REGEXP_LIKE (CODES, '^[C][4][0-3][.][0-9]|') OR
. . .
Here is a regex pattern for what you want to match here:
^C[0-3][0-9][.][0-9]$
Demo
This would match the range of C00.0 - C39.9. If you want to match other ranges, then you would need an alternation with another pattern to cover those ranges.
Applying this to your current query:
SELECT DISTINCT CODES
FROM yourTable
WHERE REGEXP_LIKE (CODES, '^C[0-3][0-9][.][0-9]$');

Postgres SQL regexp_replace replace all number

I need some help with the next. I have a field text in SQL, this record a list of times sepparates with '|'. For example
'14613|15474|3832|148|5236|5348|1055|524' Each value is a time in milliseconds. This field could any length, for example is perfect correct '3215|2654' or '4565' (only 1 value). I need get this field and replace all number with -1000 value.
So '14613|15474|3832|148|5236|5348|1055|524' will be '-1000|-1000|-1000|-1000|-1000|-1000|-1000|-1000'
Or '3215|2654' => '-1000|-1000' Or '4565' => '-1000'.
I try use regexp_replace(times_field,'[[:digit:]]','-1000','g') but it replace each digit, not the complete number, so in this example:
'3215|2654' than must be '-1000|-1000', i get:
'-1000-1000-1000-1000|-1000-1000-1000-1000', I try with other combinations and more options of regexp but i'm done.
Please need your help, thanks!!!.
We can try using REGEXP_REPLACE here:
UPDATE yourTable
SET times_field = REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g');
If instead you don't really want to alter your data but rather just view your data this way, then use a select:
SELECT
times_field,
REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g') AS times_field_replace
FROM yourTable;
Note that in either case we pass g as the fourtb parameter to REGEXP_REPLACE to do a global replacement of all pipe separated numbers.
[[:digit:]] - matches a digit [0-9]
+ Quantifier - matches between one and unlimited times, as many times as possible
your regexp must look like
regexp_replace(times_field,'[[:digit:]]+','-1000','g')

Use REGEXP_SUBSTR to extract string of varied length

I want to extract alphanumeric text of varied length from a string between the second occurrence of a specific characters.
I have tried various forms of substr and regexp_substr but can't seem to get the syntax right. This is for use in Teradata SQL assistant. In the past I would have to create a temp table and use substr twice before trimming down the string to what I need. I want to do it all in one go.
SELECT regexp_substr('Channel:DF GB, Order Num:12345T6, Order Date:01/01/2019, Charge Codes:TAXES,,GBRAX', 'Num\\:+(\\:+)',1,2, ':') as RESULTING_STRING
My desired result is to return ONLY what is between "Num:" and the next "," in this case "12345T6". The length of the order number can vary so it is not a fixed length. When I run my code the actual output is a '?' returned by Teradata. What am I doing wrong?
This seems to work:
SELECT regexp_substr('Channel:DF GB, Order Num:12345T6, Order Date:01/01/2019, Charge Codes:TAXES,,GBRAX', 'Num:(\w*)', 1, 1, NULL, 1) as RESULTING_STRING from dual
Finds Num: and then captures as many word characters (, is not a word char) as there are available. The last parameter - subexpr - specifies which subexpression (aka capture group) you want, without it the whole thing will be matched (Num:12345T6).
Assuming you use Teradata SQL Assistant to query a Teradata system (but why do you tag Oracle then) the RegEx syntax is slightly different (both use a different RegEx dialects):
Teradata's RegExp_Substr doesn't support the subexpression parameter, you can either switch to the (I really don't know why) undocumented RegExp_Substr_gpl
RegExp_Substr_gpl(x, 'Num:([^,]*)', 1, 1, 'i', 1)
or tell the RegEx to forget the previous match using \K:
RegExp_Substr(x, 'Num:\K[^,]*', 1,1, 'i')
You can give a try to the below pattern search !
SELECT REGEXP_REPLACE ((REGEXP_SUBSTR('Channel:DF GB, Order Num:12345T6, Order Date:01/01/2019, Charge Codes:TAXES,,GBRAX', 'Num:[A-Za-z0-9]*',1,1, 'i')),'Num:','',1,1,'i') AS RESULTING_STRING
Regexp_substr pattern search ['Num:[A-Za-z0-9]*'], will first filter out the alphanumeric characters that follow the pattern 'Num:',astriek, helps to find out zero or more occurrences of the specified pattern.
For eg:, in this 'Num:12345T6' will be filtered out of the string provided, also note the last parameter in the regexp_substr is 'i', which ensures case in-specific search.
Lastly, Regexp_replace will replace the pattern 'Num:' from the output of the regexp_substr with an empty string,resulting in a final string as '12345T6'.

How can I extract a substring from a character column without using SUBSTR()?

I have a questions regarding below data.
You clearly can see each EMP_IDENTIFIER has connected with EMP_ID.
So I need to pull only identifier which is 10 characters that will insert another column.
How would I do that?
I did some traditional way, using INSTR, SUBSTR.
I just want to know is there any other way to do it but not using INSTR, SUBSTR.
EMP_ID(VARCHAR2)EMP_IDENTIFIER(VARCHAR2)
62049 62049-2162400111
6394 6394-1368000222
64473 64473-1814702333
61598 61598-0876000444
57452 57452-0336503555
5842 5842-0000070666
75778 75778-0955501777
76021 76021-0546004888
76274 76274-0000454999
73910 73910-0574500122
I am using Oracle 11g.
If you want the second part of the identifier and it is always 10 characters:
select t.*, substr(emp_identifier, -10) as secondpart
from t;
Here is one way:
REGEXP_SUBSTR (EMP_IDENTIFIER, '-(.{10})',1,1,null,1)
That will give the 1st 10 character string that follows a dash ("-") in your string. Thanks to mathguy for the improvement.
Beyond that, you'll have to provide more details on the exact logic for picking out the identifier you want.
Since apparently this is for learning purposes... let's say the assignment was more complicated. Let's say you had a longer input string, and it had several groups separated by -, and the groups could include letters and digits. You know there are at least two groups that are "digits only" and you need to grab the second such "purely numeric" group. Then something like this will work (and there will not be an instr/substr solution):
select regexp_substr(input_str, '(-|^)(\d+)(-|$)', 1, 2, null, 2) from ....
This searches the input string for one or more digits ( \d means any digit, + means one or more occurrences) between a - or the beginning of the string (^ means beginning of the string; (a|b) means match a OR b) and a - or the end of the string ($ means end of the string). It starts searching at the first character (the second argument of the function is 1); it looks for the second occurrence (the argument 2); it doesn't do any special matching such as ignore case (the argument "null" to the function), and when the match is found, return the fragment of the match pattern included in the second set of parentheses (the last argument, 2, to the regexp function). The second fragment is the \d+ - the sequence of digits, without the leading and/or trailing dash -.
This solution will work in your example too, it's just overkill. It will find the right "digits-only" group in something like AS23302-ATX-20032-33900293-CWV20-3499-RA; it will return the second numeric group, 33900293.

How to retrieve specific character positions within rows of database column using REGEX in Oracle SQL?

What Oracle SQL query could return the second, third and fourth positions of characters contained within rows of a specific column using the REGEXP_SUBSTR method instead of using SUBSTR method like my example provided below?
SELECT SUBSTR(city,2,3) AS "2nd, 3rd, 4th"
FROM student.zipcode;`
One way that works for me (with test data) is:
SELECT REGEXP_SUBSTR(city, '\S{3}', 2) AS partial FROM student.zipcode;
Note that this is set to find three non-whitespace characters beginning at the second position of the string.
You could also use:
SELECT REGEXP_SUBSTR(city, '.{3}', 2) AS partial FROM student.zipcode;
which will instead match any three characters in the 2nd to 4th position.
However, I'm not sure what advantage this has over simply:
SELECT SUBSTR(city,2,3) AS partial FROM student.zipcode;
The REGEXP_INSTR function is not what you want, as it returns an index (position number) for the search item in the searched string. You can read about it here: http://www.techonthenet.com/oracle/functions/regexp_instr.php