teradata sql - regexp_substr to split on ' - ' - sql

I am somewhat new to Teradata. I am more familiar with Presto SQL, where split_part is available.
I'm looking to split a string on a space, hyphen, space (' - ').
Example: 'Wal-Mart - Target - Best Buy - K-Mart - Staples'
I'm used to using split_part(split_part(COLUMN, ' - ',2), ' - '), 1) to get Target, which ignores the hyphens in Wal-Mart and K-Mart because the hyphen is not preceeded and followed by a space.
But, I can't figure out how to get 'Target' with Teradata.
strtok() only seems to work with a single character, which isn't sufficient since I want to split on 3 (' - ').
Any help would be appreciated!

Depending on your version (14.0 or recent), you could use strtok to parse it out
select strtok(oreplace('Wal-Mart - Target - Costco - K-Mart - Staples',' - ','|'),'|',2)

While not a direct answer, maybe it will help with the logic so at the risk of getting flamed I'm throwing it out here anyway. With a regex you should be able to first describe your pattern in plain language to help analyze and define what you really need to get. i.e. You want the 2nd occurrence of a string that is surrounded by the pattern space-dash-space. What if the pattern is at the start or end of the line? Let's revise. You want a specified occurrence of a string that is preceded by the start of the line optionally OR by the pattern space-dash-space, and is followed by space-dash-space OR the end of the line.
In Oracle it would look like this where the first '2' in the argument list means get the 2nd occurrence of the pattern and the 2nd '2' means return the 2nd remembered group (in parenthesis). The WITH statement just sets up the data. You would have to translate this regex to Teradata.
WITH tbl(str) AS (
SELECT 'Wal-Mart - Target - Best Buy - K-Mart - Staples' FROM dual
)
SELECT REGEXP_SUBSTR(str, '(^?| - )(.*?)( - |$)', 1, 2, NULL, 2) retailer
FROM tbl;
RETAILER
--------
Target
1 row selected.
Play with query here

Related

SnowflakeSQL get me last text after specific char

I'm trying to retrieve last part of string after underscore.. but every record has different number of underscore and I'm not sure how to write it correctly.
Example:
aaa_bb_cccc_dddd_ee - only ee
aaa_bb_cccc_dddd - only dddd
sss_aas_ww_ww_ww_bb - only bb
As you can see there is different number of underscore and I need only last part after last underscore.
I have been playing with regex and split_part but since I don't now how to point to the last _ then it's not working correctly.
My Idea is to start reading string from the right a then you just pick the first one but couldn't find a way how to do it.
It's probably basic thing but I'm struggling with it so please help.
Use REGEXP_SUBSTR:
SELECT col, REGEXP_SUBSTR(col, '_([^_]+)$', 1, 1, 'e', 1) AS col_out
FROM yourTable;
You can make split_part change the direction of split using negative integers
select split_part(col,'_',-1)
from your_table;
You could also just regexp_replace everything up until and including the last underscore
select regexp_replace(col,'.*_','')
from your_table;

Oracle regexp to match only digits after certain combination of signs

I have a string which roughly looks like: XXXXXXXXX - 1234567 XXXXXXXX,
where X can be either digit, string or sign (<,>,. or space).
I need to extract only these numbers after ' - '.
I have tried following:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}') from dual
I end up with - 1234567.
How to I get rid of '- '?
Thank you in advance
This should work with Oracle 11g.
Place the capturing group around the pattern part you are interested in first. Since you need the digits, wrap the [0-9]{1,7} with the capturing parentheses.
Then, pass all the 6 arguments to the REGEXP_SUBSTR function where the 6th one indicates the number of capturing group you want to extract:
select regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>',' - ([0-9]{1,7})', 1,1,NULL,1) from dual
Here, 1,1,NULL,1 means: start looking for a pattern match from Position 1, just for the first match, with no specific regex options, and return the contents of Group 1.
What #Gordon Linoff was trying to say was:
select substr(regexp_substr('17.12.12 <XXXXXXXXXX> - 1234567 <XXXXXXXXXX>','(- )[0-9]{1,7}'), 3)
from dual
Substr the remaining "- " off of your result.

Extracting specific part of column values in Oracle SQL

I want to extract a specific part of column values.
The target column and its values look like
TEMP_COL
---------------
DESCOL 10MG
TEGRAL 200MG 50S
COLOSPAS 135MG 30S
The resultant column should look like
RESULT_COL
---------------
10MG
200MG
135MG
This can be done using a regular expression:
SELECT regexp_substr(TEMP_COL, '[0-9]+MG')
FROM the_table;
Note that this is case sensitive and it always returns the first match.
I would probably approach this using REGEXP_SUBSTR() rather than base functions, because the structure of the prescription text varies from record to record.
SELECT TRIM(REGEXP_SUBSTR(TEMP_COL, '(\s)(\S*)', 1, 1))
FROM yourTable
The pattern (\s)(\S*) will match a single space followed by any number of non-space characters. This should match the second term in all cases. We use TRIM() to remove a leading space which is matched and returned.
how do you know what is the part you want to extract? how do you know where it begins and where it ends? using the white-spaces?
if so, you can use substr for cutting the data and instr for finding the white-spaces.
example:
select substr(tempcol, -- string
instr(tempcol, ' ', 1), -- location of first white-space
instr(tempcol, ' ', 1, 2) - instr(tempcol, ' ', 1)) -- length until next space
from dual
another solution is using regexp_substr (but it might be harder on performance if you have a lot of rows):
SELECT REGEXP_SUBSTR (tempcol, '(\S*)(\s*)', 1, 2)
FROM dual;
edit: fixed the regular expression to include expressions that don't have space after the parsed text. sorry about that.. ;)

SQL - need help in parsing text of a field

I have a select query and it fetches a field with complex data. I need to parse that data in specified format. please help with your expertise:
selected string = complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I
expected output - PB|I
Please help me in writing a sql regular expression to accomplish this output.
The first step in figuring out the regular expression is to be able to describe it plain language. Based on what we know (and as others have said, more info is really needed) from your post, some assumptions have to be made.
I'd take a stab at it by describing it like this, which is based on the sample data you provided: I want the sets of one or more characters that follow the equal signs but not including the following space or end of the line. The output should be these sets of characters, separated by a pipe, in the order they are encountered in the string when reading from left to right. My assumptions are based on your test data: only 2 equal signs exist in the string and the last data element is not followed by a space but by the end of the line. A regular expression can be built using that info, but you also need to consider other facts which would change the regex.
Could there be more than 2 equal signs?
Could there be an empty data element after the equal sign?
Could the data set after the equal sign contain one or more spaces?
All these affect how the regex needs to be designed. All that said, and based on the data provided and the assumptions as stated, next I would build a regex that describes the string (really translating from the plain language to the regex language), grouping around the data sets we want to preserve, then replace the string with those data sets separated by a pipe.
SQL> with tbl(str) as (
2 select 'complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I' from dual
3 )
4 select regexp_replace(str, '^.*=([^ ]+).*=([^ ]+)$', '\1|\2') result from tbl;
RESU
----
PB|I
The match regex explained:
^ Match the beginning of the line
. followed by any character
* followed by 0 or more 'any characters' (refers to the previous character class)
= followed by an equal sign
( start remembered group 1
[^ ]+ which is a set of one or more characters that are not a space
) end remembered group one
.*= followed by any number of any characters but ending in an equal sign
([^ ]+) followed by the second remembered group of non-space characters
$ followed by the end of the line
The replace string explained:
\1 The first remembered group
| a pipe character
\2 the second remember group
Keep in mind this answer is for your exact sample data as shown, and may not work in all cases. You need to analyse the data you will be working with. At any rate, these steps should get you started on breaking down the problem when faced with a challenging regex. The important thing is to consider all types of data and patterns (or NULLs) that could be present and allow for all cases in the regex so you return accurate data.
Edit: Check this out, it parses all the values right after the equal signs and allows for nulls:
SQL> with tbl(str) as (
2 select 'a=zz|complexType|ChannelCode=PB - Phone In A Box|IncludeExcludeIndicator=I - testing|test1=|test2=test2 - testing' from dual
3 )
4 select regexp_substr(str, '=([^ |]*)( |||$)', 1, level, null, 1) output, level
5 from tbl
6 connect by level <= regexp_count(str, '=')
7 ORDER BY level;
OUTPUT LEVEL
-------------------- ----------
zz 1
PB 2
I 3
4
test2 5
SQL>

Comparing fields when a field has data in between 2 characters that match the field being compared

I have code that looks like this:
left outer join
gme_batch_header bh
on
substr(ln.lot_number,instr(ln.lot_number,'(') + 1,
instr(ln.lot_number,')') - instr(ln.lot_number,'(') - 1)
=
bh.batch_no
It works fine, but I have come across a few lot numbers that have two sections of strings that are between parenthesis. How would I compare what is between the second set of parenthesis? Here is an example of the data in the lot number field:
E142059-307-SCRAP-(74055)
This one works with the code,
58LF-3-B-2-2-2 (SCRAP)-(61448)
This one tries comparing SCRAP with the batch no, which isn't correct. It needs to be the 61448.
The result is always the last item in parenthesis.
After more research, I actually got it to work with this code:
substr(ln.lot_number,instr(ln.lot_number,'(',-1) + 1, instr(ln.lot_number,')',-1) - instr(ln.lot_number,'(',-1) - 1)
Assuming SQL2005+, and it is always the last occurrence you want, then I would suggest finding the last instance of a ( in your query and substring to there. To get the last instance you could use something like:
REVERSE(SUBSTRING(REVERSE(lot_number),0,CHARINDEX('(',REVERSE(lot_number))))
If your version of Oracle supports regular expressions try this:
substr(regexp_substr(ln.lot_number,'[0-9]+\)$'),1,length(regexp_substr(ln.lot_number,'[0-9]+\)$'))-1)
Explanation:
regexp_substr(scrap_row,'[0-9]+\)$' ==> find me just numbers in the string that ends in ). This returns the numbers but it includes the closing parenthesis.
To remove the closing parenthsis, just send it through substring and extract first number through the length of the number stopping at 1 character from the end of the string.
Query for analysis:
with scrap
as (select '58LF-3-B-2-2-2 (SCRAP)-(61448)' as scrap_row from dual)
select scrap_row,
regexp_substr(scrap_row,'[0-9]+\)$') as regex_substring,
length(regexp_substr(scrap_row,'[0-9]+\)$')) as length_regex_substring,
substr(regexp_substr(scrap_row,'[0-9]+\)$'),1,length(regexp_substr(scrap_row,'[0-9]+\)$'))-1) as regex_sans_parenthesis
from scrap
If you have 11g, this will do it pretty simply by using the subgroup argument of regexp_substr() and constructing the regex appropriately:
SQL> with tbl(data) as
(
select 'E142059-307-SCRAP-(74055)' from dual
union
select '58LF-3-B-2-2-2 (SCRAP)-(61448)' from dual
)
select data from tbl
where regexp_substr(data, '\((\d+)\)$', 1, 1, NULL, 1)
= '61448';
DATA
------------------------------
58LF-3-B-2-2-2 (SCRAP)-(61448)
The regular expression can be read as:
\( - Search for a literal left paren
( - Start a remembered subgroup
\d+ - followed by 1 more more digits
) - End remembered subgroup
\) - followed by a literal right paren
$ - at the end of the line.
The regexp_substr function arguments are:
Source - the source string
Pattern - The regex pattern to look for
position - Position in the string to start looking for the pattern
occurrence - If the pattern occurs multiple times, which occurrence you want
match_params - See the docs, not used here
subexpression - which subexpression to use (the remembered group)
So in English, look for a series of 1 or more digits surrounded by parens, where it occurs at the end of the line and save the digit part only to use to compare. IMHO a lot easier to follow/maintain than nested instr(), substr().
For re-useability, make a function called get_last_number_in_parens() that contains this code and uses an argument of the string to search. This way that logic is encapsulated and can be re-used by folks that may not be so comfortable with regular expressions, but can benefit from the power! One place to maintain code too. Then call like this:
select data from tbl
where get_last_number_in_parens(data) = '61448';
How easy is that?!
Hello you can check with this code. It works whaever the condition may be
SELECT SUBSTR('58LF-3-B-2-2-2-(61448)',instr('58LF-3-B-2-2-2-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2-(61448)')-instr('58LF-3-B-2-2-2-(61448)','(',-1)-1)
FROM dual;
SELECT SUBSTR('58LF-3-B-2-2-2 (SCRAP)-(61448)',instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2 (SCRAP)-(61448)')-instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)-1)
FROM dual;
Output
==================================
61448
==================================