Remove all characters between two substrings in Hive SQL query - hive

I have a column of strings that looks like this:
STRING:SECTION1/SECTION2/0000123456789/SECTION3/SECTION4
STRING:SECTION1/SECTION2/0000987654321/SECTION3/SECTION4
STRING:SECTION1/SECTION2/00005552121X/SECTION3/SECTION4
STRING:SECTION1/SECTION2/00005552222:ID/SECTION3/SECTION4
I am trying to use REGEXP_REPLACE to replace the variable length, alpha/num/special char string from the middle and replace it with something generic so that they all look like this:
STRING:SECTION1/SECTION2/id_number_removed/SECTION3/SECTION4
I have been trying all morning to try to find the right regex expression to replace everything between '/SECTION2/' and '/SECTION3/' but have had no success.

Replace regex pattern 'SECTION2/[^/]+/SECTION3' with 'SECTION2/id_number_removed/SECTION3'. [^/]+ means 1 or more characters that are not slashes.
select regexp_replace(
'STRING:SECTION1/SECTION2/00005552222:ID/SECTION3/SECTION4',
'SECTION2/[^/]+/SECTION3',
'SECTION2/id_number_removed/SECTION3');
which gives
STRING:SECTION1/SECTION2/id_number_removed/SECTION3/SECTION4

Related

How can I remove characters in a string after a specific special character (~) in snowflake sql?

I am using Snowflake SQL. I would like to remove characters from a string after a special character ~. How can I do that?
here is the whole scenario. Let me explain. I do have a string like 'CK#123456~fndkjfgdjkg'. Now, i want only the number after #.And not anything after ~. This is number length varies for that field value. It might be 1 or 5 or 3. And i want to add the condition in where class where this number is equal to check_num from other table after joining. I am trying REGEXP_SUBSTR(A.SRC_TXT, '(?<=CK#)(.+?\b)') = C.CHK_NUM in the where condition. I am getting the error as 'No repititive argument after ?'
You can use a regex for this
-- To remove just the character after a ~
select regexp_replace('fo~o bar','~.', '');
-- returns 'fo bar'
--If you want to keep the ~
select regexp_replace('fo~o bar','~.', '~');
-- returns 'fo~ bar'
--If you want to remove everything after the ~
select regexp_replace('fo~o bar','~.*', '');
--returns 'fo'
If you need to remove other specific character sets after a ~, you can probably do this with a slightly more complicated regex, but I'd need examples of your desired input/output to help with that.
EDIT for updated question
This regex replace should get what you need.
select regexp_replace('CK#123456~fndkjfgdjkg','CK#(\\d*)~.*', '\\1');
-- returns 123456
(\\d*) gets ANY number of digits in a row, and the \\1 causes it to replace the match with what was in the first set of parenthesis, which is your list of digits. the CK# and ~.* are there to make sure the whole string gets matched and replaced.
If the CK# can vary as well, you can use .*? like this.
select regexp_replace('ABCD123HI#123456~fndkjfgdjkg','.*?#(\\d*)~.*', '\\1')
-- returns 123456
I'd probably do something like the following, easy enough but not as cool as RegEx type of functions.
set my_string='fooo~12345';
set search_for_me = '~';
SELECT SUBSTR($my_string, 1, DECODE(position($search_for_me, $my_string), 0, length($my_string), position($search_for_me, $my_string)));
I hope this helps...Rich
It looks like lookahead and lookbehinds are not supported in REGEXP functions, they seem to work in the PATTERN clause of a LIST command. Snowflake documentation makes no mention either way of lookahead or lookbehinds.
In your example:
It seems that the query engine is looking for that repeating argument, where you are attempting a lookbehind
You have not specified what you wanted extracted. You have two capture groups, but in this scenario everything would be returned
Since you are looking to remove everything after ~ you have a delimiter, why not use it in your REGEXP_SUBSTR function?
Try the following:
SELECT $1,REGEXP_SUBSTR($1,'\\w+#(.+?)~',1,1,'is',1)
FROM VALUES
('CK#123456~fndkjfgdjkg')
,('QH#128fklj924~fndkjfgdjkg')
;
This looks for:
One or more word characters
Followed by #
Capturing one or more characters upto and not including ~
Returns the characters within the capture group
You can change the .+? to \\d+? to make sure the pattern is only digits. Backslashes must be escaped with a backslash.
The descriptions for each argument of the function can be found here:
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_substr.html
You could check this!!
select substr('CK#123456~fndkjfgdjkg',4,6) from dual;
OUTPUT
123456
https://docs.snowflake.net/manuals/sql-reference/functions/substr.html

Snowflake SQL Regex

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()
The value that I want to access is in quotes:
...
Type:
value: "CategoryA"
...
Edit: This text is nested in a much larger portion of text.
I want to extract CategoryA for all columns using regexp_substr. But I am unsure how.
I have tried:
regexp_substr(col, 'Type\\W+(\\w+)\\W+\\w.+')
and while that gives the portion of the string, I just want what is in quotes and can't figure out how to do so.
You could use regexp_replace() instead:
regexp_replace(col, '(^[^"]*")|("[^"]*$)", '')
The regexp matches on both following conditions, and replaces matching parts with the empty string:
^[^"]*": everything from the beginning of the string to the first double quote
("[^"]*$)": everything from the last double quote to the end of the string

How can I replace a string pattern with blank in hive?

I have a string as:
https://maps.googleapis.com/maps/api/staticmap?center=41.892532+-87.63811&zoom=11&scale=2&size=280x320&maptype=roadmap&format=png&visual_refresh=true%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:1%7C2413+S+State+St++Chicago+IL+60616%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:2%7C3000+N+Halsted+St++Chicago+IL+60657%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&key=AIzaSyBNEAQcC5niAEeiP3zkA_nuWGvtl0IOEs4
I want to replace the '++++' pattern at the end with blank and not the single occurrence of '+'. Tried using regexp_replace and translate functions in hive but that replaces all the single occurrences of '+' as well.
Use
regexp_replace(string,'[+]{4}','')
Pattern '[+]{4}' means + caracter four times.
Test:
select regexp_replace('++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&','[+]{4}','');
Result:
OK
++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C&
Dod you try this?
replace(string, '++++', '')
Admittedly, this will replace all occurrences of '++++', but your string only has one of them.

Regex not matching correct string

I am busy building a lookup table for specific names of merchants. I tried to make use of the following regex but it's returning less results than the standard "like" function in Netezza SQL. Please refer to below:
SQL Like function: where trim(upper(a.MRCH_NME)) like '%CNA %' -- returns 4622 matches
Regex function in Netezza SQL: where array_combine(regexp_extract_all(trim(upper(a.MRCH_NME)),'.*CNA\s','i'),'|') = 'CNA' -- returns 2226 matches
I looked at the two result sets and found that strings such as the following aren't matched:
!C CNA INT ARR
*CNA PLATZ 0400
015764 CNA CRAD
C#CNA PARK 0
I made use of the following regex expression: /.*CNA\s'/
Any idea why the above strings aren't being returned as matches?
Thank you.
You probably should be using regexp_like:
SELECT *
FROM yourTable
WHERE REGEXP_LIKE(MRCH_NME, 'CNA[ ]', 'i');
This would be logically identical to the following query using LIKE:
SELECT *
FROM yourTable
WHERE MRCH_NME LIKE '%CNA ';
It seems to me the problem is more with your code rather than the regex. Look: like '%CNA %' returns all entries that contain a CNA substring followed with a literal space anywhere inside the entry. The '.*CNA\s' regex matches any 0+ chars other than newline followed with CNA and **any whitespace char*.
Acc. to this reference, \s matches "a white space character. White space is defined as [\t\n\f\r\p{Z}].
Thus, you should in fact just use
WHERE REGEXP_LIKE(MRCH_NME, 'CNA ', 'i')
or, better with a word boundary check:
WHERE REGEXP_LIKE(MRCH_NME, '\bCNA\b', 'i')
where \b marks a transition from a word to non-word and non-word to word character, thus ensuring a whole word search and justifying the regex usage.
If you do not need to match the merchant name as a whole word, use the regular LIKE with '%CNA %', it should be more efficient.

Replace multiple repeating character to one

I have a varchar column, and each field contains a single word, but there are random number of pipe character before and after the word.
Something like this:
MyVarcharColumn
'|||Apple|||||'
'|||||Pear|||||'
'||Leaf|'
When I query the table, I wish to replace the multiple pipes to a single one, so the result would be like this:
MyVarcharColumn
'|Apple|'
'|Pear|'
'|Leaf|'
Cannot figure out how to solve it with REPLACE function, anybody knows?
vkp's method absolutely solves your issue. Another method that works, and also will work in a variety of other situations, is using a triple REPLACE()
SELECT REPLACE(REPLACE(REPLACE('|||Apple|||||', '|', '><'), '<>',''), '><','|')
This method will allow you to keep a delimiter between multiple strings where Mr. VPK's method will concat the strings and put a delim at the very beginning and the very end.
SELECT REPLACE(REPLACE(REPLACE('|||Apple|||||Banana||||||||||', '|', '><'), '<>',''), '><','|')
One way is to replace all the | with blanks and add a pipe character at the beginning and the end of string.
select '|'+replace(mycolumn,'|','')+'|' from tablename