Removing leading special characters in Hive - sql

I am trying to remove leading special characters (could be -"$&^#_)
from "Persi és Levon Cnatówóeez using Hive.
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '[^a-zA-Z0-9]+', '')
but this removes all special characters.
I am expecting an output similar to
Persi és Levon Cnatówóeez

Try this:
select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
I tried it on Hive and it replaces any character that is not a letter (a-zA-Z) a number (0-9) or an accented character (\u00E0-\u00FC).
0: jdbc:hive2://localhost:10000> select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
+----------------------------+--+
| _c0 |
+----------------------------+--+
| Persi és Levon Cnatówóeez |
+----------------------------+--+
1 row selected (0.104 seconds)
0: jdbc:hive2://localhost:10000>

From the Hive documentation:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
You should do something like this:
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '^[\!-\/\[-\`]+', '')
I haven't Hive right know to try this code, but the idea should be correct. In the second field you must put what you want to substitute, not what you want to keep in your string. In this specific case, this should remove (substitute with empty string '') every consequent character in the beginning of the line, that is in the range from ! to /, or in the range [ to ` referring to the ASCII table.

Related

How to remove characters that are not in ASCII range 32 - 126 from a column in SQL Query

I have a table containing two columns which can hold any character values. However the target system only allows characters in the ASCII range 32 - 126.
My requirement is to remove the characters out of this ASCII range, and allow others to flow
Ex : Frédéric should become Frdric - which just drops all spl chars out of that range.
It should only be within a query, not a procedure. Thanks in advance
You could possibly use the REGEXP_REPLACE function?
Two basic examples could be:
SELECT REGEXP_REPLACE('Frédéric', '[^ -~]', '') FROM dual;
SELECT REGEXP_REPLACE('QuÉbec', '[^ -~]', '') FROM dual;
Generates the output:
The second parameter, or the regexp portion inside the [] set is: ^ Not
space single space
- hyphen
~ tilde
Which means that any characters that are not ^ in the range of space-~ (space to tilde) get replaced by the '' empty string.
Here's a dbfiddle for the image examples.
Note that I also tried using the '[^[:print:]]' character class, which covers the same range, but it does not seem to work.

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

How can I replace a string pattern with blank in hive?

I have a string as:
https://maps.googleapis.com/maps/api/staticmap?center=41.892532+-87.63811&zoom=11&scale=2&size=280x320&maptype=roadmap&format=png&visual_refresh=true%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:1%7C2413+S+State+St++Chicago+IL+60616%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:2%7C3000+N+Halsted+St++Chicago+IL+60657%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&key=AIzaSyBNEAQcC5niAEeiP3zkA_nuWGvtl0IOEs4
I want to replace the '++++' pattern at the end with blank and not the single occurrence of '+'. Tried using regexp_replace and translate functions in hive but that replaces all the single occurrences of '+' as well.
Use
regexp_replace(string,'[+]{4}','')
Pattern '[+]{4}' means + caracter four times.
Test:
select regexp_replace('++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&','[+]{4}','');
Result:
OK
++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C&
Dod you try this?
replace(string, '++++', '')
Admittedly, this will replace all occurrences of '++++', but your string only has one of them.

Regex not matching correct string

I am busy building a lookup table for specific names of merchants. I tried to make use of the following regex but it's returning less results than the standard "like" function in Netezza SQL. Please refer to below:
SQL Like function: where trim(upper(a.MRCH_NME)) like '%CNA %' -- returns 4622 matches
Regex function in Netezza SQL: where array_combine(regexp_extract_all(trim(upper(a.MRCH_NME)),'.*CNA\s','i'),'|') = 'CNA' -- returns 2226 matches
I looked at the two result sets and found that strings such as the following aren't matched:
!C CNA INT ARR
*CNA PLATZ 0400
015764 CNA CRAD
C#CNA PARK 0
I made use of the following regex expression: /.*CNA\s'/
Any idea why the above strings aren't being returned as matches?
Thank you.
You probably should be using regexp_like:
SELECT *
FROM yourTable
WHERE REGEXP_LIKE(MRCH_NME, 'CNA[ ]', 'i');
This would be logically identical to the following query using LIKE:
SELECT *
FROM yourTable
WHERE MRCH_NME LIKE '%CNA ';
It seems to me the problem is more with your code rather than the regex. Look: like '%CNA %' returns all entries that contain a CNA substring followed with a literal space anywhere inside the entry. The '.*CNA\s' regex matches any 0+ chars other than newline followed with CNA and **any whitespace char*.
Acc. to this reference, \s matches "a white space character. White space is defined as [\t\n\f\r\p{Z}].
Thus, you should in fact just use
WHERE REGEXP_LIKE(MRCH_NME, 'CNA ', 'i')
or, better with a word boundary check:
WHERE REGEXP_LIKE(MRCH_NME, '\bCNA\b', 'i')
where \b marks a transition from a word to non-word and non-word to word character, thus ensuring a whole word search and justifying the regex usage.
If you do not need to match the merchant name as a whole word, use the regular LIKE with '%CNA %', it should be more efficient.

Extract Specific Set of data from a String in Oracle

I have the string '1_A_B_C_D_E_1_2_3_4_5' and I am trying to extract the data 'A_B_C_D_E'. I am trying to remove the _1_2_3_4_5 & the 1_ portion from the string. Which is essentially the numeric portion in the string. any special characters after the last alphabet must also be removed. In this example the _ after the character E must also not be present.
and the Query I am trying is as below
SELECT
REGEXP_SUBSTR('1_A_B_C_D_E_1_2_3_4_5','[^0-9]+',1,1)
from dual
The Data I get from the above query is as below: -
_A_B_C_D_E_
I am trying to figure a way to remove the underscore towards the end. Any other way to approach this?
Assuming the "letters" come first and then the "digits", you could do something like this:
select regexp_substr('A_B_C_D_E_1_2_3_4_5','.*[A-Z]') from dual;
This will pull all the characters from the beginning of the string, up to the last upper-case letter in the string (.* is greedy, it will extend as far as possible while still allowing for one more upper-case letter to complete the match).
I have the string '1_A_B_C_D_E_1_2_3_4_5' and I am trying to extract the data 'A_B_C_D_E'
Use REGEXP_REPLACE:
SQL> SELECT trim(BOTH '_' FROM
2 (REGEXP_SUBSTR('1_A_B_C_D_E_1_2_3_4_5','[0-9]+', ''))) str
3 FROM dual;
STR
---------
A_B_C_D_E
How it works:
REGEXP_REPLACE will replace all numeric occurrences '[0-9]+' from the string. Alternatively, you could also use POSIX character class '[^[:digit:]]+'
TRIM BOTH '_' will remove any leading and lagging _ from the string.
Also using REGEXP_SUBSTR:
SELECT trim(BOTH '_' FROM
(REGEXP_SUBSTR('1_A_B_C_D_E_1_2_3_4_5','[^0-9]+'))) str
FROM dual;
STR
---------
A_B_C_D_E