REPLACE() function for HiveSQL, REGEXP_REPLACE() not working as intended - hive

I run the following statement to replace the characters of ".." to ".":
CREATE TABLE TableA AS
SELECT Column1,
REGEXP_REPLACE(Column2, "..", ".") AS NewColumn
FROM TableB;
The result of NewColumn became ".......", what's wrong with the REGEXP_REPLACE() function?

regexp_replace expects a regex pattern. . means any character in regex, so all pairs of characters are replaced with a fullstop because you specified .. as the regex pattern.
To prevent this, you can either escape the fullstop:
REGEXP_REPLACE(Column2, "\\.\\.", ".")
or use replace, which expects a string pattern:
REPLACE(Column2, "..", ".")

In addition to what #mck proposed, you can use quantifier for repeating patterns
REGEXP_REPLACE(Column2, "\\.{2}", ".")
Or if you want to replace 2 or more dots with single one:
REGEXP_REPLACE(Column2, "\\.{2,}", ".")

Related

How can I replace a string pattern with blank in hive?

I have a string as:
https://maps.googleapis.com/maps/api/staticmap?center=41.892532+-87.63811&zoom=11&scale=2&size=280x320&maptype=roadmap&format=png&visual_refresh=true%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:1%7C2413+S+State+St++Chicago+IL+60616%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:2%7C3000+N+Halsted+St++Chicago+IL+60657%7C&markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&key=AIzaSyBNEAQcC5niAEeiP3zkA_nuWGvtl0IOEs4
I want to replace the '++++' pattern at the end with blank and not the single occurrence of '+'. Tried using regexp_replace and translate functions in hive but that replaces all the single occurrences of '+' as well.
Use
regexp_replace(string,'[+]{4}','')
Pattern '[+]{4}' means + caracter four times.
Test:
select regexp_replace('++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C++++&','[+]{4}','');
Result:
OK
++markers=size:mid%7Ccolor:0x8000ff%7Clabel:3%7C&
Dod you try this?
replace(string, '++++', '')
Admittedly, this will replace all occurrences of '++++', but your string only has one of them.

Postgresql: Extracting substring after first instance of delimiter

I'm trying to extract everything after the first instance of a delimiter.
For example:
01443-30413 -> 30413
1221-935-5801 -> 935-5801
I have tried the following queries:
select regexp_replace(car_id, E'-.*', '') from schema.table_name;
select reverse(split_part(reverse(car_id), '-', 1)) from schema.table_name;
However both of them return:
01443-30413 -> 30413
1221-935-5801 -> 5801
So it's not working if delimiter appears multiple times.
I'm using Postgresql 11. I come from a MySQL background where you can do:
select SUBSTRING(car_id FROM (LOCATE('-',car_id)+1)) from table_name
Why not just do the PG equivalent of your MySQL approach and substring it?
SELECT SUBSTRING('abcdef-ghi' FROM POSITION('-' in 'abcdef-ghi') + 1)
If you don't like the "from" and "in" way of writing arguments, PG also has "normal" comma separated functions:
SELECT SUBSTR('abcdef-ghi', STRPOS('abcdef-ghi', '-') + 1)
I think that regexp_replace is appropriate, but using the correct pattern:
select regexp_replace('1221-935-5801', E'^[^-]+-', '');
935-5801
The regex pattern ^[^-]+- matches, from the start of the string, one or more non dash characters, ending with a dash. It then replaces with empty string, effectively removing this content.
Note that this approach also works if the input has no dashes at all, in which case it would just return the original input.
Use this regexp pattern :
select regexp_replace('1221-935-5801', E'^[^-]+-', '') from schema.table_name
Regexp explanation :
^ is the beginning of the string
[^-]+ means at least one character different than -
...until the - character is met
I tried it in a conventional way in general what we do (found
something similar to instr as strpos in postgrsql .) Can try the below
SELECT
SUBSTR(car_id,strpos(car_id,'-')+1,
length(car_id) ) from table ;

Regex not matching correct string

I am busy building a lookup table for specific names of merchants. I tried to make use of the following regex but it's returning less results than the standard "like" function in Netezza SQL. Please refer to below:
SQL Like function: where trim(upper(a.MRCH_NME)) like '%CNA %' -- returns 4622 matches
Regex function in Netezza SQL: where array_combine(regexp_extract_all(trim(upper(a.MRCH_NME)),'.*CNA\s','i'),'|') = 'CNA' -- returns 2226 matches
I looked at the two result sets and found that strings such as the following aren't matched:
!C CNA INT ARR
*CNA PLATZ 0400
015764 CNA CRAD
C#CNA PARK 0
I made use of the following regex expression: /.*CNA\s'/
Any idea why the above strings aren't being returned as matches?
Thank you.
You probably should be using regexp_like:
SELECT *
FROM yourTable
WHERE REGEXP_LIKE(MRCH_NME, 'CNA[ ]', 'i');
This would be logically identical to the following query using LIKE:
SELECT *
FROM yourTable
WHERE MRCH_NME LIKE '%CNA ';
It seems to me the problem is more with your code rather than the regex. Look: like '%CNA %' returns all entries that contain a CNA substring followed with a literal space anywhere inside the entry. The '.*CNA\s' regex matches any 0+ chars other than newline followed with CNA and **any whitespace char*.
Acc. to this reference, \s matches "a white space character. White space is defined as [\t\n\f\r\p{Z}].
Thus, you should in fact just use
WHERE REGEXP_LIKE(MRCH_NME, 'CNA ', 'i')
or, better with a word boundary check:
WHERE REGEXP_LIKE(MRCH_NME, '\bCNA\b', 'i')
where \b marks a transition from a word to non-word and non-word to word character, thus ensuring a whole word search and justifying the regex usage.
If you do not need to match the merchant name as a whole word, use the regular LIKE with '%CNA %', it should be more efficient.

SQL Server : PATINDEX using pattern

I am using PATINDEX to replace stray characters in string (column in my table).
select PATINDEX('%[^A-Z,a-z0-9 -()&_/\.]%', 'This has a stray character$')
I am puzzled as to why the result is 0 (I am expecting 27): $ is not in the pattern that I want to keep. Any insights?
The thing is that dash - is used in pattern matching. So you have to escape it:
select PATINDEX('%[^A-Z,a-z0-9 \-()&_/\.]%', 'This has a stray character$')
I would escape all special character:
select PATINDEX('%[^A-Z,a-z0-9 \-\(\)\&\_\/\\\.]%', 'This has a stray character$')
LiveDemo

How to write MySQL REGEXP?

A table contains the string "Hello world!"
Thinking of * as the ordinary wildcard character, how can I write a REGEXP that will evalute to true for 'W*rld!' but false for 'H*rld!' since H is part of another word. 'W*rld' should evalute to false as well because of the trailing '!'
Use:
WHERE column REGEXP 'W[[:alnum:]]+rld!'
Alternately, you can use:
WHERE column RLIKE 'W[[:alnum:]]+rld!'
RLIKE is a synonym for REGEXP
[[:alnum:]] will allow any alphanumeric character, [[:alnum:]]+ will allow multiples
REGEXP \ RLIKE is not case sensitive, except when used with binary strings.
Reference: MySQL Regex Support
If you are just looking to match the word world, then do this:
SELECT * FROM `table` WHERE `field_name` LIKE "w_rld!";
The _ allows for a single wildcard character.
Edit: I realize the OP requested this solution with REGEXP, but since the same result can be achieved without using regular expressions, I provided this as viable solution that should perform faster than a REGEXP.
You can use regular expressions in MySQL:
SELECT 'Hello world!' REGEXP 'H[[:alnum:]]+rld!'
0
SELECT 'Hello world!' REGEXP 'w[[:alnum:]]+rld!'
1
More information about the syntax can be found here.