HIVE - Omitting exact bracketed substring from a string field - hive

I have a string field with records like the following
“Harry Potter (HP) (ab-cd)”
“John Doe (ab-cd)”
“Richard Smith (RS)”
“William Johnson”
I would like to remove the “(ab-cd)” part from all records without removing any other bracketed expressions.
The results should be:
“Harry Potter (HP)”
“John Doe”
“Richard Smith (RS)”
“William Johnson”
I think regexp_replace() needs to be used; but I am not good with regular expressions.

use simple replace() if you are not replacing a pattern. You dont have to use slow & complex regex.
select replace('Harry Potter (HP) (ab-cd)','(ab-cd)','')

Related

SQL Server LIKE caret (^) for NOT does not work as expected

I was reading the article at mssqltips and wanted to try the caret in regex. I understand regex pretty well and use it often, although not much in SQl Server queries.
For the following list of names, I had thought that 1) select * from people where name like '%[^m]%;' will return those names that do not contain 'm'. But it doesn't work like that. I know I can do 2) select * from people where name not like '%m%'; to get the result I want, but I'm just baffled why 1) doesn't work as expected.
Amy
Jasper
Jim
Kathleen
Marco
Mike
Mitchell
I am using SQL Server 2017, but here is a fiddle:
sql fiddle
'%[^m]%' would be true for any string containing a character that is not m. An expanded version would be '%[Any character not m]%'. Since all of those strings contain a character other than m, they are valid results.
If you had a string like mmm, where name like '%[^m]%' would not return that row.

String operations in BigQuery

Adams Allen "prop1-prop2-pro3"
Burns Bonnie "prop1Burns-prop2bon-prop3-ch"
Cannon Charles "prop1a-prop2b-prop3c"
I have the above table stored in BigQuery and the 3rd column is guaranteed to have 3 properties separated by '-'.
I want to do string operations on 3rd column and return something like 'custom_string-prop1-custom_string2-prop2' for each row. How do I do in BigQuery?
You can use split():
select split('prop1-prop2-pro3', '-')[ordinal(1)] as custom1,
split('prop1-prop2-pro3', '-')[ordinal(2)] as custom2,
split('prop1-prop2-pro3', '-')[ordinal(3)] as custom3
You can also just put them into an array -- that might be as or more convenient:
select split(col, '-') as prop_array

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))
You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.
Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.
Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Trying to removes spaces between initials in sql

I have a column that contains a persons name and I need to extract it to pass to another system but I need to remove the spaces but only from between the initials
for example I might have
Mr A B Bloggs and I want Mr AB Bloggs or
Mrs A B C Bloggs and I want Mrs ABC Bloggs
As there are millions of records in the table I wont know how many initials there are or indeed if there are any initials. All I know is the prefix (Mr, Mrs etc) will be more than 1 character and so will the surname. I've tried using trim, replace, charindex but obviously not in the right combination. Any help would be appreciated.
Unfortunately SQL server does not support regex. You have two options:
Use .Net in CLR to perform the transformation. This link explains how to implement regex in SQL server using CLR: https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/.
Other option is to use a cursor to iterate through all the reocords and transform each entry. This may be slow for a large table. For example, you could write a function that returns location of spaces surrounded by single letters and then remove them. The trick is not to remove them until you have recorded all of them, and then remove them from right to left to avoid the location changing.
Try this:
declare #test varchar(100)='Mrs A B C Bloggs'
select (substring (#test,0,charindex(' ',#test)))+' '+
replace(replace(replace(substring(#test,len((substring (#test,0,charindex(' ',#test))))+1,len(#test)),
(substring (#test,0,charindex(' ',#test))),''),reverse((substring (reverse(#test),0,charindex(' ',reverse(#test))))),''),' ','')
+' '+reverse((substring (reverse(#test),0,charindex(' ',reverse(#test)))))

Informix Accent Insensitive Search

Is there any way (a function, a config option, etc.) to force informix to ignore accents on searches?
Example:
select id, name from user where name like 'conceição%'
Returns:
1 | conceicao oliveira
2 | conceiçao santos
3 | conceicão andrade
4 | conceição barros
Thanks
Not directly, that I'm aware of. You could install the Regex DataBlade module. The use it's regexp_match function. Replacing the query with something like this:
where regexp_match(name , 'concei[çc][ãa][o]%')
Or, if you don't have that option, what I would do would be add another 'normalized_name' column. replacing all the accented characters with a "standard" character. Then query my table based on that.
name='conceiçao santos', normalized_name='conceicao santos'
Adding a normalized column will also make sure you're not dependant on any module, or any particular database for that matter.