REGEXP REPLACE with backslashes in Spark-SQL - sql

I have a string containing \s\ keyword. Now, I want to replace it with NULL.
select string,REGEXP_REPLACE(string,'\\\s\\','') from test
But unable to replace with the above statement in spark sql
input: \s\help
output: help
want to use regexp_replace

To replace one \ in the actual string you need to use \\\\ (4 backslashes) in the pattern of the regexep_replace. Please do look at https://stackoverflow.com/a/4025508/9042433 to understand why 4 backslashes are needed to replace just one backslash
So, the required statement would become like below
select name, regexp_replace(name, '\\\\s\\\\', '') from test
Below screenshot has examples for better understanding

Related

TERADATA REGEXP_SUBSTR Get string between two values

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?
It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

How to remove a specific part of a string in Postgres SQL?

Say I have a column in postgres database called pid that looks like this:
set1/2019-10-17/ASR/20190416832-ASR.pdf
set1/2019-03-15/DEED/20190087121-DEED.pdf
set1/2021-06-22/DT/20210376486-DT.pdf
I want to remove everything after the last dash "-" including the dash itself. So expected results:
set1/2019-10-17/ASR/20190416832.pdf
set1/2019-03-15/DEED/20190087121.pdf
set1/2021-06-22/DT/20210376486.pdf
I've looked into replace() and split_part() functions but still can't figure out how
to do this. Please advise.
We can use a regex replacement here:
SELECT col, REGEXP_REPLACE(col, '^(.*)-[^-]+(\.\w+)$', '\1\2') AS col_out
FROM yourTable;
The regex used above captures the column value before the last dash in \1, and the extension in \2. It then builds the output using \1\2 as the replacement.
Here is a working regex demo.

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

Extract characters between a string and the first occurrence of something in BigQuery

I want to extract a set of characters between "u1=" and the first semi-colon using a regex. For instance, given the following string: id=1w54;name=nick;u1=blue;u2=male;u3=ohio;u5=
The desired regex output should be just blue.
I tested (?<=u1=)[^;]* on https://regex101.com and it works. However, when I run this in BigQuery, using regexp_extract(string, '(?<=u1=)[^;]*') , I get an error that reads "Cannot parse regular expression: invalid perl operator: (?<"
I'm confused why this isn't working in BQ. Any help would be appreciated.
You can use regexp_extract() like this:
regexp_extract(string, 'u1=([^;]+)')

How to remove semicolon in a string in HIVE SQL

I am trying to remove ";" semi-colon from a string.
What command in HIVE SQL should I use. I know regexp_replace may work..but what to put ?
It appears that ; - the special character does not work but other special characters like , or : works.
For example ,
Data looks like
;;;;;0123445
I want the data to look like this
0123445
Any help on this will be appreciated. I have been struggling with this.
REGEXP_REPLACE indeed looks like a good pick. For example, this removes all semicolons from the field :
REGEXP_REPLACE(my_column, ';', '')
From the documentation :
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT.
Please note that the semicolon has no special meaning in the regexp language.
If you want to match on semicolons on the beginning of the string only (as shown in your question), use regexp special character ^, which indicates the beginning of the string
REGEXP_REPLACE(my_column, '^;', '')
To remove all semicolons, you can simply use replace():
replace(my_column, ';', '')
To remove leading semicolons, you can use:
replace(my_column, '^;+', '')
In Hive you need to escape the semi-colon.
regexp_replace(column_name,'\;','')