TERADATA REGEXP_SUBSTR Get string between two values - sql

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?

It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

Related

How to remove a specific part of a string in Postgres SQL?

Say I have a column in postgres database called pid that looks like this:
set1/2019-10-17/ASR/20190416832-ASR.pdf
set1/2019-03-15/DEED/20190087121-DEED.pdf
set1/2021-06-22/DT/20210376486-DT.pdf
I want to remove everything after the last dash "-" including the dash itself. So expected results:
set1/2019-10-17/ASR/20190416832.pdf
set1/2019-03-15/DEED/20190087121.pdf
set1/2021-06-22/DT/20210376486.pdf
I've looked into replace() and split_part() functions but still can't figure out how
to do this. Please advise.
We can use a regex replacement here:
SELECT col, REGEXP_REPLACE(col, '^(.*)-[^-]+(\.\w+)$', '\1\2') AS col_out
FROM yourTable;
The regex used above captures the column value before the last dash in \1, and the extension in \2. It then builds the output using \1\2 as the replacement.
Here is a working regex demo.

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

REGEXP REPLACE with backslashes in Spark-SQL

I have a string containing \s\ keyword. Now, I want to replace it with NULL.
select string,REGEXP_REPLACE(string,'\\\s\\','') from test
But unable to replace with the above statement in spark sql
input: \s\help
output: help
want to use regexp_replace
To replace one \ in the actual string you need to use \\\\ (4 backslashes) in the pattern of the regexep_replace. Please do look at https://stackoverflow.com/a/4025508/9042433 to understand why 4 backslashes are needed to replace just one backslash
So, the required statement would become like below
select name, regexp_replace(name, '\\\\s\\\\', '') from test
Below screenshot has examples for better understanding

Finding strings between dashes using REGEXP_EXTRACT in Bigquery

In Bigquery, I am trying to find a way to extract particular segments of a string based on how many dashes come before it. The number of total dashes in the string will always be the same. For example, I could be looking for the string after the second dash and before the third dash in the following string:
abc-defgh-hij-kl-mnop
Currently, I am using the following regex to extract, which counts the dashes from the back:
([^-]+)(?:-[^-]+){2}$
The problem is that if there is nothing in between the dashes, the regex doesn't work. For example, something like this returns null:
abc-defgh-hij--mnop
Is there a way to use regex to extract a string after a certain number of dashes and cut it off before the subsequent dash?
Thank you!
Below is for BigQuery Standrd SQL
The simplest way in your case is to use SPLIT and OFFSET as in below example
SELECT SPLIT(str, '-')[OFFSET(3)]
above will return empty string for abc-defgh-hij--mnop
to prevent error in case of calling non-existing element - better to use SAFE_OFFSET
SELECT SPLIT(str, '-')[SAFE_OFFSET(3)]

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))