Regexp_Extract BigQuery anything up to "|" - sql

I'm fairly new to coding and I was wondering if you could give me a hand writing some regular expression for BigQuery SQL.
Basically I would like to extract everything before the bar sign "|" for one of my column.
Example:
Source string:
bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah|someMore_string_stuff-IDontNeed
Desired output:
bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah
I thought about using the REGEXP_EXTRACT(string, delimiter) function but I'm totally unable to write some regex (LOL). Therefore I had a look over Stack, and have found stuff like:
SELECT REGEXP_EXTRACT( String_Name , "\S*\s*\|" ) ,
# or
SELECT REGEXP_EXTRACT( String_Name , '.+?(?=|)')
But every time I get error messages like " invalid perl operator: (?= " or "Illegal escape space"
Would you have any suggestions on why I get these messages and/or how could I proceed to extract these strings?
Many many thanks in advance <3

You can use SPLIT instead:
SELECT SPLIT("bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah|someMore_string_stuff-IDontNeed", "|")[OFFSET(0)]

Prefix the pattern string with r:
SELECT REGEXP_EXTRACT(String_Name, r'\S*\s*\|')
This is the syntax for a raw string constant. You can review what this means in the documentation.

Related

TERADATA REGEXP_SUBSTR Get string between two values

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?
It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

Extract characters between a string and the first occurrence of something in BigQuery

I want to extract a set of characters between "u1=" and the first semi-colon using a regex. For instance, given the following string: id=1w54;name=nick;u1=blue;u2=male;u3=ohio;u5=
The desired regex output should be just blue.
I tested (?<=u1=)[^;]* on https://regex101.com and it works. However, when I run this in BigQuery, using regexp_extract(string, '(?<=u1=)[^;]*') , I get an error that reads "Cannot parse regular expression: invalid perl operator: (?<"
I'm confused why this isn't working in BQ. Any help would be appreciated.
You can use regexp_extract() like this:
regexp_extract(string, 'u1=([^;]+)')

REGEXP REPLACE with backslashes in Spark-SQL

I have a string containing \s\ keyword. Now, I want to replace it with NULL.
select string,REGEXP_REPLACE(string,'\\\s\\','') from test
But unable to replace with the above statement in spark sql
input: \s\help
output: help
want to use regexp_replace
To replace one \ in the actual string you need to use \\\\ (4 backslashes) in the pattern of the regexep_replace. Please do look at https://stackoverflow.com/a/4025508/9042433 to understand why 4 backslashes are needed to replace just one backslash
So, the required statement would become like below
select name, regexp_replace(name, '\\\\s\\\\', '') from test
Below screenshot has examples for better understanding

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))