Replace String from the End of an String | REGEXP_REPLACE() - google-bigquery

I am looking (probably) for REGEXP_REPLACE() in BigQuery to remove specific strings from the end of another string.
I need to remove ".html" and ".htm" and "/" (...plus a few more strings) from the end of the following URLs:
someurl.com/page.html
someurl.com/page.htm
someurl.com/page/
someurl.com/page/
I know I need REGEXP_REPLACE() but I'm too lame to build it.
Can someone give me a little push?
Thx!
DZ

Use below
select
url,
regexp_replace(url, r'(.html|.htm|/)$', '') output
from t
If applied to sample data in your question - output is

Related

TERADATA REGEXP_SUBSTR Get string between two values

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?
It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

Alternative for Positive Lookahead on Big Query - Match everything before the last delimiter

I'm currently cleaning up URLs and I want to get everything before the last slash ("/")
This is an example string:
https://www.businessinsider.de/gruenderszene/plus-angebot/?tpcc=onsite_gs_header_nav&verification_code=DOVCGF75J8LSID
and the part I want to extract is: https://www.businessinsider.de/gruenderszene/plus-angebot
With normal RegEx, it is super simple with .*(?=\/)
You can see it here on regex101.com
Can you help me to replicate this on BigQuery please, as they don't allow for lookahead/lookbehind?
I might phrase this as a regex replacement which removes the last path separator and path:
SELECT url, REGEXP_REPLACE(url, r'/[^/]+$', '') AS url_out
FROM yourTable;
If you want to specifically target a final path separator immediately followed by a query parameter, then use:
SELECT url, REGEXP_REPLACE(url, r'/\?[^/]+$', '') AS url_out
FROM yourTable;

SQL Regex - Select everything after '/' and split into array

I have to write a HSQLDB query that splits this string on '/'
/2225/golf drive/#305/Huntsville/AL/1243
This is where I am at
select REGEXP_SUBSTRING_ARRAY(Terms, ''/[a-zA-Z0-9]*'') as ARR from Address
This is giving me
/2225, /golf, /, /Huntsville, /AL, /1243 - (Missing "#305" and "drive" in second split)
How can I modify the regex such that it includes everything after "/" and give me this result
/2225, /golf drive, /#305, /Huntsville, /AL, /1243
In this case why can't you use /[a-zA-Z0-9, #]* regexp? It seems good for your goal.
I've checked, it works here for me: https://regex101.com/r/8bJQEk/1
PS This regexp /\/([^\/]*)/g can helps to split everything. Be careful with slashes). Example

Split a string to only use the middle part in SQL

I have a string
ABC - ABCDEFGHIJK - 05/07/2016
I want to only use the ABCDEFGHIJK section and remove the first and third parts of the string.
I have tried using SUBSTRING with CHARINDEX, but was only able to remove the first part of the string.
Anyone help with this?
You can use SUBSTRING_INDEX() and TRIM() for spaces :
SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(string_col,'-',2),'-',-1)) AS Strig_Col
FROM YourTable;

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))