SQL Regex - Select everything after '/' and split into array - sql

I have to write a HSQLDB query that splits this string on '/'
/2225/golf drive/#305/Huntsville/AL/1243
This is where I am at
select REGEXP_SUBSTRING_ARRAY(Terms, ''/[a-zA-Z0-9]*'') as ARR from Address
This is giving me
/2225, /golf, /, /Huntsville, /AL, /1243 - (Missing "#305" and "drive" in second split)
How can I modify the regex such that it includes everything after "/" and give me this result
/2225, /golf drive, /#305, /Huntsville, /AL, /1243

In this case why can't you use /[a-zA-Z0-9, #]* regexp? It seems good for your goal.
I've checked, it works here for me: https://regex101.com/r/8bJQEk/1
PS This regexp /\/([^\/]*)/g can helps to split everything. Be careful with slashes). Example

Related

TERADATA REGEXP_SUBSTR Get string between two values

I am fairly new to teradata, but I was trying to understand how to use REGEXP_SUBSTR
For example I have the following cell value = ABCD^1234567890^1
How can I extract 1234567890
What I attempted to do is the following:
REGEXP_SUBSTR(x, '(?<=^).*?(?=^)')
But this didnt seem to work.
Can anyone help?
It might (or might not) be possible to use REGEXP_SUBSTR() to handle this, but you would need to use a capture group. An alternative here would be to do a regex replacement instead:
SELECT x, REGEXP_REPLACE(x, '^.*?\^|\^.*$', '') AS output
FROM yourTable;
The regex pattern used here matches:
^.*?\^ everything from the start to the first ^
| OR
\^.*$ everything from the second ^ to the end
We then replace with empty string to remove the content being matched.

Alternative for Positive Lookahead on Big Query - Match everything before the last delimiter

I'm currently cleaning up URLs and I want to get everything before the last slash ("/")
This is an example string:
https://www.businessinsider.de/gruenderszene/plus-angebot/?tpcc=onsite_gs_header_nav&verification_code=DOVCGF75J8LSID
and the part I want to extract is: https://www.businessinsider.de/gruenderszene/plus-angebot
With normal RegEx, it is super simple with .*(?=\/)
You can see it here on regex101.com
Can you help me to replicate this on BigQuery please, as they don't allow for lookahead/lookbehind?
I might phrase this as a regex replacement which removes the last path separator and path:
SELECT url, REGEXP_REPLACE(url, r'/[^/]+$', '') AS url_out
FROM yourTable;
If you want to specifically target a final path separator immediately followed by a query parameter, then use:
SELECT url, REGEXP_REPLACE(url, r'/\?[^/]+$', '') AS url_out
FROM yourTable;

REGEX to search for and remove all characters up to and including the last hyphen

I am looking for a way to search for and remove everything up to and including the - from my strings below. I have tried variations and none works exactly how I want it to. I tried regex_replace, but it did not catch all of them, and I found myself creating individual regexp statements for each scenario, which did not seem any better than hard-coding. I am hoping someone has a solution. I would very much appreciate it.
POLY GON - HOME
POLY-GON-HOME
POLY - GON - HOME
POLY - GON HOME
PG - HOME
PG-HOME
I want to show everything after the second hyphen. So, HOME is what I want to display.
I tried
regexp_replace(string,\A[^-]+-[^-]+)
but it removes everything except for the second hyphen. Otherwise it works.
Use
SELECT regexp_replace(string, '^.*-', '')
^ matches the beginning of the string.
.* matches any string
- matches hyphen
Since * is greedy, this will match everything up to the last hyphen. It then gets replaced with an empty string.
I am thinking something like:
select regexp_substr(string, '[^-]+$')
This basically keeps the last string of characters that are not hyphens.
Not all databases that support regular expression supports regexp_substr(), but they have some similar function.

SQL substring non greedy regex

I have data like
http://www.linz.at/politik_verwaltung/32386.asp
stored in a text column. I thought a non-greedy extraction with
select substring(turl from '\..*?$') as ext from tdata
would give me .asp but instead it still ?greedely results in
.linz.at/politik_verwaltung/32386.asp
How can I only match against the last occurence of dot .?
Using Postgresql 9.3
\.[^.]*$ matches . followed by any number of non-dot characters followed by end-of-string:
# select substring('http://www.linz.at/politik_verwaltung/32386.asp'
from '\.[^.]*$');
substring
-----------
.asp
(1 row)
As for why the non-greedy quantifiers do not work here is that they still start matching as soon as possible while still trying to match as short as possible from there on.
Try this:
\.[\w]*$
Here is how it works:
all the word characters (\w), any numbers of them with *, between dot (\.) and the end of the string ($), with the last . itself.
Note: updated the answer, now will capture the strings ends with ..

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))