Alternative for Positive Lookahead on Big Query - Match everything before the last delimiter - google-bigquery

I'm currently cleaning up URLs and I want to get everything before the last slash ("/")
This is an example string:
https://www.businessinsider.de/gruenderszene/plus-angebot/?tpcc=onsite_gs_header_nav&verification_code=DOVCGF75J8LSID
and the part I want to extract is: https://www.businessinsider.de/gruenderszene/plus-angebot
With normal RegEx, it is super simple with .*(?=\/)
You can see it here on regex101.com
Can you help me to replicate this on BigQuery please, as they don't allow for lookahead/lookbehind?

I might phrase this as a regex replacement which removes the last path separator and path:
SELECT url, REGEXP_REPLACE(url, r'/[^/]+$', '') AS url_out
FROM yourTable;
If you want to specifically target a final path separator immediately followed by a query parameter, then use:
SELECT url, REGEXP_REPLACE(url, r'/\?[^/]+$', '') AS url_out
FROM yourTable;

Related

REGEXP_REPLACE URL BIGQUERY

I have two types of URL's which I would need to clean, they look like this:
["//xxx.com/se/something?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
["//www.xxx.com/se/car?p_color_car=White?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
The outcome I want is;
SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"
I want to remove the brackets and everything up to SE, the URLS differ so I want to remove:
First URL
["//xxx.com/se/something?
Second URL:
["//www.xxx.com/se/car?p_color_car=White?
I can't get my head around it,I've tried this .*\/ . But it will still keep strings I don't want such as:
(1 url) =
something?
(2 url) car?p_color_car=White?
You can use
regexp_replace(FinalUrls, r'.*\?|"\]$', '')
See the regex demo
Details
.*\? - any zero or more chars other than line breakchars, as many as possible and then ? char
| - or
"\]$ - a "] substring at the end of the string.
Mind the regexp_replace syntax, you can't omit the replacement argument, see reference:
REGEXP_REPLACE(value, regexp, replacement)
Returns a STRING where all substrings of value that match regular
expression regexp are replaced with replacement.
You can use backslashed-escaped digits (\1 to \9) within the
replacement argument to insert text matching the corresponding
parenthesized group in the regexp pattern. Use \0 to refer to the
entire matching text.

Big Query Regex Extraction

I am trying to extract a item_subtype field from an URL.
This regex works fine in the to get the first item item_type
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_type=(\w+)')
but what is the correct regex to get everything starting from 'chocolate' all the way to before the '&page1'
I have tried this, but can't seem to get it to work to go further
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=(\w+[^Z])')
basically, I want to extract 'chocolate/cookies%20cream,vanilla'
In your case, \w+ only matches one or more letters, digits or underscores. Your expected values may contain other characters, too.
You may use
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=([^&]+)')
See the regex demo.
Notes:
item_subtype= - this string is matched as a literal char sequence
([^&]+) - a Capturing group 1 that matches and captures one or more chars other than & into a separate memory buffer that is returned by REGEXP_EXTRACT function.

Regexp_extract everything after appearance of '-q_'

Have strings containing 'q_' which I want to extract everything that comes after it. Some rows contain occurrence of q_ which I want everything that occurs after it. Example values in the column are:
prod-q_cat_trait_cat_social_issue
_prod-q_body_modification_graffiti
event_tickets
dappled_grey
_prod-q_cat_tech_support
What is wrong with my regular expression as I'm trying to remove the trailing '_' after q.
REGEXP_EXTRACT(queue_id, '[^q_]+$')
Is just returning
issue
I've also tried the split method:
SPLIT(queue_id, 'q_')[OFFSET(2)]
But this returns
Array index 2 is out of bounds (overflow)
Any suggestions. Thanks! (I am using Google Cloud SQL)
Using a capturing group, you may extract all after the first q_ with:
REGEXP_EXTRACT(queue_id, 'q_(.*)')
You may extract all after the last q_ with:
REGEXP_EXTRACT(queue_id, '.*q_(.*)')
See the regex demo #1 and regex demo #2.
Here, q_ finds the first occurrence of q_ and (.*) grabs the rest of the line into Group 1, and this is the value returned by REGEXP_EXTRACT. .* matches any 0+ chars other than line break chars as many as possible, that is why the second regex will start capturing the rest of the line after the last occurrence of q_.
Google Cloud SQL uses MySQL. I think the simplest method is substring_index():
select substring_index(queue_id, '-q_', -1)
Can you try this : q_([^q_]+)$? You'll have what you want in the first group.
Edit: this one match all the cases > (?(?<=-q_).*|^((?!-q_).)*$)

how to get specific part from string in sql

I want to retrieve file names from urls in sql.
for example:
Input:
url:
https://www.google.co.in/root/subdir/file.extension?p1=v1&p2=v2
https://www.abxdhcak.com/sitemap-companies.xml
then Output should be:
file.extension
sitemap-companies.xml
To match your expected output you can use REGEXP_REPLACE
REGEXP_REPLACE(txt, '^.*/|\?.*$') as rg
This does 2 things:
'^.*/'
This removes all characters up to and including the last forward-slash in the string.
'\?.*$'
This removes all characters after and including a question mark.
This may not work for all cases, but it works for the examples provided.

SQL Regex - Select everything after '/' and split into array

I have to write a HSQLDB query that splits this string on '/'
/2225/golf drive/#305/Huntsville/AL/1243
This is where I am at
select REGEXP_SUBSTRING_ARRAY(Terms, ''/[a-zA-Z0-9]*'') as ARR from Address
This is giving me
/2225, /golf, /, /Huntsville, /AL, /1243 - (Missing "#305" and "drive" in second split)
How can I modify the regex such that it includes everything after "/" and give me this result
/2225, /golf drive, /#305, /Huntsville, /AL, /1243
In this case why can't you use /[a-zA-Z0-9, #]* regexp? It seems good for your goal.
I've checked, it works here for me: https://regex101.com/r/8bJQEk/1
PS This regexp /\/([^\/]*)/g can helps to split everything. Be careful with slashes). Example