SQL substring non greedy regex - sql

I have data like
http://www.linz.at/politik_verwaltung/32386.asp
stored in a text column. I thought a non-greedy extraction with
select substring(turl from '\..*?$') as ext from tdata
would give me .asp but instead it still ?greedely results in
.linz.at/politik_verwaltung/32386.asp
How can I only match against the last occurence of dot .?
Using Postgresql 9.3

\.[^.]*$ matches . followed by any number of non-dot characters followed by end-of-string:
# select substring('http://www.linz.at/politik_verwaltung/32386.asp'
from '\.[^.]*$');
substring
-----------
.asp
(1 row)
As for why the non-greedy quantifiers do not work here is that they still start matching as soon as possible while still trying to match as short as possible from there on.

Try this:
\.[\w]*$
Here is how it works:
all the word characters (\w), any numbers of them with *, between dot (\.) and the end of the string ($), with the last . itself.
Note: updated the answer, now will capture the strings ends with ..

Related

Postgres - substring from the beginning to the second last occurrence of a char within a string

I need to retrieve the bolded section of the below string . This value is in a column within my Postgres database table.
SEALS_LME_TRADES_MBL_20220919_00212.csv
I tried to utilize the functions; substring, reverse, strpos but they all have limitations. It seems like regex is the best option, however I was not able to do it.
Essentially I need to substring from beginning till the second last '_'. I do not want the date and sequence number along with the file extension at the end.
The closes regex I managed to get is: ^(([^]*){4})
https://regex101.com/
This look a little wonky but how about this?
select substring ('SEALS_LME_TRADES_MBL_20220919_00212.csv', '^(.+)_[^_]+_[^_]+')
Translation
^ from the beginning
(.+) any characters (capture and return this value), followed by
_ an underscore, followed by
[^_]+ one or more non-underscores, followed by
_ an underscore, followed by
[^_]+ one or more non-underscores
Regex greediness will cause any incidental underscores to be captured in the initial string.
Technically speaking the last portion (one or more non-underscores) can probably be omitted.

REGEXP_REPLACE URL BIGQUERY

I have two types of URL's which I would need to clean, they look like this:
["//xxx.com/se/something?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
["//www.xxx.com/se/car?p_color_car=White?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
The outcome I want is;
SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"
I want to remove the brackets and everything up to SE, the URLS differ so I want to remove:
First URL
["//xxx.com/se/something?
Second URL:
["//www.xxx.com/se/car?p_color_car=White?
I can't get my head around it,I've tried this .*\/ . But it will still keep strings I don't want such as:
(1 url) =
something?
(2 url) car?p_color_car=White?
You can use
regexp_replace(FinalUrls, r'.*\?|"\]$', '')
See the regex demo
Details
.*\? - any zero or more chars other than line breakchars, as many as possible and then ? char
| - or
"\]$ - a "] substring at the end of the string.
Mind the regexp_replace syntax, you can't omit the replacement argument, see reference:
REGEXP_REPLACE(value, regexp, replacement)
Returns a STRING where all substrings of value that match regular
expression regexp are replaced with replacement.
You can use backslashed-escaped digits (\1 to \9) within the
replacement argument to insert text matching the corresponding
parenthesized group in the regexp pattern. Use \0 to refer to the
entire matching text.

Big Query Regex Extraction

I am trying to extract a item_subtype field from an URL.
This regex works fine in the to get the first item item_type
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_type=(\w+)')
but what is the correct regex to get everything starting from 'chocolate' all the way to before the '&page1'
I have tried this, but can't seem to get it to work to go further
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=(\w+[^Z])')
basically, I want to extract 'chocolate/cookies%20cream,vanilla'
In your case, \w+ only matches one or more letters, digits or underscores. Your expected values may contain other characters, too.
You may use
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=([^&]+)')
See the regex demo.
Notes:
item_subtype= - this string is matched as a literal char sequence
([^&]+) - a Capturing group 1 that matches and captures one or more chars other than & into a separate memory buffer that is returned by REGEXP_EXTRACT function.

Regexp_extract everything after appearance of '-q_'

Have strings containing 'q_' which I want to extract everything that comes after it. Some rows contain occurrence of q_ which I want everything that occurs after it. Example values in the column are:
prod-q_cat_trait_cat_social_issue
_prod-q_body_modification_graffiti
event_tickets
dappled_grey
_prod-q_cat_tech_support
What is wrong with my regular expression as I'm trying to remove the trailing '_' after q.
REGEXP_EXTRACT(queue_id, '[^q_]+$')
Is just returning
issue
I've also tried the split method:
SPLIT(queue_id, 'q_')[OFFSET(2)]
But this returns
Array index 2 is out of bounds (overflow)
Any suggestions. Thanks! (I am using Google Cloud SQL)
Using a capturing group, you may extract all after the first q_ with:
REGEXP_EXTRACT(queue_id, 'q_(.*)')
You may extract all after the last q_ with:
REGEXP_EXTRACT(queue_id, '.*q_(.*)')
See the regex demo #1 and regex demo #2.
Here, q_ finds the first occurrence of q_ and (.*) grabs the rest of the line into Group 1, and this is the value returned by REGEXP_EXTRACT. .* matches any 0+ chars other than line break chars as many as possible, that is why the second regex will start capturing the rest of the line after the last occurrence of q_.
Google Cloud SQL uses MySQL. I think the simplest method is substring_index():
select substring_index(queue_id, '-q_', -1)
Can you try this : q_([^q_]+)$? You'll have what you want in the first group.
Edit: this one match all the cases > (?(?<=-q_).*|^((?!-q_).)*$)

Regular expression to match specific variations of function

I am trying to construct a regular expression to find the text of the following variations.
NSLocalizedString(#"TEXT")
NSLocalizedStringFromTable(#"TEXT")
NSLocalizedStringWithDefaultValue(#"TEXT")
...
The goal is to extract TEXT. I have been able to construct a regex for each individual function or macro, e.g., (?<=NSLocalizedString)\(#"(.*?)". However, I am looking for a solution that does the job no matter what the name of the function as long as it starts with NSLocalizedString.
I assumed it was as simple as (?<=NSLocalizedString\w+)\(#"(.*?)", but that does't seem to do the trick.
How about this one?
/NSLocalizedString\w*\(#"(.*)"\)/
Explanation:
NSLocalizedString 'NSLocalizedString'
\w+ word characters (a-z, A-Z, 0-9, _) (0 or
more times (matching the most amount
possible))
\( '('
#" '#"'
( group and capture to \1:
.* any character except \n (0 or more times
(matching the most amount possible))
) end of \1
" '"'
\) ')'
The only reason your regex doesn't work is because the regex engine doesn't support variable length lookbehinds. The (?<=NSLocalizedString\w+) is variable length so can't be used.
Firstly it needs to be \w* not \w+, to allow your first example string to match.
If you move the \w* outside the lookbehind (?<=NSLocalizedString)\w* it will work just fine.
Alternatively, since you have to use a capturing group to grab the text value anyway, theres no need for the lookbehind at all. Change the (?<= to a (?: and it becomes a non-capturing group (which can be variable length), and then just grab your text value from group 1.
Your attempt was:
(?<=NSLocalizedString\w+)\(#"(.*?)"
Both of these minor changes should make it work:
(?<=NSLocalizedString)\w*\(#"(.*?)"
(?:NSLocalizedString\w*)\(#"(.*?)"
The following is actually not supported in Objective-C:
The solution that will extract exactly TEXT without using any groups is:
NSLocalizedString\w*\(#"\K[^"]*
It avoids the need to use a negative lookbehind (which can't be used for reasons I explain below) by using the \K modifier, which chops off anything before it from the match.