BigQuery REGEXP_REPLACE referencing capture group in the replacement expression

BigQuery REGEXP_REPLACE referencing capture group in the replacement expression - google-bigquery

I'm very new to Regex so this may seem a very dumb question.
I've been playing around with captured groups in Google Sheets, without any problems, but when I try and apply it to BigQuery, it doesn't seem to work and I can't find out how to implement the syntax.
I looked round and this seems to be the closest answer, but I can't make it work:
Find and replace using regular expression, group capturing, and back referencing
I want to reference a capture group in the replacement expression to either extract or replace £ 1,000.23 in this text:
random text £ 1,000.23 other text
I've got 3 groups:
(.+)
(£\ *[\d\.\,]+)
(.+)
It may not be the best example, but I really want to understand how to use a capture group in the replacement part so I'm not looking for an alternative solution.
The code below literally returns '$2' rather than '£ 1,000.23'.
SELECT
note,
REGEXP_REPLACE(note,r'(.+)(£\ *[\d\.\,]+)(.+)','$2') AS note2
FROM
`project.dataset.table`
LIMIT
100
Thanks for any help!

According to the replacement note in the doc, I think the following should work:
SELECT
note,
REGEXP_REPLACE(note,r'(.+)(£\ *[\d\.\,]+)(.+)','\\2') AS note2
FROM
`project.dataset.table`
LIMIT
100

Related

Snowflake SQL REGEXP - Capturing After Hyphen Before File Extension

Looking to capture a customer code that appears after a hyphen and before the .file_extension ...
Example: DWL-202_EJJFT_Transactions-EOTTFFS001.csv
In this case, I want to capture EOTTFFS001 as my account code.
Thus far I have tried working with RIGHT but since our customers have different length codes, sometimes I end up with -DJTSM001.csv because, in this case, the customer had a five-letter code. This approach also does not remove CSV. I have also tried to nest a RIGHT statement inside of another RIGHT statement but that does not seem to work.
My goal is to use REGEXP_SUBSTR.

I think you just want the non-hyphenated string just before the last period:
select regexp_substr(col, '-([^-]+)[.][^.-]+$', 1, 1, 'e')

Throwing a split_part based alternative in there
select split_part(replace(col,'-','.'),'.',-2) -- -2 gets you the second last item

Replace function, keep unknown substrings/wildcards

I have tried looking for answers online, but I am lacking the right nomenclature to find any answers matching my question.
The DB I am working with is an inconsistent mess. I am currently trying to import a number of maintenance codes which I have to link to a pre-existing Excel table. For this reason, the maintenance code I import have to be very universal.
The table is designed to work with 2-3 digit number (time lengths), followed by a time unit.
For example, SERV-01W and SERV-03M .
As these used to be added to the DB by hand, a large number of older maintenance codes are actually written with 1 digit numbers.
For example, SERV-1W and SERV-3M.
I would like to replace the old codes by the new codes. In other words, I want to add a leading 0 if only one digit is used in the code.
REPLACE(T.Code,'-[0-9][DWM]','-0[0-9][DWM]') unfortunately does not work, most likely because I am using wildcards in the result string.
What would be a good way of handling this issue?
Thank you in advance.

Assuming I understand your requirement this should get you what you are after:
WITH VTE AS(
SELECT *
FROM (VALUES('SERV-03M'),
('SERV-01W'),
('SERV-1Q'),
('SERV-4X')) V(Example))
SELECT Example,
ISNULL(STUFF(Example, NULLIF(PATINDEX('%-[0-9][A-z]%',Example),0)+1,0,'0'),Example) AS NewExample
FROM VTE;
Instead of trying to replace the pattern, I used PATINDEX to find the pattern and then inject the extra '0' character. If the pattern wasn't found, so 0 was returned by PATINDEX, I forced the expression to return NULL and then wrapped the entire thing with a further ISNULL, so that the original value was returned.

I find a simple CASE expression to be a simple way to express the logic:
SELECT (CASE WHEN code LIKE '%-[0-9][0-9]%'
THEN code
ELSE REPLACE(code, '-', '-0')
END)
That is, if the code has two digits, then do nothing. Otherwise, add a zero. The code should be quite clear on what it is doing.
This is not generalizable (it doesn't add two zeros for instance), but it does do exactly what you are asking for.

SQL Alphanumeric Select

I apologize if this has been covered previously. I could not find exactly what I was looking for by searching. So I hope it's okay that I ask.
I have a column with many different types of alphanumeric values (e.g. A101, F576, AI01, etc.). What I'm wanting to do is find members of a specific portion with a specific pattern, F100 through F9999. Using between F1000 and F9999 gets me what I need. But, I don't think that's exactly the right way to go about querying for future reference.
Does anyone have any suggestions? Any help would be greatly appreciated!

This should do the trick:
SELECT * FROM THE_TABLE
WHERE THE_COLUMN LIKE 'F[1-9][0-9][0-9][0-9]'
That will match F1000 through F9999 The key point is you can use [0-9] as a range. If you want to do other patterns things like [A-Z] work too.
If you want F100 to F9999 you could do:
SELECT * FROM THE_TABLE
WHERE THE_COLUMN LIKE 'F[1-9][0-9][0-9][0-9]'
OR THE_COLUMN LIKE 'F[1-9][0-9][0-9]'

SQL exclusion regexp does not work, why?

I have some sentences in db. I want to select the ones that don't have urls in them.
So what I do is
select ID, SENTENCE
from SENTENCE_TABLE
where regexp_like(SENTENCE, '[^http]');
However after the query is executed the sentences that appear in the results pane still have urls. I tried a lot of other combinations without any success.
Can somebody explain or give a good link where it is explained how regexps actually work in SQL.
How can I filter(exclude) actual words in db with SQL query?

You're over-complicating this. Just use a standard LIKE.
select ID, SENTENCE
from SENTENCE_TABLE
where SENTENCE not like '%http%';
regexp_like(SENTENCE, '[^http]') will match everything but h, t and p separately. I like the PSOUG page on regular expressions in Oracle but I would also recommend reading the documentation.
To respond to your comment you can use REGEXP_LIKE, there's just no point.
select ID, SENTENCE
from SENTENCE_TABLE
where not regexp_like(SENTENCE, 'http');
This looks for the string http rather than the letters individually.

[^http] would match any character except h or t or t or p..So this would match any string that doesn't contain h or t or t or p anywhere in the string
It should be where not regexp_like(SENTENCE, '^http');..this would match anything that doesn`t start with http

Contains() function falters with strings of numbers?

For some background information, I'm creating an application that searches against a couple of indexed tables to retrieve some records. It isn't overtly complex to the point of say Google, but it's good enough for the purpose it serves, barring this strange issue.
I'm using the Contains() function, and it's going very well, except when the search contains strings of numbers. Now, I'm only passing in a string -- nowhere numerical datatypes being passed in -- only characters. We're searching against a collection of emails, each appended with a custom ID when shot off from a workflow. So while testing, we decided to search via number strings.
In our test, we isolated a number 0042600006, which belongs to one and only one email subject. However, when using our query we are getting results for 0042600001, 0042600002, etc. The query is this as follows (with some generic columns standing in):
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006')
We've tried every possible combination: '0042600006*', '"0042600006"' and '"0042600006*"'.
I think it's just a limitation of the function, but I thought this would probably be the best place for answers. Thanks in advance.

Asked this same question recently. Please see the insightful answer someone left me here
Essentially what this user says to do is to turn off the noise words (Microsoft has included integers 0-9 as noise in the Full Text Search). Hope you can use this awesome tool with integers as I now am!

try to add language 1033 as an additional parameter. that worked with my solution.
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '0042600006', language 1033)

try using
SELECT description, subject FROM tableA WHERE CONTAINS((subject), '%0042600006%')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery REGEXP_REPLACE referencing capture group in the replacement expression - google-bigquery

According to the replacement note in the doc, I think the following should work: SELECT note, REGEXP_REPLACE(note,r'(.+)(£\ *[\d\.\,]+)(.+)','\\2') AS note2 FROM `project.dataset.table` LIMIT 100

Related

Snowflake SQL REGEXP - Capturing After Hyphen Before File Extension

Replace function, keep unknown substrings/wildcards

SQL Alphanumeric Select

SQL exclusion regexp does not work, why?

Contains() function falters with strings of numbers?

Categories

Resources