Snowflake SQL REGEXP - Capturing After Hyphen Before File Extension

Snowflake SQL REGEXP - Capturing After Hyphen Before File Extension - sql

Looking to capture a customer code that appears after a hyphen and before the .file_extension ...
Example: DWL-202_EJJFT_Transactions-EOTTFFS001.csv
In this case, I want to capture EOTTFFS001 as my account code.
Thus far I have tried working with RIGHT but since our customers have different length codes, sometimes I end up with -DJTSM001.csv because, in this case, the customer had a five-letter code. This approach also does not remove CSV. I have also tried to nest a RIGHT statement inside of another RIGHT statement but that does not seem to work.
My goal is to use REGEXP_SUBSTR.

I think you just want the non-hyphenated string just before the last period:
select regexp_substr(col, '-([^-]+)[.][^.-]+$', 1, 1, 'e')

Throwing a split_part based alternative in there
select split_part(replace(col,'-','.'),'.',-2) -- -2 gets you the second last item

Related

BigQuery REGEXP_REPLACE referencing capture group in the replacement expression

I'm very new to Regex so this may seem a very dumb question.
I've been playing around with captured groups in Google Sheets, without any problems, but when I try and apply it to BigQuery, it doesn't seem to work and I can't find out how to implement the syntax.
I looked round and this seems to be the closest answer, but I can't make it work:
Find and replace using regular expression, group capturing, and back referencing
I want to reference a capture group in the replacement expression to either extract or replace £ 1,000.23 in this text:
random text £ 1,000.23 other text
I've got 3 groups:
(.+)
(£\ *[\d\.\,]+)
(.+)
It may not be the best example, but I really want to understand how to use a capture group in the replacement part so I'm not looking for an alternative solution.
The code below literally returns '$2' rather than '£ 1,000.23'.
SELECT
note,
REGEXP_REPLACE(note,r'(.+)(£\ *[\d\.\,]+)(.+)','$2') AS note2
FROM
`project.dataset.table`
LIMIT
100
Thanks for any help!

According to the replacement note in the doc, I think the following should work:
SELECT
note,
REGEXP_REPLACE(note,r'(.+)(£\ *[\d\.\,]+)(.+)','\\2') AS note2
FROM
`project.dataset.table`
LIMIT
100

Invalid Position in SQL WHERE Clause

I have a query I am writing that examines an ID field and derives an ID number from that column based on several criteria. Now that I have its logic written, I want to run the query on each criteria to see if the logic is working. So, the last part of my query for doing so is as follows:
FROM TABLE1
WHERE SOURCE_SYSTEM_NM = 'XYZ' AND ((STRLEFT(SOURCE_ARRANGEMENT_ID,4)) NOT IN ('23CC','21CC'))
LIMIT 10000
Essentially what I am trying to do here is tell it to return to me only items with SOURCE_SYSTEM_NM equal to 'XYZ', while eliminating any with a SOURCE_ARRANGEMENT_ID not having the first 4 characters equal to '21CC' or '23CC'. I have a third criteria I want to filter on as well, which is that the first three characters must be '0CC'.
My problem when I run this is I get back an "Invalid Position" error. I removed one of the strings from the criteria, and it works. So, I decided to add the second in its own 'NOT IN...' clause with an AND between them, but that resulted in the same error.
If I had to guess, the NOT IN ('21CC','23CC') puts an AND between them and I think that must be the root of my issue. The criteria in my CASE statement derives the ID number with the following:
WHEN (M_CRF_CU_PRODUCT_ARRANGEMENT.SOURCE_SYSTEM_NM) IN ('XYZ') AND STRLEFT(SOURCE_ARRANGEMENT_ID, 4) IN ('23CC','21CC') THEN STRRIGHT(SOURCE_ARRANGEMENT_ID, LENGTH(SOURCE_ARRANGEMENT_ID)-4)
WHEN (M_CRF_CU_PRODUCT_ARRANGEMENT.SOURCE_SYSTEM_NM) IN ('XYZ') AND STRLEFT(SOURCE_ARRANGEMENT_ID, 3) IN ('0CC') THEN STRRIGHT(SOURCE_ARRANGEMENT_ID, LENGTH(SOURCE_ARRANGEMENT_ID)-3)
WHEN (M_CRF_CU_PRODUCT_ARRANGEMENT.SOURCE_SYSTEM_NM) IN ('XYZ') AND (STRLEFT(SOURCE_ARRANGEMENT_ID, 4) NOT IN ('23CC','21CC') OR STRLEFT(SOURCE_ARRANGEMENT_ID, 3) NOT IN ('0CC')) THEN (SOURCE_ARRANGEMENT_ID)
So with that, I am just trying to check each criteria to make sure the ID derived/created is correct. I need to filter down to get results for that last WHEN statement above, but I keep getting that "Invalid Position" in my WHERE statement at the end. I am using Aginity to run this query and it's running against an IBM Netezza database. Thanks in advance!

I figured out what the issue was on this - when performing
STRRIGHT(SOURCE_ARRANGEMENT_ID, LENGTH(SOURCE_ARRANGEMENT_ID)-4)
There are some of those Arrangement IDs that do not have 4 characters, thus I was getting an "Invalid Position". I fixed this by updating this query to use substring() instead:
SUBSTRING(SOURCE_ARRANGEMENT_ID,5,LENGTH(SOURCE_ARRANGEMENT_ID))
This fixed my issue. Just wanted to post an answer in case others have this issue. It s not Netezza specific, this will react this way with any SQL variant.

Replace function, keep unknown substrings/wildcards

I have tried looking for answers online, but I am lacking the right nomenclature to find any answers matching my question.
The DB I am working with is an inconsistent mess. I am currently trying to import a number of maintenance codes which I have to link to a pre-existing Excel table. For this reason, the maintenance code I import have to be very universal.
The table is designed to work with 2-3 digit number (time lengths), followed by a time unit.
For example, SERV-01W and SERV-03M .
As these used to be added to the DB by hand, a large number of older maintenance codes are actually written with 1 digit numbers.
For example, SERV-1W and SERV-3M.
I would like to replace the old codes by the new codes. In other words, I want to add a leading 0 if only one digit is used in the code.
REPLACE(T.Code,'-[0-9][DWM]','-0[0-9][DWM]') unfortunately does not work, most likely because I am using wildcards in the result string.
What would be a good way of handling this issue?
Thank you in advance.

Assuming I understand your requirement this should get you what you are after:
WITH VTE AS(
SELECT *
FROM (VALUES('SERV-03M'),
('SERV-01W'),
('SERV-1Q'),
('SERV-4X')) V(Example))
SELECT Example,
ISNULL(STUFF(Example, NULLIF(PATINDEX('%-[0-9][A-z]%',Example),0)+1,0,'0'),Example) AS NewExample
FROM VTE;
Instead of trying to replace the pattern, I used PATINDEX to find the pattern and then inject the extra '0' character. If the pattern wasn't found, so 0 was returned by PATINDEX, I forced the expression to return NULL and then wrapped the entire thing with a further ISNULL, so that the original value was returned.

I find a simple CASE expression to be a simple way to express the logic:
SELECT (CASE WHEN code LIKE '%-[0-9][0-9]%'
THEN code
ELSE REPLACE(code, '-', '-0')
END)
That is, if the code has two digits, then do nothing. Otherwise, add a zero. The code should be quite clear on what it is doing.
This is not generalizable (it doesn't add two zeros for instance), but it does do exactly what you are asking for.

SQL Select Query Returns 5,448 rows, the update query affects 93,205 rows, but nothing is actually updated?

If I run this:
SELECT *
FROM [myDB].[dbo].[content]
where content_html like '%<images>%<img src=''/'' alt=''''>%</img>%</images>%'
It returns 5,448 rows.
Those empty image tags are breaking my pulls in Ektron – our current CMS. They suggested I replace those empty image tags with
So I did this Update query:
UPDATE [myDB].[dbo].[content]
SET content_html=REPLACE(CONVERT(VARCHAR(MAX), content_html),'%<images>%<img src=''/'' alt=''''>%</img>%</images>%','<images></images>')
It returns this: (93205 row(s) affected)
But nothing actually gets changed. I tried it with one specific record and it says it was affected but the data remains the same.

Your update statement doesn't have a WHERE clause. That means it will operate on every row. Even if the data in content_html does not change, the UPDATE statement will rewrite it again back to the table. Why? Because that's what you told it to do.
You will need to use this to only affect the 5448 rows you want:
UPDATE [myDB].[dbo].[content]
SET content_html=REPLACE(CONVERT(VARCHAR(MAX), content_html),'%<images>%<img src=''/'' alt=''''>%</img>%</images>%','<images></images>')
where content_html like '%<images>%<img src=''/'' alt=''''>%</img>%</images>%'
However, you're still unlikely to do anything because the REPLACE() function is a string literal replace. It doesn't know anything about the wildcards that LIKE uses. So the query is pattern matching to find the rows to update, but it won't every find the literal string %<images>%<img src='/' alt=''>%</img>%</images>% in content_html, which the REPLACE() function probably never finds.
There is no built in regex replace function in SQL Server. There are other options, but I don't think any of them are particularly useful for you. You might be able to use a series of PATINDEX() calls to find <images> and <img src='/' alt=''> and </images>, use some comparison to decide if the <img tag is inside those, and then use STUFF(), but without knowing your data I've no way of knowing how well that would work.
The easiest way to use a real regex replace is probably to SELECT the fields to a DataTable, apply your regex from C# or PowerShell, and then merge the changes back.

First, your where clause from the first query is missing from the second:
Second, unless you want to replace the entire field, you don't want leading and trailing wild card symbols in your replace statement. I don't know if you can even use wildcard statements in a replace command - it's not used as an example in any Microsoft documentation that I've read and a few google searches didn't yield anything:
UPDATE [myDB].[dbo].[content]
SET content_html=REPLACE(CONVERT(VARCHAR(MAX), content_html),'<images>%<img src=''/'' alt=''''>%</img>%</images>','<images></images>')
where content_html like '%<images>%<img src=''/'' alt=''''>%</img>%</images>%'
EDIT:
A REALLY easy way to test this would be:
SELECT content_html as [Original], REPLACE(CONVERT(VARCHAR(MAX), content_html),'%<images>%<img src=''/'' alt=''''>%</img>%</images>%','<images></images>') as [Replaced]
FROM YourTable
WHERE content_html like '%<images>%<img src=''/'' alt=''''>%</img>%</images>%'
This query will give you the original value - and the replaced value that you're creating - and only on records that match your like expression. If the two columns are different, it works. If they're the same, it doesn't and you'll need to use other code to accomplish the desired results.
FINAL EDIT:
I tested this an REPLACE only works on literal strings. Your wildcards will not work in your replace function and you will need either very complex SQL - or application code to achieve your goal. The app route is exponentially better, so do that! Best of luck!

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!

You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/

You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set

The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)

I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Snowflake SQL REGEXP - Capturing After Hyphen Before File Extension - sql

I think you just want the non-hyphenated string just before the last period: select regexp_substr(col, '-([^-]+)[.][^.-]+$', 1, 1, 'e')

Throwing a split_part based alternative in there select split_part(replace(col,'-','.'),'.',-2) -- -2 gets you the second last item

Related

BigQuery REGEXP_REPLACE referencing capture group in the replacement expression

Invalid Position in SQL WHERE Clause

Replace function, keep unknown substrings/wildcards

SQL Select Query Returns 5,448 rows, the update query affects 93,205 rows, but nothing is actually updated?

SQL Contains - only match at start

Categories

Resources