Extract number from a URL in Redshift

Extract number from a URL in Redshift - sql

I would like to extract an ID (a number) from a bunch of URLs in Redshift. I know I can use regexp_substr for this purpose, but my knowledge of regular expressions is weak. Here are a couple example URLs:
/checkout?feature=ADVANCED_SEARCH&upgradeRedirect=%2Fmentions%3Ftop_ids%3D1222874068&btv=feature_ADVANCED_SEARCH
/checkout?feature=ADVANCED_SEARCH&trigger=mentioning-author-rw&upgradeRedirect=%2Fmentions%3Ftop_ids%3D160447990
After parsing the above URLs, I would like the output to be:
1222874068
160447990
Note that the parameter top_ids remains constant and will help break the URL.
I tried using multiple versions of split_part as well. But there may be variations in the URL where it might break. So using a regular expression may be a better idea.
Any help would be greatly appreciated.

You can use:
select regexp_substr(column,'top_ids%3D([0-9]*)', 1, 1, 'e')
The 'e' extracts the substring in (brackets).

Try something like this:
SUBSTR(REGEXP_SUBSTR(yourcolumn, 'top_ids%3D([0-9]{2,})'), 11, 20)
Just looking for 'top_ids%3D' and 2 or more digits after it.
Then removes the first 10 characters.

Related

Regex match first number if it does not appear at the end

I am currently facing a Regex problem which apparently I cannot find an answer to.
My Regex is embedded in a teradata SQL of the form:
REGEXP_SUBSTR(column, 'regex_pattern')
I want to find the first appearance of any number except if it appears at the end of the string.
For Example:
"YEL2X30" -> "2"
"YEL19XYZ05" -> "19"
"YELLOW05" -> ""
I tried it with '[0-9]+(?!$)/' but this returns me a blank String always.
Thanks in Advance!

Shot in the dark here since I'm unfamiliar with teradata and the supported SQL-functionality. However, reading the docs on the REGEXP_SUBSTR() function it seems like you may want to use the 3rd and 4th possible argument along with a slightly different regular expression:
[0-9]+(?![0-9]|$)
Meaning: 1+ Digits that are not followed by either the end of the string or another digit.
I'd believe the following syntax may work now to retrieve the 1st appearance of any number from the matching results:
REGEXP_SUBSTR(column, '[0-9]+(?![0-9]|$)', 1, 1)
The 3rd parameter states from which position in the source-string we need to start searching whereas the 4th will return the 1st match from any possible multiple matches (is how I read the docs). For example: abc123def456ghi789 whould return 123.
Fiddling around in online IDE's gave me that:
CREATE TABLE TBL (TST varchar(100));
INSERT INTO TBL values ('YEL2X30'), ('YEL19XYZ05'), ('YELLOW05'), ('abc123def456ghi789');
SELECT REGEXP_SUBSTR(TST, '[0-9]+(?![0-9]|$)', 1, 1) as 'RESULTS' FROM TBL;
Resulted in:
RESULTS
2
19
NULL
123
NOTE: I also noticed that leaving out the 3rd and 4th parameter made no difference since they will default back to 1 without explicitly mentioning them. I tested this over here.

Possibly the simplest way is to look for digits followed by a non-digit. Then keep all the digits:
regexp_substr(regexp_substr(column, '[0-9]+[^0-9]'), '[0-9]+')

Not Like in Teradata

I am new to Teradata and trying to figure out how to do a NOT LIKE statement with multiple wildcards. I've tried several different ways, but haven't found a way that works. Most recently I've tried the code below.
WHERE DIAG_CD NOT IN ALL ('S060%','S340%')
Any help you all can provide would be much appreciated.
Thanks!

You are on the right track. You can use ANY / ALL quantifier with LIKE or NOT LIKE.
WHERE DIAG_CD NOT LIKE ALL ('S060%','S340%')
or
WHERE NOT (DIAG_CD LIKE ANY ('S060%','S340%'))

IN does not support wildcards. You need to repeat the conditions:
where diag_cd not like 'S060%' and diag_cd not like 'S340%'
Or you can do regex matching instead: ^ represents the beginning of the string, and | stands for or. This syntax is easier to extend with more strings patterns.
where not regexp_like(diag_cd, '(^S060)|(^S340)')

SQL Server : extracting the Midddle Characters without CHARINDEX

To start, I have seen the CHARINDEX results on here but none of them seem to be working for my case. The reasons are either a, CHARINDEX can't help me, or b, I am not understanding how CHARINDEX works. That being said, I would like to ask my question here in hopes that I can get some clarification on both how to solve my issue and CHARINDEX if that so happens to be the way this question is answered.
The variable that I am trying to extract from has varying length. However, two things are always constant.
There is always a '/' as the 16th character in the string
The last character in the string is always '0' OR '1'
What I am trying to do is extract the name from between '/' and '0' or '1'. In short, I want to chop off the first 16 characters and the last character of every string. So far, this is what I have:
SELECT
SUBSTRING([string_name], 17, LEN([string_name]) - 1) AS 'username'
FROM
[table_name]
The results I get still contain the 0 OR 1 at the end. What I need to do is somehow remove that 0 from the string. It is important to note that the number of characters between '/' and '0' are always different.
Current results:
gordon0
grant0
greg0
guy1
hanying0
Desired results:
gordon
grant
greg
guy
hanying
Any advice here would be wonderful.
Please let me know if you need any additional information from me. If possible, would like to maintain using either SUBSTRING, LEFT or RIGHT.
Thanks

Adjusting the length would seem to address your problem:
SELECT SUBSTRING([string_name], 17, LEN([string_name])-17) AS username
FROM [table_name]

Using 'LIKE' and 'REGEXP' in a SQL query

I'm trying to use some regex on an expression where I have two conditions on the WHERE clause. The pattern I want to capture is 106 followed by any digit followed by a digit that must be either 3 or 4, i.e. 106[0-9][3-4]
First, I tried this:
SELECT DISTINCT Loggers
FROM [alo].[Forests] C
WHERE (R.LogSU = 3)
AND (ForestID REGEXP '106[0-9][3-4]')
This produced an error as below and it would be good to know why.
Msg 102, Level 15, State 1, Line 16
Incorrect syntax near 'REGEXP'.
Next, I have tried this, which is now running but I am unsure about whether this is doing what I want it to do.
SELECT DISTINCT Loggers
FROM [alo].[Forests] C
WHERE (R.LogSU = 3)
AND (ForestID LIKE '106[0-9][3-4]')
Would this do as I described above?

You specify this:
The pattern I want to capture is 106 followed by any digit followed by
a digit that must be either 3 or 4, i.e. 106[0-9][3-4]
And then you give an example using a regular expression:
WHERE ForestID REGEXP '106[0-9][3-4]'
Regular expressions match patterns anywhere inside a string. So, this will match '10603'. It will also match 'abc10694 def'. This is true of regular expressions in general, not merely one databases's implementation of them.
If this is the behavior you want, then the corresponding LIKE (in SQL Server)` is:
WHERE ForestID LIKE '%106[0-9][3-4]%'
If you only want 5-digit values, then the corresponding regular expression is:
WHERE ForestID REGEXP '^106[0-9][3-4]$'

You do not need to interact with managed code, as you can use LIKE:
SELECT DISTINCT Loggers
FROM [alo].[Forests] C
WHERE (R.LogSU = 3)
AND ForestID LIKE '106[0-9][3-4]')
to make clear: SQL Server doesn't supports regular expressions without managed code. Depending on the situation, the LIKE operator can be an option, but it lacks the flexibility that regular expressions provides.
If you would like to have full regular expression functionality, try this.

Try Below
SELECT DISTINCT Loggers
FROM [alo].[Forests] C
WHERE (R.LogSU = 3)
AND ((ForestID LIKE '%106_3%' OR ForestID LIKE '%106_4%'))

Using SQL - how do I match an exact number of characters?

My task is to validate existing data in an MSSQL database. I've got some SQL experience, but not enough, apparently. We have a zip code field that must be either 5 or 9 digits (US zip). What we are finding in the zip field are embedded spaces and other oddities that will be prevented in the future. I've searched enough to find the references for LIKE that leave me with this "novice approach":
ZIP NOT LIKE '[0-9][0-9][0-9][0-9][0-9]'
AND ZIP NOT LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
Is this really what I must code? Is there nothing similar to...?
ZIP NOT LIKE '[\d]{5}' AND ZIP NOT LIKE '[\d]{9}'
I will loath validating longer fields! I suppose, ultimately, both code sequences will be equally efficient (or should be).
Thanks for your help

Unfortunately, LIKE is not regex-compatible so nothing of the sort \d. Although, combining a length function with a numeric function may provide an acceptable result:
WHERE ISNUMERIC(ZIP) <> 1 OR LEN(ZIP) NOT IN(5,9)
I would however not recommend it because it ISNUMERIC will return 1 for a +, - or valid currency symbol. Especially the minus sign may be prevalent in the data set, so I'd still favor your "novice" approach.
Another approach is to use:
ZIP NOT LIKE '%[^0-9]%' OR LEN(ZIP) NOT IN(5,9)
which will find any row where zip does not contain any character that is not 0-9 (i.e only 0-9 allowed) where the length is not 5 or 9.

There are few ways you could achieve that.
You can replace [0-9] with _ like
ZIP NOT LIKE '_'
USE LEN() so it's like
LEN(ZIP) NOT IN(5,9)

You are looking for LENGTH()
select * from table WHERE length(ZIP)=5;
select * from table WHERE length(ZIP)=9;

To test for non-numeric values you can use ISNUMERIC():
WHERE ISNUMERIC(ZIP) <> 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract number from a URL in Redshift - sql

You can use: select regexp_substr(column,'top_ids%3D([0-9]*)', 1, 1, 'e') The 'e' extracts the substring in (brackets).

Try something like this: SUBSTR(REGEXP_SUBSTR(yourcolumn, 'top_ids%3D([0-9]{2,})'), 11, 20) Just looking for 'top_ids%3D' and 2 or more digits after it. Then removes the first 10 characters.

Related

Regex match first number if it does not appear at the end

Not Like in Teradata

SQL Server : extracting the Midddle Characters without CHARINDEX

Using 'LIKE' and 'REGEXP' in a SQL query

Using SQL - how do I match an exact number of characters?

Categories

Resources