Google BigQuery extract string from column with regexp_extract - sql

I need to extract the app_id from the following string in the column below:
&app_id=4.25.9&
so anything that starts with &app_id= and ends with &
Any ideas on how to write this example? The number of characters and symbols may differ.

This should be sufficient:
REGEXP_EXTRACT(your_column, r'\&app_id=(.+?)\&')
The app id matching right now .+? is a little broad and will match any character, you may want to restrict it further.

Related

Get last occurrence for the string after '/' in redshift

I am fairly new to regex expressions and always had a trouble to follow. It would be really helpful if I can get answer to the following problem.
I have a column with strings in redshift table and want to extract a certain part of the string(The string that is after the last '/'). For example, I have https://hello.com/my_first_website in my redshift table with the column name as customer_site, from this I want to extract my_first_website as output. Can someone tell me a regex expression that can help me to extract this.
You can use regexp_substr function such as
SELECT regexp_substr('https://hello.com/my_first_website','[^/]*$')

Hive SQL regexp_extract (number)_(number)

I'm new to hiveSQL and I'm trying to extract a value from the column col_a from the data df which is in this format:
\\\"id\\\":\\\"101_12345\\\"
I only need to extract 101_12345, but underscore makes it hard to satisfy my need. I tried using regexp_extract(col_a, '(\\d+)[_](\\d+)') but only outputs 101.
Could I get some help with regexp? Thank you
Simple solution: You don't need the two brackets.
Here's a working solution: '\\d+[_]\\d+'
When you put tokens into parentheses, the regex engine will group its match together, separate from the complete match. So the final result will comprise the complete match, and two extra matches representing the one before and after the underscore. To avoid this, just remove the brackets as you don't really need them.
In the future, if you want to group a regex together but don't want the result to contain it separately, use a non-capturing group given by (?:).
Here's a demo of what your code resulted in, hosted at regex101.com

Extract all elements from a string separated by underscores using regex

I am trying to capture all elements and store in separate column from the below string,seprated via underscores(campaign name for an advertisement) and then I wish to compare it with a master table having the true values to determine how accurate the data is being recorded.
eg: Input :
Expected output is :
My first element extraction was : REGEXP_EXTRACT(campaign_name, r"[^_+]{3}")) as parsed_campaign_agency
I only extracted first 3 letters because according to the naming convention(truth table), the agency name is made of only 3 letters.
Caveat: Some elements can have variable lengths too. eg. The third element "CrossBMC" could be 3 letters in length or more.
I am new to regex and the data lies in a SQL table(in BigQuery) so I thought it could be achieved via SQL's regex_extract but what I am having trouble is to extract all elements at once.
Any help is appreciated :)
If number of underscores constant and knows you can use SUBSTRING_INDEX like:
SELECT
SUBSTRING_INDEX(campaign_name,'_',1) first,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',2),'_',-1) second,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',3),'_',-1) third
FROM your_table;
Here you can try an example SQLize.online

Hive Regular expression - only portion of string needed

Hi i was trying to extract portion of data from one column in my hive table but the position of character is not in one place
select value4,regexp_extract(value4,'*****',0) from hive_table;
column value is shown below
grade:data:home made;Cat;dinnerbox_grade_Enroll
list:date:may;animal;dinnerbox_list_value
cgrade:made_data;dinnerbox_cgrade_notEnroll
I want data from dinnerbox to till end.
Can any one help on this?
It is a pretty simple regular expression
.*dinnerbox(.*?)$
Using a non-greedy wildcard, but forcing it to the end of the line makes sure that you always get the dinnerbox at the end.
You want capture group 1
To get rid of the _ you can use
.*dinnerbox_(.*?)$

How do I check the end of a particular string using SQL pattern matching?

I am trying to use sql pattern matching to check if a string value is in the correct format.
The string code should have the correct format of:
alphanumericvalue.alphanumericvalue
Therefore, the following are valid codes:
D0030.2190
C0052.1925
A0025.2013
And the following are invalid codes:
D0030
.2190
C0052.
A0025.2013.
A0025.2013.2013
So far I have the following SQL IF clause to check that the string is correct:
IF #vchAccountNumber LIKE '_%._%[^.]'
I believe that the "_%" part checks for 1 or more characters. Therefore, this statement checks for one or more characters, followed by a "." character, followed by one or more characters and checking that the final character is not a ".".
It seems that this would work for all combinations except for the following format which the IF clause allows as a valid code:
A0025.2013.2013
I'm having trouble correcting this IF clause to allow it to treat this format as incorrect. Can anybody help me to correct this?
Thank you.
This stackoverflow question mentions using word-boundaries: [[:<:]] and [[:>:]] for whole word matches. You might be able to use this since you don't have spaces in your code.
This is ANSI SQL solution
This LIKE expression will find any pattern not alphanumeric.alphanumeric. So NOT LIKE find only this that match as you wish:
IF #vchAccountNumber NOT LIKE '%[^A-Z0-9].[^A-Z0-9]%'
However, based on your examples, you can use this...
LIKE '[A-Z][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9]'
...or one like this if you 5 alphas, dot, 4 alphas
LIKE '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9].[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
The 2nd one is slightly more obvious for fixed length values. The 1st one is slighty less intuitive but works with variable length code either side of the dot.
Other SO questions Creating a Function in SQL Server with a Phone Number as a parameter and returns a Random Number and Best equivalent for IsInteger in SQL Server