Regular Expressions contains in bigquery example - google-bigquery

I need to write regular expressions in bigquery to match the following two under title column: I want to get exactly these two. There are some other values containing 3 Percent, but I want to get only these two.
WBC - SAV - 3 Percent Q4 FY20
Canstar - canstar.com.au - AFF: Table Listing - Cost per click - National - 1x1 - 3 percent Savings
My code is:
WHEN REGEXP_CONTAINS(title, '(?i) 3 Percent')
THEN '3% PF'
I am not getting the correct output. Can anyone please assist.

There are some other values containing 3 Percent, but I want to get only these two.
So, in this case you don't need regular expression and rather use below
WHEN title IN (
'WBC - SAV - 3 Percent Q4 FY20',
'Canstar - canstar.com.au - AFF: Table Listing - Cost per click - National - 1x1 - 3 percent Savings'
) THEN '3% PF'

Related

SQL query to get different columns from a json string using a pattern matching

I have a json string - {"exteriorCheck":{"criteria":[{"code":"EXTERIOR_BODY","title":"Exterior - XYZ","value":5},{"code":"EXTERIOR_RIMS","title":"Exterior - ABC","value":4}],"images":[{"code":"EXTERIOR_PICTURES","keys":["share-tasks-b1c757e3-0cb6-41ea-a298-f3430aafb36c/0"]}],"comment":"i.o "},"interiorCheck":{"criteria":[{"code":"INTERIOR_SEATS","title":"Interior - Seats","value":5}
I want to create a column whenever there is "title" like for "title":"Exterior - XYZ"- the column would be Exterior - XYZ and the values would be taken from "value":5 , so 5 will be my output. Since there multiple such cases in the string- it is difficult to use substr with position. I have tried -
select
case when "json" like '%Exterior - XYZ%' then substr("JSON",89,1)
else null end as "Exterior - XYZ". But for the entire json its difficult to get the position.
Desired output:
Exterior - XYZ | Exterior - ABC | Interior - Seats
5 | 4 | 5
How to proceed using AWS Athena (considering multiple string functions wont work at athena)

BiqQuery: Remove last N characters

I am trying to remove last 8 characters from a long string but only in case it ends with the 6 character string in the parenthesis (the bolded ones). Does anyone know how to do this in BigQuery?
here are some very random data examples:
01/5/2014 - new planted trees - email - juniper
04/22/2021 - fridge remote‚I want fresh tea (xgssjj)
re- engagement email
5/20 - example reminder (hfgfgh)
repeat customer example #2 (ttrdgd)
Thanks!
Consider below approach
select longString,
trim(regexp_replace(longString, r'\(\w{6}\)$', '')) newString
from your_table
if applied to sample data in your question - output is

Extracting data from SAS data set based on values with different length

I am looking to automate a process which has a sales dataset and a specific column named SALES CODE which is of 5 letters.
Based on the input given by the user I would like to filter the data but the problem is the user can give multiple sales codes and sometimes the length of codes could be 5,4,3,2 or 1 based on the condition. How will I filter out the required rows based on the above condition?
SALESCODE area value units rep
A10AA KR 100 10 Jay
B10AQ TN 120 12 Jrn
C10AH KR 200 10 Jay
T11TA TR 180 10 Jay
Say if I give the input as A10AA, B10A, T11 I should be able to
Get the sales data with codes A10AA, B10AQ, T11TA. kindly help.
Use the IN operator. Since you want to match values that start with the specified value use the : modifier. Since your values are character values make sure to include quotes.
proc print data=sales_data ;
where salescode in: ("A10AA" "B10A" "T11");
run;
If you want you can use commas between the values in the list, but I find it easier to type spaces instead.

How to create Deltas in bigquery

I have a table in BQ which I refresh on daily basis. It's a full snapshot every day.
I have a business requirement to create deltas of that feed.
Table Details :
Table contains 10 columns
Out of 10 columns, 5 columns change on daily basis. How do I identify which columns changed and only create a snapshot for that?
For eg here are the columns in tableA: columns which will frequently change are in bold.
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - ebook
last_product_purchase_date - 2018-05-01
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - null
third_product_purchase_date - null
fourth_product - null
fourth_product_purchase_date - null
After more purchase Data will look like this:
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - Hardbook
last_product_purchase_date - 2018-05-17
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - CD
third_product_purchase_date - 2017-01-01
fourth_product - null
fourth_product_purchase_date - null
first_product = first product ever purchased
last_product = most recent product purchased
This is just one row of records for one customer. I have millions of customers with all these columns, and let's say half a million of the rows will be updated on daily basis.
In my delta, I just want the rows where any of the column value changed.
It seems like you have a column for each product bought and their repetition, perhaps this comes from a de-normalize dimensional models. To query the last "update" you would have to compare each columns the previous row by using the lead function. This would use a lot of computation and might not be optimal.
I recommend using repeated fields. The product and product_purchase_date would be repeated field and you could simply query using a where product_purchase_date = current_date() which would use much less computation.
De-normalize dimensional models are meant to use less computation on traditional data warehouses. Bigquery being fast, highly scalable, enterprise data warehouse has a lot of computing power.
To have a beter understanding on how BigQuery works under the hood I recommend reviewing this document.

SQL Server - Update field in all table records, using a field from the same table and values from other table

I have this scenario:
Table Territory
ID (int) - CODE (varchar) - NAME (varchar)
Data:
1 - GB - UNITED KINGDOM
2 - GB - ISLE OF MAN
3 - GB - NORTHERN IRELAND
4 - PT - PORTUGAL
5 - DE - GERMANY
6 - DE - HELGOLAND ISLAND
Table Rules:
ID (int) - TERRITORY_CODES (varchar) - TERRITORY_IDS (varchar)
1 - 'GB,PT' - NULL
2 - 'DE,PT' - NULL
I know the second table should not be like this, but I have no option to change it.
I want to fill the column TERRITORY_IDS with the IDs from the table TERRITORY separated by comma. For example:
Table Rules
ID (int) - TERRITORY_CODES (varchar) - TERRITORY_IDS (varchar)
1 - 'GB,PT' - '1,4'
2 - 'DE,PT' - '5,4'
There are several IDs for each territory code, but I want only one ID for each territory table, it could be the first one, doesn't matter.
What you are looking to do is a Bad Idea. It is a good thing that you recognize this is a bad Idea. But for those reading this question and do not understand why it is bad, this violates the First normal form (1NF) principle. Which is all columns should be atomic, meaning that they hold 1 and only 1 value.
Lets get to the nuts and bolts on how to do this Coalesce to the rescue.
Since I do not know why 'gb,pt' and 'de,pt' are grouped that way I didnt wrap this in a Cursor to go through the whole table. But you can easily wrap this in a cursor and do the entire table contents for you.
DECLARE #TERRITORY_Ids varchar(100)
SELECT #TERRITORY_Ids = COALESCE(#TERRITORY_Ids+ ', ', '') +
Id
FROM table_terrytory
WHERE code in ('gb','pt')
INSERT INTO table_rules
SELECT 'gb,pt',#TERRITORY_Ids