BQ scripting: Writing results of a loop to a table - google-bigquery

I am working with BigQuery scripting, I have written a simple WHILE loop which iterates through daily Google Analytics tables and sums the visits, now I'd like to write these results out to a table.
I've gotten as far as creating the table, but I can't capture the value of visits from my SQL query to populate the table. Date works fine, because it is defined outside of the SQL. I tried to DECLARE the value of visits with a new variable, but again this does not work because it's not known outside of the statement.
SET vis = visits;
How can I correctly write my results out to a table?
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),"-","");
DECLARE vis INT64;
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SELECT d, SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix
GROUP BY Date;
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET vis = visits;
INSERT INTO test.looped_results VALUES (d, visits);
END WHILE;
Update: I also tried an alternative solution, assigning visits to it's own variable, but this produces the same error:
WHILE d > '2019-10-01' DO
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `mindful-agency-136314.43786551.ga_sessions_*`
WHERE _table_suffix = pfix);
INSERT INTO test.looped_results VALUES (d, vis_count);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
END WHILE;
Results:
In my results I see the correct number of rows created, with the correct dates, but the value of visits for each is the value for the most recent day.

I would also move INSERT INTO outside of the WHILE loop by collecting result into result variable (along with few other minor changes) as in below example
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING;
DECLARE vis_count INT64;
DECLARE result ARRAY<STRUCT<vis_date DATE, vis_count INT64>> DEFAULT [];
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix);
SET result = ARRAY_CONCAT(result, [STRUCT(d, vis_count)]);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
END WHILE;
INSERT INTO test.looped_results SELECT * FROM UNNEST(result);
Note: I hope your example is for scripting learning purpose and not for production as whenever possible we should stick with set based processing which can be easily done in your case

Here is a better way which is faster and without using a loop.
Basically, you form an array of suffix and do SELECT/INSERT in single query:
DECLARE date_range ARRAY<DATE> DEFAULT
GENERATE_DATE_ARRAY(DATE '2019-10-01', DATE '2019-10-10', INTERVAL 1 DAY);
DECLARE suffix_array ARRAY<STRING>
DEFAULT (SELECT ARRAY_AGG(REGEXP_REPLACE(CAST(dates AS STRING),"-",""))
FROM UNNEST(date_range) dates);
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
INSERT INTO test.looped_results
SELECT Date, SUM(totals.visits)
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix in UNNEST(suffix_array);
GROUP BY Date;

Actually, you need to update the pfix variable in there. Also, it is a good idea to instantiate the visits. Finally, your GROUPBY doesn't necessarily need a dimension if you are providing it with a pfix constraint.
This should do it:
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),'-','');
DECLARE visits int64;
SET visits = 0;
CREATE OR REPLACE TABLE project.dataset.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET visits = (SELECT SUM(totals.visits) FROM `project.dataset.ga_sessions_*` WHERE _table_suffix = pfix);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
INSERT INTO dataset.looped_results VALUES (d, visits);
END WHILE;
Hope it helps.

Having reviewed my code (several times!) I realized that I wasn't refreshing the variable which transforms the data into the table prefix within the loop.
Here is a working version of the script, where I set pfix at the end of the loop:
DECLARE d DATE DEFAULT DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
DECLARE pfix STRING DEFAULT REGEXP_REPLACE(CAST(d AS STRING),"-","");
DECLARE vis_count INT64;
CREATE OR REPLACE TABLE test.looped_results (Date DATE, Visits INT64);
WHILE d > '2019-10-01' DO
SET vis_count = (SELECT SUM(totals.visits) AS visits
FROM `project.dataset.ga_sessions_*`
WHERE _table_suffix = pfix);
INSERT INTO test.looped_results VALUES (d, vis_count);
SET d = DATE_SUB(d, INTERVAL 1 DAY);
SET pfix = REGEXP_REPLACE(CAST(d AS STRING),"-","");
END WHILE;

I would actually not use a while loop but rather a group by
SELECT date, SUM(totals.visits) AS visits
FROM `mindful-agency-136314.43786551.ga_sessions_*`
GROUP BY Date;
It will give you your results as the table that you want, you don't need to loop on your table.
Depending on your set up but you can set the query to be ran every day so every new day you will have the new values

Related

How to turn this script into a UDF in BigQuery

This should be easy to do but I'm struggling.
I have the below script that calculates a number of business dates before or after a date. I need to change to UDF as I'll be using it in multiple views. It is for Bigquery:
DECLARE Date DATE;
DECLARE DAYS_EXTEND INT64;
DECLARE COUNTER INT64;
SET Date = '2021-12-23';
SET DAYS_EXTEND = 3;
SET COUNTER = DAYS_EXTEND;
BEGIN
WHILE COUNTER > 0 Do
SET COUNTER = COUNTER -1;
SET Date = Date +1;
IF Extract( DAYOFWEEK from Date) in (1,7) or Date in ('2021-01-01','2021-04-02','2021-04-05','2021-05-03','2021-05-31','2021-08-30','2021-12-27','2021-12-28','2022-01-03','2022-04-15','2022-04-18','2022-05-02','2022-06-02','2022-06-03','2022-08-29','2022-12-26','2022-12-27')
Then
BEGIN
SET DAYS_EXTEND = DAYS_EXTEND +1;
SET COUNTER = COUNTER +1;
END;
END IF;
END WHILE;
WHILE COUNTER <0 DO
SET COUNTER = COUNTER +1;
SET Date = Date -1;
IF Extract( DAYOFWEEK from Date) in (1,7) or Date in ('2021-01-01','2021-04-02','2021-04-05','2021-05-03','2021-05-31','2021-08-30','2021-12-27','2021-12-28','2022-01-03','2022-04-15','2022-04-18','2022-05-02','2022-06-02','2022-06-03','2022-08-29','2022-12-26','2022-12-27')
Then
BEGIN
SET DAYS_EXTEND = DAYS_EXTEND - 1;
SET COUNTER = COUNTER - 1;
END;
END IF;
END WHILE;
END;
I'm just not sure how to turn it into a Select statement or if it is possible to do UDF without a Select statement with the loops remaining.
I'd appreciate any help.
For what I understand from your question, you are trying to create a function/procedure that make uses of selects and loops to identify business days. Personally I would avoid complex scripting (usually as last resort). I think It can be achieve by just using selects.
Here is a sample code, that can assist you identifying business days:
DECLARE MyDate DATE;
DECLARE DAYS_RANGE INT64;
SET MyDate = '2021-11-22';
SET DAYS_RANGE = 10;
/*if it fits you, create a table for special dates and holidays*/
create temp table offdays (
nonworkingdays date
);
insert into offdays values('2021-11-23');
insert into offdays values('2021-11-19');
with working_dates as(
select dafter,
case when EXTRACT(DAYOFWEEK FROM dafter) not in (1,7) then 1 else 0 end as isweekdays_after,
case when dafter in (select nonworkingdays from offdays) then 1 else 0 end as isoffdays_after,
dbefore,
case when EXTRACT(DAYOFWEEK FROM dbefore) not in (1,7) then 1 else 0 end as isweekdays_before,
case when dbefore in (select nonworkingdays from offdays) then 1 else 0 end as isoffdays_before
from (
select date_add(MyDate, INTERVAL days_to_count DAY) as dafter,
date_add(MyDate, INTERVAL -days_to_count DAY) as dbefore
from (SELECT days_to_count FROM UNNEST(GENERATE_ARRAY(1, DAYS_RANGE)) AS days_to_count)
)
)
select * from working_dates
If you run it, you will get a main table working_dates with rows equals to the range given and columns that will help you identify weekdays and offdays.
So, You can use this code to create a function or procedure where you can pass parameters and calculate if you either want days after or before and return the days or the count of days filtered by the columns weekdays and offdays.
Take this as a sample function derived from above script:
CREATE TEMP FUNCTION GetBusinessDaysAfter(fnDate Date, days_range INT64)
RETURNS INT64
AS ((
select count(d.dafter) from(
select dafter,
case when EXTRACT(DAYOFWEEK FROM dafter) not in (1,7) then 1 else 0 end as isweekdays_after,
case when dafter in ('2021-11-23') then 1 else 0 end as isoffdays_after,
from (
select date_add(fnDate, INTERVAL days_to_count DAY) as dafter
from (SELECT days_to_count FROM UNNEST(GENERATE_ARRAY(1, days_range)) AS days_to_count)
)) as d
where d.isweekdays_after=1 and d.isoffdays_after=0
));
select GetBusinessDaysAfter('2021-11-22',3)
I can use it to retrieve how many business days will pop up in the next 3 days. Turns out to be just 2 (for the sake of the sample, I put a fixed value in offdays, you can replace it making a reference to your holidays table in your dataset).
For more information about scripting, functions and procedures, here are some useful links:
Data definition language
Working with arrays
Working with dates functions

Loop over query in bigquery

I'm trying to loop over a query result and combine the result.
I want to loop over the variable called rolling date, which gives out an array of dates with 30 day difference.
DECLARE rollingdate ARRAY<DATE>;
SET rollingdate = ( GENERATE_DATE_ARRAY(CURRENT_DATE(), DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY), INTERVAL -30 DAY) );
My table is partitioned by DATE, and I'd like to loop over two consecutive dates from the rolling date and union all the results
select *, rollingdate[0]
from table
where date > rollingdate[1] and date < rollingdate[0]
union all
select *, rollingdate[1]
from table
where date > rollingdate[2] and date < rollingdate[1]
How do I achieve this in bigquery? i tried with bigquery scripts, but they don't take subqueries..
You can try using EXECUTE IMMEDIATE:
DECLARE i INT64 DEFAULT 1;
DECLARE dsql STRING DEFAULT '';
DECLARE rollingdate ARRAY<DATE>;
SET rollingdate = ( GENERATE_DATE_ARRAY(CURRENT_DATE(), DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY), INTERVAL -30 DAY) );
WHILE i <= 2
DO
SET dsql = dsql || " select *, '" || rollingdate[ORDINAL(i)] || "' from table where date > '" || rollingdate[ORDINAL(i+1)] || "' and date < '" || rollingdate[ORDINAL(i)] || "' union all";
SET i = i + 1;
END WHILE;
SET dsql = SUBSTR(dsql, 1, LENGTH(dsql) - LENGTH(' union all'));
EXECUTE IMMEDIATE dsql;
An approach like this should be ok:
with
dates as (
select * from unnest(generate_date_array(current_date, date_sub(current_date(), interval 365 DAY), interval -30 day)) as date
),
date_start_end as (
select lag(date,1) over(order by date asc) as begin_date, date as end_date
from dates
)
select table.*, end_date
from table
inner join date_start_end where date between begin_date and end_date
You might need to make adjustments depending if your date ranges are mean to be inclusive or exclusive.
As mentioned in my comment, any date in table will only be in 1 of your rollingdate intervals. SQL is much more performant when you can do operations on a set and avoid loops.

google bigquery in visual data studio error declare variable

I have query on google BigQuery same under code, it run good on google bigquery, then I have copy to visual data studio, it has error "Data set configuration error", i don't understand this error enter image description here , please help me.
I think cause of problem is declare variable in BigQuery
DECLARE yesterday date;
DECLARE count_week01 INT64;--number user active in (last week -1)
DECLARE count_week0 INT64;--number user active in (last week)
DECLARE retention FLOAT64; -- % user has already actived
DECLARE churn_rate FLOAT64; -- % user has already leaved
DECLARE count_user_active_week0 INT64; -- % user has leave
DECLARE week0_begin string; --date begin of last week
DECLARE week0_end string; --date end of last week
DECLARE week01_begin string; --date begin of (last week - 1)
DECLARE week01_end string; --date end of (last week -1)
SET yesterday = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY);
--Set value for last week from yesterday
SET week0_begin = FORMAT_DATE('%Y%m%d', DATE_ADD(yesterday, INTERVAL -6 DAY));
SET week0_end = FORMAT_DATE('%Y%m%d', yesterday);
--Set value for (last week -1) from yesterday
SET week01_begin = FORMAT_DATE('%Y%m%d', DATE_ADD(yesterday, INTERVAL -13 DAY));
SET week01_end = FORMAT_DATE('%Y%m%d', DATE_ADD(yesterday, INTERVAL -7 DAY));
--Count number user active time in last week -1
SET count_week01 = (SELECT COUNT(DISTINCT param.value.string_value) as phone
FROM `analytics_223664948.events_*` ev,
UNNEST(event_params) AS param
WHERE ev.event_name = 'spl_active_time' and param.key = 'PhoneNumber' AND _TABLE_SUFFIX BETWEEN week01_begin AND week01_end);
--Count number user active time in last week
SET count_week0 = (SELECT COUNT(DISTINCT param.value.string_value) as phone
FROM `analytics_223664948.events_*` ev,
UNNEST(event_params) AS param
WHERE ev.event_name = 'spl_active_time' and param.key = 'PhoneNumber' AND _TABLE_SUFFIX BETWEEN week0_begin AND week0_end);
--Join last week and last week -1
SET count_user_active_week0 = (SELECT COUNT(week0.phone) FROM
(
(SELECT DISTINCT param.value.string_value as phone
FROM `analytics_223664948.events_*` ev,
UNNEST(event_params) AS param
WHERE ev.event_name = 'spl_active_time' and param.key = 'PhoneNumber' AND _TABLE_SUFFIX BETWEEN week0_begin AND week0_end) week0
INNER JOIN
(SELECT DISTINCT param.value.string_value as phone
FROM `analytics_223664948.events_*` ev,
UNNEST(event_params) AS param
WHERE ev.event_name = 'spl_active_time' and param.key = 'PhoneNumber' AND _TABLE_SUFFIX BETWEEN week01_begin AND week01_end) week01
ON week0.phone = week01.phone
)
);
SET retention = 100*count_user_active_week0/count_week01;
SET churn_rate = 100 - retention;
SELECT week01_begin, week01_end, week0_begin, week0_end, count_week0, count_week01, retention, churn_rate
you can use with ...as change for DECLARE

How to generate with scripting INTERVAL 1 <day|week|month>?

We are trying to find a syntax to generate the DAY|WEEK|MONTH options from the 3rd param of date functions.
DECLARE var_date_option STRING DEFAULT 'DAY';
select GENERATE_DATE_ARRAY('2019-01-01','2020-01-01',INTERVAL 1 WEEK)
dynamic param here -^^^
Do you know what's the proper syntax to use in DECLARE and that should be converted to valid SQL.
Below is for BigQuery Standard SQL
Those DAY|WEEK|MONTH are LITERALs and cannot be parametrized
And, as you know - dynamic SQL is also not available yet
So, unfortunately below is the only solution I can think of as of today
#standardSQL
DECLARE var_date_option STRING DEFAULT 'DAY';
DECLARE start_date, end_date DATE;
DECLARE date_array ARRAY<DATE>;
SET (start_date, end_date, var_date_option) = ('2019-01-01','2020-01-01', 'MONTH');
SET date_array = (
SELECT CASE var_date_option
WHEN 'DAY' THEN GENERATE_DATE_ARRAY(start_date, end_date, INTERVAL 1 DAY)
WHEN 'WEEK' THEN GENERATE_DATE_ARRAY(start_date, end_date, INTERVAL 1 WEEK)
WHEN 'MONTH' THEN GENERATE_DATE_ARRAY(start_date, end_date, INTERVAL 1 MONTH)
END
);
SELECT * FROM UNNEST(date_array) AS date_dt;

Grouping by contiguous dates, ignoring weekends in SQL

I'm attempting to group contiguous date ranges to show the minimum and maximum date for each range. So far I've used a solution similar to this one: http://www.sqlservercentral.com/articles/T-SQL/71550/ however I'm on SQL 2000 so I had to make some changes. This is my procedure so far:
create table #tmp
(
date smalldatetime,
rownum int identity
)
insert into #tmp
select distinct date from testDates order by date
select
min(date) as dateRangeStart,
max(date) as dateRangeEnd,
count(*) as dates,
dateadd(dd,-1*rownum, date) as GroupID
from #tmp
group by dateadd(dd,-1*rownum, date)
drop table #tmp
It works exactly how I want except for one issue: weekends. My data sets have no records for weekend dates, which means any group found is at most 5 days. For instance, in the results below, I would like the last 3 groups to show up as a single record, with a dateRangeStart of 10/6 and a dateRangeEnd of 10/20:
Is there some way I can set this up to ignore a break in the date range if that break is just a weekend?
Thanks for the help.
EDITED
I didn't like my previous idea very much. Here's a better one, I think:
Based on the first and the last dates from the set of those to be grouped, prepare the list of all the intermediate weekend dates.
Insert the working dates together with weekend dates, ordered, so they would all be assigned rownum values according to their normal order.
Use your method of finding contiguous ranges with the following modifications:
1) when calculating dateRangeStart, if it's a weekend date, pick the nearest following weekday;
2) accordingly for dateRangeEnd, if it's a weekend date, pick the nearest preceding weekday;
3) when counting dates for the group, pick only weekdays.
Select from the resulting set only those rows where dates > 0, thus eliminating the groups formed only of the weekends.
And here's an implementation of the method, where it is assumed, that a week starts on Sunday (DATEPART returns 1) and weekend days are Sunday and Saturday:
DECLARE #tmp TABLE (date smalldatetime, rownum int IDENTITY);
DECLARE #weekends TABLE (date smalldatetime);
DECLARE #minDate smalldatetime, #maxDate smalldatetime, #date smalldatetime;
/* #1 */
SELECT #minDate = MIN(date), #maxDate = MAX(date)
FROM testDates;
SET #date = #minDate - DATEPART(dw, #minDate) + 7;
WHILE #date < #maxDate BEGIN
INSERT INTO #weekends
SELECT #date UNION ALL
SELECT #date + 1;
SET #date = #date + 7;
END;
/* #2 */
INSERT INTO #tmp
SELECT date FROM testDates
UNION
SELECT date FROM #weekends
ORDER BY date;
/* #3 & #4 */
SELECT *
FROM (
SELECT
MIN(date + CASE DATEPART(dw, date) WHEN 1 THEN 1 WHEN 7 THEN 2 ELSE 0 END)
AS dateRangeStart,
MAX(date - CASE DATEPART(dw, date) WHEN 1 THEN 2 WHEN 7 THEN 1 ELSE 0 END)
AS dateRangeEnd,
COUNT(CASE WHEN DATEPART(dw, date) NOT IN (1, 7) THEN date END) AS dates,
DATEADD(d, -rownum, date) AS GroupID
FROM #tmp
GROUP BY DATEADD(d, -rownum, date)
) s
WHERE dates > 0;