BigQuery: aggregate metrics by running 7d/14d/30d etc. buckets - google-bigquery

Anyone aware of a short, neat BQ query (#standardsql) to aggregate metrics (sessions / PVs / users etc.) by running 7d/14d/30d etc. buckets For ex.
16th-22nd April: 300K sessions
9th-15th April: 330K sessions
2nd-8th April: 270K sessions
OR, out-of-the box function that converts GA's date field (STRING) to days_since_epoch
I wrote a query but it's very complicated
- manually extract as YYY, MM, DD components with REGEXP_EXTRACT()
- convert to days_since_epoch using UNIX_DATE
- divide by '7' to group each row into weekly observations
- use GROUP BY to aggregate & report
any pointers to simplify this use case will be highly appreciable !
Cheers!

Anyone aware of a short, neat BQ query (#standardsql) to aggregate metrics (sessions / PVs / users etc.) by running 7d/14d/30d etc. buckets
See below 7d example for BigQuery Standard SQL - you can apply this logic to whatever data you have with hopefully light adjustments
#standardSQL
WITH data AS (
SELECT
day, CAST(1000 * RAND() AS INT64) AS events
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-25')) AS day
)
SELECT
FORMAT_DATE('%U', day) as week,
FORMAT_DATE('%Y, %B %d', MIN(day)) AS start,
FORMAT_DATE('%Y, %B %d', MAX(day)) AS finish,
SUM(events) AS events
FROM data
GROUP BY week
ORDER BY week
It produces below output that can be used as a starting point for further tailoring to your desired layout
week start finish events
01 2017, January 01 2017, January 07 3699
02 2017, January 08 2017, January 14 4008
03 2017, January 15 2017, January 21 3726
... ... ... ...
OR, out-of-the box function that converts GA's date field (STRING) to days_since_epoch
To convert STRING expressed date into date of DATE type - use PARSE_DATE as in below example
#standardSQL
SELECT PARSE_DATE('%Y%m%d', '20170425') AS date
result is
date
2017-04-25
Finally, below is example/template for running 7d/14d/30d etc. buckets
#standardSQL
WITH data AS (
SELECT
DAY, CAST(1000 * RAND() AS INT64) AS events
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-25')) AS DAY
)
SELECT
DAY,
SUM(CASE WHEN period = 7 THEN events END) AS days_07,
SUM(CASE WHEN period = 14 THEN events END) AS days_14,
SUM(CASE WHEN period = 30 THEN events END) AS days_30
FROM (
SELECT
dates.day AS DAY,
periods.period AS period,
SUM(events) AS events
FROM data AS activity
CROSS JOIN (SELECT DAY FROM data GROUP BY DAY) AS dates
CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL
SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
WHERE dates.day >= activity.day
AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
GROUP BY 1,2
)
GROUP BY DAY
ORDER BY DAY DESC
with output as below
DAY days_07 days_14 days_30
2017-04-25 2087 4004 9700
2017-04-24 1947 4165 9611
2017-04-23 1666 4066 9599
2017-04-22 2121 4820 10014
2017-04-21 2885 5421 10192

Related

prestosql get average from last 7 days for each day

The question I have is very similar to the question here, but I am using Presto SQL (on aws athena) and couldn't find information on loops in presto.
To reiterate the issue, I want the query that:
Given table that contains: Day, Number of Items for this Day
I want: Day, Average Items for Last 7 Days before "Day"
So if I have a table that has data from Dec 25th to Jan 25th, my output table should have data from Jan 1st to Jan 25th. And for each day from Jan 1-25th, it will be the average number of items from last 7 days.
Is it possible to do this with presto?
maybe you can try this one
calendar Common Table Expression (CTE) is used to generate dates between two dates range.
with calendar as (
select date_generated
from (
values (sequence(date'2021-12-25', date'2022-01-25', interval '1' day))
) as t1(date_array)
cross join unnest(date_array) as t2(date_generated)),
temp CTE is basically used to make a date group which contains last 7 days for each date group.
temp as (select c1.date_generated as date_groups
, format_datetime(c2.date_generated, 'yyyy-MM-dd') as dates
from calendar c1, calendar c2
where c2.date_generated between c1.date_generated - interval '6' day and c1.date_generated
and c1.date_generated >= date'2021-12-25' + interval '6' day)
Output for this part:
date_groups
dates
2022-01-01
2021-12-26
2022-01-01
2021-12-27
2022-01-01
2021-12-28
2022-01-01
2021-12-29
2022-01-01
2021-12-30
2022-01-01
2021-12-31
2022-01-01
2022-01-01
last part is joining day column from your table with each date and then group it by the date group
select temp.date_groups as day
, avg(your_table.num_of_items) avg_last_7_days
from your_table
join temp on your_table.day = temp.dates
group by 1
You want a running average (AVG OVER)
select
day, amount,
avg(amount) over (order by day rows between 6 preceding and current row) as avg_amount
from mytable
order by day
offset 6;
I tried many different variations of getting the "running average" (which I now know is what I was looking for thanks to Thorsten's answer), but couldn't get the output I wanted exactly with my other columns (that weren't included in my original question) in the table, but this ended up working:
SELECT day, <other columns>, avg(amount) OVER (
PARTITION BY <other columns>
ORDER BY date(day) ASC
ROWS 6 PRECEDING) as avg_7_days_amount FROM table ORDER BY date(day) ASC

create a temporary sql table using recursion as a loop to populate custom time interval

Suppose you have a table like:
id subscription_start subscription_end segment
1 2016-12-01 2017-02-01 87
2 2016-12-01 2017-01-24 87
...
And wish to generate a temporary table with months.
One way would be to encode the month date as:
with months as (
select
'2016-12-01' as 'first',
'2016-12-31' as 'last'
union
select
'2017-01-01' as 'first',
'2017-01-31' as 'last'
...
) select * from months;
So that I have an output table like:
first_day last_day
2017-01-01 2017-01-31
2017-02-01 2017-02-31
2017-03-01 2017-03-31
I would like to generate a temporary table with a custom interval (above), without manually encoding all the dates.
Say the interval is of 12 months, for each year, for as many years there are in the db.
I'd like to have general approach to compute the months table with the same output as above.
Or, one may adjust the range to a custom interval (months split an year in 12 parts, but one may want to split a time in a custom interval of days).
To start, I was thinking to use recursive query like:
with months(id, first_day, last_day, month) as (
select
id,
first_day,
last_day,
0
where
subscriptions.first_day = min(subscriptions.first_day)
union all
select
id,
first_day,
last_day,
months.month + 1
from
subscriptions
left join months on cast(
strftime('%m', datetime(subscriptions.subscription_start)) as int
) = months.month
where
months.month < 13
)
select
*
from
months
where
month = 1;
but it does not do what I'd expect: here I was attempting to select the first row from the table with the minimum date, and populate a table at interval of months, ranging from 1 to 12. For each month, I was comparing the string date field of my table (e.g. 2017-03-01 = 3 is march).
The query above does work and also seems a bit complicated, but for the sake of learning, which alternative would you propose to create a temporary table months without manually coding the intervals ?

7-day user count: Big-Query self-join to get date range and count?

My Google Firebase event data is integrated to BigQuery and I'm trying to fetch from here one of the info that Firebase gives me automatically: 1-day, 7-day, 28-day user count.
1-day count is quite straightforward
SELECT
"1-day" as period,
events.event_date,
count(distinct events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date
with a neat result like
period event_date uid
1-day 20190609 5
1-day 20190610 7
1-day 20190611 5
1-day 20190612 7
1-day 20190613 37
1-day 20190614 73
1-day 20190615 52
1-day 20190616 36
But to me it gets complicated when I try to count for each day how many unique users I had in the previous 7 days
From the above query, I know my target value for day 20190616 will be 142, by filtering 7 days and removing the group by condition.
The solution I tried is direct self join (and variations that didnt change the result)
SELECT
"7-day" as period,
events.event_date,
count(distinct user_events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events,
`your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
and events.event_date = "20190616" #fixed date to check
group by events.event_date
Now, I know I'm barely setting any join conditions, but if any I expected to produce cross joins and huge results. Instead, the count this way is 70, which is a lot lower than expected. Furthermore, I can set INTERVAL 2 DAY and the result does not change.
I'm clearly doing something very wrong here, but I also thought that the way I'm doing it is very rudimental, and there must be a smarter way to accomplish this.
I have checked Calculating a current day 7 day active user with BigQuery? but the explicit cross join here is with event_dim which definition I'm unsure about
Cheched the solution provided at Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU) as suggested by comment.
The solution seemed sound at first but has some problems the more recent the day is. Here's the query using COUNT(DISTINCT) that I adapted to my case
SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT user_pseudo_id) unique_90_day_users
, COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
, COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
, COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
FROM `your_path_here.events_*`
WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
and here is the result for the latest days (consider data starts 23rd May) where you can appreciate that the result is wrong
row_num date_grp 90-day 28-day 7-day 1-day
114 2019-06-16 273 273 273 210
115 2019-06-17 78 78 78 78
so in the last day this count for 90-day,28-day,7day is only considering the very same day instead of all the days before.
It's not possible for 90-day count on the 17th June to be 78 if the 1-day on the 16th June was higher.
This is AN answer to my same question.
My means are rudimentary as I'm not extremely familiar with BQ shortcuts and some advanced functions, but the result is still correct.
I hope others will be able to integrate with better queries.
#standardSQL
WITH dates AS (
SELECT i as event_date
FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 90)) i
UNION ALL
SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 29)) i
UNION ALL
SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 7)) i
UNION ALL
SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
FROM dates
)
SELECT event_date,
sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
, ptd_dates.event_date
, COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
`your_path_here.events_*` events,
unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
, ptd_dates.event_date)
group by event_date
order by 1,2,3
As per suggestion from ECris, I first defined a calendar table to use: this contains 4 categories of PTDs (period to date). Each is generated from basic elements: this should scale linearly as it's not querying the event dataset and therefore does not have gaps.
Then the join is made with events, where the join condition shows how for each date I'm counting distinct users in all related days in the period.
The results are correct.

how to produce a customer retention table /cohort analysis with SQL

I'm trying to write an SQL query (Presto SQL syntax) to produce a customer retention table (see sample below).
A customer who makes at least one transaction in a month is considered as retained for that month.
this is the table
user_id transaction_date
bdcff651- . 2018-01-01
bdcff641 . 2018-03-15
this is the result I would like to get
The first row should be understood as follows:
Out of all customers who made their first transaction in the month of Jan 2018 (defined as “Jan Activation Cohort”), 35% subsequently made a transaction during the one month period following their first transaction date, 23% in the next month, 15% in the next month and so on.
Date 1st Month 2nd Month 3rd Month
2018-01-01 35% 23% . 15%
2018-02-0 33 % 26% . 13%
2018-03-0 36% 27% 12%
As an example, if person XYZ makes his first transaction on 10th February 2018, his 1st month will be from 11th February 2018 to 10th March 2018, 2nd month will be from 11th March 2018 to 10th April 2018 and so on. This person’s details need to appear in the Feb 2018 cohort in the Customer Retention Table.
would appreciate any help! thanks.
You can use conditional aggregation. However, I am not sure what your real calculations are.
If I just use the built-in definitions of date_diff(), then the logic looks like:
select date_trunc(month, first_td) as yyyymm,
count(distinct user_id) as cnt,
(count(distinct case when date_diff(month, first_td, transaction_date) = 1
then user_id
end) /
count(distinct user_id)
) as month_1_ratio,
(count(distinct case when date_diff(month, first_td, transaction_date) = 2
then user_id
end) /
count(distinct user_id)
) as month_2_ratio
from (select t.*,
min(transaction_date) over (partition by user_id) as first_td
from t
) t
group by date_trunc(month, first_td)
order by yyyymm;
I am not familiar with Presto exactly, and do not have a way to test Presto code. However, it looks like from searching around a bit that it wouldn't be too hard to convert to Presto syntax from something like SQL Server syntax. Here is what I would do in SQL Server and you should be able to carry the concept over to Presto:
with transactions_info_per_user as (
select user_id, min(transaction_date) as first_transaction,
convert(datepart(year, min(transaction_date)) as varchar(4)) + convert(datepart(month, min(transaction_date)) as varchar(2)) as activation_cohort
from my_table
group by user_id
),
users_per_activation_cohort as (
select activation_cohort, count(*) as number_of_users
from transactions_info_per_user
group by activation_cohort
),
months_after_activation_per_purchase as (
select distinct mt.user_id, ti.activation_cohort, datediff(month, mt.transaction_date, ti.first_transaction) AS months_after_activation
from my_table mt
left join transactions_info_per_user as ti
on mt.user_id = ti.user_id
),
final as (
select activation_cohort, months_after_activation, count(*) as user_count_per_cohort_with_purchase_per_month_after_activation
from months_after_activation_per_purchase
group by activation_cohort, months_after_activation
)
select activation_cohort, months_after_activation,
convert(user_count_per_cohort_with_purchase_per_month_after_activation as decimal(9,2)) / convert(users_per_activation_cohort as decimal(9,2)) * 100
from final
--Then pivot months_after_activation into columns
I was very explicit with the naming of things so you could follow the thought process. Here is an example of how to pivot in Presto. Hopefully this helps you!

how to make this query also return rows with 0 count value?

I have written a small PostgreSQL query that helps me total amount of jobs executed per hourly intervals in every day within two certain dates -e.g. all the jobs executed between February 2, 2012 and March 3, 2012 hour by hour starting with the hour given in February 2 and ending with the hour given in March 3- I have noticed that this query doesn't print the rows with 0 count -no job executed within that time interval e.g. at February 21, 2012 between 5 and 6pm-. How can I make this also return results(rows) with 0 count? The code is as below:
SELECT date_trunc('hour', executiontime), count(executiontime)
FROM mytable
WHERE executiontime BETWEEN '2011-2-2 0:00:00' AND '2012-3-2 5:00:00'
GROUP BY date_trunc('hour', executiontime)
ORDER BY date_trunc('hour', executiontime) ASC;
Thanks in advance.
-- CTE to the rescue!!!
WITH cal AS (
SELECT generate_series('2012-02-02 00:00:00'::timestamp , '2012-03-02 05:00:00'::timestamp , '1 hour'::interval) AS stamp
)
, qqq AS (
SELECT date_trunc('hour', executiontime) AS stamp
, count(*) AS zcount
FROM mytable
GROUP BY date_trunc('hour', executiontime)
)
SELECT cal.stamp
, COALESCE (qqq.zcount, 0) AS zcount
FROM cal
LEFT JOIN qqq ON cal.stamp = qqq.stamp
ORDER BY stamp ASC
;
Look this. Idea is to generate array or table with dates in this period and join with job execution table.