Google BigQuery: Rolling Count Distinct - sql

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?

BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.

Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1

Related

Day wise Rolling 30 day uniques user count bigquery

I am trying to generate a day on day rolling 30 days unique count using this query but the problem is running this query day on the day I need aug full month rolling 30 days day on day count in one script pls help
-----------------------------------------
SELECT max(date),count(DISTINCT user_id) as MAU
FROM user_data
WHERE date between DATE_SUB('2020-08-31' ,INTERVAL 29 DAY) and '2020-08-31';
BigQuery doesn't support rolling windows for count(distinct). So, one approach is a brute force method:
select dte,
(select count(distinct ud.user_id)
from user_data ud
where ud.date between DATE_SUB(dte, INTERVAL 29 DAY) and dte
) as num_users
from unnest(generate_date_array(date('2020-08-01'), date('2020-08-31'))) dte
Gordon approach works great.
If you need to calculate more numbers - Cross join the data.
SELECT
date_gen,
COUNT(DISTINCT IF(ud.date BETWEEN DATE_SUB(date_gen ,INTERVAL 29 DAY) AND date_gen,ud.user_id,NULL)) as MAU
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE_SUB('2020-08-31' ,INTERVAL 29 DAY), date('2020-08-31'))) date_gen,
(SELECT * FROM user_data WHERE date BETWEEN DATE_SUB('2020-08-31' ,INTERVAL 60 DAY) AND '2020-08-31') AS ud
GROUP BY 1
ORDER BY 1 DESC
With SET and DECLARE you can get rid of replacing the 'DATE' multiple times.
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM `project.dataset.user_data`
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t
Above assumes you have entries in project.dataset.user_data table for all days in time period of your interest
If this is not a case, and you actually have some gaps in your data - you can use below
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM UNNEST(GENERATE_DATE_ARRAY('2020-08-01', '2020-08-31')) AS date
LEFT JOIN `project.dataset.user_data`
USING(date)
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t

BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

Example relevant table schema:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.
At first, I attempted to use a window function, similar to the answer here.
https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.
The next thing I tried is to use the following solution
https://stackoverflow.com/a/47659590
- concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
However this quickly ran out of memory with anything large.
Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.
I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
"Image of the error - Function MERGE_PARTIAL is not supported"
Any ideas why this error happens or how to adjust?
Below is for BigQuery Standard SQL and does exactly what you want with use of window function
#standardSQL
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
You can test, play with above using [totally] dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2019-01-01' day UNION ALL
SELECT 2, '2019-01-01' UNION ALL
SELECT 3, '2019-01-01' UNION ALL
SELECT 1, '2019-01-02' UNION ALL
SELECT 4, '2019-01-02' UNION ALL
SELECT 2, '2019-01-03' UNION ALL
SELECT 3, '2019-01-03' UNION ALL
SELECT 4, '2019-01-03' UNION ALL
SELECT 5, '2019-01-03' UNION ALL
SELECT 1, '2019-01-04' UNION ALL
SELECT 4, '2019-01-04' UNION ALL
SELECT 2, '2019-01-05' UNION ALL
SELECT 3, '2019-01-05' UNION ALL
SELECT 5, '2019-01-05' UNION ALL
SELECT 6, '2019-01-05'
)
SELECT day,
(SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch) rolling_sketch
FROM (
SELECT day,
ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr
FROM (
SELECT day, HLL_COUNT.INIT(id) ids_sketch
FROM `project.dataset.table`
GROUP BY day
)
)
-- ORDER BY day
with result
Row day rolling_sketch
1 2019-01-01 3
2 2019-01-02 4
3 2019-01-03 5
4 2019-01-04 5
5 2019-01-05 6
Combine HLL_COUNT.INIT and HLL_COUNT.MERGE. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER.
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

how to calculate difference between dates in BigQuery

I have a table named Employees with Columns: PersonID, Name, StartDate. I want to calculate 1) difference in days between the newest and oldest employee and 2) the longest period of time (in days) without any new hires. I have tried to use DATEDIFF, however the dates are in a single column and I'm not sure what other method I should use. Any help would be greatly appreciated
Below is for BigQuery Standard SQL
#standardSQL
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
You can test, play with above using dummy data as in the example below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT DATE '2019-01-01' StartDate UNION ALL
SELECT '2019-01-03' StartDate UNION ALL
SELECT '2019-01-13' StartDate
)
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
with result
Row days_between_newest_and_oldest_employee longest_period_without_new_hire
1 12 9
Note use of -1 in calculating longest_period_without_new_hire - it is really up to you to use this adjustment or not depends on your preferences of counting gaps
1) difference in days between the newest and oldest record
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT DATE_DIFF(MAX(date), MIN(date), DAY) max_minus_min
FROM table
2) the longest period of time (in days) without any new records
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT MAX(diff) max_diff
FROM (
SELECT DATE_DIFF(date, LAG(date) OVER(ORDER BY date), DAY) diff
FROM table
)

Group By - select by a criteria that is met every month

The below query returns all USERS that have SUM(AMOUNT) > 10 in a given month. It includes Users in a month even if they don't meet the criteria in other months.
But I'd like to transform this query to return all USERS who must meet the criteria SUM(AMOUNT) > 10 every single month (i.e., from the first month in the table to the last one) across the entire data.
Put another way, exclude users who don't meet SUM(AMOUNT) > 10 every single month.
select USERS, to_char(transaction_date, 'YYYY-MM') as month
from Table
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10;
One approach uses a generated calendar table representing all months in your data set. We can left join this calendar table to your current query, and then aggregate over all months by user:
WITH months AS (
SELECT DISTINCT TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
),
cte AS (
SELECT USERS, TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10
)
SELECT
t.USERS
FROM months m
LEFT JOIN cte t
ON m.month = t.month
GROUP BY
t.USERS
HAVING
COUNT(t.USERS) = (SELECT COUNT(*) FROM months);
The HAVING clause above asserts that the number of months to which a user matches is in fact the total number of months. This would imply that the user meets the sum criteria for every month.
Perhaps you could use a correlated subquery, such as:
select t.*
from (select distinct table.users from table) t
where not exists
(
select to_char(u.transaction_date, 'YYYY-MM') as month
from table u
where u.users = t.users
group by month
having sum(u.amount) <= 10
)
One option would be using sign(amount-10) vs. sign(amount) logic as
SELECT q.users
FROM
(
with tab(users, transaction_date,amount) as
(
select 1,date'2018-11-24',8 union all
select 1,date'2018-11-24',18 union all
select 2,date'2018-10-24',13 union all
select 3,date'2018-11-24',18 union all
select 3,date'2018-10-24',28 union all
select 3,date'2018-09-24', 3 union all
select 4,date'2018-10-24',28
)
SELECT users, to_char(transaction_date, 'YYYY-MM') as month,
sum(sign(amount-10)) as cnt1,
sum(sign(amount)) as cnt2
FROM tab t
GROUP BY users, month
) q
GROUP BY q.users
HAVING sum(q.cnt1) = sum(q.cnt2)
GROUP BY q.users
users
-----
2
4
Rextester Demo
You need to compare the number of months > 10 to the number of months between the min and the max date:
SELECT users, Count(flag) AS months, Min(mth), Max(mth)
FROM
(
SELECT users, date_trunc('month',transaction_date) AS mth,
CASE WHEN Sum(amount) > 10 THEN 1 end AS flag
FROM tab t
GROUP BY users, mth
) AS dt
GROUP BY users
HAVING -- adding the number of months > 10 to the min date and compare to max
Min(mth) + (INTERVAL '1' MONTH * (Count(flag)-1)) = Max(mth)
If missing months don't count it would be a simple count(flag) = count(*)

How can I count users in a month that were not present in the month before?

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1
I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])
First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.
The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1
Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;