Cumulative count over weeks in SQL - sql

I have table of items with owner ids referencing to a user from users table.
I want to show for each week (group by week) how many items were created that week per user + all the items created before - cumulative count.
For this table:
id
owner
created
1
xxxxx
'2021-01-01'
2
xxxxx
'2021-01-01'
3
xxxxx
'2021-01-09'
I want to get:
count
owner
week
2
xxxxx
'2021-01-01' - '2021-01-07'
3
xxxxx
'2021-01-08' - '2021-01-14'
This is code for non-cumulative count. How can I change it to be cumulative?
select
count(*),
uu.id,
date_trunc('week', CAST(it.created AS timestamp)) as week
from items it
left join users uu on uu.id = item.owner_id
group by uu.id, week

I'm a little confused by your query:
You have a left join from items to users as if you expect some items with no valid user id.
You are using u.id in the select, but that would be NULL with no match.
I would suggest:
select it.owner_id,
date_trunc('week', it.created::timestamp) as week_start,
date_trunc('week', it.created::timestamp) + interval '6 day' as week_end,
count(*) as this_week,
sum(count(*)) over (partition by uu.id order by min(timestamp)) as running_count
from items it
group by it.owner_id, week_start;
This uses Postgres syntax because your code looks like Postgres.

Remove user id from the GROUP BY clause and from SELECT list:
select
count(*),
date_trunc('week', CAST(it.created AS timestamp)) as week
from items it
left join users uu on uu.id = item.owner_id
group by week

here's a little runnable sample (SQL Server), maybe it will help:
create table #temp (week int, cnt int)
select * from #temp
insert into #temp select 1,2
insert into #temp select 1,1
insert into #temp select 2,3
insert into #temp select 3,3
select
week,
sum(count(*)) over (order by week) as runningCnt
from #temp
group by week
The output is:
week - runningCnt
1 - 3
2 - 5
3 - 6
So 1st week there were 3, next week there came 2 more, and last week one more.
You could also do a cumulative sum of the values in the cnt-column.

Related

Day wise Rolling 30 day uniques user count bigquery

I am trying to generate a day on day rolling 30 days unique count using this query but the problem is running this query day on the day I need aug full month rolling 30 days day on day count in one script pls help
-----------------------------------------
SELECT max(date),count(DISTINCT user_id) as MAU
FROM user_data
WHERE date between DATE_SUB('2020-08-31' ,INTERVAL 29 DAY) and '2020-08-31';
BigQuery doesn't support rolling windows for count(distinct). So, one approach is a brute force method:
select dte,
(select count(distinct ud.user_id)
from user_data ud
where ud.date between DATE_SUB(dte, INTERVAL 29 DAY) and dte
) as num_users
from unnest(generate_date_array(date('2020-08-01'), date('2020-08-31'))) dte
Gordon approach works great.
If you need to calculate more numbers - Cross join the data.
SELECT
date_gen,
COUNT(DISTINCT IF(ud.date BETWEEN DATE_SUB(date_gen ,INTERVAL 29 DAY) AND date_gen,ud.user_id,NULL)) as MAU
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE_SUB('2020-08-31' ,INTERVAL 29 DAY), date('2020-08-31'))) date_gen,
(SELECT * FROM user_data WHERE date BETWEEN DATE_SUB('2020-08-31' ,INTERVAL 60 DAY) AND '2020-08-31') AS ud
GROUP BY 1
ORDER BY 1 DESC
With SET and DECLARE you can get rid of replacing the 'DATE' multiple times.
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM `project.dataset.user_data`
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t
Above assumes you have entries in project.dataset.user_data table for all days in time period of your interest
If this is not a case, and you actually have some gaps in your data - you can use below
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM UNNEST(GENERATE_DATE_ARRAY('2020-08-01', '2020-08-31')) AS date
LEFT JOIN `project.dataset.user_data`
USING(date)
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t

Finding the customers that are coming back in a week

Here is my table:
sessid userid date prodcode
xxxxx xx0101 01/01/2020 rpd032
xxxxx xx2021 01/01/2020 xxxx01
xxxxx xx0101 01/01/2020 xx0381
xxxxx xxju23 02/01/2020 xxx023
xxxxx xxjp17 03/01/2020 xxx016
xxxxx xxju23 03/01/2020 xxxx03
xxxxx xx2021 04/01/2020 xxx023
xxxxx xxx270 05/01/2020 xxx023
xxxxx xx0j34 06/01/2020 rpd032
xxxxx xxcj02 07/01/2020 xxx333
xxxxx xxjr04 08/01/2020 rpd032
I want to run a query every week. I might just turn into a procedure later. For now, I want to know the number of customers coming back to the website for the week starting the 02/01/2020. As you can see from the sample above there is only one customer that is coming back (xxju23) so the result of my query should be 1 but I am struggling with it.
select count(userid)
from (
select userid, count(*) as comingbak
from orders
where customers in dateadd(week,7,'02/01/2020')
groupby comingback
having cominback > 1
);
I understand that you are looking for the count of customers that had more than one visit in the website during the week that started on January 2nd.
Consider:
select count(*)
from (
select 1
from orders
where date >= '20200102' and date < dateadd(week, 1, '20200102')
group by userid
having count(*) > 1
) t
Demo on db<>fiddle
You can use datepart(wk, date) to get week in a year.
;with t1 as ( -- Exclude customer comeback in the same date
select distinct userid, date
from #table1
),
t2 as (-- Get week in year
select userid, 'Week ' + cast(datepart(wk, date) as varchar(2)) Week
from t1
)
select userid, Week, count(*) as numberOfVisit -- group by userId and week in year
from t2
group by userid, Week
having count(*) > 1
You can also Count all customer to get the last result.
;with t1 as (
select distinct userid, date
from #table1
),
t2 as (
select userid, 'Week ' + cast(datepart(wk, date) as varchar(2)) Week
from t1
),
t3 as (
select userid, Week, count(*) as numberOfVisit
from t2
group by userid, Week
having count(*) > 1)
select count(*) Total
from t3
Based on GMB's comment. There are some following mistakes (Feel free to correct me if I'm mistaken):
Error syntax: No column name was specified for column 1 of 't'. https://ibb.co/52FBxMc
Condition in Where clause combines with having count(*) > 1 is wrong: You won't get any value >= '20200102'. It should be value >= '20200101'
You will get xx0101 as well. However, it should be excluded as back in the same day https://ibb.co/YtbCL1z
You should select userId or something like that instead of 1 as it makes confuse
Your condition just works in the time range 20200101 while it should be dynamic.
In short, the answer to #Phong might be more suitable.

Group By - select by a criteria that is met every month

The below query returns all USERS that have SUM(AMOUNT) > 10 in a given month. It includes Users in a month even if they don't meet the criteria in other months.
But I'd like to transform this query to return all USERS who must meet the criteria SUM(AMOUNT) > 10 every single month (i.e., from the first month in the table to the last one) across the entire data.
Put another way, exclude users who don't meet SUM(AMOUNT) > 10 every single month.
select USERS, to_char(transaction_date, 'YYYY-MM') as month
from Table
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10;
One approach uses a generated calendar table representing all months in your data set. We can left join this calendar table to your current query, and then aggregate over all months by user:
WITH months AS (
SELECT DISTINCT TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
),
cte AS (
SELECT USERS, TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10
)
SELECT
t.USERS
FROM months m
LEFT JOIN cte t
ON m.month = t.month
GROUP BY
t.USERS
HAVING
COUNT(t.USERS) = (SELECT COUNT(*) FROM months);
The HAVING clause above asserts that the number of months to which a user matches is in fact the total number of months. This would imply that the user meets the sum criteria for every month.
Perhaps you could use a correlated subquery, such as:
select t.*
from (select distinct table.users from table) t
where not exists
(
select to_char(u.transaction_date, 'YYYY-MM') as month
from table u
where u.users = t.users
group by month
having sum(u.amount) <= 10
)
One option would be using sign(amount-10) vs. sign(amount) logic as
SELECT q.users
FROM
(
with tab(users, transaction_date,amount) as
(
select 1,date'2018-11-24',8 union all
select 1,date'2018-11-24',18 union all
select 2,date'2018-10-24',13 union all
select 3,date'2018-11-24',18 union all
select 3,date'2018-10-24',28 union all
select 3,date'2018-09-24', 3 union all
select 4,date'2018-10-24',28
)
SELECT users, to_char(transaction_date, 'YYYY-MM') as month,
sum(sign(amount-10)) as cnt1,
sum(sign(amount)) as cnt2
FROM tab t
GROUP BY users, month
) q
GROUP BY q.users
HAVING sum(q.cnt1) = sum(q.cnt2)
GROUP BY q.users
users
-----
2
4
Rextester Demo
You need to compare the number of months > 10 to the number of months between the min and the max date:
SELECT users, Count(flag) AS months, Min(mth), Max(mth)
FROM
(
SELECT users, date_trunc('month',transaction_date) AS mth,
CASE WHEN Sum(amount) > 10 THEN 1 end AS flag
FROM tab t
GROUP BY users, mth
) AS dt
GROUP BY users
HAVING -- adding the number of months > 10 to the min date and compare to max
Min(mth) + (INTERVAL '1' MONTH * (Count(flag)-1)) = Max(mth)
If missing months don't count it would be a simple count(flag) = count(*)

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?
BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.
Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1

RedShift: Alternative to 'where in' to compare annual login activity

Here are the two cases:
Members Lost: Get the distinct count of user ids from 365 days ago who haven't had any activity since then
Members Added: Get the distinct count of user ids from today who don't exist in the previous 365 days.
Here are the SQL statements I've been writing. Logically I feel like this should work (and it does for sample data), but the dataset is 5Million+ rows and takes forever! Is there any way to do this more efficiently? (base_date is a calendar that I'm joining on to build out a 2 year trend. I figured this was faster than joining the 5million table on itself...)
-- Members Lost
SELECT
effective_date,
COUNT(DISTINCT dwuserid) as members_lost
FROM base_date
LEFT JOIN site_visit
-- Get Login Activity for 365th day
ON DATEDIFF(day, srclogindate, effective_date) = 365
WHERE dwuserid NOT IN (
-- Get Distinct Login activity for Current Day (PY) + 1 to Current Day (CY) (i.e. 2013-01-02 to 2014-01-01)
SELECT DISTINCT dwuserid
FROM site_visit b
WHERE DATEDIFF(day, b.srclogindate, effective_date) BETWEEN 0 AND 364
)
GROUP BY effective_date
ORDER BY effective_date;
-- Members Added
SELECT
effective_date,
COUNT(DISTINCT dwuserid) as members_added
FROM base_date
LEFT JOIN site_visit ON srclogindate = effective_date
WHERE dwuserid NOT IN (
SELECT DISTINCT dwuserid
FROM site_visit b
WHERE DATEDIFF(day, b.srclogindate, effective_date) BETWEEN 1 AND 365
)
GROUP BY effective_date
ORDER BY effective_date;
Thanks in advance for any help.
UPDATE
Thanks to #JohnR for pointing me in the right direction. I had to tweak your response a bit because I need to know on any login day how many were "Member Added" or "Member Lost" so it had to be a 365 rolling window looking back or looking forward. Finding the IDs that didn't have a match in the LEFT JOIN was much faster.
-- Trim data down to one user login per day
CREATE TABLE base_login AS
SELECT DISTINCT "dwuserid", "srclogindate"
FROM site_visit
-- Members Lost
SELECT
current."srclogindate",
COUNT(DISTINCT current."dwuserid") as "members_lost"
FROM base_login current
LEFT JOIN base_login future
ON current."dwuserid" = future."dwuserid"
AND current."srclogindate" < future."srclogindate"
AND DATEADD(day, 365, current."srclogindate") >= future."srclogindate"
WHERE future."dwuserid" IS NULL
GROUP BY current."srclogindate"
-- Members Added
SELECT
current."srclogindate",
COUNT(DISTINCT current."dwuserid") as "members_added"
FROM base_login current
LEFT JOIN base_login past
ON current."dwuserid" = past."dwuserid"
AND current."srclogindate" > past."srclogindate"
AND DATEADD(day, 365, past."srclogindate") >= current."srclogindate"
WHERE past."dwuserid" IS NULL
GROUP BY current."srclogindate"
NOT IN should generally be avoided because it has to scan all data.
Instead of joining to the site_visit table (which is presumably huge), try joining to a sub-query that selects UserID and the most recent login date -- that way, there is only one row per user instead of one row per visit.
For example:
SELECT dwuserid, min (srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
You could then simplify the queries to something like:
-- Members Lost: Last login was between 12 and 13 months ago
SELECT
COUNT(*)
FROM
(
SELECT dwuserid, min(srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
)
WHERE
last_login BETWEEN current_date - interval '13 months' and current_date - interval '12 months'
-- Members Added: First visit in last 12 months
SELECT
COUNT(*)
FROM
(
SELECT dwuserid, min(srclogindate) as first_login, max(srclogindate) as last_login
FROM site_visit
GROUP BY dwuserid
)
WHERE
first_login > current_date - interval '12 months'