SQL Count distinct per 30 days - sql

Can SQL distinct count per 30 days backward or MAU (Monthly active user)? for example if I have data like this:
date user
1/1/2020 A
1/2/2020 B
1/2/2020 C
...
1/30/2020 Z
And I transform it into like this using DISTINCT COUNT
date distinct_user
1/1/2020 1
1/2/2020 2
...
1/30/2020 30
To make it easier, assume that distinct user is the number of distinct users that active per days and there is no overlap between days (in reality there is overlap). So the result of MAU will be like this
date distinct_user MAU
1/1/2020 1 1
1/2/2020 2 3
...
1/30/2020 30 465
465 is the result of calculating distinct user in 30 days (with assumption no overlap user every days). so if there is 5 new user that active on 1/31/2020, the result will be like this
date distinct_user MAU
1/1/2020 1 1
1/2/2020 2 3
...
1/30/2020 30 465
1/31/2020 5 469
469 is from (Last MAU) + (new distinct user) - (distinct user from 1/1/2020 because the range is 30 days) so the result is 465 + 5 - 1 with the assumption that 5 users that active on 1/31/2020 is not active from 1/2/2020 to 1/30/2020

There are different approches to answer this question, the better one in terms of performance may be the following :
SELECT mt1.`date`, SUM(mt2.distinct_user) AS MAU
FROM (
SELECT `date`
FROM myTable
GROUP BY `date`
) mt1 INNER JOIN (
SELECT `date`, SUM(distinct_user) AS distinct_user
FROM myTable
GROUP BY `date`
) mt2
WHERE mt2.`date` BETWEEN mt1.`date` - INTERVAL 29 DAY AND mt1.`date`
GROUP BY mt1.`date`
ORDER BY mt1.`date`;
SEE DEMO HERE

Perhaps the simplest method is to "unpivot" the data and reaggregate:
with t1 as (
select date, user, 1 as inc
from t
union all
select date + interval 30 day, user, -1 as inc
from t
),
select date,
sum(case when sum_inc > 0 then 1 else 0 end) as running_30day_users
from (select t1.*,
sum(inc) over (partition by user order by date) as sum_inc
from t1
) t1
group by date;
I should note that this can also be expressed in SQL as:
select distinct date, running_30
from (select t.*,
count(distinct user) over (order by date range between interval 29 day preceding and current date) as running_30
from t
) t;
However, I'm not sure if Athena supports that syntax.

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

rolling count on Athena or Quicksight

I have this dataset
date id
1/1/2020 1
1/1/2020 2
...
n m
I want to have rolling count of distinct monthly user on AWS Quicksight or Athena. for example
date MAU
1/1/2020 -
1/2/2020 -
1/30/2020 100
1/31/2020 102
100 on 1/30/2020 means that in the past 30 days, there is 100 distinct user that active (from 1/1/2020 to 1/30/2020). 102 on 1/31/2020 means that in the past 30 days there is 102 distinct user that active (from 1/2/2020 to 1/30/2020)
The basic idea is to use a window count with a range frame. Does it work in Amazon Athena if we convert the date to an epoch and use the following range frame?
select date,
sum(count(*)) over(
order by to_unixtime(date)
range between - 60 * 60 * 24 * 30 preceding and current row
) mau
from mytable
group by date
An alternative to the window function solution would be a correlated subquery:
select date,
count(*) + (
select count(*)
from mytable t1
where t1.date >= t.date - interval '30' day and t1.date < t.date
) mau
from mytable
group by date

SQL question - how to output using iterative date logic in SQL Server

I have the following sample table (provided with single ID for simplicity - need to perform the same logic across all IDs)
ID Visit_date
-----------------
ABC 8/7/2019
ABC 9/10/2019
ABC 9/12/2019
ABC 10/1/2019
ABC 10/1/2019
ABC 10/8/2019
ABC 10/15/2019
ABC 10/17/2019
ABC 10/24/2019
Here is what I need to get the sample output
Mark the first visit as 1 in the "new_visit" column
Compare the subsequent dates with the 1st date until it exceeds 21 days condition. Example Sep 10 is compared to Aug 7 and it doesn’t fall within 21 days of Aug 7, therefore this is considered as another new_visit, so mark new_visit as 1
Then we compare Sep 10 with the subsequent dates with 21 days criteria and mark all of them as follow_up of Sep 10 visit. Eg. Sep 12, Oct 1 are within 21 days of Sep 10; hence they are considered as follow up visits, so mark "follow_up" as 1
When the subsequent date exceeds 21 days criteria of the previous new visit (e.g. Oct 8 compared to Sep 10) then Oct 8 will be considered a new visit & mark "New_visit" as 1 and the subsequent dates will be compared against Oct 8
Sample Output :
Dates New_Visit Follow_up
-----------------------------
8/7/2019 1
9/10/2019 1
9/12/2019 1
10/1/2019 1
10/1/2019 1
10/8/2019 1
10/15/2019 1
10/17/2019 1
10/24/2019 1
You need a recursive query for this.
You would enumerate the rows, then walk through the dataset by ascending date, while keeping track of the first visit date of each group; when the interval since the last first visit exceeds 21 days, the date of the first visit resets, and a new group starts.
with
data as (
select t.*, row_number() over(partition by id order by date) rn
from mtytable t
),
cte as (
select id, visit_date, visit_date first_visit_date
from data
where rn = 1
union all
select c.id, d.visit_date, case when d.visit_date > datead(day, 21, c.first_visit_date) then d.visit_date else c.first_visit_date end
from cte c
inner join data d on d.id = c.id and d.rn = c.rn + 1
)
select
id,
date,
case when visit_date = first_visit_date then 1 else 0 end as is_new
case when visit_date = first_visit_date then 0 else 1 end as is_follow_up
from cte
If a patient may have more than 100 visits, then you need to add option (maxrecursion 0) at the very end of the query.
You need a recursive CTE to handle this. This is the idea, although the exact syntax might vary by database:
with recursive t as (
select id, date,
row_number() over (partition by id order by date) as seqnum
from yourtable
),
recursive cte as (
select id, date, visit_start as date, 1 as is_new_visit
from t
where id = 1
union all
select cte.id, t.date,
(case when t.date < visit_start + interval '21 day'
then cte.visit_start else t.date
end) as visit_start,
(case when t.date < cte.visit_start + interval '21 say'
then 0 else 1
end) as is_new_visit
from cte join
t
on t.id = cte.id and t.seqnum = cte.seqnum + 1
)
select *
from cte
where is_new_visit = 1;

SQL - How to group/count items by age and status on every date of a year?

I am trying to build a query from multi-year data set (tickets table) of support tickets, with relevant columns of ticked_id, status, created_on date and closed_on date for each ticket. There is also a generic dates table I can join/query to a list of dates.
I'd like to create a "burn down" chart for this year that displays the number of open tickets that were at least a year old on any given date this year. I have been able to create tables that use a sum(case... statement to group by a date - for example to show how many tickets were created on a given week - but I can't figure out how to group by every day or week this year the number of tickets that were open on that day and at least a year old.
Any help is appreciated.
Example Data:
ticket_id | status | created_on | closed_on
--------------------------------------------
1 open 1/5/2019
2 open 1/26/2019
3 closed 1/28/2019 2/1/2020
4 open 6/1/2019
5 closed 6/5/2019 1/1/2020
Example Results I Seek:
Date (2020) | Count of Year+ Aged Tickets
------------------------------------------------
1/1/2020 0
1/2/2020 0
1/3/2020 0
1/4/2020 0
1/5/2020 1
1/6/2020 1
... (skipping dates here but want all dates in results)...
1/25/2020 1
1/26/2020 2
1/27/2020 2
1/28/2020 3
1/29/2020 3
1/30/2020 3
1/31/2020 3
2/1/2020 2
... (skipping dates here but want all dates up to current date in results)...
ticket_id 1 reached one year of age on 1/5/2020 and is still open
(remains in count)
ticket_id 2 reached one year of age on 1/26/2020 and is still open (remains in count)
ticket_id 3 reached one year of age on 1/28/2020 and was still open, adding to the count, but was closed on 2/1/2020, reducing the count
ticket_id 4 will only add to the count if it is still open on 6/1/2020, but not if it is closed before then
ticket_id 5 will never appear in the count because it never reached one year of age and is closed
One option is to build a sequential list of dates, then bring the table with a ‘left join` and conditional logic, and finally aggregate.
This would give the results you want for year 2020.
select d.dt, count(t.ticket_id) no_tickets
from (
select date '2020-01-01' + I * interval '1 day' dt
from generate_series(0, 365) i
) d
left join mytable t
on t.created_on + interval '1 year' <= d.dt
and (
t.closed_on is null
or t.closed_on > d.dt
)
group by d.dt
If your version of Redshift does not support generate_series(), you can emulate it a custom number table, or with row_number() against a large table (say mylargetable):
select d.dt, count(t.ticket_id) no_tickets
from (
select date '2020-01-01' + row_number() over(order by 1) * interval '1 day' dt
from mylargetable
) d
left join mytable t
on t.created_on + interval '1 year' <= d.dt
and (
t.closed_on is null
or t.closed_on > d.dt
)
where d.dt < date '2021-01-01'
group by d.dt
If ticket_id is unique then you can do this to get all ticket at least 1 year old
select ticket_id, created_on , status where status = 'open' and created_on <= dateadd(year,-1,getdate())
if you want to count number of ticket per month then
select count(ticket_id), month(created_on) , status where status = 'open' and created_on <= dateadd(year,-1,getdate())
group by month(created_on)

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you
You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.
This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo