Querying DAU/MAU over time (daily) - sql

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.

Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;

This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)

I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.

You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>

Related

Oracle SQL - Get difference between dates based on check in and checkout records

Assume I have the following table data.
# | USER | Entrance | Transaction Date Time
-----------------------------------------------
1 | ALEX | INBOUND | 2020-01-01 10:20:00
2 | ALEX | OUTBOUND | 2020-01-02 10:00:00
3 | ALEX | INBOUND | 2020-01-04 11:30:00
4 | ALEX | OUTBOUND | 2020-01-07 15:00:00
5 | BEN | INBOUND | 2020-01-08 08:00:00
6 | BEN | OUTBOUND | 2020-01-09 09:00:00
I would like to know the total of how many days the user has stay outbound.
For each inbound and outbound is considered one trip, every trip exceeded 24 hours is considered as 2 days.
Below is my desired output:
No. of Days | Trips Count
----------------------------------
Stay < 1 day | 1
Stay 1 day | 1
Stay 2 days | 0
Stay 3 days | 0
Stay 4 days | 1
I would use lead() and aggregation. Assuming that the rows are properly interlaced:
select floor( (next_dt - dt) ) as num_days, count(*)
from (select t.*,
lead(dt) over (partition by user order by dt) as next_dt
from trips t
) t
where entrance = 'INBOUND'
group by floor( (next_dt - dt) )
order by num_days;
Note: This does not include the 0 rows. That does not seem central to your question and is a significant complication.
I still don't know what you mean with < 1 day, but this I got this far
Setup
create table trips (id number, name varchar2(10), entrance varchar2(10), ts TIMESTAMP);
insert into trips values( 1 , 'ALEX','INBOUND', TIMESTAMP '2020-01-01 10:20:00');
insert into trips values(2 , 'ALEX','OUTBOUND',TIMESTAMP '2020-01-02 10:00:00');
insert into trips values(3 , 'ALEX','INBOUND',TIMESTAMP '2020-01-04 11:30:00');
insert into trips values(4 , 'ALEX','OUTBOUND',TIMESTAMP '2020-01-07 15:00:00');
insert into trips values(5 , 'BEN','INBOUND',TIMESTAMP '2020-01-08 08:00:00');
insert into trips values(6 , 'BEN','OUTBOUND',TIMESTAMP '2020-01-09 07:00:00');
Query
select decode (t.days, 0 , 'Stay < 1 day', 1, 'Stay 1 day', 'Stay ' || t.days || ' days') Days , count(d.days) Trips_count
FROM (Select Rownum - 1 days From dual Connect By Rownum <= 6) t left join
(select extract (day from b.ts - a.ts) + 1 as days from trips a
inner join trips b on a.name = b.name
and a.entrance = 'INBOUND'
and b.entrance = 'OUTBOUND'
and a.ts < b.ts
and not exists (select ts from trips where entrance = 'OUTBOUND' and ts > a.ts and ts < b.ts)) d
on t.days = d.days
group by t.days order by t.days
Result
DAYS | TRIPS_COUNT
----------------|------------
Stay < 1 day | 0
Stay 1 day | 2
Stay 2 days | 0
Stay 3 days | 0
Stay 4 days | 1
Stay 5 days | 0
You could replace the 6 with a select max with the second subquery repeated

SQL Server : processing by group

I have a table with the following data:
Id Date Value
---------------------------
1 Dec-01-2019 10
1 Dec-03-2019 5
1 Dec-05-2019 8
1 Jan-03-2020 6
1 Jan-07-2020 3
1 Jan-08-2020 9
2 Dec-01-2019 4
2 Dec-03-2019 7
2 Dec-31-2019 9
2 Jan-04-2020 4
2 Jan-09-2020 6
I need to group it to the following format: 1 record per month per id. If month is closed, so date will be the last day of that month, if not, the last day available. Max and average are calculated using all data until that date.
Id Date Max_Value Average_Value
-----------------------------------------------
1 Dec-31-2019 10 7,6
1 Jan-08-2020 10 6,8
2 Dec-31-2019 9 6,6
2 Jan-09-2020 9 6,0
Any easy SQL to obtain this analysis?
Regards,
Hmmm . . . You want to aggregate by month and then just take the maximum date in the month:
select id, max(date), max(value), avg(value * 1.0)
from t
group by id, eomonth(date)
order by id, max(date);
If by closed month you mean that it's not the last month of the id then:
select id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end Date,
max(maxValue) Max_Value,
round(avg(1.0 * Value), 1) Average_Value
from (
select *,
max(Date) over (partition by Id) maxDate,
max(Value) over (partition by Id) maxValue
from tablename
) t
group by id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end
order by id, Date
See the demo.
Results:
> id | Date | Max_Value | Average_Value
> -: | :--------- | --------: | :------------
> 1 | 2019-12-31 | 10 | 7.7
> 1 | 2020-01-08 | 10 | 6.0
> 2 | 2019-12-31 | 9 | 6.7
> 2 | 2020-01-09 | 9 | 5.0

Rolling Aggregation

I am trying to write a program in SQL Server that aggregates based on rolling dates.
Take this below
Acc Dte Amount
1 1/1/20 100
1 1/3/20 200
1 1/8/20 100
1 1/8/20 75
2 1/1/20 50
2 1/2/20 100
2 1/3/20 75
2 1/3/20 125
3 1/3/20 100
3 1/6/20 75
3 1/8/20 75
3 1/10/20 200
3 1/10/20 150
So the goal is I want to find the avg and the count of records and dates for each account PRIOR TO the record being analyzed. I also need to sum the records based on the date So based on the above it would look like this...
Acc Dte Num_of_dates Avg_Amount_per_day Current_Amount
1 1/3/20 1 100 200
1 1/8/20 2 150 175
2 1/2/20 1 50 100
2 1/3/20 2 75 200
3 1/6/20 1 100 75
3 1/8/20 2 83.3 75
3 1/10/20 3 83.3 350
The goal is to create a z-score comparing accounts numbers for the current day to the accounts average per day. But we also need to hit a minimum of 10 days of historical data for each account.
Right now my code looks like this and is not working
select Account,
Dte,
(select sum(case when Cast(EventTimestamp as DATE) < Dte then 1 else 0 end) Num_of_Date,
(select (case when Cast(EventTimestamp as DATE) < Dte then sum(Amount) else 0 end) t_amount
from Data
group by Account, Dte
Any ideas? Thanks
You can use window functions with proper rows clause. For once, distinct comes handy here:
select distinct
acc,
dte,
count(*) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) num_of_dates,
avg(1.0 * amount) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) avg_amount_per_day,
sum(amount) over(partition by acc, dte) current_amount
from mytable
If you do want just one record per date and account, as shown in your sample data, you can nest the query and use row_number() - in absence of an obvious column to define the sorting order, I relied on the cumulative count:
select acc, dte, num_of_dates, avg_amount_per_day, current_amount
from (
select
t.*,
row_number() over(partition by acc, dte order by num_of_dates) rn
from (
select
acc,
dte,
count(*) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) num_of_dates,
avg(1.0 * amount) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) avg_amount_per_day,
sum(amount) over(partition by acc, dte) current_amount
from mytable
) t
) t
where rn = 1 and avg_amount_per_day is not null
Demo on DB Fiddlde:
acc | dte | num_of_dates | avg_amount_per_day | current_amount
--: | :--------- | -----------: | :----------------- | -------------:
1 | 2020-01-03 | 1 | 100.000000 | 200
1 | 2020-01-08 | 2 | 150.000000 | 175
2 | 2020-01-02 | 1 | 50.000000 | 100
2 | 2020-01-03 | 2 | 75.000000 | 200
3 | 2020-01-06 | 1 | 100.000000 | 75
3 | 2020-01-08 | 2 | 87.500000 | 75
3 | 2020-01-10 | 3 | 83.333333 | 350
Your sample data and description suggest:
select acc, dte,
count(*) as num_on_day,
sum(amount) as sum_on_day,
avg(sum(amount)) over (partition by acc order by date_num range between unbounded preceding and 1 preceding) as avg_previous
from t cross join
(values (datediff(day, '1900-01-01', dte))) v(date_num)
group by acc, dte;
I'm not sure why you don't include the first date for each acc.

Find a query that computes the time between first and second order of an ecommerce dataset

I am trying to compute the time between a first and second order in my ecommerce table (orders) for every customer.
I found this document that is useful to select all the top n rows per group, but I am not sure how to pair it up with the computation of second order time - first order time https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/
Here is what I wrote so far:
SELECT customer_id, datediff(day, min(order_time), max(order_time))
as avg_time
FROM ORDERS AS so
WHERE
(select count(*) from ORDERS as se
where se.customer_ID = so.customer_ID and se.order_time <= so.order_time
) <= 2
group by customer_id
having count(distinct order_time)>1
order by avg_time desc) t
However, it is wrong because it compute the max at the beginning from the whole dataset, which would be the latest order and not the second order.
Thank you in advance!
As mentioned, LAG/LEAD functions will give you what you need here. I assume you want this on a per customer basis.
SELECT "Customer ID"
,"Transaction ID"
,"SKU"
,"Date"
,LAG("Date") OVER (PARTITION BY "Customer ID" ORDER BY "Date") AS LastOrderDate
,DATEDIFF(dd, LAG("Date") OVER (PARTITION BY "Customer ID" ORDER BY "Date"), "Date") AS DaysBetweenOrders
FROM ORDERS
;
So, given the following in postgres (which I have readily available):
so1=# SELECT * FROM table1;
cust | tx | sku | _date
------+-----+-----+---------------------
1 | 111 | 3 | 2010-01-01 12:30:00
2 | 222 | 1 | 2010-01-01 11:00:00
2 | 222 | 2 | 2010-01-01 11:00:00
3 | 333 | 7 | 2010-01-03 15:00:00
1 | 444 | 8 | 2010-01-04 21:00:00
(5 rows)
The following allows to perform date arithmetic on consequetive rows (by date)
so1=# SELECT
so1-# fst - snd
so1-# FROM (
so1(# SELECT
so1(# _date AS fst,
so1(# lag(_date, 1) OVER (ORDER BY _date) AS snd
so1(# FROM
so1(# table1) AS s;
?column?
-----------------
00:00:00
01:30:00
2 days 02:30:00
1 day 06:00:00
(5 rows)

Teradata sql query from grouping records using Intervals

In Teradata SQL how to assign same row numbers for the group of records created with in 8 seconds of time Interval.
Example:-
Customerid Customername Itembought dateandtime
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01
100 ALex Circketball 2017-02-10 10:10:06
100 ALex Baseball 2017-02-10 10:10:08
100 ALex volleyball 2017-02-10 10:11:01
100 ALex footbball 2017-02-10 10:11:05
100 ALex ringball 2017-02-10 10:11:08
100 Alex football 2017-02-10 10:12:10
My Expected result shoud have additional column with Row_number where it should assign the same number for all the purchases of the customer with in 8 seconds: Refer the below expected result
Customerid Customername Itembought dateandtime Row_number
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01 1
100 ALex Circketball 2017-02-10 10:10:06 1
100 ALex Baseball 2017-02-10 10:10:08 1
100 ALex volleyball 2017-02-10 10:11:01 2
100 ALex footbball 2017-02-10 10:11:05 2
100 ALex ringball 2017-02-10 10:11:08 2
100 Alex football 2017-02-10 10:12:10 3
This is one way to do it with a recursive cte. Reset the running total of difference from the previous row's timestamp when it gets > 8 to 0 and start a new group.
WITH ROWNUMS AS
(SELECT T.*
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY TM) AS RNUM
/*Replace DATEDIFF with Teradata specific function*/
,DATEDIFF(SECOND,COALESCE(MIN(TM) OVER(PARTITION BY ID
ORDER BY TM ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), TM),TM) AS DIFF
FROM T --replace this with your tablename and add columns as required
)
,RECURSIVE CTE(ID,TM,DIFF,SUM_DIFF,RNUM,GRP) AS
(SELECT ID,
TM,
DIFF,
DIFF,
RNUM,
CAST(1 AS int)
FROM ROWNUMS
WHERE RNUM=1
UNION ALL
SELECT T.ID,
T.TM,
T.DIFF,
CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN 0 ELSE C.SUM_DIFF+T.DIFF END,
T.RNUM,
CAST(CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN T.RNUM ELSE C.GRP END AS int)
FROM CTE C
JOIN ROWNUMS T ON T.RNUM=C.RNUM+1 AND T.ID=C.ID
)
SELECT ID,
TM,
DENSE_RANK() OVER(PARTITION BY ID ORDER BY GRP) AS row_num
FROM CTE
Demo in SQL Server
I am going to interpret the problem differently from vkp. Any row within 8 seconds of another row should be in the same group. Such values can chain together, so the overall span can be more than 8 seconds.
The advantage of this method is that recursive CTEs are not needed, so it should be faster. (Of course, this is not an advantage if the OP does not agree with the definition.)
The basic idea is to look at the previous date/time value; if it is more than 8 seconds away, then add a flag. The cumulative sum of the flag is the row number you are looking for.
select t.*,
sum(case when prev_dt >= dateandtime - interval '8' second
then 0 else 1
end) over (partition by customerid order by dateandtime
) as row_number
from (select t.*,
max(dateandtime) over (partition by customerid order by dateandtime row between 1 preceding and 1 preceding) as prev_dt
from t
) t;
Using Teradata's PERIOD data type and the awesome td_normalize_overlap_meet:
Consider table test32:
SELECT * FROM test32
+----+----+------------------------+
| f1 | f2 | f3 |
+----+----+------------------------+
| 1 | 2 | 2017-05-11 03:59:00 PM |
| 1 | 3 | 2017-05-11 03:59:01 PM |
| 1 | 4 | 2017-05-11 03:58:58 PM |
| 1 | 5 | 2017-05-11 03:59:26 PM |
| 1 | 2 | 2017-05-11 03:59:28 PM |
| 1 | 2 | 2017-05-11 03:59:46 PM |
+----+----+------------------------+
The following will group your records:
WITH
normalizedCTE AS
(
SELECT *
FROM TABLE
(
td_normalize_overlap_meet(NEW VARIANT_TYPE(periodCTE.f1), periodCTE.fper)
RETURNS (f1 integer, fper PERIOD(TIMESTAMP(0)), recordCount integer)
HASH BY f1
LOCAL ORDER BY f1, fper
) as output(f1, fper, recordcount)
),
periodCTE AS
(
SELECT f1, f2, f3, PERIOD(f3, f3 + INTERVAL '9' SECOND) as fper FROM test32
)
SELECT t2.f1, t2.f2, t2.f3, t1.fper, DENSE_RANK() OVER (PARTITION BY t2.f1 ORDER BY t1.fper) as fgroup
FROM normalizedCTE t1
INNER JOIN periodCTE t2 ON
t1.fper P_INTERSECT t2.fper IS NOT NULL
Results:
+----+----+------------------------+-------------+
| f1 | f2 | f3 | fgroup |
+----+----+------------------------+-------------+
| 1 | 2 | 2017-05-11 03:59:00 PM | 1 |
| 1 | 3 | 2017-05-11 03:59:01 PM | 1 |
| 1 | 4 | 2017-05-11 03:58:58 PM | 1 |
| 1 | 5 | 2017-05-11 03:59:26 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:28 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:46 PM | 3 |
+----+----+------------------------+-------------+
A Period in Teradata is a special data type that holds a date or datetime range. The first parameter is the start of the range and the second is the ending time (up to, but not including which is why it's "+ 9 seconds"). The result is that we get a 8 second time "Period" where each record might "intersect" with another record.
We then use td_normalize_overlap_meet to merge records that intersect, sharing the f1 field's value as the key. In your case that would be customerid. The result is three records for this one customer since we have three groups that "overlap" or "meet" each other's time periods.
We then join the td_normalize_overlap_meet output with the output from when we determined the periods. We use the P_INTERSECT function to see which periods from the normalized CTE INTERSECT with the periods from the initial Period CTE. From the result of that P_INTERSECT join we grab the values we need from each CTE.
Lastly, Dense_Rank() gives us a rank based on the normalized period for each group.