I have a following table in Vertica:
Item_id event_date Price
A 2019-01-01 100
A 2019-01-04 200
B 2019-01-05 150
B 2019-01-06 250
B 2019-01-09 350
As you see, there are some missing dates between 2019-01-04 and 2019-01-01, and also 2019-01-09 - 2019-01-06.
What I need is to add for each item_id missing dates between the existing ones, and as the price cell will be NULL, fill it with the previous date Price.
So it will be like this:
Item_id event_date Price
A 2019-01-01 100
A 2019-01-02 100
A 2019-01-03 100
A 2019-01-04 200
B 2019-01-05 150
B 2019-01-06 250
B 2019-01-07 250
B 2019-01-08 250
B 2019-01-09 350
I tried to go with
SELECT Item_id, event_date
CASE Price WHEN 0 THEN NVL( LAG( CASE Price WHEN 0 THEN NULL ELSE Price END ) IGNORE NULLS OVER ( ORDER BY NULL ), 0 ) ELSE Price END AS Price_new
FROM item_price_table
from this article https://blog.jooq.org/2015/12/17/how-to-fill-sparse-data-with-the-previous-non-empty-value-in-sql/ , but it seems it works for SQL Server, but not for Vertica, as there are no IGNORE NULLS function...
Does anyone know how to deal with it?
Let me assume you have a calendar table. In Vertica, you can then use last_value(ignore nulls) to fill in the rest:
select c.event_date, i.item_id,
coalesce(ipt.price,
last_value(ipt.price ignore nulls) over (partition by i.item_id order by c.event_date)
) as price
from calendar c cross join
(select distinct item_id from item_price_table) i left join
item_price_table ipt
on i.item_it = ipt.item_id and c.date = ipt.event_date
I was waiting for that one....!
I just love Vertica's TIMESERIES clause !
It works on TIMESTAMPs, not DATEs, so I have to cast back and forth, but it's unbeatable.
See here:
WITH
input(item_id,event_dt,price) AS (
SELECT 'A',DATE '2019-01-01',100
UNION ALL SELECT 'A',DATE '2019-01-04',200
UNION ALL SELECT 'B',DATE '2019-01-05',150
UNION ALL SELECT 'B',DATE '2019-01-06',250
UNION ALL SELECT 'B',DATE '2019-01-09',350
)
SELECT
item_id
, event_dts::DATE AS event_dt
, TS_FIRST_VALUE(price) AS price
FROM input
TIMESERIES event_dts AS '1 DAY' OVER(PARTITION BY item_id ORDER BY event_dt::timestamp)
-- out item_id | event_dt | price
-- out ---------+------------+-------
-- out A | 2019-01-01 | 100
-- out A | 2019-01-02 | 100
-- out A | 2019-01-03 | 100
-- out A | 2019-01-04 | 200
-- out B | 2019-01-05 | 150
-- out B | 2019-01-06 | 250
-- out B | 2019-01-07 | 250
-- out B | 2019-01-08 | 250
-- out B | 2019-01-09 | 350
-- out (9 rows)
-- out
-- out Time: First fetch (9 rows): 68.057 ms. All rows formatted: 68.221 ms
;
Need explanations?
Happy playing ...
I don't have the code on the top of my head, but to add the missing dates you'll want to create a Calendar table and join to that. Then you can use the lag function to replace the null Price with the one above it. There's plenty of code if you search CTE to create calendar table.
Related
I have a date dimension table containing all dates and another table containing the value of items at specific dates.
E.g
(a) Date_Dim table
|Full_Date |
|-----------|
| .... |
|1-jan-2021 |
|2-Jan-2021 |
|3-jan-2021 |
| ... |
(b) Item_value table
|P_Date | ITEM | Value |
|-----------:|:------|-------:|
|20-Dec-2020 |AA1 |9 |
|1-jan-2021 |AA1 |10 |
|1-jan-2021 |AA2 |100 |
| ... | ... | ... |
I am trying to build a fact table containing the latest value of every item in the item_value table for every date in the date_dim table. i.e the value of the items every day.
e.g
|Full_date | ITEM | Value |
|-----------:|-------:|------:|
|31-Dec-2020 |AA1 | 9 |
|31-Dec-2020 |AA2 | null |
|1-Jan-2021 |AA1 | 10 |
|1-Jan-2021 |AA2 | 100 |
|2-Jan-2021 |AA1 | 10 |
|2-Jan-2021 |AA2 | 100 |
|3-Jan-2021 |AA1 | 10 |
|3-Jan-2021 |AA2 | 100 |
|4-Jan-2021 |AA1 | 10 |
|4-Jan-2021 |AA2 | 100 |
How can this query be built, please?
I have tried the following but not working
select full_date,p_date,item,value
from dim_date
left outer join item_value on full_date=p_date;
Not sure whether max(p_date) over (partition by ...) will work.
Thank you
You can use a partitioned outer join and then aggregate:
WITH date_dim ( full_date ) AS (
SELECT DATE '2020-12-31' + LEVEL - 1 AS full_Date
FROM DUAL
CONNECT BY DATE '2020-12-31' + LEVEL - 1 <= DATE '2021-01-04'
)
SELECT item,
full_date,
MAX( value ) KEEP ( DENSE_RANK LAST ORDER BY p_date ) AS value
FROM date_dim d
LEFT OUTER JOIN item_value i
PARTITION BY ( i.item )
ON ( d.full_date >= i.p_date )
GROUP BY item, full_date
Which, for the sample data:
CREATE TABLE item_value ( P_Date, ITEM, Value ) AS
SELECT DATE '2020-12-20', 'AA1', 9 FROM DUAL UNION ALL
SELECT DATE '2021-01-01', 'AA1', 10 FROM DUAL UNION ALL
SELECT DATE '2021-01-01', 'AA2', 100 FROM DUAL;
Outputs:
ITEM | FULL_DATE | VALUE
:--- | :-------- | ----:
AA1 | 31-DEC-20 | 9
AA1 | 01-JAN-21 | 10
AA1 | 02-JAN-21 | 10
AA1 | 03-JAN-21 | 10
AA1 | 04-JAN-21 | 10
AA2 | 31-DEC-20 | null
AA2 | 01-JAN-21 | 100
AA2 | 02-JAN-21 | 100
AA2 | 03-JAN-21 | 100
AA2 | 04-JAN-21 | 100
Note: You do not need to store the date_dim dimension table; it can be generated on-the-fly and will reduce the need to perform (expensive) IO operations reading the table from the hard disk.
db<>fiddle here
You may simple add a validity interval for you ITEM table using the analtical function LEAD
select
P_DATE,
lead(P_DATE-1,1,(select max(full_date) from date_dim)) over (partition by ITEM order by P_DATE) P_DATE_TO,
ITEM, VALUE
from item_value
;
P_DATE P_DATE_TO ITE VALUE
------------------- ------------------- --- ----------
20.12.2020 00:00:00 31.12.2020 00:00:00 AA1 9
01.01.2021 00:00:00 04.01.2021 00:00:00 AA1 10
01.01.2021 00:00:00 04.01.2021 00:00:00 AA2 100
In some case this is enough for your use case as you can query the VALUE of a specific ITEM on a given date with
select VALUE from item_value_hist h where ITEM = 'AA2'
and <query_date> BETWEEN h.P_DATE and h.P_DATE_TO
Note, that the validity interval is inclusive, as we for P_DATE_TO subtract one day from the adjacent P_DATE. You should take some care is the DATEs have a time component.
If you want the ITEM per DAY overview you must first add the missing early history with the VALUE of NULL
select
(select min(full_date) from date_dim) P_DATE, min(P_DATE)-1 P_DATE_TO, ITEM, null VALUE
from item_value
group by ITEM
having min(P_DATE) > (select min(full_date) from date_dim)
P_DATE P_DATE_TO ITE VALUE
------------------- ------------------- --- -----
31.12.2020 00:00:00 31.12.2020 00:00:00 AA2
Than simple outer join to your dimension table matching all day from your validity interval
with item as (
select
P_DATE,
lead(P_DATE-1,1,(select max(full_date) from date_dim)) over (partition by ITEM order by P_DATE) P_DATE_TO,
ITEM, VALUE
from item_value
union all
select
/* add the missing early history without a VALUE */
(select min(full_date) from date_dim) P_DATE, min(P_DATE)-1 P_DATE_TO, ITEM, null VALUE
from item_value
group by ITEM
having min(P_DATE) > (select min(full_date) from date_dim)
)
select dt.full_date, item.ITEM, item.VALUE from item
join date_dim dt
on dt.full_date between item.P_DATE and item.P_DATE_TO
order by item.ITEM, dt.full_date
FULL_DATE ITE VALUE
------------------- --- ----------
31.12.2020 00:00:00 AA1 9
01.01.2021 00:00:00 AA1 10
02.01.2021 00:00:00 AA1 10
03.01.2021 00:00:00 AA1 10
04.01.2021 00:00:00 AA1 10
31.12.2020 00:00:00 AA2
01.01.2021 00:00:00 AA2 100
02.01.2021 00:00:00 AA2 100
03.01.2021 00:00:00 AA2 100
04.01.2021 00:00:00 AA2 100
Two steps:
Cross join dates and items. If you don't have an item table (which you should), join distinct items from your item_value table.
Get the value in the FROM clause with OUTER APPLY or in the SELECT clause with a subquery using FETCH FIRST ROW ONLY.
The query:
select
d.full_date,
i.item,
(
select iv.value
from Item_value iv
where iv.item = i.item
and iv.p_date <= d.full_date
order by iv.p_date desc
fetch first row only
) as value
from dim_date d
cross join (select distinct item from item_value) i
order by d.full_date, i.item;
You can generate the full list of dates and items using cross join followed by a left join to bring in the existing values. Then you can use last_value() or lag() to fill in the values:
select d.p_date, i.item,
coalesce(v.value,
lag(v.value ignore nulls) over (partition by i.item order by d.p_date)
) as value
from date_dim d cross join
(select distinct iv.item from item_value iv) i left join
item_value iv
on iv.p_date = d.p_date and iv.item = i.item;
You can also do this using a join by adding an "end" date to the values table:
select d.p_date, i.item,
coalesce(v.value,
lag(v.value ignore nulls) over (partition by i.item order by d.p_date)
) as value
from date_dim d cross join
(select distinct iv.item from item_value iv) i left join
(select iv.*,
lead(p_date) over (partition by item order by p_date) as next_p_date
from item_value iv
) iv
on i.item = iv.item and
d.p_date >= iv.p_date and
(iv.next_p_date is null or d.p_date < iv.next_p_date);
Story:
For each id , they have a join date to a subscription and when they get rebilled monthly, they have a returning date. The first part of the exercise was to flag consecutive months of returned dates from the join date. Here's an example:
+----+------------+----------------+------+
| id | join_date | returning_date | flag |
+----+------------+----------------+------+
| 1 | 2018-12-01 | 2019-01-01 | 1 |
| 1 | 2018-12-01 | 2019-02-01 | 1 |
| 1 | 2018-12-01 | 2019-03-01 | 1 |
+----+------------+----------------+------+
Objective:
What I would like to add is to flag those who return from a canceled subscription. That flag can be in another column. For example the following results shows that on May 1st 2019 , he returned. This date needs to be flagged:
+----+------------+----------------+------+
| id | join_date | returning_date | flag |
+----+------------+----------------+------+
| 1 | 2018-12-01 | 2019-01-01 | 1 |
| 1 | 2018-12-01 | 2019-02-01 | 1 |
| 1 | 2018-12-01 | 2019-03-01 | 1 |
| 1 | 2018-12-01 | 2019-05-01 | 0 |
| 1 | 2018-12-01 | 2019-06-01 | 0 |
+----+------------+----------------+------+
Fiddle Data:
DROP TABLE IF EXISTS #T1
create table #t1 (id int,join_date date, returning_date date)
insert into #t1 values
(1,'2018-12-01', '2019-01-01'),
(1,'2018-12-01', '2019-02-01'),
(1,'2018-12-01', '2019-03-01'),
(1,'2018-12-01', '2019-05-01'),
(1,'2018-12-01', '2019-06-01'),
(2,'2018-12-01', '2019-02-01'),
(2,'2018-12-01', '2019-03-01'),
(2,'2018-12-01', '2019-05-01'),
(2,'2018-12-01', '2019-06-01'),
(3,'2019-05-01', '2019-06-01'),
(3,'2019-05-01', '2019-08-01'),
(3,'2019-05-01', '2019-10-01')
Current query with flag for consecutive months:
select *
,CASE WHEN DATEDIFF(MONTH,join_date,returning_date) = ROW_NUMBER() OVER (PARTITION BY id ORDER BY returning_date ASC) THEN 1 ELSE 0 END AS flag
from #t1
ORDER BY ID,returning_date
You seem to be asking if there are any gaps since an id first returned (with a given join_date).
If so, that is simply counting. How many months since the first return_date? How many rows? Compare these to see if there are gaps:
select t1.*,
(case when datediff(month, min(returning_date) over (partition by id, join_date order by returning_date), returning_date) <>
row_number() over (partition by id, join_date order by returning_date) - 1
then 0 else 1
end) as flag
from t1;
Here is a db<>fiddle.
since you didn't specify which recurrence of returning as the target to flag, my query flags any non-consecutive date as a return date cause a subscriber could leave and return many times after their join date (the subscriber with [id] 3 technically returned in August and then again in October so that's returning twice but October is marked as LAST instead based on the data set). i also made it easier to read by adding in start date and end date based on the data set in your fiddle.
you can use this query as a temp table, cte, basis, or whatever to continue to query against if you need to manipulate the data further.
select a.*
,case
when a.returning_date = (select min(c.returning_date) from subscription c where c.id = a.id and c.join_date = a.join_date) then 'START'
when a.returning_date = (select max(c.returning_date) from subscription c where c.id = a.id and c.join_date = a.join_date) then 'END'
when b.id is null then 'RETURN'
else 'CONSECUTIVE'
end as SubStatus
from subscription a
left join subscription b on a.id = b.id and a.join_date = b.join_date and DATEADD(month,-1,a.returning_date) = b.returning_date
here is the result set from my query:
id join_date returning_date SubStatus
----------- ---------- -------------- -----------
1 2018-12-01 2019-01-01 START
1 2018-12-01 2019-02-01 CONSECUTIVE
1 2018-12-01 2019-03-01 CONSECUTIVE
1 2018-12-01 2019-05-01 RETURN
1 2018-12-01 2019-06-01 END
2 2018-12-01 2019-02-01 START
2 2018-12-01 2019-03-01 CONSECUTIVE
2 2018-12-01 2019-05-01 RETURN
2 2018-12-01 2019-06-01 END
3 2019-05-01 2019-06-01 START
3 2019-05-01 2019-08-01 RETURN
3 2019-05-01 2019-10-01 END
flag consecutive months
and
renders all future payments
are not phrases that are going to lead to a pretty query. Which is why you had to resort to a while loop. Nevertheless, what you seek is possible, and with work may prove more performant than your while loop for large data. I present my sample code below using cte's, but you may want to use temp tables ore update an originally null 'flag' column on the base table.
In flagNonConsecutive, a flag is applied for any date that is not consecutive with the previous date (as identified using the lag window function) or by the join_date.
This meets the first requirement. Then in minNonConsecutives, you identify the earliest of those flags for each id.
In the main query, any dates after the minimum get the 0 treatment:
with
flagNonConsecutive as (
select *,
nonConsecutive =
case
when datediff(month, join_date, returning_date) = 1 then 1
when datediff(
month,
lag(returning_date) over(
partition by id
order by returning_date
),
returning_date
) = 1 then 1
else 0
end
from #t1
),
minNonConsecutives as (
select id,
minNonConsec = min(returning_date)
from flagNonConsecutive
where nonConsecutive = 0
group by id
)
select fnc.id,
fnc.join_date,
fnc.returning_date,
flag = iif(fnc.returning_date >= mnc.minNonConsec, 0, 1)
from flagNonConsecutive fnc
left join minNonConsecutives mnc on fnc.id = mnc.id;
I would like to list the missing date between two dates in a request for example
my data :
TABLE ORDER
DATE_order | AMOUNT
01/01/2020 | 500
01/01/2020 | 600
03/01/2020 | 100
05/01/2020 | 300
I want the request to return
01/01/2020 | 1100
02/01/2020 | 0
03/01/2020 | 100
04/01/2020 | 0
05/01/2020 | 300
i use Cassandra database whith Apach Hive connector
someone can help me ?
You can generate missing rows using lateral view and posexplode:
with your_data as (
select stack(4,
'2020-01-01',500,
'2020-01-01',600,
'2020-01-03',100,
'2020-01-05',300
) as (DATE_order,AMOUNT )
)
select date_sub(s.date_order ,nvl(d.i,0)) as date_order, case when d.i > 0 then 0 else s.amount end as amount
from
(--find previous date
select date_order, amount,
lag(date_order) over(order by date_order) prev_date,
datediff(date_order,lag(date_order) over(order by date_order)) datdiff
from
( --aggregate
select date_order, sum(amount) amount from your_data group by date_order )s
)s
--generate rows
lateral view outer posexplode(split(space(s.datdiff-1),' ')) d as i,x
order by date_order;
Result:
date_order amount
2020-01-01 1100
2020-01-02 0
2020-01-03 100
2020-01-04 0
2020-01-05 300
Time taken: 10.04 seconds, Fetched: 5 row(s)
In Teradata SQL how to assign same row numbers for the group of records created with in 8 seconds of time Interval.
Example:-
Customerid Customername Itembought dateandtime
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01
100 ALex Circketball 2017-02-10 10:10:06
100 ALex Baseball 2017-02-10 10:10:08
100 ALex volleyball 2017-02-10 10:11:01
100 ALex footbball 2017-02-10 10:11:05
100 ALex ringball 2017-02-10 10:11:08
100 Alex football 2017-02-10 10:12:10
My Expected result shoud have additional column with Row_number where it should assign the same number for all the purchases of the customer with in 8 seconds: Refer the below expected result
Customerid Customername Itembought dateandtime Row_number
(yyy-mm-dd hh:mm:ss)
100 ALex Basketball 2017-02-10 10:10:01 1
100 ALex Circketball 2017-02-10 10:10:06 1
100 ALex Baseball 2017-02-10 10:10:08 1
100 ALex volleyball 2017-02-10 10:11:01 2
100 ALex footbball 2017-02-10 10:11:05 2
100 ALex ringball 2017-02-10 10:11:08 2
100 Alex football 2017-02-10 10:12:10 3
This is one way to do it with a recursive cte. Reset the running total of difference from the previous row's timestamp when it gets > 8 to 0 and start a new group.
WITH ROWNUMS AS
(SELECT T.*
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY TM) AS RNUM
/*Replace DATEDIFF with Teradata specific function*/
,DATEDIFF(SECOND,COALESCE(MIN(TM) OVER(PARTITION BY ID
ORDER BY TM ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), TM),TM) AS DIFF
FROM T --replace this with your tablename and add columns as required
)
,RECURSIVE CTE(ID,TM,DIFF,SUM_DIFF,RNUM,GRP) AS
(SELECT ID,
TM,
DIFF,
DIFF,
RNUM,
CAST(1 AS int)
FROM ROWNUMS
WHERE RNUM=1
UNION ALL
SELECT T.ID,
T.TM,
T.DIFF,
CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN 0 ELSE C.SUM_DIFF+T.DIFF END,
T.RNUM,
CAST(CASE WHEN C.SUM_DIFF+T.DIFF > 8 THEN T.RNUM ELSE C.GRP END AS int)
FROM CTE C
JOIN ROWNUMS T ON T.RNUM=C.RNUM+1 AND T.ID=C.ID
)
SELECT ID,
TM,
DENSE_RANK() OVER(PARTITION BY ID ORDER BY GRP) AS row_num
FROM CTE
Demo in SQL Server
I am going to interpret the problem differently from vkp. Any row within 8 seconds of another row should be in the same group. Such values can chain together, so the overall span can be more than 8 seconds.
The advantage of this method is that recursive CTEs are not needed, so it should be faster. (Of course, this is not an advantage if the OP does not agree with the definition.)
The basic idea is to look at the previous date/time value; if it is more than 8 seconds away, then add a flag. The cumulative sum of the flag is the row number you are looking for.
select t.*,
sum(case when prev_dt >= dateandtime - interval '8' second
then 0 else 1
end) over (partition by customerid order by dateandtime
) as row_number
from (select t.*,
max(dateandtime) over (partition by customerid order by dateandtime row between 1 preceding and 1 preceding) as prev_dt
from t
) t;
Using Teradata's PERIOD data type and the awesome td_normalize_overlap_meet:
Consider table test32:
SELECT * FROM test32
+----+----+------------------------+
| f1 | f2 | f3 |
+----+----+------------------------+
| 1 | 2 | 2017-05-11 03:59:00 PM |
| 1 | 3 | 2017-05-11 03:59:01 PM |
| 1 | 4 | 2017-05-11 03:58:58 PM |
| 1 | 5 | 2017-05-11 03:59:26 PM |
| 1 | 2 | 2017-05-11 03:59:28 PM |
| 1 | 2 | 2017-05-11 03:59:46 PM |
+----+----+------------------------+
The following will group your records:
WITH
normalizedCTE AS
(
SELECT *
FROM TABLE
(
td_normalize_overlap_meet(NEW VARIANT_TYPE(periodCTE.f1), periodCTE.fper)
RETURNS (f1 integer, fper PERIOD(TIMESTAMP(0)), recordCount integer)
HASH BY f1
LOCAL ORDER BY f1, fper
) as output(f1, fper, recordcount)
),
periodCTE AS
(
SELECT f1, f2, f3, PERIOD(f3, f3 + INTERVAL '9' SECOND) as fper FROM test32
)
SELECT t2.f1, t2.f2, t2.f3, t1.fper, DENSE_RANK() OVER (PARTITION BY t2.f1 ORDER BY t1.fper) as fgroup
FROM normalizedCTE t1
INNER JOIN periodCTE t2 ON
t1.fper P_INTERSECT t2.fper IS NOT NULL
Results:
+----+----+------------------------+-------------+
| f1 | f2 | f3 | fgroup |
+----+----+------------------------+-------------+
| 1 | 2 | 2017-05-11 03:59:00 PM | 1 |
| 1 | 3 | 2017-05-11 03:59:01 PM | 1 |
| 1 | 4 | 2017-05-11 03:58:58 PM | 1 |
| 1 | 5 | 2017-05-11 03:59:26 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:28 PM | 2 |
| 1 | 2 | 2017-05-11 03:59:46 PM | 3 |
+----+----+------------------------+-------------+
A Period in Teradata is a special data type that holds a date or datetime range. The first parameter is the start of the range and the second is the ending time (up to, but not including which is why it's "+ 9 seconds"). The result is that we get a 8 second time "Period" where each record might "intersect" with another record.
We then use td_normalize_overlap_meet to merge records that intersect, sharing the f1 field's value as the key. In your case that would be customerid. The result is three records for this one customer since we have three groups that "overlap" or "meet" each other's time periods.
We then join the td_normalize_overlap_meet output with the output from when we determined the periods. We use the P_INTERSECT function to see which periods from the normalized CTE INTERSECT with the periods from the initial Period CTE. From the result of that P_INTERSECT join we grab the values we need from each CTE.
Lastly, Dense_Rank() gives us a rank based on the normalized period for each group.
I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.
Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)
I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.
You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>