Rolling Aggregation - sql

I am trying to write a program in SQL Server that aggregates based on rolling dates.
Take this below
Acc Dte Amount
1 1/1/20 100
1 1/3/20 200
1 1/8/20 100
1 1/8/20 75
2 1/1/20 50
2 1/2/20 100
2 1/3/20 75
2 1/3/20 125
3 1/3/20 100
3 1/6/20 75
3 1/8/20 75
3 1/10/20 200
3 1/10/20 150
So the goal is I want to find the avg and the count of records and dates for each account PRIOR TO the record being analyzed. I also need to sum the records based on the date So based on the above it would look like this...
Acc Dte Num_of_dates Avg_Amount_per_day Current_Amount
1 1/3/20 1 100 200
1 1/8/20 2 150 175
2 1/2/20 1 50 100
2 1/3/20 2 75 200
3 1/6/20 1 100 75
3 1/8/20 2 83.3 75
3 1/10/20 3 83.3 350
The goal is to create a z-score comparing accounts numbers for the current day to the accounts average per day. But we also need to hit a minimum of 10 days of historical data for each account.
Right now my code looks like this and is not working
select Account,
Dte,
(select sum(case when Cast(EventTimestamp as DATE) < Dte then 1 else 0 end) Num_of_Date,
(select (case when Cast(EventTimestamp as DATE) < Dte then sum(Amount) else 0 end) t_amount
from Data
group by Account, Dte
Any ideas? Thanks

You can use window functions with proper rows clause. For once, distinct comes handy here:
select distinct
acc,
dte,
count(*) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) num_of_dates,
avg(1.0 * amount) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) avg_amount_per_day,
sum(amount) over(partition by acc, dte) current_amount
from mytable
If you do want just one record per date and account, as shown in your sample data, you can nest the query and use row_number() - in absence of an obvious column to define the sorting order, I relied on the cumulative count:
select acc, dte, num_of_dates, avg_amount_per_day, current_amount
from (
select
t.*,
row_number() over(partition by acc, dte order by num_of_dates) rn
from (
select
acc,
dte,
count(*) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) num_of_dates,
avg(1.0 * amount) over(
partition by acc
order by dte
rows between unbounded preceding and 1 preceding
) avg_amount_per_day,
sum(amount) over(partition by acc, dte) current_amount
from mytable
) t
) t
where rn = 1 and avg_amount_per_day is not null
Demo on DB Fiddlde:
acc | dte | num_of_dates | avg_amount_per_day | current_amount
--: | :--------- | -----------: | :----------------- | -------------:
1 | 2020-01-03 | 1 | 100.000000 | 200
1 | 2020-01-08 | 2 | 150.000000 | 175
2 | 2020-01-02 | 1 | 50.000000 | 100
2 | 2020-01-03 | 2 | 75.000000 | 200
3 | 2020-01-06 | 1 | 100.000000 | 75
3 | 2020-01-08 | 2 | 87.500000 | 75
3 | 2020-01-10 | 3 | 83.333333 | 350

Your sample data and description suggest:
select acc, dte,
count(*) as num_on_day,
sum(amount) as sum_on_day,
avg(sum(amount)) over (partition by acc order by date_num range between unbounded preceding and 1 preceding) as avg_previous
from t cross join
(values (datediff(day, '1900-01-01', dte))) v(date_num)
group by acc, dte;
I'm not sure why you don't include the first date for each acc.

Related

percentage per month Bigquery

I am working in Bigquery and I need the percentages for each result for each month, I have the following query but the percentage is calculated with respect to the total, I have tried to add a PARTITION BY in the OVER clause but it does not work.
SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS `month`,
result,
count(*) * 100.0 / sum(count(1)) over() as percentage
FROM table_name
GROUP BY 1,2
ORDER BY 1
month
result
percentage
2021-01
0001
50
2021-01
0000
50
2021-02
00001
33.33
2021-02
0000
33.33
2021-02
0002
33.33
Using the data that you shared as:
WITH data as(
SELECT "2021-01-01" as created_at,"0001" as result UNION ALL
SELECT "2021-01-01","0000" UNION ALL
SELECT "2021-02-01","00001"UNION ALL
SELECT "2021-02-01","0000"UNION ALL
SELECT "2021-02-01","0002"
)
I used a subquery to help you to deal with the month field and then use that field to partition by and then group them by month, and result.
d as (SELECT CAST(TIMESTAMP_TRUNC(CAST((created_at) AS TIMESTAMP), MONTH) AS DATE) AS month,
result, created_at
from DATA
)
SELECT d.month,
d.result,
count(*) * 100.0 / sum(count(1)) over(partition by month) as percentage
FROM d
GROUP BY 1, 2
ORDER BY 1
The output is the following:
This example is code on dbFiddle SQL server, but according to the documentation google-bigquery has the function COUNT( ~ ) OVER ( PARTITION BY ~ )
create table table_name(month char(7), result int)
insert into table_name values
('2021-01',50),
('2021-01',30),
('2021-01',20),
('2021-02',70),
('2021-02',80);
select
month,
result,
sum(result) over (partition by month) month_total,
100 * result / sum(result) over (partition by month) per_cent
from table_name
order by month, result;
month | result | month_total | per_cent
:------ | -----: | ----------: | -------:
2021-01 | 20 | 100 | 20
2021-01 | 30 | 100 | 30
2021-01 | 50 | 100 | 50
2021-02 | 70 | 150 | 46
2021-02 | 80 | 150 | 53
db<>fiddle here

SQL Server : processing by group

I have a table with the following data:
Id Date Value
---------------------------
1 Dec-01-2019 10
1 Dec-03-2019 5
1 Dec-05-2019 8
1 Jan-03-2020 6
1 Jan-07-2020 3
1 Jan-08-2020 9
2 Dec-01-2019 4
2 Dec-03-2019 7
2 Dec-31-2019 9
2 Jan-04-2020 4
2 Jan-09-2020 6
I need to group it to the following format: 1 record per month per id. If month is closed, so date will be the last day of that month, if not, the last day available. Max and average are calculated using all data until that date.
Id Date Max_Value Average_Value
-----------------------------------------------
1 Dec-31-2019 10 7,6
1 Jan-08-2020 10 6,8
2 Dec-31-2019 9 6,6
2 Jan-09-2020 9 6,0
Any easy SQL to obtain this analysis?
Regards,
Hmmm . . . You want to aggregate by month and then just take the maximum date in the month:
select id, max(date), max(value), avg(value * 1.0)
from t
group by id, eomonth(date)
order by id, max(date);
If by closed month you mean that it's not the last month of the id then:
select id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end Date,
max(maxValue) Max_Value,
round(avg(1.0 * Value), 1) Average_Value
from (
select *,
max(Date) over (partition by Id) maxDate,
max(Value) over (partition by Id) maxValue
from tablename
) t
group by id,
case
when year(Date) = year(maxDate) and month(Date) = month(maxDate) then maxDate
else eomonth(Date)
end
order by id, Date
See the demo.
Results:
> id | Date | Max_Value | Average_Value
> -: | :--------- | --------: | :------------
> 1 | 2019-12-31 | 10 | 7.7
> 1 | 2020-01-08 | 10 | 6.0
> 2 | 2019-12-31 | 9 | 6.7
> 2 | 2020-01-09 | 9 | 5.0

Is it possible to do projection in Google Big Query?

I have a query (due to restrictions, it is using Legacy SQL) that produces a column that is the rolling average of last 3 days of sale (excluding today)
SELECT
id, date, sales, AVG(sales) OVER (PARTITION BY id ORDER BY date RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING) AS projected_sale
FROM tableA
tableA
+-------+---------+---------+
| id | date | sales |
+-------+---------+---------+
| 1 | 01-01-17| 5 |
| 1 | 01-02-17| 6 |
| 1 | 01-03-17| 7 |
| 1 | 01-04-17| 10 |
+-------+---------+---------+
The query produces
+-------+---------+---------+--------------+
| id | date | sales |projected_sale|
+-------+---------+---------+--------------+
| 1 | 01-01-17| 5 | . |
| 1 | 01-02-17| 6 | . |
| 1 | 01-03-17| 7 | . |
| 1 | 01-04-17| 10 | 6 |
+-------+---------+---------+--------------+
Since the average is excluding the current row, theoretically I can project the sale for 01-05-17 using the sales from (01-02 to 01-04). However since tableA doesn't actually have a entry with date 01-05-17, my query stops at 01-04-17 as the last row.
Is what I am trying to do possible in Big Query?
Thank you
First, I think using RANGE is incorrect here - it should be ROWS instead
Anyway, below is an example for BigQuery Legacy SQL that demonstrates how to achieve result you need.
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
As you can see here you just simply adding (UNION ALL - comma in Kegacy SQL) that missing day. Of course you can transform that one such that it will add such missing row for all id's
Nevetherless - hope this is a good starting point for you
You can test / play with it using dummy data as in your question
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM (
SELECT * FROM
(SELECT 1 id, '01-01-17' dt, 5 sales),
(SELECT 1 id, '01-02-17' dt, 6 sales),
(SELECT 1 id, '01-03-17' dt, 7 sales),
(SELECT 1 id, '01-04-17' dt, 10 sales)
) tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
with result as
Row id dt sales projected_sale
1 1 01-01-17 5 null
2 1 01-02-17 6 5.0
3 1 01-03-17 7 5.5
4 1 01-04-17 10 6.0
5 1 01-05-17 0 7.0

SQL order by two column, omit if second column doesn't meet the order

Let's say we have next data
id | date | price
------------------------
1 | 10-09-2016 | 200
2 | 11-09-2016 | 190
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
5 | 14-09-2016 | 200
6 | 15-09-2016 | 200
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
and we have to order by date first, and price second, however if the price must be in order. If current price is less than previous we should omit this row, and the result will be:
id | date | price
------------------------
1 | 10-09-2016 | 200
3 | 12-09-2016 | 210
4 | 13-09-2016 | 220
7 | 16-09-2016 | 230
8 | 17-09-2016 | 240
Is it possible without join?
Use LAG window function
SELECT *
FROM (SELECT *,
Lag(price)OVER( ORDER BY date) AS prev_price
FROM Yourtable) a
WHERE price > prev_price
OR prev_price IS NULL -- to get the first record
If "previous" is supposed to mean the previous row in the output, then keep track of a running maximum. Postgres solution with a window function in a subquery:
SELECT id, date, price
FROM (
SELECT *, price >= max(price) OVER (ORDER BY date, price) AS ok
FROM tbl
) sub
WHERE ok;
If Postgres:
select id, date, price
from
(select
t.*,
price - lag(price, 1, price) over (order by id) diff
from
your_table) t
where diff > 0;
If MySQL:
select id, date, price from
(
select t.*,
price - #lastprice diff,
#lastprice := price
from
(select *
from your_table
order by id) t
cross join (select #lastprice := 0) t2
) t where t.diff > 0;

Querying DAU/MAU over time (daily)

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.
Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)
I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.
You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>