Rolling 3 day average transaction amount for each day - sql

I'm trying to get the rolling 3 day average transaction amount for each day. I first grouped my data by day from the time stamp using cast:
select
cast(transaction_time as Date) As Date
, SUM(transaction_amount) as total_transaction_amount
from transactions
Group by cast(transaction_time as date)
order by cast(transaction_time as date)
now I want to get the rolling 3 day average:
select *,
avg(transaction_amount) OVER(ORDER BY transaction_time
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
as moving_average
from transactions;
but don't know how to make both statements work together, any ideas?

You've basically done all the hard work, just need to stick them together and a CTE is great for this.
With transactions_by_day as(
select
cast(transaction_time as Date) As Date
, SUM(transaction_amount) as total_transaction_amount
from transactions
Group by cast(transaction_time as date)
order by cast(transaction_time as date))
select *,
avg(total_transaction_amount) OVER(ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
as moving_average
from transactions_by_day

Related

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

PostgreSQL subquery - calculating average of lagged values

I am looking at Sales Rates by month, and was able to query the 1st table. I am quite new to PostgreSQL and am trying to figure out how I can query the second (I had to do the 2nd one in Excel for now)
I have the current Sales Rate and I would like to compare it to the Sales Rate 1 and 2 months ago, as an averaged rate.
I am not asking for an answer how exactly to solve it because this is not the point of getting better, but just for hints for functions to use that are specific to PostgreSQL. What I am trying to calculate is the 2 month average in the 2nd table based on the lagged values of the 2nd table. Thanks!
Here is the query for the 1st table:
with t1 as
(select date,
count(sales)::numeric/count(poss_sales) as SR_1M_before
from data
where date between '2019-07-01' and '2019-11-30'
group by 1),
t2 as
(select date,
count(sales)::numeric/count(poss_sales) as SR_2M_before
from data
where date between '2019-07-01' and '2019-10-31'
group by 1)
select t0.date,
count(t0.sales)::numeric/count(t0.poss_sales) as Sales_Rate
t1.SR_1M_before,
t2.SR_2M_before
from data as t0
left join t1 on t0.date=t1.date
left join t2 on t0.date=t1.date
where date between '2019-07-01' and '2019-12-31'
group by 1,3,4
order by 1;
As commented by a_horse_with_no_name, you can use window functions to take the average of the two previous monthes with a range clause:
select
date,
count(sales)::numeric/count(poss_sales) as Sales_Rate,
avg(count(sales)::numeric/count(poss_sales)) over(
order by date
rows between '2 month' preceding and '1 month' preceding
) Sales_Rate,
count(sales)::numeric/count(poss_sales) as Sales_Rate
- avg(count(sales)::numeric/count(poss_sales)) over(
order by date
rows between '2 month' preceding and '1 month' preceding
) PercentDeviation
from data
where date between '2019-07-01' and '2019-12-31'
group by date
order by date;
Your data is a bit confusing -- it would be less confusing if you had decimal places (that is, 58% being the average of 57% and 58% is not obvious).
Because you want to have NULL values on the first two rows, I'm going to calculate the values using sum() and count():
with q as (
<whatever generates the data you have shown>
)
select q.*,
(sum(sales_rate) over (order by date
rows between 2 preceding and 1 preceding
) /
nullif(count(*) over (order by date
rows between 2 preceding and 1 preceding
)
) as two_month_average
from q;
You could also express this using case and avg():
select q.*,
(case when row_number() over (order by date) > 2)
then avg(sales_rate) over (order by date
rows between 2 preceding and 1 preceding
)
end) as two_month_average
from q;

SQL - Rolling avg over truncated date

I want to do a rolling mean of a calculated field on a week basis out of data whose precision is at the second. This is why I first truncate the date to the week.
So my provisional query is
SELECT week, AVG(my_value) OVER(ORDER BY week ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_my_value
FROM
(SELECT id,
DATE_TRUNC('week', created_at) AS week,
my_value
FROM my_table
ORDER BY week ASC
)
GROUP BY week
The problem I have is that the AVG works but it's done separately for all rows which have got the same week! I think this is because there must be some sort of inner grouping added but the problem I have is to conceive it for the case of an average.
If that counts, I am looking for a solution working for Redshift, or PostgreSQL.
If you want a cumulative average, then:
SELECT week,
AVG(AVG(my_value)) OVER (ORDER BY week ASC) AS avg_my_value
FROM (SELECT id, DATE_TRUNC('week', created_at) AS week, my_value
FROM my_table
) t
GROUP BY week;
Notes:
The ORDER BY in the subquery is superfluous.
Note the nesting of the aggregation functions.

Calculate MAX for value over a relative date range

I am trying to calculate the max of a value over a relative date range. Suppose I have these columns: Date, Week, Category, Value. Note: The Week column is the Monday of the week of the corresponding Date.
I want to produce a table which gives the MAX value within the last two weeks for each Date, Week, Category combination so that the output produces the following: Date, Week, Category, Value, 2WeeksPriorMAX.
How would I go about writing that query? I don't think the following would work:
SELECT Date, Week, Value,
MAX(Value) OVER (PARTITION BY Category
ORDER BY Week
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as 2WeeksPriorMAX
The above query doesn't account for cases where there are missing values for a given Category, Week combination within the last 2 weeks, and therefore it would span further than 2 weeks when it analyzes the 2 preceding rows.
Left joining or using a lateral join/subquery might be expensive. You can do this with window functions, but you need to have a bit more logic:
select t.*,
(case when lag(date, 1) over (partition by category order by date) < date - interval '2 week'
then value
when lag(date, 2) over (partition by category order by date) < date - interval '2 week'
then max(value) over (partition by category order by date rows between 1 preceding and current row)
else max(value) over (partition by category order by date rows between 2 preceding and current row)
end) as TwoWeekMax
from t;

How can I select one row for each week in a date range that spans more than a year?

In my postgreSQL data base, I have a table with columns of dates and prices. ('transdate' and 'price')
I would like to form a query which selects one row for each week over a date range which spans more than one year.
From another question/answer here, I implemented this code which works for date ranges of less than a year:
;with cte as
(
select *,
row_number() over (partition by Extract (week from transdate) order by transdate desc) as rn
from "tablename" where transdate between '06-01-1999' and '06-01-1999'::timestamp + `'50 week'::interval
)
select transdate, price from cte where rn = 1 order by transdate;
However, when I extend the interval greater than 50 weeks, it still only selects a max of 12 months.
How can I re-write this code to select one date/price from every week in the range?
Your problem is that week numbers wrap around at year boundaries but you want to look at the week number and the year at the same time. Lucky for you, you can PARTITION BY several things at once:
row_number() over (
partition by extract(week from transdate),
extract(year from transdate)
order by transdate desc
) as rn