Calculate a 3-month moving average from non-aggregated data - sql

I have a bunch of orders. Each order is either a type A or type B order. I want a 3-month moving average of time it takes to ship orders of each type. How can I aggregate this order data into what I want using Redshift or Postgres SQL?
Start with this:
order_id
order_type
ship_date
time_to_ship
1
a
2021-12-25
100
2
b
2021-12-31
110
3
a
2022-01-01
200
4
a
2022-01-01
50
5
b
2022-01-15
110
6
a
2022-02-02
100
7
a
2022-02-28
300
8
b
2022-04-05
75
9
b
2022-04-06
210
10
a
2022-04-15
150
Note: Some months have no shipments. The solution should allow for this.
I want this:
order_type
ship__month
mma3_time_to_ship
a
2022-02-01
150
a
2022-04-01
160
b
2022-04-01
126.25
Where a 3-month moving average is only calculated for months with at least 2 preceding months. Each record is an order type-month. The ship_month columns denotes the month of shipment (Redshift represents months as the date of the first of the month).
Here's how the mma3_time_to_ship column is calculated, expressed as Excel-like formulas:
150 = AVERAGE(100, 200, 50, 100, 300) <- The average for all A orders in Dec, Jan, and Feb.
160 = AVERAGE(200, 50, 100, 300, 150) <- The average for all A orders in Jan, Feb, Apr (no orders in March)
126.25 = AVERAGE(110, 110, 75, 210) <- The average for all B orders in Dec, Jan, Apr (no B orders in Feb, no orders at all in Mar)
My attempt doesn't aggregate it into monthly data and 3-month averages (this query runs without error in Redshift):
SELECT
order_type,
DATE_TRUNC('month', ship_date) AS ship_month,
AVG(time_to_ship) OVER (
PARTITION BY
order_type,
ship_month
ORDER BY ship_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS avg_time_to_ship
FROM tbl
Is what I want possible?

This is honestly a complete stab in the dark, so it won't surprise me if it's not correct... but it seems to me you can accomplish this with a self join using a range of dates within the join.
select
t1.order_type, t1.ship_date, avg (t2.time_to_ship) as 3mma_time_to_ship
from
tbl t1
join tbl t2 on
t1.order_type = t2.order_type and
t2.ship_date between t1.ship_date - interval '3 months' and t1.ship_date
group by
t1.order_type, t1.ship_date
The results don't match your example, but then I'm not entirely sure where they came from anyway.
Perhaps this will be the catalyst towards an eventual solution or at least an idea to start.
This is Pg12, by the way. Not sure if it will work on Redshift.
-- EDIT --
Per your updates, I was able to match your three results identically. I used dense_rank to find the closest three months:
with foo as (
select
order_type, date_trunc ('month', ship_date)::date as ship_month,
time_to_ship, dense_rank() over (partition by order_type order by date_trunc ('month', ship_date)) as dr
from tbl
)
select
f1.order_type, f1.ship_month,
avg (f2.time_to_ship),
array_agg (f2.time_to_ship)
from
foo f1
join foo f2 on
f1.order_type = f2.order_type and
f2.dr between f1.dr - 2 and f1.dr
group by
f1.order_type, f1.ship_month
Results:
b 2022-01-01 110.0000000000000000 {110,110}
a 2022-01-01 116.6666666666666667 {100,50,200,100,50,200}
b 2022-04-01 126.2500000000000000 {110,110,75,210,110,110,75,210}
b 2021-12-01 110.0000000000000000 {110}
a 2021-12-01 100.0000000000000000 {100}
a 2022-02-01 150.0000000000000000 {100,50,200,100,300,100,50,200,100,300}
a 2022-04-01 160.0000000000000000 {50,200,100,300,150}
There are some dupes in the array elements, but it doesn't seem to impact the averages. I'm sure that part could be fixed.

Related

Unpivoting for large dataset and greater number of unique columns

The pivot and unpivot functions in snowflake are not efficient for processing 30+ unique columns into row based.
Use case : I have 35 different month columns which needs to be rows based , another 35 columns will be quantity for the corresponding month .
So at the and there will be 2 columns(one for month data and another for quantity) for 70 unique columns
there would be aggregation of quantity based on month
But unpivoting is not at all efficient. The below query is scanning 15 GB of data from the main table used
select part_num ,concat(date_part(year, dates),'-',date_part(month, dates)) as month_year,
sum(quantity) as quantities
from table_name
unpivot(dates for cols in (month_1, 30 other uniue cols)),
unpivot(quantity for cols in (qunatity_1, 30 other uniue cols)),
group by part_num, month_year
Is there any other approach to unpivot large dataset.
Thanks
Alternative approach could be using conditional aggregation:
with cte as (
select part_num
,concat(date_part(year, dates),'-',date_part(month, dates)) as month_year
,sum(quantity) as quantities
from table_name
group by part_num, month_year
)
SELECT part_num
-- lowest date
,'2020-01' AS "2020-01"
,MAX(IFF(month_year='2020-01', quantities, NULL) AS "quantities_2020-01"
-- next date
,...
-- last date
,'2022-04' AS "2022-04"
,MAX(IFF(month_year='2022-04', quantities, NULL) AS "quantities_2022-04"
FROM cte
GROUP BY part_num;
Version using single GROUP BY and TO_VARCHAR with format:
SELECT part_num
-- lowest date
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM'),'2020-01',NULL) AS "2020-01"
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM')='2020-01',quantities,NULL) AS "quantities_2020-01"
-- next date
,...
-- last date
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM'),'2022-04',NULL) AS "2022-04"
,MAX(IFF(TO_VARCHAR(dates,'YYYY-MM')='2022-04',quantities,NULL) AS "quantities_2022-04"
FROM table_name
GROUP BY part_num;
So if we get some example DATA and test is what is happening is what is wanted..
Here is a trival and tiny CTE worth of data
with table_name(part_num, month_1, month_2, month_3, qunatity_1, qunatity_2, qunatity_3) as (
select * from values
(1, '2022-01-01'::date, '2022-02-01'::date, '2022-03-01'::date, 4, 5, 6)
)
now pointing your SQL at it (after making it compile)
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,sum(quantity) as quantities
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
unpivot(quantity for quan in (qunatity_1, qunatity_2, qunatity_3))
group by part_num, month_year
gives:
PART_NUM
MONTH_YEAR
QUANTITIES
1
2022-01
15
1
2022-02
15
1
2022-03
15
which is not what I think you are after.
If we look at the un aggregated rows:
PART_NUM
MONTH
DATES
QUAN
QUANTITY
1
MONTH_1
2022-01-01
QUNATITY_1
4
1
MONTH_1
2022-01-01
QUNATITY_2
5
1
MONTH_1
2022-01-01
QUNATITY_3
6
1
MONTH_2
2022-02-01
QUNATITY_1
4
1
MONTH_2
2022-02-01
QUNATITY_2
5
1
MONTH_2
2022-02-01
QUNATITY_3
6
1
MONTH_3
2022-03-01
QUNATITY_1
4
1
MONTH_3
2022-03-01
QUNATITY_2
5
1
MONTH_3
2022-03-01
QUNATITY_3
6
we are getting a cross join, which is not what I believe you are wanting.
my understanding is you want a relationship between month (1-35) and quantity (1-35)
thus a mix like:
PART_NUM
MONTH
DATES
QUAN
QUANTITY
1
MONTH_1
2022-01-01
QUNATITY_1
4
1
MONTH_2
2022-02-01
QUNATITY_2
5
1
MONTH_3
2022-03-01
QUNATITY_3
6
Guessed Answer:
My guess at what you really are wanting is:
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,array_construct(qunatity_1, qunatity_2, qunatity_3)[split_part(month,'_',2)::number - 1] as qunatity
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
order by 1,2;
which gives (for the same above CTE data):
PART_NUM
MONTH_YEAR
QUNATITY
1
2022-01
4
1
2022-02
5
1
2022-03
6
Another way to way to get than guessed answer:
select
part_num
,to_char(dates, 'yyyy-mm') as month_year
,sum(iff(split_part(month,'_',2)=split_part(q_name,'_',2), q_val, null)) as qunatity
from table_name
unpivot(dates for month in (month_1, month_2, month_3))
unpivot(q_val for q_name in (qunatity_1, qunatity_2, qunatity_3))
group by 1,2
order by 1,2;
which uses the double unpivot, so might be slow, but then only aggregates the values if they match. Which feels somewhat almost as gross as the build an array, to rip it apart, but that version is not needing to do large joins, just some per row grossness.
Assuming your data is already aggregated at part_num level, you could divide and conquer like this
with year_month as
(select a.part_num, b.index+1 as month_num, left(b.value,7) as year_month
from my_table a,table(flatten(input=>array_construct(m1,m2,m3...))) b),
quantities as
(select a.part_num, b.index+1 as month_num, b.value::int as quantity
from my_table a,table(flatten(input=>array_construct(q1,q2,q3...))) b)
select a.part_num, a.year_month, b.quantity
from year_month a
join quantities b on a.part_num=b.part_num and a.month_num=b.month_num

How to calculate average monthly number of some action in some perdion in Teradata SQL?

I have table in Teradata SQL like below:
ID trans_date
------------------------
123 | 2021-01-01
887 | 2021-01-15
123 | 2021-02-10
45 | 2021-03-11
789 | 2021-10-01
45 | 2021-09-02
And I need to calculate average monthly number of transactions made by customers in a period between 2021-01-01 and 2021-09-01, so client with "ID" = 789 will not be calculated because he made transaction later.
In the first month (01) were 2 transactions
In the second month was 1 transaction
In the third month was 1 transaction
In the nineth month was 1 transactions
So the result should be (2+1+1+1) / 4 = 1.25, isn't is ?
How can I calculate it in Teradata SQL? Of course I showed you sample of my data.
SELECT ID, AVG(txns) FROM
(SELECT ID, TRUNC(trans_date,'MON') as mth, COUNT(*) as txns
FROM mytable
-- WHERE condition matches the question but likely want to
-- use end date 2021-09-30 or use mth instead of trans_date
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id, mth) mth_txn
GROUP BY id;
Your logic translated to SQL:
--(2+1+1+1) / 4
SELECT id, COUNT(*) / COUNT(DISTINCT TRUNC(trans_date,'MON')) AS avg_tx
FROM mytable
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id;
You should compare to Fred's answer to see which is more efficent on your data.

Count distinct customers who bought in previous period and not in next period Bigquery

I have a dataset in bigquery which contains order_date: DATE and customer_id.
order_date | CustomerID
2019-01-01 | 111
2019-02-01 | 112
2020-01-01 | 111
2020-02-01 | 113
2021-01-01 | 115
2021-02-01 | 119
I try to count distinct customer_id between the months of the previous year and the same months of the current year. For example, from 2019-01-01 to 2020-01-01, then from 2019-02-01 to 2020-02-01, and then who not bought in the same period of next year 2020-01-01 to 2021-01-01, then 2020-02-01 to 2021-02-01.
The output I am expect
order_date| count distinct CustomerID|who not buy in the next period
2020-01-01| 5191 |250
2020-02-01| 4859 |500
2020-03-01| 3567 |349
..........| .... |......
and the next periods shouldn't include the previous.
I tried the code below but it works in another way
with customers as (
select distinct date_trunc(date(order_date),month) as dates,
CUSTOMER_WID
from t
where date(order_date) between '2018-01-01' and current_date()-1
)
select
dates,
customers_previous,
customers_next_period
from
(
select dates,
count(CUSTOMER_WID) as customers_previous,
count(case when customer_wid_next is null then 1 end) as customers_next_period,
from (
select prev.dates,
prev.CUSTOMER_WID,
next.dates as next_dates,
next.CUSTOMER_WID as customer_wid_next
from customers as prev
left join customers
as next on next.dates=date_add(prev.dates,interval 1 year)
and prev.CUSTOMER_WID=next.CUSTOMER_WID
) as t2
group by dates
)
order by 1,2
Thanks in advance.
If I understand correctly, you are trying to count values on a window of time, and for that I recommend using window functions - docs here and here a great article explaining how it works.
That said, my recommendation would be:
SELECT DISTINCT
periods,
COUNT(DISTINCT CustomerID) OVER 12mos AS count_customers_last_12_mos
FROM (
SELECT
order_date,
FORMAT_DATE('%Y%m', order_date) AS periods,
customer_id
FROM dataset
)
WINDOW 12mos AS ( # window of last 12 months without current month
PARTITION BY periods ORDER BY periods DESC
ROWS BETWEEN 12 PRECEEDING AND 1 PRECEEDING
)
I believe from this you can build some customizations to improve the aggregations you want.
You can generate the periods using unnest(generate_date_array()). Then use joins to bring in the customers from the previous 12 months and the next 12 months. Finally, aggregate and count the customers:
select period,
count(distinct c_prev.customer_wid),
count(distinct c_next.customer_wid)
from unnest(generate_date_array(date '2020-01-01', date '2021-01-01', interval '1 month')) period join
customers c_prev
on c_prev.order_date <= period and
c_prev.order_date > date_add(period, interval -12 month) left join
customers c_next
on c_next.customer_wid = c_prev.customer_wid and
c_next.order_date > period and
c_next.order_date <= date_add(period, interval 12 month)
group by period;

How to LEFT JOIN on ROW_NUM using WITH

Right now I'm in the testing phase of this query so I'm only testing it on two Queries. I've gotten stuck on the final part where I want to left join everything (this will have to be extended to 12 separate queries). The problem is basically as the title suggests--I want to join 12 queries on the created Row_Num column using the WITH() statement, instead of creating 12 separate tables and saving them as table in a database.
WITH Jan_Table AS
(SELECT ROW_NUMBER() OVER (ORDER BY a.SALE_DATE) as Row_ID, a.SALE_DATE, sum(a.revenue) as Jan_Rev
FROM ba.SALE_TABLE a
WHERE a.SALE_DATE BETWEEN '2015-01-01' and '2015-01-31'
GROUP BY a.SALE_DATE)
SELECT ROW_NUMBER() OVER (ORDER BY a.SALE_DATE) as Row_ID, a.SALE_DATE, sum(a.revenue) as Jun_Rev, j.Jan_Rev
FROM ba.SALE_TABLE a
LEFT JOIN Jan_Table j
on "j.Row_ID" = a.Row_ID
WHERE a.SALE_DATE BETWEEN '2015-06-01' and '2015-06-30'
GROUP BY a.SALE_DATE
And then I get this error message:
ERROR: column "j.Row_ID" does not exist
I put in the "j.Row_ID" because the previous message was:
ERROR: column a.row_id does not exist Hint: Perhaps you meant to
reference the column "j.row_id".
Each query works individually without the JOIN and WITH functions. I have one for every month of the year and want to join 12 of these together eventually.
The output should be a single column with ROW_NUM and 12 Monthly Revenues columns. Each row should be a day of the month. I know not every month has 31 days. So, for example, Feb only has 28 days, meaning I'd want days 29, 30, and 31 as NULLs. The query above still has the dates--but I will remove the "SALE_DATE" column after I can just get these two queries to join.
My initially thought was just to create 12 tables but I think that'd be a really bad use of space and not the most logical solution to this problem if I were to extend this solution.
edit
Below are the separate outputs of the two qaruies above and the third table is what I'm trying to make. I can't give you the raw data. Everything above has been altered from the actual column names and purposes of the data that I'm using. And I don't know how to create a dataset--that's too above my head in SQL.
Jan_Table (first five lines)
Row_Num Date Jan_Rev
1 2015-01-01 20
2 2015-01-02 20
3 2015-01-03 20
4 2015-01-04 20
5 2015-01-05 20
Jun_Table (first five lines)
Row_Num Date Jun_Rev
1 2015-06-01 30
2 2015-06-02 30
3 2015-06-03 30
4 2015-06-04 30
5 2015-06-05 30
JOINED_TABLE (first five lines)
Row_Num Date Jun_Rev Date Jan_Rev
1 2015-06-01 30 2015-01-01 20
2 2015-06-02 30 2015-01-02 20
3 2015-06-03 30 2015-01-03 20
4 2015-06-04 30 2015-01-04 20
5 2015-06-05 30 2015-01-05 20
It seems like you can just use group by and conditional aggregation for your full query:
select day(sale_date),
max(case when month(sale_date) = 1 then sale_date end) as jan_date,
max(case when month(sale_date) = 1 then revenue end) as jan_revenue,
max(case when month(sale_date) = 2 then sale_date end) as feb_date,
max(case when month(sale_date) = 2 then revenue end) as feb_revenue,
. . .
from sale_table s
group by day(sale_date)
order by day(sale_date);
You haven't specified the database you are using. DAY() is a common function to get the day of the month; MONTH() is a common function to get the months of the year. However, those particular functions might be different in your database.

SQL How to calculate Average time between Order Purchases? (do sql calculations based on next and previous row)

I have a simple table that contains the customer email, their order count (so if this is their 1st order, 3rd, 5th, etc), the date that order was created, the value of that order, and the total order count for that customer.
Here is what my table looks like
Email Order Date Value Total
r2n1w#gmail.com 1 12/1/2016 85 5
r2n1w#gmail.com 2 2/6/2017 125 5
r2n1w#gmail.com 3 2/17/2017 75 5
r2n1w#gmail.com 4 3/2/2017 65 5
r2n1w#gmail.com 5 3/20/2017 130 5
ation#gmail.com 1 2/12/2018 150 1
ylove#gmail.com 1 6/15/2018 36 3
ylove#gmail.com 2 7/16/2018 41 3
ylove#gmail.com 3 1/21/2019 140 3
keria#gmail.com 1 8/10/2018 54 2
keria#gmail.com 2 11/16/2018 65 2
What I want to do is calculate the time average between purchase for each customer. So lets take customer ylove. First purchase is on 6/15/18. Next one is 7/16/18, so thats 31 days, and next purchase is on 1/21/2019, so that is 189 days. Average purchase time between orders would be 110 days.
But I have no idea how to make SQL look at the next row and calculate based on that, but then restart when it reaches a new customer.
Here is my query to get that table:
SELECT
F.CustomerEmail
,F.OrderCountBase
,F.Date_Created
,F.Total
,F.TotalOrdersBase
FROM #FullBase F
ORDER BY f.CustomerEmail
If anyone can give me some suggestions, that would be greatly appreciated.
And then maybe I can calculate value differences (in percentage). So for example, ylove spent $36 on their first order, $41 on their second which is a 13% increase. Then their second order was $140 which is a 341% increase. So on average, this customer increased their purchase order value by 177%. Unrelated to SQL, but is this the correct way of calculating a metric like this?
looking to your sample you clould try using the diff form min and max date divided by total
select email, datediff(day, min(Order_Date), max(Order_Date))/(total-1) as avg_days
from your_table
group by email
and for manage also the one order only
select email,
case when total-1 > 0 then
datediff(day, min(Order_Date), max(Order_Date))/(total-1)
else datediff(day, min(Order_Date), max(Order_Date)) end as avg_days
from your_table
group by email
The simplest formulation is:
select email,
datediff(day, min(Order_Date), max(Order_Date)) / nullif(total-1, 0) as avg_days
from t
group by email;
You can see this is the case. Consider three orders with od1, od2, and od3 as the order dates. The average is:
( (od2 - od1) + (od3 - od2) ) / 2
Check the arithmetic:
--> ( od2 - od1 + od3 - od2 ) / 2
--> ( od3 - od1 ) / 2
This pretty obviously generalizes to more orders.
Hence the max() minus min().