How to "calculate performant wise" cumulative sum column in sql - sql

Hi lets say i have a table that contains cost per day
and i want by the end of the month to calculate that cumulative sum for that day
so if for say we have those values: 1,2,3 (per 3 days)
we we'll calculate 1,(1+2)=3, (1+2+3)=6 (for the last day)
i wonder how we can do it through sql without sorting the days (n*lgn) cost
is there any other way to solve it?
sample data :
1/1, 1
2,1, 10
3/, 12
desired result (with total from start of the month):
1/1, 1, 1
2,1, 10, 11
3/, 12, 23

I'm guessing you want a rolling sum.
select *
, sum(cost_column) over (order by day_column asc) as rolling_cost
from yourtable
day_column
cost_column
rolling_cost
2022-1-1
1
1
2022-1-2
10
11
2022-1-3
12
23
Demo on db<>fiddle here

Related

Dynamic average calculation

I want to add an average cost column which calculates the average across different time periods.
So in the example below, there are 6 months of cost, the first column finds the average across all 6 i.e. average(1,5,8,12,15,20)
The next "Half Period" column determines how many total periods there are and calculates the average across the most recent 3 periods i.e. average(12,15,20)
The first average is straightforward e.g.
AVG(COST)
What I've tried for the half period is:
AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN x PRECEDING AND CURRENT ROW)
The x is of course an integer value, how would I write the statement to automatically enter the integer required? i.e. in this example 6 periods requires 3 rows averaged, therefore x=2.
x can be found by some sub-query e.g.
SELECT ( CEILING(COUNT(PERIOD) / 2) - 1) FROM TABLE
Example table:
Period
Cost
Jan
1
Feb
5
Mar
8
Apr
12
May
15
Jun
20
Desired Output:
Period
Cost
All Time Average Cost
Half Period Average Cost
Jan
1
10.1
1
Feb
5
10.1
3
Mar
8
10.1
4.7
Apr
12
10.1
8.3
May
15
10.1
11.7
Jun
20
10.1
15.7
The main problem here is that you cannot use a variable or an expression for the number of rows Preceeding in the window expression, we must use a literal value for x in the following:
BETWEEN x PRECEDING
If there is a finite number of periods, then we can use a CASE statement to switch between the possible expressions:
CASE
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 1
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 2
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 3
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 4
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 5
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 5 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 6
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
END as [Half Period Average Cost]
I added this step in SQL. But my window function denied taking the variable half_period_rounded. So we're not quite there yet. :-)
SQL query
This looks like a job for sneaky windowed function aggregates!
DECLARE #TABLE TABLE (SaleID INT IDENTITY, Cost DECIMAL(12,4), SaleDateTime DATETIME)
INSERT INTO #TABLE (SaleDateTime, Cost) VALUES
('2022-Jan-01', 1 ),
('2022-Feb-01', 5 ),
('2022-Mar-01', 8 ),
('2022-Apr-01', 12),
('2022-May-01', 15),
('2022-Jun-01', 20)
SELECT DISTINCT DATEPART(YEAR,SaleDateTime) AS Year, DATEPART(MONTH,SaleDateTime) AS MonthNumber, DATENAME(MONTH,SaleDateTime) AS Month,
AVG(Cost) OVER (ORDER BY (SELECT 1)) AS AllTimeAverage,
AVG(Cost) OVER (PARTITION BY DATEPART(YEAR,SaleDateTime), DATEPART(MONTH, SaleDateTime) ORDER BY SaleDateTime) AS MonthlyAverage,
AVG(Cost) OVER (PARTITION BY DATEPART(YEAR,SaleDateTime), DATEPART(QUARTER,SaleDateTime) ORDER BY SaleDateTime) AS QuarterlyAverage,
AVG(Cost) OVER (PARTITION BY CASE WHEN SaleDateTime BETWEEN CAST(DATEADD(MONTH,-1,DATEADD(DAY,1-DATEPART(DAY,SaleDateTime),SaleDateTime)) AS DATE)
AND CAST(DATEADD(MONTH,2,DATEADD(DAY,1-DATEPART(DAY,SaleDateTime),SaleDateTime)) AS DATE)
THEN 1 END ORDER BY SaleDateTime) AS RollingThreeMonthAverage
FROM #TABLE
ORDER BY DATEPART(YEAR,SaleDateTime), DATEPART(MONTH,SaleDateTime)
We're cheating here, and having the case expression find the rows we want in our rolling 3 month window. I've opted to keep it to a rolling window of last month, this month and next month (from the first day of last month, to the last day of next month - '2022-01-01 00:00:00' to '2022-04-01 00:00:00' for February).
Partitioning over the whole result set, month and quarter is straightforward, but the rolling three months isn't much more complicated when you turn it into a case expression describing it.
Year MonthNumber Month AllTimeAverage MonthlyAverage QuarterlyAverage RollingThreeMonthAverage
--------------------------------------------------------------------------------------------------------
2022 1 January 10.166666 1.000000 1.000000 1.000000
2022 2 February 10.166666 5.000000 3.000000 3.000000
2022 3 March 10.166666 8.000000 4.666666 4.666666
2022 4 April 10.166666 12.000000 12.000000 6.500000
2022 5 May 10.166666 15.000000 13.500000 8.200000
2022 6 June 10.166666 20.000000 15.666666 10.166666

Calculate a 3-month moving average from non-aggregated data

I have a bunch of orders. Each order is either a type A or type B order. I want a 3-month moving average of time it takes to ship orders of each type. How can I aggregate this order data into what I want using Redshift or Postgres SQL?
Start with this:
order_id
order_type
ship_date
time_to_ship
1
a
2021-12-25
100
2
b
2021-12-31
110
3
a
2022-01-01
200
4
a
2022-01-01
50
5
b
2022-01-15
110
6
a
2022-02-02
100
7
a
2022-02-28
300
8
b
2022-04-05
75
9
b
2022-04-06
210
10
a
2022-04-15
150
Note: Some months have no shipments. The solution should allow for this.
I want this:
order_type
ship__month
mma3_time_to_ship
a
2022-02-01
150
a
2022-04-01
160
b
2022-04-01
126.25
Where a 3-month moving average is only calculated for months with at least 2 preceding months. Each record is an order type-month. The ship_month columns denotes the month of shipment (Redshift represents months as the date of the first of the month).
Here's how the mma3_time_to_ship column is calculated, expressed as Excel-like formulas:
150 = AVERAGE(100, 200, 50, 100, 300) <- The average for all A orders in Dec, Jan, and Feb.
160 = AVERAGE(200, 50, 100, 300, 150) <- The average for all A orders in Jan, Feb, Apr (no orders in March)
126.25 = AVERAGE(110, 110, 75, 210) <- The average for all B orders in Dec, Jan, Apr (no B orders in Feb, no orders at all in Mar)
My attempt doesn't aggregate it into monthly data and 3-month averages (this query runs without error in Redshift):
SELECT
order_type,
DATE_TRUNC('month', ship_date) AS ship_month,
AVG(time_to_ship) OVER (
PARTITION BY
order_type,
ship_month
ORDER BY ship_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS avg_time_to_ship
FROM tbl
Is what I want possible?
This is honestly a complete stab in the dark, so it won't surprise me if it's not correct... but it seems to me you can accomplish this with a self join using a range of dates within the join.
select
t1.order_type, t1.ship_date, avg (t2.time_to_ship) as 3mma_time_to_ship
from
tbl t1
join tbl t2 on
t1.order_type = t2.order_type and
t2.ship_date between t1.ship_date - interval '3 months' and t1.ship_date
group by
t1.order_type, t1.ship_date
The results don't match your example, but then I'm not entirely sure where they came from anyway.
Perhaps this will be the catalyst towards an eventual solution or at least an idea to start.
This is Pg12, by the way. Not sure if it will work on Redshift.
-- EDIT --
Per your updates, I was able to match your three results identically. I used dense_rank to find the closest three months:
with foo as (
select
order_type, date_trunc ('month', ship_date)::date as ship_month,
time_to_ship, dense_rank() over (partition by order_type order by date_trunc ('month', ship_date)) as dr
from tbl
)
select
f1.order_type, f1.ship_month,
avg (f2.time_to_ship),
array_agg (f2.time_to_ship)
from
foo f1
join foo f2 on
f1.order_type = f2.order_type and
f2.dr between f1.dr - 2 and f1.dr
group by
f1.order_type, f1.ship_month
Results:
b 2022-01-01 110.0000000000000000 {110,110}
a 2022-01-01 116.6666666666666667 {100,50,200,100,50,200}
b 2022-04-01 126.2500000000000000 {110,110,75,210,110,110,75,210}
b 2021-12-01 110.0000000000000000 {110}
a 2021-12-01 100.0000000000000000 {100}
a 2022-02-01 150.0000000000000000 {100,50,200,100,300,100,50,200,100,300}
a 2022-04-01 160.0000000000000000 {50,200,100,300,150}
There are some dupes in the array elements, but it doesn't seem to impact the averages. I'm sure that part could be fixed.

Running Total by Year in SQL

I have a table broken out into a series of numbers by year, and need to build a running total column but restart during the next year.
The desired outcome is below
Amount | Year | Running Total
-----------------------------
1 2000 1
5 2000 6
10 2000 16
5 2001 5
10 2001 15
3 2001 18
I can do an ORDER BY to get a standard running total, but can't figure out how to base it just on the year such that it does the running total for each unique year.
SQL tables represent unordered sets. You need a column to specify the ordering. One you have this, it is a simple cumulative sum:
select amount, year, sum(amount) over (partition by year order by <ordering column>)
from t;
Without a column that specifies ordering, "cumulative sum" does not make sense on a table in SQL.

SQL How to calculate Average time between Order Purchases? (do sql calculations based on next and previous row)

I have a simple table that contains the customer email, their order count (so if this is their 1st order, 3rd, 5th, etc), the date that order was created, the value of that order, and the total order count for that customer.
Here is what my table looks like
Email Order Date Value Total
r2n1w#gmail.com 1 12/1/2016 85 5
r2n1w#gmail.com 2 2/6/2017 125 5
r2n1w#gmail.com 3 2/17/2017 75 5
r2n1w#gmail.com 4 3/2/2017 65 5
r2n1w#gmail.com 5 3/20/2017 130 5
ation#gmail.com 1 2/12/2018 150 1
ylove#gmail.com 1 6/15/2018 36 3
ylove#gmail.com 2 7/16/2018 41 3
ylove#gmail.com 3 1/21/2019 140 3
keria#gmail.com 1 8/10/2018 54 2
keria#gmail.com 2 11/16/2018 65 2
What I want to do is calculate the time average between purchase for each customer. So lets take customer ylove. First purchase is on 6/15/18. Next one is 7/16/18, so thats 31 days, and next purchase is on 1/21/2019, so that is 189 days. Average purchase time between orders would be 110 days.
But I have no idea how to make SQL look at the next row and calculate based on that, but then restart when it reaches a new customer.
Here is my query to get that table:
SELECT
F.CustomerEmail
,F.OrderCountBase
,F.Date_Created
,F.Total
,F.TotalOrdersBase
FROM #FullBase F
ORDER BY f.CustomerEmail
If anyone can give me some suggestions, that would be greatly appreciated.
And then maybe I can calculate value differences (in percentage). So for example, ylove spent $36 on their first order, $41 on their second which is a 13% increase. Then their second order was $140 which is a 341% increase. So on average, this customer increased their purchase order value by 177%. Unrelated to SQL, but is this the correct way of calculating a metric like this?
looking to your sample you clould try using the diff form min and max date divided by total
select email, datediff(day, min(Order_Date), max(Order_Date))/(total-1) as avg_days
from your_table
group by email
and for manage also the one order only
select email,
case when total-1 > 0 then
datediff(day, min(Order_Date), max(Order_Date))/(total-1)
else datediff(day, min(Order_Date), max(Order_Date)) end as avg_days
from your_table
group by email
The simplest formulation is:
select email,
datediff(day, min(Order_Date), max(Order_Date)) / nullif(total-1, 0) as avg_days
from t
group by email;
You can see this is the case. Consider three orders with od1, od2, and od3 as the order dates. The average is:
( (od2 - od1) + (od3 - od2) ) / 2
Check the arithmetic:
--> ( od2 - od1 + od3 - od2 ) / 2
--> ( od3 - od1 ) / 2
This pretty obviously generalizes to more orders.
Hence the max() minus min().

SQL - Grouping results by custom 24 hour period

I need to create an Oracle 11g SQL report showing daily productivity: how many units were shipped during a 24 hour period. Each period starts at 6am and finishes at 5:59am the next day.
How could I group the results in such a way as to display this 24 hour period? I've tried grouping by day, but, a day is 00:00 - 23:59 and so the results are inaccurate.
The results will cover the past 2 months.
Many thanks.
group by trunc(your_date - 1/4)
Days are whole numbers in oracle so 6 am will be 0.25 of a day
so :
select
trunc(date + 0.25) as period, count(*) as number
from table
group by trunc(date + 0.25 )
I havent got an oracle to try it on at the moment.
Well, you could group by a calculated date.
So, add 6 hours to the dates and group by that which would then technically group your dates correctly and produce the correct results.
Assuming that you have a units column or similar on your table, perhaps something like this:
SQL Fiddle
SELECT
TRUNC(us.shipping_datetime - 0.25) + 0.25 period_start
, TRUNC(us.shipping_datetime - 0.25) + 1 + (1/24 * 5) + (1/24/60 * 59) period_end
, SUM(us.units) units
FROM units_shipped us
GROUP BY TRUNC(us.shipping_datetime - 0.25)
ORDER BY 1
This simply subtracts 6 hours (0.25 of a day) from each date. If the time is earlier than 6am, the subtraction will make it fall prior to midnight, and when the resultant value is truncated (time element is removed, the date at midnight is returned), it falls within the grouping for the previous day.
Results:
| PERIOD_START | PERIOD_END | UNITS |
-----------------------------------------------------------------------
| April, 22 2013 06:00:00+0000 | April, 23 2013 05:59:00+0000 | 1 |
| April, 23 2013 06:00:00+0000 | April, 24 2013 05:59:00+0000 | 3 |
| April, 24 2013 06:00:00+0000 | April, 25 2013 05:59:00+0000 | 1 |
The bit of dynamic maths in the SELECT is just to help readability of the results. If you don't have a units column to SUM() up, i.e. each row represents a single unit, then substitute COUNT(*) instead.