Dynamic average calculation - sql

I want to add an average cost column which calculates the average across different time periods.
So in the example below, there are 6 months of cost, the first column finds the average across all 6 i.e. average(1,5,8,12,15,20)
The next "Half Period" column determines how many total periods there are and calculates the average across the most recent 3 periods i.e. average(12,15,20)
The first average is straightforward e.g.
AVG(COST)
What I've tried for the half period is:
AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN x PRECEDING AND CURRENT ROW)
The x is of course an integer value, how would I write the statement to automatically enter the integer required? i.e. in this example 6 periods requires 3 rows averaged, therefore x=2.
x can be found by some sub-query e.g.
SELECT ( CEILING(COUNT(PERIOD) / 2) - 1) FROM TABLE
Example table:
Period
Cost
Jan
1
Feb
5
Mar
8
Apr
12
May
15
Jun
20
Desired Output:
Period
Cost
All Time Average Cost
Half Period Average Cost
Jan
1
10.1
1
Feb
5
10.1
3
Mar
8
10.1
4.7
Apr
12
10.1
8.3
May
15
10.1
11.7
Jun
20
10.1
15.7

The main problem here is that you cannot use a variable or an expression for the number of rows Preceeding in the window expression, we must use a literal value for x in the following:
BETWEEN x PRECEDING
If there is a finite number of periods, then we can use a CASE statement to switch between the possible expressions:
CASE
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 1
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 2
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 3
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 4
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 5
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 5 PRECEDING AND CURRENT ROW)
WHEN CEILING(COUNT(PERIOD) / 2) - 1 <= 6
THEN AVG(COST) OVER (ORDER BY PERIOD ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
END as [Half Period Average Cost]

I added this step in SQL. But my window function denied taking the variable half_period_rounded. So we're not quite there yet. :-)
SQL query

This looks like a job for sneaky windowed function aggregates!
DECLARE #TABLE TABLE (SaleID INT IDENTITY, Cost DECIMAL(12,4), SaleDateTime DATETIME)
INSERT INTO #TABLE (SaleDateTime, Cost) VALUES
('2022-Jan-01', 1 ),
('2022-Feb-01', 5 ),
('2022-Mar-01', 8 ),
('2022-Apr-01', 12),
('2022-May-01', 15),
('2022-Jun-01', 20)
SELECT DISTINCT DATEPART(YEAR,SaleDateTime) AS Year, DATEPART(MONTH,SaleDateTime) AS MonthNumber, DATENAME(MONTH,SaleDateTime) AS Month,
AVG(Cost) OVER (ORDER BY (SELECT 1)) AS AllTimeAverage,
AVG(Cost) OVER (PARTITION BY DATEPART(YEAR,SaleDateTime), DATEPART(MONTH, SaleDateTime) ORDER BY SaleDateTime) AS MonthlyAverage,
AVG(Cost) OVER (PARTITION BY DATEPART(YEAR,SaleDateTime), DATEPART(QUARTER,SaleDateTime) ORDER BY SaleDateTime) AS QuarterlyAverage,
AVG(Cost) OVER (PARTITION BY CASE WHEN SaleDateTime BETWEEN CAST(DATEADD(MONTH,-1,DATEADD(DAY,1-DATEPART(DAY,SaleDateTime),SaleDateTime)) AS DATE)
AND CAST(DATEADD(MONTH,2,DATEADD(DAY,1-DATEPART(DAY,SaleDateTime),SaleDateTime)) AS DATE)
THEN 1 END ORDER BY SaleDateTime) AS RollingThreeMonthAverage
FROM #TABLE
ORDER BY DATEPART(YEAR,SaleDateTime), DATEPART(MONTH,SaleDateTime)
We're cheating here, and having the case expression find the rows we want in our rolling 3 month window. I've opted to keep it to a rolling window of last month, this month and next month (from the first day of last month, to the last day of next month - '2022-01-01 00:00:00' to '2022-04-01 00:00:00' for February).
Partitioning over the whole result set, month and quarter is straightforward, but the rolling three months isn't much more complicated when you turn it into a case expression describing it.
Year MonthNumber Month AllTimeAverage MonthlyAverage QuarterlyAverage RollingThreeMonthAverage
--------------------------------------------------------------------------------------------------------
2022 1 January 10.166666 1.000000 1.000000 1.000000
2022 2 February 10.166666 5.000000 3.000000 3.000000
2022 3 March 10.166666 8.000000 4.666666 4.666666
2022 4 April 10.166666 12.000000 12.000000 6.500000
2022 5 May 10.166666 15.000000 13.500000 8.200000
2022 6 June 10.166666 20.000000 15.666666 10.166666

Related

Count total without duplicate records

I have a table that contains the following columns: TrackingStatus, Year, Month, Order, Notes
I need to calculate the total number of tracking status for each year and month.
For example, if the table contains the following orders:
TrackingStatus
Year
Month
Order
Notes
F
2020
1
33
F
2020
1
33
DFF
E
2020
2
36
xxx
A
2021
3
34
X1
A
2021
3
34
DD
A
2021
3
88
A
2021
2
45
The result should be:
• Tracking F , year 2020, month 1 the total will be one (because it's the same year, month, and order).
• Tracking A , year 2021, month 2 the total will be one. (because there is only one record with the same year, month, and order).
• Tracking A , year 2021, month 3 the total will be two. (because there are two orders within the same year and month).
So the expected SELECT output will be like that:
TrackingStatus
Year
Month
Total
F
2020
1
1
E
2020
2
1
A
2021
2
1
A
2021
3
2
I was trying to use group by but then it will count the number of records which in my scenario is wrong.
How can I get the total orders for each month without counting “duplicate” records?
Thank you
You can use a COUNT DISTINCT aggregation function, whereas the COUNT allows you to count the values, but the DISTINCT condition will allow each value only once.
SELECT TrackingStatus,
Year,
Month,
COUNT(DISTINCT Order) AS Total
FROM tab
GROUP BY TrackingStatus,
Year,
Month
ORDER BY Year,
Month
Here you can find a tested solution in a MySQL environment, though this should work with many DBMS.

How to "calculate performant wise" cumulative sum column in sql

Hi lets say i have a table that contains cost per day
and i want by the end of the month to calculate that cumulative sum for that day
so if for say we have those values: 1,2,3 (per 3 days)
we we'll calculate 1,(1+2)=3, (1+2+3)=6 (for the last day)
i wonder how we can do it through sql without sorting the days (n*lgn) cost
is there any other way to solve it?
sample data :
1/1, 1
2,1, 10
3/, 12
desired result (with total from start of the month):
1/1, 1, 1
2,1, 10, 11
3/, 12, 23
I'm guessing you want a rolling sum.
select *
, sum(cost_column) over (order by day_column asc) as rolling_cost
from yourtable
day_column
cost_column
rolling_cost
2022-1-1
1
1
2022-1-2
10
11
2022-1-3
12
23
Demo on db<>fiddle here

Running Total by Year in SQL

I have a table broken out into a series of numbers by year, and need to build a running total column but restart during the next year.
The desired outcome is below
Amount | Year | Running Total
-----------------------------
1 2000 1
5 2000 6
10 2000 16
5 2001 5
10 2001 15
3 2001 18
I can do an ORDER BY to get a standard running total, but can't figure out how to base it just on the year such that it does the running total for each unique year.
SQL tables represent unordered sets. You need a column to specify the ordering. One you have this, it is a simple cumulative sum:
select amount, year, sum(amount) over (partition by year order by <ordering column>)
from t;
Without a column that specifies ordering, "cumulative sum" does not make sense on a table in SQL.

Can I calculate an aggregate duration over multiple rows with a single row per day?

I'm creating an Absence Report for HR. The Absence Data is stored in the database as a single row per day (the columns are EmployeeId, Absence Date, Duration). So if I'm off work from Tuesday 11 February 2020 to Friday 21 February 2020 inclusive, there will be 9 rows in the table:
11 February 2020 - 1 day
12 February 2020 - 1 day
13 February 2020 - 1 day
14 February 2020 - 1 day
17 February 2020 - 1 day
18 February 2020 - 1 day
19 February 2020 - 1 day
20 February 2020 - 1 day
21 February 2020 - 1 day
(see screenshot below)
HR would like to see a single entry in the report for a contiguous period of absence:
My question is - without using a cursor, how can I calculate the is in SQL (even more complicated because I have to do this using Linq to SQL, but I might be able to swap this out for a stored procedure. Note that the criterion for contiguous data is adjacent working days EXCLUDING weekends and bank holidays. I hope I've made myself clear ... apologies if not.
This is a form of gaps-and-islands. In this case, use lag() to see if two vacations overlap and then a cumulative sum:
select employee, min(absent_from), max(absent_to)
from (select t.*,
sum(case when prev_absent_to = dateadd(day, -1, absent_from) then 0 else 1
end) over (partition by employee order by absent_to) as grp
from (select t.*,
lag(absent_to) over (partition by employee order by absent_from) as prev_absent_to
from t
) t
) t
group by employee, grp;
If you need to deal with holidays and weekends, then you need a calendar table.

Get just one row per ID from SQL query with specifc condition

I'm using Postgres v>9.
I'd like to get values of a table like this:
id year value
1 2015 0.1
2 2015 0.2
6 2030 0.3
6 2015 0.4
6 2017 0.3
The idea is to get lines where years is < 2019 or year = 2030. If id is repeated, I´d like to get only 2030 line, not 2015 ones, that is, the result I´m looking for is:
id year value
1 2015 0.1
2 2015 0.2
6 2030 0.3
How can I do that?
This only considers the year 2030 or any year < 2019. At least that's what the question says. (I suspect there's something fuzzy there.)
It picks one row per id, with the latest year first.
SELECT DISTINCT ON (id) *
FROM tbl
ORDER BY id, year DESC
WHERE (year = 2030 OR year < 2019);
If there can be multiple rows with the same (id, year), you need a tiebreaker.
About this and more details for DISTINCT ON:
Select first row in each GROUP BY group?
Use distinct on if you want one row per id:
select distint on (id) t.*
from t
order by id, year desc;
SELECT ID,
FIRST_VALUE(YEAR) OVER (PARTITION BY ID ORDER BY YEAR DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS year,
FIRST_VALUE(Value) OVER (PARTITION BY ID ORDER BY YEAR DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS value
FROM t
WHERE YEAR = 2030 OR YEAR < 2019
I think this is the standard for first_value -- postgre might require a seperate clause?