T-SQL calculate moving average - sql

I am working with SQL Server 2008 R2, trying to calculate a moving average. For each record in my view, I would like to collect the values of the 250 previous records, and then calculate the average for this selection.
My view columns are as follows:
TransactionID | TimeStamp | Value | MovAvg
----------------------------------------------------
1 | 01.09.2014 10:00:12 | 5 |
2 | 01.09.2014 10:05:34 | 3 |
...
300 | 03.09.2014 09:00:23 | 4 |
TransactionID is unique. For each TransactionID, I would like to calculate the average for column value, over previous 250 records. So for TransactionID 300, collect all values from previous 250 rows (view is sorted descending by TransactionID) and then in column MovAvg write the result of the average of these values. I am looking to collect data within a range of records.

The window functions in SQL 2008 are rather limited compared to later versions and if I remember correct you can only partition and you can't use any rows/range frame limit but I think this might be what you want:
;WITH cte (rn, transactionid, value) AS (
SELECT
rn = ROW_NUMBER() OVER (ORDER BY transactionid),
transactionid,
value
FROM your_table
)
SELECT
transactionid,
value,
movagv = (
SELECT AVG(value)
FROM cte AS inner_ref
-- average is calculated for 250 previous to current row inclusive
-- I might have set the limit one row to large, maybe it should be 249
WHERE inner_ref.rn BETWEEN outer_ref.rn-250 AND outer_ref.rn
)
FROM cte AS outer_ref
Note that it applies a correlated sub-query to every row and performance might not be great.
With the later versions you could have used window frame functions and done something like this:
SELECT
transactionid,
value,
-- avg over the 250 rows counting from the previous row
AVG(value) OVER (ORDER BY transactionid
ROWS BETWEEN 251 PRECEDING AND 1 PRECEDING),
-- or 250 rows counting from current
AVG(value) OVER (ORDER BY transactionid
ROWS BETWEEN 250 PRECEDING AND CURRENT ROW)
FROM your_table

Use a Common Table Expression (CTE) to include the rownum for each transaction, then join the CTE against itself on the row number so you can get the previous values to calculate the average with.
CREATE TABLE MyTable (TransactionId INT, Value INT)
;with Data as
(
SELECT TransactionId,
Value,
ROW_NUMBER() OVER (ORDER BY TransactionId ASC) as rownum
FROM MyTable
)
SELECT d.TransactionId , Avg(h.Value) as MovingAverage
FROM Data d
JOIN Data h on h.rownum between d.rownum-250 and d.rownum-1
GROUP BY d.TransactionId

Related

In Spark SQL how do I take 98% of the lowest values

I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value

PostgreSQL using sum in where clause

I have a table which has a numeric column named 'capacity'. I want to select first rows which the total sum of their capacity is no greater than X, Sth like this query
select * from table where sum(capacity )<X
But I know I can not use aggregation functions in where part.So what other ways exists for this problem?
Here is some sample data
id| capacity
1 | 12
2 | 13.5
3 | 15
I want to list rows which their sum is less than 26 with the order of id, so a query like this
select * from table where sum(capacity )<26 order by id
and it must give me
id| capacity
1 | 12
2 | 13.5
because 12+13.5<26
A bit late to the party, but for future reference, the following should work for a similar problem as the OP's:
SELECT id, sum(capacity)
FROM table
GROUP BY id
HAVING sum(capacity) < 26
ORDER by id ASC;
Use the PostgreSQL docs for reference to aggregate functions: https://www.postgresql.org/docs/9.1/tutorial-agg.html
Use Having clause
select * from table order by id having sum(capacity)<X
You can use the window variant of sum to produce a cumulative sum, and then use it in the where clause. Note that window functions can't be placed directly in the where clause, so you'd need a subquery:
SELECT id, capacity
FROM (SELECT id, capacity, SUM(capacity) OVER (ORDER BY id ASC) AS cum_sum
FROM mytable) t
WHERE cum_sum < 26
ORDER BY id ASC;

Selecting row with max running total on one column less than a given value

For example, for a table such as :
ID | col_a | col_b | col_c
=============================
1 |5.0 |7.0 |3
2 |3.0 |6.8 |5
I need to find the value of col_a / col_b for which the running total on col_c is less than a given value.
So far I have:
select MAX(running_total) as max FROM (select (col_a / col_b) as val, SUM(col_c)
OVER (ORDER BY value ROWS UNBOUNDED PRECEDING) as running_total FROM tableName)
WHERE running_total < 50;
This gives me the maximum running total but I also need the val (col_a/col_b) for the row where this running_total was achieved.
I am using Amazon Redshift for this query, which unlike mysql wont let me place val in the outer select statement, without adding a group by clause on val. I cant add the group by clause cause that would change the whole semantic of query.
I have found solution to similar problem - Fetch the row which has the Max value for a column
Mostly these solutions suggests, that we join with the same table and then match values for the column, but the running_total column is calculated and to do a join on it, I have to calculate it again ? which sounds fairly expensive.
You can do this. Window functions to the rescue.
Just add another layer of subquery that calculates the maximum running total on each row. Then use a where clause to get the row where they match:
select t.*
from (select t.*,
max(running_total) over () as maxrt
FROM (select (col_a / col_b) as val,
SUM(col_c) OVER (ORDER BY value ROWS UNBOUNDED PRECEDING
) as running_total
FROM tableName
) t
WHERE running_total < 50
) t
where running_total = maxrt;

Multiple filters on SQL query

I have been reading many topics about filtering SQL queries, but none seems to apply to my case, so I'm in need of a bit of help. I have the following data on a SQL table.
Date item quantity moved quantity in stock sequence
13-03-2012 16:51:00 xpto 2 2 1
13-03-2012 16:51:00 xpto -2 0 2
21-03-2012 15:31:21 zyx 4 6 1
21-03-2012 16:20:11 zyx 6 12 2
22-03-2012 12:51:12 zyx -3 9 1
So this is quantities moved in the warehouse, and the problem is on the first two rows which was a reception and return at the same time, because I'm trying to make a query which gives me the stock at a given time of all items. I use max(date) but i don't get the right quantity on result.
SELECT item, qty_in_stock
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY item_date DESC, sequence DESC) rn
FROM mytable
WHERE item_date <= #date_of_stock
) q
WHERE rn = 1
If you are on SQL-Server 2012, these are several nice features added.
You can use the LAST_VALUE - or the FIRST_VALUE() - function, in combination with a ROWS or RANGE window frame (see OVER clause):
SELECT DISTINCT
item,
LAST_VALUE(quantity_in_stock) OVER (PARTITION BY item
ORDER BY date, sequence
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS quantity_in_stock
FROM tableX
WHERE date <= #date_of_stock
Add a where clause and do the summation:
select item, sum([quantity moved])
from t
group by item
where t.date <= #DESIREDDATETIME
If you put a date in for the desired datetime, remember that goes to midnight when the day starts.

Return min date and corresponding amount to that distinct ID

Afternoon
I am trying to return the min value/ max values in SQL Server 2005 when I have multiple dates that are the same but the values in the Owed column are all different. I've already filtered the table down by my select statement into a temp table for a different query, when I've then tried to mirror I have all the duplicated dates that you can see below.
I now have a table that looks like:
ID| Date |Owes
-----------------
1 20110901 89
1 20110901 179
1 20110901 101
1 20110901 197
1 20110901 510
2 20111001 10
2 20111001 211
2 20111001 214
2 20111001 669
My current query:
Drop Table #Temp
Select Distinct Convert(Varchar(8), DateAdd(dd, Datediff(DD,0,DateDue),0),112)as Date
,ID
,Paid
Into #Temp
From Table
Where Paid <> '0'
Select ,Id
,Date
,Max(Owed)
,Min(Owed)
From #Temp
Group by ID, Date, Paid
Order By ID, Date, Paid
This doesn't strip out any of my dates that are the same, I'm new to SQL but I'm presuming its because my owed column has different values. I basically want to be able to pull back the first record as this will always be my minimum paid and my last record will always be my maximum owed to work out my total owed by ID.
I'm new to SQL so would like to understand what I've done wrong for my future knowledge of structuring queries?
Many Thanks
In your "select into"statement, you don't have an Owed column?
GROUP BY is the normal way you "strip out values that are the same". If you group by ID and Date, you will get one row in your result for each distinct pair of values in those two columns. Each row in the results represents ALL the rows in the underlying table, and aggregate functions like MIN, MAX, etc. can pull out values.
SELECT id, date, MAX(owes) as MaxOwes, MIN(owes) as minOwes
FROM myFavoriteTable
GROUP BY id, date
In SQL Server 2005 there are "windowing functions" that allow you to use aggregate functions on groups of records, without grouping. An example below. You will get one row for each row in the table:
SELECT id, date, owes,
MAX(Owes) over (PARTITION BY select, id) AS MaxOwes,
MIN(Owes) over (PARTITION BY select, id) AS MinOwes
FROM myfavoriteTable
If you name a column "MinOwes" it might sound like you're just fishing tho.
If you want to group by date you can't also group by ID, too, because ID is probably unique. Try:
Select ,Date
,Min(Owed) AS min_date
,Max(Owed) AS max_date
From #Temp
Group by Date
Order By Date
To get additional values from the row (your question is a bit vague there), you could utilize window functions:
SELECT DISTINCT
,Date
,first_value(ID) OVER (PARTITION BY Date ORDER BY Owed) AS min_owed_ID
,last_value(ID) OVER (PARTITION BY Date ORDER BY Owed) AS max_owed_ID
,first_value(Owed) OVER (PARTITION BY Date ORDER BY Owed) AS min_owed
,last_value(Owed) OVER (PARTITION BY Date ORDER BY Owed) AS max_owed
FROM #Temp
ORDER BY Date;