Moving average within groups that returns NULL if any row is NULL (Snowflake - SQL) - sql

I need to caluclate the moving average of a column per group (partitioned by id). The only twist is that I need the result to be NULL if any value in the corresponding window is NULL.
Example of expected behaviour (for a given id and window size=3):
A
mov_ave_A
NULL
NULL
1
NULL
1
NULL
1
1
4
2
The first 3 rows of the moving average are NULL, because the first value (which is included in the first 3 windows) is NULL. Row 4 of mov_ave_A is equal to 1 because it's the average of rows 2 to 4 of A, and so on.
I tried:
CASE WHEN SUM(CASE WHEN a IS NULL THEN 1 ELSE 0 END) = 0 THEN AVG(a) ELSE NULL END
OVER (
PARTITION BY id
ORDER BY date_month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS mov_ave_A
but I get
"Sliding window frame unsupported for function CASE".
Also, I'd really like the solution to be short and simple as I need to create 6 such columns. So, I'll have to repeat the logic 6 times.

The issue with your query is the OVER clause is after the END. I believe this should work. You need to have the OVER clause for each window function so once for COUNT and once for AVG. COUNT is a easier to way to check for NULL's then using SUM
SELECT
*
,CASE
/*Check for 3 values in a, if so return the rolling AVG value. Implicit ELSE NULL*/
WHEN COUNT(a) OVER (PARTITION BY ID ORDER BY date_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) = 3
THEN AVG(a) OVER (PARTITION BY ID ORDER BY date_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
END AS mov_ave_A
FROM YourTable

Use the following case expression:
CASE WHEN COUNT(a) OVER (
PARTITION BY id
ORDER BY date_month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) = 3 THEN AVG(a) OVER (
PARTITION BY id
ORDER BY date_month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) END AS mov_avg

Related

SQL query to allow for latest datasets per items

I have this table in an SQL server database:
and I would like a query that gives me the values of cw1, cw2,cw3 for a restricted date condition.
I would like a query giving me the "latest" values of cw1, cw2, cw3 giving me previous values of cw1, cw2, cw3, if they are null for the last plan_date. This would be with a date condition.
So if the condition is plan_date between "02.01.2020" and "04.01.2020" then the result should be
1 04.01.2020 null, 9, 4
2 03.01.2020 30 , 15, 2
where, for example, the "30" is from the last previous date for item_nr 2.
You can get the last value using first_value(). Unfortunately, that is a window function, but select distinct solves that:
select distinct item_nr,
first_value(cw1) over (partition by item_nr
order by (case when cw1 is not null then 1 else 2 end), plan_date desc
) as imputed_cw1,
first_value(cw2) over (partition by item_nr
order by (case when cw2 is not null then 1 else 2 end), plan_date desc
) as imputed_cw2,
first_value(cw3) over (partition by item_nr
order by (case when cw3 is not null then 1 else 2 end), plan_date desc
) as imputed_cw3
from t;
You can add a where clause after the from.
The first_value() window function returns the first value from each partition. The partition is ordered to put the non-NULL values first, and then order by time descending. So, the most recent non-NULL value is first.
The only downside is that it is a window function, so the select distinct is needed to get the most recent value for each item_nr.

SQL Query getting the latest record of the Group and calculate the value of those particular records

I do have the following table (just a sample) and would like to get the Points subtract from Record2 to Record1. (Record2-Record1) from the latest record of both record1 and 2. The records are entered in category of Match. 1 Match will consists of 2 records which are Record 1 and Record 2.
The output will be 3 as the newest record is ID 3 and 4 from the Match2.)
ID
Name
Points
TimeRecorded
Match
1
Record 1
3
2-Mar 2pm
1
2
Record 2
5
2-Mar 2pm
1
3
Record 1
5
4-Mar 5pm
2
4
Record 2
8
4-Mar 5pm
2
I tried to get the value of subtracting both query as below. But I feel that this is not the good way as it is hard coded for the match and the Name of the record. May I know how to construct a better query in order to get the latest record of the grouped match and calculate the points whereby subtracting Record1 from Record2.
SELECT
(select Points from RunRecord where Name= 'Record2' AND Match = 2)
- (select Points from RunRecord where Name= 'Record1' AND Match = 2)
You could use:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY TimeRecorded DESC) rn
FROM yourTable
)
SELECT
MAX(CASE WHEN Name = 'Record 2' THEN Points END) -
MAX(CASE WHEN Name = 'Record 1' THEN Points END) AS diff
FROM cte
WHERE rn = 1;
The CTE assigns a row number for each group of records of the same name, with 1 being assigned to the most recent record. Then, we aggregate over the entire table and pivot out the points to find the difference.
You can use the rank() window function to rank the records by match descending. Then take the top of the ranked records and use conditional aggregation to control the sign of the points added.
SELECT sum(CASE x.name
WHEN 'Record2' THEN
x.points
WHEN 'Record1' THEN
-x.points
END)
FROM (SELECT rr.name,
rr.points,
rank() OVER (ORDER BY rr.match DESC) r
FROM runrecord rr
WHERE name IN ('Record1',
'Record2')) x
WHERE x.r = 1;

How to identify rows per group before a certain value gap?

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.
Here is a test setup:
CREATE TABLE test(
main INTEGER,
sub_id INTEGER,
value_t INTEGER);
INSERT INTO test (main, sub_id, value_t)
VALUES
(1,1,8),
(1,2,7),
(1,3,3),
(1,4,85),
(1,5,40),
(2,1,3),
(2,2,1),
(2,3,1),
(2,4,8),
(2,5,41);
My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Should I use a loop or are there smarter solutions?
The task description in pseudocode:
FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
DO until diff > 10 OR diff <-10
SET newval = 1 AND LEAD(newval) = 1
Basic SELECT
As fast as possible:
SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub;
Fine points
Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).
To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.
diff BETWEEN -10 AND 10 is slightly faster than #diff < 11 (clearer and more flexible, too). (# being the "absolute value" operator, equivalent to the abs() function.)
bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.
Your UPDATE
Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Again, as fast as possible.
UPDATE test AS t
SET newval = 1
FROM (
SELECT main, sub_id
, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT main, sub_id
, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub
) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND u.flag;
Fine points
Computing all values in a single query is typically substantially faster than a correlated subquery.
The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.
db<>fiddle here
Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.
Anyhow, here's your solution:
WITH data AS (
SELECT main, sub_id, value_t
, abs(value_t
- lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
FROM test
)
SELECT main, sub_id, value_t
, CASE max(is_evil::int)
OVER (PARTITION BY main ORDER BY sub_id
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
WHEN 1 THEN NULL ELSE 1 END newval
FROM data;
I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.
EXISTS on an aggregating subquery:
UPDATE test u
SET value_t = NULL
WHERE EXISTS (
SELECT * FROM (
SELECT main,sub_id
, value_t , ABS(value_t - lag(value_t)
OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
FROM test
) x
WHERE x.main = u.main
AND x.sub_id <= u.sub_id
AND x.absdiff >= 10
)
;
SELECT * FROM test
ORDER BY main, sub_id;
Result:
UPDATE 3
main | sub_id | value_t
------+--------+---------
1 | 1 | 8
1 | 2 | 7
1 | 3 | 3
1 | 4 |
1 | 5 |
2 | 1 | 3
2 | 2 | 1
2 | 3 | 1
2 | 4 | 8
2 | 5 |
(10 rows)

Oracle SQL - select last 3 rows after a specific row

Below is my data:
My requirement is to get the first 3 consecutive approvals. So from above data, ID 4, 5 and 6 are the rows that I need to select. ID 1 and 2 are not eligible, because ID 3 is a rejection and hence breaks the consecutive condition of actions. Basically, I am looking for the last rejection in the list and then finding the 3 consecutive approvals after that.
Also, if there are no rejections in the chain of actions then the first 3 actions should be the result. For below data:
So my output should be ID 11, 12 and 13.
And if there are less than 3 approvals, then the output should be the list of approvals. For below data:
output should be ID 21 and 22.
Is there any way to achieve this with SQL query only - i.e. no PL-SQL code?
Here is one method that uses window functions:
Find the first row where there are three approvals.
Find the minimum action_at among the rows with three approvals
Filter
Keep the three rows you want
This version uses fetch which is in Oracle 12+:
select t.*
from (select t.*,
min(case when has_approval_3 = 3 then action_at end) over () as first_action_at
from (select t.*,
sum(case when action = 'APPROVAL' then 1 else 0 end) over (order by action_at rows between current row and 2 following) as has_approval_3
from t
) t
) t
where action = 'APPROVAL' and
(action_at >= first_action_at or first_action_at is null)
order by action_at
fetch first 3 rows only;
You can use IN and ROW_NUMBER analytical function as following:
SELECT * FROM
( SELECT
T.*,
ROW_NUMBER() OVER(ORDER BY Y.ACTION_AT) AS RN
FROM YOUR_TABLE Y
WHERE Y.ACTION = 'APPROVE'
AND Y.ACTION_AT >= COALESCE(
(SELECT MAX(YIN.ACTION_AT)
FROM YOUR_TABLE YIN
WHERE YIN.ACTION = 'REJECT'
), Y.ACTION_AT) )
WHERE RN <= 3;
Cheers!!

HIVE/Impala query: Count the number of rows between rows fulfilling certain conditions

I need to count the number of rows that fulfill certain conditions contained in intervals defined by other rows that fulfill other conditions. Examples: the number of rows N between 'Reference' having values 1 and 4 that fulfill the condition 'Other_condition' = b is N=1, the number of rows N between 'Reference' having values 2 and 5 that fulfill the condition 'Other_condition' = b is N=2 etc.
Date Reference Other_condition
20171111 1 a
20171112 2 a
20171113 3 b
20171114 4 b
20171115 5 b
I'm accessing the database through Hive/Impala SQL queries and unfortunately I have no idea where to start implementing such a window function. A half-pseudocode version of what I want would be something like:
SELECT COUNT (DISTINCT database.Date) AS counter, Other_condition, reference
FROM database
WHERE database.Other_condition = a AND database.Reference BETWEEN
(window function condition 1: database.Reference = 2) AND
(window function condition 2: database.Reference = 5)
GROUP BY counter
Your question is rather hard to follow. I get the first conditions, which is the number of rows between "1" and "4".
Here is one method that should be pretty easy to generalize:
select (max(case when reference = 4 then seqnum end) -
max(case when reference = 1 then seqnum end)
) as num_rows_1_4
from (select t.*,
row_number() over (order by date) as seqnum
from t
) t;