Calculate moving weather stats in PostgreSQL - sql

I'm trying to calculate the days since last rain and the amount of rain in that event for each day in my PostgreSQL table of weather data. I've been trying to achieve this with window functions but the limitation of ranges having to be unbounded has left me a bit stuck on how to proceed.
Here's the query I have so far:
SELECT
station_num,
ob_date,
rain,
max(rain) OVER (PARTITION BY station_num ORDER BY ob_date ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as prev_rain_mm,
'' as days_since_rain --haven't attempted this calculation yet
FROM
obs_daily_ground_moisture
This results in the following:
but I'm trying to achieve something more like this:
I feel like all the pieces are there in regards to window functions range & filter and nested queries but I'm not sure how to pull it all together. Also the above data is just a subset of the actual dataset, the entire dataset is just over half a million rows.

The key here is to group the observations starting from the first occurrence of rain>0 value to the next occurrence of rain>0 value. Thereafter you can use window functions to calculate the needed columns.
select
x.station_num,
x.ob_date,
max(rain) over(partition by station_num,col) prev_rain,
case when rain > 0 then 0
else row_number() over(partition by station_num, col order by ob_date)-1 end days_since_rain
from (select t.*,
sum(case when rain > 0 then 1 else 0 end) over(partition by station_num order by ob_date) col
from t) x
Sample Demo

try this.
DECLARE #Rain AS FLOAT
UPDATE A
SET
#Rain = CASE WHEN A.Rain = 0 THEN #Rain ELSE A.Rain END,
A.Rain = CASE WHEN #Rain IS NULL OR A.Rain <> 0 THEN A.Rain ELSE #Rain END
FROM obs_daily_ground_moisture A
SELECT ob_date, Rain,
max(rain) OVER (PARTITION BY station_num ORDER BY ob_date ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as prev_rain_mm,
ROW_NUMBER() OVER(PARTITION BY Rain ORDER BY ob_date) - 1 as days_since_rain
FROM obs_daily_ground_moisture ORDER BY ob_date

Related

Create partitions based on column values in sql

I am very new to sql and query writing and after alot of trying, I am asking for help.
As shown in the picture, I want to create partition of data based on is_late = 1 and show its count (that is 2) but at the same time want to capture the value of last_status where is_late = 0 to be displayed in the single row.
The task is to calculate how many time the rider was late and time taken by him from first occurrence of estimated time to the last_status.
Desired output:
You can use following query
SELECT
rider_id,
task_created_time,
expected_time_to_arrive,
is_late,
last_status,
task_count,
CONVERT(VARCHAR(5), DATEADD(MINUTE, DATEDIFF(MINUTE, expected_time_to_arrive, last_status), 0), 114) AS time_delayed
FROM
(SELECT
rider_id,
task_created_time,
expected_time_to_arrive,
is_late,
SUM(CASE WHEN is_late = 1 THEN 1 ELSE 0 END) OVER(PARTITION BY rider_id ORDER BY rider_id) AS task_count,
ROW_NUMBER() OVER(PARTITION BY rider_id ORDER BY rider_id) AS num,
MAX(last_status) OVER(PARTITION BY rider_id ORDER BY rider_id) AS last_status
FROM myTestTable) t
WHERE num = 1
db<>fiddle

Retain function in SQL

Have scenario where need to retain values based on condition.
Assign "Change date" on "AA start Date" then check for "Change in Duration".
If "change in Duration" < 60 then 1st Change date will be assigned till next Change in duration > 60
then retain new change date in "AA start date". Sample is given below.
"AA_START_DATE" is final column which I am looking for.
I think this is a type of gap and islands problem, where you want to remember the first date in sequence that is 60+ days from the previous date.
You can handle this by using lag() to get the previous date. Then use a cumulative conditional maximum to get the time when the most recent change occurred:
select t.*,
max(case when change_date > date_add(change_date, interval -60 day) then null else change_date
end) over (partition by cn, aa_code
order by change_date
) as aa_start_date
from (select t.*,
lag(change_date) over (partition by cn, aa_code order by change_date) as prev_change_date
from t
) t
I have achieved this as follows: Any alternate way is appreciable
(select CIN,AA_CODE,EXT_AA_MESSAGE,CHG_DATE,EXP_DATE,PREV_CHG_DATE,CHG_DATE_DUR,AA_START_DTE,
sum(case when AA_START_DTE is null then 0 else 1 end) over (partition by CIN,AA_CODE order by CHG_DATE) as value_partition from (select CIN,AA_CODE,EXT_AA_MESSAGE,CHG_DATE,EXP_DATE,
case when CIN_AA_CODE_FIRST = 1 then NULL else PREV_CHG_DATE end as PREV_CHG_DATE,
case when CIN_AA_CODE_FIRST <> 1 then date_diff(CHG_DATE,PREV_CHG_DATE,DAY) end as CHG_DATE_DUR,
case when CIN_AA_CODE_FIRST = 1 or (case when CIN_AA_CODE_FIRST <> 1 then date_diff(CHG_DATE,PREV_CHG_DATE,DAY) end ) > 60 then CHG_DATE
end as AA_START_DTE,
from
( select CIN,AA_CODE,EXT_AA_MESSAGE,CHG_DATE,EXP_DATE, LAG(CHG_DATE) OVER (PARTITION BY CIN,AA_CODE ORDER BY CHG_DATE ASC) as PREV_CHG_DATE,
rank() OVER (PARTITION BY CIN,AA_CODE ORDER BY CHG_DATE ASC) AS CIN_AA_CODE_FIRST from TABLE )))`

Snowflake SQL UDFs: SELECT TOP N, LIMIT, ROW_NUMBER() AND RANK() not working in subqueries?

I've been experimenting with Snowflake SQL UDF solutions to add a desired number of working days to a timestamp. I've been tryinig to define a function that takes a timestamp and desired number of working days to add as parameters and returns a date. The function uses date dimension table. The function works when I pass it a single date as a parameter, but whenever I try to give it a full column of dates, it throws error "Unsupported subquery type cannot be evaluated". It seems that this happens whenever I try to use SELECT TOP N, LIMIT, ROW_NUMBER() or RANK() in a subquery.
Here is an example of an approach I tried:
CREATE OR REPLACE FUNCTION "ADDWORKINGDAYSTOWORKINGDAY"(STARTDATE TIMESTAMP_NTZ, DAYS NUMBER)
RETURNS DATE
LANGUAGE SQL
AS '
WITH CTE AS (
SELECT PAIVA
FROM EDW_DEV.REPORTING_SCHEMA."D_PAIVA"
WHERE ARKIPAIVA = 1 AND ARKIPYHA_FI = FALSE
AND 1 = CASE WHEN DAYS < 0 AND P.PAIVA < TO_DATE(STARTDATE) THEN 1
WHEN DAYS < 0 AND P.PAIVA >= TO_DATE(STARTDATE) THEN 0
WHEN DAYS >= 0 AND P.PAIVA > TO_DATE(STARTDATE) THEN 1
ELSE 0
END),
CTE2 AS (
SELECT
PAIVA
,CASE WHEN DAYS >= 0 THEN RANK() OVER
(ORDER BY PAIVA)
ELSE RANK() OVER
(ORDER BY PAIVA DESC)
END AS RANK
FROM CTE
ORDER BY RANK)
SELECT TOP 1
ANY_VALUE (CASE WHEN DAYS IS NULL OR TO_DATE(STARTDATE) IS NULL THEN NULL
WHEN DAYS = 0 THEN TO_DATE(STARTDATE)
ELSE PAIVA
END) AS PAIVA
FROM CTE2
WHERE CASE WHEN DAYS IS NULL OR TO_DATE(STARTDATE) IS NULL THEN 1 = 1
WHEN DAYS > 0 THEN RANK = DAYS
WHEN DAYS = 0 THEN 1 = 1
ELSE RANK = -DAYS
END
';
UDFs are scalar. They will return only one value of a specified type, in this case a date. If you want to return a set of values for a column, you may want to investigate UDTFs, User Defined Table Functions, which return a table.
https://docs.snowflake.net/manuals/sql-reference/udf-table-functions.html
With a bit of modification to your UDF, you can convert it to a UDTF. You can pass it columns instead of scalar values. You can then join the table resulting from the UDTF with the base table to get the work day addition values.

Reset rolling sum to 0 after reaching the threshold

I'm trying to compute a running total and reset it to 0 based on 2 conditions or if the limit is reached.
Here is an example.
As in the image above, I need to get the running total while the following conditions are met:
monthly discount = 0 and monthly ticket=1
If one of discount=1 and ticket=0, the next value for running total has to be 0.
running_total<50
If running total>=50, the value for running total has to start from the value on the same row.
Here is what I'm trying to do now:
Is there any possibility to do this in HIVE? Thank you so much!!!
SELECT * ,
SUM(tag_flg) OVER (PARTITION BY account, flg_sum
ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS running_sum
FROM
( SELECT * ,
SUM(CASE
WHEN tag_flg>=50 THEN value
ELSE tag_flg
END) OVER (PARTITION BY account
ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS flg_sum
FROM
( SELECT * ,
CASE
WHEN month_disc =0
AND month_ticket = 1 THEN value
ELSE 0
END AS tag_flg
FROM source_table) x) y
Do the 40, 60 and 20 that aren't being accounted for matter at all in your report? Like would you want them to be counted then a new row added with a total of 0 to restart?
Here is the way I managed to do it:
SELECT *,
SUM(case when month_disc=1 OR month_ticket=0 then 0 else value end) OVER (PARTITION BY account, flg_sum, band_sum ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum
FROM (
SELECT *,
FLOOR(SUM(case when month_disc=1 OR month_ticket=0 then 0 else value end) OVER (PARTITION BY account, flg_sum ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/50.000001) as band_sum ---- create bands for running total
FROM (
SELECT *,
SUM(tag_flg) OVER (PARTITION BY account ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS flg_sum
FROM (
SELECT *,
CASE WHEN (month_disc=1 OR month_ticket=0) THEN 1 ELSE 0 END AS tag_flg ---- flag to count when the value is reset due to one of the conditions
FROM source_table) x ) y) z

SUM OVER with GROUP BY

I am working on a large database with millions of rows and I am trying to be efficient in my queries. The database contains regular snapshots of a loan portfolio where sometimes loans default (status goes from '1' to <>'1'). When they do, they appear only once in the corresponding snapshot, then they are no longer reported. I am trying to get a cumulative count of such loans - as they develop over time and divided into many buckets depending on country of origin, vintage, etc.
SUM (...) OVER seems to be a very efficient function to achieve the result but when I run the following query
Select
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate
I get
SQL Error (8120) column aa27 is invalid in the selected list because it is not contained in either an aggregate function or the GROUP BY clause
Can anyone shed some light? Thanks
I believe you want:
Select assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(SUM(aa27)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(SUM(aa26)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22 >= '2014-01' and aa22 <= '2014-12' and vintage = '2015' and
active = 0 and aa74 <> '1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate;
Note the SUM(SUM()) in the cumulative sum expressions.
This is what I found to be working, comparing my results with some external research data.
I have simplified the fields for readability:
select
poolcutoffdate,
count(1) as LoanCount,
MAX(sum(case status when 'default' then 1 else 0 end))
over (order by poolcutoffdate
ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults
from myDatabase
group by poolcutoffdate
order by poolcutoffdate asc
I am thus counting all loans that have been in the 'default' status at least once from inception to the current cutoff date.
Note the use of MAX(SUM()) so that the result is the largest of the various iteration from the first to the current row. Using SUM(SUM()) would add the various iterations leading to a cumulative of cumulatives.
I considered using SUM(SUM()) with "PARTITION BY poolcutoffdate" so that the tally restarts from 0 and does not add from the previous cutoff date but this would only include loans from the latest cutoff so if a loan had defaulted and removed from the pool it would wrongly not be counted.
Note the CASE in the OVER statement.
Thanks for all the help