SUM OVER with GROUP BY - sql

I am working on a large database with millions of rows and I am trying to be efficient in my queries. The database contains regular snapshots of a loan portfolio where sometimes loans default (status goes from '1' to <>'1'). When they do, they appear only once in the corresponding snapshot, then they are no longer reported. I am trying to get a cumulative count of such loans - as they develop over time and divided into many buckets depending on country of origin, vintage, etc.
SUM (...) OVER seems to be a very efficient function to achieve the result but when I run the following query
Select
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate
I get
SQL Error (8120) column aa27 is invalid in the selected list because it is not contained in either an aggregate function or the GROUP BY clause
Can anyone shed some light? Thanks

I believe you want:
Select assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(SUM(aa27)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(SUM(aa26)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22 >= '2014-01' and aa22 <= '2014-12' and vintage = '2015' and
active = 0 and aa74 <> '1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate;
Note the SUM(SUM()) in the cumulative sum expressions.

This is what I found to be working, comparing my results with some external research data.
I have simplified the fields for readability:
select
poolcutoffdate,
count(1) as LoanCount,
MAX(sum(case status when 'default' then 1 else 0 end))
over (order by poolcutoffdate
ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults
from myDatabase
group by poolcutoffdate
order by poolcutoffdate asc
I am thus counting all loans that have been in the 'default' status at least once from inception to the current cutoff date.
Note the use of MAX(SUM()) so that the result is the largest of the various iteration from the first to the current row. Using SUM(SUM()) would add the various iterations leading to a cumulative of cumulatives.
I considered using SUM(SUM()) with "PARTITION BY poolcutoffdate" so that the tally restarts from 0 and does not add from the previous cutoff date but this would only include loans from the latest cutoff so if a loan had defaulted and removed from the pool it would wrongly not be counted.
Note the CASE in the OVER statement.
Thanks for all the help

Related

Partition SQL WINDOW function on certain criteria

I am trying to calculate a running total of the AddToCart metric that only starts after a 'product/search/details' page was seen.
Here's the link to SQL Fiddle: http://sqlfiddle.com/#!15/bbf9b/1
In the sqlfiddle link, I've manually created a column to reflect my desiredoutput. The workingoutput column shows where I have gotten to with my code.
SUM(AddToCart) OVER (PARTITION BY SessionID ORDER BY HitNumber ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as workingoutput
I know the below syntax is all wrong, but this is essentially what I am trying to achieve
SUM(AddToCart) OVER (PARTITION BY SessionID ORDER BY HitNumber ROWS BETWEEN UNBOUNDED PRECEDING AND FIRST_VALUE(ROW LIKE "%/product/search/details%")) as workingoutput
You need to nest your window functions here
Start with a running conditional count, checking if we have reached /product/search/details yet, and only return AddToCart based on that
Do a running sum over that result
SELECT
wd.SessionID,
wd.HitNumber,
wd.HitType,
wd.EventType,
wd.PageName,
wd.AddToCart,
SUM(wd.AddToCartFromSearch) OVER (PARTITION BY wd.SessionID
ORDER BY HitNumber ROWS UNBOUNDED PRECEDING) AS DesiredOutput
FROM (
SELECT *,
CASE WHEN COUNT(CASE WHEN wd.PageName = '/product/search/details' THEN 1 END)
OVER (PARTITION BY wd.SessionID ORDER BY HitNumber ROWS UNBOUNDED PRECEDING) > 0
THEN AddToCart ELSE 0 END AS AddToCartFromSearch
FROM WebData wd
) wd
ORDER BY HitNumber;
SQL Fiddle

Reset rolling sum to 0 after reaching the threshold

I'm trying to compute a running total and reset it to 0 based on 2 conditions or if the limit is reached.
Here is an example.
As in the image above, I need to get the running total while the following conditions are met:
monthly discount = 0 and monthly ticket=1
If one of discount=1 and ticket=0, the next value for running total has to be 0.
running_total<50
If running total>=50, the value for running total has to start from the value on the same row.
Here is what I'm trying to do now:
Is there any possibility to do this in HIVE? Thank you so much!!!
SELECT * ,
SUM(tag_flg) OVER (PARTITION BY account, flg_sum
ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS running_sum
FROM
( SELECT * ,
SUM(CASE
WHEN tag_flg>=50 THEN value
ELSE tag_flg
END) OVER (PARTITION BY account
ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS flg_sum
FROM
( SELECT * ,
CASE
WHEN month_disc =0
AND month_ticket = 1 THEN value
ELSE 0
END AS tag_flg
FROM source_table) x) y
Do the 40, 60 and 20 that aren't being accounted for matter at all in your report? Like would you want them to be counted then a new row added with a total of 0 to restart?
Here is the way I managed to do it:
SELECT *,
SUM(case when month_disc=1 OR month_ticket=0 then 0 else value end) OVER (PARTITION BY account, flg_sum, band_sum ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum
FROM (
SELECT *,
FLOOR(SUM(case when month_disc=1 OR month_ticket=0 then 0 else value end) OVER (PARTITION BY account, flg_sum ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/50.000001) as band_sum ---- create bands for running total
FROM (
SELECT *,
SUM(tag_flg) OVER (PARTITION BY account ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS flg_sum
FROM (
SELECT *,
CASE WHEN (month_disc=1 OR month_ticket=0) THEN 1 ELSE 0 END AS tag_flg ---- flag to count when the value is reset due to one of the conditions
FROM source_table) x ) y) z

Running count shows all values instead of the total number of values

My data is stored in an Amazon Redshift db. I am attempting to get a running count of loans by month. This is my query:
SELECT
TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE,
COUNT( LD.LOAN_ID) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS
INNER JOIN LOANS L ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...')
AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY
LD.LOAN_ID,
LD.INITIAL_PURCHASE_DATE;
My expected result is as follow:
INITIAL_PURCHASE|TOTAL_LOANS
...|...
2016-10|369
2016-11|424
But instead I get one record for every day of the month like so
INITIAL_PURCHASE|TOTAL_LOANS
...|...
2016-10|366
2016-10|367
2016-10|368
2016-10|369
2016-11|371
I checked the source system and confirmed there were a total of 369 loans in October, 424 in November so I know data's correct.
How do I get the total number of loans per month?
SOLUTION:
This is the correct query.
SELECT
TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE_DATE,
SUM(COUNT( LD_LOANS.LOAN_ID )) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAIL LD
INNER JOIN LOANS L ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...') AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')
Your group by needs to be by month, not day, and you need to remove LOAN_ID from the GROUP BY:
SELECT TO_CHAR(LD.INITIAL_PURCHASE_DATE, 'YYYY-MM') AS INITIAL_PURCHASE,
SUM(COUNT( LD.LOAN_ID)) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS LD INNER JOIN
LOANS L
ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...') AND
LD.INITIAL_PURCHASE_DATE IS NOT
GROUP BY TO_CHAR(LD.INITIAL_PURCHASE_DATE, 'YYYY-MM')
Notes:
I think Amazon Redshift allows aliases in the GROUP BY, so you could use GROUP BY INITIAL_PURPOSE, LD.LOAN_ID.
The SUM(COUNT(*)) should give you the running sum.
LOAN_ID should not be in the GROUP BY if you want totals by month.
This is what you were aiming for.
You group by INITIAL_PURCHASE ('YYYY-MM') and do a running total on count(*).
SELECT TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE
,sum(count(*)) OVER
(ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')
ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS LD
INNER JOIN LOANS L
ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...')
AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY INITIAL_PURCHASE
P.s.
I think the alias INITIAL_PURCHASE should be recognized in the GROUP BY clause, if I am mistaken then use TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')

Calculate moving weather stats in PostgreSQL

I'm trying to calculate the days since last rain and the amount of rain in that event for each day in my PostgreSQL table of weather data. I've been trying to achieve this with window functions but the limitation of ranges having to be unbounded has left me a bit stuck on how to proceed.
Here's the query I have so far:
SELECT
station_num,
ob_date,
rain,
max(rain) OVER (PARTITION BY station_num ORDER BY ob_date ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as prev_rain_mm,
'' as days_since_rain --haven't attempted this calculation yet
FROM
obs_daily_ground_moisture
This results in the following:
but I'm trying to achieve something more like this:
I feel like all the pieces are there in regards to window functions range & filter and nested queries but I'm not sure how to pull it all together. Also the above data is just a subset of the actual dataset, the entire dataset is just over half a million rows.
The key here is to group the observations starting from the first occurrence of rain>0 value to the next occurrence of rain>0 value. Thereafter you can use window functions to calculate the needed columns.
select
x.station_num,
x.ob_date,
max(rain) over(partition by station_num,col) prev_rain,
case when rain > 0 then 0
else row_number() over(partition by station_num, col order by ob_date)-1 end days_since_rain
from (select t.*,
sum(case when rain > 0 then 1 else 0 end) over(partition by station_num order by ob_date) col
from t) x
Sample Demo
try this.
DECLARE #Rain AS FLOAT
UPDATE A
SET
#Rain = CASE WHEN A.Rain = 0 THEN #Rain ELSE A.Rain END,
A.Rain = CASE WHEN #Rain IS NULL OR A.Rain <> 0 THEN A.Rain ELSE #Rain END
FROM obs_daily_ground_moisture A
SELECT ob_date, Rain,
max(rain) OVER (PARTITION BY station_num ORDER BY ob_date ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as prev_rain_mm,
ROW_NUMBER() OVER(PARTITION BY Rain ORDER BY ob_date) - 1 as days_since_rain
FROM obs_daily_ground_moisture ORDER BY ob_date

Sum Until Value Reached - Teradata

In Teradata, I need a query to first identify all members in the MEM TABLE that currently have a negative balance, let's call that CUR_BAL. Then, for all of those members only, sum all transactions from the TRAN TABLE in order by date until the sum of those transactions is equal to the CUR_BAL.
Editing to add a third ADJ table that contains MEM_NBR, ADJ_DT and ADJ_AMT that need to be included in the running total in order to capture all of the records.
I would like the outcome to include the MEM.MEM_NBR, MEM.CUR_BAL, TRAN.TRAN_DATE OR ADJ.ADJ_DT (date associated with the transaction that resulted in the running total to equal CUR_BAL), MEM.LST_UPD_DT. I don't need to know if the balance is negative as a result of a transaction or adjustment, just the date that it went negative.
Thank you!
select
mem_nbr,
cur_bal,
tran_date,
tran_type
from (
select
a.mem_nbr,
a.cur_bal,
b.tran_date,
b.tran_type,
a.lst_upd_dt,
sum(b.tran_amt) over (partition by b.mem_nbr order by b.tran_date rows between unbounded preceding and current row) as cumulative_bal
from mem a
inner join (
select
mem_nbr,
tran_date,
tran_amt,
'Tran' as tran_type
from tran
union all
select
mem_nbr,
adj_date,
adj_amt,
'Adj' as tran_type
from adj
) b
on a.mem_nbr = b.mem_nbr
where a.cur_bal < 0
qualify cumulative_bal < 0
) z
qualify rank() over (partition by mem_nbr order by tran_date) = 1
The subquery picks up all instances where the cumulative balance is negative, then the outer query picks up the earliest instance of it. If you want the latest, add desc after tran_date in the final qualify line.