Running count shows all values instead of the total number of values - sql

My data is stored in an Amazon Redshift db. I am attempting to get a running count of loans by month. This is my query:
SELECT
TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE,
COUNT( LD.LOAN_ID) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS
INNER JOIN LOANS L ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...')
AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY
LD.LOAN_ID,
LD.INITIAL_PURCHASE_DATE;
My expected result is as follow:
INITIAL_PURCHASE|TOTAL_LOANS
...|...
2016-10|369
2016-11|424
But instead I get one record for every day of the month like so
INITIAL_PURCHASE|TOTAL_LOANS
...|...
2016-10|366
2016-10|367
2016-10|368
2016-10|369
2016-11|371
I checked the source system and confirmed there were a total of 369 loans in October, 424 in November so I know data's correct.
How do I get the total number of loans per month?
SOLUTION:
This is the correct query.
SELECT
TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE_DATE,
SUM(COUNT( LD_LOANS.LOAN_ID )) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAIL LD
INNER JOIN LOANS L ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...') AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')

Your group by needs to be by month, not day, and you need to remove LOAN_ID from the GROUP BY:
SELECT TO_CHAR(LD.INITIAL_PURCHASE_DATE, 'YYYY-MM') AS INITIAL_PURCHASE,
SUM(COUNT( LD.LOAN_ID)) OVER (ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS LD INNER JOIN
LOANS L
ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...') AND
LD.INITIAL_PURCHASE_DATE IS NOT
GROUP BY TO_CHAR(LD.INITIAL_PURCHASE_DATE, 'YYYY-MM')
Notes:
I think Amazon Redshift allows aliases in the GROUP BY, so you could use GROUP BY INITIAL_PURPOSE, LD.LOAN_ID.
The SUM(COUNT(*)) should give you the running sum.
LOAN_ID should not be in the GROUP BY if you want totals by month.

This is what you were aiming for.
You group by INITIAL_PURCHASE ('YYYY-MM') and do a running total on count(*).
SELECT TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM') AS INITIAL_PURCHASE
,sum(count(*)) OVER
(ORDER BY TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')
ROWS UNBOUNDED PRECEDING ) AS TOTAL_LOANS
FROM LOANS_DETAILS LD
INNER JOIN LOANS L
ON LD.LOAN_ID = L.ID
WHERE L.UNDERWRITING_STATUS IN ('...')
AND LD.INITIAL_PURCHASE_DATE IS NOT NULL
GROUP BY INITIAL_PURCHASE
P.s.
I think the alias INITIAL_PURCHASE should be recognized in the GROUP BY clause, if I am mistaken then use TO_CHAR(LD.INITIAL_PURCHASE_DATE,'YYYY-MM')

Related

ETL query need some changes go get it right

Hello guys I have a query which is working but when I remove 2 filters (2 where clauses at the end doesn't work as expected but still have to be removed from the query)
I have accounts 1000001,1000002,1000003,1000004 and 1000005
I only get 1000005 accounts, Pretty sure that it`s is about the window MAX function, but still.
I want to get the all values for the accounts.
SELECT a12.month_id,
a12.populate_id AS account_id,
LAST_VALUE(current_bal IGNORE NULLS) OVER
(PARTITION BY Populate_id ORDER BY date_id ASC ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_dly_bal
FROM (SELECT TO_CHAR(date_id, 'YYYYMM') AS month_id,
date_id,
account_id AS "account_id",
MAX(account_id) OVER (PARTITION by TO_CHAR(date_id, 'YYYYMM')) as populate_id,
current_bal
FROM (SELECT t.date_id, ad.account_id, ad.current_bal
FROM timedate t
FULL OUTER JOIN (SELECT src_extract_dt, account_id, current_bal
FROM account_dly
WHERE account_id = 1000001) ad
on t.date_id = ad.src_extract_dt
WHERE TO_CHAR(date_id, 'YYYYMM') = '201908'
order by t.date_id)) a12;
https://i.stack.imgur.com/xphVh.png

Sum for a rolling total

I have the following query:
select b.month_date,total_signups,active_users from
(
SELECT date_trunc('month',confirmed_at) as month_date
, count(distinct id) as total_signups
FROM follower.users
WHERE confirmed_at::date >= dateadd(day,-90,getdate())::date
and (deleted_at is null or deleted_at > date_trunc('month',confirmed_at))
group by 1
) a ,
(
SELECT date_trunc('month', inv.created_at) AS month_date
,COUNT(DISTINCT em.user_id) AS active_users
FROM follower.invitees inv
INNER JOIN follower.events
ON inv.event_id = em.event_id
where inv.created_at::date >= dateadd(day,-90,getdate())::date
GROUP BY 1
) b
where a.month_date=b.month_date
This returns three columns month date, total signups and active users, what I need is a rolling total for all users in the fourth column (rolling total of signups). I've tried over and partition functions with no luck. Could someone help? Appreciate it very much.
Try adding this column definition to your first Select:
SUM(total_signups)
OVER (ORDER BY b.month_date ASC rows between unbounded preceding and current row)
AS running_total
Here's a mini-demo

SUM OVER with GROUP BY

I am working on a large database with millions of rows and I am trying to be efficient in my queries. The database contains regular snapshots of a loan portfolio where sometimes loans default (status goes from '1' to <>'1'). When they do, they appear only once in the corresponding snapshot, then they are no longer reported. I am trying to get a cumulative count of such loans - as they develop over time and divided into many buckets depending on country of origin, vintage, etc.
SUM (...) OVER seems to be a very efficient function to achieve the result but when I run the following query
Select
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate
I get
SQL Error (8120) column aa27 is invalid in the selected list because it is not contained in either an aggregate function or the GROUP BY clause
Can anyone shed some light? Thanks
I believe you want:
Select assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(SUM(aa27)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(SUM(aa26)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22 >= '2014-01' and aa22 <= '2014-12' and vintage = '2015' and
active = 0 and aa74 <> '1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate;
Note the SUM(SUM()) in the cumulative sum expressions.
This is what I found to be working, comparing my results with some external research data.
I have simplified the fields for readability:
select
poolcutoffdate,
count(1) as LoanCount,
MAX(sum(case status when 'default' then 1 else 0 end))
over (order by poolcutoffdate
ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults
from myDatabase
group by poolcutoffdate
order by poolcutoffdate asc
I am thus counting all loans that have been in the 'default' status at least once from inception to the current cutoff date.
Note the use of MAX(SUM()) so that the result is the largest of the various iteration from the first to the current row. Using SUM(SUM()) would add the various iterations leading to a cumulative of cumulatives.
I considered using SUM(SUM()) with "PARTITION BY poolcutoffdate" so that the tally restarts from 0 and does not add from the previous cutoff date but this would only include loans from the latest cutoff so if a loan had defaulted and removed from the pool it would wrongly not be counted.
Note the CASE in the OVER statement.
Thanks for all the help

ROW_NUMBER() OVER (PARTITION BY) showing duplicate results for Group By Clause

I have the below query that was created to show the summation of the "Last" values for a year, usually this is a december value, but the year could potentially end in any month so i want to add together the last values for each goalmontecarloheaderid. I have it working 99%, but there are some random duplicates in the [year] value.
WITH endBalances AS (
SELECT ROW_NUMBER() OVER (PARTITION By GoalMonteCarloHeaderID, Year(Convert(date,MonthDate)) Order By Max(Month(Convert(date,MonthDate))) desc) n, Max(Month(Convert(date,MonthDate))) maxMonth, GrowthBucket, WithdrawalBucket, NoTaxesBucket,
Year(MonthDate) [year]
From GoalMonteCarloMedianResults mcmr
full join GoalMonteCarloHeader mch on mch.ID = mcmr.GoalMonteCarloHeaderID
full join GoalChartData gcd on gcd.ID = mch.GoalChartDataID and gcd.TypeID = 2
inner join Goal g on g.iGoalID = gcd.GoalID
where g.iTypeID in (1) and g.iHHID = 850802
group by GoalMonteCarloHeaderID, MonthDate, GrowthBucket, WithdrawalBucket, NoTaxesBucket
)
SELECT [year], Sum(GrowthBucket) GrowthBucket, Sum(WithdrawalBucket) WithdrawalBucket,Sum(NoTaxesBucket) NoTaxesBucket, maxMonth
From endBalances
where [year] is not null and n=1
Group By [year], maxMonth
order by [year] asc
Showing two random duplicates in the database result;
you can see in the image there are two examples where the year is duplicated and displayed for more than just the 'last' month in the year. Am I doing something wrong with the group by or the PARTITION BY() in my query? I am not the most familiar with this functionality of T-SQL.
T-SQL has a lovely function for this which has no direct equivalent in MySQL.
ROW_NUMBER() OVER (PARTITION BY [year] ORDER BY MonthDate DESC) AS rn
Then anything with rn=1 will be the last entry in a year.
The answers to this question have a few ideas:
ROW_NUMBER() in MySQL

Sum Until Value Reached - Teradata

In Teradata, I need a query to first identify all members in the MEM TABLE that currently have a negative balance, let's call that CUR_BAL. Then, for all of those members only, sum all transactions from the TRAN TABLE in order by date until the sum of those transactions is equal to the CUR_BAL.
Editing to add a third ADJ table that contains MEM_NBR, ADJ_DT and ADJ_AMT that need to be included in the running total in order to capture all of the records.
I would like the outcome to include the MEM.MEM_NBR, MEM.CUR_BAL, TRAN.TRAN_DATE OR ADJ.ADJ_DT (date associated with the transaction that resulted in the running total to equal CUR_BAL), MEM.LST_UPD_DT. I don't need to know if the balance is negative as a result of a transaction or adjustment, just the date that it went negative.
Thank you!
select
mem_nbr,
cur_bal,
tran_date,
tran_type
from (
select
a.mem_nbr,
a.cur_bal,
b.tran_date,
b.tran_type,
a.lst_upd_dt,
sum(b.tran_amt) over (partition by b.mem_nbr order by b.tran_date rows between unbounded preceding and current row) as cumulative_bal
from mem a
inner join (
select
mem_nbr,
tran_date,
tran_amt,
'Tran' as tran_type
from tran
union all
select
mem_nbr,
adj_date,
adj_amt,
'Adj' as tran_type
from adj
) b
on a.mem_nbr = b.mem_nbr
where a.cur_bal < 0
qualify cumulative_bal < 0
) z
qualify rank() over (partition by mem_nbr order by tran_date) = 1
The subquery picks up all instances where the cumulative balance is negative, then the outer query picks up the earliest instance of it. If you want the latest, add desc after tran_date in the final qualify line.