HANA SQL Filling missing gaps in a date table with balance history - sql

On Hana Sql environment I have this table with changes of balances from customers accounts by dates:
"BalanceTable"
CustomerID
BalDate
Balance
1
2021-06-01
0
1
2021-06-04
100
1
2021-06-28
500
2
2021-06-01
200
2
2021-06-03
0
2
2021-07-02
300
...
The table has several rows.
I have created now a date table with all the dates of the interval using the earliest day as first row and latest day as last row:
"DateTable"
Day
2021-06-01
2021-06-02
2021-06-03
2021-06-04
2021-06-05
2021-06-06
...
2021-07-02
I need to join the two tables having the daily balance of each customer:
Day
CustomerID
Balance
2021-06-01
1
0
2021-06-02
1
0
2021-06-03
1
0
2021-06-04
1
100
2021-06-05
1
100
2021-06-06
1
100
...
2021-06-27
1
100
2021-06-28
1
500
2021-06-29
1
500
2021-06-30
1
500
2021-07-01
1
500
2021-07-02
1
100
2021-06-01
2
200
2021-06-02
2
200
2021-06-03
2
0
2021-06-04
2
0
2021-06-05
2
0
...
2021-06-30
2
0
2021-07-01
2
0
2021-07-02
2
300
As first aproach I have tried joining the two tables using a left join:
SELECT * FROM "DateTable" T0 LEFT JOIN "BalanceTable" T1 ON T0."Day"=T1."BalDate"
But I know the proper solution is far beyond my limited SQL knowledge. The key is being able to fill in the gaps for the days of the "DateTable" that don't have a balance value in the "BalanceTable" with the balance of the previous last day with data.
I've read similar cases and they combine IFNULL function to fill gaps with PARTITION BY clause to get the last value, but after many attempts I wasn't able to apply that to my case.
Thank you for your ideas and sorry if I miss something, this is my first post asking for help.

So you have this example data:
CREATE TABLE BALANCETAB (CUSTOMERID INTEGER, BALDATE DATE, BALANCE INTEGER);
INSERT INTO BALANCETAB VALUES (1, '2021-06-01', 0);
INSERT INTO BALANCETAB VALUES (1, '2021-06-04', 100);
INSERT INTO BALANCETAB VALUES (1, '2021-06-28', 500);
INSERT INTO BALANCETAB VALUES (2, '2021-06-01', 200);
INSERT INTO BALANCETAB VALUES (2, '2021-06-03', 0);
INSERT INTO BALANCETAB VALUES (1, '2021-07-02', 300);
You already headed in the right direction by creating the dates table:
CREATE TABLE DATETAB AS (
SELECT GENERATED_PERIOD_START DAY
FROM SERIES_GENERATE_DATE('INTERVAL 1 DAY', '2021-06-01' ,'2021-07-02')
);
However, additionally you will need to know all customers since you want to add one row per date and per customer (cross join):
CREATE TABLE CUSTOMERTAB AS (
SELECT DISTINCT CUSTOMERID FROM BALANCETAB
);
From this you can infer the table with NULL values, that you would like to fill:
WITH DATECUSTOMERTAB AS (
SELECT * FROM DATETAB, CUSTOMERTAB
)
SELECT DCT.DAY, DCT.CUSTOMERID, BT.BALANCE
FROM DATECUSTOMERTAB DCT
LEFT JOIN BALANCETAB BT ON DCT.DAY = BT.BALDATE AND DCT.CUSTOMERID = BT.CUSTOMERID
ORDER BY DCT.CUSTOMERID, DCT.DAY;
On this table, you can apply a self-join (BTFILL) and use window function RANK (documentation) to determine the latest previous balance value.
WITH DATECUSTOMERTAB AS (
SELECT * FROM DATETAB, CUSTOMERTAB
)
SELECT DAY, CUSTOMERID, IFNULL(BALANCE, BALANCEFILL) BALANCE_FILLED
FROM
(
SELECT DCT.DAY, DCT.CUSTOMERID, BT.BALANCE, BTFILL.BALANCE AS BALANCEFILL,
RANK() OVER (PARTITION BY DCT.DAY, DCT.CUSTOMERID, BT.BALANCE ORDER BY BTFILL.BALDATE DESC) RNK
FROM DATECUSTOMERTAB DCT
LEFT JOIN BALANCETAB BT ON DCT.DAY = BT.BALDATE AND DCT.CUSTOMERID = BT.CUSTOMERID
LEFT JOIN BALANCETAB BTFILL ON BTFILL.BALDATE <= DCT.DAY AND DCT.CUSTOMERID = BTFILL.CUSTOMERID AND BTFILL.BALANCE IS NOT NULL
)
WHERE RNK = 1
ORDER BY CUSTOMERID, DAY;
Of course, you would omit the explicit creation of tables DATETAB and CUSTOMERTAB. The list of expected customer would probably already exist somewhere in your system and the series generator function could be part of the final statement.

Related

Calculate a 3-month moving average from non-aggregated data

I have a bunch of orders. Each order is either a type A or type B order. I want a 3-month moving average of time it takes to ship orders of each type. How can I aggregate this order data into what I want using Redshift or Postgres SQL?
Start with this:
order_id
order_type
ship_date
time_to_ship
1
a
2021-12-25
100
2
b
2021-12-31
110
3
a
2022-01-01
200
4
a
2022-01-01
50
5
b
2022-01-15
110
6
a
2022-02-02
100
7
a
2022-02-28
300
8
b
2022-04-05
75
9
b
2022-04-06
210
10
a
2022-04-15
150
Note: Some months have no shipments. The solution should allow for this.
I want this:
order_type
ship__month
mma3_time_to_ship
a
2022-02-01
150
a
2022-04-01
160
b
2022-04-01
126.25
Where a 3-month moving average is only calculated for months with at least 2 preceding months. Each record is an order type-month. The ship_month columns denotes the month of shipment (Redshift represents months as the date of the first of the month).
Here's how the mma3_time_to_ship column is calculated, expressed as Excel-like formulas:
150 = AVERAGE(100, 200, 50, 100, 300) <- The average for all A orders in Dec, Jan, and Feb.
160 = AVERAGE(200, 50, 100, 300, 150) <- The average for all A orders in Jan, Feb, Apr (no orders in March)
126.25 = AVERAGE(110, 110, 75, 210) <- The average for all B orders in Dec, Jan, Apr (no B orders in Feb, no orders at all in Mar)
My attempt doesn't aggregate it into monthly data and 3-month averages (this query runs without error in Redshift):
SELECT
order_type,
DATE_TRUNC('month', ship_date) AS ship_month,
AVG(time_to_ship) OVER (
PARTITION BY
order_type,
ship_month
ORDER BY ship_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS avg_time_to_ship
FROM tbl
Is what I want possible?
This is honestly a complete stab in the dark, so it won't surprise me if it's not correct... but it seems to me you can accomplish this with a self join using a range of dates within the join.
select
t1.order_type, t1.ship_date, avg (t2.time_to_ship) as 3mma_time_to_ship
from
tbl t1
join tbl t2 on
t1.order_type = t2.order_type and
t2.ship_date between t1.ship_date - interval '3 months' and t1.ship_date
group by
t1.order_type, t1.ship_date
The results don't match your example, but then I'm not entirely sure where they came from anyway.
Perhaps this will be the catalyst towards an eventual solution or at least an idea to start.
This is Pg12, by the way. Not sure if it will work on Redshift.
-- EDIT --
Per your updates, I was able to match your three results identically. I used dense_rank to find the closest three months:
with foo as (
select
order_type, date_trunc ('month', ship_date)::date as ship_month,
time_to_ship, dense_rank() over (partition by order_type order by date_trunc ('month', ship_date)) as dr
from tbl
)
select
f1.order_type, f1.ship_month,
avg (f2.time_to_ship),
array_agg (f2.time_to_ship)
from
foo f1
join foo f2 on
f1.order_type = f2.order_type and
f2.dr between f1.dr - 2 and f1.dr
group by
f1.order_type, f1.ship_month
Results:
b 2022-01-01 110.0000000000000000 {110,110}
a 2022-01-01 116.6666666666666667 {100,50,200,100,50,200}
b 2022-04-01 126.2500000000000000 {110,110,75,210,110,110,75,210}
b 2021-12-01 110.0000000000000000 {110}
a 2021-12-01 100.0000000000000000 {100}
a 2022-02-01 150.0000000000000000 {100,50,200,100,300,100,50,200,100,300}
a 2022-04-01 160.0000000000000000 {50,200,100,300,150}
There are some dupes in the array elements, but it doesn't seem to impact the averages. I'm sure that part could be fixed.

How to calculate average monthly number of some action in some perdion in Teradata SQL?

I have table in Teradata SQL like below:
ID trans_date
------------------------
123 | 2021-01-01
887 | 2021-01-15
123 | 2021-02-10
45 | 2021-03-11
789 | 2021-10-01
45 | 2021-09-02
And I need to calculate average monthly number of transactions made by customers in a period between 2021-01-01 and 2021-09-01, so client with "ID" = 789 will not be calculated because he made transaction later.
In the first month (01) were 2 transactions
In the second month was 1 transaction
In the third month was 1 transaction
In the nineth month was 1 transactions
So the result should be (2+1+1+1) / 4 = 1.25, isn't is ?
How can I calculate it in Teradata SQL? Of course I showed you sample of my data.
SELECT ID, AVG(txns) FROM
(SELECT ID, TRUNC(trans_date,'MON') as mth, COUNT(*) as txns
FROM mytable
-- WHERE condition matches the question but likely want to
-- use end date 2021-09-30 or use mth instead of trans_date
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id, mth) mth_txn
GROUP BY id;
Your logic translated to SQL:
--(2+1+1+1) / 4
SELECT id, COUNT(*) / COUNT(DISTINCT TRUNC(trans_date,'MON')) AS avg_tx
FROM mytable
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id;
You should compare to Fred's answer to see which is more efficent on your data.

SQL Query: Finding Overdue Start Date and Overdue Amount from a table

I have a table as follows, in which there is a 'ODType' column, This column states that a transaction is Due (D) or Collection-ed (C) amount. From this i need to find out overdue start date and overdue amount for each loan.
LoanID OverDueDate TotalAmount ODType
12345 01/10/17 1000 D
12345 01/11/17 500 C
12345 03/12/17 1000 D
12346 01/10/17 1500 D
12346 01/11/17 500 C
12346 03/12/17 1000 C
12346 01/01/18 2000 D
12346 01/02/18 1000 C
Examples Scenarios:
if we take LoanID 12345, The Overdue start date is: 01/10/2017 and
overdue Amount is: 1500
if we take LoanID 12346, The Overdue start date is:
01/01/2018 and overdue Amount is: 1000
I am able to get the overdue amounts for each loanId, but not sure how to get the Overdue start date. i did it with the following query:
SELECT t.LoanID, (t."DemandAmount" -t."CollectionAmount") Overdue
FROM (SELECT
LoanID,
MAX(CASE
WHEN ODType = 'D' THEN ("TotalAmount")
END) AS DemandAmount,
MAX(CASE
WHEN (ODType = 'C') THEN ("TotalAmount")
END) AS CollectionAmount
FROM TXN_OverdueCollection GROUP BY LoanID ) t
How to find out the overdue start date, what is the additional criteria i need to add to get it apart from the overdue amount. Or do i need to change the query completely to get both Overdue start date and overdue amounts.
UPDATE:
Overdue Amount and Overdue start date calculation information as follows:
The Overdue amount comes by SUM Of Dues(D) minus SUM Of Collections (C).
Suppose if we take the LoanID 12345, Sum of D (Dues) is 2000 and the
C (Collection) is 500 only so 2000 - 500 = 1500 is the due and since
it does not fulfill the 01/10/2017 full payment, the overdue start
date is 01/10/2017 only.
Suppose if we take the LoanID 12346, Sum of D(Dues) is 3500 and the C
(Collection) is 2500, So the overdue amount is 3500 - 2500 = 1000 and
overdue start date is 01/01/18, as it did not fulfill that dates due
yet.
Note:
This needs to be achived with simple JOIN OR LEFT OR RIGHT or Inner JOIN queries. Does not work with Partition, LAG, OVER and row_Number keywords which means these built in functions are not available to write the query.
Appreciate any help.
This is Microsoft T-SQL syntax, and depending on your server language, it will likely be different. I use MS-SQL's LAG() function, which was introduced in MS SQL 2012. All of the concepts should be convertible to whatever flavor of SQL you are using.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE t ( LoanID int, OverDueDate date, TotalAmount decimal(10,2), ODType varchar(1));
INSERT INTO t ( LoanID, OverDueDate, TotalAmount, ODType )
VALUES
(12345, '01/10/17', 1000, 'D')
, (12345, '01/11/17', 500, 'C')
, (12346, '02/10/17', 1500, 'D')
, (12346, '03/12/17', 1000, 'C') /* Paid off. But more loans. */
, (12346, '01/02/18', 1000, 'C')
, (12345, '03/12/17', 1000, 'D') /* Additional deposit. Maintains original overdue date */
, (12346, '02/11/17', 500, 'C')
, (12346, '01/01/18', 2000, 'D')
, (12347, '10/01/17', 1000, 'D')
, (12347, '11/01/17', 1001, 'C') /* Overpaid */
, (12348, '11/11/17', 1000, 'D')
, (12348, '12/11/17', 1000, 'C') /* Paid off */
;
I added a couple of extra rows to the data to demonstrate some variations, like over-payment or paying off a loan. I also changed up the order of some of the dates to show how the ORDER BY in the OVER() window function will correct for out-of-order data.
Query: NOTE: I commented the SQL to explain some of what I did.
; WITH cte1 AS ( /* Created CTE because use this query in main and sub query. */
SELECT s1.LoanID
, s1.OverDueDate
, s1.TotalAmount
, s1.ODType
, s1.runningTotal
, CASE
WHEN (
COALESCE ( /* COALESCE() will handle NULL dates. */
LAG(s1.runningTotal) /* LAG() is SQL2012. */
OVER ( PARTITION BY s1.LoanID ORDER BY s1.LoanID, s1.OverDueDate )
, 0 ) <= 0
/* This resets the OverDueDate. "<=0" will reset date for overpays. */
) THEN s1.OverDueDate
ELSE NULL
END AS od
, s1.rn
FROM (
SELECT t.LoanID
, t.OverDueDate
, t.TotalAmount
, t.ODType
, SUM( CASE
WHEN t.ODType = 'D' THEN t.TotalAmount
WHEN t.ODType = 'C' THEN t.TotalAmount*-1
ELSE 0
END )
OVER (
PARTITION BY LoanID
ORDER BY OverDueDate
) AS runningTotal
/* We need to be able to calculate + (D) and - (C) to get a running total. */
, ROW_NUMBER() OVER ( PARTITION BY t.LoanID ORDER BY t.OverDueDate DESC ) AS rn
/* ROW_NUMBER() helps us find the most recent record for the LoanID. */
FROM t
) s1
)
SELECT b.LoanID
, b.TotalAmount
, b.ODType
, b.runningTotal
, CASE
WHEN b.od IS NOT NULL THEN b.od
WHEN b.runningTotal <= 0 THEN NULL /* If they don't owe, they aren't overdue. */
ELSE ( SELECT max(s1.od)
FROM cte1 s1
WHERE b.LoanID = s1.LoanID
AND s1.OverDueDate <= b.OverDueDate
)
END AS runningOverDue /* Calculate the running overdue date. */
FROM cte1 b
WHERE b.rn=1 /* rn=1 gets the most recent record for each LoanID. */
AND b.runningTotal <> 0 /* This will exclude anyone who doesn't currently
owe money but did. Change to >0 to include only overdues. */
ORDER BY b.LoanID, b.OverDueDate
Results:
| LoanID | overduedate | TotalAmount | ODType | runningTotal | runningOverDue |
|--------|-------------|-------------|--------|--------------|----------------|
| 12345 | 2017-03-12 | 1000 | D | 1500 | 2017-01-10 |
| 12346 | 2018-01-02 | 1000 | C | 1000 | 2018-01-01 |
| 12347 | 2017-11-01 | 1001 | C | -1 | (null) |

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day

T-SQL Programming . Common Table expression

I would need a help in the following scneario. I am using T-SQL
Following is my table details. Say the table name is #tempk
Customer Current_Month Contract Amount
201 2015-09-01 3 100
My requirement is to add 12 months from the current month.that is 2016-09-01. Assuming
I am getting the start date of the month. I need the data in the following format
Customer Renewal_Month Contract_months End_Month Amount
201 2015-09-01 3 2016-09-01 100
201 2015-12-01 3 2016-09-01 100
201 2015-03-01 3 2016-09-01 100
201 2015-06-01 3 2016-09-01 100
The contract column can have any values
The consquent records are incremental of contract columns from the previous records.
I am using the following query. I have a date dimension table called Dim_Date that has date,quareter,year,month etc..
WITH GetProrateCTE (Customer_ID,Renewal_Month,Contract_Months,End_Month,MRR) as
(SELECT Customer_ID,Renewal_Month,Contract_Months,DATEADD(month, 12,Renewal_Month) End_Month,MRR
from #tempk),
GetRenewalMonths (Customer_ID,Renewal_Month,Contract_Months,End_Month,MRR) as
(
SELECT A.Customer_ID,B.Month Renewal_Month,A.Contract_Months,A.End_Month,A.MRR
FROM GetProrateCTE A
INNER JOIN (SELECT Month from DW..Dim_Date B GROUP BY MONTH) B
ON B.Month between A.Renewal_Month and A.End_Month
)
SELECT G.Customer_ID,G.Renewal_Month,G.Contract_Months,G.End_Month,G.MRR
FROM GetRenewalMonths G
Could you please help me to achieve the result. Any help would be greatly appreciated.
I want to do this in Common table Expressions. or would it be better if I go cursor.
You can try in this way -
WITH CTE AS
(SELECT Customer,DATEADD(MM,DATEDIFF(MM,0,Current_Month), 0) AS Renewal_Month,Contract,DATEADD(YEAR,1,Current_Month) AS End_Month,Amount,1 AS Level FROM #tempk
UNION ALL
SELECT t.Customer,DATEADD(MONTH,t.Contract,c.Renewal_Month),t.Contract,DATEADD(YEAR,1,t.Current_Month) AS End_Month,t.Amount,Level + 1
FROM #tempk t join CTE c on t.customer = c.customer
WHERE Level < (12/t.Contract))
SELECT Customer,Renewal_Month,Contract AS Contract_months,End_Month,Amount
FROM CTE
Just append your logic of the date dimension table to this.