Generate dates without gap and vlookup 2 tables - sql

I have 2 tables,
Table 1 - has the available principal balance on the transaction dates
=>(Transactions_Date, Principal balance)
Table 2 - has interest rates and the effective dates
=> ( Effective_Date, Interest)
I have to generate a result set, with the below data
Result_Date - List out all the dates from table1(Transaction_Date) by filling all the gaps between the available dates
Principle Balance - vlookup the Result_Dates in Table 1 and fill in the corresponding Principal_Balance until the next table1.transaction_date.
Interest - vlookup Result_date in Table 2.Effective_date and fill in all the Interest for that date range.
How to implement this in SNOWFLAKE?
Note: I can use CTE but can not create table

a couple of simple steps, find the min/max of dates for all tables, find the least/greatest of those values, spread across a generator range, and then use a row_number to feed into a dateadd, then left join this range of dates to the original data, and use nvl to pick between the this value, and the carried lag via the ignore nulls extension.
with table_1(t_date, p_bal) as (
select * from values
('2023-02-10'::date, 10000),
('2023-01-20'::date, 9000),
('2023-01-03'::date, 8000)
), table_2(e_date, interest) as (
select * from values
('2023-02-05'::date, 1),
('2023-01-15'::date, 2),
('2023-01-01'::date, 3)
), range_of_date as (
select dateadd(day, rn, min_date) as date
from (
select
least(min_t_date, min_e_date) as min_date
,greatest(max_t_date, max_e_date) as max_date
from (
select
min(t_date) as min_t_date
,max(t_date) as max_t_date
from table_1
) as t
cross join (
select
min(e_date) as min_e_date
,max(e_date) as max_e_date
from table_2
) as e
) as a
cross join (
select
row_number() over (order by null) - 1 as rn
from table(generator(ROWCOUNT => 1000))
) as b
having date <= max_date
)
select
*
,nvl(t1.p_bal,lag(t1.p_bal)ignore nulls over (order by rd.date)) as filled_p_bal
,nvl(t2.interest,lag(t2.interest)ignore nulls over(order by rd.date)) as filled_interest
from range_of_date as rd
left join table_1 as t1
on rd.date = t1.t_date
left join table_2 as t2
on rd.date = t2.e_date
order by 1;
gives:
DATE
T_DATE
P_BAL
E_DATE
INTEREST
FILLED_P_BAL
FILLED_INTEREST
2023-01-01
2023-01-01
3
3
2023-01-02
3
2023-01-03
2023-01-03
8,000
8,000
3
2023-01-04
8,000
3
2023-01-05
8,000
3
2023-01-06
8,000
3
2023-01-07
8,000
3
2023-01-08
8,000
3
2023-01-09
8,000
3
2023-01-10
8,000
3
2023-01-11
8,000
3
2023-01-12
8,000
3
2023-01-13
8,000
3
2023-01-14
8,000
3
2023-01-15
2023-01-15
2
8,000
2
2023-01-16
8,000
2
2023-01-17
8,000
2
2023-01-18
8,000
2
2023-01-19
8,000
2
2023-01-20
2023-01-20
9,000
9,000
2
2023-01-21
9,000
2
2023-01-22
9,000
2
2023-01-23
9,000
2
2023-01-24
9,000
2
2023-01-25
9,000
2
2023-01-26
9,000
2
2023-01-27
9,000
2
2023-01-28
9,000
2
2023-01-29
9,000
2
2023-01-30
9,000
2
2023-01-31
9,000
2
2023-02-01
9,000
2
2023-02-02
9,000
2
2023-02-03
9,000
2
2023-02-04
9,000
2
2023-02-05
2023-02-05
1
9,000
1
2023-02-06
9,000
1
2023-02-07
9,000
1
2023-02-08
9,000
1
2023-02-09
9,000
1
2023-02-10
2023-02-10
10,000
10,000
1
Take 2:
To make the data only have values in the range of the table 1 but carry interesting values from earlier, that are not aligned with the first balance value, we need the time range to be wider (it can be a more constrained match, but I have opted for simplicity to me).
with table_1(t_date, p_bal) as (
select * from values
('2023-02-10'::date, 10000),
('2023-01-20'::date, 9000),
('2023-01-03'::date, 8000)
), table_2(e_date, interest) as (
select * from values
('2023-02-05'::date, 1),
('2023-01-15'::date, 2),
('2023-01-01'::date, 3)
), range_1 as (
select
min(t_date) as min_t_date
,max(t_date) as max_t_date
from table_1
), range_of_date as (
select dateadd(day, rn, min_date) as date
from (
select
least(min_t_date, (select min(e_date) from table_2)) as min_date
,max_t_date as max_date
from range_1
) as a
cross join (
select
row_number() over (order by null) - 1 as rn
from table(generator(ROWCOUNT => 1000))
) as b
having date <= max_date
)
select
rd.date
,nvl(t1.p_bal,lag(t1.p_bal)ignore nulls over (order by rd.date)) as filled_p_bal
,nvl(t2.interest,lag(t2.interest)ignore nulls over(order by rd.date)) as filled_interest
from range_of_date as rd
left join table_1 as t1
on rd.date = t1.t_date
left join table_2 as t2
on rd.date = t2.e_date
qualify filled_p_bal is not null
order by 1;

Related

First value in subsequents rows that match a condition

order_at
delivery_at
2023-01-01
2023-01-03
2023-01-02
2023-01-03
2023-01-03
2023-01-05
2023-01-04
2023-01-05
I want a new field, next_delivery_at, which is the first delivery_at in subsequents rows for each delivery_at, that is not the same value as delivery_at so the final table would be:
order_at
delivery_at
next_delivery_at
2023-01-01
2023-01-03
2023-01-05
2023-01-02
2023-01-03
2023-01-05
2023-01-03
2023-01-05
null
2023-01-04
2023-01-05
null
For this specific case, I could do something like:
CASE
WHEN (LEAD(delivery_at) OVER (PARTITION BY NULL ORDER BY delivery_at DESC) = delivery_at)
THEN (LEAD(delivery_at, 2) OVER (PARTITION BY NULL ORDER BY delivery_at DESC))
ELSE LEAD(delivery_at) OVER (PARTITION BY NULL ORDER BY delivery_at DESC)
END AS next_delivery_at
But if there are more than two rows in a row with the same delivery_at, the output will be wrong, so I am looking for a generic way of getting the first value in subsequents rows for delivery_at that is distinct for each delivery_at value.
You can use a self join to match successive deliveries, then get the minimum next delivery.
SELECT t1.order_at, t1.delivery_at, MIN(t2.delivery_at) AS next_delivery_at
FROM tab t1
LEFT JOIN tab t2
ON t1.delivery_at < t2.delivery_at
GROUP BY t1.order_at, t1.delivery_at
You might consider below using a logical window frame RANGE
WITH sample_table AS (
SELECT '2023-01-01' order_at, '2023-01-03' delivery_at UNION ALL
SELECT '2023-01-02' order_at, '2023-01-03' delivery_at UNION ALL
SELECT '2023-01-03' order_at, '2023-01-05' delivery_at UNION ALL
SELECT '2023-01-04' order_at, '2023-01-05' delivery_at
)
SELECT *,
FIRST_VALUE(delivery_at) OVER (
ORDER BY UNIX_DATE(DATE(delivery_at))
RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
) AS next_delivery_at
FROM sample_table;
Query results

Filling missing weekend rows with previous working day values

I have a data table like the below. For each customer, missing days(weekends or holidays) should be inserted with the balance of previous working day. And this should only be done between the dates that customer has in the table. Balance should be added as 0 for dates outside the customer date range in the table. So for customer with id 1 should be filled between 2022-07-01 and 2022-07-31. Customer with id 2 should be filled between 2022-07-07 and 2022-07-19. Also for the dates 2022-07-01 to 2022-07-07 and 2022-07-19 to 2022-07-31 balance should be added as 0.
Data Table
date customer_id balance
2022-07-01 1 100
2022-07-04 1 150
2022-07-05 1 200
. 1 .
. 1 .
2022-07-31 1 650
2022-07-07 2 200
2022-07-08 2 300
2022-07-11 2 400
. 2 .
. 2 .
2022-07-19 2 750
Output table should look like this:
date customer_id balance
2022-07-01 1 100
2022-07-02 1 100
2022-07-03 1 100
2022-07-04 1 150
2022-07-05 1 200
. 1 .
. 1 .
2022-07-31 1 650
2022-07-01 2 0
2022-07-02 2 0
. 2 .
. 2 .
2022-07-07 2 200
2022-07-08 2 300
2022-07-09 2 300
2022-07-10 2 300
2022-07-11 2 400
. 2 .
. 2 .
2022-07-19 2 750
2022-07-20 2 0
. 2 .
. 2 .
2022-07-31 2 0
There are some solutions that use cross join with calendar table to similar questions on the site. But i couldn't implement them for my case.
Any help is much appreciated.
The below is a solution that uses recursion instead of a calendar table.
It essentially works by 'extending' your original data to create some extra rows with 0 balances for every customer at:
The min date in the table (if the customer didn't already have a record at the min date)
The max date in the table (if the customer didn't already have a record at the max date)
The day after the last record for the customer (as long as this doesn't go over the max date in the table)
It then uses recursion to plug the gaps between the dates for each customer.
With balances as (
-- This is a simplified version of the data already in your table
SELECT '2022-07-01' as dt, 1 as customer_id, 100 as balance
UNION ALL SELECT '2022-07-04' as dt, 1 as customer_id, 150 as balance
UNION ALL SELECT '2022-07-05' as dt, 1 as customer_id, 200 as balance
UNION ALL SELECT '2022-07-31' as dt, 1 as customer_id, 650 as balance
UNION ALL SELECT '2022-07-07' as dt, 2 as customer_id, 200 as balance
UNION ALL SELECT '2022-07-08' as dt, 2 as customer_id, 300 as balance
UNION ALL SELECT '2022-07-11' as dt, 2 as customer_id, 400 as balance
UNION ALL SELECT '2022-07-19' as dt, 2 as customer_id, 750 as balance
)
, min_records as (
-- This can create a 0 balance record for each customer at the min date in the table
SELECT dt, customer_id, 0 as balance
FROM (
SELECT min(dt) as dt
FROM balances
) as min_dt
CROSS JOIN (
SELECT DISTINCT customer_id
FROM balances
) as customers
)
, max_records as (
-- This can create a 0 balance record for each customer at the max date in the table
SELECT dt, customer_id, 0 as balance
FROM (
SELECT max(dt) as dt
FROM balances
) as min_dt
CROSS JOIN (
SELECT DISTINCT customer_id
FROM balances
) as customers
)
, max_customer_records as (
-- This creates a 0 balance record for each customer for the day after their last record,
-- so long as that date does not go beyond the max date in the table
SELECT dateadd(day, 1, max(dt)) as dt, customer_id, 0 as balance
FROM balances as a
CROSS JOIN (
SELECT max(dt) as max_dt
FROM balances
) as m
GROUP BY customer_id, max_dt
HAVING max(dt) < max_dt
)
, extended_balances as (
-- We then join all of the tables above to the original balances table.
-- Grouping to the dt + customer level and sum(balance) wont cause issues for customers
-- who already had a record on the min(dt) or max(dt) because x + 0 still = x
SELECT dt, customer_id, sum(balance) as balance
FROM (
SELECT *
FROM balances
UNION
SELECT dt, customer_id, balance
FROM min_records
UNION
SELECT dt, customer_id, balance
FROM max_records
UNION
SELECT dt, customer_id, balance
FROM max_customer_records
) AS A
GROUP BY dt, customer_id
)
, recursive_query as (
-- Now we use recursion to fill in the gaps between the dates
SELECT dt as original_dt
, dt
, customer_id
, balance
-- We use lead() to find the date when a new balance exists
, coalesce(lead(dt) over(partition by customer_id order by dt asc), dateadd(day, 1, dt)) as next_dt
FROM extended_balances
UNION ALL
SELECT original_dt
, dateadd(day, 1, dt)
, customer_id
, balance
, next_dt
FROM recursive_query
WHERE dateadd(day, 1, dt) < next_dt
)
SELECT dt, customer_id, balance
FROM recursive_query
ORDER BY customer_id, dt
To help illustrate the steps, I've included examples of key tables:
Balances:
dt
customer_id
balance
2022-07-01
1
100
2022-07-04
1
150
2022-07-05
1
200
2022-07-31
1
650
2022-07-07
2
200
2022-07-08
2
300
2022-07-11
2
400
2022-07-19
2
750
Extended Balances:
dt
customer_id
balance
2022-07-01
1
100
2022-07-04
1
150
2022-07-05
1
200
2022-07-31
1
650
2022-07-01
2
0
2022-07-07
2
200
2022-07-08
2
300
2022-07-11
2
400
2022-07-19
2
750
2022-07-20
2
0
2022-07-31
2
0
First 10 records of the recursive query:
original_dt
dt
customer_id
balance
next_dt
2022-07-01
2022-07-01
1
100
2022-07-04
2022-07-01
2022-07-02
1
100
2022-07-04
2022-07-01
2022-07-03
1
100
2022-07-04
2022-07-04
2022-07-04
1
150
2022-07-05
2022-07-05
2022-07-05
1
200
2022-07-31
2022-07-05
2022-07-06
1
200
2022-07-31
2022-07-05
2022-07-07
1
200
2022-07-31
2022-07-05
2022-07-08
1
200
2022-07-31
2022-07-05
2022-07-09
1
200
2022-07-31
2022-07-05
2022-07-10
1
200
2022-07-31

SQL : create intermediate data from date range

I have a table as shown here:
USER
ROI
DATE
1
5
2021-11-24
1
4
2021-11-26
1
6
2021-11-29
I want to get the ROI for the dates in between the other dates, expected result will be as below
From 2021-11-24 to 2021-11-30
USER
ROI
DATE
1
5
2021-11-24
1
5
2021-11-25
1
4
2021-11-26
1
4
2021-11-27
1
4
2021-11-28
1
6
2021-11-29
1
6
2021-11-30
You may use a calendar table approach here. Create a table containing all dates and then join with it. Sans an actual table, you may use an inline CTE:
WITH dates AS (
SELECT '2021-11-24' AS dt UNION ALL
SELECT '2021-11-25' UNION ALL
SELECT '2021-11-26' UNION ALL
SELECT '2021-11-27' UNION ALL
SELECT '2021-11-28' UNION ALL
SELECT '2021-11-29' UNION ALL
SELECT '2021-11-30'
),
cte AS (
SELECT USER, ROI, DATE, LEAD(DATE) OVER (ORDER BY DATE) AS NEXT_DATE
FROM yourTable
)
SELECT t.USER, t.ROI, d.dt
FROM dates d
INNER JOIN cte t
ON d.dt >= t.DATE AND (d.dt < t.NEXT_DATE OR t.NEXT_DATE IS NULL)
ORDER BY d.dt;

SQL query joining on existing date records and max date for missing records

I have an items table with dates and values. As soon as the value gets to 1, there are no more records for that Itemid.
Item Table
Itemid ItemDate Value
1 2020-04-30 0.5
1 2020-05-31 0.75
1 2020-06-30 1.0
2 2020-05-31 0.6
2 2020-06-30 1.0
I want to join this with a simple date table
dateId EOMDate
1 2020-04-30
2 2020-05-31
3 2020-06-30
4 2020-07-31
5 2020-08-31
The result should produce one record for each date in the date table and for each item where the date is >= the Item date. Where there is an exact date match with the Item table, it will use that record from the item table. Where there is no matching record in the item table, then it uses the record with the Max(ItemDate) value, that exists in the item table.
So it should produce this:
Result EOMDate ItemDate Value
1 2020-04-30 2020-04-30 0.5
1 2020-05-31 2020-05-31 0.75
1 2020-06-30 2020-06-30 1.0
1 2020-07-31 2020-06-30 1.0
1 2020-08-31 2020-06-30 1.0
2 2020-05-31 2020-05-31 0.6
2 2020-06-30 2020-06-30 1.0
2 2020-07-31 2020-06-30 1.0
2 2020-08-31 2020-06-30 1.0
The item table has several hundred millions of rows, and the date table has 120 records (each month end for 10 years), so I need a good performing solution. This has completely stumped me for some reason!
EDIT
my initial and non-working solution uses an outer apply
select p.ItemId, p.ItemDate, d.EOMDate, p.Value
from (select ItemId, ItemDate, Value from Items) p
OUTER APPLY
(
SELECT EOMDate from dates
) d
order by p.ItemDate,d.EOMDate
However it returns a table that has one record for each combination of Item date and EOM date. So in the above example, 20 records for ItemId 1 and 16 records for ItemId2
Here is to sql to create the above example tables:
CREATE TABLE #Items (ItemId int, ItemDate date, [Value] float)
Insert into #Items (ItemId,ItemDate,[Value])
Values (1,'2020-04-30',0.5),(1,'2020-05-31',0.75),(1,'2020-06-30',1),(2,'2020-05-31',0.6),(2,'2020-06-30',1)
Create Table #dates (dateId int, EOMDate date)
Insert into #dates (dateId,EOMDate) Values (1,'2020-04-30'),(2,'2020-05-31'),(3,'2020-06-30'),(4,'2020-07-31'),(5,'2020-08-31')
One method uses apply:
select i.*, d.*
from (select item_id, max(date) as max_date
from items
group by item_id
) i outer apply
(select top (1) d.*
from dates d
where d.date >= max_date
order by d.date asc
) d
You can use cross join and analytical function as follows:
Select * from
(Select a.item_id, d.eomdate, i.itemdate, i.value,
Row_number() over (partition by a.item_id, d.eomdate order by i.itemdate) as rn
From
(Select distinct item_id from items) a
Cross join Dates d
join items i on i.item_id = a.item_id and d.eomdate >= i.item_date) t
Where rn = 1

How to shift a year-week field in bigquery

This question is about shifting values of a year-week field in bigquery.
run_id year_week value
0001 201451 13
0001 201452 6
0001 201503 3
0003 201351 8
0003 201352 5
0003 201403 1
Here for each year the week can range from 01 to 53. For example year 2014 has last week which is 201452 but year 2015 has last week which is 201553.
Now I want to shift the values for each year_week in each run_id by 5 weeks. For the weeks there is no value it is assumed that they have a value of 0. For example the output from the example table above should look like this:
run_id year_week value
0001 201504 13
0001 201505 6
0001 201506 0
0001 201507 0
0001 201508 3
0003 201404 8
0003 201405 5
0003 201406 0
0003 201407 0
0003 201408 1
Explanation of the output: In the table above for run_id 0001 the year_week 201504 has a value of 13 because in the input table we had a value of 13 for year_week 201451 which is 5 weeks before 201504.
I could create a table programmatically by creating a mapping from a year_week to a shifted year_week and then doing a join to get the output, but I was wondering if there is any other way to do it by just using sql.
#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 13 value UNION ALL
SELECT '001', 201452, 6 UNION ALL
SELECT '001', 201503, 3
), weeks AS (
SELECT 100 * year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) value
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN 5 PRECEDING AND 5 PRECEDING
)
)
WHERE NOT value IS NULL
ORDER BY run_id, year_week
with result as
Row run_id year_week value
1 001 201504 13
2 001 201505 6
3 001 201508 3
if you need to "preserve" zero rows - just change below portion
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
to
SELECT i.run_id, w.year_week, d.year_week week2, IFNULL(value, 0) value
FROM weeks w
or
SUM(value) OVER(win) value
FROM temp
to
SUM(IFNULL(value, 0)) OVER(win) value
FROM temp
If you have data in the table for all year-weeks, then you can do:
with yw as (
select year_week, row_number() over (order by year_week) as seqnum
from t
group by year_week
)
select t.*, yw5, year_week as new_year_week
from t join
yw
on t.year_week = yw.year_week left join
yw yw5
on yw5.seqnum = yw.seqnum + 5;
If you don't have a table of year weeks, then I would advise you to create such a table, so you can do such manipulations -- or a more general calendar table.