Find max of total days using Partition By [duplicate] - sql

This question already has answers here:
Merge overlapping date intervals
(9 answers)
Closed 4 months ago.
Need SQL Query to get a max of the sum of Days where enddate is equal to startdate
.Below is a table
ID
StartDate
EndDate
Days
121
01-01-2022
01-03-2022
2
121
01-03-2022
01-04-2022
1
121
01-04-2022
01-06-2022
2
121
01-07-2022
01-08-2022
1
121
01-08-2022
01-09-2022
1
In the above table, the 01-01-2022 to 01-06-2022 sum is 5 which is greater than the sum of 2 from 01-07-2022 to 01-09-2022.
Output required
ID
Days
121
5

I have written as per my understanding. correct me if I did worse
with table as (
select 121 id, "12-28-2021" startdate, "12-30-2021" enddate, 2 days
union all
select 121 id, "12-30-2021" startdate, "12-31-2021" enddate, 1 days
union all
select 121 id, "01-01-2022" startdate, "01-03-2022" enddate, 2 days
union all
select 121 id, "01-03-2022" startdate, "01-04-2022" enddate, 1 days
union all
select 121 id, "01-04-2022" startdate, "01-06-2022" enddate, 2 days
union all
select 121 id, "01-07-2022" startdate, "01-08-2022" enddate, 1 days
union all
select 121 id, "01-08-2022" startdate, "01-09-2022" enddate, 1 days
)
select
table3.id,
max(days) days
from
(
select
table2.id,
sum(days) days
from
(
select
table1.*,
case
when sum(x) over(rows between unbounded preceding and 1 preceding) is null then 0
else sum(x) over(rows between unbounded preceding and 1 preceding)
end as y
from
(
select
table.*,
case
when enddate = lead(startdate) over(order by startdate) then 0
else 1
end as x
from table
) table1
) table2
group by id,y
) table3
group by id

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

BigQuery - Picking latest not null value within 28 interval

I'm trying to add a column on this table and stuck for a little while
ID
Category 1
Date
Data1
A
1
2022-05-30
21
B
2
2022-05-21
15
A
2
2022-05-02
33
A
1
2022-02-11
3
B
2
2022-05-01
19
A
1
2022-05-15
null
A
1
2022-05-20
11
A
2
2022-04-20
22
to
ID
Category 1
Date
Data1
Picked_Data
A
1
2022-05-30
21
11
B
2
2022-05-21
15
19
A
2
2022-05-02
33
22
A
1
2022-02-11
3
some number or null
B
2
2022-05-01
19
some number or null
A
1
2022-05-15
null
some number or null
A
1
2022-05-20
11
some number or null
A
2
2022-04-20
22
some number or null
The logic is to partition by Category1 and ID then pick the latest none null value within the past 28 days. If there is no data exist, it'll be null
For the first row, ID = A and Category 1, it will pick 7th row as they are in the same category, ID and the date difference is <= 28. It skipped row 4th and 6th as the date is too far back and null value.
I've tried querying this by
select first_value(Data1) over (partition bty Category1 order by case when Data1 is not null and Date between Date - Inteverval 28 DAY and Date then 1 else 2) as Picked_Data
but it's picking incorrect rows,my guess is this query
Date between Date - Inteverval 28 DAY and Date
is not picking the correct date.. could anyone give me advise/suggestion how I could twick this query?
Consider below approach
select *,
first_value(data1 ignore nulls) over past_28_days as picked_data
from your_table
window past_28_days as (
partition by id, category_1
order by unix_date(date)
range between 29 preceding and 1 preceding
)
if applied to sample data in your question - output is
Consider below approach:
with sample_data as (
select 'A' as ID, 1 as category_1, date('2022-05-30') as date, 21 as data1,
union all select 'B' as ID, 2 as category_1, date('2022-05-21') as date, 15 as data1,
union all select 'A' as ID, 2 as category_1, date('2022-05-02') as date, 33 as data1,
union all select 'A' as ID, 1 as category_1, date('2022-02-11') as date, 3 as data1,
union all select 'B' as ID, 2 as category_1, date('2022-05-01') as date, 19 as data1,
union all select 'A' as ID, 1 as category_1, date('2022-05-15') as date, NULL as data1,
union all select 'A' as ID, 1 as category_1, date('2022-05-20') as date, 11 as data1,
union all select 'A' as ID, 2 as category_1, date('2022-04-20') as date, 22 as data1,
),
with_next_data as (
select *,
lag(date) over (partition by ID,category_1 order by date) as next_date,
lag(data1) over (partition by ID,category_1 order by date) as next_data,
from sample_data
)
select
id,
category_1,
date,
data1,
if(date_diff(date, next_date,day) <= 28, next_data, null) as picked_data
from with_next_data
Output:

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

Can this daily inventory balance calculation on bigquery be improved

i came up with the following query to calculate inventory balances per day. The query works and gives me the expected results but it takes over 200 seconds to run on a subset of the transaction table with about 2mio rows.
Being new to bigquery i am wondering if there is a better/more efficient way to do this?
The code with some sample data is below.
Thanks in advance for any thoughts or tips.
#### Generate a continuous date range
WITH days AS
(
SELECT day
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2011-01-01'), CURRENT_DATE(), INTERVAL 1 DAY)) AS day
),
#### Transactional information of inventory movements. Simple example
movements AS
(
SELECT 1 AS ItemID
,1 AS Location
,DATE('2017-12-01') AS TransactionDate
,0 AS Quantity
UNION ALL SELECT 1, 1, DATE('2017-12-03'), 10
UNION ALL SELECT 1, 1, DATE('2017-12-06'), 100
UNION ALL SELECT 1, 1, DATE('2017-12-12'), 1000
),
#### Calculate cumulative sum for each item and location based on the transaction date
cumsum AS
(
SELECT ItemID
,TransactionDate
,Location
,SUM(Quantity) OVER (PARTITION BY ItemID, Location ORDER BY TransactionDate ROWS UNBOUNDED PRECEDING) as cumulative_quantity
FROM movements
),
#### Cross join with the date range to backfill cumulative values for each day
#### This will return multiple lines for a day when there are multiple transaction date balances
cross_sum AS
(
SELECT m.ItemID
,m.Location
,d.day
,m.TransactionDate
,m.cumulative_quantity
FROM days d
CROSS JOIN cumsum m
WHERE m.TransactionDate <= d.day
),
#### Get just one line per day, based on the latest transaction date
filtered AS
(
SELECT ItemID
,Location
,CAST (day AS datetime) AS BalanceDate
,ARRAY_AGG(cumulative_quantity ORDER BY TransactionDate DESC LIMIT 1) AS InventoryBalance
FROM cross_sum
GROUP BY 1,2,3
)
#### Final result, flattened out
SELECT ItemID
,Location
,BalanceDate
,(SELECT SUM(InventoryBalance) FROM UNNEST(InventoryBalance) AS InventoryBalance) AS InventoryBalance
FROM filtered
ORDER BY 1,2,3
i am wondering if there is a better/more efficient way to do this?
Below is for BigQuery Standard SQL
as you can see: days, cumsum and cross_sum are modified/optimized and the rest just eliminated. It has good potentials to be more efficient but needs to be tested on real data - so you should try and see if it is
#standardSQL
#### Transactional information of inventory movements. Simple example
WITH movements AS (
SELECT 1 AS ItemID, 1 AS Location, DATE('2017-12-01') AS TransactionDate, 0 AS Quantity UNION ALL
SELECT 1, 1, DATE('2017-12-03'), 10 UNION ALL
SELECT 1, 1, DATE('2017-12-06'), 100 UNION ALL
SELECT 1, 1, DATE('2017-12-12'), 1000
), days AS (
SELECT day, ItemID, Location
FROM UNNEST(GENERATE_DATE_ARRAY((SELECT MIN(TransactionDate) AS d FROM movements), CURRENT_DATE(), INTERVAL 1 DAY)) AS day
CROSS JOIN (SELECT DISTINCT ItemID, Location FROM movements)
), cumsum AS (
SELECT ItemID
,TransactionDate
,Location
,LEAD(TransactionDate) OVER(PARTITION BY ItemID, Location ORDER BY TransactionDate) AS NextTransactionDate
,SUM(Quantity) OVER(PARTITION BY ItemID, Location ORDER BY TransactionDate ROWS UNBOUNDED PRECEDING) AS cumulative_quantity
FROM movements
), cross_sum AS (
SELECT d.ItemID
,d.Location
,d.day AS BalanceDate
,m.cumulative_quantity
FROM days d
JOIN cumsum m
ON d.day >= IFNULL(m.TransactionDate, d.day)
AND d.day < IFNULL(m.NextTransactionDate, CURRENT_DATE())
)
SELECT ItemID
,Location
,BalanceDate
,cumulative_quantity
FROM cross_sum
ORDER BY 1,2,3
result is
ItemID Location BalanceDate cumulative_quantity
1 1 2017-12-01 0
1 1 2017-12-02 0
1 1 2017-12-03 10
1 1 2017-12-04 10
1 1 2017-12-05 10
1 1 2017-12-06 110
1 1 2017-12-07 110
1 1 2017-12-08 110
1 1 2017-12-09 110
1 1 2017-12-10 110
1 1 2017-12-11 110
1 1 2017-12-12 1110
1 1 2017-12-13 1110
1 1 2017-12-14 1110
1 1 2017-12-15 1110

T/SQL - Group/Multiply records

Source date:
CREATE TABLE #Temp (ID INT Identity(1,1) Primary Key, BeginDate datetime, EndDate datetime, GroupBy INT)
INSERT INTO #Temp
SELECT '2015-06-05 00:00:00.000','2015-06-12 00:00:00.000',7
UNION
SELECT '2015-06-05 00:00:00.000', '2015-06-08 00:00:00.000',7
UNION
SELECT '2015-10-22 00:00:00.000', '2015-10-31 00:00:00.000',7
SELECT *, DATEDIFF(DAY,BeginDate, EndDate) TotalDays FROM #Temp
DROP TABLE #Temp
ID BeginDate EndDate GroupBy TotalDays
1 6/5/15 0:00 6/8/15 0:00 7 3
2 6/5/15 0:00 6/12/15 0:00 7 7
3 10/22/15 0:00 10/31/15 0:00 7 9
Desired Output:
ID BeginDate EndDate GroupBy TotalDays GroupCnt GroupNum
1 6/5/15 0:00 6/8/15 0:00 7 3 1 1
2 6/5/15 0:00 6/12/15 0:00 7 7 1 1
3 10/22/15 0:00 10/29/15 0:00 7 9 2 1
3 10/29/15 0:00 10/31/15 0:00 7 9 2 2
Goal:
Group the records based on ID/BeginDate/EndDate.
Based on the GroupBy number (# of days) and TotalDays (days diff),
if the GroupBy => TotalDays, keep a single group record
else multiply the group records (1 record per GroupBy count) while staying within TotalDays limit.
Apologies if it's confusing but basically, in the above example, there should be one record for each group (ID/BeginDate/EndDate) for the record where days diff b/w Begin/End date = 7 or less (GroupBy).
If the days diff goes above 7 days, create another record (for every additional 7 days diff).
So since 1st two records have days diff of 7 days or less, there's only one record.
The 3rd record has days diff of 9 (7 + 2). Therefore, there should be 2 records (1st for the first 7 days and 2nd for the additional 2 days).
GroupCNT = how many records there're of the grouped records after applying the above records.
GroupNum is basically row number of the group.
GroupBy # can be different for each record. Dataset is huge so performance does matter.
One pattern I was able to figure out was related to the modulus b/w GroupBy and days diff.
When the GroupBy value is < days diff, modulus is always less than GroupBy. When the GroupBy value = days diff, modulus is always 0. And when the GroupBy value > days diff, modulus is always equals GroupBy. I'm not sure if/how to use that to group/multiply records to meet the requirement.
SELECT DISTINCT
ID
, BeginDate
, EndDate
, GroupBy
, DATEDIFF(DAY,BeginDate, EndDate) TotalDays
, CAST(GroupBy as decimal(18,6))%CAST(DATEDIFF(DAY,BeginDate, EndDate) AS decimal(18,6)) Modulus
, CASE WHEN DATEDIFF(DAY,BeginDate, EndDate) <= GroupBy THEN BeginDate END NewBeginDate
, CASE WHEN DATEDIFF(DAY,BeginDate, EndDate) <= GroupBy THEN EndDate END NewEndDate
FROM #Temp
Update:
Forgot to mention/include that the begin/enddate, when the records gets multiplied, will change accordingly. In other words, begin/end date will reflect the GroupBy - desired output shows what I mean more clearly in the 3rd and 4th record.
Also, GroupCnt/GroupNum are not as important to calculate as grouping/multiplying the records.
You could do something like this using a recursive CTE..
;WITH cte AS (
SELECT ID,
BeginDate,
EndDate,
GroupBy,
DATEDIFF(DAY, BeginDate, EndDate) AS TotalDays,
1 AS GroupNum
FROM #Temp
UNION ALL
SELECT ID,
BeginDate,
EndDate,
GroupBy,
TotalDays,
GroupNum + 1
FROM cte
WHERE GroupNum * GroupBy < TotalDays
)
SELECT ID,
BeginDate = CASE WHEN GroupNum = 1 THEN BeginDate
ELSE DATEADD(DAY, GroupBy * (GroupNum - 1), BeginDate)
END ,
EndDate = CASE WHEN TotalDays <= GroupBy THEN EndDate
WHEN DATEADD(DAY, GroupBy * GroupNum, BeginDate) > EndDate THEN EndDate
ELSE DATEADD(DAY, GroupBy * GroupNum, BeginDate)
END ,
GroupBy,
TotalDays,
COUNT(*) OVER (PARTITION BY ID) GroupCnt,
GroupNum
FROM cte
OPTION (MAXRECURSION 0)
the cte builds out a recordset like this.
ID BeginDate EndDate GroupBy TotalDays GroupNum
----------- ----------------------- ----------------------- ----------- ----------- -----------
1 2015-06-05 00:00:00.000 2015-06-08 00:00:00.000 7 3 1
2 2015-06-05 00:00:00.000 2015-06-12 00:00:00.000 7 7 1
3 2015-10-22 00:00:00.000 2015-10-31 00:00:00.000 7 9 1
3 2015-10-22 00:00:00.000 2015-10-31 00:00:00.000 7 9 2
then you just have to take this and use some case statements to determine what the begin and end date should be.
you should end up with
ID BeginDate EndDate GroupBy TotalDays GroupCnt GroupNum
----------- ----------------------- ----------------------- ----------- ----------- ----------- -----------
1 2015-06-05 00:00:00.000 2015-06-08 00:00:00.000 7 3 1 1
2 2015-06-05 00:00:00.000 2015-06-12 00:00:00.000 7 7 1 1
3 2015-10-22 00:00:00.000 2015-10-29 00:00:00.000 7 9 2 1
3 2015-10-29 00:00:00.000 2015-10-31 00:00:00.000 7 9 2 2
since you're using SQL 2012, you can also use the LAG and LEAD functions in your final query.
;WITH cte AS (
SELECT ID,
BeginDate,
EndDate,
GroupBy,
DATEDIFF(DAY, BeginDate, EndDate) AS TotalDays,
1 AS GroupNum
FROM #Temp
UNION ALL
SELECT ID,
BeginDate,
EndDate,
GroupBy,
TotalDays,
GroupNum + 1
FROM cte
WHERE GroupNum * GroupBy < TotalDays
)
SELECT ID,
BeginDate = COALESCE(LAG(BeginDate) OVER (PARTITION BY ID ORDER BY GroupNum) + GroupBy * (GroupNum - 1), BeginDate),
EndDate = COALESCE(LEAD(BeginDate) OVER (PARTITION BY ID ORDER BY GroupNum) + GroupBy * GroupNum, EndDate),
GroupBy,
TotalDays,
COUNT(*) OVER (PARTITION BY ID) GroupCnt,
GroupNum
FROM cte
OPTION (MAXRECURSION 0)
CREATE TABLE dim_number (id INT);
INSERT INTO dim_number VALUES ((0), (1), (2), (3)); -- Populate this to a large number
SELECT
#Temp.Id,
CASE WHEN dim_number.id = 0
THEN #Temp.BeginDate
ELSE DATEADD(DAY, dim_number.id * #Temp.GroupBy, #Temp.BeginDate)
END AS BeginDate,
CASE WHEN dim_number.id = parts.count
THEN #Temp.EndDate
ELSE DATEADD(DAY, (dim_number.id + 1) * #Temp.GroupBy, #Temp.BeginDate)
END AS EndDate,
#Temp.GroupBy AS GroupBy,
DATEDIFF(DAY, #Temp.BeginDate, #Temp.EndDate) AS TotalDays,
parts.count + 1 AS GroupCnt,
dim_number.id + 1 AS GroupNum
FROM
#Temp
CROSS APPLY
(SELECT DATEDIFF(DAY, #Temp.BeginDate, #Temp.EndDate) / #Temp.GroupBy AS count) AS parts
INNER JOIN
dim_number
ON dim_number.id >= 0
AND dim_number.id <= parts.count