SQL recursively creating matching groups based on reference table - sql

Imagine you had a data source like:
Id
Val
Data_Date
1
A
2022-01-01
2
B
2022-01-05
3
C
2022-01-09
4
D
2022-01-31
5
E
2022-02-01
With a reference table matching values in this way:
Target_Val
Matching_Val
Valid_Start
Valid_End
B
A
2022-01-04
2022-01-06
C
B
2022-01-09
2022-01-09
D
A
2022-01-31
2022-01-31
Imagine you want to create a table grouping values together where there is a match in the reference table within X days, say 4.
And you want to apply this matching recursively.
Output would be something like this:
Group_Id
Id
1
1
1
2
1
3
2
4
3
5
The logic here would be that C matches to B in the appropriate date range, and B matches to A in the appropriate date range, therefore they are all one group.
But although D matches to A, it is too far apart (greater than 4 days). And E doesn't match to anything.
There could be any depth (A > B > C > D ...)
Is there an appropriate algorithm in SQL to accomplish this? The values of the group IDs are unimportant and just meant to group data points together.

Here's my attempt. You do indeed need a recursive CTE, but you need to join the source table to groups table and then join back to the source table to ensure that the child fits within the parent's 4 day window. E.g. in the case of D and A, as you mention, they match, but they aren't close enough to be counted.
Then I added a calc to work out which rows were valid hierarchies and used that for the recursive join, because we can exclude anything not part of a hierachy.
After that we need to order the records by their depth so we know which parent record is first, e.g. in the case of A > B > C.
Then DENSE_RANK over the results to get your final groups. This will need some testing with deeper levels of recursion though, but this should point you in the right direction:
CREATE TABLE SourceData
(
Id INTEGER,
Val CHAR(1),
Data_Date DATE
);
CREATE TABLE Groups
(
Target_Val CHAR(1),
Matching_Val CHAR(1),
Valid_Start DATE,
Valid_End DATE
);
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (1,'A','2022-01-01');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (2,'B','2022-01-05');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (3,'C','2022-01-09');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (4,'D','2022-01-31');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (5,'E','2022-02-01');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('B','A','2022-01-04','2022-01-06');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('C','B','2022-01-09','2022-01-09');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('D','A','2022-01-31','2022-01-31');
WITH sourceCTE AS
(
SELECT sd.Id, sd.Val, sd.Data_Date, g.Valid_Start, g.Valid_End, IIF(s.Val IS NULL, sd.Val, g.Matching_Val) [ParentVal], CAST(NULL AS DATE) [start], CAST(NULL AS DATE) [end], 1 [Depth],
IIF(s.Val IS NULL, 0, 1) IsHeirarchy
FROM SourceData sd
LEFT JOIN Groups g ON g.Target_Val = sd.Val AND sd.Data_Date BETWEEN g.Valid_Start AND g.Valid_End
LEFT JOIN SourceData s ON s.Val = g.Matching_Val AND ABS(DATEDIFF(DAY, s.Data_Date, sd.Data_Date)) < 5
UNION ALL
SELECT s.Id, s.Val, s.Data_Date, g.Valid_Start, g.Valid_End, g.Matching_Val, g.Valid_Start, g.Valid_End, s.[Depth] + 1, 1
FROM sourceCTE s
INNER JOIN Groups g ON g.Target_Val = s.[ParentVal] AND s.IsHeirarchy = 1
),
ResultCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY [Depth] DESC) [RNum]
FROM sourceCTE
)
SELECT DENSE_RANK() OVER (ORDER BY ParentVal) [Group_Id], Id
FROM ResultCTE
WHERE [RNum] = 1
Here's a working fiddle.
I can't promise this is the best solution, because just like the query optimiser I gave up after about 2 hours, ha.
Also, for any future questions, please provide sample data in script format to save time creating the structure.

Related

How split comma separated string into multiple rows in AWS redshift?

Im trying to separate string values to multiple rows grouped by its id column.
Most of the answers i saw include some of this functions however they are not supported by aws redshift https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html
Assume i have a table like this
id
order_id
1
10001,10005,10006
2
11000,12005
And i would like to have a result like this
id
order_id
1
10001
1
10005
1
10006
2
11000
2
12005
A few concepts to have in mind. First is the recursive CTE which can be used to create number values for each position in the order_id string. Second is json functions which can split the string into parts based on commas.
A full test case with expanded input data:
create table test (id int, order_id varchar(256));
insert into test values
(1, '10001,10005,10006'),
(2, '11000,12005'),
(3, '10001,10005,10006,21000,22005'),
(4, '21000,22005,10001,10005,10006,11000,12005,10001,10005,10006,21000,22005')
;
with recursive numbers(n) as (
select 0 as n
union all
select n + 1
from numbers n
where n < (select max(length(order_id) - length(replace(order_id, ',',''))) from test)
),
input as (
select id, order_id,
length(order_id) - length(replace(order_id, ',','')) no_of_elements --counts the number of commas in the string
from test
)
select id, json_extract_array_element_text('['||order_id||']', n.n) as order_id
from input t
join numbers n
on n.n <= t.no_of_elements
order by id, order_id
;

Determining consecutive and independent PTO days

Based on feedback, I am restructuring my question.
I am working with SQL on a Presto database.
My objective is to report on employees that take consecutive days of PTO or Sick Time since the beginning of 2018. My desired output would have the individual islands of time taken by employee with the start and end dates, along the lines of:
The main table I am using is d_employee_time_off
There are only two time_off_type_name: PTO and Sick Leave.
The ds is a datestamp and I use the latest ds (usually the current date)
I have access to a date table named d_date
I can join the tables on d_employee_time_off.time_off_date = d_date.full_date
I hope that I have structured this question in a fashion that is understandable.
I believe the need here is to join the day off material to a calendar table.
In the example solution below I am generating this "on the fly" but I think you do have your own solution for this. Also in my example I have used the string 'Monday' and moved backward from that (or, you could use 'Friday' and move forward). I'm, not keen on language dependent solutions but as I'm not a Presto user wasn't able to test anything on Presto. So the example below uses some of your own logic, but using SQL Server syntax which I trust you can translate to Presto:
Query:
;WITH
Digits AS (
SELECT 0 AS digit UNION ALL
SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL
SELECT 9
)
, cal AS (
SELECT
ca.number
, dateadd(day,ca.number,'20180101') as cal_date
, datename(weekday,dateadd(day,ca.number,'20180101')) weekday
FROM Digits [1s]
CROSS JOIN Digits [10s]
CROSS JOIN Digits [100s] /* add more like this as needed */
cross apply (
SELECT
[1s].digit
+ [10s].digit * 10
+ [100s].digit * 100 /* add more like this as needed */
AS number
) ca
)
, time_off AS (
select
*
from cal
inner join mytable t on (cal.cal_date = t.time_off_date and cal.weekday <> 'Monday')
or (cal.cal_date between dateadd(day,-2,t.time_off_date)
and t.time_off_date and datename(weekday,t.time_off_date) = 'Monday')
)
, starting_points AS (
SELECT
employee_id,
cal_date,
dense_rank() OVER(partition by employee_id
ORDER BY
time_off_date
) AS rownum
FROM
time_off A
WHERE
NOT EXISTS (
SELECT
*
FROM
time_off B
WHERE
B.employee_id = A.employee_id
AND B.cal_date = DATEADD(day, -1, A.cal_date)
)
)
, ending_points AS (
SELECT
employee_id,
cal_date,
dense_rank() OVER(partition by employee_id
ORDER BY
time_off_date
) AS rownum
FROM
time_off A
WHERE
NOT EXISTS (
SELECT
*
FROM
time_off B
WHERE
B.employee_id = A.employee_id
AND B.cal_date = DATEADD(day, 1, A.cal_date)
)
)
SELECT
S.employee_id,
S.cal_date AS start_range,
E.cal_date AS end_range
FROM
starting_points S
JOIN
ending_points E
ON E.employee_id = S.employee_id
AND E.rownum = S.rownum
order by employee_id
, start_range
Result:
employee_id start_range end_range
1 200035 02.01.2018 02.01.2018
2 200035 20.04.2018 27.04.2018
3 200037 27.01.2018 29.01.2018
4 200037 31.03.2018 02.04.2018
see: http://rextester.com/MISZ50793
CREATE TABLE mytable(
ID INT NOT NULL
,employee_id INTEGER NOT NULL
,type VARCHAR(3) NOT NULL
,time_off_date DATE NOT NULL
,time_off_in_days INT NOT NULL
);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (1,200035,'PTO','2018-01-02',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (2,200035,'PTO','2018-04-20',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (3,200035,'PTO','2018-04-23',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (4,200035,'PTO','2018-04-24',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (5,200035,'PTO','2018-04-25',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (6,200035,'PTO','2018-04-26',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (7,200035,'PTO','2018-04-27',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (8,200037,'PTO','2018-01-29',1);
INSERT INTO mytable(id,employee_id,type,time_off_date,time_off_in_days) VALUES (9,200037,'PTO','2018-04-02',1);

Percentage by group - oracle

I have this sample.
What I need is getting an average per key not key and value. However, the syntax I used appear to give me the average per key and value.
select avg(value2),KEY,VALUE from testavg
GROUP BY key,value
order by key, value
Doing otherwise will yield a syntax error. The results I need are as follow:
10 A 0.96
10 B 0.04
12 C 1
But the statement I used yields the incorrect results above.
Could this be achieved by issuing 1 single oracle select statement? I have included the statement to create the entire table.
CREATE TABLE "TESTAVG"
( "KEY" NUMBER,
"VALUE" VARCHAR2(20 BYTE),
"VALUE2" NUMBER
)
Insert into TESTAVG (KEY,VALUE,VALUE2) values (10,'A',12);
Insert into TESTAVG (KEY,VALUE,VALUE2) values (10,'A',13);
Insert into TESTAVG (KEY,VALUE,VALUE2) values (10,'B',1);
Insert into TESTAVG (KEY,VALUE,VALUE2) values (12,'C',20);
This query might run faster on larger data - only reads the table once:
select distinct key, value,
sum(value2) over (partition by key, value) / sum(value2) over (partition by key) r
from testavg
/
KEY VALUE R
---------- -------------------- ----------
10 A .961538462
10 B .038461538
12 C 1
select avg(value2),KEY from testavg
GROUP BY key
order by key;
8.66666666666666666666666666666666666667 10
20 12
EDIT: Specs are still not clear but this might be what you need...
with gr1 as (select key,sum(value2) sumvalue
from testavg
group by key)
, gr2 as (select key,value,sum(value2) sumvalue
from testavg
GROUP BY key,value)
select gr1.key,gr2.value,gr2.sumvalue/gr1.sumvalue
from gr1
, gr2
where gr1.key = gr2.key;
10 B 0.0384615384615384615384615384615384615385
12 C 1
10 A 0.9615384615384615384615384615384615384615

Drop rows identified within moving time window

I have a dataset of hospitalisations ('spells') - 1 row per spell. I want to drop any spells recorded within a week after another (there could be multiple) - the rationale being is that they're likely symptomatic of the same underlying cause. Here is some play data:
create table hif_user.rzb_recurse_src (
patid integer not null,
eventdate integer not null,
type smallint not null
);
insert into hif_user.rzb_recurse_src values (1,1,1);
insert into hif_user.rzb_recurse_src values (1,3,2);
insert into hif_user.rzb_recurse_src values (1,5,2);
insert into hif_user.rzb_recurse_src values (1,9,2);
insert into hif_user.rzb_recurse_src values (1,14,2);
insert into hif_user.rzb_recurse_src values (2,1,1);
insert into hif_user.rzb_recurse_src values (2,5,1);
insert into hif_user.rzb_recurse_src values (2,19,2);
Only spells of type 2 - within a week after any other - are to be dropped. Type 1 spells are to remain.
For patient 1, dates 1 & 9 should be kept. For patient 2, all rows should remain.
The issue is with patient 1. Spell date 9 is identified for dropping as it is close to spell date 5; however, as spell date 5 is close to spell date 1 is should be dropped therefore allowing spell date 9 to live...
So, it seems a recursive problem. However, I've not used recursive programming in SQL before and I'm struggling to really picture how to do it. Can anyone help? I should add that I'm using Teradata which has more restrictions than most with recursive SQL (only UNION ALL sets allowed I believe).
It's a cursor logic, check one row after the other if it fits your rules, so recursion is the easiest (maybe the only) way to solve your problem.
To get a decent performance you need a Volatile Table to facilitate this row-by-row processing:
CREATE VOLATILE TABLE vt (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT r.*
,ROW_NUMBER() -- needed to facilitate the join
OVER (PARTITION BY patid ORDER BY eventdate) AS rn
FROM hif_user.rzb_recurse_src AS r
) WITH DATA ON COMMIT PRESERVE ROWS;
WITH RECURSIVE cte (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT vt.*
,eventdate AS startdate
FROM vt
WHERE rn = 1 -- start with the first row
UNION ALL
SELECT vt.*
-- check if type = 1 or more than 7 days from the last eventdate
,CASE WHEN vt.eventdate > cte.startdate + 7
OR vt.exac_type = 1
THEN vt.eventdate -- new start date
ELSE cte.startdate -- keep old date
END
FROM vt JOIN cte
ON vt.patid = cte.patid
AND vt.rn = cte.rn + 1 -- proceed to next row
)
SELECT *
FROM cte
WHERE eventdate - startdate = 0 -- only new start days
order by patid, eventdate
I think the key to solving this is getting the first date more than 7 days from the current date and then doing a recursive subquery:
with rrs as (
select rrs.*,
(select min(rrs2.eventdate)
from hif_user.rzb_recurse_src rrs2
where rrs2.patid = rrs.patid and
rrs2.eventdate > rrs.eventdate + 7
) as eventdate7
from hif_user.rzb_recurse_src rrs
),
recursive cte as (
select patid, min(eventdate) as eventdate, min(eventdate7) as eventdate7
from hif_user.rzb_recurse_src rrs
group by patid
union all
select cte.patid, cte.eventdate7, rrs.eventdate7
from cte join
hif_user.rzb_recurse_src rrs
on rrs.patid = cte.patid and
rrs.eventdate = cte.eventdate7
)
select cte.patid, cte.eventdate
from cte;
If you want additional columns, then join in the original table at the last step.

create a table of duplicated rows of another table using the select statement

I have a table with one column containing different integers.
For each integer in the table I would like to duplicate it as the number of digits -
For example:
12345 (5 digits):
1. 12345
2. 12345
3. 12345
4. 12345
5. 12345
I thought doing it using with recursion t (...) as () but I didn't manage, since I don't really understand how it works and what is happening "behind the scenes.
I don't want to use insert because I want it to be scalable and automatic for as many integers as needed in a table.
Any thoughts and an explanation would be great.
The easiest way is to join to a table with numbers from 1 to n in it.
SELECT n, x
FROM yourtable
JOIN
(
SELECT day_of_calendar AS n
FROM sys_calendar.CALENDAR
WHERE n BETWEEN 1 AND 12 -- maximum number of digits
) AS dt
ON n <= CHAR_LENGTH(TRIM(ABS(x)))
In my example I abused TD's builtin calendar, but that's not a good choice, as the optimizer doesn't know how many rows will be returned and as the plan must be a Product Join it might decide to do something stupid. So better use a number table...
Create a numbers table that will contain the integers from 1 to the maximum number of digits that the numbers in your table will have (I went with 6):
create table numbers(num int)
insert numbers
select 1 union select 2 union select 3 union select 4 union select 5 union select 6
You already have your table (but here's what I was using to test):
create table your_table(num int)
insert your_table
select 12345 union select 678
Here's the query to get your results:
select ROW_NUMBER() over(partition by b.num order by b.num) row_num, b.num, LEN(cast(b.num as char)) num_digits
into #temp
from your_table b
cross join numbers n
select t.num
from #temp t
where t.row_num <= t.num_digits
I found a nice way to perform this action. Here goes:
with recursive t (num,num_as_char,char_n)
as
(
select num
,cast (num as varchar (100)) as num_as_char
,substr (num_as_char,1,1)
from numbers
union all
select num
,substr (t.num_as_char,2) as num_as_char2
,substr (num_as_char2,1,1)
from t
where char_length (num_as_char2) > 0
)
select *
from t
order by num,char_length (num_as_char) desc