Running recursive query in Vertica - sql

I am trying to do the exact same thing as this question. But I am in Vertica, and I am finding no way to carry out the top answer, or the other answers. So basically I have tried connect by and sub query UNION ALL method, and I don't think Vertica supports it.
Is there any way I can replicate the solution in Vertica?
EDIT: Full Question
I am trying to calculate 30-day readmission chains, which is a sequence of readmissions within 30 days from its previous admission. The following data shows a simplified situation where we have events, rather than admission and discharges. Difference in days between events will identify it as a 30-day readmission, consecutive 30 day readmissions(Chain Len) will be a single chain of readmission (Count).
Sample Data
CREATE TABLE dbo.Events (
EventID INT IDENTITY(1,1) PRIMARY KEY,
EventDate DATE NOT NULL,
PersonID INT NOT NULL
);
GO
INSERT dbo.Events (EventDate, PersonID)
VALUES
('2014-01-01', 1), ('2014-01-05', 1), ('2014-02-02', 1), ('2014-03-30', 1), ('2014-04-04', 1),
('2014-01-11', 2), ('2014-02-02', 2),
('2014-01-03', 3), ('2014-03-03', 3);
GO
Sample Output
EventID EventDate PersonID CHAIN LEN Count
------- ---------- -------- --------- -----
1 2014-01-01 1 1 1
2 2014-01-05 1 2 1
3 2014-02-02 1 3 1
------- ---------- -------- --------- -----
4 2014-03-30 1 1 2
5 2014-04-04 1 2 2
------- ---------- -------- --------- -----
6 2014-01-11 2 1 1
7 2014-02-02 2 2 1
------- ---------- -------- --------- -----
8 2014-01-03 3 1 1
------- ---------- -------- --------- -----
9 2014-03-03 3 1 2
------- ---------- -------- --------- -----

Here is an Oracle solution; see if it works. You may need to make some changes for vertica, as each db dialect has its own quirks. Vertica does support analytic functions, which is the main ingredient.
The method used here is a very well known one, it is usually called "start-of-groups" method (for the "flags" created in the innermost subquery).
select eventid, eventdate, personid,
row_number() over
(partition by personid, ct order by eventdate) as chain_len,
ct
from (
select eventid, eventdate, personid,
count(flag) over
(partition by personid order by eventdate) + 1 as ct
from (
select eventid, eventdate, personid,
case when eventdate > lag(eventdate) over
(partition by personid order by eventdate) + 30
then 0 end as flag
from events
)
)
order by personid, eventdate -- if needed
;

Related

Most Efficient SQL to Calculate Running Streak Occurrences

I am looking for the most efficient manner to determine the longest occurrence of a streak within a given data set; specifically, to determine the longest winning streak of games.
Below is the SQL that I have thus far, and it does seem to perform very fast, and as expected from the limited testing I've done on a dataset with around 100,000 records.
DECLARE #HistoryDateTimeLimit datetime = '3/15/2018';
CTE to create result subset from voting dataset.
WITH Results AS (
SELECT
EntityPlayerId,
(CASE
WHEN VoteTeamA = 1 AND ParticipantAScore > ParticipantBScore THEN 'W'
WHEN VoteTeamA = 0 AND ParticipantBScore > ParticipantAScore THEN 'W'
ELSE 'L'
END) AS WinLoss,
match.ScheduledStartDateTime
FROM
[dbo].[MatchVote] vote
INNER JOIN [dbo].[MatchMetaData] match ON vote.MatchId = match.MatchId
WHERE
IsComplete = 1
AND ScheduledStartDateTime >= #HistoryDateTimeLimit
)
CTE to create subset of data with streak type as WinLoss and total count of votes in the partition using ROW_NUMBER().
WITH Streaks AS (
SELECT
EntityPlayerId,
ScheduledStartDateTime,
WinLoss,
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId ORDER BY ScheduledStartDateTime) -
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId, WinLoss ORDER BY ScheduledStartDateTime) AS Streak
FROM
Results
)
CTE to summarize the partitioned vote streaks by WinLoss and a begin date/time, with the total count in the streak.
WITH StreakCounts AS (
SELECT
EntityPlayerId,
WinLoss,
MIN(ScheduledStartDateTime) StreakStart,
MAX(ScheduledStartDAteTime) StreakEnd,
COUNT(*) as Streak
FROM
Streaks
GROUP BY
EntityPlayerId, WinLoss, Streak
)
CTE to select the MAXIMUM (longest) vote streak for WinLoss of W (win) grouped by players.
WITH LongestWinStreak AS (
SELECT
EntityPlayerId,
MAX(Streak) AS LongestStreak
FROM
StreakCounts
WHERE
WinLoss = 'W'
GROUP BY
EntityPlayerId
)
Selecting the useful data from the LongestWinStreak CTE.
SELECT * FROM LongestWinStreak
This is the 3rd iteration of the code; at first I feel like I was overthinking and using windows with the LAG function to define a reset period that was later used for partitioning.
[UPDATE]: SQLFiddle example at http://sqlfiddle.com/#!18/5b33a/1 -- Sample data for the two tables that are used above are as follows.
The data is meant to show the schema, and can be extrapolated for your own testing/usage;
MatchVote table data.
EntityPlayerId IsExtMatch MatchId VoteTeamA VoteDateTime IsComplete
-------------------- ------------ -------------------- --------- ----------------------- ----------
158 1 152639 0 2018-03-20 23:25:28.910 1
158 1 156058 1 2018-03-13 23:36:57.517 1
MatchMetaData table data.
MatchId IsTeamTournament MatchCompletedDateTime ScheduledStartDateTime MatchIsFinalized TournamentId TournamentTitle TournamentLogoUrl TournamentLogoThumbnailUrl GameName GameShortCode GameLogoUrl ParticipantAScore ParticipantAName ParticipantALogoUrl ParticipantBScore ParticipantBName ParticipantBLogoUrl
--------- ---------------- ----------------------- ----------------------- ---------------- -------------------- ------------------ ----------------------- ---------------------------- --------------------------------- -------------- ----------------------- ------------------ ------------------- --------------------- ----------------- ------------------- --------------------
23354 1 2014-07-30 00:30:00.000 2014-07-30 00:00:00.000 1 543 Sample https://...Small.png https://...Small.png Dota 2 Dota 2 https://...logo.png 3 Natus Vincere.US https://...VI.png 0 Not Today https://...ay.png
44324 1 2014-12-15 12:40:00.000 2014-12-15 11:40:00.000 1 786 Sample https://...Small.png https://...Small.png Counter-Strike: Global Offensive CS:GO https://...logo.png 0 Avalier's stars https://...oto.png 1 Kassad's Legends https://...oto.png

How to join multiple rows by continue from and to id columns in oracle

I have a scenario where I need to find the start date and end date from multiple rows which are tied by continued_from and continued_to date fields in Oracle.
result should look like
ID STARTDATE ENDDATE
-- ---------- ----------
3 01/01/1000 12/31/9999
ID STARTDATE ENDDATE CONT_FROM_ID CONT_TO_ID
-- ---------- ---------- ------------ -----------
1 01/01/1000 10/10/1999 NULL 2
2 10/10/1999 11/11/2000 1 3
3 11/11/2000 12/31/9999 2 NULL
Oracle's hierarchical query syntax makes it easy to walk the tree from parent to child. The analytical lead() and lag() functions track the next and previous IDs.
select c23.id
, c23.startdate
, c23.enddate
, lag(c23.id) over (partition by p23.id order by c23.id) as cont_from_id
, lead(c23.id) over (partition by p23.id order by c23.id) as cont_to_id
from p23
join c23 on p23.startdate <= c23.startdate
and p23.enddate >= c23.enddate
order by c23.id
/
Here is a test using your sample data:
SQL> select c23.id
2 , c23.startdate
3 , c23.enddate
4 , lag(c23.id) over (partition by p23.id order by c23.id) as cont_from_id
5 , lead(c23.id) over (partition by p23.id order by c23.id) as cont_to_id
6 from p23
7 join c23 on p23.startdate <= c23.startdate
8 and p23.enddate >= c23.enddate
9 order by c23.id
10 /
ID STARTDATE ENDDATE CONT_FROM_ID CONT_TO_ID
---------- --------- --------- ------------ ----------
1 01-JAN-00 10-OCT-99 2
2 10-OCT-99 11-NOV-00 1 3
3 11-NOV-00 31-DEC-99 2
SQL>

How to report for all rows when joining?

I'd like to report for all batch_runs that meet where batch_run > 200833 and batch_id=100
If a BATCH_RUN does not have any batch_id = 100, then report 0.
select batch_id,
batch_run,
count(*) over (partition by batch_id,
batch_run
order by batch_run) as total_lot_count,
sum(lot_size) over (partition by batch_id,
batch_run) as total_lot_size,
row_number() over (partition by batch_id
order by batch_run) as line_number
from batch_jobs
-- inner join batches on batch_jobs.batch_run = batches.batch_run
-- left join batches on batch_jobs.batch_run = batches.batch_run
where batch_run > 200833
and batch_id = 100
See Fidde
BATCHES
--------------- ----------
BatchSequence Batch_run
--------------- ----------
1 200833
2 200911
3 200922
4 200933
5 201011
6 201022
7 201033
BATCH_RUNS
------------- ---------- ---------
Batch_id Batch_run Lot_size
------------- ---------- ---------
100 200933 10
100 200933 20
100 200933 30
100 201022 400
100 201022 500
Desired result:
--------------- --------- ---------- ----- ---- -------
Batch_Run Batch_id Lot_count Total_Lots Line_No
--------------- --------- ---------- ----- ---- -------
200911 0 1
200922 100 3 60 2
200933 0 3
201011 0 4
201022 100 2 900 5
201033 0 6
It's too bad that your post has inconsistencies with your SQL Fiddle. It makes it all a bit confusing. But I think that this is what you were looking for. And as you'll see, apart from row_number, analytic functions are not really needed.
select b.batch_run,
bj.batch_id,
count(bj.batch_run) as total_lot_count,
coalesce(sum(bj.lot_size), 0) as total_lot_size,
row_number() over (order by b.batch_sequence) as Line_No
from batches b
left join batch_jobs bj
on bj.batch_run = b.batch_run
and bj.batch_id = 100
where b.batch_run > 200833
group by b.batch_sequence, b.batch_run, bj.batch_id
order by Line_No
SQLFiddle Demo

How to partition the following table in DB2

I am trying to partition and order the following table, where I have used all sorts of row_number() over() and dense_rank() over() combinations but am not getting what I need.
The MWE table is as follows:
Person Visit Last_Visit Gap_1_yr
------ ----- -------- --------
1 01/01/2001 01/01/2000 NULL
1 01/01/2003 01/01/2001 gap
1 01/01/2004 01/01/2003 NULL
1 01/01/2006 01/01/2004 gap
2 01/01/2005 01/01/2002 gap
2 01/01/2010 01/01/2005 gap
where a person turns up for an appointment, and if the persons next appointment is > 365 days from their previous appointment (I used a lag function for this).
What I want is, whenever there is a gap, I want to partition so I have the following:
Person Visit Last_Visit Gap_1_yr SEQ
------ ----- -------- -------- ---
1 01/01/2001 01/01/2000 NULL 1
1 01/01/2003 01/01/2001 gap 2
1 01/01/2004 01/01/2003 NULL 2
1 01/01/2006 01/01/2004 gap 3
2 01/01/2005 01/01/2002 gap 1
2 01/01/2010 01/01/2005 gap 2
You see that when there is a gap, the sequence iterates by one until the next gap - all per person.
I have tried:
row_number() over(partition by person order by gap)
but this iterates for every cell in SEQ until finding a new person -ignores gaps
and have tried:
dense_rank() over(partition by person order by gap)
returns 1's in every cell in SEQ
dense_rank() over(partition by person,gap order by gap)
also returns all 1's.
does anyone have any suggestions?
Convert the gap to a flag. Then use sum() to do a cumulative sum of the flag:
select mwe.*,
sum(case when gap_1_yr = 'gap' then 1 else 0 end) over
(partition by person order by visit)
) as seq
from mwe;

SQL: Identify distinct blocks of treatment over multiple start and end date ranges for each member

Objective: Identify distinct episodes of continuous treatment for each member in a table. Each member has a diagnosis and a service date, and an episode is defined as all services where the time between each consecutive service is less than some number (let's say 90 days for this example). The query will need to loop through each row and calculate the difference between dates, and return the first and last date associated with each episode. The goal is to group results by member and episode start/end date.
A very similar question has been asked before, and was somewhat helpful. The problem is that in customizing the code, the returned tables are excluding first and last records. I'm not sure how to proceed.
My data currently looks like this:
MemberCode Diagnosis ServiceDate
1001 ----- ABC ----- 2010-02-04
1001 ----- ABC ----- 2010-03-20
1001 ----- ABC ----- 2010-04-18
1001 ----- ABC ----- 2010-05-22
1001 ----- ABC ----- 2010-09-26
1001 ----- ABC ----- 2010-10-11
1001 ----- ABC ----- 2010-10-19
2002 ----- XYZ ----- 2010-07-10
2002 ----- XYZ ----- 2010-07-21
2002 ----- XYZ ----- 2010-11-08
2002 ----- ABC ----- 2010-06-03
2002 ----- ABC ----- 2010-08-13
In the above data, the first record for Member 1001 is 2010-02-04, and there is not a difference of more than 90 days between consecutive services until 2010-09-26 (the date at which a new episode starts). So Member 1001 has two distinct episodes: (1) Diagnosis ABC, which goes from 2010-02-04 to 2010-05-22, and (2) Diagnosis ABC, which goes from 2010-09-26 to 2010-10-19.
Similarly, Member 2002 has three distinct episodes: (1) Diagnosis XYZ, which goes from 2010-07-10 to 2010-07-21, (2) Diagnosis XYZ, which begins and ends on 2010-11-08, and (3) Diagnosis ABC, which goes from 2010-06-03 to 2010-08-13.
Desired output:
MemberCode Diagnosis EpisodeStartDate EpisodeEndDate
1001 ----- ABC ----- 2010-02-04 ----- 2010-05-22
1001 ----- ABC ----- 2010-09-26 ----- 2010-10-19
2002 ----- XYZ ----- 2010-07-10 ----- 2010-07-21
2002 ----- XYZ ----- 2010-11-08 ----- 2010-11-08
2002 ----- ABC ----- 2010-06-03 ----- 2010-08-13
I've been working on this query for too long, and still can't get exactly what I need. Any help would be appreciated. Thanks in advance!
SQL Server 2012 has the lag() and cumulative sum functions, which makes it easier to write such a query. The idea is to find the first in each sequence. Then take the cumulative sum of the first flag to identify each group. Here is the code:
select MemberId, Diagnosis, min(ServiceDate) as EpisodeStartDate,
max(ServiceStartDate) as EpisodeEndDate
from (select t.*, sum(ServiceStartFlag) over (partition by MemberId, Diagnosis order by ServiceDate) as grp
from (select t.*,
(case when datediff(day,
lag(ServiceDate) over (partition by MemberId, Diagnosis
order by ServiceDate),
ServiceDate) < 90
then 0
else 1 -- handles both NULL and >= 90
end) as ServiceStartFlag
from table t
) t
group by grp, MemberId, Diagnosis;
You can do this in earlier versions of SQL Server but the code is more cumbersome.
For versions of SQL Server prior to 2012, here's some code snippets that should work.
First, you'll need a temp table (as opposed to a CTE, as the lookup of the edge event will fire the newid() function again, rather than retriving the value for that row)
DECLARE #Edges TABLE (MemberCode INT, Diagnosis VARCHAR(3), ServiceDate DATE, GroupID VARCHAR(40))
INSERT INTO #Edges
SELECT *
FROM Treatments E
CROSS APPLY (
SELECT
CASE
WHEN EXISTS (
SELECT TOP 1 E2.ServiceDate
FROM Treatments E2
WHERE E.MemberCode = E2.MemberCode
AND E.Diagnosis = E2.Diagnosis
AND E.ServiceDate > E2.ServiceDate
AND DATEDIFF(dd,E2.ServiceDate,E.ServiceDate) BETWEEN 1 AND 90
ORDER BY E2.ServiceDate DESC
) THEN 'Group'
ELSE CAST(NEWID() AS VARCHAR(40))
END AS GroupID
) z
The EXISTS operator contains a query that looks into the past for a date between 1 and 90 days ago. Once the Edge cases are gathered, this query will provide the results you posted as desired from the test data you posted.
SELECT MemberCode, Diagnosis, MIN(ServiceDate) AS StartDate, MAX(ServiceDate) AS EndDate
FROM (
SELECT
MemberCode
, Diagnosis
, ServiceDate
, CASE GroupID
WHEN 'Group' THEN (
SELECT TOP 1 GroupID
FROM #Edges E2
WHERE E.MemberCode = E2.MemberCode
AND E.Diagnosis = E2.Diagnosis
AND E.ServiceDate > E2.ServiceDate
AND GroupID != 'Group'
ORDER BY ServiceDate DESC
)
ELSE GroupID END AS GroupID
FROM #Edges E
) Z
GROUP BY MemberCode, Diagnosis, GroupID
ORDER BY MemberCode, Diagnosis, MIN(ServiceDate)
Like Gordon said, more cumbersome, but it can be done if your server is not SQL 2012 or greater.