Most Efficient SQL to Calculate Running Streak Occurrences - sql

I am looking for the most efficient manner to determine the longest occurrence of a streak within a given data set; specifically, to determine the longest winning streak of games.
Below is the SQL that I have thus far, and it does seem to perform very fast, and as expected from the limited testing I've done on a dataset with around 100,000 records.
DECLARE #HistoryDateTimeLimit datetime = '3/15/2018';
CTE to create result subset from voting dataset.
WITH Results AS (
SELECT
EntityPlayerId,
(CASE
WHEN VoteTeamA = 1 AND ParticipantAScore > ParticipantBScore THEN 'W'
WHEN VoteTeamA = 0 AND ParticipantBScore > ParticipantAScore THEN 'W'
ELSE 'L'
END) AS WinLoss,
match.ScheduledStartDateTime
FROM
[dbo].[MatchVote] vote
INNER JOIN [dbo].[MatchMetaData] match ON vote.MatchId = match.MatchId
WHERE
IsComplete = 1
AND ScheduledStartDateTime >= #HistoryDateTimeLimit
)
CTE to create subset of data with streak type as WinLoss and total count of votes in the partition using ROW_NUMBER().
WITH Streaks AS (
SELECT
EntityPlayerId,
ScheduledStartDateTime,
WinLoss,
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId ORDER BY ScheduledStartDateTime) -
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId, WinLoss ORDER BY ScheduledStartDateTime) AS Streak
FROM
Results
)
CTE to summarize the partitioned vote streaks by WinLoss and a begin date/time, with the total count in the streak.
WITH StreakCounts AS (
SELECT
EntityPlayerId,
WinLoss,
MIN(ScheduledStartDateTime) StreakStart,
MAX(ScheduledStartDAteTime) StreakEnd,
COUNT(*) as Streak
FROM
Streaks
GROUP BY
EntityPlayerId, WinLoss, Streak
)
CTE to select the MAXIMUM (longest) vote streak for WinLoss of W (win) grouped by players.
WITH LongestWinStreak AS (
SELECT
EntityPlayerId,
MAX(Streak) AS LongestStreak
FROM
StreakCounts
WHERE
WinLoss = 'W'
GROUP BY
EntityPlayerId
)
Selecting the useful data from the LongestWinStreak CTE.
SELECT * FROM LongestWinStreak
This is the 3rd iteration of the code; at first I feel like I was overthinking and using windows with the LAG function to define a reset period that was later used for partitioning.
[UPDATE]: SQLFiddle example at http://sqlfiddle.com/#!18/5b33a/1 -- Sample data for the two tables that are used above are as follows.
The data is meant to show the schema, and can be extrapolated for your own testing/usage;
MatchVote table data.
EntityPlayerId IsExtMatch MatchId VoteTeamA VoteDateTime IsComplete
-------------------- ------------ -------------------- --------- ----------------------- ----------
158 1 152639 0 2018-03-20 23:25:28.910 1
158 1 156058 1 2018-03-13 23:36:57.517 1
MatchMetaData table data.
MatchId IsTeamTournament MatchCompletedDateTime ScheduledStartDateTime MatchIsFinalized TournamentId TournamentTitle TournamentLogoUrl TournamentLogoThumbnailUrl GameName GameShortCode GameLogoUrl ParticipantAScore ParticipantAName ParticipantALogoUrl ParticipantBScore ParticipantBName ParticipantBLogoUrl
--------- ---------------- ----------------------- ----------------------- ---------------- -------------------- ------------------ ----------------------- ---------------------------- --------------------------------- -------------- ----------------------- ------------------ ------------------- --------------------- ----------------- ------------------- --------------------
23354 1 2014-07-30 00:30:00.000 2014-07-30 00:00:00.000 1 543 Sample https://...Small.png https://...Small.png Dota 2 Dota 2 https://...logo.png 3 Natus Vincere.US https://...VI.png 0 Not Today https://...ay.png
44324 1 2014-12-15 12:40:00.000 2014-12-15 11:40:00.000 1 786 Sample https://...Small.png https://...Small.png Counter-Strike: Global Offensive CS:GO https://...logo.png 0 Avalier's stars https://...oto.png 1 Kassad's Legends https://...oto.png

Related

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions
You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

How to select rows where values changed for an ID

I have a table that looks like the following
id effective_date number_of_int_customers
123 10/01/19 0
123 02/01/20 3
456 10/01/19 6
456 02/01/20 6
789 10/01/19 5
789 02/01/20 4
999 10/01/19 0
999 02/01/20 1
I want to write a query that looks at each ID to see if the salespeople have newly started working internationally between October 1st and February 1st.
The result I am looking for is the following:
id effective_date number_of_int_customers
123 02/01/20 3
999 02/01/20 1
The result would return only the salespeople who originally had 0 international customers and now have at least 1.
I have seen similar posts here that use nested queries to pull records where the first date and last have different values. But I only want to pull records where the original value was 0. Is there a way to do this in one query in SQL?
In your case, a simple aggregation would do -- assuming that 0 is the earliest value:
select id, max(number_of_int_customers)
from t
where effective_date in ('2019-10-01', '2020-02-01')
group by id
having min(number_of_int_customers) = 0;
Obviously, this is not correct if the values can decrease to zero. But this having clause fixes that problem:
having min(case when number_of_int_customers = 0 then effective_date end) = min(effective_date)
An alternative is to use window functions, such asfirst_value():
select distinct id, last_noic
from (select t.*,
first_value(number_of_int_customers) over (partition by id order by effective_date) as first_noic,
first_value(number_of_int_customers) over (partition by id order by effective_date desc) as last_noic,
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where first_noic = 0;
Hmmm, on second thought, I like lag() better:
select id, number_of_int_customers
from (select t.*,
lag(number_of_int_customers) over (partition by id order by effective_date) as prev_noic
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where prev_noic = 0;

Running recursive query in Vertica

I am trying to do the exact same thing as this question. But I am in Vertica, and I am finding no way to carry out the top answer, or the other answers. So basically I have tried connect by and sub query UNION ALL method, and I don't think Vertica supports it.
Is there any way I can replicate the solution in Vertica?
EDIT: Full Question
I am trying to calculate 30-day readmission chains, which is a sequence of readmissions within 30 days from its previous admission. The following data shows a simplified situation where we have events, rather than admission and discharges. Difference in days between events will identify it as a 30-day readmission, consecutive 30 day readmissions(Chain Len) will be a single chain of readmission (Count).
Sample Data
CREATE TABLE dbo.Events (
EventID INT IDENTITY(1,1) PRIMARY KEY,
EventDate DATE NOT NULL,
PersonID INT NOT NULL
);
GO
INSERT dbo.Events (EventDate, PersonID)
VALUES
('2014-01-01', 1), ('2014-01-05', 1), ('2014-02-02', 1), ('2014-03-30', 1), ('2014-04-04', 1),
('2014-01-11', 2), ('2014-02-02', 2),
('2014-01-03', 3), ('2014-03-03', 3);
GO
Sample Output
EventID EventDate PersonID CHAIN LEN Count
------- ---------- -------- --------- -----
1 2014-01-01 1 1 1
2 2014-01-05 1 2 1
3 2014-02-02 1 3 1
------- ---------- -------- --------- -----
4 2014-03-30 1 1 2
5 2014-04-04 1 2 2
------- ---------- -------- --------- -----
6 2014-01-11 2 1 1
7 2014-02-02 2 2 1
------- ---------- -------- --------- -----
8 2014-01-03 3 1 1
------- ---------- -------- --------- -----
9 2014-03-03 3 1 2
------- ---------- -------- --------- -----
Here is an Oracle solution; see if it works. You may need to make some changes for vertica, as each db dialect has its own quirks. Vertica does support analytic functions, which is the main ingredient.
The method used here is a very well known one, it is usually called "start-of-groups" method (for the "flags" created in the innermost subquery).
select eventid, eventdate, personid,
row_number() over
(partition by personid, ct order by eventdate) as chain_len,
ct
from (
select eventid, eventdate, personid,
count(flag) over
(partition by personid order by eventdate) + 1 as ct
from (
select eventid, eventdate, personid,
case when eventdate > lag(eventdate) over
(partition by personid order by eventdate) + 30
then 0 end as flag
from events
)
)
order by personid, eventdate -- if needed
;

How to partition the following table in DB2

I am trying to partition and order the following table, where I have used all sorts of row_number() over() and dense_rank() over() combinations but am not getting what I need.
The MWE table is as follows:
Person Visit Last_Visit Gap_1_yr
------ ----- -------- --------
1 01/01/2001 01/01/2000 NULL
1 01/01/2003 01/01/2001 gap
1 01/01/2004 01/01/2003 NULL
1 01/01/2006 01/01/2004 gap
2 01/01/2005 01/01/2002 gap
2 01/01/2010 01/01/2005 gap
where a person turns up for an appointment, and if the persons next appointment is > 365 days from their previous appointment (I used a lag function for this).
What I want is, whenever there is a gap, I want to partition so I have the following:
Person Visit Last_Visit Gap_1_yr SEQ
------ ----- -------- -------- ---
1 01/01/2001 01/01/2000 NULL 1
1 01/01/2003 01/01/2001 gap 2
1 01/01/2004 01/01/2003 NULL 2
1 01/01/2006 01/01/2004 gap 3
2 01/01/2005 01/01/2002 gap 1
2 01/01/2010 01/01/2005 gap 2
You see that when there is a gap, the sequence iterates by one until the next gap - all per person.
I have tried:
row_number() over(partition by person order by gap)
but this iterates for every cell in SEQ until finding a new person -ignores gaps
and have tried:
dense_rank() over(partition by person order by gap)
returns 1's in every cell in SEQ
dense_rank() over(partition by person,gap order by gap)
also returns all 1's.
does anyone have any suggestions?
Convert the gap to a flag. Then use sum() to do a cumulative sum of the flag:
select mwe.*,
sum(case when gap_1_yr = 'gap' then 1 else 0 end) over
(partition by person order by visit)
) as seq
from mwe;

SQL: Identify distinct blocks of treatment over multiple start and end date ranges for each member

Objective: Identify distinct episodes of continuous treatment for each member in a table. Each member has a diagnosis and a service date, and an episode is defined as all services where the time between each consecutive service is less than some number (let's say 90 days for this example). The query will need to loop through each row and calculate the difference between dates, and return the first and last date associated with each episode. The goal is to group results by member and episode start/end date.
A very similar question has been asked before, and was somewhat helpful. The problem is that in customizing the code, the returned tables are excluding first and last records. I'm not sure how to proceed.
My data currently looks like this:
MemberCode Diagnosis ServiceDate
1001 ----- ABC ----- 2010-02-04
1001 ----- ABC ----- 2010-03-20
1001 ----- ABC ----- 2010-04-18
1001 ----- ABC ----- 2010-05-22
1001 ----- ABC ----- 2010-09-26
1001 ----- ABC ----- 2010-10-11
1001 ----- ABC ----- 2010-10-19
2002 ----- XYZ ----- 2010-07-10
2002 ----- XYZ ----- 2010-07-21
2002 ----- XYZ ----- 2010-11-08
2002 ----- ABC ----- 2010-06-03
2002 ----- ABC ----- 2010-08-13
In the above data, the first record for Member 1001 is 2010-02-04, and there is not a difference of more than 90 days between consecutive services until 2010-09-26 (the date at which a new episode starts). So Member 1001 has two distinct episodes: (1) Diagnosis ABC, which goes from 2010-02-04 to 2010-05-22, and (2) Diagnosis ABC, which goes from 2010-09-26 to 2010-10-19.
Similarly, Member 2002 has three distinct episodes: (1) Diagnosis XYZ, which goes from 2010-07-10 to 2010-07-21, (2) Diagnosis XYZ, which begins and ends on 2010-11-08, and (3) Diagnosis ABC, which goes from 2010-06-03 to 2010-08-13.
Desired output:
MemberCode Diagnosis EpisodeStartDate EpisodeEndDate
1001 ----- ABC ----- 2010-02-04 ----- 2010-05-22
1001 ----- ABC ----- 2010-09-26 ----- 2010-10-19
2002 ----- XYZ ----- 2010-07-10 ----- 2010-07-21
2002 ----- XYZ ----- 2010-11-08 ----- 2010-11-08
2002 ----- ABC ----- 2010-06-03 ----- 2010-08-13
I've been working on this query for too long, and still can't get exactly what I need. Any help would be appreciated. Thanks in advance!
SQL Server 2012 has the lag() and cumulative sum functions, which makes it easier to write such a query. The idea is to find the first in each sequence. Then take the cumulative sum of the first flag to identify each group. Here is the code:
select MemberId, Diagnosis, min(ServiceDate) as EpisodeStartDate,
max(ServiceStartDate) as EpisodeEndDate
from (select t.*, sum(ServiceStartFlag) over (partition by MemberId, Diagnosis order by ServiceDate) as grp
from (select t.*,
(case when datediff(day,
lag(ServiceDate) over (partition by MemberId, Diagnosis
order by ServiceDate),
ServiceDate) < 90
then 0
else 1 -- handles both NULL and >= 90
end) as ServiceStartFlag
from table t
) t
group by grp, MemberId, Diagnosis;
You can do this in earlier versions of SQL Server but the code is more cumbersome.
For versions of SQL Server prior to 2012, here's some code snippets that should work.
First, you'll need a temp table (as opposed to a CTE, as the lookup of the edge event will fire the newid() function again, rather than retriving the value for that row)
DECLARE #Edges TABLE (MemberCode INT, Diagnosis VARCHAR(3), ServiceDate DATE, GroupID VARCHAR(40))
INSERT INTO #Edges
SELECT *
FROM Treatments E
CROSS APPLY (
SELECT
CASE
WHEN EXISTS (
SELECT TOP 1 E2.ServiceDate
FROM Treatments E2
WHERE E.MemberCode = E2.MemberCode
AND E.Diagnosis = E2.Diagnosis
AND E.ServiceDate > E2.ServiceDate
AND DATEDIFF(dd,E2.ServiceDate,E.ServiceDate) BETWEEN 1 AND 90
ORDER BY E2.ServiceDate DESC
) THEN 'Group'
ELSE CAST(NEWID() AS VARCHAR(40))
END AS GroupID
) z
The EXISTS operator contains a query that looks into the past for a date between 1 and 90 days ago. Once the Edge cases are gathered, this query will provide the results you posted as desired from the test data you posted.
SELECT MemberCode, Diagnosis, MIN(ServiceDate) AS StartDate, MAX(ServiceDate) AS EndDate
FROM (
SELECT
MemberCode
, Diagnosis
, ServiceDate
, CASE GroupID
WHEN 'Group' THEN (
SELECT TOP 1 GroupID
FROM #Edges E2
WHERE E.MemberCode = E2.MemberCode
AND E.Diagnosis = E2.Diagnosis
AND E.ServiceDate > E2.ServiceDate
AND GroupID != 'Group'
ORDER BY ServiceDate DESC
)
ELSE GroupID END AS GroupID
FROM #Edges E
) Z
GROUP BY MemberCode, Diagnosis, GroupID
ORDER BY MemberCode, Diagnosis, MIN(ServiceDate)
Like Gordon said, more cumbersome, but it can be done if your server is not SQL 2012 or greater.