Count multiple repeats after event as single repeat - sql

What I'm trying to do in come up with a single query that can give the percentage of repeats within 30 days of an initial event, but only count any events within 30 days as a single repeat. Here's a sample data set for a single person:
Person Date
══════════════
A 3/1/14
A 3/21/14
A 3/29/14
A 4/14/14
A 4/17/14
In this case, 3/21 would be the repeat event, and 3/29 wouldn't be counted as a second. 4/14 would be the start of the next window, with 4/17 being the second repeat.
To calculate the percentage of repeats here, the numerator would be the distinct count of people who had an initial event in the month and also had a subsequent event within 30 days. The denominator is a distinct count of people with events in that month. In the case of crossing months, the repeat is counted within the month of the initial event.
I know I could come up with something that uses a loop/cursor or temp table, but as the data set grows, it's going to take forever. Does anyone have any thoughts on how to do this as a single query? It's probably going to involve a couple of CTE's. Everything I've come up with so far has failed.

Nice one... try this:
create table #t (Person varchar(10), EventDate date);
insert #t (Person, EventDate)
values
('A', '3/1/14'),
('A', '3/21/14'),
('A', '3/29/14'),
('A', '4/14/14'),
('A', '4/17/14'),
('A', '8/3/14'),
('B', '3/25/14'),
('B', '4/2/14'),
('B', '4/20/14'),
('B', '6/14/14'),
('B', '8/17/14'),
('B', '8/26/14');
;WITH OrderedEvents AS (
SELECT Person, EventDate, ROW_NUMBER() OVER (PARTITION BY Person ORDER BY EventDate) AS Ord
FROM #t
)
, RepeatedEvents AS (
SELECT Person, EventDate, Ord, EventDate AS InitialDate
FROM OrderedEvents
WHERE Ord = 1
UNION ALL
SELECT o.Person, o.EventDate, o.Ord
, CASE WHEN DATEDIFF(DAY, r.InitialDate, o.EventDate) > 30 THEN o.EventDate ELSE r.InitialDate END
FROM OrderedEvents o
JOIN RepeatedEvents r ON o.Person = r.Person AND o.Ord = r.Ord + 1
)
, GroupedEvents AS (
SELECT Person, MONTH(InitialDate) AS Mth, YEAR(InitialDate) AS Yr
, IsRepeat = CASE WHEN COUNT(*) > 1 THEN 1 ELSE 0 END
FROM RepeatedEvents
GROUP BY Person, MONTH(InitialDate), YEAR(InitialDate)
)
SELECT Mth, Yr, CAST(SUM(IsRepeat) AS NUMERIC) / CAST(COUNT(DISTINCT person) AS NUMERIC) AS Pct
FROM GroupedEvents
GROUP BY Mth, Yr;

Related

SQL grouping data with overlapping timespans

I need to group data together that are related to each other by overlapping timespans based on the records start and end times. SQL-fiddle here: http://sqlfiddle.com/#!18/87e4b/1/0
The current query I have built is giving incorrect results. Callid 3 should give a callCount of 4. It does not because record 6 is not included since it does not overlap with 3, but should be included because it does overlap with one of the other related records. So I believe a recursive CTE may be in need but I am unsure how to write this.
Schema:
CREATE TABLE Calls
([callid] int, [src] varchar(10), [start] datetime, [end] datetime, [conf] varchar(5));
INSERT INTO Calls
([callid],[src],[start],[end],[conf])
VALUES
('1','5555550001','2019-07-09 10:00:00', '2019-07-09 10:10:00', '111'),
('2','5555550002','2019-07-09 10:00:01', '2019-07-09 10:11:00', '111'),
('3','5555550011','2019-07-09 11:00:00', '2019-07-09 11:10:00', '111'),
('4','5555550012','2019-07-09 11:00:01', '2019-07-09 11:11:00', '111'),
('5','5555550013','2019-07-09 11:01:00', '2019-07-09 11:15:00', '111'),
('6','5555550014','2019-07-09 11:12:00', '2019-07-09 11:16:00', '111'),
('7','5555550014','2019-07-09 15:00:00', '2019-07-09 15:01:00', '111');
Current query:
SELECT
detail_record.callid,
detail_record.conf,
MIN(related_record.start) AS sessionStart,
MAX(related_record.[end]) As sessionEnd,
COUNT(related_record.callid) AS callCount
FROM
Calls AS detail_record
INNER JOIN
Calls AS related_record
ON related_record.conf = detail_record.conf
AND ((related_record.start >= detail_record.start
AND related_record.start < detail_record.[end])
OR (related_record.[end] > detail_record.start
AND related_record.[end] <= detail_record.[end])
OR (related_record.start <= detail_record.start
AND related_record.[end] >= detail_record.[end])
)
WHERE
detail_record.start > '1/1/2019'
AND detail_record.conf = '111'
GROUP BY
detail_record.callid,
detail_record.start,
detail_record.conf
HAVING
MIN(related_record.start) >= detail_record.start
ORDER BY sessionStart DESC
Expected Results:
callid conf sessionStart sessionEnd callCount
7 111 2019-07-09T15:00:00Z 2019-07-09T15:01:00Z 1
3 111 2019-07-09T11:00:00Z 2019-07-09T11:15:00Z 4
1 111 2019-07-09T10:00:00Z 2019-07-09T10:11:00Z 2
This is a gaps-and-islands problem. It does not require a recursive CTE. You can use window functions:
select min(callid), conf, grouping, min([start]), max([end]), count(*)
from (select c.*,
sum(case when prev_end < [start] then 1 else 0 end) over (order by start) as grouping
from (select c.*,
max([end]) over (partition by conf order by [start] rows between unbounded preceding and 1 preceding) as prev_end
from calls c
) c
) c
group by conf, grouping;
The innermost subquery calculates the previous end. The middle subquery compares this to the current start, to determine when groups of adjacent rows are the beginning of a new group. A cumulative sum then determines the grouping.
And, the outer query aggregates to summarize information about each group.
Here is a db<>fiddle.

Find date of most recent overdue

I have the following problem: from the table of pays and dues, I need to find the date of the last overdue. Here is the table and data for example:
create table t (
Id int
, [date] date
, Customer varchar(6)
, Deal varchar(6)
, Currency varchar(3)
, [Sum] int
);
insert into t values
(1, '2017-12-12', '1110', '111111', 'USD', 12000)
, (2, '2017-12-25', '1110', '111111', 'USD', 5000)
, (3, '2017-12-13', '1110', '122222', 'USD', 10000)
, (4, '2018-01-13', '1110', '111111', 'USD', -10100)
, (5, '2017-11-20', '2200', '222221', 'USD', 25000)
, (6, '2017-12-20', '2200', '222221', 'USD', 20000)
, (7, '2017-12-31', '2201', '222221', 'USD', -10000)
, (8, '2017-12-29', '1110', '122222', 'USD', -10000)
, (9, '2017-11-28', '2201', '222221', 'USD', -30000);
If the value of "Sum" is positive - it means overdue has begun; if "Sum" is negative - it means someone paid on this Deal.
In the example above on Deal '122222' overdue starts at 2017-12-13 and ends on 2017-12-29, so it shouldn't be in the result.
And for the Deal '222221' the first overdue of 25000 started at 2017-11-20 was completly paid at 2017-11-28, so the last date of current overdue (we are interested in) is 2017-12-31
I've made this selection to sum up all the payments, and stuck here :(
WITH cte AS (
SELECT *,
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
Apparently i need to find (for each Deal) minimum of Dates if there is no 0 or negative Debt_balance and the next date after the last 0 balance otherwise..
Will be gratefull for any tips and ideas on the subject.
Thanks!
UPDATE
My version of solution:
WITH cte AS (
SELECT ROW_NUMBER() OVER (ORDER BY Deal, [Date]) id,
Deal, [Date], [Sum],
SUM([Sum]) OVER(PARTITION BY Deal ORDER BY [Date]) AS Debt_balance
FROM t
)
SELECT a.Deal,
SUM(a.Sum) AS NET_Debt,
isnull(max(b.date), min(a.date)),
datediff(day, isnull(max(b.date), min(a.date)), getdate())
FROM cte as a
LEFT OUTER JOIN cte AS b
ON a.Deal = b.Deal AND a.Debt_balance <= 0 AND b.Id=a.Id+1
GROUP BY a.Deal
HAVING SUM(a.Sum) > 0
I believe you are trying to use running sum and keep track of when it changes to positive, and it can change to positive multiple times and you want the last date at which it became positive. You need LAG() in addition to running sum:
WITH cte1 AS (
-- running balance column
SELECT *
, SUM([Sum]) OVER (PARTITION BY Deal ORDER BY [Date], Id) AS RunningBalance
FROM t
), cte2 AS (
-- overdue begun column - set whenever running balance changes from l.t.e. zero to g.t. zero
SELECT *
, CASE WHEN LAG(RunningBalance, 1, 0) OVER (PARTITION BY Deal ORDER BY [Date], Id) <= 0 AND RunningBalance > 0 THEN 1 END AS OverdueBegun
FROM cte1
)
-- eliminate groups that are paid i.e. sum = 0
SELECT Deal, MAX(CASE WHEN OverdueBegun = 1 THEN [Date] END) AS RecentOverdueDate
FROM cte2
GROUP BY Deal
HAVING SUM([Sum]) <> 0
Demo on db<>fiddle
You can use window functions. These can calculate intermediate values:
Last day when the sum is negative (i.e. last "good" record).
Last sum
Then you can combine these:
select deal, min(date) as last_overdue_start_date
from (select t.*,
first_value(sum) over (partition by deal order by date desc) as last_sum,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where last_sum > 0 and date > max_date_neg
group by deal;
Actually, the value on the last date is not necessary. So this simplifies to:
select deal, min(date) as last_overdue_start_date
from (select t.*,
max(case when sum < 0 then date end) over (partition by deal order by date) as max_date_neg
from t
) t
where date > max_date_neg
group by deal;

Splitting up group by with relevant aggregates beyond the basic ones?

I'm not sure if this has been asked before because I'm having trouble even asking it myself. I think the best way to explain my dilemma is to use an example.
Say I've rated my happiness on a scale of 1-10 every day for 10 years and I have the results in a big table where I have a single date correspond to a single integer value of my happiness rating. I say, though, that I only care about my happiness over 60 day periods on average (this may seem weird but this is a simplified example). So I wrap up this information to a table where I now have a start date field, an end date field, and an average rating field where the start days are every day from the first day to the last over all 10 years, but the end dates are exactly 60 days later. To be clear, these 60 day periods are overlapping (one would share 59 days with the next one, 58 with the next, and so on).
Next I pick a threshold rating, say 5, where I want to categorize everything below it into a "bad" category and everything above into a "good" category. I could easily add another field and use a case structure to give every 60-day range a "good" or "bad" flag.
Then to sum it up, I want to display the total periods of "good" and "bad" from maximum beginning to maximum end date. This is where I'm stuck. I could group by the good/bad category and then just take min(start date) and max(end date), but then if, say, the ranges go from good to bad to good then to bad again, output would show overlapping ranges of good and bad. In the aforementioned situation, I would want to show four different ranges.
I realize this may seem clearer to me that it would to someone else so if you need clarification just ask.
Thank you
---EDIT---
Here's an example of what the before would look like:
StartDate| EndDate| MoodRating
------------+------------+------------
1/1/1991 |3/1/1991 | 7
1/2/1991 |3/2/1991 | 7
1/3/1991 |3/3/1991 | 4
1/4/1991 |3/4/1991 | 4
1/5/1991 |3/5/1991 | 7
1/6/1991 |3/6/1991 | 7
1/7/1991 |3/7/1991 | 4
1/8/1991 |3/8/1991 | 4
1/9/1991 |3/9/1991 | 4
And the after:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/2/1991 |good
1/3/1991|3/4/1991 |bad
1/5/1991|3/6/1991 |good
1/7/1991|3/9/1991 |bad
Currently my query with the group by rating would show:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/6/1991 |good
1/3/1991|3/9/1991 |bad
This is something along the lines of
select min(StartDate), max(EndDate), Good_Bad
from sourcetable
group by Good_Bad
While Jason A Long's answer may be correct - I can't read it or figure it out, so I figured I would post my own answer. Assuming that this isn't a process that you're going to be constantly running, the CURSOR's performance hit shouldn't matter. But (at least to me) this solution is very readable and can be easily modified.
In a nutshell - we insert the first record from your source table into our results table. Next, we grab the next record and see if the mood score is the same as the previous record. If it is, we simply update the previous record's end date with the current record's end date (extending the range). If not, we insert a new record. Rinse, repeat. Simple.
Here is your setup and some sample data:
DECLARE #MoodRanges TABLE (StartDate DATE, EndDate DATE, MoodRating int)
INSERT INTO #MoodRanges
VALUES
('1/1/1991','3/1/1991', 7),
('1/2/1991','3/2/1991', 7),
('1/3/1991','3/3/1991', 4),
('1/4/1991','3/4/1991', 4),
('1/5/1991','3/5/1991', 7),
('1/6/1991','3/6/1991', 7),
('1/7/1991','3/7/1991', 4),
('1/8/1991','3/8/1991', 4),
('1/9/1991','3/9/1991', 4)
Next, we can create a table to store our results, as well as some variable placeholders for our cursor:
DECLARE #MoodResults TABLE(ID INT IDENTITY(1, 1), StartDate DATE, EndDate DATE, MoodScore varchar(50))
DECLARE #CurrentStartDate DATE, #CurrentEndDate DATE, #CurrentMoodScore INT,
#PreviousStartDate DATE, #PreviousEndDate DATE, #PreviousMoodScore INT
Now we put all of the sample data into our CURSOR:
DECLARE MoodCursor CURSOR FOR
SELECT StartDate, EndDate, MoodRating
FROM #MoodRanges
OPEN MoodCursor
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
WHILE ##FETCH_STATUS = 0
BEGIN
IF #PreviousStartDate IS NOT NULL
BEGIN
IF (#PreviousMoodScore >= 5 AND #CurrentMoodScore >= 5)
OR (#PreviousMoodScore < 5 AND #CurrentMoodScore < 5)
BEGIN
UPDATE #MoodResults
SET EndDate = #CurrentEndDate
WHERE ID = (SELECT MAX(ID) FROM #MoodResults)
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
SET #PreviousStartDate = #CurrentStartDate
SET #PreviousEndDate = #CurrentEndDate
SET #PreviousMoodScore = #CurrentMoodScore
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
END
CLOSE MoodCursor
DEALLOCATE MoodCursor
And here are the results:
SELECT * FROM #MoodResults
ID StartDate EndDate MoodScore
----------- ---------- ---------- --------------------------------------------------
1 1991-01-01 1991-03-02 GOOD
2 1991-01-03 1991-03-04 BAD
3 1991-01-05 1991-03-06 GOOD
4 1991-01-07 1991-03-09 BAD
Is this what you're looking for?
IF OBJECT_ID('tempdb..#MyDailyMood', 'U') IS NOT NULL
DROP TABLE #MyDailyMood;
CREATE TABLE #MyDailyMood (
TheDate DATE NOT NULL,
MoodLevel INT NOT NULL
);
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Calendar (dt) AS (
SELECT TOP (DATEDIFF(dd, '2007-01-01', '2017-01-01'))
DATEADD(dd, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, '2007-01-01')
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT #MyDailyMood (TheDate, MoodLevel)
SELECT
c.dt,
ABS(CHECKSUM(NEWID()) % 10) + 1
FROM
cte_Calendar c;
--==========================================================
WITH
cte_AddRN AS (
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS (
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
)
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup;
Post OP update solution...
WITH
cte_AddRN AS ( -- Add a row number to each row that resets to 1 ever 60 rows.
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS ( -- Use DENSE_RANK to create groups based on the RN added above.
-- How it works: RN set the row number 1 - 60 then repeats itself
-- but we dont want ever 60th row grouped together. We want blocks of 60 consecutive rows grouped together
-- DENSE_RANK accompolishes this by ranking within all the "1's", "2's"... and so on.
-- verify with the following query... SELECT * FROM cte_AssignGroups ag ORDER BY ag.TheDate
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
),
cte_AggRange AS ( -- This is just a straight forward aggregation/rollup. It produces the results similar to the sample data you posed in your edit.
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
GorB = CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END,
ag.DateGroup
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup
),
cte_CompactGroup AS ( -- This time we're using dense rank to group all of the consecutive "Good" and "Bad" values so that they can be further aggregated below.
SELECT
ar.BegOfRange, ar.EndOfRange, ar.AverageMoodLevel, ar.GorB, ar.DateGroup,
DenseGroup = ar.DateGroup - DENSE_RANK() OVER (PARTITION BY ar.GorB ORDER BY ar.BegOfRange)
FROM
cte_AggRange ar
)
-- The final aggregation step...
SELECT
BegOfRange = MIN(cg.BegOfRange),
EndOfRange = MAX(cg.EndOfRange),
cg.GorB
FROM
cte_CompactGroup cg
GROUP BY
cg.DenseGroup,
cg.GorB
ORDER BY
BegOfRange;

setting a flag for score change in SQL

I have a table with exam scores for different weeks. I wanted to create an extra column with the score difference, like if score decreased by 0-5 then 1, 5-9 then 2, 10+ then 3 and if score increases then 4. Here is the sample data that I have with me in the table.
--DROP TABLE #Scores
CREATE TABLE #Scores (
NAME varchar(10),
Grade varchar(10),
Subject varchar(25),
Exam_Date datetime,
Score int
)
INSERT INTO #Scores
VALUES ('Sam', 'XI', 'Maths', '2016-08-01 15:47:29.533', 38),
('Sam', 'XI', 'Maths', '2016-07-25 15:47:29.533', 50),
('Mike', 'XI', 'Maths', '2016-08-01 15:47:29.533', 50),
('Mike', 'XI', 'Maths', '2016-07-25 15:47:29.533', 45)
SELECT * FROM #Scores
Thanks in adavance
You would use lag() and case:
select s.*,
(case when score - prev_score < 0 then 4
when score - prev_score <= 5 then 1
when score - prev_score <= 9 then 2
else 3
end) as score_diff
from (select s.*,
lag(score) over (partition by name, subject order by exam_date) as prev_score
from #scores s
) s;
Thanks to #Gordon Linoff, I change the code a little bit. The logic is right, just change the math a little.
select s.*,
(case when score - prev_score > 0 then 4
when score - prev_score between -5 and 0 then 1
when score - prev_score between -9 and -5 then 2
else 3
end) as score_diff
from (select s.*,
lag(score) over (partition by name, subject order by exam_date) as prev_score
from #scores s
) s;
Result is captured and shown below:
Consider a further step of normalization. Keep the scores in a separate table. Relate the student to the scores table.
You have to decide how you are going to reference the previous score to compare to the current. If you create an additional field to either store the change from last score then you can have a calculated field that shows the current score, or, store the previous score in a field along side the new score, then have a calculated field show the change between the two.

Count previous consecutive rows in SQL Server

I have attendance data list which is showing below. Now I am trying to find data by a specific date range (01/05/2016 – 07/05/2016) with total Present Column, Total Present Column will be calculated from previous present data (P). Suppose today is 04/05/2016. If a person has 01,02,03,04 status ‘p’ then it will show date 04-05-2016 total present 4.
Could you help me to find total present from this result set.
You can check this example, which have logic to calculate previous sum value.
declare #t table (employeeid int, datecol date, status varchar(2) )
insert into #t values (10001, '01-05-2016', 'P'),
(10001, '02-05-2016', 'P'),
(10001, '03-05-2016', 'P'),
(10001, '04-05-2016', 'P'),
(10001, '05-05-2016', 'A'),
(10001, '06-05-2016', 'P'),
(10001, '07-05-2016', 'P'),
(10001, '08-05-2016', 'L'),
(10002, '07-05-2016', 'P'),
(10002, '08-05-2016', 'L')
--select * from #t
select * ,
SUM(case when status = 'P' then 1 else 0 end) OVER (PARTITION BY employeeid ORDER BY employeeid, datecol
ROWS BETWEEN UNBOUNDED PRECEDING
AND current row)
from
#t
Another twist of the same thing via cte (as you written SQLSERVER2012, this below solution only work in Sqlserver 2012 and above)
;with cte as
(
select employeeid , datecol , ROW_NUMBER() over(partition by employeeid order by employeeid, datecol) rowno
from
#t where status = 'P'
)
select t.*, cte.rowno ,
case when ( isnull(cte.rowno, 0) = 0)
then LAG(cte.rowno) OVER (ORDER BY t.employeeid, t.datecol)
else cte.rowno
end LagValue
from #t t left join cte on t.employeeid = cte.employeeid and t.datecol = cte.datecol
order by t.employeeid, t.datecol
You could use a subquery to calculate TotalPresent for each row:
SELECT
main.EmployeeID,
main.[Date],
main.[Status],
(
SELECT SUM(CASE WHEN t.[Status] = 'P' THEN 1 ELSE 0 END)
FROM [TableName] t
WHERE t.EmployeeID = main.EmployeeID AND t.[Date] <= main.[Date]
) as TotalPresent
FROM [TableName] main
ORDER BY
main.EmployeeID,
main.[Date]
Here I used subquery to count the sum of records that have the same EmployeeID and date is less or equal to the date of current row. If status of the record is 'P', then 1 is added to the sum, otherwise 0, which counts only records that have status P.
Interesting question, this should work:
select *
, (select count(retail) from p g
where g.date <= p.date and g.id = p.id and retail = 'P')
from p
order by ID, Date;
So I believe I understand correctly. You would like to count the occurences of P per ID datewise.
This makes a lot of sense. That is why the first occurrence of ID2 was L and the Total is 0. This query will count P status for each occurrence, pause at non-P for each ID.
Here is an example