I have a datatable with many rows and I would like to conditionally group two columns, namely Begin and End. These columns stand for a certain month in which the associated person was doing something. Here is some sample data (you can use R to read in, or find the pure tables below if you dont use R):
# base:
test <- read.table(
text = "
1 A mnb USA prim 4 12
2 A mnb USA x 13 15
3 A mnb USA un 16 25
4 A mnb USA fdfds 1 2
5 B ghf CAN sdg 3 27
6 B ghf CAN hgh 28 29
7 B ghf CAN y 24 31
8 B ghf CAN ghf 38 42
",header=F)
library(data.table)
setDT(test)
names(test) <- c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1 A mnb USA fdfds 1 2
2 A mnb USA - 4 25
3 B ghf CAN - 3 31
4 B ghf CAN ghf 38 42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")
The grouping should be done as follows: If person A did hiking from month 4 to month 15 and travelling from month 16 to month 24, I would group the consecutive (i.e. without break) activity from month 4 to month 24. If afterwards person A did surfing from month 25 to month 28, I would also add this, and the whole group activity would last from 4 to 28.
Now problematic are cases were there are overlapping periods, for example person A might also do fishing from 11 to 31, so the whole thing would become 4 to 31. However, if person A did something from 1 to 2, that would be a separate activity (as compared to 1 to 3, which would also have to be added, because 3 is connected to 4). I hope that was clear, if not you can find more examples in the above code.
I am using datatable, because my dataset is quite large. I have started with sqldf so far, but it's problematic if you have so many activities per person (let's say 8 or more).
Can this be done in datatable, or plyr, or sqldf?
Please note: I am also looking for an answer in SQL because I could use that directly in sqldf or try to convert it to another language. sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3) the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL.
Edit: Here are the 'pure' tables:
In:
Person Name Country add info Begin End
A mnb USA prim 4 12
A mnb USA x 13 15
A mnb USA un 16 25
A mnb USA fdfds 1 2
B ghf CAN sdg 3 27
B ghf CAN hgh 28 29
B ghf CAN y 24 31
B ghf CAN ghf 38 42
Out:
A mnb USA fdfds 1 2
A mnb USA - 4 25
B ghf CAN - 3 31
B ghf CAN ghf 38 42
I did this one which worked in my tests and almost all the main databases out there should normally run it... I underscored my columns... please, change the names before test:
SELECT
r1.person_,
r1.name_,
r1.country_,
CASE
WHEN max(r2.begin_) = max(r1.begin_)
THEN max(r1.info_) ELSE '-'
END info_,
MAX(r2.begin_) begin_,
r1.end_
FROM stack_39626781 r1
INNER JOIN stack_39626781 r2 ON 1=1
AND r2.person_ = r1.person_
AND r2.begin_ <= r1.begin_ -- just optimizing...
LEFT JOIN stack_39626781 r3 ON 1=1
AND r3.person_ = r1.person_
-- matches when another range overlaps this range end
AND r3.end_ >= r1.end_ + 1
AND r3.begin_ <= r1.end_ + 1
LEFT JOIN stack_39626781 r4 ON 1=1
AND r4.person_ = r2.person_
-- matches when another range overlaps this range begin
AND r4.end_ >= r2.begin_ - 1
AND r4.begin_ <= r2.begin_ - 1
WHERE 1=1
-- get rows
-- with no overlaps on end range and
-- with no overlaps on begin range
AND r3.person_ IS NULL
AND r4.person_ IS NULL
GROUP BY
r1.person_,
r1.name_,
r1.country_,
r1.end_
This query is based on the fact that any range from output have no connections/overlaps. Lets say that, for an output of five ranges, five begins and five ends exists with no connections/overlaps. Find and associate them should be easier than generating all connections/overlaps. So, what this query does is:
Find all ranges per person with no overlaps/connections on its end value;
Find all ranges per person with no overlaps/connections on its begin value;
These are the valid ranges, so associate them all to find the correct pair;
For each person and end, the correct begin pair is the maximum one available which value is equal or lesser than this end... it's easy to validate this rule... first, you can't have a begin greater than an end... also, if you have two or more possible begins lesser than end, e. g., begin1 = end - 2 and begin2 = end - 5, selecting the lesser one (begin2) makes the greater one (begin1) an overlap of this range.
Hope it helps.
If you are working with SQL Server 2012 or above, you can use the LAG and LEAD functions to build up your logic to arrive at your final desired dataset. Those functions have also been available in Oracle since Oracle 8i, I believe.
Below is a solution that I created in SQL Server 2012, which should help you. The example values you provided are loaded into a temporary table to represent what you referred to as your first "pure table." Using those two functions, along with OVER clause, I arrived at your final dataset with the following T-SQL code below. I left some of the commented out lines in the code to show how I was able to build up the overall solution piece-by-piece, which accounts for the various scenarios placed into the CASE statement for the GapMarker column that acts as grouping flag.
IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
DROP TABLE #MyTable
CREATE TABLE #MyTable (
Person CHAR(1)
,[Name] VARCHAR(3)
,Country VARCHAR(10)
,add_info VARCHAR(10)
,[Begin] INT
,[End] INT
)
INSERT INTO #MyTable (Person, Name, Country, add_info, [Begin], [End])
VALUES ('A', 'mnb', 'USA', 'prim', 4, 12),
('A', 'mnb', 'USA', 'x', 13, 15),
('A', 'mnb', 'USA', 'un', 16, 25),
('A', 'mnb', 'USA', 'fdfds', 1, 2),
('B', 'ghf', 'CAN', 'sdg', 3, 27),
('B', 'ghf', 'CAN', 'hgh', 28, 29),
('B', 'ghf', 'CAN', 'y', 24, 31),
('B', 'ghf', 'CAN', 'ghf', 38, 42);
WITH CTE
AS
(SELECT
mt.Person
,mt.Name
,mt.Country
,mt.add_info
,mt.[Begin]
,mt.[End]
--,LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
--,CASE WHEN [End] + 1 = LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- --AND LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]) = LEAD([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])
-- THEN 1
-- ELSE 0
-- END AS Grp
--,MARKER = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End]))
,CASE
WHEN mt.[End] + 1 = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) OR
1 + COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [End]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [End])) = mt.[Begin] OR
COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])) BETWEEN mt.[Begin] AND mt.[End] OR
[End] BETWEEN LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) AND LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]) THEN 1
ELSE 0
END AS GapMarker
,InBetween = COALESCE(LEAD([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]), LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin]))
,EndInBtw = LAG([Begin], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
,LagEndInBtw = LAG([End], 1) OVER (PARTITION BY mt.Person ORDER BY [Begin])
FROM #MyTable mt
--ORDER BY mt.Person, mt.[Begin]
)
SELECT DISTINCT
X.Person
,X.[Name]
,X.Country
,t.add_info
,X.MinBegin
,X.MaxEnd
FROM (SELECT
c.Person
,c.[Name]
,c.Country
,c.add_info
,c.[Begin]
,c.[End]
,c.GapMarker
,c.InBetween
,c.EndInBtw
,c.LagEndInBtw
,MIN(c.[Begin]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MinBegin
,MAX(c.[End]) OVER (PARTITION BY c.Person, c.GapMarker ORDER BY c.Person) AS MaxEnd
--, CASE WHEN c.[End]+1 = c.MARKER
-- OR c.MARKER +1 = c.[Begin]
-- THEN 1
-- ELSE 0
-- END Grp
FROM CTE AS c) X
LEFT JOIN #MyTable AS t
ON t.[Begin] = X.[MinBegin]
AND t.[End] = X.[MaxEnd]
AND t.Person = X.Person
ORDER BY X.Person, X.MinBegin
--ORDER BY Person, [Begin]
And here's a screen shot of the results that match your desired final dataset:
Related
Greetings I need to select a specific date and if the specific date does not match I need to select the max, I'm no expert in SQL this is what I achieved so far but is returning all records:
select
scs.subscription_id,
case
when scs.end_date = max(scs.end_date) then max(scs.end_date)
when scs.end_date = '1900-01-01 00:00:00.000Z' then '1900-01-01 00:00:00.000Z'
end as end_date
from
sim_cards sc
inner join sim_card_subscriptions scs on
sc.id = scs.sim_card_id
where
scs.subscription_id = 1
group by
scs.end_date,
scs.subscription_id
Assuming you want that (match a date X or use MAX) policy to be applied for each row individually and the MAX has to be calculated only across related subscriptions (sharing the same sim_card_id) so that you can use max(end_date::date) over(partition by sim_card_id) window function in your CASE statement to fallback to MAX if the specific date is not matched.
Consider the following sample dataset
with data (id, sim_card_id, end_date) as (
values
(1, 1, '2021-08-05 20:21:00'),
(2, 1, '2021-10-10 12:12:10'),
(3, 1, '2021-12-11 00:11:14'),
(4, 2, '2021-12-14 09:08:45'),
(5, 2, '2021-12-14 15:42:07'),
(6, 3, '2021-10-09 20:20:33')
)
select
id,
case
when end_date::date = '2021-10-10' then end_date::date
else max(end_date::date) over(partition by sim_card_id)
end as end_date
from data
which yields the following output:
1 2021-12-11 -- max across sim_card ID=1
2 2021-10-10 -- matches desired date
3 2021-12-11 -- max across sim_card ID=1
4 2021-12-14 -- max across sim_card ID=2
5 2021-12-14 -- max across sim_card ID=2
6 2021-10-09 -- matches desired date
I'm not sure if this has been asked before because I'm having trouble even asking it myself. I think the best way to explain my dilemma is to use an example.
Say I've rated my happiness on a scale of 1-10 every day for 10 years and I have the results in a big table where I have a single date correspond to a single integer value of my happiness rating. I say, though, that I only care about my happiness over 60 day periods on average (this may seem weird but this is a simplified example). So I wrap up this information to a table where I now have a start date field, an end date field, and an average rating field where the start days are every day from the first day to the last over all 10 years, but the end dates are exactly 60 days later. To be clear, these 60 day periods are overlapping (one would share 59 days with the next one, 58 with the next, and so on).
Next I pick a threshold rating, say 5, where I want to categorize everything below it into a "bad" category and everything above into a "good" category. I could easily add another field and use a case structure to give every 60-day range a "good" or "bad" flag.
Then to sum it up, I want to display the total periods of "good" and "bad" from maximum beginning to maximum end date. This is where I'm stuck. I could group by the good/bad category and then just take min(start date) and max(end date), but then if, say, the ranges go from good to bad to good then to bad again, output would show overlapping ranges of good and bad. In the aforementioned situation, I would want to show four different ranges.
I realize this may seem clearer to me that it would to someone else so if you need clarification just ask.
Thank you
---EDIT---
Here's an example of what the before would look like:
StartDate| EndDate| MoodRating
------------+------------+------------
1/1/1991 |3/1/1991 | 7
1/2/1991 |3/2/1991 | 7
1/3/1991 |3/3/1991 | 4
1/4/1991 |3/4/1991 | 4
1/5/1991 |3/5/1991 | 7
1/6/1991 |3/6/1991 | 7
1/7/1991 |3/7/1991 | 4
1/8/1991 |3/8/1991 | 4
1/9/1991 |3/9/1991 | 4
And the after:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/2/1991 |good
1/3/1991|3/4/1991 |bad
1/5/1991|3/6/1991 |good
1/7/1991|3/9/1991 |bad
Currently my query with the group by rating would show:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/6/1991 |good
1/3/1991|3/9/1991 |bad
This is something along the lines of
select min(StartDate), max(EndDate), Good_Bad
from sourcetable
group by Good_Bad
While Jason A Long's answer may be correct - I can't read it or figure it out, so I figured I would post my own answer. Assuming that this isn't a process that you're going to be constantly running, the CURSOR's performance hit shouldn't matter. But (at least to me) this solution is very readable and can be easily modified.
In a nutshell - we insert the first record from your source table into our results table. Next, we grab the next record and see if the mood score is the same as the previous record. If it is, we simply update the previous record's end date with the current record's end date (extending the range). If not, we insert a new record. Rinse, repeat. Simple.
Here is your setup and some sample data:
DECLARE #MoodRanges TABLE (StartDate DATE, EndDate DATE, MoodRating int)
INSERT INTO #MoodRanges
VALUES
('1/1/1991','3/1/1991', 7),
('1/2/1991','3/2/1991', 7),
('1/3/1991','3/3/1991', 4),
('1/4/1991','3/4/1991', 4),
('1/5/1991','3/5/1991', 7),
('1/6/1991','3/6/1991', 7),
('1/7/1991','3/7/1991', 4),
('1/8/1991','3/8/1991', 4),
('1/9/1991','3/9/1991', 4)
Next, we can create a table to store our results, as well as some variable placeholders for our cursor:
DECLARE #MoodResults TABLE(ID INT IDENTITY(1, 1), StartDate DATE, EndDate DATE, MoodScore varchar(50))
DECLARE #CurrentStartDate DATE, #CurrentEndDate DATE, #CurrentMoodScore INT,
#PreviousStartDate DATE, #PreviousEndDate DATE, #PreviousMoodScore INT
Now we put all of the sample data into our CURSOR:
DECLARE MoodCursor CURSOR FOR
SELECT StartDate, EndDate, MoodRating
FROM #MoodRanges
OPEN MoodCursor
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
WHILE ##FETCH_STATUS = 0
BEGIN
IF #PreviousStartDate IS NOT NULL
BEGIN
IF (#PreviousMoodScore >= 5 AND #CurrentMoodScore >= 5)
OR (#PreviousMoodScore < 5 AND #CurrentMoodScore < 5)
BEGIN
UPDATE #MoodResults
SET EndDate = #CurrentEndDate
WHERE ID = (SELECT MAX(ID) FROM #MoodResults)
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
SET #PreviousStartDate = #CurrentStartDate
SET #PreviousEndDate = #CurrentEndDate
SET #PreviousMoodScore = #CurrentMoodScore
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
END
CLOSE MoodCursor
DEALLOCATE MoodCursor
And here are the results:
SELECT * FROM #MoodResults
ID StartDate EndDate MoodScore
----------- ---------- ---------- --------------------------------------------------
1 1991-01-01 1991-03-02 GOOD
2 1991-01-03 1991-03-04 BAD
3 1991-01-05 1991-03-06 GOOD
4 1991-01-07 1991-03-09 BAD
Is this what you're looking for?
IF OBJECT_ID('tempdb..#MyDailyMood', 'U') IS NOT NULL
DROP TABLE #MyDailyMood;
CREATE TABLE #MyDailyMood (
TheDate DATE NOT NULL,
MoodLevel INT NOT NULL
);
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Calendar (dt) AS (
SELECT TOP (DATEDIFF(dd, '2007-01-01', '2017-01-01'))
DATEADD(dd, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, '2007-01-01')
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT #MyDailyMood (TheDate, MoodLevel)
SELECT
c.dt,
ABS(CHECKSUM(NEWID()) % 10) + 1
FROM
cte_Calendar c;
--==========================================================
WITH
cte_AddRN AS (
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS (
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
)
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup;
Post OP update solution...
WITH
cte_AddRN AS ( -- Add a row number to each row that resets to 1 ever 60 rows.
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS ( -- Use DENSE_RANK to create groups based on the RN added above.
-- How it works: RN set the row number 1 - 60 then repeats itself
-- but we dont want ever 60th row grouped together. We want blocks of 60 consecutive rows grouped together
-- DENSE_RANK accompolishes this by ranking within all the "1's", "2's"... and so on.
-- verify with the following query... SELECT * FROM cte_AssignGroups ag ORDER BY ag.TheDate
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
),
cte_AggRange AS ( -- This is just a straight forward aggregation/rollup. It produces the results similar to the sample data you posed in your edit.
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
GorB = CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END,
ag.DateGroup
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup
),
cte_CompactGroup AS ( -- This time we're using dense rank to group all of the consecutive "Good" and "Bad" values so that they can be further aggregated below.
SELECT
ar.BegOfRange, ar.EndOfRange, ar.AverageMoodLevel, ar.GorB, ar.DateGroup,
DenseGroup = ar.DateGroup - DENSE_RANK() OVER (PARTITION BY ar.GorB ORDER BY ar.BegOfRange)
FROM
cte_AggRange ar
)
-- The final aggregation step...
SELECT
BegOfRange = MIN(cg.BegOfRange),
EndOfRange = MAX(cg.EndOfRange),
cg.GorB
FROM
cte_CompactGroup cg
GROUP BY
cg.DenseGroup,
cg.GorB
ORDER BY
BegOfRange;
fellow developers and analysts. I have some experience in SQL and have resorted to similar posts. However, this is slightly more niche. Thank you in advance for helping.
I have the below dataset (edited. Apology)
Setup
CREATE TABLE CustomerPoints
(
CustomerID INT,
[Date] Date,
Points INT
)
INSERT INTO CustomerPoints
VALUES
(1, '20150101', 500),
(1, '20150201', -400),
(1, '20151101', 300),
(1, '20151201', -400)
and need to turn it into (edited. The figures in previous table were incorrect)
Any positive amount of points are points earned whereas negative are redeemed. Because of the FIFO (1st in 1st out concept), of the second batch of points spent (-400), 100 of those were taken from points earned on 20150101 (UK format) and 300 from 20151101.
The goal is to calculate, for each customer, the number of points spent within x and y months of earning. Again, thank you for your help.
I have already answered a similar question here and here
You need to explode points earned and redeemed by single units and then couple them, so each point earned will be matched by a redeemed point.
For each of these matching rows calculate the months elapsed from the earning to the redeeming and then aggregate it all.
For FN_NUMBERS(n) it is a tally table, look at other answers I have linked above.
;with
p as (select * from CustomerPoints),
e as (select * from p where points>0),
r as (select * from p where points<0),
ex as (
select *, ROW_NUMBER() over (partition by CustomerID order by [date] ) rn
from e
join FN_NUMBERS(1000) on N<= e.points
),
rx as (
select *, ROW_NUMBER() over (partition by CustomerID order by [date] ) rn
from r
join FN_NUMBERS(1000) on N<= -r.points
),
j as (
select ex.CustomerID, DATEDIFF(month,ex.date, rx.date) mm
from ex
join rx on ex.CustomerID = rx.CustomerID and ex.rn = rx.rn and rx.date>ex.date
)
-- use this select to see points redeemed in current and past semester
select * from j join (select 0 s union all select 1 s ) p on j.mm >= (p.s*6)+(p.s) and j.mm < p.s*6+6 pivot (count(mm) for s in ([0],[2])) p order by 1, 2
-- use this select to see points redeemed with months detail
--select * from j pivot (count(mm) for mm in ([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) p order by 1
-- use this select to see points redeemed in rows per month
--select CustomerID, mm, COUNT(mm) PointsRedeemed from j group by CustomerID, mm order by 1
output of default query, 0 is 0-6 months, 1 is 7-12 (age of redemption in months)
CustomerID 0 1
1 700 100
output of 2nd query, 0..12 is the age of redemption in months
CustomerID 0 1 2 3 4 5 6 7 8 9 10 11 12
1 0 700 0 0 0 0 0 0 0 0 0 100 0
output from 3rd query, is the age of redemption in months
CustomerID mm PointsRedeemed
1 1 700
1 11 100
bye
I want to count the number of 2 or more consecutive week periods that have negative values within a range of weeks.
Example:
Week | Value
201301 | 10
201302 | -5 <--| both weeks have negative values and are consecutive
201303 | -6 <--|
Week | Value
201301 | 10
201302 | -5
201303 | 7
201304 | -2 <-- negative but not consecutive to the last negative value in 201302
Week | Value
201301 | 10
201302 | -5
201303 | -7
201304 | -2 <-- 1st group of negative and consecutive values
201305 | 0
201306 | -12
201307 | -8 <-- 2nd group of negative and consecutive values
Is there a better way of doing this other than using a cursor and a reset variable and checking through each row in order?
Here is some of the SQL I have setup to try and test this:
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestOne') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestOne
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestTwo') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestTwo
CREATE TABLE #ConsecutiveNegativeWeekTestOne
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- I have a condition where I expect to see at least 2 consecutive weeks with negative values
-- TRUE : Week 201328 & 201329 are both negative.
INSERT INTO #ConsecutiveNegativeWeekTestOne
VALUES
(201327, 5)
,(201328,-11)
,(201329,-18)
,(201330, 25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, 59)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestOne
WHERE Value < 0
ORDER BY [Week] ASC
CREATE TABLE #ConsecutiveNegativeWeekTestTwo
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- FALSE: The negative weeks are not consecutive
INSERT INTO #ConsecutiveNegativeWeekTestTwo
VALUES
(201327, 5)
,(201328,-11)
,(201329,20)
,(201330, -25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, -15)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestTwo
WHERE Value < 0
ORDER BY [Week] ASC
My SQL fiddle is also here:
http://sqlfiddle.com/#!3/ef54f/2
First, would you please share the formula for calculating week number, or provide a real date for each week, or some method to determine if there are 52 or 53 weeks in any particular year? Once you do that, I can make my queries properly skip missing data AND cross year boundaries.
Now to queries: this can be done without a JOIN, which depending on the exact indexes present, may improve performance a huge amount over any solution that does use JOINs. Then again, it may not. This is also harder to understand so may not be worth it if other solutions perform well enough (especially when the right indexes are present).
Simulate a PREORDER BY windowing function (respects gaps, ignores year boundaries):
WITH Calcs AS (
SELECT
Grp =
[Week] -- comment out to ignore gaps and gain year boundaries
-- Row_Number() OVER (ORDER BY [Week]) -- swap with previous line
- Row_Number() OVER
(PARTITION BY (SELECT 1 WHERE Value < 0) ORDER BY [Week]),
*
FROM dbo.ConsecutiveNegativeWeekTestOne
)
SELECT
[Week] = Min([Week])
-- NumWeeks = Count(*) -- if you want the count
FROM Calcs C
WHERE Value < 0
GROUP BY C.Grp
HAVING Count(*) >= 2
;
See a Live Demo at SQL Fiddle (1st query)
And another way, simulating LAG and LEAD with a CROSS JOIN and aggregates (respects gaps, ignores year boundaries):
WITH Groups AS (
SELECT
Grp = T.[Week] + X.Num,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (-1), (0), (1)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 0) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
See a Live Demo at SQL Fiddle (2nd query)
And, my original second query, but simplified (ignores gaps, handles year boundaries):
WITH Groups AS (
SELECT
Grp = (Row_Number() OVER (ORDER BY T.[Week]) + X.Num) / 3,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (0), (2), (4)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 2) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
Note: The execution plan for these may be rated as more expensive than other queries, but there will be only 1 table access instead of 2 or 3, and while the CPU may be higher it is still respectably low.
Note: I originally was not paying attention to only producing one row per group of negative values, and so I produced this query as only requiring 2 table accesses (respects gaps, ignores year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T2.[Week] IN (T1.[Week] - 1, T1.[Week] + 1)
)
;
See a Live Demo at SQL Fiddle (3rd query)
However, I have now modified it to perform as required, showing only each starting date (respects gaps, ignored year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM
dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T1.[Week] - 1 <= T2.[Week]
AND T1.[Week] + 1 >= T2.[Week]
AND T1.[Week] <> T2.[Week]
HAVING
Min(T2.[Week]) > T1.[Week]
)
;
See a Live Demo at SQL Fiddle (3rd query)
Last, just for fun, here is a SQL Server 2012 and up version using LEAD and LAG:
WITH Weeks AS (
SELECT
PrevValue = Lag(Value, 1, 0) OVER (ORDER BY [Week]),
SubsValue = Lead(Value, 1, 0) OVER (ORDER BY [Week]),
PrevWeek = Lag(Week, 1, 0) OVER (ORDER BY [Week]),
SubsWeek = Lead(Week, 1, 0) OVER (ORDER BY [Week]),
*
FROM
dbo.ConsecutiveNegativeWeekTestOne
)
SELECT #Week = [Week]
FROM Weeks W
WHERE
(
[Week] - 1 > PrevWeek
OR PrevValue >= 0
)
AND Value < 0
AND SubsValue < 0
AND [Week] + 1 = SubsWeek
;
See a Live Demo at SQL Fiddle (4th query)
I am not sure I am doing this the best way as I haven't used these much, but it works nonetheless.
You should do some performance testing of the various queries presented to you, and pick the best one, considering that code should be, in order:
Correct
Clear
Concise
Fast
Seeing that some of my solutions are anything but clear, other solutions that are fast enough and concise enough will probably win out in the competition of which one to use in your own production code. But... maybe not! And maybe someone will appreciate seeing these techniques, even if they can't be used as-is this time.
So let's do some testing and see what the truth is about all this! Here is some test setup script. It will generate the same data on your own server as it did on mine:
IF Object_ID('dbo.ConsecutiveNegativeWeekTestOne', 'U') IS NOT NULL DROP TABLE dbo.ConsecutiveNegativeWeekTestOne;
GO
CREATE TABLE dbo.ConsecutiveNegativeWeekTestOne (
[Week] int NOT NULL CONSTRAINT PK_ConsecutiveNegativeWeekTestOne PRIMARY KEY CLUSTERED,
[Value] decimal(18,6) NOT NULL
);
SET NOCOUNT ON;
DECLARE
#f float = Rand(5.1415926535897932384626433832795028842),
#Dt datetime = '17530101',
#Week int;
WHILE #Dt <= '20140106' BEGIN
INSERT dbo.ConsecutiveNegativeWeekTestOne
SELECT
Format(#Dt, 'yyyy') + Right('0' + Convert(varchar(11), DateDiff(day, DateAdd(year, DateDiff(year, 0, #Dt), 0), #Dt) / 7 + 1), 2),
Rand() * 151 - 76
;
SET #Dt = DateAdd(day, 7, #Dt);
END;
This generates 13,620 weeks, from 175301 through 201401. I modified all the queries to select the Week values instead of the count, in the format SELECT #Week = Expression ... so that tests are not affected by returning rows to the client.
I tested only the gap-respecting, non-year-boundary-handling versions.
Results
Query Duration CPU Reads
------------------ -------- ----- ------
ErikE-Preorder 27 31 40
ErikE-CROSS 29 31 40
ErikE-Join-IN -------Awful---------
ErikE-Join-Revised 46 47 15069
ErikE-Lead-Lag 104 109 40
jods 12 16 120
Transact Charlie 12 16 120
Conclusions
The reduced reads of the non-JOIN versions are not significant enough to warrant their increased complexity.
The table is so small that the performance almost doesn't matter. 261 years of weeks is insignificant, so a normal business operation won't see any performance problem even with a poor query.
I tested with an index on Week (which is more than reasonable), doing two separate JOINs with a seek was far, far superior to any device to try to get the relevant related data in one swoop. Charlie and jods were spot on in their comments.
This data is not large enough to expose real differences between the queries in CPU and duration. The values above are representative, though at times the 31 ms were 16 ms and the 16 ms were 0 ms. Since the resolution is ~15 ms, this doesn't tell us much.
My tricky query techniques do perform better. They might be worth it in performance critical situations. But this is not one of those.
Lead and Lag may not always win. The presence of an index on the lookup value is probably what determines this. The ability to still pull prior/next values based on a certain order even when the order by value is not sequential may be one good use case for these functions.
you could use a combination of EXISTS.
Assuming you only want to know groups (series of consecutive weeks all negative)
--Find the potential start weeks
;WITH starts as (
SELECT [Week]
FROM #ConsecutiveNegativeWeekTestOne AS s
WHERE s.[Value] < 0
AND NOT EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS p
WHERE p.[Week] = s.[Week] - 1
AND p.[Value] < 0
)
)
SELECT COUNT(*)
FROM
Starts AS s
WHERE EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS n
WHERE n.[Week] = s.[Week] + 1
AND n.[Value] < 0
)
If you have an index on Week this query should even be moderately efficient.
You can replace LEAD and LAG with a self-join.
The counting idea is basically to count start of negative sequences rather than trying to consider each row.
SELECT COUNT(*)
FROM ConsecutiveNegativeWeekTestOne W
LEFT OUTER JOIN ConsecutiveNegativeWeekTestOne Prev
ON W.week = Prev.week + 1
INNER JOIN ConsecutiveNegativeWeekTestOne Next
ON W.week = Next.week - 1
WHERE W.value < 0
AND (Prev.value IS NULL OR Prev.value > 0)
AND Next.value < 0
Note that I simply did "week + 1", which would not work when there is a year change.
I am relatively new to writing SQL. I have a requirement where I have to display the top 5 records as it is and consolidate the rest as 1 single record and append it as the 6th record. I know top 5 selects the top 5 records, but I am finding it difficult to put together a logic to consolidate the rest of the records and append it at the bottom of the result set.
weekof sales year weekno
-------------------------------------------------------------
07/01 - 07/07 2 2012 26
07/08 - 07/14 2 2012 27
07/29 - 08/04 1 2012 30
08/05 - 08/11 1 2012 31
08/12 - 08/18 32 2012 32
08/26 - 09/01 2 2012 34
09/02 - 09/08 8 2012 35
09/09 - 09/15 46 2012 36
09/16 - 09/22 26 2012 37
I want this to be displayed as
weekof sales
----------------------
09/16 - 09/22 26
09/09 - 09/15 46
09/02 - 09/08 8
08/26 - 09/01 2
08/12 - 08/18 32
07/01 - 08/11 6
Except when weekof spans years, this will get the data you want and in the correct order:
;WITH x AS
(
SELECT weekof, sales,
rn = ROW_NUMBER() OVER (ORDER BY [year] DESC, weekno DESC)
FROM dbo.table_name
)
SELECT weekof, sales FROM x WHERE rn <= 5
UNION ALL
SELECT MIN(LEFT(weekof, 5)) + ' - ' + MAX(RIGHT(weekof, 5)), SUM(sales)
FROM x WHERE rn > 5
ORDER BY weekof DESC;
When the rows being returned span a year, you may have to return the rn as well (and just ignore it at the presentation layer):
;WITH x AS
(
SELECT weekof, sales,
rn = ROW_NUMBER() OVER (ORDER BY [year] DESC, weekno DESC)
FROM dbo.table_name
)
SELECT weekof, sales, rn FROM x WHERE rn <= 5
UNION ALL
SELECT MIN(LEFT(weekof, 5)) + ' - ' + MAX(RIGHT(weekof, 5)), SUM(sales), rn = 6
FROM x WHERE rn > 5
ORDER BY rn;
Here's an alternate version that displays in the correct order even when crossing years, and requires a single scan instead of two (important only if the amount of data is large). It will work in SQL 2005 and up:
WITH Sales AS (
SELECT
Seq = Row_Number() OVER (ORDER BY S.yr DESC, S.weekno DESC),
*
FROM dbo.SalesByWeek S
)
SELECT
weekof =
Substring(Min(Convert(varchar(11), S.yr) + Left(S.weekof, 5)), 5, 5)
+ ' - '
+ Substring(Max(Convert(varchar(11), S.yr) + Right(S.weekof, 5)), 5, 5),
sales = Sum(S.sales)
FROM
Sales S
OUTER APPLY (SELECT S.Seq WHERE S.Seq <= 5) G
GROUP BY G.Seq
ORDER BY Coalesce(G.Seq, 2147483647)
;
There's also a query possible that might perform even better for very large data sets, since it doesn't calculate row number (using TOP once instead). But this is all theoretical without real data and performance testing:
SELECT
weekof =
Substring(Min(Convert(varchar(11), S.yr) + Left(S.weekof, 5)), 5, 5)
+ ' - '
+ Substring(Max(Convert(varchar(11), S.yr) + Right(S.weekof, 5)), 5, 5),
sales = Sum(S.sales)
FROM
dbo.SalesByWeek S
LEFT JOIN (
SELECT TOP 5 *
FROM dbo.SalesByWeek
ORDER BY yr DESC, weekno DESC
) G ON S.yr = G.yr AND S.weekno = G.weekno
GROUP BY G.yr, G.weekno
ORDER BY Coalesce(G.yr, 9999), G.weekno
;
Ultimately, I really don't like the string manipulation to pull out the date ranges. It would be better if the table had the MinDate and MaxDate as separate columns, and then creating a label for them would be much simpler.