SQL query to identify paired items (challenging)

SQL query to identify paired items (challenging) - sql

Assume there is a relation database with one table:
{datetime, tapeID, backupStatus}
2012-07-09 3:00, ID33, Start
2012-07-09 3:05, ID34, Start
2012-07-09 3:10, ID35, Start
2012-07-09 4:05, ID34, End
2012-07-09 4:10, ID33, Start
2012-07-09 5:05, ID33, End
2012-07-09 5:10, ID34, Start
2012-07-09 6:00, ID34, End
2012-07-10 4:00, ID35, Start
2012-07-11 5:00, ID35, End
tapeID = any of 100 different tapes each with their own unique ID.
backupStatus = one of two assignments either Start or End.
I want to write a SQL query that returns five fields
{startTime,endTime,tapeID,totalBackupDuration,numberOfRestarts}
2012-07-09 3:00,2012-07-09 5:05, ID33, 0days2hours5min,1
2012-07-09 3:05,2012-07-09 4:05, ID34, 0days1hours0min,0
2012-07-09 3:10,2012-07-10 5:00, ID35, 0days0hours50min,1
2012-07-09 5:10,2012-07-09 6:00, ID34, 0days0hours50min,0
I'm looking to pair the start and end dates to identify when each backupset has truely completed. The caveat here is that the backup of a single backupset may be restarted so there may be multiple Start times that are not considered complete until the following End event. A single backupset may be backed up multiple times a day, which would need to be identified as a with a separate start and end time.
Thank you for your assistance in advance!
B

Here's my version. If you add INSERT #T SELECT '2012-07-11 12:00', 'ID35', 'Start' to the table, you'll see unfinished backups in this query as well. OUTER APPLY is a natural way to solve the problem.
SELECT
Min(T.dt) StartTime,
Max(E.dt) EndTime,
T.tapeID,
Datediff(Minute, Min(T.dt), Max(E.dt)) TotalBackupDuration,
Count(*) - 1 NumberOfRestarts
FROM
#T T
OUTER APPLY (
SELECT TOP 1 E.dt
FROM #T E
WHERE
T.tapeID = E.tapeID
AND E.BackupStatus = 'End'
AND E.dt > T.dt
ORDER BY E.dt
) E
WHERE
T.BackupStatus = 'Start'
GROUP BY
T.tapeID,
IsNull(E.dt, T.dt)
One thing about CROSS APPLY is that if you're only returning one row and the outer references are all real tables, you have an equivalent in SQL 2000 by moving it into the WHERE clause of a derived table:
SELECT
Min(T.dt) StartTime,
Max(T.EndTime) EndTime,
T.tapeID,
Datediff(Minute, Min(T.dt), Max(T.EndTime)) TotalBackupDuration,
Count(*) - 1 NumberOfRestarts
FROM (
SELECT
T.*,
(SELECT TOP 1 E.dt
FROM #T E
WHERE
T.tapeID = E.tapeID
AND E.BackupStatus = 'End'
AND E.dt > T.dt
ORDER BY E.dt
) EndTime
FROM #T T
WHERE T.BackupStatus = 'Start'
) T
GROUP BY
T.tapeID,
IsNull(T.EndTime, T.dt)
For outer references that are not all real tables (you want a calculated value from another subquery's outer reference) you have to add nested derived tables to accomplish this.
I finally bit the bullet and did some real testing. I used SPFiredrake's table population script to see the actual performance with a large amount of data. I did it programmatically so there are no typing errors. I took 10 executions each, and threw out the worst and best value for each column, then averaged the remaining 8 column values for that statistic.
The indexes were created after populating the table, with 100% fill factor. The Indexes column shows 1 when just the clustered index is present. It shows 2 when the nonclustered index on BackupStatus is added.
To exclude client network data transfer from the testing, I selected each query into variables like so:
DECLARE
#StartTime datetime,
#EndTime datetime,
#TapeID varchar(5),
#Duration int,
#Restarts int;
WITH A AS (
-- The query here
)
SELECT
#StartTime = StartTime,
#EndTime = EndTime,
#TapeID = TapeID,
#Duration = TotalBackupDuration,
#Restarts = NumberOfRestarts
FROM A;
I also trimmed the table column lengths to something more reasonable: tapeID varchar(5), BackupStatus varchar(5). In fact, the BackupStatus should be a bit column, and the tapeID should be an integer. But we'll stick with varchar for the time being.
Server Indexes UserName Reads Writes CPU Duration
--------- ------- ------------- ------ ------ ----- --------
x86 VM 1 ErikE 97219 0 599 325
x86 VM 1 Gordon Linoff 606 0 63980 54638
x86 VM 1 SPFiredrake 344927 260 23621 13105
x86 VM 2 ErikE 96388 0 579 324
x86 VM 2 Gordon Linoff 251443 0 22775 11830
x86 VM 2 SPFiredrake 197845 0 11602 5986
x64 Beefy 1 ErikE 96745 0 919 61
x64 Beefy 1 Gordon Linoff 320012 70 62372 13400
x64 Beefy 1 SPFiredrake 362545 288 20154 1686
x64 Beefy 2 ErikE 96545 0 685 164
x64 Beefy 2 Gordon Linoff 343952 72 65092 17391
x64 Beefy 2 SPFiredrake 198288 0 10477 924
Notes:
x86 VM: an almost idle virtual machine, Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)
x64 Beefy: a quite beefy and possibly very busy Microsoft SQL Server 2008 R2 (RTM) - 10.50.1765.0 (X64)
The second index helped all the queries, mine the least.
It is interesting that Gordon's initially low number of reads on one server was high on the second--but it had a lower duration, so it obviously picked a different execution plan probably due to having more resources to search the possible plan space faster (being a beefier server). But, the index raised the number of reads because that plan lowered the CPU cost by a ton and so costed out less in the optimizer.

What you need to do is to assign the next end date to all starts. Then count the number of starts in-between.
select tstart.datetime as starttime, min(tend.datetime) as endtime, tstart.tapeid
from (select *
from t
where BackupStatus = 'Start'
) tstart join
(select *
from t
where BackupStatus = 'End'
) tend
on tstart.tapeid = tend.tapeid and
tend.datetime >= tstart.datetime
This is close, but we have multiple rows for each end time (depending on the number of starts). To handle this, we need to group by the tapeid and the end time:
select min(a.starttime) as starttime, a.endtime, a.tapeid,
datediff(s, min(a.starttime), endtime), -- NOT CORRECT, DATABASE SPECIFIC
count(*) - 1 as NumRestarts
from (select tstart.dt as starttime, min(tend.dt) as endtime, tstart.tapeid
from (select *
from #t
where BackupStatus = 'Start'
) tstart join
(select *
from #t
where BackupStatus = 'End'
) tend
on tstart.tapeid = tend.tapeid and
tend.dt >= tstart.dt
group by tstart.dt, tstart.tapeid
) a
group by a.endtime, a.tapeid
I've written this version using SQL Server syntax. To create the test table, you can use:
create table #t (
dt datetime,
tapeID varchar(255),
BackupStatus varchar(255)
)
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:00', 'ID33', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:05', 'ID34', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:10', 'ID35', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 4:05', 'ID34', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 4:10', 'ID33', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 5:05', 'ID33', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 5:10', 'ID34', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 6:00', 'ID34', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-10 4:00', 'ID35', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-11 5:00', 'ID35', 'End')

Thought I'd take a stab at it. Tested out Gordon Linoff's solution, and it doesn't quite calculate correctly for tapeID 33 in his own example (matches to the next start, not the corresponding end).
My attempt assumes you're using SQL server 2005+, as it utilizes CROSS/OUTER APPLY. If you need it for server 2000 I could probably swing it, but this seemed like the cleanest solution to me (as you're starting with all end elements and matching the first start elements to get the result). I'll annotate as well so you can understand what I'm doing.
SELECT
startTime, endT.dt endTime, endT.tapeID, DATEDIFF(s, startTime, endT.dt), restarts
FROM
#t endT -- Main source, getting all 'End' records so we can match.
OUTER APPLY ( -- Match possible previous 'End' records for the tapeID
SELECT TOP 1 dt
FROM #t
WHERE dt < endT.dt AND tapeID = endT.tapeID
AND BackupStatus = 'End') g
CROSS APPLY (SELECT ISNULL(g.dt, CAST(0 AS DATETIME)) dt) t
CROSS APPLY (
-- Match 'Start' records between our 'End' record
-- and our possible previous 'End' record.
SELECT MIN(dt) startTime,
COUNT(*) - 1 restarts -- Restarts, so -1 for the first 'Start'
FROM #t
WHERE tapeID = endT.tapeID AND BackupStatus = 'Start'
-- This is where our previous possible 'End' record is considered
AND dt > t.dt AND dt < endt.dt) starts
WHERE
endT.BackupStatus = 'End'
Edit: Test data generation script found at this link.
So decided to run some data against the three methods, and found that ErikE's solution is fastest, mine is a VERY close second, and Gordon's is just inefficient for any sizable set (even when working with 1000 records, it started showing slowness). For smaller sets (at about 5k records), my method wins over Erik's, but not by much. Honestly, I like my method as it doesn't require any additional aggregate functions to get the data, but ErikE's wins in the efficiency/speed battle.
Edit 2: For 55k records in the table (and 12k matching start/end pairs), Erik's takes ~0.307s and mine takes ~0.157s (averaging over 50 attempts). I was a little surprised about this, because I would've assumed that individual runs would've translated to the overall, but I guess the index cache is being better utilized by my query so subsequent hits are less expensive. Looking at the execution plans, ErikE's only has 1 branch off the main path, so he's ultimately working with a larger set for most of the query. I have 3 branches that combine closer to the output, so I'm churning on less data at any given moment and combine right at the end.

Make it very simple --
Make one sub query for start event and another one for End event. Rank function in each set for each row that has start and end. Then, use a left joins for 2 sub queries:
-- QUERY
WITH CTE as
(
SELECT dt
, ID
, status
--, RANK () OVER (PARTITION BY ID ORDER BY DT) as rnk1
--, RANK () OVER (PARTITION BY status ORDER BY DT) as rnk2
FROM INT_backup
)
SELECT *
FROM CTE
ORDER BY id, rnk2
select * FROM INT_backup order by id, dt
SELECT *
FROM
(
SELECT dt
, ID
, status
, rank () over (PARTITION by ID ORDER BY dt) as rnk
FROM INT_backup
WHERE status = 'start'
) START_SET
LEFT JOIN
(
SELECT dt
, ID
, status
, rank () over (PARTITION by ID ORDER BY dt) as rnk
FROM INT_backup
where status = 'END'
) END_SET
ON Start_Set.ID = End_SET.ID
AND Start_Set.Rnk = End_Set.rnk

Related

How to create start & stop times when multiple start times are given?

I'm trying to quantify active vs. idle times, and the first thing I need to do is to create distinct and discrete start and end times. The issue is that the database is (I'm told this is a bug) creating multiple "start" times for the events. To make it even more complicated, a "report" can have multiple instances of being worked on, and each one should be logged as a discrete duration.
For instance,
WorkflowID ReportID User Action Timestamp
1 1 A Start 1:00
2 1 A Stop 1:03
3 1 B Start 1:05
4 1 B Start 1:06
5 1 B Stop 1:08
6 1 B Start 1:10
7 1 B Start 1:11
8 1 B Stop 1:14
I want to write a SQL query that would output the following:
User StartTime EndTime
A 1:00 1:03
B 1:05 1:08
B 1:10 1:14
The issue I'm running into is that the number of start/stop events needs to be arbitrary (per ReportID per User). In addition, the superfluous "start" times between the first "start" in the series and the following "stop" need to be removed to not mess it up.
Maybe I'm missing something, but this is tricky to me. Any thoughts? Thank you.

To deduplicate use lag() to compare the previous action for a user and a report with the current one. If they are the same it's a duplicate, mark it as such. Then number the starts and stops using row_number(), so that each pair of a start and a stop belonging together share a number (per report and user). Then join on the report, the user and that number.
For convenience you can use CTEs to structure the query and prevent the necessity of duplicating some subqueries.
WITH
[DeduplicatedAndNumbered]
AS
(
SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
row_number() OVER (PARTITION BY [ReportID],
[User],
[Action]
ORDER BY [Timestamp]) [Number]
FROM (SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
CASE
WHEN lag([Action]) OVER (PARTITION BY [ReportId],
[User]
ORDER BY [Timestamp]) = [Action] THEN
1
ELSE
0
END [IsDuplicate]
FROM [elbaT]) [x]
WHERE [IsDuplicate] = 0
),
[DeduplicatedAndNumberedStart]
AS
(SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
[Number]
FROM [DeduplicatedAndNumbered]
WHERE [Action] = 'Start'),
[DeduplicatedAndNumberedStop]
AS
(SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
[Number]
FROM [DeduplicatedAndNumbered]
WHERE [Action] = 'Stop')
SELECT [DeduplicatedAndNumberedStart].[User],
[DeduplicatedAndNumberedStart].[Timestamp] [StartTime],
[DeduplicatedAndNumberedStop].[Timestamp] [EndTime]
FROM [DeduplicatedAndNumberedStart]
INNER JOIN [DeduplicatedAndNumberedStop]
ON [DeduplicatedAndNumberedStart].[ReportId] = [DeduplicatedAndNumberedStop].[ReportId]
AND [DeduplicatedAndNumberedStart].[User] = [DeduplicatedAndNumberedStop].[User]
AND [DeduplicatedAndNumberedStart].[Number] = [DeduplicatedAndNumberedStop].[Number];
db<>fiddle

OP has tagged their question with sql-server-2008.
Since SQL Server 2008 lacks the lag() function (it was added in SQL Server 2012), here is a solution that uses Common Table Expressions and row_number() which were available from SQL Server 2005 onwards...
;with [StopEvents] as (
select [WorkflowID],
[ReportID],
[User],
[EndTime] = [Timestamp],
[StopEventSeq] = row_number() over (
partition by [ReportID], [User], [Timestamp]
order by [Timestamp])
from Workflow
where [Action] = 'Stop'
)
select this.[User], [StartTime], this.[EndTime]
from [StopEvents] this
-- Left join here because first Stop event won't have a previous Stop event
left join [StopEvents] previous
on previous.[ReportID] = this.[ReportID]
and previous.[User] = this.[User]
and previous.[StopEventSeq] = this.[StopEventSeq] - 1
outer apply (
select [StartTime] = min([Timestamp])
from Workflow W
where W.[ReportID] = this.[ReportID]
and W.[User] = this.[User]
and W.[Timestamp] < this.[EndTime]
-- First Stop event won't have a previous, so just get the min([Timestamp])
and (previous.[EndTime] is null or W.[Timestamp] >= previous.[EndTime])
) thisStart
order by this.[User], this.[EndTime]

Splitting up group by with relevant aggregates beyond the basic ones?

I'm not sure if this has been asked before because I'm having trouble even asking it myself. I think the best way to explain my dilemma is to use an example.
Say I've rated my happiness on a scale of 1-10 every day for 10 years and I have the results in a big table where I have a single date correspond to a single integer value of my happiness rating. I say, though, that I only care about my happiness over 60 day periods on average (this may seem weird but this is a simplified example). So I wrap up this information to a table where I now have a start date field, an end date field, and an average rating field where the start days are every day from the first day to the last over all 10 years, but the end dates are exactly 60 days later. To be clear, these 60 day periods are overlapping (one would share 59 days with the next one, 58 with the next, and so on).
Next I pick a threshold rating, say 5, where I want to categorize everything below it into a "bad" category and everything above into a "good" category. I could easily add another field and use a case structure to give every 60-day range a "good" or "bad" flag.
Then to sum it up, I want to display the total periods of "good" and "bad" from maximum beginning to maximum end date. This is where I'm stuck. I could group by the good/bad category and then just take min(start date) and max(end date), but then if, say, the ranges go from good to bad to good then to bad again, output would show overlapping ranges of good and bad. In the aforementioned situation, I would want to show four different ranges.
I realize this may seem clearer to me that it would to someone else so if you need clarification just ask.
Thank you
---EDIT---
Here's an example of what the before would look like:
StartDate| EndDate| MoodRating
------------+------------+------------
1/1/1991 |3/1/1991 | 7
1/2/1991 |3/2/1991 | 7
1/3/1991 |3/3/1991 | 4
1/4/1991 |3/4/1991 | 4
1/5/1991 |3/5/1991 | 7
1/6/1991 |3/6/1991 | 7
1/7/1991 |3/7/1991 | 4
1/8/1991 |3/8/1991 | 4
1/9/1991 |3/9/1991 | 4
And the after:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/2/1991 |good
1/3/1991|3/4/1991 |bad
1/5/1991|3/6/1991 |good
1/7/1991|3/9/1991 |bad
Currently my query with the group by rating would show:
MinStart| MaxEnd | Good/Bad
-----------+------------+----------
1/1/1991|3/6/1991 |good
1/3/1991|3/9/1991 |bad
This is something along the lines of
select min(StartDate), max(EndDate), Good_Bad
from sourcetable
group by Good_Bad

While Jason A Long's answer may be correct - I can't read it or figure it out, so I figured I would post my own answer. Assuming that this isn't a process that you're going to be constantly running, the CURSOR's performance hit shouldn't matter. But (at least to me) this solution is very readable and can be easily modified.
In a nutshell - we insert the first record from your source table into our results table. Next, we grab the next record and see if the mood score is the same as the previous record. If it is, we simply update the previous record's end date with the current record's end date (extending the range). If not, we insert a new record. Rinse, repeat. Simple.
Here is your setup and some sample data:
DECLARE #MoodRanges TABLE (StartDate DATE, EndDate DATE, MoodRating int)
INSERT INTO #MoodRanges
VALUES
('1/1/1991','3/1/1991', 7),
('1/2/1991','3/2/1991', 7),
('1/3/1991','3/3/1991', 4),
('1/4/1991','3/4/1991', 4),
('1/5/1991','3/5/1991', 7),
('1/6/1991','3/6/1991', 7),
('1/7/1991','3/7/1991', 4),
('1/8/1991','3/8/1991', 4),
('1/9/1991','3/9/1991', 4)
Next, we can create a table to store our results, as well as some variable placeholders for our cursor:
DECLARE #MoodResults TABLE(ID INT IDENTITY(1, 1), StartDate DATE, EndDate DATE, MoodScore varchar(50))
DECLARE #CurrentStartDate DATE, #CurrentEndDate DATE, #CurrentMoodScore INT,
#PreviousStartDate DATE, #PreviousEndDate DATE, #PreviousMoodScore INT
Now we put all of the sample data into our CURSOR:
DECLARE MoodCursor CURSOR FOR
SELECT StartDate, EndDate, MoodRating
FROM #MoodRanges
OPEN MoodCursor
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
WHILE ##FETCH_STATUS = 0
BEGIN
IF #PreviousStartDate IS NOT NULL
BEGIN
IF (#PreviousMoodScore >= 5 AND #CurrentMoodScore >= 5)
OR (#PreviousMoodScore < 5 AND #CurrentMoodScore < 5)
BEGIN
UPDATE #MoodResults
SET EndDate = #CurrentEndDate
WHERE ID = (SELECT MAX(ID) FROM #MoodResults)
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
END
ELSE
BEGIN
INSERT INTO
#MoodResults
VALUES
(#CurrentStartDate, #CurrentEndDate, CASE WHEN #CurrentMoodScore >= 5 THEN 'GOOD' ELSE 'BAD' END)
END
SET #PreviousStartDate = #CurrentStartDate
SET #PreviousEndDate = #CurrentEndDate
SET #PreviousMoodScore = #CurrentMoodScore
FETCH NEXT FROM MoodCursor INTO #CurrentStartDate, #CurrentEndDate, #CurrentMoodScore
END
CLOSE MoodCursor
DEALLOCATE MoodCursor
And here are the results:
SELECT * FROM #MoodResults
ID StartDate EndDate MoodScore
----------- ---------- ---------- --------------------------------------------------
1 1991-01-01 1991-03-02 GOOD
2 1991-01-03 1991-03-04 BAD
3 1991-01-05 1991-03-06 GOOD
4 1991-01-07 1991-03-09 BAD

Is this what you're looking for?
IF OBJECT_ID('tempdb..#MyDailyMood', 'U') IS NOT NULL
DROP TABLE #MyDailyMood;
CREATE TABLE #MyDailyMood (
TheDate DATE NOT NULL,
MoodLevel INT NOT NULL
);
WITH
cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),
cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
cte_n3 (n) AS (SELECT 1 FROM cte_n2 a CROSS JOIN cte_n2 b),
cte_Calendar (dt) AS (
SELECT TOP (DATEDIFF(dd, '2007-01-01', '2017-01-01'))
DATEADD(dd, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, '2007-01-01')
FROM
cte_n3 a CROSS JOIN cte_n3 b
)
INSERT #MyDailyMood (TheDate, MoodLevel)
SELECT
c.dt,
ABS(CHECKSUM(NEWID()) % 10) + 1
FROM
cte_Calendar c;
--==========================================================
WITH
cte_AddRN AS (
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS (
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
)
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup;
Post OP update solution...
WITH
cte_AddRN AS ( -- Add a row number to each row that resets to 1 ever 60 rows.
SELECT
*,
RN = ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY mdm.TheDate) % 60, 0), 60)
FROM
#MyDailyMood mdm
),
cte_AssignGroups AS ( -- Use DENSE_RANK to create groups based on the RN added above.
-- How it works: RN set the row number 1 - 60 then repeats itself
-- but we dont want ever 60th row grouped together. We want blocks of 60 consecutive rows grouped together
-- DENSE_RANK accompolishes this by ranking within all the "1's", "2's"... and so on.
-- verify with the following query... SELECT * FROM cte_AssignGroups ag ORDER BY ag.TheDate
SELECT
*,
DateGroup = DENSE_RANK() OVER (PARTITION BY arn.RN ORDER BY arn.TheDate)
FROM
cte_AddRN arn
),
cte_AggRange AS ( -- This is just a straight forward aggregation/rollup. It produces the results similar to the sample data you posed in your edit.
SELECT
BegOfRange = MIN(ag.TheDate),
EndOfRange = MAX(ag.TheDate),
AverageMoodLevel = AVG(ag.MoodLevel),
GorB = CASE WHEN AVG(ag.MoodLevel) >= 5 THEN 'Good' ELSE 'Bad' END,
ag.DateGroup
FROM
cte_AssignGroups ag
GROUP BY
ag.DateGroup
),
cte_CompactGroup AS ( -- This time we're using dense rank to group all of the consecutive "Good" and "Bad" values so that they can be further aggregated below.
SELECT
ar.BegOfRange, ar.EndOfRange, ar.AverageMoodLevel, ar.GorB, ar.DateGroup,
DenseGroup = ar.DateGroup - DENSE_RANK() OVER (PARTITION BY ar.GorB ORDER BY ar.BegOfRange)
FROM
cte_AggRange ar
)
-- The final aggregation step...
SELECT
BegOfRange = MIN(cg.BegOfRange),
EndOfRange = MAX(cg.EndOfRange),
cg.GorB
FROM
cte_CompactGroup cg
GROUP BY
cg.DenseGroup,
cg.GorB
ORDER BY
BegOfRange;

SQL table and data extraction

I have never done SQL before and I been reading up on it. There is a exercise in the book i am reading to get me started, I am also looking up a website called W3School and the book is telling me to attempt the below;
Trades which has the following structure –
trade_id: primary key
timestamp: timestamp of trade
security: underlying security (bought or sold in trade)
quantity: underlyingquantity (positive signifies bought, negative indicates sold)
price:price of 1 security item for this trade
Consider the following table
CREATE TABLE tbProduct
([TRADE_ID] varchar(8), [TIMESTAMP] varchar(8), [SECURITY] varchar(8), [QUANTITY] varchar(8), [PRICE] varchar(8))
;
INSERT INTO tbProduct
([TRADE_ID], [TIMESTAMP], [SECURITY], [QUANTITY], [PRICE])
VALUES
('TRADE1', '10:01:05', 'BP', '+100', '20'),
('TRADE2', '10:01:06', 'BP', '+20', '15'),
('TRADE3', '10:10:00', 'BP', '-100', '19'),
('TRADE4', '10:10:01', 'BP', '-100', '19')
;
In the book it is telling me to write a query to find all trades that happened in the range of 10 seconds and having prices differing by more than 10%.
The result should also list the percentage of price difference between the 2 trades.
For a person who has not done SQL before, reading that has really confused me. They have also provided me the outcome but i am unsure on how they have come to this outcome.
Expected result:
First_Trade Second_Trade PRICE_DIFF
TRADE1 TRADE2 25
I have created a fiddle if this help. If someone could show me how to get the expected result, it will help me understand the book exercise.
Thanks

This will get the result you want.
;with cast_cte
as
(
select [TRADE_ID], cast([TIMESTAMP] as datetime) timestamp, [SECURITY], [QUANTITY], cast([PRICE] as float) as price
from tbProduct
)
select t1.trade_id, t2.trade_id, datediff(ms, t1.timestamp, t2.timestamp) as milliseconds_diff,
((t1.price - t2.price) / t1.price) * 100 as price_diff
from cast_cte t1
inner join cast_cte t2
on datediff(ms, t1.timestamp, t2.timestamp) between 0 and 10000
and t1.trade_id <> t2.trade_id
where ((t1.price - t2.price) / t1.price) * 100 > 10
or ((t1.price - t2.price) / t1.price) * 100 < -10
However, there are a number of problems with the schema and general query parameters:
1) The columns are all varchars. This is very inefficient because they all need to be cast to their appropriate data types in order to get the results you desire. Use datetime, int, float etc. (I have used a CTE to clean up the query as per #Jeroen-Mostert's suggestion)
2) As the table gets larger this query will start performing very poorly as the predicate used (the 10 second timestamp) is not indexed properly.

Slightly different approach to the other answer, but pretty much the same effect. I use 'Between' to find the date range rather than datediff.
select
trade1.trade_ID as TRADE1,
trade2.trade_ID as TRADE2,
(cast(trade1.price as float)-cast(trade2.price as float))/cast(trade1.price as float)*100 as PRICE_DIFF_PERC
from
tbProduct trade1
inner join
tbProduct trade2
on
trade2.timestamp between trade1.timestamp and dateadd(s,10,trade1.TIMESTAMP)
and trade1.TRADE_ID <> trade2.TRADE_ID
where (cast(trade1.price as float)-cast(trade2.price as float))/cast(trade1.price as float) >0.1
The schema could definitely be improved; removing the need for 'CAST's would make this a lot clearer:
CREATE TABLE tbProduct2
([TRADE_ID] varchar(8), [TIMESTAMP] datetime, [SECURITY] varchar(8), [QUANTITY] int, [PRICE] float)
;
Allows you to do:
select *,
trade1.trade_ID as TRADE1,
trade2.trade_ID as TRADE2,
((trade1.price-trade2.price)/trade1.price)*100 as PRICE_DIFF_PERC
from
tbProduct2 trade1
inner join
tbProduct2 trade2
on
trade2.timestamp between trade1.timestamp and dateadd(s,10,trade1.TIMESTAMP)
and trade1.TRADE_ID <> trade2.TRADE_ID
where (trade1.price-trade2.price) /trade1.price >0.1
;

have used lead function to gain expected result. try this :
select
iq.trade_id as FIRST_TRADE,
t1 as SECOND_TRADE,
((price-t3)/price*100) as PRICE_DIFF
from
(
Select trade_id, timestamp, security, quantity, cast(price as float) price,
lead(trade_id) over (partition by security order by timestamp) t1
,lead(timestamp) over (partition by security order by timestamp) t2
,lead(cast(price as float)) over (partition by security order by timestamp) t3
from tbProduct
) iq
where DATEDIFF(SECOND, iq.timestamp,iq.t2) between 0 and 10
and ((price-t3)/price*100) > 10
It is based on fact that partition is done over security. Feel free to comment or suggest corrections.

SQL to find overlapping time periods and sub-faults

Long time stalker, first time poster (and SQL beginner). My question is similar to this one SQL to find time elapsed from multiple overlapping intervals, except I'm able to use CTE, UDFs etc and am looking for more detail.
On a piece of large scale equipment I have a record of all faults that arise. Faults can arise on different sub-components of the system, some may take it offline completely (complete outage = yes), while others do not (complete outage = no). Faults can overlap in time, and may not have end times if the fault has not yet been repaired.
Outage_ID StartDateTime EndDateTime CompleteOutage
1 07:00 3-Jul-13 08:55 3-Jul13 Yes
2 08:30 3-Jul-13 10:00 4-Jul13 No
3 12:00 4-Jul-13 No
4 12:30 4-Jul13 12:35 4-Jul-13 No
1 |---------|
2 |---------|
3 |--------------------------------------------------------------
4 |---|
I need to be able to work out for a user defined time period, how long the total system is fully functional (no faults), how long its degraded (one or more non-complete outages) and how long inoperable (one or more complete outages). I also need to be able to work out for any given time period which faults were on the system. I was thinking of creating a "Stage Change" table anytime a fault is opened or closed, but I am stuck on the best way to do this - any help on this or better solutions would be appreciated!

This isn't a complete solution (I leave that as an exercise :)) but should illustrate the basic technique. The trick is to create a state table (as you say). If you record a 1 for a "start" event and a -1 for an "end" event then a running total in event date/time order gives you the current state at that particular event date/time. The SQL below is T-SQL but should be easily adaptable to whatever database server you're using.
Using your data for partial outage as an example:
DECLARE #Faults TABLE (
StartDateTime DATETIME NOT NULL,
EndDateTime DATETIME NULL
)
INSERT INTO #Faults (StartDateTime, EndDateTime)
SELECT '2013-07-03 08:30', '2013-07-04 10:00'
UNION ALL SELECT '2013-07-04 12:00', NULL
UNION ALL SELECT '2013-07-04 12:30', '2013-07-04 12:35'
-- "Unpivot" the events and assign 1 to a start and -1 to an end
;WITH FaultEvents AS (
SELECT *, Ord = ROW_NUMBER() OVER(ORDER BY EventDateTime)
FROM (
SELECT EventDateTime = StartDateTime, Evt = 1
FROM #Faults
UNION ALL SELECT EndDateTime, Evt = -1
FROM #Faults
WHERE EndDateTime IS NOT NULL
) X
)
-- Running total of Evt gives the current state at each date/time point
, FaultEventStates AS (
SELECT A.Ord, A.EventDateTime, A.Evt, [State] = (SELECT SUM(B.Evt) FROM FaultEvents B WHERE B.Ord <= A.Ord)
FROM FaultEvents A
)
SELECT StartDateTime = S.EventDateTime, EndDateTime = F.EventDateTime
FROM FaultEventStates S
OUTER APPLY (
-- Find the nearest transition to the no-fault state
SELECT TOP 1 *
FROM FaultEventStates B
WHERE B.[State] = 0
AND B.Ord > S.Ord
ORDER BY B.Ord
) F
-- Restrict to start events transitioning from the no-fault state
WHERE S.Evt = 1 AND S.[State] = 1
If you are using SQL Server 2012 then you have the option to calculate the running total using a windowing function.

The below is a rough guide to getting this working. It will compare against an interval table of dates and an interval table of 15 mins. It will then sum the outage events (1 event per interval), but not sum a partial outage if there is a full outage.
You could use a more granular time interval if you needed, I choose 15 mins for speed of coding.
I already had a date interval table set up "CAL.t_Calendar" so you would need to create one of your own to run this code.
Please note, this does not represent actual code you should use. It is only intended as a demonstration and to point you in a possible direction...
EDIT I've just realised I have't accounted for the null end dates. The code will need amending to check for NULL endDates and use #EndDate or GETDATE() if #EndDate is in the future
--drop table ##Events
CREATE TABLE #Events (OUTAGE_ID INT IDENTITY(1,1) PRIMARY KEY
,StartDateTime datetime
,EndDateTime datetime
, completeOutage bit)
INSERT INTO #Events VALUES ('2013-07-03 07:00','2013-07-03 08:55',1),('2013-07-03 08:30','2013-07-04 10:00',0)
,('2013-07-04 12:00',NULL,0),('2013-07-04 12:30','2013-07-04 12:35',0)
--drop table #FiveMins
CREATE TABLE #FiveMins (ID int IDENTITY(1,1) PRIMARY KEY, TimeInterval Time)
DECLARE #Time INT = 0
WHILE #Time <= 1410 --number of 15 min intervals in day * 15
BEGIN
INSERT INTO #FiveMins SELECT DATEADD(MINUTE , #Time, '00:00')
SET #Time = #Time + 15
END
SELECT * from #FiveMins
DECLARE #StartDate DATETIME = '2013-07-03'
DECLARE #EndDate DATETIME = '2013-07-04 23:59:59.999'
SELECT SUM(FullOutage) * 15 as MinutesFullOutage
,SUM(PartialOutage) * 15 as MinutesPartialOutage
,SUM(NoOutage) * 15 as MinutesNoOutage
FROM
(
SELECT DateAnc.EventDateTime
, CASE WHEN COUNT(OU.OUTAGE_ID) > 0 THEN 1 ELSE 0 END AS FullOutage
, CASE WHEN COUNT(OU.OUTAGE_ID) = 0 AND COUNT(pOU.OUTAGE_ID) > 0 THEN 1 ELSE 0 END AS PartialOutage
, CASE WHEN COUNT(OU.OUTAGE_ID) > 0 OR COUNT(pOU.OUTAGE_ID) > 0 THEN 0 ELSE 1 END AS NoOutage
FROM
(
SELECT CAL.calDate + MI.TimeInterval AS EventDateTime
FROM CAL.t_Calendar CAL
CROSS JOIN #FiveMins MI
WHERE CAL.calDate BETWEEN #StartDate AND #EndDate
) DateAnc
LEFT JOIN #Events OU
ON DateAnc.EventDateTime BETWEEN OU.StartDateTime AND OU.EndDateTime
AND OU.completeOutage = 1
LEFT JOIN #Events pOU
ON DateAnc.EventDateTime BETWEEN pOU.StartDateTime AND pOU.EndDateTime
AND pOU.completeOutage = 0
GROUP BY DateAnc.EventDateTime
) AllOutages

SQL to check for 2 or more consecutive negative week values

I want to count the number of 2 or more consecutive week periods that have negative values within a range of weeks.
Example:
Week | Value
201301 | 10
201302 | -5 <--| both weeks have negative values and are consecutive
201303 | -6 <--|
Week | Value
201301 | 10
201302 | -5
201303 | 7
201304 | -2 <-- negative but not consecutive to the last negative value in 201302
Week | Value
201301 | 10
201302 | -5
201303 | -7
201304 | -2 <-- 1st group of negative and consecutive values
201305 | 0
201306 | -12
201307 | -8 <-- 2nd group of negative and consecutive values
Is there a better way of doing this other than using a cursor and a reset variable and checking through each row in order?
Here is some of the SQL I have setup to try and test this:
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestOne') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestOne
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestTwo') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestTwo
CREATE TABLE #ConsecutiveNegativeWeekTestOne
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- I have a condition where I expect to see at least 2 consecutive weeks with negative values
-- TRUE : Week 201328 & 201329 are both negative.
INSERT INTO #ConsecutiveNegativeWeekTestOne
VALUES
(201327, 5)
,(201328,-11)
,(201329,-18)
,(201330, 25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, 59)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestOne
WHERE Value < 0
ORDER BY [Week] ASC
CREATE TABLE #ConsecutiveNegativeWeekTestTwo
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- FALSE: The negative weeks are not consecutive
INSERT INTO #ConsecutiveNegativeWeekTestTwo
VALUES
(201327, 5)
,(201328,-11)
,(201329,20)
,(201330, -25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, -15)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestTwo
WHERE Value < 0
ORDER BY [Week] ASC
My SQL fiddle is also here:
http://sqlfiddle.com/#!3/ef54f/2

First, would you please share the formula for calculating week number, or provide a real date for each week, or some method to determine if there are 52 or 53 weeks in any particular year? Once you do that, I can make my queries properly skip missing data AND cross year boundaries.
Now to queries: this can be done without a JOIN, which depending on the exact indexes present, may improve performance a huge amount over any solution that does use JOINs. Then again, it may not. This is also harder to understand so may not be worth it if other solutions perform well enough (especially when the right indexes are present).
Simulate a PREORDER BY windowing function (respects gaps, ignores year boundaries):
WITH Calcs AS (
SELECT
Grp =
[Week] -- comment out to ignore gaps and gain year boundaries
-- Row_Number() OVER (ORDER BY [Week]) -- swap with previous line
- Row_Number() OVER
(PARTITION BY (SELECT 1 WHERE Value < 0) ORDER BY [Week]),
*
FROM dbo.ConsecutiveNegativeWeekTestOne
)
SELECT
[Week] = Min([Week])
-- NumWeeks = Count(*) -- if you want the count
FROM Calcs C
WHERE Value < 0
GROUP BY C.Grp
HAVING Count(*) >= 2
;
See a Live Demo at SQL Fiddle (1st query)
And another way, simulating LAG and LEAD with a CROSS JOIN and aggregates (respects gaps, ignores year boundaries):
WITH Groups AS (
SELECT
Grp = T.[Week] + X.Num,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (-1), (0), (1)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 0) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
See a Live Demo at SQL Fiddle (2nd query)
And, my original second query, but simplified (ignores gaps, handles year boundaries):
WITH Groups AS (
SELECT
Grp = (Row_Number() OVER (ORDER BY T.[Week]) + X.Num) / 3,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (0), (2), (4)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 2) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
Note: The execution plan for these may be rated as more expensive than other queries, but there will be only 1 table access instead of 2 or 3, and while the CPU may be higher it is still respectably low.
Note: I originally was not paying attention to only producing one row per group of negative values, and so I produced this query as only requiring 2 table accesses (respects gaps, ignores year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T2.[Week] IN (T1.[Week] - 1, T1.[Week] + 1)
)
;
See a Live Demo at SQL Fiddle (3rd query)
However, I have now modified it to perform as required, showing only each starting date (respects gaps, ignored year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM
dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T1.[Week] - 1 <= T2.[Week]
AND T1.[Week] + 1 >= T2.[Week]
AND T1.[Week] <> T2.[Week]
HAVING
Min(T2.[Week]) > T1.[Week]
)
;
See a Live Demo at SQL Fiddle (3rd query)
Last, just for fun, here is a SQL Server 2012 and up version using LEAD and LAG:
WITH Weeks AS (
SELECT
PrevValue = Lag(Value, 1, 0) OVER (ORDER BY [Week]),
SubsValue = Lead(Value, 1, 0) OVER (ORDER BY [Week]),
PrevWeek = Lag(Week, 1, 0) OVER (ORDER BY [Week]),
SubsWeek = Lead(Week, 1, 0) OVER (ORDER BY [Week]),
*
FROM
dbo.ConsecutiveNegativeWeekTestOne
)
SELECT #Week = [Week]
FROM Weeks W
WHERE
(
[Week] - 1 > PrevWeek
OR PrevValue >= 0
)
AND Value < 0
AND SubsValue < 0
AND [Week] + 1 = SubsWeek
;
See a Live Demo at SQL Fiddle (4th query)
I am not sure I am doing this the best way as I haven't used these much, but it works nonetheless.
You should do some performance testing of the various queries presented to you, and pick the best one, considering that code should be, in order:
Correct
Clear
Concise
Fast
Seeing that some of my solutions are anything but clear, other solutions that are fast enough and concise enough will probably win out in the competition of which one to use in your own production code. But... maybe not! And maybe someone will appreciate seeing these techniques, even if they can't be used as-is this time.
So let's do some testing and see what the truth is about all this! Here is some test setup script. It will generate the same data on your own server as it did on mine:
IF Object_ID('dbo.ConsecutiveNegativeWeekTestOne', 'U') IS NOT NULL DROP TABLE dbo.ConsecutiveNegativeWeekTestOne;
GO
CREATE TABLE dbo.ConsecutiveNegativeWeekTestOne (
[Week] int NOT NULL CONSTRAINT PK_ConsecutiveNegativeWeekTestOne PRIMARY KEY CLUSTERED,
[Value] decimal(18,6) NOT NULL
);
SET NOCOUNT ON;
DECLARE
#f float = Rand(5.1415926535897932384626433832795028842),
#Dt datetime = '17530101',
#Week int;
WHILE #Dt <= '20140106' BEGIN
INSERT dbo.ConsecutiveNegativeWeekTestOne
SELECT
Format(#Dt, 'yyyy') + Right('0' + Convert(varchar(11), DateDiff(day, DateAdd(year, DateDiff(year, 0, #Dt), 0), #Dt) / 7 + 1), 2),
Rand() * 151 - 76
;
SET #Dt = DateAdd(day, 7, #Dt);
END;
This generates 13,620 weeks, from 175301 through 201401. I modified all the queries to select the Week values instead of the count, in the format SELECT #Week = Expression ... so that tests are not affected by returning rows to the client.
I tested only the gap-respecting, non-year-boundary-handling versions.
Results
Query Duration CPU Reads
------------------ -------- ----- ------
ErikE-Preorder 27 31 40
ErikE-CROSS 29 31 40
ErikE-Join-IN -------Awful---------
ErikE-Join-Revised 46 47 15069
ErikE-Lead-Lag 104 109 40
jods 12 16 120
Transact Charlie 12 16 120
Conclusions
The reduced reads of the non-JOIN versions are not significant enough to warrant their increased complexity.
The table is so small that the performance almost doesn't matter. 261 years of weeks is insignificant, so a normal business operation won't see any performance problem even with a poor query.
I tested with an index on Week (which is more than reasonable), doing two separate JOINs with a seek was far, far superior to any device to try to get the relevant related data in one swoop. Charlie and jods were spot on in their comments.
This data is not large enough to expose real differences between the queries in CPU and duration. The values above are representative, though at times the 31 ms were 16 ms and the 16 ms were 0 ms. Since the resolution is ~15 ms, this doesn't tell us much.
My tricky query techniques do perform better. They might be worth it in performance critical situations. But this is not one of those.
Lead and Lag may not always win. The presence of an index on the lookup value is probably what determines this. The ability to still pull prior/next values based on a certain order even when the order by value is not sequential may be one good use case for these functions.

you could use a combination of EXISTS.
Assuming you only want to know groups (series of consecutive weeks all negative)
--Find the potential start weeks
;WITH starts as (
SELECT [Week]
FROM #ConsecutiveNegativeWeekTestOne AS s
WHERE s.[Value] < 0
AND NOT EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS p
WHERE p.[Week] = s.[Week] - 1
AND p.[Value] < 0
)
)
SELECT COUNT(*)
FROM
Starts AS s
WHERE EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS n
WHERE n.[Week] = s.[Week] + 1
AND n.[Value] < 0
)
If you have an index on Week this query should even be moderately efficient.

You can replace LEAD and LAG with a self-join.
The counting idea is basically to count start of negative sequences rather than trying to consider each row.
SELECT COUNT(*)
FROM ConsecutiveNegativeWeekTestOne W
LEFT OUTER JOIN ConsecutiveNegativeWeekTestOne Prev
ON W.week = Prev.week + 1
INNER JOIN ConsecutiveNegativeWeekTestOne Next
ON W.week = Next.week - 1
WHERE W.value < 0
AND (Prev.value IS NULL OR Prev.value > 0)
AND Next.value < 0
Note that I simply did "week + 1", which would not work when there is a year change.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL query to identify paired items (challenging) - sql

Related

How to create start & stop times when multiple start times are given?

Splitting up group by with relevant aggregates beyond the basic ones?

SQL table and data extraction

SQL to find overlapping time periods and sub-faults

SQL to check for 2 or more consecutive negative week values

Categories

Resources