I have data table that has column ORDER that is supposed to indicate if the values are increasing or decreasing and another column ORDER_BASIS. However, data in ORDER is often incorrect at the first place so I am trying to determine the correct order using ORDER_BASIS.
Here's what the table looks like:
ORDER
ORDER_BASIS
INCREASING
8
INCREASING
16
INCREASING
12
INCREASING
5
INCREASING
1
INCREASING
1
INCREASING
10
INCREASING
16
INCREASING
16
I am trying to achieve this:
ORDER
ORDER_BASIS
CORRECT_ORDER
INCREASING
8
INCREASING
INCREASING
16
INCREASING
INCREASING
12
DECREASING
INCREASING
5
DECREASING
INCREASING
1
DECREASING
INCREASING
1
DECREASING
INCREASING
10
INCREASING
INCREASING
16
INCREASING
INCREASING
16
INCREASING
First column may use ORDER then the following rows should determine if it's increasing or decreasing. If value did not increase or decrease then remain with it's current status until there's a change in value.
My current logic uses LAG and LEAD:
SELECT
LEAD (ORDER_BASIS, 1, 0) AS NEXT_BASIS,
LAG (ORDER_BASIS, 1, 0) AS PREV_BASIS
FROM
DATA_TABLE
Then created a condition but cannot get it to work correctly
CASE
WHEN CAST(PREV_BASIS AS int) = 0
OR (CAST(PREV_BASIS AS int) >= CAST(ORDER_BASIS AS int)
AND CAST(NEXT_BASIS AS int) <= CAST(ORDER_BASIS AS int))
THEN ORDER_BASIS
ELSE 'OPPOSITE_DIRECTION'
END AS CORRECT_ORDER
Using SQL Server 2014
if your query has not any order by statement, the order of rows is totally random anytime,and can be different. so to solve this problem you need some column that you can guarantee the initial order of rows , then we can fix the issue :
select * ,
case when ORDER_BASIS > LAG(ORDER_BASIS,1,-1) over (order by <the column>)
then 'INCREASING'
case when ORDER_BASIS = LAG(ORDER_BASIS,1,-1) over (order by <the column>)
then 'No change'
else 'DECREASING' end CORRECT_ORDER
from DATA_TABLE
With sequential processing functions, like LAG and LEAD the sequence is the most important factor to maintain and is the one item that was left out of the original post. In SQL Server, window functions will operate on their own partition (grouping) and sort criteria, so when visually correlating the data it is important to use the same criteria in the external query as you do for the window functions.
The following solution can be explored in this fiddle: http://sqlfiddle.com/#!18/5e1ee/31
To validate your input conditions, run the query to output the LAG and LEAD results:
SELECT
[Id],[Order]
, LAG (ORDER_BASIS, 1, NULL) OVER (ORDER BY [Id]) AS PREV_BASIS
, [Order_Basis]
, LEAD (ORDER_BASIS, 1, NULL) OVER (ORDER BY [Id]) AS NEXT_BASIS
FROM DATA_TABLE;
Id
Order
PREV_BASIS
Order_Basis
NEXT_BASIS
1
INCREASING
(null)
8
16
2
INCREASING
8
16
12
3
INCREASING
16
12
5
4
INCREASING
12
5
1
5
INCREASING
5
1
1
6
INCREASING
1
1
10
7
INCREASING
1
10
16
8
INCREASING
10
16
16
9
INCREASING
16
16
(null)
The next issue is that your attempted logic is using the LAG AND the LEAD values, which is not invalid, but is usually used to compute a value that either smooths out the curve or is trying to detect spikes or Highs and Lows.
It is not necessary to do this via a CTE however, it simplifies the readability of the syntax for this discussion, within the CTE we can perform the Integer Casting as well, however in a production environment it might be optimal to store the ORDER_BASIS column as an Integer in the first place.
WITH Records as
(
SELECT
[Id],[Order]
, CAST(LAG (ORDER_BASIS, 1, NULL) OVER (ORDER BY [Id]) AS INT) AS PREV_BASIS
, CAST([Order_Basis] AS INT) AS [Order_Basis]
, CAST(LEAD (ORDER_BASIS, 1, NULL) OVER (ORDER BY [Id]) AS INT) AS NEXT_BASIS
FROM DATA_TABLE
)
SELECT
[Id],[Order],PREV_BASIS,[Order_Basis],NEXT_BASIS
,CASE
WHEN NEXT_BASIS > ORDER_BASIS AND PREV_BASIS > ORDER_BASIS THEN 'LOW'
WHEN NEXT_BASIS < ORDER_BASIS AND PREV_BASIS < ORDER_BASIS THEN 'HIGH'
WHEN ISNULL(PREV_BASIS, ORDER_BASIS) = ORDER_BASIS THEN 'NO CHANGE'
WHEN ISNULL(PREV_BASIS, ORDER_BASIS) >= ORDER_BASIS
AND ISNULL(NEXT_BASIS, ORDER_BASIS) <= ORDER_BASIS
THEN 'DECREASING'
WHEN ISNULL(PREV_BASIS, ORDER_BASIS) <= ORDER_BASIS
AND ISNULL(NEXT_BASIS, ORDER_BASIS) >= ORDER_BASIS
THEN 'INCREASING'
ELSE 'INDETERMINATE'
END AS CORRECT_ORDER
FROM Records
ORDER BY [Id];
Id
Order
PREV_BASIS
Order_Basis
NEXT_BASIS
CORRECT_ORDER
1
INCREASING
(null)
8
16
NO CHANGE
2
INCREASING
8
16
12
HIGH
3
INCREASING
16
12
5
DECREASING
4
INCREASING
12
5
1
DECREASING
5
INCREASING
5
1
1
DECREASING
6
INCREASING
1
1
10
NO CHANGE
7
INCREASING
1
10
16
INCREASING
8
INCREASING
10
16
16
INCREASING
9
INCREASING
16
16
(null)
NO CHANGE
You could extend this by using a LAG comparison again to determine if the NO CHANGE in the middle of the above record set is in fact a low point over a longer period.
If the CORRECT ORDER should only be a function of the previous record, then there is no need to use a LEAD evaluation at all:
WITH Records as
(
SELECT
[ID],[ORDER]
, CAST(LAG (ORDER_BASIS, 1, NULL) OVER (ORDER BY [Id]) AS INT) AS PREV_BASIS
, CAST([ORDER_BASIS] AS INT) AS [ORDER_BASIS]
FROM DATA_TABLE
)
SELECT
[ID],[ORDER],[PREV_BASIS],[ORDER_BASIS]
, CASE WHEN ORDER_BASIS < PREV_BASIS
THEN 'DECREASING'
WHEN ORDER_BASIS > PREV_BASIS
THEN 'INCREASING'
ELSE 'NO CHANGE'
END CORRECT_ORDER
FROM Records;
ID
ORDER
PREV_BASIS
ORDER_BASIS
CORRECT_ORDER
1
INCREASING
(null)
8
NO CHANGE
2
INCREASING
8
16
INCREASING
3
INCREASING
16
12
DECREASING
4
INCREASING
12
5
DECREASING
5
INCREASING
5
1
DECREASING
6
INCREASING
1
1
NO CHANGE
7
INCREASING
1
10
INCREASING
8
INCREASING
10
16
INCREASING
9
INCREASING
16
16
NO CHANGE
Related
I have the following table:
RequestId,Type, Date, ParentRequestId
1 1 2020-10-15 null
2 2 2020-10-19 1
3 1 2020-10-20 null
4 2 2020-11-15 3
For this example I am interested in the request type 1 and 2, to make the example simpler. My task is to query a big database and to see the distribution of the secondary transaction based on the difference of dates with the parent one. So the result would look like:
Interval,Percentage
0-7 days,50 %
8-15 days,0 %
16-50 days, 50 %
So for the first line from teh expected result we have the request with the id 2 and for the third line from the expected result we have the request with the id 4 because the date difference fits in this interval.
How to achieve this?
I'm using sql server 2014.
We like to see your attempts, but by the looks of it, it seems like you're going to need to treat this table as 2 tables and do a basic GROUP BY, but make it fancy by grouping on a CASE statement.
WITH dateDiffs as (
/* perform our date calculations first, to get that out of the way */
SELECT
DATEDIFF(Day, parent.[Date], child.[Date]) as daysDiff,
1 as rowsFound
FROM (SELECT RequestID, [Date] FROM myTable WHERE Type = 1) parent
INNER JOIN (SELECT ParentRequestID, [Date] FROM myTable WHERE Type = 2) child
ON parent.requestID = child.parentRequestID
)
/* Now group and aggregate and enjoy your maths! */
SELECT
case when daysDiff between 0 and 7 then '0-7'
when daysDiff between 8 and 15 then '8-15'
when daysDiff between 16 and 50 THEN '16-50'
else '50+'
end as myInterval,
sum(rowsFound) as totalFound,
(select sum(rowsFound) from dateDiffs) as totalRows,
1.0 * sum(rowsFound) / (select sum(rowsFound) from dateDiffs) * 100.00 as percentFound
FROM dateDiffs
GROUP BY
case when daysDiff between 0 and 7 then '0-7'
when daysDiff between 8 and 15 then '8-15'
when daysDiff between 16 and 50 THEN '16-50'
else '50+'
end;
This seems like basically a join and group by query:
with dates as (
select 0 as lo, 7 as hi, '0-7 days' as grp union all
select 8 as lo, 15 as hi, '8-15 days' union all
select 16 as lo, 50 as hi, '16-50 days'
)
select d.grp,
count(*) as cnt,
count(*) * 1.0 / sum(count(*)) over () as raio
from dates left join
(t join
t tp
on tp.RequestId = t. ParentRequestId
)
on datediff(day, tp.date, t.date) between d.lo and d.hi
group by d.grp
order by d.lo;
The only trick is generating all the date groups, so you have rows with zero values.
My background is Oracle but we've moved to Hadoop on AWS and I'm accessing our logs using Hive SQL. I've been asked to return a report where the number of high severity errors on the system of any given type exceeds 9 in any rolling period of 30 days (9 but I use 2 in the example to keep the example data volumes down) by uptime. I've written code to do this but I don't really understand performance tuning in Hive. A lot of the stuff I learned in Oracle doesn't seem applicable.
Can this be improved?
Data is roughly
CREATE TABLE LOG_TABLE
(SYSTEM_ID VARCHAR(1),
EVENT_TYPE VARCHAR(2),
EVENT_ID VARCHAR(3),
EVENT_DATE DATE,
UPTIME INT);
INSERT INOT LOG_TABLE
VALUES
('1','A1','138','2018-10-29',34),
('1','A2','146','2018-11-13',49),
('1','A3','140','2018-11-02',38),
('1','B1','130','2018-10-13',18),
('1','B1','150','2018-11-19',55),
('1','B2','137','2018-10-27',32),
('2','A1','128','2018-10-11',59),
('2','A1','131','2018-10-16',64),
('2','A1','136','2018-10-25',73),
('2','A2','139','2018-10-31',79),
('2','A2','145','2018-11-11',90),
('2','A2','147','2018-11-14',93),
('2','A3','135','2018-10-24',72),
('2','B1','124','2018-10-03',51),
('2','B1','133','2018-10-19',67),
('2','B2','134','2018-10-22',70),
('2','B2','142','2018-11-06',85),
('2','B2','148','2018-11-15',94),
('2','B2','149','2018-11-17',96),
('3','A2','127','2018-10-10',122),
('3','A3','123','2018-10-01',113),
('3','A3','125','2018-10-06',118),
('3','A3','126','2018-10-07',119),
('3','A3','141','2018-11-05',148),
('3','A3','144','2018-11-10',153),
('3','B1','132','2018-10-18',130),
('3','B1','143','2018-11-08',151),
('3','B2','129','2018-10-12',124);
and code that works is as follows. I do a self join on the log table to return all the records with the gap between them and include those with a gap of 30 days or less. I then select those where there are more than 2 events into a second cte and from these I count distinct event types and event ids by system and uptime range
WITH EVENTGAP AS
(SELECT T1.EVENT_TYPE,
T1.SYSTEM_ID,
T1.EVENT_ID,
T2.EVENT_ID AS EVENT_ID2,
T1.EVENT_DATE,
T2.EVENT_DATE AS EVENT_DATE2,
T1.UPTIME,
DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) AS EVENT_GAP
FROM LOG_TABLE T1
INNER JOIN LOG_TABLE T2
ON (T1.EVENT_TYPE=T2.EVENT_TYPE
AND T1.SYSTEM_ID=T2.SYSTEM_ID)
WHERE DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) BETWEEN 0 AND 30
AND T1.UPTIME BETWEEN 0 AND 299
AND T2.UPTIME BETWEEN 0 AND 330),
EVENTCOUNT
AS (SELECT EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE,
COUNT(1)
FROM EVENTGAP
GROUP BY EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE
HAVING COUNT(1)>2)
SELECT EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END AS UPTIME_BAND,
COUNT(DISTINCT EVENTGAP.EVENT_ID2) AS EVENT_COUNT,
COUNT(DISTINCT EVENTGAP.EVENT_TYPE) AS TYPE_COUNT
FROM EVENTGAP
WHERE EVENTGAP.EVENT_ID IN (SELECT DISTINCT EVENTCOUNT.EVENT_ID FROM EVENTCOUNT)
GROUP BY EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END
This gives the following result, which should be unique counts of event ids and event types that have 3 or more events falling in any rolling 30 day period. Some events may be in more than one period but will only be counted once.
EVENTGAP.SYSTEM_ID UPTIME_BAND EVENT_COUNT TYPE_COUNT
2 50-99 10 3
3 100-149 4 1
In both Hive and Oracle, you would want to do this using window functions, using a window frame clause. The exact logic is different in the two databases.
In Hive you can use range between if you convert event_date to a number. A typical method is to subtract a fixed value from it. Another method is to use unix timestamps:
select lt.*
from (select lt.*,
count(*) over (partition by event_type
order by unix_timestamp(event_date)
range between 60*24*24*30 preceding and current row
) as rolling_count
from log_table lt
) lt
where rolling_count >= 2 -- or 9
I have a query that works perfectly for combining ElapsedTime that is considered Non-Productive when NonProductive = 1. However, I have been trying to get a running total to work. This the main query that totals by ReportNo for each day:
Select SUM(CASE
When NonProductive = 1 Then ElapsedTime
Else 0
End)
From DailyOperations
Where (DailyOperations.WellID = 'ZCQ-5') AND (DailyOperations.JobID = 'Original') and (ReportNo = 9)
ReportNo = 9 is the first Reportno that has NonProductive time which is 4. The next is ReportNo = 14. It has 5.5 hours of NonProductive time. So when I run ReportNo 14 I am hoping to see a total of 9.5 and nothing else. Below is the query that I am using for my running total but it is listing all of the Non-Productive time. So Instead of getting just 9.5 for ReportNo 14 I am also getting a running total for each instance of NonProductive time in the report:
SELECT (ElapsedTime),(Reportno),NonProductive,
SUM(ElapsedTime) OVER (PARTITION BY NonProductive ORDER BY REPORTNO ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS RUNNINGTOTAL
FROM dbo.DailyOperations
WHERE (NonProductive IN(1)) and (WellID = 'ZCQ-5') AND (JobID = 'Original')
Group by ReportNo,ElapsedTime,NonProductive
Order by ReportNo
This gives me:
ReportNo RUNNINGTOTAL
9 4
14 6
14 9.5
What I want is:
ReportNo RUNNINGTOTAL
9 4
14 9.5
I think you want:
SELECT Reportno,NonProductive,
SUM(SUM(ElapsedTime)) OVER (PARTITION BY NonProductive ORDER BY REPORTNO) AS RUNNINGTOTAL
FROM dbo.DailyOperations o
WHERE NonProductive IN (1) and WellID = 'ZCQ-5' AND JobID = 'Original')
GROUP BY ReportNo
ORDER BY ReportNo;
Notes:
The GROUP BY defines each row that you want in the result set. Hence, you only want ReportNo in it.
When combining window functions and aggregation functions, you sometimes end up with strange looking constructs such as SUM(SUM()).
The windowing clause is unnecessary. What you have is basically the default when you use ORDER BY (actually, the default is RANGE BETWEEN, but ReportId is unique so the two are equivalent).
I am trying to figure out the SQL to do a running total for a daily quota system. The system works like this...
Each day a user gets a quota of 2 "consumable things". If they use them all up, the next day they get another 2. If they somehow over use them (use more than 2), the next day they still get 2 (they can't have a negative balance). If they don't use them all, the remainder carries to the next day (which can carry to the next, etc...).
Here is a chart of data to use as validation. It's laid out as quota for the day, amount used that day, amount left at the end of the day:
2 - 2 - 0
2 - 0 - 2
4 - 3 - 1
3 - 0 - 3
5 - 7 - 0
2 - 1 - 1
3 - 0 - 3
5 - 2 - 3
5 - 1 - 4
6 - 9 - 0
The SQL to start of with would be:
WITH t(x, y) AS (
VALUES (2, '2013-09-16'),
(0, '2013-09-17'),
(3, '2013-09-18'),
(0, '2013-09-19'),
(7, '2013-09-20'),
(1, '2013-09-21'),
(0, '2013-09-22'),
(2, '2013-09-23'),
(1, '2013-09-24'),
(9, '2013-09-25')
)
For the life of me, trying recursive with statements and window aggregates, I cannot figure out how to make it work (but I can certainly see the pattern).
It should be something like 2 - x + SUM(previous row), but I don't know how to put that in to SQL.
Try creating custom aggregate function like:
CREATE FUNCTION quota_calc_func(numeric, numeric, numeric) -- carry over, daily usage and daily quota
RETURNS numeric AS
$$
SELECT GREATEST(0, $1 + $3 - $2);
$$
LANGUAGE SQL STRICT IMMUTABLE;
CREATE AGGREGATE quota_calc( numeric, numeric ) -- daily usage and daily quota
(
SFUNC = quota_calc_func,
STYPE = numeric,
INITCOND = '0'
);
WITH t(x, y) AS (
VALUES (2, '2013-09-16'),
(0, '2013-09-17'),
(3, '2013-09-18'),
(0, '2013-09-19'),
(7, '2013-09-20'),
(1, '2013-09-21'),
(0, '2013-09-22'),
(2, '2013-09-23'),
(1, '2013-09-24'),
(9, '2013-09-25')
)
SELECT x, y, quota_calc(x, 2) over (order by y)
FROM t;
May contain bugs, haven't tested it.
they can't have a negative balance
That triggered my memory :-)
I had a similar problem >10 years ago on a Teradata system.
The logic could be easily implemented using recursion, for each row do:
add 2 "new" and substract x "used" quota, if this is less than zero
use zero instead.
I can't remember how i found that solution, but i finally implemented it using simple cumulative sums:
SELECT
dt.*,
CASE -- used in following calculation, this is just for illustration
WHEN MIN(quota_raw) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING) >= 0 THEN 0
ELSE MIN(quota_raw) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING)
END AS correction,
quota_raw
- CASE
WHEN MIN(quota_raw) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING) >= 0 THEN 0
ELSE MIN(quota_raw) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING)
END AS quote_left
FROM
(
SELECT quota, datecol,
SUM(quota) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING) AS quota_used,
2*COUNT(*) OVER (ORDER BY datecol ROWS UNBOUNDED PRECEDING) AS quota_available,
quota_available - quota_used AS quota_raw
FROM t
) AS dt
ORDER BY datecol
The secret sauce is the moving min "correction" which adjusts negative results to zero.
simple recursive cte solution, assuming you have no gaps in dates:
with recursive cte as (
select
t.dt,
2 as quote_day,
t.quote_used,
greatest(2 - t.quote_used, 0) as quote_left
from t
where t.dt = '2013-09-16'
union all
select
t.dt,
2 + c.quote_left as quote_day,
t.quote_used,
greatest(2 + c.quote_left - t.quote_used, 0) as quote_left
from cte as c
inner join t on t.dt = c.dt + 1
)
select *
from cte
sql fiddle demo
another solution - with cumulative aggregates:
with cte1 as (
select
dt, quote_used,
sum(2 - quote_used) over(order by dt asc) as quote_raw
from t
), cte2 as (
select
dt, quote_used, quote_raw,
least(min(quote_raw) over(order by dt asc), 0) as quote_corr
from cte1
)
select
dt,
quote_raw - quote_corr + quote_used as quote_day,
quote_used,
quote_raw - quote_corr as quote_left
from cte2
sql fiddle demo
I want to count the number of 2 or more consecutive week periods that have negative values within a range of weeks.
Example:
Week | Value
201301 | 10
201302 | -5 <--| both weeks have negative values and are consecutive
201303 | -6 <--|
Week | Value
201301 | 10
201302 | -5
201303 | 7
201304 | -2 <-- negative but not consecutive to the last negative value in 201302
Week | Value
201301 | 10
201302 | -5
201303 | -7
201304 | -2 <-- 1st group of negative and consecutive values
201305 | 0
201306 | -12
201307 | -8 <-- 2nd group of negative and consecutive values
Is there a better way of doing this other than using a cursor and a reset variable and checking through each row in order?
Here is some of the SQL I have setup to try and test this:
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestOne') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestOne
IF OBJECT_ID('TempDB..#ConsecutiveNegativeWeekTestTwo') IS NOT NULL DROP TABLE #ConsecutiveNegativeWeekTestTwo
CREATE TABLE #ConsecutiveNegativeWeekTestOne
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- I have a condition where I expect to see at least 2 consecutive weeks with negative values
-- TRUE : Week 201328 & 201329 are both negative.
INSERT INTO #ConsecutiveNegativeWeekTestOne
VALUES
(201327, 5)
,(201328,-11)
,(201329,-18)
,(201330, 25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, 59)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestOne
WHERE Value < 0
ORDER BY [Week] ASC
CREATE TABLE #ConsecutiveNegativeWeekTestTwo
(
[Week] INT NOT NULL
,[Value] DECIMAL(18,6) NOT NULL
)
-- FALSE: The negative weeks are not consecutive
INSERT INTO #ConsecutiveNegativeWeekTestTwo
VALUES
(201327, 5)
,(201328,-11)
,(201329,20)
,(201330, -25)
,(201331, 30)
,(201332, -36)
,(201333, 43)
,(201334, 50)
,(201335, -15)
,(201336, 0)
,(201337, 0)
SELECT * FROM #ConsecutiveNegativeWeekTestTwo
WHERE Value < 0
ORDER BY [Week] ASC
My SQL fiddle is also here:
http://sqlfiddle.com/#!3/ef54f/2
First, would you please share the formula for calculating week number, or provide a real date for each week, or some method to determine if there are 52 or 53 weeks in any particular year? Once you do that, I can make my queries properly skip missing data AND cross year boundaries.
Now to queries: this can be done without a JOIN, which depending on the exact indexes present, may improve performance a huge amount over any solution that does use JOINs. Then again, it may not. This is also harder to understand so may not be worth it if other solutions perform well enough (especially when the right indexes are present).
Simulate a PREORDER BY windowing function (respects gaps, ignores year boundaries):
WITH Calcs AS (
SELECT
Grp =
[Week] -- comment out to ignore gaps and gain year boundaries
-- Row_Number() OVER (ORDER BY [Week]) -- swap with previous line
- Row_Number() OVER
(PARTITION BY (SELECT 1 WHERE Value < 0) ORDER BY [Week]),
*
FROM dbo.ConsecutiveNegativeWeekTestOne
)
SELECT
[Week] = Min([Week])
-- NumWeeks = Count(*) -- if you want the count
FROM Calcs C
WHERE Value < 0
GROUP BY C.Grp
HAVING Count(*) >= 2
;
See a Live Demo at SQL Fiddle (1st query)
And another way, simulating LAG and LEAD with a CROSS JOIN and aggregates (respects gaps, ignores year boundaries):
WITH Groups AS (
SELECT
Grp = T.[Week] + X.Num,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (-1), (0), (1)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 0) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
See a Live Demo at SQL Fiddle (2nd query)
And, my original second query, but simplified (ignores gaps, handles year boundaries):
WITH Groups AS (
SELECT
Grp = (Row_Number() OVER (ORDER BY T.[Week]) + X.Num) / 3,
*
FROM
dbo.ConsecutiveNegativeWeekTestOne T
CROSS JOIN (VALUES (0), (2), (4)) X (Num)
)
SELECT
[Week] = Min(C.[Week])
-- Value = Min(C.Value)
FROM
Groups G
OUTER APPLY (SELECT G.* WHERE G.Num = 2) C
WHERE G.Value < 0
GROUP BY G.Grp
HAVING
Min(G.[Week]) = Min(C.[Week])
AND Max(G.[Week]) > Min(C.[Week])
;
Note: The execution plan for these may be rated as more expensive than other queries, but there will be only 1 table access instead of 2 or 3, and while the CPU may be higher it is still respectably low.
Note: I originally was not paying attention to only producing one row per group of negative values, and so I produced this query as only requiring 2 table accesses (respects gaps, ignores year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T2.[Week] IN (T1.[Week] - 1, T1.[Week] + 1)
)
;
See a Live Demo at SQL Fiddle (3rd query)
However, I have now modified it to perform as required, showing only each starting date (respects gaps, ignored year boundaries):
SELECT
T1.[Week]
FROM
dbo.ConsecutiveNegativeWeekTestOne T1
WHERE
Value < 0
AND EXISTS (
SELECT *
FROM
dbo.ConsecutiveNegativeWeekTestOne T2
WHERE
T2.Value < 0
AND T1.[Week] - 1 <= T2.[Week]
AND T1.[Week] + 1 >= T2.[Week]
AND T1.[Week] <> T2.[Week]
HAVING
Min(T2.[Week]) > T1.[Week]
)
;
See a Live Demo at SQL Fiddle (3rd query)
Last, just for fun, here is a SQL Server 2012 and up version using LEAD and LAG:
WITH Weeks AS (
SELECT
PrevValue = Lag(Value, 1, 0) OVER (ORDER BY [Week]),
SubsValue = Lead(Value, 1, 0) OVER (ORDER BY [Week]),
PrevWeek = Lag(Week, 1, 0) OVER (ORDER BY [Week]),
SubsWeek = Lead(Week, 1, 0) OVER (ORDER BY [Week]),
*
FROM
dbo.ConsecutiveNegativeWeekTestOne
)
SELECT #Week = [Week]
FROM Weeks W
WHERE
(
[Week] - 1 > PrevWeek
OR PrevValue >= 0
)
AND Value < 0
AND SubsValue < 0
AND [Week] + 1 = SubsWeek
;
See a Live Demo at SQL Fiddle (4th query)
I am not sure I am doing this the best way as I haven't used these much, but it works nonetheless.
You should do some performance testing of the various queries presented to you, and pick the best one, considering that code should be, in order:
Correct
Clear
Concise
Fast
Seeing that some of my solutions are anything but clear, other solutions that are fast enough and concise enough will probably win out in the competition of which one to use in your own production code. But... maybe not! And maybe someone will appreciate seeing these techniques, even if they can't be used as-is this time.
So let's do some testing and see what the truth is about all this! Here is some test setup script. It will generate the same data on your own server as it did on mine:
IF Object_ID('dbo.ConsecutiveNegativeWeekTestOne', 'U') IS NOT NULL DROP TABLE dbo.ConsecutiveNegativeWeekTestOne;
GO
CREATE TABLE dbo.ConsecutiveNegativeWeekTestOne (
[Week] int NOT NULL CONSTRAINT PK_ConsecutiveNegativeWeekTestOne PRIMARY KEY CLUSTERED,
[Value] decimal(18,6) NOT NULL
);
SET NOCOUNT ON;
DECLARE
#f float = Rand(5.1415926535897932384626433832795028842),
#Dt datetime = '17530101',
#Week int;
WHILE #Dt <= '20140106' BEGIN
INSERT dbo.ConsecutiveNegativeWeekTestOne
SELECT
Format(#Dt, 'yyyy') + Right('0' + Convert(varchar(11), DateDiff(day, DateAdd(year, DateDiff(year, 0, #Dt), 0), #Dt) / 7 + 1), 2),
Rand() * 151 - 76
;
SET #Dt = DateAdd(day, 7, #Dt);
END;
This generates 13,620 weeks, from 175301 through 201401. I modified all the queries to select the Week values instead of the count, in the format SELECT #Week = Expression ... so that tests are not affected by returning rows to the client.
I tested only the gap-respecting, non-year-boundary-handling versions.
Results
Query Duration CPU Reads
------------------ -------- ----- ------
ErikE-Preorder 27 31 40
ErikE-CROSS 29 31 40
ErikE-Join-IN -------Awful---------
ErikE-Join-Revised 46 47 15069
ErikE-Lead-Lag 104 109 40
jods 12 16 120
Transact Charlie 12 16 120
Conclusions
The reduced reads of the non-JOIN versions are not significant enough to warrant their increased complexity.
The table is so small that the performance almost doesn't matter. 261 years of weeks is insignificant, so a normal business operation won't see any performance problem even with a poor query.
I tested with an index on Week (which is more than reasonable), doing two separate JOINs with a seek was far, far superior to any device to try to get the relevant related data in one swoop. Charlie and jods were spot on in their comments.
This data is not large enough to expose real differences between the queries in CPU and duration. The values above are representative, though at times the 31 ms were 16 ms and the 16 ms were 0 ms. Since the resolution is ~15 ms, this doesn't tell us much.
My tricky query techniques do perform better. They might be worth it in performance critical situations. But this is not one of those.
Lead and Lag may not always win. The presence of an index on the lookup value is probably what determines this. The ability to still pull prior/next values based on a certain order even when the order by value is not sequential may be one good use case for these functions.
you could use a combination of EXISTS.
Assuming you only want to know groups (series of consecutive weeks all negative)
--Find the potential start weeks
;WITH starts as (
SELECT [Week]
FROM #ConsecutiveNegativeWeekTestOne AS s
WHERE s.[Value] < 0
AND NOT EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS p
WHERE p.[Week] = s.[Week] - 1
AND p.[Value] < 0
)
)
SELECT COUNT(*)
FROM
Starts AS s
WHERE EXISTS (
SELECT 1
FROM #ConsecutiveNegativeWeekTestOne AS n
WHERE n.[Week] = s.[Week] + 1
AND n.[Value] < 0
)
If you have an index on Week this query should even be moderately efficient.
You can replace LEAD and LAG with a self-join.
The counting idea is basically to count start of negative sequences rather than trying to consider each row.
SELECT COUNT(*)
FROM ConsecutiveNegativeWeekTestOne W
LEFT OUTER JOIN ConsecutiveNegativeWeekTestOne Prev
ON W.week = Prev.week + 1
INNER JOIN ConsecutiveNegativeWeekTestOne Next
ON W.week = Next.week - 1
WHERE W.value < 0
AND (Prev.value IS NULL OR Prev.value > 0)
AND Next.value < 0
Note that I simply did "week + 1", which would not work when there is a year change.