SQL pivot based on fuzzy condition - sql

I have a SQL Server 2016 database representing patients (pID), with visits (vID) to a healthcare facility. When patients move facilities, new visits are created.
I would like to piece together visits where the admit/discharge dates (represented by example ints vdStart and vdEnd) are close to one another (fuzzy joining), and display these as extra columns, thus having 1 row representing a patients healthcare journey. Future visits that aren't close to previous visits are separate journeys.
Here's some sample data:
CREATE TABLE t
(
[pID] varchar(7),
[vID] int,
[vdStart] int,
[vdEnd] int
);
INSERT INTO t ([pID], [vID], [vdStart], [vdEnd])
VALUES
('Jenkins', 1, 100, 102),
('Jenkins', 3, 102, 110),
('Jenkins', 7, 111, 130),
('Barnaby', 2, 90, 114),
('Barnaby', 5, 114, 140),
('Barnaby', 9, 153, 158),
('Forster', 4, 100, 130),
('Smith', 6, 120, 131),
('Smith', 8, 140, 160),
('Everett', 10, 158, 165),
('Everett', 12, 165, 175),
('Everett', 15, 186, 190),
('Everett', 17, 190, 195),
('Everett', 18, 195, 199),
('Everett', 19, 199, 210)
;
Here's an example of what I want:
Visits that all correspond to the same "healthcare journey" are joined. New row for each.
I wasn't able to get the PIVOT function to do what I wanted based on a fuzzy joining logic (which is supposed to represent datetimes)). My approach was using LEAD, however this quickly becomes silly when trying to connect beyond 2 visits, and it was showing incorrect values with gaps in between, which I don't want.
SELECT
pID,
vdStart,
vdEnd,
vID,
(
CASE WHEN ((
LEAD (vdStart, 1) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) - vdEnd < 2) THEN (
LEAD (vID, 1) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) ELSE NULL END
) AS vID2,
(
CASE WHEN ((
LEAD (vdStart, 2) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) - (
LEAD (vdEnd, 1) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) < 2) THEN (
LEAD (vID, 2) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) ELSE NULL END
) AS vID3,
(
CASE WHEN ((
LEAD (vdStart, 3) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) - (
LEAD (vdEnd, 2) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) < 2) THEN (
LEAD (vID, 3) OVER (PARTITION BY pID ORDER BY vdStart ASC)
) ELSE NULL END
) AS vID4
FROM t
;
I'm unsure how else to approach this based on the fuzzy pivot logic I'm after. This only needs to be run occasionally, and should run in less than 10 minutes.

This is a classic gaps-and-islands problem.
One solution uses a conditional count
Get the each row's previous using LAG
Use a conditional count to number the groups of rows.
Use ROW_NUMBER to number each row within the group
Group up and pivot by pID and group ID.
WITH cte1 AS (
SELECT *,
PrevEnd = LAG(t.vdEnd) OVER (PARTITION BY t.pID ORDER BY t.vdStart)
FROM t
),
cte2 AS (
SELECT *,
GroupId = COUNT(CASE WHEN cte1.PrevEnd >= cte1.vdStart - 1 THEN NULL ELSE 1 END)
OVER (PARTITION BY cte1.pID ORDER BY cte1.vdStart ROWS UNBOUNDED PRECEDING)
FROM cte1
),
Numbered AS (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY cte2.pID, cte2.GroupID ORDER BY cte2.vdStart)
FROM cte2
)
SELECT
n.pID,
vdStart = MIN(n.vdStart),
vdEnd = MIN(n.vdEnd),
vID = MIN(CASE WHEN n.rn = 1 THEN n.vID END),
vID1 = MIN(CASE WHEN n.rn = 2 THEN n.vID END),
vID2 = MIN(CASE WHEN n.rn = 3 THEN n.vID END),
vID3 = MIN(CASE WHEN n.rn = 4 THEN n.vID END)
FROM Numbered n
GROUP BY
n.pID,
n.GroupID
ORDER BY
n.pID,
n.GroupID;
Another option you can use is a recursive algorithm
Get all rows which are starting rows (no previous rows in the sequence for this pID)
Recursively get the next row in the sequence, keeping track of the first row's vdStart.
Number the sequence results.
Group up and pivot by pID and sequence number.
WITH cte AS (
SELECT pID, vID, vdStart, vdEnd, GroupID = vdStart
FROM t
WHERE NOT EXISTS (SELECT 1
FROM t Other
WHERE Other.pID = t.pID
AND t.vdStart BETWEEN Other.vdEnd AND Other.vdEnd + 1)
UNION ALL
SELECT t.pID, t.vID, t.vdStart, t.vdEnd, cte.GroupID
FROM cte
JOIN t ON t.pID = cte.pID AND t.vdStart BETWEEN cte.vdEnd AND cte.vdEnd + 1
),
Numbered AS (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY cte.pID, cte.GroupID ORDER BY cte.vdStart)
FROM cte
)
SELECT
n.pID,
vdStart = MIN(n.vdStart),
vdEnd = MIN(n.vdEnd),
vID = MIN(CASE WHEN n.rn = 1 THEN n.vID END),
vID1 = MIN(CASE WHEN n.rn = 2 THEN n.vID END),
vID2 = MIN(CASE WHEN n.rn = 3 THEN n.vID END),
vID3 = MIN(CASE WHEN n.rn = 4 THEN n.vID END)
FROM Numbered n
GROUP BY
n.pID,
n.GroupID
ORDER BY
n.pID,
n.GroupID;
db<>fiddle

Related

SQL - Decaying Time Since Event Then Starting Over At the Next Event

There are many similar questions and answers already posted but I could not find one with these differences. 1) The count of NULLs starts over, and 2) there is a math function applied to the replaced value.
An event either takes place or not (NULL or 1), by date by customer. Can assume that a customer has one and only one row for every date.
I want to replace the NULLs with a decay function based on number of consecutive NULLs (time from event). A customer can have the event every day, skip a day, skip multiple days. But once the event takes place, the decay starts over. Currently my decay is divide by 2 but that is for example.
DT
CUSTOMER
EVENT
DESIRED
2022-01-01
a
1
1
2022-01-02
a
1
1
2022-01-03
a
1
1
2022-01-04
a
1
1
2022-01-05
a
1
1
2022-01-01
b
1
1
2022-01-02
b
0.5
2022-01-03
b
0.25
2022-01-04
b
1
1
2022-01-05
b
0.5
I can produce the desired result, but it is very unwieldy. Looking if there is a better way. This will need to be extended for multiple event columns.
create or replace temporary table the_data (
dt date,
customer char(10),
event int,
desired float)
;
insert into the_data values ('2022-01-01', 'a', 1, 1);
insert into the_data values ('2022-01-02', 'a', 1, 1);
insert into the_data values ('2022-01-03', 'a', 1, 1);
insert into the_data values ('2022-01-04', 'a', 1, 1);
insert into the_data values ('2022-01-05', 'a', 1, 1);
insert into the_data values ('2022-01-01', 'b', 1, 1);
insert into the_data values ('2022-01-02', 'b', NULL, 0.5);
insert into the_data values ('2022-01-03', 'b', NULL, 0.25);
insert into the_data values ('2022-01-04', 'b', 1, 1);
insert into the_data values ('2022-01-05', 'b', NULL, 0.5);
with
base as (
select * from the_data
),
find_nan as (
select *, case when event is null then 1 else 0 end as event_is_nan from base
),
find_nan_diff as (
select *, event_is_nan - coalesce(lag(event_is_nan) over (partition by customer order by dt), 0) as event_is_nan_diff from find_nan
),
find_nan_group as (
select *, sum(case when event_is_nan_diff = -1 then 1 else 0 end) over (partition by customer order by dt) as nan_group from find_nan_diff
),
consec_nans as (
select *, sum(event_is_nan) over (partition by customer, nan_group order by dt) as n_consec_nans from find_nan_group
),
decay as (
select *, case when n_consec_nans > 0 then 0.5 / n_consec_nans else 1 end as decay_factor from consec_nans
),
ffill as (
select *, first_value(event) over (partition by customer order by dt) as ffill_value from decay
),
final as (
select *, ffill_value * decay_factor as the_answer from ffill
)
select * from final
order by customer, dt
;
Thanks
The query could be simplified by using CONDITIONAL_CHANGE_EVENT to generate subgrp helper column:
WITH cte AS (
SELECT *, CONDITIONAL_CHANGE_EVENT(event IS NULL) OVER(PARTITION BY CUSTOMER
ORDER BY DT) AS subgrp
FROM the_data
)
SELECT *, COALESCE(EVENT, 0.5 / ROW_NUMBER() OVER(PARTITION BY CUSTOMER, SUBGRP
ORDER BY DT)) AS computed_decay
FROM cte
ORDER BY CUSTOMER, DT;
Output:
EDIT:
Without using CONDITIONAL_CHANGE_EVENT:
WITH cte AS (
SELECT *,
CASE WHEN
event = LAG(event,1, event) OVER(PARTITION BY customer ORDER BY dt)
OR (event IS NULL AND LAG(event) OVER(PARTITION BY customer ORDER BY dt) IS NULL)
THEN 0 ELSE 1 END AS l
FROM the_data
), cte2 AS (
SELECT *, SUM(l) OVER(PARTITION BY customer ORDER BY dt) AS SUBGRP
FROM cte
)
SELECT *, COALESCE(EVENT, 0.5 / ROW_NUMBER() OVER(PARTITION BY CUSTOMER, SUBGRP
ORDER BY DT)) AS computed_decay
FROM cte2
ORDER BY CUSTOMER, DT;
db<>fiddle demo

How to create a stored procedure to run a script in Snowflake?

I want to execute a SQL script on a schedule in Snowflake. There are several CTEs and a JavaScript UDF in my query.
My idea is to
Table the query of the result.
Create one stored procedure
Create a task to execute the procedure
How can I create one stored procedure for the entire query and update the result set?
Example data:
WITH dataset AS (
select $1 id, $2 status, $3 created_at, $4 completed_at
from values
('A','created','2021-07-15 10:30:00'::timestamp, NULL), ('A','missing_info','2021-07-15 11:10:00'::timestamp,NULL),
('A','pending','2021-07-15 12:05:00'::timestamp, NULL), ('A','successful','2021-07-15 16:05:00'::timestamp,'2021-07-15 17:05:00'::timestamp),
('B','created','2021-07-16 11:30:00'::timestamp, NULL), ('B',NULL,'2021-07-16 11:30:00'::timestamp, NULL),
('B','successful','2021-07-16 12:30:00'::timestamp, '2021-07-16 16:30:00'::timestamp) )
UDF to calculate timediff:
create or replace function tsrange_intersection(s string, e string)
RETURNS double
LANGUAGE JAVASCRIPT
AS
$$
let minutes = 0
start = new Date(S)
end = new Date(E)
let t = start
while(t < end) {
if ([1, 2, 3, 4, 5].includes(t.getDay())
&& [9, 10, 11, 12, 13, 14, 15, 16].includes(t.getHours())) {
minutes += 1
}
t = new Date(t.getTime() + 60*1000);
}
return minutes
$$;
The query:
-- simple filtering
cte1 AS (
SELECT *
FROM dataset
WHERE id IS NOT NULL AND status IS NOT NULL
ORDER BY created_at ),
-- retrieve only the id's which started from 'created'
cte2 AS (
SELECT *
FROM cte1
QUALIFY FIRST_VALUE(status) OVER (PARTITION BY id ORDER BY created_at) = 'created' )
-- pattern match
SELECT *
FROM cte2
MATCH_RECOGNIZE (
PARTITION BY id
ORDER BY created_at
MEASURES MATCH_NUMBER() AS mn,
MATCH_SEQUENCE_NUMBER AS msn
ALL ROWS PER MATCH
PATTERN (c+i+|p+i+|ps+)
DEFINE
c AS status='created',
i AS status='missing_info',
p AS status='pending',
s AS status='successful'
) mr
QUALIFY (ROW_NUMBER() OVER (PARTITION BY mn, id ORDER BY msn) = 1)
OR(ROW_NUMBER() OVER (PARTITION BY mn, id ORDER BY msn DESC) =1)
ORDER BY id, created_at;
-- retrieve the result set above
WITH cte3 AS (
SELECT *
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
),
-- start time of each status
cte4 AS (
SELECT
*,
CASE WHEN status = 'successful' THEN IFNULL(completed_at, created_at) ELSE created_at END AS start_timestamp
FROM cte3 ),
-- end time of each status
cte5 AS (
SELECT
*,
CASE WHEN status = 'successful' THEN COMPLETED_AT
ELSE IFNULL(LAG(start_timestamp, -1) OVER (PARTITION BY id ORDER BY start_timestamp), completed_at) END AS end_timestamp
FROM cte4 )
-- final query
SELECT
id, status, start_timestamp, end_timestamp,
tsrange_intersection(start_timestamp, end_timestamp) AS time_diff
FROM cte5
ORDER BY start_timestamp

Mandt Specified Multiple Times But Not in Query

I get this error
Msg 8156, Level 16, State 1, Line 67
The column 'MANDT' was specified multiple times for 'cte'."
when attempting to run the code below however I am not including the column MANDT in my query. Both tables that I am calling do have a column MANDT, but they both have the column STAT as well and I did not have a problem with another table attempting the same join, the only thing is that table did not have MANDT, only STAT was the same.
I attempted to include both columns MANDT with an alias: JCDS_SOGR.MANDT as Client and TJ30T.MANDT as Client2 separately and together, this did not pan out. Got the same error message.
;WITH cte AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY STAT ORDER BY UDATE) AS Rn,
*,
LAG(UDATE) OVER (PARTITION BY STAT ORDER BY UDATE) AS PrevUDate,
COUNT(*) OVER (PARTITION BY STAT) AS [Count]
FROM
JCDS_SOGR
JOIN
TJ30T on JCDS_SOGR.STAT = TJ30T.ESTAT
WHERE
OBJNR = 'IE000000000010003137'
)
SELECT
MAX(rn) AS [Count],
OBJNR, STAT, TXT30,
SUM(CASE
WHEN rn % 2 = 0
THEN DATEDIFF(d, PrevUDate, UDATE)
WHEN rn = [Count]
THEN DATEDIFF(d, UDATE, GETDATE())
ELSE 0
END) AS DIF
FROM
cte
GROUP BY
OBJNR, STAT, TXT30
This is the other query I referred to that works fine with this same code.
;with cte
AS
(
select ROW_NUMBER() OVER(partition by STAT Order by UDATE ) as Rn
, *
, LAG(UDATE) OVER(partition by STAT Order by UDATE ) As PrevUDate
, COUNT(*) OVER(partition by STAT) As [Count]
from JCDS_SOGR
join TJ02T on JCDS_SOGR.STAT = TJ02T.ISTAT
where OBJNR = 'IE000000000010003137'
and TJ02T.SPRAS = 'E'
)
select Max(rn) As [Count]
, OBJNR,STAT,TXT30
, SUM(CASE WHEN rn%2=0 THEN DATEDIFF(d,PrevUDate,UDATE)
WHEN rn=[Count] THEN DATEDIFF(d,UDATE,getDate())
ELSE 0 END) as DIF
from cte
group BY OBJNR, STAT,TXT30
The expected result is this
[COUNT OBJNR STAT TXT30 DIF
1 IE000000000010003137 I0099 Available 2810][1]
In your CTE, you are selecting *. So if you have two columns named MANDT, this could cause a conflict. Remove *. That should fix the problem that you described.

Entry and Exit points on times series chart data

Is the following actually possible in SQL?
I have some time-series data and I want to extract some entry and exit points based on prices.
Desired output:
Example Data:
SQL Data:
CREATE TABLE Control
([PKey] int, [TimeStamp] datetime, [Name] varchar(10), [Price1] float, [Price2] float);
INSERT INTO Control ([PKey], [TimeStamp], [Name], [Price1], [Price2])
VALUES
(1,'2018-10-01 09:00:00', 'Name1',120, 125),
(2,'2018-10-01 09:10:00', 'Name1',110, 115),
(3,'2018-10-01 09:20:00', 'Name1',101, 106),
(4,'2018-10-01 09:30:00', 'Name1',105, 110),
(5,'2018-10-01 09:40:00', 'Name1',106, 111),
(6,'2018-10-01 09:50:00', 'Name1',108, 113),
(7,'2018-10-01 10:00:00', 'Name1',110, 115),
(8,'2018-10-01 10:10:00', 'Name1',104, 109),
(9,'2018-10-01 10:20:00', 'Name1',101, 106),
(10,'2018-10-01 10:30:00', 'Name1',99, 104),
(11,'2018-10-01 10:40:00', 'Name1',95, 100),
(12,'2018-10-01 10:50:00', 'Name1',101, 106),
(13,'2018-10-01 11:00:00', 'Name1',102, 107),
(14,'2018-10-01 11:10:00', 'Name1',101, 106),
(15,'2018-10-01 11:20:00', 'Name1',99, 104),
(16,'2018-10-01 11:30:00', 'Name1',105, 110),
(17,'2018-10-01 11:40:00', 'Name1',108, 113),
(18,'2018-10-01 11:50:00', 'Name1',108, 113),
(19,'2018-10-01 12:00:00', 'Name1',109, 114),
(20,'2018-10-01 12:10:00', 'Name1',108, 113),
(21,'2018-10-01 12:20:00', 'Name1',105, 110),
(22,'2018-10-01 12:30:00', 'Name1',101, 106),
(23,'2018-10-01 12:40:00', 'Name1',102, 107),
(24,'2018-10-01 09:00:00', 'Name2',103, 108),
(25,'2018-10-01 09:10:00', 'Name2',101, 106),
(26,'2018-10-01 09:20:00', 'Name2',104, 109),
(27,'2018-10-01 09:30:00', 'Name2',106, 111),
(28,'2018-10-01 09:40:00', 'Name2',108, 113),
(29,'2018-10-01 09:50:00', 'Name2',108, 113),
(30,'2018-10-01 10:00:00', 'Name2',105, 110),
(31,'2018-10-01 10:10:00', 'Name2',103, 108),
(32,'2018-10-01 10:20:00', 'Name2',101, 106),
(33,'2018-10-01 10:30:00', 'Name2',99, 104),
(34,'2018-10-01 10:40:00', 'Name2',101, 106),
(35,'2018-10-01 10:50:00', 'Name2',104, 109),
(36,'2018-10-01 11:00:00', 'Name2',101, 106),
(37,'2018-10-01 11:10:00', 'Name2',99, 104),
(38,'2018-10-01 11:20:00', 'Name2',106, 111),
(39,'2018-10-01 11:30:00', 'Name2',103, 108),
(40,'2018-10-01 11:40:00', 'Name2',105, 110),
(41,'2018-10-01 11:50:00', 'Name2',108, 113),
(42,'2018-10-01 12:00:00', 'Name2',105, 110),
(43,'2018-10-01 12:10:00', 'Name2',104, 109),
(44,'2018-10-01 12:20:00', 'Name2',108, 113),
(45,'2018-10-01 12:30:00', 'Name2',110, 115),
(46,'2018-10-01 12:40:00', 'Name2',105, 110)
;
What have I tried:
I am able to get the first instance of an entry and exit point using the following query which finds the first entry point PKey and then finds the first exit point after the entry point PKey
declare #EntryPrice1 float = 101.0; -- Entry when Price1 <= 101.0 (when not already Entered)
declare #ExitPrice2 float = 113.0; -- Exit when Price2 >= 113.0 (after Entry only)
select
t1.[Name]
,t2.[Entry PKey]
,min(case when t1.[Price2] >= #ExitPrice2 and t1.[PKey] > t2.[Entry PKey] then t1.[PKey] else null end) as [Exit PKey]
from [dbo].[Control] t1
left outer join
(select min(case when [Price1] <= #EntryPrice1 then [PKey] else null end) as [Entry PKey]
,[Name]
from [dbo].[Control]
group by [Name]) t2
on t1.[Name] = t2.[Name]
group by t1.[Name],t2.[Entry PKey]
--Name Entry PKey Exit PKey
--Name1 3 6
--Name2 25 28
I'm stuck on the approach to use that will allow multiple entry/exit points to be returned and not sure if it's even possible in SQL.
The logic for entry an exit points are
Entry - when price1 <= 101.0 and not already in an entry that has not exited.
Exit - when price2 >= 113.0 and inside an entry.
It's a kind of gaps and islands problem, this is a generic solution using Windowed Aggregates (should work for most DBMSes):
declare #EntryPrice1 float = 101.0; -- Entry when Price1 <= 101.0 (when not already Entered)
declare #ExitPrice2 float = 113.0; -- Exit when Price2 >= 113.0 (after Entry only)
WITH cte AS
( -- apply your logic to mark potential entry and exit rows
SELECT *
,CASE WHEN Price1 <= #EntryPrice1 THEN Timestamp END AS possibleEntry
,CASE WHEN Price2 >= #ExitPrice2 THEN Timestamp END AS possibleExit
,Max(CASE WHEN Price1 <= #EntryPrice1 THEN Timestamp END) -- most recent possibleEntry
Over (PARTITION BY Name
ORDER BY Timestamp
ROWS Unbounded Preceding) AS lastEntry
,Max(CASE WHEN Price2 >= #ExitPrice2 THEN Timestamp END) -- most recent possibleExit
Over (PARTITION BY Name
ORDER BY Timestamp
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS lastExit
FROM [dbo].[Control]
)
-- SELECT * FROM cte ORDER BY Name, PKey
,groupRows AS
( -- mark rows from the 1st entry to the exit row
SELECT *
-- if lastEntry <= lastExit we're after an exit and before an entry -> don't return this row
,CASE WHEN lastEntry <= lastExit THEN 0 ELSE 1 END AS returnFlag
-- assign the same group number to consecutive rows in group
,Sum(CASE WHEN lastEntry <= lastExit THEN 1 ELSE 0 END)
Over (PARTITION BY Name
ORDER BY Timestamp
ROWS Unbounded Preceding) AS grp
FROM cte
WHERE (possibleEntry IS NOT NULL OR possibleExit IS NOT NULL)
AND lastEntry IS NOT NULL
)
-- SELECT * FROM groupRows ORDER BY Name, PKey
,rowNum AS
( -- get the data from the first and last row of an entry/exit group
SELECT *
-- to get the values of the 1st row in a group
,Row_Number() Over (PARTITION BY Name, grp ORDER BY Timestamp) AS rn
-- to get the values of the last row in a group
,Last_Value(Price2)
Over (PARTITION BY Name, grp
ORDER BY Timestamp
ROWS BETWEEN Unbounded Preceding AND Unbounded Following) AS ExitPrice
,Last_Value(possibleExit)
Over (PARTITION BY Name, grp
ORDER BY Timestamp
ROWS BETWEEN Unbounded Preceding AND Unbounded Following) AS ExitTimestamp
,Last_Value(CASE WHEN possibleExit IS NOT NULL THEN PKey END)
Over (PARTITION BY Name, grp
ORDER BY Timestamp
ROWS BETWEEN Unbounded Preceding AND Unbounded Following) AS ExitPKey
FROM groupRows
WHERE returnFlag = 1
)
SELECT Name
,Price1 AS EntryPrice
,ExitPrice
,Timestamp AS EntryTimestamp
,ExitTimestamp
,PKey AS EntryPKey
,ExitPKey
FROM rowNum
WHERE rn = 1 -- return 1st row of each group
ORDER BY Name, Timestamp
See dbfiddle
Of course it might be possible to simplify the logic or apply some proprietary SQL Server syntax...
This is a weird form of gaps-and-islands. Start with the very basic definitions of entry and exit:
select c.*,
(case when [Price1] <= #EntryPrice1 then 1 else 0 end) as is_entry,
(case when [Price2] >= #ExitPrice2 then 1 else 0 end) as is_exit
from control c;
This doesn't quite work because two adjacent "entries" count only as a single entry. We can get the information we need by looking at the previous entry/exit time. With that logic, we can determine which entries are "real". We might as well get the next exit time as well:
with cee as (
select c.*,
(case when [Price1] <= #EntryPrice1 then 1 else 0 end) as is_entry,
(case when [Price2] >= #ExitPrice2 then 1 else 0 end) as is_exit
from control c
),
cp as (
select cee.*,
max(case when is_entry = 1 then pkey end) over (partition by name order by timestamp rows between unbounded preceding and 1 preceding) as prev_entry,
max(case when is_exit = 1 then pkey end) over (partition by name order by timestamp) as prev_exit,
min(case when is_exit = 1 then pkey end) over (partition by name order by timestamp desc) as next_exit
from cee
)
Next, use this logic to generate a cumulative sum of real entries, and then do some fancy filtering:
with cee as (
select c.*,
(case when [Price1] <= #EntryPrice1 then 1 else 0 end) as is_entry,
(case when [Price1] >= #ExitPrice1 then 1 else 0 end) as is_exit
from control c
),
cp as (
select cee.*,
max(case when is_entry = 1 then pkey end) over (partition by name order by timestamp rows between unbounded preceding and 1 preceding) as prev_entry,
max(case when is_exit = 1 then pkey end) over (partition by name order by timestamp) as prev_exit,
min(case when is_exit = 1 then pkey end) over (partition by name order by timestamp desc) as next_exit
from cee
)
select *
from cp
where cp.is_entry = 1 and
(prev_entry is null or prev_exit > prev_entry)
This gives you the rows where the entry starts. You can join in to get the additional information you want.

SQL server to do like Group By task

I have a table with SQL server as below,
Date Value
---------------------------------------------------
08-01-2016 1
08-02-2016 1
08-03-2016 1
08-04-2016 1
08-05-2016 1
08-06-2016 2
08-07-2016 2
08-08-2016 2
08-09-2016 2.5
08-10-2016 1
08-11-2016 1
Since the original table is too large, even I used 'Results to file', it still raise the exception 'System.OutOfMemoryException'. That's why I want to organize the table into this kind.
But I don't have a good logic to deal with. Therefore, I want to change the table into this kind as below.
Date_from Date_to Value
-------------------------------------------------
08-01-2016 08-05-2016 1
08-06-2016 08-08-2016 2
08-09-2016 08-09-2016 2.5
08-10-2016 08-11-2016 1
I appreciate your ideas!
Commonly called as Groups and Island problem. Here is one trick to do this
;WITH data
AS (SELECT *,Lag(Value, 1)OVER(ORDER BY Dates) [pVal]
FROM (VALUES ('08-01-2016',1 ),
('08-02-2016',1 ),
('08-03-2016',1 ),
('08-04-2016',1 ),
('08-05-2016',1 ),
('08-06-2016',2 ),
('08-07-2016',2 ),
('08-08-2016',2 ),
('08-09-2016',2.5 ),
('08-10-2016',1 ),
('08-11-2016',1 )) tc (Dates, Value)),
intr
AS (SELECT Dates,
Value,
Sum(Iif(pVal = Value, 0, 1)) OVER(ORDER BY Dates) AS [Counter]
FROM data)
SELECT Min(Dates) AS Dates_from,
Max(Dates) AS Dates_to,
Value
FROM intr
GROUP BY [Counter],
Value
The cumulative sum/lag approach is one method. In this case, a simpler method is:
select min(date) as date_from, max(date) as date_to, value
from (select t.*,
dateadd(day, - row_number() over (partition by value order by date),date) as grp
from t
) t
group by value, grp;
This uses the observation that the dates are consecutive with no gaps. Hence, subtracting a sequence from the date will yield a constant -- when the values are the same.
Here is an example:
DECLARE #T TABLE (
[Date] DATE,
[Value] DECIMAL(9,2)
)
INSERT #T VALUES
( '08-01-2016', 1 ),
( '08-02-2016', 1 ),
( '08-03-2016', 1 ),
( '08-04-2016', 1 ),
( '08-05-2016', 1 ),
( '08-06-2016', 2 ),
( '08-07-2016', 2 ),
( '08-08-2016', 2 ),
( '08-09-2016', 2.5 ),
( '08-10-2016', 1 ),
( '08-11-2016', 1 )
SELECT * FROM #T
SELECT A.[Date] StartDate, B.[Date] EndDate, A.[Value] FROM (
SELECT A.*, ROW_NUMBER() OVER (ORDER BY A.[Date], A.[Value]) O FROM #T A
LEFT JOIN #T B ON B.[Value] = A.[Value] AND B.[Date] = DATEADD(d, -1, A.[Date])
WHERE B.[Date] IS NULL
) A
JOIN (
SELECT A.*, ROW_NUMBER() OVER (ORDER BY A.[Date], A.[Value]) O FROM #T A
LEFT JOIN #T B ON B.[Value] = A.[Value] AND B.[Date] = DATEADD(d, 1, A.[Date])
WHERE B.[Date] IS NULL
) B ON B.O = A.O
Prdp's solution is great but just in case if anyone is still using SQL Server 2008 where LAG() and The Parallel Data Warehouse (PDW) features are not available here is an alternative:
SAMPLE DATA:
IF OBJECT_ID('tempdb..#Temp') IS NOT NULL
DROP TABLE #Temp;
CREATE TABLE #Temp([Dates] DATE
, [Value] FLOAT);
INSERT INTO #Temp([Dates]
, [Value])
VALUES
('08-01-2016'
, 1),
('08-02-2016'
, 1),
('08-03-2016'
, 1),
('08-04-2016'
, 1),
('08-05-2016'
, 1),
('08-06-2016'
, 2),
('08-07-2016'
, 2),
('08-08-2016'
, 2),
('08-09-2016'
, 2.5),
('08-10-2016'
, 1),
('08-11-2016'
, 1);
QUERY:
;WITH Seq
AS (SELECT SeqNo = ROW_NUMBER() OVER(ORDER BY [Dates]
, [Value])
, t.Dates
, t.[Value]
FROM #Temp t)
SELECT StartDate = MIN([Dates])
, EndDate = MAX([Dates])
, [Value]
FROM
(SELECT [Value]
, [Dates]
, SeqNo
, rn = SeqNo - ROW_NUMBER() OVER(PARTITION BY [Value] ORDER BY SeqNo)
FROM Seq s) a
GROUP BY [Value]
, rn
ORDER BY StartDate;
RESULTS: