A follow up question on Gaps and Islands solution - sql

This is continuation of my previous question A question again on cursors in SQL Server.
To reiterate, I get values from a sensor as 0 (off) or 1(on) every 10 seconds. I need to log in another table the on times ie when the sensor value is 1.
I will process the data every one minute (which means I will have 6 rows of data). I needed a way to do this without using cursors and was answered by #Charlieface.
WITH cte1 AS (
SELECT *,
PrevValue = LAG(t.Value) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM YourTable t
),
cte2 AS (
SELECT *,
NextTime = LEAD(t.Timestamp) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM cte1 t
WHERE (t.Value <> t.PrevValue OR t.PrevValue IS NULL)
)
SELECT
t.SlaveID,
t.Register,
StartTime = t.Timestamp,
Endtime = t.NextTime
FROM cte2 t
WHERE t.Value = 1;
db<>fiddle
The raw data set and desired outcome are as below. Here register 250 represents the sensor and value presents the value as 0 or 1 and time stamp represents the time of reading the value
SlaveID
Register
Value
Timestamp
ProcessTime
3
250
0
13:30:10
NULL
3
250
0
13:30:20
NULL
3
250
1
13:30:30
NULL
3
250
1
13:30:40
NULL
3
250
1
13:30:50
NULL
3
250
1
13:31:00
NULL
3
250
0
13:31:10
NULL
3
250
0
13:31:20
NULL
3
250
0
13:32:30
NULL
3
250
0
13:32:40
NULL
3
250
1
13:32:50
NULL
The required entry in the logging table is
SlaveID
Register
StartTime
Endtime
3
250
13:30:30
13:31:10
3
250
13:32:50
NULL //value is still 1
The solution given works fine but when the next set of data is processed, the exiting open entry (end time is null) is to be considered.
If the next set of values is only 1 (ie all values are 1), then no entry is to be made in the log table since the value was 1 in the previous set of data and continues to be 1. When the value changes 0 in one of the sets, then the end time should be updated with that time. A fresh row to be inserted in log table when it becomes 1 again

I solved the issue by using a 'hybrid'. I get 250 rows (values of 250 sensors polled) every 10 seconds. I process the data once in 180 seconds. I get about 4500 records which I process using the CTE. Now I get result set of around 250 records (a few more than 250 if some signals have changed the state). This I insert into a #table (of the table being processed) and use a cursor on this #table to check and insert into the log table. Since the number of rows is around 250 only cursor runs without issue.
Thanks to #charlieface for the original answer.

Related

A question again on cursors in SQL Server

I am reading data using modbus The data contains status of the 250 registers in a PLC as either off or on with the time of reading as the time stamp. The raw data received is stored in table as below where the column register represents the register read and the column value represents the status of the register as 0 or 1 with time stamp. In the sample I am showing data for just one register (ie 250). Slave ID represents the PLC from which data was obtained
I need to populate one more table Table_signal_on_log from the raw data table. This table should contain the time at which the value changed to 1 as the start time and the time at which it changes back to 0 as end time. This table is also given below
I am able to do it with a cursor but it is slow and if the number of signals increases could slow down the processing. How could I do without cursor. I tried to do it with set based operations I couldn't get one working. I need to avoid repeat values ie after recording 13:30:30 as the time at which signal becomes 1, I have to ignore all entries till it becomes 0 and record that as end time. Again ignore all values till becomes 1. This process is done once in 20 seconds (can be done at any interval but presently 20). So I may have 500 rows to be looped through every time. This may increase as the number of PLCs connected increases and cursor operation is bound to be an issue
Raw data table
SlaveID Register Value Timestamp ProcessTime
-------------------------------------------------------
3 250 0 13:30:10 NULL
3 250 0 13:30:20 NULL
3 250 1 13:30:30 NULL
3 250 1 13:30:40 NULL
3 250 1 13:30:50 NULL
3 250 1 13:31:00 NULL
3 250 0 13:31:10 NULL
3 250 0 13:31:20 NULL
3 250 0 13:32:30 NULL
3 250 0 13:32:40 NULL
3 250 1 13:32:50 NULL
Table_signal_on_log
SlaveID Register StartTime Endtime
3 250 13:30:30 13:31:10
3 250 13:32:50 NULL //value is still 1
This is a classic gaps-and-islands problem, there are a number of solutions. Here is one:
Get the previous Value for each row using LAG
Filter so we only have rows where the previous Value is different or non-existent, in other words the beginning of an "island" of rows.
Of those rows, get the next Timestamp for eacc row using LEAD.
Filter so we only have Value = 1.
WITH cte1 AS (
SELECT *,
PrevValue = LAG(t.Value) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM YourTable t
),
cte2 AS (
SELECT *,
NextTime = LEAD(t.Timestamp) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM cte1 t
WHERE (t.Value <> t.PrevValue OR t.PrevValue IS NULL)
)
SELECT
t.SlaveID,
t.Register,
StartTime = t.Timestamp,
Endtime = t.NextTime
FROM cte2 t
WHERE t.Value = 1;
db<>fiddle

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

Oracle PL\SQL to Update Data with the 2nd minimum values between two dates

I have the following problem related industrial pump readings. A pump usually have a meter that keeps the record of volume of material processed by that specific pump. Sometimes the meter needs to be replaced with a entirely new meter (meter reading starts with 0) or an old working meter (meter reading can be more than 0). I have a dataset that keeps maintenance record of the pump with meter readings.
And the indication of a meter change is only when we have data in OLD_METER_READING column, otherwise it is blank.
In ideal scenario the data looks like following:
PUMP_NO INSPECTION_DATE MAINTENANCE_TASK METER_READING OLD_METER_READING TOTAL_PUMP_LIFE
11 11-AUG-2000 A 12489 12489
11 14-JUL-2001 B 14007 14007
11 03-SEP-2002 Y 0 14007 14007
11 03-SEP-2002 C 0 14007 14007
11 03-SEP-2002 B 0 14007 14007
11 04-JUN-2003 A 1200 16007
11 21-DEC-2003 A 8000 22007
11 23-FEB-2004 Y 0 10000 24007
11 26-MAY-2004 B 10 24017
11 26-MAY-2004 P 20 24027
11 26-MAY-2004 R 300 24307
11 04-OCT-2004 B 2312 26319
11 31-MAR-2005 A 2889 26896
11 06-NOv-2006 V 5000 29007
11 14-JUL-2008 T 0 7000 31007
However in many cases the Pump technician will make a mistake in loging METER_READING during change of meter. So the data may end up looking like:
PUMP_NO INSPECTION_DATE MAINTENANCE_TASK METER_READING OLD_METER_READING TOTAL_PUMP_LIFE
11 11-AUG-2000 A 12489 12489
11 14-JUL-2001 B 14007 14007
11 03-SEP-2002 Y 0 14007 14007
11 03-SEP-2002 C 0 14007 14007
11 03-SEP-2002 B 0 14007 14007
11 04-JUN-2003 A 1200 16007
11 21-DEC-2003 A 8000 22007
11 23-FEB-2004 Y 0 10000 24007
11 26-MAY-2004 B 10000 34007
11 26-MAY-2004 P 10000 34007
11 26-MAY-2004 R 10000 34007
11 04-OCT-2004 B 2312 26319
11 31-MAR-2005 A 2889 26896
11 06-NOV-2006 V 5000 29007
11 14-JUL-2008 T 0 7000 31007
The mistake in the 2nd set of data is that the technician rather than loging the actual METER_READING used last METER_READING from old meter as the new METER_READING on the day of 26-MAY-2004. However, correct METER_READING was logged again from 04-OCT-2004. We have numerous occasion where for a specific pump (PUMP_NO) we will have erroneous METER_READING entered in the database after a meter change event. It is also creating wrong and confusing value for the TOTAL_PUMP_LIFE.
So, to correct the data we want to add another column in the table and update the table with a Oracle Procedure where the procedure will check the METER_READING field with the following logic:
check the data between two subsequent meter change event. (for example, in this case between 1st meter 03-SEP-2002 and 2nd meter change-23-FEB-2004. And again between 2nd meter change-23-FEB-2004 and 3rd meter change 14-JUL-2008).
if METER_READING between any of these period is higher at prior date compared to METER_READING on a prior date then update the higher METER_READING with the 2nd lowest value (0 and 2312 are the 2 lowest, so update with 2312) in that period.
So, the period between first 2 meter changes will pass and no update will be necessary.However, in the 2nd set of the date all the values (10000) in the METER_READING column for 26-MAY-2014 will be updated with the value of 2312.
I am not sure how to write a PL\SQL to do the compare the values between two events and also how to update the value of a prior date (if higher value found in the METER_READING column) with a lower value between that period.
Database: Oracle SQL 11g
So in looking at your problem, I don't know that you need to resort to PL/SQL. The following query should help you identify which records are in need of updating:
SELECT m.*,
MIN(meter_reading)
OVER (PARTITION BY m.pump_no
ORDER BY m.inspection_date
RANGE BETWEEN NVL((SELECT min(n.inspection_date)-m.inspection_date
FROM maintenance n
WHERE n.inspection_date > m.inspection_date),
0) FOLLOWING
AND NVL((SELECT min(n.inspection_date)-m.inspection_date-1
FROM maintenance n
WHERE n.old_meter_reading IS NOT NULL
AND n.inspection_date > m.inspection_date),
0) FOLLOWING) AS MIN_READING_FOLLOWING
FROM maintenance m
ORDER BY m.inspection_date, old_meter_reading ASC NULLS LAST;
I created a SQLFiddle to demonstrate the query. (Link)
The analytic MIN function is looking at all rows between the next date a read was performed AND the next meter change to see if any of them have a value which is less than the current read.
You could use this as part of an update statement. As for TOTAL_PUMP_LIFE, it might be easiest to recalculate that after you've corrected the meter_readings as part of a separate operation.
Edit 1: Adding PL/SQL to make updates
DECLARE
CURSOR c_readings IS
SELECT m.*,
MIN(meter_reading)
OVER (PARTITION BY m.pump_no
ORDER BY m.inspection_date
RANGE BETWEEN NVL((SELECT min(n.inspection_date)-m.inspection_date
FROM maintenance n
WHERE n.inspection_date > m.inspection_date),
0) FOLLOWING
AND NVL((SELECT min(n.inspection_date)-m.inspection_date-1
FROM maintenance n
WHERE n.old_meter_reading IS NOT NULL
AND n.inspection_date > m.inspection_date),
0) FOLLOWING) AS MIN_READING_FOLLOWING
FROM maintenance m
ORDER BY m.inspection_date, old_meter_reading ASC NULLS LAST;
BEGIN
FOR rec IN c_readings LOOP
IF rec.meter_reading > rec.min_reading_following THEN
UPDATE maintenance m
SET m.meter_reading = rec.min_reading_following
WHERE m.pump_no = rec.pump_no
AND m.inspection_date = rec.inspection_date
AND m.maintenance_task = rec.maintenance_task;
END IF;
END LOOP;
END;
/
You'll need to either COMMIT when this is done or add it to the code.
Maybe what u need to do is something like this:
update MyTable mt1
set value = (select min(value)
from MyTable2 mt2
where mt1.id = mt2.id --your relation
and value NOT IN (select min(value)
from MyTable2 mt3
where mt2.id = mt3.id))
With this update u are getting the min value and not taking the min value original with the NOT IN.

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

create variable for unique sessions

I have some data about when, how long, and what channel people are listening to the radio. I need to make a variable called sessions that groups all entries which occur while the radio is on. Because the data may contain some errors I would like to say that if less than five minutes passes from the end of one channel period to the next then it is still the same session. Hopefully a brief example will clarify.
obs Entry_date Entry_time duration(in secs) channel
1 01/01/12 23:25:21 6000 2
2 01/03/12 01:05:64 300 5
3 01/05/12 12:12:35 456 5
4 01/05/12 16:45:21 657 8
I want to create the variable sessions so that
obs Entry_date Entry_time duration(in secs) channel session
1 01/01/12 23:25:21 6000 2 1
2 01/03/12 01:05:64 300 5 1
3 01/05/12 12:12:35 456 5 2
4 01/05/12 16:45:21 657 8 3
for defining 1 session i need to use entry_time (and date if it goes from 11pm into the next morning) so that if entry_time+duration + (5minutes) < entry_time(next channel) then the session changes. This has been killing me and simple arrays wont do the trick, or my attempt using arrays has not worked. Thanks in advance
Aside from the comments I made in the OP, here's how I would do it using a SAS data step. I've changed the date and time values for row 2 to what I suspect they should be (in order to get the same result as in the OP). This avoids having to perform a self join, which is likely to be performance intensive on a large dataset.
I've used the DIF and LAG functions, so care needs to be taken if you're adding in extra code (particularly IF statements).
data have;
input entry_date :mmddyy10. entry_time :time. duration channel;
format entry_date date9. entry_time time.;
datalines;
01/01/2012 23:25:21 6000 2
01/02/2012 01:05:54 300 5
01/05/2012 12:12:35 456 5
01/05/2012 16:45:21 657 8
;
run;
data want;
set have;
by entry_date entry_time; /* put in to check data is sorted correctly */
retain session 1; /* initialise session with value 1 */
session+(dif(dhms(entry_date,0,0,entry_time))-lag(duration)>300); /* increment session by 1 if time difference > 5 minutes */
run;
hopefully I got your requirements right!
Since you need to base result on adjoining rows, there is a need to join a table to itself.
The Session #s are not consecutive, but you should get the point.
create table #temp
(obs int not null,
entry_date datetime not null,
duration int not null,
channel int not null)
--obs Entry_date Entry_time duration(in secs) channel
insert #temp
select 1, '01/01/12 23:25:21', 6000, 2
union all select 2, '01/03/12 01:05:54', 300, 5
union all select 3, '01/05/12 12:12:35', 456, 5
union all select 4, '01/05/12 16:45:21', 657, 8
select a.obs,
a.entry_date,
a.duration,
endSession = dateadd(mi,5,dateadd(mi,a.duration,a.entry_date)),
a.channel,
b.entry_date,
minOverlapping = datediff(mi,b.entry_date,
dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))),
anotherSession = case
when dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))<b.entry_date
then b.obs
else a.obs end
from #temp a
left join #temp b on a.obs = b.obs - 1
hope this helps a bit