create variable for unique sessions - sql

I have some data about when, how long, and what channel people are listening to the radio. I need to make a variable called sessions that groups all entries which occur while the radio is on. Because the data may contain some errors I would like to say that if less than five minutes passes from the end of one channel period to the next then it is still the same session. Hopefully a brief example will clarify.
obs Entry_date Entry_time duration(in secs) channel
1 01/01/12 23:25:21 6000 2
2 01/03/12 01:05:64 300 5
3 01/05/12 12:12:35 456 5
4 01/05/12 16:45:21 657 8
I want to create the variable sessions so that
obs Entry_date Entry_time duration(in secs) channel session
1 01/01/12 23:25:21 6000 2 1
2 01/03/12 01:05:64 300 5 1
3 01/05/12 12:12:35 456 5 2
4 01/05/12 16:45:21 657 8 3
for defining 1 session i need to use entry_time (and date if it goes from 11pm into the next morning) so that if entry_time+duration + (5minutes) < entry_time(next channel) then the session changes. This has been killing me and simple arrays wont do the trick, or my attempt using arrays has not worked. Thanks in advance

Aside from the comments I made in the OP, here's how I would do it using a SAS data step. I've changed the date and time values for row 2 to what I suspect they should be (in order to get the same result as in the OP). This avoids having to perform a self join, which is likely to be performance intensive on a large dataset.
I've used the DIF and LAG functions, so care needs to be taken if you're adding in extra code (particularly IF statements).
data have;
input entry_date :mmddyy10. entry_time :time. duration channel;
format entry_date date9. entry_time time.;
datalines;
01/01/2012 23:25:21 6000 2
01/02/2012 01:05:54 300 5
01/05/2012 12:12:35 456 5
01/05/2012 16:45:21 657 8
;
run;
data want;
set have;
by entry_date entry_time; /* put in to check data is sorted correctly */
retain session 1; /* initialise session with value 1 */
session+(dif(dhms(entry_date,0,0,entry_time))-lag(duration)>300); /* increment session by 1 if time difference > 5 minutes */
run;

hopefully I got your requirements right!
Since you need to base result on adjoining rows, there is a need to join a table to itself.
The Session #s are not consecutive, but you should get the point.
create table #temp
(obs int not null,
entry_date datetime not null,
duration int not null,
channel int not null)
--obs Entry_date Entry_time duration(in secs) channel
insert #temp
select 1, '01/01/12 23:25:21', 6000, 2
union all select 2, '01/03/12 01:05:54', 300, 5
union all select 3, '01/05/12 12:12:35', 456, 5
union all select 4, '01/05/12 16:45:21', 657, 8
select a.obs,
a.entry_date,
a.duration,
endSession = dateadd(mi,5,dateadd(mi,a.duration,a.entry_date)),
a.channel,
b.entry_date,
minOverlapping = datediff(mi,b.entry_date,
dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))),
anotherSession = case
when dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))<b.entry_date
then b.obs
else a.obs end
from #temp a
left join #temp b on a.obs = b.obs - 1
hope this helps a bit

Related

A question again on cursors in SQL Server

I am reading data using modbus The data contains status of the 250 registers in a PLC as either off or on with the time of reading as the time stamp. The raw data received is stored in table as below where the column register represents the register read and the column value represents the status of the register as 0 or 1 with time stamp. In the sample I am showing data for just one register (ie 250). Slave ID represents the PLC from which data was obtained
I need to populate one more table Table_signal_on_log from the raw data table. This table should contain the time at which the value changed to 1 as the start time and the time at which it changes back to 0 as end time. This table is also given below
I am able to do it with a cursor but it is slow and if the number of signals increases could slow down the processing. How could I do without cursor. I tried to do it with set based operations I couldn't get one working. I need to avoid repeat values ie after recording 13:30:30 as the time at which signal becomes 1, I have to ignore all entries till it becomes 0 and record that as end time. Again ignore all values till becomes 1. This process is done once in 20 seconds (can be done at any interval but presently 20). So I may have 500 rows to be looped through every time. This may increase as the number of PLCs connected increases and cursor operation is bound to be an issue
Raw data table
SlaveID Register Value Timestamp ProcessTime
-------------------------------------------------------
3 250 0 13:30:10 NULL
3 250 0 13:30:20 NULL
3 250 1 13:30:30 NULL
3 250 1 13:30:40 NULL
3 250 1 13:30:50 NULL
3 250 1 13:31:00 NULL
3 250 0 13:31:10 NULL
3 250 0 13:31:20 NULL
3 250 0 13:32:30 NULL
3 250 0 13:32:40 NULL
3 250 1 13:32:50 NULL
Table_signal_on_log
SlaveID Register StartTime Endtime
3 250 13:30:30 13:31:10
3 250 13:32:50 NULL //value is still 1
This is a classic gaps-and-islands problem, there are a number of solutions. Here is one:
Get the previous Value for each row using LAG
Filter so we only have rows where the previous Value is different or non-existent, in other words the beginning of an "island" of rows.
Of those rows, get the next Timestamp for eacc row using LEAD.
Filter so we only have Value = 1.
WITH cte1 AS (
SELECT *,
PrevValue = LAG(t.Value) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM YourTable t
),
cte2 AS (
SELECT *,
NextTime = LEAD(t.Timestamp) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM cte1 t
WHERE (t.Value <> t.PrevValue OR t.PrevValue IS NULL)
)
SELECT
t.SlaveID,
t.Register,
StartTime = t.Timestamp,
Endtime = t.NextTime
FROM cte2 t
WHERE t.Value = 1;
db<>fiddle

A follow up question on Gaps and Islands solution

This is continuation of my previous question A question again on cursors in SQL Server.
To reiterate, I get values from a sensor as 0 (off) or 1(on) every 10 seconds. I need to log in another table the on times ie when the sensor value is 1.
I will process the data every one minute (which means I will have 6 rows of data). I needed a way to do this without using cursors and was answered by #Charlieface.
WITH cte1 AS (
SELECT *,
PrevValue = LAG(t.Value) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM YourTable t
),
cte2 AS (
SELECT *,
NextTime = LEAD(t.Timestamp) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM cte1 t
WHERE (t.Value <> t.PrevValue OR t.PrevValue IS NULL)
)
SELECT
t.SlaveID,
t.Register,
StartTime = t.Timestamp,
Endtime = t.NextTime
FROM cte2 t
WHERE t.Value = 1;
db<>fiddle
The raw data set and desired outcome are as below. Here register 250 represents the sensor and value presents the value as 0 or 1 and time stamp represents the time of reading the value
SlaveID
Register
Value
Timestamp
ProcessTime
3
250
0
13:30:10
NULL
3
250
0
13:30:20
NULL
3
250
1
13:30:30
NULL
3
250
1
13:30:40
NULL
3
250
1
13:30:50
NULL
3
250
1
13:31:00
NULL
3
250
0
13:31:10
NULL
3
250
0
13:31:20
NULL
3
250
0
13:32:30
NULL
3
250
0
13:32:40
NULL
3
250
1
13:32:50
NULL
The required entry in the logging table is
SlaveID
Register
StartTime
Endtime
3
250
13:30:30
13:31:10
3
250
13:32:50
NULL //value is still 1
The solution given works fine but when the next set of data is processed, the exiting open entry (end time is null) is to be considered.
If the next set of values is only 1 (ie all values are 1), then no entry is to be made in the log table since the value was 1 in the previous set of data and continues to be 1. When the value changes 0 in one of the sets, then the end time should be updated with that time. A fresh row to be inserted in log table when it becomes 1 again
I solved the issue by using a 'hybrid'. I get 250 rows (values of 250 sensors polled) every 10 seconds. I process the data once in 180 seconds. I get about 4500 records which I process using the CTE. Now I get result set of around 250 records (a few more than 250 if some signals have changed the state). This I insert into a #table (of the table being processed) and use a cursor on this #table to check and insert into the log table. Since the number of rows is around 250 only cursor runs without issue.
Thanks to #charlieface for the original answer.

How to pull duplicates in transactional data based on date and other fields

I am looking at transactional data such as my credit card statement. I want to ensure that I am not getting my card swiped twice. The fields that I have are card number (I have multiple), amount of transaction, transaction date, merchant code, merchant name, and transaction code.
To know if it is a true duplicate transaction, I want to know if the merchant code, merchant name, and transaction amount appear more the once. I also want to make sure that the transaction was within 5 days of each other if all else matches.
I am doing the work in SAS code, but I can also do in PROC SQL. So far in SAS I’ve sorted the data and then pulled a table that only holds duplicates, but since I’ve sorted the data, It will only call it a duplicate if the dates are the exact same date instead of the 5 days rule mentioned.
I did a simple PROC SORT.
PROC SORT DATA=WORK.TRANSACTIONS
OUT=WORK.TRANSACTIONS1
DUPOUT=WORK.SORTSORTEDDUPS
NODUPKEY;
BY CARD NUMBER TRANSACTION_AMOUNT TRANSACTION_DATE MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE
What do I need to incorporate to add my rule of transaction within 5 days?
You can do it with an additional pass, retaining (and comparing to) the last transaction date as per the below. Note the change in the sort BY statement (you'll need to update the proc sort also).
data duplicates;
set work.transactions1;
by BY CARD NUMBER TRANSACTION_AMOUNT MERCHANT_CODE MERCHANT_NAME TRANSACTION_CODE TRANSACTION_DATE;
retain datecheck 0;
if first.TRANSACTION_CODE then datecheck=0;
else if TRANSACTION_DATE-datecheck le 5 then output;
datecheck=TRANSACTION_DATE;
run;
Let's create our practice data source:
DATA MY_CREDIT_CARDS;
INPUT
C_NUMBER
TRANC_AMOUNT
TRANSC_DATE :DATE10.
TRANSC_CODE
MERCH_CODE
MERCH_NAME $10.;
FORMAT TRANSC_DATE DDMMYY10.;
CARDS;
1 100 17JAN1990 1 1 AMAZON
2 200 01JAN1990 2 8 WALLMART
4 100 04JAN1990 3 5 CRUSTYKRAB
2 200 07JAN1990 4 7 NETFLIX
1 300 01JAN1990 5 2 GOOGLEPLAY
3 200 17JAN1990 6 8 WALLMART
5 100 18JAN1990 7 2 GOOG.PLAY
5 300 19JAN1990 8 2 GOOGLEPLAY
2 200 22JAN1990 9 8 WALLMART
4 200 20JAN1990 10 2 GOOGLEPLAY
1 100 03JAN1990 11 2 GOOG.PLAY
1 100 17JAN1990 12 1 AMZN
;
RUN;
Result:
Now, first of all, I recommend not to use descriptive fields such as a names (merchant name in this case) as keys, because descriptive fields can be very variable, i.e. someone can register AMAZON as AMZN or AMAZN, or any combination you could imagine as the merchant name. Use ID fields instead. So, assuming merchant code is an unique ID, I think that is enough to identify the merchant.
Considering the above, using PROC SQL you could do something like this to find duplicates based on the rule you provide (and without the need of using any other extra-step):
PROC SQL;
/*The following assuming each record are unique
(identified by 'transaction code' in this case),
otherwise you must handle duplicate records properly.*/
SELECT
DISTINCT A.*,
CASE WHEN
B.TRANSC_CODE IS NOT NULL
THEN 1 ELSE 0 END AS DUPLICATED
FROM MY_CREDIT_CARDS AS A
LEFT JOIN MY_CREDIT_CARDS AS B
ON
A.MERCH_CODE = B.MERCH_CODE AND
A.TRANC_AMOUNT = B.TRANC_AMOUNT AND
A.TRANSC_CODE ^= B.TRANSC_CODE AND
A.TRANSC_DATE >= INTNX('day',B.TRANSC_DATE,-5) AND
A.TRANSC_DATE <= INTNX('day',B.TRANSC_DATE,5)
;
/*You could use an ORDER BY clause to sort the
results as you want.*/
RUN;
The result would be:
Now you have a new column named "DUPLICATED" showing 1 if found the value as duplicated and 0 if not.
Hope it helps.

Summation/counting over overlapping values or dates with group by over id's in sql

I am working with an sas table and the dates are represented as numbers given in columns "entered" and "left" . I have to count the days the member remained in the system. Like, for example below for id 1, the person entered on 7071 and again used a different product on 7075 although he remained continuously in system from 7071 to 7083. That is the dates overlap. I want to count the final duration a member stayed in the system like as for id 1 it is 12 days (7083-7071) + 2 days (7087 to 7089) + 4 days (7095 to 7099). So the total is 18 days. (There are some duplicate entered and left values but other columns (not shown here) are not same, so these rows were not removed.) . Since i'm working in sas so the idea can be both in sas data or the sas-sql format.
For member 2, there is no overlap of values. So the day count is 2 (8921 to 8923) + 5 days (8935 to 8940) = 7 days. I was able to solve this case as the days didn't overlap but for overlap case, any suggestion or code/advice is appreciated.
id Entered left
1 7071 7077
1 7071 7077
1 7075 7079
1 7077 7083
1 7077 7083
1 7078 7085
1 7087 7089
1 7095 7099
2 8921 8923
2 8935 8940
So the final table should be of the form
id days_in_system
1 18
2 7
This is a surprisingly tricky problem as every row has to be compared to every other row for the same id to check for overlaps and if there are multiple overlaps you have to be very careful not to double-count them.
Here's a hash-based solution - the idea is to build up a hash containing all of the individual days a member has stayed as you go along, then count the number of items in it at the end:
data have;
input id Entered left;
cards;
1 7071 7077
1 7071 7077
1 7075 7079
1 7077 7083
1 7077 7083
1 7078 7085
1 7087 7089
1 7095 7099
2 8921 8923
2 8935 8940
;
run;
data want;
length day 8;
if _n_ = 1 then do;
declare hash h();
rc = h.definekey('day');
rc = h.definedone();
end;
do until(last.id);
set have;
by id;
do day = entered to left - 1;
rc = h.add();
end;
end;
total_days = h.num_items;
rc = h.clear();
keep id total_days;
run;
This should be fairly light on memory as it only has to load the days for 1 id at a time.
The output from id 1 is 20, not 18 - here's a breakdown of the new days added row-by-row that I generated by adding a bit of debugging logic. If this is wrong, please indicate where:
_N_=1
7071 7072 7073 7074 7075 7076
_N_=2
No new days
_N_=3
7077 7078
_N_=4
7079 7080 7081 7082
_N_=5
No new days
_N_=6
7083 7084
_N_=7
7087 7088
_N_=8
7095 7096 7097 7098
_N_=1
8921 8922
_N_=2
8935 8936 8937 8938 8939
If you want to add only days for rows matching a particular condition, you can pick those using a where clause on the set statement, e.g.
set have(where = (var1 in ('value1', 'value2', ...)));

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...