Detect Anomaly Intervals with SQL - sql

My problem is simple: I have a table with a series of statuses and timestamps (for the sake of curiosity, these statuses indicate alarm levels) and I would like to query this table in order to get duration between two statuses.
Seems simple, but here comes the tricky part: I canĀ“t create look-up tables, procedures and it should be as fast as possible as this table is a little monster holding over 1 billion records (no kidding!)...
The schema is drop dead simple:
[pk] Time
Value
(actualy, there is a second pk but it is useless for this)
And below a real world example:
Timestamp Status
2013-1-1 00:00:00 1
2013-1-1 00:00:05 2
2013-1-1 00:00:10 2
2013-1-1 00:00:15 2
2013-1-1 00:00:20 0
2013-1-1 00:00:25 1
2013-1-1 00:00:30 2
2013-1-1 00:00:35 2
2013-1-1 00:00:40 0
The output, considering only a level 2 alarm, should be as follow should report the begin of a level 2 alarm an its end (when reach 0):
StartTime EndTime Interval
2013-1-1 00:00:05 2013-1-1 00:00:20 15
2013-1-1 00:00:30 2013-1-1 00:00:40 10
I have been trying all sorts of inner joins, but all of them lead me to an amazing Cartesian explosion. Can you guys help me figure out a way to accomplish this?
Thanks!

This has to be one of the harder questions I've seen today - thanks! I assume you can use CTEs? If so, try something like this:
;WITH Filtered
AS
(
SELECT ROW_NUMBER() OVER (ORDER BY dateField) RN, dateField, Status
FROM Test
)
SELECT F1.RN, F3.MinRN,
F1.dateField StartDate,
F2.dateField Enddate
FROM Filtered F1, Filtered F2, (
SELECT F1a.RN, MIN(F3a.RN) as MinRN
FROM Filtered F1a
JOIN Filtered F2a ON F1a.RN = F2a.RN+1 AND F1a.Status = 2 AND F2a.Status <> 2
JOIN Filtered F3a ON F1a.RN < F3a.RN AND F3a.Status <> 2
GROUP BY F1a.RN ) F3
WHERE F1.RN = F3.RN AND F2.RN = F3.MinRN
And the Fiddle. I didn't add the intervals, but I imagine you can handle that part from here.
Good luck.

Finally figured out a version I was happy with. It took me remembering an answer from another question (can't remember which one though) where it was pointed out that the difference between two (increasing) sequences was always a constant.
WITH Ordered (occurredAt, status, row, grp)
as (SELECT occurredAt, status,
ROW_NUMBER() OVER (ORDER BY occurredat),
ROW_NUMBER() OVER (PARTITION BY status
ORDER BY occurredat)
FROM Alert)
SELECT Event.startDate, Ending.occurredAt as endDate,
DATEDIFF(second, Event.startDate, Ending.occurredAt) as interval
FROM (SELECT MIN(occurredAt) as startDate, MAX(row) as ending
FROM Ordered
WHERE status = 2
GROUP BY row - grp) Event
LEFT JOIN (SELECT occurredAt, row
FROM Ordered
WHERE status != 2) Ending
ON Event.ending + 1 = Ending.row
(working SQL Fiddle example, with some additional data rows for work checking).
This unfortunately doesn't correctly deal with level-2 statuses that are end rows (behavior unspecified), although it does list them.

Just for the sake of having an alternative. Tried to do some test on performance, but did not finish.
SELECT
MIN([main].[Start]) AS [Start],
[main].[End],
DATEDIFF(s, MIN([main].[Start]), [main].[End]) AS [Seconds]
FROM
(
SELECT
[sub].[Start],
MIN([sub].[End]) AS [End]
FROM
(
SELECT
[start].[Timestamp] AS [Start],
[start].[Status] AS [StartingStatus],
[end].[Timestamp] AS [End],
[end].[Status] AS [EndingStatus]
FROM [Alerts] [start], [Alerts] [end]
WHERE [start].[Status] = 2
AND [start].[Timestamp] < [end].[Timestamp]
AND [start].[Status] <> [end].[Status]
) AS [sub]
GROUP BY
[sub].[Start],
[sub].[StartingStatus]
) AS [main]
GROUP BY
[main].[End]
And here is a Fiddle.

I do something similar by using id that is an identity to the table.
create table test(id int primary key identity(1,1),timstamp datetime,val int)
insert into test(timstamp,val) Values('1/1/2013 00:00:00',1)
insert into test(timstamp,val) Values('1/1/2013 00:00:05',2)
insert into test(timstamp,val) Values('1/1/2013 00:00:25',1)
insert into test(timstamp,val) Values('1/1/2013 00:00:30',2)
insert into test(timstamp,val) Values('1/1/2013 00:00:35',1)
select t1.timstamp,t1.val,DATEDIFF(s,t1.timstamp,t2.timstamp)
from test t1 left join test t2 on t1.id=t2.id-1
drop table test
I would also make the timestamps be seconds since 1980 or 2000 or whatever. But then you might not want to do the reverse conversion all the time and so it depends on how often you use the actual time stamp.

Related

For each unique item in a redshift sql column, get the last rows based on a looking/scanning window

patient_id
alert_id
alert_timestamp
3
xyz
2022-10-10
1
anp
2022-10-12
1
gfe
2022-10-10
2
fgy
2022-10-02
2
gpl
2022-10-03
1
gdf
2022-10-13
2
mkd
2022-10-23
1
liu
2022-10-01
I have a sql table (see simplified version above) where for each patient_id, I want to only keep the latest alert (i.e. last one) that was sent out in a given window period e.g. window_size = 7.
Note, the window size needs to look at consecutive days i.e. between day 1 -> day 1 + window_size. The ranges of alert_timestamp for each patient_id varies and is usually well beyond the window_size range.
Note, that the table example given above, is a very simple example and will have many more patient_id's and will be in a mixed order in terms alert_timestamp and alert_id.
The approach is to start from the last alert_timstamp for a given patient_id and work back using the window_size to select the alert that was the last one in that window time frame.
Please note the idea is to have a scanning/looking window, example window_size = 7 days to move across the timestamps of each patient
The end result I want, is a table with the filtered out alerts
Expected output for (this example) window_size = 7:
patient_id
alert_id
alert_timestamp
1
liu
2022-10-01
1
gdf
2022-10-13
2
gpl
2022-10-03
2
mkd
2022-10-23
3
xyz
2022-10-10
What's the most efficient way to solve for this?
This can be done with the last_value window function but you need to prep your data a bit. Here's an example of what this could look like:
create table test (
patient_id int,
alert_id varchar(8),
alert_timestamp date);
insert into test values
(3, 'xyz', '2022-10-10'),
(1, 'anp', '2022-10-12'),
(1, 'gfe', '2022-10-10'),
(2, 'fgy', '2022-10-02'),
(2, 'gpl', '2022-10-03'),
(1, 'gdf', '2022-10-13'),
(2, 'mkd', '2022-10-23'),
(1, 'liu', '2022-10-01');
WITH RECURSIVE dates (dt) AS
(
SELECT '2022-09-30'::DATE AS dt UNION ALL SELECT dt + 1
FROM dates d
WHERE dt < '2022-10-31'::DATE
),
p_dates AS
(
SELECT pid,
dt
FROM dates d
CROSS JOIN (SELECT DISTINCT patient_id AS pid FROM test) p
),
combined AS
(
SELECT *
FROM p_dates d
LEFT JOIN test t
ON d.dt = t.alert_timestamp
AND d.pid = t.patient_id
),
latest AS
(
SELECT patient_id,
pid,
alert_id,
dt,
alert_timestamp,
LAST_VALUE(alert_id IGNORE NULLS) OVER (PARTITION BY pid ORDER BY dt ROWS BETWEEN CURRENT ROW AND 7 following) AS at
FROM combined
)
SELECT patient_id,
alert_id,
alert_timestamp
FROM latest
WHERE patient_id IS NOT NULL
AND alert_id = at
ORDER BY patient_id,
alert_timestamp;
This produces the results you are looking for with the test data but there are a few assumptions. The big one is that here is at most 1 alert per patient per day. If this isn't true then some more data massaging will be needed. Either way this should give you an outline on how to do this.
First need is to ensure that there is 1 row per patient per day so that the window function can operate on rows as these will be equivalent to days (for each patient). The date range is generated by a recursive CTE and joined to the test data to achieve the 1 row per day per patient.
The "ignore nulls" option is used in the last_value window function to ignore any of these "extra" rows create by the above process. The last step is to prune out all the unneeded rows and ensure that only the latest alert of the window is produced.

The nearest row in the other table

One table is a sample of users and their purchases.
Structure:
Email | NAME | TRAN_DATETIME (Varchar)
So we have customer email + FirstName&LastName + Date of transaction
and the second table that comes from second system contains all users, they sensitive data and when they got registered in our system.
Simplified Structure:
Email | InstertDate (varchar)
My task is to count minutes difference between the rows insterted from sale(first table)and the rows with users and their sensitive data.
The issue is that second table contain many rows and I want to find the nearest in time row that was inserted in 2nd table, because sometimes it may be a few minutes difeerence(delay or opposite of delay)and sometimes it can be a few days.
So for x email I have row in 1st table:
E_MAIL NAME TRAN_DATETIME
p****#****.eu xxx xxx 2021-10-04 00:03:09.0000000
But then I have 3 rows and the lastest is the one I want to count difference
Email InstertDate
p****#****.eu 2021-05-20 19:12:07
p****#****.eu 2021-05-20 19:18:48
p****#****.eu 2021-10-03 18:32:30 <--
I wrote that some query, but I have no idea how to match nearest row in the 2nd table
SELECT DISTINCT TOP (100)
,a.[E_MAIL]
,a.[NAME]
,a.[TRAN_DATETIME]
,CASE WHEN b.EMAIL IS NOT NULL THEN 'YES' ELSE 'NO' END AS 'EXISTS'
,(ABS(CONVERT(INT, CONVERT(Datetime,LEFT(a.[TRAN_DATETIME],10),120))) - CONVERT(INT, CONVERT(Datetime,LEFT(b.[INSERTDATE],10),120))) as 'DateAccuracy'
FROM [crm].[SalesSampleTable] a
left join [crm].[SensitiveTable] b on a.[E_MAIL]) = b.[EMAIL]
Totally untested: I'd need sample data and database the area of suspect is the casting of dates and the datemath.... since I dont' know what RDBMS and version this is.. consider the following "pseudo code".
We assign a row number to the absolute difference in seconds between the dates those with rowID of 1 win.
WTIH CTE AS (
SELECT A.*, B.* row_number() over (PARTITION BY A.e_mail
ORDER BY abs(datediff(second, cast(Tran_dateTime as Datetime), cast(InsterDate as DateTime)) desc) RN
FROM [crm].[SalesSampleTable] a
LEFT JOIN [crm].[SensitiveTable] b
on a.[E_MAIL] = b.[EMAIL])
SELECT * FROM CTE WHERE RN = 1

Getting the latest entry per day / SQL Optimizing

Given the following database table, which records events (status) for different objects (id) with its timestamp:
ID | Date | Time | Status
-------------------------------
7 | 2016-10-10 | 8:23 | Passed
7 | 2016-10-10 | 8:29 | Failed
7 | 2016-10-13 | 5:23 | Passed
8 | 2016-10-09 | 5:43 | Passed
I want to get a result table using plain SQL (MS SQL) like this:
ID | Date | Status
------------------------
7 | 2016-10-10 | Failed
7 | 2016-10-13 | Passed
8 | 2016-10-09 | Passed
where the "status" is the latest entry on a day, given that at least one event for this object has been recorded.
My current solution is using "Outer Apply" and "TOP(1)" like this:
SELECT DISTINCT rn.id,
tmp.date,
tmp.status
FROM run rn OUTER apply
(SELECT rn2.date, tmp2.status AS 'status'
FROM run rn2 OUTER apply
(SELECT top(1) rn3.id, rn3.date, rn3.time, rn3.status
FROM run rn3
WHERE rn3.id = rn.id
AND rn3.date = rn2.date
ORDER BY rn3.id ASC, rn3.date + rn3.time DESC) tmp2
WHERE tmp2.status <> '' ) tmp
As far as I understand this outer apply command works like:
For every id
For every recorded day for this id
Select the newest status for this day and this id
But I'm facing performance issues, therefore I think that this solution is not adequate. Any suggestions how to solve this problem or how to optimize the sql?
Your code seems too complicated. Why not just do this?
SELECT r.id, r.date, r2.status
FROM run r OUTER APPLY
(SELECT TOP 1 r2.*
FROM run r2
WHERE r2.id = r.id AND r2.date = r.date AND r2.status <> ''
ORDER BY r2.time DESC
) r2;
For performance, I would suggest an index on run(id, date, status, time).
Using a CTE will probably be the fastest:
with cte as
(
select ID, Date, Status, row_number() over (partition by ID, Date order by Time desc) rn
from run
)
select ID, Date, Status
from cte
where rn = 1
Do not SELECT from a log table, instead, write a trigger that updates a latest_run table like:
CREATE TRIGGER tr_run_insert ON run FOR INSERT AS
BEGIN
UPDATE latest_run SET Status=INSERTED.Status WHERE ID=INSERTED.ID AND Date=INSERTED.Date
IF ##ROWCOUNT = 0
INSERT INTO latest_run (ID,Date,Status) SELECT (ID,Date,Status) FROM INSERTED
END
Then perform reads from the much shorter lastest_run table.
This will add a performance penalty on writes because you'll need two writes instead of one. But will give you much more stable response times on read. And if you do not need to SELECT from "run" table you can avoid indexing it, therefore the performance penalty of two writes is partly compensated by less indexes maintenance.

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x

GROUP values separated by specific records

I want to make a specific counter which will raise by one after a specific record is found in a row.
time event revenue counter
13.37 START 20 1
13.38 action A 10 1
13.40 action B 5 1
13.42 end 1
14.15 START 20 2
14.16 action B 5 2
14.18 end 2
15.10 START 20 3
15.12 end 3
I need to find out total revenue for every visit (actions between START and END). I was thinking the best way would be to set a counter like this:
so I could group events. But if you have a better solution, I would be grateful.
You can use a query similar to the following:
with StartTimes as
(
select time,
startRank = row_number() over (order by time)
from events
where event = 'START'
)
select e.*, counter = st.startRank
from events e
outer apply
(
select top 1 st.startRank
from StartTimes st
where e.time >= st.time
order by st.time desc
) st
SQL Fiddle with demo.
May need to be updated based on the particular characteristics of the actual data, things like duplicate times, missing events, etc. But it works for the sample data.
SQL Server 2012 supports an OVER clause for aggregates, so if you're up to date on version, this will give you the counter you want:
count(case when eventname='START' then 1 end) over (order by eventtime)
You could also use the latest START time instead of a counter to group by, like this:
with t as (
select
*,
max(case when eventname='START' then eventtime end)
over (order by eventtime) as timeStart
from YourTable
)
select
timeStart,
max(eventtime) as timeEnd,
sum(revenue) as totalRevenue
from t
group by timeStart;
Here's a SQL Fiddle demo using the schema Ian posted for his solution.