Finding statistical outliers in timestamp intervals with SQL Server

Finding statistical outliers in timestamp intervals with SQL Server - sql

We have a bunch of devices in the field (various customer sites) that "call home" at regular intervals, configurable at the device but defaulting to 4 hours.
I have a view in SQL Server that displays the following information in descending chronological order:
DeviceInstanceId uniqueidentifier not null
AccountId int not null
CheckinTimestamp datetimeoffset(7) not null
SoftwareVersion string not null
Each time the device checks in, it will report its id and current software version which we store in a SQL Server db.
Some of these devices are in places with flaky network connectivity, which obviously prevents them from operating properly. There are also a bunch in datacenters where administrators regularly forget about it and change firewall/ proxy settings, accidentally preventing outbound communication for the device. We need to proactively identify this bad connectivity so we can start investigating the issue before finding out from an unhappy customer... because even if the problem is 99% certainly on their end, they tend to feel (and as far as we are concerned, correctly) that we should know about it and be bringing it to their attention rather than vice-versa.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval. For example, let's say device 87C92D22-6C31-4091-8985-AA6877AD9B40 has, for the last 1000 checkins, checked in every 4 hours or so (give or take a few seconds)... but the last time it checked in was just a little over 6 hours ago now. This is information I would like to highlight for immediate review, along with device E117C276-9DF8-431F-A1D2-7EB7812A8350 which normally checks in every 2 hours, but it's been a little over 3 hours since the last check-in.
It seems relatively straightforward to brute-force this, looping through all the devices, examining the average interval between check-ins, seeing what the last check-in was, comparing that to current time, etc... but there's thousands of these, and the device count grows larger every day. I need an efficient query to quickly generate this list of uncommunicative devices at least every hour... I just can't picture how to write that query.
Can someone help me with this? Maybe point me in the right direction? Thanks.

I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval.
I think you can do:
select *
from (select DeviceInstanceId,
datediff(second, min(CheckinTimestamp), max(CheckinTimestamp)) / nullif(count(*) - 1, 0) as avg_secs,
max(CheckinTimestamp) as max_CheckinTimestamp
from t
group by DeviceInstanceId
) t
where max_CheckinTimestamp < dateadd(second, - avg_secs * 1.5, getdate());

Related

Incremental load of a full api call

I have an API I where I need to get signup data into my database from and aggregate it daily. Everytime I call the API I will get a full copy of the data. Sometimes old accounts will get deleted, so the historical data will change.
This is what the data from the API looks like:
I want to aggregate it like so, to see the daily account creations and activations:
Now, what I could do is a daily import of the full data and then aggregate like this:
SELECT
Current_date() as snapshot_date,
SUM(CASE WHEN accountCreateOn = current_date() THEN 1 ELSE 0 END) as accountCreateOn,
SUM(CASE WHEN accountActivateOn = current_date() THEN 1 ELSE 0 END) as accountActivateOn
FROM full_data
But this doesn't seem very failure resistant. What happens, if the pipeline fails for a couple of days? What would be the right way to solve such a problem?

The easiest and most fault-tolerant way is to store the data you are getting completely and as detailed as they are. You can't get any better information, and leaving away information - which includes aggregating it - always carries the danger that you will one day want to answer a question on those data that could have been answered on the complete dataset and can't be answered on the reduced one.
The only reason to leave this path could be datasets that are so huge that storing and processing them isn't feasible. For modern DBMS systems running on modern hardware, it's rather unlikely that you run into that problem. So I would create synthetic test data of the maximum size that I expect for my business, say 10 times the account activations per year that I dream of. If the database can handle this, it means you have one less problem to worry about.

Qlikview: Build Logic/KPI to Count distinct devices using Loops/Set Analysis

Please help me building logic on below scenario.
I have a data, in which there are many device saying A/B/C.., with their server up/down status being 1/0 respectively and dates(24 Hour) corresponding to it.
What I want here is to Count No. of distinct devices in a dataset, which are UP in entire day for atleast once. Means, If any device is Up i.e 1 for atleast once in a day, then it is counted as 1, and check for other devices and count the others similarly. and finally show the total UP devices Reported. Vice Versa for the devices which were Down whole day.
I am sorry, if I am putting this again, But I didn't find any post regarding this.
I am not sure which function/loop will give the correct logic? Can we do it through loop, or set analysis can do this?
Thanks in Advance!

Should be pretty simple I'm set analysis. Something like
count({<Status={1}>}distinct DeviceID) to get all machines that have been up at all.
There's probably a clever way to do the down all day but I can't think of it other than
count(distinct DeviceID)-count({<Status={1}>}distinct DeviceID)

SQL MIN_ACTIVE_ROWVERSION() value does not change for a long while

We're troubleshooting a sort of Sync Framework between two SQL Server databases, in separate servers (both SQL Server 2008 Enterprise 64 bits SP2 - 10.0.4000.0), through linked server connections, and we reached to a point in which we're sort of stuck.
The logic to identify which are the records "pending to be synced" is of course based on ROWVERSION values, including the use of MIN_ACTIVE_ROWVERSION() to avoid dirty reads.
All SELECT operations are encapsulated in SPs on each "source" side. This is a schematic sample of one SP:
PROCEDURE LoaderRetrieve(#LastStamp bigint, #Rows int)
BEGIN
...
(vars handling)
...
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
Select TOP (#Rows) Field1, Field2, Field3
FROM Table
WHERE [RowVersion] > #LastStampAsRowVersionDataType
AND [RowVersion] < #MinActiveVersion
Order by [RowVersion]
END
The approach works just fine, we usually sync records with the expected rate of 600k/hour (job every 30 seconds, batch size = 5k), but at some point, the sync process does not find any single record to be transferred, even though there are several thousand of records with a ROWVERSION value greater than the #LastStamp parameter.
When checking the reason, we've found that the MIN_ACTIVE_ROWVERSION() has a value less than (or slightly greater, just 5 or 10 increments) the #LastStamp being searched. This of course shouldn't be a problem since the MIN_ACTIVE_ROWVERSION() approach was introduced to avoid dirty reads and posterior issues, BUT:
The problem we see in some occasions, during the above scenario occurs, is that the value for MIN_ACTIVE_ROWVERSION() does not change during a long (really long) period of time, like 30/40 minutes, sometimes more than one hour. And this value is by far less than the ##DBTS value.
We first thought this was related to a pending DB transaction not yet committed. As per MSDN definition about the MIN_ACTIVE_ROWVERSION() (link):
Returns the lowest active rowversion value in the current database. A rowversion value is active if it is used in a transaction that has not yet been committed.
But when checking sessions (sys.sysprocesses) with open_tran > 0 during the duration of this issue, we couldn't find any session with a waittime greater than a few seconds, only one or two occurrences of +/- 5 minutes waittime sessions.
So at this point we're struggling to understand the situation: The MIN_ACTIVE_ROWVERSION() does not change during a huge period of time, and no uncommitted transactions with long waits are found within this time frame.
I'm not a DBA and could be the case that we're missing something in the picture to analyze this problem, doing some research on forums and blogs couldn't found any other clue. So far the open_tran > 0 was the valid reason, but under the circumstances I've exposed, it's clear that there's something else and don't know why.
Any feedback is appreciated.

well, I finally find the solution after digging a bit more.
The problem is that we were looking for sessions with a long waittime, but the real deal was to find sessions which have an active batch since a while.
If there's a session where open_tran = 1, to obtain exactly since when this transaction is open (and of course still active, not yet committed), the field last_batch from sys.sysprocesses shall be checked.
Using this query:
select
batchDurationMin= DATEDIFF(second,last_batch,getutcdate())/60.0,
batchDurationSecs= DATEDIFF(second,last_batch,getutcdate()),
hostname,open_tran,* from sys.sysprocesses a
where spid > 50
and a.open_tran >0
order by last_batch asc
we could identify a session with an open tran being active 30+ minutes. And with hostname values and some more checks within the web services (and also using dbcc inputbuffer) we found the responsible process.
So, the final question actually is "there's indeed an active session with an uncommitted transaction", therefore the MIN_ACTIVE_ROWVERSION() does not change. We were just looking processes with the wrong criteria.
Now that we know which process behaves like this, next step will be to improve it.
Hope this results useful to someone else.

CRM 2011 - Set/Retrieve work hours programmatically

I am attempting to retrieve a resources work hours to perform some logic I require. I understand that the CRM scheduling engine is a little clunky around such things, but I assumed that I would be able to find out how the working hours were stored in the DB eventually...
So a resource has associated calendars and those calendars have associated calendar rules and inner calendars etc. It is possible to look at the start/end and frequency of aforementioned calendar rules and query their codes to work out whether a resource is 'working' during a given period. However, I have not been able to find the actual working hours, the 9-5 shall we say in any field in the DB.
I even tried some SQL profiling while I was creating a new schedule for a resource via the UI, but the results don't show any work hours passing to SQL. For those with the patience the intercepted SQL statement is below:-
EXEC Sp_executesql
N'update [CalendarRuleBase] set [ModifiedBy]=#ModifiedBy0, [EffectiveIntervalEnd]=#EffectiveIntervalEnd0, [Description]=#Description0, [ModifiedOn]=#ModifiedOn0, [GroupDesignator]=#GroupDesignator0, [IsSelected]=#IsSelected0, [InnerCalendarId]=#InnerCalendarId0, [TimeZoneCode]=#TimeZoneCode0, [CalendarId]=#CalendarId0, [IsVaried]=#IsVaried0, [Rank]=#Rank0, [ModifiedOnBehalfBy]=NULL, [Duration]=#Duration0, [StartTime]=#StartTime0, [Pattern]=#Pattern0 where ([CalendarRuleId] = #CalendarRuleId0)',
N'#ModifiedBy0 uniqueidentifier,#EffectiveIntervalEnd0 datetime,#Description0 ntext,#ModifiedOn0 datetime,#GroupDesignator0 ntext,#IsSelected0 bit,#InnerCalendarId0 uniqueidentifier,#TimeZoneCode0 int,#CalendarId0 uniqueidentifier,#IsVaried0 bit,#Rank0 int,#Duration0 int,#StartTime0 datetime,#Pattern0 ntext,#CalendarRuleId0 uniqueidentifier',
#ModifiedBy0='EB04662A-5B38-E111-9889-00155D79A113',
#EffectiveIntervalEnd0='2012-01-13 00:00:00',
#Description0=N'Weekly Single Rule',
#ModifiedOn0='2012-03-12 16:02:08',
#GroupDesignator0=N'FC5769FC-4DE9-445d-8F4E-6E9869E60857',
#IsSelected0=1,
#InnerCalendarId0='3C806E79-7A49-4E8D-B97E-5ED26700EB14',
#TimeZoneCode0=85,
#CalendarId0='E48B1ABF-329F-425F-85DA-3FFCBB77F885',
#IsVaried0=0,
#Rank0=2,
#Duration0=1440,
#StartTime0='2000-01-01 00:00:00',
#Pattern0=N'FREQ=WEEKLY;INTERVAL=1;BYDAY=SU,MO,TU,WE,TH,FR,SA',
#CalendarRuleId0='0A00DFCF-7D0A-4EE3-91B3-DADFCC33781D'
The key parts in the statement are the setting of the pattern:-
#Pattern0=N'FREQ=WEEKLY;INTERVAL=1;BYDAY=SU,MO,TU,WE,TH,FR,SA'
However, as mentioned, no indication of the work hours set.
Am I thinking about this incorrectly or is CRM doing something interesting around these work hours?
Any thoughts greatly appreciated, thanks.

If you look in the CalendarRuleBase table you should see a record with the data you gathered in your trace. You should also see another record created approximately the same time and it will have a CalendarId that equals the InnerCalendarId of the data from the trace. In this record there is a value - Offset which appears to represent the number of minutes past midnight for the start time. There is another value - Duration which appears to be the number of minutes of the shift.
I created work hours from 8-5. My offset was 480 (480/60 = 8) 8 AM start time and the duration was 540 (540/60 = 9) for a 9 hour shift.

How to calculate blocks of free time using start and end time?

I have a Ruby on Rails application that uses MySQL and I need to calculate blocks of free (available) time given a table that has rows of start and end datetimes. This needs to be done for a range of dates, so for example, I would need to look for which times are free between May 1 and May 7. I can query the table with the times that are NOT available and use that to remove periods of time between May 1 and May 7. Times in the database are stored at a fidelity of 15 minutes on the quarter hour, meaning all times end at 00, 15, 30 or 45 minutes. There is never a time like 11:16 or 10:01, so no rounding is necessary.
I've thought about creating a hash that has time represented in 15 minute increments and defaulting all of the values to "available" (1), then iterating over an ordered resultset of rows and flipping the values in the hash to 0 for the times that come back from the database. I'm not sure if this is the most efficient way of doing this, and I'm a little concerned about the memory utilization and computational intensity of that approach. This calculation won't happen all the time, but it needs to scale to happening at least a couple hundred times a day. It seems like I would also need to reprocess the entire hash to find the blocks of time that are free after this which seems pretty inefficient.
Any ideas on a better way to do this?
Thanks.

I've done this a couple of ways. First, my assumption is that your table shows appointments, and now you want to get a list of un-booked time, right?
So, the first way I did this was like yours, just a hash of unused times. It's slow and limited and a little wasteful, since I have to re-calculate the hash every time someone needs to know the times that are available.
The next way I did this was borrow an idea from the data warehouse people. I build an attribute table of all time slots that I'm interested in. If you build this kind of table, you may want to put more information in there besides the slot times. You may also include things like whether it's a weekend, which hour of the day it's in, whether it's during regular business hours, whether it's on a holiday, that sort of thing. Then, I have to do a join of all slots between my start and end times and my appointments are null. So, this is a LEFT JOIN, something like:
SELECT *
FROM slots
WHERE ...
LEFT JOIN appointments
WHERE appointments.id IS NULL
That keeps me from having to re-create the hash every time, and it's using the database to do the set operations, something the database is optimized to do.
Also, if you make your slots table a little rich, you can start doing all sorts of queries about not only the available slots you may be after, but also on the kinds of times that tend to get booked, or the kinds of times that tend to always be available, or other interesting questions you might want to answer some day. At the very least, you should keep track of the fields that tell you whether a slot should be one that is being filled or not (like for business hours).

Why not have a flag in the row that indicates this. As time is allocated, flip the flag for every date/time in the appropriate range. For example May 2, 12pm to 1pm, would be marked as not available.
Then it's a simple matter of querying the date range for every row that has the availability flagged set as true.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas