Azure stream analytics query using Tumbling window - azure-stream-analytics

In our application, multiple IoT devices publish data to IoT hub. They emits some reading in rooms (for ex: power usage).
Now we have a requirement to find out total energy consumed in an area in last hour and log it.
Suppose, there is a light bulb which was switched on 8:00 AM and take 60 watt power, and it was switched off at 8:20 for 10 min. At 8:30 it was switched on in dimmed manner with power usage 40 watt. So energy (Watt per hour) consumed between 8 and 9 AM should be:
60*20/60 (for 8:00 AM to 8:20 AM) + 0 (8:20 to 8:30) + 40*30/60 (8:30 to 9:00) = 40 watt per hour.
How can we write Stream Analytic query (using Tumbling window to achieve this).

You can use HoppingWindow to produce events every minute repeating latest signal from the device and then use TumblingWindow to get hourly aggregates.
-- First query produces event every minute with latest known value up to 1 hour back
WITH MinuteData AS
(
SELECT deviceId, TopOne() OVER (ORDER BY ts DESC) AS lastRecord
FROM input TIMESTAMP BY ts
GROUP BY deviceId, HoppinWindow(miute, 1, 60)
)
SELECT
deviceId,
SUM(lastRecord.wat)/60
FROM MinuteData
GROUP BY deviceId, TumblingWindow(hour, 1)

Related

Running a Count on an Interval

I'm trying to do an alert of sorts for customers joining. The alert needs to run on an interval of one hour, which is possible with an integration we have.
The sample data is this:
Name
Time
John
2022-04-21T13:49:51
Mary
2022-04-23T13:49:51
Dave
2022-04-25T13:49:51
Gregg
2022-04-27T13:49:51
so the problem with the below output is this only captures the "count" within the hour. And will yield no results. But I'm trying to determine the moment (well, within the hour) the threshold crosses above a count of 3. Is there something I'm missing?
SELECT COUNT (name)
FROM Table
WHERE Time >= TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -60 MINUTE)
HAVING COUNT (NAME) > 3

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

What is the most efficient way to compute "colocation" in BigQuery?

Assuming that you have a table of the form:
vehicle_id | timestamp | lat | lon
What is the most efficient way to create a query to compute "colocation"?
Colocation means two vehicles at nearly the same location at the same time.
What I am doing is to first create cell_id from a grid (for example created by rounding lat/lon to the 4th decimal digit) and then running a groupby on the cell_id (and time). Is there a more efficient way?
I'd suggest using a GeoHash. Demonstrating this on NYC taxicab data and grouping by hour in time:
WITH top_pickup_locations AS (
SELECT
TIMESTAMP_TRUNC(pickup_datetime, HOUR) AS hour,
ST_GeoHash( ST_GeogPoint(pickup_longitude, pickup_latitude), 15 ) AS geohash,
COUNT(*) AS num_pickups
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
GROUP BY hour, geohash
ORDER BY num_pickups DESC
LIMIT 10
)
SELECT
hour,
ST_GeogPointFromGeoHash(geohash),
num_pickups
FROM top_pickup_locations
To read more about GeoHash, see here: https://en.wikipedia.org/wiki/Geohash
Increase the number of characters (I'm using 15) to control the precision.
The other alternative is to use ST_SnapToGrid() instead of the geohash:
WITH top_pickup_locations AS (
SELECT
TIMESTAMP_TRUNC(pickup_datetime, HOUR) AS hour,
ST_ASGeoJson(ST_SnapToGrid( ST_GeogPoint(pickup_longitude, pickup_latitude), 0.0001)) AS cellid,
COUNT(*) AS num_pickups
FROM `bigquery-public-data.new_york.tlc_green_trips_2013`
GROUP BY hour, cellid
ORDER BY num_pickups DESC
LIMIT 10
)
SELECT
hour,
ST_GeogFromGeoJson(cellid),
num_pickups
FROM top_pickup_locations
When I did it, the geohash method took 11 seconds of slot time
while the snap-to-grid method took 57 seconds of slot time.
15 characters of geohash and 4 digits of lat-lon are approximately similar in the number of groups.

How to aggregate unique users by hour in Amazon Redshift?

With Amazon Redshift I want to count every unique visitor.
A unique visitor is a visitor who did not visit less than an hour previously.
So for the following rows of users and timestamps we'd get a total count of 4 unique visitors with user1 and user2 counting as 2 respectively.
Please note that I do not want to aggregate by hour in a 24 hour day. I want to aggregate by an hour after the time stamp of the users first visit.
I'm guessing a straight up SQL expression won't do it.
user1,"2015-07-13 08:28:45.247000"
user1,"2015-07-13 08:30:17.247000"
user1,"2015-07-13 09:35:00.030000"
user1,"2015-07-13 09:54:00.652000"
user2,"2015-07-13 08:28:45.247000"
user2,"2015-07-13 08:30:17.247000"
user2,"2015-07-13 09:35:00.030000"
user2,"2015-07-13 09:54:00.652000"
So user1 arrives at 8:28, that counts as one hit. He comes back at 8:30 which counts as zero. He then comes back at 9:35 which is more than an hour from 8:30, so he gets another hit. Then he comes back at 9:35 which is only 5 minutes from the last time 9:30 so this counts as zero. The total is 2 hits for user1. The same thing happens for user2 meaning two hits each bringing it to a final total of 4.
You can use lag to accomplish this. However, you will also have to handle for end of day by partitioning on day as well. The query below would be a starting point.
with prev as (
select user_id,
datecol,
coalesce(lag(datecol) over(partition by user_id order by datecol),0) as prev
from tablename
)
select user_id,
sum(case when datediff(minutes, datecol, prev) >=60 then 1 else 0 end) as totalvisits
from prev
group by user_id

How can I select one row of data per hour, from a table of time stamps?

Excuse me if this is confusing, as I am not very familiar with postgresql. I have a postgres database with a table full of "sites". Each site reports about once an hour, and when it reports, it makes an entry in this table, like so:
site | tstamp
-----+--------------------
6000 | 2013-05-09 11:53:04
6444 | 2013-05-09 12:58:00
6444 | 2013-05-09 13:01:08
6000 | 2013-05-09 13:01:32
6000 | 2013-05-09 14:05:06
6444 | 2013-05-09 14:06:25
6444 | 2013-05-09 14:59:58
6000 | 2013-05-09 19:00:07
As you can see, the time stamps are almost never on-the-nose, and sometimes there will be 2 or more within only a few minutes/seconds of each other. Furthermore, some sites won't report for hours at a time (on occasion). I want to only select one entry per site, per hour (as close to each hour as I can get). How can I go about doing this in an efficient way? I also will need to extend this to other time frames (like one entry per site per day -- as close to midnight as possible).
Thank you for any and all suggestions.
You could use DISTINCT ON:
select distinct on (date_trunc('hour', tstamp)) site, tstamp
from t
order by date_trunc('hour', tstamp), tstamp
Be careful with the ORDER BY if you care about which entry you get.
Alternatively, you could use the row_number window function to mark the rows of interest and then peel off the first result in each group from a derived table:
select site, tstamp
from (
select site, tstamp,
row_number() over (partition by date_trunc('hour', tstamp) order by tstamp) as r
from t
) as dt
where r = 1
Again, you'd adjust the ORDER BY to select the specific row of interest for each date.
You are looking for the closest value per hour. Some are before the hour and some are after. That makes this a hardish problem.
First, we need to identify the range of values that work for a particular hour. For this, I'll consider anything from 15 minutes before the hour to 45 minutes after as being for that hour. So, the period of consideration for 2:00 goes from 1:45 to 2:45 (arbitrary, but seems reasonable for your data). We can do this by shifting the time stamps by 15 minutes.
Second, we need to get the closest value to the hour. So, we prefer 1:57 to 2:05. We can do this by considering the first value in (57, 60 - 57, 5, 60 - 5).
We can put these rules into a SQL statement, using row_number():
select site, tstamp, usedTimestamp
from (select site, tstamp,
date_trunc('hour', tstamp + 'time 00:15') as usedTimestamp
row_number() over (partition by site, to_char(tstamp + time '00:15', 'YYYY-MM-DD-HH24'),
order by least(extract(minute from tstamp), 60 - extract(minute from tstamp))
) as seqnum
from t
) as dt
where seqnum = 1;
For the extensibility aspect of your question.
I also will need to extend this to other time frames (like one entry per site per day
From the distinct set of site ids, and using a (recursive) CTE, I would build a set comprised of one entry per site per hour (or other specified interval), within a specified StartDateTime, EndDateTime range.
SITE..THE DATE-TIME-HOUR
6000 12.1.2013 00:00:00
6000 12.1.2013 01:00:00
.
.
.
6000 12.1.2013 24:00:00
7000 12.1.2013 00:00:00
7000 12.1.2013 01:00:00
.
.
.
7000 12.1.2013 24:00:00
Then I would left join that CTE against your SITES log on site id and on the min absolute difference between the CTE point-in-time and the LOG's point-in-time.
That way you are assured of a row for each site per interval.
P.S. For a site that has not phoned home for a long time, its most recent phone-in timestamp will be repeated multiple times as the closest one available.