Azure Stream Analytics takes too long to process the events

Azure Stream Analytics takes too long to process the events - azure-stream-analytics

I'm trying to configure an Azure Stream Analytics job, but consistently getting bad performance. I received data from a client system that pushes data into an Event Hub. And the ASA queries that into an Azure SQL database.
A few days ago I noticed that it was generating large amount of InputEventLateBeyondThreshold errors. Here an example out of the ASA. The Timestamp element is set by the client system.
{
"Tag": "MANG_POWER_5",
"Value": 1.08411181,
"ValueType": "Analogue",
"Timestamp": "2022-02-01T09:00:00.0000000Z",
"EventProcessedUtcTime": "2022-02-01T09:36:05.1482308Z",
"PartitionId": 0,
"EventEnqueuedUtcTime": "2022-02-01T09:00:00.8980000Z"
}
You can see that the event arrives pretty quickly, but takes more than 30 mins to process it. To try and avoid InputEventLateBeyondThreshold errors, I have increased the late event threshold. This may be contributing to the increased processing time, but having it too low also increases number of InputEventLateBeyondThreshold errors.
The Watermark Delay is consistently high, and yet SU usage is around 5%. I have increased the SU to as high as I can for this query.
I'm trying to figure out, why it takes so long to process the events once they have arrived.
This is the query I'm using:
WITH PIDataSet AS (SELECT * FROM [<event-hub>] TIMESTAMP BY timestamp)
--Write data to SQL joining with a lookup
SELECT
i.Timestamp as timestamp,
i.Value as value,
INTO [<sql-database>]
FROM PIDataSet as i
INNER JOIN [tagmapping-ref-alias] tm ON tm.sourcename = i.Tag
----Write data to AzureTable joining with a lookup
SELECT
DATEDIFF(second,CAST('1970-01-01' as DateTime), I1.Timestamp) As Rowkey,
I2.TagId as PartitionKey,
I1.Value as Value,
UDF.formatTime(I1.Timestamp) as DeviceTimeStamp
into [<azure-table>]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
--Get an hourly count into a SQL Table.
SELECT
I2.TagId,
System.Timestamp() as WindowEndTime, COUNT(I2.TagId) AS InputCount
into [tagmeta-ref-alias]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
GROUP BY I2.TagId, TumblingWindow(Duration(hour, 1))

When you set up a 59 minutes out-of-order window, what you do is that you set up a 59 minutes buffer for that input. When records land in that buffer, they will wait 59 minutes until they get out. What you get in exchange, is that we have the opportunity to re-order these events so they will look in order to the job.
Using it at 1h is an extreme setting that will automatically give you 59minute of watermark, by definition. This is very surprising and I'm wondering why you need a value so high.
Edit
Now looking at the late arrival policy.
You are using an event time (TIMESTAMP BY timestamp) which means that your events can now be late, see this doc and this one.
What this means is that when a record comes later than 1h (so timestamp is older than 1h from the wall clock on our servers - in UTC), then we adjust its timestamp to our wall clock minus 1h, and send it to the query. It also means that your tumbling window always has to wait an additional hour to be sure it's not missing those late records.
Here what I would do is restore the default settings (no out of order, 5 seconds late events, adjust events). Then when you get InputEventLateBeyondThreshold, it means that that the job received a timestamp that was later in the past than 5 seconds. You're not losing the data, we are adjusting its system.timestamp to a more recent value (but not the timestamp field, we don't change it).
What we then need to understand is why does it take more than 5 seconds for a record in your pipeline to go from production to consumption. Is it because you have big delays in your ingestion pipeline, or because you have a time skew on your producer clock? Do you know?

Related

Flink SQL interval join not triggering

I have a simple interval join between two unbounded streams. This works with small workloads, but with a larger (production environment) it no longer works. From observing the output I can see that the Flink SQL Job triggers/emitts records only once the entire topic has been scanned (and consequently read into memory?), but I would want the job to trigger the record as soon as as ingle match is found. Since in my production environment the job cannot withstand reading the entire table into memory.
The interval join which I'm making is very similar to the examples provided here: https://github.com/ververica/flink-sql-cookbook/blob/main/joins/02_interval_joins/02_interval_joins.md
SELECT
o.id AS order_id,
o.order_time,
s.shipment_time,
TIMESTAMPDIFF(DAY,o.order_time,s.shipment_time) AS day_diff
FROM orders o
JOIN shipments s ON o.id = s.order_id
WHERE
o.order_time BETWEEN s.shipment_time - INTERVAL '3' DAY AND s.shipment_time;
Except my time interval is as small as possible (couple of seconds). I also have a watermark of 5 seconds on the Flink SQL source tables.
How can I instruct Flink to emitt/trigger the records as soon as it has made a single 'match' with the join? As currently the job is trying to scan the entire table before emitting any records, which is not feasible with my data volumes. From my understanding it should only need to scan up until the interval (time window) and check that, and once the interval is passed then the record is emitted/triggered.
Also from observing the cluster I can see that the watermark is moving, but no records are being emitted.

May be some data was abandoned, you can check your event time whether if it's reasonable . In this scenes, you can try to use regular join and set a 3 days ttl(table.state.ttl = 3 days) which can ensure output for every data joined.

Finding statistical outliers in timestamp intervals with SQL Server

We have a bunch of devices in the field (various customer sites) that "call home" at regular intervals, configurable at the device but defaulting to 4 hours.
I have a view in SQL Server that displays the following information in descending chronological order:
DeviceInstanceId uniqueidentifier not null
AccountId int not null
CheckinTimestamp datetimeoffset(7) not null
SoftwareVersion string not null
Each time the device checks in, it will report its id and current software version which we store in a SQL Server db.
Some of these devices are in places with flaky network connectivity, which obviously prevents them from operating properly. There are also a bunch in datacenters where administrators regularly forget about it and change firewall/ proxy settings, accidentally preventing outbound communication for the device. We need to proactively identify this bad connectivity so we can start investigating the issue before finding out from an unhappy customer... because even if the problem is 99% certainly on their end, they tend to feel (and as far as we are concerned, correctly) that we should know about it and be bringing it to their attention rather than vice-versa.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval. For example, let's say device 87C92D22-6C31-4091-8985-AA6877AD9B40 has, for the last 1000 checkins, checked in every 4 hours or so (give or take a few seconds)... but the last time it checked in was just a little over 6 hours ago now. This is information I would like to highlight for immediate review, along with device E117C276-9DF8-431F-A1D2-7EB7812A8350 which normally checks in every 2 hours, but it's been a little over 3 hours since the last check-in.
It seems relatively straightforward to brute-force this, looping through all the devices, examining the average interval between check-ins, seeing what the last check-in was, comparing that to current time, etc... but there's thousands of these, and the device count grows larger every day. I need an efficient query to quickly generate this list of uncommunicative devices at least every hour... I just can't picture how to write that query.
Can someone help me with this? Maybe point me in the right direction? Thanks.

I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval.
I think you can do:
select *
from (select DeviceInstanceId,
datediff(second, min(CheckinTimestamp), max(CheckinTimestamp)) / nullif(count(*) - 1, 0) as avg_secs,
max(CheckinTimestamp) as max_CheckinTimestamp
from t
group by DeviceInstanceId
) t
where max_CheckinTimestamp < dateadd(second, - avg_secs * 1.5, getdate());

How to understand statistics of trace file in Oracle. Such as CPU, elapsed time, query...etc

I am learning query optimization in Oracle and I know that trace file will create statistic about the query execution and EXPLAIN Plan of the query.
At the bottom of the trace file, it is EXPLAIN PLAN of the query. My first question is , does the part "time = 136437 us" show the time duration for the steps of query execution? what does "us" mean ? Is it unit of time?
In addition, can anyone explain what statistics such as count, cpu, elapsed , disk and query mean? I google and read Oracle doc about them already but I still can not understand it. Can anyone clarify the meaning of those stats more clearly?
Thanks in advance. I am new and sorry for my English.

The smallest unit of data access in Oracle Database is a block. Not a row.
Each block can store many rows.
The database can access a block in current or consistent mode.
Current = as the block exists "right now".
Consistent = as the blocked existed at the time your query started.
The query and current columns report how many times the database accessed a block in consistent (query) and current mode.
When accessing a block it may already be in the buffer cache (memory). If so, no disk access is needed. If not, it has to do a physical read (pr). The disk column is a count of the total physical reads.
The stats for each line in the plan are the figures for that operation. Plus the sum of all its child operations.
In simple terms, the database processes the plan by accessing the first child first. Then passes the rows up to the parent. Then all the other child ops of that parent in order. Child operations are indented from their parent in the display.
So the database processed your query like so:
Read 2,000 rows from CUSTOMER. This required 749 consistent block gets and 363 disk reads (cr and pr values on this row). This took 10,100 microseconds.
Read 112,458 rows from BOOKING. This did 8,203 consistent reads and zero disk reads. This took 337,595 microseconds
Joined these two tables together using a hash join. The CR, PR, PW (physical writes) and time values are the sum of the operations below this. Plus whatever work this operation did. So the hash join:
did 8,952 - ( 749 + 8,203 ) = zero consistent reads
did 363 - ( 363 + 0 ) = zero physical reads
took 1,363,447 - ( 10,100 + 337,595 ) = 1,015,752 microseconds to execute
Notice that the CR & PR totals for the hash join match the query and disk totals in the fetch line?
The count column reports the number of times that operation happened. A fetch is a call to the database to get rows. So the client called the database 7,499 times. Each time it received ceil( 112,458 / 7,499 ) = 15 rows.
CPU is the total time in seconds the server's processors were executing that step. Elapsed is the total wall clock time. This is the CPU time + any extra work. Such as disk reads, network time, etc.

What is the maximum LIMIT DURATION in the LAG function in ASA?

I am streaming data from devices and I want to use the LAG function to identify the last value received from a particular device. The data is not streamed at a regular period and in rare cases it could be days between receiving data from a device.
Is there a maximum period for the LIMIT DURATION clause?
Is there any down-side to having long LIMIT DURATION periods?

There is no maximum period for LIMIT DURATION in the language. However it is limited by amount of data the input source can hold - e.g. 1 day is default retention policy for Event Hub (can be increased in configuration).
When job is being started, Azure Stream Analytics reads up to LIMIT DURATION amount of data from the source to make sure it has correct value for the LAG at job start time. If data volume is high, this can increase job start time.
If you need to use data that is more than several days old, it may make more sense to use it as a reference data (which can be updated at daily intervals for example)

SQL Server Timeout based on locking?

I've got a SQL query that normally runs for about .5 seconds and now, as our server gets slightly more busy, it is timing out. It involves a join on a table that is getting updated every couple seconds as a matter of course.
That is, the EmailDetails table gets updated from a mail service to notify me of reads,clicks, etc and that table gets updated every couple seconds as those notifications come through. I'm wondering if the join below is timing out because those updates are locking the table and blocking my SQL join.
If so, any suggestions how to avoid my timeout would be appreciated.
SELECT TOP 25
dbo.EmailDetails.Id,
dbo.EmailDetails.AttendeesId,
dbo.EmailDetails.Subject,
dbo.EmailDetails.BodyText,
dbo.EmailDetails.EmailFrom,
dbo.EmailDetails.EmailTo,
dbo.EmailDetails.EmailDetailsGuid,
dbo.Attendees.UserFirstName,
dbo.Attendees.UserLastName,
dbo.Attendees.OptInTechJobKeyWords,
dbo.Attendees.UserZipCode,
dbo.EmailDetails.EmailDetailsTopicId,
dbo.EmailDetails.EmailSendStatus,
dbo.EmailDetails.TextTo
FROM
dbo.EmailDetails
LEFT OUTER JOIN
dbo.Attendees ON (dbo.EmailDetails.AttendeesId = dbo.Attendees.Id)
WHERE
(dbo.EmailDetails.EmailSendStatus = 'NEEDTOSEND' OR
dbo.EmailDetails.EmailSendStatus = 'NEEDTOTEXT')
AND
dbo.EmailDetails.EmailDetailsTopicId IS NOT NULL
ORDER BY
dbo.EmailDetails.EmailSendPriority,
dbo.EmailDetails.Id DESC
Adding Execution Plan:
https://dl.dropbox.com/s/bo6atz8bqv68t0i/emaildetails.sqlplan?dl=0
It takes .5 seconds on my fast macbook but on my real server with magnetic media it takes 8 seconds.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas