stream analytics: Event processing delay, Performance Issues - azure-stream-analytics

I am facing some delay in event processing through stream analytics. I am working on an IoT Project and gateway is sending messages to IoTHub. Since we are dealing with different type of messages, through event hub end point we are routing the messages. Our Stream analytics job will pick the events from event hub and will start processing and pushing to Service Bus Queue..
After starting the job for the next minute on wards messages are pushing to service bus queue. So we compared the time difference between event hub utc time and stream analytics process time. Initially it will be executing within 5 to 10 seconds.
But after 1 hour this delay will be getting incresaed to 15 seconds, and it will continue. So after 6 or 7 hours, the message produced and message deliver to Service Bus queue gap will be more than 1 minute. Since our application is directly based on realtime data, I may need to restart the job after 7 hours(Which is not a permanent solution).
In the event hub I am using 4 partition and 3 streaming units are using. Since streaming unit utilization is very low(16%), so increasing streaming unit is not a solution for this problem.
Kindly help on this issue.
My Stream Analytics Query Giving Below:
WITH
rawmessage AS
(
SELECT
digitaleventhubstreaminputonlineclassaforrawdata.*,
GetMetadataPropertyValue(digitaleventhubstreaminputonlineclassaforrawdata,
'EventHub.IoTConnectionDeviceId')
as iotdevice
FROM digitaleventhubstreaminputonlineclassaforrawdata
Partition
By PartitionId
)
,
messagetoprocess AS
(
SELECT
rawmessage.*,
digitalblobreferenceinputnmea.*,
digitalblobreferenceinputwidget.*,
digitalblobreferenceinputscalingfactor.*
FROM rawmessage
Partition
By PartitionId
LEFT JOIN digitalblobreferenceinputnmea ON rawmessage.vessel_id=digitalblobreferenceinputnmea.vessel_id
LEFT JOIN digitalblobreferenceinputwidget
ON rawmessage.vessel_id=digitalblobreferenceinputwidget.vessel_id
LEFT JOIN digitalblobreferenceinputscalingfactor
ON rawmessage.vessel_id=digitalblobreferenceinputscalingfactor.vessel_id
WHERE rawmessage.sensorval IS NOT NULL
)
,
processedmessage AS
(
SELECT
event.vessel_id as vessel_id_fk,
event.iotdevice as device_id_fk,
event.PartitionId as partitionId,
UDF.getEpochTime(event.EventProcessedUtcTime)as EventProcessedUtcTime,
UDF.getAnalyticsProcessTime('arg')
as AnalyticsProcessTime
FROM messagetoprocess
as event
Partition By PartitionId
)
--output is writing into service bus queue for socket push
SELECT *
INTO digitalqueueoutputonlineclassasocketdata
FROM processedmessage
Partition
By PartitionId

We raised a ticket, and they resolved it.. Its a bug from microsoft and they deployed the change in all regions.. The delay happening while merging reference input.

Related

Query Redis data after elapsed time

I am currently in the midst of a POC where I plan to store some IOT data in my Redis.
Here's my question:
I would like to monitor the data sent by multiple IOT devices and raise alarms if a device fails to report telemetry under a certain time threshold.
For Example:
Device 1: Booting: 09:00am : expected turn around time 2min
After 02 min, 01 sec
Device 1 has failed to report back in the given time.
Is there a way to use Redis to query, in order for it to return back the data which has passed a certain time threshold?
Any references will be appreciated, thanks!

Azure Stream Analytics takes too long to process the events

I'm trying to configure an Azure Stream Analytics job, but consistently getting bad performance. I received data from a client system that pushes data into an Event Hub. And the ASA queries that into an Azure SQL database.
A few days ago I noticed that it was generating large amount of InputEventLateBeyondThreshold errors. Here an example out of the ASA. The Timestamp element is set by the client system.
{
"Tag": "MANG_POWER_5",
"Value": 1.08411181,
"ValueType": "Analogue",
"Timestamp": "2022-02-01T09:00:00.0000000Z",
"EventProcessedUtcTime": "2022-02-01T09:36:05.1482308Z",
"PartitionId": 0,
"EventEnqueuedUtcTime": "2022-02-01T09:00:00.8980000Z"
}
You can see that the event arrives pretty quickly, but takes more than 30 mins to process it. To try and avoid InputEventLateBeyondThreshold errors, I have increased the late event threshold. This may be contributing to the increased processing time, but having it too low also increases number of InputEventLateBeyondThreshold errors.
The Watermark Delay is consistently high, and yet SU usage is around 5%. I have increased the SU to as high as I can for this query.
I'm trying to figure out, why it takes so long to process the events once they have arrived.
This is the query I'm using:
WITH PIDataSet AS (SELECT * FROM [<event-hub>] TIMESTAMP BY timestamp)
--Write data to SQL joining with a lookup
SELECT
i.Timestamp as timestamp,
i.Value as value,
INTO [<sql-database>]
FROM PIDataSet as i
INNER JOIN [tagmapping-ref-alias] tm ON tm.sourcename = i.Tag
----Write data to AzureTable joining with a lookup
SELECT
DATEDIFF(second,CAST('1970-01-01' as DateTime), I1.Timestamp) As Rowkey,
I2.TagId as PartitionKey,
I1.Value as Value,
UDF.formatTime(I1.Timestamp) as DeviceTimeStamp
into [<azure-table>]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
--Get an hourly count into a SQL Table.
SELECT
I2.TagId,
System.Timestamp() as WindowEndTime, COUNT(I2.TagId) AS InputCount
into [tagmeta-ref-alias]
FROM PIDataSet as I1
JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
GROUP BY I2.TagId, TumblingWindow(Duration(hour, 1))
When you set up a 59 minutes out-of-order window, what you do is that you set up a 59 minutes buffer for that input. When records land in that buffer, they will wait 59 minutes until they get out. What you get in exchange, is that we have the opportunity to re-order these events so they will look in order to the job.
Using it at 1h is an extreme setting that will automatically give you 59minute of watermark, by definition. This is very surprising and I'm wondering why you need a value so high.
Edit
Now looking at the late arrival policy.
You are using an event time (TIMESTAMP BY timestamp) which means that your events can now be late, see this doc and this one.
What this means is that when a record comes later than 1h (so timestamp is older than 1h from the wall clock on our servers - in UTC), then we adjust its timestamp to our wall clock minus 1h, and send it to the query. It also means that your tumbling window always has to wait an additional hour to be sure it's not missing those late records.
Here what I would do is restore the default settings (no out of order, 5 seconds late events, adjust events). Then when you get InputEventLateBeyondThreshold, it means that that the job received a timestamp that was later in the past than 5 seconds. You're not losing the data, we are adjusting its system.timestamp to a more recent value (but not the timestamp field, we don't change it).
What we then need to understand is why does it take more than 5 seconds for a record in your pipeline to go from production to consumption. Is it because you have big delays in your ingestion pipeline, or because you have a time skew on your producer clock? Do you know?

Storing time intervals efficiently in redis

I am trying to track server uptimes using redis.
So the approach I have chosen is as follows:
server xyz will keep on sending my service ping indicating that it was alive and working in the last 30 seconds.
My service will store a list of all time intervals during which the server was active. This will be done by storing a list of {startTime, endTime} in redis, with key as name of the server (xyz)
Depending on a user query, I will use this list to generate server uptime metrics. Like % downtime in between times (T1, T2)
Example:
assume that the time is T currently.
at T+30, server sends a ping.
xyz:["{start:T end:T+30}"]
at T+60, server sends another ping
xyz:["{start:T end:T+30}", "{start:T+30 end:T+60}"]
and so on for all pings.
This works fine , but an issue is that over a large time period this list will get a lot of elements. To avoid this currently, on a ping, I pop the last element of the list, check if it can be merged with the latest time interval. If it can be merged, I coalesce and push a single time interval into the list. if not then 2 time intervals are pushed.
So with this my list becomes like this after step 2 : xyz:["{start:T end:T+60}"]
Some problems I see with this approach is:
the merging is being done in my service, and not redis.
incase my service is distributed, The list ordering might get corrupted due to multiple readers and writers.
Is there a more efficient/elegant way to handle this , like maybe handling merging of time intervals in redis itself ?

Dataflow Apache beam Python job stuck at Group by step

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
I figured out the thing myself.
Below are the 2 changes that I did in my pipeline:
1. I added a Combine function just after the Group by Key, see screenshot
since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

Azure SQL high wait time on "VDI_CLIENT_OTHER"

We're benchmarking our app with different scales of an Azure SQL database, and we're having a hard time saturating the db. Among other things, we've executed this query:
SELECT *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC
The top row of the result was something like
wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms
VDI_CLIENT_OTHER 19560 409007428 60016 37281
What is this wait time? What exactly have we been waiting for during those 409000 seconds (almost 5 days)? Google doesn't seem to know what VDI_CLIENT_OTHER is.
VDI_CLIENT_OTHER is used in case of new replica seeding or any other user initiated workflow that triggers copies like update service tier and setting up geo relationship link. High wait time It likely just means we did seeding and the task remained running waiting for additional work items which aren’t arriving.