Azure Stream Analytics Job hangs or not processing input

Azure Stream Analytics Job hangs or not processing input - azure-stream-analytics

Using Azure stream analytics job with input - IoT Hub, output - Azure Document DB and Azure Event Hub.
Streaming Unit assigned - 1.
Message rate - 36 messages per seconds, message payload less than 1 KB.
after some time ASA job stop processing input messages as on Monitoring graph it shows zero input and didn't output any event. Verified messages coming to IoT Hub.
No error log in Activity and Diagnostic log for that duration.
On restart without any modification it start working as expected. Showing input count on ASA job monitoring graph and generate output event.
It occurs continuously on random time interval.

Related

Which GCP Log Explorer query will show success message of data loading to BigQuery by Dataflow Job so that log sink to pub/sub can be created

I am running a Dataflow streaming job which reads data from pub/sub topic and performs streaming insert into BigQuery.
Once data loaded, I want to capture the success status from Log explorer to send acknowledgement back to another pub/sub topic.
What Log Explorer query can serve the purpose ?
I tried to run query below, but did not help.
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.datasetId="[REPLACE_WITH_YOUR_DATASET_ID]"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.projectId="REPLACE_WITH_YOUR_PROJECT_ID"
protoPayload.methodName="jobservice.jobcompleted"
protoPayload.state="DONE"
Please help.
Thanking you,
Santanu

How to measure execution time of ASA job on edge device and on cloud?

I am running azure stream analytics query on edge device as well as on cloud.
I have watermark delay seconds for both stream analytics pipelines.
But How can I measure execution time of azure stream analytics jobs on edge as well as on cloud?
Thanks.

There is no such thing as an execution time for a job. A job can have multiple queries, that each output events at different times. Some will take a message and output a message. The other will take 100 messages and output 1. And again others will do compute over a time window and output events whenever the window ends.
The job itself can be started and stopped, but I don't think the time between those two actions is what you're looking for.

Need burst speed messages per second for devices at various times during a day with Azure IoT hub

While Azure Event hub can have thousands and million? of messages per second, the Azure IoT hub has a surprisingly low limitation on this.
S1 has 12 msg/sec speed but allow 400.000 daily msg pr. unit
S2 has 120 msg/sec speed but allow 6.000.000 daily msg pr. unit
S3 has 6000 msg/sec speed but allow 300.000.000 daily msg pr unit.
Imagine an IoT solution where your devices normally sends 1 message every hour, but have the ability to activate a short "realtime" mode to send messages every second for about 2 minutes duration.
Example: 10.000 IoT devices:
Let's say 20% of these devices happens to start a realtime mode session simultaneously 4 times a day. (We do not have control over when those are started by individual customers). That is 2000 devices and burst speed needed is then 2000 msg/second.
Daily msg need:
Normal messages: 10.000dev * 24hours = 240.000 msg/day
Realtime messages daily count: 2.000dev * 120msg(2 min with 1 msg every second) * 4times a day = 960.000 messages
Total daily msg count need: 240.000 + 960000 msg = 1.200.000 msg/day.
Needed Azure IoT hub tier: S1 with 3 units gives 1.200.000 msg/day. ($25 * 3units = $75/month)
Burst speed needed:
2000 devices sending simultaneously every second for a couple of
minutes a few times a day: 2000 msg/second. Needed Azure IoT hub
tier: S2 with 17 units gives speed 2040 msg/second. ($250 * 17units =
$4250/month) Or go for S3 with 1 unit, which gives speed 6000
msg/second. ($2500/month)
The daily message count requires only a low IoT hub tier due to the modest messages per day count, but the need for burst speed when realtime is activated requires an unproportionally very high IoT hub tier which skyrockets(33 times) the monthly costs for the solution, ruining the businesscase.
Is it possible to allow for temporary burst speeds at varying times during a day as long as the total number of daily messages sent does not surpass current tier max limit?
I understood from an article from 2016 by Nicole Berdy that the throttling on Azure IoT hub is in place to avoid DDOS attacks and misuse. However to be able to simulate realtime mode functionality with Azure IoT hub we need more Event Hub like messages/second speed. Can this be opened up by contacting support or something? Will it help if the whole solution is living inside its own protected network bubble?
Thanks,

For real-time needs definitely, always consider Azure IoT Edge and double check if it is possible to implement it on your scenario.
In the calculations you did above you refer, for example that S2 has 120 msg/sec speed. That is not fully correct. Let me explain better:
The throttle for Device-to-cloud sends is applied only if you overpass 120 send operations/sec/unit
Each message can be up to 256 KB which is the maximum message size.
Therefore, the questions you need to answer to successfully implement your scenario with the lowest cost possible are:
What is the message size of my devices?
Do I need to display messages in near-real-time on customer's Cloud Environment, or my concern is the level of detail of the sensors during a specific time?
When I enable "burst mode" am I leveraging the batch mode of Azure IoT SDK?
To your questions:
Is it possible to allow for temporary burst speeds at varying times
during a day as long as the total number of daily messages sent does
not surpass current tier max limit?
No, the limits for example to S2 are 120 device-to-cloud send operations/sec/unit.
Can this be opened up by contacting support or something? Will it help
if the whole solution is living inside its own protected network
bubble?
No, the only exception is when you need to increase the total number of devices plus modules that can be registered to a single IoT Hub for more than 1,000,000. On that case you shall contact Microsoft Support.

How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow?

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.

Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Connect your process to the pull subscription
Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue
Solution 2
Connect your process to the pull subscription
Read a chunk of messages (for example 1000) and write them in memory (into an array).
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!

Stream Analytics job has validation errors: Job will exceed the maximum amount of Event Hub Receivers

I am trying to write a query in ASA ( Azure Stream Analytics) with lot of left joins. And it is failing to start the job with the following error:
Stream Analytics job has validation errors: Job will exceed the maximum amount of Event Hub Receivers.

You are most likely hitting Event Hub's 5 readers limit. Take a look at this article.
Here is the relevant part:
Consumer groups
... When a job contains a self-join or multiple inputs, some input may be
read by more than one reader downstream, which impacts the number of
readers in a single consumer group. To avoid exceeding Event Hub limit of
5 readers per consumer group per partition, it is a best practice
to designate a consumer group for each Stream Analytics job.
...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas