How to measure execution time of ASA job on edge device and on cloud? - azure-iot-hub

I am running azure stream analytics query on edge device as well as on cloud.
I have watermark delay seconds for both stream analytics pipelines.
But How can I measure execution time of azure stream analytics jobs on edge as well as on cloud?
Thanks.

There is no such thing as an execution time for a job. A job can have multiple queries, that each output events at different times. Some will take a message and output a message. The other will take 100 messages and output 1. And again others will do compute over a time window and output events whenever the window ends.
The job itself can be started and stopped, but I don't think the time between those two actions is what you're looking for.

Related

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks
Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow?

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.
Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Connect your process to the pull subscription
Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue
Solution 2
Connect your process to the pull subscription
Read a chunk of messages (for example 1000) and write them in memory (into an array).
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!

Azure Stream Analytics Job hangs or not processing input

Using Azure stream analytics job with input - IoT Hub, output - Azure Document DB and Azure Event Hub.
Streaming Unit assigned - 1.
Message rate - 36 messages per seconds, message payload less than 1 KB.
after some time ASA job stop processing input messages as on Monitoring graph it shows zero input and didn't output any event. Verified messages coming to IoT Hub.
No error log in Activity and Diagnostic log for that duration.
On restart without any modification it start working as expected. Showing input count on ASA job monitoring graph and generate output event.
It occurs continuously on random time interval.

Azure Streaming Analytics Jobs scaling SU

I'm running into an issue I can't fix:
I created a Azure Streaming Analytics Jobs, that sometime run into this error:
Resource usage is over the capacity for one or more of the query steps.
Event processing may be delayed or stop making progress. This may be a
result of large window in your query, large events in your input, large out
of order tolerance window, or a combination of the above. Please try to
partition your query, or break down your query to more steps, and add
Streaming Unit resources from the Scale tab to avoid such condition., :
So I decided to scale up the SU. I stopped the job, I open the scale pane and the input box keep grey. I can't change the SU value, no error message.
What can I do?
Many thanks!
In order to change the SU scale you need admin/owner role on the Azure Stream Analytics job.
Sorry for the inconvenience, we are changing the user experience in order to make this more explicit.
Let me know if it works for you after you get the admin role.
Thanks.,
JS

Stream analytics small rules on high amount of device data

We have the following situation.
We have multiple devices sending data to an event hub (Interval is
one second)
We have a lot of small stream analytics rules for alarm
checks. The rules are applied to a small subset of the devices.
Example:
10000 Devices sending data every second.
Rules for roughly 10 devices.
Our problem:
Each stream analytics query processes all of the input data, although the job has to process only a small subset of the data. Each query filters on device id and filters out the most amount of data. Thus we need a huge number of streaming units which lead to high stream analytics cost.
Our first idea was to create an event hub for each query. However, here we have the problem that each event hub has at least one throughput unit, which leads also to high costs.
What is the best solution in our case?
One possible solution would be to use IoT hub and to create a different Endpoint with a specific Route for the devices you want to monitor.
Have a look to this blog post to see if this will work for your particular scenario: https://azure.microsoft.com/en-us/blog/azure-iot-hub-message-routing-enhances-device-telemetry-and-optimizes-iot-infrastructure-resources/
Then in Azure Stream Analytics, you can use this specific Endpoint as input.
Thanks,
JS (Azure Stream Analytics team)