Capture backlog is higher though captured message metric matches the incoming messages - azure-data-lake

In Azure Event Hub Capture, the capture backlog is always high though captured message metric matches the incoming messages. How do we infer this ? Does this mean that Azure is dropping the message payload while capturing to Azure Datalake Gen 1?

From Microsoft Azure Metrics documentation, I can see that the metric Capture Backlog means
The number of bytes that are yet to be captured to the chosen destination.
The default aggregation type is Bytes and the unit is bytes.
When capture is enabled over an Event Hub, an empty file file will be created in the capture storage location at time when there are no events to the Event Hub.
From the definition given by Microsoft for Capture Backlog, we can see that the value shown for Capture Backlog is the amount of data that is queued for getting captured or stored in the Capture location i.e in your case Azure Datalake Gen 1.

Related

Azure IoT Hub monitoring usage and history

I recently started a project with Azure IoT Edge with the IoT Hub Free Tier so I'm a total beginner. I set up a device sensor, a module and I am sucessfully sending data to my module and IoT Hub. I can
monitor the messages sent from the sensor with Azure IoT Hub expension from Visual Studio Code.
I can see the messages I'm sending but I am having an issue with the number of messages being sent.
I use Azure portal metrics to monitor the number of messages sent and very often Azure would show me different numbers as I refresh. For example "1000" messages and after a refresh "800" messages etc...
Another issue I'm having is also that when using the Metrics functionality, it shows me that some messages are being sent during a time where my sensors weren't sending any messages.
Is there a way to get a detailled report with a history on the messages that the Hub receives?
Any help or advice would be highly appreciated! Thank you
As far as I know there is no "nice and simple" report which will show you what you need. However, if you want to get historical events which IoT hub processed it can be done. Note, though, that history is limited and can be no longer than 7 days. Current retention period can be seen on the azure portal in "Built-in endpoints" section - there is a "Retain for" setting with the value from 1 day (default) to 7 days.
If events are within this range, you can use Powershell command "az iot hub monitor-events" with the --enqueued-time parameter to "look back" in history. The time is specified in milliseconds since unix epoch. Example command:
az iot hub monitor-events --hub-name 'your_iot_hub_name' --device-id  'your_registered_device_id' --properties all --enqueued-time 1593734400000

How to understand Bulk transfer using libusb

Say I have an USB device, a camera for instance, and I would like to load the image sequence captured by the camera to the host, using libusb API.
It is not clear to me for the following points:
How is the IN Endpoint on the device populated? Is it always the full image data of one frame (and optionally plus some status data)?
libusb_bulk_transfer() has a parameter length to specify how long is the data the host wants to read IN, and another parameter transferred indicating how much data actually had been transferred. The question is: should I always request the same amount of data that the IN Endpoint would send? If so, then what would be the case where transferred was smaller than length?
How is it determined how much data would be sent by the In Endpoint upon each transfer request?

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/
If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.

Stream Analytics and stream position

I have two general questions to the Stream Analytics behavior. I found nothing or(for me) missleading information, in the documentation about my questions.
Both of my questions are targeting a Stream Analytics with EventHub as input source.
1. Stream position
When the analytics job started, are only events processed that are incoming after startup? Are older events which are still in the event hub pipeline ignored?
2. Long span time window
In the documentation is written
"The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time."
If I created a select statement with a, for example, 7 days tumbling window. Is there any limitation of how many output elements the job can hold in memory before closing the window and send out the result set? I mean on my heavy workload eventhub that can be millions of output results.
For your first question, there was not any evidence show that Stream Analytics will ignore any older events which before the job startup. Actually, the event lifecycle is depended on Event Hub Message Retention (1 ~ 7 days), not Stream Analytics. However, you can specify the eventStartTime & eventEndTime for a input to retrieve these data as you want, please see the first REST request properties of Stream Analytics Input.
On Azure portal, they are like as below.
For your second question, according to the Azure limits & quotas for Stream Analytics and the reference for Windowing, there is not any limits written for memory usage, the only limits are as below.
For windowing, "The maximum size of the window in all cases is 7 days."
For Stream Analytis, "Maximum throughput of a Streaming Unit" is 1MB/s.
For Event Hubs, as below.
These above will cause the output delay.

Streaming Analytics job not receiving inputs from IOT hub

I followed a IOT Hub tutorial and got it working. I then created a Stream Analytics job and used the above as an Input (which upon test connection works).
However I do not see any inputs being received. When running a sample test I get the following error:
Description Error code: ServiceUnavailable Error message: Unable to
connect to input source at the moment. Please check if the input
source is available and if it has not hit connection limits.
I can see telemetry messages being received in the IOT Hub. Any help would be appreciated
Is the stream analytics job running?
I had a similar problem where i wasn't getting any events from stream analytics and i had forgotten to turn it on.
Click on the stream analytics > overview > start
I had the same problem (using Event Hubs in my case). The root cause was that I had too many queries within my job running against the same input. I solved it by splitting my input into several inputs across multiple consumer groups.
From the documentation (emphasis added):
Each Stream Analytics IoT Hub input should be configured to have its own consumer group. When a job contains a self-join or multiple inputs, some input may be read by more than one reader downstream, which impacts the number of readers in a single consumer group. To avoid exceeding IoT Hub limit of 5 readers per consumer group per partition, it is a best practice to designate a consumer group for each Stream Analytics job.
I have exactly the same problem, though my modules on my raspberry pi are running without failure.
SA says: "Make sure the input has recently received data and the correct format of those events has been selected.