Stream Analytics and stream position - azure-stream-analytics

I have two general questions to the Stream Analytics behavior. I found nothing or(for me) missleading information, in the documentation about my questions.
Both of my questions are targeting a Stream Analytics with EventHub as input source.
1. Stream position
When the analytics job started, are only events processed that are incoming after startup? Are older events which are still in the event hub pipeline ignored?
2. Long span time window
In the documentation is written
"The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time."
If I created a select statement with a, for example, 7 days tumbling window. Is there any limitation of how many output elements the job can hold in memory before closing the window and send out the result set? I mean on my heavy workload eventhub that can be millions of output results.

For your first question, there was not any evidence show that Stream Analytics will ignore any older events which before the job startup. Actually, the event lifecycle is depended on Event Hub Message Retention (1 ~ 7 days), not Stream Analytics. However, you can specify the eventStartTime & eventEndTime for a input to retrieve these data as you want, please see the first REST request properties of Stream Analytics Input.
On Azure portal, they are like as below.
For your second question, according to the Azure limits & quotas for Stream Analytics and the reference for Windowing, there is not any limits written for memory usage, the only limits are as below.
For windowing, "The maximum size of the window in all cases is 7 days."
For Stream Analytis, "Maximum throughput of a Streaming Unit" is 1MB/s.
For Event Hubs, as below.
These above will cause the output delay.

Related

Stream Analytics - No such host is known

We are in Australia East.
We have an event hub with events coming through from an application. On Friday 19th March morning I created a Stream Analytics job, to try and read one of the event streams. This worked successfully on the Event Hub and returned results in the "Input preview" window when setting this up. This seems to match the timings on the message below (we are about 12 hours in front of UTC).
However by Friday afternoon, it started failing with one of the error messages "InternalServerError" or "No such host is known". I was working through the drop down boxes available when creating a new input after selecting "Select Event Hub from your subscriptions", so I know I haven't got anything wrong in the setup.
When trying to submit a support request, we get this slightly cryptic message:
The link doesn't work, it claims Stream Analytics is not supported in Resource Health, even though it is. Does this mean "It's down sorry, we are working on it" (as in actually working on it), or is it a canned response and we should escalate it?
Or is anyone else having trouble creating Stream Analytics Jobs and we are suffering an outage? The Azure Status monitor shows they are in good health.
Looks like this was a permission error. I needed to go into the IAM of the Event Hub and set the Stream Analytics Job up with Reader permissions. For some reason, this wasn't automatically added when setting up the Stream Analytics Job, as I thought it would be.
Once the permission was set, the job started successfully.

Capture backlog is higher though captured message metric matches the incoming messages

In Azure Event Hub Capture, the capture backlog is always high though captured message metric matches the incoming messages. How do we infer this ? Does this mean that Azure is dropping the message payload while capturing to Azure Datalake Gen 1?
From Microsoft Azure Metrics documentation, I can see that the metric Capture Backlog means
The number of bytes that are yet to be captured to the chosen destination.
The default aggregation type is Bytes and the unit is bytes.
When capture is enabled over an Event Hub, an empty file file will be created in the capture storage location at time when there are no events to the Event Hub.
From the definition given by Microsoft for Capture Backlog, we can see that the value shown for Capture Backlog is the amount of data that is queued for getting captured or stored in the Capture location i.e in your case Azure Datalake Gen 1.

Twitch Api: How can I know that stream was finished?

I have a stream url like https://www.twitch.tv/streams/26114851120/channel/31809543. Stream is online and I need to catch moment when stream will be finished.
I researched twitch api documentation and didn't find any events. The first thought was to send requests every several minutes and when stream going online - handle this. It was a little delay, but it isn't critical.
But there are many streams that I must track and I scare that twitch can block me for this.
Are there any other ways to catch stream's finish?
As best I can tell there's no way to directly listen for a stream going online or offline, but you can still monitor a large number of streams in spite of that.
There are a fair number of Q&A on the official Twitch developer site wanting this functionality, but all of them I could find are answered with the same "it's not currently possible."
Keep in mind that you can check the status of multiple channels simultaneously (up to 100 per request) using a comma separated list and the limit query parameter: Get-Live-Streams
https://api.twitch.tv/kraken/streams/?get-live-streams?channel=Channel1,Channel2&limit=100
That'll return an object containing an array of online streams (the streams property).
Rate Limits
Twitch's official stance regarding rate limiting is a recommendation of no more than "about 1 request per second". That said they don't throttle you for making several requests in immediate succession, but rather the cumulative amount.
Note that there's a separate rate limit for IRC-related actions of 20 commands/messages per 30 seconds normally or 100 per 30 if a mod. Violating that will trigger a 30 minute lockout.
API-Side Caching
API results are also cached for 1-3 minutes which reduces load on their end. Given that, there's not much value in polling for anything more frequently than that (i.e. you should wait at least 1 minute before making the exact same request again since you'd just get the same response).
You Can Still Monitor ~6000 Streams
Given the ability to check 100 streams at a time, a need to wait for at least 1 minute per request to get new results, and an approximate rate limit of 1 request per second, you can theoretically check the status of about 6000 streams continuously (assuming you're not making other requests; 100 streams per second * 60 per minute).
PubSub For Monitoring Other Things
Currently the PubSub API doesn't have anything for monitoring stream's going online, but you may want to keep it in mind for other polling-type actions (it currently deals with things like new subscriptions or donations).
Using The Embedded Player
One last thing worth noting is you can listen for a channel going online or offline when you're using the Twitch Embedded Player.
Might be a little late to reply, but now you can look into Twitch WebHooks.
They allow you to subscribe to specific stream(s) to have Twitch notify your callback URL, when stream goes up or down.
This seems more accurate and bandwidth-saving than querying twitch yourself.

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/
If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.

Streaming Analytics job not receiving inputs from IOT hub

I followed a IOT Hub tutorial and got it working. I then created a Stream Analytics job and used the above as an Input (which upon test connection works).
However I do not see any inputs being received. When running a sample test I get the following error:
Description Error code: ServiceUnavailable Error message: Unable to
connect to input source at the moment. Please check if the input
source is available and if it has not hit connection limits.
I can see telemetry messages being received in the IOT Hub. Any help would be appreciated
Is the stream analytics job running?
I had a similar problem where i wasn't getting any events from stream analytics and i had forgotten to turn it on.
Click on the stream analytics > overview > start
I had the same problem (using Event Hubs in my case). The root cause was that I had too many queries within my job running against the same input. I solved it by splitting my input into several inputs across multiple consumer groups.
From the documentation (emphasis added):
Each Stream Analytics IoT Hub input should be configured to have its own consumer group. When a job contains a self-join or multiple inputs, some input may be read by more than one reader downstream, which impacts the number of readers in a single consumer group. To avoid exceeding IoT Hub limit of 5 readers per consumer group per partition, it is a best practice to designate a consumer group for each Stream Analytics job.
I have exactly the same problem, though my modules on my raspberry pi are running without failure.
SA says: "Make sure the input has recently received data and the correct format of those events has been selected.