When is a Stream Analytics job without Window getting triggered - azure-iot-hub

I'm unclear on when exactly a Stream Analytics job is getting called if I don't specify any windowing function. Is it running periodically or is it getting called whenever a new event arrives? Unfortunately I haven't found any documentation specifically mentioning this yet.
Background: I have multiple devices sending updates to my IoT Hub and I only want to do a new calculation whenever a certain device sends an update.

Without a windowing function, your Azure Stream Analytics job will run every time an input message is received. However, it might batch the output depending on the egress rate. This depends on the type of the output as well, as described on this page.
One sidenote is that depending on how complex your query is, you need to take the Streaming Units usage into account. Even if you're not looking at windowed functions, you want to check this page if you want to use UDF's or reference data, as it might affect your memory usage.

Related

Custom Dataflow Template - BigQuery to CloudStorage - documentation? general solution advice?

I am consuming a BigQuery table datasource. It is 'unbounded' as it is updated via a batch process. It contains session keyed reporting data from server logs where each row captures a request. I do not have access to the original log data and must consume the BigQuery table.
I would like to develop a custom Java based google Dataflow template using beam api with the goals of :
collating keyed session objects
deriving session level metrics
deriving filterable window level metrics based on session metrics, e.g., percentage of sessions with errors during previous window and percentage of errors per filtered property, e.g., error percentage per device type
writing the result as a formatted/compressed report to cloud storage.
This seems like a fairly standard use case? In my research thus far, I have not yet found a perfect example and still have not been able to determine the best practice approach for certain basic requirements. I would very much appreciate any pointers. Keywords to research? Documentation, tutorials. Is my current thinking right or do I need to consider other approaches?
Questions :
beam windowing and BigQuery I/O Connector - I see that I can specify a window type and size via beam api. My BQ table has a timestamp field per row. Am I supposed to somehow pass this via configuration or is it supposed to be automagic? Do I need to do this manually via a SQL query somehow? This is not clear to me.
fixed time windowing vs. session windowing functions - examples are basic and do not address any edge cases. My sessions can last hours. There are potentially 100ks plus session keys per window. Would session windowing support this?
BigQuery vs. BigQueryClientStorage - The difference is not clear to me. I understand that BQCS provides a performance benefit, but do I have to store BQ data in a preliminary step to use this? Or can I simply query my table directly via BQCS and it takes care of that for me?
For number 1 you can simply use a withTimestamps function before applying windowing, this assigns the timestamp to your items. Here are some python examples.
For number 2 the documentation states:
Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. [...] If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Also in the java documentation, you can only specify a minimum gap duration, but not a maximum. This means that session windowing can easily support hour-lasting sessions. After all, the only thing it does is putting a watermark on your data and keeping it alive.
For number 3, the differences between the BigQuery IO Connector and the BigQuery storage APIs is that the latter (an experimental feature as of 01/2020) access directly data stored, without the logical passage through BigQuery (BigQuery data isn't stored in BigQuery). This means that with storage APIs, the documentation states:
you can't use it to read data sources such as federated tables and logical views
Also, there are different limits and quotas between the two methods, that you can find in the documentation link above.

Sending data from website to BigQuery using Pub/Sub and Cloud Functions

Here's what I'm trying to accomplish
A visitor lands on my website
Javascript collects some information and sends a hit
The hit is processed and inserted into BigQuery
And here's how I have planned to solve it
The hit is sent to Cloud Functions HTTP trigger (using Ajax)
Cloud Functions sends a message to Pub/Sub
Pub/Sub sends data to another Cloud Function using a Pub/Sub trigger
The second Cloud Function processes the hit into Biguery row and inserts it into BigQuery
Is there a simpler way to solve this?
Some other details to take into account
There are around 1 million hits a day
Don't want to use Cloud Dataflow because it inflates the costs
Can't (probably) skip Pub/Sub because some hits are sent when a person is leaving the site and the request might not have enough time to process everything.
You can perform a Big Query streaming, this one is less expensive and you avoid reach the Load Jobs quotas 1000 per table per day.
Another option is if you don't mind that the data spend a lot of time loading, you can store all the info in a Cloud Storage bucket and then load all the data with a transfer. You can program it in order that data be uploaded daily. This solution is focus in a batch environment in which you will store all the info in one side and then you transfer it to the final destination. If you only want to streaming the solution that you mentioned is ok.
It’s up to you to choose the option that better fits to your specific usage.

How to batch streaming inserts to BigQuery from a Beam job

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.
I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Using Graylog to monitor resources + notifications

Since we're already using Graylog (version 2.4.6) as a general purpose logging backend for our project, we thought we might as well also use it to monitor resource use. The three major benefits would be:
No need to change our codebase to add additional libraries.
Easy to create charts and graphs for the metrics we're tracking.
Built-in notifications.
Concretely, we're trying to track how many jobs our various Beanstalk server has in each of its tubes. If a given tube accumulates for than a certain amount of jobs, we would like to be alerted.
Here's a typical message that we're using for a given tube:
{
"count" => $totalJobsInTube,
"tube" => $tubeName,
"env" => $env,
}
I can't think of a way to set up an alert condition in Graylog that allows me to specify a query + which field to look at. The only conditions we have are:
Field content alert condition
Field aggregation alert condition
Message count alert condition
Message conditional count alert condition
Can this even be done i Graylog?
Graylog is using Elasticsearch as a backend, which is not a good system for metrics (time series data), it's not efficient and doesn't scale well with time series data. This is the reason that most use another monitoring system for measuring resources and other time series data. It depends on your stack, but there are lots of open source and commercial offerings to do that.
If you wanted to do logs and metrics together I would suggest using open source software the Elasic Stack can do both, but that is only my reccomendation if you have limited numbers of metrics. Splunk and SumoLogic can also do logs and metrics, but they are not ideal for time series, especially large numbers of them.

Bigquery Stream Benchmark

Bigquery officially becomes our device log data repository and live monitor/analysis/diagnostic base. As one step further, We need to measure and monitor data streaming performance. Any relevant benchmark you are using for Bigquery live stream? What relevant once I can refer to?
Since streaming has a limited payload size, see Quota policy it's easier to talk about times and other side effects.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout'
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
UPDATE
Someone requested in the comments new stats, so I posted 2017. It's still the same, there was some heavy data reorganization for us, you see the spike, but essentially it's the same it's around 2sec if you use the max of the streaming insert.