How to make the output as another input? - azure-stream-analytics

The case: We have a 1day aggregation window which sum the total value received from Event hub (1, 2, 3 … each minute send a value), we set the output to a blob called 1dayresult. Now we want to get the blob data as another 1week aggregation input, each week we want to get the data from the blob and do calculation, so can we set the 1day result blob as the input for the 1week aggregation? We know we can set the window unit is 7day, but we think it will slow down the performance, because if we make the 1day result blob as the input, we only need 7 values, but if we use the 7day window, we will get more than 7*24*60 values and then do the calculation. We also want to have a month aggregation, but the The maximum size of the window is 7 days. So how to achieve it?

You can use WITH statement to "chain" multiple subqueries together, where next subquery can be using output of the previous as input. Take a look at this doc
However, as you noted, sometimes it can be more efficient to persist intermediate output results in blob storage or another Event Hub. You can define same storage location both as input and as output and one subquery can be outputting while another is reading.
The maximum size of windows in Azure Stream Analytics is indeed 7 days. For bigger size window computations involving bigger amounts of historic data it might be better to use product like Azure Data Factory.

Related

Using 1 Dataflow Job (Apache Beam Pipeline) to aggregate data on different window periods, and write them to different column families in BigTable

I am trying to optimize my Apache Beam pipeline on Google Cloud Platform Dataflow.
Background information: I am trying to read streaming data from PubSub Messages, and aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such aggregations consists of summing, averaging, finding the maximum or minimum, etc. For example, for all data collected from 1200 to 1201, I want to aggregate them and write the output into BigTable's 1-min column family. And for all data collected from 1200 to 1205, I want to similarly aggregate them and write the output into BigTable's 5-min column. Same goes for 60min.
The current approach I took is to have 3 separate dataflow jobs (i.e. 3 separate Beam Pipelines), each one having a different window duration (1min, 5min and 60min). See https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html. And the outputs of all 3 dataflow jobs are written to the same BigTable, but on different column families. Other than that, the function and aggregations of the data are the same for the 3 jobs.
However, this seems to be very computationally inefficient, and cost inefficient, as the 3 jobs are essentially doing the same function, with the only exception being the window time duration and output column family.
Some challenges and limitations we faced was that from the Apache Beam documentation, it seems like we are unable to create multiple windows of different periods in a singular dataflow job. Also, when we write the final data into big table, we would have to define the table, column family, column, and rowkey. And unfortunately, the column family is a fixed property (i.e. it cannot be redefined or changed given the window period).
Hence, I am wondering if there is a way to only use 1 dataflow job (i.e. 1 Apache Beam pipeline) that fulfils the objective of this project? Which is to aggregate data on different window periods, and write them to different column families of the same BigTable.
I was considering using Split stream: first window by 1-min, then split into 3 streams (1 write to bigtable for 1-min interval, another for 5-min aggregation, and another for 60-min aggregation). However, the problem is that we are working with streaming data and not batch data.
Thank you

What's the best way to account for missing records when performing aggregate queries?

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:
If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation
Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?