What's the best way to account for missing records when performing aggregate queries?

What's the best way to account for missing records when performing aggregate queries? - sql

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:

If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation

Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

Related

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance

You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1

From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

Internal db logic/operation to group/compress result

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?

You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.

You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column

I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

How to make the output as another input?

The case: We have a 1day aggregation window which sum the total value received from Event hub (1, 2, 3 … each minute send a value), we set the output to a blob called 1dayresult. Now we want to get the blob data as another 1week aggregation input, each week we want to get the data from the blob and do calculation, so can we set the 1day result blob as the input for the 1week aggregation? We know we can set the window unit is 7day, but we think it will slow down the performance, because if we make the 1day result blob as the input, we only need 7 values, but if we use the 7day window, we will get more than 7*24*60 values and then do the calculation. We also want to have a month aggregation, but the The maximum size of the window is 7 days. So how to achieve it?

You can use WITH statement to "chain" multiple subqueries together, where next subquery can be using output of the previous as input. Take a look at this doc
However, as you noted, sometimes it can be more efficient to persist intermediate output results in blob storage or another Event Hub. You can define same storage location both as input and as output and one subquery can be outputting while another is reading.
The maximum size of windows in Azure Stream Analytics is indeed 7 days. For bigger size window computations involving bigger amounts of historic data it might be better to use product like Azure Data Factory.

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real

Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.

I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.

Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What's the best way to account for missing records when performing aggregate queries? - sql

Aggregation functions like avg() ignore missing data (for example null values). So no, your results will not be skewed if your sensors do not send data for some time.

Related

Improve performance of deducting values of same table in SQL

Internal db logic/operation to group/compress result

Find out the amount of space each field takes in Google Big Query

How to make the output as another input?

Trending 100 million+ rows

Categories

Resources