BigQuery SQL computing spatial statistics around multiple locations - sql

Problem: I want to make a query that allows me to compute sociodemographic statistics within a given distance of multiple locations. I can only make a query that computes the statistics at a single location at a time.
My desired result would be a table where I can see the name of the library (title), P_60YMAS (some sociodemographic data near these libraries), and a geometry of the multipolygons within the buffer distance of this location as a GEOGRAPHY data type.
Context:
I have two tables:
'cis-sdhis.de.biblioteca' or library, that have points as GEOGRAPHY data type;
'cis-sdhis.inegi.resageburb' in which I have many sociodemographic data, including polygons as GEOGRAPHY data type (column name: 'GEOMETRY')
I want to make 1 Km Buffer around the libraries, make new multipolygons within this buffers and get some sociodemographic data and geometry from those multipolygons.
My first approach was with this query:
SELECT
SUM(P_60YMAS) AS age60_plus,
ST_UNION_AGG(GEOMETRY) AS geo
FROM
`cis-sdhis.inegi.resageburb`
WHERE
ST_WITHIN(GEOMETRY,
(
SELECT
ST_BUFFER(geography,
1000)
FROM
`cis-sdhis.de.biblioteca`
WHERE
id = 'bpm-mty3'))
As you can see, this query only gives me one library ('bpm-mty3'), an that's my problem: I want them all at once.
I thought that using OVER() would be one solution, but I don't really know where or how to use it.

OVER() could be a solution, but a more performant solution is JOIN.
You first need to join two tables on condition that the distance between them is less than 1000 meters, that gives you pairs of row, where libraries are paired with all relevant data, note we'll get multiple rows per library. The predicate to use is ST_DWithin(geo1, geo2, distance), alternative form is ST_Distance(geo1, geo2) < distance - in both cases you don't need buffer.
SELECT
*
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
Then we need to compute stats per library, remember we have many rows per-library, and we need a single row for each library. For this we need to aggregate per library, let's do GROUP BY using id. When we need an info about the library itself the cheapest way to do it is to use ANY_VALUE aggregation function. So it will be something like
SELECT
lib.id, ANY_VALUE(lib.title) AS title,
SUM(P_60YMAS) AS age60_plus,
-- if you need to show the circle around library
ST_BUFFER(ANY_VALUE(lib.geometry)) AS buffered_location
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
GROUP BY lib.id
One thing to note here is that ST_BUFFER(ANY_VALUE(...)) is much cheaper than ANY_VALUE(ST_BUFFER(...)) - it only computes buffer once per output row.

Related

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

What's the best way to account for missing records when performing aggregate queries?

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:
If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation
Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

Internal db logic/operation to group/compress result

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?
You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

How can I perform aggregate function logic across multiple fields?

How can I get max() of three dimensions to come from the same record?
Description:
I have a large list of widgets, with multiple attributes, from multiple sources. Think manual data entry, where you have the same stuff being entered by different people, and then you need to consolidate differences. Though, instead of auditing each difference, I just want to perform some logic to choose a value over another under certain criteria.
An analogous example: if one source a says widget xyz weighs 3 pounds, and source b says it weighs 4 pounds, I am just blindly taking the 4, as it is greater, and say I need to be over cautious for packing/shipping purposes. That is easy, I choose MAX().
Now, I have a group of attributes that are in separate fields but related. Think dimensions of a box. There are width/length/height fields. If one source says the 'dimensions' are 2x3x4, and another says they are 3x3x4, I need to take the larger, for the same reason as above. Also sounds like MAX(), except...
My sources disagree on which is the width, height, or length. A 2x3x4 box could be entered 4x3x2, or 2x4x3, depending on how the source was looking at it. If I took the MAX of 3 such sources, I would end up with 4x4x4, even though all 3 sources measured it correctly. This is undesirable.
How do I take the greatest 'measurement' value, but make sure all three values comes from the same record?
If 'greatest' is impossible, we could settle for unique... except there is a fourth source, which has 0x0x0 for about 40% of the widgets. I can't leave a 0x0x0 if any of the other sources did in fact measure that widget.
some sample data
ID,widget_name,height,width,leng
(a1,widget3,2,3,4)
(b1,widget3,2,4,3)
(c1,widget3,4,3,2)
(d1,widget3,0,0,0)
output should be (widget3,4,3,2)
you could use row_number instead of group by like
select * from
(select data, ID,widget_name,height,width,leng, ROW_NUMBER() over ( partition by widget_name order by height + width + leng desc ) rowid
from yourTable
) as t
where rowid = 1