Generate convex hull from points - hive

I have a Hive table with many thousands of points. The only columns are latitude|longitude. I know in advance that these points are all in a certain area, and the extreme outer edge of the points does form a continuous polygon, but many of the points are interior. I'm trying to determine which points are the external convex hull for visualization. I don't want to use all points, because it has messy interior holes that don't look good on a visualization. I'm using hive-1.2.1000.2.4.2.0. Here's what I tried:
hive> add jar /home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar;
Added [/home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar] to class path
Added resources: [/home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar]
hive> add jar /home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
Added [/home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar] to class path
Added resources: [/home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar]
hive> create temporary function ST_ConvexHull AS 'com.esri.hadoop.hive.ST_ConvexHull';
OK
Time taken: 0.014 seconds
hive> create temporary function ST_AsText AS 'com.esri.hadoop.hive.ST_AsText';
OK
Time taken: 0.009 seconds
hive> create temporary function ST_Point AS 'com.esri.hadoop.hive.ST_Point';
OK
Time taken: 0.009 seconds
hive> SELECT ST_AsText(ST_ConvexHull(ST_Point(latitude, longitude))) FROM sandbox11.cnst_zn;
I have also tried flipping latitude and longitude order in my query. In both instances, I get 'MULTIPOLYGON EMPTY' as the response. The documentation of the UDF is here: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Operations#st_convexhull

If you want the convex hull of the geometries of all the multiple records of a table, use ST_Aggr_ConvexHull rather than ST_ConvexHull (which expects a list of multiple geometries from a single row).
[collaborator]
Update: the syntax for aggregate ConvexHull would be similar to the syntax for aggregate Union, for which we have an example in an article.

Related

Big Query External table - Query performance degrades with increased number of files in the Source URI

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.
This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

BigQuery SQL computing spatial statistics around multiple locations

Problem: I want to make a query that allows me to compute sociodemographic statistics within a given distance of multiple locations. I can only make a query that computes the statistics at a single location at a time.
My desired result would be a table where I can see the name of the library (title), P_60YMAS (some sociodemographic data near these libraries), and a geometry of the multipolygons within the buffer distance of this location as a GEOGRAPHY data type.
Context:
I have two tables:
'cis-sdhis.de.biblioteca' or library, that have points as GEOGRAPHY data type;
'cis-sdhis.inegi.resageburb' in which I have many sociodemographic data, including polygons as GEOGRAPHY data type (column name: 'GEOMETRY')
I want to make 1 Km Buffer around the libraries, make new multipolygons within this buffers and get some sociodemographic data and geometry from those multipolygons.
My first approach was with this query:
SELECT
SUM(P_60YMAS) AS age60_plus,
ST_UNION_AGG(GEOMETRY) AS geo
FROM
`cis-sdhis.inegi.resageburb`
WHERE
ST_WITHIN(GEOMETRY,
(
SELECT
ST_BUFFER(geography,
1000)
FROM
`cis-sdhis.de.biblioteca`
WHERE
id = 'bpm-mty3'))
As you can see, this query only gives me one library ('bpm-mty3'), an that's my problem: I want them all at once.
I thought that using OVER() would be one solution, but I don't really know where or how to use it.
OVER() could be a solution, but a more performant solution is JOIN.
You first need to join two tables on condition that the distance between them is less than 1000 meters, that gives you pairs of row, where libraries are paired with all relevant data, note we'll get multiple rows per library. The predicate to use is ST_DWithin(geo1, geo2, distance), alternative form is ST_Distance(geo1, geo2) < distance - in both cases you don't need buffer.
SELECT
*
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
Then we need to compute stats per library, remember we have many rows per-library, and we need a single row for each library. For this we need to aggregate per library, let's do GROUP BY using id. When we need an info about the library itself the cheapest way to do it is to use ANY_VALUE aggregation function. So it will be something like
SELECT
lib.id, ANY_VALUE(lib.title) AS title,
SUM(P_60YMAS) AS age60_plus,
-- if you need to show the circle around library
ST_BUFFER(ANY_VALUE(lib.geometry)) AS buffered_location
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
GROUP BY lib.id
One thing to note here is that ST_BUFFER(ANY_VALUE(...)) is much cheaper than ANY_VALUE(ST_BUFFER(...)) - it only computes buffer once per output row.

What's the best way to account for missing records when performing aggregate queries?

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:
If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation
Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

Visualising data in Tableau when connected to BigQuery taking an eternity

I have a dataset that I've loaded into BigQuery, it consists of 3 separate tables with a common identifier in each of the files.
When I set up my project in Tableau I performed an inner join on two of the tables. I set the connection up as an extract and not live.
There's some geo info in my file, lats and longs. When I drag lat to the rows section on my worksheet it's taking an eternity to perform that task, currently it's taken 18 mins and counting to just process whatever it's doing when I drag the lat to the row section.
Is there some other way that I can take a random sample of my data for working on it rather than having to wait for each query to process? My data is not even that big, it's around 1M rows.
I've found Tableau to bog down quite a bit long before 1 million rows, and I supsect the join compounds the problem for you.
Aggregating as much as possible in BigQuery itself, before making the extract, is your friend. The random excerpt is a good idea, too. You could try:
SELECT
*
FROM
([subquery joining your tables])
WHERE RAND() < 0.05 # or whatever gives an acceptable sample size

postgresql with postgis & polygons

I'm working with postgresql with postgis.
I have few questions:
trying to insert into a table with polygon column using the following syntax:
ST_GeomFromText('POLYGON((long1 lat1, long2 lat2, long3 lat3))')
fails with the following error: function geomfromtext(unknown) does not exist
what is the difference between
CREATE TABLE my_table (my_polys polygon);
and
CREATE TABLE my_table2 (my_polys GEOGRAPHY(POLYGON));
and why does the following:
INSERT INTO my_table (my_polys) VALUES ('
(51.504824, -0.125918),
(51.504930, -0.122743),
(51.504930, -0.110297),
(51.504824, -0.102229),
(51.503435, -0.099311)'
);
work fine with my_table and not with my_table2 (I've changed the table name to my_table2)
what is the maximum number of points a polygon can have?
The geography data type is mostly useful when your geometry might wrap around the date line. If it does not, you are better off using normal geometry data types, as most functions in Postgis are designed to work on planar geometries. If your data are lat/lon, you can explicitly set this when you create your column with:
create table foo geom geometry(POLYGON, 4326);
Explicitly setting the coordinate reference system is useful if you want to convert between one coordinate system and another and helps prevent suprises if you attempt to run an intersection, say, between geometries in different coordinate systems.
It is hard to imagine your insert fragment above working with any table, not only is there a trailing comma, but there is no st_geomfromtext, st_makepolygon or anything else to convert what you have written into any geometry.
I have no idea what the maximum size for a polygon is, although, as Chris has already said, you are likely to have performance issues with polygons of 1gb in size, when you try and do spatial queries on them.
Your polygon size is limited by the max storage size. I believe this is limited to 1gb for a toasted column, though my guess is that you are likely to run into performance issues long before you exceed the maximum number of points.
Polygon is a built in type in PostgreSQL, and PostGIS offers additional types for geometry and geography. As I understand it the built-in-types only work for 2d Euclidean space, while PostGIS offers additional options there.
As for why your query fails, I wonder which version of PostGIS you are running and why st_geomfromtext is calling geomfromtext.