List all points within radius using Hive - hive

I have a table as follows:
id_center|latitude_of_center|longitude_of_center|id_point|latitude_of_point|longitude_of_point
The table is many millions of rows
I'm trying to get the output which would show for each id_center which id_points are within a 5 mile radius, and how far the distance is, sorted in descending order. Each row is fully populated, so each id_center has all possible id_point in separate rows. Here's what I've tried thus far, and I'm just getting null results:
hive> add jar /home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar;
Added [/home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar] to class path
Added resources: [/home/me/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar]
hive> add jar /home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
Added [/home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar] to class path
Added resources: [/home/me/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar]
hive> create temporary function ST_GeodesicLengthWGS84 AS 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84';
OK
Time taken: 0.014 seconds
hive> create temporary function ST_SetSRID AS 'com.esri.hadoop.hive.ST_SetSRID';
OK
Time taken: 0.008 seconds
hive> create temporary function ST_LineString AS 'com.esri.hadoop.hive.ST_LineString';
SELECT * FROM mytable WHERE ST_GeodesicLengthWGS84(ST_SetSRID(ST_LineString(latitude_of_center, longitude_of_center, latitude_of_point, longitude_of_point), 4326)) <= 8046.72

For ST_LineString, you need longitude first, then latitude - (X,Y) order.
(As also discussed on GIS-SE https://gis.stackexchange.com/questions/178950/hive-gis-st-geodesiclengthwgs84-not-returning-expected-distance)

I used ST_Point inside ST_LineString when worked on similar task. Check this option in docs.
In your case:
SELECT * FROM mytable
WHERE ST_GeodesicLengthWGS84(ST_SetSRID(ST_LineString(array(ST_Point(longitude_of_center, latitude_of_center), ST_Point(longitude_of_point, latitude_of_point))), 4326)) <= 8046.72;

Related

Fastest way to filter latitude and longitude from a sql table?

I have a huge table data where sample data is like below. I want to filter few latitude and longitude records from the huge table and I am using In clause to filter list of lat,lon values but when I try to run the query it takes more a min to execute what is the better query to execute it faster? the list of lat,lon is around 120-150
id longitude latitude
--------------------------
190 -0.410123 51.88409
191 -0.413256 51.84567
query:-
SELECT DISTINCT id, longitude, latitude
FROM geo_table
WHERE ROUND(longitude::numeric, 3) IN (-0.418, -0.417, -0.417, -0.416 and so on )
AND ROUND(latitude::numeric, 3) IN (51.884, 51.884, 51.883, 51.883 and so on);
If at least one of the ranges of values in X or Y is tight you can try prefiltering rows. For example, if X (longitude) values are all close together you could try:
SELECT distinct id,longitude,latitude
from (
select *
FROM geo_table
where longitude between -0.418 and -0.416 -- prefilter with index scan
and latitude between 51.883 and 51.884 -- prefilter with index filter
) x
-- now the re-check logic for exact filtering
where ROUND(longitude::numeric,3) in (-0.418, -0.417, -0.417, -0.416, ...)
and ROUND(latitude::numeric,3) in (51.884, 51.884, 51.883, 51.883, ...)
You would need an index with the form:
create index ix1 on geo_table (longitude, latitude);
First, the way you are looking for a list of latitudes and a list of longitudes is likely wrong if you are looking for points locations:
point: lat;long
----------------
Point A: 1;10
Point B: 2;10
Point C: 1;20
Point D: 2;20
--> if you search for latitude in (1;2) and longitude in (10;20), the query will return the 4 points, while if you search for (latitude,longitude) in ((1;10),(2;20)), the query will return only points A and D.
Then, since you are looking for rounded values, you must index the rounded values:
create index latlong_rdn on geo_table (round(longitude,3),round(latitude,3));
and the query should use the exact same expression:
select *
from geo_table
where (round(longitude,3),round(latitude,3)) in
(
(-0.413,51.846),
(-0.410,51.890)
);
But here again rounding is not necessarily the best approach when dealing with locations. You may want to have a look at the PostGIS extension, to save the points as geography, add a spatial index, and to search for points within a distance (st_dwithin()) of the input locations.

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

How to get repeatable sample using Presto SQL?

I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.
One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.
If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.

Most efficient way to query for lat-long rectangle in SQL

I'm currently making consistent queries for a block of land within a given latitude, longitude rectangle. The coordinates are stored as individual double precision values. I've created a single index of both columns, so the current query containing 15240 tiles takes .10 seconds on my local machine.
At the moment, there's 23 million rows in the table, but there's going to be around 800 million upon completion of the table, so I expect this query time to get much slower.
Here's the query I'm running, with example values:
SELECT * FROM territories
WHERE nwlat < 47.606977 and nwlat > 47.506977
and nwlng < -122.232991 and nwlng > -122.338991;
Is there a more efficient way of doing this? I'm fairly new to large databases, so any help is appreciated. FYI, I'm using PostgreSQL.
It would be much more efficient with a GiST or an SP-GiST index and a "box-contains-points" query ...
GiST index
The index is on a box with zero area, built from the same point (point(nwlat, nwlng)) twice.
There is a related code example in the manual for CREATE INDEX.
CREATE INDEX territories_box_gist_idx ON territories
USING gist (box(point(nwlat, nwlng), point(nwlat, nwlng)));
Query with the "overlaps" operator &&:
SELECT *
FROM territories
WHERE box(point(nwlat, nwlng), point(nwlat, nwlng))
&& '(47.606977, -122.232991), (47.506977, -122.338991)'::box;
SP-GiST index
Smaller index on just points:
CREATE INDEX territories_box_spgist_idx ON territories
USING spgist (point(nwlat, nwlng));
Query with the contains operator #>:
SELECT *
FROM point
WHERE '(47.606977, -122.232991), (47.506977, -122.338991)'::box
#> point(nwlat, nwlng);
I get fastest results for the SP-GiST index in a simple test with 1M rows on Postgres 9.6.1.
For more sophisticated needs consider the PostGIS extension.