Incorrect results returned by postgres - sql

I ran the following commands in posgresql 9.6:
./bin/createdb testSpatial
./bin/psql -d testSpatial -c "CREATE EXTENSION postgis;"
create table test(name character varying(250), lat_long character varying(90250), the_geom geometry);
\copy test(name,lat_long) FROM 'test.csv' DELIMITERS E'\t' CSV HEADER;
CREATE INDEX spatial_gist_index ON test USING gist (the_geom );
UPDATE test SET the_geom = ST_GeomFromText(lat_long,4326);
On running: select * from test; I get the following output:
name | lat_long
|
the_geom
------+-----------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------+--------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------
A | POLYGON((-0.061225 -128.427791,-0.059107 -128.428264,-0.056311 -128.428911,-0.054208 -128.426510,-0.055431 -128.426324,-0.057363 -128.42
6124,-0.059315 -128.425843,-0.061225 -128.427791)) | 0103000020E61000000100000008000000D42B6519E258AFBFBE50C076B00D60C07DE9EDCF4543AEBFBC41B456B
40D60C08063CF9ECBD4ACBFA1BC8FA3B90D60C07BF65CA626C1ABBF58AD4CF8A50D60C0BF805EB87361ACBFFFAF3A72A40D60C0B83A00E2AE5EADBF4D81CCCEA20D60C01F1153228
95EAEBF60C77F81A00D60C0D42B6519E258AFBFBE50C076B00D60C0
B | POINT(1.978165 -128.639779)
| 0101000020E61000002D78D15790A6FF3F5D35CF11791460C0
(2 rows)
After this I ran a query: To find all "name" which are within 5 meters of each other. For doing so, I wrote the following command.
testSpatial=# select s1.name, s2.name from test s1, test s2 where ST_DWithin(s1.the_geom, s2.the_geom, 5);
name | name
------+------
A | A
A | B
B | A
B | B
(4 rows)
To my surprise I am getting incorrect output as "A" and "B" are 227.301 km away from each other (as calculated using haversine distance here: http://andrew.hedges.name/experiments/haversine/). Can someone please help me understand as to where am I going wrong.

You have defined your geometry as follows
the_geom geometry
ie, it's not geography. But the ST_DWithin docs say
For Geometries: The distance is specified in units defined by the
spatial reference system of the geometries. For this function to make
sense, the source geometries must both be of the same coordinate
projection, having the same SRID.
For geography units are in meters and measurement is defaulted to
use_spheroid=true, for faster check, use_spheroid=false to measure
along sphere.
So you are actually searching for places that are within 5 degrees of each other. A degree is roughly equal to 111km so you are looking for places that are about 550 km from each other rather than 5 meters.
Additionally, it doesn't make much sense to store strings like POINT(1.978165 -128.639779) in your table. It's completely redundant. It's information that can be generated quite easily from the geography column.

Related

Filter list of points using list of Polygons

Given a list of points and a list of polygons. How do you return a list of points (subset of original list of points) that is in any of the polygons on the list
I've removed other columns in the sample tables to simplify things
Points Table:
| Longitude| Latitude |
|----------|-----------|
| 7.07491 | 51.28725 |
| 3.674765 | 51.40205 |
| 6.049105 | 51.86624 |
LocationPolygons Table:
| LineString |
|----------------------|
| CURVEPOLYGON (COMPOUNDCURVE (CIRCULARSTRING (-122.20 47.45, -122.81 47.0, -122.942505 46.687131 ... |
| MULTIPOLYGON (((-110.3086 24.2154, -110.30842 24.2185966, -110.3127...
If I had row from the LocationPolygons table I could do something like
DECLARE #homeLocation geography;
SET #homeLocation = (select top 1 GEOGRAPHY::STGeomFromText(LineString, 4326)
FROM LocationPolygon where LocationPolygonId = '123abc')
select Id, Longitude, Latitude, #homeLocation.STContains(geography::Point(Latitude, Longitude, 4326))
as IsInLocation from Points PointId in (1, 2, 3,)
which would return what I want in a format like the below. However this is only true for just one location on the list
| Id | Longitude| Latitude | IsInLocation |
|----|----------|-----------|--------------|
| 1 | 7.07491 | 51.28725 | 0 |
| 2 | 3.674765 | 51.40205 | 1 |
| 3 | 6.049105 | 51.86624 | 0 |
How do I handle the scenario with multiple rows of the LocationPolygon table?
I'd like to know
if any of the points are in any of the locationPolygons?
what specific location polygon they are in? or if they are in more than one polygon.
Question 2 is more of an extra. Can someone help?
Update #1
In response to #Ben-Thul answer.
Unfortunately I don't have access/permission to make changes to the original tables, I can request access but not certain it'll be given. So not certain I'll be able to add the columns or create the index. Although I can create temp tables in a stored proc, I might be able to use test your solution that way
I stumbled on an answer like the below, but slightly worried about performance implications of using a cross join.
WITH cte AS (
select *, (GEOGRAPHY::STGeomFromText(LineString, 4326)).STContains(geography::Point(Latitude, Longitude, 4326)) as IsInALocation from
(
select Longitude, Latitude from Points nolock
) a cross join (
select LineString FROM LocationPolygons nolock
) b
)
select * from cte where IsInALocation = 1
Obviously, it's better to look at a query plan but is the solution I stumbled upon essentially the same as yours? Are there any potential issues that I missed. Apologies for this but my sql isn't very good.
Question 1 shouldn't be too bad. First, some set up:
alter table dbo.Points add Point as (GEOGRAPHY::Point(Latitude, Longitude, 4326));
create spatial index IX_Point on dbo.Points (Point) with (online = on);
alter table dbo.LocationPolygon add Polygon as (GEOGRAPHY::STGeomFromText(LineString, 4326));
create spatial index IX_Polygon on dbo.LocationPolygon (Polygon) with (online = on);
This will create a computed column on each of your tables that is of type geography that has a spatial index on it.
From there, you should be able to do something like this:
select pt.ID,
pt.Longitude,
pt.Latitude,
coalesce(pg.IsInLocation, 0) as IsInLocation
from Points as pt
outer apply (
select top(1) 1 as IsInLocation
from dbo.LocationPolygon as pg
where pg.Polygon.STContains(p.Point) = 1
) as pg;
Here, you're selecting every row from the Points table and using outer apply to see if any polygons contain that point. If one does (it doesn't matter which one), that query will return a 1 in the result set and bubble that back up to the driving select.
To extend this to Question 2, you can remove the top() from the outer apply and have it return either the IDs from the Polygon table or whatever you want. Note though that it'll return one row per polygon that contains the point, potentially changing the cardinality of your result set!

How to select polygon from table by Latitude and Longitude (Postgis)

There is polygon table. I need to find one record from table with some Point (Lat/Long) inside of area.
It's example of coordinates: 149.14668176, -35.32202098
Could you please help me to write select string to find area that contain my Point?
SELECT PostGIS_full_version();
postgis_full_version
POSTGIS="2.5.2 r17328" [EXTENSION] PGSQL="96" GEOS="3.5.1-CAPI-1.9.1 r4246" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.1.2, released 2016/10/24" LIBXML="2.9.4" LIBJSON="0.12.1" LIBPROTOBUF="1.2.1" RASTER
Something like that:
SELECT id,name FROM area_polygon WHERE ST_Within('149.14668176, -35.32202098', geog);
bounds=# \d bounds.area_polygon;
id | integer | | not null |
geog | geography(Polygon,4283) | | |
name | text | | |
I expected:
id | name
------+--------
1 | Alabama
st_within only supports geometry types, which is why you get the error in the earlier answer, because you have a geography column type.
You can either cast to geometry:
SELECT id,name
FROM area_polygon
WHERE ST_Within(ST_SetSRID(ST_POINT(149.14668176,-35.32202098),4283), geog::geometry);
Or you can use st_dwithin, with distance set to zero:
SELECT id,name
FROM area_polygon
WHERE ST_DWithin(ST_SetSRID(ST_POINT(149.14668176,-35.32202098),4283)::geography, geog,0);
Note that the order of the coordinates must be lon/lat (and not lat/lon) and I am assuming those coordinates are in your SRID 4283. They have to either match the geog SRID or be transformed to it...
See here for a list of which functions support which arguments.

Amazon Redshift - Pivot Large JSON Arrays

I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.
95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
15MB * 512 rows in numbers = 7.5GB
If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?
Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2
Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.
Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

Getting all Buildings in range of 5 miles from specified coordinates

I have database table Building with these columns: name, lat, lng
How can I get all Buildings in range of 5 miles from specified coordinates, for example these:
-84.38653999999998
33.72024
My try but it does not work:
SELECT ST_CONTAINS(
SELECT ST_BUFFER(ST_Point(-84.38653999999998,33.72024), 5),
SELECT ST_POINT(lat,lng) FROM "my_db"."Building" LIMIT 50
);
https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list.html
Why are you storing x,y in separated columns? I strongly suggest you to store them as geometry or geography to avoid unnecessary casting overhead in query time.
That being said, you can compute and check distances in miles using ST_DWithin or ST_Distance:
(Test data)
CREATE TABLE building (name text, long numeric, lat numeric);
INSERT INTO building VALUES ('Kirk Michael',-4.5896,54.2835);
INSERT INTO building VALUES ('Baldrine',-4.4077,54.2011);
INSERT INTO building VALUES ('Isle of Man Airport',-4.6283,54.0804);
ST_DWithin
ST_DWithin returns true if the given geometries are within the specified distance from another. The following query searches for geometries that are in 5 miles radius from POINT(-4.6314 54.0887):
SELECT name,long,lat,
ST_Distance('POINT(-4.6314 54.0887)'::geography,
ST_MakePoint(long,lat)) * 0.000621371 AS distance
FROM building
WHERE
ST_DWithin('POINT(-4.6314 54.0887)'::geography,
ST_MakePoint(long,lat),8046.72); -- 8046.72 metres = 5 miles;
name | long | lat | distance
---------------------+---------+---------+-------------------
Isle of Man Airport | -4.6283 | 54.0804 | 0.587728347062174
(1 row)
ST_Distance
The function ST_Distance (with geography type parameters) will return the distance in meters. Using this function all you have to do is to convert meters to miles in the end.
Attention: Distances in queries using ST_Distance are computed in real time and therefore do not use the spatial index. So, it is not recommended to use this function in the WHERE clause! Use it rather in the SELECT clause. Nevertheless the example below shows how it could be done:
SELECT name,long,lat,
ST_Distance('POINT(-4.6314 54.0887)'::geography,
ST_MakePoint(long,lat)) * 0.000621371 AS distance
FROM building
WHERE
ST_Distance('POINT(-4.6314 54.0887)'::geography,
ST_MakePoint(long,lat)) * 0.000621371 <= 5;
name | long | lat | distance
---------------------+---------+---------+-------------------
Isle of Man Airport | -4.6283 | 54.0804 | 0.587728347062174
(1 row)
Mind the parameters order with ST_MakePoint: It is longitude,latitude.. not the other way around.
Demo: db<>fiddle
Amazon Athena equivalent (distance in degrees):
SELECT *, ST_DISTANCE(ST_GEOMETRY_FROM_TEXT('POINT(-84.386330 33.753746)'),
ST_POINT(long,lat)) AS distance
FROM building
WHERE
ST_Distance(ST_GEOMETRY_FROM_TEXT('POINT(-84.386330 33.753746)'),
ST_POINT(long,lat)) <= 5;
First thing first. If possible use Postgis not amazon-athena. Looking on documentation athena looks like the castrated version of a spatial tool.
First - Install postgis.
CREATE EXTENSION postgis SCHEMA public;
Now create geometry(if you want to use metric SRID like 3857 for example) or geography (if you want use degree SRID like 4326) column for your data.
alter table building add column geog geography;
Then transform your point data (lat,long) data to geometry/geography:
update building
set geog=(ST_SetSRID(ST_MakePoint(lat,long),4326)::geography)
Next create spatial index on it
create index on buildings using gist(geog);
Now you are ready for action
select *,
st_distance(geog, ST_makePoint(-84.386,33.72024))/1609.34 dist_miles
from building
where st_dwithin(geog, ST_makePoint(-84.38653999999998,33.72024),5*1609.34);
Few words of explenations:
Index is useful if you have many records in your table.
ST_Dwithin uses index when st_distance doesn't so ST_dwithin will make your query much faster on big data sets.
For aws Athena , try to use this for calculte aprox distance in degrees
decimal_degree_distance = 5000.0 * 360.0 / (2.0 * pi() * cos( radians(latitud) ) * 6400000.0)
where 5000.0 y distance in meters
is good for near ecuador places

Efficient sorted bounding box query

How would I create indexes in PostgresSQL 8.3 which would make a sorted bounding box query efficient? The table I'm querying has quite a few rows.
That is I want the create indexes that makes the following query as efficient as possible:
SELECT * FROM features
WHERE lat BETWEEN ? AND ?
AND lng BETWEEN ? AND ?
ORDER BY score DESC
The features table look like this:
Column | Type |
------------+------------------------+
id | integer |
name | character varying(255) |
type | character varying(255) |
lat | double precision |
lng | double precision |
score | double precision |
html | text |
To create a GiST index on a point attribute so that we can efficiently use box operators on the result of the conversion function:
CREATE INDEX pointloc
ON points USING gist (box(location,location));
SELECT * FROM points
WHERE box(location,location) && '(0,0),(1,1)'::box;
http://www.postgresql.org/docs/9.0/static/sql-createindex.html
This is the example in 9.0 docs. It should work for 8.3 though as these are features that have been around for ages.
You could try using a GiST index to implement an R-Tree. This type of index is poorly documented, so you might have to trawl through example code in the source distribution.
(Note: My prior advice to use R-Tree indexes appears to be out of date; they are deprecated.)
Sounds like you'd want to take a look at PostGIS, a PostgreSQL module for spatial data types and queries. It supports quick lookups using GiST indexes. Unfortunately I can't guide you further as I haven't used PostGIS myself.