Spatial SQL query showing parcels containing centroid of building - sql

I am trying to write a query that selects parcels that contain the centroid of a certain building code (bldg_code = 3).
The parcels are listed in the table "city.zoning" and contains a column for a PIN, geometry, and area of each parcel. The table "buildings" contains a column for bldg_type and bldg_code indicating the building type and its corresponding code. The building type of interest for this query has a bldg_code of 3.
So far I've developed a query that shows parcels that interact with the building type of interest:
select a.*
from city.zoning a, username.buildings b
where b.bldg_code = 3 and sdo_anyinteract(a.geom,b.geom) = 'TRUE';
Any ideas?

You can use SDO_GEOM.SDO_CENTROID (documentation) to find the centroid of a geometry.
Note that the centroid provided by this function is the mathematical centroid only and may not always lie inside the geometry, for example, if your polygon is L shaped. SpatialDB Adviser has a good article on this, but here's a quick illustration:
If this isn't a problem for you and you don't need that level of accuracy, just use the built-in, but if you do consider this to be a problem (as I did in the past), then SpatialDB Adviser has a standalone PL/SQL package that corrrectly calculates centroids.
Depending on your performance needs, you could calculate the centroids on-the-fly and just use them in your query directly, or alternatively, add a centroid column to the table and compute and cache the values with application code (best case) or trigger (worst case).
Your query would look something like this:
SELECT a.*
FROM city.zoning a
JOIN username.buildings b ON sdo_contains(a.geom, b.centroid) = 'TRUE'
WHERE b.bldg_code = 3
Note that this is using SDO_CONTAINS on the basis of the a.geom column being spatially indexed and a new column b.centroid that has been added and populated (note - query not tested). If the zoning geometry is not spatially indexed, then you would need to use SDO_GEOM.RELATE, or index the centroid column and invert the logic to use SDO_INSIDE.

Related

How to divide world into cells (grid)

How to divide the world into cells of almost equal size, such that each lat,lon can be mapped to a different cell?
I am pretty sure I've seen a library do this, labelling cells as S1, S2, etc..
Say we have 62.356279,-99.422395 , how to map it to a 2km*2km cell named "FR,23" ?
Thank you!
PostGIS 3.1+
PostGIS 3.1 introduces very easy to use grid generators, namely ST_SquareGrid and ST_HexagonGrid. An easy way to use these functions with data from a table is to use LATERAL to execute it, e.g. cells with 0.1° in size::
Sample Data
Consider the following polygon:
Creating a squared grid with cell size of 0.1° over a given geometry
SELECT grid.* FROM isle_of_man imn,
LATERAL ST_SquareGrid(0.1,imn.geom) grid;
If you only want the cells that interersect with the geometry, just call the function ST_Intersects in the WHERE clause:
SELECT grid.* FROM isle_of_man imn,
LATERAL ST_SquareGrid(0.1,imn.geom) grid
WHERE ST_Intersects(imn.geom,grid.geom);
The same principle applies to ST_HexagonGrid:
SELECT grid.* FROM isle_of_man imn,
LATERAL ST_HexagonGrid(0.1,imn.geom) grid;
SELECT grid.* FROM isle_of_man imn,
LATERAL ST_HexagonGrid(0.1,imn.geom) grid
WHERE ST_Intersects(imn.geom,grid.geom);
Older PostGIS versions
Inspired by this post I started writing a function to do just that - it still needs some tweaking but it'll certainly give you a direction look at. The following function creates a grid with cells of a given size covering the area of a given geometry:
CREATE OR REPLACE FUNCTION public.generate_grid(_size numeric,_geom geometry)
RETURNS TABLE(gid bigint, cell geometry) LANGUAGE 'plpgsql'
AS $BODY$
DECLARE
_bbox box2d := ST_Extent(_geom);
_ncol int := ceil(abs(ST_Xmax(_bbox)-ST_Xmin(_bbox))/_size);
_nrow int := ceil(abs(ST_Ymax(_bbox)-ST_Ymin(_bbox))/_size);
_srid int DEFAULT 4326;
BEGIN
IF ST_SRID(_geom) <> 0 THEN
IF EXISTS (SELECT 1 FROM spatial_ref_sys crs
WHERE crs.srid = ST_SRID(_geom) AND NOT crs.proj4text LIKE '+proj=longlat%') THEN
RAISE EXCEPTION 'Only lon/lat spatial reference systems are supported in this function.';
ELSE
_srid := ST_SRID($2);
END IF;
END IF;
RETURN QUERY
SELECT ROW_NUMBER() OVER (), geom FROM (
SELECT
ST_SetSRID(
(ST_PixelAsPolygons(
ST_AddBand(
ST_MakeEmptyRaster(_ncol, _nrow, ST_XMin(_bbox), ST_YMax(_bbox),_size),
'1BB'::text, 1, 0),
1, false)).geom,_srid)) j(geom);
END;
$BODY$;
Note: This function relies on the extension PostGIS Raster.
SELECT cell FROM isle_of_man,
LATERAL generate_grid(0.1,geom);
... if you're only interested in cells that overlap your polygon, add a ST_Intersects to the query:
SELECT cell FROM isle_of_man,
LATERAL generate_grid(0.1,geom)
WHERE ST_Intersects(geom,cell)
Other alternatives
Mike's fishnet function does basically the same, but you'd need to manually provide the number of rows and columns, and the coordinate pair of the lower left corner:
SELECT ST_SetSRID(cells.geom,4326)
FROM ST_CreateFishnet(4, 6, 0.1, 0.1,-4.8411, 54.0364) AS cells;
You could use this makegrid_2d function to create a grid over an area using a polygon, e.g. a grid with cells of 5000 meters in size:
CREATE TABLE grid_isle_of_man AS
SELECT 'S'||ROW_NUMBER() OVER () AS grid_id, (g).geom
FROM (
SELECT ST_Dump(makegrid_2d(geom,5000))
FROM isle_of_man) j(g)
JOIN isle_of_man ON ST_Intersects((g).geom,geom);
The same logic applies for this hexagrid function. It creates a hexagon grid with fixed sized cells over a given BBOX. You can either manually provide the BBOX (function's second parameter) or extract it from a given polygon. For instance, to create a hexagrid that matches the polygon's extent and store it in a new table with the label you want - with cells of 0.1° in size:
CREATE TABLE hexgrid_isle_of_man AS
WITH j (hex_rec) AS (
SELECT generate_hexagons(0.1,ST_Extent(geom))
FROM isle_of_man
)
SELECT 'S'||ROW_NUMBER() OVER () AS grid_id,(hex_rec).hexagon FROM j
JOIN isle_of_man t ON ST_Intersects(t.geom,(hex_rec).hexagon);
Further reading:
Import World Shapefile
Waiting for PostGIS 3.1: Grid Generators, by Paul Ramsey
Jim's answer is excellent. There are use cases though, where you don't need the actual geometries. Where, as you've mentioned, all you need is coordinates in the same cell mapping to the same code. So instead of a costly point-in-polygon operation that takes O(n) for n polygons - without index, that is ;) - you call a function that simply evaluates a formula transforming a coordinate into a code in O(1). Very handy to aggregate data spatially fast.
Personally, I love Uber's H3 library for that sort of thing, but I'm sure S2 does something similar and does it well. There is a well-maintained PostgreSQL binding for H3 and a simple aggregation example would look something like this:
SELECT h3_geo_to_h3(geom_4326, 9) AS h3res09, SUM(pop_19) AS pop_19
FROM uk_postcode_population
GROUP BY 1;
Read: Sum up the postcode-level population of the UK for each resolution 9 hexagon.
You can still create the actual hexagon geometries when you need them (whole hex grids even). But in my experience, once you commit to the grid approach, you will only need polygons for visualisation in the very end.
I should note that you can't divide the world yourself using this library - Uber has already divided it for you. So if 2km squares are a hard requirement, this is not for you.
Installing the H3 extension is not as straightforward as CREATE EXTENSION postgis, but you don't need to be a command line warrior either. You will at the least have to install PGXN, most likely PostgreSQL's extension build library and the extension itself.
Someone asked for a point-in-polygon example in the comments. This is not exactly related to the question, but highlights how one would use H3.
Polygon layer prep:
CREATE TABLE poly_h3 AS (
SELECT id, h3_polyfill(geom_4326, 13) AS h3res13
FROM poly
);
CREATE INDEX ON poly_h3 (h3res13);
Point layer prep:
ALTER TABLE points ADD COLUMN h3res13 h3index;
UPDATE points
SET h3res13 = h3_geo_to_h3(geom_p_4326, 13);
CREATE INDEX ON points (h3res13);
Count points per polygon:
SELECT poly.id, COUNT(*) AS n
FROM poly_h3 AS poly
INNER JOIN points AS x
ON poly.h3res13 = x.h3res13
GROUP BY poly.id;

Query times out after 6 hours, how to optimize it?

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.
The shapes table contains travel routes for vehicles:
shape_key STRING identifier for the shape
shape_lines ARRAY<GEOGRAPHY> consecutive line segments making up the shape
shape_geography GEOGRAPHY the union of all shape_lines
shape_length_km FLOAT64 length of the shape in kilometers
Rows: 65k
Size: 718 MB
We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.
The squares table contains a grid of 1×1 km squares:
square_key INT64 identifier of the grid square
square_geography GEOGRAPHY four-cornered polygon describing the grid square
Rows: 102k
Size: 15 MB
The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.
Here's the query to compute that:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
shapes,
squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Sadly, this query times out after 6 hours instead of producing a useful result.
In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.
I have considered the geographic join optimizations performed by BigQuery:
Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.
Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.
Spatial joins perform better when your geography data is persisted.
Check. Both shape_geography and square_geography come straight from existing tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects
Check. Just a single ST_Intersect call, no other conditions.
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.
Check. None of these cases apply.
So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.
I have also considered the advice about cross joins:
Avoid joins that generate more outputs than inputs.
This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.
When a CROSS JOIN is required, pre-aggregate your data.
To avoid performance issues associated with joins that generate more outputs than inputs:
Use a GROUP BY clause to pre-aggregate the data.
Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.
Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
I don't think it's possible to use a window function for this query.
I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.
How can I make this query run in reasonable time?
I think the squares got inverted, resulting in almost-full Earth polygons:
select st_area(square_geography), * from `open-transport-data.public.squares`
Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
`open-transport-data.public.shapes`,
(select
square_key,
st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
from `open-transport-data.public.squares`) squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Below would definitely not fit the comments format so I have to post this as an answer ...
I did three adjustment to your query
using JOIN ... ON instead of CROSS JOIN ... WHERE
commenting out square_portion calculation
using destination table with Allow Large Results option
Even though you expected just 3.3 M rows in output - in reality it is about 6.6 B ( 6,591,549,944) rows - you can see result of my experiment below
Note warning about Billing Tier - so you better use Reservations if available
Obviously, un-commenting square_portion calculation will increase Slots usage - so, you might potentially need to revisit your requirements/expectations

Limit dimension values displayed in QlikView Pivot Table Chart

I have a pivot table chart in QlikView that has a dimension and an expression. The dimension is a column with 5 possible values: 'a','b','c','d','e'.
Is there a way to restrict the values to 'a','b' and 'c' only?
I would prefer to enforce this from the chart properties with a condition, instead of choosing the values from a listbox if possible.
Thank you very much, I_saw_drones! There is an problem I have though. I have different expressions defined depending on the category, like this:
IF( ([Category]) = 'A' , COUNT( {<[field1] = {'x','y'} >} [field2]), IF ([Category]) = 'B' , SUM( {<[field3] = {'z'} >} [field4]), IF (Category='C', ..., 0)))
In this case, where would I add $<Category={'A','B','C'} ? My expression so far doesn't help because although I tell QV to use a different formula/calculation for each category, the category overall (all 5 values) represents the dimension.
One possible method to do this is to use QlikView's Set Analysis to create an expression which sums only your desired values.
For this example, I have a very simple load script:
LOAD * INLINE [
Category, Value
A, 1
B, 2
C, 3
D, 4
E, 5
];
I then have the following Pivot Table Chart set up with a single expression which just sums the values:
What we need to do is to modify the expression, so that it only sums A, B and C from the Category field.
If I then use QlikView's Set Analysis to modify the expression to the following:
=sum({$<Category={A,B,C}>} Value)
I then achieve my desired result:
This then restricts my Pivot Table Chart to displaying only these three values for Category without me having to make a selection in a Listbox. The form of this expression also allows other dimensions to be filtered at the same time (i.e. the selections "add up"), so I could say, filter on a Country dimension, and my restriction for Category would still be applied.
How this works
Let's pick apart the expression:
=sum({$<Category={A,B,C}>} Value)
Here you can recognise the original form we had before (sum(Value)), but with a modification. The part {$<Category={A,B,C}>} is the Set Analysis part and has this format: {set_identifier<set_modifier>}. Coming back to our original expression:
{: Set Analysis expressions always start with a {.
$: Set Identifier: This symbol represents the current selections in the QlikView document. This means that any subsequent restrictions are applied on top of the existing selections. 1 can also be used, this represents the full set of data in your document irrespective of selections.
<: Start of the set modifiers.
Category={A,B,C}: The dimension that we wish to place a restriction on. The values required are contained within the curly braces and in this case they are ORed together.
>: End of the set modifiers.
}: End of the set analysis expression.
Set Analysis can be quite complex and I've only scratched the surface here, I would definitely recommend checking the QlikView topic "Set Analysis" in both the installed helpfile and the reference manual (PDF).
Finally, Set Analysis in QlikView is quite powerful, however it should be used sparingly as it can lead to some performance problems. In this case, as this is a fairly simple expression the performance should be reasonable.
Woa! a year later, but what you are loking for is osmething near this:
Go to the dimension sheet, then select the Category Dimension, and click on the Edit Dimesnion button
there you can use something like this:
= If(Match(Category, 'a', 'b', 'c'), Category, Null())
This will make the object display only a b and c Categories, and a line for the Null value.
What leasts is that you check the "Suppress value when null" option on the Dimension sheet.
c ya around
Just thought another solution to this which may still be useful to people looking for this.
How about creating a bookmark with the categories that you want and then setting the expressions to be evaluated in the context of that bookmark only?
(Will expand on this later, but take a look at how set analysis can be affected by a bookmark)

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !
As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm
Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.
This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.
I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.
Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.
IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).