How to get polygons contained within another polygon, including those that touch the border in BigQuery Standard SQL? - google-bigquery

I'm trying to get blockgroups contains within a designated market area polygon in BigQuery. I've tried using st_contains and st_covers but I still only get the completely contained ones (and not the bordering ones:
SELECT a.blockgroup_geom as the_geom,
a.geo_id,
c.total_pop
FROM `bigquery-public-data.geo_census_blockgroups.us_blockgroups_national` a
join `bigquery-public-data.census_bureau_acs.blockgroup_2018_5yr` c
on a.geo_id = c.geo_id
join `bigquery-public-data.geo_us_boundaries.designated_market_area` b
on st_covers(b.dma_geom,a.blockgroup_geom)
where b.dma_name = 'Milwaukee, WI'
What do I need to do to get all of the blockgroups within the dma polygon?

looks like the two datasets are not exactly aligned, so any slight discrepancies between the two polygons, or errors due to planar to spherical conversion prevent perfect nesting of the polygons.
This post has deeper discussion and an idea how to handle this: https://mentin.medium.com/creating-spatial-hierarchy-2ba5488eac0a
Here I would try ST_Intersects condition and check area of intersection to see if the larger part of the blockgroup belongs to DMA:
SELECT a.blockgroup_geom as the_geom,
a.geo_id,
c.total_pop,
(st_area(st_intersection(b.dma_geom,a.blockgroup_geom))
> 0.75 * st_area(a.blockgroup_geom)) as mostly
FROM `bigquery-public-data.geo_census_blockgroups.us_blockgroups_national` a
join `bigquery-public-data.census_bureau_acs.blockgroup_2018_5yr` c
on a.geo_id = c.geo_id
join `bigquery-public-data.geo_us_boundaries.designated_market_area` b
on st_intersects(b.dma_geom,a.blockgroup_geom)
where b.dma_name = 'Milwaukee, WI'
Does this result make sense? I've colored green polygons with mostly = true, and blue those with mostly = false:

Related

How to use PostGIS to buffer a point layer and find if points fall within these buffers, ignoring each buffer's own central point

I have a table containing point geometries and a buffer_distance column. I want to create a buffer around each point using the buffer_distance column and check if any point is within each buffer but making sure to ignore the point at the center of each buffer polygon (the point from which I created the buffer).
I have the following query but it returns True for all values in the point_within column as it is not ignoring the point at the center of the buffered geometry.
SELECT p.id, ST_Buffer(p.geom, p.buffer_distance), ST_Within(p.geom, ST_Buffer(p.geom, p.buffer_distance)) AS point_within
FROM point_table as p
What you're looking for is a spatial join. Join the table with itself and check which records lie inside each individual buffer, and then exclude the duplicates:
SELECT * FROM point_table p
JOIN point_table p2 ON
ST_Contains(ST_Buffer(p.geom, p.buffer_distance), p2.geom) AND
p2.id <> p.id;
Demo: db<>fiddle

Query times out after 6 hours, how to optimize it?

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.
The shapes table contains travel routes for vehicles:
shape_key STRING identifier for the shape
shape_lines ARRAY<GEOGRAPHY> consecutive line segments making up the shape
shape_geography GEOGRAPHY the union of all shape_lines
shape_length_km FLOAT64 length of the shape in kilometers
Rows: 65k
Size: 718 MB
We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.
The squares table contains a grid of 1×1 km squares:
square_key INT64 identifier of the grid square
square_geography GEOGRAPHY four-cornered polygon describing the grid square
Rows: 102k
Size: 15 MB
The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.
Here's the query to compute that:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
shapes,
squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Sadly, this query times out after 6 hours instead of producing a useful result.
In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.
I have considered the geographic join optimizations performed by BigQuery:
Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.
Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.
Spatial joins perform better when your geography data is persisted.
Check. Both shape_geography and square_geography come straight from existing tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects
Check. Just a single ST_Intersect call, no other conditions.
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.
Check. None of these cases apply.
So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.
I have also considered the advice about cross joins:
Avoid joins that generate more outputs than inputs.
This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.
When a CROSS JOIN is required, pre-aggregate your data.
To avoid performance issues associated with joins that generate more outputs than inputs:
Use a GROUP BY clause to pre-aggregate the data.
Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.
Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
I don't think it's possible to use a window function for this query.
I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.
How can I make this query run in reasonable time?
I think the squares got inverted, resulting in almost-full Earth polygons:
select st_area(square_geography), * from `open-transport-data.public.squares`
Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
`open-transport-data.public.shapes`,
(select
square_key,
st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
from `open-transport-data.public.squares`) squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Below would definitely not fit the comments format so I have to post this as an answer ...
I did three adjustment to your query
using JOIN ... ON instead of CROSS JOIN ... WHERE
commenting out square_portion calculation
using destination table with Allow Large Results option
Even though you expected just 3.3 M rows in output - in reality it is about 6.6 B ( 6,591,549,944) rows - you can see result of my experiment below
Note warning about Billing Tier - so you better use Reservations if available
Obviously, un-commenting square_portion calculation will increase Slots usage - so, you might potentially need to revisit your requirements/expectations

find number of polygons a line intersects

I am working with SQL Server and spatial types. Have found the very interesting. I have reached a situation and I am not sure how I might work it out. Let me outline
I have 2 points (coordinates) eg. 3,9 and 50, 20 - this is a straight line, joining such.
I have multiple polygons (50 +).
I want to be able to calculate how many polygons that the above line passes through. What I mean by pass through, when I join the 2 coordinates, how many polygons the line intersects? I want to work this out with a SQL query.
Please let me know if not clear - it is difficult to explain!
Based on your coordinates, I'm assuming geometry (as opposed to geography), but the approach should hold regardless. If you have a table dbo.Shapes that has a Shape column of type geometry and each row contains one polygon, this should do:
declare #line geometry = geometry::STLineFromText('LINESTRING(3 9, 50 20)', 0);
select count(*)
from dbo.Shapes as s
where s.shape.STIntersects(#line) = 1;
If you want to know which polygons intersect, just change the count(*) to something more appropriate.

How to find the points that did not intersect with STIntersect using SQL Server

I performed STInteract using two tables and the intersections of points onto a given polygon. I have converted all the tables to have geometries for all. I am having a problem writing the query for this. I am trying to look for the points that did not intersect.
These are my two table
PO_Database = contains the points
POLY_Database = Polygon of interest
This is my script:
SELECT GEOM
FROM [dbo].[PO_Database] as PO
JOIN [dbo].[POLY_Database] as p ON hwy.GEOM.STIntersects(p.NEATCELL) = 1
I tried changing the value from 1 to 0 but I get repeating values of the geometry for when the query is run with 0. How do I write the query to give me the names of the points that did not intersect with the polygon. Also is there a way to do checks if the intersects where done right.
If you get repeating values, you probably have multiple rows in the POLY_Database table. If you want to find the points that do not intersect any of those polygons, try this query:
SELECT GEOM
FROM [dbo].[PO_Database] as PO
WHERE NOT EXISTS (
SELECT * FROM [dbo].[POLY_Database] as p
WHERE hwy.GEOM.STIntersects(p.NEATCELL) = 1
)

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...