ST_EXTENT or ST_ENVELOPE in BigQuery? - sql

I want the equivalent of ST_EXTENT or ST_ENVELOPE in BigQuery, but I can't find a way to make this query run:
SELECT REGEXP_EXTRACT(name, ', (..)') state
, ST_EXTENT(ARRAY_AGG(urban_area_geom)) corners
, COUNT(*) cities
FROM `bigquery-public-data.geo_us_boundaries.urban_areas`
GROUP BY state
The desired result of this query is a list of bounding boxes to cover all urban areas around the US, grouped by state.

I created a feature request to get a native implementation of ST_EXTENT(). Please add your votes and evidence of why you need this function so the team can prioritize and keep you informed of any developments:
https://issuetracker.google.com/issues/148915449
In the meantime, the best solution I can offer:
fhoffa.x.st_bounding_box(): a naive bounding box UDF.
Use it like this:
SELECT REGEXP_EXTRACT(name, ', (..)') state
, fhoffa.x.st_bounding_box(ARRAY_AGG(urban_area_geom)).polygon
, COUNT(*) urban_areas
FROM `bigquery-public-data.geo_us_boundaries.urban_areas`
GROUP BY state
The code behind it:
CREATE OR REPLACE FUNCTION fhoffa.x.st_bounding_box(arr ANY TYPE) AS ((
SELECT AS STRUCT *
, ST_MakePolygon(ST_GeogFromText(FORMAT('LINESTRING(%f %f,%f %f,%f %f,%f %f)',minlon,minlat,maxlon,minlat,maxlon,maxlat,minlon, maxlat))) polygon
FROM (
SELECT MIN(m.min_x) minlon, MAX(m.max_x) maxlon , MIN(m.min_y) minlat, MAX(m.max_y) maxlat
FROM (
SELECT
(SELECT AS STRUCT MIN(x) min_x, MAX(x) max_x, MIN(y) min_y, MAX(y) max_y FROM UNNEST(coords)) m
FROM (
SELECT ARRAY(
SELECT STRUCT(
CAST(SPLIT(c, ', ')[OFFSET(0)] AS FLOAT64) AS x,
CAST(SPLIT(c, ', ')[OFFSET(1)] AS FLOAT64) AS y
)
FROM UNNEST(REGEXP_EXTRACT_ALL(ST_ASGEOJSON(geog), r'\[([^[\]]*)\]')) c
) coords
FROM UNNEST(arr) geog
)
)
)
))
Notes:
Additional effort is needed to make it work with geometries that cross the -180 line.
Due to geodesic edges, the function result is not a true bounding box, i.e. ST_Covers(box, geom) might return FALSE.
In the picture above I'm not expecting each state to be fully covered, just its urban areas. So the bounding box is correct if there's no urban area in those uncovered corners.
The following polygon construction will give you exact "rectangles", but they become much more complex structures to work with.
ST_GEOGFROMGEOJSON(
FORMAT('{"type": "Polygon", "coordinates": [[[%f,%f],[%f,%f],[%f,%f],[%f,%f],[%f, %f]]]}'
, minlon,minlat,maxlon,minlat,maxlon,maxlat,minlon,maxlat,minlon,minlat)
)
I'll be looking forward to your comments and suggestions.

Since September 27, 2021 BigQuery support ST_BOUNDINGBOX and ST_EXTENT

Related

SQL select features within a polygon

I have the following code whick works fine:
select vissoort, count(1), ST_Buffer(ST_GeomFromText('POINT(5.341248 51.615590)',4326):: geography, 2500)
from visvangsten
where st_intersects(visvangsten.locatie,
ST_Buffer(ST_GeomFromText('POINT(5.3412480 51.615590)',4326):: geography, 2500))
group by vissoort
order by 2 desc
Now I want the same function but then selecting the features within a polygon instead of the circle/buffer.
I tried things like this but nothing worked:
select vissoort, count(1), ST_asText( ST_Polygon('LINESTRING(5.303 51.629, 5.387 51.626, 5.393 51.588, 5.281 51.592)'::geometry, 4326) )
from visvangsten
where st_contains(ST_asText( ST_Polygon('LINESTRING(5.303 51.629, 5.387 51.626, 5.393 51.588, 5.281 51.592)'::geometry, 4326) ), visvangsten.locatie);
group by vissoort
order by 2 desc limit 1
The database table looks like this:
id ([PK]bigint)
datum(date)
vissoort(character varying)
locatie(geometry)
15729
2007-06-23
Blankvoorn
0101000...etc.
etc.
etc.
etc.
etc.
Does someone know the answer?
Keep in mind that to transform a LineString into a Polygon you need to have a closed ring - in other words, the first and last coordinate pairs must be identical. That being said, you can convert a LineString into a Polygon using the function ST_MakePolygon. The following example is probably what you're looking for:
WITH j (geom) AS (
VALUES
(ST_MakePolygon('SRID=4326;LINESTRING(-4.59 54.19,-4.55 54.23,-4.52 54.19,-4.59 54.19)'::geometry)),
(ST_Buffer('SRID=4326;LINESTRING(-4.59 54.19,-4.55 54.23,-4.52 54.19,-4.59 54.19)'::geometry,0.1))
)
SELECT ST_Contains(geom,'SRID=4326;POINT(-4.5541 54.2043)'::geometry) FROM j;
st_contains
-------------
t
t
(2 Zeilen)

BigQuery Geo functions - How to compute the shortest distance to the polygon perimeter

I would like to compute the shortest distance from the yellow point in the image below to the polygon boundary using built in BigQuery Geo functions.
I could not find anything myself.
Here is the query that builds the example.
WITH objects AS(
SELECT 'POLYGON((-84.3043408314983 33.78004925, -84.3058929975152 33.7780287948446, -84.3026549053438 33.77962155, -84.3018234603607 33.7798783, -84.3041030408163 33.7785105714286, -84.2983655895464 33.7814847396304, -84.2869801170094 33.7772419185107, -84.2842584693878 33.7827876938775, -84.2863881748169 33.7848439284835, -84.2963746470588 33.7897689411765, -84.2979002513655 33.790508814658, -84.2978883265306 33.7851126734694, -84.300035153059 33.78268675, -84.3043408314983 33.78004925))' wkt_string
UNION ALL
SELECT 'POINT(-84.2998716702097 33.7796025711153)' wkt_string
)
SELECT ST_GEOGFROMTEXT(wkt_string) geo
FROM objects
this is the function i was looking for:
ST_CLOSESTPOINT(geography_1, geography_2[, use_spheroid]).
Use ST_Distance function to compute shortest distance between shapes:
WITH objects AS(
SELECT
'POLYGON((-84.3043408314983 33.78004925, -84.3058929975152 33.7780287948446, -84.3026549053438 33.77962155, -84.3018234603607 33.7798783, -84.3041030408163 33.7785105714286, -84.2983655895464 33.7814847396304, -84.2869801170094 33.7772419185107, -84.2842584693878 33.7827876938775, -84.2863881748169 33.7848439284835, -84.2963746470588 33.7897689411765, -84.2979002513655 33.790508814658, -84.2978883265306 33.7851126734694, -84.300035153059 33.78268675, -84.3043408314983 33.78004925))'
AS poly,
'POINT(-84.2998716702097 33.7796025711153)' AS point
)
SELECT ST_Distance(ST_GEOGFROMTEXT(poly), ST_GEOGFROMTEXT(point))
FROM objects
One caveat - it computes distance between point and polygon, so if the point is inside the polygon, the distance is 0. If you really want distance to polygon boundary, add ST_Boundary to the mix:
WITH objects AS(
...
)
SELECT ST_Distance(ST_Boundary(ST_GEOGFROMTEXT(poly)), ST_GEOGFROMTEXT(point))
FROM objects

Oracle Spatial Geometry covered by the most

I have a table which contains a number of geometries. I am attempting to extract the one which is most covered by another geometry.
This is best explained with pictures and code.
Currently I am doing this simple spatial query to get any rows that spatially interact with a passed in WKT Geometry
SELECT ID, NAME FROM MY_TABLE WHERE
sdo_anyinteract(geom,
sdo_geometry('POLYGON((400969 95600,402385 95957,402446 95579,400905 95353,400969 95600))',27700)) = 'TRUE';
Works great, returns a bunch of rows that interact in any way with my passed in geometry.
What I preferably want though is to find which one is covered most by my passed in geometry. Consider this image.
The coloured blocks represent 'MY_TABLE'. The black polygon over the top represents my passed in geometry I am searching with. The result I want returned from this is Polygon 2, as this is the one that is most covered by my polygon. Is this possible? Is there something I can use to pull the cover percentage in and order by that or a way of doing it that simply returns just that one result?
--EDIT--
Just to supplement the accepted answer (which you should go down and give an upvote as it is the entire basis for this) this is what I ended up with.
SELECT name, MI_PRINX,
SDO_GEOM.SDO_AREA(
SDO_GEOM.SDO_INTERSECTION(
GEOM,
sdo_geometry('POLYGON((400969.48717156524 95600.59583240788,402385.9445972018 95957.22742049221,402446.64806962677 95579.91508788493,400905.95874489535 95353.03765349534,400969.48717156524 95600.59583240788))',27700)
,0.005
)
,0.005) AS intersect_area
FROM LIFE_HEATHLAND WHERE sdo_anyinteract(geom, sdo_geometry('POLYGON((400969.48717156524 95600.59583240788,402385.9445972018 95957.22742049221,402446.64806962677 95579.91508788493,400905.95874489535 95353.03765349534,400969.48717156524 95600.59583240788))',27700)) = 'TRUE'
ORDER BY INTERSECT_AREA DESC;
This returns me all the results that intersect my query polygon with a new column called INTERSECT_AREA, which provides the area. I can then sort this and pick up the highest number.
Just compute the intersection between each of the returned geometries and your query window (using SDO_GEOM.SDO_INTERSECTION()), compute the area of each such intersection (using SDO_GEOM.SDO_AREA()) and return the row with the largest area (order the results in descending order of the computed area and only retain the first row).
For example, the following computes how much space Yellowstone National Park occupies in each state it covers. The results are ordered by area (descending).
SELECT s.state,
sdo_geom.sdo_area (
sdo_geom.sdo_intersection (
s.geom, p.geom, 0.5),
0.5, 'unit=sq_km') area
FROM us_states s, us_parks p
WHERE SDO_ANYINTERACT (s.geom, p.geom) = 'TRUE'
AND p.name = 'Yellowstone NP'
ORDER by area desc;
Which returns:
STATE AREA
------------------------------ ----------
Wyoming 8100.64988
Montana 640.277886
Idaho 154.657145
3 rows selected.
To only retain the row with the largest intersection do:
SELECT * FROM (
SELECT s.state,
sdo_geom.sdo_area (
sdo_geom.sdo_intersection (
s.geom, p.geom, 0.5),
0.5, 'unit=sq_km') area
FROM us_states s, us_parks p
WHERE SDO_ANYINTERACT (s.geom, p.geom) = 'TRUE'
AND p.name = 'Yellowstone NP'
ORDER by area desc
)
WHERE rownum = 1;
giving:
STATE AREA
------------------------------ ----------
Wyoming 8100.64988
1 row selected.
The following variant also returns the percentage of the park's surface in each intersecting state:
WITH p AS (
SELECT s.state,
sdo_geom.sdo_area (
sdo_geom.sdo_intersection (
s.geom, p.geom, 0.5),
0.5, 'unit=sq_km') area
FROM us_states s, us_parks p
WHERE SDO_ANYINTERACT (s.geom, p.geom) = 'TRUE'
AND p.name = 'Yellowstone NP'
)
SELECT state, area,
RATIO_TO_REPORT(area) OVER () * 100 AS pct
FROM p
ORDER BY pct DESC;
If you want to return the geometry of the intersections, just include that into your result set.

Bigquery: "Not enough memory"

Bigquery started to give me error:not enough memory when I run this query this morning. The two tables involved contain no more than 5GB data. Plus I'm using table decorators, 1407249067530 equals around 10:30am today(20140805). I wonder what's the problem.
Job ID: red-road-574:job_x8flLfo4QwA1gQ_FCrNWbKY-bZM
select * from
(
select t_connection.row_id AS debug_row_id,
t_connection.hardware_id AS hardware_id,
t_connection.debug_data AS debug_data,
t_connection.connection_status AS connection_status,
t_connection.date_time AS debug_date_time,
t_gps.hardware_id AS hardware_id2,
t_gps.latitude AS latitude,
t_gps.longitude AS longitude,
t_gps.date_time AS gps_date_time,
t_gps.zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id,
gg.hardware_id as hardware_id,
gg.latitude as latitude,
gg.longitude as longitude,
gg.date_time as date_time,
gg.zip_code as zip_code
from [my data set.table1_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
dd.date_time as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [my data set.table2_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id
)
) WHERE row_num=1
You're hitting an odd corner case. When you use allowLargeResults with results that are nested or repeated and you don't use flattenResults=false, the query goes into a special mode. (when you use timestamps, you're really using a nested data structure, which was a design decision that spawned 1000 bugs and is hopefully changing soon). This special query mode has some limitations, which are what you're hitting.
In general, we want this to be seamless, which is why it isn't documented. However, since you're running into a problem here, I'll explain a little about about how to avoid it.
You have a couple of options to get around this:
If you're using nested or repeated results (it looks like you're not, which is good):
rename your results without dots in the name.
set the flattenResults field on the query to 'false'. This means that nested and repeated fields will be actually nested and repeated in the results.
If you're using timestamps in the results:
Convert your timestamps to strings or numeric values. Sorry.
If you don't really need large results:
unset the allowLargeResults flag.
I realize that all of these options are deeply unsatisfying. This is an area we're actively working to improve.
Now with allowLargeReults=true and flattenResults=false and convert timestamps to numeric value at the first step
select * from
(
select row_id AS debug_row_id,
hardware_id AS hardware_id,
debug_data AS debug_data,
connection_status AS connection_status,
date_time AS debug_date_time,
hardware_id2 AS hardware_id2,
latitude AS latitude,
longitude AS longitude,
date_time2 AS gps_date_time,
zip_code AS zip_code,
ROW_NUMBER() OVER (PARTITION BY debug_row_id ORDER BY time_diff) row_num,
from(
select *,
ABS(t_gps.date_time2-t_connection.date_time) AS time_diff
from ( select CONCAT(String(gg.hardware_id),String(gg.date_time)) as row_id_gps,
gg.hardware_id as hardware_id2,
gg.latitude as latitude,
gg.longitude as longitude,
TIMESTAMP_TO_MSEC(gg.date_time) as date_time2,
gg.zip_code as zip_code
from [test.gps32_20140805#1407249067530-] gg
) AS t_gps
INNER JOIN EACH
( select CONCAT(CONCAT(String(dd.debug_reason),String(dd.hardware_id)),String(dd.date_time)) as row_id,
dd.hardware_id as hardware_id,
TIMESTAMP_TO_MSEC(dd.date_time) as date_time,
dd.debug_data as debug_data,
case
when dd.debug_reason = 1 then 'Successful_Connection'
when dd.debug_reason = 2 then 'Dropped_Connection'
when dd.debug_reason = 3 then 'Failed_Connection'
end AS connection_status
from [test.debug_data_developer_20140805#1407249067530-] dd
where dd.debug_reason in (50013, 50017, 50018)
) as t_connection
ON t_connection.hardware_id = t_gps.hardware_id2
)
) WHERE row_num=1
it gives me
Query Failed
Error: Resources exceeded during query execution.
Job ID: red-road-574:job_ikWQvffmPEUP6DtTvJaYpXHFJ2M
This is the functioning SQL with allowLargeResults=true, flattenResults=true. I don't know what I did to make this work, maybe only add a HAVING clause? But in the JOIN, I change one side to be a whole table instead of the one with decorator as above, so the data involved actually increased. I'm not sure whether it can keep successful or it's just temporary luck.

Help me to optimize hard logic on SQL

First I'll try (with my bad English) to explain what I've got and what I need to have
I have got the list of Events by TimeLine.
Event is some discrete signal 1/0 that happens in some time and got some duration.
my event list is looking like
Rectime - start event time
Col - event name
ChangedDate - end event time
InitalValue - event message
Value - event state 1/0
And those events can call some Complex event if there are A1 event is 1 A2 is 0 or A5 is 1 in the same time -- Just for example
my complex events (incidents) structure is :
[ID] - just ID
[Name] - just Name
[SQL] - here is list of event names with logics alike ***(A1 AND NOT A2) OR A5***
[Message] - event message
I need do not miss any possible change so I when some event is happens I'm looking for complex events it could change , but to know if that changed complex events I need to know about other depends of this complex event, so next step is getting all the dependent events and their states 1/0. Here is my try :
With DependencedIncidents AS -- Get all dependenced Incidents from this Event
(
SELECT INC.[RecTime],INC.[SQL] AS [str] FROM
(
SELECT A.[RecTime] As [RecTime],X.[SQL] As [SQL] FROM [EventView] AS A
CROSS JOIN [Incident] AS X
WHERE
patindex('%' + A.[Col] + '%', X.[SQL]) > 0
) AS INC
)
, DependencedEvents AS -- Split SQL string to get dependeced Events for each dependeced Incident
(
select distinct word AS [Event] , [RecTime]
from
(
select v.number, t.[RecTime] As [RecTime],
substring(t.str+')',
v.number+1,
patindex('%[() ]%',
substring(t.str+')',
v.number+1,
1000))-1) word
from DependencedIncidents AS t
inner join master..spt_values v on v.type='P'
and v.number < len(t.str)
and (v.number=0 or substring(t.str,v.number,1) like '[() ]')
) x
where word not in ('','OR','AND')
)
, EventStates AS -- Dependeced events with their states 1/0
(
Select D.[RecTime], D.[Event], X.[Value]
From [DependencedEvents] AS D
LEFT JOIN [EventView] AS X
ON X.Col = D.[Event]
AND D.[Rectime] >= X.[Rectime]
AND D.[Rectime] <= X.[ChangedDate]
)
select * from EventStates
order by [RecTime]
And it works very very slow , I need a serious optimization if that possible.
The slowest ( 95% of time ) part is
LEFT JOIN [EventView] AS X
ON X.Col = D.[Event]
AND D.[Rectime] BETWEEN X.[Rectime] AND X.[ChangedDate]
maybe I'm doing something wrong here...
I just want to check Value of D.[Event] from EventView in this time D.[Rectime]...
eventview added by comments requests :
ALTER VIEW [dbo].[EventView] AS
(SELECT RecTime, ChangedDate, ( 'Alarm' + CAST(ID as nvarchar(MAX)) ) AS Col, InitialValue, Value FROM [dbo].[Changes]
WHERE InitialValue <> '')
UNION ALL
SELECT RecTime, ChangedDate, Col, InitialValue, Value FROM [dbo].[XDeltaIntervals]
UNION ALL
SELECT RecTime, ChangedDate, Col, InitialValue, Value FROM [dbo].[ActvXDeltaIntervals]
I think this should be about the same:
SELECT
ev.Rectime,
ev.Event,
ev2.Value
FROM EventView AS ev
INNER JOIN Incident i
ON PATINDEX('%' + ev.Col + '%', i.SQL) > 0
LEFT JOIN EventView ev2
ON ev.Col = ev2.Col AND ev.Rectime BETWEEN ev2.Rectime AND ev2.ChangedDate
The thing is, you are finding your complex events using event names, then you are extracting those very names from the complex events found, and finally you are using the extracted names in the last CTE to compare against themselves. So, it seemed to me that the extracting part was completely unnecessary.
And without it the resulting query turned out to be quite simple (in appearance at least).
Well one of the most basic concepts of relational data storage is to
store the data in a normalized way and
use the relational database to store the data, but do not parse/process it etc. Use the application layer to do that.
That should be the first thing you do and then you may move to the next level of optimizing the queries, joins, making indices etc.
I think the slowest part is originating from EventView definiton:
SELECT ... ( 'Alarm' + CAST(ID as nvarchar(MAX)) ) AS Col, ...
Joining with such calculated field causes nasty performance hit.
Can't you:
record (Col=)Alarm+ID directly to Changes table or
update Alarm+ID by trigger or
use indexed view for calculating Alarm+ID or
use temporary table for storing Alarm+ID or at least
not use nvarchar(MAX), but something like nvarchar(10) (if this changes query plan)
?