ST_GeogFromGeoJSON fails in bigquery while successful in postgres - google-bigquery

We have geojson polygons we would like to convert to a geo object in bigquery using ST_GeogFromGeoJSON. The conversion fails in bigquery while is successful in postgres using the equivalent command ST_GeomFromGeoJSON.
I am familiar with the SAFE prefix that can be added to the the bigquery call, but we would like to use the object and not just ignore it in case the conversion fails. I tried converting the object using ST_CONVEXHULL but wasn't able to make it work.
Is there some work around in bigquery?
Example:
Running the following command in bigquery
select ST_GeogFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')
returns
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
While runs successfully in postgres
select ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')

October 2020 Update for this post
No more any tricks needed - ST_GEOGFROMGEOJSON and ST_GEOGFROMTEXT geographic functions now support a new make_valid parameter. If set to TRUE, the function attempts to correct polygon issues when importing geography data.
So, below simple statement works perfectly now ...
select ST_GeogFromGeoJSON(
'{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}'
, make_valid => true
)
and returns expected output

Below is for BigQuery Standard SQL
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
... Is there some work around in bigquery? ...
Proposed workaround is obviously naive and simple way of fixing specific issue while easily can be extended to more generic cases. The idea here is to extract coordinates and reorder them to eliminate the problem ...
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}' AS geojson
)
SELECT ST_GEOGFROMGEOJSON('{"type":"Polygon","coordinates":' || fixed_coordinates || '}') AS geo
FROM (
SELECT '[[[' || STRING_AGG(lat_lon, '],[') || '],[' || ANY_VALUE(ordered_coordinates[OFFSET(0)]) || ']]]' fixed_coordinates
FROM (
SELECT
ARRAY( SELECT lon_lat
FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat
ORDER BY CAST( SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64), CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64)
) ordered_coordinates
FROM test
) t, t.ordered_coordinates lat_lon
)
This produces correct output
POLYGON((-82.022982 26.69785, -81.999574 26.109253, -81.8073135 26.1074055, -81.615053 26.105558, -81.606813 26.710698, -81.8148975 26.704274, -82.022982 26.69785))
and respective visualization is

Below is for BigQuery Standard SQL
My previous answer is based on oversimplified logic of re-ordering coordinates. Obviously it will not work in more complex cases like below one
{‘type’:‘Polygon’,‘coordinates’:[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}
Is there some more advanced sorting logic that can be applied?
So more complex logic can be used to address this
#standardSQL
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}' geojson
), coordinates AS (
SELECT CAST(SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64) lon, CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64) lat
FROM test, UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat), stats AS (
SELECT ST_CENTROID(ST_UNION_AGG(ST_GEOGPOINT(lon, lat))) centroid FROM coordinates
)
SELECT ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_AGG(point ORDER BY sequence))) AS polygon
FROM (
SELECT point,
CASE
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) > ST_Y(centroid) THEN 3.14 - angle
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 3.14 + angle
WHEN ST_X(point) < ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 6.28 - angle
ELSE angle
END sequence
FROM (
SELECT point, centroid,
ACOS(ST_DISTANCE(centroid, anchor) / ST_DISTANCE(centroid, point)) angle
FROM (
SELECT centroid,
ST_GEOGPOINT(lon, lat) point,
ST_GEOGPOINT(lon, ST_Y(centroid)) anchor
FROM coordinates, stats
)
)
)
This approach produces correct output
POLYGON((-0.49075 51.46989, -0.48664 51.46987, -0.48664 51.47341, -0.48923 51.47336, -0.48921 51.4737, -0.49072 51.47462, -0.49114 51.47446, -0.49044 51.4737, -0.4907 51.4737, -0.49075 51.46989))
which is visualized as below

Related

ST_MAKEPOLYGON inverse function

Is there a function to inverse the ST_MAKEPOLYGON and get a linestring from a polygon?
The best option I have found so far is to modify the geometry at the WKB level.
with data AS (
SELECT ST_MAKEPOLYGON(ST_MAKELINE([
ST_GEOGPOINT(7.48,6.74),
ST_GEOGPOINT(7.50,6.73),
ST_GEOGPOINT(7.47,6.76),
ST_GEOGPOINT(7.48,6.74)
])) AS my_polygon
)
SELECT
ST_GEOGFROMWKB(CONCAT(b'\x01\x02',SUBSTR(ST_ASBINARY(my_polygon),7)))
FROM data
Try below
SELECT ST_EXTERIORRING(my_polygon)
If applied to sample data in your question - output is

Zip parallell arrays in hive

I have parallel arrays in a hive table, like this:
with tbl as ( select array(1,2,3) as x, array('a','b','c') as y)
select x,y from tbl;
x y
[1,2,3] ["a","b","c"]
1 row selected (0.108 seconds)
How can I zip them together (like the python zip function), so that I get back a list of structs, like
[(1, "a"), (2, "b"), (3,"c")]
You can posexplode so it gives the positions in the array which can then be used for filtering.
select x,y,collect_list(struct(val1,val2))
from tbl
lateral view posexplode(x) t1 as p1,val1
lateral view posexplode(y) t2 as p2,val2
where p1=p2
group by x,y
Here was my attempt at avoiding a double-explode:
with tbl as (select array(1,2,3,4,5) as x, array('a','b','c','d','e') as y)
select collect_list(struct(xi, y[i-1]))
from tbl lateral view posexplode(x) tbl2 as xi, i;
However, I ran into a strange error:
Error: Error while compiling statement: FAILED: IllegalArgumentException Size requested for unknown type: java.util.Collection (state=42000,code=40000)
I was able to work around it using
set hive.execution.engine=mr;
which is not as fast / optimized as using spark or tez as the back end.

How to calculate area in SQL using geographic coordinates?

Does anybody know what is the problem with my query. I am trying to calculate area using geographical coordinates, but result seems to be too small to be true. 0.00118 sqm. Can anybody help?
SELECT ST_Area(the_geom) As sqm
FROM (SELECT
ST_GeomFromText('POLYGON
(
(14.604514925547997 121.0968017578125,
14.595212295624522 121.08512878417969,
14.567302046916149 121.124267578125,
14.596541266841905 121.14761352539062,
14.604514925547997 121.0968017578125)
)',4326) ) As foo(the_geom)
How accurate should be the calculation?
A solution is to cast GEOMETRY to GEOGRAPHY, which is acceptably accurate for the most use cases:
SELECT ST_Area(the_geom::GEOGRAPHY ) As sqm
FROM (SELECT
ST_GeomFromText('POLYGON
(
(14.604514925547997 121.0968017578125,
14.595212295624522 121.08512878417969,
14.567302046916149 121.124267578125,
14.596541266841905 121.14761352539062,
14.604514925547997 121.0968017578125)
)',4326) ) As foo(the_geom)
The geography type automatically converts degrees to meters.
Depending on your scenario you could also use directly the geography constructor St_GeographyFromText, which accept a WKT string as argument, very similar to ST_GeomFromText
ST_GeographyFromText('POLYGON((14.604514925547997 121.0968017578125,
14.595212295624522 121.08512878417969,
14.567302046916149 121.124267578125,
14.596541266841905 121.14761352539062,
14.604514925547997 121.0968017578125))'
)

point is within circle in postgresql

I want to find out whether a point is within the circle or not using postgresql.
For point within polygon, I have used following query. i need some equivalent query for circle too.
SELECT a
FROM a_table
WHERE
ST_within(a::geometry,ST_GeomFromText('Polygon((50 -80.98 , 20.99 -90.99 , 90.98 -99.99 , 50 -80.98))'));
for circle, i tried this below query :
SELECT a
FROM a_table
WHERE
ST_within(a::geometry,ST_GeomFromText('POINT(10 20)',10));
and
SELECT a
FROM a_table
WHERE
ST_within(a::geometry,ST_GeomFromText('circle((10 20),10)'));
but both of these gives errors like this :
ERROR: parse error - invalid geometry
SQL state: XX000
Hint: "714" <-- parse error at position 4 within geometry
and
ERROR: parse error - invalid geometry
SQL state: XX000
Hint: "ci" <-- parse error at position 2 within geometry
select a
from a_table
where a <# circle '((10, 20),10))';
Geometric Functions
select point '(1,1)' <# circle '((0,0), 1)';
?column?
----------
f
select point '(1,1)' <# circle '((0,0), 1.5)';
?column?
----------
t

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?
Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.
One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10
I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!