Zip parallell arrays in hive

Zip parallell arrays in hive - hive

I have parallel arrays in a hive table, like this:
with tbl as ( select array(1,2,3) as x, array('a','b','c') as y)
select x,y from tbl;
x y
[1,2,3] ["a","b","c"]
1 row selected (0.108 seconds)
How can I zip them together (like the python zip function), so that I get back a list of structs, like
[(1, "a"), (2, "b"), (3,"c")]

You can posexplode so it gives the positions in the array which can then be used for filtering.
select x,y,collect_list(struct(val1,val2))
from tbl
lateral view posexplode(x) t1 as p1,val1
lateral view posexplode(y) t2 as p2,val2
where p1=p2
group by x,y

Here was my attempt at avoiding a double-explode:
with tbl as (select array(1,2,3,4,5) as x, array('a','b','c','d','e') as y)
select collect_list(struct(xi, y[i-1]))
from tbl lateral view posexplode(x) tbl2 as xi, i;
However, I ran into a strange error:
Error: Error while compiling statement: FAILED: IllegalArgumentException Size requested for unknown type: java.util.Collection (state=42000,code=40000)
I was able to work around it using
set hive.execution.engine=mr;
which is not as fast / optimized as using spark or tez as the back end.

Related

Spark SQL: Can't UNNEST lambda variables

I am encountering a strange behaviour. I can't access lambda variable with UNNEST in my spark code:
FILTER(boxes.clicks, x -> EXISTS (SELECT 1 FROM UNNEST(x) AS clicks WHERE clicks.href IS NOT NULL))
This will complain that x does not exist: cannot resolve 'x' given input columns: []
However, without UNNEST, x can be accessed without any problems. For example, this will work just fine:
FILTER(boxes.clicks, x -> size(x) > 1)
Is it possible to use lambda variables in combination with UNNEST?

ST_GeogFromGeoJSON fails in bigquery while successful in postgres

We have geojson polygons we would like to convert to a geo object in bigquery using ST_GeogFromGeoJSON. The conversion fails in bigquery while is successful in postgres using the equivalent command ST_GeomFromGeoJSON.
I am familiar with the SAFE prefix that can be added to the the bigquery call, but we would like to use the object and not just ignore it in case the conversion fails. I tried converting the object using ST_CONVEXHULL but wasn't able to make it work.
Is there some work around in bigquery?
Example:
Running the following command in bigquery
select ST_GeogFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')
returns
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
While runs successfully in postgres
select ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}')

October 2020 Update for this post
No more any tricks needed - ST_GEOGFROMGEOJSON and ST_GEOGFROMTEXT geographic functions now support a new make_valid parameter. If set to TRUE, the function attempts to correct polygon issues when importing geography data.
So, below simple statement works perfectly now ...
select ST_GeogFromGeoJSON(
'{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}'
, make_valid => true
)
and returns expected output

Below is for BigQuery Standard SQL
Query failed: ST_GeogFromGeoJSON failed: Invalid polygon loop: Edge 4 crosses edge 9
... Is there some work around in bigquery? ...
Proposed workaround is obviously naive and simple way of fixing specific issue while easily can be extended to more generic cases. The idea here is to extract coordinates and reorder them to eliminate the problem ...
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-82.022982,26.69785],[-81.606813,26.710698],[-81.999574,26.109253],[-81.615053,26.105558],[-82.022982,26.69785]]]}' AS geojson
)
SELECT ST_GEOGFROMGEOJSON('{"type":"Polygon","coordinates":' || fixed_coordinates || '}') AS geo
FROM (
SELECT '[[[' || STRING_AGG(lat_lon, '],[') || '],[' || ANY_VALUE(ordered_coordinates[OFFSET(0)]) || ']]]' fixed_coordinates
FROM (
SELECT
ARRAY( SELECT lon_lat
FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat
ORDER BY CAST( SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64), CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64)
) ordered_coordinates
FROM test
) t, t.ordered_coordinates lat_lon
)
This produces correct output
POLYGON((-82.022982 26.69785, -81.999574 26.109253, -81.8073135 26.1074055, -81.615053 26.105558, -81.606813 26.710698, -81.8148975 26.704274, -82.022982 26.69785))
and respective visualization is

Below is for BigQuery Standard SQL
My previous answer is based on oversimplified logic of re-ordering coordinates. Obviously it will not work in more complex cases like below one
{‘type’:‘Polygon’,‘coordinates’:[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}
Is there some more advanced sorting logic that can be applied?
So more complex logic can be used to address this
#standardSQL
WITH test AS (
SELECT '{"type":"Polygon","coordinates":[[[-0.49044,51.4737],[-0.4907,51.4737],[-0.49075,51.46989],[-0.48664,51.46987],[-0.48664,51.47341],[-0.48923,51.47336],[-0.48921,51.4737],[-0.49072,51.47462],[-0.49114,51.47446],[-0.49044,51.4737]]]}' geojson
), coordinates AS (
SELECT CAST(SPLIT(lon_lat)[OFFSET(0)] AS FLOAT64) lon, CAST(SPLIT(lon_lat)[OFFSET(1)] AS FLOAT64) lat
FROM test, UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(geojson, '$.coordinates'), r'\[+(.*?)\]+')) lon_lat), stats AS (
SELECT ST_CENTROID(ST_UNION_AGG(ST_GEOGPOINT(lon, lat))) centroid FROM coordinates
)
SELECT ST_MAKEPOLYGON(ST_MAKELINE(ARRAY_AGG(point ORDER BY sequence))) AS polygon
FROM (
SELECT point,
CASE
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) > ST_Y(centroid) THEN 3.14 - angle
WHEN ST_X(point) > ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 3.14 + angle
WHEN ST_X(point) < ST_X(centroid) AND ST_Y(point) < ST_Y(centroid) THEN 6.28 - angle
ELSE angle
END sequence
FROM (
SELECT point, centroid,
ACOS(ST_DISTANCE(centroid, anchor) / ST_DISTANCE(centroid, point)) angle
FROM (
SELECT centroid,
ST_GEOGPOINT(lon, lat) point,
ST_GEOGPOINT(lon, ST_Y(centroid)) anchor
FROM coordinates, stats
)
)
)
This approach produces correct output
POLYGON((-0.49075 51.46989, -0.48664 51.46987, -0.48664 51.47341, -0.48923 51.47336, -0.48921 51.4737, -0.49072 51.47462, -0.49114 51.47446, -0.49044 51.4737, -0.4907 51.4737, -0.49075 51.46989))
which is visualized as below

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col

Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

log function in redshift

I am trying to run following query.
CREATE TEMP TABLE tmp_variables AS SELECT
0.99::numeric(10,8) AS y ;
select y, log(y) from tmp_variables
It gives me following error. Is there a way to get around this?
[Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
Warnings:
Function "log(numeric,numeric)" not supported.

A workaround is to use "float" instead.
CREATE TEMP TABLE tmp_variables AS SELECT
0.99::float AS y ;
select y, log(y) from tmp_variables
works fine and returns
y log
0.99 -0.004364805402450088

The LOG function requires an argument that is data type "double precision". Your code is passing in a data type of "numeric", that's why you are getting an error.
This will work:
CREATE TEMP TABLE tmp_variables AS
SELECT 0.99::numeric(10,8) AS y ;
select y, log(cast(y as double precision)) from tmp_variables;

ST_DIFFERENCE returning GeometryCollection instead of MultiPoint

I'm trying to get the difference between two multipoints. I am doing this using the query location = ST_Difference(location, other_geo). This works when the result is not empty, however, if the two multipoints are exactly the same, the resulting object is a GeometryCollection instead of an empty MultiPoint, as would be returned from ST_geomFromText('MULTIPOINT EMPTY'). How do I get the result to be an empty multipoint object?
The following query results in a multipoint:
SELECT ST_asGeoJSON(ST_Difference(ST_geomFromText('MultiPoint(1 2, 3 4)', 4326), ST_geomFromText('MultiPoint(1 2)', 4326)));
Result: {"type":"Point","coordinates":[3,4]}
This one results in an empty GeometryCollection:
SELECT ST_asGeoJSON(ST_Difference(ST_geomFromText('MultiPoint(1 2)', 4326), ST_geomFromText('MultiPoint(1 2)', 4326)));
Result: {"type":"GeometryCollection","geometries":[]}

Try using ST_Multi and ST_CollectionExtract to always return a MultiPoint geometry with zero or more points:
SELECT ST_AsGeoJSON(ST_Multi(ST_CollectionExtract(ST_Difference(a, b), 1)))
FROM (
SELECT 'MultiPoint(1 2, 3 4)'::geometry a, 'MultiPoint(1 2)'::geometry b
UNION SELECT 'MultiPoint(1 2)', 'MultiPoint(1 2)'
) data;
st_asgeojson
---------------------------------------------
{"type":"MultiPoint","coordinates":[]}
{"type":"MultiPoint","coordinates":[[3,4]]}
(2 rows)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Zip parallell arrays in hive - hive

You can posexplode so it gives the positions in the array which can then be used for filtering. select x,y,collect_list(struct(val1,val2)) from tbl lateral view posexplode(x) t1 as p1,val1 lateral view posexplode(y) t2 as p2,val2 where p1=p2 group by x,y

Related

Spark SQL: Can't UNNEST lambda variables

ST_GeogFromGeoJSON fails in bigquery while successful in postgres

Cannot have map type columns in DataFrame which calls set operations

log function in redshift

ST_DIFFERENCE returning GeometryCollection instead of MultiPoint

Categories

Resources