Array operation on hive collect_set - sql

I am working on hive on large dataset, I have table with colum array and the content of the colum is as follows.
["20190302Prod4"
"20190303Prod1"
"20190303Prod4"
"20190304Prod4"
"20190305Prod3"
"20190307Prod4"
"20190308Prod4"
"20190309Prod4"
"20190310Prod2"
"20190311Prod1"
"20190311Prod4"
"20190312Prod1"
"20190312Prod4"
"20190313Prod2"
"20190313Prod1"
"20190313Prod4"
"20190314Prod4"
"20190315Prod4"
"20190316Prod4"
"20190317Prod1"
"20190317Prod4"]
I need a set as per the asc date of prod e.g. I need to trim date from the array and apply collect_set to get below result.
["Prod4",
"Prod1",
"Prod3",
"Prod2"]

Explode array, remove date (digits at the beginning of the string), aggregate using collect_set:
with mydata as (--use your table instead of this
select array(
"20190302Prod4",
"20190303Prod1",
"20190303Prod4",
"20190304Prod4",
"20190305Prod3",
"20190307Prod4",
"20190308Prod4",
"20190309Prod4",
"20190310Prod2",
"20190311Prod1",
"20190311Prod4",
"20190312Prod1",
"20190312Prod4",
"20190313Prod2",
"20190313Prod1",
"20190313Prod4",
"20190314Prod4",
"20190315Prod4",
"20190316Prod4",
"20190317Prod1",
"20190317Prod4"
) myarray
)
select collect_set(regexp_extract(elem,'^\\d*(.*?)$',1)) col_name
from mydata a --Use your table instead
lateral view outer explode(myarray) s as elem;
Result:
col_name
["Prod4","Prod1","Prod3","Prod2"]
One more possible method is to concatenate array first, remove dates from the string, split to get an array. Unfortunately we still need to explode to do collect_set to remove duplicates (example using the same WITH mydata CTE):
select collect_set(elem) col_name
from mydata a --Use your table instead
lateral view outer explode(split(regexp_replace(concat_ws(',',myarray),'(^|,)\\d{8}','$1'),',')) s as elem
;

Related

How to flatten nested array data into row in bigquery

I am trying to flatten inside_array or sub array of nested array data into table rows.
I am able to flatten array_data which is outside array.
Anybody have any suggestion.Thanks in advance
#standardSQL
SELECT ...
FROM `project.dataset.table`,
UNNEST(array_data) AS array_data_rec,
UNNEST(array_data_rec.inside_array) AS inside_array_rec
To handle "no data inside the inside_array" - use LEFT JOIN instead as in below example
#standardSQL
SELECT ...
FROM `project.dataset.table`,
UNNEST(array_data) AS array_data_rec
LEFT JOIN UNNEST(array_data_rec.inside_array) AS inside_array_rec
You can do following
...
FROM
AA.nested_array,
UNNEST(array_data) as array_data,
UNNEST(array_data.inside_array) as array_data_inside_array

Get a list of all objects with the same key inside a jsonb array

I have a table mytable and a JSONB column employees that contains data like this:
[ {
"name":"Raj",
"email":"raj#gmail.com",
"age":32
},
{
"name":"Mohan",
"email":"Mohan#yahoo.com",
"age":21
}
]
I would like to extract only the names and save them in a list format, so the resulting cell would look like this:
['Raj','Mohan']
I have tried
select l1.obj ->> 'name' names
from mytable t
cross join jsonb_array_elements(t.employees) as l1(obj)
but this only returns the name of the first array element.
How do I get the name of all array elements?
Thanks!
PostgreSQL 11.8
In Postgres 12, you can use jsonb_path_query_array():
select jsonb_path_query_array(employees, '$[*].name') as names
from mytable
In earlier versions you need to unnest then aggregate back:
select (select jsonb_agg(e -> 'name')
from jsonb_array_elements(employees) as t(e)) as names
from mytable

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

hive explode list from json-string

I have table with jsons:
CREATE TABLE TABLE_JSON (
json_body string
);
Json has structure:
{ obj1: { fields ... }, obj2: [array] }
I want to select all elements from array, but I can't.
For example, I can get all fields from first object:
SELECT f.fields...
FROM (
SELECT q1.obj1, q1.obj2
FROM TABLE_JSON jt
LATERAL VIEW JSON_TUPLE(jt.json_body, 'obj1', 'obj2') q1 AS obj1, obj2
) as json_table2
LATERAL VIEW JSON_TUPLE(TABLE_JSON.obj1, 'fields...') f AS fields...;
But with array this method doesnt work.
I've tried to use
...
LATERAL VIEW explode(json_table2.obj2) adTable AS arr;
hive explode doc
But obj2 - string with array. How to transform string-json to array and explode it?
The json_split UDF from Brickhouse ( http://github.com/klout/brickhouse ) can convert a JSON array to a Hive List, and then you can explode that.
See http://mail-archives.apache.org/mod_mbox/hive-user/201406.mbox/%3CCAO78EnLgSrrUY3Ad_ZWS9zWNKLQRwS9jXrqEE869FhUNiWgCXA#mail.gmail.com%3E and https://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/
You can consider using Hive-JSON SerDe to read the data from JSON.
Refer: https://github.com/rcongiu/Hive-JSON-Serde
This may not be an optimal solution but can help unblock you. For a JSON object which looks like below
'{"obj1":"field1","obj2":["a1","a2","a3"]}'
this query can help you obtain all items of array into individual columns given that the size of the array is constant across all rows.
SELECT split(results,",")[0] AS arrayItem1,
split(results,",")[1] AS arrayItem2,
regexp_replace(split(results,",")[2], "[\\]|}]", "") AS arrayItem3
FROM
(SELECT split(translate(get_json_object(TABLE_JSON.json_body,'$.obj2'), '"\\[|]|\""',''), "},") AS r
FROM TABLE_JSON) t1 LATERAL VIEW explode(r) rr AS results
It produces the result which looks like this
arrayitem1| arrayitem2| arrayitem3
a1 | a2 | a3
You can scale it to any number of array size on a condition that size is constant across the table.

explode function in hive

I have the following sample data and I am trying to explode it in hive.. I used split but I know I am missing something..
["[[-80.742426,35.23248],[-80.740424,35.23184],[-80.739583,35.231562],[-80.735935,35.23041],[-80.728624,35.228069],[-80.727753,35.227836],[-80.727294,35.227741],[-80.726762,35.227647],[-80.726321,35.227594],[-80.725687,35.227544],[-80.725134,35.227535],[-80.721502,35.227615],[-80.691298,35.216202],[-80.688009,35.215396],[-80.686516,35.215016],[-80.598433,35.234307]]"]
I used the below query
select explode(split(col, ',')) from sample2;
and the result is this
["[[-80.742426
35.23248]
[-80.740424
35.23184]
[-80.739583
35.231562]
[-80.735935
35.23041]
[-80.728624
35.228069]
[-80.727753
35.227836]
[-80.71143
35.227831]
[-80.711007
35.227795]
[-80.710638
35.227741]
[-80.673884
35.21014]
[-80.672358
35.209481]
[-80.672036
35.209356]
[-80.671686
35.209234]
[-80.67124
35.209099]
[-80.670815
35.209006]
[-80.670267
35.208906]
[-80.669612
35.208833]
[-80.668924
35.208806]
[-80.598433
35.234307]]"]
I need it in below format
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]
Any help over here..?
You have your data set as arrays of array and you want to explode your data at first level only, so use LATERAL VIEW explode(colname) to explode at the first level.
Below is the SELECT query with explode():
SELECT col1 FROM sample2 LATERAL VIEW EXPLODE(col) explodeVal AS col1;
output generated from your input data set as below:
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]