Hive Explode the Array of Struct key: value: - sql

This is the below Hive Table
CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable
(
USER_ID string,
DETAIL_DATA array<struct<key:string,value:string>>
)
And this is the data in the above table-
11111 [{"key":"client_status","value":"ACTIVE"},{"key":"name","value":"Jane Doe"}]
Is there any way I can get the below output using HiveQL?
**client_status** | **name**
-------------------+----------------
ACTIVE Jane Doe
I tried use explode() but I get result like that:
SELECT details
FROM sample_table
lateral view explode(DETAIL_DATA) exploded_table as details;
**details**
-------------------------------------------+
{"key":"client_status","value":"ACTIVE"}
------------------------------------------+
{"key":"name","value":"Jane Doe"}

Use laterral view [outer] inline to get struct elements already etracted and use conditional aggregation to get values corresponting to some keys grouped in single row, use group_by user_id.
Demo:
with sample_table as (--This is your data example
select '11111' USER_ID,
array(named_struct('key','client_status','value','ACTIVE'),named_struct('key','name','value','Jane Doe')) DETAIL_DATA
)
SELECT max(case when e.key='name' then e.value end) as name,
max(case when e.key='client_status' then e.value end) as status
FROM sample_table
lateral view inline(DETAIL_DATA) e as key, value
group by USER_ID
Result:
name status
------------------------
Jane Doe ACTIVE
If you can guarantee the order of structs in array (one with status comes first always), you can address nested elements dirctly
SELECT detail_data[0].value as client_status,
detail_data[1].value as name
from sample_table
One more approach, if you do not know the order in array, but array is of size=2, CASE expressions without explode will give better performance:
SELECT case when DETAIL_DATA[0].key='name' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as name,
case when DETAIL_DATA[0].key='client_status' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as status
FROM sample_table

Related

Efficiently outer join two array columns row-wise in BigQuery table

I'll first state the question as simply as possible, then elaborate with more detail and an example.
Concise question without context
I have a table with rows containing columns of arrays. I need to outer join the elements of some pairs of these, compute some variables, and then aggregate the results back into a new array. I'm currently using a pattern where I:
unnest each column in the pair to be joined (cross join to PK of row)
full outer join the two on the PK and compute desired fields
group by PK to get back to single row with array column that summarizes the results
Is there a way to do this without the multiple unnesting and grouping back down?
More context and an example
I have a table which represents edits to an entity that is made up of multiple sub-records. Each row represents a single entity. There is a column before that contains the records before the edit, and another after that contains the records afterwards.
My goal is to label each sub-record with exactly one of the four valid edit types:
DELETE - record exists in before but not after
ADD - record exists in after but not before
EDIT - record exists in both before and after but any field was changed
NONE - record exists in both before and after and no fields were changed
Each of the sub-record values is represented by its ID and a hash of all of its fields. I've created some fake data and provided my initial implementation below. This works, but it seems very roundabout.
WITH source_data AS (
SELECT
1 AS pkField,
[
STRUCT(1 AS id, 1 AS fieldHash),
STRUCT(2 AS id, 2 AS fieldHash),
STRUCT(3 AS id, 3 AS fieldHash)
] AS before,
[
STRUCT(1 AS id, 1 AS fieldHash),
STRUCT(2 AS id, 0 AS fieldHash), -- record 2 edited
-- record 3 deleted
STRUCT(4 AS id, 4 AS fieldHash), -- record 4 added
STRUCT(5 AS id, 5 AS fieldHash) -- record 5 added
] AS after
)
SELECT
pkField,
ARRAY_AGG(STRUCT(
id,
CASE
WHEN beforeHash IS NULL THEN "ADD"
WHEN afterHash IS NULL THEN "DELETE"
WHEN beforeHash <> afterHash THEN "EDIT"
ELSE "NONE"
END AS editType
)) AS edits
FROM (
SELECT pkField, id, fieldHash AS beforeHash
FROM source_data
CROSS JOIN UNNEST(source_data.before)
)
FULL OUTER JOIN (
SELECT pkField, id, fieldHash AS afterHash
FROM source_data
CROSS JOIN UNNEST(source_data.after)
)
USING (pkField, id)
GROUP BY pkField
Is there a simpler and/or more efficient way to do this? Perhaps something that avoids the multiple unnesting and grouping back down?
I think, what you have is already simple and efficient way!
Meantime, you can consider below optimized version
select pkField,
array(select struct(
id, case
when b.fieldHash is null then 'ADD'
when a.fieldHash is null then 'DELETE'
when b.fieldHash != a.fieldHash then 'EDIT'
else 'NONE'
end as editType
) edits
from (select id, fieldHash from t.before) b
full outer join (select id, fieldHash from t.after) a
using(id)
) edits
from source_data t
if applied to sample data in your question - output is

Hive - Reformat data structure

So I have a sample of Hive data:
Customer
xx_var
yy_var
branchflow
{"customer_no":"239230293892839892","acct":["2324325","23425345"]}
23
3
[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]
And I want to transform it into something like this:
Customer_no
acct
xx_var
yy_var
branchflow
239230293892839892
2324325
23
3
[1,2,3,4,5,6,6,6,4]
239230293892839892
23425345
23
3
[1,2,3,4,5,6,6,6,99,4]
I have tried using this query, but getting the wrong output format.
SELECT
customer.customer_no,
acct,
xx_var,
yy_var,
bi_acctno,
values_bi
FROM
struct_test
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno
Does anyone know how to approach this problem?
Use json_tuple to extract JSON elements. In case of array, it returns it also as string: remove square brackets, split and explode. See comments in the demo code.
Demo:
with mytable as (--demo data, use your table instead of this CTE
select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,
23 xx_var, 3 yy_var,
'[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
)
select c.customer_no,
a.acct,
t.xx_var, t.yy_var,
get_json_object(b.acct_branchflow,'$.value') value
from mytable t
--extract customer_no and acct array
lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
--remove [] and " and explode array of acct
lateral view explode(split(regexp_replace(c.accts,'^\\[|"|\\]$',''),',')) a as acct
--remove [] and explode array of json
lateral view explode(split(regexp_replace(t.branchflow,'^\\[|\\]$',''),'(?<=\\}),(?=\\{)')) b as acct_branchflow
--this will remove duplicates after lateral view: need only matching acct
where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
Result:
customer_no acct xx_var yy_var value
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]

How to parse an string to an array of maps in HIVE

I have a hive table which is ingested from system logs. The data is encoded in a weird format (an array of maps) in which each element of the array contains the field_name and it's value. The column type is STRING. Just like in the example below:
select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info;
Which creates something like this:
user_id
user_info
1
[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]
2
[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]
Notice that the array size is not always the same. I'm trying to convert the array of maps to a simple map. Then, this is what I expect as result:
user_id
user_info
1
{"name":"Bob", "gender":"M"}
2
{"name":"Ana", "gender":"F", "age":22}
I was planning to reach that in 3 steps: (1) parse the string column to create an array of maps, (2) explode the array (using lateral view), (3) collect the list of fields and group them by user_id
I'm struggling to complete the first step: parse the string column to create an array of maps. Any help would be much appreciated :D
See comments in the code. Array of strings to be transformed to maps is produced by this split(user_info, '(?<=\\}) *, *(?=\\{)'). Then it is exploded and each element converted to map.
with mydata as
(select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info
)
select user_id,
--build new map
str_to_map(concat('name:', name, nvl(concat(',','gender:', gender),''), nvl(concat(',','age:', age),'') )) as user_info
from
(
select user_id,
--get name, gender, age, aggregate by user_id
max(case when user_info['field'] = 'name' then user_info['value'] end) name,
max(case when user_info['field'] = 'gender' then user_info['value'] end) gender,
max(case when user_info['field'] = 'age' then user_info['value'] end) age
from
(
select s.user_id,
--remove {} and ", convert to map
str_to_map(regexp_replace(e.element,'^\\{| *"|\\}$','')) as user_info
from
(
select user_id, regexp_replace(user_info, '^\\[|\\]$','') as user_info -- remove []
from mydata
)s lateral view outer explode(split(user_info, '(?<=\\}) *, *(?=\\{)'))e as element --split by comma between }{ with optional spaces in between
) s
group by user_id
)s
Result:
user_id user_info
1 {"name":"Bob","gender":"M"}
2 {"name":"Ana","gender":"F","age":"22"}

Dynamic transpose for unknown row value into column name on postgres

I have table such:
customer_number
label
value
1
address
St. John 1A
1
phone
111111111
1
email
john#cena.com
2
address
St. Marry 231A
2
phone
222222222
2
email
please#marry.me
I want new table or view so it's become:
customer_number
address
phone
email
1
St. John 1A
111111111
john#cena.com
2
St. Marry 231A
222222222
please#marry.me
but in the future there are possibility to add different label, for example there might be new label called occupation.
Important to note, I don't know the value of the label column, so it's should iterate to any value inside that column.
Is there any way to do this?
You can't have a "dynamic" pivot as the number, names and data types of all columns of a query must be known to the database before the query is actually executed (i.e. at parse time).
I find aggregating stuff into a JSON easier to deal with.
select customer_number,
jsonb_object_agg(label, value) as props
from the_table
group by customer_number
If your frontend can deal with JSON values directly, you can stop here.
If you really need a view with one column per attribute, you can them from the JSON value:
select customer_number,
props ->> 'address' as address,
props ->> 'phone' as phone,
props ->> 'email' as email
from (
select customer_number,
jsonb_object_agg(label, value) as props
from the_table
group by customer_number
) t
I find this a bit easier to manage when new attributes are added.
If you need a view with all labels, you can create a stored procedure to dynamically create it. If the number of different labels doesn't change too often, this might be a solution:
create procedure create_customer_view()
as
$$
declare
l_sql text;
l_columns text;
begin
select string_agg(distinct format('(props ->> %L) as %I', label, label), ', ')
into l_columns
from the_table;
l_sql :=
'create view customer_properties as
select customer_number, '||l_columns||'
from (
select customer_number, jsonb_object_agg(label, value) as props
from the_table
group by customer_number
) t';
execute l_sql;
end;
$$
language plpgsql;
Then create the view using:
call create_customer_view();
And in your code just use:
select *
from customer_properties;
You can schedule that procedure to run in regular intervals (e.g. through a cron job on Linux)
Generally-speaking SQL is not good at pivotting dynamically.
Here is a query that will pivot the data for you. However, it is not dynamic i.e. if a future occupation label was added then you would have to change the query. Not sure whether that is acceptable or not :
select customer_number,
max(value) filter (where label='address') as address,
max(value) filter (where label='phone') as phone,
max(value) filter (where label='email') as email
from your_customer_table
group by customer_number
Bit of an assumption that you are running Postgres 9.4 or better here so that the filter function is supported. If not then it can be re-worked using case statements :
select customer_number,
max(case when label='address' then value else null end) as address,
max(case when label='phone' then value else null end) as phone,
max(case when label='email' then value else null end) as email
from your_customer_table
group by customer_number
I used cross apply to solve this problem ..
Here is my query
select distinct tb9.customer_number, tb9_2.*
from Table_9 tb9 cross apply
(select max(case when tb9_2.[label] like '%address%' then [value] end) as [address],
max(case when tb9_2.[label] like '%phone%' then [value] end) as [phone],
max(case when tb9_2.[label] like '%email%' then [value] end) as [email]
from Table_9 tb9_2
where tb9.customer_number = tb9_2.customer_number
) tb9_2;

Building a string based on columns from a GROUP

I have a table like this:
user 1 A
user 1 B
user 2 H
user 2 G
user 2 A
and I need a result like:
user 1 AB
user 2 HGA
Is there a way to obtain a result like this?
So here we create some test data
CREATE TABLE foo AS
SELECT * FROM (
VALUES (1,'A'),(1,'B'),(2,'H'),(2,'G'),(2,'A')
) AS f(id,col);
This should work,
SELECT id, array_to_string(array_agg(col), '')
FROM table
GROUP BY id;
Here is what we're doing,
GROUP BY id.
Build a PostgreSQL text[] (text array) of that column with array_agg
Convert the array back to text by joining on an empty string '' with array_to_string.
You can also use string_agg,
SELECT id, string_agg(col, '')
FROM foo
GROUP BY id;
The better soluction is using str_sum agregate function
select
user,
str_sum(column_name,'')
from table_name
group by user;