How to parse an string to an array of maps in HIVE - sql

I have a hive table which is ingested from system logs. The data is encoded in a weird format (an array of maps) in which each element of the array contains the field_name and it's value. The column type is STRING. Just like in the example below:
select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info;
Which creates something like this:
user_id
user_info
1
[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]
2
[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]
Notice that the array size is not always the same. I'm trying to convert the array of maps to a simple map. Then, this is what I expect as result:
user_id
user_info
1
{"name":"Bob", "gender":"M"}
2
{"name":"Ana", "gender":"F", "age":22}
I was planning to reach that in 3 steps: (1) parse the string column to create an array of maps, (2) explode the array (using lateral view), (3) collect the list of fields and group them by user_id
I'm struggling to complete the first step: parse the string column to create an array of maps. Any help would be much appreciated :D

See comments in the code. Array of strings to be transformed to maps is produced by this split(user_info, '(?<=\\}) *, *(?=\\{)'). Then it is exploded and each element converted to map.
with mydata as
(select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info
)
select user_id,
--build new map
str_to_map(concat('name:', name, nvl(concat(',','gender:', gender),''), nvl(concat(',','age:', age),'') )) as user_info
from
(
select user_id,
--get name, gender, age, aggregate by user_id
max(case when user_info['field'] = 'name' then user_info['value'] end) name,
max(case when user_info['field'] = 'gender' then user_info['value'] end) gender,
max(case when user_info['field'] = 'age' then user_info['value'] end) age
from
(
select s.user_id,
--remove {} and ", convert to map
str_to_map(regexp_replace(e.element,'^\\{| *"|\\}$','')) as user_info
from
(
select user_id, regexp_replace(user_info, '^\\[|\\]$','') as user_info -- remove []
from mydata
)s lateral view outer explode(split(user_info, '(?<=\\}) *, *(?=\\{)'))e as element --split by comma between }{ with optional spaces in between
) s
group by user_id
)s
Result:
user_id user_info
1 {"name":"Bob","gender":"M"}
2 {"name":"Ana","gender":"F","age":"22"}

Related

Use a CASE expression without typing matched conditions manually using PostgreSQL

I have a long and wide list, the following table is just an example. Table structure might look a bit horrible using SQL, but I was wondering whether there's a way to extract IDs' price using CASE expression without typing column names in order to match in the expression
IDs
A_Price
B_Price
C_Price
...
A
23
...
B
65
82
...
C
...
A
10
...
..
...
...
...
...
Table I want to achieve:
IDs
price
A
23;10
B
65
C
82
..
...
I tried:
SELECT IDs, string_agg(CASE IDs WHEN 'A' THEN A_Price
WHEN 'B' THEN B_Price
WHEN 'C' THEN C_Price
end::text, ';') as price
FROM table
GROUP BY IDs
ORDER BY IDs
To avoid typing A, B, A_Price, B_Price etc, I tried to format their names and call them from a subquery, but it seems that SQL cannot recognise them as columns and cannot call the corresponding values.
WITH CTE AS (
SELECT IDs, IDs||'_Price' as t FROM ID_list
)
SELECT IDs, string_agg(CASE IDs WHEN CTE.IDs THEN CTE.t
end::text, ';') as price
FROM table
LEFT JOIN CTE cte.IDs=table.IDs
GROUP BY IDs
ORDER BY IDs
You can use a document type like json or hstore as stepping stone:
Basic query:
SELECT t.ids
, to_json(t.*) ->> (t.ids || '_price') AS price
FROM tbl t;
to_json() converts the whole row to a JSON object, which you can then pick a (dynamically concatenated) key from.
Your aggregation:
SELECT t.ids
, string_agg(to_json(t.*) ->> (t.ids || '_price'), ';') AS prices
FROM tbl t
GROUP BY 1
ORDER BY 1;
Converting the whole (big?) row adds some overhead, but you have to read the whole table for your query anyway.
A union would be one approach here:
SELECT IDs, A_Price FROM yourTable WHERE A_Price IS NOT NULL
UNION ALL
SELECT IDs, B_Price FROM yourTable WHERE B_Price IS NOT NULL
UNION ALL
SELECT IDs, C_Price FROM yourTable WHERE C_Price IS NOT NULL;

Add a column to BigQuery results that adds a description for an ID in a column from the results

I am using BQ to pull some data and I need to add a column to the results that includes a lookup.
SELECT
timestamp_trunc(a.timestamp,day) date,
a.custom_parameter1,
a.custom_parameter2,
a.score,
a.type,
b.ref
FROM
`data-views_batch_20221021` a
left outer join (select client_uuid,STRING_AGG(document_referrer, "," LIMIT 1) ref from `activities_batch_20221021` where app_id="12345" and document_referrer is not null group by client_uuid) b using (client_uuid)
WHERE
a.app_id="12345"
How can I add a column that takes an array in a.type and looks up each value in the dict. I currently do this in Python and it looks up the values in a dict but I want to include it in the query.
The dict is:
{23:"Description1", 24:"Description2", 25:"Description3"}
I don't have these values in a table within BQ, can I include it within the query? There are about 14 total descriptions to map.
My end result would look like this:
date | custom_parameter1 | customer_paramter2 | score | types | ref | type_descriptions
Edited to add that types is an array.
I don't have these values in a table within BQ, can I include it within the query?
Yes, you can have them as CTE as in below example
with dict as (
select 23 type, "description1" type_description union all
select 24, "description2" union all
select 25, "description3"
)
select
timestamp_trunc(a.timestamp,day) date,
a.custom_parameter1,
a.custom_parameter2,
a.score,
a.type,
b.ref,
type_description
from `data-views_batch_20221021` a
left outer join (
select client_uuid, string_agg(document_referrer, "," limit 1) ref
from `activities_batch_20221021`
where app_id="12345" and document_referrer is not null
group by client_uuid
) b using (client_uuid)
left join dict using (type)
where a.app_id="12345"
There are about 14 total descriptions to map
You can add to dict CTE as many as you need

How to unnest BigQuery nested records into multiple columns

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -
That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.
Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))
Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

Hive Explode the Array of Struct key: value:

This is the below Hive Table
CREATE EXTERNAL TABLE IF NOT EXISTS SampleTable
(
USER_ID string,
DETAIL_DATA array<struct<key:string,value:string>>
)
And this is the data in the above table-
11111 [{"key":"client_status","value":"ACTIVE"},{"key":"name","value":"Jane Doe"}]
Is there any way I can get the below output using HiveQL?
**client_status** | **name**
-------------------+----------------
ACTIVE Jane Doe
I tried use explode() but I get result like that:
SELECT details
FROM sample_table
lateral view explode(DETAIL_DATA) exploded_table as details;
**details**
-------------------------------------------+
{"key":"client_status","value":"ACTIVE"}
------------------------------------------+
{"key":"name","value":"Jane Doe"}
Use laterral view [outer] inline to get struct elements already etracted and use conditional aggregation to get values corresponting to some keys grouped in single row, use group_by user_id.
Demo:
with sample_table as (--This is your data example
select '11111' USER_ID,
array(named_struct('key','client_status','value','ACTIVE'),named_struct('key','name','value','Jane Doe')) DETAIL_DATA
)
SELECT max(case when e.key='name' then e.value end) as name,
max(case when e.key='client_status' then e.value end) as status
FROM sample_table
lateral view inline(DETAIL_DATA) e as key, value
group by USER_ID
Result:
name status
------------------------
Jane Doe ACTIVE
If you can guarantee the order of structs in array (one with status comes first always), you can address nested elements dirctly
SELECT detail_data[0].value as client_status,
detail_data[1].value as name
from sample_table
One more approach, if you do not know the order in array, but array is of size=2, CASE expressions without explode will give better performance:
SELECT case when DETAIL_DATA[0].key='name' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as name,
case when DETAIL_DATA[0].key='client_status' then DETAIL_DATA[0].value else DETAIL_DATA[1].value end as status
FROM sample_table

Building a string based on columns from a GROUP

I have a table like this:
user 1 A
user 1 B
user 2 H
user 2 G
user 2 A
and I need a result like:
user 1 AB
user 2 HGA
Is there a way to obtain a result like this?
So here we create some test data
CREATE TABLE foo AS
SELECT * FROM (
VALUES (1,'A'),(1,'B'),(2,'H'),(2,'G'),(2,'A')
) AS f(id,col);
This should work,
SELECT id, array_to_string(array_agg(col), '')
FROM table
GROUP BY id;
Here is what we're doing,
GROUP BY id.
Build a PostgreSQL text[] (text array) of that column with array_agg
Convert the array back to text by joining on an empty string '' with array_to_string.
You can also use string_agg,
SELECT id, string_agg(col, '')
FROM foo
GROUP BY id;
The better soluction is using str_sum agregate function
select
user,
str_sum(column_name,'')
from table_name
group by user;