Hive: Aggregate values by attribute into a JSON or MAP field - sql

I have a table that looks like this:
| user | attribute | value |
|--------|-------------|---------|
| 1 | A | 10 |
| 1 | A | 20 |
| 1 | B | 5 |
| 2 | B | 10 |
| 2 | B | 15 |
| 2 | C | 100 |
| 2 | C | 200 |
I'd like to group this table by user and collect the sum of the value field into a JSON or a MAP with attributes as keys, like:
| user | sum_values_by_attribute |
|------|--------------------------|
| 1 | {"A": 30, "B": 15} |
| 2 | {"B": 25, "C": 300} |
Is there a way to do that in Hive?
I've found related questions such as this and this but none consider the case of a summation over values.

JSON string corresponding to map<string, int> can be built in Hive using native functions only: aggregate by user, attribute, then concatenate pairs "key": value and aggregate array of them, concatenate array using concat_ws, add curly braces.
Demo:
with initial_data as (
select stack(7,
1,'A',40,
1,'A',20,
1,'B',5,
2,'B',10,
2,'B',15,
2,'C',100,
2,'C',200) as (`user`, attribute, value )
)
select `user`, concat('{',concat_ws(',',collect_set(concat('"', attribute, '": ',sum_value))), '}') as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
Result ( JSON string ):
user sum_values_by_attribute
1 {"A": 60,"B": 5}
2 {"B": 25,"C": 300}
Note: If you are running this on Spark, you can cast( as map<string, int>), Hive does not support casting complex types cast.
Also map<string, string> can be easily done using native functions only: the same array of key-values pairs byt without double-quotes (like A:10) concatenate to comma delimited string using concat_ws and convert to map using str_to_map function (the same WITH CTE is skipped):
select `user`, str_to_map(concat_ws(',',collect_set(concat(attribute, ':',sum_value)))) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
Result ( map<string, string> ):
user sum_values_by_attribute
1 {"A":"60","B":"5"}
2 {"B":"25","C":"300"}
And if you need map<string, int>, unfortunately, it can not be done using Hive native functions only because map_to_str returns map<string, string>, not map<string, int>. You can try brickhouse collect function:
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'; --check brickhouse site https://github.com/klout/brickhouse for instructions
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select `user`, collect(attribute, sum_value) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

You can first calculate the sum by attribute and user_id and then use collect list.
Pls let me know if below output is fine.
SQL Below -
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;
Testing Query -
create table tmp2 ( `user` int, `attribute` string, `value` int);
insert into tmp2 select 1,'A',40;
insert into tmp2 select 1,'A',20;
insert into tmp2 select 1,'B',5;
insert into tmp2 select 2,'C',20;
insert into tmp2 select 1,'B',10;
insert into tmp2 select 2,'B',10;
insert into tmp2 select 2,'C',10;
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

Related

How to extract a JSON value in Hive

I Have a JSON string that is stored in a single cell in the DB corresponding to a parent ID
{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"}
Now I want to use the above JSON to extract the id from this. Let say we have data like
Parent ID | Profile JSON
1 | {profile_json} (see above string)
I want the output to look like this
Parent ID | ID
1 | abc
1 | efg
Now, I've tried a couple of iterations to solve this
First Approach:
select
get_json_object(p.profile, '$$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)
But this is returning the id column as having NULL values. Is there something I'm missing here.
Second Approach:
with profiles as(
select regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{') as profile_list,
parent_id
from source_table
)
SELECT
get_json_object (t1.profile_list,'$.id')
FROM profiles t1
The second approach is only returning the first id (abc) as per the above JSON string.
I tried to replicate this in apache hive v4.
Data
+----------------------------------------------------+------------------+
| data | parent_id |
+----------------------------------------------------+------------------+
| {"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"} | 1.0 |
+----------------------------------------------------+------------------+
Sql
select pid,get_json_object(expl_jid,'$.id') json_id from
(select parent_id pid,split(data,'\\|\\|') jid from tabl1)a
lateral view explode(jid) exp_tab as expl_jid;
+------+----------+
| pid | json_id |
+------+----------+
| 1.0 | abc |
| 1.0 | efg |
+------+----------+
Solve this. Was using a extract $ in the First Approach
select
get_json_object(p.profile, '$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)

Count string occurances within a list column - Snowflake/SQL

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22
Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

Statistical functions on columns of arrays in BigQuery

If I have data that looks like the follows:
+------+------------+-------+
| a | val | val2 |
+------+------------+-------+
| 3.14 | [1, 2, 3] | [2, 3]|
| 1.59 | [7, 8, 9] | ... |
| -1 | [4, 5, 6] | ... |
+------+------------+-------+
and I want to get the array averages of the val column, naively I'd want to just try something like
SELECT
AVG(val)
FROM
<Table>
But that doesn't work. I get an error like No matching signature for aggregate function AVG for argument types: ARRAY<INT64>. Supported signatures: AVG(INT64); AVG(UINT64); AVG(DOUBLE); AVG(NUMERIC)
I know that if I have just one column val I can do something like
SELECT avg
FROM
(
SELECT AVG(val) as avg
FROM UNNEST(val) AS val
)
but what if I have multiple columns (val, val2, etc.) and need multiple statistics? The above method just seems really cumbersome.
To be clear the result I'd want is:
+------+------------+-------------+--------------+
| a | avg_val | std_dev_val | avg_val2 |
+------+------------+-------------+--------------+
| 3.14 | 2 | 1 | ... |
| 1.59 | 8 | .... | ... |
| -1 | 5 | .... | ... |
+------+------------+-------------+--------------+
Is there a simple way to do this? Or do I need to create some sort of temporary function to accomplish this? Or am I stuck doing something like what I see in https://stackoverflow.com/a/45560462/1902480
If you want the average as an array, you can unnest and then reaggregate:
select array_agg(avg_val order by n)
from (select n, avg(val) as avg_val
from t cross join
unnest(vals) v with offset n
group by n
) a;
EDIT:
If you want the values per row, just use a subquery with unnest():
select t.*,
(select avg(el)
from unnest(t.val) el
),
(select avg(el)
from unnest(t.val2) el
)
from t;
And so on for whatever aggregation functions you want.
Consider below example
#standardSQL
create temp function array_avg(arr any type) as ((
select avg(val) from unnest(arr) val
));
create temp function array_std_dev(arr any type) as ((
select stddev(val) from unnest(arr) val
));
select a,
val, array_avg(val) val_avg, array_std_dev(val) val_stddev,
val2, array_avg(val2) val2_avg, array_std_dev(val2) val2_stddev
from `project.dataset.table`
if applied to sample data in your question - output is
I think simple subqueries should be fine - AVG() only works with tables and UNNEST() turns arrays into tables - so you can just combine them:
SELECT
(SELECT AVG(val) FROM UNNEST(val1) as val) AS avg_val1,
(SELECT AVG(val) FROM UNNEST(val2) as val) AS avg_val2,
(SELECT AVG(val) FROM UNNEST(val3) as val) AS avg_val3
FROM
<table>
val1, val2 and val3 are looked up as columns in <table> while val within the subqueries will be looked up in the table coming from the respective UNNEST().

Hive: merge or tag multiple rows based on neighboring rows

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?
This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.
Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

How to count frequency of elements in a bigquery array field

I have a table that looks like this:
I am looking for a table that gives a frequency count of the elements in the fields l_0, l_1, l_2, l_3.
For example the output should look like this:
| author_id | year | l_o.name | l_0.count| l1.name | l1.count | l2.name | l2.count| l3.name | l3.count|
| 2164089123 | 1987 | biology | 3 | botany | 3 | | | | |
| 2595831531 | 1987 | computer science | 2 | simulation | 2 | computer simulation | 2 | mathematical model | 2 |
Edit:
In some cases the array field might have more than one type of element. For example l_0 could be ['biology', 'biology', 'geometry', 'geometry']. In that case the output for fields l_0, l_1, l_2, l_3 would be a nested repeated field with all the elements in l_0.name and all the corresponding counts in the l_0.count.
This should work, assuming you want to count on a per-array basis:
SELECT
author_id,
year,
(SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
FROM UNNEST(l_0) AS l_0) AS l_0,
(SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
FROM UNNEST(l_1) AS l_1) AS l_1,
(SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
FROM UNNEST(l_2) AS l_2) AS l_2,
(SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;
To avoid so much repetition, you can make use of a SQL UDF:
CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
(SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem)
);
SELECT
author_id,
year,
GetNameAndCount(l_0) AS l_0,
GetNameAndCount(l_1) AS l_1,
GetNameAndCount(l_2) AS l_2,
GetNameAndCount(l_3) AS l_3
FROM YourTable;
If you potentially need to account for multiple different names within an array, you can have the UDF return an array of them with associated counts instead:
CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
ARRAY(
SELECT AS STRUCT elem AS name, COUNT(*) AS count
FROM UNNEST(elements) AS elem
GROUP BY elem
ORDER BY count
)
);
SELECT
author_id,
year,
GetNamesAndCounts(l_0) AS l_0,
GetNamesAndCounts(l_1) AS l_1,
GetNamesAndCounts(l_2) AS l_2,
GetNamesAndCounts(l_3) AS l_3
FROM YourTable;
Note that if you want to perform counting across rows, however, you'll need to unnest the arrays and perform the GROUP BY at the outer level, but it doesn't look like this is your intention based on the question.