Statistical functions on columns of arrays in BigQuery - sql

If I have data that looks like the follows:
+------+------------+-------+
| a | val | val2 |
+------+------------+-------+
| 3.14 | [1, 2, 3] | [2, 3]|
| 1.59 | [7, 8, 9] | ... |
| -1 | [4, 5, 6] | ... |
+------+------------+-------+
and I want to get the array averages of the val column, naively I'd want to just try something like
SELECT
AVG(val)
FROM
<Table>
But that doesn't work. I get an error like No matching signature for aggregate function AVG for argument types: ARRAY<INT64>. Supported signatures: AVG(INT64); AVG(UINT64); AVG(DOUBLE); AVG(NUMERIC)
I know that if I have just one column val I can do something like
SELECT avg
FROM
(
SELECT AVG(val) as avg
FROM UNNEST(val) AS val
)
but what if I have multiple columns (val, val2, etc.) and need multiple statistics? The above method just seems really cumbersome.
To be clear the result I'd want is:
+------+------------+-------------+--------------+
| a | avg_val | std_dev_val | avg_val2 |
+------+------------+-------------+--------------+
| 3.14 | 2 | 1 | ... |
| 1.59 | 8 | .... | ... |
| -1 | 5 | .... | ... |
+------+------------+-------------+--------------+
Is there a simple way to do this? Or do I need to create some sort of temporary function to accomplish this? Or am I stuck doing something like what I see in https://stackoverflow.com/a/45560462/1902480

If you want the average as an array, you can unnest and then reaggregate:
select array_agg(avg_val order by n)
from (select n, avg(val) as avg_val
from t cross join
unnest(vals) v with offset n
group by n
) a;
EDIT:
If you want the values per row, just use a subquery with unnest():
select t.*,
(select avg(el)
from unnest(t.val) el
),
(select avg(el)
from unnest(t.val2) el
)
from t;
And so on for whatever aggregation functions you want.

Consider below example
#standardSQL
create temp function array_avg(arr any type) as ((
select avg(val) from unnest(arr) val
));
create temp function array_std_dev(arr any type) as ((
select stddev(val) from unnest(arr) val
));
select a,
val, array_avg(val) val_avg, array_std_dev(val) val_stddev,
val2, array_avg(val2) val2_avg, array_std_dev(val2) val2_stddev
from `project.dataset.table`
if applied to sample data in your question - output is

I think simple subqueries should be fine - AVG() only works with tables and UNNEST() turns arrays into tables - so you can just combine them:
SELECT
(SELECT AVG(val) FROM UNNEST(val1) as val) AS avg_val1,
(SELECT AVG(val) FROM UNNEST(val2) as val) AS avg_val2,
(SELECT AVG(val) FROM UNNEST(val3) as val) AS avg_val3
FROM
<table>
val1, val2 and val3 are looked up as columns in <table> while val within the subqueries will be looked up in the table coming from the respective UNNEST().

Related

Hive: Aggregate values by attribute into a JSON or MAP field

I have a table that looks like this:
| user | attribute | value |
|--------|-------------|---------|
| 1 | A | 10 |
| 1 | A | 20 |
| 1 | B | 5 |
| 2 | B | 10 |
| 2 | B | 15 |
| 2 | C | 100 |
| 2 | C | 200 |
I'd like to group this table by user and collect the sum of the value field into a JSON or a MAP with attributes as keys, like:
| user | sum_values_by_attribute |
|------|--------------------------|
| 1 | {"A": 30, "B": 15} |
| 2 | {"B": 25, "C": 300} |
Is there a way to do that in Hive?
I've found related questions such as this and this but none consider the case of a summation over values.
JSON string corresponding to map<string, int> can be built in Hive using native functions only: aggregate by user, attribute, then concatenate pairs "key": value and aggregate array of them, concatenate array using concat_ws, add curly braces.
Demo:
with initial_data as (
select stack(7,
1,'A',40,
1,'A',20,
1,'B',5,
2,'B',10,
2,'B',15,
2,'C',100,
2,'C',200) as (`user`, attribute, value )
)
select `user`, concat('{',concat_ws(',',collect_set(concat('"', attribute, '": ',sum_value))), '}') as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
Result ( JSON string ):
user sum_values_by_attribute
1 {"A": 60,"B": 5}
2 {"B": 25,"C": 300}
Note: If you are running this on Spark, you can cast( as map<string, int>), Hive does not support casting complex types cast.
Also map<string, string> can be easily done using native functions only: the same array of key-values pairs byt without double-quotes (like A:10) concatenate to comma delimited string using concat_ws and convert to map using str_to_map function (the same WITH CTE is skipped):
select `user`, str_to_map(concat_ws(',',collect_set(concat(attribute, ':',sum_value)))) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
Result ( map<string, string> ):
user sum_values_by_attribute
1 {"A":"60","B":"5"}
2 {"B":"25","C":"300"}
And if you need map<string, int>, unfortunately, it can not be done using Hive native functions only because map_to_str returns map<string, string>, not map<string, int>. You can try brickhouse collect function:
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'; --check brickhouse site https://github.com/klout/brickhouse for instructions
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select `user`, collect(attribute, sum_value) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
You can first calculate the sum by attribute and user_id and then use collect list.
Pls let me know if below output is fine.
SQL Below -
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;
Testing Query -
create table tmp2 ( `user` int, `attribute` string, `value` int);
insert into tmp2 select 1,'A',40;
insert into tmp2 select 1,'A',20;
insert into tmp2 select 1,'B',5;
insert into tmp2 select 2,'C',20;
insert into tmp2 select 1,'B',10;
insert into tmp2 select 2,'B',10;
insert into tmp2 select 2,'C',10;
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

how to merge two columns and make a new column in sql

I have to merge two columns in one column that has deferent values
I wrote this code but I can't continue:
SELECT Title, ...
FROM BuyItems
input:
| title | UK | US | total |
|-------|-----|-----|-------|
| coca | 3 | 5 | 8 |
| cake | 2 | 0 | 2 |
output:
|title |Origin | Total|
|------|-------|------|
|coca | UK | 3 |
|coca | US | 5 |
|cake | US | 2 |
You can use CROSS APPLY and a table value constructor to do this:
-- EXAMPLE DATA START
WITH BuyItems AS
( SELECT x.title, x.UK, x.US
FROM (VALUES ('coca', 3, 5), ('cake', 2, 0)) x (title, UK, US)
)
-- EXAMPLE DATA END
SELECT bi.Title, upvt.Origin, upvt.Total
FROM BuyItems AS bi
CROSS APPLY (VALUES ('UK', bi.UK), ('US', bi.US)) upvt (Origin, Total)
WHERE upvt.Total <> 0;
Alternatively, you can use the UNPIVOT function:
-- EXAMPLE DATA START
WITH BuyItems AS
( SELECT x.title, x.UK, x.US
FROM (VALUES ('coca', 3, 5), ('cake', 2, 0)) x (title, UK, US)
)
-- EXAMPLE DATA END
SELECT upvt.Title, upvt.Origin, upvt.Total
FROM BuyItems AS bi
UNPIVOT (Total FOR Origin IN (UK, US)) AS upvt
WHERE upvt.Total <> 0;
My preference is usually for the former, as it is much more flexible. You can use explicit casting to combine columns of different types, or unpivot multiple columns. UNPIVOT works just fine, and there is no reason not to use it, but since UNPIVOT works in limited scenarios, and CROSS APPLY/VALUES works in all scenarios, I just go for this option as default.
Use apply:
select v.*
from t cross apply
(values (t.title, 'UK', uk), (t.title, 'US', us)
) v(title, origin, total)
where v.total > 0;
This is a simple unpivot:
SELECT YT.title,
V.Origin,
V.Total
FROM dbo.YourTable YT
CROSS APPLY (VALUES('UK',UK),
('US',US))V(Origin,Total);

How do I accomplish "cascading grouping" of columns in ANSI SQL?

I have a Presto SQL table that looks something like this:
|tenant|type|environment |
| |
| X | A |http:a.b.c(foo)/http:a.b.c(bar)/http:a.b.c(baz)|
| X | A |http:d.e.f(foo)/http:d.e.f(bar)/http:d.e.f(baz)|
| X | A |http:g.h.i(foo) |
| X | B |http:g.h.i(foo)/http:g.h.i(bar) |
All columns are of type string.
I need to produce output that counts each environment type (foo, bar, or baz) per tenant and type. I.e. the above data should be listed somewhat like this:
X A foo 3
bar 2
baz 2
X B foo 1
bar 1
I've been trying queries like this:
SELECT "tenant_id", "type_id", "environment", count(*) AS total_count
FROM "tenant_table"
WHERE "environment" LIKE '%foo%'
GROUP BY "tenant_id", "type_id", "environment";
But I'm not getting the output I need. I do have a little bit of flexibility of changing the data types. The data comes from a CSV file originally. For example, if it makes things easier to redefine the type of the "environment" column to something like an array, that is an option. Any help in resolving this would be greatly appreciated. Thanks.
If that's a fixed list of values, with at most 1 occurance per string, the, you can put it in a derived table and use like to search for matches:
select t.tenant, t.type, v.val, count(*) cnt
from tenant_db t
inner join (values ('foo'), ('bar'), ('baz')) v(val)
on t.environment like '%' || v.val || '%'
group by t.tenant, t.type, v.val
Depending on your requirement, you might want to narrow the search criteria in order to avoid fake positives; maybe using the parentheses:
on t.environment like '%(' || v.val || ')%'
Or using a regex.
You can extract the values with regexp_extract_all and use UNNEST to "flatten" the resulting arrays before computing the aggregation:
WITH data(tenant, type, environment) AS (
VALUES
('X', 'A', 'http:a.b.c(foo)/http:a.b.c(bar)/http:a.b.c(baz)'),
('X', 'A', 'http:d.e.f(foo)/http:d.e.f(bar)/http:d.e.f(baz)'),
('X', 'A', 'http:g.h.i(foo)'),
('X', 'B', 'http:g.h.i(foo)/http:g.h.i(bar)')
)
SELECT tenant, type, value, count(*)
FROM data, UNNEST(regexp_extract_all(data.environment, '\(([^\)]+)\)', 1)) t(value)
GROUP BY tenant, type, value
produces:
tenant | type | value | _col3
--------+------+-------+-------
X | A | baz | 2
X | A | bar | 2
X | A | foo | 3
X | B | bar | 1
X | B | foo | 1

Hive: merge or tag multiple rows based on neighboring rows

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?
This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.
Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

Postgres GROUP BY Array Column

I use postgres & have a table like this :
id | arr
-------------------
1 | [A,B,C]
2 | [C,B,A]
3 | [A,A,B]
4 | [B,A,B]
I created a GROUP BY 'arr' query.
SELECT COUNT(*) AS total, "arr" FROM "table" GROUP BY "arr"
... and the result :
total | arr
-------------------
1 | [A,B,C]
1 | [C,B,A]
1 | [A,A,B]
1 | [B,A,B]
BUT, since [A,B,C] and [C,B,A] have the same elements, so i expected the result should be like this :
total | arr
-------------------
2 | [A,B,C]
2 | [A,A,B]
Did i miss something (in query) or else? Please help me..
You do not need to create a separate function to do this. It can all be done in a single statement:
select array(select unnest(arr) order by 1) as sorted_arr, count(*)
from t
group by sorted_arr;
Here is a rextester.
[A,B,C] and [C,B,A] are different arrays even if they have the same elements they are not in the same position, they will never be grouped by a group by clause, in case you want to make them equivalent you'd need to sort them first.
On this thread you have info abour sorting arrays.
You should do something like:
SELECT COUNT(*) AS total, array_sort("arr") FROM "table" GROUP BY array_sort("arr")
After creating a sort function like the one proposed in there:
CREATE OR REPLACE FUNCTION array_sort (ANYARRAY)
RETURNS ANYARRAY LANGUAGE SQL
AS $$
SELECT ARRAY(SELECT unnest($1) ORDER BY 1)
$$;