How to aggregate arrays element by element in BigQuery? - sql

In BigQquery how can I aggregate arrays element by element ?
For instance if I have this table
id
array_value
1
[1, 2, 3]
2
[4, 5, 6]
3
[7, 8, 9]
I want to sum all the vector element-wise and output [1+4+7, 2+5+8, 3+6+9] = [12, 15, 18]
I can SUM float fields with SELECT SUM(float_field) FROM table but when I try to apply the SUM on an array I get
No matching signature for aggregate function SUM for argument types: ARRAY.
Supported signatures: SUM(INT64); SUM(FLOAT64); SUM(NUMERIC); SUM(BIGNUMERIC) at [1:8]
I have found ARRAY_AGG in the doc but it is not what I want: it just creates an array from values.

I think you want:
select array_agg(sum_val order by id) as res
from (
select idx, sum(val) as sum_val
from mytable t
cross join unnest(t.array_value) as val with offset as idx
group by idx
) t

I think you want:
select array_agg(sum_val)
from (select (select sum(val)
from unnest(t.array_value) val
) as sum_val
from t
) x

I think technically you simply refer to the individual values in the arrays using offset() or safe_offset() in case there might be missing values
-- example data
with temp as (
select * from unnest([
struct(1 as id, [1, 2, 3] as array_value),
(2, [4,5,6]),
(3, [7,8])
])
)
-- actual query
select
[
SUM( array_value[safe_offset(0)] ),
SUM( array_value[safe_offset(1)] ),
SUM( array_value[safe_offset(2)] )
] as result_array
from temp
I put them in a result array, but you don't have to do that. I had the last array missing one value to show that the query doesn't break. If you want it to break you should use offset() without the 'safe_'

Below is for BigQuery Standard SQL
select array_agg(val order by offset)
from (
select offset, sum(val) as val
from `project.dataset.table` t,
unnest(array_value) as val with offset
group by offset
)

Related

How to move this group/concat logic into a function?

Given a column of integers
ids AS (
SELECT
id
FROM
UNNEST([1, 2, 3, 4, 5, 6, 7]) AS id)
I'd like to convert them into the following (batched) string representations:
"1,2,3,4,5"
"6,7"
Currently, I do this as follows:
SELECT
STRING_AGG(CAST(id AS STRING), ',')
FROM (
SELECT
DIV(ROW_NUMBER() OVER() - 1, 5) batch,
id
FROM
ids)
GROUP BY
batch
Since I use this on multiple occasions, I'd like to move this into a function.
Is this possible, and if so how?
(I guess, since we can't pass the table (ids), we'd need to pass an ARRAY<INT64>, but that would be ok.)
I think you might consider below 2 approches.
UDF
returns result as ARRAY<STRING>.
CREATE TEMP FUNCTION batched_string(ids ARRAY<INT64>) AS (
ARRAY(
SELECT STRING_AGG('' || id) FROM (
SELECT DIV(offset, 5) batch, id
FROM UNNEST(ids) id WITH offset
) GROUP BY batch
)
);
SELECT * FROM UNNEST(batched_string([1, 2, 3, 4, 5, 6, 7]));
Table functions
return result as a Table.
note that a table function shouldn't be a temp function.
CREATE OR REPLACE TABLE FUNCTION `your-project.dataset.batched_string`(ids ARRAY<INT64>) AS (
SELECT STRING_AGG('' || id) batched FROM (
SELECT DIV(offset, 5) batch, id
FROM UNNEST(ids) id WITH offset
) GROUP BY batch
);
SELECT * FROM `your-project.dataset.batched_string`([1, 2, 3, 4, 5, 6, 7]);

Count the number of matches in an array in BigQuery

How I can count the number of matches in an array? For example, for numbers [1,3] in the array [1,2,3] there will be 2 matches, and for the array [1,2] there will be 1 match. Right now I can only check if [1,3] is in the array or not.
WITH `arrays` AS (
SELECT 1 id, [1,2,3] as arr
UNION ALL
SELECT 2, [1,2]
UNION ALL
SELECT 3, [3]
)
SELECT id, arr, [1,3] as numbers,
CASE
1 IN UNNEST(arr) and
3 IN UNNEST(arr)
WHEN TRUE THEN 'numbers is in array'
ELSE 'numbers is not in array'
END conclusion
FROM `arrays`
I'm trying to get such result:
Using a math, following seems to be possible:
If union of arr and numbers is same as arr, it will be numbers is in array
If union of arr and numbers is greater than arr, elements as much as the increased number is not in the arr.
So, numbers_len - (union_len - arr_len) will be check
WITH `arrays` AS (
SELECT 1 id, [1,2,3] as arr
UNION ALL
SELECT 2, [1,2]
UNION ALL
SELECT 3, [3]
),
calculated_arrays AS (
SELECT *, [1,3] as numbers,
ARRAY_LENGTH(ARRAY(SELECT DISTINCT * FROM UNNEST(arr || [1, 3]))) AS union_len,
ARRAY_LENGTH(arr) AS arr_len,
ARRAY_LENGTH([1, 3]) AS numbers_len
FROM `arrays`
)
SELECT id, arr, numbers,
numbers_len - union_len + arr_len AS check,
IF (union_len = arr_len, 'numbers is in array', 'numbers is not in array') AS conclusion
FROM calculated_arrays
;
output:
Consider below approach
with `arrays` as (
select 1 id, [1,2,3] as arr union all
select 2, [1,2] union all
select 3, [3]
)
select *,
( select count(*)
from t.numbers num join t.arr num
using(num)
) check,
( select format('number is %sin array',
if(logical_and(if(num2 is null, false, true)), '', 'not '))
from t.numbers num1 left join t.arr num2
on num1 = num2
) conclusion
from (
select id, arr, [1,3] as numbers
from `arrays`
) t
with output

BigQuery arrays - SELECT DISTINCT ordering guarantees?

I want to filter out the duplicates from a BigQuery array. I also need the order of the elements to be preserved. The docs mention that this can be done by combining SELECT DISTINCT with UNNEST. However, it doesn't mention any ordering behavior. I ran this query and got the desired ordering of [5, 3, 1, 4, 10, 8].
WITH an_array AS (
SELECT [5, 5, 3, 1, 4, 4, 10, 8, 5, 1] AS nums
)
SELECT
ARRAY((
SELECT DISTINCT num
FROM UNNEST(nums) num
))
FROM an_array;
I don't know if that's coincidence or if that ordering is guaranteed. I also tried adding WITH OFFSET with an ORDER BY to specify the order explicitly, but in that case I get Query error: ORDER BY clause expression references table alias offset which is not visible after SELECT DISTINCT.
You should always be explicit about ordering if you care about it:WITH an_array AS (
WITH an_array as (
SELECT [5, 5, 3, 1, 4, 4, 10, 8, 5, 1] AS nums
)
SELECT ARRAY((SELECT num
FROM UNNEST(nums) num WITH OFFSET o
GROUP BY num
ORDER BY MIN(o)
)
)
FROM an_array;

How to use percentile_disc on array

I am able to use approx_quantiles on an array by doing
(select approx_quantiles(reps, 10)[offset(5)] from unnest(arr_tab.arr) as reps) as med,
where arr_tab.arr is an array of values.
I would like to get exact numbers the same way with percentile_disc (the arrays are relatively small), but the following:
(select percentile_disc(reps, .5) from unnest(arr_tab.arr) as reps) as med,
gives the error
Analytic function PERCENTILE_DISC cannot be called without an OVER clause at [17:11] Learn More about BigQuery SQL Functions.
Here is a full example query, which runs if I comment out the percentile_disc attempt:
with arr_tab as (
SELECT [1, 2, 3] AS arr, 'a' as label UNION ALL
SELECT [4, 5, 6], 'c' UNION ALL
SELECT [10, 11, 12], 'd'
)
, q2 as (
select
label,
(select approx_quantiles(reps, 10)[offset(5)] from unnest(arr_tab.arr) as reps) as med,
-- (select percentile_disc(reps, .5) from unnest(arr_tab.arr) as reps) as med2,
from arr_tab
)
select *
from q2
You can use below
(SELECT PERCENTILE_DISC(reps, .5) OVER() FROM UNNEST(arr_tab.arr) AS reps LIMIT 1) AS med2

How do I add arrays in BigQuery SQL?

I have a UDF which returns a floating point array of the same size for each row of a table. How do I sum values of these arrays ?
In other words, how can I do something like this:
create temp function f(...)
returns array<float64>
...;
select sum(f(column)) from table
As the result of this operation I need to get another array of equal size where
result[i] = sum(over rows) f(row, column)[i]
Here is a function that uses ANY TYPE in order to support summing arrays of FLOAT64, INT64, or NUMERIC along with some sample input:
CREATE TEMP FUNCTION ElementWiseSum(arr1 ANY TYPE, arr2 ANY TYPE) AS (
ARRAY(SELECT x + arr2[OFFSET(off)] FROM UNNEST(arr1) AS x WITH OFFSET off ORDER BY off)
);
SELECT arr1, arr2, ElementWiseSum(arr1, arr2) AS result
FROM (
SELECT [1, 2, 3] AS arr1, [4, 5, 6] AS arr2 UNION ALL
SELECT [7, 8], [9, 10] UNION ALL
SELECT [], [] UNION ALL
SELECT [11, 12, 13, 14, 15], [16, 17, 18, 19, 20]
);
It unnests arr1 using WITH OFFSET, then retrieves the equivalent element from arr2 using this offset, and orders by the offset to ensure that the element order is preserved.
Edit: to sum across rows, you can unnest the arrays, compute sums grouped by the offset of the elements, then reaggregate the sums into a new array:
SELECT
ARRAY_AGG(sum ORDER BY off) AS arr
FROM (
SELECT
off,
SUM(x) AS sum
FROM (
SELECT [1, 2, 3] AS arr UNION ALL
SELECT [7, 8, 9] UNION ALL
SELECT [4, 5, 6] UNION ALL
SELECT [10, 11, 12]
), UNNEST(arr) AS x WITH OFFSET off
GROUP BY off
);
So based on your comment, what you are looking for is the sum the values of all your arrays. This is how you can do it using UNNEST operator
WITH mydata AS (
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
union all
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
union all
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
)
SELECT SUM(eachelement) from mydata, UNNEST(myarray) AS eachelement;
If you have your UDF defined (takes in a your column(s) and returns a float64 array of a pre-determined (or fixed) dimensions), you can use a simplified solution. For example in case of 3-d arrays, something like:
create temp function f(...)
returns array<float64>
...;
with dataset as (
select arr[offset(0)] as col_a, arr[offset(1)] as col_b, arr[offset(2)] as col_c
from (
select f(mycolumn) as arr
from `mydataset.mytable`
)
)
select [sum(col_a), sum(col_b), sum(col_c)] as new_array from dataset
This does not directly answer OP's question, but people landing on this page searching for "How do I add arrays in BigQuery SQL?" might benefit.
(Based on #elliott-brossard answer edit) In case you have 2 arrays, but 1 array includes a struct, you can use the following code to add them together:
WITH mydata AS (
SELECT
[1, 2, 3] AS arr
-- ,[7, 8, 9] AS arr2
,[
STRUCT(7 AS timeOnSite)
,STRUCT(8 AS timeOnSite)
,STRUCT(9 AS timeOnSite)
] AS arr2
)
SELECT
(
SELECT
ARRAY_AGG(sum ORDER BY off) AS arr
FROM (
SELECT
off,
SUM(x) AS sum
FROM (
SELECT arr UNION ALL
-- SELECT arr2
SELECT (SELECT ARRAY_AGG(t.timeOnSite) FROM UNNEST(arr2) AS t)
), UNNEST(arr) AS x WITH OFFSET off
GROUP BY off
)
) AS sum_arrays
FROM
mydata