BigQuery arrays - SELECT DISTINCT ordering guarantees? - sql

I want to filter out the duplicates from a BigQuery array. I also need the order of the elements to be preserved. The docs mention that this can be done by combining SELECT DISTINCT with UNNEST. However, it doesn't mention any ordering behavior. I ran this query and got the desired ordering of [5, 3, 1, 4, 10, 8].
WITH an_array AS (
SELECT [5, 5, 3, 1, 4, 4, 10, 8, 5, 1] AS nums
)
SELECT
ARRAY((
SELECT DISTINCT num
FROM UNNEST(nums) num
))
FROM an_array;
I don't know if that's coincidence or if that ordering is guaranteed. I also tried adding WITH OFFSET with an ORDER BY to specify the order explicitly, but in that case I get Query error: ORDER BY clause expression references table alias offset which is not visible after SELECT DISTINCT.

You should always be explicit about ordering if you care about it:WITH an_array AS (
WITH an_array as (
SELECT [5, 5, 3, 1, 4, 4, 10, 8, 5, 1] AS nums
)
SELECT ARRAY((SELECT num
FROM UNNEST(nums) num WITH OFFSET o
GROUP BY num
ORDER BY MIN(o)
)
)
FROM an_array;

Related

Check if array element appears more than once

Is there a way to check an array of integers for any elements that appear more than once? Either boolean False if there are redundancies or list of offending elements would work.
unnest() the array and then use GROUP BY and HAVING with count() to filter for values that appear more than once.
SELECT un.n
FROM unnest('{1, 2, 3, 3, 4, 5, 5, 6, 7, 8, 9, 10}'::integer[]) un (n)
GROUP BY un.n
HAVING count(*) > 1;
To just get a Boolean you can use EXISTS and the above in a subquery.
SELECT EXISTS (SELECT un.n
FROM unnest('{1, 2, 3, 3, 4, 5, 5, 6, 7, 8, 9, 10}'::integer[]) un (n)
GROUP BY un.n
HAVING count(*) > 1);
If you have the intarray extension installed, this will return false if you have duplicates:
icount(your_array) = icount(uniq(your_array))
If not, then I would use unnest() to return false in case of duplicates.
select count(unnest) = count(distinct unnest)
from your_table
cross join lateral unnest(your_array)

SQL ARRAY: Select ID from my_table where "arrayvalue" = "defined_arrayvalue"

This is a beginner-question relating arrays. I hope the answer is simple.
The example is taken from Oracle Spatial, but I think it is valid for all arrays.
I have this SELECT:
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO -- column GEOM contains spatial data
FROM
my_table D
I get this result:
73035 MDSYS.SDO_ELEM_INFO_ARRAY(1, 2, 1)
73036 MDSYS.SDO_ELEM_INFO_ARRAY(1, 4, 3, 1, 2, 1, 11, 2, 2, 19, 2, 1)
73037 MDSYS.SDO_ELEM_INFO_ARRAY(1, 2, 1)
Now I want to SELECT all rows where (1,2,1) is defined:
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO
FROM
my_table D
WHERE
-- Pseudo-Code is following
D.GEOM.SDO_ELEM_INFO is "(1, 2, 1)";
So, in simple words: "array_from_row = defined_array".
I found a lot about IMPLODE and TABLE and COLLECT etc. But how to define a clause on two arrays?
Thanks for help!
Try IN clause, you can also use both
SELECT
D.FID
, D.GEOM.SDO_ELEM_INFO
FROM
my_table D
WHERE
D.GEOM.SDO_ELEM_INFO in (1, 2, 1) or ( D.GEOM.SDO_ELEM_INFO = 1 or D.GEOM.SDO_ELEM_INFO = 2 or D.GEOM.SDO_ELEM_INFO = 3);

How do I add arrays in BigQuery SQL?

I have a UDF which returns a floating point array of the same size for each row of a table. How do I sum values of these arrays ?
In other words, how can I do something like this:
create temp function f(...)
returns array<float64>
...;
select sum(f(column)) from table
As the result of this operation I need to get another array of equal size where
result[i] = sum(over rows) f(row, column)[i]
Here is a function that uses ANY TYPE in order to support summing arrays of FLOAT64, INT64, or NUMERIC along with some sample input:
CREATE TEMP FUNCTION ElementWiseSum(arr1 ANY TYPE, arr2 ANY TYPE) AS (
ARRAY(SELECT x + arr2[OFFSET(off)] FROM UNNEST(arr1) AS x WITH OFFSET off ORDER BY off)
);
SELECT arr1, arr2, ElementWiseSum(arr1, arr2) AS result
FROM (
SELECT [1, 2, 3] AS arr1, [4, 5, 6] AS arr2 UNION ALL
SELECT [7, 8], [9, 10] UNION ALL
SELECT [], [] UNION ALL
SELECT [11, 12, 13, 14, 15], [16, 17, 18, 19, 20]
);
It unnests arr1 using WITH OFFSET, then retrieves the equivalent element from arr2 using this offset, and orders by the offset to ensure that the element order is preserved.
Edit: to sum across rows, you can unnest the arrays, compute sums grouped by the offset of the elements, then reaggregate the sums into a new array:
SELECT
ARRAY_AGG(sum ORDER BY off) AS arr
FROM (
SELECT
off,
SUM(x) AS sum
FROM (
SELECT [1, 2, 3] AS arr UNION ALL
SELECT [7, 8, 9] UNION ALL
SELECT [4, 5, 6] UNION ALL
SELECT [10, 11, 12]
), UNNEST(arr) AS x WITH OFFSET off
GROUP BY off
);
So based on your comment, what you are looking for is the sum the values of all your arrays. This is how you can do it using UNNEST operator
WITH mydata AS (
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
union all
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
union all
SELECT [1.4, 1.3, 1.4, 1.1] as myarray
)
SELECT SUM(eachelement) from mydata, UNNEST(myarray) AS eachelement;
If you have your UDF defined (takes in a your column(s) and returns a float64 array of a pre-determined (or fixed) dimensions), you can use a simplified solution. For example in case of 3-d arrays, something like:
create temp function f(...)
returns array<float64>
...;
with dataset as (
select arr[offset(0)] as col_a, arr[offset(1)] as col_b, arr[offset(2)] as col_c
from (
select f(mycolumn) as arr
from `mydataset.mytable`
)
)
select [sum(col_a), sum(col_b), sum(col_c)] as new_array from dataset
This does not directly answer OP's question, but people landing on this page searching for "How do I add arrays in BigQuery SQL?" might benefit.
(Based on #elliott-brossard answer edit) In case you have 2 arrays, but 1 array includes a struct, you can use the following code to add them together:
WITH mydata AS (
SELECT
[1, 2, 3] AS arr
-- ,[7, 8, 9] AS arr2
,[
STRUCT(7 AS timeOnSite)
,STRUCT(8 AS timeOnSite)
,STRUCT(9 AS timeOnSite)
] AS arr2
)
SELECT
(
SELECT
ARRAY_AGG(sum ORDER BY off) AS arr
FROM (
SELECT
off,
SUM(x) AS sum
FROM (
SELECT arr UNION ALL
-- SELECT arr2
SELECT (SELECT ARRAY_AGG(t.timeOnSite) FROM UNNEST(arr2) AS t)
), UNNEST(arr) AS x WITH OFFSET off
GROUP BY off
)
) AS sum_arrays
FROM
mydata

Exclude matched array elements

How can I exclude matched elements of one array from another?
Postgres code:
a1 := '{1, 2, 5, 15}'::int[];
a2 := '{1, 2, 3, 6, 7, 9, 15}'::int[];
a3 := a2 ??magic_operator?? a1;
In a3 I expect exactly '{3, 6, 7, 9}'
Final Result
My and lad2025 solutions works fine.
Solution with array_position() required PostgreSQL 9.5 and later, executes x3 faster.
It looks like XOR between arrays:
WITH set1 AS
(
SELECT * FROM unnest('{1, 2, 5, 15}'::int[])
), set2 AS
(
SELECT * FROM unnest('{1, 2, 3, 6, 7, 9, 15}'::int[])
), xor AS
(
(SELECT * FROM set1
UNION
SELECT * FROM set2)
EXCEPT
(SELECT * FROM set1
INTERSECT
SELECT * FROM set2)
)
SELECT array_agg(unnest ORDER BY unnest)
FROM xor
Output:
"{3,5,6,7,9}"
How it works:
Unnest both arrays
Calculate SUM
Calculate INTERSECT
From SUM - INTERSECT
Combine to array
Alternatively you could use sum of both minus(except) operations:
(A+B) - (A^B)
<=>
(A-B) + (B-A)
Utilizing FULL JOIN:
WITH set1 AS
(
SELECT *
FROM unnest('{1, 2, 5, 15}'::int[])
), set2 AS
(
SELECT *
FROM unnest('{1, 2, 3, 6, 7, 9, 15}'::int[])
)
SELECT array_agg(COALESCE(s1.unnest, s2.unnest)
ORDER BY COALESCE(s1.unnest, s2.unnest))
FROM set1 s1
FULL JOIN set2 s2
ON s1.unnest = s2.unnest
WHERE s1.unnest IS NULL
OR s2.unnest IS NULL;
EDIT:
If you want only elements from second array that are not is first use simple EXCEPT:
SELECT array_agg(unnest ORDER BY unnest)
FROM (SELECT * FROM unnest('{1, 2, 3, 6, 7, 9, 15}'::int[])
EXCEPT
SELECT * FROM unnest('{1, 2, 5, 15}'::int[])) AS sub
Output:
"{3,6,7,9}"
The additional module intarray provides a simple and fast subtraction operator - for integer arrays, exactly the magic_operator you are looking for:
test=# SELECT '{1, 2, 3, 6, 7, 9, 15}'::int[] - '{1, 2, 5, 15}'::int[] AS result;
?column?
-----------
{3,6,7,9}
You need to install the module once per database:
CREATE EXTENSION intarray;
It also provides special operator classes for indexes:
Postgresql intarray error: undefined symbol: pfree
Note that it only works for:
... null-free arrays of integers.
I found a little similar case and modify.
That SQL solve my case.
with elements (element) as (
select unnest(ARRAY[1, 2, 3, 6, 7, 9, 15])
)
select array_agg(element)
from elements
where array_position(ARRAY[1, 2, 5, 15],element) is null
PostgreSQL 9.5 and later required.

How to do equivalent of "limit distinct"?

How can I limit a result set to n distinct values of a given column(s), where the actual number of rows may be higher?
Input table:
client_id, employer_id, other_value
1, 2, abc
1, 3, defg
2, 3, dkfjh
3, 1, ldkfjkj
4, 4, dlkfjk
4, 5, 342
4, 6, dkj
5, 1, dlkfj
6, 1, 34kjf
7, 7, 34kjf
8, 6, lkjkj
8, 7, 23kj
desired output, where limit distinct=5 distinct values of client_id:
1, 2, abc
1, 3, defg
2, 3, dkfjh
3, 1, ldkfjkj
4, 4, dlkfjk
4, 5, 342
4, 6, dkj
5, 1, dlkfj
Platform this is intended for is MySQL.
You can use a subselect
select * from table where client_id in
(select distinct client_id from table order by client_id limit 5)
This is for SQL Server. I can't remember, MySQL may use a LIMIT keyword instead of TOP. That may make the query more efficient if you can get rid of the inner most subquery by using the LIMIT and DISTINCT in the same subquery. (It looks like Vinko used this method and that LIMIT is correct. I'll leave this here for the second possible answer though.)
SELECT
client_id,
employer_id,
other_value
FROM
MyTable
WHERE
client_id IN
(
SELECT TOP 5
client_id
FROM
(
SELECT DISTINCT
client_id
FROM
MyTable
) SQ
ORDER BY
client_id
)
Of course, add in your own WHERE clause and ORDER BY clause in the subquery.
Another possibility (compare performance and see which works out better) is:
SELECT
client_id,
employer_id,
other_value
FROM
MyTable T1
WHERE
T1.code IN
(
SELECT
T2.code
FROM
MyTable T2
WHERE
(SELECT COUNT(*) FROM MyTable T3 WHERE T3,code < T2.code) < 5
)
-- Using Common Table Expression in Microsoft SQL Server.
-- LIMIT function does not exist in MS SQL.
WITH CTE
AS
(SELECT DISTINCT([COLUMN_NAME])
FROM [TABLE_NAME])
SELECT TOP (5) [[COLUMN_NAME]]
FROM CTE;
This works for ‍‍MS SQL if anyone is on that platform:
SET ROWCOUNT 10;
SELECT DISTINCT
column1, column2, column3,...
FROM
Table1
WHERE ...