Get first N elements from an array in BigQuery table - sql

I have an array column and I would like to get first N elements of it (keeping an array data type). Is there a some nice way how to do it? Ideally without unnesting, ranking and array_agg back to array.
I could also do this (for getting first 2 elements):
WITH data AS
(
SELECT 1001 as id, ['a', 'b', 'c'] as array_1
UNION ALL
SELECT 1002 as id, ['d', 'e', 'f', 'g'] as array_1
UNION ALL
SELECT 1003 as id, ['h', 'i'] as array_1
)
select *,
[array_1[SAFE_OFFSET(0)], array_1[SAFE_OFFSET(1)]] as my_result
from data
But obviously this is not a nice solution as it would fail in case when some array would have only 1 element.

Here's a general solution with a UDF that you can call for any array type:
CREATE TEMP FUNCTION TopN(arr ANY TYPE, n INT64) AS (
ARRAY(SELECT x FROM UNNEST(arr) AS x WITH OFFSET off WHERE off < n ORDER BY off)
);
WITH data AS
(
SELECT 1001 as id, ['a', 'b', 'c'] as array_1
UNION ALL
SELECT 1002 as id, ['d', 'e', 'f', 'g'] as array_1
UNION ALL
SELECT 1003 as id, ['h', 'i'] as array_1
)
select *, TopN(array_1, 2) AS my_result
from data
It uses unnest and the array function, which it sounds like you didn't want to use, but it has the advantage of being general enough that you can pass any array to it.

Another option for BigQuery Standard SQL (with JS UDF)
#standardSQL
CREATE TEMP FUNCTION FirstN(arr ARRAY<STRING>, N FLOAT64)
RETURNS ARRAY<STRING> LANGUAGE js AS """
return arr.slice(0, N);
""";
SELECT *,
FirstN(array_1, 3) AS my_result
FROM data

Related

how to convert a bigQuery array row into a column and then count [duplicate]

This question already has answers here:
Bigquery SQL: convert array to columns
(2 answers)
Closed 9 days ago.
I have a table like as shown below
[sample table AS-IS]
As shown, I would like to count each value in the "Value" column based on the ID.
as shown in the example below
[sample table TO-BE]
The data in the "Value" column is subject to change.
Can you help?
You might consider below PIVOT and a dynamic SQL.
CREATE TEMP TABLE sample_table AS
SELECT '111' ID, ['A', 'B', 'C', 'D'] Value UNION ALL
SELECT '222', ['E', 'F', 'G'] UNION ALL
SELECT '222', ['A', 'H', 'D'];
EXECUTE IMMEDIATE FORMAT("""
SELECT * FROM (
SELECT ID, v FROM sample_table, UNNEST(Value) v
) PIVOT (COUNT(v) FOR Value IN ('%s'))
""", (SELECT STRING_AGG(DISTINCT v, "','" FROM (SELECT ID, v FROM sample_table, UNNEST(Value) v)));

How to get all overlapping (ordered) 3-tuples from an array in BigQuery

Given a table like the following
elems
['a', 'b', 'c', 'd', 'e']
['v', 'w', 'x', 'y']
I'd like to transform it into something like this:
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['v', 'w', 'x']
['w', 'x', 'y']
I.e., I'd like to get all overlapping 3-tuples.
My current attempt looks as follows:
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3
But the problem is, it returns some unwanted rows too, i.e., the ones that are "between" the original rows from the foo table.
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['d', 'e', 'v'] <--- unwanted
['e', 'v', 'w'] <--- unwanted
['v', 'w', 'x']
['w', 'x', 'y']
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality? (I guess there may be a simple solution without this step in between.)
Consider below approach
select [elems[offset(index - 1)], elems[offset(index)], elems[offset(index + 1)]] as tuple
from your_table, unnest([array_length(elems)]) len,
unnest(generate_array(1, len - 2)) index
if applied to sample data in your question - output is
You might consider below query.
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality?
afaik, it's not quaranteeded without explicit using WITH OFFSET in the query.
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem WITH OFFSET
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (PARTITION BY FORMAT('%t', elems) ORDER BY offset ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3;
Just to give you another idea
create temp function slice(arr ARRAY<string>, pos float64, len float64)
returns array<string> language js as
r"return arr.slice(pos, pos + len);";
select slice(elems, index, 3) as tuple
from foo, unnest([array_length(elems)]) len,
unnest(generate_array(0, len - 3)) index
leaving it up to you to refactor above query to the point when it will look something like
select tuple
from foo, unnest(slices(elems, 3)) as tuple

PostgreSQL: Select unique rows where distinct values are in list

Say that I have the following table:
with data as (
select 'John' "name", 'A' "tag", 10 "count"
union all select 'John', 'B', 20
union all select 'Jane', 'A', 30
union all select 'Judith', 'A', 40
union all select 'Judith', 'B', 50
union all select 'Judith', 'C', 60
union all select 'Jason', 'D', 70
)
I know there are a number of distinct tag values, namely (A, B, C, D).
I would like to select the unique names that only have the tag A
I can get close by doing
-- wrong!
select
distinct("name")
from data
group by "name"
having count(distinct tag) = 1
however, this will include unique names that only have 1 distinct tag, regardless of what tag is it.
I am using PostgreSQL, although having more generic solutions would be great.
You're almost there - you already have groups with one tag, now just test if it is the tag you want:
select
distinct("name")
from data
group by "name"
having count(distinct tag) = 1 and max(tag)='A'
(Note max could be min as well - SQL just doesn't have single() aggregate function but that's different story.)
You can use not exists here:
select distinct "name"
from data d
where "tag" = 'A'
and not exists (
select * from data d2
where d2."name" = d."name" and d2."tag" != d."tag"
);
This is one possible way of solving it:
select
distinct("name")
from data
where "name" not in (
-- create list of names we want to exclude
select distinct name from data where "tag" != 'A'
)
But I don't know if it's the best or most efficient one.

Extract last N elements of an array in SQL (hive)

I have a column with arrays and I want to extract the X last elements in an array.
Example trying to extract the last two elements:
Column A
['a', 'b', 'c']
['d', 'e']
['f', 'g', 'h', 'i']
Expected output:
Column A
['b', 'c']
['d', 'e']
['h', 'i']
Best case scenario would be to do it without using a UDF
One method using reverse, explode, filtering and re-assembling array again:
with your_table as (
select stack (4,
0, array(), --empty array to check it works if no elements or less than n
1, array('a', 'b', 'c'),
2, array('d', 'e'),
3, array('f', 'g', 'h', 'i')
) as (id, col_A)
)
select s.id, collect_list(s.value) as col_A
from
(select s.id, a.value, a.pos
from your_table s
lateral view outer posexplode(split(reverse(concat_ws(',',s.col_A)),',')) a as pos, value
where a.pos between 0 and 1 --last two (use n-1 instead of 1 if you want last n)
distribute by s.id sort by a.pos desc --keep original order
)s
group by s.id
Result:
s.id col_a
0 []
1 ["b","c"]
2 ["d","e"]
3 ["h","i"]
More elegant way using brickhouse numeric_range UDF in this answer

Conditionally include in fields for presto query

I have a presto query which works as expected:
SELECT json_format(cast(MAP(ARRAY['random_name'],
ARRAY[
MAP(
ARRAY['a', 'b', 'c'],
ARRAY[a, b, c]
)]
) as JSON)) as metadata from a_table_with_a_b_c; // a,b,c are all ints
Now I only want to include a,b,c when they are larger than 0, how do I change the query? I can add 'CASE WHEN' but it seems I will have 'a:null' instead of not having it.
You can try this:
with rows as (
select row_number() over() as row_id, ARRAY['a', 'b', 'c'] as keys, ARRAY[a, b, c] as vals
from a_table_with_a_b_c
)
select
json_format(cast(MAP(ARRAY['random_name'],
ARRAY[
MAP(
array_agg(key), array_agg(value)
)]
) as JSON)) as metadata
from rows
cross join unnest (keys, vals) as t (key, value)
where value is not null and value > 0
group by row_id;