Comparing two arrays in BigQuery - google-bigquery

Postgres has a nice utility in that arrays can be compared directly. For example:
SELECT ARRAY[1,2] < ARRAY[1,3];
# t
How could this behavior be emulated in BigQuery, for example with a function?

Consider below hint
select format('%t', ARRAY[1,2]) < format('%t', ARRAY[1,3])

You can join on the offset of the arrays and then compare the value at that offset at the first difference. Here is an example:
WITH
a AS (SELECT * from UNNEST([1,2]) AS elem WITH OFFSET),
b AS (SELECT * from UNNEST([1,3]) AS elem WITH OFFSET)
SELECT
a.elem < b.elem
FROM
a JOIN b USING (offset)
WHERE
a.elem != b.elem
LIMIT 1

Related

selecting row if value of attribute in array of objects one of multiple values

I have data shaped like this: arrays of objects in a jsonb column in postgres
id
data
1
[{"a":3, "b":"green"} ,{"a":5, "b":"blue"}]
2
[{"a":3, "b":"red"} ,{"a":5, "b":"yellow"}]
3
[{"a":3, "b":"orange"} ,{"a":5, "b":"blue"}]
I am trying to select the rows where b is either "green" or "yellow"
I know I can unroll the data using jsonb_array_elements to get all the b values
select jsonb_array_elements(data) ->> 'b' from table
but I am failing to use that in a where query like this
select * from table where jsonb_array_elements(data) ->> 'b' && ARRAY["green","yellow"]::varchar[]
(not working "set-returning functions are not allowed in WHERE")
You can use the #> operator
select *
from the_table
where data #> '[{"b": "green"}]'
or data #> '[{"b": "yellow"}]'
Or a JSON path expression:
select *
from the_table
where data ## '$[*].b == "green" || $[*].b == "yellow"';
Or by unnesting the array with an EXISTS condition:
select t.*
from the_table t
where exists (select *
from jsonb_array_elements(t.data) as x(item)
where x.item ->> 'b' in ('green', 'yellow'))
You can try to use subquery with a column alias name and ANY like below
SELECT *
FROM (
select *,jsonb_array_elements(data) ->> 'b' val
from t
) t1
WHERE t1.val = ANY (ARRAY['green','yellow'])
sqlfiddle
NOTE
ARRAY filter value need to use single quote instead of double quote

Running nested CTEs

On the BigQuery SELECT syntax page it gives the following:
query_statement:
query_expr
query_expr:
[ WITH cte[, ...] ]
{ select | ( query_expr ) | set_operation }
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ LIMIT count [ OFFSET skip_rows ] ]
I understand how the (second line) select could be either:
{ select | set_operation }
But what is the ( query_expr ) in the middle for? For example, if it can refer to itself, wouldn't it make the possibility to construct a lisp-like query such as:
with x as (select 1 a)
(with y as (select 2 b)
(with z as (select 3 c)
select * from x, y, z))
Actually, I just tested it and the answer is yes. If so, what would be an actual use case of the above construction where you can use ( query_expr ) ?
And, is there ever a case where using a nested CTE can do something that multiple CTEs cannot? (For example, the current answer is just a verbose way or writing what would more properly be written with a single WITH expression with multiple CTEs)
The following construction might be useful to model how an ETL flow might work and to encapsulate certain 'steps' that you don't want the outer query to have access to. But this is quite a stretch...
WITH output AS (
WITH step_csv_query AS (
SELECT * FROM 'Sales.CSV'),
step_filter_csv AS (
SELECT * FROM step_csv_query WHERE Country='US'),
step_mysql_query AS (
SELECT * FROM MySQL1 LEFT OUTER JOIN MySQL2...),
step_join_queries AS (
SELECT * FROM step_filter_csv INNER JOIN step_mysql_query USING (id)
) SELECT * FROM step_join_queries -- output is last step
) SELECT * FROM output -- or whatever we want to do with the output...
This query might be useful in that case where a CTE is being referred to by subsequent CTEs.
For instance, you can use this if you want to join two tables, use expressions and query the resulting table.
with x as (select '2' as id, 'sample' as name )
(with y as ( select '2' as number, 'customer' as type)
(with z as ( select CONCAT('C00',id), name, type from x inner join y on x.id=y.number)
Select * from z))
The above query gives the following output :
Though there are other ways to achieve the same, the above method would be much easier for debugging.
Following article can be referred for further information on the use cases.
In nested CTEs, the same CTE alias can be reused which is not possible in case of multiple CTEs. For example, in the following query the inner CTE will override the outer CTEs with the same alias :
with x as (select '1')
(with x as (select '2' as id, 'sample' as name )
(with x as ( select '2' as number, 'customer' as type)
select * from x))

Perform loop and calculation on BigQuery Array type

My original data, B is an array of INT64:
And I want to calculate the difference between B[n+1] - B[n], hence result in a new table as follow:
I figured out I can somehow achieve this by using LOOP and IF condition:
DECLARE x INT64 DEFAULT 0;
LOOP
SET x = x + 1
IF(x < array_length(table.B))
THEN INSERT INTO newTable (SELECT A, B[OFFSET(x+1)] - B[OFFSET(x)]) from table
END IF;
END LOOP;
The problem is that the above idea doesn't work on each row of my data, cause I still need to loop through each row in my data table, but I can't find a way to integrate my scripting part into a normal query, where I can
SELECT A, [calculation script] from table
Can someone point me how can I do it? Or any better way to solve this problem?
Thank you.
Below actually works - BigQuery
select * replace(
array(select diff from (
select offset, lead(el) over(order by offset) - el as diff
from unnest(B) el with offset
) where not diff is null
order by offset
) as B
)
from `project.dataset.table` t
if to apply to sample data in your question - output is
You can use unnest() with offset for this purpose:
select id, a,
array_agg(b_el - prev_b_el order by n) as b_diffs
from (select t.*, b_el, lag(b_el) over (partition by t.id order by n) as prev_b_el
from t cross join
unnest(b) b_el with offset n
) t
where prev_b_el is not null
group by t.id, t.a

LIMIT within ARRAY_AGG in BigQuery

When I put a LIMIT clause in an ARRAY_AGG, I still get many items in the array. The docs suggest that this should work.
Am I doing something wrong?
SELECT
x,
ARRAY_AGG((
SELECT
AS STRUCT y
LIMIT
1)) y
FROM
`a`,
UNNEST(b) b
WHERE
x = 'abc'
GROUP BY
1
LIMIT 1
...produces a result with one row of STRING an ARRAY with 50 items, when I would have expected only 1 item.
The issue was the placement of the LIMIT clause. It was in scope for the SELECT statement, rather than the ARRAY_AGG function. This corrected it:
SELECT
x,
ARRAY_AGG((
SELECT
AS STRUCT y
) LIMIT 1) y
FROM
`a`,
UNNEST(b) b
WHERE
x = 'abc'
GROUP BY
1
LIMIT 1

Find all possible combinations of array without permutations

Input is an array of 'n' length.
I need all combinations inside this array stored into new array.
IN: j='{A, B, C ..}'
OUT: k='{A, B, C, AB, AC, BC, ABC ..}'
Without repetitions, so without BA, CA etc.
Generic solution using a recursive CTE
Works for any number of elements and any base data type that supports the > operator.
WITH RECURSIVE t(i) AS (SELECT * FROM unnest('{A,B,C}'::text[])) -- provide array
, cte AS (
SELECT i::text AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT cte.combo || t.i::text, t.i, ct + 1
FROM cte
JOIN t ON t.i > cte.i
)
SELECT ARRAY (
SELECT combo
FROM cte
ORDER BY ct, combo
) AS result;
Result is an array of text in the example.
Note that you can have any number of additional non-recursive CTEs when using the RECURSIVE keyword.
More generic yet
If any of the following apply:
Array elements are non-unique (like '{A,B,B}').
The base data type does not support the > operator (like json).
Array elements are very big - for better performance.
Use a row number instead of comparing elements:
WITH RECURSIVE t AS (
SELECT i::text, row_number() OVER () AS rn
FROM unnest('{A,B,B}'::text[]) i -- duplicate element!
)
, cte AS (
SELECT i AS combo, rn, 1 AS ct
FROM t
UNION ALL
SELECT cte.combo || t.i, t.rn, ct + 1
FROM cte
JOIN t ON t.rn > cte.rn
)
SELECT ARRAY (
SELECT combo
FROM cte
ORDER BY ct, combo
) AS result;
Or use WITH ORDINALITY in Postgres 9.4+:
PostgreSQL unnest() with element number
Special case: generate decimal numbers
To generate decimal numbers with 5 digits along these lines:
WITH RECURSIVE t AS (
SELECT i
FROM unnest('{1,2,3,4,5}'::int[]) i
)
, cte AS (
SELECT i AS nr, i
FROM t
UNION ALL
SELECT cte.nr * 10 + t.i, t.i
FROM cte
JOIN t ON t.i > cte.i
)
SELECT ARRAY (
SELECT nr
FROM cte
ORDER BY nr
) AS result;
SQL Fiddle demonstrating all.
if n is small < 20 , all possible combinations can be found using a bitmask approach. There are 2^n different combinations of it. The number values 0 to
(2^n - 1) represents one of the combination.
e.g n=3
0 represents {},empty element
2^3-1=7= 111 b represents element, abc
pseudo code as follows
for b=0 to 2^n - 1 do #each combination
res=""
for i=0 to (n-1) do # which elements are included
if (b && (1<<i) != 0)
res= res+arr[i]
end
print res
end
end