Bigquery union two arrays of different struct - google-bigquery

I have two tables
v1
ARRAY<STRUCT<a int64>>
and
v2
ARRAY<STRUCT<a int64, b int64>>
I want to write query which unions both tables using union all and for v1 rows put nulls in place of b field. Any help is appreciated :)
I'm using standard SQL.

Below is for BigQuery Standard SQL
#standardSQL
SELECT
ARRAY(SELECT AS STRUCT val.a, NULL AS b FROM UNNEST(arr1) val) arr
FROM `project.dataset.v1`
UNION ALL
SELECT arr2 AS arr
FROM `project.dataset.v2`
you can test, play with above using dummy data as below
#standardSQL
WITH `project.dataset.v1` AS (
SELECT [STRUCT<a INT64>(1),STRUCT(2),STRUCT(3)] arr1
), `project.dataset.v2` AS (
SELECT [STRUCT<a INT64, b INT64>(100, 1),STRUCT(100, 2),STRUCT(100, 3)] arr2
)
SELECT
ARRAY(SELECT AS STRUCT val.a, NULL AS b FROM UNNEST(arr1) val) arr
FROM `project.dataset.v1`
UNION ALL
SELECT arr2 AS arr
FROM `project.dataset.v2`
with result as
Row arr.a arr.b
1 1 null
2 null
3 null
2 100 1
100 2
100 3

Related

Is there something like Spark's unionByName in BigQuery?

I'd like to concatenate tables with different schemas, filling unknown values with null.
Simply using UNION ALL of course does not work like this:
WITH
x AS (SELECT 1 AS a, 2 AS b ),
y AS (SELECT 3 AS b, 4 AS c )
SELECT * FROM x
UNION ALL
SELECT * FROM y
a b
1 2
3 4
(unwanted result)
In Spark, I'd use unionByName to get the following result:
a b c
1 2
3 4
(wanted result)
Of course, I can manually create the needed query (adding nullss) in BigQuery like so:
SELECT a, b, NULL c FROM x
UNION ALL
SELECT NULL a, b, c FROM y
But I'd prefer to have a generic solution, not requiring me to generate something like that.
So, is there something like unionByName in BigQuery? Or can one come up with a generic SQL function for this?
Consider below approach (I think it is as generic as one can get)
create temp function json_extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp table temp_table as (
select json, key, value
from (
select to_json_string(t) json from table_x as t
union all
select to_json_string(t) from table_y as t
) t, unnest(json_extract_keys(json)) key with offset
join unnest(json_extract_values(json)) value with offset
using(offset)
order by key
);
execute immediate(select '''
select * except(json) from temp_table
pivot (any_value(value) for key in ("''' || string_agg(distinct key, '","') || '"))'
from temp_table
)
if applied to sample data in your question - output is

How to query on fields from nested records without referring to the parent records in BigQuery?

I have data structured as follows:
{
"results": {
"A": {"first": 1, "second": 2, "third": 3},
"B": {"first": 4, "second": 5, "third": 6},
"C": {"first": 7, "second": 8, "third": 9},
"D": {"first": 1, "second": 2, "third": 3},
... },
...
}
i.e. nested records, where the lowest level has the same schema for all records in the level above. The schema would be similar to this:
results RECORD NULLABLE
results.A RECORD NULLABLE
results.A.first INTEGER NULLABLE
results.A.second INTEGER NULLABLE
results.A.third INTEGER NULLABLE
results.B RECORD NULLABLE
results.B.first INTEGER NULLABLE
...
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level? Put differently, can I do a query on first for all records in results without having to specify A, B, ... in my query?
I would for example want to achieve something like
SELECT SUM(results.*.first) FROM table
in order to get 1+4+7+1 = 13,
but SELECT results.*.first isn't supported.
(I've tried playing around with STRUCTs, but haven't gotten far.)
Below trick is for BigQuery Standard SQL
#standardSQL
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
You can test, play with above using dummy/sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
with output
Row id sum_first sum_second sum_third
1 1 13 17 21
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level?
Below is for BigQuery Standard SQL and totally avoids referencing parent records (A, B, C, D, etc.)
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
if to apply to sample data from your question as in below example
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
output is
Row id first_sum second_sum third_sum forth_sum
1 1 13 17 21 null
I adapted Mikhail's answer in order to support grouping on the values of the lowest-level fields:
#standardSQL
CREATE TEMP FUNCTION Nested_AGGREGATE(entries ANY TYPE, field_name STRING) AS ((
SELECT ARRAY(
SELECT AS STRUCT TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') AS value, COUNT(SPLIT(kv, ':')[OFFSET(1)]) AS count
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
GROUP BY TRIM(SPLIT(kv, ':')[OFFSET(1)], '"')
)
));
SELECT id,
Nested_AGGREGATE(results, 'first') AS first_agg,
Nested_AGGREGATE(results, 'second') AS second_agg,
Nested_AGGREGATE(results, 'third') AS third_agg,
FROM `project.dataset.table`
Output for WITH `project.dataset.table` AS (SELECT 1 AS id, STRUCT( STRUCT(1 AS first, 2 AS second, 3 AS third) AS A, STRUCT(4 AS first, 5 AS second, 6 AS third) AS B, STRUCT(7 AS first, 8 AS second, 9 AS third) AS C, STRUCT(1 AS first, 2 AS second, 3 AS third) AS D) AS results ):
Row id first_agg.value first_agg.count second_agg.value second_agg.count third_agg.value third_agg.count
1 1 1 2 2 2 3 2
4 1 5 1 6 1
7 1 8 1 9 1

Convert data from wide format to long format in SQL

I have some data in the format:
VAR1 VAR2 Score1 Score2 Score3
A B 1 2 3
I need to convert it into the format
VAR1 VAR2 VarName Value
A B Score1 1
A B Score2 2
A B Score3 3
How can I do this in SQL?
Provided your score columns are fixed and you require no aggregation, you can use multiple SELECT and UNION ALL statements to generate the shape of data you requested. E.g.
SELECT [VAR1], [VAR2], [VarName] = 'Score1', [Value] = [Score1]
FROM [dbo].[UnknownMe]
UNION ALL
SELECT [VAR1], [VAR2], [VarName] = 'Score2', [Value] = [Score2]
FROM [dbo].[UnknownMe]
UNION ALL
SELECT [VAR1], [VAR2], [VarName] = 'Score3', [Value] = [Score3]
FROM [dbo].[UnknownMe]
SQL Fiddle: http://sqlfiddle.com/#!6/f54b2/4/0
In hive, you could use the named_struct function, the array function, and the explode function in conjunction with the LATERAL VIEW construct
SELECT VAR1, VAR2, var_struct.varname, var_struct.value FROM
(
SELECT
VAR1,
VAR2,
array (
named_struct("varname","Score1","value",Score1),
named_struct("varname","Score2","value",Score2),
named_struct("varname","Score3","value",Score3)
) AS struct_array1
FROM OrignalTable
) t1 LATERAL VIEW explode(struct_array1) t2 as var_struct;

Is there better Oracle operator to do null-safe equality check?

According to this question, the way to perform an equality check in Oracle, and I want null to be considered equal null is something like
SELECT COUNT(1)
FROM TableA
WHERE
wrap_up_cd = val
AND ((brn_brand_id = filter) OR (brn_brand_id IS NULL AND filter IS NULL))
This can really make my code dirty, especially if I have a lot of where like this and the where is applied to several column. Is there a better alternative for this?
Well, I'm not sure if this is better, but it might be slightly more concise to use LNNVL, a function (that you can only use in a WHERE clause) which returns TRUE if a given expression is FALSE or UNKNOWN (NULL). For example...
WITH T AS
(
SELECT 1 AS X, 1 AS Y FROM DUAL UNION ALL
SELECT 1 AS X, 2 AS Y FROM DUAL UNION ALL
SELECT 1 AS X, NULL AS Y FROM DUAL UNION ALL
SELECT NULL AS X, 1 AS Y FROM DUAL
)
SELECT
*
FROM
T
WHERE
LNNVL(X <> Y);
...will return all but the row where X = 1 and Y = 2.
As an alternative you can use NVL function and designated literal which will be returned if a value is null:
-- both are not nulls
SQL> with t1(col1, col2) as(
2 select 123, 123 from dual
3 )
4 select 1 res
5 from t1
6 where nvl(col1, -1) = nvl(col2, -1)
7 ;
RES
----------
1
-- one of the values is null
SQL> with t1(col1, col2) as(
2 select null, 123 from dual
3 )
4 select 1 res
5 from t1
6 where nvl(col1, -1) = nvl(col2, -1)
7 ;
no rows selected
-- both values are nulls
SQL> with t1(col1, col2) as(
2 select null, null from dual
3 )
4 select 1 res
5 from t1
6 where nvl(col1, -1) = nvl(col2, -1)
7 ;
RES
----------
1
As #Codo has noted in the comment, of course, above approach requires choosing a literal comparing columns will never have. If comparing columns are of number datatype(for example) and are able to accept any value, then choosing -1 of course won't be an option. To eliminate that restriction we can use decode function(for numeric or character datatypes) for that:
with t1(col1, col2) as(
2 select null, null from dual
3 )
4 select 1 res
5 from t1
6 where decode(col1, col2, 'same', 'different') = 'same'
7 ;
RES
----------
1
With the LNNVL function, you still have a problem when col1 and col2 (x and y in the answer) are both null. With nvl it works but it is inefficient (not understood by the optimizer) and you have to find a value that cannot appear in the data (and the optimizer should know it cannot).
For strings you can choose a value that have more characters than the maximum of the columns but it is dirty.
The true efficient way to do it is to use the (undocumented) function SYS_OP_MAP_NONNULL().
like this:
where SYS_OP_MAP_NONNULL(col1) <> SYS_OP_MAP_NONNULL(col2)
SYS_OP_MAP_NONNULL(a) is equivalent to nvl(a,'some internal value that cannot appear in the data but that is not null')

Oracle SQL -- select from two columns and combine into one

I have this table:
Vals
Val1 Val2 Score
A B 1
C 2
D 3
I would like the output to be a single column that is the "superset" of the Vals1 and Val2 variable. It also keeps the "score" variable associated with that value.
The output should be:
Val Score
A 1
B 1
C 2
D 3
Selecting from this table twice and then unioning is absolutely not a possibility because producing it is very expensive. In addition I cannot use a with clause because this query uses one in a sub-query and for some reason Oracle doesn't support two with clauses.
I don't really care about how repeat values are dealt with, whatever is easiest/fastest.
How can I generate my appropriate output?
Here is solution without using unpivot.
with columns as (
select level as colNum from dual connect by level <= 2
),
results as (
select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
columns
)
select * from results where val is not null
Here is essentially the same query without the WITH clause:
select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
(select level as colNum from dual connect by level <= 2) columns
where case colNum
when 1 then Val1
when 2 then Val2
end is not null
Or a bit more concisely
select *
from ( select case colNum
when 1 then Val1
when 2 then Val2
end Val,
score
from vals,
(select level as colNum from dual connect by level <= 2) columns
) results
where val is not null
try this, looks like you want to convert column values into rows
select val1, score from vals where val1 is not null
union
select val2,score from vals where val2 is not null
If you're on Oracle 11, unPivot will help:
SELECT *
FROM vals
UNPIVOT ( val FOR origin IN (val1, val2) )
you can choose any names instead of 'val' and 'origin'.
See Oracle article on pivot / unPivot.