How to merge different schemas of structs in arrays (filling missing columns with null)? - google-bigquery

Given these one tables (a being an array of structs).
baz_v1 (a being ARRAY<STRUCT<x INT64>>):
+===========+
| a.x | b |
+===========+
| 1 | one |
| 2 | |
+===========+
baz_v2 (a being ARRAY<STRUCT<x INT64, y INT64>>):
+=================+
| a.x | a.z | b |
+=================+
| 3 | 4 | one |
| 5 | 6 | |
+-----------------+
| 7 | 8 | two |
| 9 | 0 | |
+-----------------+
| 11 | 12 | two |
| 13 | 14 | |
+=================+
How can I obtain the following (concatenated) table/view?
+==================+
| a.x | a.y | b |
+==================+
| 1 | null | one |
| 2 | null | |
+------------------+
| 3 | 4 | one |
| 5 | 6 | |
+------------------+
| 7 | 8 | two |
| 9 | 10 | |
+------------------+
| 11 | 12 | two |
| 13 | 14 | |
+==================+
Code:
WITH `baz_v1` AS (
SELECT
[
STRUCT(1 AS x),
STRUCT(2 AS x)
]
a,
"one" b
), `baz_v2` AS (
SELECT
[
STRUCT(3 AS x, 4 AS y),
STRUCT(5 AS x, 6 AS y)
]
a,
"one" b
UNION ALL
SELECT
[
STRUCT(7 AS x, 8 AS y),
STRUCT(9 AS x, 10 AS y)
]
a,
"two" b
UNION ALL
SELECT
[
STRUCT(11 AS x, 12 AS y),
STRUCT(13 AS x, 14 AS y)
]
a,
"two" b
)
-- todo: Insert magic here, because the below, of course, does not work.
SELECT * FROM baz_v2
UNION ALL
SELECT * FROM baz_v1

Consider below
select * replace(
array(select as struct x, null as y from t.a) as a)
from `baz_v1` t
union all
select * from `baz_v2`
if applied to sample data in y our question - output is

Building on the very good answer given by Mikhail Berlyant, I've found another solution, one that does not use SELECT * REPLACE:
SELECT ARRAY(SELECT AS STRUCT x, NULL AS y FROM baz_v1.a) AS a, b FROM baz_v1
UNION ALL
SELECT * FROM baz_v2

The following method was used to merge these tables.
(1) To use an union all the target tables, array type was flattened.
(2) In order to match the number of columns in the UNION ALL target tables, the number of columns is appended as much as the insufficient number of columns by
LEFT JOIN (SELECT '' as y) ON FALSE - reference
(3) ARRAY_AGG, STRUCT, GROUP BY were used to output the array result. - reference
WITH `baz_v1` AS (
SELECT
[
STRUCT(1 AS x),
STRUCT(2 AS x)
] a,
"one" b
), `baz_v2` AS (
SELECT
[
STRUCT(3 AS x, 4 AS y),
STRUCT(5 AS x, 6 AS y)
] a,
"two" b
)
SELECT ARRAY_AGG (
STRUCT(x, y)
) as a,
b
FROM (
SELECT xy.x as x, xy.y as y, b
FROM baz_v2, UNNEST(a) as xy
UNION ALL
SELECT x, y, b
FROM (
SELECT x.x as x, CAST(y as INT64) as y, b
FROM baz_v1, UNNEST(a) as x
LEFT JOIN (SELECT '' as y) ON FALSE
)
)
GROUP BY b
with result)

Related

SQL/BigQuery: GroupBy If List Contains Common Element

I would like to perform a query in SQL that will do a groupby if a list contains a common element. For example:
| ID | Groups | Amount |
| -------- | -------------- | -------|
| 1 |[A,B] | 5 |
| 2 |[A,C,D] | 10 |
| 3 |[C,B] | 20 |
So that if I do a GROUP BY on the GROUPS, it will do:
|Groups | AVG(Amount)|
|------ | -----------|
|A | 7.5 |
|B | 12.5 |
|C | 15 |
|D | 10. |
The list lengths are variable.
A few ideas I had are one-hot encoding (expanding along the columns), or duplicate the rows by flattening or using UNNEST, but am not sure the best way to implement. Thanks!
I'm trying to "flatten" an array.
Consider below query:
CREATE TEMP TABLE sample_table AS
SELECT 1 ID, ['A','B'] `Groups`, 5 Amount UNION ALL
SELECT 2, ['A','C','D'], 10 UNION ALL
SELECT 3, ['C','B'], 20;
SELECT g AS `Groups`, AVG(Amount) AS avg_amount
FROM sample_table, UNNEST(`Groups`) g
GROUP BY 1;
In your question you did not mention about column that you want group but you are looking for ARRAY_AGG() function.
For example:
WITH vals AS
(
SELECT 1 x, 'a' y UNION ALL
SELECT 1 x, 'b' y UNION ALL
SELECT 2 x, 'a' y UNION ALL
SELECT 2 x, 'c' y
)
SELECT x, ARRAY_AGG(y) as array_agg
FROM vals
GROUP BY x;
+---------------+
| x | array_agg |
+---------------+
| 1 | [a, b] |
| 2 | [a, c] |
+---------------+
Reference

How do I return two lines of unions and a count for each row?

I'm trying to take 10 columns and put into one column, take 10 more different columns into a separate column and get a count for each unique couple in a third column.
So far I've got
UNION
(SELECT a_3 as x, a_4 as y FROM result)
....
UNION
(SELECT a_19 as x, a_20 as y FROM result)
This gives me a table as follows
| x | y |
---------
| a | 1 |
| a | 2 |
| b | 1 |
| b | 3 |
etc...
I want this, however I also want a third column counting how many times each row occurs, like below
| x | y |count|
---------------
| a | 1 | 10 |
| a | 2 | 3 |
| b | 1 | 6 |
| b | 3 | 2 |
etc...
I can also do:
select count(*) from (insert above union SQL)
but then I just get a total number for the table.
Thanks!
Looks like your original table design is not normalized at all.
Instead of multiple union query, you can use CROSS APPLY with Table Value Constructor to simplify the query
select x, y, count(*) as [count]
from [result] t
cross apply
(
values (a_1, a_2), (a_3, a_4), (a_5, a_6),
. . .
(a_19, a_20)
) v (x, y)
group by x, y
SELECT X,Y,COUNT(1) AS[COUNT] FROM
(
SELECT a_3 as x, a_4 as y FROM result
....
UNION
SELECT a_19 as x, a_20 as y FROM result
) x
GROUP BY X,Y

Grouping by similar values in multiple columns

I have a table of entities with an id, and a category (few different values with NULL allowed) from 3 different years (category can be different from 1 year to another), in 'wide' table format:
| ID | CATEG_Y1 | CATEG_Y2 | CATEG_Y3 |
+-----+----------+----------+----------+
| 1 | NULL | B | C |
| 2 | A | A | C |
| 3 | B | A | NULL |
| 4 | A | C | B |
| ... | ... | ... | ... |
I would like to simply count the number of entities by category, grouped by category, independently for the year:
+-------+----+----+----+
| CATEG | Y1 | Y2 | Y3 |
+-------+----+----+----+
| A | 6 | 4 | 5 | <- 6 entities w/ categ_y1, 4 w/ categ_y2, 5 w/ categ_y3
| B | 3 | 1 | 10 |
| C | 8 | 4 | 5 |
| NULL | 3 | 3 | 3 |
+-------+----+----+----+
I guess I could do it by grouping values one column after the other and UNION ALL the results, but I was wondering if there was a more rapid & convenient way, and if it can be generalized if I have more columns/years to manage (e.g. 20-30 different values)
A bit clumsy, but probably someone has a better idea. Query first collects all diferent categories (the union-query in the from part), and then counts the occurences with dedicated subqueries in the select part. One could omit the union-part if there is a table already defining the available categories (I suppose categ_y1 is a foreign key to such a primary category table). Hope there are not to many typos:
select categories.cat,
(select count(categ_y1) from table ty1 where select categories.cat = categ_y1) as y1,
(select count(categ_y2) from table ty2 where select categories.cat = categ_y2) as y2,
(select count(categ_y3) from table ty3 where select categories.cat = categ_y3) as y3
from ( select categ_y1 as cat from table t1
union select categ_y2 as cat from table t2
union select categ_y3 as cat from table t3) categories
Use jsonb functions to transpose the data (from the question) to this format:
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
order by 1;
categ | jdata
-------+-----------------------------------------------
A | {"categ_y1": 2, "categ_y2": 2}
B | {"categ_y1": 1, "categ_y2": 1, "categ_y3": 1}
C | {"categ_y2": 1, "categ_y3": 2}
| {"categ_y1": 1, "categ_y3": 1}
(4 rows)
For a known (static) number of years you can easily unpack the jsonb column:
select categ, jdata->'categ_y1' as y1, jdata->'categ_y2' as y2, jdata->'categ_y3' as y3
from (
select categ, jsonb_object_agg(key, count) as jdata
from (
select value as categ, key, count(*)
from my_table t,
jsonb_each_text(to_jsonb(t)- 'id')
group by 1, 2
) s
group by 1
) s
order by 1;
categ | y1 | y2 | y3
-------+----+----+----
A | 2 | 2 |
B | 1 | 1 | 1
C | | 1 | 2
| 1 | | 1
(4 rows)
To get fully dynamic solution you can use the function create_jsonb_flat_view() described in Flatten aggregated key/value pairs from a JSONB field.
I would do this as using union all followed by aggregation:
select categ, sum(categ_y1) as y1, sum(categ_y2) as y2,
sum(categ_y3) as y3
from ((select categ_y1, 1 as categ_y1, 0 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y2, 0 as categ_y1, 1 as categ_y2, 0 as categ_y3
from t
) union all
(select categ_y3, 0 as categ_y1, 0 as categ_y2, 1 as categ_y3
from t
)
)
group by categ ;

Combine 2 tables which doesn't have any relationship

I have couple of tables like below-
Table1:
A B C D <<Columns
1 2 3 4 <<single row
Table2:
W X Y Z << Columns
5 6 7 8 << Single row
I want to combine these 2 tables such a way that it will give me following result
Result:
P Q R S << Column headers
1 2 3 4 << row from table1
5 6 7 8 << row from table2
Expected result will have column headers as P, Q, R, S and row from table1 and row from table2
How to achieve this using SQL?
UNION ALL will not eliminate duplicates
In set operations (UNION / INTERSECT / EXCEPT) the aliases are taken from the first query (Currently I'm aware of only one exception- Hive requires the aliases to be the same for all queries - I consider this as a bug)
select A as P, B as Q, C as R, D as S
from table1
union all
select W,X,Y,Z
from table2
+---+---+---+---+
| p | q | r | s |
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
+---+---+---+---+
table2 with 3 Columns
select B as Q, C as R, D as S
from table1
union all
select X,Y,Z
from table2
+---+---+---+
| q | r | s |
+---+---+---+
| 2 | 3 | 4 |
| 6 | 7 | 8 |
+---+---+---+
or
select A as P, B as Q, C as R, D as S
from table1
union all
select null,X,Y,Z
from table2
+--------+---+---+---+
| p | q | r | s |
+--------+---+---+---+
| 1 | 2 | 3 | 4 |
| (null) | 6 | 7 | 8 |
+--------+---+---+---+
_Updated to be more strict and more complete, thanks to #AntDC (and #Matt) and #Dudu Markovitz__
Use UNION with aliases, like this:
SELECT A AS P, B AS Q, C AS R, D AS S
FROM table1
UNION
-- or UNION ALL if you want to keep duplicate rows
SELECT W, X, Y, Z
FROM table2

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?