Flattening nested hierarchies in BigQuery - google-bigquery

I have a BigQuery table with two nested levels of repeated field hierarchies.
I need to do self join (join the table with itself) on a leaf field in the inner level.
Usage of FLATTEN clause only flattens one level and I couldn't figure out how to do this.
In theory I need to write nested FLATTEN but I couldn't make this work.
Any help would be appreciated.
Example:
Given the following table structure:
a1, integer
a2, record (repeated)
a2.b1, integer
a2.b2, record (repeated)
a2.b2.c1, integer
How do I write a query which does a self join (join each) on a2.b2.c1 on both sides.

Nested flatten -- that is flatten of a subquery -- should work. Note it requires a plethora of parentheses.
Given the schema:
{nested_repeated_f: [
{inner_nested_repeated_f: [
{string_f}]}]}
The following query will work:
SELECT t1.f1 FROM (
SELECT nested_repeated_f.inner_nested_repeated_f.string_f as f1
FROM (FLATTEN((
SELECT nested_repeated_f.inner_nested_repeated_f.string_f
FROM
(FLATTEN(lotsOdata.nested002, nested_repeated_f.inner_nested_repeated_f))
), nested_repeated_f))) as t1
JOIN (
SELECT nested_repeated_f.inner_nested_repeated_f.string_f as f2
FROM (FLATTEN((
SELECT nested_repeated_f.inner_nested_repeated_f.string_f
FROM
(FLATTEN(lotsOdata.nested002, nested_repeated_f.inner_nested_repeated_f))
), nested_repeated_f))) as t2
on t1.f1 = t2.f2

Related

Options for using BigQuery TABLESAMPLE results to filter two tables

I have one big table which I want to sub-sample once and use that result to filter two other tables. To do so, I have constructed a query which is something like this:
with subset_ids as (
select
distinct(id) as id
from `project.dataset.very_large_table`
TABLESAMPLE SYSTEM (25 percent)
),
sampled_table_1 as (
select * from `project.dataset.table_1` t1
right join subset_ids
on t1.id = subset_ids.id
),
sampled_table_2 as (
select * from `project.dataset.table_2` t2
right join subset_ids
on t2.id = subset_ids.id
),
select * from sampled_table_1
UNION ALL
select * from sampled_table_2
In this query I'm getting a subset of IDs using the TAMPLESAMPLE clause and then joining two tables on that subset to filter those tables and ultimately doing anything with them - in this case just UNION-ing them together but that's not the important bit.
This gives me the following error:
sampling of table project.dataset.very_large_table not supported. Possible reasons: (1) sampled table referenced more than once or (2) sampling used inside IN subquery.
Which makes sense - the random sampling is not deterministic and will give different results each time.
Short from writing the subsampled list of IDs to a temp table, what other options are there to re-use a sampled table?

How to apply a filter on jsonb array of objects - after aggregating?

I have a the follow select statement:
SELECT
cards.*,
COUNT(cards.*) OVER() AS full_count,
p.printing_information
FROM
cards
LEFT JOIN
(SELECT
pr.card_id, jsonb_agg(to_jsonb(pr)) AS printing_information
FROM
printings pr
GROUP BY
pr.card_id) p ON cards.card_id = p.card_id
WHERE
...
I would like to be able to query on set_id that is within the printings table. I tried to do this within my above select statement by including pr.set_id but it then required a GROUP BY pr.card_id, pr.set_id which then made a row per printing rather than having all printings within the printing_information sub-array.
Unless I can determine how to do above, is it possible to search within the printing_information array of jsonb?
Ideally I would like to be able to do something like:
WHERE p.printing_information->set_id = '123'
Unfortunately I can't do that as it's within an array.
What's the best way to achieve this? I could just do post-processing of the result to strip out unnecessary results, but I feel there must be a better way.
SELECT cards.*
, count(cards.*) over() AS full_count
, p.printing_information
FROM cards
LEFT JOIN (
SELECT pr.card_id, jsonb_agg(to_jsonb(pr)) AS printing_information
FROM printings pr
WHERE pr.set_id = '123' -- HERE!
GROUP BY pr.card_id
) p ON cards.card_id = p.card_id
WHERE ...
This is much cheaper than filtering after the fact. And can be supported with an index on (set_id) - unlike any attempts to filter on the dynamically generated jsonb column.
This is efficient, while we need to aggregate all or most rows from table printings anyway. But your added WHERE ... implies more filters on the outer SELECT. If that results in only few rows from printings being needed, a LATERAL subquery should be more efficient:
SELECT cards.*
, count(cards.*) OVER() AS full_count
, p.printing_information
FROM cards c
CROSS JOIN LATERAL (
SELECT jsonb_agg(to_jsonb(pr)) AS printing_information
FROM printings pr
WHERE pr.card_id = c.card_id
AND pr.set_id = '123' -- here!
) p
WHERE ... -- selective filter here?!?
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Aside, there is no "array of jsonb" here. The subquery produces a jsonb containing an array.

How to JOIN a JSONB column with an array by a text column?

Using PostgreSQL 9.6+
Two tables (simplified to only the columns that matter with example data):
Table 1:
-------------------------------------------------------
key (PK) [Text]| resources [JSONB]
-------------------------------------------------------
asdfaewdfas | [i0c1d1233s49f3fce, z0k1d9921s49f3glk]
Table 2:
-------------------------------------------------------
resource (PK) [Text]| data [JSONB]
-------------------------------------------------------
i0c1d1233s49f3fce | {large json of data}
z0k1d9921s49f3glk | {large json of data}
Trying to access the data column(s) of Table 2 from the resources column of Table 1.
Unnest the JSON array and join to the second table. Like:
SELECT t1.*, t2.data -- or just the bits you need
FROM table1 t1, jsonb_array_elements_text(t1.resources) r(resource)
JOIN table2 t2 USING (resource)
WHERE t1.key = ?
Or, to preserve all rows in table1 with empty / null / unmatched resources:
SELECT t1.*, t2.data -- or just the bits you need
FROM table1 t1
LEFT JOIN LATERAL jsonb_array_elements_text(t1.resources) r(resource) ON true
LEFT JOIN table2 t2 USING (resource)
WHERE t1.key = ?
About jsonb_array_elements_text():
How to turn json array into postgres array?
There is an implicit LATERAL join in the first query. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Consider a normalized DB design with a junction table with one row per linked resource instead of the column table1.resources, implementing the m:n relation properly. This way you can enforce referential integrity, data integrity etc. with relational features. And queries become simpler. jsonb for everything is simple at first. But if you work a lot with nested data, this may turn around on you.
Can PostgreSQL array be optimized for join?
How to implement a many-to-many relationship in PostgreSQL?
Can PostgreSQL have a uniqueness constraint on array elements?

Memory allocation failed: How to combine four result sets into one table

I have four tables. Every table has just one column with 32768 rows, like:
|calculated|
|2.45644534|
|3.23323567|
[...]
Now I want to combine these four results/tables into one table with four columns, like:
|calc1|calc2|calc3|calc4|
[values]
There are no IDs or something else to identify unique rows.
This is my query:
SELECT A.*, B.*, C.*, D.*
FROM
(
SELECT * FROM :REAL_RESULT
) AS A
JOIN
(
SELECT * FROM :PHASE_RESULT
) AS B
ON 1=1
JOIN
(
SELECT * FROM :AMPLITUDE_RESULT
) AS C
ON 1=1 [...]
Now the server is throwing this error:
Error: (dberror) 2048 - column store error: search table error:
"TEST"."data::fourier": line 58
col 4 (at pos 1655): [2048] (range 3): column store error: search
table error: [9] Memory allocation failed
What can I do now? Are there any other options? Thanks!
what you do in your original code is effectively a cross join on four tables, each containing 2^15 rows. The result size would contain 2^60 rows, quite a few petabyte... That's the reason for the OOM. I used a similar example to show colleagues what can happen when joining big tables with the wrong the join condition.
Besides that, SQL is set based and your rows do not have any natural order.
If the tables are column store tables, you could technically join on the internal column $rowid$. But $rowid$ is not officially documented and I can therefore not recommend using it.
A clean solution is the one suggested by Craig. I would probably use an IDENTITY column.
If this cross join was not your original intention, but you wanted join a list of values without any actual join condition you might try UNION:
SELECT COLUMN,0,0,0 from A
union all
SELECT 0,COLUMN,0,0 from B
union all
SELECT 0,0,COLUMN,0 from C
union all
SELECT 0,0,0,COLUMN from D
The output will be the sum of all records for these tables.

Find difference between two big tables in PostgreSQL

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash).
Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)
Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.
This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.
EXISTS seems like the best option.
tbl1 is the table with surplus rows in this example:
SELECT *
FROM tbl1
WHERE NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);
If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:
SELECT *
FROM tbl1
FULL OUTER JOIN tbl2 USING (col)
WHERE tbl2 col IS NULL OR
tbl1.col IS NULL;
Overview over basic techniques in a later post:
Select rows which are not present in other table
Aside: The data type uuid is efficient for md5 hashes:
Convert hex in text representation to decimal number
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this
select *
from tbl1
full outer join tbl2
on row(tbl1) = row(tbl2)
where tbl1.col is null
or tbl2.col is null
If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:
SELECT * FROM A
WHERE NOT EXISTS (SELECT * FROM B)
UNION
SELECT * FROM B
WHERE NOT EXISTS (SELECT * FROM A)
In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:
DELETE FROM table1 where ID IN (
SELECT id FROM table1
LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield
WHERE table2.hashfield IS NULL)
And then do the same the other way around for the other table.