nest multiple columns into an array in Big Query - google-bigquery

Given this BQ Table
there are 1.026 rows with 944 unique modemio_cat_ids
how can I return a query that nests all non null columns into 1 single array called "parents" for each modemio_cat_id ?
example: for modemio_cat_id = 1111118
and then finally group by modemio_cat_id + cumulate all array in case of duplicates
wrong approach: this query returns always the same arrays for each modemutti_cat_id:
SELECT modemio_cat_id, ARRAY (
SELECT AS STRUCT cat1_id, cat2_id FROM `modemutti-8d8a6.categorization.test`
) as parent
FROM `modemutti-8d8a6.categorization.test`
group by modemio_cat_id

Below example for BigQuery Standard SQL
#standardSQL
SELECT modemio_cat_id,
ARRAY_AGG(DISTINCT cat_id IGNORE NULLS) parents
FROM `modemutti-8d8a6.categorization.test`,
UNNEST([cat1_id, cat2_id, cat3_id, cat4_id, cat5_id, cat6_id]) cat_id
GROUP BY modemio_cat_id

Related

Inclusion of nulls with ANY_VALUE in BigQuery

I have a 'vendors' table that looks like this...
**company itemKey itemPriceA itemPriceB**
companyA, 203913, 20, 10
companyA, 203914, 20, 20
companyA, 203915, 25, 5
companyA, 203916, 10, 10
It has potentially millions of rows per company and I want to query it to bring back a representative delta between itemPriceA and itemPriceB for each company. I don't care which delta I bring back as long as it isn't zero/null (like row 2 or 4), so I was using ANY_VALUE like this...
SELECT company
, ANY_VALUE(CASE WHEN (itemPriceA-itemPriceB)=0 THEN null ELSE (itemPriceA-itemPriceB) END)
FROM vendors
GROUP BY 1
It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...
"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."
If ANY_VALUE returns null "when expression is NULL for all rows in the group" it should NEVER return null for companyA right (since only 2 of 4 rows are null)? But the second sentence sounds like it will indeed include the null rows.
P.s. you may be wondering why I don't simply add a WHERE clause saying "WHERE itemPriceA-itemPriceB>0" but in the event that a company has ONLY matching prices, I still want the company to be returned in my results.
Clarification
I'm afraid the accepted answer will have to show stronger evidence that contradicts the docs.
#Raul Saucedo suggests that the following BigQuery documentation is referring to WHERE clauses:
rows for which expression is NULL are considered and may be selected
This is not the case. WHERE clauses are not mentioned anywhere in the ANY_VALUE docs. (Nowhere on the page. Try to ctrl+f for it.) And the docs are clear, as I'll explain.
#d3wannabe is correct to wonder about this:
It seems to be working but I notice 2 sentences that seem contradictory from Google's documentation...
"Returns NULL when expression is NULL for all rows in the group. ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected."
But the docs are not contradictory. The 2 sentences coexist.
"Returns NULL when expression is NULL for all rows in the group." So if all rows in a column are NULL, it will return NULL.
"ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which expression is NULL are considered and may be selected." So if the column has rows mixed with NULLs and actual data, it will select anything from that column, including nulls.
How to create an ANY_VALUE without nulls in BigQuery
We can use ARRAY_AGG to turn a group of values into a list. This aggregate function has the option to INGORE NULLS. We then select 1 item from the list after ignoring nulls.
If we have a table with 2 columns: id and mixed_data, where mixed_data has some rows with nulls:
SELECT
id,
ARRAY_AGG( -- turn the mixed_data values into a list
mixed_data -- we'll create an array of values from our mixed_data column
IGNORE NULLS -- there we go!
LIMIT 1 -- only fill the array with 1 thing
)[SAFE_OFFSET(0)] -- grab the first item in the array
AS any_mixed_data_without_nulls
FROM your_table
GROUP BY id
See similar answers here:
https://stackoverflow.com/a/53508606/6305196
https://stackoverflow.com/a/62089838/6305196
Update, 2022-08-12
There is evidence that the docs may be inconsistent with the actual behavior of the function. See Samuel's latest answer to explore his methodology.
However, we cannot know if the docs are incorrect and ANY_VALUE behaves as expected or if ANY_VALUE has a bug and the docs express the intended behavior. We don't know if Google will correct the docs or the function when they address this issue.
Therefore I would continue to use ARRAY_AGG to create a safe ANY_VALUE that ignores nulls until we see a fix from Google.
Please upvote the issue in Google's Issue Tracker to see this resolved.
This is an explanation about how “any_value works with null values”.
With any_value always return the first value, if there is a value different from null.
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, "banana",null,null]) as fruit;
Return null if all rows have null values. Refers at this sentence
“Returns NULL when expression is NULL for all rows in the group”
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST([null, null, null]) as fruit
Return null if one value is null and you specified in the where clause. Refers to these sentences
“ANY_VALUE behaves as if RESPECT NULLS is specified; rows for which
expression is NULL are considered and may be selected.”
SELECT ANY_VALUE(fruit) as any_value
FROM UNNEST(["apple", "banana", null]) as fruit
where fruit is null
Always depends which filter you are using and the field inside the any_value.
You can see this example, return two rows that are different from 0.
SELECT ANY_VALUE(e).company, (itemPriceA-itemPriceB) as value
FROM `vendor` e
where (itemPriceA-itemPriceB)!=0
group by e.company
The documentation says that "NULL are considered and may be" returned by an any_value statement. However, I am quite sure the documentation is wrong here. In the current implementation, which was tested on 13th August 2022, the any_value will return the first value of that column. However, if the table does not have an order by specified, the sorting may be random due to processing of the data on several nodes.
For testing a large table of nulls is needed. To generate_array will come handy for that. This array will have several entries and the value zero for null. The first 1 million entries with value zero are generated in the table tmp. Then table tbl adds before and after the [-100,0,-90,-80,3,4,5,6,7,8,9] the 1 million zeros. Finally, calculating NULLIF(x,0) AS x replaces all zeros by null.
Several test of any_value using the test table tbl are done. If the table is not further sorted, the first value of that column is returned: -100.
WITH
tmp AS (SELECT ARRAY_AGG(0) AS tmp0 FROM UNNEST(GENERATE_ARRAY(1,1000*1000))),
tbl AS (
SELECT
NULLIF(x,0) AS x,
IF(x!=0,x,NULL) AS y,
rand() AS rand
FROM
tmp,
UNNEST(ARRAY_CONCAT(tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0)) AS x )
SELECT "count rows", COUNT(1) FROM tbl
UNION ALL SELECT "count items not null", COUNT(x) FROM tbl
UNION ALL SELECT "any_value(x): (returns first non null element in list: -100)", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "2nd run", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "3rd run", ANY_VALUE(x) FROM tbl
UNION ALL SELECT "any_value(y)", ANY_VALUE(y) FROM tbl
UNION ALL SELECT "order asc", ANY_VALUE(x) FROM (Select * from tbl order by x asc)
UNION ALL SELECT "order desc (returns largest element: 9)", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order desc", ANY_VALUE(x) FROM (Select * from tbl order by x desc)
UNION ALL SELECT "order abs(x) desc", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) desc )
UNION ALL SELECT "order abs(x) asc (smallest number: 3)", ANY_VALUE(x) FROM (Select * from tbl order by abs(x) asc )
UNION ALL SELECT "order rand asc", ANY_VALUE(x) FROM (Select * from tbl order by rand asc )
UNION ALL SELECT "order rand desc", ANY_VALUE(x) FROM (Select * from tbl order by rand desc )
This gives following result:
The first not null entry, -100 is returned.
Sorting the table by this column causes the any_value to always return the first entry
In the last two examples, the table is ordered by random values, thus any_value returns random entries
If the dataset is larger than 2 million rows, the table may be internally split to be processed; this will result in a not ordered table. Without the order by command the first entry on the table and thus the result of any_value cannot be predicted.
For testing this, please replace the 10th line by
UNNEST(ARRAY_CONCAT(tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0,tmp0, [0,0,0,0,0,-100,0,-90,-80,3,4,5,6,7,8,9] , tmp0,tmp0)) AS x )

How to get count(percentage) for columns after each groupby item?

I have the following table. Using sqlite DB
Item
Result
A
Pass
B
Pass
A
Fail
B
Fail
I want to realize the above table as below using some query.
Item
Total
Accept
Reject
A
2
1(50%)
1(50%)
B
2
1(50%)
1(50%)
How should I construct this query?
You can try PIVOT() if your DBMS supports. Then use CONCAT or || operator depending on the DMBS.
Query:
SELECT
item,
total,
SUM(Pass)||'('|| CAST((SUM(Pass)*1.0/total*1.0)*100.0 AS DECIMAL)||'%)' AS Accept,
SUM(Fail)||'('|| CAST((SUM(Fail)*1.0/total*1.0)*100.0 AS DECIMAL)||'%)' AS Reject
FROM
(
SELECT
Item,
result,
COUNT(result) OVER(PARTITION BY item ORDER BY result ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS total,
CASE
WHEN Result = 'Pass' then 1
ELSE 0
END AS Pass,
CASE
WHEN Result = 'Fail' then 1
ELSE 0
END AS Fail
FROM t
) AS j
GROUP BY item, total
Query explanation:
Since SQLITE does not handle PIVOT, we are creating the flags Pass and Fail manually using CASE statement
To calculate total, COUNT is used as analytical function here. It is basically a shortcut to calculate count and place it in all rows
Then in the outer query, we are calculating %s and using || as the concatenate operator to concatenate the result with total sum and % of it
See demo in db<>fiddle

Unnesting repeated records to a single row in Big Query

I have a dataset that includes repeated records. When I unnest them I get 2 rows. 1 per nested record.
Before unnest raw data:
After unnest using this query:
SELECT
eventTime
participant.id
FROM
`public.table`,
UNNEST(people) AS participant
WHERE
verb = 'event'
These are actually 2 rows that are expanded to 4. I've been trying to unnest into a single row so I have 3 columns,
eventTime, buyer.Id, seller.Id.
I've been trying to use REPLACE to build a struct of the unnested content but I cannot figure out how to do it. Any pointer , documentation or steps that could help me out?
Consider below approach
SELECT * EXCEPT(key) FROM (
SELECT
eventTime,
participant.id,
personEventRole,
TO_JSON_STRING(t) key
FROM `public.table` t,
UNNEST(people) AS participant
WHERE verb = 'event'
)
PIVOT (MIN(id) FOR personEventRole IN ('buyer', 'seller'))
if applied to sample data in your question - output is

Is there a way to check if any items in a string array are in a string in Snowflake/Redshift?

I am looking for a way to check if a string contains any words in another field which is a single string that holds a list of items. Something like this...
id items (STRING)
1 burger;hotdog
I have a second dataset that might look like...
transaction_id description amount
10 cheeseburger 10
Now I need to grab the amount if the description matches any items in the first table, in this case it does match with the string burger, however, i can't seem to get the SQL right since if I were to use LIKE ANY in Snowflake, i'd need to pass in **('%burger%",'%hotdog%') which are two separate strings - in this case I can't make explicit calls as each id/item permutation may be different in the first table. While in Redshift when I try to use
CASE WHEN lower(t.description) SIMILAR TO '%(' || replace(items,';','|') || ')%' then amount END
I get the following error: Specified types or functions (one per INFO message) not supported on Redshift tables.
Thanks in advance!
If your wanting a snowflake answer:
WITH keys AS (
SELECT * FROM VALUES (1,'burger;hotdog') a(id,items)
), data AS (
SELECT * FROM VALUES (10,'cheeseburger',10) b(transaction_id, description, amount)
), seq_keys AS (
SELECT s.seq_id, f.value as key
FROM (
SELECT seq8() as seq_id, k.*
FROM keys AS k
) AS s
,lateral flatten(input=>split(s.items,';')) F
)
SELECT d.*, sk.*
FORM data d
JOIN seq_keys sk ON d.description ILIKE '%'||sk.key||'%'
gives:
TRANSACTION_ID DESCRIPTION AMOUNT SEQ_ID KEY
10 cheeseburger 10 0 "burger"
which is you distinct on the SEQ_ID then you can de-dupe if there are multi keys that match.. I would be inclined to also add an ID to the "data table".

How can I aggregate Jsonb columns in postgres using another column type

I have the following data in a postgres table,
where data is a jsonb column. I would like to get result as
[
{field_type: "Design", briefings_count: 1, meetings_count: 13},
{field_type: "Engineering", briefings_count: 1, meetings_count: 13},
{field_type: "Data Science", briefings_count: 0, meetings_count: 3}
]
Explanation
Use jsonb_each_text function to extract data from jsonb column named data. Then aggregate rows by using GROUP BY to get one row for each distinct field_type. For each aggregation we also need to include meetings and briefings count which is done by selecting maximum value with case statement so that you can create two separate columns for different counts. On top of that apply coalesce function to return 0 instead of NULL if some information is missing - in your example it would be briefings for Data Science.
At a higher level of statement now that we have the results as a table with fields we need to build a jsonb object and aggregate them all to one row. For that we're using jsonb_build_object to which we are passing pairs that consist of: name of the field + value. That brings us with 3 rows of data with each row having a separate jsonb column with the data. Since we want only one row (an aggregated json) in the output we need to apply jsonb_agg on top of that. This brings us the result that you're looking for.
Code
Check LIVE DEMO to see how it works.
select
jsonb_agg(
jsonb_build_object('field_type', field_type,
'briefings_count', briefings_count,
'meetings_count', meetings_count
)
) as agg_data
from (
select
j.k as field_type
, coalesce(max(case when t.count_type = 'briefings_count' then j.v::int end),0) as briefings_count
, coalesce(max(case when t.count_type = 'meetings_count' then j.v::int end),0) as meetings_count
from tbl t,
jsonb_each_text(data) j(k,v)
group by j.k
) t
You can aggregate columns like this and then insert data to another table
select array_agg(data)
from the_table
Or use one of built-in json function to create new json array. For example jsonb_agg(expression)