How to query multiple STRUCTs in BigQuery in wildcard-like way - google-bigquery

I struggle to query multiple STRUCTs which share same record fields each other.
Let me show you how the table looks like.
Tables with multiple STRUCTs with same record fields
Each mango, melon, apple, banana STRUCT(RECORD) share same fields-qty, price.
Now I want to query them at once like "Show me the qty > 5."
Is ther any wildcard-like way to do this? Maybe something like SELECT %.qty >5. Of course It is an invalid way(just for an example).
I know that the best way is to change the schema like fruit, fruit.qty, fruit.price and put the mango and others to fruit filed data, not remain them as a field itself.
However for some reason, I want to keep that schema and query multiple RECORDs at once. Could It be possible?
Thank you.

Consider below approach
with temp as (
select
trim(fruit, '"') as fruit,
cast(json_extract(info, '$.qty') as int64) as qty,
cast(json_extract(info, '$.price') as float64) as price
from your_table t,
unnest(split(trim(to_json_string(t), '{}'), '},')) record,
unnest([struct(
split(record, ':{')[offset(0)] as fruit,
'{' || split(record, ':{')[offset(1)] || '}' as info)
])
)
select *
from temp
where qty > 5
if applied to sample data like in your question - output is

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Is there a way to check if any items in a string array are in a string in Snowflake/Redshift?

I am looking for a way to check if a string contains any words in another field which is a single string that holds a list of items. Something like this...
id items (STRING)
1 burger;hotdog
I have a second dataset that might look like...
transaction_id description amount
10 cheeseburger 10
Now I need to grab the amount if the description matches any items in the first table, in this case it does match with the string burger, however, i can't seem to get the SQL right since if I were to use LIKE ANY in Snowflake, i'd need to pass in **('%burger%",'%hotdog%') which are two separate strings - in this case I can't make explicit calls as each id/item permutation may be different in the first table. While in Redshift when I try to use
CASE WHEN lower(t.description) SIMILAR TO '%(' || replace(items,';','|') || ')%' then amount END
I get the following error: Specified types or functions (one per INFO message) not supported on Redshift tables.
Thanks in advance!
If your wanting a snowflake answer:
WITH keys AS (
SELECT * FROM VALUES (1,'burger;hotdog') a(id,items)
), data AS (
SELECT * FROM VALUES (10,'cheeseburger',10) b(transaction_id, description, amount)
), seq_keys AS (
SELECT s.seq_id, f.value as key
FROM (
SELECT seq8() as seq_id, k.*
FROM keys AS k
) AS s
,lateral flatten(input=>split(s.items,';')) F
)
SELECT d.*, sk.*
FORM data d
JOIN seq_keys sk ON d.description ILIKE '%'||sk.key||'%'
gives:
TRANSACTION_ID DESCRIPTION AMOUNT SEQ_ID KEY
10 cheeseburger 10 0 "burger"
which is you distinct on the SEQ_ID then you can de-dupe if there are multi keys that match.. I would be inclined to also add an ID to the "data table".

Extracting string and converting columns to rows in SQL (Redshift)

I've a column called "Description" in a table called "Food" which includes multiple food item names delimited by , such as chicken, soup, bread, coke
How can I extract each item from the column and create multiple rows.
e.g. Currently it's like
{FoodID, FoodName, Description} ==> {123, Meal, "chicken, soup, bread, coke"}
and what I need is
{FoodID, FoodName, Description} ==> {123, Meal, chicken},
{123, Meal, soup},
{123, Meal, bread} etc.
In Redshift, I first did a split of "description" column as
select FoodID, FoodName, Description,
SPLIT_PART(Description, ',',1) AS Item1,
SPLIT_PART(Description, ',',1) AS Item2,
SPLIT_PART(Description, ',',1) AS Item3,.....till Item10
FROM Food
consider that max of 10 items can be there and hence Item10.
What's the best method to convert these columns Item1 to Item10 to store as rows? I tried UNION ALL but it's taking a longer time considering huge load of data.
Your question is answered here in detail specifically for Redshift.
You just need to map your queries to example queries provided over there.
Your query will be something like below.
select (row_number() over (order by true))::int as n into numbers from food limit 100;
This will create numbers table.
Your query would become:
select foodId, foodName, split_part(Description,',',n) as descriptions from food cross join numbers where split_part(Description,',',n) is not null and split_part(Description,',',n) != '';
Now, coming to back to your original question about performance.
it's taking a longer time considering huge load of data.
Considering typical data warehouse use case of high read and seldom write, you should be keeping your typical food data that you have mentioned in stagging table, say stg_food.
You should use following kind of query to do one time insert into actual food table, something like below.
insert into food select foodId, foodName, split_part(Description,',',n) as descriptions from stg_food cross join numbers where split_part(Description,',',n) is not null and split_part(Description,',',n) != '';
This will write one time and make your select queries faster.

Count number of occurrences of keyword in comma separated column?

I have a column which stores data like this:
Product:
product1,product2,product5
product5,product7
product1
What I would like to do is count the number of occurrences there are of product1, product2, etc. but where the record contains multiple products I want it to double count them.
So for the above example the totals would be:
product1: 2
product2: 1
product5: 2
product7: 1
How can I achieve this?
I was trying something like this:
select count(case when prodcolumn like '%product1%' then 'product1' end) from myTable
This gets me the count for product1 appears but how do I extend this to go through each product?
I also tried something like this:
select new_productvalue, count(new_productvalue) from OpportunityExtensionBase
group by new_ProductValue
But that lists all different combinations of the products which were found and how many times they were found...
These products don't change so hard coding it is ok...
EDIT: here is what worked for me.
WITH Product_CTE (prod) AS
(SELECT
n.q.value('.', 'varchar(50)')
FROM (SELECT cast('<r>'+replace(new_productvalue, ';', '</r><r>')+'</r>' AS xml) FROM table) AS s(XMLCol)
CROSS APPLY s.XMLCol.nodes('r') AS n(q)
WHERE n.q.value('.', 'varchar(50)') <> '')
SELECT prod, count(*) AS [Num of Opps.] FROM Product_CTE GROUP BY prod
You have a lousy, lousy data structure, but sometimes one must make do with that. You should have a separate table storing each pair product/whatever pair -- that is the relational way.
with prodref as (
select 'product1' as prod union all
select 'product2' as prod union all
select 'product5' as prod union all
select 'product7' as prod
)
select p.prod, count(*)
from prodref pr left outer join
product p
on ','+p.col+',' like '%,'+pr.prod+',%'
group by p.prod;
This will be quite slow on a large table. And, the query cannot make use of standard indexes. But, it should work. If you can restructure the data, then you should.
Nevermind all you need if one split function
SQL query to split column data into rows
hope after this you can manage .

Case Statement with Text Search

I have a data-set (using SQL Server Management Studio) that is used for Sales Analysis. For this example, when an agent fufills a Sales Call or Account Review, they list (via a drop-down) what topics they discussed in the call/review. Then there is a corresponding column of the products that client purchased after-the fact (in this example, I'm using automobiles). I'm thinking maybe a case statement is the way to do but in esscence I need to figure out if any of the makers the person suggested exists in the products column:
So in this example, in line 1, they had suggested Mazda and a Toyota (seperate by ";") and Mazda appears in the products line so that would then be marked as effective. Line 3, they suggested Honda but the person ended up getting a Jeep, so that not effective. So on and so forth.
I'd like for it to be dynamic (maybe an EXISTS??) that way I don't have to write/maintain something like 'Effective'=CASE WHEN Topic like '%Mazada%' and Products like '%Mazada%', "Yes", "No" WHEN.....
Thoughts?
If you have a Product table, then you might be able to get away with something like this:
select RowId, Topic, Products,
(case when exists (select 1
from Products p
where t.Topic like '%'+p.brand+'%' and
t.Products like '%'+p.brand+'%'
)
then 'Yes' else 'No'
end) as Effective
from t;
This is based on the fact that the "brand" seems to be mentioned in both the topic and products fields. If you don't have such a table, you could do something like:
with products as (
select 'Mercedes' as brand union all
select 'Mazda' union all
select 'Toyota' . . .
)
select RowId, Topic, Products,
(case when exists (select 1
from Products p
where t.Topic like '%'+p.brand+'%' and
t.Products like '%'+p.brand+'%'
)
then 'Yes' else 'No'
end) as Effective
from t;
However, this may not work, because in the real world, text is more complicated. It has misspellings, abbreviations, and synonyms. There is no guarantee that there is even a matching word on both lists, and so on. But, if your text is clean enough, this approach might be helpful.