We are trying to create a materialized view of a large BQ table. The table receives a high volume of streaming web activity inserts, is multi-tenant, and really leverages BQ's nested columnar structure.
We want to create a subset of this table for more efficient, near-real time query execution with minimal administrative overhead. We thought the simplest solution would be to create a materialized view which is just a subset of rows (by client) and columns, but currently materialized views require aggregation.
Additionally, the materialized view beta supports a limited set of aggregation functions and does not support sub-selects or UNNEST operations. We have not found a good method of extracting the deeply nested STRUCTs into the materialized view. A simple example:
SELECT
'7602E3E96349E972' as session_id,
'084F0262' as transaction_id,
[STRUCT(
[STRUCT(
'promotions' as name,
['SAVE50'] as value),
STRUCT(
'discounts' as name,
['9.99'] as value)
] as modifiers
)] as contexts_transaction
UNION ALL
SELECT
'7602E3E96349E972' as session_id,
'01ECB6EF' as transaction_id,
[STRUCT(
[STRUCT(
'promotions' as name,
['SPRING','LOVE'] as value),
STRUCT(
'discounts' as name,
['14.99','6.99'] as value)
] as modifiers
)] as contexts_transaction
UNION ALL
SELECT
'508082BC49BAC09F' as session_id,
'038B67CF' as transaction_id,
[STRUCT(
[STRUCT(
'promotions' as name,
['FREESHIP','HOLIDAY25'] as value),
STRUCT(
'discounts' as name,
['9.99'] as value)
] as transaction
)] as contexts_transaction
UNION ALL
SELECT
'C88AE153C784D910' as session_id,
'EA716BD2' as transaction_id,
[STRUCT(
[STRUCT(
'promotions' as name,
['CYBER'] as value),
STRUCT(
'discounts' as name,
['9.99','19.99'] as value)
] as modifiers
)]
In that ideally we would retain this STRUCT as is, we are trying to accomplish something like this in the materialized view (recognizing these are not supported MV features):
SELECT
session_id,
transaction_id,
ARRAY_AGG(STRUCT<name STRING, value ARRAY<STRING>>(mods_array.name,mods_array.value)) as modifiers
FROM data,
UNNEST(contexts_transaction) trans_array,
UNNEST(trans_array.modifiers) mods_array
GROUP BY 1,2
We are open to any method of subsetting this massive table, not just MV, but would love it to have the same benefits (low maintenance, automatic, low cost). Any suggestions appreciated!
As far as I could understand from your question, you want to have a similar output to this:
with rawdata AS
(
SELECT 1 as userid, [STRUCT('transactionIds' as name, ['ABCDEF'] as value), STRUCT('couponIds' as name, ['123456'] as value)] as transactions union all
SELECT 1 as userid, [STRUCT('transactionIds' as name, ['XYZ', 'KLM'] as value), STRUCT('couponIds' as name, ['789', '567'] as value)] union all
SELECT 2 as userid, [STRUCT('transactionIds' as name, ['XY', 'KL'] as value), STRUCT('couponIds' as name, ['10', '15'] as value)] union all
SELECT 2 as userid, [STRUCT('transactionIds' as name, ['X', 'K'] as value), STRUCT('couponIds' as name, ['20', '25'] as value)]
)
select
userid,
ARRAY_CONCAT_AGG((SELECT trx.value FROM UNNEST(transactions) trx WHERE trx.name = 'transactionIds')) as transactionIds,
ARRAY_CONCAT_AGG((SELECT trx.value FROM UNNEST(transactions) trx WHERE trx.name = 'couponIds')) as couponIds
from rawdata
group by userid;
So, input table looks like this
While, output table looks like
If your intention is different, please state it in the question with more details.
For this purpose, I tried to create that query as a materialized view.
create or replace table project.dataset.rawdata as
SELECT 1 as userid, [STRUCT('transactionIds' as name, ['ABCDEF'] as value), STRUCT('couponIds' as name, ['123456'] as value)] as transactions union all
SELECT 1 as userid, [STRUCT('transactionIds' as name, ['XYZ', 'KLM'] as value), STRUCT('couponIds' as name, ['789', '567'] as value)] union all
SELECT 2 as userid, [STRUCT('transactionIds' as name, ['XY', 'KL'] as value), STRUCT('couponIds' as name, ['10', '15'] as value)] union all
SELECT 2 as userid, [STRUCT('transactionIds' as name, ['X', 'K'] as value), STRUCT('couponIds' as name, ['20', '25'] as value)]
;
create materialized view project.dataset.mview as
select
userid,
ARRAY_CONCAT_AGG((SELECT trx.value FROM UNNEST(transactions) trx WHERE trx.name = 'transactionIds')) as transactionIds,
ARRAY_CONCAT_AGG((SELECT trx.value FROM UNNEST(transactions) trx WHERE trx.name = 'couponIds')) as couponIds
from project.dataset.rawdata
GROUP BY userid
However, I get the error Unsupported aggregation function in materialized view: array_concat_agg..
Since materialized views are beta yet, we don't know if it's going to be supported in the future. However, it's not possible to do it with current capabilities.
#fhoffa can tell more about it, maybe.
Related
Im using Big Query Sql here
This is the table build
This table is showing customer id_123 has purchase in type_shop and delivery_shop & also delivery_home .
Is it possible for me to get the result to be reflect in a single row instead of 2 different rows ?
I only want to show this customer id_123 purchased in type_shop & uses delivery_home & delivery_shop in a row
I tried a few methods using array_agg(stru) but it is still shows 2 rows of result instead of 1.
Not sure what other SQL function should i try here ? try searching for similar content in stack overflow but there isnt one that i can apply .
Assuming your sample data is your 1st table. Consider approach below:
with sample_data as (
select 'id_123' as customer, 'm' as gender, [1,1] as type_shop, [0,0] as type_online, [0,0] as delivery_pickup,[0,1] as delivery_home, [1,0] as delivery_shop,
union all select 'id_456' as customer, 'f' as gender, [1,0,1] as type_shop, [0,1,0] as type_online, [0,0,0] as delivery_pickup,[1,0,0] as delivery_home, [0,1,1] as delivery_shop,
),
normalize_data as (
select
customer,
gender,
type_shop[safe_offset(index)] as type_shop,
type_online[safe_offset(index)] as type_online,
delivery_pickup[safe_offset(index)] as delivery_pickup,
delivery_home[safe_offset(index)] as delivery_home,
delivery_shop[safe_offset(index)] as delivery_shop,
from sample_data,
unnest(generate_array(0,array_length(type_shop)-1)) as index
),
join_data as (
select
customer,
gender,
max(type_shop) as t_shop,
max(type_online) as t_online,
max(delivery_pickup) as delivery_pickup,
max(delivery_home) as delivery_home,
max(delivery_shop) as delivery_shop,
from normalize_data
group by customer,gender,type_shop,type_online
)
select
customer,
gender,
array_agg(t_shop) as type_shop,
array_agg(t_online) as type_online,
array_agg(delivery_pickup) as delivery_pickup,
array_agg(delivery_home) as delivery_home,
array_agg(delivery_shop) as delivery_shop,
from join_data
group by customer,gender
Output:
Beginner's question here... I have a table of tree measurements being 3 fields: - ID, Diameter_1, Diameter_2
& I wish to get to these 3 fields: - ID, DiameterName, DiameterMeasurement
Input and Desired Output
SELECT DISTINCT ID, Diameter_1
FROM tblDiameters
UNION SELECT DISTINCT ID, Diameter_2
FROM tblDiameters;
Though it results in only 2 fields. How may the field: - DiameterMeasurement be brought in?
Many thanks :-)
You were on the right track to use a union. Here is one viable approach:
SELECT ID, 'Diameter_1' AS DiameterName, Diameter_1 AS DiameterMeasurement
FROM tblDiameters
UNION ALL
SELECT ID, 'Diameter_2', Diameter_2
FROM tblDiameters
ORDER BY ID, DiameterName;
I have Json string with two levels: list[json{list[json]}]:
UserID Json_List
100 [{"application_charge":{"id":13409353813,"name":"Starter","api_client_id":2485321},"usage_charges":[{"id":48216805,"description":"Extras","price":"60.70"}]}]
200 [{"application_charge":{"id":13409353814,"name":"Starter","api_client_id":2485322},"usage_charges":[{"id":48216890,"description":"Extras","price":"80.79"}]}]
Need Output like table :
UserID application_charge.id name api_client_id usage_charges.id description price
100 13409353813 Starter 2485321 48216805 Extras 60.7
200 13409353814 Starter 2485322 48216890 Extras 80.79
I manage to pull out the first step "application_charge" but do not understand how to get to the next step "usage_charges"
select
UserID,
json_extract_scalar(json, '$.application_charge.id') as id,
json_extract_scalar(json, '$.application_charge.name') as name,
json_extract_scalar(json, '$.application_charge.api_client_id') as api_client_id
from `be-prod-data.retailers_billing_data_production.shopify_application_charges`,
unnest(json_extract_array( json_list, '$')) json
How can I extract data from the second stage ??
Just one more unnest(json_extract_array(...
with mytable as (
select 100 as userid, '[{"application_charge":{"id":13409353813,"name":"Starter","api_client_id":2485321},"usage_charges":[{"id":48216805,"description":"Extras","price":"60.70"}]}]' as json_list union all
select 200, '[{"application_charge":{"id":13409353814,"name":"Starter","api_client_id":2485322},"usage_charges":[{"id":48216890,"description":"Extras","price":"80.79"}]}]'
)
select
UserID,
json_extract_scalar(json, '$.application_charge.id') as id,
json_extract_scalar(json, '$.application_charge.name') as name,
json_extract_scalar(json, '$.application_charge.api_client_id') as api_client_id,
json_extract_scalar(nested, '$.id') as usage_charges_id,
from mytable,
unnest(json_extract_array( json_list, '$')) json, unnest(json_extract_array(json, '$.usage_charges')) as nested
I'm trying to calculate quality of users with cohort data in bigquery
My current query is:
WITH analytics_data AS (
SELECT user_pseudo_id, event_timestamp, event_name, app_info.id,geo.country as country,platform ,app_info.id as bundle_id,
UNIX_MICROS(TIMESTAMP("2019-12-05 00:00:00")) AS start_day,
3600*1000*1000*24 AS one_day_micros
FROM `table.events_*`
WHERE _table_suffix BETWEEN "20191205" AND "20191218"
)
SELECT day_7_cohort / day_0_cohort AS seven_day_conversion FROM (
WITH day_7_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = 'watched_20_ads' AND event_timestamp BETWEEN start_day AND start_day+(12*one_day_micros)
), day_0_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = "first_open"
AND bundle_id = "com.bundle.id"
AND country = "United States"
AND platform = "ANDROID"
AND event_timestamp BETWEEN start_day AND start_day+(1*one_day_micros)
)
SELECT
(SELECT count(*)
FROM day_0_users) AS day_0_cohort,(SELECT count(*)
FROM day_7_users
JOIN day_0_users USING (user_pseudo_id)) AS day_7_cohort
)
the problem is that I'm unable to separate the users by tracking source.
I want to separate the users by: tracking source and country.
What I'm curently getting:
what I would like to see:
What would be perfect:
I'm not sure if it's possible to write a query that would return the data in a single table, without involving more queries and data storage elsewhere.
So your question is missing some data/fields, but I will provide a 'general' solution.
with data as (
-- Select the fields you need to define criteria and cohorts
),
cohort_info as (
-- Cohort Logic (might be more complicated than this)
select user_id, source, country---, etc...
from data
group by 1,2,3
),
day_0_users as (
-- Logic to determine who you are measuring for your calculation
),
day_7_users as (
-- Logic to detemine who qualifies as a 7 day user for your calculation
),
joined as (
-- Join your CTEs together
select
cohort_info.source,
cohort_info.country,
count(distinct day_0_users.user_id) as day_0_count,
count(distinct day_7_users.user_id) as day_7_count
from day_0_users
left join day_7_users using(user_id)
inner join cohort_info using(user_id)
group by 1,2
)
select *, day_7_count/day_0_count as seven_day_conversion
from joined
I think using several CTEs in this manner will make your code more readable and will enable you to track your logic a bit better. Nested Subqueries tend to get ugly.
I have a table query result that looks like this (after a few sql queries):
Element Subelement ID Email Value
1003022 10003981 "454255" "email1#yahoo.com" 25.5
1003022 10003981 "454109" "email2#yahoo.com" 34.45
1003027 10033987 "454369" "email3#yahoo.com" 1.9
1003027 10033987 "454255" "email1#yahoo.com" 25.5
1003011 10021233 "454209" "email2#yahoo.com" 34.45
1222011 13513544 "454209" "email2#yahoo.com" 34.45
Those are some events with first 2 columns different, as a group. Col1 with Col 2.
Based on the ID (email and value are the same for one ID), I want to have a result like this:
ID Email Value Elements
"454255" "email1#yahoo.com" 25.5 {[1003022, 10003981], [1003027, 10033987]}
"454109" "email2#yahoo.com" 34.45 {[1003022, 10003981], [1003011, 10021233], [1222011, 13513544]}
"454369" "email3#yahoo.com" 1.9 {[1003027, 10033987]}
Or any format that keeps ID (email, value) on one line and adds the Element and Subelement to a list/array.
UPDATE:
I've tried group_concat, but could not find a way to do it.
How about this?
#standardSQL
SELECT ID, email, value, ARRAY_AGG(STRUCT(element, subelement)) AS Elements
FROM YourTable
GROUP BY ID, email, value;
With Standard SQL you can do the following:
#standardSQL
with t as
(select 1003022 element, 10003981 subelement, "454255" id, "email1#yahoo.com" email, 25.5 value union all
select 1003027, 10033987, "454255", "email1#yahoo.com", 25.5)
SELECT id, email, value,
array_agg(struct<array<int64>>([element, subelement])) elements
FROM t
GROUP BY 1, 2, 3
#standardSQL
WITH yourTable AS (
SELECT 1003022 AS Element, 10003981 AS Subelement, "454255" AS ID, "email1#yahoo.com" AS Email, 25.5 AS Value UNION ALL
SELECT 1003022 AS Element, 10003981 AS Subelement, "454209" AS ID, "email2#yahoo.com" AS Email, 34.45 AS Value UNION ALL
SELECT 1003027 AS Element, 10033987 AS Subelement, "454369" AS ID, "email3#yahoo.com" AS Email, 1.9 AS Value UNION ALL
SELECT 1003027 AS Element, 10033987 AS Subelement, "454255" AS ID, "email1#yahoo.com" AS Email, 25.5 AS Value UNION ALL
SELECT 1003011 AS Element, 10021233 AS Subelement, "454209" AS ID, "email2#yahoo.com" AS Email, 34.45 AS Value UNION ALL
SELECT 1222011 AS Element, 13513544 AS Subelement, "454209" AS ID, "email2#yahoo.com" AS Email, 34.45 AS Value
)
SELECT
ID, Email, Value,
CONCAT('{', STRING_AGG(CONCAT('[', CAST(Element AS STRING), ',', CAST(Subelement AS STRING), ']')), '}') AS Elements
FROM yourTable
GROUP BY ID, Email, Value
-- ORDER BY Email
The result is
ID Email Value Elements
454255 email1#yahoo.com 25.5 {[1003022,10003981],[1003027,10033987]}
454209 email2#yahoo.com 34.45 {[1003022,10003981],[1003011,10021233],[1222011,13513544]}
454369 email3#yahoo.com 1.9 {[1003027,10033987]}
Your question is a little fuzzy in a sense of expected output, thus you have many answers
I think - the more the merrier :o)