Accessing BigQuery RECORD - Repeated in Tableau - google-bigquery

I have a BigQuery Table with a column of RECORD type & mode REPEATED. I have to query and use this table in Tableau. Using UNNEST or FLATTEN in BigQuery is performing CROSS JOIN of the Table which is impacting performance. Is there any other way to use this table in Tableau without flattening it. Have posted the table schema image link below.
[Schema of Table]
https://i.stack.imgur.com/T4jHg.png

Is there any other way to use ... ?
You should not afraid UNNEST just because it “does” CROSS JOIN
The trick is that even though it is cross join but it is cross join within the row only and global to all rows in table. At the same time, there are always way to do stuff different
So, below example 1 – presents dummy example using UNNEST
And then Example 2 – shows how to do the same without using UNNEST, but rather using SQL UDF
You have not presented specifics about your case, so below is generic enough to show ‘other’ way
With Flattening via UNNEST
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, SUM(t.details) AS details
FROM yourTable, UNNEST(type) AS t
WHERE t.flag = 'y'
GROUP BY id
With SQL UDF
#standardSQL
CREATE TEMP FUNCTION do_something (
type ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
)
RETURNS INT64 AS ((
SELECT SUM(t.details) AS details
FROM UNNEST(type) AS t
WHERE t.flag = 'y'
));
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, do_something(type) AS details
FROM yourTable

Related

Updating Google Analytics (UA) export tables in BigQuery - how to unnest?

I want to update the existing GA (universal analytics) table that I exported to BigQuery. What I want to do is modifying the existing hits.eventInfo.eventLabel field that contains 'abc' into 'xyz'. I wrote this script, but it's giving me the "Cannot access field eventInfo on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [10:12]” error.
UPDATE `myProject.ga_sessions_20220403`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE (
(SELECT AS STRUCT *
REPLACE ('null' AS eventLabel)
FROM UNNEST([eventInfo])
) AS eventInfo)
FROM UNNEST(hits)
)
WHERE hits.eventInfo.eventLabel = 'abc'
What am I doing wrong, and how do I get this to work?
Also, how does the query change if I want to update multiple tables (ie. multiple dates) with the same criteria? and what if I want to add another WHERE clause that accesses the page RECORD (eg. hits.eventInfo.eventLabel = ‘abc’ AND hits.page.pagePath = ‘12345’)?
You should keep the exact same structure of hits after update.
In the schema, hits is an array of struct with many fields.
This struct also contains eventInfo, which is another struct.
Firstly, since hits is an array, you can't access hits.eventInfo in the where statement. If you want to filter hits that contain eventInfo.eventLabel "abc" and pagePath '12345', you can use this where condition:
where exists (select 1 from unnest(hits) as hit where hit.eventInfo.eventLabel = 'abc' AND hit.page.pagePath = '12345')
Secondly, since eventInfo is not an array, you don't need to unnest it, you can just directly access its elements.
So you can see all the code here:
update `myProject.ga_sessions_20220403`
set hits =
ARRAY(
(
SELECT AS STRUCT
* REPLACE
(
case when eventInfo is not null
then
(
select as struct eventInfo.* replace
(
CASE
WHEN eventInfo.eventLabel = 'abc' THEN 'null'
ELSE eventInfo.eventLabel
END as eventLabel
)
)
end as eventInfo
)
FROM UNNEST(hits) as hit
))
WHERE exists (select 1 from unnest(hits) as hit where hit.eventInfo.eventLabel = 'abc' AND hit.page.pagePath = '12345')
Besides all the code above, I don't recommend updating the original data, so I'd create another table using a select statement instead of updating the original table, in case you want to access original data in the future.

How to pass a string of column name as a parameter into a CREATE TABLE FUNCTION in BigQuery

I want to create a table function that takes two arguments, fieldName and parameter, where I can later use this function to create tables in other fieldName and parameter pairs. I tried multiple ways, and it seems like the fieldName(column name) is always parsed as a string in the where clause. Wondering how should I be doing this in the correct way.
CREATE OR REPLACE TABLE FUNCTION dataset.functionName( fieldName ANY TYPE, parameter ANY TYPE)
as
(SELECT *
FROM `dataset.table`
WHERE format("%t",fieldName ) = parameter
)
Later call the function as
SELECT *
from dataset.functionName( 'passed_qa', 'yes')
(passed_qa is a column name and assume it only has 'yes' and 'no' value)
I tried using EXECUTE IMMEDIATE, it works, but I just want to know if there's a way to approach this in a functional way.
Thanks for any help!
Good news - IT IS POSSIBLE!!! (side note: in my experience - i haven't had any cases when something was not possible to achieve in BigQuery directly or indirectly/workaround maybe with some few exceptions)
See example below
create or replace table function dataset.functionName(fieldName any type, parameter any type)
as (
select * from `bigquery-public-data.utility_us.us_states_area` t
where exists ( select true
from unnest(`bqutil.fn.json_extract_keys`(to_json_string(t))) key with offset
join unnest(`bqutil.fn.json_extract_values`(to_json_string(t))) value with offset
using(offset)
where key = fieldName and value = parameter
)
)
Now, when table function created - run below as see result
select *
from dataset.functionName('state_abbreviation', 'GU')
you will get record for GUAM
Then try below
select *
from dataset.functionName('division_code', '0')
with output
For details see:
https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions
A work-around can be to use a case statement to select the desired column. If any column is needed, please use the solution of Mikhail Berlyant.
Create or replace table function Test.HL(fieldName string,parameter ANY TYPE)
as
(
SELECT *
From ( select "1" as tmp, 2 as passed_qa) # generate dummy table
Where case
when fieldName="passed_qa" then format("%t",passed_qa)
when fieldName="tmp" then format("%t",tmp)
else ERROR(concat('column ',fieldName,' not found')) end = parameter
)

Getting selection of structs from an array of structs in BQ

I have a table where one column is defined as:
my_column ARRAY<STRUCT<key STRING, value FLOAT64, description STRING>>
Is there some easy way how to specify list of parameters to be returned in a SELECT statement? For instance removing description, so the result column would be still an array of structs but containing only key and value.
Below is for BigQuery Standard SQL
#standardSQL
SELECT * REPLACE(
ARRAY(
SELECT AS STRUCT * EXCEPT(description)
FROM UNNEST(my_column)
) AS my_column)
FROM `project.dataset.table`
Above fully preserves schema of table and only does change in my_column field by removing description
I would just unnest and then re-aggregate your selected fields.
select array_agg(struct(m.key,m.value)) as my_new_column
from table
left join unnest(my_column) m
I found this way:
SELECT
ARRAY(SELECT AS VALUE STRUCT(key, value) FROM a.my_column) as my_new_column
FROM my_table a
No joining or unnesting needed.

SQL Error in Google Big Query with UNION ALL on tables with same schema EDIT: change in schema from String to INT

I have the following query
SELECT *
FROM `January_2018`
UNION ALL
SELECT *
FROM `February_2018`
I get the following error on the second SELECT call
Column 14 in UNION ALL has incompatible types: STRING, STRING, INT64,
INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64, INT64 at [7:3]
The column name is travel_type with a type of integer with values 0, 1 and 2.
I am trying to make one large table from several smaller ones - monthly tables of the same data. It seems that one of the fields has changed from String to Int data type after the 4th month and stays Int ongoing after that.
Try the following so both table schemas match:
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table1
UNION ALL
SELECT * EXCEPT(changed_column)
, CAST(changed_column AS STRING) AS changed_column
FROM table2
TO select data from different tables you can use Wildcard instead of union. Wildcard will execute your query on all tables satisfying the condition. You can use wildcard ‘*’ with table prefix to select multiple tables at once. Your table names must have same Prefix with different suffix. Ex – Mytable_1, Mytabel_2, Mytable_3………

How to get get Records based on multiple columns from a table

Consider the following table.
From the above table I want to select the Middle BFS_SCORE per LN_LOAN_ID and BR_ID. There are some LN_LOAN_ID with single score.
As an example for the above table the output I need is as below.
Please let me know how this can be achieved.
To handle cases where there are two scores for unique pair of LN_LOAD_ID, BR_ID you need a median, as there is no middle value for BFS_SCORE.
Postgres solution:
Create a median aggregate function following Postgres wiki:
CREATE OR REPLACE FUNCTION _final_median(NUMERIC[])
RETURNS NUMERIC AS
$$
SELECT AVG(val)
FROM (
SELECT val
FROM unnest($1) val
ORDER BY 1
LIMIT 2 - MOD(array_upper($1, 1), 2)
OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE median(NUMERIC) (
SFUNC=array_append,
STYPE=NUMERIC[],
FINALFUNC=_final_median,
INITCOND='{}'
);
Then your query would look as simple as this:
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
But the tricky part comes with score_order. If there are two pairs and you actually really need a median, not the middle value - then there will be no row for your calculated score, so it will be null. Other than that, join back to your table to retrieve it for the "middle" column:
select
t1.ln_load_id, t1.bfs_score, t1.br_id, t2.score_order
from (
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
) t1
left join yourtable t2 on
t1.ln_load_id = t2.ln_load_id
and t1.br_id = t2.br_id
and t1.bfs_score = t2.bfs_score