Updating Google Analytics (UA) export tables in BigQuery - how to unnest?

Updating Google Analytics (UA) export tables in BigQuery - how to unnest? - google-bigquery

I want to update the existing GA (universal analytics) table that I exported to BigQuery. What I want to do is modifying the existing hits.eventInfo.eventLabel field that contains 'abc' into 'xyz'. I wrote this script, but it's giving me the "Cannot access field eventInfo on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [10:12]” error.
UPDATE `myProject.ga_sessions_20220403`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE (
(SELECT AS STRUCT *
REPLACE ('null' AS eventLabel)
FROM UNNEST([eventInfo])
) AS eventInfo)
FROM UNNEST(hits)
)
WHERE hits.eventInfo.eventLabel = 'abc'
What am I doing wrong, and how do I get this to work?
Also, how does the query change if I want to update multiple tables (ie. multiple dates) with the same criteria? and what if I want to add another WHERE clause that accesses the page RECORD (eg. hits.eventInfo.eventLabel = ‘abc’ AND hits.page.pagePath = ‘12345’)?

You should keep the exact same structure of hits after update.
In the schema, hits is an array of struct with many fields.
This struct also contains eventInfo, which is another struct.
Firstly, since hits is an array, you can't access hits.eventInfo in the where statement. If you want to filter hits that contain eventInfo.eventLabel "abc" and pagePath '12345', you can use this where condition:
where exists (select 1 from unnest(hits) as hit where hit.eventInfo.eventLabel = 'abc' AND hit.page.pagePath = '12345')
Secondly, since eventInfo is not an array, you don't need to unnest it, you can just directly access its elements.
So you can see all the code here:
update `myProject.ga_sessions_20220403`
set hits =
ARRAY(
(
SELECT AS STRUCT
* REPLACE
(
case when eventInfo is not null
then
(
select as struct eventInfo.* replace
(
CASE
WHEN eventInfo.eventLabel = 'abc' THEN 'null'
ELSE eventInfo.eventLabel
END as eventLabel
)
)
end as eventInfo
)
FROM UNNEST(hits) as hit
))
WHERE exists (select 1 from unnest(hits) as hit where hit.eventInfo.eventLabel = 'abc' AND hit.page.pagePath = '12345')
Besides all the code above, I don't recommend updating the original data, so I'd create another table using a select statement instead of updating the original table, in case you want to access original data in the future.

Related

How to pass a string of column name as a parameter into a CREATE TABLE FUNCTION in BigQuery

I want to create a table function that takes two arguments, fieldName and parameter, where I can later use this function to create tables in other fieldName and parameter pairs. I tried multiple ways, and it seems like the fieldName(column name) is always parsed as a string in the where clause. Wondering how should I be doing this in the correct way.
CREATE OR REPLACE TABLE FUNCTION dataset.functionName( fieldName ANY TYPE, parameter ANY TYPE)
as
(SELECT *
FROM `dataset.table`
WHERE format("%t",fieldName ) = parameter
)
Later call the function as
SELECT *
from dataset.functionName( 'passed_qa', 'yes')
(passed_qa is a column name and assume it only has 'yes' and 'no' value)
I tried using EXECUTE IMMEDIATE, it works, but I just want to know if there's a way to approach this in a functional way.
Thanks for any help!

Good news - IT IS POSSIBLE!!! (side note: in my experience - i haven't had any cases when something was not possible to achieve in BigQuery directly or indirectly/workaround maybe with some few exceptions)
See example below
create or replace table function dataset.functionName(fieldName any type, parameter any type)
as (
select * from `bigquery-public-data.utility_us.us_states_area` t
where exists ( select true
from unnest(`bqutil.fn.json_extract_keys`(to_json_string(t))) key with offset
join unnest(`bqutil.fn.json_extract_values`(to_json_string(t))) value with offset
using(offset)
where key = fieldName and value = parameter
)
)
Now, when table function created - run below as see result
select *
from dataset.functionName('state_abbreviation', 'GU')
you will get record for GUAM
Then try below
select *
from dataset.functionName('division_code', '0')
with output

For details see:
https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions
A work-around can be to use a case statement to select the desired column. If any column is needed, please use the solution of Mikhail Berlyant.
Create or replace table function Test.HL(fieldName string,parameter ANY TYPE)
as
(
SELECT *
From ( select "1" as tmp, 2 as passed_qa) # generate dummy table
Where case
when fieldName="passed_qa" then format("%t",passed_qa)
when fieldName="tmp" then format("%t",tmp)
else ERROR(concat('column ',fieldName,' not found')) end = parameter
)

BigQuery MERGE statement with NESTED+REPEATED fields

I need to do a merge statement in BigQuery using a classic flat table, having as target a table with nested and repeated fields, and I'm having trouble understanding how this is supposed to work. Google's examples use direct values, so the syntax here is not really clear to me.
Using this example:
CREATE OR REPLACE TABLE
mydataset.DIM_PERSONA (
IdPersona STRING,
Status STRING,
Properties ARRAY<STRUCT<
Id STRING,
Value STRING,
_loadingDate TIMESTAMP,
_lastModifiedDate TIMESTAMP
>>,
_loadingDate TIMESTAMP NOT NULL,
_lastModifiedDate TIMESTAMP
);
INSERT INTO mydataset.DIM_PERSONA
values
('A', 'KO', [('FamilyMembers', '2', CURRENT_TIMESTAMP(), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),
('B', 'KO', [('FamilyMembers', '4', CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),('Pets', '1', CURRENT_TIMESTAMP(), NULL)], CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
;
CREATE OR REPLACE TABLE
mydataset.PERSONA (
IdPersona STRING,
Status STRING,
IdProperty STRING,
Value STRING
);
INSERT INTO mydataset.PERSONA
VALUES('A', 'OK','Pets','3'),('B', 'OK','FamilyMembers','5'),('C', 'OK','Pets','2')
The goal is to:
Update IdPersona='A', adding a new element in Properties and
changing Status
Update IdPersona='B', updating the existent element
in Properties
Insert IdPersona='C'
This INSERT works:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
) Properties,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
But I would like to build the nested/repeated fields in the INSERT clause, because for the UPDATE I would also need (I think) to do a "SELECT AS STRUCT * REPLACE" by comparing the values of TRG with SRC.
This doesn't work:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
*
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
),
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
)
I get "Correlated Subquery is unsupported in INSERT clause."
Even if I used the first option, I don't get how to reference TRG.properties in the UPDATE..
WHEN MATCHED THEN
UPDATE
SET Properties = ARRAY(
SELECT AS STRUCT p_SRC.*
REPLACE (IF(p_SRC.IdProperty=p_TRG.id AND p_SRC.Value<>p_TRG.Value,p_SRC.Value,p_TRG.Value) AS Value)
FROM SRC.Properties p_SRC, TRG.Properties p_TRG
)
Obv this is wrong though.
One way to solve this, as I see it, is to pre-join everything in the USING clause, therefore doing all the replacement there, but it feels very wrong for a merge statement.
Can anyone help me figure this out, please? :\

So, I wanted to share a possible solution, although I still hope there's another way.
As mentioned, I pre-compute what I need with a CTE and a FULL OUTER JOIN, therefore recreating the array of structs I need later on (tables will be relatively small so I can afford it).
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
WITH NEW_PROPERTIES AS (
SELECT
COALESCE(idp,IdPersona) IdPersona,
ARRAY_AGG((
SELECT AS STRUCT
COALESCE(idpro,Id) IdProperty,
COALESCE(vl,Value) Value,
COALESCE(_loadingDate,CURRENT_TIMESTAMP) _loadingDate,
IF(idp=IdPersona,CURRENT_TIMESTAMP,TIMESTAMP(NULL)) _lastModifiedDate
)) Properties
FROM (
SELECT DIP.IdPersona, DIP.Status, DIP_PR.*, PER.IdPersona idp, PER.Status st, PER.IdProperty idpro, PER.Value vl
FROM `clean-yew-281811.mydataset.DIM_PERSONA` DIP
CROSS JOIN UNNEST(DIP.Properties) DIP_PR
FULL OUTER JOIN mydataset.PERSONA PER
ON DIP.IdPersona=PER.IdPersona
AND DIP_PR.Id=PER.IdProperty
)
GROUP BY IdPersona
)
SELECT
IdPersona,
'subquery to do here' Status,
NP.Properties
FROM (SELECT DISTINCT IdPersona FROM mydataset.PERSONA) PE
LEFT JOIN NEW_PROPERTIES NP USING (IdPersona)
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
WHEN MATCHED THEN
UPDATE
SET
TRG.Status = SRC.Status,
TRG.Properties = SRC.Properties,
TRG._lastModifiedDate = CURRENT_TIMESTAMP()
This works but I'm pretty much avoiding the syntax to update an array of structs, as what I'm doing is a rebuild and replace operation. Hopefully someone can suggest a better way.

Also, while you did not provide your desired output, I was able to create a query based on the objectives you described and your code and with the sample data you provided.
Following the below goals:
Update IdPersona='A', adding a new element in Properties and changing Status
Update IdPersona='B', updating the existent element in Properties
Insert IdPersona='C'
Instead of doing a replace and rebuild operation, I used:
MERGE;in order to perform the updates and insert the new rows, such as IdPersona = "C"
INSERT: within merge it is not possible to use INSERT with WHEN MATCHED. Thus, in order to add a new Property when IdPerson="A", this method was used after the MERGE operations.
CREATE TABLE: after using INSERT, the new Properties when IdPersona="A" are not aggregated, since we did not use WHEN MATCHED. So, the final table DM_PERSONA is replaced in order to aggregate properly the results.
LEFT JOIN: in order to add the fields _loadingDate and *_lastModifiedDate *, which are not aggregated into the ARRAY<STRUCT<>>.
Below is the query with the proper comments:
#first step update current values and insert new IdPersonas
MERGE sample.DIM_PERSONA_test2 T
USING sample.PERSONA_test2 S
ON T.IdPersona = S.IdPersona
#update A but not insert
WHEN MATCHED AND T.IdPersona ="A" THEN
UPDATE SET STATUS = "OK"
#update B
WHEN MATCHED AND T.IdPersona ="B" THEN
UPDATE SET Properties = [( S.IdPersona, S.IdProperty,TIMESTAMP(NULL), TIMESTAMP(NULL) )]
#insert what is not in the target table
WHEN NOT MATCHED THEN
INSERT(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate ) VALUES (S.IdPersona, S.Status, [( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL));
#insert new values when IdPersona="A"
#you will see the result won't be aggregated properly
INSERT INTO sample.DIM_PERSONA_test2(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate)
SELECT IdPersona, Status,[( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL) from sample.PERSONA_test2
where IdPersona = "A";
#replace the above table to recriate the ARRAY<STRUCT<>>
CREATE OR REPLACE TABLE sample.DIM_PERSONA_FINAL_test2 AS(
SELECT t1.*, t2._loadingDate,t2._lastModifiedDate
FROM( SELECT a.IdPersona,
a.Status,
ARRAY_AGG(STRUCT( Properties.Id as Id, Properties.Value as Value, Properties._loadingDate ,
Properties._lastModifiedDate AS _lastModifiedDate)) AS Properties
FROM sample.DIM_PERSONA_test2 a, UNNEST(Properties) as Properties
GROUP BY 1,2
ORDER BY a.IdPersona)t1 LEFT JOIN sample.DIM_PERSONA_test2 t2 USING(IdPersona)
)
And the output,
Notice that when updating the ARRAY<STRUCT<>>, the values are wrapped within [()]. Lastly, pay attention that there are two IdPersona="A" because _loadingDate is required, so it can not be NULL and due to the CURRENT_TIMESTAMP(), there are two different values for this field. Thus, two different records.

How to efficiently select records matching substring in another table using BigQuery?

I have a table of several million strings that I want to match against a table of about twenty thousand strings like this:
#standardSQL
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Unfortunately this is taking an awful long time.
Considering that the fragment table is only 20k records, can I load it into a JavaScript array using a UDF and match it that way? I'm trying to figure out how to this right now but perhaps there's already some magic I could do here to make this faster. I tried a CROSS JOIN and got resource exceeded fairly quickly. I've also tried using EXISTS but I can't reference the record.name inside that subquery's WHERE without getting an error.
Example using Public Data
This seems to reflect about the same amount of data ...
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT LOWER(name) AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')

Below is for BigQuery Standard SQL
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT DISTINCT LOWER(name) AS name
FROM `bigquery-public-data.usa_names.usa_1910_current`
), temp_record AS (
SELECT record, TO_JSON_STRING(record) id, name, item
FROM record, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
), temp_fragment AS (
SELECT name, item FROM fragment, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
)
SELECT AS VALUE ANY_VALUE(record) FROM (
SELECT ANY_VALUE(record) record, id, r.name name, f.name fragment_name
FROM temp_record r
JOIN temp_fragment f
USING(item)
GROUP BY id, name, fragment_name
)
WHERE name LIKE CONCAT('%', fragment_name, '%')
GROUP BY id
above was completed in 375 seconds, while original query is still running at 2740 seconds and keep running, so I will not even wait for it to complete

Mikhail's answer appears to be faster - but lets have one that doesn't need to SPLIT nor separate the text into words.
First, compute a regular expression with all the words to be searched:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT FORMAT('(%s)',STRING_AGG(name,'|'))
FROM fragment
Now you can take that resulting string, and use it in a REGEX ignoring case:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), largestring AS (
SELECT '(?i)(mary|margaret|helen|more_names|more_names|more_names|josniel|khaiden|sergi)'
)
SELECT record.* FROM `record`
WHERE REGEXP_CONTAINS(record.name, (SELECT * FROM largestring))
(~510 seconds)

As eluded to in my question, I worked on a version using a JavaScript UDF which solves this albeit in a slower way than the answer I accepted. For completeness, I'm posting it here because perhaps someone (like myself in the future) may find it useful.
CREATE TEMPORARY FUNCTION CONTAINS_ANY(str STRING, fragments ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """
for (var i in fragments) {
if (str.indexOf(fragments[i]) >= 0) {
return fragments[i];
}
}
return null;
""";
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
WHERE text IS NOT NULL
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name IS NOT NULL
GROUP BY name
), fragment_array AS (
SELECT ARRAY_AGG(name) AS names, COUNT(*) AS count
FROM fragment
GROUP BY LENGTH(name)
), records_with_fragments AS (
SELECT record.name,
CONTAINS_ANY(record.name, fragment_array.names)
AS fragment_name
FROM record INNER JOIN fragment_array
ON CONTAINS_ANY(name, fragment_array.names) IS NOT NULL
)
SELECT * EXCEPT(rownum) FROM (
SELECT record.name,
records_with_fragments.fragment_name,
ROW_NUMBER() OVER (PARTITION BY record.name) AS rownum
FROM record
INNER JOIN records_with_fragments
ON records_with_fragments.name = record.name
AND records_with_fragments.fragment_name IS NOT NULL
) WHERE rownum = 1
The idea is that the list of fragments is relatively small enough that it can be processed in an array, similar to Felipe's answer using regular expressions. The first thing I do is create a fragment_array table which is grouped by the fragment lengths ... a cheap way of preventing an over-sized array which I found can cause UDF timeouts.
Next I create a table called records_with_fragments that joins those arrays to the original records, finding only those which contain a matching fragment using the JavaScript UDF CONTAINS_ANY(). This will result in a table containing some duplicates since one record may match multiple fragments.
The final SELECT then pulls in the original record table, joins to records_with_fragments to determine which fragment matched, and also uses the ROW_NUMBER() function to prevent duplicates, e.g. only showing the first row of each record as uniquely identified by its name.
Now, the reason I do the join in the final query is because in my actual data there are more fields I want besides just the string being matched. Earlier on in my actual data I create a table of DISTINCT strings which then later need to be re-joined.
Voila! Not the most elegant but it gets the job done.

Update a nested field in BigQuery using another nested field as a condition

I am trying to update the sourcePropertyDisplayName on a ga_sessions_ table WHERE it matches the value of another nested field. I found this answer here:
Update nested field in BigQuery table
But this only has a very simple WHERE TRUE; whereas I only want to apply it if it matches a specified hits.eventInfo.eventCategory.
Here is what I have so far:
UPDATE `dataset_name`.`ga_sessions_20170720`
SET hits =
ARRAY(
SELECT AS STRUCT * REPLACE(
(SELECT AS STRUCT sourcePropertyInfo.* REPLACE('updated text' AS
sourcePropertyDisplayName)) AS sourcePropertyInfo)
FROM UNNEST(hits)
)
WHERE ARRAY(
SELECT AS STRUCT eventInfo.eventCategory
FROM UNNEST(hits)
) LIKE '%SEARCH%'
But I'm currently getting following error:
Error: No matching signature for operator LIKE for argument types:
ARRAY<STRUCT<eventCategory STRING>>, STRING. Supported signatures: STRING
LIKE STRING; BYTES LIKE BYTES at [8:7]
How can I update one nested field by using the value of another in a WHERE clause?

Your WHERE clause should be like below
WHERE EXISTS (
SELECT 1 FROM UNNEST(hits) AS h
WHERE h.eventInfo.eventCategory LIKE '%SEARCH%'
)

Accessing BigQuery RECORD - Repeated in Tableau

I have a BigQuery Table with a column of RECORD type & mode REPEATED. I have to query and use this table in Tableau. Using UNNEST or FLATTEN in BigQuery is performing CROSS JOIN of the Table which is impacting performance. Is there any other way to use this table in Tableau without flattening it. Have posted the table schema image link below.
[Schema of Table]
https://i.stack.imgur.com/T4jHg.png

Is there any other way to use ... ?
You should not afraid UNNEST just because it “does” CROSS JOIN
The trick is that even though it is cross join but it is cross join within the row only and global to all rows in table. At the same time, there are always way to do stuff different
So, below example 1 – presents dummy example using UNNEST
And then Example 2 – shows how to do the same without using UNNEST, but rather using SQL UDF
You have not presented specifics about your case, so below is generic enough to show ‘other’ way
With Flattening via UNNEST
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, SUM(t.details) AS details
FROM yourTable, UNNEST(type) AS t
WHERE t.flag = 'y'
GROUP BY id
With SQL UDF
#standardSQL
CREATE TEMP FUNCTION do_something (
type ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
)
RETURNS INT64 AS ((
SELECT SUM(t.details) AS details
FROM UNNEST(type) AS t
WHERE t.flag = 'y'
));
WITH yourTable AS (
SELECT 1 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(1,'y','a','xxx'),(2,'n','b','yyy'),(3,'y','c','zzz'),(4,'n','d','vvv')] AS type UNION ALL
SELECT 2 AS id, ARRAY<STRUCT<details INT64, flag STRING, value STRING, description STRING>>
[(11,'t','c','xxx'),(21,'n','a','yyy'),(31,'y','c','zzz'),(41,'f','d','vvv')] AS type
)
SELECT id, do_something(type) AS details
FROM yourTable

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Updating Google Analytics (UA) export tables in BigQuery - how to unnest? - google-bigquery

Related

How to pass a string of column name as a parameter into a CREATE TABLE FUNCTION in BigQuery

BigQuery MERGE statement with NESTED+REPEATED fields

How to efficiently select records matching substring in another table using BigQuery?

Update a nested field in BigQuery using another nested field as a condition

Accessing BigQuery RECORD - Repeated in Tableau

Categories

Resources