Query multiple params in multiple tables with TABLE_DATE_RANGE for Firebase Analytics - sql

I intend to get from the events I have in the applications a stat for most played audios within an article. In the event I send articleId and the audioID that has been played.
I want to obtain as result rows like this ordered by number of ocurrences:
| ID of the article | ID of the audio | number of occurrences
Since firebase analytics exports to bigquery in a diary basis and I want those events per month I created a query that takes the values from multiple tables, and mixed it with the info I found in this thread.
The resulting query is:
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
TABLE_DATE_RANGE([project-id:my_app_id.app_events_], DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP()), UNNEST(event_dim) AS x
WHERE event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
Unfortunately this query is not being parsed correctly provided me an error:
Error: Table name cannot be resolved: dataset name is missing.
RUN QUERY
I am pretty sure the issue is related to querying multiple tables in a range, but not sure how to fix it. Thanks.

The other answer you reference, is using StandardSQL and you are trying to use TABLE_DATE_RANGE which is only available in LegacySQL.
This is the query in Standard SQL that allows you multiple tables
#standardSql
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
`project-id:my_app_id.app_events_*`, UNNEST(event_dim) AS x
WHERE _TABLE_SUFFIX BETWEEN cast(DATE_ADD(current_date(), INTERVAL -30 DAY) as string) AND cast(current_date() as string)
AND event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
See this From clause: project-id:my_app_id.app_events_* and the WHERE _TABLE_SUFFIX BETWEEN syntax line.

Related

GA4 vs BigQuery - User Count don't match

I have extracted from Bigquery the active_users and totalusers on 31/12/2022, grouped by CampaignName and Country, using the following query:
select
count(distinct case when (select value.int_value from unnest(event_params) where key = 'engagement_time_msec') > 0 or (select value.string_value from unnest(event_params) where key = 'session_engaged') = '1' then user_pseudo_id else null end) AS active_users
,count(distinct user_pseudo_id) AS totalusers
,traffic_source.name AS CampaignName
,geo.country AS Country
FROM `independent-tea-354108.analytics_254831690.events_20221231`
GROUP BY
traffic_source.name
,geo.country
The result filtered by CampaignName='(organic)' was:
(https://i.stack.imgur.com/LMQAH.png)
But when I compare with the data from GA4, it doesn't match and the difference is huge (around 15000 more active_users in GA4 than in BigQuery). Please note that this is only for one day, if it was a month the difference would be even higher:
(https://i.stack.imgur.com/8arYs.png)
I've tried filtering by other CampaignNames and not a single value matches and the differences are always huge.
These are two common reasons for the GA4 to BigQuery difference, You have probably already looked at them already.
Check your source table for blank 'user_pseudo_id's if you have a consent mode on your website they may be counted in GA4 but not in bigquery and this can cause big differences.
Time zone is another are that can make a difference BigQuery is always in UTC time your GA4 may not be.
I hope these help

Results within Bigquery do not remain the same as in GA4

I'm inside BigQuery performing the query below to see how many users I had from August 1st to August 14th, but the number is not matching what GA4 presents me.
with event AS (
SELECT
user_id,
event_name,
PARSE_DATE('%Y%m%d',
event_date) AS event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY TIMESTAMP_MICROS(event_timestamp) DESC) AS rn,
FROM
`events_*`
WHERE
event_name= 'push_received')
SELECT COUNT ( DISTINCT user_id)
FROM
event
WHERE
event_date >= '2022-08-01'
Resultado do GA4
Result BQ = 37024
There are quite a few reasons why your GA4 data in the web will not match when compared to the BigQuery export and the Data API.
In this case, I believe you are running into the Time Zone issue. event_date is the date that the event was logged in the registered timezone of your Property. However, event_timestamp is a time in UTC that the event was logged by the client.
To resolve this, simply update your query with:
EXTRACT(DATETIME FROM TIMESTAMP_MICROS(`event_timestamp`) at TIME ZONE 'TIMEZONE OF YOUR PROPERTY' )
Your data should then match the WebUI and the GA4 Data API. This post that I co-authored goes into more detail on this and other reasons why your data doesn't match: https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
You cannot simply compare totals. Divide it into daily comparisons and look at details.

Google Analytics to Big Query data-What is the SQL code from Custom Dimension with transaction?

How to see the data above in Big Query-The tables are there since an year.
What code should I use to see the above result?
User subscription status is Session based dimension which has made transactions.
I have enabled data in Big Query but how to see the exact the same results in BQ.?
Try code below. Change table name and date interval according to your request.
#standardSQL
SELECT
date,
SUM(totals.visits) AS visits,
SUM(totals.pageviews) AS pageviews,
SUM(totals.transactions) AS transactions,
SUM(totals.transactionRevenue)/1000000 AS revenue
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170731'
GROUP BY date
ORDER BY date ASC
These documents could be useful for you before posting questions:
https://support.google.com/analytics/answer/4419694?hl=tr
https://support.google.com/analytics/answer/3437719?hl=tr
For custom dimensions on session scope write a subquery that runs on the unnested array.
#standardSQL
SELECT
date,
-- select one value from unnested array
(SELECT value FROM UNNEST(customDimensions) WHERE index=4) AS cd4,
SUM(totals.transactions) AS transactions,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20160802'
GROUP BY
date, cd4
ORDER BY
date ASC
you need to change the condition in the subquery to your custom dimension index

Big Query Compute Average time between two Custom Events

I'm attempting to determine the average time between two events in my Firebase analytics using BigQuery. The table looks something like this:
I'd like to collect the timstamp_micros for the LOGIN_CALL and LOGIN_CALL_OK events, subtract LOGIN_CALL from LOGIN_CALL_OK and compute the average for this across all rows.
#standardSQL
SELECT AVG(
(SELECT
event.timestamp_micros
FROM
`table`,
UNNEST(event_dim) AS event
where event.name = "LOGIN_CALL_OK") -
(SELECT
event.timestamp_micros
FROM
`table`,
UNNEST(event_dim) AS event
where event.name = "LOGIN_CALL"))
from `table`
I've managed to list either the low or the hi numbers, but any time I try to do any math on them I run into errors I'm struggling to pull apart. This approach above seems like it should work but i get the following error:
Error: Scalar subquery produced more than one element
I read this error to mean that each of the UNNEST() functions is returning an array, and not single value which is causing AVG to barf. I've tried to unnest once and apply a "low" and "hi" name to the values, but can't figure out how to filter using the event_dim.name correctly.
I couldn't fully test this one but maybe this might work for you:
WITH data AS(
SELECT STRUCT('1' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1497088800000000), ('20170610', 'LOGIN_CALL', 1498088800000000), ('20170610', 'LOGIN_CALL_OK', 1498888800000000), ('20170610', 'EVENT2', 159788800000000), ('20170610', 'LOGIN_CALL', 1599088800000000), ('20170610', 'LOGIN_CALL_OK', 1608888800000000)] event_dim union all
SELECT STRUCT('2' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1497688500400000), ('20170610', 'LOGIN_CALL', 1497788800000000)] event_dim UNION ALL
SELECT STRUCT('3' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1487688500400000), ('20170610', 'LOGIN_CALL', 1487788845000000), ('20170610', 'LOGIN_CALL_OK', 1498888807700000)] event_dim
)
SELECT
AVG(time_diff) avg_time_diff
FROM(
SELECT
CASE WHEN e.name = 'LOGIN_CALL' AND LEAD(NAME,1) OVER(PARTITION BY user_dim.user_id ORDER BY timestamp_micros ASC) = 'LOGIN_CALL_OK' THEN TIMESTAMP_DIFF(TIMESTAMP_MICROS(LEAD(TIMESTAMP_MICROS, 1) OVER(PARTITION BY user_dim.user_id ORDER BY timestamp_micros ASC)), TIMESTAMP_MICROS(TIMESTAMP_MICROS), day) END time_diff
FROM data,
UNNEST(event_dim) e
WHERE e.name in ('LOGIN_CALL', 'LOGIN_CALL_OK')
)
I've simulated 3 users with the same schema that you have in Firebase Schema.
Basically, I first applied the UNNEST operation so to have each value of event_dim.name. Then applied filter to get only the events that you are interested in, that is, "LOGIN_CALL" and "LOGIN_CALL_OK".
As Mosha commented above, you do need to have some identification for these rows as otherwise you won't know which event succeeded which so that's why the partitioning of the analytical functions takes the user_dim.user_id as input as well.
After that, it's just TIMESTAMP operations to get the differences when appropriate (when the leading event is "LOGIN_CALL_OK" and the current one being "LOGIN_CALL" then take the difference. This is expressed in the CASE expression).
You can choose in the TIMESTAMP_DIFF function which part of the date you want to analyze, such as seconds, minutes, days and so on.

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?
October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)
Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!
List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)
You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)
I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.
Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.