I am aware that Google Analytics can be linked to Bigquery using BigQuery Linking features in the GA.
But I experienced the drawback that it's scheduled at a random time. So, it's messed up my table with dependencies to these GA data, which I set up at 9 AM using DBT -- so if the GA data is updated above 9 AM, my table won't have today's GA data.
My questions are:
Is there a way to schedule the updated GA data to have constant time, as the cronjob did?
Or if there is not any. Is there a way for DBT to run the job after the GA data is updated on bigquery?
Unfortunately Google provide no SLA on the BigQuery export from Google Analytics 3, if you have the option the best solution would be to migrate to Google Analytics 4, which was an almost realtime export to BigQuery and appears to be much more robust. Find out more on the official Google support page.
I currently get around this by using event based triggers that look at the meta data of a table, or check for the existence of a sharded table for yesterday, then proceed down downstream jobs, I'm sure you could achieve something similar with DBT.
Here is some example SQL code which checks for the existence of yesterday's Google Analytics sharded table by returning the maximum timestamp:
SELECT MAX(cast(PARSE_DATE('%Y%m%d', SUBSTR(table_id,13)) as timestamp)) as max_date
FROM `my_ga_dataset.__TABLES__`
WHERE table_id LIKE'%ga_sessions_%'
AND table_id NOT LIKE '%intraday%'
AND PARSE_DATE('%Y%m%d', SUBSTR(table_id,13)) >= CURRENT_DATE() -9
This works for sharded tables, if you want to use table metadata to get the date/time of the last table update you can use INFORMATION_SCHEMA:
https://cloud.google.com/bigquery/docs/information-schema-tables
Related
Sorry, I'm new to this. I read a few sources including some google documentation guides but still don't quiet understand:
Every time GA4 streams data into bigquery, bigquery table get a new row, or it can update existing?
For example, if I want data from a particular client_id should I expect only 1 row, or I get a bunch with different data to research?
Every record should be insert entry as if you see table column , each streamed record loaded with epoch timestamp as event_timestamp.
We are using the Google Ads transfer in BigQuery to ingest our Google Ads data. One thing I have noticed when querying the results is that all of the metrics are exactly 156x of the values we would expect in the Google Ads UI (cost, clicks, etc.)
We have tested multiple transfers and each time we have this same issue. The transfer process seems pretty straight forward, but am I missing something? Has anyone else noticed a similar issue or have any ideas of what to look at to adjust in the data transfer?
For which tables do you notice this behavior?
The dimension tables such as Customer, Campaign, AdGroup are exported every day and so are partitioned by day.
This could cause your duplication?!
You only need the latest partition/day.
So this is for example how I get the latest account / customer data:
SELECT
-- main reason I cast all the id's to string is because BI reporting tool will not see it as a metric but as a dimension field
CAST(customer_id AS STRING) AS account_id, --globally unique, see also: https://developers.google.com/google-ads/api/docs/concepts/api-structure
customer_descriptive_name,
customer_auto_tagging_enabled,
customer_currency_code,
customer_manager,
customer_test_account,
customer_time_zone,
_DATA_DATE AS date, --source table is paritioned on date
_LATEST_DATE,
CASE WHEN _DATA_DATE = _LATEST_DATE THEN TRUE ELSE FALSE END is_most_recent_record
FROM
`YOURPROJECTID.google_ads.ads_Customer_YOURID`
WHERE
_DATA_DATE = _LATEST_DATE
I started to test Google AdWords transfers for Big Query (https://cloud.google.com/bigquery/docs/adwords-transfer).
I have few questions for which I cannot find answers anywhere.
Is it possible to e.g. edit which columns are downloaded from AdWords to Big Query? E.g. Keyword report has only ad group ID column but not ad group text name.
Or is it possible to decide which tables=reports are downloaded? The transfer creates around 60 tables and I need just 5...
DZ
According to here, AdWords data transfer
store your AdWords data into a Dataset. So, the inputs are in terms of Adwords customer IDs (minimum one customer ID) and the output is a collection of Datasets.
I think, you need a modified version of PubSub to store special columns or tables in BigQuery.
I'm running the following query to select the active users for a time frame on my project.
SELECT DISTINCT
active_users,
unix
FROM [mobileapp_logs].[dbo].[active_users]
WHERE (rtrim(app_id) + ':' + app_os) = 'tbl'
AND [aggregation] = '30-day-active'
AND [unix] BETWEEN 1491696000 AND 1494288000
AND active_users >= 100
The query seems to be working but with every row returned for that day it will give me about 10 - 30 more than what's in firebase. Is this normal for bigquery -> firebase?
I'm not familiar with the table you are querying, according to the documentation Firebase imports data to app_events_intraday_YYYYMMDD. Could you provide more information about [mobileapp_logs].[dbo].[active_users]?
According to different SO questions it seems there may be a delay of a few days where offline devices upload their data. Also Firebase updates data in BigQuery daily. Since you are querying up until today you may be seeing data that has already been updated in Firebase but not in BigQuery. I would recommend changing your query to a range ending 3 days before today.
I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.