Daily Retention with Filter in BigQuery

Daily Retention with Filter in BigQuery - google-bigquery

I am using a query to calculate daily retention on my Firebase Analytics data exported to BigQuery. It is working well and the numbers match with the numbers in Firebase, but when I try to filter the query by a cohort of users, the numbers don't add up.
I want to compare the results of an A/B test from Firebase, and so I've looked at the user_property "firebase_exp_2" which is my A/B test, and I've split up the users in each group (0/1). The retention numbers do not match (at all) the numbers that I can see in my A/B test results in Firebase - actually they show the opposite pattern.
The query is adapted from here: https://github.com/sagishporer/big-query-queries-for-firebase/wiki/Query:-Daily-retention
All I've changed is adding the following under the "WHERE" clause:
WHERE
event_name = 'user_engagement' AND user_pseudo_id IN
(SELECT user_pseudo_id
FROM `analytics_XXX.events_*`,
UNNEST (user_properties) user_properties
WHERE user_properties.key = 'firebase_exp_2' AND user_properties.value.string_value='1')
Firebase says that there are 6,043 users in the Control group and 6,127 in the Variant A group, but my numbers are 5,632 and 5,730, and the retained users are around 1,000 users more than what Firebase reports.
What am I doing wrong?

The export to BigQuery happens on a daily basis and each imported table is named events_YYYYMMDD. Additionally, a table is imported for events received throughout the current day. This table is named events_intraday_YYYYMMDD.
The additions you made are querying from events_* which is fine. The example uses events_201812* though which would ignore the intraday table. That would explain why your numbers a lower. You are missing users added to the A/B test during the current day.

Related

BigQuery Google Ads Transfer - Duplicate data

We are using the Google Ads transfer in BigQuery to ingest our Google Ads data. One thing I have noticed when querying the results is that all of the metrics are exactly 156x of the values we would expect in the Google Ads UI (cost, clicks, etc.)
We have tested multiple transfers and each time we have this same issue. The transfer process seems pretty straight forward, but am I missing something? Has anyone else noticed a similar issue or have any ideas of what to look at to adjust in the data transfer?

For which tables do you notice this behavior?
The dimension tables such as Customer, Campaign, AdGroup are exported every day and so are partitioned by day.
This could cause your duplication?!
You only need the latest partition/day.
So this is for example how I get the latest account / customer data:
SELECT
-- main reason I cast all the id's to string is because BI reporting tool will not see it as a metric but as a dimension field
CAST(customer_id AS STRING) AS account_id, --globally unique, see also: https://developers.google.com/google-ads/api/docs/concepts/api-structure
customer_descriptive_name,
customer_auto_tagging_enabled,
customer_currency_code,
customer_manager,
customer_test_account,
customer_time_zone,
_DATA_DATE AS date, --source table is paritioned on date
_LATEST_DATE,
CASE WHEN _DATA_DATE = _LATEST_DATE THEN TRUE ELSE FALSE END is_most_recent_record
FROM
`YOURPROJECTID.google_ads.ads_Customer_YOURID`
WHERE
_DATA_DATE = _LATEST_DATE

Bigquery Active User count not accurate (Google Analytics)

I have Google Analytics integrated to Bigquery and I'm trying to write a query to fetch Active Users that should match with the number on GA Portal.
Here's the query I've written;
SELECT
date(date) as date,
EXACT_COUNT_DISTINCT(fullVisitorId) as daily_active_users,
FROM TABLE_DATE_RANGE([<project_id>:<dataset>.ga_sessions_],
TIMESTAMP('2018-01-01'),
TIMESTAMP(CURRENT_DATE()))
group by date
order by date desc
The numbers I get in response are somehow related to the ones Google Analytics shows me, but they aren't a 100% accurate.
The numebers I get in return are slightely higher than the ones on the portal and I assume I need to put a where clause to filter a property GA might be filtering on the portal.

Your query looks fine to me. Assuming that you're looking at the same GA view as the one linked to BigQuery, I think that the problem could be sampling.
Even if the GA UI says that "This report is based on 100% of sessions.", try to export it as an Unsampled Report and check the numbers (in my experience, the users metric sometimes doesn't match between unsampled reports and default reports without sampling).

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?

The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.

1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)

The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.

The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.

Where do you get Google Bigquery usage info (mainly for processed data)

I know that BigQuery offers the first "1 TB of data processed" per month for free but I can't figure out where to look on my dashboard to see my monthly usage. I used to be able to "revert" to the old dashboard which had the info but for the past couple of weeks the "old dashboard" isn't accessible.

From the Google Cloud Console overview page for your project, click on the "details" section on the top-right, next to the charge estimate :
You'll get an estimate of the charges for the current month for each service and item in the service, including Big Query analysis :
If you want to track this usage, you can also export the data into CSV every day by going in the Billing settings and enable the usage export feature. Do not worry about the fact that it only mentions Compute Engine, it actually works for other services also.
You can also access directly the billing history by clicking on the billing account link :
You will get a detailed bill with the usage info :

Post GCP Console Redesign Answer
The GCP console was redesigned and now the other answer here no longer applies, but it is still possible to view your usage by going to IAM & Admin -> Quotas.
What you're looking for is "Big Query API: Query usage per day". It doesn't seem possible to view your usage over 30 days unfortunately, but you can see your current usage (per day) and your peak usage over the past 7 days. You can also set a daily quota. If you're just working infrequently or doing a lot in one day, you can set a quota to 1 TiB and prevent yourself from blowing your whole allocation in one day.
You can try sending feedback about these limitations, like I did, by clicking the question mark at the top right and then send feedback.

Theo is correct that there is no way to view the number of bytes processed or billed since the start of the month (inside of the free tier) in the GCP Billing Console. However, you can extract the bytes processed and bytes billed data from logs in Cloud Logging and calculate the total bytes processed/billed since the start of the month inside of BigQuery.
Here are the steps to count total bytes billed in a month:
Under Cloud Logging, go to Logs Explorer (NOT the Legacy Logs Explorer) and run the following query in the query builder frame:
resource.type="bigquery_project" AND
protoPayload.metadata.jobChange.job.jobStats.queryStats.totalBilledBytes>1 AND
timestamp>="2021-04-01T00:00:00Z"
The timestamp clause is not actually necessary, but it will speed up the query. You can set timestamp >= <value> to any valid timestamp you want as long as it returns at least one result.
In the Query Results frame, click the "Action" button, and select "Create Sink".
In the window that opens, give your sink a name, click "Next", and in the "Select sink service" dropdown menu select "BigQuery dataset".
In the "Select BigQuery dataset" dropdown menu, either select an existing dataset where you would like to create your sink (which is a table containing logs) or if you prefer, choose "Create new BigQuery dataset.
Finally, you will likely want to check the box for Partition Table, since this will help you control costs whenever you query this sink. As of the time of this answer, however, Google limits partition tables to 4000 partitions, so you may find it is necessary to clear out old logs eventually.
Click "Create Sink" (there is no need for any inclusion or exclusion filters).
Run a query in BigQuery that produces bytes billed (i.e. a query that does not return a previously cached result). This is necessary to instantiate the sink. Moments after your query runs, you should now see a table called <your_biquery_dataset>.cloudaudit_googleapis_com_data_access
Enter the following Standard SQL query in the BigQuery query editor:
WITH
bytes_table AS (
SELECT
JSON_VALUE(protopayload_auditlog.metadataJson,
'$.jobChange.job.jobStats.createTime') AS date_time,
JSON_VALUE(protopayload_auditlog.metadataJson,
'$.jobChange.job.jobStats.queryStats.totalBilledBytes') AS billedbytes
FROM
`<your_project><your_bigquery_dataset>.cloudaudit_googleapis_com_data_access`
WHERE
EXTRACT(MONTH
FROM
timestamp) = 4
AND EXTRACT(YEAR
FROM
timestamp) = 2021)
SELECT
(SUM(CAST(billedbytes AS INT64))/1073741824) AS total_GB
FROM
bytes_table;
You will want to chance the month from 4 to whatever month you intend to query, and 2021 to whatever year you intend to query. Also, you may find it helpful to save this query as a view if you intend to rerun it periodically.
Be advised that your sink does not contain your past BigQuery logs, only BigQuery logs produced after you created the sink. Therefore in the first month the number of GB returned by this query will not be an accurate count your bytes billed in month unless you happen to have created the sink prior to running any queries in BigQuery during the current month.

Might be related to How can I monitor incurred BigQuery billings costs (jobs completed) by table/dataset in real-time?
If you are fine by using BigQuery itself to get that information (instead of using a UI), you can use something like this:
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
user_email,
FROM (
(SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
UNION ALL
(SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY
user_email
Open in BigQuery UI
Explanation
Please consider the caveats I mentioned in my answer here

SQL Server Query: Daily Data Snapshot Comparison (Counting Delta Occurrences)

I am working towards counting customer subscription ("package") changes. To do this, I am selecting all data from my package table once, every day. I am calling the daily query results "snapshots" (approx 500k rows). I then load the snapshot data into a new table. After 10 days I have a total of 5 million rows in the snapshots table (500k rows * 10 days). The majority of customers do not changes packages (65%). I need to report which customers, of the remaining 35%, are switching packages, when they are switching packages, what package changes they are making (from "package X" to "package y") and which customers are changing packages most frequently.
The query I have written uses a self-join. I am identifying the changes but my results contain duplicate rows.
This is my query:
select *
from UserPackageDump UPD1, UserPackageDump UPD2
where UPD1.user_id = UPD2.user_id
and UPD1.package_id <> UPD2.package_id
How can I change this query to yield only distinct results?

SELECT
DISTINCT *
FROM
UserPackageDump UPD1
JOIN UserPackageDump UPD2
ON UPD1.user_id = UPD2.user_id
WHERE
UPD1.package_id <> UPD2.package_id

You have many options for doing this, and I'm not sure your approach is the right one to take. Firstly to answer your specific question, you could perform a DISTINCT as per #sqlab's answer. Or you could include the date in the join, ensuring that UDP1 only matches a record in UDP2 that is one day different.
However, to come back to the approach, there should be no need to take a full copy of all the data. You have lots of other options for more efficient data storage, some of which being:
Put a "LastUpdated" datetime2 field in the database, to be populated each time the row is changed. Copy only those rows that have a LastUpdated more recent than the last time the copy was made. Assuming the only change possible to the table is to change the package_id then you will now only have rows in the table for users that have changed.
Create a UserPackageHistory table into which rows are written each time a user subscribes to a package, at the same time that UserPackage is updated. This then leaves you with much the same result as the first bullet, but in advance of running the copy job.
Then, with any one of these sets of data, to satisfy the reporting requirements you could populate a cube. Your source would be a set of rows containing user_id, old_package_id, new_package_id and date. You would create a measure group containing these measures:
Distinct count of user_id
Count of switches (basically just the row count of the source data)
This measure group could then be related to the following dimensions:
Date, so you can see when the switches are taking place
User, so you can drill down on who is switching
Switch Type, which is a dimension built from the selecting the old_package_id and new_package_id from your source data. This gives you the ability to see the popularity of particular shifts.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas