GA4 BigQuery Export doesn't export all the data - google-bigquery

I've enabled BigQuery Export in Google Analytics 4, but on inspection I noticed that roughly half of the events were missing in the raw data (in my case those were sign_up events that also had user_id as a parameter).
When inspecting the event stats in the standard GA4 report I noticed that "Data Thresholding" was applied when inspecting stats pertaining to the said event. My understanding is that when thresholding is applied, GA4 omits certain events from exporting, although I can't be sure of that.
Is there a way to make sure all the data gets exported to BigQuery?

Related

Google Analytics (GA4) event data is different between reports and exploration

I am recently new with GA4 and I found that I get different data from reports and exploration for the same event.
For example I want to get the data:
eventName: Load Result
date: 2023-02-01
Event have 2 parameters:
event_category
event_label
Here is the data from reports and exploration. while exploration get a data bigger than reports. And when I use GA4 API to get data, I can only return the data same as reports.I guess maybe there is a sampling, but how can I get no sample data,I hope to know if there is a way to get the same data as exploration data by GA4 API.
reports data:
reports data
exploration data:
exploration data
API data:
API data
Yes, this is one of the known GA4 bugs. General consensus and advise for now is to never use Reports. Always use explorer.
But even Explorer is, at times, pretty odd with certain combinations of dimensions and metrics, often conflicting with itself within the same tab.
For now, the best you can do to Firebase and GA4 data is to export it to BQ and then look at raw numbers. They look much better and seem to be quite significantly more reliable.

Firebase Analytics doesn't export events table to Bigquery, even though streaming export is enabled

I'm trying to export my Google Analytics data from Firebase into Bigquery.
About 3 weeks ago the connector in Firebase was enabled with the "Streaming" export setting.
Just recently I decided to check BigQuery to start making some views and noticed that there is only 3 weeks' worth of "intraday" tables, which I understand are staging tables of sorts.
However, as per the documentation, there should also be another another table containing all the data simply called "events_", but these are completely missing:
"You should query events_YYYYMMDD rather than
events_intraday_YYYYMMDD"
https://support.google.com/analytics/answer/9358801?authuser=0
Where is the "events_" table? Is it safe to use the event_intraday tables instead despite what the documentation says?

linking GA4 project to bigquery - streaming vs daily

I recently linked my GA4 property to bigquery to better look at the analytics data. That was initially on daily, so every day the data was exported from Google Analytics to Bigquery. However, I decided that streaming is necessary so I switched from daily to streaming in the BigQuery Linking section of GA4's admin tab. However, that streaming data is not showing up after a few hours. I'm wondering if anyone has done this with similar problems. Do I need to recreate an entire bigquery project?
If you look at your configuration options for GA to BigQuery, you will see a message under the streaming option.
This option will take effect after the next date boundary (tomorrow) for this property.
This property = the "Data exported continuously" option (streaming)
You will probably see your data tomorrow.

About the way of saving BigQuery data capacity.(BigQuery/Data Portal/Data Studio/Google)

I want to know about the way of saving BigQuery data capacity with changing setting of Data Portal(Google BI tool/old name:Data Studio).
The reason is I can't execute SQL or defray the much cost , if I don't save my BigQuery data capacity .
I want to know the way is not used Changing BigQuery Setting(contain of change SQL code) , but Data Protal setting.
Because , the dashboard in data portal continue to use BigQuery data capacity , I can't solve my problem ,even if I change the SQL code.
My situations is below:
My situations:
1.I made a "view" in my BigQuery Enviroment.
I tried to make the query not to use a lot of BigQuery data capacity.
For example , I didn't use "SELECT * FROM ...".
I set the view to "data sorce" in the data portal.
And I made the dashboard using the "data sorce".
If someone open the dashboard , the view I made is executed.
And , BigQuery data capacity is used every time that someone open the dashboard.
If I'm understanding correctly, you're wanting to reduce the amount of data processed in BigQuery from your Data Studio (or in Japan, Data Portal) reports.
There are a few ways to do this:
Make sure that the "Enable Cache" option is checked in the report settings.
Avoid using BigQuery views as a query source, as these aren't cached at the BigQuery level (the view query is run every time, and likely many times per report for various charts). Instead, use a Custom Query connection or pull the table data directly to allow caching. Another option (which we use heavily) is to run a scheduled query that saves the output of a view as a table and replaces it regularly (or is triggered when the underlying data is refreshed). This way your queries can be cached, but the business logic can still exist within the view.
Create a BI Engine reservation in BigQuery. This adds another level of caching to Data Studio reports, and may give you better results for things that can't be query-cached or cached in Data Studio. (While there will be a cost to the service in the future based on the size of instance you reserve, it's free during their beta period.)
Don't base your queries on tables with a streaming buffer attached (even if it hasn't received rows recently), uses wildcard tables in the query, or is based on an external dataset (e.g. file in Cloud Storage or BigTable). See Caching Exceptions for details.
Pull as little data as possible by using the new Data Source Parameters. This means you can pass the values of your date range or other filters directly to BigQuery and filter the data before it reaches your report. This is especially helpful if you have a date-partitioned table, as you can only scan the needed partitions (which greatly reduces processing and the amount of data returned)
Also, sometimes it seems like you're moving a lot of data but that doesn't always relate to a high cost. Check your cost breakdowns or look at the logging filtered to the user your data source authenticates as, then see how much cost that's incurred. Certain operations fall under a free tier, and others don't result in cost for non-egress use cases like Data Studio. All that to say that you may want to make sure there's a cost problem at the BigQuery level in the first place before killing yourself trying to optimize the usage.

Service that does advanced queries on a data set, and automatically returns relevant updated results every time new data is added to the set?

I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)
You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.