Google Analytics (GA4) event data is different between reports and exploration - api

I am recently new with GA4 and I found that I get different data from reports and exploration for the same event.
For example I want to get the data:
eventName: Load Result
date: 2023-02-01
Event have 2 parameters:
event_category
event_label
Here is the data from reports and exploration. while exploration get a data bigger than reports. And when I use GA4 API to get data, I can only return the data same as reports.I guess maybe there is a sampling, but how can I get no sample data,I hope to know if there is a way to get the same data as exploration data by GA4 API.
reports data:
reports data
exploration data:
exploration data
API data:
API data

Yes, this is one of the known GA4 bugs. General consensus and advise for now is to never use Reports. Always use explorer.
But even Explorer is, at times, pretty odd with certain combinations of dimensions and metrics, often conflicting with itself within the same tab.
For now, the best you can do to Firebase and GA4 data is to export it to BQ and then look at raw numbers. They look much better and seem to be quite significantly more reliable.

Related

GA4 BigQuery Export doesn't export all the data

I've enabled BigQuery Export in Google Analytics 4, but on inspection I noticed that roughly half of the events were missing in the raw data (in my case those were sign_up events that also had user_id as a parameter).
When inspecting the event stats in the standard GA4 report I noticed that "Data Thresholding" was applied when inspecting stats pertaining to the said event. My understanding is that when thresholding is applied, GA4 omits certain events from exporting, although I can't be sure of that.
Is there a way to make sure all the data gets exported to BigQuery?

Most efficient way to clean data in BigQuery

I need some help cleaning my data...
I have a BQ table where I receive new entries from my back-end, these data are recorded to my BQ and I'm using Google Data Studio to present these data.
My problem is, I a field named sessions that sometimes are duplicates, I can't solve that directly in my back-end because a user can send different data from the same session so I can't just stop recording duplicates.
I've managed my problem by creating a View that selects the newest duplicate record and I'm using this view as data-source for my report. The problem with this approach is that I lost the feature of "real-time report" and that is important in this case. And another problem is that I also lost Accelerated by BigQuery BI Engine and I would like to have these feature too.
Is this the best solution for my problem and I'll need to accept this outcome or there is another way?
Many thanks in advance, kind regards.
Using the view should work for BI Engine acceleration. Can you please share more details on BI Engine? It should show you the reason query wasn't accelerated, likely mentioning one of the limitations. If you hover over "not accelerated" sign it should give you more details on why your query wasn't supported. Feel free to share it here and I will be happy to help.
Another way you can clean up the data: Have scheduled job to preprocess the data. It will mean data may not be the most recent, but it will give you ability to clean up and aggregate data.

Custom Dataflow Template - BigQuery to CloudStorage - documentation? general solution advice?

I am consuming a BigQuery table datasource. It is 'unbounded' as it is updated via a batch process. It contains session keyed reporting data from server logs where each row captures a request. I do not have access to the original log data and must consume the BigQuery table.
I would like to develop a custom Java based google Dataflow template using beam api with the goals of :
collating keyed session objects
deriving session level metrics
deriving filterable window level metrics based on session metrics, e.g., percentage of sessions with errors during previous window and percentage of errors per filtered property, e.g., error percentage per device type
writing the result as a formatted/compressed report to cloud storage.
This seems like a fairly standard use case? In my research thus far, I have not yet found a perfect example and still have not been able to determine the best practice approach for certain basic requirements. I would very much appreciate any pointers. Keywords to research? Documentation, tutorials. Is my current thinking right or do I need to consider other approaches?
Questions :
beam windowing and BigQuery I/O Connector - I see that I can specify a window type and size via beam api. My BQ table has a timestamp field per row. Am I supposed to somehow pass this via configuration or is it supposed to be automagic? Do I need to do this manually via a SQL query somehow? This is not clear to me.
fixed time windowing vs. session windowing functions - examples are basic and do not address any edge cases. My sessions can last hours. There are potentially 100ks plus session keys per window. Would session windowing support this?
BigQuery vs. BigQueryClientStorage - The difference is not clear to me. I understand that BQCS provides a performance benefit, but do I have to store BQ data in a preliminary step to use this? Or can I simply query my table directly via BQCS and it takes care of that for me?
For number 1 you can simply use a withTimestamps function before applying windowing, this assigns the timestamp to your items. Here are some python examples.
For number 2 the documentation states:
Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. [...] If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Also in the java documentation, you can only specify a minimum gap duration, but not a maximum. This means that session windowing can easily support hour-lasting sessions. After all, the only thing it does is putting a watermark on your data and keeping it alive.
For number 3, the differences between the BigQuery IO Connector and the BigQuery storage APIs is that the latter (an experimental feature as of 01/2020) access directly data stored, without the logical passage through BigQuery (BigQuery data isn't stored in BigQuery). This means that with storage APIs, the documentation states:
you can't use it to read data sources such as federated tables and logical views
Also, there are different limits and quotas between the two methods, that you can find in the documentation link above.

how to store weekly data from google analytics

I have some simple weekly aggregates from Google analytics that i'd like to store somewhere. The reason for storing is because if I run a query against too much data in google analytics, it becomes sampled and I want it to be totally accurate.
What is the best way to solve this?
My thoughts are:
1) Write a process in bigquery to append the data each week to a permanent dataset
2) Use an API that gets the data each week and stores the data in a google spreadsheet (appending a line each time)
What is the best recommendation for my problem - and how do I go about executing it?
Checking your previous questions, we see that you already use Bigquery.
When you run a query against the Google Analytics tables that is not sampled, as that has all the data in it. There is no need to store as you can query every time you need.
In case if you want to store, and pay for the addition table, you can go ahead store in a destination table.
If you want to access quickly, try creating a view.
I suggest the following:
1) make a roll-up table for your weekly data - you can do that either by writing a query for it and running manually or with a script in a Google Spreadsheet that uses the same query (using the API) and is scheduled to run every week. I tried a bunch of the tutorials out there and this one is the simplest to implement
2) depending on the data points you want, you can even use the Google Analytics API without having to go through BigQuery for this request, try pulling this report of yours from here . If it works there are a bunch of Google Sheets extensions that can make it a lot quicker to set up a weekly report. Or you can just code it yourself
Would that work for you?
thks!

How do I automate getting BigQuery billing data?

We are using BigQuery rather heavily now and I've been tasked with keeping track of how much we are spending on queries each day. There seems to be no easy way to do this within BigQuery? Has anyone else done this already?
I started trying to scrape it myself, but its a real mess. Retrieving data involves a POST to https://bpui0.google.com/billing/ui/batchservice which sends the entire contents of my about:plugins to Google for every new request.
There are two components for BigQuery pricing: Data storage, and data processed by each query.
https://developers.google.com/bigquery/pricing#table
To keep track of daily spend, you'd want to track how much data is being processed. An easy way to do this is to look at the 'bytes_processed' field that comes with each API query response.
You could even pipe this data back to BigQuery, to further dice and analyze usage :).