Sessions count in GA4 is quite bigger than events_intraday table in bigQuery - google-bigquery

Issue
For example, GA4's sessions number of November is 559,555 take from (Report / Acquistion / Traffic Acquistion). But if I calculate session number from bigquery table, it is 468,991.
There are big different. I guess bigquery's number close to our actual traffic and google analytics 360 number.
Actually this is start from we implemented ecommerce event in our site. But we not sure this is related or not.
question
GA4's screen number and data in bigquery should be same (or close)??
How can we solve this issue? We would like to have close number.
FYI
We used this for calculating session number in bigquery.
SELECT
HLL_COUNT.EXTRACT(
HLL_COUNT.INIT(
CONCAT(
user_pseudo_id,
(SELECT `value` FROM UNNEST(event_params) WHERE key = 'ga_session_id' LIMIT 1).int_value),
12)) AS session_count,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
https://developers.google.com/analytics/blog/2022/hll
I'd really appreciated it if you guys give us an advice.

Related

How to get count of active users grouped by version? (from Firebase using BigQuery)

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

Finding statistical outliers in timestamp intervals with SQL Server

We have a bunch of devices in the field (various customer sites) that "call home" at regular intervals, configurable at the device but defaulting to 4 hours.
I have a view in SQL Server that displays the following information in descending chronological order:
DeviceInstanceId uniqueidentifier not null
AccountId int not null
CheckinTimestamp datetimeoffset(7) not null
SoftwareVersion string not null
Each time the device checks in, it will report its id and current software version which we store in a SQL Server db.
Some of these devices are in places with flaky network connectivity, which obviously prevents them from operating properly. There are also a bunch in datacenters where administrators regularly forget about it and change firewall/ proxy settings, accidentally preventing outbound communication for the device. We need to proactively identify this bad connectivity so we can start investigating the issue before finding out from an unhappy customer... because even if the problem is 99% certainly on their end, they tend to feel (and as far as we are concerned, correctly) that we should know about it and be bringing it to their attention rather than vice-versa.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval. For example, let's say device 87C92D22-6C31-4091-8985-AA6877AD9B40 has, for the last 1000 checkins, checked in every 4 hours or so (give or take a few seconds)... but the last time it checked in was just a little over 6 hours ago now. This is information I would like to highlight for immediate review, along with device E117C276-9DF8-431F-A1D2-7EB7812A8350 which normally checks in every 2 hours, but it's been a little over 3 hours since the last check-in.
It seems relatively straightforward to brute-force this, looping through all the devices, examining the average interval between check-ins, seeing what the last check-in was, comparing that to current time, etc... but there's thousands of these, and the device count grows larger every day. I need an efficient query to quickly generate this list of uncommunicative devices at least every hour... I just can't picture how to write that query.
Can someone help me with this? Maybe point me in the right direction? Thanks.
I am trying to come up with a way to query all distinct DeviceInstanceId that have currently not checked in for a period of 150% their normal check-in interval.
I think you can do:
select *
from (select DeviceInstanceId,
datediff(second, min(CheckinTimestamp), max(CheckinTimestamp)) / nullif(count(*) - 1, 0) as avg_secs,
max(CheckinTimestamp) as max_CheckinTimestamp
from t
group by DeviceInstanceId
) t
where max_CheckinTimestamp < dateadd(second, - avg_secs * 1.5, getdate());

Confirmation of how to calculate bigquery query costs

I want to double check what I need to look at when assessing query costs for BigQuery. I've found the quoted price per TB here which says $5 per TB, but for precisely 1 TB of what? I have been assuming up until now (before it seemed to matter) that the relevant number would be that which the BigQuery UI outputs above the results, so for this example query:
...in this case 2.34GB. So as a fraction of a terabyte and multiplied by $5 this would cost around 1.2cents assuming I'd used up my allowance for the month.
Can anyone confirm that I'm correct? Checking this before I process something I think could rack up some non-negligible costs for once. I should say I've never been stung with a sizeable BigQuery bill before it seems difficult to do.
Can anyone confirm that I'm correct?
Confirmed
Please note - BigQuery UI in fact uses DryRun which only estimates Total Bytes Processed . The final cost is based on Bytes Billed which reflects some nuances - minimum 10MB per each table involved in query as an example. You can see more details here - https://cloud.google.com/bigquery/pricing#on_demand_pricing
I know I am late but this might help you.
If you are pushing your audit logs to another dataset, you can do below on that dataset.
WITH data as
(
SELECT
protopayload_auditlog.authenticationInfo.principalEmail as principalEmail,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent AS jobCompletedEvent
FROM
`administrative-audit-trail.gcp_audit_logs.cloudaudit_googleapis_com_data_access_20190227`
)
SELECT
principalEmail,
FORMAT('%9.2f',5.0 * (SUM(jobCompletedEvent.job.jobStatistics.totalBilledBytes)/POWER(2, 40))) AS Estimated_USD_Cost
FROM
data
WHERE
jobCompletedEvent.eventName = 'query_job_completed'
GROUP BY principalEmail
ORDER BY Estimated_USD_Cost DESC
Reference: https://cloud.google.com/bigquery/docs/reference/auditlogs/

Is Bigtable (or BigQuery) the right platform for correlation analysis of logs?

I'm faced with the challenge of analysing different system logfiles based on following requirements:
several hundred systems
millions of logs every day in different formats
Beside many other objectives my biggest challenge is a realtime correlation analysis of all incoming logs on all current system logs and also on partially historical log events.
Currently we're focusing on MongoDB, ElasticSearch, Hadoop, ... to meet this challenge.
On the other hand I've read some interesting things about Google Bigtable and Bigquery.
So my question is, is Bigtable and/or Bigquery a solution worth looking at, in order to do this realtime analysis ?
I've no experience with these two products, so I'm hoping for some tips whether these Google solutions could be an alternative for my requirements.
THX & BR
bdriven
EDIT:
too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql
Our goal is, to develop a system, which is able to generate warnings based on current log events (or a combination of different log events) and their past interactions on other systems behavior.
Therefore we have to be able to do fast correlation analysis for current events against huge amounts of unstructured historical data.
I know that this requirement description is probably not the most specific one, but we're right at the beginning of this project.
So my goal with this question is to get some arguments for our next team meeting, whether we should consider to take a closer look at Bigtable / Bigquery or not.
One of my favorite features of BigQuery is its ability to run correlations.
Here's a correlations with BigQuery tutorial I wrote a couple years ago: http://nbviewer.ipython.org/gist/fhoffa/6459195
For example, to rank and find the most correlated airports in terms of flight delays:
SELECT a.departure_state, b.departure_state, corr(a.avg, b.avg) corr, COUNT(*) c
FROM
(SELECT date, departure_state, AVG(departure_delay) avg , COUNT(*) c
FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5
) a
JOIN
(SELECT date, departure_state ,
AVG(departure_delay) avg, COUNT(*) c FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5 ) b
ON a.date=b.date
WHERE a.departure_state < b.departure_state
GROUP EACH BY 1, 2
HAVING c > 5
ORDER BY corr DESC;
Try it yourself in the next 5 minutes! A quick getting started tutorial: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/

Trouble Looking For Events WITHIN a Session In BigQuery or WITHIN Multiple Sessions

I wanted to get a second pair of eyes & some help confirming the best way to look within a session at the hit level in BigQuery. I have read the BigQuery developer documentation thoroughly that provides insight on working WITHIN as session. My challenge is this. Let us assume I write the high level query to count the number of sessions that exist and group the sessions by the device.device category as below:
SELECT device.deviceCategory,
COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY sessions DESC
I then run a follow up query like the following to find the number of distinct users (Client ID's):
SELECT device.deviceCategory,
COUNT(DISTINCT fullVisitorID) AS USERS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY users DESC
(Note that I broke those up because of the sheer size of the data I am working with which produces runs greater than 5TB in some cases).
My challenge is the following. I feel like I have the wrong approach and have not had success with the WITHIN function. For every user ID (or full visitor ID), I want to look within all their various sessions to find out how many sessions from the many they had were desktop and how many were mobile. Basically, these are the cross device users. I want to collect a table with these users. I started here:
SELECT COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
WHERE device.deviceCategory = 'desktop' AND device.deviceCategory = 'mobile'
This is not correct though. Moreover, any version I write of a within query is giving me non-sense results or results that have 0 as the number. Does anyone have any strategies or tips to recommend a way forward here? What is the best way to use the WITHIN function to look for sessions that may have multiple events happening WITHIN the session (with my goal being collecting the user ID's that meet certain requirements within a session or over various sessions). Two days ago I did this in a very manual way by manually working through the steps and saving intermediate data frames to generate counts. That said, I wanted to see if there was any guidance to quickly do this using a single query?
I'm not sure if this question is still open on your end, but I believe I see your problem, and it is not with the misuse of the WITHIN function. It is a data understanding problem.
When dealing with GA and cross-device identification, you cannot reliably use any combination of fullVisitorId and visitId to identify users, as these are derived from the cookie that GA places on the users browser. Thus, leveraging the fullVisitorId would identify a specific browser on a specific device more accurately that a specific user.
In order to truly track users across devices, you must be able to leverage the userId functionality follow this link. This requires you to have the user sign in in some way, thus giving them an identifier that you can use across all of their devices and tie their behavior together.
After you implement some type of user identification that you can control, rather than GA's cookie assignment, you can use that to look for details across sessions and within those individual sessions.
Hope that helps!