Getting session counts, by source, from GA4 and BiqQuery - google-bigquery

I've got GA4 pushing to BQ and I wanted to test some ideas, so I started with making sure my BiqQuery queries could duplicate (or at least get close to) what I see in the GA4 UI.
In GA4, I looked at the Traffic acquisition: Session source report and just looked at traffic with a source of 'google' for a specific date range. This is what I see:
So just under 16k sessions for google during this timeframe.
I then ran this query to get session counts by source and just had it show me the results for Google:
WITH prep AS (
SELECT
user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
MAX((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source')) AS source,
FROM `disco-retina-319219.analytics_312878766.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20221112' AND '20221209'
group by
user_pseudo_id,
session_id)
SELECT
COALESCE(source,'(direct)') AS source,
COUNT(DISTINCT CONCAT(user_pseudo_id,session_id)) AS sessions
FROM
prep
WHERE source='google'
GROUP BY
source
ORDER BY
sessions desc
But it only shows 4,200 sessions.
I expected the results might be slightly off, but my BQ query is showing just over 4k sessions. Clearly I am doing something wrong, but I don't understand what. I'm getting a distinct list of the concatenated session ID and pseudo user ID; from what I've read this is the proper way to get a list of distinct sessions. I've verified that I am using the same date ranges in both (11/12/2022 - 12/9/2022) and that I hitting the proper table in BQ.

Related

MERGE on multiple tables

I am trying to do the following but this is an "Illegal operation (write) on meta-table".
MERGE x.y.events_* as events
USING
(
select distinct
user_id,
user_pseudo_id
from x.y.events_*
where user_id is not null
and user_pseudo_id is not null
qualify row_number() over (partition by user_pseudo_id) = 1
order by user_pseudo_id
) user_ids
ON events.user_pseudo_id = user_ids.user_pseudo_id
WHEN MATCHED THEN
UPDATE SET events.user_id = user_ids.user_id
This works fine if I define x.y.events_20230115 after MERGE but I have about 700 tables to update plus I would like to run this dynamically every day so it updates yesterdays dataset. With the wildcard, bigQuery tell me that this is an "Illegal operation (write) on meta-table". Makes sense, however I can't figure out how to proceed.
I am aware that I can use something like _table_suffix = FORMAT_DATE('%Y%m%d', DATE_SUB(#run_date, INTERVAL 1 DAY)) in WHERE clauses but that doesn't seem like a solution here as I'm trying to write stuff.
Could anyone kindly point me to the right direction here? How to dynamically refer to the table suffix in MERGE x.y.events_ or is there perhaps a better way of doing this? Some sort of iteration?

Google Looker Studio - Google Data Studio | Bad performance using table_suffix filter in BigQuery data source

CONTEXT
Hi,
in bigquery, I have a table that is partitioned by an integer that can be from 0 to 999.
Every time I use this data source in Looker Studio for reporting, I filter this column using a parameter to get the right partition; after that, another filter is used on the date column.
The queries are fast but very expensive.
GOAL
To reduce cost, I divided the table into 1000 wildcard tables in my big query project and use the date as a partition for all of them.
So,
before: I have my_project.big_table partition by id;
now: I have my_project.table_ partition by date and I can use the table_suffix to get the right table
In the Looker Studio, I changed the custom query for the data source from:
SELECT a.*
FROM `my_project.big_table` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.id = #id1
AND a.user_email = #DS_USER_EMAIL
to :
SELECT a.*
FROM `my_project.table_*` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a._TABLE_SUFFIX = #id1
AND a.user_email = #DS_USER_EMAIL
ISSUE DESCRIPTION
the change above caused a dramatic issue in the performance of the dashboard.
Every page now spends more than 5' to give me results, before the pages were loaded in less than 10''.
I tried to use:
The parameter #id1 directly in the FROM SQL but it is not automatically substituted and it causes an error Not found: Table my_project.table_#{id1} was not found in location EU
an EXECUTE IMMEDIATE but it is not recognized by the tool
When I try to use direct one of the 1k table suffixes, for example id 400:
SELECT a.*
FROM `my_project.table_400` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.user_email = #DS_USER_EMAIL
the performances are exactly the same as before, but, I must filter for reporting.
I know that the wildcard tables are limited in many aspects( cache for example) but, testing the query on BigQuery, the time spent is 0/1 second.
Is there something that I miss/I can change on the query?
Do you have some advice/suggestions?
Many thanks!

BQ - Materialised Views and ARRAY_AGG

I am trialling materialised views in our BQ eventing system but have hit a roadblock.
For context:
Our source event ingest tables use streaming inserts only (append only), partitioned by event time (more or less true time but always in order WRT the entity involved in the event stream), and we extract a given entities 'latest' / most recent full state. I feel with data being append only, and history immutable, there could be benefits here but currently cannot get it to work (yet)
Alot of our base BQ code is spent determining what the 'latest' state of the entity is. This latest entity is baked into the payloads of the most recent event ingested in that table e.g OrderAccepted then later OrderItemsDespatched events (for the same OrderId), the OrderItemsDespatched event will have the most up to date snapshot of the order (post processing an items dispatch).
Thus in BQ for BI, we need to surface the most current state of that order. e.g we need to extract the order struct from the OrderItemsDespatched event since it is the most recent event.
This could involve an analytic function:
(ROW_NUMBER() OVER entityId ORDER BY EventOccurredTimestamp DESC)
and pick row=1 - however analytic functions are not supported by MVs and is not as efficient anyway as ARRAY_AGG below
CREATE MATERIALIZED VIEW dataset.order_events_latest_mv
PARTITION BY EventOccurredDate
CLUSTER BY OrderId
AS
WITH ord_events AS (
SELECT
oe.*,
orderEvent.latestOrder.id AS OrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime) AS EventOccurredTimestamp,
EXTRACT(DATE FROM PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime)) AS EventOccurredDate,
FROM
`project.dataset.order_events` oe
),
ord_events_latest AS (
SELECT
ARRAY_AGG(
e ORDER BY EventOccurredTimestamp DESC LIMIT 1
)[OFFSET(0)].*
FROM
ord_events e
GROUP BY
e.OrderId
)
SELECT
*
FROM
ord_events_latest
However there is an error
Materialized view query contains unsupported feature.
Fundamentally, we could save a heck of alot of current processing and cost only processing changed data, rather then scanning all the data everytime, which given its an append only, partitioned source table, seems feasible?
The logic would be quite similar for deduping our events as well, which we do alot as well with slightly different query but using ARRAY_AGG as well.
Any advice welcome, hopefully the feature the error message mentions isnt supported is not far off. Thanks!
I hope it works:
WITH
latest_records AS
(
SELECT entityId, SPLIT(MAX(COALESCE(EventOccurredTimestamp, '||', Col1, '||', Col2, '||', Col3)), '||') values
FROM `project.dataset.order_events`
GROUP BY entityId
)
SELECT
entityId,
CAST(values[offset(0)]) as timestamp) as EventOccurredTimestamp,
values[offset(1)] as Col1, -- let's say it's string
CAST(values[offset(2)] as bool) as Col2, -- it's bool
CAST(values[offset(3)] as int64) as Col3 -- it's int64
FROM latest_records

Window function (LEAD/LAG) with where clause?

So I have a table of page hits on a website and for each page of a specific type (marketing_page), I am trying to identify the next page a customer hits. So my query would probably look something like this
Select * from
(
Select page_id
, hit_time
, customer_id
, session_id
, page_type
, LEAD(page_id, 1) over (PARTITION BY customer_id, session_id ORDER BY hit_time) as next_page_id
FROM page_hits
)
WHERE page_type = 'marketing_page'
The problem with this approach is that the sub-query becomes HUGE if I keep the WHERE clause outside the sub-query. Ideally I'd like to be able to do something like:
Select page_id
, hit_time
, customer_id
, session_id
, page_type
, LEAD(page_id, 1) over (PARTITION BY customer_id, session_id ORDER BY hit_time) as next_page_id
FROM page_hits
WHERE page_type = 'marketing_page'
but have it still account for pages outside the WHERE clause when doing the LEAD function. I understand that the LEAD function gets evaluated after the WHERE so this is not possible.
I would also like to avoid a self join because of the efficiency issue. Is there a fast/simple way to achieve this?
Thanks!
This is too long for a comment.
If a simple lead() does not work on a Redshift table, that could mean one of several things. What come to mind are:
The Redshift databases is busy, having used up all query connections and you are just waiting. I'll assume this is not the case.
Your data is seriously big.
Your "table" is really a complicated view.
Given the nature of the data, I would assume that the data is seriously big. I would further assume that it is partitioned by some time unit, probably day.
You need to limit the query to one or a handful of partitions to run it. Your question provides no information on how that might be done.

How do I query GA RealtimeView with Standard SQL in BigQuery?

When exporting Google Analytics data to Google BigQuery you can setup a realtime table that is populated with Google Analytics data in real time. However, this table will contain duplicates due to the eventual consistent nature of distributed computing.
To overcome this Google has provided a view where the duplicated are filtered out. However, this view is not queryable with Standard SQL.
If I try querying with standard:
Cannot reference a legacy SQL view in a standard SQL query
We have standardized on Standard, and I am hesitant to rewrite all our batch queries to legacy for when we want to use them on realtime data. Is there a way to switch the realtime view to be a standard view?
EDIT:
This is the view definition (which is recreated every day by Google):
SELECT *
FROM [111111.ga_realtime_sessions_20190625]
WHERE exportKey IN (SELECT exportKey
FROM
(SELECT
exportKey, exportTimeUsec,
MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
FROM
[111111.ga_realtime_sessions_20190625])
WHERE exportTimeUsec >= maxexportTimeUsec );
You can create a logical view like this using standard SQL:
CREATE VIEW dataset.realtime_view_20190625 AS
SELECT
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM dataset.ga_realtime_sessions_20190625 AS t
GROUP BY visitKey
This selects the most recent row for each visitKey. If you want to generalize this across days, you can do something like this:
CREATE VIEW dataset.realtime_view AS
SELECT
CONCAT('20', _TABLE_SUFFIX) AS date,
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM `dataset.ga_realtime_sessions_20*` AS t
GROUP BY date, visitKey