Grouping by Session ID in BigQuery for GA4 data

Grouping by Session ID in BigQuery for GA4 data - sql

I'm currently trying to build a query that allows me to group all my GA4 event data by session ID in order to get information about the all the events, per session, as opposed to analyzing the data by each event separately.
The resulting output of my initial query is a new table that has session ID as its own column in the table, instead of being within an array for event parameters for a particular event.
The problem is that the session_id column has non-unique values, a session id is repeated multiple times for each row that is a new event (that happens within that session). I am trying to combine (merge) those non-unique session ids so that I can get ALL the events associated with a particular session_id.
I have tried this query which provides me with session_id as a new column, that is repeated for each event.
`SELECT
*,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
(
SELECT COALESCE(value.string_value)
FROM UNNEST(event_params)
WHERE key = 'page_location'
) AS page_location
FROM
`digital-marketing-xxxxxx.analytics_xxxxxxx.events_intraday*``
gives me an output like (it has way more columns than this but just an example):
session_id
event_name
1234567
session_start
1234567
click_url
I need a way to basically merge the two session ids into a single cell. When I try this:
SELECT
*,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
(
SELECT COALESCE(value.string_value)
FROM UNNEST(event_params)
WHERE key = 'page_location'
) AS page_location
FROM
`digital-marketing-xxxxxxx.analytics_xxxxxxx.events_intraday*`
GROUP BY session_id
I get an error that tells me (if I understand correctly) that I can't aggregate certain values (like date) which is what the code is trying to do when attempting to group by session id.
Is there any way around this? I'm new to SQL but the searches I've done do far haven't given me a clear answer on how to attempt this.

I use this code to understand sequence of events, it might not be that efficient as I have it set up to look at other things as well
with _latest as (
SELECT
--create unique id
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) as unique_session_id,
--create event id
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id'),event_name) as session_ids,
event_name,
event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp
FROM *******
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20221113' and '20221114'),
Exit_count as (
select *,
row_number() over (partition by session_ids order by event_timestamp desc) as Event_order
from _latest)
select
Event_order,
unique_session_id,
event_date,
event_name,
FROM
Exit_count
group by
Event_order,
event_name,
unique_session_id,
--pagepath,
event_date
--Country_site
order by
unique_session_id,
Event_order

Related

BigQuery unnest multiple params

I need to unnest multiple param keys event_date, page_location,page_title,user_pseudo_id. Two of them, page_location and page_title I need to unnest and show them separately. The code below just randomly shows either the location or title value, I need them in a separate row
SELECT
event_date, value.string_value, user_pseudo_id,
FROM
` mydata.events20220909*`,
unnest(event_params)
WHERE
key = "page_title" OR key = "page_location"

You don't necessarily have to unnest all the event_params, with the following query, you can put any parameter in a separate column
select
event_date,
user_pseudo_id,
(select value.string_value from unnest(event_params) where key = 'page_location') as page_location,
(select value.string_value from unnest(event_params) where key = 'page_title') as page_title
from `mydata.events_20220909*`

SQL - timeline based queries

I have a table of events which has:
user_id
event_name
event_time
There are event names of types: meeting_started, meeting_ended, email_sent
I want to create a query that counts the number of times an email has been send during a meeting.
UPDATE: I'm using Google BigQuery.
Example query:
SELECT
event_name,
count(distinct user_id) users,
FROM
events_table WHERE
and event_name IN ('meeting_started', 'meeting_ended')
group by 1
How can I achieve that?
Thanks!

You can do this in BigQuery using last_value():
Presumably, an email is send during a meeting if the most recent "meeting" event is 'meeting_started'. So, you can solve this by getting the most recent meeting event for each event and then filtering:
select et.*
from (select et.*,
last_value(case when event_name in ('meeting_started', 'meeting_ended') then event_name end) ignore nulls) over
(partition by user_id order by event_time) as last_meeting_event
from events_table et
) et
where event_name = 'email_sent' and last_meeting_event = 'meeting_started'

This reads likes some kind of gaps-and-islands problem, where an island is a meeting, and you want emails that belong to islands.
How do we define an island? Assuming that meeting starts and ends properly interleave, we can just compare the count of starts and ends on a per-user basis. If there are more starts than ends, then a meeting is in progress. Using this logic, you can get all emails that were sent during a meeting like so:
select *
from (
select e.*,
countif(event_name = 'meeting_started') over(partition by user_id order by event_time) as cnt_started,
countif(event_name = 'meeting_ended' ) over(partition by user_id order by event_time) as cnt_ended
from events_table e
) e
where event_name = 'email_sent' and cnt_started > cnt_ended
It is unclear where you want to go from here. If you want the count of such emails, just use select count(*) instead of select * in the outer query.

BigQuery how to order a nested and repeated column?

I have a table containing events that occurred with a certain animal, example purchase, death, birth, etc. Many events occur on different dates but there are events that occur on the same date (format yyyy-mm-dd). For those that occur on the same date, there is the "event_id" column to make the classification. My question is this: how to create a query that returns a nested and repeated column with all events ordered by date + ID, for each existing animal?
select animal,
array_agg(struct(event_id, event_date, event_name, event_etc, ...)) as event
from events
group by animal;

ARRAY_AGG supports ORDER BY (ref). So, you can do something like this:
SELECT
animal,
ARRAY_AGG(
STRUCT(
event_id,
event_date,
event_name,
...
) ORDER BY event_date, event_id
) AS events
FROM events
GROUP BY animal;

Optimizing query when trying to find latest record in multiple tables for specific column

Problem: Find the most recent record based on (created) column for each (linked_id) column in multiple tables, the results should include (user_id, MAX(created), linked_id). The query must also be able to be used with a WHERE clause to find a single record based on the (linked_id).
There is actually several tables in question but here is 3 tables so you can get the idea of the structure (there is several other columns in each table that have been omitted since they are not to be returned).
CREATE TABLE em._logs_adjustments
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_adjustments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE TABLE em._logs_assets
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_assets_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE TABLE em._logs_condition_assessments
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_condition_assessments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
The query i'm currently using with a small hack to get around the need for user_id in the GROUP BY clause, if possible array_agg should be removed.
SELECT MAX(MaxDate), linked_id, (array_agg(user_id ORDER BY MaxDate DESC))[1] AS user_id FROM (
SELECT user_id, MAX(created) as MaxDate, asset_id AS linked_id FROM _logs_assets
GROUP BY asset_id, user_id
UNION ALL
SELECT user_id, MAX(created) as MaxDate, linked_id FROM _logs_adjustments
GROUP BY linked_id, user_id
UNION ALL
SELECT user_id, MAX(created) as MaxDate, linked_id FROM _logs_condition_assessments
GROUP BY linked_id, user_id
) as subQuery
GROUP BY linked_id
ORDER BY linked_id DESC
I receive the desired results but don't believe it is the right way to be doing this, especially when array_agg is being used and shouldn't and some tables can have upwards of 1.5+ million records making the query take upwards of 10-15+ seconds to run. Any help/steering in the right direction is much appreciated.

distinct on
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
select distinct on (linked_id) created, linked_id, user_id
from (
select user_id, created, asset_id as linked_id
from _logs_assets
union all
select user_id, created, linked_id
from _logs_adjustments
union all
select user_id, created, linked_id
from _logs_condition_assessments
) s
order by linked_id desc, created desc

SQL query with logic

Please help me with SQL query.
My table structure with data is:
How can select all rows without duplicates Event_ID?
And Event_NAME predominantly must contain http://
But if does not exist Event_NAME with http://, then Event_NAME must contain Connect
Finally selection result assumed
Syntax Oracle.
Thank all in advance for help.

If I understand correctly, you want to select all rows with Event_NAME containing 'http://', and then select any events not in the first set that have 'Connect' in Event_NAME.
I'm assuming that the only possibilities are the ones you show above - either there's two entries (http and Connect) for an event, or there's just 'Connect' - though this query could work for other situations.
The query is a union between 1. all events with 'http://' in the Event_NAME, and 2. events that don't have an Event_ID in the first set and that have 'Connect' in their Event_NAME.
There are probably prettier ways to do this, but it works in Oracle with the test data:
SELECT * FROM eventtest WHERE Event_NAME LIKE 'http://%'
UNION
SELECT * FROM eventtest
WHERE Event_ID NOT IN
(SELECT Event_ID FROM eventtest WHERE Event_NAME LIKE 'http://%');

here are 2 approaches requiring only a single pass of the data:
SELECT
Event_ID
, MAX(Event_NAME) AS Event_NAME
FROM eventtest
GROUP BY
Event_ID
;
SELECT
ID
, Event_ID
, Event_NAME
FROM (
SELECT
ID
, Event_ID
, Event_NAME
, ROW_NUMBER() OVER (PARTITION BY Event_ID ORDER BY Event_NAME DESC) AS rn
FROM eventtest
) dt
WHERE rn = 1
;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Grouping by Session ID in BigQuery for GA4 data - sql

Related

BigQuery unnest multiple params

SQL - timeline based queries

BigQuery how to order a nested and repeated column?

Optimizing query when trying to find latest record in multiple tables for specific column

SQL query with logic

Categories

Resources