BigQuery how to order a nested and repeated column? - google-bigquery

I have a table containing events that occurred with a certain animal, example purchase, death, birth, etc. Many events occur on different dates but there are events that occur on the same date (format yyyy-mm-dd). For those that occur on the same date, there is the "event_id" column to make the classification. My question is this: how to create a query that returns a nested and repeated column with all events ordered by date + ID, for each existing animal?
select animal,
array_agg(struct(event_id, event_date, event_name, event_etc, ...)) as event
from events
group by animal;

ARRAY_AGG supports ORDER BY (ref). So, you can do something like this:
SELECT
animal,
ARRAY_AGG(
STRUCT(
event_id,
event_date,
event_name,
...
) ORDER BY event_date, event_id
) AS events
FROM events
GROUP BY animal;

Related

Grouping by Session ID in BigQuery for GA4 data

I'm currently trying to build a query that allows me to group all my GA4 event data by session ID in order to get information about the all the events, per session, as opposed to analyzing the data by each event separately.
The resulting output of my initial query is a new table that has session ID as its own column in the table, instead of being within an array for event parameters for a particular event.
The problem is that the session_id column has non-unique values, a session id is repeated multiple times for each row that is a new event (that happens within that session). I am trying to combine (merge) those non-unique session ids so that I can get ALL the events associated with a particular session_id.
I have tried this query which provides me with session_id as a new column, that is repeated for each event.
`SELECT
*,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
(
SELECT COALESCE(value.string_value)
FROM UNNEST(event_params)
WHERE key = 'page_location'
) AS page_location
FROM
`digital-marketing-xxxxxx.analytics_xxxxxxx.events_intraday*``
gives me an output like (it has way more columns than this but just an example):
session_id
event_name
1234567
session_start
1234567
click_url
I need a way to basically merge the two session ids into a single cell. When I try this:
SELECT
*,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'ga_session_id'
) AS session_id,
(
SELECT COALESCE(value.string_value)
FROM UNNEST(event_params)
WHERE key = 'page_location'
) AS page_location
FROM
`digital-marketing-xxxxxxx.analytics_xxxxxxx.events_intraday*`
GROUP BY session_id
I get an error that tells me (if I understand correctly) that I can't aggregate certain values (like date) which is what the code is trying to do when attempting to group by session id.
Is there any way around this? I'm new to SQL but the searches I've done do far haven't given me a clear answer on how to attempt this.
I use this code to understand sequence of events, it might not be that efficient as I have it set up to look at other things as well
with _latest as (
SELECT
--create unique id
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id')) as unique_session_id,
--create event id
concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id'),event_name) as session_ids,
event_name,
event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp
FROM *******
where
-- change the date range by using static and/or dynamic dates
_table_suffix between '20221113' and '20221114'),
Exit_count as (
select *,
row_number() over (partition by session_ids order by event_timestamp desc) as Event_order
from _latest)
select
Event_order,
unique_session_id,
event_date,
event_name,
FROM
Exit_count
group by
Event_order,
event_name,
unique_session_id,
--pagepath,
event_date
--Country_site
order by
unique_session_id,
Event_order

Find highest Value among duplicates in a table (Oracle)

I have a table with duplicate values as shown below.
I would like to find the latest start time among the events. Expected output is
I used the below query but it seems to get the latest start time in the entire table.
SELECT ID,
EVENT,
START_TIME,
LAST_VALUE(START_TIME) OVER (ORDER BY ID,EVENT RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_start_time
FROM
(Select * from EVENTS)
order by ID,EVENT;
I know I am missing something, probably a group by. Can you please help me out. I am using ORACLE.
You just need the MAX analytic function and partition on id and event.
SELECT ID,
EVENT,
START_TIME,
MAX(START_TIME) OVER (PARTITION BY id, event) AS latest_start_time
FROM EVENTS
order by ID,EVENT;
Try
SELECT DISTINCT
ID,
EVENT,
START_TIME,
MAX(START_TIME) OVER (PARTITION BY ID, EVENT) AS latest_start_time
FROM EVENTS
order by ID,EVENT;

SQL - timeline based queries

I have a table of events which has:
user_id
event_name
event_time
There are event names of types: meeting_started, meeting_ended, email_sent
I want to create a query that counts the number of times an email has been send during a meeting.
UPDATE: I'm using Google BigQuery.
Example query:
SELECT
event_name,
count(distinct user_id) users,
FROM
events_table WHERE
and event_name IN ('meeting_started', 'meeting_ended')
group by 1
How can I achieve that?
Thanks!
You can do this in BigQuery using last_value():
Presumably, an email is send during a meeting if the most recent "meeting" event is 'meeting_started'. So, you can solve this by getting the most recent meeting event for each event and then filtering:
select et.*
from (select et.*,
last_value(case when event_name in ('meeting_started', 'meeting_ended') then event_name end) ignore nulls) over
(partition by user_id order by event_time) as last_meeting_event
from events_table et
) et
where event_name = 'email_sent' and last_meeting_event = 'meeting_started'
This reads likes some kind of gaps-and-islands problem, where an island is a meeting, and you want emails that belong to islands.
How do we define an island? Assuming that meeting starts and ends properly interleave, we can just compare the count of starts and ends on a per-user basis. If there are more starts than ends, then a meeting is in progress. Using this logic, you can get all emails that were sent during a meeting like so:
select *
from (
select e.*,
countif(event_name = 'meeting_started') over(partition by user_id order by event_time) as cnt_started,
countif(event_name = 'meeting_ended' ) over(partition by user_id order by event_time) as cnt_ended
from events_table e
) e
where event_name = 'email_sent' and cnt_started > cnt_ended
It is unclear where you want to go from here. If you want the count of such emails, just use select count(*) instead of select * in the outer query.

Last click attribution/greatest n per user in SQL

I would like to select the last campaign a user clicked in my dataset and return a table with the name of the last clicked campaign and date for each anonymous id.
This is what I have written
select anon,
source,
medium,
campaign,
max(ts) as ts
from attribution
group by 1,2,3,4
This code seems to return the last click date, but in cases where the user clicked on two campaigns it will return both campaigns with the latest date appended to the date column.
TS in this scenario refers to the timestamp
You could use row_number():
select *
from (
select
anon,
source,
medium,
campaign,
ts,
row_number() over(partition by anon order by ts desc) rn
from attribution
) where rn = 1
This assumes that anom is the column that hold the username - if that's not the case, then change it to the relevant column in the OVER(PARTITION BY ...) clause.

Max of a Date field into another field in Postgresql

I have a postgresql table wherein I have few fields such as id and date. I need to find the max date for that id and show the same into a new field for all the ids. SQLFiddle site was not responding so I have an example in the excel. Here is the screenshot of the data and the output for the table.
You could use the windowing variant of max:
SELECT id, date, MAX(date) OVER (PARTITION BY id)
FROM mytable
Something like this might work:
WITH maxdts AS (
SELECT id, max(dt) maxdt FROM table GROUP BY id
)
SELECT id, date, maxdt FROM table t, maxdts m WHERE t.id = m.id;
Keep in mind without more information that this could be a horribly inefficient query, but it will get you what you need.