could you help me.
I have the query below that works and meets my need but I don't know if it was done in the best way and if there is a less complex way to build the query.
Purpose: Capture the last event (event_name) of the user_id, only one per user and the most recent.
Query developed in Bigquery environment for GA4 event tables
CAST(CONCAT(SUBSTR(t2.event_date, 0, 4), '-', SUBSTR(t2.event_date, 5, 2), '-', SUBSTR(t2.event_date, 7, 2)) AS DATE) AS event_timestamp,
DATE_DIFF(CURRENT_DATE("UTC-3"),CAST(CONCAT(SUBSTR(t2.event_date, 0, 4), '-', SUBSTR(t2.event_date, 5, 2), '-', SUBSTR(t2.event_date, 7, 2)) AS DATE),day) AS days_ult_event,
t1.user_id,
t2.event_name
FROM (
SELECT
MAX(TIMESTAMP_MICROS(event_timestamp)) event_timestamp,
USER_ID
FROM
`events_*`
WHERE
user_id IS NOT NULL
GROUP BY
2) t1
LEFT JOIN (
SELECT
DISTINCT user_id,
event_name,
event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp
FROM
`events_*`) t2
ON
t1.user_id = t2.user_id
AND t1.event_timestamp = t2.event_timestamp
Your approach works, however it's not so great, a little hard to read. To return the latest event for a user try using row_number() window function:
with _latest as (
SELECT user_id,
event_name,
event_date,
TIMESTAMP_MICROS(event_timestamp) AS event_timestamp,
row_number() over (partition by user_id order by TIMESTAMP_MICROS(event_timestamp) desc) as rn
FROM `events_*`
)
select *
from _latest
where rn=1
Related
I have a query like this:
With cte as(
Select min(date1) as min_date,
max(date1) as max_date,
id,
city,
time_id
From some_table
Group by id, city, time_id
),
range as
(select dateadd('month', row_number()over(order by null), (select min_date from cte)) as date_expand
From table(generator (row_count => 12*36))))
Select * from range;
It gives this error:
Single row subquery returns more than one row.
Is there a way to pass a variable in the 3rd argument of dateadd function? Because my cte will return many min_date based on the group by clause. TIA
Yes, sub-select in SELECT need to only return one row, you have many rows in your CTE.
You query makes more sense to me if you did something like this:
With some_table as (
SELECT * FROM VALUES
(1, 'new york', 10, '2020-01-01'::date),
(1, 'new york', 10, '2020-02-01'::date),
(2, 'christchurch', 20, '2021-01-01'::date)
v(id, city, time_id, date1)
), cte as (
Select
min(date1) as min_date,
max(date1) as max_date,
id,
city,
time_id
FROM some_table
GROUP BY 3,4,5
), range as (
SELECT
id, city, time_id,
dateadd('month', row_number()over(partition by id, city, time_id ORDER BY null), min_date) as date_expand
FROM table(generator(rowcount =>12*36))
CROSS JOIN cte
)
Select * from range;
but if your CTE was changing to be like this:
With cte as (
Select
min(date1) as min_date,
max(date1) as max_date
FROM some_table
), range as (
SELECT
dateadd('month', row_number()over(ORDER BY null), (select min_date from cte)) as date_expand
FROM table(generator(rowcount =>2*4))
)
Select * from range;
this would work, as there is only one min_date value returned.
OR you could find the smallest of the min_dates like:
WITH cte as (
Select
min(date1) as min_date,
max(date1) as max_date,
id,
city,
time_id
FROM some_table
GROUP BY 3,4,5
), range as (
SELECT
dateadd('month', row_number()over(ORDER BY null), (select min(min_date) from cte)) as date_expand
FROM table(generator(rowcount =>2*3))
)
Select * from range;
I have a table with the number of trips taken and a station_id, and I want to return the 5 most recent trips made per ID (sample image of the table is below)
The query I made below aggregates the station id's and the most recent trip, but I am having a difficult time returning the 5 most recent
SELECT start_station_id, MAX(start_time)
FROM `bpd.shop.trips`
group by start_station_id, start_time
Trips:
https://imgur.com/Ebh9FeZ
Any help would be much appreciated, thanks!
You can use row_number():
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY start_station_id ORDER BY start_time DESC) as seqnum
FROM `bpd.shop.trips` t
) t
WHERE seqnum <= 5;
Below is for BigQuery Standard SQL
Option 1
#standardSQL
SELECT record.*
FROM (
SELECT ARRAY_AGG(t ORDER BY start_time DESC LIMIT 5) arr
FROM `bpd.shop.trips` t
GROUP BY start_station_id
), UNNEST(arr) record
Option 2
#standardSQL
SELECT * EXCEPT (pos) FROM (
SELECT *, ROW_NUMBER() OVER(win) AS pos
FROM `bpd.shop.trips`
WINDOW win AS (PARTITION BY start_station_id ORDER BY start_time DESC)
)
WHERE pos <= 5
I recommend using Option 1 as more scalable option
I have a simple table that has lat, long, and time. Basically, I want the result of my query to give me something like this:
lat,long,hourwindow,count
I can't seem to figure out how to do this. I've tried so many things I can't keep them straight. And unfortunately Here's what I've got so far:
WITH all_lat_long_by_time AS (
SELECT
trunc(cast(lat AS NUMERIC), 4) AS lat,
trunc(cast(long AS NUMERIC), 4) AS long,
date_trunc('hour', time :: TIMESTAMP WITHOUT TIME ZONE) AS hourWindow
FROM my_table
),
unique_lat_long_by_time AS (
SELECT DISTINCT * FROM all_lat_long_by_time
),
all_with_counts AS (
-- what do I do here?
)
SELECT * FROM all_with_counts;
I think this is pretty basic aggregation query:
SELECT date_trunc('hour', time :: TIMESTAMP WITHOUT TIME ZONE) AS hourWindow
trunc(cast(lat AS NUMERIC), 4) AS lat,
trunc(cast(long AS NUMERIC), 4) AS long,
COUNT(*)
FROM my_table
GROUP BY hourWindow, trunc(cast(lat AS NUMERIC), 4), trunc(cast(long AS NUMERIC), 4)
ORDER BY hourWindow
If "count of rows by uniqueness" is meant to count distinct coordinates per hour (after truncating the numbers), count(DISTINCT (lat,long)) does the job:
SELECT date_trunc('hour', time::timestamp) AS hour_window
, count(DISTINCT (trunc( lat::numeric, 4)
, trunc(long::numeric, 4))) AS count_distinct_coordinates
FROM tbl
GROUP BY 1
ORDER BY 1;
Details in the manual here.
(lat,long) is a ROW value and short for ROW(lat,long). More here.
But count(DISTINCT ...) is typically slow, a subquery should be faster for your case:
SELECT hour_window, count(*) AS count_distinct_coordinates
FROM (
SELECT date_trunc('hour', time::timestamp) AS hour_window
, trunc( lat::numeric, 4) AS lat
, trunc(long::numeric, 4) AS long
FROM tbl
GROUP BY 1, 2, 3
) sub
GROUP BY 1
ORDER BY 1;
Or:
SELECT hour_window, count(*) AS count_distinct_coordinates
FROM (
SELECT DISTINCT
date_trunc('hour', time::timestamp) AS hour_window
, trunc( lat::numeric, 4) AS lat
, trunc(long::numeric, 4) AS long
FROM tbl
) sub
GROUP BY 1
ORDER BY 1;
After the subquery folds duplicates, the outer SELECT can use a plain count(*).
How to calculate Session Duration in Firebase analytics raw data which is linked to BigQuery?
I have used the following blog to calculate the users by using the flatten command for the events which are nested within each record, but I would like to know how to proceed with in calculating the Session and Session duration by country and time.
(I have many apps configured, but if you could help me with the SQL query for calculating the session duration and session, It would be of immense help)
Google Blog on using Firebase and big query
First you need to define a session - in the following query I'm going to break a session whenever a user is inactive for more than 20 minutes.
Now, to find all sessions with SQL you can use a trick described at https://blog.modeanalytics.com/finding-user-sessions-sql/.
The following query finds all sessions and their lengths:
#standardSQL
SELECT app_instance_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records, MAX(sess_id) OVER(PARTITION BY app_instance_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY app_instance_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(
previous IS null
OR (min_time-previous)>(20*60*1000*1000), # sessions broken by this inactivity
1, 0) session_start
#https://blog.modeanalytics.com/finding-user-sessions-sql/
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY app_instance_id ORDER BY max_time) previous
FROM (
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160601`
)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
With the new schema of Firebase in BigQuery, I found that the answer by #Maziar did not work for me, but I am not sure why.
Instead I have used the following to calculate it, where a session is defined as a user engaging with your app for a minimum of 10 seconds and where the session stops if a user doesn't engage with the app for 30 minutes.
It provides total number of sessions and the session length in minutes, and it is based on this query: https://modeanalytics.com/modeanalytics/reports/5e7d902f82de/queries/2cf4af47dba4
SELECT COUNT(*) AS sessions,
AVG(length) AS average_session_length
FROM (
SELECT global_session_id,
(MAX(event_timestamp) - MIN(event_timestamp))/(60 * 1000 * 1000) AS length
FROM (
SELECT user_pseudo_id,
event_timestamp,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT *,
CASE WHEN event_timestamp - last_event >= (30*60*1000*1000)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT user_pseudo_id,
event_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM `dataset.events_2019*`
) last
) final
) session
GROUP BY 1
) agg
WHERE length >= (10/60)
As you know, Google has changed the schema of BigQuery firebase databases:
https://support.google.com/analytics/answer/7029846
Thanks to #Felipe answer, the new format will be changed as follow:
SELECT SUM(total_sessions) AS Total_Sessions, AVG(sess_length_seconds) AS Average_Session_Duration
FROM (
SELECT user_pseudo_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records,
MAX(sess_id) OVER(PARTITION BY user_pseudo_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(previous IS null OR (min_time-previous) > (20*60*1000*1000), 1, 0) session_start
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY user_pseudo_id ORDER BY max_time) previous
FROM (SELECT user_pseudo_id, MIN(event_timestamp) AS min_time, MAX(event_timestamp) AS max_time
FROM `dataset_name.table_name` GROUP BY user_pseudo_id)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
)
Note: change dataset_name and table_name based on your project info
Sample result:
With the recent change in which we have ga_session_id with each event row in the BigQuery table you can calculate number of sessions and average session length much more easily.
The value of the ga_session_id would remain same for the whole session, so you don't need to define the session separately.
You take the Min and the Max value of the event_timestamp column by grouping the result by user_pseudo_id , ga_session_id and event_date so that you get the session duration of the particular session of any user on any given date.
WITH
UserSessions as (
SELECT
user_pseudo_id,
event_timestamp,
event_date,
(Select value.int_value from UNNEST(event_params) where key = "ga_session_id") as session_id,
event_name
FROM `projectname.dataset_name.events_*`
),
SessionDuration as (
SELECT
user_pseudo_id,
session_id,
COUNT(*) AS events,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration
,event_date
FROM
UserSessions
WHERE session_id is not null
GROUP BY
user_pseudo_id,
session_id
,event_date
)
Select count(session_id) as NumofSessions,avg(session_duration) as AverageSessionLength from SessionDuration
At last you just do the count of the session_id to get the total number of sessions and do the average of the session duration to get the value of the average session length.
I have gathered data for certain events that happen within a video. I need to figure out the total time that any event occurred within that video, but I cannot double count periods where there are multiple events happening simultaneously. This image below demonstrates the situation.
In this scenario, there are 4 events which take up 7 seconds of the entire 10 second video. Simply summing the total time of each event incorrectly yields 3 + 2 + 3 + 2 = 10 out of 10 seconds. The table I'm working in has:
video_id, video_length, event_id, event_start, event_end
Does anyone know how I can write a query to result the result I'm looking for
This is called a gaps and islands problem. Basically, you need to find groups of overlapping records. You can do this by identifying the first record when something starts. Then a group is the sum of such flags.
The following finds each "island" with the start and end time, assuming that two events don't start at the same time.
select video_id, min(event_start) as event_start, max(event_end) as event_end
from (select e.*,
sum(IsNotOverlap) over (partition by video_id order by event_start) as grp
from (select e.*,
(case when exists (select 1 from events e2 where e2.event_start < e.event_start and e2.event_end > e.event_start and e2.video_id = v.video_id)
then 0 else 1
end) as IsNotOverlap
from events e
) e
) e
group by video_id, grp;
You can use this as a subquery or CTE to get the total time for a given video.
This works even if two events have the same start date, end date or even if one event is completely contained in another:
Oracle Setup:
CREATE TABLE videos ( video_id, video_length, event_id, event_start, event_end ) AS
SELECT 1, 10, 1, 1, 4 FROM DUAL UNION ALL
SELECT 1, 10, 2, 1, 3 FROM DUAL UNION ALL -- Same start date
SELECT 1, 10, 3, 2, 4 FROM DUAL UNION ALL -- Same end date
SELECT 1, 10, 4, 3, 6 FROM DUAL UNION ALL
SELECT 1, 10, 5, 7, 9 FROM DUAL UNION ALL
SELECT 1, 10, 6, 8, 8.5 FROM DUAL; -- Contained in previous event
Query:
SELECT video_id,
SUM( event_duration ) AS event_duration,
MAX( video_length ) AS video_length
FROM (
SELECT video_id,
video_length,
end_date
- LAST_VALUE( start_date ) IGNORE NULLS
OVER ( PARTITION BY video_id
ORDER BY ROWNUM ) AS event_duration
FROM (
SELECT video_id,
video_length,
CASE WHEN 1 = lvl
AND 1 = SUM( lvl ) OVER ( PARTITION BY video_id
ORDER BY event_date, lvl DESC, ROWNUM )
THEN event_date
END AS start_date,
CASE WHEN 0 = SUM( lvl ) OVER ( PARTITION BY video_id
ORDER BY event_date, lvl DESC, ROWNUM )
THEN event_date
END AS end_date
FROM videos
UNPIVOT ( event_date FOR lvl IN ( event_start AS 1, event_end AS -1 ) )
)
)
GROUP BY video_id;
Output:
VIDEO_ID EVENT_DURATION VIDEO_LENGTH
---------- -------------- ------------
1 7 10
Variant 1 Complicated:
(Always partition by video_id, order by start_date.) First get running MAX from end_date, then compare event start with the max from previous record. When start <= running max end_date there is an overlapping. Then we are using running sum to make groups of overlapping intervals and at last we are grouping on this groups.
SELECT video_id, video_length, SUM (new_end - new_start) total_time
FROM ( SELECT video_id, video_length, MIN (event_start) new_start, MAX (new_end) new_end
FROM (SELECT b.*, SUM (counting) OVER (PARTITION BY video_id ORDER BY event_start) time_group
FROM (SELECT a.*, CASE WHEN LAG (new_end, 1) OVER (PARTITION BY video_id ORDER BY event_start) >= event_start THEN NULL ELSE 1 END counting
FROM (SELECT x.*, MAX (event_end) OVER (PARTITION BY video_id ORDER BY event_start) new_end
FROM videos x) a) b) c
GROUP BY video_id, video_length, time_group)
GROUP BY video_id, video_length
ORDER BY video_id
Variant 2: Get the start and end of overlapping periods (or of the same period), get only distinct values and sum the time:
SELECT video_id, SUM (new_end - new_start) total_time
FROM (SELECT DISTINCT a.video_id,
(SELECT MIN (event_start)
FROM videos b
WHERE ( (a.event_start BETWEEN b.event_start AND b.event_end) OR (a.event_end BETWEEN b.event_start AND b.event_end)) AND a.video_id = b.video_id)
new_start,
(SELECT MAX (event_end)
FROM videos b
WHERE ( (a.event_start BETWEEN b.event_start AND b.event_end) OR (a.event_end BETWEEN b.event_start AND b.event_end)) AND a.video_id = b.video_id)
new_end
FROM videos a)
GROUP BY video_id
Variant 3: It is Variant 2, but modified to use a new feature in Oracle 12 LATERAL Inline Views
SELECT video_id, SUM (new_end - new_start) total_time
FROM (SELECT DISTINCT a.video_id, b.new_start, b.new_end
FROM videos a,
LATERAL (SELECT MIN (event_start) new_start, MAX (event_end) new_end
FROM videos b
WHERE ( (a.event_start BETWEEN b.event_start AND b.event_end) OR (a.event_end BETWEEN b.event_start AND b.event_end)) AND a.video_id = b.video_id) b)
GROUP BY video_id
You can use the CROSS APPLY Join or OUTER APPLY Join too, which give the same result, because the subquery always returns one row.