Define user's sessions (sql) - sql

I have an event table (user_id, timestamp). I need to write a query to define a user session (every user can have more than one session and every session can have >= 1 event). 30 minutes of inactivity for the user is a completed session.
The output table should have the following format: (user_id, start_session, end_sesson). I wrote part of query, but what to do next i have no idea.
select
t.user_id,
t.ts start_session,
t.next_ts
from ( select
user_id,
ts,
DATEDIFF(SECOND, lag(ts, 1) OVER(partition by user_id order by ts), ts) next_ts
from
events_tabl ) t

You want a cumulative sum to identify the sessions and then aggregation:
select user_id, session_id, min(ts), max(ts)
from (select e.*,
sum(case when prev_ts > dateadd(minute, -30, ts)
then 0 else 1
end) over (partition by user_id order by ts) as session_id
from (select e.*,
lag(ts) over (partition by user_id order by ts), ts) as prev_ts
from events_tabl e
) e
) e
group by user_id, session_id;
Note that I changed the date/time logic from using datediff() to a direct comparison of the times. datediff() counts the number of "boundaries" between two times. So, there is 1 hour between 12:59 a.m. and 1:01 a.m. -- but zero hours between 1:01 a.m. and 1:59 a.m.
Although handling the diffs at the second level produces similar results, you can run into occasions where you are working with seconds or milliseconds -- but the time spans are too long to fit into an integer. Overflow errors. It is just easier to work directly with the date/time values.

Related

get count of users having at least one transaction made within 7 days from the previous transaction

I am trying to get the count of users who made at least one transaction that was made within 7 days from the previous transaction
What I've tried so far is:
WITH criteria_1 AS (
SELECT
fullvisitorid,
COUNT(hits.TRANSACTION.transactionid) AS number_of_transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS hits
GROUP BY
fullvisitorid
HAVING
COUNT( hits.TRANSACTION.transactionid) > 1
ORDER BY
2 DESC)
SELECT
COUNT(*) AS number_of_users_matching_crit1
FROM
criteria_1
But I feel like I'm missing something here, how this could be improved?
This is the implementation of #HimanshuAgrawal's comment on comparing current and lag timestamp to get the transaction made within 7 days from the previous transaction. In this query I used a wildcard on the table (ga_sessions_20170*) to get data on multiple dates.
WITH
criteria_1 AS (
SELECT
fullvisitorid as visitor_id,
hits.TRANSACTION.transactionid as transaction_id,
date as visit_date,
LAG(date) OVER (PARTITION BY fullvisitorid ORDER BY date ASC) AS preceding_date
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170*` AS outside,
UNNEST(hits) AS hits
WHERE
hits.TRANSACTION.transactionid IS NOT NULL)
SELECT
visitor_id,
transaction_id,
visit_date ,
preceding_date
FROM
criteria_1
where preceding_date is not null
and visit_date != preceding_date
and safe.date_diff(cast(visit_date as date format 'YYYYMMDD')
, cast(preceding_date as date format 'YYYYMMDD'), DAY) <= 7
Sample output:

How to skip gaps when using SQL GROUP BY to compute duration

My table has ACTIONS with USER and date/time of each action. A group of Actions by the same user is a session. I want to measure the duration of each session. However if there is a gap larger than 5 minutes without any action, then I want to start a new session.
I am using this SQL
Select USER,
TRUNC((Max(ACTION_DATE) - Min(ACTION_DATE))*24*60,2) as Minutes
from USER_ACTION_DATA
Group by USER,TRUNC(ACTION_DATE,'J') ;
And it works pretty well if the user only has one session per day. But if the data is
USER ACTION_DATE
---------------------
john 2021-05-24 11:30:22
john 2021-05-24 11:32:12
john 2021-05-24 11:32:44
john 2021-05-24 11:36:08
john 2021-05-24 14:20:02
john 2021-05-24 14:23:52
it will show a single session with 173 minutes. But that is wrong. It should be two sessions with 6 and 3 minutes (because the gap between the 4th and 5th record is more than 5 minutes). Is this possible with SQL, or do I have to do the grouping with a real programming language?
If you're on a recent version of Oracle you can use match_recognize:
select user_name,
start_date,
end_date,
trunc((end_date - start_date) * 24 * 60, 2) as minutes
from user_action_data
match_recognize (
partition by user_name
order by action_date
measures
first(action_date) as start_date,
last(action_date) as end_date
pattern (A B*)
define B as action_date <= prev(action_date) + interval '5' minute
);
USER_NAME
START_DATE
END_DATE
MINUTES
john
2021-05-24 11:30:22
2021-05-24 11:36:08
5.76
john
2021-05-24 14:20:02
2021-05-24 14:23:52
3.83
db<>fiddle
USER is a reserved word (and a function) so I've changed the column name to USER_NAME to make it valid.
Because it's only looking at the interval it will allow a session to span midnight, rather than restricting to sessions within a day, as you are doing by truncating with J. I'm assuming that's a good thing, of course. If it isn't then you can change it to only look within the same day:
define B as trunc(action_date) = trunc(prev(action_date))
and action_date <= prev(action_date) + interval '5' minute
db<>fiddle with some additional sample data to go into the next day.
You can assign a session using window functions -- check when the previous action is and assigning a session start when needed:
select uad.*,
sum(case when prev_action_date > action_date - interval '5' minute then 0 else 1 end) over
(partition by user order by action_date) as session_id
from (select uad.*,
lag(action_date) over (partition by user order by action_date) as prev_action_date
from USER_ACTION_DATA uad
) uad;
You can then aggregate this if you like:
select user, min(action_date), max(action_date),
( max(action_date) - min(action_date) ) * 24*60*60
from (select uad.*,
sum(case when prev_action_date > action_date - interval '5' minute then 0 else 1 end) over
(partition by user order by action_date) as session_id
from (select uad.*,
lag(action_date) over (partition by user order by action_date) as prev_action_date
from USER_ACTION_DATA uad
) uad
) uad
group by user, session_id;

how to find consecutive user login across week

I'm fairly new to SQL & maybe the complexity level for this report is above my pay grade
I need help to figure out the list of users who are logging to the app consecutively every week in the time period chosen(this logic eventually needs to be extended to a month, quarter & year ultimately but a week is good for now)
Table structure for ref
events: User_id int, login_date timestamp
The table events can have 1 or more entries for a user. This inherently means that the user can login multiple times to the app. To shed some light, if we focus on Jan 2020- Mar2020 then I need the following in the output
user_id who logged into the app every week from 2020wk1 to 2020Wk14
at least once
the week they logged in
number of times they logged in that week
I'm also okay if the output of the query is just the user_id. The thing is I'm unable to make sense out of the output that I'm seeing on my end after trying the following SQL code, perhaps working on this problem for so long might be the reason for that!
SQL code tried so far:
SELECT DISTINCT user_id
,extract('year' FROM timestamp)||'Wk'|| extract('week' FROM timestamp)
,lead(extract('week' FROM timestamp)) over (partition by user_id, extract('week' FROM timestamp) order by extract('week' FROM timestamp))
FROM events
WHERE user_id = 'Anything that u wish to enter'
You can get the summary you want as:
select user_id, date_trunc('week', timestamp) as week, count(*)
from events
group by user_id, week;
But the filtering is tricker. It is better to go with dates rather than week numbers:
select user_id, date_trunc('week', timestamp) as week, count(*) as cnt,
count(*) over (partition by user_id) as num_weeks
from events
where timestamp >= ? and timestamp < ?
group by user_id, week;
Then you can use a subquery:
select uw.*
from (select user_id, date_trunc('week', timestamp) as week, count(*) as cnt,
count(*) over (partition by user_id) as num_weeks
from events
where timestamp >= ? and timestamp < ?
group by user_id, week
) uw
where num_weeks = ? -- 14 in your example

How to calculate Session and Session duration in Firebase Analytics raw data?

How to calculate Session Duration in Firebase analytics raw data which is linked to BigQuery?
I have used the following blog to calculate the users by using the flatten command for the events which are nested within each record, but I would like to know how to proceed with in calculating the Session and Session duration by country and time.
(I have many apps configured, but if you could help me with the SQL query for calculating the session duration and session, It would be of immense help)
Google Blog on using Firebase and big query
First you need to define a session - in the following query I'm going to break a session whenever a user is inactive for more than 20 minutes.
Now, to find all sessions with SQL you can use a trick described at https://blog.modeanalytics.com/finding-user-sessions-sql/.
The following query finds all sessions and their lengths:
#standardSQL
SELECT app_instance_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records, MAX(sess_id) OVER(PARTITION BY app_instance_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY app_instance_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(
previous IS null
OR (min_time-previous)>(20*60*1000*1000), # sessions broken by this inactivity
1, 0) session_start
#https://blog.modeanalytics.com/finding-user-sessions-sql/
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY app_instance_id ORDER BY max_time) previous
FROM (
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time
FROM `firebase-analytics-sample-data.ios_dataset.app_events_20160601`
)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
With the new schema of Firebase in BigQuery, I found that the answer by #Maziar did not work for me, but I am not sure why.
Instead I have used the following to calculate it, where a session is defined as a user engaging with your app for a minimum of 10 seconds and where the session stops if a user doesn't engage with the app for 30 minutes.
It provides total number of sessions and the session length in minutes, and it is based on this query: https://modeanalytics.com/modeanalytics/reports/5e7d902f82de/queries/2cf4af47dba4
SELECT COUNT(*) AS sessions,
AVG(length) AS average_session_length
FROM (
SELECT global_session_id,
(MAX(event_timestamp) - MIN(event_timestamp))/(60 * 1000 * 1000) AS length
FROM (
SELECT user_pseudo_id,
event_timestamp,
SUM(is_new_session) OVER (ORDER BY user_pseudo_id, event_timestamp) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS user_session_id
FROM (
SELECT *,
CASE WHEN event_timestamp - last_event >= (30*60*1000*1000)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT user_pseudo_id,
event_timestamp,
LAG(event_timestamp,1) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS last_event
FROM `dataset.events_2019*`
) last
) final
) session
GROUP BY 1
) agg
WHERE length >= (10/60)
As you know, Google has changed the schema of BigQuery firebase databases:
https://support.google.com/analytics/answer/7029846
Thanks to #Felipe answer, the new format will be changed as follow:
SELECT SUM(total_sessions) AS Total_Sessions, AVG(sess_length_seconds) AS Average_Session_Duration
FROM (
SELECT user_pseudo_id, sess_id, MIN(min_time) sess_start, MAX(max_time) sess_end, COUNT(*) records,
MAX(sess_id) OVER(PARTITION BY user_pseudo_id) total_sessions,
(ROUND((MAX(max_time)-MIN(min_time))/(1000*1000),1)) sess_length_seconds
FROM (
SELECT *, SUM(session_start) OVER(PARTITION BY user_pseudo_id ORDER BY min_time) sess_id
FROM (
SELECT *, IF(previous IS null OR (min_time-previous) > (20*60*1000*1000), 1, 0) session_start
FROM (
SELECT *, LAG(max_time, 1) OVER(PARTITION BY user_pseudo_id ORDER BY max_time) previous
FROM (SELECT user_pseudo_id, MIN(event_timestamp) AS min_time, MAX(event_timestamp) AS max_time
FROM `dataset_name.table_name` GROUP BY user_pseudo_id)
)
)
)
GROUP BY 1, 2
ORDER BY 1, 2
)
Note: change dataset_name and table_name based on your project info
Sample result:
With the recent change in which we have ga_session_id with each event row in the BigQuery table you can calculate number of sessions and average session length much more easily.
The value of the ga_session_id would remain same for the whole session, so you don't need to define the session separately.
You take the Min and the Max value of the event_timestamp column by grouping the result by user_pseudo_id , ga_session_id and event_date so that you get the session duration of the particular session of any user on any given date.
WITH
UserSessions as (
SELECT
user_pseudo_id,
event_timestamp,
event_date,
(Select value.int_value from UNNEST(event_params) where key = "ga_session_id") as session_id,
event_name
FROM `projectname.dataset_name.events_*`
),
SessionDuration as (
SELECT
user_pseudo_id,
session_id,
COUNT(*) AS events,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration
,event_date
FROM
UserSessions
WHERE session_id is not null
GROUP BY
user_pseudo_id,
session_id
,event_date
)
Select count(session_id) as NumofSessions,avg(session_duration) as AverageSessionLength from SessionDuration
At last you just do the count of the session_id to get the total number of sessions and do the average of the session duration to get the value of the average session length.

Microsoft SQL server count distinct every 30 minutes

We have an activity database that records user interaction to a website, storing a log that includes values such as [UserId] and [LogDate] e.g.
UserId|LogDate
123 |2017-01-01 11:17:35.190
I am trying to find out the count of distinct user sessions over time.
This would be easy enough by counting the distinct users:
SELECT COUNT(DISTINCT UserId) FROM ActivityDatabase.dbo.Logs
However, I need to count a user multiple times if they have a log more than 30 minutes from the previous log as this is then classed as a new session.
A session is defined as having a log in a 30 minute timeframe. For example:
If a user creates a log at 13.30, the value for distinct user
sessions over time would be 1.
If the user creates another log at 13.40, the count should still be 1 as it is within 30 minutes of the previous log.
If the user creates another log at 14.20, the count should then be 2 as this is 30 minutes after the previous log.
Is this possible in SQL? I would need a way of checking every log for a user against the previous user log, and if the time difference between these is more than 30 minutes, it should count as a unique session.
The output of the SQL should be a number rather than broken down by a time period.
Thank you.
Sessionizing is a bit tricky. Let me show you how to do that. Perhaps this will solve your problem:
select userid, min(log_date) as session_start,
dateadd(minute, 30, max(log_date)) as session_end,
row_number() over () as session_id
from (select l.*,
sum(case when log_date < dateadd(minute, 30, prev_logdate)
then 0 else 1
end) over (partition by userid order by logdate
) as grp
from (select l.*,
lag(logdate) over (partition by userid order by logdate) as prev_logdate
from ActivityDatabase.dbo.Logs l
) l
) l
group by userid, grp;
If you want the number of unique users at a given point in time, then:
with s as (
select userid, min(log_date) as session_start,
dateadd(minute, 30, max(log_date) as session_end,
row_number() over () as session_id
from (select l.*,
sum(case when log_date < dateadd(minute, 30, prev_logdate)
then 0 else 1
end) over (partition by userid order by logdate
) as grp
from (select l.*,
lag(logdate) over (partition by userid order by logdate) as prev_logdate
from ActivityDatabase.dbo.Logs l
) l
) l
group by userid, grp
)
select count(*)
from s
where #datetime between session_start and session_end;
A more brute force alternative for a given time is:
select count(distinct userid)
from ActivityDatabase.dbo.Logs l
where #datetime between log_date and dateadd(minute, 30, log_date);
If you are using sql server 2012 or greater, I would use the lag function to find the previous row and then you can compare the two datetimes to see if the difference is greater than 30 mins
select
userId,
LogDate,
LAG(LogDate, 1,0) OVER (PARTITION BY userId ORDER BY LogDate) AS PreviousLogDate
from logTbl
You can then add datediff and a case statement to flag a new login where the difference is greater than your threshold.
If no previous row is found, then the lag function will return null.
If you play around with the definition you're trying to use, it becomes a lot easier to write the SQL.
What we want to identify are "starting logs" - logs that mark the start of a session. We don't want to identify any other logs.
How do we define a "starting log"? It's a log that doesn't have another log within 30 minutes before it.
SELECT COUNT(*)
FROM ActivityDatabase.dbo.Logs l1
WHERE NOT EXISTS (
SELECT * FROM ActivityDatabase.dbo.Logs l2
WHERE l1.UserId = l2.UserId AND
l2.LogDate < l1.LogDate AND
l2.LogDate >= DATEADD(minute,-30,l1.LogDate)
)