Finding total session time of a user in postgres - sql

I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.

Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle

Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle

Related

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

SQL: last 7 Days Calculations based on date

Below Tables consists of count of users on particular day.Looking to populate Total_Users signup column
Logic:Contains user count b/w Signupdate-14 & Signupdate-7
For Example: 15/01/2020 , contains users count between 1/1/2020 AND 1/7/2020
Signupdate| |Users| Total_Users(b/w D-14 & D-7)
1/1/2020 | |20. | 60
2/1/2020 | |30. | 80
3/1/2020 | |10. | 90
--- | |-- | --
--- | |-- | --
15/1/2020 | |30. | 120
16/1/2020 | |10. | 40
SELECT Signupdate
, Users
,SUM(CASE
WHEN Signupdate BETWEEN to_date(Signupdate,'DDMMYYYY')-14 and to_date(Signupdate,'DDMMYYYY')-7
THEN Users END) AS 'Total_Users'
FROM
This is assuming that the users column is of numeric type
Assuming you have a row for each date, you would use window functions with a windowing clause. I'm not sure if Redshift supports window frames with intervals, but this is the basic logic:
select t.*,
sum(users) over (order by signupdate
range between interval '-14' day and interval '-7 day'
) as total_users
from t;
If not, you can turn the date into a number and use that:
select t.*,
sum(users) over (order by signupdate
rows between 14 preceding and 7 preceding
) as total_users
from (select t.*,
datediff(day, signupdate, date '2000-01-01') as diff
from t
) t
I am guessing you want a complete week. However, this is 8 days.

Calculate previous login for each row grouped by day relative to current row

Goal: I would like to gather logins for each user grouped by day.
Problem: I am struggling with the function to calculate the last column which is the last login relative to the current row of the login column(somewhat like a lag function but not sure how to use it). The issue is that I only need to show logins for the last three months so how would it calculate the fifth observation of the days_last_login column in the following table if i put a where condition for the last three months?:
Desired Output:
+----+---------------------+-----------------+
| id | login | days_last_login |
+----+---------------------+-----------------+
| 1 | 2018-12-10 05:00:00 | 5 |
| 1 | 2018-12-07 05:30:00 | 3 |
| 1 | 2018-12-01 05:30:00 | 6 |
| 2 | 2019-08-01 05:30:00 | 7 |
| 2 | 2019-01-01 05:30:00 | 365 |
+----+---------------------+-----------------+
Current Query:
SELECT id
,YEAR(login) as yr, MONTH(login) as mm, DAY(login) as dd
,CAST(login AS DATE) as logins
,FUNCTION FOR DAYS_LAST_LOGIN
FROM database.table
WHERE login > DATEADD(month,-3,getdate())
GROUP BY YEAR(login), MONTH(login), DAY(login), id
ORDER BY id, yr desc, mm desc, dd desc
Note: I ommitted to show the yr,month and day columns in the table to make it more clear.
From what I can tell, the logic is the number of days from a given login date to the next, presumably with the most recent date measured up to the current date.
That suggests a query like this:
SELECT id, CONVERT(date, login) as dte,
DATEDIFF(day, login, LEAD(MAX(login), 1, GETDATE()) OVER (PARTITION BY id)) as DAYS_LAST_LOGIN
FROM database.table
WHERE login > DATEADD(month, -3, getdate())
GROUP BY id, CONVERT(date, login)
ORDER BY id, CONVERT(date, login) DESC;
I removed the date parts because I don't find them useful, but you can of course include them.

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC

How to get the row_numbers in sql where condition is a specific date difference?

I am stuck with a portion of my query to extract the row_numbers that have a date difference of at least three months. So in the example below I would like to extract row_number 1 (always the first one), 5 and 6. So after row_number 1 the row_numbers with a date_diff > 3 months (and after second extracted row_number applying this condition again until None. Is there any function or way within SQL that allows for such a condition to be made?
table_name: users
id row_number User date
---|----------|-------|---------------------|
1 |1 | Usr1 | 2017-10-01 12:35:00 |
2 |2 | Usr1 | 2017-10-01 12:35:00 |
3 |3 | Usr1 | 2017-12-03 07:47:00 |
4 |4 | Usr1 | 2018-01-10 07:47:00 |
5 |5 | Usr1 | 2018-02-10 07:47:00 |
6 |6 | Usr1 | 2018-04-10 07:47:00 |
You can use the lag() function to calculate the difference:
select *
from (
select id, row_number, "User", date,
date - lag(date) over (order by id) as diff
from users
) t
where diff is null -- first row
or diff > interval '3 month';
I'm not sure you want to compare intervals as months. I think they are normally represented as a number of days.
So, I would phrase this as:
select u.*
from (select u.*, lag(date) over (order by id) as prev_date
from users u
) u
where prev_date is null or prev_date < date - interval '3 month';
If the or bothers you, you can remove it by using default values in the lag():
select u.*
from (select u.*, lag(date, 1, date - interval '100 year') over (order by id) as prev_date
from users u
) u
where prev_date < date - interval '3 month';