sql query to groupby with a deduplicated column - sql

I have the following table
create table events (
event_id,
event_name,
datetime,
email)
And I want to display the events per week, and the events per week deduplicated by emails, in a single query.
While doing:
select date_trunc('week', datetime) wdt, event_name, count(1)
from events
group by wdt, event_name;
wdt | event_name | count
---------------------+-------------+-------
2014-10-27 00:00:00 | deliver | 32
2014-11-17 00:00:00 | open | 30
2014-10-20 00:00:00 | deliver | 25
2014-10-20 00:00:00 | click | 19
2014-10-27 00:00:00 | click | 29
I can get the first column, but I don't know how to have the count_distinct column (if two clicks for the same email, on same week, it counts for one, not two).

Just specify which column to count only distinct values for, like this:
select date_trunc('week', datetime) wdt, event_name, count(distinct email)
from events
group by wdt, event_name;

I think the problem is you just need to do a distinct 1; as you pointed out.
select date_trunc('week', datetime) wdt, event_name, count(distinct 1)
from events
group by wdt, event_name;
however with out the raw data and some samples, I'm not sure how to confirm as I can't see why count 31 and 29 would occur for the same date (10/27) in wdt for the same event_name.

Related

Finding total session time of a user in postgres

I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.
Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle
Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle

Get rolling 30 day count of users logging in to site

I have a table of login to my site in the format below:
logins
+---------+--------------------------+-----------------------+
| USER_ID | LOGIN_TIMESTAMP | LOGOUT_TIMESTAMP |
+---------+--------------------------+-----------------------+
| 274385 | 01-JAN-20 02.56.12 PM | 02-JAN-20 10.04.40 AM |
| 32498 | 01-JAN-20 05.12.14 PM | 01-JAN-20 08.26.43 PM |
| 981231 | 01-JAN-20 04.41.04 PM | 01-JAN-20 10.51.11 PM |
+---------+--------------------------+-----------------------+
I would like to calculate a unique count of users who logged in only once in the previous 30 days, per day to get something as below
(note - USER_COUNT_LAST_30_DAYS counts only those users who logged in only once in the previous 30 days)
:
+-----------+-------------------------+
| DAY | USER_COUNT_LAST_30_DAYS |
+-----------+-------------------------+
| 01-JAN-20 | 14 |
| 02-JAN-20 | 23 |
| 03-JAN-20 | 29 |
+-----------+-------------------------+
My first thought would be a query as below, but I recognise this would just count all users who logged in the last 30 days, rather than those who only logged in once
SELECT
CAST(LOGIN_TIMESTAMP AS DATE),
COUNT(DISTINCT USER_ID)
FROM
logins
WHERE
LOGIN_TIMESTAMP > SYSDATE - 30
GROUP BY
CAST(LOGIN_TIMESTAMP AS DATE);
Would this query work in getting me a count of users who logged in only once the last 30 days with a rownum partition filter on user id? or is there something that I would have to ensure to get a rolling 30 day count?
The date datatype still has a time component, even if the format mask doesn't show it. You can use the TRUNC function on either a date or a timestamp. If you really want your day to be limited to the day, you'll need to truncate the timestamp. You also need to use INTERVAL, as timestamp math and date math are not the same:
SELECT TRUNC(LOGIN_TIMESTAMP) LOGIN_DATE,
COUNT(DISTINCT USER_ID) USER_COUNT
FROM logins
WHERE TRUNC(LOGIN_TIMESTAMP) > TRUNC(SYSTIMESTAMP - INTERVAL '30' DAY)
GROUP BY TRUNC(LOGIN_TIMESTAMP)
ORDER BY TRUNC(LOGIN_TIMESTAMP) ASC;
Example:
alter session set nls_date_format='DD-MON-YY HH24.MI.SS';
SELECT
SYSTIMESTAMP raw_timestamp,
CAST(SYSTIMESTAMP AS DATE) raw_date,
TRUNC(CAST(SYSTIMESTAMP AS DATE)) trunc_date,
TRUNC(SYSTIMESTAMP) - INTERVAL '30' DAY
from dual;
RAW_TIMESTAMP RAW_DATE TRUNC_DATE TRUNC(SYSTIMESTAMP
-------------------------------------- ------------------ ------------------ ------------------
25-JUN-20 12.27.21.756299000 PM -04:00 25-JUN-20 12.27.21 25-JUN-20 00.00.00 26-MAY-20 00.00.00
For identifying users that have only logged in once, try this:
WITH user_logins as (
SELECT USER_ID,
COUNT(*) LOGIN_COUNT
FROM logins
WHERE TRUNC(LOGIN_TIMESTAMP) > TRUNC(SYSTIMESTAMP - INTERVAL '30' DAY)
GROUP BY USER_ID)
SELECT user_id, login_count
from user_logins
where login_count=1
order by user_id;
Please use below query, since you have string value PM in the date, you cannot use cast function, instead you have to use to_date and convert to date format.
SELECT
to_date(LOGIN_TIMESTAMP, 'DD-MON-YYYY hh.mi.ss PM'),
COUNT(DISTINCT USER_ID)
FROM
logins
WHERE
LOGIN_TIMESTAMP >= SYSDATE - 30
GROUP BY
to_date(LOGIN_TIMESTAMP, 'DD-MON-YYYY hh.mi.ss PM');

Get a rolling count of timestamps in SQL

I have a table (in an Oracle DB) that looks something like what is shown below with about 4000 records. This is just an example of how the table is designed. The timestamps range for several years.
| Time | Action |
| 9/25/2019 4:24:32 PM | Yes |
| 9/25/2019 4:28:56 PM | No |
| 9/28/2019 7:48:16 PM | Yes |
| .... | .... |
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval. I would like this done by looking at each timestamp and getting a count of timestamps that appear within 15 minutes of that timestamp.
My goal would to have something like
| Interval | Count |
| 9/25/2019 4:24:00 PM - 9/25/2019 4:39:00 | 2 |
| 9/25/2019 4:25:00 PM - 9/25/2019 4:40:00 | 2 |
| ..... | ..... |
| 9/25/2019 4:39:00 PM - 9/25/2019 4:54:00 | 0 |
I am not sure how I would be able to do this, if at all. Any ideas or advice would be much appreciated.
If you want any 15 minute interval in the data, then you can use:
select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t;
If you want the maximum, then use rank() on this:
select t.*
from (select t.*, rank() over (order by cnt_15 desc) as seqnum
from (select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t
) t
) t
where seqnum = 1;
This doesn't produce exactly the results you specify in the query. But it does answer the question:
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval.
You could enumerate the minutes with a recursive query, then bring the table with a left join:
with recursive cte (start_dt, max_dt) as (
select trunc(min(time), 'mi'), max(time) from mytable
union all
select start_dt + interval '1' minute, max_dt from cte where start_dt < max_dt
)
select
c.start_dt,
c.start_dt + interval '15' minute end_dt,
count(t.time) cnt
from cte c
left join mytable t
on t.time >= c.start_dt
and t.time < c.start_dt + interval '15' minute
group by c.start_dt

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC

How can I find the time between users session activity using SQL

Using SQL how can you find the time duration or time elapsed between each users session? For instance user_id 1234 had one session on 2017-01-01 00:00:00 and another session on 2017-01-02 (see table below). How can I find the time between the last session_end to beginning of their next session_start.
user_id|session_start |session_end
1234 | 2017-01-01 00:00:00| 2017-01-01 00:30:30
1236 | 2017-01-01 01:00:00| 2017-01-01 01:05:30
1234 | 2017-01-02 12:00:09| 2017-01-02 12:00:30
1234 | 2017-01-01 02:00:00| 2017-01-01 03:30:30
1236 | 2017-01-01 00:00:00| 2017-01-01 00:30:30
Thanks.
This can easily be done using window functions
select user_id, session_start, session_end,
session_start - lag(session_end) over (partition by user_id order by session_start) as time_diff
from the_table
order by user_id, session_start;
Online example: http://rextester.com/NTVH38963
Subtracting one timestamp from another returns an interval to convert that to minutes you can extract the number of get the number of seconds the interval represents and divide them by 60 to get minutes:
select user_id, session_start, session_end,
extract(epoch from
session_start - lag(session_end) over (partition by user_id order by session_start)
) / 60 as minutes
from the_table
order by user_id, session_start;
Here's one way to do it with a subquery:
SELECT dT.user_ID
,dT.max_session_start
,DATEDIFF(minute, (SELECT MAX(session_end)
FROM tablename T
WHERE T.user_ID = dT.user_ID
AND T.session_end < dT.max_session_start)
, dT.max_session_start
) AS minutes
FROM (
SELECT user_ID
,MAX(session_start) AS max_session_start
FROM tablename
GROUP BY user_ID
) AS dT