How can I calculate an "active users" aggregation from an activity log in SQL? - sql

In PostgreSQL, I have a table that logs activity for all users, with an account ID and a timestamp field:
SELECT account_id, created FROM activity_log;
A single account_id can appear many times in a day, or not at all.
I would like a chart showing the number of "active users" each day, where "active users"
means "users who have done any activity within the previous X days".
If X is 1, then we can just truncate timestamp to 'day' and aggregate:
SELECT date_trunc('day', created) AS date, count(DISTINCT account_id)
FROM activity_log
GROUP BY date_trunc('day', created) ORDER BY date;
If X is exactly 7, then we could truncate to 'week' and aggregate - although this gives
me only one data point for a week, when I actually want one data point per day.
But I need to solve for the general case of different X, and give a distinct data point for each day.

One method is to generate the dates and then count using left join and group by or similar logic. The following uses a lateral join:
select gs.dte, al.num_accounts
from generate_series('2021-01-01'::date, '2021-01-31'::date, interval '1 day'
) gs(dte) left join lateral
(select count(distinct al.account_id) as num_accounts
from activity_log al
where al.created >= gs.dte - (<n - 1>) * interval '1 day' and
al.created < gs.dte + interval '1 day'
) al
on 1=1
order by gs.dte;
<n - 1> is one less than the number of days. So for one week, it would be 6.

If your goal is to get day wise distinct account_id for last X days you can use below query. Instead of 7 you can use any number as you wise:
SELECT date_trunc('day', created) AS date, count(DISTINCT account_id)
FROM activity_log
where date_trunc('day', created)>=date_trunc('day',CURRENT_DATE) +interval '-7' day
GROUP BY date_trunc('day', created)
ORDER BY date
(If there is no activity in any given date then the date will not be in the output.)

Related

Count of active user sessions per hour

For each user login to our website, we insert a record into our user_session table with the user's login and logout timestamps. If I wanted to produce a graph of the number of logins per hour over time, it would be easy with the following SQL.
SELECT
date_trunc('hour',login_time) AS "time",
count(*)
FROM user_session
group by time
order by time
Time would be the X-axis and count would be the Y-axis.
But what I really need is the number of active sessions in each hour where "active" means
login_time <= foo and logout_time >= foo where foo is the particular time slot.
How can I do this in a single SELECT statement?
One brute force method generates the hours and then uses a lateral join or correlated subquery to do the calculation:
select gs.ts, us.num_active
from generate_series('2021-03-21'::timestamp, '2021-03-22'::timestamp, interval '1 hour') gs(ts) left join lateral
(select count(*) as num_active
from user_session us
where us.login_time <= gs.ts and
us.logout_time > gs.ts
) us
on 1=1;
A more efficient method -- particularly for longer periods of time -- is to pivot the times and keep an incremental count is ins and outs:
with cte as (
select date_trunc('hour', login_time) as hh, count(*) as inc
from user_session
group by hh
select date_trunc('hour', logout_time + interval '1 hour') as hh, - count(*) as inc
from user_session
group by hh
)
select hh, sum(inc) as net_in_hour,
sum(sum(inc)) over (order by hh) as active_in_hour
from cte
group by hh;

Filling in empty dates

This query returns the number of alarms created by day between a specific date range.
SELECT CAST(created_at AS DATE) AS date, SUM(1) AS count
FROM ew_alarms
LEFT JOIN site ON site.id = ew_alarms.site_id
AND ew_alarms.created_at BETWEEN '12/22/2020' AND '01/22/2021' AND (CAST(EXTRACT(HOUR FROM ew_alarms.created_at) AS INT) BETWEEN 0 AND 23.99)
GROUP BY CAST(created_at AS DATE)
ORDER BY date DESC
Result: screenshot
What the best way to fill in the missing dates (1/16, 1/17, 1/18, etc)? Due to no alarms created on those days these results throw off the daily average I'm ultimately trying to achieve.
Would it be a generate_series query?
Yes, use generate_series(). I would suggest:
SELECT gs.date, COUNT(s.site_id) AS count
FROM GENERATE_SERIES('2020-12-22'::date, '2021-01-22'::date, INTERVAL '1 DAY') gs(dte) LEFT JOIN
ew_alarms a
ON ew.created_at >= gs.dte AND
ew.created_at < gs.dte + INTERVAL '1 DAY' LEFT JOIN
site s
ON s.id = a.site_id
GROUP BY gs.dte
ORDER BY date DESC;
I don't know what the hour comparison is supposed to be doing. The hour is always going to be between 0 and 23, so I removed that logic.
Note: Presumably, you want to count something from either site or ew_alarms. That is expected with LEFT JOINs so 0 can be returned.

How to get a count of data for every date in postgres

I am trying to get data to populate a multi-line graph. The table jobs has the columns id, created_at, and partner_id. I would like to display the sum of jobs for each partner_id each day. My current query has 2 problems. 1) It is missing a lot of jobs. 2) It only contains an entry for a given day if there was a row on that day. My current query is where start is an integer denoting how many days back we are looking for data:
SELECT d.date, count(j.id), j.partner_id FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, #{start}, 1)
AS offs
) d
JOIN (
SELECT jobs.id, jobs.created_at, jobs.partner_id FROM jobs
WHERE jobs.created_at > now() - INTERVAL '#{start} days'
) j
ON (d.date=to_char(date_trunc('day', j.created_at), 'YYYY-MM-DD'))
GROUP BY d.date, j.partner_id
ORDER BY j.partner_id, d.date;
This returns records like the following:
[{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"099"},
{"date"=>"2019-06-22", "count"=>1, "partner_id"=>"099"},
{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>1, "partner_id"=>"099"}]
what I want is something like this:
[{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"099"},
{"date"=>"2019-06-22", "count"=>1, "partner_id"=>"099"},
{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"075"},
{"date"=>"2019-06-22", "count"=>0, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>0, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>1, "partner_id"=>"099"}]
So that for every day in the query I have an entry for every partner even if that count is 0. How can I adjust the query to populate data even when the count is 0?
Use a LEFT JOIN. You also don't need so many subqueries and there is no need to translate to a date to a string and then back to a date:
SELECT d.date, count(j.id), j.partner_id
FROM (SELECT to_char(dte, 'YYYY-MM-DD') AS date , dte
FROM generate_series(current_date - {start} * interval '1 day', current_date, interval '1 day') gs(dte)
) d LEFT JOIN
jobs j
ON DATE_TRUNC('day', j.created_at) = d.dte
GROUP BY d.date, j.partner_id
ORDER BY j.partner_id, d.date;

SQL: COUNT DISTINCT over more groups

I have this query:
SELECT
date_trunc('week', q.order_date::date+2)::date-2 AS weekly,
COUNT(DISTINCT q.provider_username) AS engaged_providers
FROM tbl_quotes q
GROUP BY weekly
What I need is to COUNT(DISTINCT q.provider_username) over 4 weeks, not just one, and I don't want to change my date-trunc. obviously I can't use OVER(ORDER BY ...) because DISTINCT is not implemented for window functions. Is there any other solution for this?
Sample data and desired results would really help. For one interpretation of your question, I would suggest a correlated subquery. The date arithmetic is a bit arcane, but I think this is what you want:
SELECT weekly,
(SELECT COUNT(DISTINCT q.provider_username)
FROM tbl_quotes q
WHERE q.order_date >= w.weekly - interval '3 week' and
q.order_date < w.weekly + interval '1 week'
) as engaged_providers
FROM (SELECT DISTINCT date_trunc('week', q.order_date::date+2)::date-2 as weekly
FROM tbl_quotes q
) w;

grouping by column but getting multiple results for each

I am trying to calculate the median response time for conversations on each date for the last X days.
I use the following query below, but for some reason, it will generate multiple rows with the same date.
with grouping as (
SELECT a.id, d.date, extract(epoch from (first_response_at - started_at)) as response_time
FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 2) AS offs
) d
LEFT OUTER JOIN apps a on true
LEFT OUTER JOIN conversations c ON (d.date=to_char(date_trunc('day'::varchar, c.started_at), 'YYYY-MM-DD')) and a.id = c.app_id
and c.app_id = a.id and c.first_response_at > (current_date - (2 || ' days')::interval)::date
)
select
*
from grouping
where grouping.id = 'ASnYW1-RgCl0I'
Any ideas?
First a number of issues with your query, assuming there aren't any parts you haven't shown us:
You don't need a CTE for this query.
From table apps you only use column id whose value is the same as c.app_id. You can remove the table apps and select c.app_id for the same result.
When you use to_char() you do not first have to date_trunc() to a date, the to_char() function handles that.
generate_series() also works with timestamps. Just enter day values with an interval and cast the end result to date before using it.
So, removing all the flotsam we end up with this which does exactly the same as the query in your question but now we can at least see what is going on.
SELECT c.app_id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (first_response_at - started_at)) AS response_time
FROM generate_series(CURRENT_DATE - 2, CURRENT_DATE, interval '1 day') d(date)
LEFT JOIN conversations c ON d.date::date = c.started_at::date
AND c.app_id = 'ASnYW1-RgCl0I'
AND c.first_response_at > CURRENT_DATE - 2;
You don't calculate the median response time anywhere, so that is a big problem you need to solve. This only requires data from table conversations and would look somewhat like this to calculate the median response time for the past 2 days:
SELECT app_id, started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = 'ASnYW1-RgCl0I'
AND first_response_at > CURRENT_DATE - 2
GROUP BY 2;
When we fold the two queries, and put the parameters handily in a single place, this is the final result:
SELECT p.id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (c.median_response)) AS response_time
FROM (VALUES ('ASnYW1-RgCl0I', 2)) p(id, days)
JOIN generate_series(CURRENT_DATE - p.days, CURRENT_DATE, interval '1 day') d(date) ON true
LEFT JOIN LATERAL (
SELECT started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = p.id
AND first_response_at > CURRENT_DATE - p.days
GROUP BY 2) c ON d.date::date = c.start_date;
If you want to change the id of the app or the number of days to look back, you only have to change the VALUES clause accordingly. You can also wrap the whole thing in a SQL function and convert the VALUES clause into two parameters.