PostgreSQL - generating an hourly list - sql

I have an API that counts events from a table and groups them by the hour of day and severity that I use to draw a graph. this is my current query
SELECT
extract(hour FROM time) AS hours,
alarm. "severity",
COUNT(*)
FROM
alarm
WHERE
date = '2019-06-12'
GROUP BY
extract(hour FROM time),
alarm. "severity"
ORDER BY
extract(hour FROM time),
alarm. "severity"
what I really want to do is get a list of hours from 00 to 24 with the corresponding event counts and 0 if there are no events that hour. is there a way to make postgres generate such a structure?

Use generate_series() to generate the hours and a cross join for the severities:
SELECT gs.h, s.severity, COUNT(a.time)
FROM GENERATE_SERIES(0, 23, 1) gs(h) CROSS JOIN
(SELECT DISTINCT a.severity FROM alarm
) s LEFT JOIN
alarm a
ON extract(hour FROM a.time) = gs.h AND
a.severity = s.severity AND
a.date = '2019-06-12'
GROUP BY gs.h, s.severity
ORDER BY gs.h, s.severity;

Related

How can I calculate an "active users" aggregation from an activity log in SQL?

In PostgreSQL, I have a table that logs activity for all users, with an account ID and a timestamp field:
SELECT account_id, created FROM activity_log;
A single account_id can appear many times in a day, or not at all.
I would like a chart showing the number of "active users" each day, where "active users"
means "users who have done any activity within the previous X days".
If X is 1, then we can just truncate timestamp to 'day' and aggregate:
SELECT date_trunc('day', created) AS date, count(DISTINCT account_id)
FROM activity_log
GROUP BY date_trunc('day', created) ORDER BY date;
If X is exactly 7, then we could truncate to 'week' and aggregate - although this gives
me only one data point for a week, when I actually want one data point per day.
But I need to solve for the general case of different X, and give a distinct data point for each day.
One method is to generate the dates and then count using left join and group by or similar logic. The following uses a lateral join:
select gs.dte, al.num_accounts
from generate_series('2021-01-01'::date, '2021-01-31'::date, interval '1 day'
) gs(dte) left join lateral
(select count(distinct al.account_id) as num_accounts
from activity_log al
where al.created >= gs.dte - (<n - 1>) * interval '1 day' and
al.created < gs.dte + interval '1 day'
) al
on 1=1
order by gs.dte;
<n - 1> is one less than the number of days. So for one week, it would be 6.
If your goal is to get day wise distinct account_id for last X days you can use below query. Instead of 7 you can use any number as you wise:
SELECT date_trunc('day', created) AS date, count(DISTINCT account_id)
FROM activity_log
where date_trunc('day', created)>=date_trunc('day',CURRENT_DATE) +interval '-7' day
GROUP BY date_trunc('day', created)
ORDER BY date
(If there is no activity in any given date then the date will not be in the output.)

Filling in empty dates

This query returns the number of alarms created by day between a specific date range.
SELECT CAST(created_at AS DATE) AS date, SUM(1) AS count
FROM ew_alarms
LEFT JOIN site ON site.id = ew_alarms.site_id
AND ew_alarms.created_at BETWEEN '12/22/2020' AND '01/22/2021' AND (CAST(EXTRACT(HOUR FROM ew_alarms.created_at) AS INT) BETWEEN 0 AND 23.99)
GROUP BY CAST(created_at AS DATE)
ORDER BY date DESC
Result: screenshot
What the best way to fill in the missing dates (1/16, 1/17, 1/18, etc)? Due to no alarms created on those days these results throw off the daily average I'm ultimately trying to achieve.
Would it be a generate_series query?
Yes, use generate_series(). I would suggest:
SELECT gs.date, COUNT(s.site_id) AS count
FROM GENERATE_SERIES('2020-12-22'::date, '2021-01-22'::date, INTERVAL '1 DAY') gs(dte) LEFT JOIN
ew_alarms a
ON ew.created_at >= gs.dte AND
ew.created_at < gs.dte + INTERVAL '1 DAY' LEFT JOIN
site s
ON s.id = a.site_id
GROUP BY gs.dte
ORDER BY date DESC;
I don't know what the hour comparison is supposed to be doing. The hour is always going to be between 0 and 23, so I removed that logic.
Note: Presumably, you want to count something from either site or ew_alarms. That is expected with LEFT JOINs so 0 can be returned.

Get list of active users per day

I have a dataset that has a list of users that are connected to the server at every 15 minutes, e.g.
May 7, 2020, 8:09 AM user1
May 7, 2020, 8:09 AM user2
...
May 7, 2020, 8:24 AM user1
May 7, 2020, 8:24 AM user3
...
And I'd like to get a number of active users for every day, e.g.
May 7, 2020 71
May 8, 2020 83
Now, the tricky part. An active user is defined if he/she has been connected 80% of the time or more across the last 7 days. This means that, if there are 672 15-minute intervals in a week (1440 / 15 x 7), then a user has to be displayed 538 (672 x 0.8) times.
My code so far is:
SELECT
DATE_TRUNC('week', ts) AS ts_week
,COUNT(DISTINCT user)
FROM activeusers
GROUP BY 1
Which only gives a list of unique users connected at every week.
July 13, 2020, 12:00 AM 435
July 20, 2020, 12:00 AM 267
But I'd like to implement the active user definition, as well as get the result for every day, not just Mondays.
The resulting special difficulty here is that users might qualify for days where they have no connections at all, if they were connected sufficiently during the previous 6 days.
That makes it harder to use a window function. Aggregating in a LATERAL subquery is the obvious alternative:
WITH daily AS ( -- ① granulate daily
SELECT ts::date AS the_day
, "user"
, count(*)::int AS daily_cons
FROM activeusers
GROUP BY 1, 2
)
SELECT d.the_day, count("user") AS active_users
FROM ( -- ② time frame
SELECT generate_series (timestamp '2020-07-01'
, LOCALTIMESTAMP
, interval '1 day')::date
) d(the_day)
LEFT JOIN LATERAL (
SELECT "user"
FROM daily d
WHERE d.the_day >= d.the_day - 6
AND d.the_day <= d.the_day
GROUP BY "user"
HAVING sum(daily_cons) >= 538 -- ③
) sum7 ON true
ORDER BY d.the_day;
① The CTE daily is optional, but starting with daily aggregates should help performance a lot.
② You'll have to define the time frame somehow. I chose the current year. Replace with your choice. To work with the total range present in your table, use instead:
SELECT generate_series (min(the_day)::timestamp
, max(the_day)::timestamp
, interval '1 day')::date AS the_day
FROM daily
Consider basics here:
Generating time series between two dates in PostgreSQL
This also overcomes the "special difficulty" mentioned above.
③ The condition in the HAVING clause eliminates all rows with insufficient connections over the last 7 days (including "today").
Related:
Cumulative sum of values by month, filling in for missing months
Best way to count records by arbitrary time intervals in Rails+Postgres
Total Number of Records per Week
Aside:
You wouldn't really use the reserved word "user" as identifier.
I have done something similar to this for device monitoring reports. I was never able to come up with a solution that does not involve building a calendar and cross joining it to a distinct list of devices (user values in your case).
This deliberately verbose query builds the cross join, gets active counts per user and ddate, performs the running sum() over seven days, and then counts the number of users on a given ddate that had 538 or more actives in the seven days ending on that ddate.
with drange as (
select min(ts) as start_ts, max(ts) as end_ts
from activeusers
), alldates as (
select (start_ts + make_interval(days := x))::date as ddate
from drange
cross join generate_series(0, date_part('day', end_ts - start_ts)::int) as gs(x)
), user_dates as (
select ddate, "user"
from alldates
cross join (select distinct "user" from activeusers) u
), user_date_counts as (
select u.ddate, u."user",
sum(case when a.user is null then 0 else 1 end) as actives
from user_dates u
left join activeusers a
on a."user" = u."user"
and a.ts::date = u.ddate
group by u.ddate, u."user"
), running_window as (
select ddate, "user",
sum(actives) over (partition by user
order by ddate
rows between 6 preceding
and current row) seven_days
from user_date_counts
), flag_active as (
select ddate, "user",
seven_days >= 538 as is_active
from running_window
)
select ddate, count(*) as active_users
from flag_active
where is_active
group by ddate
;
Because you want the active user for every day but are determining by week, I think you might use a CROSS APPLY to duplicate the count for every day. The FROM part of the query will give you the days and the users, the CROSS APPLY will limit to active users. You can specify in the final WHERE what users or dates you want.
SELECT users.UserName, users.LogDate
FROM (
SELECT UserName, CAST(ts AS DATE) AS LogDate
FROM activeusers
GROUP BY CAST(ts AS DATE)
) AS users
CROSS APPLY (
SELECT UserName, COUNT(1)
FROM activeusers AS a
WHERE a.UserName = users.UserName AND CAST(ts AS DATE) BETWEEN DATEADD(WEEK, -1, LogDate) AND LogDate
GROUP BY UserName
HAVING COUNT(1) >= 538
) AS activeUsers
WHERE users.LogDate > '2020-01-01' AND users.UserName = 'user1'
This is SQL Server, you may need to make revisions for PostgreSQL. CROSS APPLY may translate to LEFT JOIN LATERAL (...) ON true.

How to get a count of data for every date in postgres

I am trying to get data to populate a multi-line graph. The table jobs has the columns id, created_at, and partner_id. I would like to display the sum of jobs for each partner_id each day. My current query has 2 problems. 1) It is missing a lot of jobs. 2) It only contains an entry for a given day if there was a row on that day. My current query is where start is an integer denoting how many days back we are looking for data:
SELECT d.date, count(j.id), j.partner_id FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, #{start}, 1)
AS offs
) d
JOIN (
SELECT jobs.id, jobs.created_at, jobs.partner_id FROM jobs
WHERE jobs.created_at > now() - INTERVAL '#{start} days'
) j
ON (d.date=to_char(date_trunc('day', j.created_at), 'YYYY-MM-DD'))
GROUP BY d.date, j.partner_id
ORDER BY j.partner_id, d.date;
This returns records like the following:
[{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"099"},
{"date"=>"2019-06-22", "count"=>1, "partner_id"=>"099"},
{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>1, "partner_id"=>"099"}]
what I want is something like this:
[{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"099"},
{"date"=>"2019-06-22", "count"=>1, "partner_id"=>"099"},
{"date"=>"2019-06-21", "count"=>3, "partner_id"=>"075"},
{"date"=>"2019-06-22", "count"=>0, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>0, "partner_id"=>"075"},
{"date"=>"2019-06-23", "count"=>1, "partner_id"=>"099"}]
So that for every day in the query I have an entry for every partner even if that count is 0. How can I adjust the query to populate data even when the count is 0?
Use a LEFT JOIN. You also don't need so many subqueries and there is no need to translate to a date to a string and then back to a date:
SELECT d.date, count(j.id), j.partner_id
FROM (SELECT to_char(dte, 'YYYY-MM-DD') AS date , dte
FROM generate_series(current_date - {start} * interval '1 day', current_date, interval '1 day') gs(dte)
) d LEFT JOIN
jobs j
ON DATE_TRUNC('day', j.created_at) = d.dte
GROUP BY d.date, j.partner_id
ORDER BY j.partner_id, d.date;

SQL - Unequal left join BigQuery

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2