How to GROUP BY several days in PostgreSQL? - ruby-on-rails-3

The following code generates dates and counts records by day.
SELECT ts, COUNT(DISTINCT(user_id)) FROM
( SELECT current_date + s.ts FROM generate_series(-20,0,1) AS s(ts) )
AS series(ts)
LEFT JOIN messages
ON messages.created_at::date = ts
GROUP BY ts
ORDER BY ts
The output looks like:
2011-07-07 0
2011-07-08 0
2011-07-09 0
2011-07-10 0
2011-07-11 0
2011-07-12 94
2011-07-13 56
2011-07-14 35
2011-07-15 56
2011-07-16 0
2011-07-17 13
How would you modify it to group by 2 days, so that the results overlap? Instead of counting the distinct user_id's for each day, it would count the distinct user_id's for each 2 day period.
This is different from summing the counts of the 2 days, as the user_id should be counted only once for each 2 day period.
Working in PostgreSQL 8.3.
Thanks.

SELECT ts, COUNT(DISTINCT(user_id)) FROM
( SELECT current_date + s.ts FROM generate_series(-20,0,1) AS s(ts) )
AS series(ts)
LEFT JOIN messages
ON messages.created_at::date between ts - 1 and ts -- JOIN on a range
GROUP BY ts
ORDER BY ts

Try this:
SELECT ts, COUNT(DISTINCT(user_id))
FROM
( SELECT current_date + s.ts
FROM generate_series(-20,0,2) AS s(ts) ) AS series(ts)
LEFT JOIN messages
ON messages.created_at::date = ts or messages.created_at::date = ts + 1
GROUP BY ts
ORDER BY ts

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

SQL - In a week get result count of records in that week and count of records ageing 7days from that week

This is redshift SQL
I'm trying to get 2 results for a week:
Total records in that week
Total records ageing greater than 7 days from that week.
say there are sample 100 records in below format, in current example 7 records/week:
day code week
1/1/2020 P001 1
1/2/2020 P002 1
1/3/2020 P003 1
1/4/2020 P004 1
1/5/2020 P005 2
1/6/2020 P006 2
1/7/2020 P007 2
1/8/2020 P008 2
1/9/2020 P009 2
1/10/2020 P010 2
1/11/2020 P011 2
.....................
4/8/2020 P099 15
Trying to get output like this:
Week count count>7 days
1 7 0
2 7 7
3 7 14
4 7 21
15 7 98
Basically for the latest week, i'm trying to get distinct number of records ageing more than 7 days. In actual use case, the number of records in week will vary.
What i've tried:
calendar_week_number,
count(code) as count 1,
count(DISTINCT (case when datediff(day, trunc(completion_date-7), '2020-01-01') then code end)) as count 2,
count(case when completion_date between TO_DATE('20200101','YYYYMMDD') and TO_DATE(completion_date,'YYYYMMDD')-7 then code end) as count 3
from rbsrpt.RBS_DAILY_ASIN_PROC_SNPSHT ul
LEFT JOIN rbsrpt.dim_rbs_time t ON Trunc(ul.completion_date) = trunc(t.cal_date)
where
mp=1
and calendar_year=2020
group by
calendar_week_number
order by calendar_week_number desc
but my output is as below:
week count1 count 2 count 3
51 2866 2866 0
50 3211 3211 0
49 6377 6377 0
48 9013 9013 0
47 5950 5950 0
One option uses lateral joins. It is probably more efficient to aggregate the calendar table by weeks first, then perform the searches on week per week in the dataset.
Assuming Postgres (since there is no TO_DATE() in MySQL):
select d.cal_date, c1.*, c2.*
from (
select calendar_week_number, min(cal_date) as cal_date
rbsrpt.dim_rbs_time t
group by calendar_week_number
) t
cross join lateral (
select count(*) as cnt
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date
and r.completion_date < t.cal_date + interval '7 day'
) c1
cross join lateral (
select count(*) as cnt_aged
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '7' day
and r.completion_date < t.cal_date
) c2
This ages out records after 7 days. If you wanted 30 days instead, you would change the where clause of the second subquery:
cross join lateral (
select count(*) as cnt_aged
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '30 day'
and r.completion_date < t.cal_date - interval '23 day'
) c2
Edit: if your database does not support lateral joins, you can use subqueries instead:
select d.cal_date,
(
select count(*)
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date
and r.completion_date < t.cal_date + interval '7 day'
) as cnt,
(
select count(*)
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '7' day
and r.completion_date < t.cal_date
) as cnt_aged
from (
select calendar_week_number, min(cal_date) as cal_date
rbsrpt.dim_rbs_time t
group by calendar_week_number
) t

SQL not returning a value if no row exist for time queried

I'm writing this SQL query which returns the number of records created in an hour in last 24 hours. I'm getting the result for only those hours that have a non zero value. If no records were created, it doesn't return anything at all.
Here's my query:
SELECT HOUR(timeStamp) as hour, COUNT(*) as count
FROM `events`
WHERE timeStamp > DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY HOUR(timeStamp)
ORDER BY HOUR(timeStamp)
The output of current Query:
+-----------------+----------+
| hour | count |
+-----------------+----------+
| 14 | 6 |
| 15 | 5 |
+-----------------+----------+
But i'm expecting 0 for hours in which no records were created. Where am I going wrong?
One solution is to generate a table of numbers from 0 to 23 and left join it with your original table.
Here is a query that uses a recursive query to generate the list of hours (if you are running MySQL, this requires version 8.0):
with hours as (
select 0 hr
union all select hr + 1 where h < 23
)
select h.hr, count(e.eventID) as cnt
from hours h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr
If your RDBMS does not support recursive CTEs, then one option is to use an explicit derived table:
select h.hr, count(e.eventID) as cnt
from (
select 0 hr union all select 1 union all select 2 ... union all select 23
) h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr

getting first column blank postgres

SELECT CASE WHEN date_part('hour',created_at) BETWEEN 3 AND 15 THEN '9am-3pm'
WHEN date_part('hour',created_at) BETWEEN 15 AND 18 THEN '3pm-6pm' END "time window",COUNT(*) FROM tickets where created_at < now()
GROUP BY CASE WHEN date_part('hour',created_at) BETWEEN 3 AND 15 THEN '9am-3pm' WHEN date_part('hour',created_at) BETWEEN 15 AND 18 THEN '3pm-6pm' END;
time window | count
-------------+-------
| 6
9am-3pm | 69
is it possible to filter it by date along with time so that my result set will looks like
Date | time window | count
------------+-------------+-------
12-01-2020 | 9am-3pm| 6
12-01-2020 | 3pm-6pm| 69
13-01-2020 | 9am-3pm| 12
13-01-2020 | 3pm-6pm| 14
We can handle this requirement using a calendar table approach:
WITH dates AS (
SELECT '12-01-2020' AS created_at UNION ALL
SELECT '13-01-2020'
),
tw AS (
SELECT '9am-3pm' AS "time window" UNION ALL
SELECT '3pm-6pm'
),
cte AS (
SELECT
created_at::date AS created_at,
CASE WHEN DATE_PART('hour', created_at) BETWEEN 3 AND 15 THEN '9am-3pm'
WHEN DATE_PART('hour', created_at) BETWEEN 15 AND 18 THEN '3pm-6pm' END "time window",
COUNT(*) AS cnt
FROM tickets
WHERE created_at < NOW()
GROUP BY 1, 2
)
SELECT
d.created_at,
tw."time window",
COALESCE(t.cnt, 0) AS count
FROM dates d
CROSS JOIN tw
LEFT JOIN cte t
ON d.created_at = t.created_at AND tw."time window" = t."time window"
ORDER BY
d.dt,
tw."time window";
You are actually asking two questions:
The "empty space" (really an SQL NULL) is there because there are dates that do not fall within any of the time ranges. You can exclude them with an additional WHERE condition.
To get the date part as well, add
CAST (created_at AS date)
to the SELECT list and the GROUP BY ckause.

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you
You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.
This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo