Getting counts for overlapping time periods - sql

I have a table data in PostgreSQL with this structure:
created_at. customer_email status
2020-12-31 xxx#gmail.com opened
...
2020-12-24 yyy#gmail.com delivered
2020-12-24 xxx#gmail.com opened
...
2020-12-17 zzz#gmail.com opened
2020-12-10 xxx#gmail.com opened
2020-12-03 hhh#gmail.com enqueued
2020-11-27 xxx#gmail.com opened
...
2020-11-20 rrr#gmail.com opened
2020-11-13 ttt#gmail.com opened
There are many rows for each day.
Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.
Desired output:
period active
2021-W01 1539
2020-W53 1480
2020-W52 1630
2020-W51 1820
2020-W50 1910
2020-W49 1890
2020-W48 2000
How can I do that?

Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.
Instead, get distinct counts from a LATERAL subquery:
WITH weekly_dist AS (
SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
FROM tbl
WHERE status = 'opened'
)
SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
FROM (
SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
, date_trunc('week', now()::timestamp)
, interval '1 week') AS wk
FROM tbl
) t
LEFT JOIN LATERAL (
SELECT count(DISTINCT customer_email) AS active
FROM weekly_dist d
WHERE d.wk >= t.wk - interval '91 days'
AND d.wk < t.wk
) ct ON true;
db<>fiddle here
I operate with timestamp, not timestamptz, might make a corner case difference.
The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.
The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:
PostgreSQL: running count of rows for a query 'by minute'
Generating time series between two dates in PostgreSQL
But I do skip the first week since I count active emails before each start of the week.
Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.
Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.

You can combine the date_part() function with a group by like this:
SELECT
DATE_PART('year', created_at)::varchar || '-W' || DATE_PART('week', created_at)::varchar,
SUM(CASE WHEN status = 'opened' THEN 1 ELSE 0 END)
FROM
your_table
GROUP BY 1
ORDER BY created_at DESC

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

prestosql get average from last 7 days for each day

The question I have is very similar to the question here, but I am using Presto SQL (on aws athena) and couldn't find information on loops in presto.
To reiterate the issue, I want the query that:
Given table that contains: Day, Number of Items for this Day
I want: Day, Average Items for Last 7 Days before "Day"
So if I have a table that has data from Dec 25th to Jan 25th, my output table should have data from Jan 1st to Jan 25th. And for each day from Jan 1-25th, it will be the average number of items from last 7 days.
Is it possible to do this with presto?
maybe you can try this one
calendar Common Table Expression (CTE) is used to generate dates between two dates range.
with calendar as (
select date_generated
from (
values (sequence(date'2021-12-25', date'2022-01-25', interval '1' day))
) as t1(date_array)
cross join unnest(date_array) as t2(date_generated)),
temp CTE is basically used to make a date group which contains last 7 days for each date group.
temp as (select c1.date_generated as date_groups
, format_datetime(c2.date_generated, 'yyyy-MM-dd') as dates
from calendar c1, calendar c2
where c2.date_generated between c1.date_generated - interval '6' day and c1.date_generated
and c1.date_generated >= date'2021-12-25' + interval '6' day)
Output for this part:
date_groups
dates
2022-01-01
2021-12-26
2022-01-01
2021-12-27
2022-01-01
2021-12-28
2022-01-01
2021-12-29
2022-01-01
2021-12-30
2022-01-01
2021-12-31
2022-01-01
2022-01-01
last part is joining day column from your table with each date and then group it by the date group
select temp.date_groups as day
, avg(your_table.num_of_items) avg_last_7_days
from your_table
join temp on your_table.day = temp.dates
group by 1
You want a running average (AVG OVER)
select
day, amount,
avg(amount) over (order by day rows between 6 preceding and current row) as avg_amount
from mytable
order by day
offset 6;
I tried many different variations of getting the "running average" (which I now know is what I was looking for thanks to Thorsten's answer), but couldn't get the output I wanted exactly with my other columns (that weren't included in my original question) in the table, but this ended up working:
SELECT day, <other columns>, avg(amount) OVER (
PARTITION BY <other columns>
ORDER BY date(day) ASC
ROWS 6 PRECEDING) as avg_7_days_amount FROM table ORDER BY date(day) ASC

7-day user count: Big-Query self-join to get date range and count?

My Google Firebase event data is integrated to BigQuery and I'm trying to fetch from here one of the info that Firebase gives me automatically: 1-day, 7-day, 28-day user count.
1-day count is quite straightforward
SELECT
"1-day" as period,
events.event_date,
count(distinct events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date
with a neat result like
period event_date uid
1-day 20190609 5
1-day 20190610 7
1-day 20190611 5
1-day 20190612 7
1-day 20190613 37
1-day 20190614 73
1-day 20190615 52
1-day 20190616 36
But to me it gets complicated when I try to count for each day how many unique users I had in the previous 7 days
From the above query, I know my target value for day 20190616 will be 142, by filtering 7 days and removing the group by condition.
The solution I tried is direct self join (and variations that didnt change the result)
SELECT
"7-day" as period,
events.event_date,
count(distinct user_events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events,
`your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
and events.event_date = "20190616" #fixed date to check
group by events.event_date
Now, I know I'm barely setting any join conditions, but if any I expected to produce cross joins and huge results. Instead, the count this way is 70, which is a lot lower than expected. Furthermore, I can set INTERVAL 2 DAY and the result does not change.
I'm clearly doing something very wrong here, but I also thought that the way I'm doing it is very rudimental, and there must be a smarter way to accomplish this.
I have checked Calculating a current day 7 day active user with BigQuery? but the explicit cross join here is with event_dim which definition I'm unsure about
Cheched the solution provided at Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU) as suggested by comment.
The solution seemed sound at first but has some problems the more recent the day is. Here's the query using COUNT(DISTINCT) that I adapted to my case
SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT user_pseudo_id) unique_90_day_users
, COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
, COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
, COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
FROM `your_path_here.events_*`
WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
and here is the result for the latest days (consider data starts 23rd May) where you can appreciate that the result is wrong
row_num date_grp 90-day 28-day 7-day 1-day
114 2019-06-16 273 273 273 210
115 2019-06-17 78 78 78 78
so in the last day this count for 90-day,28-day,7day is only considering the very same day instead of all the days before.
It's not possible for 90-day count on the 17th June to be 78 if the 1-day on the 16th June was higher.
This is AN answer to my same question.
My means are rudimentary as I'm not extremely familiar with BQ shortcuts and some advanced functions, but the result is still correct.
I hope others will be able to integrate with better queries.
#standardSQL
WITH dates AS (
SELECT i as event_date
FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 90)) i
UNION ALL
SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 29)) i
UNION ALL
SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 7)) i
UNION ALL
SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
FROM dates
)
SELECT event_date,
sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
, ptd_dates.event_date
, COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
`your_path_here.events_*` events,
unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
, ptd_dates.event_date)
group by event_date
order by 1,2,3
As per suggestion from ECris, I first defined a calendar table to use: this contains 4 categories of PTDs (period to date). Each is generated from basic elements: this should scale linearly as it's not querying the event dataset and therefore does not have gaps.
Then the join is made with events, where the join condition shows how for each date I'm counting distinct users in all related days in the period.
The results are correct.

How to select the count of total active listings week by week?

I want to get the count from a table of active apartment listings week by week. The table looks like this (except much longer):
id created_at delisted_at
2318867 2014-11-12 18:57:44 Null
2329665 2014-11-14 4:36:32 Null
1431098 2014-07-25 5:45:03 Null
1930123 2014-09-28 10:10:46 2014-09-28 10:10:45
2490774 2014-12-05 0:08:47 Null
To get an active listings for a single week, you have to check that created_at <= end_of_week and delisted_at > end_of_week.
The results table would like like a longer version of:
Week Number of Active Listings
5/1/2016 3024
5/8/2016 11234
5/15/2016 11234
I would also like to produce another results table month by month as opposed to week by week.
How do I write a query to achieve this behavior?
Here is an example for month. First generate a list of months using generate_series(). The rest is just joins and aggregations:
select g.mon_start, g.mon_end, count(a.id) as numActives
from (select g.mon_start, g.mon_start + interval '1 month' as mon_end
from generate_series('2016-01-01'::timestamp, '2016-06-01'::timestamp, interval '1 month'
) g(mon_start)
) g left join
actives a
on a.created_at < g.mon_end and
(a.delisted_at >= mon_end or a.delisted_at is null)
group by g.mon_start, g.mon_end
order by g.mon_start, g.mon_end;
A similar idea works for weeks as well.

Calculate closest working day in Postgres

I need to schedule some items in a postgres query based on a requested delivery date for an order. So for example, the order has a requested delivery on a Monday (20120319 for example), and the order needs to be prepared on the prior working day (20120316).
Thoughts on the most direct method? I'm open to adding a dates table. I'm thinking there's got to be a better way than a long set of case statements using:
SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
This gets you previous business day.
SELECT
CASE (EXTRACT(ISODOW FROM current_date)::integer) % 7
WHEN 1 THEN current_date-3
WHEN 0 THEN current_date-2
ELSE current_date-1
END AS previous_business_day
To have the previous work day:
select max(s.a) as work_day
from (
select s.a::date
from generate_series('2012-01-02'::date, '2050-12-31', '1 day') s(a)
where extract(dow from s.a) between 1 and 5
except
select holiday_date
from holiday_table
) s
where s.a < '2012-03-19'
;
If you want the next work day just invert the query.
SELECT y.d AS prep_day
FROM (
SELECT generate_series(dday - 8, dday - 1, interval '1d')::date AS d
FROM (SELECT '2012-03-19'::date AS dday) x
) y
LEFT JOIN holiday h USING (d)
WHERE h.d IS NULL
AND extract(isodow from y.d) < 6
ORDER BY y.d DESC
LIMIT 1;
It should be faster to generate only as many days as necessary. I generate one week prior to the delivery. That should cover all possibilities.
isodow as extract parameter is more convenient than dow to test for workdays.
min() / max(), ORDER BY / LIMIT 1, that's a matter of taste with the few rows in my query.
To get several candidate days in descending order, not just the top pick, change the LIMIT 1.
I put the dday (delivery day) in a subquery so you only have to input it once. You can enter any date or timestamp literal. It is cast to date either way.
CREATE TABLE Holidays (Holiday, PrecedingBusinessDay) AS VALUES
('2012-12-25'::DATE, '2012-12-24'::DATE),
('2012-12-26'::DATE, '2012-12-24'::DATE);
SELECT Day, COALESCE(PrecedingBusinessDay, PrecedingMondayToFriday)
FROM
(SELECT Day, Day - CASE DATE_PART('DOW', Day)
WHEN 0 THEN 2
WHEN 1 THEN 3
ELSE 1
END AS PrecedingMondayToFriday
FROM TestDays) AS PrecedingMondaysToFridays
LEFT JOIN Holidays ON PrecedingMondayToFriday = Holiday;
You might want to rename some of the identifiers :-).