Querying count on daily basis with date constraints over multiple weeks - sql

I'm trying to find the # active users over time on a daily basis.
A user is active when he has made more than 10 requests per week for 4 consecutive weeks.
ie. On Oct 31, 2014, a user is active if he has made more than 10 requests in total per week between:
Oct 24-Oct 30, 2014 AND
Oct 17-Oct 23, 2014 AND
Oct 10-Oct 16, 2014 AND
Oct 3-Oct 9, 2014
I have a table of requests:
CREATE TABLE requests (
id text PRIMARY KEY, -- id of the request
amount bigint, -- sum of requests made by accounts_id to recipient_id,
-- aggregated on a daily basis based on "date"
accounts_id text, -- id of the user
recipient_id text, -- id of the recipient
date timestamp -- date that the request was made in YYYY-MM-DD
);
Sample values:
INSERT INTO requests2
VALUES
('1', 19, 'a1', 'b1', '2014-10-05 00:00:00'),
('2', 19, 'a2', 'b2', '2014-10-06 00:00:00'),
('3', 85, 'a3', 'b3', '2014-10-07 00:00:00'),
('4', 11, 'a1', 'b4', '2014-10-13 00:00:00'),
('5', 2, 'a2', 'b5', '2014-10-14 00:00:00'),
('6', 50, 'a3', 'b5', '2014-10-15 00:00:00'),
('7', 787323, 'a1', 'b6', '2014-10-17 00:00:00'),
('8', 33, 'a2', 'b8', '2014-10-18 00:00:00'),
('9', 14, 'a3', 'b9', '2014-10-19 00:00:00'),
('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
Example output:
Date | # Active users
-----------+---------------
10-01-2014 | 600
10-02-2014 | 703
10-03-2014 | 891
Here's what I tried to do to find the number of active users for a certain date (e.g. 10-01-2014):
SELECT count(*)
FROM
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
JOIN
(SELECT accounts_id
FROM requests
WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
Since this is just the query to get the number for 1 day, I need to get this number on a daily basis over time. I think the idea is to do a join to get the date so I tried to do something like this:
SELECT week_1."Date_series",
count(*)
FROM
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_1
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
AND week_1."Date_series" = week_2."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
AND week_2."Date_series" = week_3."Date_series"
JOIN
(SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
accounts_id
FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
GROUP BY "Date_series",
accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
AND week_3."Date_series" = week_4."Date_series"
GROUP BY week_1."Date_series"
However, I think I'm not getting the right answer and I'm not sure why. Any tips/ guidance/ pointers is much appreciated! :) :)
PS. I'm using Postgres 9.3

Here is a long answer how to make your queries short. :)
Table
Building on my table (before you provided table definition with different (odd!) data types:
CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);
Active Users for given day
The list of "active users" for one given day:
SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;
Step 1
In the innermost subquery w (for "week") build the bounds of the 4 weeks of interest from a CROSS JOIN of the given day - 1 with the output of generate_series(0-3).
To add / subtract days to / from a date (not from a timestamp!) just add / subtract integer numbers. The expression day - 7 * w subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end).
Subrtract another 6 days (not 7!) from each to compute the respective start (w_start).
Additionally, keep the week number w (0-3) for the later aggregation.
Step 2
In subquery sub join rows from requests to the set of 4 weeks, where the date lies between start and end date. GROUP BY the week number w and the accounts_id.
Only weeks with more than 10 requests total qualify.
Step 3
In the outer SELECT count the number of weeks each user (accounts_id) qualified. Must be 4 to qualify as "active user"
Count of active users per day
This is dynamite.
Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:
CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;
The function takes any day (_now), "today" by default, and the number of days (_days) in the result, 3 by default. Call:
SELECT * FROM f_active_users('2014-10-31', 5);
Or without parameters to use defaults:
SELECT * FROM f_active_users();
The approach is different from the first query.
SQL Fiddle with both queries and variants for your table definition.
Step 0
In the CTE r pre-aggregate amounts per (accounts_id, date) for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.
Step 1
In the inner subquery d generate the necessary list of days: 27 + _days rows, where _days is the desired number of rows in the output, effectively 28 days or more.
While being at it, compute the day of the week (dow) to be used for aggregating in step 3. i%7 coincides with weekly intervals, the query works for any interval, though.
In the inner subquery a generate a unique list of users (accounts_id) that exist in CTE r and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).
Step 2
Generate a Cartesian product from d and a with a CROSS JOIN to have one row for every relevant day for every relevant user. LEFT JOIN to r to append the amount of requests (if any). No WHERE condition, we want every day in the result, even if there are no active users at all.
Compute the total amount for the past week (w_amount) in the same step using a Window functions with a custom frame. Example:
How to use a ring data structure in window functions
Step 3
Cut off the last 6 days now; which is optional and may or may not help performance. Test it: WHERE date >= _now - (21 + _days)
Count the weeks where the minimum amount is met (w_ct) in a similar window function, this time partitioned by dow additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week).
The expression count(w_amount > 10 OR NULL) only counts rows with more than 10 requests. Detailed explanation:
Compute percents from SUM() in the same SELECT sql query
Step 4
In the outer SELECT group by date and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)). Add 1 to the date to compensate off-by-1, ORDER and LIMIT to the requested number of days.
Performance and outlook
The perfect index for both queries would be:
CREATE INDEX foo ON requests (date, accounts_id, amount);
Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:
Moving-aggregate support in the Postgres Wiki.
Moving aggregates in the 9.4 manual
Aside: don't call a timestamp column "date", it's a timestamp, not a date. Better yet, never use basic type names like date or timestamp as identifier. Ever.

Related

SQL Query to group dates and includes different dates in the aggregation

I have a table with two columns, dates and number of searches in each date. What I want to do is group by the dates, and find the sum of number of searches for each date.
The trick is that for each group, I also want to include the number of searches for the date exactly the following week, and the number of searches for the date exactly the previous week.
So If I have
Date
Searches
2/3/2023
2
2/10/2023
4
2/17/2023
1
2/24/2023
5
I want the output for the 2/10/2023 and 2/17/2023 groups to be
Date
Sum
2/10/2023
7
2/17/2023
10
How can I write a query for this?
You can use a correlated query for this:
select date, (
select sum(searches)
from t as x
where x.date between t.date - interval '7 day' and t.date + interval '7 day'
) as sum_win
from t
Replace interval 'x day' with the appropriate date add function for your RDBMS.
If your RDBMS supports interval in window functions then a much better solution would be:
select date, sum(searches) over (
order by date
range between interval '7 day' preceding and interval '7 day' following
) as sum_win
from t
Assuming weekly rows
CREATE TABLE Table1
([Dates] date, [Searches] int)
;
INSERT INTO Table1
([Dates], [Searches])
VALUES
('2023-02-03 00:00:00', 2),
('2023-02-10 00:00:00', 4),
('2023-02-17 00:00:00', 1),
('2023-02-24 00:00:00', 5)
;
;with cte as (
select dates
, searches
+ lead(searches) over(order by dates)
+ lag(searches) over(order by dates) as sum_searches
from table1)
select * from cte
where sum_searches is not null;
dates
sum_searches
2023-02-10
7
2023-02-17
10
fiddle

Assign value to day per month and only display selected days in query result

DB-Fiddle
CREATE TABLE costs (
id SERIAL PRIMARY KEY,
entry_date DATE,
costs DECIMAL
);
INSERT INTO costs
(entry_date, costs)
VALUES
('2020-01-01', '500'),
('2020-02-01', '325'),
('2020-03-01', '200'),
('2020-04-01', '400'),
('2020-05-01', '900'),
('2020-06-01', '700'),
('2020-07-01', '900'),
('2020-08-01', '100'),
('2020-09-01', '300'),
('2020-10-01', '850'),
('2020-11-01', '470'),
('2020-12-01', '800');
Exptected Result:
date_list | costs
--------------|----------------------------
2020-05-01 | 29.03 (=900/31)
2020-05-02 | 29.03 (=900/31)
2020-05-03 | 29.03 (=900/31)
2020-05-04 | 29.03 (=900/31)
In the table I have costs per month assigned to one day per month.
Now I want to do the following:
Divide the costs through the days of the months to get the costs per day.
Only display the dates that are selected in the WHERE-Clause of the query.
With reference to this question I tried this query:
SELECT
gs::date AS entry_date,
costs / date_part('day', entry_date + interval '1 month - 1 day') AS costs
FROM costs,
generate_series(
entry_date,
entry_date + interval '1 month - 1 day',
interval '1 day'
) gs
WHERE entry_date BETWEEN '2020-05-01' AND '2020-05-04'
The query does the first step of the expected result.
However, as you can see in the DB-Fiddle it does show all days of the month and is not limitied to the 2020-05-04.
How do I need to change the query to also make the WHERE-condition work correctly?
I think the approach is a little inverted. You want one row per date, so use generate_series() to generate those days. And you can do so explicitly.
Then just do the arithmetic to get what you want:
SELECT gs.entry_date,
c.costs / date_part('day', date_trunc('month', gs.entry_date) + interval '1 month - 1 day') AS costs
FROM costs c join lateral
generate_series('2020-05-01'::date, '2020-05-04'::date, interval '1 day') as gs(entry_date)
on date_trunc('month', gs.entry_date) = c.entry_date;
Here is a db<>fiddle.
If you put the WHERE clause into the same query, it just filters your original table, which results in the May record. This will be used to do all the magic and will be expanded. That's why you get all May days.
You have to put the WHERE into a subquery because first you need to do the calculation and then you can filter it:
demo:db<>fiddle
SELECT
*
FROM (
SELECT
gs::date AS entry_date,
costs / date_part('day', entry_date + interval '1 month - 1 day') AS costs
FROM costs,
generate_series(
entry_date,
entry_date + interval '1 month - 1 day',
interval '1 day'
) gs
) s
WHERE entry_date BETWEEN '2020-05-01' AND '2020-05-04'

how to get date different in postgres using date_part option

How to get date time difference in PostgreSQL
I am using below syntax
select id, A_column,B_column,
(SELECT count(*) AS count_days_no_weekend
FROM generate_series(B_column ::timestamp , A_column ::timestamp, interval '1 day') the_day
WHERE extract('ISODOW' FROM the_day) < 5) * 24 + DATE_PART('hour', B_column::timestamp-A_column ::timestamp ) as hrs
FROM table req where id='123';
If A_column=2020-05-20 00:00:00 and B_column=2020-05-15 00:00:00 I want to get 72(in hours).
Is there any possibility to skip weekends(Saturday and Sunday) in first one, it means to get the result as 72 hours(exclude weekend hours)
i am getting 0
But i need to get 72 hours
And if If A_column=2020-08-15 12:00:00 and B_column=2020-08-15 00:00:00 I want to get 12(in hours).
One option uses a lateral join and generate_series() to enumerate each and every hour between the two timestamps, while filtering out week-ends:
select t.a_column, t.b_column, h.count_hours_no_weekend
from mytable t
cross join lateral (
select count(*) count_hours_no_weekend
from generate_series(t.b_column::timestamp, t.a_column::timestamp, interval '1 hour') s(col)
where extract('isodow' from s.col) < 5
) h
where id = 123
I would attack this by calculating the weekend hours to let the database deal with daylight savings time. I would then subtract the intervening weekend hours from the difference between the two date values.
with weekend_days as (
select *, date_part('isodow', ddate) as dow
from table1
cross join lateral
generate_series(
date_trunc('day', b_column),
date_trunc('day', a_column),
interval '1 day') as gs(ddate)
where date_part('isodow', ddate) in (6, 7)
), weekend_time as (
select id,
sum(
least(ddate + interval '1 day', a_column) -
greatest(ddate, b_column)
) as we_ival
from weekend_days
group by id
)
select t.id,
a_column - b_column as raw_difference,
coalesce(we_ival, interval '0') as adjustment,
a_column - b_column -
coalesce(we_ival, interval '0') as adj_difference
from weekend_time w
left join table1 t on t.id = w.id;
Working fiddle.

PostgreSQL generate month and year series based on table field and fill with nulls if no data for a given month

I want to generate series of month and year from the next month of current year(say, start_month) to 12 months from start_month along with the corresponding data (if any, else return nulls) from another table in PostgreSQL.
SELECT ( ( DATE '2019-03-01' + ( interval '1' month * generate_series(0, 11) ) )
:: DATE ) dd,
extract(year FROM ( DATE '2019-03-01' + ( interval '1' month *
generate_series(0, 11) )
)),
coalesce(SUM(price), 0)
FROM items
WHERE s.date_added >= '2019-03-01'
AND s.date_added < '2020-03-01'
AND item_type_id = 3
GROUP BY 1,
2
ORDER BY 2;
The problem with the above query is that it is giving me the same value for price for all the months. The requirement is that the price column be filled with nulls or zeros if no price data is available for a given month.
Put the generate_series() in the FROM clause. You are summarizing the data -- i.e. calculating the price over the entire range -- and then projecting this on all months. Instead:
SELECT gs.yyyymm,
coalesce(SUM(i.price), 0)
FROM generate_series('2019-03-01'::date, '2020-02-01', INTERVAL '1 MONTH'
) gs(yyyymm) LEFT JOIN
items i
ON gs.yyyymm = DATE_TRUNC('month', s.date_added) AND
i.item_type_id = 3
GROUP BY gs.yyyymm
ORDER BY gs.yyyymm;
You want generate_series in the FROM clause and join with it, somewhat like
SELECT months.m::date, ...
FROM generate_series(
start_month,
start_month + INTERVAL '11 months',
INTERVAL '1 month'
) AS months(m)
LEFT JOIN items
ON months.m::date = items.date_added

Get value zero if data is not there in PostgreSQL

I have a table employee in Postgres:
Query:
SELECT DISTINCT month_last_date,number_of_cases,reopens,csat
FROM employee
WHERE month_last_date >=(date('2017-01-31') - interval '6 month')
AND month_last_date <= date('2017-01-31')
AND agent_id='analyst'
AND name='SAM';
Output:
But if data is not in table for other month I want column value as 0.
Generate all dates you are interested in, LEFT JOIN to the table and default to 0 with COALESCE:
SELECT DISTINCT -- see below
i.month_last_date
, COALESCE(number_of_cases, 0) AS number_of_cases -- see below
, COALESCE(reopens, 0) AS reopens
, COALESCE(csat, 0) AS csat
FROM (
SELECT date '2017-01-31' - i * interval '1 mon' AS month_last_date
FROM generate_series(0, 5) i -- see below
) i
LEFT JOIN employee e ON e.month_last_date = i.month_last_date
AND e.agent_id = 'analyst' -- see below
AND e.name = 'SAM';
Notes
If you add or subtract an interval of 1 month and the same day does not exist in the target month, Postgres defaults to the latest existing day of that moth. So this works as desired, you get the last day of each month:
SELECT date '2017-12-31' - i * interval '1 mon' -- note 31
FROM generate_series(0,11) i;
But this does not, you'd get the 28th of each month:
SELECT date '2017-02-28' - i * interval '1 mon' -- note 28
FROM generate_series(0,11) i;
The safe alternative is to subtract 1 day from the first day of the next month, like #Oto demonstrated. Related:
Daily average for the month (needs number of days in month)
Here are two optimized ways to generate a series of last days of the month - up to and including a given month:
1.
SELECT (timestamp '2017-01-01' - i * interval '1 month')::date - 1 AS month_last_date
FROM generate_series(-1, 10) i; -- generate 12 months, off-by-1
Input is the first day of the month - or calculate it from a given date or timestamp with date_trunc():
SELECT date_trunc('month', timestamp '2017-01-17')::date AS this_mon1
Subtracting an interval from a date produces a timestamp. After the cast back to date we can simply subtract an integer to subtract days.
2.
SELECT m::date - 1 AS month_last_date
FROM generate_series(timestamp '2017-02-01' - interval '11 month' -- for 12 months
, timestamp '2017-02-01'
, interval '1 mon') m;
Input is the first day of the next month - or calculate it from any given date or timestamp with:
SELECT date_trunc('month', timestamp '2017-01-17' + interval '1 month')::date AS next_mon1
Related:
How do I determine the last day of the previous month using PostgreSQL?
Create list with first and last day of month for given period
Not sure you actually need DISTINCT. Typically, (agent_id, month_last_date) would be defined unique, then remove DISTINCT ...
Be sure to use the LEFT JOIN correctly. Join conditions go into the join clause, not the WHERE clause:
Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail
Finally, default to 0 with COALESCE where NULL values are filled in by the LEFT JOIN.
Note that COALESCE cannot distinguish between actual NULL values from the right table and NULL values filled in for missing rows. If your columns are not defined NOT NULL, there may be ambiguity to address.
As I see, you need generate last days of all last 6 months, before certain date. (before "2017-01-31" in this case).
If I correctly understand, then you can use this query, which generates all of these days
SELECT (date_trunc('MONTH', mnth) + INTERVAL '1 MONTH - 1 day')::DATE
FROM
generate_series('2017-01-31'::date - interval '6 month', '2017-01-31'::date, '1 month') as mnth;
You just need LEFT JOIN this query to your existing query, and you get desirable result
Please note that this will returns 7 record (days), not 6.