Ensuring no dupe ids in query return - sql

So for the following schema:
CREATE TABLE activity (
id integer NOT NULL,
start_date date NOT NULL
);
CREATE TABLE account (
id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE contact (
id integer NOT NULL,
account_id integer NOT NULL,
name varchar NOT NULL
);
CREATE TABLE activity_contact (
id integer NOT NULL,
contact_id integer NOT NULL,
activity_id integer NOT NULL
);
insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');
insert into account(id, name)
values
(1, 'Test Account');
insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');
insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);
You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query
SELECT contact.account_id AS accountid,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;
This returns:
accountid last_month prev_month
1 3 1
However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:
SELECT con.account_id AS accountid
, count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)) AS last_month
, count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
RIGHT JOIN contact con ON con.id = aco.contact_id
-- JOIN account acc ON con.account_id = acc.id -- noise
GROUP BY 1;
db<>fiddle here
Most importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
We have to move that predicate to the JOIN clause, lest we'd exclude accounts with no activity. (LEFT JOIN and RIGHT JOIN can both be used, mirroring each other.)
See:
Postgres Left Join with where condition
Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail
Make that filter "sargable", so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.
Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Unlike other aggregate functions, count() returns 0 (not NULL) for "no rows", so we don't have to do anything extra.
Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it.
CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it's bit more efficient to use LOCALTIMESTAMP to begin with.
Compare with your original to see that this is quite a bit faster.
And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:
Ignoring time zones altogether in Rails and PostgreSQL
If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:
SELECT con.account_id AS accountid
, date_trunc('month', act.start_date) AS mon
, count(DISTINCT aco.activity_id) AS dist_count
FROM activity act
JOIN activity_contact aco ON aco.activity_id = act.id
AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
AND act.start_date < date_trunc('month', LOCALTIMESTAMP)
RIGHT JOIN contact con ON con.id = aco.contact_id
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
Again, we can include accounts without activity. But months without activity do not show up ...

Related

how to aggregate data by month overlapping postgresql

I have SCD table type 2 that I join with various other tables and I am looking to aggregate sum total from any entity that was active (by active I mean the ones that don't yet have an end_date) during an individual month.
currently, I have a query similar to this (let's say aggregating data for the month of May 2022 and April 2022):
select
count(1) as enitities_agg,
DATE_TRUNC('Month', h.start) as date,
sum(h.price) filter (where c.name='HIGH') as sum_total,
----....----
from
project as p
join class as c on p.class_id = c.id
join stage as s on s.project_id = p.id
join stage_info as si on si.stage_id = s.id
join history as h on h.stage_info_id = si.id
where
h.start <= '2022-06-01' and
h.end_date >= '2022-04-01' and
COALESCE(p.end_date, '2099-01-01') >= '2022-04-01' and
COALESCE(p.start_date, '2099-01-01') <= '2022-06-01' and
COALESCE(stage.end, '2099-01-01') >= '2022-04-01' and
h.price is not null and
h.price != 0
group by DATE_TRUNC('Month', h.start)
It aggregates fine only those whose history starts in May or April, not the ones that overlap those months and are still active.
The problem I have is that some history entities start in April, March, etc., and still haven't ended by May. Because I group with group by DATE_TRUNC('Month', h.start) i.e. by history start date, I don't get the entities that start earlier than April or May and continue to be active after May, I get aggregates only in those months that they have started in.
I was trying to do it by generating series and group by the generated month, however, I didn't find a way that would group them correctly. Example, of one experimentation that I tried.
from
generate_series('2022-03-01', '2022-07-01', INTERVAL '1 month') as mt
join project as p on COALESCE(p.end_date, '2099-01-01') >= mt and
COALESCE(p.start_date, '2099-01-01') <= mt + INTERVAL '1 month'
join class as c on p.class_id = c.id
join stage as stage on stage.project_id = p.id and
COALESCE(stage.end, '2099-01-01') >= mt
join stage_info as si on si.stage_id = stage.id
join history as h on h.stage_info_id = si.id
where
h.start <= mt and
h.end_date >= mt + INTERVAL '1 month' and
h.price is not null and
h.price != 0
group by mt
How would it be possible to iterate through the history table aggregating any active entities in a month and group them by the same month and get something like this?
"enitities_agg" | "date" | "sum_total"
832 | "2022-04-01 00:00:00" | 15432234
1020 | "2022-05-01 00:00:00" | 19979458
Seems your logic is: if any day of begin_ - _end interval falls into month, count it in. This was the hardest part to guess from the desired results.
So I guess you need this:
with dim as (
select
m::date as month_start
,(date_trunc('month', m) + interval '1 month - 1 day')::date as month_end
,to_char(date_trunc('month', m), 'Mon') as month
from generate_series('2022-01-01', '2022-08-01', INTERVAL '1 month') as m
)
SELECT
dim.month
, sum(coalesce(t.price, 0)) as sum_price
FROM dim
left join test as t
on t.begin_ <= dim.month_end
and t._end >= dim.month_start
group by dim.month_start, dim.month
order by dim.month_start, dim.month
;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=614030d4db5e03876f693a9a8a5ff122
You want all history entities was happened during 2022-May? If so, the following maybe help.
daterange(h.start,
h.end_date,
'[]') && daterange('2022-05-01', '2022-06-01', '[]');
demo:
CREATE temp TABLE test (
begin_ date,
_end date
);
INSERT INTO test
VALUES ('2022-01-01', '2022-05-01');
INSERT INTO test
VALUES ('2022-01-01', '2022-05-11');
INSERT INTO test
VALUES ('2022-05-01', '2022-07-11');
INSERT INTO test
VALUES ('2022-06-11', '2022-07-11');
SELECT
*,
daterange(begin_, _end, '[]')
FROM
test t
WHERE
daterange(t.begin_, t._end, '[]') && daterange('2022-05-01', '2022-05-31', '[]');
&& range operator reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE

Postgresql left join date_trunc with default values

I have 3 tables which I'm querying to get the data based on different conditions. I have from and to params and these are the ones I'm using to create a range of time in which I'm looking for the data in those tables.
For instance if I have from equals to '2020-07-01' and to equals to '2020-08-01' I'm expecting to receive the grouped row values of the tables by week, if in some case some of the weeks don't have records I want to return 0, if some tables have records for the same week, I'd like to sum them.
Currently I have this:
SELECT d.day, COALESCE(t.total, 0)
FROM (
SELECT day::date
FROM generate_series(timestamp '2020-07-01',
timestamp '2020-08-01',
interval '1 week') day
) d
LEFT JOIN (
SELECT date AS day,
SUM(total)
FROM table1
WHERE id = '1'
AND date BETWEEN '2020-07-01' AND '2020-08-01'
GROUP BY day
) t USING (day)
ORDER BY d.day;
I'm generating a series of dates grouped by week, and on top of that I'm doing adding a left join. Now for some reason, it only works if the dates match completely, otherwise COALESCE(t.total, 0) returns 0 even if in that week the SUM(total) is not 0.
The same way I'm applying the LEFT JOIN, I'm using other left joins with other tables in the same query, so I'm falling with the same problem.
Please see if this works for you. Whenever you find yourself aggregating more than once, ask yourself whether it is necessary.
Rather than try to match on discrete days, use time ranges.
with limits as (
select '2020-07-01'::timestamp as dt_start,
'2020-08-01'::timestamp as dt_end
), weeks as (
SELECT x.day::date as day, least(x.day::date + 7, dt_end::date) as day_end
FROM limits l
CROSS JOIN LATERAL
generate_series(l.dt_start, l.dt_end, interval '1 week') as x(day)
WHERE x.day::date != least(x.day::date + 7, dt_end::date)
), t1 as (
select w.day,
sum(coalesce(t.total, 0)) as t1total
from weeks w
left join table1 t
on t.id = 1
and t.date >= w.day
and t.date < w.day_end
group by w.day
), t2 as (
select w.day,
sum(coalesce(t.sum_measure, 0)) as t2total
from weeks w
left join table2 t
on t.something = 'whatever'
and t.date >= w.day
and t.date < w.day_end
group by w.day
)
select t1.day,
t1.t1total,
t2.t2total
from t1
join t2 on t2.day = t1.day;
You can keep adding tables like that with CTEs.
My earlier example with multiple left join was bad because it blows out the rows due to a lack of join conditions between the left-joined tables.
There is an interesting corner case for e.g. 2019-02-01 to 2019-03-01 which returns an empty interval as the last week. I have updated to filter that out.

grouping by column but getting multiple results for each

I am trying to calculate the median response time for conversations on each date for the last X days.
I use the following query below, but for some reason, it will generate multiple rows with the same date.
with grouping as (
SELECT a.id, d.date, extract(epoch from (first_response_at - started_at)) as response_time
FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 2) AS offs
) d
LEFT OUTER JOIN apps a on true
LEFT OUTER JOIN conversations c ON (d.date=to_char(date_trunc('day'::varchar, c.started_at), 'YYYY-MM-DD')) and a.id = c.app_id
and c.app_id = a.id and c.first_response_at > (current_date - (2 || ' days')::interval)::date
)
select
*
from grouping
where grouping.id = 'ASnYW1-RgCl0I'
Any ideas?
First a number of issues with your query, assuming there aren't any parts you haven't shown us:
You don't need a CTE for this query.
From table apps you only use column id whose value is the same as c.app_id. You can remove the table apps and select c.app_id for the same result.
When you use to_char() you do not first have to date_trunc() to a date, the to_char() function handles that.
generate_series() also works with timestamps. Just enter day values with an interval and cast the end result to date before using it.
So, removing all the flotsam we end up with this which does exactly the same as the query in your question but now we can at least see what is going on.
SELECT c.app_id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (first_response_at - started_at)) AS response_time
FROM generate_series(CURRENT_DATE - 2, CURRENT_DATE, interval '1 day') d(date)
LEFT JOIN conversations c ON d.date::date = c.started_at::date
AND c.app_id = 'ASnYW1-RgCl0I'
AND c.first_response_at > CURRENT_DATE - 2;
You don't calculate the median response time anywhere, so that is a big problem you need to solve. This only requires data from table conversations and would look somewhat like this to calculate the median response time for the past 2 days:
SELECT app_id, started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = 'ASnYW1-RgCl0I'
AND first_response_at > CURRENT_DATE - 2
GROUP BY 2;
When we fold the two queries, and put the parameters handily in a single place, this is the final result:
SELECT p.id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (c.median_response)) AS response_time
FROM (VALUES ('ASnYW1-RgCl0I', 2)) p(id, days)
JOIN generate_series(CURRENT_DATE - p.days, CURRENT_DATE, interval '1 day') d(date) ON true
LEFT JOIN LATERAL (
SELECT started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = p.id
AND first_response_at > CURRENT_DATE - p.days
GROUP BY 2) c ON d.date::date = c.start_date;
If you want to change the id of the app or the number of days to look back, you only have to change the VALUES clause accordingly. You can also wrap the whole thing in a SQL function and convert the VALUES clause into two parameters.

Retrieve IDs with a minimum time gap between consecutive rows

I have the following event table in Postgres 9.3:
CREATE TABLE event (
event_id integer PRIMARY KEY,
user_id integer,
event_type varchar,
event_time timestamptz
);
My goal is to retrieve all user_id's with a gap of at least 30 days between any of their events (or between their last event and the current time). An additional complication is that I only want the users who have one of these gaps occur at a later time than them performing a certain event_type 'convert'. How can this be done easily?
Some example data in the event table might look like:
INSERT INTO event (event_id, user_id, event_type, event_time)
VALUES
(10, 1, 'signIn', '2015-05-05 00:11'),
(11, 1, 'browse', '2015-05-05 00:12'), -- no 'convert' event
(20, 2, 'signIn', '2015-06-07 02:35'),
(21, 2, 'browse', '2015-06-07 02:35'),
(22, 2, 'convert', '2015-06-07 02:36'), -- only 'convert' event
(23, 2, 'signIn', '2015-08-10 11:00'), -- gap of >= 30 days
(24, 2, 'signIn', '2015-08-11 11:00'),
(30, 3, 'convert', '2015-08-07 02:36'), -- starting with 1st 'convert' event
(31, 3, 'signIn', '2015-08-07 02:36'),
(32, 3, 'convert', '2015-08-08 02:36'),
(33, 3, 'signIn', '2015-08-12 11:00'), -- all gaps below 30 days
(33, 3, 'browse', '2015-08-12 11:00'), -- gap until today (2015-08-20) too small
(40, 4, 'convert', '2015-05-07 02:36'),
(41, 4, 'signIn', '2015-05-12 11:00'); -- gap until today (2015-08-20) >= 30 days
Expected result:
user_id
--------
2
4
One way to do it:
SELECT user_id
FROM (
SELECT user_id
, lead(e.event_time, 1, now()) OVER (PARTITION BY e.user_id ORDER BY e.event_time)
- event_time AS gap
FROM ( -- only users with 'convert' event
SELECT user_id, min(event_time) AS first_time
FROM event
WHERE event_type = 'convert'
GROUP BY 1
) e1
JOIN event e USING (user_id)
WHERE e.event_time >= e1.first_time
) sub
WHERE gap >= interval '30 days'
GROUP BY 1;
The window function lead() allows to include a default value if there is no "next row", which is convenient to cover your additional requirement "or between their last event and the current time".
Indexes
You should at least have an index on (user_id, event_time) if your table is big:
CREATE INDEX event_user_time_idx ON event(user_id, event_time);
If you do that often and the event_type 'convert' is rare, add another partial index:
CREATE INDEX event_user_time_convert_idx ON event(user_id, event_time)
WHERE event_type = 'convert';
For many events per user
And only if gaps of 30 days are common (not a rare case).
Indexes become even more important.
Try this recursive CTE for better performance:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT DISTINCT ON (user_id)
user_id, event_time, interval '0 days' AS gap
FROM event
WHERE event_type = 'convert'
ORDER BY user_id, event_time
)
UNION ALL
SELECT c.user_id, e.event_time, COALESCE(e.event_time, now()) - c.event_time
FROM cte c
LEFT JOIN LATERAL (
SELECT e.event_time
FROM event e
WHERE e.user_id = c.user_id
AND e.event_time > c.event_time
ORDER BY e.event_time
LIMIT 1 -- the next later event
) e ON true -- add 1 row after last to consider gap till "now"
WHERE c.event_time IS NOT NULL
AND c.gap < interval '30 days'
)
SELECT * FROM cte
WHERE gap >= interval '30 days';
It has considerably more overhead, but can stop - per user - at the first gap that's big enough. If that should be the gap between the last event now, then event_time in the result is NULL.
New SQL Fiddle with more revealing test data demonstrating both queries.
Detailed explanation in these related answers:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
SQL Fiddle
This is another way, probably not as neat as #Erwin but have all the step separated so is easy to adapt.
include_today: add a dummy event to indicate current date.
event_convert: calculate the first time the event convert appear for each user_id (in this case only user_id = 2222)
event_row: asign an unique consecutive id to each event. starting from 1 for each user_id
last part join all together and using rnum = rnum + 1 so could calculate date difference.
also the result show both event involve in the 30 days range so you can see if that is the result you want.
.
WITH include_today as (
(SELECT 'xxxx' event_id, user_id, 'today' event_type, current_date as event_time
FROM users)
UNION
(SELECT *
FROM event)
),
event_convert as (
SELECT user_id, MIN(event_time) min_time
FROM event
WHERE event_type = 'convert'
GROUP BY user_id
),
event_row as (
SELECT *, row_number() OVER (PARTITION BY user_id ORDER BY event_time desc) as rnum
FROM
include_today
)
SELECT
A.user_id,
A.event_id eventA,
A.event_type typeA,
A.event_time timeA,
B.event_id eventB,
B.event_type typeB,
B.event_time timeB,
(B.event_time - A.event_time) days
FROM
event_convert e
Inner Join event_row A
ON e.user_id = A.user_id and e.min_time <= a. event_time
Inner Join event_row B
ON A.rnum = B.rnum + 1
AND A.user_id = B.user_id
WHERE
(B.event_time - A.event_time) > interval '30 days'
ORDER BY 1,4

Join a count query on generate_series() and retrieve Null values as '0'

I want to count ID's per month using generate_series(). This query works in PostgreSQL 9.1:
SELECT (to_char(serie,'yyyy-mm')) AS year, sum(amount)::int AS eintraege FROM (
SELECT
COUNT(mytable.id) as amount,
generate_series::date as serie
FROM mytable
RIGHT JOIN generate_series(
(SELECT min(date_from) FROM mytable)::date,
(SELECT max(date_from) FROM mytable)::date,
interval '1 day') ON generate_series = date(date_from)
WHERE version = 1
GROUP BY generate_series
) AS foo
GROUP BY Year
ORDER BY Year ASC;
This is my output:
"2006-12" | 4
"2007-02" | 1
"2007-03" | 1
But what I want to get is this output ('0' value in January):
"2006-12" | 4
"2007-01" | 0
"2007-02" | 1
"2007-03" | 1
Months without id should be listed nevertheless.
Any ideas how to solve this?
Sample data:
drop table if exists mytable;
create table mytable(id bigint, version smallint, date_from timestamp);
insert into mytable(id, version, date_from) values
(4084036, 1, '2006-12-22 22:46:35'),
(4084938, 1, '2006-12-23 16:19:13'),
(4084938, 2, '2006-12-23 16:20:23'),
(4084939, 1, '2006-12-23 16:29:14'),
(4084954, 1, '2006-12-23 16:28:28'),
(4250653, 1, '2007-02-12 21:58:53'),
(4250657, 1, '2007-03-12 21:58:53')
;
Untangled, simplified and fixed, it might look like this:
SELECT to_char(s.tag,'yyyy-mm') AS monat
, count(t.id) AS eintraege
FROM (
SELECT generate_series(min(date_from)::date
, max(date_from)::date
, interval '1 day'
)::date AS tag
FROM mytable t
) s
LEFT JOIN mytable t ON t.date_from::date = s.tag AND t.version = 1
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Among all the noise, misleading identifiers and unconventional format the actual problem was hidden here:
WHERE version = 1
You made correct use of RIGHT [OUTER] JOIN. But adding a WHERE clause that requires an existing row from mytable converts the RIGHT [OUTER] JOIN to an [INNER] JOIN effectively.
Move that filter into the JOIN condition to make it work.
I simplified some other things while being at it.
Better, yet
SELECT to_char(mon, 'yyyy-mm') AS monat
, COALESCE(t.ct, 0) AS eintraege
FROM (
SELECT date_trunc('month', date_from)::date AS mon
, count(*) AS ct
FROM mytable
WHERE version = 1
GROUP BY 1
) t
RIGHT JOIN (
SELECT generate_series(date_trunc('month', min(date_from))
, max(date_from)
, interval '1 mon')::date
FROM mytable
) m(mon) USING (mon)
ORDER BY mon;
db<>fiddle here
It's much cheaper to aggregate first and join later - joining one row per month instead of one row per day.
It's cheaper to base GROUP BY and ORDER BY on the date value instead of the rendered text.
count(*) is a bit faster than count(id), while equivalent in this query.
generate_series() is a bit faster and safer when based on timestamp instead of date. See:
Generating time series between two dates in PostgreSQL