I have a table of events that has a very similar schema and data distribution as this artificial table that can easily be generated locally:
CREATE TABLE events AS
WITH args AS (
SELECT
300 AS scale_factor, -- feel free to reduce this to speed up local testing
1000 AS pa_count,
1 AS l_count_min,
29 AS l_count_rand,
10 AS c_count,
10 AS pr_count,
3 AS r_count,
'10 days'::interval AS time_range -- edit 2017-05-02: the real data set has years worth of data here, but the query time ranges stay small (a couple days)
)
SELECT
p.c_id,
'ABC'||lpad(p.pa_id::text, 13, '0') AS pa_id,
'abcdefgh-'||((random()*(SELECT pr_count-1 FROM args)+1))::int AS pr_id,
((random()*(SELECT r_count-1 FROM args)+1))::int AS r,
'2017-01-01Z00:00:00'::timestamp without time zone + random()*(SELECT time_range FROM args) AS t
FROM (
SELECT
pa_id,
((random()*(SELECT c_count-1 FROM args)+1))::int AS c_id,
(random()*(SELECT l_count_rand FROM args)+(SELECT l_count_min FROM args))::int AS l_count
FROM generate_series(1, (SELECT pa_count*scale_factor FROM args)) pa_id
) p
JOIN LATERAL (
SELECT generate_series(1, p.l_count)
) l(id) ON (true);
Excerpt from SELECT * FROM events:
What I need is a query that selects all rows for a given c_id in a given time range of t, then filters them down to only include the most recent rows (by t) for each unique pr_id and pa_id combination, and then counts the number of pr_id and r combinations of those rows.
That's a quite a mouthful, so here are 3 SQL queries that I came up with that produce the desired results:
WITH query_a AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
),
query_b AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT
pr_id,
pa_id,
first_not_null(r ORDER BY t DESC) AS r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
GROUP BY
1,
2
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
),
query_c AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT
pr_id,
pa_id,
first_not_null(r) AS r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
GROUP BY
1,
2
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
)
And here is the custom aggregate function used by query_b and query_c, as well as what I believe to be the most optimal index, settings and conditions:
CREATE FUNCTION first_not_null_agg(before anyelement, value anyelement) RETURNS anyelement
LANGUAGE sql IMMUTABLE STRICT
AS $_$
SELECT $1;
$_$;
CREATE AGGREGATE first_not_null(anyelement) (
SFUNC = first_not_null_agg,
STYPE = anyelement
);
CREATE INDEX events_idx ON events USING btree (c_id, t DESC, pr_id, pa_id, r);
VACUUM ANALYZE events;
SET work_mem='128MB';
My dilemma is that query_c outperforms query_a and query_b by a factor of > 6x, but is technically not guaranteed to produce the same result as the other queries (notice the missing ORDER BY in the first_not_null aggregate). However, in practice it seems to pick a query plan that I believe to be correct and most optimal.
Below are the EXPLAIN (ANALYZE, VERBOSE) outputs for all 3 queries on my local machine:
query_a:
CTE Scan on query_a (cost=25810.77..26071.25 rows=13024 width=44) (actual time=3329.921..3329.934 rows=30 loops=1)
Output: query_a.pr_id, query_a.r, query_a.quantity
CTE query_a
-> Sort (cost=25778.21..25810.77 rows=13024 width=23) (actual time=3329.918..3329.921 rows=30 loops=1)
Output: events.pr_id, events.r, (count(1))
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=24757.86..24888.10 rows=13024 width=23) (actual time=3329.849..3329.892 rows=30 loops=1)
Output: events.pr_id, events.r, count(1)
Group Key: events.pr_id, events.r
-> Unique (cost=21350.90..22478.71 rows=130237 width=40) (actual time=3168.656..3257.299 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
-> Sort (cost=21350.90..21726.83 rows=150375 width=40) (actual time=3168.655..3209.095 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Sort Key: events.pr_id, events.pa_id, events.t DESC
Sort Method: quicksort Memory: 18160kB
-> Index Only Scan using events_idx on public.events (cost=0.56..8420.00 rows=150375 width=40) (actual time=0.038..101.584 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.316 ms
Execution time: 3331.082 ms
query_b:
CTE Scan on query_b (cost=67140.75..67409.53 rows=13439 width=44) (actual time=3761.077..3761.090 rows=30 loops=1)
Output: query_b.pr_id, query_b.r, query_b.quantity
CTE query_b
-> Sort (cost=67107.15..67140.75 rows=13439 width=23) (actual time=3761.074..3761.081 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), (count(1))
Sort Key: (count(1)), (first_not_null(events.r ORDER BY events.t DESC)), events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=66051.24..66185.63 rows=13439 width=23) (actual time=3760.997..3761.049 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), count(1)
Group Key: events.pr_id, first_not_null(events.r ORDER BY events.t DESC)
-> GroupAggregate (cost=22188.98..63699.49 rows=134386 width=32) (actual time=2961.471..3671.669 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, first_not_null(events.r ORDER BY events.t DESC)
Group Key: events.pr_id, events.pa_id
-> Sort (cost=22188.98..22578.94 rows=155987 width=40) (actual time=2961.436..3012.440 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Sort Key: events.pr_id, events.pa_id
Sort Method: quicksort Memory: 18160kB
-> Index Only Scan using events_idx on public.events (cost=0.56..8734.27 rows=155987 width=40) (actual time=0.038..97.336 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.385 ms
Execution time: 3761.852 ms
query_c:
CTE Scan on query_c (cost=51400.06..51660.54 rows=13024 width=44) (actual time=524.382..524.395 rows=30 loops=1)
Output: query_c.pr_id, query_c.r, query_c.quantity
CTE query_c
-> Sort (cost=51367.50..51400.06 rows=13024 width=23) (actual time=524.380..524.384 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r)), (count(1))
Sort Key: (count(1)), (first_not_null(events.r)), events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=50347.14..50477.38 rows=13024 width=23) (actual time=524.311..524.349 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r)), count(1)
Group Key: events.pr_id, first_not_null(events.r)
-> HashAggregate (cost=46765.62..48067.99 rows=130237 width=32) (actual time=401.480..459.962 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, first_not_null(events.r)
Group Key: events.pr_id, events.pa_id
-> Index Only Scan using events_idx on public.events (cost=0.56..8420.00 rows=150375 width=32) (actual time=0.027..109.459 rows=153795 loops=1)
Output: events.c_id, events.t, events.pr_id, events.pa_id, events.r
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.296 ms
Execution time: 525.566 ms
Broadly speaking, I believe that the index above should allow query_a and query_b to be executed without the Sort nodes that slow them down, but so far I've failed to convince the postgres query optimizer to do my bidding.
I'm also somewhat confused about the t column not being included in the Sort key for query_b, considering that quicksort is not stable. It seems like this could yield the wrong results.
I've verified that all 3 queries generate the same results running the following queries and verifying they produce an empty result set:
SELECT * FROM query_a
EXCEPT
SELECT * FROM query_b;
and
SELECT * FROM query_a
EXCEPT
SELECT * FROM query_c;
I'd consider query_a to be the canonical query when in doubt.
I greatly appreciate any input on this. I've actually found a terribly hacky workaround to achieve acceptable performance in my application, but this problem continues to hunt me in my sleep (and in fact vacation, which I'm currently on) ... 😬.
FWIW, I've looked at many similar questions and answers which have guided my current thinking, but I believe there is something unique about the two column grouping (pr_id, pa_id) and having to sort by a 3rd column (t) that doesn't make this a duplicate question.
Edit: The outer queries in the example may be entirely irrelevant to the question, so feel free to ignore them if it helps.
I'd consider query_a to be the canonical query when in doubt.
I found a way to make query_a half a second fast.
Your inner query from query_a
SELECT DISTINCT ON (pr_id, pa_id)
needs to go with
ORDER BY pr_id, pa_id, t DESC
especially with pr_id and pa_id listed first.
c_id = 5 is const, but you cannot use your index event_idx (c_id, t DESC, pr_id, pa_id, r), because the columns are not organized by (pr_id, pa_id, t DESC), which your ORDER BY clause demands.
If you had an index on at least (pr_id, pa_id, t DESC) then the sorting does not have to happen, because the ORDER BY condition aligns with this index.
So here is what I did.
CREATE INDEX events_idx2 ON events (c_id, pr_id, pa_id, t DESC, r);
This index can be used by your inner query - at least in theory.
Unfortunately the query planner thinks that it's better to reduce the number of rows by using index events_idx with c_id and x <= t < y.
Postgres does not have index hints, so we need a different way to convince the query planner to take the new index events_idx2.
One way to force the use of events_idx2 is to make the other index more expensive.
This can be done by removing the last column r from events_idx and make it unusable for query_a (at least unusable without loading the pages from the heap).
It is counter-intuitive to move the t column later in the index layout, because usually the first columns will be chosen for = and ranges, which c_id and t qualify well for.
However, your ORDER BY (pr_id, pa_id, t DESC) mandates at least this subset as-is in your index. Of course we still put the c_id first to reduce the rows as soon as possible.
You can still have an index on (c_id, t DESC, pr_id, pa_id), if you need, but it cannot be used in query_a.
Here is the query plan for query_a with events_idx2 used and events_idx deleted.
Look for events_c_id_pr_id_pa_id_t_r_idx, which is how PG names indices automatically, when you don't give them a name.
I like it this way, because I can see the order of the columns in the name of the index in every query plan.
Sort (cost=30076.71..30110.75 rows=13618 width=23) (actual time=426.898..426.914 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=29005.43..29141.61 rows=13618 width=23) (actual time=426.820..426.859 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.56..26622.33 rows=136177 width=40) (actual time=0.037..328.828 rows=117204 loops=1)
-> Index Only Scan using events_c_id_pr_id_pa_id_t_r_idx on events (cost=0.56..25830.50 rows=158366 width=40) (actual time=0.035..178.594 rows=154940 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.201 ms
Execution time: 427.017 ms
(11 Zeilen)
The planning is instantaneously and the performance is sub second, because the index matches the ORDER BY of the inner query.
With good performance on query_a there is no need for an additional function to make alternative queries query_b and query_c faster.
Remarks:
Somehow I could not find a primary key in your relation.
The aforementioned proposed solution works without any primary key assumption.
I still think that you have some primary key, but maybe forgot to mention it.
The natural key is pa_id. Each pa_id refers to "a thing" that has
~1...30 events recorded about it.
If pa_id is in relation to multiple c_id's, then pa_id alone cannot be key.
If pr_id and r are data, then maybe (c_id, pa_id, t) is unique key?
Also your index events_idx is not created unique, but spans all columns of the relation, so you could have multiple equal rows - do you want to allow that?
If you really need both indices events_idx and the proposed events_idx2, then you will have the data stored 3 times in total (twice in indices, once on the heap).
Since this really is a tricky query optimization, I kindly ask you to at least consider adding a bounty for whoever answers your question, also since it has been sitting on SO without answer for quite some time.
EDIT A
I inserted another set of data using your excellently generated setup above, basically doubling the number of rows.
The dates started from '2017-01-10' this time.
All other parameters stayed the same.
Here is a partial index on the time attribute and it's query behaviour.
CREATE INDEX events_timerange ON events (c_id, pr_id, pa_id, t DESC, r) WHERE '2017-01-03' <= t AND t < '2017-01-06';
Sort (cost=12510.07..12546.55 rows=14591 width=23) (actual time=361.579..361.595 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=11354.99..11500.90 rows=14591 width=23) (actual time=361.503..361.543 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.55..8801.60 rows=145908 width=40) (actual time=0.026..265.084 rows=118571 loops=1)
-> Index Only Scan using events_timerange on events (cost=0.55..8014.70 rows=157380 width=40) (actual time=0.024..115.265 rows=155800 loops=1)
Index Cond: (c_id = 5)
Heap Fetches: 0
Planning time: 0.214 ms
Execution time: 361.692 ms
(11 Zeilen)
Without the index events_timerange (that's the regular full index).
Sort (cost=65431.46..65467.93 rows=14591 width=23) (actual time=472.809..472.824 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=64276.38..64422.29 rows=14591 width=23) (actual time=472.732..472.776 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.56..61722.99 rows=145908 width=40) (actual time=0.024..374.392 rows=118571 loops=1)
-> Index Only Scan using events_c_id_pr_id_pa_id_t_r_idx on events (cost=0.56..60936.08 rows=157380 width=40) (actual time=0.021..222.987 rows=155800 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.171 ms
Execution time: 472.925 ms
(11 Zeilen)
With the partial index the runtime is about 100ms faster, meanwhile the whole table is twice as big.
(Note: the second time around it was only 50ms faster. The advantage should grow, the more events are recorded, though, because the queries requiring the full index will become slower, as you suspect (and i agree)).
Also, on my machine, the full index is 810 MB for two inserts (create table + additional from 2017-01-10).
The partial index WHERE 2017-01-03 <= t < 2017-01-06 is only 91 MB.
Maybe you can create partial indices on a monthly or yearly basis?
Depending on what time range is queried, maybe only recent data needs to be indexed, or otherwise only old data partially?
I also tried partial indexing with WHERE c_id = 5, so partitioning by c_id.
Sort (cost=51324.27..51361.47 rows=14880 width=23) (actual time=550.579..550.592 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=50144.21..50293.01 rows=14880 width=23) (actual time=550.481..550.528 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.42..47540.21 rows=148800 width=40) (actual time=0.050..443.393 rows=118571 loops=1)
-> Index Only Scan using events_cid on events (cost=0.42..46736.42 rows=160758 width=40) (actual time=0.047..269.676 rows=155800 loops=1)
Index Cond: ((t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.366 ms
Execution time: 550.706 ms
(11 Zeilen)
So partial indexing may also be a viable option.
If you get ever more data, then you may also consider partitioning, for example all rows aged two years and older into a separate table or something.
I don't think Block Range Indexes BRIN (indices) might help here, though.
If you machine is more beefy than mine, then you can just insert 10 times the amount of data and check the behaviour of the regular full index first and how it behaves on an increasing table.
[EDITED]
Ok, As this depend of your data distribution here is another way to do it.
First add the following index :
CREATE INDEX events_idx2 ON events (c_id, t DESC, pr_id, pa_id, r);
This extract the MAX(t) as quick as he can, on the assumption that the sub set will be way smaller to join on the parent table. It may however probably be slower if the dataset is not that small.
SELECT
e.pr_id,
e.r,
count(1) AS quantity
FROM events e
JOIN (
SELECT
pr_id,
pa_id,
MAX(t) last_t
FROM events e
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pr_id,
pa_id
) latest
ON (
c_id = 5
AND latest.pr_id = e.pr_id
AND latest.pa_id = e.pa_id
AND latest.last_t = e.t
)
GROUP BY
e.pr_id,
e.r
ORDER BY 3, 2, 1 DESC
Full Fiddle
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
--PostgreSQL 9.6
--'\\' is a delimiter
-- CREATE TABLE events AS...
VACUUM ANALYZE events;
CREATE INDEX idx_events_idx ON events (c_id, t DESC, pr_id, pa_id, r);
Query 1:
-- query A
explain analyze SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=2170.24..2170.74 rows=200 width=15) (actual time=358.239..358.245 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=2160.60..2162.60 rows=200 width=15) (actual time=358.181..358.189 rows=30 loops=1)
-> Unique (cost=2012.69..2132.61 rows=1599 width=40) (actual time=327.345..353.750 rows=12098 loops=1)
-> Sort (cost=2012.69..2052.66 rows=15990 width=40) (actual time=327.344..348.686 rows=15966 loops=1)
Sort Key: events.pr_id, events.pa_id, events.t
Sort Method: external merge Disk: 792kB
-> Index Only Scan using idx_events_idx on events (cost=0.42..896.20 rows=15990 width=40) (actual time=0.059..5.475 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 358.610 ms
Query 2:
-- query max/JOIN
explain analyze SELECT
e.pr_id,
e.r,
count(1) AS quantity
FROM events e
JOIN (
SELECT
pr_id,
pa_id,
MAX(t) last_t
FROM events e
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pr_id,
pa_id
) latest
ON (
c_id = 5
AND latest.pr_id = e.pr_id
AND latest.pa_id = e.pa_id
AND latest.last_t = e.t
)
GROUP BY
e.pr_id,
e.r
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=4153.31..4153.32 rows=1 width=15) (actual time=68.398..68.402 rows=30 loops=1)
Sort Key: (count(1)), e.r, e.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=4153.29..4153.30 rows=1 width=15) (actual time=68.363..68.371 rows=30 loops=1)
-> Merge Join (cost=1133.62..4153.29 rows=1 width=15) (actual time=35.083..64.154 rows=12098 loops=1)
Merge Cond: ((e.t = (max(e_1.t))) AND (e.pr_id = e_1.pr_id))
Join Filter: (e.pa_id = e_1.pa_id)
-> Index Only Scan Backward using idx_events_idx on events e (cost=0.42..2739.72 rows=53674 width=40) (actual time=0.010..8.073 rows=26661 loops=1)
Index Cond: (c_id = 5)
Heap Fetches: 0
-> Sort (cost=1133.19..1137.19 rows=1599 width=36) (actual time=29.778..32.885 rows=12098 loops=1)
Sort Key: (max(e_1.t)), e_1.pr_id
Sort Method: external sort Disk: 640kB
-> HashAggregate (cost=1016.12..1032.11 rows=1599 width=36) (actual time=12.731..16.738 rows=12098 loops=1)
-> Index Only Scan using idx_events_idx on events e_1 (cost=0.42..896.20 rows=15990 width=36) (actual time=0.029..5.084 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 68.736 ms
Query 3:
DROP INDEX idx_events_idx
CREATE INDEX idx_events_flutter ON events (c_id, pr_id, pa_id, t DESC, r)
Query 5:
-- query A + index by flutter
explain analyze SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=2744.82..2745.32 rows=200 width=15) (actual time=20.915..20.916 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=2735.18..2737.18 rows=200 width=15) (actual time=20.883..20.892 rows=30 loops=1)
-> Unique (cost=0.42..2707.20 rows=1599 width=40) (actual time=0.037..16.488 rows=12098 loops=1)
-> Index Only Scan using idx_events_flutter on events (cost=0.42..2627.25 rows=15990 width=40) (actual time=0.036..10.893 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 20.964 ms
Just two different methods(YMMV):
-- using a window finction to find the record with the most recent t::
EXPLAIN ANALYZE
SELECT pr_id, r, count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id, pa_id,
first_value(r) OVER www AS r
-- last_value(r) OVER www AS r
FROM events
WHERE c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
WINDOW www AS (PARTITION BY pr_id, pa_id ORDER BY t DESC)
ORDER BY 1, 2, t DESC
) sss
GROUP BY 1, 2
ORDER BY 3, 2, 1 DESC
;
-- Avoiding the window function; find the MAX via NOT EXISTS() ::
EXPLAIN ANALYZE
SELECT pr_id, r, count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id, pa_id, r
FROM events e
WHERE c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
AND NOT EXISTS ( SELECT * FROM events nx
WHERE nx.c_id = 5 AND nx.pr_id =e.pr_id AND nx.pa_id =e.pa_id
AND nx.t >= '2017-01-03Z00:00:00'
AND nx.t < '2017-01-06Z00:00:00'
AND nx.t > e.t
)
) sss
GROUP BY 1, 2
ORDER BY 3, 2, 1 DESC
;
Note: the DISTINCT ON can be omitted from the second query, the results are already unique.
I'd try to use a standard ROW_NUMBER() function with a matching index instead of Postgres-specific DISTINCT ON to find the "latest" rows.
Index
CREATE INDEX ix_events ON events USING btree (c_id, pa_id, pr_id, t DESC, r);
Query
WITH
CTE_RN
AS
(
SELECT
pa_id
,pr_id
,r
,ROW_NUMBER() OVER (PARTITION BY c_id, pa_id, pr_id ORDER BY t DESC) AS rn
FROM events
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
)
SELECT
pr_id
,r
,COUNT(*) AS quantity
FROM CTE_RN
WHERE rn = 1
GROUP BY
pr_id
,r
ORDER BY quantity, r, pr_id DESC
;
I don't have Postgres at hand, so I'm using http://rextester.com for testing. I set the scale_factor to 30 in the data generation script, otherwise it takes too long for rextester. I'm getting the following query plan. The time component should be ignored, but you can see that there are no intermediate sorts, only the sort for the final ORDER BY. See http://rextester.com/GUFXY36037
Please try this query on your hardware and your data. It would be interesting to see how it compares to your query. I noticed that optimizer doesn't choose this index if the table has the index that you defined. If you see the same on your server, please try to drop or disable other indexes to get the plan that I got.
1 Sort (cost=158.07..158.08 rows=1 width=44) (actual time=81.445..81.448 rows=30 loops=1)
2 Output: cte_rn.pr_id, cte_rn.r, (count(*))
3 Sort Key: (count(*)), cte_rn.r, cte_rn.pr_id DESC
4 Sort Method: quicksort Memory: 27kB
5 CTE cte_rn
6 -> WindowAgg (cost=0.42..157.78 rows=12 width=88) (actual time=0.204..56.215 rows=15130 loops=1)
7 Output: events.pa_id, events.pr_id, events.r, row_number() OVER (?), events.t, events.c_id
8 -> Index Only Scan using ix_events3 on public.events (cost=0.42..157.51 rows=12 width=80) (actual time=0.184..28.688 rows=15130 loops=1)
9 Output: events.c_id, events.pa_id, events.pr_id, events.t, events.r
10 Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
11 Heap Fetches: 15130
12 -> HashAggregate (cost=0.28..0.29 rows=1 width=44) (actual time=81.363..81.402 rows=30 loops=1)
13 Output: cte_rn.pr_id, cte_rn.r, count(*)
14 Group Key: cte_rn.pr_id, cte_rn.r
15 -> CTE Scan on cte_rn (cost=0.00..0.27 rows=1 width=36) (actual time=0.214..72.841 rows=11491 loops=1)
16 Output: cte_rn.pa_id, cte_rn.pr_id, cte_rn.r, cte_rn.rn
17 Filter: (cte_rn.rn = 1)
18 Rows Removed by Filter: 3639
19 Planning time: 0.452 ms
20 Execution time: 83.234 ms
There is one more optimisation you could do that relies on the external knowledge of your data.
If you can guarantee that each pair of pa_id, pr_id has values for each, say, day, then you can safely reduce the user-defined range of t to just one day.
This will reduce the number of rows that engine reads and sorts if user usually specifies range of t longer than 1 day.
If you can't provide this kind of guarantee in your data for all values, but you still know that usually all pa_id, pr_id are close to each other (by t) and user usually provides a wide range for t, you can run a preliminary query to narrow down the range of t for the main query.
Something like this:
SELECT
MIN(MaxT) AS StartT
MAX(MaxT) AS EndT
FROM
(
SELECT
pa_id
,pr_id
,MAX(t) AS MaxT
FROM events
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pa_id
,pr_id
) AS T
And then use the found StartT,EndT in the main query hoping that new range would be much narrower than original defined by the user.
The query above doesn't have to sort rows, so it should be fast. The main query has to sort rows, but there will be less rows to sort, so overall run-time may be better.
So I've taken a little bit of a tack and tried moving your grouping and distinct data into their owns tables, so that we can leverage multiple table indexes. Note that this solution only works if you have control over the way data gets inserted into the database, i.e. you can change the data source application. If not, alas this is moot.
In practice, instead of inserting into the events table immediately, you would first check if the relational date and prpa exist in their relevant tables. If not, create them. Then get their ids and use that for your insert statement to the events table.
Before I start, I was generating a 10x increase in performance on query_c over query_a, and my final result for the rewritten query_a is about a 4x performance. If that's not good enough, feel free to switch off.
Given the initial data seeding queries you gave in the first instance, I calculated the following benchmarks:
query_a: 5228.518 ms
query_b: 5708.962 ms
query_c: 538.329 ms
So, about a 10x increase in performance, give or take.
I'm going to alter the data that's generated in events, and this alteration takes quite a while. You would not need to do this in practice, as your INSERTs to the tables would be covered already.
For my optimisation, the first step is to create a table that houses dates and then transfer the data over, and relate back to it in the events table, like so:
CREATE TABLE dates (
id SERIAL,
year_part INTEGER NOT NULL,
month_part INTEGER NOT NULL,
day_part INTEGER NOT NULL
);
-- Total runtime: 8.281 ms
INSERT INTO dates(year_part, month_part, day_part) SELECT DISTINCT
EXTRACT(YEAR FROM t), EXTRACT(MONTH FROM t), EXTRACT(DAY FROM t)
FROM events;
-- Total runtime: 12802.900 ms
CREATE INDEX dates_ymd ON dates USING btree(year_part, month_part, day_part);
-- Total runtime: 13.750 ms
ALTER TABLE events ADD COLUMN date_id INTEGER;
-- Total runtime: 2.468ms
UPDATE events SET date_id = dates.id
FROM dates
WHERE EXTRACT(YEAR FROM t) = dates.year_part
AND EXTRACT(MONTH FROM t) = dates.month_part
AND EXTRACT(DAY FROM T) = dates.day_part
;
-- Total runtime: 388024.520 ms
Next, we do the same, but with the key pair (pr_id, pa_id), which doesn't reduce the cardinality too much, but when we're talking large sets it can help with memory usage and swapping in and out:
CREATE TABLE prpa (
id SERIAL,
pr_id TEXT NOT NULL,
pa_id TEXT NOT NULL
);
-- Total runtime: 5.451 ms
CREATE INDEX events_prpa ON events USING btree(pr_id, pa_id);
-- Total runtime: 218,908.894 ms
INSERT INTO prpa(pr_id, pa_id) SELECT DISTINCT pr_id, pa_id FROM events;
-- Total runtime: 5566.760 ms
CREATE INDEX prpa_idx ON prpa USING btree(pr_id, pa_id);
-- Total runtime: 84185.057 ms
ALTER TABLE events ADD COLUMN prpa_id INTEGER;
-- Total runtime: 2.067 ms
UPDATE events SET prpa_id = prpa.id
FROM prpa
WHERE events.pr_id = prpa.pr_id
AND events.pa_id = prpa.pa_id;
-- Total runtime: 757915.192
DROP INDEX events_prpa;
-- Total runtime: 1041.556 ms
Finally, let's get rid of the old indexes and the now defunct columns, and then vacuum up the new tables:
DROP INDEX events_idx;
-- Total runtime: 1139.508 ms
ALTER TABLE events
DROP COLUMN pr_id,
DROP COLUMN pa_id
;
-- Total runtime: 5.376 ms
VACUUM ANALYSE prpa;
-- Total runtime: 1030.142
VACUUM ANALYSE dates;
-- Total runtime: 6652.151
So we now have the following tables:
events (c_id, r, t, prpa_id, date_id)
dates (id, year_part, month_part, day_part)
prpa (id, pr_id, pa_id)
Let's toss on an index now, pushing t DESC to the end where it belongs, which we can do now because we're filtering results on dates before ORDERing, which cuts down the need for t DESC to be so prominent in the index:
CREATE INDEX events_idx_new ON events USING btree (c_id, date_id, prpa_id, t DESC);
-- Total runtime: 27697.795
VACUUM ANALYSE events;
Now we rewrite the query, (using a table to store intermediary results - I find this works well with large datasets) and awaaaaaay we go!
DROP TABLE IF EXISTS temp_results;
SELECT DISTINCT ON (prpa_id)
prpa_id,
r
INTO TEMPORARY temp_results
FROM events
INNER JOIN dates
ON dates.id = events.date_id
WHERE c_id = 5
AND dates.year_part BETWEEN 2017 AND 2017
AND dates.month_part BETWEEN 1 AND 1
AND dates.day_part BETWEEN 3 AND 5
ORDER BY prpa_id, t DESC;
SELECT
prpa.pr_id,
r,
count(1) AS quantity
FROM temp_results
INNER JOIN prpa ON prpa.id = temp_results.prpa_id
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC;
-- Total runtime: 1233.281 ms
So, not a 10x increase in performance, but 4x which is still alright.
This solution is a combination of a couple of techniques I have found work well with large datasets and with date ranges. Even if it's not good enough for your purposes, there might be some gems in here you can repurpose throughout your career.
EDIT:
EXPLAIN ANALYSE on SELECT INTO query:
Unique (cost=171839.95..172360.53 rows=51332 width=16) (actual time=819.385..857.777 rows=117471 loops=1)
-> Sort (cost=171839.95..172100.24 rows=104117 width=16) (actual time=819.382..836.924 rows=155202 loops=1)
Sort Key: events.prpa_id, events.t
Sort Method: external sort Disk: 3944kB
-> Hash Join (cost=14340.24..163162.92 rows=104117 width=16) (actual time=126.929..673.293 rows=155202 loops=1)
Hash Cond: (events.date_id = dates.id)
-> Bitmap Heap Scan on events (cost=14338.97..160168.28 rows=520585 width=20) (actual time=126.572..575.852 rows=516503 loops=1)
Recheck Cond: (c_id = 5)
Heap Blocks: exact=29610
-> Bitmap Index Scan on events_idx2 (cost=0.00..14208.82 rows=520585 width=0) (actual time=118.769..118.769 rows=516503 loops=1)
Index Cond: (c_id = 5)
-> Hash (cost=1.25..1.25 rows=2 width=4) (actual time=0.326..0.326 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on dates (cost=0.00..1.25 rows=2 width=4) (actual time=0.320..0.323 rows=3 loops=1)
Filter: ((year_part >= 2017) AND (year_part <= 2017) AND (month_part >= 1) AND (month_part <= 1) AND (day_part >= 3) AND (day_part <= 5))
Rows Removed by Filter: 7
Planning time: 3.091 ms
Execution time: 913.543 ms
EXPLAIN ANALYSE on SELECT query:
(Note: I had to alter the first query to select into an actual table, not temporary table, on order to get the query plan for this one. AFAIK EXPLAIN ANALYSE only works on single queries)
Sort (cost=89590.66..89595.66 rows=2000 width=15) (actual time=1248.535..1248.537 rows=30 loops=1)
Sort Key: (count(1)), temp_results.r, prpa.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=89461.00..89481.00 rows=2000 width=15) (actual time=1248.460..1248.468 rows=30 loops=1)
Group Key: prpa.pr_id, temp_results.r
-> Hash Join (cost=73821.20..88626.40 rows=111280 width=15) (actual time=798.861..1213.494 rows=117471 loops=1)
Hash Cond: (temp_results.prpa_id = prpa.id)
-> Seq Scan on temp_results (cost=0.00..1632.80 rows=111280 width=8) (actual time=0.024..17.401 rows=117471 loops=1)
-> Hash (cost=36958.31..36958.31 rows=2120631 width=15) (actual time=798.484..798.484 rows=2120631 loops=1)
Buckets: 16384 Batches: 32 Memory Usage: 3129kB
-> Seq Scan on prpa (cost=0.00..36958.31 rows=2120631 width=15) (actual time=0.126..350.664 rows=2120631 loops=1)
Planning time: 1.073 ms
Execution time: 1248.660 ms
Related
I have this SQL query:
delete from scans
where scandatetime>(current_timestamp - interval '21 days') and
scandatetime <> (select min(tt.scandatetime) from scans tt where tt.imb = scans.imb) and
scandatetime <> (select max(tt.scandatetime) from scans tt where tt.imb = scans.imb)
;
That I use to delete records from the following table:
|imb |scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 19:30:32|Received |12345 |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
such that only 2 records remain per IMB, the one with the minimum scandatetime and the maximum scandatetime. I also limit this so it only performs this operation for records that are less than 3 weeks old. The resultant table looks like this:
|imb |scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
This table has a few indexes and has tens of millions of rows, so the query usually takes forever to run. How can I speed this up?
Explain output:
Delete on scans (cost=0.57..115934571.45 rows=10015402 width=6)
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.57..115934571.45 rows=10015402 width=6)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '21 days'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
SubPlan 2
-> Result (cost=3.91..3.92 rows=1 width=8)
InitPlan 1 (returns $1)
-> Limit (cost=0.70..3.91 rows=1 width=8)
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..16.79 rows=5 width=8)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
SubPlan 4
-> Result (cost=3.91..3.92 rows=1 width=8)
InitPlan 3 (returns $3)
-> Limit (cost=0.70..3.91 rows=1 width=8)
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..16.79 rows=5 width=8)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Table DDL:
-- Table Definition ----------------------------------------------
CREATE TABLE scans (
imb text,
scandatetime timestamp without time zone,
status text,
scanfacilityzip text
);
-- Indices -------------------------------------------------------
CREATE INDEX scans_staging_scandatetime_idx ON scans(scandatetime timestamp_ops);
CREATE INDEX scans_staging_imb_idx ON scans(imb text_ops);
CREATE INDEX scans_staging_status_idx ON scans(status text_ops);
CREATE INDEX scans_staging_scandatetime_status_idx ON scans(scandatetime timestamp_ops,status text_ops);
CREATE INDEX scans_staging_imb_scandatetime_idx ON scans(imb text_ops,scandatetime timestamp_ops);
Edit:
Here is the explain analyze output (note, I changed the interval to 1 day to make it run faster):
Delete on scans (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.877..831562.877 rows=0 loops=1)
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.875..831562.875 rows=0 loops=1)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
Rows Removed by Filter: 277756
SubPlan 2
-> Result (cost=3.92..3.93 rows=1 width=8) (actual time=1.675..1.675 rows=1 loops=277756)
InitPlan 1 (returns $1)
-> Limit (cost=0.70..3.92 rows=1 width=8) (actual time=1.673..1.674 rows=1 loops=277756)
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..16.80 rows=5 width=8) (actual time=1.672..1.672 rows=1 loops=277756)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 277761
SubPlan 4
-> Result (cost=3.92..3.93 rows=1 width=8) (actual time=0.086..0.086 rows=1 loops=164210)
InitPlan 3 (returns $3)
-> Limit (cost=0.70..3.92 rows=1 width=8) (actual time=0.084..0.085 rows=1 loops=164210)
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..16.80 rows=5 width=8) (actual time=0.083..0.083 rows=1 loops=164210)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 164210
Planning Time: 11.360 ms
Execution Time: 831562.956 ms
EDIT: Result with explain analyze buffers:
Delete on scans (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.026..19309.027 rows=0 loops=1)
Buffers: shared hit=743430 read=46033
I/O Timings: read=15917.966
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.025..19309.025 rows=0 loops=1)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
Rows Removed by Filter: 74564
Buffers: shared hit=743430 read=46033
I/O Timings: read=15917.966
SubPlan 2
-> Result (cost=4.05..4.06 rows=1 width=8) (actual time=0.232..0.233 rows=1 loops=74564)
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
InitPlan 1 (returns $1)
-> Limit (cost=0.70..4.05 rows=1 width=8) (actual time=0.231..0.231 rows=1 loops=74564)
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..20.81 rows=6 width=8) (actual time=0.230..0.230 rows=1 loops=74564)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 74583
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
SubPlan 4
-> Result (cost=4.05..4.06 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=34497)
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
InitPlan 3 (returns $3)
-> Limit (cost=0.70..4.05 rows=1 width=8) (actual time=0.041..0.041 rows=1 loops=34497)
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..20.81 rows=6 width=8) (actual time=0.040..0.040 rows=1 loops=34497)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 34497
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
Planning Time: 5.350 ms
Execution Time: 19313.242 ms
Without the pre-aggregation (and avoiding the CTE):
DELETE FROM scans del
WHERE del.scandatetime > (current_timestamp - interval '21 days')
AND EXISTS (SELECT *
FROM scans x
WHERE x.imb = del.imb
AND x.scandatetime < del.scandatetime
)
AND EXISTS (SELECT *
FROM scans x
WHERE x.imb = del.imb
AND x.scandatetime > del.scandatetime
)
;
The idea is: you only delete if there is (at least) one record before, and (at least) one after it. (with the same imd) This is not true for the first and last records, only the ones inbetween.
Consider running aggregation once and incorporating it in an EXISTS clause.
with agg as (
select imb
, min(sub.scandatetime) as min_dt
, max(sub.scandatetime) as max_dt
from scans
group by imb
)
delete from scans s
where s.scandatetime > (current_timestamp - interval '21 days')
and exists
(select 1
from agg
where s.imb = agg.imb
and (s.scandatetime > agg.min_dt and
s.scandatetime < agg.max_dt)
);
In the request comments you say that the table contains no rows older than 21 days. The condition scandatetime > (current_timestamp - interval '21 days') is hence superfluous. This also means that you delete almost all rows from the table. You only keep one or two rows per imb.
DELETE on so many rows (you mention tens of millions of rows) can be very slow. Not only must the table rows be deleted one by one, but also all the indexes updated.
This said, you may be better off copying those few desired rows into a temporary table, truncate the original table and copy the rows back. TRUNCATE doesn't look at single rows like DELETE does. It simply empties the whole table and its indexes in one go and immediately reclaims disk space.
The script would look something like this:
create table temp_desired_scans as
select *
from scans s
where (imb, scandatetime) in
(
select imb, min(scandatetime) from scans group by imb
union all
select imb, max(scandatetime) from scans group by imb
);
truncate table scans;
insert into scans
select * from temp_desired_scans;
drop table temp_desired_scans;
(Another common option for such mass deletes is to keep the temp table, drop the original table, rename the temp table to the original table's name and install all constraints and indexes on this new table.)
Given the fact that select is the problem, I would focus on just select. You can make a delete from it any time. You may try this form if it helps:
select * from
(select *,
row_number() over (partition by imb order by scandatetime asc) ar,
row_number() over (partition by imb order by scandatetime desc) dr
from scans
)s
where ar>1 and dr>1 and scandatetime>(current_timestamp - interval '21 days')
Edit
It seems that a pure materialization can be stored as a column on the table and indexed; however, my specific use case (semver.satisfies) requires a more general solution:
create table Submissions (
version text
created_at timestamp
)
create index Submissions_1 on Submissions (created_at)
My query would then look like:
select * from Submissions
where
created_at <= '2016-07-12' and
satisfies(version, '>=1.2.3 <4.5.6')
order by created_at desc
limit 1;
Where I wouldn't be able to practically use the same memoization technique.
Original
I have a table storing text data and the dates at which they were created:
create table Submissions (
content text,
created_at timestamp
);
create index Submissions_1 on Submissions (created_at);
Given a checksum and a reference date, I want to get the latest Submission where the content field matches that checksum:
select * from Submissions
where
created_at <= '2016-07-12' and
expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
order by created_at desc
limit 1;
This works, but it's very slow. What Postgres ends up doing is taking a checksum of every row, and then performing the order by:
Limit (cost=270834.18..270834.18 rows=1 width=32) (actual time=1132.898..1132.898 rows=1 loops=1)
-> Sort (cost=270834.18..271561.27 rows=290836 width=32) (actual time=1132.898..1132.898 rows=1 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on installation (cost=0.00..269380.00 rows=290836 width=32) (actual time=0.118..1129.961 rows=17305 loops=1)
Filter: created_at <= '2016-07-12' AND expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
Rows Removed by Filter: 982695
Planning time: 0.066 ms
Execution time: 1246.941 ms
Without the order by, it is a sub-millisecond operation, because Postgres knows that I only want the first result. The only difference is that I want Postgres to start searching from the latest date down.
Ideally, Postgres would:
filter by created_at
sort by created_at, descending
return the first row where the checksum matches
I've tried to write queries with inline views, but an explain analyze shows that it will just be rewritten into what I already had above.
You can create index for both fields together:
create index Submissions_1 on Submissions (created_at DESC, expensive_chksm(content));
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.15..8.16 rows=1 width=40) (actual time=0.004..0.004 rows=0 loops=1)
-> Index Scan using submissions_1 on submissions (cost=0.15..16.17 rows=2 width=40) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND ((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text))
Planning time: 0.414 ms
Execution time: 0.036 ms
It is important to use also DESC in index.
UPDATED:
For storing and comparing version you can use int[]
create table Submissions (
version int[],
created_at timestamp
);
INSERT INTO Submissions SELECT ARRAY [ (random() * 10)::int2, (random() * 10)::int2, (random() * 10)::int2], '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 1000000);
create index Submissions_1 on Submissions (created_at DESC, version);
EXPLAIN ANALYZE select * from Submissions
where
created_at <= '2016-07-12'
AND version <= ARRAY [5,2,3]
AND version > ARRAY [1,2,3]
order by created_at desc
limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..13.24 rows=1 width=40) (actual time=0.074..0.075 rows=1 loops=1)
-> Index Only Scan using submissions_1 on submissions (cost=0.42..21355.76 rows=1667 width=40) (actual time=0.073..0.073 rows=1 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND (version <= '{5,2,3}'::integer[]) AND (version > '{1,2,3}'::integer[]))
Heap Fetches: 1
Planning time: 3.019 ms
Execution time: 0.100 ms
To a_horse_with_no_name comment:
The order of the conditions in the where clause is irrelevant for the index usage. It's better to put the one that can be used for the equality expression first in the index, then the range expression. –
BEGIN;
create table Submissions (
content text,
created_at timestamp
);
CREATE FUNCTION expensive_chksm(varchar) RETURNS varchar AS $$
SELECT $1;
$$ LANGUAGE sql;
INSERT INTO Submissions SELECT (random() * 1000000000)::text, '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 1000000);
INSERT INTO Submissions SELECT '77ac76dc0d4622ba9aa795acafc05f1e', '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 100000);
create index Submissions_1 on Submissions (created_at DESC, expensive_chksm(content));
-- create index Submissions_2 on Submissions (expensive_chksm(content), created_at DESC);
EXPLAIN ANALYZE select * from Submissions
where
created_at <= '2016-07-12' and
expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
order by created_at desc
limit 1;
Using Submission1:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..10.98 rows=1 width=40) (actual time=0.018..0.019 rows=1 loops=1)
-> Index Scan using submissions_1 on submissions (cost=0.43..19341.43 rows=1833 width=40) (actual time=0.018..0.018 rows=1 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND ((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text))
Planning time: 0.257 ms
Execution time: 0.033 ms
Using Submission2:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4482.39..4482.40 rows=1 width=40) (actual time=29.096..29.096 rows=1 loops=1)
-> Sort (cost=4482.39..4486.98 rows=1833 width=40) (actual time=29.095..29.095 rows=1 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on submissions (cost=67.22..4473.23 rows=1833 width=40) (actual time=15.457..23.683 rows=46419 loops=1)
Recheck Cond: (((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text) AND (created_at <= '2016-07-12 00:00:00'::timestamp without time zone))
Heap Blocks: exact=936
-> Bitmap Index Scan on submissions_1 (cost=0.00..66.76 rows=1833 width=0) (actual time=15.284..15.284 rows=46419 loops=1)
Index Cond: (((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text) AND (created_at <= '2016-07-12 00:00:00'::timestamp without time zone))
Planning time: 0.583 ms
Execution time: 29.134 ms
PostgreSQL 9.6.1
You can use sub query for the timestamp and ordering part, and later run the chksum outside:
select * from (
select * from submissions where
created_at <= '2016-07-12' and
order by created_at desc) as S
where expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
LIMIT 1
If you are always going to query on checksum then an alternative would be to have another column called checksum in the table, e.g.:
create table Submissions (
content text,
created_at timestamp,
checksum varchar
);
You can then insert/update the checksum whenever a row gets inserted/updated (or write a trigger) that does it for you and query on checksum column directly for quick result.
Try this
select *
from Submissions
where created_at = (
select max(created_at)
from Submissions
where expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e')
Let's say we have table like this:
CREATE TABLE user_device_infos
(
id integer NOT NULL DEFAULT nextval('user_device_infos_id_seq1'::regclass),
user_id integer,
data jsonb,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT user_device_infos_pkey PRIMARY KEY (id),
CONSTRAINT fk_rails_e4001464ba FOREIGN KEY (user_id)
REFERENCES public.users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE INDEX index_user_device_infos_imei_user_id
ON public.user_device_infos
USING btree
(((data -> 'Network'::text) ->> 'IMEI No'::text) COLLATE pg_catalog."default", user_id);
CREATE INDEX index_user_device_infos_on_user_id
ON public.user_device_infos
USING btree
(user_id);
Now i try to select user_id from first device with the same imei:
SELECT user_id FROM user_device_infos WHERE (data->'Network'->>'IMEI No' = 'xxxx') order by id LIMIT 1
This query takes 5 seconds on my table ( 152000 entries )
But if i write
SELECT user_id FROM user_device_infos WHERE (data->'Network'->>'IMEI No' = 'xxxx') order by created_at asc LIMIT 1
SELECT user_id FROM user_device_infos WHERE (data->'Network'->>'IMEI No' = 'xxxx') order by created_at desc LIMIT 1
query takes less then 1 ms.
Why this query i so much faster then first variant ? There are no indexes on created at, and id is primary key
Upd
As suggested in comments, i ran explain analyze, but still don't understand what is wrong with "order by id" query. Sorry, i am not a sql-dev
# explain analyze SELECT user_id FROM user_device_infos WHERE (data->'Network'->>'IMEI No' = 'xxxx') order by id LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..416.84 rows=1 width=8) (actual time=5289.784..5289.784 rows=0 loops=1)
-> Index Scan using user_device_infos_pkey on user_device_infos (cost=0.42..316483.14 rows=760 width=8) (actual time=5289.782..5289.782 rows=0 loops=1)
Filter: (((data -> 'Network'::text) ->> 'IMEI No'::text) = 'xxxx'::text)
Rows Removed by Filter: 152437
Planning time: 0.153 ms
Execution time: 5289.817 ms
(6 rows)
# explain analyze SELECT user_id FROM user_device_infos WHERE (data->'Network'->>'IMEI No' = 'xxxx') order by created_at LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2823.73..2823.74 rows=1 width=12) (actual time=0.064..0.064 rows=0 loops=1)
-> Sort (cost=2823.73..2825.63 rows=760 width=12) (actual time=0.062..0.062 rows=0 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on user_device_infos (cost=22.31..2819.93 rows=760 width=12) (actual time=0.039..0.039 rows=0 loops=1)
Recheck Cond: (((data -> 'Network'::text) ->> 'IMEI No'::text) = 'xxxx'::text)
-> Bitmap Index Scan on index_user_device_infos_imei_user_id (cost=0.00..22.12 rows=760 width=0) (actual time=0.037..0.037 rows=0 loops=1)
Index Cond: (((data -> 'Network'::text) ->> 'IMEI No'::text) = 'xxxx'::text)
Planning time: 0.144 ms
Execution time: 0.092 ms
(10 rows)
For the following query:
SELECT *
FROM "routes_trackpoint"
WHERE "routes_trackpoint"."track_id" = 593
ORDER BY "routes_trackpoint"."id" ASC
LIMIT 1;
Postgres is choosing a query plan which reads all the rows in the "id" index to perform the ordering, and them perform manual filtering to get the entries with the correct track id:
Limit (cost=0.43..511.22 rows=1 width=65) (actual time=4797.964..4797.966 rows=1 loops=1)
Buffers: shared hit=3388505
-> Index Scan using routes_trackpoint_pkey on routes_trackpoint (cost=0.43..719699.79 rows=1409 width=65) (actual time=4797.958..4797.958 rows=1 loops=1)
Filter: (track_id = 75934)
Rows Removed by Filter: 13005436
Buffers: shared hit=3388505
Total runtime: 4798.019 ms
(7 rows)
Disabling the index scan, the query plan (SET enable_indexscan=OFF;) is better and the response much faster.
Limit (cost=6242.46..6242.46 rows=1 width=65) (actual time=77.584..77.586 rows=1 loops=1)
Buffers: shared hit=1075 read=6
-> Sort (cost=6242.46..6246.64 rows=1674 width=65) (actual time=77.577..77.577 rows=1 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=1075 read=6
-> Bitmap Heap Scan on routes_trackpoint (cost=53.41..6234.09 rows=1674 width=65) (actual time=70.384..74.782 rows=1454 loops=1)
Recheck Cond: (track_id = 75934)
Buffers: shared hit=1075 read=6
-> Bitmap Index Scan on routes_trackpoint_track_id (cost=0.00..52.99 rows=1674 width=0) (actual time=70.206..70.206 rows=1454 loops=1)
Index Cond: (track_id = 75934)
Buffers: shared hit=2 read=6
Total runtime: 77.655 ms
(13 rows)
How can I get Postgres to select the better plan automatically?
I have tried the following:
ALTER TABLE routes_trackpoint ALTER COLUMN track_id SET STATISTICS 5000;
ALTER TABLE routes_trackpoint ALTER COLUMN id SET STATISTICS 5000;
ANALYZE routes_trackpoint;
But the query plan remained the same.
The table definition is:
watchdog2=# \d routes_trackpoint
Table "public.routes_trackpoint"
Column | Type | Modifiers
----------+--------------------------+----------------------------------------------------------------
id | integer | not null default nextval('routes_trackpoint_id_seq'::regclass)
track_id | integer | not null
position | geometry(Point,4326) | not null
speed | double precision | not null
bearing | double precision | not null
valid | boolean | not null
created | timestamp with time zone | not null
Indexes:
"routes_trackpoint_pkey" PRIMARY KEY, btree (id)
"routes_trackpoint_position_id" gist ("position")
"routes_trackpoint_track_id" btree (track_id)
Foreign-key constraints:
"track_id_refs_id_d59447ae" FOREIGN KEY (track_id) REFERENCES routes_track(id) DEFERRABLE INITIALLY DEFERRED
PS: We have forced postgres to sort by "created" instead, which also helped him use the index on "track_id".
Avoid LIMIT as much as you can.
Plan #1: use NOT EXISTS() to get the first one
EXPLAIN ANALYZE
SELECT * FROM routes_trackpoint tp
WHERE tp.track_id = 593
AND NOT EXISTS (
SELECT * FROM routes_trackpoint nx
WHERE nx.track_id = tp.track_id AND nx.id < tp.id
);
Plan #2: use row_number() OVER some_window to get the first one of the group.
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
) omg ON omg.id = tp.id
WHERE tp.track_id = 593
AND omg.rn = 1
;
Or -even better- move the WHERE clause to the subquery :
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
) omg ON omg.id = tp.id
WHERE 1=1
-- AND tp.track_id = 593
AND omg.rn = 1
;
Plan#3 use the postgres-specific DISTINCT ON() construct (thanks to #a_horse_with_no_name):
-- EXPLAIN ANALYZE
SELECT DISTINCT ON (track_id) track_id, id
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
-- order by track_id, created desc
order by track_id, id
;
Imagine an account table that looks like this:
Column | Type | Modifiers
------------+-----------------------------+-----------
id | bigint | not null
signupdate | timestamp without time zone | not null
canceldate | timestamp without time zone |
I want to get a report of the number of signups and cancellations by month.
It is pretty straight-forward to do it in two queries, one for the signups by month and then one for the cancellations by month. Is there an efficient way to do it in a single query? Some months may have zero signups and cancellations, and should show up with a zero in the results.
With source data like this:
id signupDate cancelDate
1 2012-01-13
2 2012-01-15 2012-02-05
3 2012-03-01 2012-03-20
we should get the following results:
Date signups cancellations
2012-01 2 0
2012-02 0 1
2012-03 1 1
I'm using postgresql 9.0
Update after the first answer:
Craig Ringer provided a nice answer below. On my data set of approximately 75k records, the first and third examples performed similarly. The second example seems to have an error somewhere, it returned incorrect results.
Looking at the results from an explain analyze (and my table does have an index on signup_date), the first query returns:
Sort (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 2 (returns $1)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 3 (returns $2)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
-> HashAggregate (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
-> Nested Loop (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
-> Materialize (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
-> Seq Scan on account a (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms
and the third query returns:
Sort (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 4 (returns $2)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 5 (returns $3)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
-> HashAggregate (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
Filter: (date_trunc('month'::text, created) = $0)
SubPlan 2
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms
This certainly makes it appear that the third query is faster, but when I run the queries from the command-line using the 'time' command, the first query is consistently faster, though only by a few milliseconds.
Surprisingly to me, running two separate queries (one to count signups and one to count cancellations) is significantly faster. It took less than half the time to run, ~300ms vs ~730ms. Of course that leaves more work to be done externally, but for my purposes it still might be the best solution. Here are the single queries:
select
m,
count(a.id) as "signups"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m
;
select
m,
count(a.id) as "cancellations"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m
;
I have marked Craig's answer as correct, but if you can make it faster, I'd love to hear about it
Here are three different ways to do it. All depend on generating a time series then scanning it. One uses subqueries to aggregate data for each month. One joins the table twice against the series with different criteria. An alternate form does a single join on the time series, retaining rows that match either start or end date, then uses predicates in the counts to further filter the results.
EXPLAIN ANALYZE will help you pick which approach works best for your data.
http://sqlfiddle.com/#!12/99c2a/9
Test setup:
CREATE TABLE accounts
("id" int, "signup_date" timestamp, "cancel_date" timestamp);
INSERT INTO accounts
("id", "signup_date", "cancel_date")
VALUES
(1, '2012-01-13 00:00:00', NULL),
(2, '2012-01-15 00:00:00', '2012-02-05'),
(3, '2012-03-01 00:00:00', '2012-03-20')
;
By single join and filter in count:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m OR date_trunc('month',a.cancel_date) = m)
GROUP BY m
ORDER BY m;
By joining the accounts table twice:
SELECT m, count(s.signup_date) AS n_signups, count(c.cancel_date) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m LEFT OUTER JOIN accounts s ON (date_trunc('month',s.signup_date) = m) LEFT OUTER JOIN accounts c ON (date_trunc('month',c.cancel_date) = m)
GROUP BY m
ORDER BY m;
Alternately, using subqueries:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',signup_date) = m
) AS n_signups, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',cancel_date) = m
)AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
New answer after update.
I'm not shocked that you get better results from two simpler queries; sometimes it's simply more efficient to do things that way. However, there was an issue with my original answer that will've signficicantly impacted performance.
Erwin accurately pointed out in another answer that Pg can't use a simple b-tree index on a date with date_trunc, so you're better off using ranges. It can use an index created on the expression date_trunc('month',colname) but you're better off avoiding the creation of another unnecessary index.
Rephrasing the single-scan-and-filter query to use ranges produces:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (
(a.signup_date >= m AND a.signup_date < m + INTERVAL '1' MONTH)
OR (a.cancel_date >= m AND a.cancel_date < m + INTERVAL '1' MONTH))
GROUP BY m
ORDER BY m;
There's no need to avoid date_trunc in non-indexable conditions, so I've only changed to use interval ranges in the join condition.
Where the original query used a seq scan and materialize, this now uses a bitmap index scan if there are indexes on signup_date and cancel_date.
In PostgreSQL 9.2 better performance may possibly be gained by adding:
CREATE INDEX account_signup_or_cancel ON accounts(signup_date,cancel_date);
and possibly:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_desc_nonnull
ON accounts(cancel_date DESC) WHERE (cancel_date IS NOT NULL);
to allow index-only scans. It's hard to make solid index recommendations without the actual data to test with.
Alternately, the subquery based approach with improved indexable filter condition:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE signup_date >= m AND signup_date < m + INTERVAL '1' MONTH
) AS n_signups, (
SELECT count(cancel_date)
FROM accounts
WHERE cancel_date >= m AND cancel_date < m + INTERVAL '1' MONTH
) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
will benefit from ordinary b-tree indexes on signup_date and cancel_date, or from:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_nonnull
ON accounts(cancel_date) WHERE (cancel_date IS NOT NULL);
Remember that every index you create imposes a penalty on INSERT and UPDATE performance, and competes with other indexes and help data for cache space. Try to create only indexes that make a big difference and are useful for other queries.