Optimizing this counting query in Postgresql - sql

I need to implement a basic facet search sidebar in my app. I unfortunately can't use Elasticsearch/Solr/alternatives and limited to Postgres.
I have around 10+ columns ('status', 'classification', 'filing_type'...) I need to return counts for every distinct value after every search made and display them accordingly. I've drafted this bit of sql, however, this won't take me very far in the long run as it will slow down massively once I reach a high number of rows.
select row_to_json(t) from (
select 'status' as column, status as value, count(*) from api_articles_mv_temp group by status
union
select 'classification' as column, classification as value, count(*) from api_articles_mv_temp group by classification
union
select 'filing_type' as column, filing_type as value, count(*) from api_articles_mv_temp group by filing_type
union
...) t;
This yields
{"column":"classification","value":"State","count":2001}
{"column":"classification","value":"Territory","count":23}
{"column":"filing_type","value":"Joint","count":169}
{"column":"classification","value":"SRO","count":771}
{"column":"filing_type","value":"Single","count":4238}
{"column":"status","value":"Updated","count":506}
{"column":"classification","value":"Federal","count":1612}
{"column":"status","value":"New","count":3901}
From the query plan, the HashAggregates are slowing it down.
Subquery Scan on t (cost=2397.58..2397.76 rows=8 width=32) (actual time=212.822..213.022 rows=8 loops=1)
-> HashAggregate (cost=2397.58..2397.66 rows=8 width=186) (actual time=212.780..212.856 rows=8 loops=1)
Group Key: ('status'::text), api_articles_mv_temp.status, (count(*))
-> Append (cost=799.11..2397.52 rows=8 width=186) (actual time=75.238..212.701 rows=8 loops=1)
-> HashAggregate (cost=799.11..799.13 rows=2 width=44) (actual time=75.221..75.242 rows=2 loops=1)
Group Key: api_articles_mv_temp.status
...
Is there a simpler, more optimized way of getting this result?

It may be improve the performance that reading api_articles_mv_temp is just once.
I gave you examples so can you try them?
If the combinations of "column" and "value" are fixed, the query looks like this:
select row_to_json(t) from (
select "column", "value", count(*) as "count"
from column_temp left outer join api_articles_mv_temp on
"value"=
case "column"
when 'status' then status
when 'classification' then classification
when 'filing_type' then filing_type
end
group by "column", "value"
) t;
The column_temp has records below:
column |value
---------------+----------
status |New
status |Updated
classification |State
classification |Territory
classification |SRO
filing_type |Single
filing_type |Joint
DB Fiddle
If just the "column" is fixed, the query looks like this:
select row_to_json(t) from (
select "column",
case "column"
when 'status' then status
when 'classification' then classification
when 'filing_type' then filing_type
end as "value",
sum("count") as "count"
from column_temp a
cross join (
select
status,
classification,
filing_type,
count(*) as "count"
from api_articles_mv_temp
group by
status,
classification,
filing_type) b
group by "column", "value"
) t;
The column_temp has records below:
column
---------------
status
classification
filing_type
DB Fiddle

Related

How to optimize a GROUP BY query

I am given a task to optimize the following query (not written by me)
SELECT
"u"."email" as email,
r.url as "domain",
"u"."id" as "requesterId",
s.total * 100 / count("r"."id") as "rate",
count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
(
SELECT
url,
id,
"requesterId",
created_at
FROM
(
SELECT
url,
id,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
id,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "r"
INNER JOIN (
SELECT
"requesterId",
url,
count(created_at) AS "total"
FROM
(
SELECT
url,
status,
created_at,
"requesterId"
FROM
(
SELECT
url,
status,
created_at,
"requesterId",
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
status,
created_at,
"requesterId"
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "s"
WHERE
status IN ('success')
AND s."created_at" :: date >= '2022-01-07' :: date
AND s."created_at" :: date <= '2022-02-07' :: date
GROUP BY
s.url,
s."requesterId"
) "s" ON s."requesterId" = r."requesterId"
AND s.url = r.url
INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
r."created_at" :: date >= '2022-01-07' :: date
AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
r.url,
"u"."email",
"u"."id",
s.total
LIMIT
10
So there is the requests table, which stores some API requests and there is a mechanism to retry a request if it fails, which is repeated 5 times, while keeping separate rows for each retry. If after 5 times it still fails, it's not continued anymore. This is the reason for the partition by subquery, which selects only the main requests.
What the query should return is the total number of requests and success rate, grouped by the url and requesterId. The query I was given is not only wrong, but also takes huge amounts of time to execute, so I came up with the optimized version below
WITH a AS (SELECT url,
id,
status,
"requesterId",
created_at
FROM (
SELECT url,
id,
status,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM "requests" "request"
WHERE
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
GROUP BY main_request_uuid,
retry_number,
url,
id,
status,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE request_.row_number = 1),
b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status IN ('success')
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;
What the new version basically does is select all the main requests, and count the successful ones and the total count. The new version still takes a lot of time to execute (around 60s on a table with 4 million requests).
Is there a way to optimize this further?
You can see the table structure below. The table has no relevant indexes, but adding one on (url, requesterId) had no effect
column_name
data_type
id
bigint
requesterId
bigint
proxyId
bigint
url
character varying
status
USER-DEFINED
time_spent
integer
created_at
timestamp with time zone
request_information
jsonb
retry_number
smallint
main_request_uuid
character varying
And here is the execution plan on a backup table with 100k rows. It's taking 1.1s for 100k rows, but it would be more desired to at least cut it down to 200ms for this case
Limit (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
CTE a
-> Subquery Scan on request_ (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
Filter: (request_.row_number = 1)
Rows Removed by Filter: 70962
-> WindowAgg (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
-> Group (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Gather Merge (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Group (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Sort (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
" Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
Sort Method: external merge Disk: 7952kB
Worker 0: Sort Method: external merge Disk: 8568kB
Worker 1: Sort Method: external merge Disk: 9072kB
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
-> Nested Loop (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
" Join Filter: (a2.""requesterId"" = u.id)"
-> Nested Loop (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
" Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
Rows Removed by Join Filter: 69
-> HashAggregate (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
" Group Key: a3.url, a3.""requesterId"""
Batches: 5 Memory Usage: 4297kB Disk Usage: 7040kB
-> CTE Scan on a a3 (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
Filter: (status = 'success'::requests_status_enum)
Rows Removed by Filter: 8196
-> HashAggregate (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
" Group Key: a2.url, a2.""requesterId"""
Batches: 41 Memory Usage: 4297kB Disk Usage: 7328kB
-> CTE Scan on a a2 (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
" -> Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
" Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms
These lines of your plan point to a possible optimization.
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
Sequential scans, parallel or not, are somewhat costly.
So, try changing these WHERE conditions to make them sargable and useful for a range scan.
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
Change these to
created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date + INTERVAL '1' DAY
And, put a BTREE index on the created_at column.
CREATE INDEX ON requests (created_at);
Your query is complex, so I'm not totally sure this will work. But try it. The index should pull out only the rows for the dates you need.
And, your LIMIT clause without an accompanying ORDER BY clause gives postgreSQL permission to give you back whatever ten rows it wants from the result set. Don't use LIMIT without ORDER BY. Don't do it at all unless you need it.
Writing query efficiently is one of the major part for query optimization specially for handling huge data set. Always avoiding unnecessary GROUP BY or ORDER BY or explicit type casting or too many joins or extra subquery or limit/limit without order by (if possible) if handling large volume of data and meet desired requirement. Create an index in created_at column. If LEFT JOIN used in your given query then query pattern would be changed. My observations are
-- avoid unnecessary GROUP BY (no aggregate function use) or ORDER BY
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
N.B.: if created_at column data type is timestamp without time zone then no need to extra casting. Follow the below statement
(created_at >= '2022-01-07'
AND created_at <= '2022-02-07')
-- Combine two CTE into one as per requirement
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
So the final query as like below
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
) select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;
If above query doesn't meet requirement then try below query. But this query will perfect if using LEFT JOIN instead of INNER JOIN
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT count(1) total, a2.url as url, a2."requesterId"
FROM a a2
GROUP BY a2.url, a2."requesterId"
), c AS (SELECT count(1) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status = 'success'
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c
ON b.url = c.url AND b."requesterId" = c."requesterId"
JOIN users u
ON b."requesterId" = u.id
LIMIT 10;
select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;

PostgreSQL performance, using ILIKE with just two percentages versus not at all

I'm using ILIKE to search for the title of a row based on user input. When the user has nothing inputted (empty string), all rows should return.
Is there a performance difference if you query a SELECT statement with ILIKE '%%' versus without it at all? In other words, is it okay to just query ILIKE empty or should I get rid of it in my query if there is no search filter text?
there a performance difference if you query a SELECT statement with ILIKE '%%' versus without it at all?
The two queries:
select *
from some_table
where some_column ILIKE '%'
and
select *
from some_table
will return different results.
The first one is equivalent to where some_column is not null - so it will never return rows where some_column is null, but the second one will.
So it's not only about performance, but also about correctness.
Performance wise they will most likely be identical - doing a Seq Scan in both cases.
On PostgreSQL (13.1) the two queries are not equivalent:
test=# select count(*) from test_ilike where test_string ilike '%%';
count
--------
100000
(1 row)
Time: 87,211 ms
test=# select count(*) from test_ilike where test_string ilike '';
count
-------
0
(1 row)
Time: 85,521 ms
test=# explain analyze select count(*) from test_ilike where test_string ilike '%%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2333.97..2333.99 rows=1 width=8) (actual time=86.859..86.860 rows=1 loops=1)
-> Seq Scan on test_ilike (cost=0.00..2084.00 rows=99990 width=0) (actual time=0.022..81.497 rows=100000 loops=1)
Filter: (test_string ~~* '%%'::text)
Planning Time: 0.313 ms
Execution Time: 86.893 ms
(5 rows)
Time: 87,582 ms
test=# explain analyze select count(*) from test_ilike where test_string ilike '';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Aggregate (cost=2084.00..2084.01 rows=1 width=8) (actual time=83.223..83.224 rows=1 loops=1)
-> Seq Scan on test_ilike (cost=0.00..2084.00 rows=1 width=0) (actual time=83.219..83.219 rows=0 loops=1)
Filter: (test_string ~~* ''::text)
Rows Removed by Filter: 100000
Planning Time: 0.104 ms
Execution Time: 83.257 ms
(6 rows)
Time: 83,728 ms

Does PostgreSQL share the ordering of a CTE?

In PostgreSQL, common table expressions (CTE) are optimization fences. This means that the CTE is materialized into memory and that predicates from another query will never be pushed down into the CTE.
Now I am wondering if other metadata about the CTE, such as ordering, is shared to the other queries. Let's take the following query:
WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects
Here, MIN(type) is obviously always the first row of ordered_objects (or NULL if ordered_objects is empty), because ordered_objects is already ordered by type. Is this knowledge about ordered_objects available when evaluating SELECT MIN(type) FROM ordered_objects?
If I understand your question correctly - no, it does not. no such knowledge does. As you will find in example below. when you limit to 10 rows, execution is extremely faster - less data to process (in my case million times less), which would mean CTE scans whole ordered set ignoring the fact, that min would be in first rows...
data:
t=# create table object (type bigint);
CREATE TABLE
Time: 4.636 ms
t=# insert into object select generate_series(1,9999999);
INSERT 0 9999999
Time: 7769.275 ms
with limit:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 3150.183 ms
https://explain.depesz.com/s/5yXe
without:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 16032.989 ms
https://explain.depesz.com/s/1SU
I surely warmed up data before tests
[in Postgres] a CTE is always excuted once
, even if it referenced more than once
its result is stored into a temporary table (materialized)
the outer query has no knowledge about the internal structure (indexes are not available) or ordering (not sure about frequency estimates), it just scans the temp results
in the below fragment the CTE is scanned twice, even if the results are known to be identical.
\d react
EXPLAIN ANALYZE
WITH omg AS (
SELECT topic_id
, row_number() OVER (PARTITION by krant_id ORDER BY topic_id) AS rn
FROM react
WHERE krant_id = 1
AND topic_id < 5000000
ORDER BY topic_id ASC
)
SELECT MIN (o2.topic_id)
FROM omg o1 --
JOIN omg o2 ON o1.rn = o2.rn -- exactly the same
WHERE o1.rn = 1
;
Table "public.react"
Column | Type | Modifiers
------------+--------------------------+--------------------
krant_id | integer | not null default 1
topic_id | integer | not null
react_id | integer | not null
react_date | timestamp with time zone |
react_nick | character varying(1000) |
react_body | character varying(4000) |
zoek | tsvector |
Indexes:
"react_pkey" PRIMARY KEY, btree (krant_id, topic_id, react_id)
"react_krant_id_react_nick_react_date_topic_id_react_id_idx" UNIQUE, btree (krant_id, react_nick, react_date, topic_id, react_id)
"react_date" btree (krant_id, topic_id, react_date)
"react_nick" btree (krant_id, topic_id, react_nick)
"react_zoek" gin (zoek)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON react FOR EACH ROW EXECUTE PROCEDURE tf_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON react FOR EACH ROW WHEN (new.react_body::text <> old.react_body::text) EXECUTE PROCEDURE tf_upd_zzoek()
----------
Aggregate (cost=232824.29..232824.29 rows=1 width=4) (actual time=1773.643..1773.645 rows=1 loops=1)
CTE omg
-> WindowAgg (cost=0.43..123557.17 rows=402521 width=8) (actual time=0.217..1246.577 rows=230822 loops=1)
-> Index Only Scan using react_pkey on react (cost=0.43..117519.35 rows=402521 width=8) (actual time=0.161..419.916 rows=230822 loops=1)
Index Cond: ((krant_id = 1) AND (topic_id < 5000000))
Heap Fetches: 442
-> Nested Loop (cost=0.00..99136.69 rows=4052169 width=4) (actual time=0.264..1773.624 rows=1 loops=1)
-> CTE Scan on omg o1 (cost=0.00..9056.72 rows=2013 width=8) (actual time=0.249..59.252 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
-> CTE Scan on omg o2 (cost=0.00..9056.72 rows=2013 width=12) (actual time=0.003..1714.355 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
Total runtime: 1782.887 ms
(14 rows)

Tricky postgresql query optimization (distinct row aggregation with ordering)

I have a table of events that has a very similar schema and data distribution as this artificial table that can easily be generated locally:
CREATE TABLE events AS
WITH args AS (
SELECT
300 AS scale_factor, -- feel free to reduce this to speed up local testing
1000 AS pa_count,
1 AS l_count_min,
29 AS l_count_rand,
10 AS c_count,
10 AS pr_count,
3 AS r_count,
'10 days'::interval AS time_range -- edit 2017-05-02: the real data set has years worth of data here, but the query time ranges stay small (a couple days)
)
SELECT
p.c_id,
'ABC'||lpad(p.pa_id::text, 13, '0') AS pa_id,
'abcdefgh-'||((random()*(SELECT pr_count-1 FROM args)+1))::int AS pr_id,
((random()*(SELECT r_count-1 FROM args)+1))::int AS r,
'2017-01-01Z00:00:00'::timestamp without time zone + random()*(SELECT time_range FROM args) AS t
FROM (
SELECT
pa_id,
((random()*(SELECT c_count-1 FROM args)+1))::int AS c_id,
(random()*(SELECT l_count_rand FROM args)+(SELECT l_count_min FROM args))::int AS l_count
FROM generate_series(1, (SELECT pa_count*scale_factor FROM args)) pa_id
) p
JOIN LATERAL (
SELECT generate_series(1, p.l_count)
) l(id) ON (true);
Excerpt from SELECT * FROM events:
What I need is a query that selects all rows for a given c_id in a given time range of t, then filters them down to only include the most recent rows (by t) for each unique pr_id and pa_id combination, and then counts the number of pr_id and r combinations of those rows.
That's a quite a mouthful, so here are 3 SQL queries that I came up with that produce the desired results:
WITH query_a AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
),
query_b AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT
pr_id,
pa_id,
first_not_null(r ORDER BY t DESC) AS r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
GROUP BY
1,
2
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
),
query_c AS (
SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT
pr_id,
pa_id,
first_not_null(r) AS r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
GROUP BY
1,
2
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
)
And here is the custom aggregate function used by query_b and query_c, as well as what I believe to be the most optimal index, settings and conditions:
CREATE FUNCTION first_not_null_agg(before anyelement, value anyelement) RETURNS anyelement
LANGUAGE sql IMMUTABLE STRICT
AS $_$
SELECT $1;
$_$;
CREATE AGGREGATE first_not_null(anyelement) (
SFUNC = first_not_null_agg,
STYPE = anyelement
);
CREATE INDEX events_idx ON events USING btree (c_id, t DESC, pr_id, pa_id, r);
VACUUM ANALYZE events;
SET work_mem='128MB';
My dilemma is that query_c outperforms query_a and query_b by a factor of > 6x, but is technically not guaranteed to produce the same result as the other queries (notice the missing ORDER BY in the first_not_null aggregate). However, in practice it seems to pick a query plan that I believe to be correct and most optimal.
Below are the EXPLAIN (ANALYZE, VERBOSE) outputs for all 3 queries on my local machine:
query_a:
CTE Scan on query_a (cost=25810.77..26071.25 rows=13024 width=44) (actual time=3329.921..3329.934 rows=30 loops=1)
Output: query_a.pr_id, query_a.r, query_a.quantity
CTE query_a
-> Sort (cost=25778.21..25810.77 rows=13024 width=23) (actual time=3329.918..3329.921 rows=30 loops=1)
Output: events.pr_id, events.r, (count(1))
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=24757.86..24888.10 rows=13024 width=23) (actual time=3329.849..3329.892 rows=30 loops=1)
Output: events.pr_id, events.r, count(1)
Group Key: events.pr_id, events.r
-> Unique (cost=21350.90..22478.71 rows=130237 width=40) (actual time=3168.656..3257.299 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
-> Sort (cost=21350.90..21726.83 rows=150375 width=40) (actual time=3168.655..3209.095 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Sort Key: events.pr_id, events.pa_id, events.t DESC
Sort Method: quicksort Memory: 18160kB
-> Index Only Scan using events_idx on public.events (cost=0.56..8420.00 rows=150375 width=40) (actual time=0.038..101.584 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.316 ms
Execution time: 3331.082 ms
query_b:
CTE Scan on query_b (cost=67140.75..67409.53 rows=13439 width=44) (actual time=3761.077..3761.090 rows=30 loops=1)
Output: query_b.pr_id, query_b.r, query_b.quantity
CTE query_b
-> Sort (cost=67107.15..67140.75 rows=13439 width=23) (actual time=3761.074..3761.081 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), (count(1))
Sort Key: (count(1)), (first_not_null(events.r ORDER BY events.t DESC)), events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=66051.24..66185.63 rows=13439 width=23) (actual time=3760.997..3761.049 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r ORDER BY events.t DESC)), count(1)
Group Key: events.pr_id, first_not_null(events.r ORDER BY events.t DESC)
-> GroupAggregate (cost=22188.98..63699.49 rows=134386 width=32) (actual time=2961.471..3671.669 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, first_not_null(events.r ORDER BY events.t DESC)
Group Key: events.pr_id, events.pa_id
-> Sort (cost=22188.98..22578.94 rows=155987 width=40) (actual time=2961.436..3012.440 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Sort Key: events.pr_id, events.pa_id
Sort Method: quicksort Memory: 18160kB
-> Index Only Scan using events_idx on public.events (cost=0.56..8734.27 rows=155987 width=40) (actual time=0.038..97.336 rows=153795 loops=1)
Output: events.pr_id, events.pa_id, events.r, events.t
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.385 ms
Execution time: 3761.852 ms
query_c:
CTE Scan on query_c (cost=51400.06..51660.54 rows=13024 width=44) (actual time=524.382..524.395 rows=30 loops=1)
Output: query_c.pr_id, query_c.r, query_c.quantity
CTE query_c
-> Sort (cost=51367.50..51400.06 rows=13024 width=23) (actual time=524.380..524.384 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r)), (count(1))
Sort Key: (count(1)), (first_not_null(events.r)), events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=50347.14..50477.38 rows=13024 width=23) (actual time=524.311..524.349 rows=30 loops=1)
Output: events.pr_id, (first_not_null(events.r)), count(1)
Group Key: events.pr_id, first_not_null(events.r)
-> HashAggregate (cost=46765.62..48067.99 rows=130237 width=32) (actual time=401.480..459.962 rows=116547 loops=1)
Output: events.pr_id, events.pa_id, first_not_null(events.r)
Group Key: events.pr_id, events.pa_id
-> Index Only Scan using events_idx on public.events (cost=0.56..8420.00 rows=150375 width=32) (actual time=0.027..109.459 rows=153795 loops=1)
Output: events.c_id, events.t, events.pr_id, events.pa_id, events.r
Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.296 ms
Execution time: 525.566 ms
Broadly speaking, I believe that the index above should allow query_a and query_b to be executed without the Sort nodes that slow them down, but so far I've failed to convince the postgres query optimizer to do my bidding.
I'm also somewhat confused about the t column not being included in the Sort key for query_b, considering that quicksort is not stable. It seems like this could yield the wrong results.
I've verified that all 3 queries generate the same results running the following queries and verifying they produce an empty result set:
SELECT * FROM query_a
EXCEPT
SELECT * FROM query_b;
and
SELECT * FROM query_a
EXCEPT
SELECT * FROM query_c;
I'd consider query_a to be the canonical query when in doubt.
I greatly appreciate any input on this. I've actually found a terribly hacky workaround to achieve acceptable performance in my application, but this problem continues to hunt me in my sleep (and in fact vacation, which I'm currently on) ... 😬.
FWIW, I've looked at many similar questions and answers which have guided my current thinking, but I believe there is something unique about the two column grouping (pr_id, pa_id) and having to sort by a 3rd column (t) that doesn't make this a duplicate question.
Edit: The outer queries in the example may be entirely irrelevant to the question, so feel free to ignore them if it helps.
I'd consider query_a to be the canonical query when in doubt.
I found a way to make query_a half a second fast.
Your inner query from query_a
SELECT DISTINCT ON (pr_id, pa_id)
needs to go with
ORDER BY pr_id, pa_id, t DESC
especially with pr_id and pa_id listed first.
c_id = 5 is const, but you cannot use your index event_idx (c_id, t DESC, pr_id, pa_id, r), because the columns are not organized by (pr_id, pa_id, t DESC), which your ORDER BY clause demands.
If you had an index on at least (pr_id, pa_id, t DESC) then the sorting does not have to happen, because the ORDER BY condition aligns with this index.
So here is what I did.
CREATE INDEX events_idx2 ON events (c_id, pr_id, pa_id, t DESC, r);
This index can be used by your inner query - at least in theory.
Unfortunately the query planner thinks that it's better to reduce the number of rows by using index events_idx with c_id and x <= t < y.
Postgres does not have index hints, so we need a different way to convince the query planner to take the new index events_idx2.
One way to force the use of events_idx2 is to make the other index more expensive.
This can be done by removing the last column r from events_idx and make it unusable for query_a (at least unusable without loading the pages from the heap).
It is counter-intuitive to move the t column later in the index layout, because usually the first columns will be chosen for = and ranges, which c_id and t qualify well for.
However, your ORDER BY (pr_id, pa_id, t DESC) mandates at least this subset as-is in your index. Of course we still put the c_id first to reduce the rows as soon as possible.
You can still have an index on (c_id, t DESC, pr_id, pa_id), if you need, but it cannot be used in query_a.
Here is the query plan for query_a with events_idx2 used and events_idx deleted.
Look for events_c_id_pr_id_pa_id_t_r_idx, which is how PG names indices automatically, when you don't give them a name.
I like it this way, because I can see the order of the columns in the name of the index in every query plan.
Sort (cost=30076.71..30110.75 rows=13618 width=23) (actual time=426.898..426.914 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=29005.43..29141.61 rows=13618 width=23) (actual time=426.820..426.859 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.56..26622.33 rows=136177 width=40) (actual time=0.037..328.828 rows=117204 loops=1)
-> Index Only Scan using events_c_id_pr_id_pa_id_t_r_idx on events (cost=0.56..25830.50 rows=158366 width=40) (actual time=0.035..178.594 rows=154940 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.201 ms
Execution time: 427.017 ms
(11 Zeilen)
The planning is instantaneously and the performance is sub second, because the index matches the ORDER BY of the inner query.
With good performance on query_a there is no need for an additional function to make alternative queries query_b and query_c faster.
Remarks:
Somehow I could not find a primary key in your relation.
The aforementioned proposed solution works without any primary key assumption.
I still think that you have some primary key, but maybe forgot to mention it.
The natural key is pa_id. Each pa_id refers to "a thing" that has
~1...30 events recorded about it.
If pa_id is in relation to multiple c_id's, then pa_id alone cannot be key.
If pr_id and r are data, then maybe (c_id, pa_id, t) is unique key?
Also your index events_idx is not created unique, but spans all columns of the relation, so you could have multiple equal rows - do you want to allow that?
If you really need both indices events_idx and the proposed events_idx2, then you will have the data stored 3 times in total (twice in indices, once on the heap).
Since this really is a tricky query optimization, I kindly ask you to at least consider adding a bounty for whoever answers your question, also since it has been sitting on SO without answer for quite some time.
EDIT A
I inserted another set of data using your excellently generated setup above, basically doubling the number of rows.
The dates started from '2017-01-10' this time.
All other parameters stayed the same.
Here is a partial index on the time attribute and it's query behaviour.
CREATE INDEX events_timerange ON events (c_id, pr_id, pa_id, t DESC, r) WHERE '2017-01-03' <= t AND t < '2017-01-06';
Sort (cost=12510.07..12546.55 rows=14591 width=23) (actual time=361.579..361.595 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=11354.99..11500.90 rows=14591 width=23) (actual time=361.503..361.543 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.55..8801.60 rows=145908 width=40) (actual time=0.026..265.084 rows=118571 loops=1)
-> Index Only Scan using events_timerange on events (cost=0.55..8014.70 rows=157380 width=40) (actual time=0.024..115.265 rows=155800 loops=1)
Index Cond: (c_id = 5)
Heap Fetches: 0
Planning time: 0.214 ms
Execution time: 361.692 ms
(11 Zeilen)
Without the index events_timerange (that's the regular full index).
Sort (cost=65431.46..65467.93 rows=14591 width=23) (actual time=472.809..472.824 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=64276.38..64422.29 rows=14591 width=23) (actual time=472.732..472.776 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.56..61722.99 rows=145908 width=40) (actual time=0.024..374.392 rows=118571 loops=1)
-> Index Only Scan using events_c_id_pr_id_pa_id_t_r_idx on events (cost=0.56..60936.08 rows=157380 width=40) (actual time=0.021..222.987 rows=155800 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.171 ms
Execution time: 472.925 ms
(11 Zeilen)
With the partial index the runtime is about 100ms faster, meanwhile the whole table is twice as big.
(Note: the second time around it was only 50ms faster. The advantage should grow, the more events are recorded, though, because the queries requiring the full index will become slower, as you suspect (and i agree)).
Also, on my machine, the full index is 810 MB for two inserts (create table + additional from 2017-01-10).
The partial index WHERE 2017-01-03 <= t < 2017-01-06 is only 91 MB.
Maybe you can create partial indices on a monthly or yearly basis?
Depending on what time range is queried, maybe only recent data needs to be indexed, or otherwise only old data partially?
I also tried partial indexing with WHERE c_id = 5, so partitioning by c_id.
Sort (cost=51324.27..51361.47 rows=14880 width=23) (actual time=550.579..550.592 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id DESC
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=50144.21..50293.01 rows=14880 width=23) (actual time=550.481..550.528 rows=30 loops=1)
Group Key: events.pr_id, events.r
-> Unique (cost=0.42..47540.21 rows=148800 width=40) (actual time=0.050..443.393 rows=118571 loops=1)
-> Index Only Scan using events_cid on events (cost=0.42..46736.42 rows=160758 width=40) (actual time=0.047..269.676 rows=155800 loops=1)
Index Cond: ((t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Planning time: 0.366 ms
Execution time: 550.706 ms
(11 Zeilen)
So partial indexing may also be a viable option.
If you get ever more data, then you may also consider partitioning, for example all rows aged two years and older into a separate table or something.
I don't think Block Range Indexes BRIN (indices) might help here, though.
If you machine is more beefy than mine, then you can just insert 10 times the amount of data and check the behaviour of the regular full index first and how it behaves on an increasing table.
[EDITED]
Ok, As this depend of your data distribution here is another way to do it.
First add the following index :
CREATE INDEX events_idx2 ON events (c_id, t DESC, pr_id, pa_id, r);
This extract the MAX(t) as quick as he can, on the assumption that the sub set will be way smaller to join on the parent table. It may however probably be slower if the dataset is not that small.
SELECT
e.pr_id,
e.r,
count(1) AS quantity
FROM events e
JOIN (
SELECT
pr_id,
pa_id,
MAX(t) last_t
FROM events e
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pr_id,
pa_id
) latest
ON (
c_id = 5
AND latest.pr_id = e.pr_id
AND latest.pa_id = e.pa_id
AND latest.last_t = e.t
)
GROUP BY
e.pr_id,
e.r
ORDER BY 3, 2, 1 DESC
Full Fiddle
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
--PostgreSQL 9.6
--'\\' is a delimiter
-- CREATE TABLE events AS...
VACUUM ANALYZE events;
CREATE INDEX idx_events_idx ON events (c_id, t DESC, pr_id, pa_id, r);
Query 1:
-- query A
explain analyze SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=2170.24..2170.74 rows=200 width=15) (actual time=358.239..358.245 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=2160.60..2162.60 rows=200 width=15) (actual time=358.181..358.189 rows=30 loops=1)
-> Unique (cost=2012.69..2132.61 rows=1599 width=40) (actual time=327.345..353.750 rows=12098 loops=1)
-> Sort (cost=2012.69..2052.66 rows=15990 width=40) (actual time=327.344..348.686 rows=15966 loops=1)
Sort Key: events.pr_id, events.pa_id, events.t
Sort Method: external merge Disk: 792kB
-> Index Only Scan using idx_events_idx on events (cost=0.42..896.20 rows=15990 width=40) (actual time=0.059..5.475 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 358.610 ms
Query 2:
-- query max/JOIN
explain analyze SELECT
e.pr_id,
e.r,
count(1) AS quantity
FROM events e
JOIN (
SELECT
pr_id,
pa_id,
MAX(t) last_t
FROM events e
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pr_id,
pa_id
) latest
ON (
c_id = 5
AND latest.pr_id = e.pr_id
AND latest.pa_id = e.pa_id
AND latest.last_t = e.t
)
GROUP BY
e.pr_id,
e.r
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=4153.31..4153.32 rows=1 width=15) (actual time=68.398..68.402 rows=30 loops=1)
Sort Key: (count(1)), e.r, e.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=4153.29..4153.30 rows=1 width=15) (actual time=68.363..68.371 rows=30 loops=1)
-> Merge Join (cost=1133.62..4153.29 rows=1 width=15) (actual time=35.083..64.154 rows=12098 loops=1)
Merge Cond: ((e.t = (max(e_1.t))) AND (e.pr_id = e_1.pr_id))
Join Filter: (e.pa_id = e_1.pa_id)
-> Index Only Scan Backward using idx_events_idx on events e (cost=0.42..2739.72 rows=53674 width=40) (actual time=0.010..8.073 rows=26661 loops=1)
Index Cond: (c_id = 5)
Heap Fetches: 0
-> Sort (cost=1133.19..1137.19 rows=1599 width=36) (actual time=29.778..32.885 rows=12098 loops=1)
Sort Key: (max(e_1.t)), e_1.pr_id
Sort Method: external sort Disk: 640kB
-> HashAggregate (cost=1016.12..1032.11 rows=1599 width=36) (actual time=12.731..16.738 rows=12098 loops=1)
-> Index Only Scan using idx_events_idx on events e_1 (cost=0.42..896.20 rows=15990 width=36) (actual time=0.029..5.084 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 68.736 ms
Query 3:
DROP INDEX idx_events_idx
CREATE INDEX idx_events_flutter ON events (c_id, pr_id, pa_id, t DESC, r)
Query 5:
-- query A + index by flutter
explain analyze SELECT
pr_id,
r,
count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id,
pa_id,
r
FROM events
WHERE
c_id = 5 AND
t >= '2017-01-03Z00:00:00' AND
t < '2017-01-06Z00:00:00'
ORDER BY pr_id, pa_id, t DESC
) latest
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC
Results:
QUERY PLAN
Sort (cost=2744.82..2745.32 rows=200 width=15) (actual time=20.915..20.916 rows=30 loops=1)
Sort Key: (count(1)), events.r, events.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=2735.18..2737.18 rows=200 width=15) (actual time=20.883..20.892 rows=30 loops=1)
-> Unique (cost=0.42..2707.20 rows=1599 width=40) (actual time=0.037..16.488 rows=12098 loops=1)
-> Index Only Scan using idx_events_flutter on events (cost=0.42..2627.25 rows=15990 width=40) (actual time=0.036..10.893 rows=15966 loops=1)
Index Cond: ((c_id = 5) AND (t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (t < '2017-01-06 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 20.964 ms
Just two different methods(YMMV):
-- using a window finction to find the record with the most recent t::
EXPLAIN ANALYZE
SELECT pr_id, r, count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id, pa_id,
first_value(r) OVER www AS r
-- last_value(r) OVER www AS r
FROM events
WHERE c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
WINDOW www AS (PARTITION BY pr_id, pa_id ORDER BY t DESC)
ORDER BY 1, 2, t DESC
) sss
GROUP BY 1, 2
ORDER BY 3, 2, 1 DESC
;
-- Avoiding the window function; find the MAX via NOT EXISTS() ::
EXPLAIN ANALYZE
SELECT pr_id, r, count(1) AS quantity
FROM (
SELECT DISTINCT ON (pr_id, pa_id)
pr_id, pa_id, r
FROM events e
WHERE c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
AND NOT EXISTS ( SELECT * FROM events nx
WHERE nx.c_id = 5 AND nx.pr_id =e.pr_id AND nx.pa_id =e.pa_id
AND nx.t >= '2017-01-03Z00:00:00'
AND nx.t < '2017-01-06Z00:00:00'
AND nx.t > e.t
)
) sss
GROUP BY 1, 2
ORDER BY 3, 2, 1 DESC
;
Note: the DISTINCT ON can be omitted from the second query, the results are already unique.
I'd try to use a standard ROW_NUMBER() function with a matching index instead of Postgres-specific DISTINCT ON to find the "latest" rows.
Index
CREATE INDEX ix_events ON events USING btree (c_id, pa_id, pr_id, t DESC, r);
Query
WITH
CTE_RN
AS
(
SELECT
pa_id
,pr_id
,r
,ROW_NUMBER() OVER (PARTITION BY c_id, pa_id, pr_id ORDER BY t DESC) AS rn
FROM events
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
)
SELECT
pr_id
,r
,COUNT(*) AS quantity
FROM CTE_RN
WHERE rn = 1
GROUP BY
pr_id
,r
ORDER BY quantity, r, pr_id DESC
;
I don't have Postgres at hand, so I'm using http://rextester.com for testing. I set the scale_factor to 30 in the data generation script, otherwise it takes too long for rextester. I'm getting the following query plan. The time component should be ignored, but you can see that there are no intermediate sorts, only the sort for the final ORDER BY. See http://rextester.com/GUFXY36037
Please try this query on your hardware and your data. It would be interesting to see how it compares to your query. I noticed that optimizer doesn't choose this index if the table has the index that you defined. If you see the same on your server, please try to drop or disable other indexes to get the plan that I got.
1 Sort (cost=158.07..158.08 rows=1 width=44) (actual time=81.445..81.448 rows=30 loops=1)
2 Output: cte_rn.pr_id, cte_rn.r, (count(*))
3 Sort Key: (count(*)), cte_rn.r, cte_rn.pr_id DESC
4 Sort Method: quicksort Memory: 27kB
5 CTE cte_rn
6 -> WindowAgg (cost=0.42..157.78 rows=12 width=88) (actual time=0.204..56.215 rows=15130 loops=1)
7 Output: events.pa_id, events.pr_id, events.r, row_number() OVER (?), events.t, events.c_id
8 -> Index Only Scan using ix_events3 on public.events (cost=0.42..157.51 rows=12 width=80) (actual time=0.184..28.688 rows=15130 loops=1)
9 Output: events.c_id, events.pa_id, events.pr_id, events.t, events.r
10 Index Cond: ((events.c_id = 5) AND (events.t >= '2017-01-03 00:00:00'::timestamp without time zone) AND (events.t < '2017-01-06 00:00:00'::timestamp without time zone))
11 Heap Fetches: 15130
12 -> HashAggregate (cost=0.28..0.29 rows=1 width=44) (actual time=81.363..81.402 rows=30 loops=1)
13 Output: cte_rn.pr_id, cte_rn.r, count(*)
14 Group Key: cte_rn.pr_id, cte_rn.r
15 -> CTE Scan on cte_rn (cost=0.00..0.27 rows=1 width=36) (actual time=0.214..72.841 rows=11491 loops=1)
16 Output: cte_rn.pa_id, cte_rn.pr_id, cte_rn.r, cte_rn.rn
17 Filter: (cte_rn.rn = 1)
18 Rows Removed by Filter: 3639
19 Planning time: 0.452 ms
20 Execution time: 83.234 ms
There is one more optimisation you could do that relies on the external knowledge of your data.
If you can guarantee that each pair of pa_id, pr_id has values for each, say, day, then you can safely reduce the user-defined range of t to just one day.
This will reduce the number of rows that engine reads and sorts if user usually specifies range of t longer than 1 day.
If you can't provide this kind of guarantee in your data for all values, but you still know that usually all pa_id, pr_id are close to each other (by t) and user usually provides a wide range for t, you can run a preliminary query to narrow down the range of t for the main query.
Something like this:
SELECT
MIN(MaxT) AS StartT
MAX(MaxT) AS EndT
FROM
(
SELECT
pa_id
,pr_id
,MAX(t) AS MaxT
FROM events
WHERE
c_id = 5
AND t >= '2017-01-03Z00:00:00'
AND t < '2017-01-06Z00:00:00'
GROUP BY
pa_id
,pr_id
) AS T
And then use the found StartT,EndT in the main query hoping that new range would be much narrower than original defined by the user.
The query above doesn't have to sort rows, so it should be fast. The main query has to sort rows, but there will be less rows to sort, so overall run-time may be better.
So I've taken a little bit of a tack and tried moving your grouping and distinct data into their owns tables, so that we can leverage multiple table indexes. Note that this solution only works if you have control over the way data gets inserted into the database, i.e. you can change the data source application. If not, alas this is moot.
In practice, instead of inserting into the events table immediately, you would first check if the relational date and prpa exist in their relevant tables. If not, create them. Then get their ids and use that for your insert statement to the events table.
Before I start, I was generating a 10x increase in performance on query_c over query_a, and my final result for the rewritten query_a is about a 4x performance. If that's not good enough, feel free to switch off.
Given the initial data seeding queries you gave in the first instance, I calculated the following benchmarks:
query_a: 5228.518 ms
query_b: 5708.962 ms
query_c: 538.329 ms
So, about a 10x increase in performance, give or take.
I'm going to alter the data that's generated in events, and this alteration takes quite a while. You would not need to do this in practice, as your INSERTs to the tables would be covered already.
For my optimisation, the first step is to create a table that houses dates and then transfer the data over, and relate back to it in the events table, like so:
CREATE TABLE dates (
id SERIAL,
year_part INTEGER NOT NULL,
month_part INTEGER NOT NULL,
day_part INTEGER NOT NULL
);
-- Total runtime: 8.281 ms
INSERT INTO dates(year_part, month_part, day_part) SELECT DISTINCT
EXTRACT(YEAR FROM t), EXTRACT(MONTH FROM t), EXTRACT(DAY FROM t)
FROM events;
-- Total runtime: 12802.900 ms
CREATE INDEX dates_ymd ON dates USING btree(year_part, month_part, day_part);
-- Total runtime: 13.750 ms
ALTER TABLE events ADD COLUMN date_id INTEGER;
-- Total runtime: 2.468ms
UPDATE events SET date_id = dates.id
FROM dates
WHERE EXTRACT(YEAR FROM t) = dates.year_part
AND EXTRACT(MONTH FROM t) = dates.month_part
AND EXTRACT(DAY FROM T) = dates.day_part
;
-- Total runtime: 388024.520 ms
Next, we do the same, but with the key pair (pr_id, pa_id), which doesn't reduce the cardinality too much, but when we're talking large sets it can help with memory usage and swapping in and out:
CREATE TABLE prpa (
id SERIAL,
pr_id TEXT NOT NULL,
pa_id TEXT NOT NULL
);
-- Total runtime: 5.451 ms
CREATE INDEX events_prpa ON events USING btree(pr_id, pa_id);
-- Total runtime: 218,908.894 ms
INSERT INTO prpa(pr_id, pa_id) SELECT DISTINCT pr_id, pa_id FROM events;
-- Total runtime: 5566.760 ms
CREATE INDEX prpa_idx ON prpa USING btree(pr_id, pa_id);
-- Total runtime: 84185.057 ms
ALTER TABLE events ADD COLUMN prpa_id INTEGER;
-- Total runtime: 2.067 ms
UPDATE events SET prpa_id = prpa.id
FROM prpa
WHERE events.pr_id = prpa.pr_id
AND events.pa_id = prpa.pa_id;
-- Total runtime: 757915.192
DROP INDEX events_prpa;
-- Total runtime: 1041.556 ms
Finally, let's get rid of the old indexes and the now defunct columns, and then vacuum up the new tables:
DROP INDEX events_idx;
-- Total runtime: 1139.508 ms
ALTER TABLE events
DROP COLUMN pr_id,
DROP COLUMN pa_id
;
-- Total runtime: 5.376 ms
VACUUM ANALYSE prpa;
-- Total runtime: 1030.142
VACUUM ANALYSE dates;
-- Total runtime: 6652.151
So we now have the following tables:
events (c_id, r, t, prpa_id, date_id)
dates (id, year_part, month_part, day_part)
prpa (id, pr_id, pa_id)
Let's toss on an index now, pushing t DESC to the end where it belongs, which we can do now because we're filtering results on dates before ORDERing, which cuts down the need for t DESC to be so prominent in the index:
CREATE INDEX events_idx_new ON events USING btree (c_id, date_id, prpa_id, t DESC);
-- Total runtime: 27697.795
VACUUM ANALYSE events;
Now we rewrite the query, (using a table to store intermediary results - I find this works well with large datasets) and awaaaaaay we go!
DROP TABLE IF EXISTS temp_results;
SELECT DISTINCT ON (prpa_id)
prpa_id,
r
INTO TEMPORARY temp_results
FROM events
INNER JOIN dates
ON dates.id = events.date_id
WHERE c_id = 5
AND dates.year_part BETWEEN 2017 AND 2017
AND dates.month_part BETWEEN 1 AND 1
AND dates.day_part BETWEEN 3 AND 5
ORDER BY prpa_id, t DESC;
SELECT
prpa.pr_id,
r,
count(1) AS quantity
FROM temp_results
INNER JOIN prpa ON prpa.id = temp_results.prpa_id
GROUP BY
1,
2
ORDER BY 3, 2, 1 DESC;
-- Total runtime: 1233.281 ms
So, not a 10x increase in performance, but 4x which is still alright.
This solution is a combination of a couple of techniques I have found work well with large datasets and with date ranges. Even if it's not good enough for your purposes, there might be some gems in here you can repurpose throughout your career.
EDIT:
EXPLAIN ANALYSE on SELECT INTO query:
Unique (cost=171839.95..172360.53 rows=51332 width=16) (actual time=819.385..857.777 rows=117471 loops=1)
-> Sort (cost=171839.95..172100.24 rows=104117 width=16) (actual time=819.382..836.924 rows=155202 loops=1)
Sort Key: events.prpa_id, events.t
Sort Method: external sort Disk: 3944kB
-> Hash Join (cost=14340.24..163162.92 rows=104117 width=16) (actual time=126.929..673.293 rows=155202 loops=1)
Hash Cond: (events.date_id = dates.id)
-> Bitmap Heap Scan on events (cost=14338.97..160168.28 rows=520585 width=20) (actual time=126.572..575.852 rows=516503 loops=1)
Recheck Cond: (c_id = 5)
Heap Blocks: exact=29610
-> Bitmap Index Scan on events_idx2 (cost=0.00..14208.82 rows=520585 width=0) (actual time=118.769..118.769 rows=516503 loops=1)
Index Cond: (c_id = 5)
-> Hash (cost=1.25..1.25 rows=2 width=4) (actual time=0.326..0.326 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on dates (cost=0.00..1.25 rows=2 width=4) (actual time=0.320..0.323 rows=3 loops=1)
Filter: ((year_part >= 2017) AND (year_part <= 2017) AND (month_part >= 1) AND (month_part <= 1) AND (day_part >= 3) AND (day_part <= 5))
Rows Removed by Filter: 7
Planning time: 3.091 ms
Execution time: 913.543 ms
EXPLAIN ANALYSE on SELECT query:
(Note: I had to alter the first query to select into an actual table, not temporary table, on order to get the query plan for this one. AFAIK EXPLAIN ANALYSE only works on single queries)
Sort (cost=89590.66..89595.66 rows=2000 width=15) (actual time=1248.535..1248.537 rows=30 loops=1)
Sort Key: (count(1)), temp_results.r, prpa.pr_id
Sort Method: quicksort Memory: 27kB
-> HashAggregate (cost=89461.00..89481.00 rows=2000 width=15) (actual time=1248.460..1248.468 rows=30 loops=1)
Group Key: prpa.pr_id, temp_results.r
-> Hash Join (cost=73821.20..88626.40 rows=111280 width=15) (actual time=798.861..1213.494 rows=117471 loops=1)
Hash Cond: (temp_results.prpa_id = prpa.id)
-> Seq Scan on temp_results (cost=0.00..1632.80 rows=111280 width=8) (actual time=0.024..17.401 rows=117471 loops=1)
-> Hash (cost=36958.31..36958.31 rows=2120631 width=15) (actual time=798.484..798.484 rows=2120631 loops=1)
Buckets: 16384 Batches: 32 Memory Usage: 3129kB
-> Seq Scan on prpa (cost=0.00..36958.31 rows=2120631 width=15) (actual time=0.126..350.664 rows=2120631 loops=1)
Planning time: 1.073 ms
Execution time: 1248.660 ms

How to count signups and cancellations with a sql query efficiently (postgresql 9.0)

Imagine an account table that looks like this:
Column | Type | Modifiers
------------+-----------------------------+-----------
id | bigint | not null
signupdate | timestamp without time zone | not null
canceldate | timestamp without time zone |
I want to get a report of the number of signups and cancellations by month.
It is pretty straight-forward to do it in two queries, one for the signups by month and then one for the cancellations by month. Is there an efficient way to do it in a single query? Some months may have zero signups and cancellations, and should show up with a zero in the results.
With source data like this:
id signupDate cancelDate
1 2012-01-13
2 2012-01-15 2012-02-05
3 2012-03-01 2012-03-20
we should get the following results:
Date signups cancellations
2012-01 2 0
2012-02 0 1
2012-03 1 1
I'm using postgresql 9.0
Update after the first answer:
Craig Ringer provided a nice answer below. On my data set of approximately 75k records, the first and third examples performed similarly. The second example seems to have an error somewhere, it returned incorrect results.
Looking at the results from an explain analyze (and my table does have an index on signup_date), the first query returns:
Sort (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 2 (returns $1)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 3 (returns $2)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
-> HashAggregate (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
-> Nested Loop (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
-> Materialize (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
-> Seq Scan on account a (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms
and the third query returns:
Sort (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 4 (returns $2)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 5 (returns $3)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
-> HashAggregate (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
Filter: (date_trunc('month'::text, created) = $0)
SubPlan 2
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms
This certainly makes it appear that the third query is faster, but when I run the queries from the command-line using the 'time' command, the first query is consistently faster, though only by a few milliseconds.
Surprisingly to me, running two separate queries (one to count signups and one to count cancellations) is significantly faster. It took less than half the time to run, ~300ms vs ~730ms. Of course that leaves more work to be done externally, but for my purposes it still might be the best solution. Here are the single queries:
select
m,
count(a.id) as "signups"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m
;
select
m,
count(a.id) as "cancellations"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m
;
I have marked Craig's answer as correct, but if you can make it faster, I'd love to hear about it
Here are three different ways to do it. All depend on generating a time series then scanning it. One uses subqueries to aggregate data for each month. One joins the table twice against the series with different criteria. An alternate form does a single join on the time series, retaining rows that match either start or end date, then uses predicates in the counts to further filter the results.
EXPLAIN ANALYZE will help you pick which approach works best for your data.
http://sqlfiddle.com/#!12/99c2a/9
Test setup:
CREATE TABLE accounts
("id" int, "signup_date" timestamp, "cancel_date" timestamp);
INSERT INTO accounts
("id", "signup_date", "cancel_date")
VALUES
(1, '2012-01-13 00:00:00', NULL),
(2, '2012-01-15 00:00:00', '2012-02-05'),
(3, '2012-03-01 00:00:00', '2012-03-20')
;
By single join and filter in count:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m OR date_trunc('month',a.cancel_date) = m)
GROUP BY m
ORDER BY m;
By joining the accounts table twice:
SELECT m, count(s.signup_date) AS n_signups, count(c.cancel_date) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m LEFT OUTER JOIN accounts s ON (date_trunc('month',s.signup_date) = m) LEFT OUTER JOIN accounts c ON (date_trunc('month',c.cancel_date) = m)
GROUP BY m
ORDER BY m;
Alternately, using subqueries:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',signup_date) = m
) AS n_signups, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',cancel_date) = m
)AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
New answer after update.
I'm not shocked that you get better results from two simpler queries; sometimes it's simply more efficient to do things that way. However, there was an issue with my original answer that will've signficicantly impacted performance.
Erwin accurately pointed out in another answer that Pg can't use a simple b-tree index on a date with date_trunc, so you're better off using ranges. It can use an index created on the expression date_trunc('month',colname) but you're better off avoiding the creation of another unnecessary index.
Rephrasing the single-scan-and-filter query to use ranges produces:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (
(a.signup_date >= m AND a.signup_date < m + INTERVAL '1' MONTH)
OR (a.cancel_date >= m AND a.cancel_date < m + INTERVAL '1' MONTH))
GROUP BY m
ORDER BY m;
There's no need to avoid date_trunc in non-indexable conditions, so I've only changed to use interval ranges in the join condition.
Where the original query used a seq scan and materialize, this now uses a bitmap index scan if there are indexes on signup_date and cancel_date.
In PostgreSQL 9.2 better performance may possibly be gained by adding:
CREATE INDEX account_signup_or_cancel ON accounts(signup_date,cancel_date);
and possibly:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_desc_nonnull
ON accounts(cancel_date DESC) WHERE (cancel_date IS NOT NULL);
to allow index-only scans. It's hard to make solid index recommendations without the actual data to test with.
Alternately, the subquery based approach with improved indexable filter condition:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE signup_date >= m AND signup_date < m + INTERVAL '1' MONTH
) AS n_signups, (
SELECT count(cancel_date)
FROM accounts
WHERE cancel_date >= m AND cancel_date < m + INTERVAL '1' MONTH
) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
will benefit from ordinary b-tree indexes on signup_date and cancel_date, or from:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_nonnull
ON accounts(cancel_date) WHERE (cancel_date IS NOT NULL);
Remember that every index you create imposes a penalty on INSERT and UPDATE performance, and competes with other indexes and help data for cache space. Try to create only indexes that make a big difference and are useful for other queries.