How to optimize a GROUP BY query - sql

I am given a task to optimize the following query (not written by me)
SELECT
"u"."email" as email,
r.url as "domain",
"u"."id" as "requesterId",
s.total * 100 / count("r"."id") as "rate",
count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
(
SELECT
url,
id,
"requesterId",
created_at
FROM
(
SELECT
url,
id,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
id,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "r"
INNER JOIN (
SELECT
"requesterId",
url,
count(created_at) AS "total"
FROM
(
SELECT
url,
status,
created_at,
"requesterId"
FROM
(
SELECT
url,
status,
created_at,
"requesterId",
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
status,
created_at,
"requesterId"
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "s"
WHERE
status IN ('success')
AND s."created_at" :: date >= '2022-01-07' :: date
AND s."created_at" :: date <= '2022-02-07' :: date
GROUP BY
s.url,
s."requesterId"
) "s" ON s."requesterId" = r."requesterId"
AND s.url = r.url
INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
r."created_at" :: date >= '2022-01-07' :: date
AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
r.url,
"u"."email",
"u"."id",
s.total
LIMIT
10
So there is the requests table, which stores some API requests and there is a mechanism to retry a request if it fails, which is repeated 5 times, while keeping separate rows for each retry. If after 5 times it still fails, it's not continued anymore. This is the reason for the partition by subquery, which selects only the main requests.
What the query should return is the total number of requests and success rate, grouped by the url and requesterId. The query I was given is not only wrong, but also takes huge amounts of time to execute, so I came up with the optimized version below
WITH a AS (SELECT url,
id,
status,
"requesterId",
created_at
FROM (
SELECT url,
id,
status,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM "requests" "request"
WHERE
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
GROUP BY main_request_uuid,
retry_number,
url,
id,
status,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE request_.row_number = 1),
b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status IN ('success')
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;
What the new version basically does is select all the main requests, and count the successful ones and the total count. The new version still takes a lot of time to execute (around 60s on a table with 4 million requests).
Is there a way to optimize this further?
You can see the table structure below. The table has no relevant indexes, but adding one on (url, requesterId) had no effect
column_name
data_type
id
bigint
requesterId
bigint
proxyId
bigint
url
character varying
status
USER-DEFINED
time_spent
integer
created_at
timestamp with time zone
request_information
jsonb
retry_number
smallint
main_request_uuid
character varying
And here is the execution plan on a backup table with 100k rows. It's taking 1.1s for 100k rows, but it would be more desired to at least cut it down to 200ms for this case
Limit (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
CTE a
-> Subquery Scan on request_ (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
Filter: (request_.row_number = 1)
Rows Removed by Filter: 70962
-> WindowAgg (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
-> Group (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Gather Merge (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Group (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Sort (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
" Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
Sort Method: external merge Disk: 7952kB
Worker 0: Sort Method: external merge Disk: 8568kB
Worker 1: Sort Method: external merge Disk: 9072kB
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
-> Nested Loop (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
" Join Filter: (a2.""requesterId"" = u.id)"
-> Nested Loop (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
" Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
Rows Removed by Join Filter: 69
-> HashAggregate (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
" Group Key: a3.url, a3.""requesterId"""
Batches: 5 Memory Usage: 4297kB Disk Usage: 7040kB
-> CTE Scan on a a3 (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
Filter: (status = 'success'::requests_status_enum)
Rows Removed by Filter: 8196
-> HashAggregate (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
" Group Key: a2.url, a2.""requesterId"""
Batches: 41 Memory Usage: 4297kB Disk Usage: 7328kB
-> CTE Scan on a a2 (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
" -> Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
" Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms

These lines of your plan point to a possible optimization.
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
Sequential scans, parallel or not, are somewhat costly.
So, try changing these WHERE conditions to make them sargable and useful for a range scan.
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
Change these to
created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date + INTERVAL '1' DAY
And, put a BTREE index on the created_at column.
CREATE INDEX ON requests (created_at);
Your query is complex, so I'm not totally sure this will work. But try it. The index should pull out only the rows for the dates you need.
And, your LIMIT clause without an accompanying ORDER BY clause gives postgreSQL permission to give you back whatever ten rows it wants from the result set. Don't use LIMIT without ORDER BY. Don't do it at all unless you need it.

Writing query efficiently is one of the major part for query optimization specially for handling huge data set. Always avoiding unnecessary GROUP BY or ORDER BY or explicit type casting or too many joins or extra subquery or limit/limit without order by (if possible) if handling large volume of data and meet desired requirement. Create an index in created_at column. If LEFT JOIN used in your given query then query pattern would be changed. My observations are
-- avoid unnecessary GROUP BY (no aggregate function use) or ORDER BY
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
N.B.: if created_at column data type is timestamp without time zone then no need to extra casting. Follow the below statement
(created_at >= '2022-01-07'
AND created_at <= '2022-02-07')
-- Combine two CTE into one as per requirement
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
So the final query as like below
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
) select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;
If above query doesn't meet requirement then try below query. But this query will perfect if using LEFT JOIN instead of INNER JOIN
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT count(1) total, a2.url as url, a2."requesterId"
FROM a a2
GROUP BY a2.url, a2."requesterId"
), c AS (SELECT count(1) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status = 'success'
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c
ON b.url = c.url AND b."requesterId" = c."requesterId"
JOIN users u
ON b."requesterId" = u.id
LIMIT 10;
select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;

Related

Does PostgreSQL share the ordering of a CTE?

In PostgreSQL, common table expressions (CTE) are optimization fences. This means that the CTE is materialized into memory and that predicates from another query will never be pushed down into the CTE.
Now I am wondering if other metadata about the CTE, such as ordering, is shared to the other queries. Let's take the following query:
WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects
Here, MIN(type) is obviously always the first row of ordered_objects (or NULL if ordered_objects is empty), because ordered_objects is already ordered by type. Is this knowledge about ordered_objects available when evaluating SELECT MIN(type) FROM ordered_objects?
If I understand your question correctly - no, it does not. no such knowledge does. As you will find in example below. when you limit to 10 rows, execution is extremely faster - less data to process (in my case million times less), which would mean CTE scans whole ordered set ignoring the fact, that min would be in first rows...
data:
t=# create table object (type bigint);
CREATE TABLE
Time: 4.636 ms
t=# insert into object select generate_series(1,9999999);
INSERT 0 9999999
Time: 7769.275 ms
with limit:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 3150.183 ms
https://explain.depesz.com/s/5yXe
without:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 16032.989 ms
https://explain.depesz.com/s/1SU
I surely warmed up data before tests
[in Postgres] a CTE is always excuted once
, even if it referenced more than once
its result is stored into a temporary table (materialized)
the outer query has no knowledge about the internal structure (indexes are not available) or ordering (not sure about frequency estimates), it just scans the temp results
in the below fragment the CTE is scanned twice, even if the results are known to be identical.
\d react
EXPLAIN ANALYZE
WITH omg AS (
SELECT topic_id
, row_number() OVER (PARTITION by krant_id ORDER BY topic_id) AS rn
FROM react
WHERE krant_id = 1
AND topic_id < 5000000
ORDER BY topic_id ASC
)
SELECT MIN (o2.topic_id)
FROM omg o1 --
JOIN omg o2 ON o1.rn = o2.rn -- exactly the same
WHERE o1.rn = 1
;
Table "public.react"
Column | Type | Modifiers
------------+--------------------------+--------------------
krant_id | integer | not null default 1
topic_id | integer | not null
react_id | integer | not null
react_date | timestamp with time zone |
react_nick | character varying(1000) |
react_body | character varying(4000) |
zoek | tsvector |
Indexes:
"react_pkey" PRIMARY KEY, btree (krant_id, topic_id, react_id)
"react_krant_id_react_nick_react_date_topic_id_react_id_idx" UNIQUE, btree (krant_id, react_nick, react_date, topic_id, react_id)
"react_date" btree (krant_id, topic_id, react_date)
"react_nick" btree (krant_id, topic_id, react_nick)
"react_zoek" gin (zoek)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON react FOR EACH ROW EXECUTE PROCEDURE tf_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON react FOR EACH ROW WHEN (new.react_body::text <> old.react_body::text) EXECUTE PROCEDURE tf_upd_zzoek()
----------
Aggregate (cost=232824.29..232824.29 rows=1 width=4) (actual time=1773.643..1773.645 rows=1 loops=1)
CTE omg
-> WindowAgg (cost=0.43..123557.17 rows=402521 width=8) (actual time=0.217..1246.577 rows=230822 loops=1)
-> Index Only Scan using react_pkey on react (cost=0.43..117519.35 rows=402521 width=8) (actual time=0.161..419.916 rows=230822 loops=1)
Index Cond: ((krant_id = 1) AND (topic_id < 5000000))
Heap Fetches: 442
-> Nested Loop (cost=0.00..99136.69 rows=4052169 width=4) (actual time=0.264..1773.624 rows=1 loops=1)
-> CTE Scan on omg o1 (cost=0.00..9056.72 rows=2013 width=8) (actual time=0.249..59.252 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
-> CTE Scan on omg o2 (cost=0.00..9056.72 rows=2013 width=12) (actual time=0.003..1714.355 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
Total runtime: 1782.887 ms
(14 rows)

Lazy order by/where evaluation

Edit
It seems that a pure materialization can be stored as a column on the table and indexed; however, my specific use case (semver.satisfies) requires a more general solution:
create table Submissions (
version text
created_at timestamp
)
create index Submissions_1 on Submissions (created_at)
My query would then look like:
select * from Submissions
where
created_at <= '2016-07-12' and
satisfies(version, '>=1.2.3 <4.5.6')
order by created_at desc
limit 1;
Where I wouldn't be able to practically use the same memoization technique.
Original
I have a table storing text data and the dates at which they were created:
create table Submissions (
content text,
created_at timestamp
);
create index Submissions_1 on Submissions (created_at);
Given a checksum and a reference date, I want to get the latest Submission where the content field matches that checksum:
select * from Submissions
where
created_at <= '2016-07-12' and
expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
order by created_at desc
limit 1;
This works, but it's very slow. What Postgres ends up doing is taking a checksum of every row, and then performing the order by:
Limit (cost=270834.18..270834.18 rows=1 width=32) (actual time=1132.898..1132.898 rows=1 loops=1)
-> Sort (cost=270834.18..271561.27 rows=290836 width=32) (actual time=1132.898..1132.898 rows=1 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on installation (cost=0.00..269380.00 rows=290836 width=32) (actual time=0.118..1129.961 rows=17305 loops=1)
Filter: created_at <= '2016-07-12' AND expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
Rows Removed by Filter: 982695
Planning time: 0.066 ms
Execution time: 1246.941 ms
Without the order by, it is a sub-millisecond operation, because Postgres knows that I only want the first result. The only difference is that I want Postgres to start searching from the latest date down.
Ideally, Postgres would:
filter by created_at
sort by created_at, descending
return the first row where the checksum matches
I've tried to write queries with inline views, but an explain analyze shows that it will just be rewritten into what I already had above.
You can create index for both fields together:
create index Submissions_1 on Submissions (created_at DESC, expensive_chksm(content));
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.15..8.16 rows=1 width=40) (actual time=0.004..0.004 rows=0 loops=1)
-> Index Scan using submissions_1 on submissions (cost=0.15..16.17 rows=2 width=40) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND ((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text))
Planning time: 0.414 ms
Execution time: 0.036 ms
It is important to use also DESC in index.
UPDATED:
For storing and comparing version you can use int[]
create table Submissions (
version int[],
created_at timestamp
);
INSERT INTO Submissions SELECT ARRAY [ (random() * 10)::int2, (random() * 10)::int2, (random() * 10)::int2], '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 1000000);
create index Submissions_1 on Submissions (created_at DESC, version);
EXPLAIN ANALYZE select * from Submissions
where
created_at <= '2016-07-12'
AND version <= ARRAY [5,2,3]
AND version > ARRAY [1,2,3]
order by created_at desc
limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..13.24 rows=1 width=40) (actual time=0.074..0.075 rows=1 loops=1)
-> Index Only Scan using submissions_1 on submissions (cost=0.42..21355.76 rows=1667 width=40) (actual time=0.073..0.073 rows=1 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND (version <= '{5,2,3}'::integer[]) AND (version > '{1,2,3}'::integer[]))
Heap Fetches: 1
Planning time: 3.019 ms
Execution time: 0.100 ms
To a_horse_with_no_name comment:
The order of the conditions in the where clause is irrelevant for the index usage. It's better to put the one that can be used for the equality expression first in the index, then the range expression. –
BEGIN;
create table Submissions (
content text,
created_at timestamp
);
CREATE FUNCTION expensive_chksm(varchar) RETURNS varchar AS $$
SELECT $1;
$$ LANGUAGE sql;
INSERT INTO Submissions SELECT (random() * 1000000000)::text, '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 1000000);
INSERT INTO Submissions SELECT '77ac76dc0d4622ba9aa795acafc05f1e', '2016-01-01'::timestamp + ('1 hour')::interval * random() * 10000 FROM generate_series(1, 100000);
create index Submissions_1 on Submissions (created_at DESC, expensive_chksm(content));
-- create index Submissions_2 on Submissions (expensive_chksm(content), created_at DESC);
EXPLAIN ANALYZE select * from Submissions
where
created_at <= '2016-07-12' and
expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
order by created_at desc
limit 1;
Using Submission1:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..10.98 rows=1 width=40) (actual time=0.018..0.019 rows=1 loops=1)
-> Index Scan using submissions_1 on submissions (cost=0.43..19341.43 rows=1833 width=40) (actual time=0.018..0.018 rows=1 loops=1)
Index Cond: ((created_at <= '2016-07-12 00:00:00'::timestamp without time zone) AND ((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text))
Planning time: 0.257 ms
Execution time: 0.033 ms
Using Submission2:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4482.39..4482.40 rows=1 width=40) (actual time=29.096..29.096 rows=1 loops=1)
-> Sort (cost=4482.39..4486.98 rows=1833 width=40) (actual time=29.095..29.095 rows=1 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on submissions (cost=67.22..4473.23 rows=1833 width=40) (actual time=15.457..23.683 rows=46419 loops=1)
Recheck Cond: (((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text) AND (created_at <= '2016-07-12 00:00:00'::timestamp without time zone))
Heap Blocks: exact=936
-> Bitmap Index Scan on submissions_1 (cost=0.00..66.76 rows=1833 width=0) (actual time=15.284..15.284 rows=46419 loops=1)
Index Cond: (((content)::text = '77ac76dc0d4622ba9aa795acafc05f1e'::text) AND (created_at <= '2016-07-12 00:00:00'::timestamp without time zone))
Planning time: 0.583 ms
Execution time: 29.134 ms
PostgreSQL 9.6.1
You can use sub query for the timestamp and ordering part, and later run the chksum outside:
select * from (
select * from submissions where
created_at <= '2016-07-12' and
order by created_at desc) as S
where expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e'
LIMIT 1
If you are always going to query on checksum then an alternative would be to have another column called checksum in the table, e.g.:
create table Submissions (
content text,
created_at timestamp,
checksum varchar
);
You can then insert/update the checksum whenever a row gets inserted/updated (or write a trigger) that does it for you and query on checksum column directly for quick result.
Try this
select *
from Submissions
where created_at = (
select max(created_at)
from Submissions
where expensive_chksm(content) = '77ac76dc0d4622ba9aa795acafc05f1e')

How to tweak index_scan cost in postgres?

For the following query:
SELECT *
FROM "routes_trackpoint"
WHERE "routes_trackpoint"."track_id" = 593
ORDER BY "routes_trackpoint"."id" ASC
LIMIT 1;
Postgres is choosing a query plan which reads all the rows in the "id" index to perform the ordering, and them perform manual filtering to get the entries with the correct track id:
Limit (cost=0.43..511.22 rows=1 width=65) (actual time=4797.964..4797.966 rows=1 loops=1)
Buffers: shared hit=3388505
-> Index Scan using routes_trackpoint_pkey on routes_trackpoint (cost=0.43..719699.79 rows=1409 width=65) (actual time=4797.958..4797.958 rows=1 loops=1)
Filter: (track_id = 75934)
Rows Removed by Filter: 13005436
Buffers: shared hit=3388505
Total runtime: 4798.019 ms
(7 rows)
Disabling the index scan, the query plan (SET enable_indexscan=OFF;) is better and the response much faster.
Limit (cost=6242.46..6242.46 rows=1 width=65) (actual time=77.584..77.586 rows=1 loops=1)
Buffers: shared hit=1075 read=6
-> Sort (cost=6242.46..6246.64 rows=1674 width=65) (actual time=77.577..77.577 rows=1 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=1075 read=6
-> Bitmap Heap Scan on routes_trackpoint (cost=53.41..6234.09 rows=1674 width=65) (actual time=70.384..74.782 rows=1454 loops=1)
Recheck Cond: (track_id = 75934)
Buffers: shared hit=1075 read=6
-> Bitmap Index Scan on routes_trackpoint_track_id (cost=0.00..52.99 rows=1674 width=0) (actual time=70.206..70.206 rows=1454 loops=1)
Index Cond: (track_id = 75934)
Buffers: shared hit=2 read=6
Total runtime: 77.655 ms
(13 rows)
How can I get Postgres to select the better plan automatically?
I have tried the following:
ALTER TABLE routes_trackpoint ALTER COLUMN track_id SET STATISTICS 5000;
ALTER TABLE routes_trackpoint ALTER COLUMN id SET STATISTICS 5000;
ANALYZE routes_trackpoint;
But the query plan remained the same.
The table definition is:
watchdog2=# \d routes_trackpoint
Table "public.routes_trackpoint"
Column | Type | Modifiers
----------+--------------------------+----------------------------------------------------------------
id | integer | not null default nextval('routes_trackpoint_id_seq'::regclass)
track_id | integer | not null
position | geometry(Point,4326) | not null
speed | double precision | not null
bearing | double precision | not null
valid | boolean | not null
created | timestamp with time zone | not null
Indexes:
"routes_trackpoint_pkey" PRIMARY KEY, btree (id)
"routes_trackpoint_position_id" gist ("position")
"routes_trackpoint_track_id" btree (track_id)
Foreign-key constraints:
"track_id_refs_id_d59447ae" FOREIGN KEY (track_id) REFERENCES routes_track(id) DEFERRABLE INITIALLY DEFERRED
PS: We have forced postgres to sort by "created" instead, which also helped him use the index on "track_id".
Avoid LIMIT as much as you can.
Plan #1: use NOT EXISTS() to get the first one
EXPLAIN ANALYZE
SELECT * FROM routes_trackpoint tp
WHERE tp.track_id = 593
AND NOT EXISTS (
SELECT * FROM routes_trackpoint nx
WHERE nx.track_id = tp.track_id AND nx.id < tp.id
);
Plan #2: use row_number() OVER some_window to get the first one of the group.
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
) omg ON omg.id = tp.id
WHERE tp.track_id = 593
AND omg.rn = 1
;
Or -even better- move the WHERE clause to the subquery :
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
) omg ON omg.id = tp.id
WHERE 1=1
-- AND tp.track_id = 593
AND omg.rn = 1
;
Plan#3 use the postgres-specific DISTINCT ON() construct (thanks to #a_horse_with_no_name):
-- EXPLAIN ANALYZE
SELECT DISTINCT ON (track_id) track_id, id
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
-- order by track_id, created desc
order by track_id, id
;

How to count signups and cancellations with a sql query efficiently (postgresql 9.0)

Imagine an account table that looks like this:
Column | Type | Modifiers
------------+-----------------------------+-----------
id | bigint | not null
signupdate | timestamp without time zone | not null
canceldate | timestamp without time zone |
I want to get a report of the number of signups and cancellations by month.
It is pretty straight-forward to do it in two queries, one for the signups by month and then one for the cancellations by month. Is there an efficient way to do it in a single query? Some months may have zero signups and cancellations, and should show up with a zero in the results.
With source data like this:
id signupDate cancelDate
1 2012-01-13
2 2012-01-15 2012-02-05
3 2012-03-01 2012-03-20
we should get the following results:
Date signups cancellations
2012-01 2 0
2012-02 0 1
2012-03 1 1
I'm using postgresql 9.0
Update after the first answer:
Craig Ringer provided a nice answer below. On my data set of approximately 75k records, the first and third examples performed similarly. The second example seems to have an error somewhere, it returned incorrect results.
Looking at the results from an explain analyze (and my table does have an index on signup_date), the first query returns:
Sort (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 2 (returns $1)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 3 (returns $2)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
-> HashAggregate (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
-> Nested Loop (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
-> Materialize (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
-> Seq Scan on account a (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms
and the third query returns:
Sort (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 4 (returns $2)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 5 (returns $3)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
-> HashAggregate (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
Filter: (date_trunc('month'::text, created) = $0)
SubPlan 2
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms
This certainly makes it appear that the third query is faster, but when I run the queries from the command-line using the 'time' command, the first query is consistently faster, though only by a few milliseconds.
Surprisingly to me, running two separate queries (one to count signups and one to count cancellations) is significantly faster. It took less than half the time to run, ~300ms vs ~730ms. Of course that leaves more work to be done externally, but for my purposes it still might be the best solution. Here are the single queries:
select
m,
count(a.id) as "signups"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m
;
select
m,
count(a.id) as "cancellations"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m
;
I have marked Craig's answer as correct, but if you can make it faster, I'd love to hear about it
Here are three different ways to do it. All depend on generating a time series then scanning it. One uses subqueries to aggregate data for each month. One joins the table twice against the series with different criteria. An alternate form does a single join on the time series, retaining rows that match either start or end date, then uses predicates in the counts to further filter the results.
EXPLAIN ANALYZE will help you pick which approach works best for your data.
http://sqlfiddle.com/#!12/99c2a/9
Test setup:
CREATE TABLE accounts
("id" int, "signup_date" timestamp, "cancel_date" timestamp);
INSERT INTO accounts
("id", "signup_date", "cancel_date")
VALUES
(1, '2012-01-13 00:00:00', NULL),
(2, '2012-01-15 00:00:00', '2012-02-05'),
(3, '2012-03-01 00:00:00', '2012-03-20')
;
By single join and filter in count:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m OR date_trunc('month',a.cancel_date) = m)
GROUP BY m
ORDER BY m;
By joining the accounts table twice:
SELECT m, count(s.signup_date) AS n_signups, count(c.cancel_date) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m LEFT OUTER JOIN accounts s ON (date_trunc('month',s.signup_date) = m) LEFT OUTER JOIN accounts c ON (date_trunc('month',c.cancel_date) = m)
GROUP BY m
ORDER BY m;
Alternately, using subqueries:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',signup_date) = m
) AS n_signups, (
SELECT count(signup_date)
FROM accounts
WHERE date_trunc('month',cancel_date) = m
)AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
New answer after update.
I'm not shocked that you get better results from two simpler queries; sometimes it's simply more efficient to do things that way. However, there was an issue with my original answer that will've signficicantly impacted performance.
Erwin accurately pointed out in another answer that Pg can't use a simple b-tree index on a date with date_trunc, so you're better off using ranges. It can use an index created on the expression date_trunc('month',colname) but you're better off avoiding the creation of another unnecessary index.
Rephrasing the single-scan-and-filter query to use ranges produces:
SELECT m,
count(nullif(date_trunc('month',a.signup_date) = m,'f')),
count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (
(a.signup_date >= m AND a.signup_date < m + INTERVAL '1' MONTH)
OR (a.cancel_date >= m AND a.cancel_date < m + INTERVAL '1' MONTH))
GROUP BY m
ORDER BY m;
There's no need to avoid date_trunc in non-indexable conditions, so I've only changed to use interval ranges in the join condition.
Where the original query used a seq scan and materialize, this now uses a bitmap index scan if there are indexes on signup_date and cancel_date.
In PostgreSQL 9.2 better performance may possibly be gained by adding:
CREATE INDEX account_signup_or_cancel ON accounts(signup_date,cancel_date);
and possibly:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_desc_nonnull
ON accounts(cancel_date DESC) WHERE (cancel_date IS NOT NULL);
to allow index-only scans. It's hard to make solid index recommendations without the actual data to test with.
Alternately, the subquery based approach with improved indexable filter condition:
SELECT m, (
SELECT count(signup_date)
FROM accounts
WHERE signup_date >= m AND signup_date < m + INTERVAL '1' MONTH
) AS n_signups, (
SELECT count(cancel_date)
FROM accounts
WHERE cancel_date >= m AND cancel_date < m + INTERVAL '1' MONTH
) AS n_cancels
FROM generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
will benefit from ordinary b-tree indexes on signup_date and cancel_date, or from:
CREATE INDEX account_signup_date_nonnull
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);
CREATE INDEX account_cancel_date_nonnull
ON accounts(cancel_date) WHERE (cancel_date IS NOT NULL);
Remember that every index you create imposes a penalty on INSERT and UPDATE performance, and competes with other indexes and help data for cache space. Try to create only indexes that make a big difference and are useful for other queries.

Is there a postgres CLOSEST operator?

I'm looking for something that, given a table like:
| id | number |
| 1 | .7 |
| 2 | 1.25 |
| 3 | 1.01 |
| 4 | 3.0 |
the query SELECT * FROM my_table WHEREnumberCLOSEST(1) would return row 3. I only care about numbers. Right now I've got a procedure that just loops over every row and does a comparison, but I figure the information should be available from a b-tree index, so this might be possible as a builtin, but I can't find any documentation suggesting that it does.
I may be a little off on the syntax, but this parameterized query (all the ? take the '1' of the original question) should run fast, basically 2 B-Tree lookups [assuming number is indexed].
SELECT * FROM
(
(SELECT id, number FROM t WHERE number >= ? ORDER BY number LIMIT 1) AS above
UNION ALL
(SELECT id, number FROM t WHERE number < ? ORDER BY number DESC LIMIT 1) as below
)
ORDER BY abs(?-number) LIMIT 1;
The query plan for this with a table of ~5e5 rows (with an index on number) looks like this:
psql => explain select * from (
(SELECT id, number FROM t WHERE number >= 1 order by number limit 1)
union all
(select id, number from t where number < 1 order by number desc limit 1)
) as make_postgresql_happy
order by abs (1 - number)
limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=0.24..0.24 rows=1 width=12)
-> Sort (cost=0.24..0.24 rows=2 width=12)
Sort Key: (abs((1::double precision - public.t.number)))
-> Result (cost=0.00..0.23 rows=2 width=12)
-> Append (cost=0.00..0.22 rows=2 width=12)
-> Limit (cost=0.00..0.06 rows=1 width=12)
-> Index Scan using idx_t on t (cost=0.00..15046.74 rows=255683 width=12)
Index Cond: (number >= 1::double precision)
-> Limit (cost=0.00..0.14 rows=1 width=12)
-> Index Scan Backward using idx_t on t (cost=0.00..9053.67 rows=66136 width=12)
Index Cond: (number < 1::double precision)
(11 rows)
You could try something like this:
select *
from my_table
where abs(1 - number) = (select min(abs(1 - number)) from t)
This isn't that much different than manually looping through the table but at least it lets the database do the looping inside "database space" rather than having to jump back and forth between your function and the database internals. Also, pushing it all into a single query lets the query engine know what you're trying to do and then it can try to do it in a sensible way.
The 2nd answer is correct, but I encountered error on "UNION ALL":
DBD::Pg::st execute failed: ERROR: syntax error at or near "UNION"
I fixed it with this code:
SELECT * FROM
(
(SELECT * FROM table WHERE num >= ? ORDER BY num LIMIT 1)
UNION ALL
(SELECT * FROM table WHERE num < ? ORDER BY num DESC LIMIT 1)
) as foo
ORDER BY abs(?-num) LIMIT 1;
the trick is to remove the AS from the inner tables and use it only on the UNION.
This code is helpful if you wish to find the closest value within groups. Here,I split my table tb by column_you_wish_to_group_by based on how close my column val is close to my target value 0.5.
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY t.column_you_wish_to_group_by ORDER BY abs(t.val - 0.5) ASC) AS r,
t.*
FROM
tb t) x
WHERE x.r = 1;