Is there a postgres CLOSEST operator? - sql

I'm looking for something that, given a table like:
| id | number |
| 1 | .7 |
| 2 | 1.25 |
| 3 | 1.01 |
| 4 | 3.0 |
the query SELECT * FROM my_table WHEREnumberCLOSEST(1) would return row 3. I only care about numbers. Right now I've got a procedure that just loops over every row and does a comparison, but I figure the information should be available from a b-tree index, so this might be possible as a builtin, but I can't find any documentation suggesting that it does.

I may be a little off on the syntax, but this parameterized query (all the ? take the '1' of the original question) should run fast, basically 2 B-Tree lookups [assuming number is indexed].
SELECT * FROM
(
(SELECT id, number FROM t WHERE number >= ? ORDER BY number LIMIT 1) AS above
UNION ALL
(SELECT id, number FROM t WHERE number < ? ORDER BY number DESC LIMIT 1) as below
)
ORDER BY abs(?-number) LIMIT 1;
The query plan for this with a table of ~5e5 rows (with an index on number) looks like this:
psql => explain select * from (
(SELECT id, number FROM t WHERE number >= 1 order by number limit 1)
union all
(select id, number from t where number < 1 order by number desc limit 1)
) as make_postgresql_happy
order by abs (1 - number)
limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=0.24..0.24 rows=1 width=12)
-> Sort (cost=0.24..0.24 rows=2 width=12)
Sort Key: (abs((1::double precision - public.t.number)))
-> Result (cost=0.00..0.23 rows=2 width=12)
-> Append (cost=0.00..0.22 rows=2 width=12)
-> Limit (cost=0.00..0.06 rows=1 width=12)
-> Index Scan using idx_t on t (cost=0.00..15046.74 rows=255683 width=12)
Index Cond: (number >= 1::double precision)
-> Limit (cost=0.00..0.14 rows=1 width=12)
-> Index Scan Backward using idx_t on t (cost=0.00..9053.67 rows=66136 width=12)
Index Cond: (number < 1::double precision)
(11 rows)

You could try something like this:
select *
from my_table
where abs(1 - number) = (select min(abs(1 - number)) from t)
This isn't that much different than manually looping through the table but at least it lets the database do the looping inside "database space" rather than having to jump back and forth between your function and the database internals. Also, pushing it all into a single query lets the query engine know what you're trying to do and then it can try to do it in a sensible way.

The 2nd answer is correct, but I encountered error on "UNION ALL":
DBD::Pg::st execute failed: ERROR: syntax error at or near "UNION"
I fixed it with this code:
SELECT * FROM
(
(SELECT * FROM table WHERE num >= ? ORDER BY num LIMIT 1)
UNION ALL
(SELECT * FROM table WHERE num < ? ORDER BY num DESC LIMIT 1)
) as foo
ORDER BY abs(?-num) LIMIT 1;
the trick is to remove the AS from the inner tables and use it only on the UNION.

This code is helpful if you wish to find the closest value within groups. Here,I split my table tb by column_you_wish_to_group_by based on how close my column val is close to my target value 0.5.
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY t.column_you_wish_to_group_by ORDER BY abs(t.val - 0.5) ASC) AS r,
t.*
FROM
tb t) x
WHERE x.r = 1;

Related

How to optimize a GROUP BY query

I am given a task to optimize the following query (not written by me)
SELECT
"u"."email" as email,
r.url as "domain",
"u"."id" as "requesterId",
s.total * 100 / count("r"."id") as "rate",
count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
(
SELECT
url,
id,
"requesterId",
created_at
FROM
(
SELECT
url,
id,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
id,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "r"
INNER JOIN (
SELECT
"requesterId",
url,
count(created_at) AS "total"
FROM
(
SELECT
url,
status,
created_at,
"requesterId"
FROM
(
SELECT
url,
status,
created_at,
"requesterId",
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
status,
created_at,
"requesterId"
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "s"
WHERE
status IN ('success')
AND s."created_at" :: date >= '2022-01-07' :: date
AND s."created_at" :: date <= '2022-02-07' :: date
GROUP BY
s.url,
s."requesterId"
) "s" ON s."requesterId" = r."requesterId"
AND s.url = r.url
INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
r."created_at" :: date >= '2022-01-07' :: date
AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
r.url,
"u"."email",
"u"."id",
s.total
LIMIT
10
So there is the requests table, which stores some API requests and there is a mechanism to retry a request if it fails, which is repeated 5 times, while keeping separate rows for each retry. If after 5 times it still fails, it's not continued anymore. This is the reason for the partition by subquery, which selects only the main requests.
What the query should return is the total number of requests and success rate, grouped by the url and requesterId. The query I was given is not only wrong, but also takes huge amounts of time to execute, so I came up with the optimized version below
WITH a AS (SELECT url,
id,
status,
"requesterId",
created_at
FROM (
SELECT url,
id,
status,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM "requests" "request"
WHERE
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
GROUP BY main_request_uuid,
retry_number,
url,
id,
status,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE request_.row_number = 1),
b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status IN ('success')
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;
What the new version basically does is select all the main requests, and count the successful ones and the total count. The new version still takes a lot of time to execute (around 60s on a table with 4 million requests).
Is there a way to optimize this further?
You can see the table structure below. The table has no relevant indexes, but adding one on (url, requesterId) had no effect
column_name
data_type
id
bigint
requesterId
bigint
proxyId
bigint
url
character varying
status
USER-DEFINED
time_spent
integer
created_at
timestamp with time zone
request_information
jsonb
retry_number
smallint
main_request_uuid
character varying
And here is the execution plan on a backup table with 100k rows. It's taking 1.1s for 100k rows, but it would be more desired to at least cut it down to 200ms for this case
Limit (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
CTE a
-> Subquery Scan on request_ (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
Filter: (request_.row_number = 1)
Rows Removed by Filter: 70962
-> WindowAgg (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
-> Group (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Gather Merge (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Group (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Sort (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
" Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
Sort Method: external merge Disk: 7952kB
Worker 0: Sort Method: external merge Disk: 8568kB
Worker 1: Sort Method: external merge Disk: 9072kB
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
-> Nested Loop (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
" Join Filter: (a2.""requesterId"" = u.id)"
-> Nested Loop (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
" Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
Rows Removed by Join Filter: 69
-> HashAggregate (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
" Group Key: a3.url, a3.""requesterId"""
Batches: 5 Memory Usage: 4297kB Disk Usage: 7040kB
-> CTE Scan on a a3 (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
Filter: (status = 'success'::requests_status_enum)
Rows Removed by Filter: 8196
-> HashAggregate (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
" Group Key: a2.url, a2.""requesterId"""
Batches: 41 Memory Usage: 4297kB Disk Usage: 7328kB
-> CTE Scan on a a2 (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
" -> Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
" Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms
These lines of your plan point to a possible optimization.
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
Sequential scans, parallel or not, are somewhat costly.
So, try changing these WHERE conditions to make them sargable and useful for a range scan.
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
Change these to
created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date + INTERVAL '1' DAY
And, put a BTREE index on the created_at column.
CREATE INDEX ON requests (created_at);
Your query is complex, so I'm not totally sure this will work. But try it. The index should pull out only the rows for the dates you need.
And, your LIMIT clause without an accompanying ORDER BY clause gives postgreSQL permission to give you back whatever ten rows it wants from the result set. Don't use LIMIT without ORDER BY. Don't do it at all unless you need it.
Writing query efficiently is one of the major part for query optimization specially for handling huge data set. Always avoiding unnecessary GROUP BY or ORDER BY or explicit type casting or too many joins or extra subquery or limit/limit without order by (if possible) if handling large volume of data and meet desired requirement. Create an index in created_at column. If LEFT JOIN used in your given query then query pattern would be changed. My observations are
-- avoid unnecessary GROUP BY (no aggregate function use) or ORDER BY
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
N.B.: if created_at column data type is timestamp without time zone then no need to extra casting. Follow the below statement
(created_at >= '2022-01-07'
AND created_at <= '2022-02-07')
-- Combine two CTE into one as per requirement
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
So the final query as like below
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT url, "requesterId", COUNT(1) total
, COUNT(1) FILTER (WHERE status = 'success') success
FROM a
-- WHERE status IN ('success')
GROUP BY url, "requesterId"
) select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;
If above query doesn't meet requirement then try below query. But this query will perfect if using LEFT JOIN instead of INNER JOIN
WITH a AS (
SELECT url
, id
, status
, "requesterId"
, created_at
FROM (SELECT url
, id
, status
, "requesterId"
, created_at
, row_number() over (partition by main_request_uuid order by retry_number DESC) as row_number
FROM "requests" "request"
WHERE (created_at:: date >= '2022-01-07'
AND created_at :: date <= '2022-02-07')) "request_"
WHERE request_.row_number = 1
), b as (
SELECT count(1) total, a2.url as url, a2."requesterId"
FROM a a2
GROUP BY a2.url, a2."requesterId"
), c AS (SELECT count(1) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status = 'success'
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c
ON b.url = c.url AND b."requesterId" = c."requesterId"
JOIN users u
ON b."requesterId" = u.id
LIMIT 10;
select (success * 100) / total as rate
, b.url, b."requesterId", total, email
from b
JOIN users u
ON u.id = b."requesterId"
limit 10;

Optimizing this counting query in Postgresql

I need to implement a basic facet search sidebar in my app. I unfortunately can't use Elasticsearch/Solr/alternatives and limited to Postgres.
I have around 10+ columns ('status', 'classification', 'filing_type'...) I need to return counts for every distinct value after every search made and display them accordingly. I've drafted this bit of sql, however, this won't take me very far in the long run as it will slow down massively once I reach a high number of rows.
select row_to_json(t) from (
select 'status' as column, status as value, count(*) from api_articles_mv_temp group by status
union
select 'classification' as column, classification as value, count(*) from api_articles_mv_temp group by classification
union
select 'filing_type' as column, filing_type as value, count(*) from api_articles_mv_temp group by filing_type
union
...) t;
This yields
{"column":"classification","value":"State","count":2001}
{"column":"classification","value":"Territory","count":23}
{"column":"filing_type","value":"Joint","count":169}
{"column":"classification","value":"SRO","count":771}
{"column":"filing_type","value":"Single","count":4238}
{"column":"status","value":"Updated","count":506}
{"column":"classification","value":"Federal","count":1612}
{"column":"status","value":"New","count":3901}
From the query plan, the HashAggregates are slowing it down.
Subquery Scan on t (cost=2397.58..2397.76 rows=8 width=32) (actual time=212.822..213.022 rows=8 loops=1)
-> HashAggregate (cost=2397.58..2397.66 rows=8 width=186) (actual time=212.780..212.856 rows=8 loops=1)
Group Key: ('status'::text), api_articles_mv_temp.status, (count(*))
-> Append (cost=799.11..2397.52 rows=8 width=186) (actual time=75.238..212.701 rows=8 loops=1)
-> HashAggregate (cost=799.11..799.13 rows=2 width=44) (actual time=75.221..75.242 rows=2 loops=1)
Group Key: api_articles_mv_temp.status
...
Is there a simpler, more optimized way of getting this result?
It may be improve the performance that reading api_articles_mv_temp is just once.
I gave you examples so can you try them?
If the combinations of "column" and "value" are fixed, the query looks like this:
select row_to_json(t) from (
select "column", "value", count(*) as "count"
from column_temp left outer join api_articles_mv_temp on
"value"=
case "column"
when 'status' then status
when 'classification' then classification
when 'filing_type' then filing_type
end
group by "column", "value"
) t;
The column_temp has records below:
column |value
---------------+----------
status |New
status |Updated
classification |State
classification |Territory
classification |SRO
filing_type |Single
filing_type |Joint
DB Fiddle
If just the "column" is fixed, the query looks like this:
select row_to_json(t) from (
select "column",
case "column"
when 'status' then status
when 'classification' then classification
when 'filing_type' then filing_type
end as "value",
sum("count") as "count"
from column_temp a
cross join (
select
status,
classification,
filing_type,
count(*) as "count"
from api_articles_mv_temp
group by
status,
classification,
filing_type) b
group by "column", "value"
) t;
The column_temp has records below:
column
---------------
status
classification
filing_type
DB Fiddle

Hive: how to get global ordering with sort by

order by has only one reducer, so slow.I'm trying to find a fast way.sort by sorts in each reducer,then how can we get global ordering?
I got this by search engine:
select * from
(select title,cast(price as FLOAT) p from tablename
distribute by time
sort by p desc
limit 10 ) t
order by t.p desc
limit 10;
Then try to validate it.
1.Get right answer in my hive table.There are 215666 records in the table named tablename.
SELECT title,cast(price as FLOAT) p
from tablename
WHERE dt='2020-03-08'
and price IS NOT NULL
ORDER BY p DESC
LIMIT 10
;
2.Use the searched clause.
set hive.execution.engine=mr;
set mapred.reduce.tasks=5;
SELECT title,cast(price as FLOAT) p
from tablename
WHERE dt='2020-03-08'
and price IS NOT NULL
DISTRIBUTE BY title
SORT BY p desc
LIMIT 10
;
The result is the same as the right answer!
Here are my questions:
1.Why only return 10 lines? There are 5 reducer, each reducer returns 10, should be 5*10=50?
2.If should return 10 lines, why the result is global ordering? This 10 line is not from the same reducer ? The limit is random, it cannot get global order in 5 reducer.
3.If should return 10 lines, the outer part in the searched clause is redundant?
select * from
(
) t
order by t.p desc
limit 10;
Consider using total order partitioner, see https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad#HBaseBulkLoad-PrepareRangePartitioning for details (just ignore part with HBase)

Does PostgreSQL share the ordering of a CTE?

In PostgreSQL, common table expressions (CTE) are optimization fences. This means that the CTE is materialized into memory and that predicates from another query will never be pushed down into the CTE.
Now I am wondering if other metadata about the CTE, such as ordering, is shared to the other queries. Let's take the following query:
WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects
Here, MIN(type) is obviously always the first row of ordered_objects (or NULL if ordered_objects is empty), because ordered_objects is already ordered by type. Is this knowledge about ordered_objects available when evaluating SELECT MIN(type) FROM ordered_objects?
If I understand your question correctly - no, it does not. no such knowledge does. As you will find in example below. when you limit to 10 rows, execution is extremely faster - less data to process (in my case million times less), which would mean CTE scans whole ordered set ignoring the fact, that min would be in first rows...
data:
t=# create table object (type bigint);
CREATE TABLE
Time: 4.636 ms
t=# insert into object select generate_series(1,9999999);
INSERT 0 9999999
Time: 7769.275 ms
with limit:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC LIMIT 10
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 3150.183 ms
https://explain.depesz.com/s/5yXe
without:
explain analyze WITH ordered_objects AS
(
SELECT * FROM object ORDER BY type ASC
)
SELECT MIN(type) FROM ordered_objects;
Execution time: 16032.989 ms
https://explain.depesz.com/s/1SU
I surely warmed up data before tests
[in Postgres] a CTE is always excuted once
, even if it referenced more than once
its result is stored into a temporary table (materialized)
the outer query has no knowledge about the internal structure (indexes are not available) or ordering (not sure about frequency estimates), it just scans the temp results
in the below fragment the CTE is scanned twice, even if the results are known to be identical.
\d react
EXPLAIN ANALYZE
WITH omg AS (
SELECT topic_id
, row_number() OVER (PARTITION by krant_id ORDER BY topic_id) AS rn
FROM react
WHERE krant_id = 1
AND topic_id < 5000000
ORDER BY topic_id ASC
)
SELECT MIN (o2.topic_id)
FROM omg o1 --
JOIN omg o2 ON o1.rn = o2.rn -- exactly the same
WHERE o1.rn = 1
;
Table "public.react"
Column | Type | Modifiers
------------+--------------------------+--------------------
krant_id | integer | not null default 1
topic_id | integer | not null
react_id | integer | not null
react_date | timestamp with time zone |
react_nick | character varying(1000) |
react_body | character varying(4000) |
zoek | tsvector |
Indexes:
"react_pkey" PRIMARY KEY, btree (krant_id, topic_id, react_id)
"react_krant_id_react_nick_react_date_topic_id_react_id_idx" UNIQUE, btree (krant_id, react_nick, react_date, topic_id, react_id)
"react_date" btree (krant_id, topic_id, react_date)
"react_nick" btree (krant_id, topic_id, react_nick)
"react_zoek" gin (zoek)
Triggers:
tr_upd_zzoek_i BEFORE INSERT ON react FOR EACH ROW EXECUTE PROCEDURE tf_upd_zzoek()
tr_upd_zzoek_u BEFORE UPDATE ON react FOR EACH ROW WHEN (new.react_body::text <> old.react_body::text) EXECUTE PROCEDURE tf_upd_zzoek()
----------
Aggregate (cost=232824.29..232824.29 rows=1 width=4) (actual time=1773.643..1773.645 rows=1 loops=1)
CTE omg
-> WindowAgg (cost=0.43..123557.17 rows=402521 width=8) (actual time=0.217..1246.577 rows=230822 loops=1)
-> Index Only Scan using react_pkey on react (cost=0.43..117519.35 rows=402521 width=8) (actual time=0.161..419.916 rows=230822 loops=1)
Index Cond: ((krant_id = 1) AND (topic_id < 5000000))
Heap Fetches: 442
-> Nested Loop (cost=0.00..99136.69 rows=4052169 width=4) (actual time=0.264..1773.624 rows=1 loops=1)
-> CTE Scan on omg o1 (cost=0.00..9056.72 rows=2013 width=8) (actual time=0.249..59.252 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
-> CTE Scan on omg o2 (cost=0.00..9056.72 rows=2013 width=12) (actual time=0.003..1714.355 rows=1 loops=1)
Filter: (rn = 1)
Rows Removed by Filter: 230821
Total runtime: 1782.887 ms
(14 rows)

How to tweak index_scan cost in postgres?

For the following query:
SELECT *
FROM "routes_trackpoint"
WHERE "routes_trackpoint"."track_id" = 593
ORDER BY "routes_trackpoint"."id" ASC
LIMIT 1;
Postgres is choosing a query plan which reads all the rows in the "id" index to perform the ordering, and them perform manual filtering to get the entries with the correct track id:
Limit (cost=0.43..511.22 rows=1 width=65) (actual time=4797.964..4797.966 rows=1 loops=1)
Buffers: shared hit=3388505
-> Index Scan using routes_trackpoint_pkey on routes_trackpoint (cost=0.43..719699.79 rows=1409 width=65) (actual time=4797.958..4797.958 rows=1 loops=1)
Filter: (track_id = 75934)
Rows Removed by Filter: 13005436
Buffers: shared hit=3388505
Total runtime: 4798.019 ms
(7 rows)
Disabling the index scan, the query plan (SET enable_indexscan=OFF;) is better and the response much faster.
Limit (cost=6242.46..6242.46 rows=1 width=65) (actual time=77.584..77.586 rows=1 loops=1)
Buffers: shared hit=1075 read=6
-> Sort (cost=6242.46..6246.64 rows=1674 width=65) (actual time=77.577..77.577 rows=1 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=1075 read=6
-> Bitmap Heap Scan on routes_trackpoint (cost=53.41..6234.09 rows=1674 width=65) (actual time=70.384..74.782 rows=1454 loops=1)
Recheck Cond: (track_id = 75934)
Buffers: shared hit=1075 read=6
-> Bitmap Index Scan on routes_trackpoint_track_id (cost=0.00..52.99 rows=1674 width=0) (actual time=70.206..70.206 rows=1454 loops=1)
Index Cond: (track_id = 75934)
Buffers: shared hit=2 read=6
Total runtime: 77.655 ms
(13 rows)
How can I get Postgres to select the better plan automatically?
I have tried the following:
ALTER TABLE routes_trackpoint ALTER COLUMN track_id SET STATISTICS 5000;
ALTER TABLE routes_trackpoint ALTER COLUMN id SET STATISTICS 5000;
ANALYZE routes_trackpoint;
But the query plan remained the same.
The table definition is:
watchdog2=# \d routes_trackpoint
Table "public.routes_trackpoint"
Column | Type | Modifiers
----------+--------------------------+----------------------------------------------------------------
id | integer | not null default nextval('routes_trackpoint_id_seq'::regclass)
track_id | integer | not null
position | geometry(Point,4326) | not null
speed | double precision | not null
bearing | double precision | not null
valid | boolean | not null
created | timestamp with time zone | not null
Indexes:
"routes_trackpoint_pkey" PRIMARY KEY, btree (id)
"routes_trackpoint_position_id" gist ("position")
"routes_trackpoint_track_id" btree (track_id)
Foreign-key constraints:
"track_id_refs_id_d59447ae" FOREIGN KEY (track_id) REFERENCES routes_track(id) DEFERRABLE INITIALLY DEFERRED
PS: We have forced postgres to sort by "created" instead, which also helped him use the index on "track_id".
Avoid LIMIT as much as you can.
Plan #1: use NOT EXISTS() to get the first one
EXPLAIN ANALYZE
SELECT * FROM routes_trackpoint tp
WHERE tp.track_id = 593
AND NOT EXISTS (
SELECT * FROM routes_trackpoint nx
WHERE nx.track_id = tp.track_id AND nx.id < tp.id
);
Plan #2: use row_number() OVER some_window to get the first one of the group.
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
) omg ON omg.id = tp.id
WHERE tp.track_id = 593
AND omg.rn = 1
;
Or -even better- move the WHERE clause to the subquery :
EXPLAIN ANALYZE
SELECT tp.*
FROM routes_trackpoint tp
JOIN (select track_id, id
, row_number() OVER (partition BY track_id ORDER BY id) rn
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
) omg ON omg.id = tp.id
WHERE 1=1
-- AND tp.track_id = 593
AND omg.rn = 1
;
Plan#3 use the postgres-specific DISTINCT ON() construct (thanks to #a_horse_with_no_name):
-- EXPLAIN ANALYZE
SELECT DISTINCT ON (track_id) track_id, id
FROM routes_trackpoint tp2
WHERE tp2.track_id = 593
-- order by track_id, created desc
order by track_id, id
;