explain analyse
SELECT COUNT(*) FROM syo_event WHERE id_group = 'OPPORTUNITY' AND id_type = 'NEW'
My query have this plan:
Aggregate (cost=654.16..654.17 rows=1 width=0) (actual time=3.783..3.783 rows=1 loops=1)
-> Bitmap Heap Scan on syo_event (cost=428.76..654.01 rows=58 width=0) (actual time=2.774..3.686 rows=1703 loops=1)
Recheck Cond: ((id_group = 'OPPORTUNITY'::text) AND (id_type = 'NEW'::text))
-> BitmapAnd (cost=428.76..428.76 rows=58 width=0) (actual time=2.635..2.635 rows=0 loops=1)
-> Bitmap Index Scan on syo_list_group (cost=0.00..35.03 rows=1429 width=0) (actual time=0.261..0.261 rows=2187 loops=1)
Index Cond: (id_group = 'OPPORTUNITY'::text)
-> Bitmap Index Scan on syo_list_type (cost=0.00..393.45 rows=17752 width=0) (actual time=2.177..2.177 rows=17555 loops=1)
Index Cond: (id_type = 'NEW'::text)
Total runtime: 3.827 ms
In the first line:
(actual time=3.783..3.783 rows=1 loops=1
(Why the actual time not match with the last line, Total runtime ?)
In the second line:
cost=428.76..654.01
(Start the Bitmap Heap Scan with cost 428.76 and ends with 654.01) ?
rows=58 width=0)
(Wath is rows and width, anything important ?)
rows=1703
(this is the result)
loops=1)
(Used in subqueries ?)
From the postgres docs:
Note that the "actual time" values are in milliseconds of real time, whereas the cost estimates are expressed in arbitrary units; so they are unlikely to match up. The thing that's usually most important to look for is whether the estimated row counts are reasonably close to reality.
The estimated cost is computed as (disk pages read * seq_page_cost) + (rows scanned * cpu_tuple_cost). By default, seq_page_cost is 1.0 and cpu_tuple_cost is 0.01
As for the first line, EXPLAIN executed the query and it took 3.783 ms, but presenting you with the output of the plan takes some time to, so that total runtime is increased by the time spent on doing that.
Basically EXPLAIN ANALYZE displays estimates that a plain EXPLAIN would show you along with values collected from actually running the query, hence the difference in the second line.
Both rows and width are important. Respectively this is the number of rows output estimation and an average width of rows in bytes. Your estimated total cost will be lower if the estimation of rows returned is smaller and you need to take that into account.
To understand what loops actually presents you need to know that a query plan is actually a tree of plan nodes. There are different types of nodes that serve different purposes - a scan node for example is responsible for returning raw rows from a table. If your query is doing some operation on rows there will be additional nodes above the scan nodes to handle that.
The first line in your output from EXPLAIN is a summary from the level 1 (at the top) node with estimated cost for entire plan.
Knowing that, loops represents a value of total number of executions of a particular node. This is because in some plans a subplan node can be executed more than once and if that happens then to make the numbers comparable with other estimates it multiplies time and rows values by loops to get the total time spent in that node.
You can get more insight on the topic in the documentation.
Related
I'm struggling to make sense of postgres EXPLAIN to figure out why my query is slow. Can someone help? This is my query, it's a pretty simple join:
SELECT DISTINCT graph_concepts.*
FROM graph_concepts
INNER JOIN graph_family_links
ON graph_concepts.id = graph_family_links.descendent_concept_id
WHERE graph_family_links.ancestor_concept_id = 1016
AND graph_family_links.generation = 1
AND graph_concepts.state != 2
It's starting from a concept and it's getting a bunch of related concepts through the links table.
Notably, I have an index on graph_family_links.descendent_concept_id, yet this query takes about 3 seconds to return a result. This is way too long for my purposes.
This is the SQL explain:
Unique (cost=46347.01..46846.16 rows=4485 width=108) (actual time=27.406..33.667 rows=13 loops=1)
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
-> Gather Merge (cost=46347.01..46825.98 rows=4485 width=108) (actual time=27.404..33.656 rows=13 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
-> Sort (cost=45347.01..45348.32 rows=2638 width=108) (actual time=23.618..23.621 rows=6 loops=2)
Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
Worker 0: Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=301.97..45317.02 rows=2638 width=108) (actual time=8.890..23.557 rows=6 loops=2)
Buffers: shared hit=13039 read=5
I/O Timings: read=0.074
-> Parallel Bitmap Heap Scan on graph_family_links (cost=301.88..39380.60 rows=2640 width=4) (actual time=8.766..23.293 rows=6 loops=2)
Recheck Cond: (ancestor_concept_id = 1016)
Filter: (generation = 1)
Rows Removed by Filter: 18850
Heap Blocks: exact=2558
Buffers: shared hit=12985
-> Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id (cost=0.00..301.66 rows=38382 width=0) (actual time=4.744..4.744 rows=47346 loops=1)
Index Cond: (ancestor_concept_id = 1016)
Buffers: shared hit=67
-> Index Scan using graph_concepts_pkey on graph_concepts (cost=0.08..2.25 rows=1 width=108) (actual time=0.036..0.036 rows=1 loops=13)
Index Cond: (id = graph_family_links.descendent_concept_id)
Filter: (state <> 2)
Buffers: shared hit=54 read=5
I/O Timings: read=0.074
Planning:
Buffers: shared hit=19
Planning Time: 0.306 ms
Execution Time: 33.747 ms
(35 rows)
I'm doing lots of googling to help me figure out how to read this EXPLAIN and I'm struggling. Can someone help translate this into plain english for me?
Answering myself (for the benefit of future people):
My question was primarily how to understand EXPLAIN. Many people below contributed to my understanding but no one really gave me the beginner unpacking I was looking for. I want to teach myself to fish rather than simply having other people read this and give me advice on solving this specific issue, although I do greatly appreciate the specific suggestions!
For others trying to understand EXPLAIN, this is the important context you need to know, which was holding me back:
"Cost" is some arbitrary unit of how expense each step of the process is, you can think of it almost like a stopwatch.
Look near the end of your EXPLAIN until you find: cost=0.00.. This is the very start of your query execution. In my case, cost=0.00..301.66 is the first step and cost=0.08..2.25 runs in parallel (from step 0.08 to step 2.25, just a small fraction of the 0 to 300).
Find the step with the biggest "span" of cost. In my case, cost=301.88..39380.60. Although I was confused because I also have a cost=301.97..45317.02. I think those are, again, both happening in parallel so I'm not sure which one is contributing more.
SELECT DISTINCT
graph_concepts.*
FROM
graph_concepts
INNER JOIN graph_family_links ON graph_concepts.id = graph_family_links.descendent_concept_id
WHERE
graph_family_links.ancestor_concept_id = 1016
AND graph_family_links.generation = 1
AND graph_concepts.state != 2
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46347.01..46846.16 rows=4485 width=108)
## (Merge records DISTINCT)
-> Gather Merge (cost=46347.01..46825.98 rows=4485 width=108)
Workers Planned: 1
## (Sort table graph_concepts.* )
-> Sort (cost=45347.01..45348.32 rows=2638 width=108)
Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
-> Nested Loop (cost=301.97..45317.02 rows=2638 width=108)
## WHERE graph_family_links.ancestor_concept_id = 1016 (Use Parallel Bitmap Heap Scan table and filter record)
-> Parallel Bitmap Heap Scan on graph_family_links (cost=301.88..39380.60 rows=2640 width=4)
Recheck Cond: (ancestor_concept_id = 1016)
Filter: (generation = 1)
## AND graph_family_links.generation = 1 (Use Bitmap Index Scan table and filter record)
-> Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id (cost=0.00..301.66 rows=38382 width=0)
Index Cond: (ancestor_concept_id = 1016)
## AND graph_concepts.state != 2 (Use Index Scan table and filter record)
-> Index Scan using graph_concepts_pkey on graph_concepts (cost=0.08..2.25 rows=1 width=108)
Index Cond: (id = graph_family_links.descendent_concept_id)
Filter: (state <> 2)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Please refer to the below sql script.
SELECT DISTINCT graph_concepts.*
FROM graph_concepts
INNER JOIN (select descendent_concept_id from graph_family_links where ancestor_concept_id = 1016 and generation = 1) A ON graph_concepts.id = A.descendent_concept_id
WHERE graph_concepts.state != 2
Explain read from bottom to up, generally.
https://www.postgresql.org/docs/current/sql-explain.html
This command displays the execution plan that the PostgreSQL planner
generates for the supplied statement. The execution plan shows how the
table(s) referenced by the statement will be scanned — by plain
sequential scan, index scan, etc. — and if multiple tables are
referenced, what join algorithms will be used to bring together the
required rows from each input table.
Since you only do explain your select command meaning the output is the system planner generate a execute plan for this select query. But if you do explain analyze then it will plan and execute the query.
First there is two table there. For each table use some ways to found out which rows meet the where criteria. index scan (one of the way to find out where is the row) found out in table graph_concepts which row meet the condition: graph_concepts.state != 2.
Also in the mean time(Parallel) use Bitmap Heap Scan to found out in table
graph_family_links which row meet the criteria: graph_family_links.ancestor_concept_id = 1016
After that then do join operation. In this case, it's Nested Loop.
After join then do Sort. Why we need to sort operation? Because you specified: SELECT DISTINCT : https://www.postgresql.org/docs/current/sql-select.html
-After sort then since you specified key word DISTINCT then eliminate the duplicates.
People have given you general links and advice on understanding plans. To focus on one relevant part of the plan, it expects to find 38382 rows satisfying ancestor_concept_id = 1016, but then well over 90% of them are expected to fail the generation = 1 filter. But that is expensive as it was to jump to some random table page to fetch the "generation" value to apply the filter.
If you had a combined index on (ancestor_concept_id, generation) it could apply both restrictions efficiently simultaneously. Alternatively, of you had separate single column indexes on those columns, it could combine then with a BitmapAnd operation. That would be more efficient than what you are currently doing to but less efficient than the combined index.
I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.
The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.
I have a reference table for UUIDs that is roughly 200M rows. I have ~5000 UUIDs that I want to look up in the reference table. Reference table looks like:
CREATE TABLE object_store AS (
project_id UUID,
object_id UUID,
object_name VARCHAR(20),
description VARCHAR(80)
);
CREATE INDEX object_store_project_idx ON object_store(project_id);
CREATE INDEX object_store_id_idx ON object_store(object_id);
* Edit #2 *
Request for the temp_objects table definition.
CREATE TEMPORARY TABLE temp_objects AS (
object_id UUID
)
ON COMMIT DELETE ROWS;
The reason for the separate index is because object_id is not unique, and can belong to many different projects. The reference table is just a temp table of UUIDs (temp_objects) that I want to check (5000 object_ids).
If I query the above reference table with 1 object_id literal value, it's almost instantaneous (2ms). If the temp table only has 1 row, again, instantaneous (2ms). But with 5000 rows it takes 25 minutes to even return. Granted it pulls back >3M rows of matches.
* Edited *
This is for 1 row comparison (4.198 ms):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..475780.22 rows=494005 width=65) (actual time=0.038..2.631 rows=1194 loops=1)
Buffers: shared hit=1202, local hit=1
-> Seq Scan on temp_objects t (cost=0.00..13.60 rows=360 width=16) (actual time=0.007..0.009 rows=1 loops=1)
Buffers: local hit=1
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1307.85 rows=1372 width=81) (actual time=0.027..1.707 rows=1194 loops=1)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=1202
Planning time: 0.173 ms
Execution time: 3.096 ms
(9 rows)
Time: 4.198 ms
This is for 4911 row comparison (1579082.974 ms (26:19.083)):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..3217316.86 rows=3507438 width=65) (actual time=0.041..1576913.100 rows=8043500 loops=1)
Buffers: shared hit=5185078 read=2887548, local hit=71
-> Seq Scan on temp_objects d (cost=0.00..96.56 rows=2556 width=16) (actual time=0.009..3.945 rows=4911 loops=1)
Buffers: local hit=71
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1244.97 rows=1372 width=81) (actual time=1.492..320.081 rows=1638 loops=4911)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=5185078 read=2887548
Planning time: 0.169 ms
Execution time: 1579078.811 ms
(9 rows)
Time: 1579082.974 ms (26:19.083)
Eventually I want to group and get a count of the matching object_ids by project_id, using standard grouping. The aggregate is at the upper end (of course) of the cost. It took just about 25 minutes again to complete the below query. Yet, when I limit the temp table to only 1 row, it comes back in 21ms. Something is not adding up...
EXPLAIN SELECT O.project_id, count(*)
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id GROUP BY O.project_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6189484.10..6189682.84 rows=19874 width=73)
Group Key: o.project_id
-> Nested Loop (cost=0.57..6155795.69 rows=6737683 width=65)
-> Seq Scan on temp_objects t (cost=0.00..120.10 rows=4910 width=16)
-> Index Scan using object_store_id_idx on object_store o (cost=0.57..1239.98 rows=1372 width=81)
Index Cond: (object_id = t.object_id)
(6 rows)
I'm on PostgreSQL 10.6, running 2 CPUs and 8GB of RAM on an SSD. I have ANALYZEd the tables, I have set the work_mem to 50MB, shared_buffers to 2GB, and have set the random_page_cost to 1. All helped the queries actually to come back in several minutes, but still not as fast as I feel it should be.
I have the option to go to cloud computing if CPUs/RAM/parallelization make a big difference. Just looking for suggestions on how to get this simple query to return in < few seconds (if possible).
* UPDATE *
Taking the hint from Jürgen Zornig, I changed both object_id fields to be bigint, using just the top half of the UUID and reducing my datasize by half. Doing the aggregate query above the query now performs at ~16min.
Next, taking jjane's suggestion of set enable_nestloop to off, my aggregate query jumped to 6min! Unfortunately, all the other suggestions haven't sped it up past 6min, although it's interesting that changing my "TEMPORARY" table to a permanent one allowed 2 workers to work it, it didn't change the time. I think jjane is accurate by saying the IO is the binding factor here. Here is the latest explain plan from the 6min (wish it were faster, still, but it's better!):
explain (analyze, buffers, format text) select project_id, count(*) from object_store natural join temp_object group by project_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=3966899.86..3967396.69 rows=19873 width=73) (actual time=368124.126..368744.157 rows=153633 loops=1)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Sort (cost=3966899.86..3966999.23 rows=39746 width=73) (actual time=368124.116..368586.497 rows=333427 loops=1)
Sort Key: object_store.project_id
Sort Method: external merge Disk: 29392kB
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Gather (cost=3959690.23..3963863.56 rows=39746 width=73) (actual time=366476.369..366827.313 rows=333427 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Partial HashAggregate (cost=3958690.23..3958888.96 rows=19873 width=73) (actual time=366472.712..366568.313 rows=111142 loops=3)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Hash Join (cost=132.50..3944473.09 rows=2843429 width=65) (actual time=7.880..363848.830 rows=2681167 loops=3)
Hash Cond: (object_store.object_id = temp_object.object_id)
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Parallel Seq Scan on object_store (cost=0.00..3499320.53 rows=83317153 width=73) (actual time=0.467..324932.880 rows=66653718 loops=3)
Buffers: shared hit=242934 read=2423215
I/O Timings: read=870720.440
-> Hash (cost=71.11..71.11 rows=4911 width=8) (actual time=7.349..7.349 rows=4911 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 256kB
Buffers: shared hit=66
-> Seq Scan on temp_object (cost=0.00..71.11 rows=4911 width=8) (actual time=0.014..2.101 rows=4911 loops=3)
Buffers: shared hit=66
Planning time: 0.247 ms
Execution time: 368779.757 ms
(32 rows)
Time: 368780.532 ms (06:08.781)
So I'm at 6min per query now. I think with I/O costs, I may try for an in-memory store on this table if possible to see if getting it off SSD makes it even better.
UUIDs are (EDIT) working against adaptive cache management and, because of their random nature effectively dropping the cache hit ratio because the index space is larger than memory. Ids cover a numerically wide range equally distributed, so in fact every Id lands pretty much on its own leaf on the index tree. As the index leaf determines in which data page the row is saved in disk pretty much every row gets its own page resulting in a whole lot of extremely expensive I/O Operations to get all these rows read in.
That's the reason why its generally not recommended to use UUIDs and if you really need UUIDs then at least generate timestamp/mac-prefixed UUIDs (have a look at uuid_generate_v1() - https://www.postgresql.org/docs/9.4/uuid-ossp.html) that are numerically close to each other, therefore chances are higher that data rows are clustered together on lesser data Pages resulting in fewer I/O Operations to get more Data in.
Long Story Short: Randomness over a large range kills your index (well actually not the index, it results in a lot of expensive I/O to get data on reading and to maintain the index on writing) and therefore slows queries down to a point where it is as good as having no index at all.
Here is also an article for reference
It looks like the centerpiece of your question is why it doesn't scale up from one input row to 5000 input rows linearly. But I think that this is a red herring. How are you choosing the one row? If you choose the same one row each time, then the data will stay in cache and will be very fast. I bet this is what you are doing. If you choose a different random one row each time you do a one-row plan, you will probably find the scaling to be more linear.
You should turn on track_io_timing. I have little doubt that IO is actually the bottleneck, but it is always nice to see it actually measured and reported, I have been surprised before.
The use of temporary table will inhibit parallel query. You might want to test with a permanent table, to see if you do get use of parallel workers, and if so, whether that actually helps. If you do this test, you should use your aggregation version of the query. They parallelize more efficiently than non-aggregation queries do, and if that is your ultimate goal that is what you should initially test with.
Another thing you could try is a large setting of effective_io_concurrency. But, that will only help if your plan uses bitmap scans to start with, which the plans you show do not. Setting random_page_cost from 1 to a slightly higher value might encourage it to use bitmap scans. (effective_io_concurrency is weird because bitmap plans can get a substantial realistic benefit from a higher setting, but the planner doesn't give bitmap plans any credit for that benefit they receive. So you must be "accidentally" using that plan already in order to get the benefit)
At some point (as you increase the number of rows in temp_objects) it is going to be faster to hash that table, and hashjoin it to a seq-scan of the object_store table. Is 5000 already past the point at which that would be faster? The planner clearly doesn't think so, but the planner never gets the cut-over point exactly right, and is often off by quite a bit. What happens if you set enable_nestloop TO off; before running your query?
Have you done low-level benchmarking of your SSD (outside of the database)? Assuming substantially all of your time is spent on IO reads and nearly none of those are fulfilled by the filesystem cache, you are getting 1576913/2887548 = 0.55ms per read. That seems pretty long. That is about what I get on a bottom-dollar laptop where the SSD is being exposed through a VM layer. I'd expect better than that from server-grade hardware.
Be sure you have also a proper index for temp_objects table
CREATE INDEX temp_object_id_idx ON temp_objects(object_id);
SELECT O.project_id
FROM temp_objects T
JOIN object_store O ON T.object_id = O.object_id;
Firstly: I would try to get the index into memory. What is shared_buffers set to? If it is small, lets make that bigger first. See if we can reduce the index scan IO.
Next: Are parallel queries enabled? I'm not sure that will help here very much because you have only 2 cpus, but it wouldn't hurt.
Even though the object column is completely random, I'd also bump up the statistics on that table from the default (100 rows or something like that) to a few thousand rows. Then run Analyze again. (or for thoroughness, vacuum analyze)
Work Mem at 50M may be low too. It could potentially be larger if you don't have a lot of concurrent users and you have G's of RAM to work with. Too large and it can be counter productive, but you could go up a bit more to see if it helps.
You could try CTAS on the big table into a new table to sort object id so that it isn't completely random.
There might be a crazy partitioning scheme you could come up with if you were using PostgreSQL 12 that would group the object ids into some even partition distribution.
EXPLAIN SELECT a.name, m.name FROM Casting c JOIN Movie m ON c.m_id = m.m_id JOIN Actor a ON a.a_id = c.a_id AND c.a_id < 50;
Output
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=26.20..18354.49 rows=1090 width=27) (actual time=0.240..5.603 rows=1011 loops=1)
-> Nested Loop (cost=25.78..12465.01 rows=1090 width=15) (actual time=0.236..4.046 rows=1011 loops=1)
-> Bitmap Heap Scan on casting c (cost=25.35..3660.19 rows=1151 width=8) (actual time=0.229..1.059 rows=1011 loops=1)
Recheck Cond: (a_id < 50)
Heap Blocks: exact=989
-> Bitmap Index Scan on casting_a_id_index (cost=0.00..25.06 rows=1151 width=0) (actual time=0.114..0.114 rows=1011 loops=1)
Index Cond: (a_id < 50)
-> Index Scan using movie_pkey on movie m (cost=0.42..7.64 rows=1 width=15) (actual time=0.003..0.003 rows=1 loops=1011)
Index Cond: (m_id = c.m_id)
-> Index Scan using actor_pkey on actor a (cost=0.42..5.39 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=1011)
Index Cond: (a_id = c.a_id)
Planning time: 0.334 ms
Execution time: 5.672 ms
(13 rows)
I am trying to understand how query planner works? I am able to understand the process it choose, but I am not getting why ?
Can someone explain query optimizer choices (choice of query processing algorithms, join order) in these queries based on parameters like query selectivity and cost models or anything that effects choice?
Also why there is use of Recheck Cond, after index scan ?
There are two reasons why there has to be a Bitmap Heap Scan:
PostgreSQL has to check whether the rows found are visible for the current transaction or not. Remember that PostgreSQL keeps old row versions in the table until VACUUM removes them. This visibility information is not stored in the index.
If work_mem is not large enough to contain a bitmap with one bit per table row, PostgreSQL uses one bit per table page, which loses some information. The PostgreSQL needs to check the lossy blocks to see which of the rows in the block really satisfy the condition.
You can see this when you use EXPLAIN (ANALYZE, BUFFERS), then PostgreSQL will show if there were lossy matches, see this example on rextester:
-> Bitmap Heap Scan on t (cost=177.14..4719.43 rows=9383 width=0)
(actual time=2.130..144.729 rows=10001 loops=1)
Recheck Cond: (val = 10)
Rows Removed by Index Recheck: 738586
Heap Blocks: exact=646 lossy=3305
Buffers: shared hit=1891 read=2090
-> Bitmap Index Scan on t_val_idx (cost=0.00..174.80 rows=9383 width=0)
(actual time=1.978..1.978 rows=10001 loops=1)
Index Cond: (val = 10)
Buffers: shared read=30
I cannot explain the whole of the PostgreSQL optimizer in this answer, but what it does is to try all possible ways to compute the result, estimate how much each one will cost and choose the cheapest plan.
To estimate how big the result set will be, it uses the object definitions and the table statistics, which contain detailed data about how the column values are distributed.
It then calculates how many disk blocks it will have to read sequentially and by random access (I/O cost), and how many tables and index rows and function calls it will have to process (CPU cost) to come up with a grand total. The weights for each of these components in the total can be configured.
Usually the best plan is one that reduces the number of result rows as quickly as possible by applying the most selective condition first. In your case this seems to be casting.a_id < 50.
Nested loop joins are often preferred if the number of rows in the outer (upper in EXPLAIN output) table is small.
In PostgreSQL, I have an index on a date field on my tickets table.
When I compare the field against now(), the query is pretty efficient:
# explain analyze select count(1) as count from tickets where updated_at > now();
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=90.64..90.66 rows=1 width=0) (actual time=33.238..33.238 rows=1 loops=1)
-> Index Scan using tickets_updated_at_idx on tickets (cost=0.01..90.27 rows=74 width=0) (actual time=0.016..29.318 rows=40250 loops=1)
Index Cond: (updated_at > now())
Total runtime: 33.271 ms
It goes downhill and uses a Bitmap Heap Scan if I try to compare it against now() minus an interval.
# explain analyze select count(1) as count from tickets where updated_at > (now() - '24 hours'::interval);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=180450.15..180450.17 rows=1 width=0) (actual time=543.898..543.898 rows=1 loops=1)
-> Bitmap Heap Scan on tickets (cost=21296.43..175963.31 rows=897368 width=0) (actual time=251.700..457.916 rows=924373 loops=1)
Recheck Cond: (updated_at > (now() - '24:00:00'::interval))
-> Bitmap Index Scan on tickets_updated_at_idx (cost=0.00..20847.74 rows=897368 width=0) (actual time=238.799..238.799 rows=924699 loops=1)
Index Cond: (updated_at > (now() - '24:00:00'::interval))
Total runtime: 543.952 ms
Is there a more efficient way to query using date arithmetic?
The 1st query expects to find rows=74, but actually finds rows=40250.
The 2nd query expects to find rows=897368 and actually finds rows=924699.
Of course, processing 23 x as many rows takes considerably more time. So your actual times are not surprising.
Statistics for data with updated_at > now() are outdated. Run:
ANALYZE tickets;
and repeat your queries. And you seriously have data with updated_at > now()? That sounds wrong.
It's not surprising, however, that statistics are outdated for data most recently changed. That's in the logic of things. If your query depends on current statistics, you have to run ANALYZE before you run your query.
Also test with (in your session only):
SET enable_bitmapscan = off;
and repeat your second query to see times without bitmap index scan.
Why bitmap index scan for more rows?
A plain index scan fetches rows from the heap sequentially as found in the index. That's simple, dumb and without overhead. Fast for few rows, but may end up more expensive than a bitmap index scan with a growing number of rows.
A bitmap index scan collects rows from the index before looking up the table. If multiple rows reside on the same data page, that saves repeated visits and can make things considerably faster. The more rows, the greater the chance, a bitmap index scan will save time.
For even more rows (around 5% of the table, heavily depends on actual data), the planner switches to a sequential scan of the table and doesn't use the index at all.
The optimum would be an index only scan, introduced with Postgres 9.2. That's only possible if some preconditions are met. If all relevant columns are included in the index, the index type support it and the visibility map indicates that all rows on a data page are visible to all transactions, that page doesn't have to be fetched from the heap (the table) and the information in the index is enough.
The decision depends on your statistics (how many rows Postgres expects to find and their distribution) and on cost settings, most importantly random_page_cost, cpu_index_tuple_cost and effective_cache_size.