Order of columns in INNER JOIN condition affects the performance badly - sql

I have two tables which links to each other like this:
Table answered_questions with the following columns and indexes:
id: primary key
taken_test_id: integer (foreign key)
question_id: integer (foreign key, links to another table called questions)
indexes: (taken_test_id, question_id)
Table taken_tests
id: primary key
user_id: (foreign key, links to table Users)
indexes: user_id column
First query (with EXPLAIN ANALYZE output):
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id"
WHERE
"taken_tests"."user_id" = 1;
Output:
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.276 ms
Execution time: 2.365 ms
(7 rows)
Another query (this is generated automatically by Rails when using joins method in ActiveRecord)
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
And here is the output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.302 ms
Execution time: 1258.035 ms
(7 rows)
The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.
Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)
I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table
Update 1:
When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)
EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE)
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Buffers: shared hit=1986
-> Index Scan using index_taken_tests_on_user_id on public.taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
Output: taken_tests.id
Index Cond: (taken_tests.user_id = 1)
Buffers: shared hit=269
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Index Cond: (answered_questions.taken_test_id = taken_tests.id)
Buffers: shared hit=1717
Planning time: 0.238 ms
Execution time: 2.335 ms

As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.
The difference comes from the very same unit's execution time:
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)
and
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)
As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem.
Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.
For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.

Related

Understanding SQL EXPLAIN on a JOIN query

I'm struggling to make sense of postgres EXPLAIN to figure out why my query is slow. Can someone help? This is my query, it's a pretty simple join:
SELECT DISTINCT graph_concepts.*
FROM graph_concepts
INNER JOIN graph_family_links
ON graph_concepts.id = graph_family_links.descendent_concept_id
WHERE graph_family_links.ancestor_concept_id = 1016
AND graph_family_links.generation = 1
AND graph_concepts.state != 2
It's starting from a concept and it's getting a bunch of related concepts through the links table.
Notably, I have an index on graph_family_links.descendent_concept_id, yet this query takes about 3 seconds to return a result. This is way too long for my purposes.
This is the SQL explain:
Unique (cost=46347.01..46846.16 rows=4485 width=108) (actual time=27.406..33.667 rows=13 loops=1)
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
-> Gather Merge (cost=46347.01..46825.98 rows=4485 width=108) (actual time=27.404..33.656 rows=13 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
-> Sort (cost=45347.01..45348.32 rows=2638 width=108) (actual time=23.618..23.621 rows=6 loops=2)
Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=13068 read=5
I/O Timings: read=0.074
Worker 0: Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=301.97..45317.02 rows=2638 width=108) (actual time=8.890..23.557 rows=6 loops=2)
Buffers: shared hit=13039 read=5
I/O Timings: read=0.074
-> Parallel Bitmap Heap Scan on graph_family_links (cost=301.88..39380.60 rows=2640 width=4) (actual time=8.766..23.293 rows=6 loops=2)
Recheck Cond: (ancestor_concept_id = 1016)
Filter: (generation = 1)
Rows Removed by Filter: 18850
Heap Blocks: exact=2558
Buffers: shared hit=12985
-> Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id (cost=0.00..301.66 rows=38382 width=0) (actual time=4.744..4.744 rows=47346 loops=1)
Index Cond: (ancestor_concept_id = 1016)
Buffers: shared hit=67
-> Index Scan using graph_concepts_pkey on graph_concepts (cost=0.08..2.25 rows=1 width=108) (actual time=0.036..0.036 rows=1 loops=13)
Index Cond: (id = graph_family_links.descendent_concept_id)
Filter: (state <> 2)
Buffers: shared hit=54 read=5
I/O Timings: read=0.074
Planning:
Buffers: shared hit=19
Planning Time: 0.306 ms
Execution Time: 33.747 ms
(35 rows)
I'm doing lots of googling to help me figure out how to read this EXPLAIN and I'm struggling. Can someone help translate this into plain english for me?
Answering myself (for the benefit of future people):
My question was primarily how to understand EXPLAIN. Many people below contributed to my understanding but no one really gave me the beginner unpacking I was looking for. I want to teach myself to fish rather than simply having other people read this and give me advice on solving this specific issue, although I do greatly appreciate the specific suggestions!
For others trying to understand EXPLAIN, this is the important context you need to know, which was holding me back:
"Cost" is some arbitrary unit of how expense each step of the process is, you can think of it almost like a stopwatch.
Look near the end of your EXPLAIN until you find: cost=0.00.. This is the very start of your query execution. In my case, cost=0.00..301.66 is the first step and cost=0.08..2.25 runs in parallel (from step 0.08 to step 2.25, just a small fraction of the 0 to 300).
Find the step with the biggest "span" of cost. In my case, cost=301.88..39380.60. Although I was confused because I also have a cost=301.97..45317.02. I think those are, again, both happening in parallel so I'm not sure which one is contributing more.
SELECT DISTINCT
graph_concepts.*
FROM
graph_concepts
INNER JOIN graph_family_links ON graph_concepts.id = graph_family_links.descendent_concept_id
WHERE
graph_family_links.ancestor_concept_id = 1016
AND graph_family_links.generation = 1
AND graph_concepts.state != 2
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46347.01..46846.16 rows=4485 width=108)
## (Merge records DISTINCT)
-> Gather Merge (cost=46347.01..46825.98 rows=4485 width=108)
Workers Planned: 1
## (Sort table graph_concepts.* )
-> Sort (cost=45347.01..45348.32 rows=2638 width=108)
Sort Key: graph_concepts.id, graph_concepts.definition, graph_concepts.checkvist_task_id, graph_concepts.primary_question_id, graph_concepts.created_at, graph_concepts.updated_at, graph_concepts.tsn, graph_concepts.state, graph_concepts.search_phrases
-> Nested Loop (cost=301.97..45317.02 rows=2638 width=108)
## WHERE graph_family_links.ancestor_concept_id = 1016 (Use Parallel Bitmap Heap Scan table and filter record)
-> Parallel Bitmap Heap Scan on graph_family_links (cost=301.88..39380.60 rows=2640 width=4)
Recheck Cond: (ancestor_concept_id = 1016)
Filter: (generation = 1)
## AND graph_family_links.generation = 1 (Use Bitmap Index Scan table and filter record)
-> Bitmap Index Scan on index_graph_family_links_on_ancestor_concept_id (cost=0.00..301.66 rows=38382 width=0)
Index Cond: (ancestor_concept_id = 1016)
## AND graph_concepts.state != 2 (Use Index Scan table and filter record)
-> Index Scan using graph_concepts_pkey on graph_concepts (cost=0.08..2.25 rows=1 width=108)
Index Cond: (id = graph_family_links.descendent_concept_id)
Filter: (state <> 2)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Please refer to the below sql script.
SELECT DISTINCT graph_concepts.*
FROM graph_concepts
INNER JOIN (select descendent_concept_id from graph_family_links where ancestor_concept_id = 1016 and generation = 1) A ON graph_concepts.id = A.descendent_concept_id
WHERE graph_concepts.state != 2
Explain read from bottom to up, generally.
https://www.postgresql.org/docs/current/sql-explain.html
This command displays the execution plan that the PostgreSQL planner
generates for the supplied statement. The execution plan shows how the
table(s) referenced by the statement will be scanned — by plain
sequential scan, index scan, etc. — and if multiple tables are
referenced, what join algorithms will be used to bring together the
required rows from each input table.
Since you only do explain your select command meaning the output is the system planner generate a execute plan for this select query. But if you do explain analyze then it will plan and execute the query.
First there is two table there. For each table use some ways to found out which rows meet the where criteria. index scan (one of the way to find out where is the row) found out in table graph_concepts which row meet the condition: graph_concepts.state != 2.
Also in the mean time(Parallel) use Bitmap Heap Scan to found out in table
graph_family_links which row meet the criteria: graph_family_links.ancestor_concept_id = 1016
After that then do join operation. In this case, it's Nested Loop.
After join then do Sort. Why we need to sort operation? Because you specified: SELECT DISTINCT : https://www.postgresql.org/docs/current/sql-select.html
-After sort then since you specified key word DISTINCT then eliminate the duplicates.
People have given you general links and advice on understanding plans. To focus on one relevant part of the plan, it expects to find 38382 rows satisfying ancestor_concept_id = 1016, but then well over 90% of them are expected to fail the generation = 1 filter. But that is expensive as it was to jump to some random table page to fetch the "generation" value to apply the filter.
If you had a combined index on (ancestor_concept_id, generation) it could apply both restrictions efficiently simultaneously. Alternatively, of you had separate single column indexes on those columns, it could combine then with a BitmapAnd operation. That would be more efficient than what you are currently doing to but less efficient than the combined index.

Improve Postgre SQL query performance

I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.
The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.

Simple aggregate query running slow

I am trying to determine why a fairly simple aggregate query is taking so long to perform on a single table. The table is called plots, and it is [id, device_id, time, ...] There are two indices, UNIQUE(id) and UNIQUE(device_id, time).
The query is simply:
SELECT device_id, MIN(time)
FROM plots
GROUP BY device_id
To me, this should be very fast, but it is taking 3+ minutes. The table has ~45 million rows, divided roughly equally among 1200 or so device_id's.
EXPLAIN for the query:
Finalize GroupAggregate (cost=1502955.41..1503055.97 rows=906 width=12)
Group Key: device_id
-> Gather Merge (cost=1502955.41..1503052.35 rows=906 width=12)
Workers Planned: 1
-> Sort (cost=1501955.41..1501955.86 rows=906 width=12)
Sort Key: device_id
-> Partial HashAggregate (cost=1501943.79..1501946.51 rows=906 width=12)
Group Key: device_id
-> Parallel Seq Scan on plots (cost=0.00..1476417.34 rows=25526447 width=12)
EXPLAIN for query with a where device_id = xxx:
GroupAggregate (cost=398.86..78038.77 rows=906 width=12)
Group Key: device_id
-> Bitmap Heap Scan on plots (cost=398.86..77992.99 rows=43065 width=12)
Recheck Cond: (device_id = 6780)
-> Bitmap Index Scan on index_plots_on_device_id_and_time (cost=0.00..396.71 rows=43065 width=0)
Index Cond: (device_id = 6780)
I have done VACUUM (FULL, ANALYZE) as well as REINDEX DATABASE.
I have also tried doing partition queries to accomplish the same.
Any pointers on making this faster? Or am I just boned on the table size. It seems like it should be fine with the index though. Maybe I am missing something...
EDIT / UPDATE:
The problem seems to be resolved at this point, though I am not sure why. I have dropped and rebuilt the index many times, and suddenly the query is only taking ~7 seconds, which is acceptable. Of note, this morning I dropped the index and created a new one with the reverse column order (time, device_id) and I was surprised to see good results. I then reverted to the previous index, and the results were improved further. I will refork the production database and try to retrace my steps and post an update. Should I be worried about the query planner wonking out in the future?
Current EXPLAIN with analysis (as requested):
Finalize GroupAggregate (cost=1000.12..480787.58 rows=905 width=12) (actual time=36.299..7530.403 rows=916 loops=1)
Group Key: device_id
Buffers: shared hit=135087 read=40325
I/O Timings: read=138.419
-> Gather Merge (cost=1000.12..480783.96 rows=905 width=12) (actual time=36.226..7552.052 rows=1829 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Partial GroupAggregate (cost=0.11..479687.58 rows=905 width=12) (actual time=15.779..5026.094 rows=914 loops=2)
Group Key: device_id
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Parallel Index Only Scan using index_plots_time_and_device_id on plots (cost=0.11..454158.41 rows=25526447 width=12) (actual time=0.033..2999.764 rows=21697480 loops=2)
Heap Fetches: 0
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
Planning Time: 0.092 ms
Execution Time: 7554.100 ms
(19 rows)
Approach 1:
You can try to remove your UNIQUE to an index on your database. CREATE UNIQUE INDEX and CREATE INDEX have different behaviors. I believe that you can get benefits from CREATE INDEX.
Approach 2:
You can create a materialized view. If you can get some delay on your information, you can do the following:
CREATE MATERIALIZED VIEW myreport AS
SELECT device_id,
MIN(time) AS mintime
FROM plots
GROUP BY device_id
CREATE INDEX myreport_device_id ON myreport(device_id);
Also, you need to remember to regularly do:
REFRESH MATERIALIZED VIEW CONCURRENTLY myreport;
And less regularly do:
VACUUM ANALYZE myreport

Why is PostgreSQL 11 optimizer refusing best plan which would use an index w/ included column?

PostgreSQL 11 isn't smart enough to use indexes with included columns?
CREATE INDEX organization_locations__org_id_is_headquarters__inc_location_id_ix
ON organization_locations(org_id, is_headquarters) INCLUDE (location_id);
ANALYZE organization_locations;
ANALYZE organizations;
EXPLAIN VERBOSE
SELECT location_id
FROM organization_locations ol
WHERE org_id = (SELECT id FROM organizations WHERE code = 'akron')
AND is_headquarters = 1;
QUERY PLAN
Seq Scan on organization_locations ol (cost=8.44..14.61 rows=1 width=4)
Output: ol.location_id
Filter: ((ol.org_id = $0) AND (ol.is_headquarters = 1))
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4)
Output: organizations.id
Index Cond: ((organizations.code)::text = 'akron'::text)
There are only 211 rows currently in organization_locations, average row length 91 bytes.
I get only loading one data page. But the I/O is the same to grab the index page and the target data is right there (no extra lookup into the data page from the index). What is PG thinking with this plan?
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
EDIT: Here is the explain with buffers:
Seq Scan on organization_locations ol (cost=8.44..14.33 rows=1 width=4) (actual time=0.018..0.032 rows=1 loops=1)
Filter: ((org_id = $0) AND (is_headquarters = 1))
Rows Removed by Filter: 210
Buffers: shared hit=7
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: ((code)::text = 'akron'::text)
Buffers: shared hit=4
Planning Time: 0.402 ms
Execution Time: 0.048 ms
Reading one index page is not cheaper than reading a table page, so with tiny tables you cannot expect a gain from an index-only scan.
Besides, did you
VACUUM organization_locations;
Without that, the visibility map won't show that the table block is all-visible, so you cannot get an index-only scan no matter what.
In addition to the other answers, this is probably a silly index to have in the first place. INCLUDE is good when you need a unique index but you also want to tack on a column which is not part of the unique constraint, or when the included column doesn't have btree operators and so can't be in the main body of the index. In other cases, you should just put the extra column in the index itself.
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
This is your workflow problem that you can't expect PostgreSQL to solve for you. Do you really think PostgreSQL should create actual plans based on imaginary scenarios?

how to understand postgres EXPLAIN output

EXPLAIN SELECT a.name, m.name FROM Casting c JOIN Movie m ON c.m_id = m.m_id JOIN Actor a ON a.a_id = c.a_id AND c.a_id < 50;
Output
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=26.20..18354.49 rows=1090 width=27) (actual time=0.240..5.603 rows=1011 loops=1)
-> Nested Loop (cost=25.78..12465.01 rows=1090 width=15) (actual time=0.236..4.046 rows=1011 loops=1)
-> Bitmap Heap Scan on casting c (cost=25.35..3660.19 rows=1151 width=8) (actual time=0.229..1.059 rows=1011 loops=1)
Recheck Cond: (a_id < 50)
Heap Blocks: exact=989
-> Bitmap Index Scan on casting_a_id_index (cost=0.00..25.06 rows=1151 width=0) (actual time=0.114..0.114 rows=1011 loops=1)
Index Cond: (a_id < 50)
-> Index Scan using movie_pkey on movie m (cost=0.42..7.64 rows=1 width=15) (actual time=0.003..0.003 rows=1 loops=1011)
Index Cond: (m_id = c.m_id)
-> Index Scan using actor_pkey on actor a (cost=0.42..5.39 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=1011)
Index Cond: (a_id = c.a_id)
Planning time: 0.334 ms
Execution time: 5.672 ms
(13 rows)
I am trying to understand how query planner works? I am able to understand the process it choose, but I am not getting why ?
Can someone explain query optimizer choices (choice of query processing algorithms, join order) in these queries based on parameters like query selectivity and cost models or anything that effects choice?
Also why there is use of Recheck Cond, after index scan ?
There are two reasons why there has to be a Bitmap Heap Scan:
PostgreSQL has to check whether the rows found are visible for the current transaction or not. Remember that PostgreSQL keeps old row versions in the table until VACUUM removes them. This visibility information is not stored in the index.
If work_mem is not large enough to contain a bitmap with one bit per table row, PostgreSQL uses one bit per table page, which loses some information. The PostgreSQL needs to check the lossy blocks to see which of the rows in the block really satisfy the condition.
You can see this when you use EXPLAIN (ANALYZE, BUFFERS), then PostgreSQL will show if there were lossy matches, see this example on rextester:
-> Bitmap Heap Scan on t (cost=177.14..4719.43 rows=9383 width=0)
(actual time=2.130..144.729 rows=10001 loops=1)
Recheck Cond: (val = 10)
Rows Removed by Index Recheck: 738586
Heap Blocks: exact=646 lossy=3305
Buffers: shared hit=1891 read=2090
-> Bitmap Index Scan on t_val_idx (cost=0.00..174.80 rows=9383 width=0)
(actual time=1.978..1.978 rows=10001 loops=1)
Index Cond: (val = 10)
Buffers: shared read=30
I cannot explain the whole of the PostgreSQL optimizer in this answer, but what it does is to try all possible ways to compute the result, estimate how much each one will cost and choose the cheapest plan.
To estimate how big the result set will be, it uses the object definitions and the table statistics, which contain detailed data about how the column values are distributed.
It then calculates how many disk blocks it will have to read sequentially and by random access (I/O cost), and how many tables and index rows and function calls it will have to process (CPU cost) to come up with a grand total. The weights for each of these components in the total can be configured.
Usually the best plan is one that reduces the number of result rows as quickly as possible by applying the most selective condition first. In your case this seems to be casting.a_id < 50.
Nested loop joins are often preferred if the number of rows in the outer (upper in EXPLAIN output) table is small.