I have a query that it's taking too long to run.
I'm using PostgreSQL 10.3.
In my tables involved in this query, I have about 3.5 million records in each.
The query is:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one AS thf
INNER JOIN tb_two AS ths
ON ths.tb_hit_hitid = thf.tb_hit_hitid
WHERE ths.source IN ('source1', 'source2')
In these tables, I have index:
CREATE INDEX tb_two_idx_1 on tb_two (Source ASC, attr5 ASC);
CREATE INDEX tb_one_idx_1 on tb_one USING btree (attr1 ASC,attr2 ASC,attr3 ASC,attr4 ASC);
CREATE INDEX tb_one_idx_2 on tb_hit_feature (tb_hit_HitId ASC);
CREATE INDEX tb_two_idx_2 on tb_hit_source (tb_hit_HitId ASC);
This is the QUERY PLAN (explain (analyse, buffers)):
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=3.85..287880.35 rows=1771004 width=44) (actual time=0.091..3894.024 rows=1726970 loops=1)
Merge Cond: (thf.tb_hit_hitid = ths.tb_hit_hitid)
Buffers: shared hit=354821
-> Index Scan using tb_one_idx_2 on tb_one thf (cost=0.43..124322.43 rows=3230800 width=52) (actual time=0.014..655.036 rows=1726946 loops=1)
Buffers: shared hit=27201
-> Index Scan using tb_two_idx_2 on tb_two ths (cost=0.43..139531.97 rows=1771004 width=8) (actual time=0.069..1604.789 rows=1726973 loops=1)
Filter: ((source)::text = ANY ('{source1,source2}'::text[]))
Rows Removed by Filter: 1651946
Buffers: shared hit=327620
Planning time: 2.737 ms
Execution time: 4117.573 ms
(11 rows)
For this query:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one thf INNER JOIN
tb_two ths
ON ths.tb_hit_hitid = thf.tb_hit_hitid
WHERE ths.source IN ('source1', 'source2');
You want indexes on tb_two(source, tb_hit_hitid) and tb_one(tb_hit_hitid). That is probably the best index.
In case the query returns duplicates (due to the join), I might suggest writing this as:
SELECT thf.attr1, thf.attr2, thf.attr3, thf.attr4
FROM tb_one thf
WHERE EXISTS (SELECT 1
FROM tb_two ths
WHERE ths.tb_hit_hitid = thf.tb_hit_hitid AND
ths.source IN ('source1', 'source2')
);
For this version, you want the index to be tb_two(tb_hit_hitid, source).
Related
What is the right way to set index for next query?
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
LEFT JOIN xml_files t2
ON t1.purchaseNumber = t2.purchaseNumber
AND t1.docPublishDate < t2.docPublishDate
WHERE t1.parsing_status IS NULL
AND t2.parsing_status IS NULL
AND t2.docPublishDate IS NULL
AND t1.section_name='contracts' AND t1.parsing_status IS NULL AND t1.random IN (1,2,3,4)
Should I create composite index or better to create single index for every table that used in query?
Also if I am doing comparison of timestamp docPublishDate how should I create in index? Should I use desc keyword?
purchaseNumber - varchar(50)
parsing_status - varchar(10)
random - integer
section_name - varchar(10)
EXPLAIN (ANALYZE, BUFFERS) query;:
Gather (cost=1000.86..137158.61 rows=43091 width=35) (actual time=22366.063..72674.678 rows=46518 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=99244069 read=144071
-> Nested Loop Anti Join (cost=0.86..131849.51 rows=17955 width=35) (actual time=22309.989..72440.514 rows=15506 loops=3)
Buffers: shared hit=99244069 read=144071
-> Parallel Index Scan using index_for_xml_files_parsing_status on xml_files t1 (cost=0.43..42606.31 rows=26932 width=35) (actual time=0.086..193.982 rows=40725 loops=3)
Index Cond: ((parsing_status IS NULL) AND (parsing_status IS NULL))
Filter: (((section_name)::text = 'contracts'::text) AND (random = ANY ('{1,2,3,4}'::integer[])))
Rows Removed by Filter: 383974
Buffers: shared hit=15724 read=42304
-> Index Scan using "index_for_xml_files_purchaseNumber" on xml_files t2 (cost=0.43..4.72 rows=3 width=27) (actual time=1.773..1.773 rows=1 loops=122174)
Index Cond: (("purchaseNumber")::text = (t1."purchaseNumber")::text)
Filter: (t1."docPublishDate" < "docPublishDate")
Rows Removed by Filter: 6499
Buffers: shared hit=99228345 read=101767
Planning Time: 0.396 ms
Execution Time: 72681.868 ms
Data example: How to improve speed of query?
You should explain what you want the query to do. I would write the query more clearly as:
SELECT t1.purchaseNumber, t1.parsing_status, t1.docPublishDate
FROM xml_files t1
WHERE t1.section_name = 'contracts' AND
t1.parsing_status IS NULL AND
t1.random IN (1, 2, 3, 4) AND
NOT EXISTS (SELECT 1
FROM xml_files t2
WHERE t1.purchaseNumber = t2.purchaseNumber AND
t1.docPublishDate < t2.docPublishDate
);
For this query, I would suggest the the following indexes:
create index idx_xml_files_3 on xml_files(section_name, random)
where parsing_status is null;
create index idx_xml_files_2 on xml_files(purchaseNumber, docPublishDate);
There is probably an even better way to write the query, using window functions for instance. However, it is not clear what your data looks like nor what the query is intended to do.
The index scan on the inner side of the nested loop join is inefficient: on average, 6499 of the 6500 rows found are discarded.
Create a better index:
CREATE INDEX ON xml_files ("purchaseNumber", "docPublishDate");
I am trying to workout query optimisation on id. Not sure which one way should I use. Below is the query plan using explain and cost wise looks similar.
1. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid,...]);
QUERY PLAN:
Index Scan using table1_pkey on table1 (cost=0.42..641.44 rows=76 width=835) (actual time=0.258..2.603 rows=76 loops=1)
Index Cond: (id = ANY ('{00e289b0-1ac8-451f-957f-e00bc289148e,...}'::uuid[]))
Buffers: shared hit=231 read=73
Planning Time: 0.487 ms
Execution Time: 2.715 ms)
2. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (VALUES ('00e289b0-1ac8-451f-957f-e00bc289148e'::uuid),...);
QUERY PLAN:
Nested Loop (cost=1.56..644.10 rows=76 width=835) (actual time=0.058..0.297 rows=76 loops=1)
Buffers: shared hit=304
-> HashAggregate (cost=1.14..1.90 rows=76 width=16) (actual time=0.049..0.060 rows=76 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.95 rows=76 width=16) (actual time=0.006..0.022 rows=76 loops=1)
-> Index Scan using table1_pkey on table1 (cost=0.42..8.44 rows=1 width=835) (actual time=0.002..0.003 rows=1 loops=76)
Index Cond: (id = "*VALUES*".column1)
Buffers: shared hit=304
Planning Time: 0.437 ms
Execution Time: 0.389 ms
Looks like VALUES () does some hashing and join to improve performance but not sure.
NOTE: In my practical use case, id is uuid_generate_v4() e.x. d31cddc0-1771-4de8-ad41-e6c568b39a5d but the column may not be indexed as such.
Also, I have a table of with 5-10 million records.
Which way is for the better query performance?
Both options seem reasonable. I would, however, suggest to avoid casting the column you filter on. Instead, you should cast the literal values to uuid:
SELECT *
FROM table1
WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid, ...]);
This should allow the database to take advantage of an index on column id.
I have two tables which links to each other like this:
Table answered_questions with the following columns and indexes:
id: primary key
taken_test_id: integer (foreign key)
question_id: integer (foreign key, links to another table called questions)
indexes: (taken_test_id, question_id)
Table taken_tests
id: primary key
user_id: (foreign key, links to table Users)
indexes: user_id column
First query (with EXPLAIN ANALYZE output):
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id"
WHERE
"taken_tests"."user_id" = 1;
Output:
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.276 ms
Execution time: 2.365 ms
(7 rows)
Another query (this is generated automatically by Rails when using joins method in ActiveRecord)
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
And here is the output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.302 ms
Execution time: 1258.035 ms
(7 rows)
The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.
Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)
I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table
Update 1:
When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)
EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE)
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Buffers: shared hit=1986
-> Index Scan using index_taken_tests_on_user_id on public.taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
Output: taken_tests.id
Index Cond: (taken_tests.user_id = 1)
Buffers: shared hit=269
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Index Cond: (answered_questions.taken_test_id = taken_tests.id)
Buffers: shared hit=1717
Planning time: 0.238 ms
Execution time: 2.335 ms
As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.
The difference comes from the very same unit's execution time:
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)
and
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)
As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem.
Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.
For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.
I was hoping to get some expert eyes over my query and see why I'm receiving different performance.
The problem I'm trying to solve is I need orders that can have one to many items. These orders need to be paginated.
To do this I've taken the following approach. I'm using a sub query to filter orders by the required item attributes. I'm then rejoining to the items to get their required fields. This means that when paginating I will not incorrectly filter order rows when there are orders with 2 or more items.
I'm seeing intermittently slow queries. The second time they are run they're much quicker. I presume this is because Postgres is loading indexes and such into memory?
I don't fully understand what is happening From the explain. It's looking like it needs to scan every order to see if they have an item that fits the subquery? I'm a bit confused by the following line. It's saying it needs to scan 286853 rows but also only 165?
Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=64.598..2114.676 rows=165 loops=1)
Is there a way to get Postgres to filter by the items first or am I reading this incorrectly and it is doing that already?
Query:
SELECT
"orders"."id_orders" as "orders.id_orders",
"items"."id_items" as "items"."id_items",
...,
orders.created_at, orders.updated_at
FROM (
SELECT
orders.id_orders,
orders.created_at,
orders.updated_at
FROM orders
WHERE orders.status in ('completed','pending') AND
(
SELECT fk_vendor_id FROM items
WHERE (
items.fk_order_id = orders.id_orders AND
items.fk_vendor_id = '0012800001YVccUAAT' AND
items.fk_offer = '0060I00000RAKFYQA5' AND
items.status IN ('completed','cancelled')
) LIMIT 1
) IS NOT NULL ORDER BY orders.created_at DESC LIMIT 50 OFFSET 150
) as orders INNER JOIN items ON items.fk_order_id = orders.id_orders;
1st explain:
Nested Loop (cost=1417.11..2311.77 rows=67 width=1705) (actual time=2785.221..17025.325 rows=17 loops=1)
-> Limit (cost=1416.68..1888.77 rows=50 width=301) (actual time=2785.216..17024.918 rows=15 loops=1)
-> Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=1214.013..17024.897 rows=165 loops=1)
Filter: ((status = ANY ('{completed,pending}'::orders_status_enum[])) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 313631
SubPlan 1
-> Limit (cost=0.42..8.45 rows=1 width=19) (actual time=0.047..0.047 rows=0 loops=287719)
-> Index Scan using items_fk_order_id_index on items items_1 (cost=0.42..8.45 rows=1 width=19) (actual time=0.047..0.047 rows=0 loops=287719)
Index Cond: (fk_order_id = orders.id_orders)
Filter: ((status = ANY ('{completed,cancelled}'::items_status_enum[])) AND (fk_vendor_id = '0012800001YVccUAAT'::text) AND (fk_offer = '0060I00000RAKFYQA5'::text))
Rows Removed by Filter: 1
-> Index Scan using items_fk_order_id_index on items (cost=0.42..8.44 rows=1 width=1404) (actual time=0.002..0.026 rows=1 loops=15)
Index Cond: (fk_order_id = orders.id_orders)
Planning time: 1.791 ms
Execution time: 17025.624 ms
(15 rows)
2nd explain:
Nested Loop (cost=1417.11..2311.77 rows=67 width=1705) (actual time=115.659..2114.739 rows=17 loops=1)
-> Limit (cost=1416.68..1888.77 rows=50 width=301) (actual time=115.654..2114.691 rows=15 loops=1)
-> Index Scan Backward using orders_created_at_idx on orders (cost=0.42..2708393.65 rows=286853 width=301) (actual time=64.598..2114.676 rows=165 loops=1)
Filter: ((status = ANY ('{completed,pending}'::orders_status_enum[])) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 313631
SubPlan 1
-> Limit (cost=0.42..8.45 rows=1 width=19) (actual time=0.006..0.006 rows=0 loops=287719)
-> Index Scan using items_fk_order_id_index on items items_1 (cost=0.42..8.45 rows=1 width=19) (actual time=0.006..0.006 rows=0 loops=287719)
Index Cond: (fk_order_id = orders.id_orders)
Filter: ((status = ANY ('{completed,cancelled}'::items_status_enum[])) AND (fk_vendor_id = '0012800001YVccUAAT'::text) AND (fk_offer = '0060I00000RAKFYQA5'::text))
Rows Removed by Filter: 1
-> Index Scan using items_fk_order_id_index on items (cost=0.42..8.44 rows=1 width=1404) (actual time=0.002..0.002 rows=1 loops=15)
Index Cond: (fk_order_id = orders.id_orders)
Planning time: 2.011 ms
Execution time: 2115.052 ms
(15 rows)
Order indexes:
"cart_pkey" PRIMARY KEY, btree (id_orders)
"orders_legacy_id_uindex" UNIQUE, btree (legacy_id_orders)
"orders_transaction_key_uindex" UNIQUE, btree (transaction_key)
"orders_created_at_idx" btree (created_at)
"orders_customer_email_idx" gin (customer_email gin_trgm_ops)
"orders_customer_full_name_idx" gin (customer_full_name gin_trgm_ops)
Referenced by:
TABLE "items" CONSTRAINT "items_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders) ON DELETE RESTRICT
TABLE "items_log" CONSTRAINT "items_log_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders)
Items indexes:
"items_pkey" PRIMARY KEY, btree (id_items)
"items_fk_vendor_id_booking_number_unique" UNIQUE, btree (fk_vendor_id, booking_number) WHERE legacy_id_items IS NULL
"items_legacy_id_uindex" UNIQUE, btree (legacy_id_items)
"items_transaction_key_uindex" UNIQUE, btree (transaction_key)
"items_booking_number_index" btree (booking_number)
"items_fk_order_id_index" btree (fk_order_id)
"items_fk_vendor_id_index" btree (fk_vendor_id)
"items_status_index" btree (status)
Foreign-key constraints:
"items_fk_order_id_fkey" FOREIGN KEY (fk_order_id) REFERENCES orders(id_orders) ON DELETE RESTRICT
The difference in execution times is probably really the effect of caching. You could use EXPLAIN (ANALYZE, BUFFERS) to see how many pages are found in the database cache.
To make your query more readable, you should rewrite
WHERE (
SELECT fk_vendor_id FROM items
WHERE (
items.fk_order_id = orders.id_orders AND
items.fk_vendor_id = '0012800001YVccUAAT' AND
items.fk_offer = '0060I00000RAKFYQA5' AND
items.status IN ('completed','cancelled')
) LIMIT 1
) IS NOT NULL
to
WHERE NOT EXISTS
(SELECT 1 FROM items
WHERE items.fk_order_id = orders.id_orders
AND items.fk_vendor_id = '0012800001YVccUAAT'
AND items.fk_offer = '0060I00000RAKFYQA5'
AND items.status IN ('completed','cancelled')
)
The best thing you could do to speed up the query is to create an index:
CREATE INDEX ON items(fk_order_id, fk_vendor_id, fk_offer);
I have simple query (Postgres 9.4):
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
bo_labels L
LEFT JOIN bo_party party ON (party.id = L.bo_party_fkey)
LEFT JOIN bo_document_base D ON (D.id = L.bo_doc_base_fkey)
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = D.id)
WHERE
party.inn = '?'
Explain looks like:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2385.30..2385.30 rows=1 width=0) (actual time=31762.367..31762.367 rows=1 loops=1)
-> Nested Loop Left Join (cost=1.28..2385.30 rows=1 width=0) (actual time=7.621..31760.776 rows=1694 loops=1)
Join Filter: ((c.bo_document_fkey)::text = (d.id)::text)
Rows Removed by Join Filter: 101658634
-> Nested Loop Left Join (cost=1.28..106.33 rows=1 width=10) (actual time=0.110..54.635 rows=1694 loops=1)
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
Heap Fetches: 0
-> Index Only Scan using bo_document_pkey on bo_document_base d (cost=0.43..0.64 rows=1 width=10) (actual time=0.022..0.025 rows=1 loops=1694)
Index Cond: (id = (l.bo_doc_base_fkey)::text)
Heap Fetches: 1134
-> Seq Scan on bo_contract_hardwood_deal c (cost=0.00..2069.77 rows=59770 width=9) (actual time=0.003..11.829 rows=60012 loops=1694)
Planning time: 13.484 ms
Execution time: 31762.885 ms
http://explain.depesz.com/s/V2wn
What is very annoying is incorrect estimate of rows:
Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
Because that postgres choose nested loops and query run about 30 seconds.
With SET LOCAL enable_nestloop = OFF; it accomplished just in a second.
What is also interesting, I have default_statistics_target = 10000 (at max value) and on all 4 tables run VACUUM VERBOSE ANALYZE just before.
As postgres does not gather statistic between tables such case is very likely possible to happens for other joins too.
Without external extension pghintplan it is not possible change enable_nestloop for just that query.
Is there some other way I could try to force use more speedy way to accomplish that query?
Update by comments
I can't eliminate join in common way. My main search is there any possibilities change statistic (for example) to include desired values which break normal statistical appearance? May be other way to force postgres to change weight of nested loops to use it not so frequently?
Could also someone explain or point to documentation how postgres analyzer for nested loops of two results with 3 (exact correct) and 1289 (which will really 565, but actually such error different question) rows made assumption what in result will be only 1 row??? I've speak about that part of plan:
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
On first glance it looks initially wrong. What statistics used there and how?
Does postgres maintain also some statistics for indexes?
Actually, I don't have a good sample data to test my answer but I think it might help.
Based on your join columns I'm assuming the following relationship cardinality:
1) bo_party (id 1:N bo_party_fkey) bo_labels
2) bo_labels (bo_doc_base_fkey N:1 id) bo_document_base
3) bo_document_base (id 1:N bo_document_fkey) bo_contract_hardwood_deal
You want to count how much rows were selected. So, based on the cardinality in 1) and 2) the table "bo_labels" have a many to many relationship. This means that joining it with "bo_party" and "bo_document_base" will produce no more rows than the ones existing in the table.
But, after joining "bo_document_base", another join is done to "bo_contract_hardwood_deal" which cardinality described in 3) is one to many, perhaps generating more rows in the final result.
This way, to find the right count of rows you can simplify the join structure to "bo_labels" and "bo_contract_hardwood_deal" through:
4) bo_labels (bo_doc_base_fkey 1:N bo_document_fkey) bo_contract_hardwood_deal
A sample query could be one of the following:
SELECT COUNT(*)
FROM bo_labels L
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = L.bo_doc_base_fkey)
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
or
SELECT sum((select COUNT(*) from bo_contract_hardwood_deal C where C.bo_document_fkey = L.bo_doc_base_fkey))
FROM bo_labels L
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
I could not test with large tables, so I don't know exactly if it will improve performance against your original query, but I think it might help.