Postgres where query optimization - sql

In our database we have a table menus having 515502 rows. It has a column status which is of type smallint.
Currently, a simple count query takes 700 ms for set of docs having value of status as 3.
explain analyze select count(id) from menus where status = 2;
Aggregate (cost=72973.71..72973.72 rows=1 width=4) (actual time=692.564..692.565 rows=1 loops=1)
-> Bitmap Heap Scan on menus (cost=2510.63..72638.80 rows=133962 width=4) (actual time=28.179..623.077 rows=135429 loops=1)
Recheck Cond: (status = 2)
Rows Removed by Index Recheck: 199654
-> Bitmap Index Scan on menus_status (cost=0.00..2477.14 rows=133962 width=0) (actual time=26.211..26.211 rows=135429 loops=1)
Index Cond: (status = 2)
Total runtime: 692.705 ms
(7 rows)
Some rows have column value of 1 for which the query runs very fast.
explain analyze select count(id) from menus where status = 4;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=7198.73..7198.74 rows=1 width=4) (actual time=24.926..24.926 rows=1 loops=1)
-> Bitmap Heap Scan on menus (cost=40.53..7193.53 rows=2079 width=4) (actual time=1.461..23.418 rows=2220 loops=1)
Recheck Cond: (status = 4)
-> Bitmap Index Scan on menus_status (cost=0.00..40.02 rows=2079 width=0) (actual time=0.858..0.858 rows=2220 loops=1)
Index Cond: (status = 4)
Total runtime: 25.089 ms
(6 rows)
I observed that the most general btree index is the best indexing strategy for simple equality based queries. Both gin and hash were slower than btree.
Any tips for making count queries faster for any filter that is using an index?
I understand that this is a beginner level question, so apologies in advance for any kind of mistakes I might have made.

Maybe your table has more rows having status = 2 than ones having status = 4 , so, the total table access time is more for the second case.
So, for status = 2 there are too many rows to consider, so the the Bitmap for the Bitmap Heap Scan goes to the "lossy" mode, and recheck is needed after the operation. So, there are two things to consider: either your result is too big (but you can't do anything with that without reorganizing your tables, say, with partitioning), or your 'work_mem' param is too small to keep the intermittent result. Try to increase its value if you have possibility.

Related

Postgres changing the query from index Only scan to bit map scan when data set increases

I have two same queries but with different where condition values
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 1565 and account_id = 225 and deleted_at is NULL group by survey_contact_id, relation_id;
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 888 and account_id = 12 and deleted_at is NULL group by survey_contact_id, relation_id;
When I ran this two queries they both producing different result
for first query the result
GroupAggregate (cost=0.28..8.32 rows=1 width=24) (actual time=0.016..0.021 rows=4 loops=1)
Group Key: survey_contact_id, relation_id
-> Index Only Scan using test on nomination (cost=0.28..8.30 rows=1 width=8) (actual time=0.010..0.012 rows=5 loops=1)
Index Cond: ((account_id = 225) AND (survey_id = 1565))
Heap Fetches: 5
Planning time: 0.148 ms
Execution time: 0.058 ms
and for the 2nd one
GroupAggregate (cost=11.08..11.12 rows=2 width=24) (actual time=0.015..0.015 rows=0 loops=1)
Group Key: survey_contact_id, relation_id
-> Sort (cost=11.08..11.08 rows=2 width=8) (actual time=0.013..0.013 rows=0 loops=1)
Sort Key: survey_contact_id, relation_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on nomination (cost=4.30..11.07 rows=2 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Recheck Cond: ((account_id = 12) AND (survey_id = 888) AND (deleted_at IS NULL))
-> Bitmap Index Scan on test (cost=0.00..4.30 rows=2 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((account_id = 12) AND (survey_id = 888))
Planning time: 0.149 ms
Execution time: 0.052 ms
Can anyone explain Me why Postgres is making a BitMap Scan instead of Index Only scan ?
The short version is that Postgres has a cost-based approach, so it has estimated that the cost of doing so is less in the second case, based on the statistics it has.
In your case, the total cost (estimated) of each of these queries is 8.32 and 11.12 respectively. You may be able to see the cost of the index-only scan for the second query by running set enable_bitmapscan = off.
Note that, based on its statistics, Postgres estimated that the first query would return 1 row (actually 4), and that the second would return 2 rows (actually 0).
There are several ways to get better statistics, but if analyze (or autovacuum) hasn't been run on that table for a while, that is a common cause of bad estimates. Another tell-tale that vacuum may not have been run recently (at least on this table) is the Heap Fetches: 5 you can see in the first query plan.
I'm confused by the "when data set increases" part of your question, please do add more context on that front if relevant.
Finally, if you're not already planning a PostgreSQL upgrade, I highly recommend doing so soon. 9.6 is nearly out of support, and versions 10, 11, 12, and 13 each contained a host of performance focussed features.

Even simplest postresql query takes ages in a big table with indexes

I have a problem with a very slow query execution in my psql table which has 145 602 995 rows (yep 145+ millions). I have indexes created but even simplest query takes ages to execute... For example query like SELECT COUNT(*) FROM events; takes 708 seconds (~12 minutes).
I have a column called org_id which has an index and when I try to execute query like:
EXPLAIN ANALYZE SELECT COUNT(*) FROM events WHERE org_id = 1;
Aggregate (cost=8191.76..8191.77 rows=1 width=8) (actual time=9.758..9.758 rows=1 loops=1)
-> Index Only Scan using org_id on events (cost=0.57..8179.63 rows=4853 width=0) (actual time=1.172..9.729 rows=48 loops=1)
Index Cond: (org_id = 1)
Heap Fetches: 48
Planning time: 0.167 ms
Execution time: 9.803 ms
it's using the org_id index but estimated cost was huge.
As I increase org_id number I get slower and slower execution times. It's probably related with less records for the org_ids with smaller numbers. When I get to org_id = 9 where there are a lot of records it stops using the org_id index and uses Bitmap Heap Scan and Bitmap Index Scan instead.
EXPLAIN SELECT COUNT(*) FROM events WHERE org_id = 9;
Aggregate (cost=10834654.32..10834654.33 rows=1 width=8)
-> Bitmap Heap Scan on events (cost=147380.56..10814983.35 rows=7868386 width=0)
Recheck Cond: (org_id = 9)
-> Bitmap Index Scan on org_id (cost=0.00..145413.46 rows=7868386 width=0)
Index Cond: (org_id = 9)
Is there any way of improving speed with such a big tables? One extra info is that I have 11 columns in this table where one of them is of the jsonb NOT NULL type. Just mentioning. Maybe it's important.
EDIT:
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM events;
Aggregate (cost=12873195.66..12873195.67 rows=1 width=8) (actual time=653255.247..653255.248 rows=1 loops=1)
Buffers: shared hit=292755 read=10754192
-> Seq Scan on events (cost=0.00..12507945.93 rows=146099893 width=0) (actual time=0.015..638846.285 rows=146318426 loops=1)
Buffers: shared hit=292755 read=10754192
Planning time: 0.215 ms
Execution time: 653255.315 ms

PL/pgSQL Query Plan Worse Inside Function Than Outside

I have a function that is running too slow. I've isolated which piece of the function is slow.. a small SELECT statement:
SELECT image_group_id
FROM programs.image_family fam
JOIN programs.provider_file pf
ON (fam.provider_data_id = pf.provider_data_id
AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL)
LIMIT 1
When I run the function this piece of SQL generates the following query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=0.56..6.75 rows=1 width=6) (actual time=3471.004..3471.004 rows=0 loops=1)
-> Nested Loop (cost=0.56..594054.42 rows=96017 width=6) (actual time=3471.002..3471.002 rows=0 loops=1)
-> Seq Scan on image_family fam (cost=0.00..391880.08 rows=96023 width=6) (actual time=3471.001..3471.001 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
When I run the selected query in a query tool (outside of the function) the query plan looks like this:
Limit (cost=1.12..3.81 rows=1 width=6) (actual time=0.043..0.043 rows=1 loops=1)
Output: pf.image_group_id
Buffers: shared hit=11
-> Nested Loop (cost=1.12..14.55 rows=5 width=6) (actual time=0.041..0.041 rows=1 loops=1)
Output: pf.image_group_id
Inner Unique: true
Buffers: shared hit=11
-> Index Only Scan using image_family_family_id_provider_data_id_idx on programs.image_family fam (cost=0.56..1.65 rows=5 width=6) (actual time=0.024..0.024 rows=1 loops=1)
Output: fam.family_id, fam.provider_data_id
Index Cond: (fam.family_id = 8419853)
Heap Fetches: 2
Buffers: shared hit=6
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on programs.provider_file pf (cost=0.56..2.58 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
Output: pf.provider_data_id, pf.provider_file_path, pf.posted_dt, pf.file_repository_id, pf.restricted_size, pf.image_group_id, pf.is_master, pf.is_biggest
Index Cond: (pf.provider_data_id = fam.provider_data_id)
Filter: (pf.image_group_id IS NOT NULL)
Buffers: shared hit=5
Planning time: 0.809 ms
Execution time: 0.100 ms
If I disable sequence scans in the function I can get a similar query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=1.12..8.00 rows=1 width=6) (actual time=3855.722..3855.722 rows=0 loops=1)
-> Nested Loop (cost=1.12..660217.34 rows=96017 width=6) (actual time=3855.721..3855.721 rows=0 loops=1)
-> Index Only Scan using image_family_family_id_provider_data_id_idx on image_family fam (cost=0.56..458043.00 rows=96023 width=6) (actual time=3855.720..3855.720 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
Heap Fetches: 368
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
The query plans are different where the Filter functions are for the Index Only Scan. The function has more Heap Fetches and seems to treat the argument as a string casted to a numeric.
Things I've tried:
Increasing statistics (and running vacuum/analyze)
Calling the problematic piece of SQL in another function with language SQL
Add another index (the one that its using now to perform an INDEX ONLY scan)
Create a CTE for the image_family table (this did help performance but would still do a sequence scan on the image_family instead of using the index so still, too slow)
Change from executing raw SQL to using an EXECUTE ... INTO .. USING in the function.
Makeup of the two tables:
image_family:
provider_data_id: numeric(16)
family_id: int4
(rest omitted for brevity)
unique index on provider_data_id
index on family_id
I recently added a unique index on (family_id, provider_data_id) as well
Approximately 20 million rows here. Families have many provider_data_ids but not all provider_data_ids are part of families and thus aren't all in this table.
provider_file:
provider_data_id numeric(16)
image_group_id numeric(16)
(rest omitted for brevity)
unique index on provider_data_id
Approximately 32 million rows in this table. Most rows (> 95%) have a Non-Null image_group_id.
Postgres Version 10
How can I get the query performance to match whether I call it from a function or as raw SQL in a query tool?
The problem is exhibited in this line:
Filter: ((family_id)::numeric = '8419853'::numeric)
The index on family_id cannot be used because family_id is compared to a numeric value. This requires a cast to numeric, and there is no index on family_id::numeric.
Even though integer and numeric both are types representing numbers, their internal representation is quite different, and so the indexes are incompatible. In other words, the cast to numeric is like a function for PostgreSQL, and since it has no index on that functional expression, it has to resort to a scan of the whole table (or index).
The solution is simple, however: use an integer instead of a numeric parameter for the query. If in doubt, use a cast like
fam.family_id = $1::integer

how to understand postgres EXPLAIN output

EXPLAIN SELECT a.name, m.name FROM Casting c JOIN Movie m ON c.m_id = m.m_id JOIN Actor a ON a.a_id = c.a_id AND c.a_id < 50;
Output
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=26.20..18354.49 rows=1090 width=27) (actual time=0.240..5.603 rows=1011 loops=1)
-> Nested Loop (cost=25.78..12465.01 rows=1090 width=15) (actual time=0.236..4.046 rows=1011 loops=1)
-> Bitmap Heap Scan on casting c (cost=25.35..3660.19 rows=1151 width=8) (actual time=0.229..1.059 rows=1011 loops=1)
Recheck Cond: (a_id < 50)
Heap Blocks: exact=989
-> Bitmap Index Scan on casting_a_id_index (cost=0.00..25.06 rows=1151 width=0) (actual time=0.114..0.114 rows=1011 loops=1)
Index Cond: (a_id < 50)
-> Index Scan using movie_pkey on movie m (cost=0.42..7.64 rows=1 width=15) (actual time=0.003..0.003 rows=1 loops=1011)
Index Cond: (m_id = c.m_id)
-> Index Scan using actor_pkey on actor a (cost=0.42..5.39 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=1011)
Index Cond: (a_id = c.a_id)
Planning time: 0.334 ms
Execution time: 5.672 ms
(13 rows)
I am trying to understand how query planner works? I am able to understand the process it choose, but I am not getting why ?
Can someone explain query optimizer choices (choice of query processing algorithms, join order) in these queries based on parameters like query selectivity and cost models or anything that effects choice?
Also why there is use of Recheck Cond, after index scan ?
There are two reasons why there has to be a Bitmap Heap Scan:
PostgreSQL has to check whether the rows found are visible for the current transaction or not. Remember that PostgreSQL keeps old row versions in the table until VACUUM removes them. This visibility information is not stored in the index.
If work_mem is not large enough to contain a bitmap with one bit per table row, PostgreSQL uses one bit per table page, which loses some information. The PostgreSQL needs to check the lossy blocks to see which of the rows in the block really satisfy the condition.
You can see this when you use EXPLAIN (ANALYZE, BUFFERS), then PostgreSQL will show if there were lossy matches, see this example on rextester:
-> Bitmap Heap Scan on t (cost=177.14..4719.43 rows=9383 width=0)
(actual time=2.130..144.729 rows=10001 loops=1)
Recheck Cond: (val = 10)
Rows Removed by Index Recheck: 738586
Heap Blocks: exact=646 lossy=3305
Buffers: shared hit=1891 read=2090
-> Bitmap Index Scan on t_val_idx (cost=0.00..174.80 rows=9383 width=0)
(actual time=1.978..1.978 rows=10001 loops=1)
Index Cond: (val = 10)
Buffers: shared read=30
I cannot explain the whole of the PostgreSQL optimizer in this answer, but what it does is to try all possible ways to compute the result, estimate how much each one will cost and choose the cheapest plan.
To estimate how big the result set will be, it uses the object definitions and the table statistics, which contain detailed data about how the column values are distributed.
It then calculates how many disk blocks it will have to read sequentially and by random access (I/O cost), and how many tables and index rows and function calls it will have to process (CPU cost) to come up with a grand total. The weights for each of these components in the total can be configured.
Usually the best plan is one that reduces the number of result rows as quickly as possible by applying the most selective condition first. In your case this seems to be casting.a_id < 50.
Nested loop joins are often preferred if the number of rows in the outer (upper in EXPLAIN output) table is small.

PostgreSQL index query speed inconsistency

We have 2 identical (double precision) columns on the same table with 2 identical indices running 2 identical queries. yet one runs nearly 10* quicker than the other. what's causing this?
1) SELECT MIN("reports"."longitude") AS min_id FROM "reports" WHERE (area2 = 18)
2) SELECT MIN("reports"."latitude") AS min_id FROM "reports" WHERE (area2 = 18)
1 runs in 28ms and 2 runs in >300ms
Here are the 'explains':
1)
Result (cost=6.07..6.08 rows=1 width=0)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..6.07 rows=1 width=8)"
-> Index Scan using longitude on reports (cost=0.00..139617.49 rows=22983 width=8)"
Index Cond: (longitude IS NOT NULL)"
Filter: (area2 = 18)"
2)
Result (cost=5.95..5.96 rows=1 width=0)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..5.95 rows=1 width=8)"
-> Index Scan using latitude on reports (cost=0.00..136754.07 rows=22983 width=8)"
Index Cond: (latitude IS NOT NULL)"
Filter: (area2 = 18)"
as requested here is the explain analyse output...
1)
Result (cost=6.07..6.08 rows=1 width=0) (actual time=10.992..10.993 rows=1 loops=1)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..6.07 rows=1 width=8) (actual time=10.985..10.986 rows=1 loops=1)"
-> Index Scan using longitude on reports (cost=0.00..139617.49 rows=22983 width=8) (actual time=10.983..10.983 rows=1 loops=1)"
Index Cond: (longitude IS NOT NULL)"
Filter: (area2 = 18)"
Total runtime: 11.033 ms"
2)
Result (cost=5.95..5.96 rows=1 width=0) (actual time=259.749..259.749 rows=1 loops=1)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..5.95 rows=1 width=8) (actual time=259.740..259.740 rows=1 loops=1)"
-> Index Scan using latitude on reports (cost=0.00..136754.07 rows=22983 width=8) (actual time=259.739..259.739 rows=1 loops=1)"
Index Cond: (latitude IS NOT NULL)"
Filter: (area2 = 18)"
Total runtime: 259.789 ms"
---------------------
What is going on? How can I get the second query to behave properly and run quickly? Both setups are identical as far as I can tell.
First, there is no guarantee that indexes speed queries. Second, when doing performance considerations, you need to run each query multiple times. There is overhead for loading the index and loading pages into the cache that can affect the length of the queries.
I am not a specialist in Postgres, but on thinking about this, I'm not that surprised.
The query plan is looping through the index, finding the corresponding row that matches area2 = 18, and then hopefully stopping at the first one (it is using the index, so it can start at the lowest value and move upwards). This is speculation on how it is working; I don't know that Postgres is doing this exactly.
In any case, what is happening is that the area is much closer to the beginning of the longitude index than the beginning of the latitude index. So, it finds the first matching record there first. If this explanation is correct, it would suggest that the area is relatively west (lower longitude) and relatively north (higher latitude), compared to other things in the database.
By the way, assuming that there are lots of areas, you might get better results with an index on Area2.
You are getting an index scan, but the number of records examined depends on how far up the list you have to go to match the area2 condition.
Unless your area2 distribution is strange, to optimize this query you should put composite indices on (area2, latitude) and (area2, longitude). I suspect you will get <10 ms. PG may also be able to combine a separate index on area2 with the existing indices, in lieu of composite indices, using its Bitmap Heap Scan capabilities.