How to make postgres not use a particular index? - sql

I have the following query:
devapp=> Explain SELECT DISTINCT "chaindata_tokentransfer"."emitting_contract" FROM "chaindata_tokentransfer" WHERE (("chaindata_tokentransfer"."to_addr" = 100 OR "chaindata_tokentransfer"."from_addr" = 100) AND "chaindata_tokentransfer"."chain_id" = 1 AND "chaindata_tokentransfer"."block_number" >= 10000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=29062023.48..29062321.43 rows=8870 width=4)
-> Sort (cost=29062023.48..29062172.45 rows=59591 width=4)
Sort Key: emitting_contract
-> Bitmap Heap Scan on chaindata_tokentransfer (cost=28822428.06..29057297.07 rows=59591 width=4)
Recheck Cond: (((to_addr = 100) OR (from_addr = 100)) AND (chain_id = 1) AND (block_number >= 10000))
-> BitmapAnd (cost=28822428.06..28822428.06 rows=59591 width=0)
-> BitmapOr (cost=4209.94..4209.94 rows=351330 width=0)
-> Bitmap Index Scan on chaindata_tokentransfer_to_addr_284dc4bc (cost=0.00..1800.73 rows=150953 width=0)
Index Cond: (to_addr = 100)
-> Bitmap Index Scan on chaindata_tokentransfer_from_addr_ef8ecd8c (cost=0.00..2379.41 rows=200377 width=0)
Index Cond: (from_addr = 100)
-> Bitmap Index Scan on chaindata_tokentransfer_chain_id_block_number_tx_eeeac2a4_idx (cost=0.00..28818202.98 rows=1315431027 width=0)
Index Cond: ((chain_id = 1) AND (block_number >= 10000))
(13 rows)
As you can see, the cost of the last index scan on chaindata_tokentransfer_chain_id_block_number_tx_eeeac2a4_idx is very high. And the query is timing out. If I remove the filter on chain_id and block_number from the query, then the query is executing in a reasonable amount of time. Since this new less constrained query is working, I'd expect even the original more constrained query to work if the index was not there and it was just an additional filter. How to achieve that without deleting the index?

You can probably disable the index by doing some dummy arithmetic on the indexed column.
...AND "chaindata_tokentransfer"."chain_id" + 0 = 1...
If you put that into production, make sure to add a code comment on why you are doing such an odd thing.
I'm curious why it chooses to use that index, despite apparently knowing how astonishingly awful it is. If you show the plan for the query with the index disabled, maybe we could figure that out.
If the dummy arithmetic doesn't work, what you could do is start a transaction, drop the index, execute the query (or the just the EXPLAIN of it), then rollback the drop. That is probably not something you want to do often in production (especially since the table will be locked from when the index is dropped until the rollback. Also because you might accidentally commit!) but getting the plan is probably worth doing it once.

Related

How do I fetch similar posts from database where text length can be more than 2000 characters

As far as I'm aware, there is no simple, quick solution. I am trying to do a full-text keyword or semantic search, which is a very advanced topic. There are dedicated search servers created specifically for that reason, but still is there a way that I can implement for a query execution time for less than a second?
Here's what I have tried so far:
begin;
SET pg_trgm.similarity_threshold = 0.3;
select
id, <col_name>
similarity(<column with gin index>,
'<text to be searched>') as sml
from
<table> p
where
<clauses> and
<indexed_col> % '<text to be searched>'
and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
end;
Index created is as follows:
CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);
The above query takes around 6-7 secs to execute and sometimes even 200 ms which is weird to me because it changes the query plan according to the input I pass in for similarity.
I tried ts_vector ## ts_query, but they turn out to be too strict due to & operator.
EDIT: Here's the EXPLAIN ANALYZE of the above query
-> Sort (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on posts p (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Rows Removed by Index Recheck: 36258
Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
Heap Blocks: exact=24043
-> Bitmap Index Scan on trgm_idx (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```
Your redundant/overlapping query conditions aren't helpful. Setting similarity_threshold=0.3 then doing
t % q and t <-> q < 0.5
just throws away index selectivity for no reason. Set similarity_threshold to as stringent of a value as you want to use, then get rid of the unnecessary <-> condition.
You could try the GiST version of trigram indexing. I can support the ORDER BY ... <-> ... LIMIT 10 operation directly from the index. I doubt it will be very effective with 2000 char strings, but it is worth a try.

Postgres index for finding the most recent row

I've got a table with 1.5 million rows. It will get significantly bigger than this. I have a very simple query that I think I should be able to index to be lightning fast.
select * from instant_power_reads where site_id = 22 order by read_at desc limit 1;
I've been reading around and see most people saying that a combined index on site_id and read_at should take care of this, but I can't get it to work.
My index looks like this:
CREATE INDEX idx_ipr_site_read
ON public.instant_power_reads USING btree
(site_id ASC NULLS LAST, read_at DESC NULLS LAST)
TABLESPACE pg_default;
When I run the query, it takes about a third of a second, which I believe is too long. This is what the explain tells me. It doesn't seem to be using the index like I think it should. Shouldn't this be able to do a quick "Index Only Scan"?
Limit (cost=10904.20..10904.21 rows=1 width=30)
-> Sort (cost=10904.20..10959.73 rows=22209 width=30)
Sort Key: read_at DESC
-> Bitmap Heap Scan on instant_power_reads (cost=516.55..10793.16 rows=22209 width=30)
Recheck Cond: (site_id = 22)
-> Bitmap Index Scan on idx_ipr_site_read (cost=0.00..511.00 rows=22209 width=0)
Index Cond: (site_id = 22)
I know I have other options. My main idea is to maintain a separate temp table that just contains the latest records for each site. But I'd like to learn what I'm doing wrong here.
EDIT: I was asked to post the EXPLAIN (ANALYZE, BUFFERS) for the basic index.
Limit (cost=0.43..2.24 rows=1 width=30) (actual time=0.027..0.028 rows=1 loops=1)
Buffers: shared hit=4
-> Index Scan Backward using idx_ipr_site_read on instant_power_reads (cost=0.43..40405.75 rows=22288 width=30) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (site_id = 22)
Buffers: shared hit=4
Planning Time: 0.124 ms
Execution Time: 0.042 ms
EDIT 2: I accepted Laurenz' answer, but wanted to specify that the key to figuring this out was running "EXPLAIN (ANALYZE, BUFFERS)" because it shows the actual planning and execution time, which may be much different from the time your GUI (like pgAdmin) is showing you.
Looks like the obvious index is the right choice:
CREATE INDEX ON instant_power_reads (site_id, read_at);
If you measure a long execution time on the client side, it is probably network latency.

Why is PostgreSQL 11 optimizer refusing best plan which would use an index w/ included column?

PostgreSQL 11 isn't smart enough to use indexes with included columns?
CREATE INDEX organization_locations__org_id_is_headquarters__inc_location_id_ix
ON organization_locations(org_id, is_headquarters) INCLUDE (location_id);
ANALYZE organization_locations;
ANALYZE organizations;
EXPLAIN VERBOSE
SELECT location_id
FROM organization_locations ol
WHERE org_id = (SELECT id FROM organizations WHERE code = 'akron')
AND is_headquarters = 1;
QUERY PLAN
Seq Scan on organization_locations ol (cost=8.44..14.61 rows=1 width=4)
Output: ol.location_id
Filter: ((ol.org_id = $0) AND (ol.is_headquarters = 1))
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4)
Output: organizations.id
Index Cond: ((organizations.code)::text = 'akron'::text)
There are only 211 rows currently in organization_locations, average row length 91 bytes.
I get only loading one data page. But the I/O is the same to grab the index page and the target data is right there (no extra lookup into the data page from the index). What is PG thinking with this plan?
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
EDIT: Here is the explain with buffers:
Seq Scan on organization_locations ol (cost=8.44..14.33 rows=1 width=4) (actual time=0.018..0.032 rows=1 loops=1)
Filter: ((org_id = $0) AND (is_headquarters = 1))
Rows Removed by Filter: 210
Buffers: shared hit=7
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: ((code)::text = 'akron'::text)
Buffers: shared hit=4
Planning Time: 0.402 ms
Execution Time: 0.048 ms
Reading one index page is not cheaper than reading a table page, so with tiny tables you cannot expect a gain from an index-only scan.
Besides, did you
VACUUM organization_locations;
Without that, the visibility map won't show that the table block is all-visible, so you cannot get an index-only scan no matter what.
In addition to the other answers, this is probably a silly index to have in the first place. INCLUDE is good when you need a unique index but you also want to tack on a column which is not part of the unique constraint, or when the included column doesn't have btree operators and so can't be in the main body of the index. In other cases, you should just put the extra column in the index itself.
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
This is your workflow problem that you can't expect PostgreSQL to solve for you. Do you really think PostgreSQL should create actual plans based on imaginary scenarios?

Efficient PostgreSQL query on timestamp using index or bitmap index scan?

In PostgreSQL, I have an index on a date field on my tickets table.
When I compare the field against now(), the query is pretty efficient:
# explain analyze select count(1) as count from tickets where updated_at > now();
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=90.64..90.66 rows=1 width=0) (actual time=33.238..33.238 rows=1 loops=1)
-> Index Scan using tickets_updated_at_idx on tickets (cost=0.01..90.27 rows=74 width=0) (actual time=0.016..29.318 rows=40250 loops=1)
Index Cond: (updated_at > now())
Total runtime: 33.271 ms
It goes downhill and uses a Bitmap Heap Scan if I try to compare it against now() minus an interval.
# explain analyze select count(1) as count from tickets where updated_at > (now() - '24 hours'::interval);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=180450.15..180450.17 rows=1 width=0) (actual time=543.898..543.898 rows=1 loops=1)
-> Bitmap Heap Scan on tickets (cost=21296.43..175963.31 rows=897368 width=0) (actual time=251.700..457.916 rows=924373 loops=1)
Recheck Cond: (updated_at > (now() - '24:00:00'::interval))
-> Bitmap Index Scan on tickets_updated_at_idx (cost=0.00..20847.74 rows=897368 width=0) (actual time=238.799..238.799 rows=924699 loops=1)
Index Cond: (updated_at > (now() - '24:00:00'::interval))
Total runtime: 543.952 ms
Is there a more efficient way to query using date arithmetic?
The 1st query expects to find rows=74, but actually finds rows=40250.
The 2nd query expects to find rows=897368 and actually finds rows=924699.
Of course, processing 23 x as many rows takes considerably more time. So your actual times are not surprising.
Statistics for data with updated_at > now() are outdated. Run:
ANALYZE tickets;
and repeat your queries. And you seriously have data with updated_at > now()? That sounds wrong.
It's not surprising, however, that statistics are outdated for data most recently changed. That's in the logic of things. If your query depends on current statistics, you have to run ANALYZE before you run your query.
Also test with (in your session only):
SET enable_bitmapscan = off;
and repeat your second query to see times without bitmap index scan.
Why bitmap index scan for more rows?
A plain index scan fetches rows from the heap sequentially as found in the index. That's simple, dumb and without overhead. Fast for few rows, but may end up more expensive than a bitmap index scan with a growing number of rows.
A bitmap index scan collects rows from the index before looking up the table. If multiple rows reside on the same data page, that saves repeated visits and can make things considerably faster. The more rows, the greater the chance, a bitmap index scan will save time.
For even more rows (around 5% of the table, heavily depends on actual data), the planner switches to a sequential scan of the table and doesn't use the index at all.
The optimum would be an index only scan, introduced with Postgres 9.2. That's only possible if some preconditions are met. If all relevant columns are included in the index, the index type support it and the visibility map indicates that all rows on a data page are visible to all transactions, that page doesn't have to be fetched from the heap (the table) and the information in the index is enough.
The decision depends on your statistics (how many rows Postgres expects to find and their distribution) and on cost settings, most importantly random_page_cost, cpu_index_tuple_cost and effective_cache_size.

What is the difference between Seq Scan and Bitmap heap scan in postgres?

In output of explain command I found two terms 'Seq Scan' and 'Bitmap heap Scan'. Can somebody tell me what is the difference between these two types of scan? (I am using PostgreSql)
http://www.postgresql.org/docs/8.2/static/using-explain.html
Basically, a sequential scan is going to the actual rows, and start reading from row 1, and continue until the query is satisfied (this may not be the entire table, e.g., in the case of limit)
Bitmap heap scan means that PostgreSQL has found a small subset of rows to fetch (e.g., from an index), and is going to fetch only those rows. This will of course have a lot more seeking, so is faster only when it needs a small subset of the rows.
Take an example:
create table test (a int primary key, b int unique, c int);
insert into test values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5);
Now, we can easily get a seq scan:
explain select * from test where a != 4
QUERY PLAN
---------------------------------------------------------
Seq Scan on test (cost=0.00..34.25 rows=1930 width=12)
Filter: (a <> 4)
It did a sequential scan because it estimates its going to grab the vast majority of the table; seeking to do that (instead of a big, seekless read) would be silly.
Now, we can use the index:
explain select * from test where a = 4 ;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using test_pkey on test (cost=0.00..8.27 rows=1 width=4)
Index Cond: (a = 4)
And finally, we can get some bitmap operations:
explain select * from test where a = 4 or a = 3;
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=8.52..13.86 rows=2 width=12)
Recheck Cond: ((a = 4) OR (a = 3))
-> BitmapOr (cost=8.52..8.52 rows=2 width=0)
-> Bitmap Index Scan on test_pkey (cost=0.00..4.26 rows=1 width=0)
Index Cond: (a = 4)
-> Bitmap Index Scan on test_pkey (cost=0.00..4.26 rows=1 width=0)
Index Cond: (a = 3)
We can read this as:
Build a bitmap of the rows we want for a=4. (Bitmap index scan)
Build a bitmap of the rows we want for a=3. (Bitmap index scan)
Or the two bitmaps together (BitmapOr)
Look those rows up in the table (Bitmap Heap Scan) and check to make sure a=4 or a=3 (recheck cond)
[Yes, these query plans are stupid, but that's because we failed to analyze test Had we analyzed it, they'd all be sequential scans, since there are 5 tiny rows]