PostgreSQL GIN index slower than GIST for pg_trgm?

PostgreSQL GIN index slower than GIST for pg_trgm? - sql

Despite what all the documentation says, I'm finding GIN indexes to be significantly slower than GIST indexes for pg_trgm related searches. This is on a table of 25 million rows with a relatively short text field (average length of 21 characters). Most of the rows of text are addresses of the form "123 Main st, City".
GIST index takes about 4 seconds with a search like
select suggestion from search_suggestions where suggestion % 'seattle';
But GIN takes 90 seconds and the following result when running with EXPLAIN ANALYZE:
Bitmap Heap Scan on search_suggestions (cost=330.09..73514.15 rows=25043 width=22) (actual time=671.606..86318.553 rows=40482 loops=1)
Recheck Cond: ((suggestion)::text % 'seattle'::text)
Rows Removed by Index Recheck: 23214341
Heap Blocks: exact=7625 lossy=223807
-> Bitmap Index Scan on tri_suggestions_idx (cost=0.00..323.83 rows=25043 width=0) (actual time=669.841..669.841 rows=1358175 loops=1)
Index Cond: ((suggestion)::text % 'seattle'::text)
Planning time: 1.420 ms
Execution time: 86327.246 ms
Note that over a million rows are being selected by the index, even though only 40k rows actually match. Any ideas why this is performing so poorly? This is on PostgreSQL 9.4.

Some issues stand out:
First, consider upgrading to a current version of Postgres. At the time of writing that's pg 9.6 or pg 10 (currently beta). Since Pg 9.4 there have been multiple improvements for GIN indexes, the additional module pg_trgm and big data in general.
Next, you need much more RAM, in particular a higher work_mem setting. I can tell from this line in the EXPLAIN output:
Heap Blocks: exact=7625 lossy=223807
"lossy" in the details for a Bitmap Heap Scan (with your particular numbers) indicates a dramatic shortage of work_mem. Postgres only collects block addresses in the bitmap index scan instead of row pointers because that's expected to be faster with your low work_mem setting (can't hold exact addresses in RAM). Many more non-qualifying rows have to be filtered in the following Bitmap Heap Scan this way. This related answer has details:
“Recheck Cond:” line in query plans with a bitmap index scan
But don't set work_mem too high without considering the whole situation:
Optimize simple query using ORDER BY date and text
There may other problems, like index or table bloat or more configuration bottlenecks. But if you fix just these two items, the query should be much faster already.
Also, do you really need to retrieve all 40k rows in the example? You probably want to add a small LIMIT to the query and make it a "nearest-neighbor" search - in which case a GiST index is the better choice after all, because that is supposed to be faster with a GiST index. Example:
Best index for similarity function

Related

Where do i begin diagnosing PostgreSQL performance problems?

I have a table called modifications with 42 columns and 84M rows. Total size is 64GB.
I am running Postgres 9.6.11 on Amazon RDS with 16GB of RAM on a db.m4.xlarge instance.
When i run a simple SELECT count(*) FROM modifications; it takes 380 seconds to finish executing.
When i run SELECT * FROM modifications WHERE post_date = '2016-05-03'; to limit to a single date, it takes 156 seconds to return the 4.6M rows in the result.
When i limit the result set even further to about 1M rows, the query still takes over 100 seconds to complete.
I know these are large result sets, but i'm fairly novice about database query performance testing, so i'd like some pointers about what to try.
I've run EXPLAIN ANALYZE on these queries, but i'm not sure exactly what to do. Many of these queries are very simple and don't have clear ways to restructure them to improve performance.
I've also tried adding more indexes…i have indexes on each of the most commonly queried columns.
I am using the default settings for AWS RDS PostgreSQL configuration and have tried tweaking the work_mem settings using SET LOCAL work_mem = 'XXXMB'. That has not made a difference. Other default settings for things like shared_buffers (0.5GB) and effective_cache_size (0.5GB) are reasonably set.
Any advice or strategies on how to approach troubleshooting this would be greatly appreciated. Please let me know in comments if i should include more information.
EDIT: Here's the execution plan for the last SELECT query
Bitmap Heap Scan on modifications (cost=479407.01..1692971.07 rows=460492 width=279)
Recheck Cond: ((post_date = '2016-05-03 00:00:00'::timestamp without time zone) AND (change_type = 'residence_address_line_1'::text))
-> BitmapAnd (cost=479407.01..479407.01 rows=460492 width=0)
-> Bitmap Index Scan on modifications_post_date_idx (cost=0.00..130733.87 rows=4478040 width=0)
Index Cond: (post_date = '2016-05-03 00:00:00'::timestamp without time zone)
-> Bitmap Index Scan on modifications_change_type_idx (cost=0.00..348442.64 rows=8677610 width=0)
Index Cond: (change_type = 'residence_address_line_1'::text)

You should turn track_io_timing on, then you should do EXPLAIN (ANALYZE, BUFFERS) to look at the query's performance.
For the query whose plan you showed, it would probably be optimal to have a multi-column index on (change_type, post_date). But it isn't feasible to have hundreds of multi-column indexes to support hundreds of different queries. So you should look at the EXPLAIN (ANALYZE, BUFFERS) for that query both with the multicolumn index, and with the two single column indexes.
You have listed 3 dramatically different queries. Which one is like the ones you care about most? You generally need to optimize the query which gives you the results you need, you can't pick among very different queries based on how easy they are to optimize.

Postgresql performance issue when querying on time range

I'm trying to understand a strange performance issue on Postgres (v10.9).
We have a requests table and I want to get all requests made by a set of particular users in several time ranges. Here is the relevant info:
There is no user_id column in the table. Rather, there is a jsonb column named params, where the user_id field is stored as a string.
The set of users in question is very large, in the thousands.
There is a time column of type timestamptz and it's indexed with a standard BTREE index.
There is also an separate BTREE index on params->>'user_id'.
The queries I am running are based on the following template:
SELECT *
FROM requests
WHERE params->>'user_id' = ANY (VALUES ('id1'), ('id2'), ('id3')...)
AND time > 't1' AND time < 't2'
Where the ids and times here are placeholders for actual ids and times.
I am running a query like this for several consecutive time ranges of 2 weeks each. The queries for the first few time ranges take a couple of minutes each, which is obviously very long in terms of production but OK for research purposes. Then suddenly there is a dramatic spike in query runtime, and they start taking hours each, which begins to be untenable even for offline purposes.
This spike happens in the same range every time. It's worth noting that in this time range there is a x1.5 increase in total requests. Certainly more compared with the previous time range, but not enough to warrant a spike by a full order of magnitude.
Here is the output for EXPLAIN ANALYZE for the last time range with the reasonable running time:
Hash Join (cost=442.69..446645.35 rows=986171 width=1217) (actual time=66.305..203593.238 rows=445175 loops=1)
Hash Cond: ((requests.params ->> 'user_id'::text) = \"*VALUES*\".column1)
-> Index Scan using requests_time_idx on requests (cost=0.56..428686.19 rows=1976888 width=1217) (actual time=14.336..201643.439 rows=2139604 loops=1)
Index Cond: ((\"time\" > '2019-02-12 22:00:00+00'::timestamp with time zone) AND (\"time\" < '2019-02-26 22:00:00+00'::timestamp with time zone))
-> Hash (cost=439.62..439.62 rows=200 width=32) (actual time=43.818..43.818 rows=29175 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 2536kB
-> HashAggregate (cost=437.62..439.62 rows=200 width=32) (actual time=24.887..33.775 rows=29175 loops=1)
Group Key: \"*VALUES*\".column1
-> Values Scan on \"*VALUES*\" (cost=0.00..364.69 rows=29175 width=32) (actual time=0.006..10.303 rows=29175 loops=1)
Planning time: 133.807 ms
Execution time: 203697.360 ms
If I understand this correctly, it seems that most of the time is spent on filtering the requests by time range, even though:
The time index seems to be used.
When running the same queries without the filter on the users (basically just fetching all requests by time range only), they both run in OK times.
Any thoughts on how to solve this problem would be appreciated, thanks!

Since you are retrieving so many rows, the query will never be really fast.
Unfortunately there is no single index to cover both conditions, but you can use these two:
CREATE INDEX ON requests ((params->>'user_id'));
CREATE INDEX ON requests (time);
Then you can hope for two bitmap index scans which get joined by a “bitmap or”.
I am not sure if that will improve performance; PostgreSQL may still opt for the current plan, which is not a bad one. If your indexes are cached or random access to your storage is fast, set effective_cache_size or random_page_cost accordingly, that will make PostgreSQL lean towards an index scan.

Understanding index scan in postgreSQL and random access concept

I'm trying to understand how index scan's actually performed.
EXPLAIN ANALYZE SELECT * FROM tbl WHERE id = 46983
Consider the following plan:
Index Scan using pk_tbl on tbl (cost=0.29..8.30 rows=1 width=1064) (actual time=0.012..0.013 rows=1 loops=1)
Index Cond: (id = 46983)
Planning time: 0.101 ms
Execution time: 0.050 ms
As far as I undersdtand, the index scan process consists of two random page read. In my case
SHOW random_page_cost
returns 4.
So, I guess we need to find the block the the row with id = 46983 stored in (random access in index) and then we need to read that block by it's address(random access the block in physical storage). That's clear, two random access are actually occured. But from wiki I read that
In data structures, direct access implies the ability to access any
entry in a list in constant time
But it's obviously that traversing the balanced-tree doesn't have constant-time complexity, because it depends on the deep of the tree.
That way, how come is it correct to say that requesting the block of the index is actually random-access?

The reason is that indexes in database are normally stored as B-trees or B+trees, an n-ary tree structure with a variabile but very large number of children per node. A typical tree of this kind with three or four levels can address millions of records, and almost certainly at least the root is kept in the cache (buffer pool), so that a typical access for a random key has a cost in the order of 1 or 2 disk accesses. For this reason, in the database field (and when costs are estimated) the access to a B-tree index is considered as a small constant.

Why are both SELECT count(PK) and SELECT count(*) so slow?

I've got a simple table with single column PRIMARY KEY called id, type serial. There is exactly 100,000,000 rows in there. Table takes 48GB, PK index ca 2,1GB. Machine running on is "dedicated" only for Postgres and it is something like Core i5, 500GB HDD, 8GB RAM. Pg config was created by pgtune utility (shared buffers ca 2GB, effective cache ca 7GB). OS is Ubuntu server 14.04, Postgres 9.3.6.
Why are both SELECT count(id) and SELECT count(*) so slow in this simple case (cca 11 minutes)?
Why is PostgreSQL planner choosing full table scan instead of index scanning which should be at least 25 times faster (in the case where it would have to read the whole index from HDD). Or where am I wrong?
Btw running the query several times in a row is not changing anything. still cca 11 minutes.
Execution plan here:
Aggregate (cost=7500001.00..7500001.01 rows=1 width=0) (actual time=698316.978..698316.979 rows=1 loops=1)
Buffers: shared hit=192 read=6249809
-> Seq Scan on transaction (cost=0.00..7250001.00 rows=100000000 width=0) (actual time=0.009..680594.049 rows=100000001 loops=1)
Buffers: shared hit=192 read=6249809
Total runtime: 698317.044 ms

Considering the spec of a HDD is usually somewhere between 50Mb/s and 100Mb/s then for 48Gb I would expect to read everything between 500 and 1000s.
Since you have no where clause, the planner sees that you are interested in the large majority of records, so it does not use the index as this would require additional indexes. The reason postgresql cannot use the index lies in the MVCC which postgresql uses for transaction consistency. This requires that the rows are pulled to ensure accurate results. (see https://wiki.postgresql.org/wiki/Slow_Counting)
The cache, CPU, etc will not affect this nor changing the caching settings. This is IO bound and the cache will be completely trashed after the query.
If you can live with an approximation you can use the reltuples field in the table metadata:
SELECT reltuples FROM pg_class WHERE relname = 'tbl';
Since this is just a single row this is blazing fast.
Update: since 9.2 a new way to store the visibility information allowed index-only counts to happen. However there are quite some caveats, especially in the case where there is no predicate to limit the rows. see https://wiki.postgresql.org/wiki/Index-only_scans for more details.

Postgresql: How do I ensure that indexes are in memory

I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.
I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.
Problem description:
I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:
id (bigserial)
url (charachter varying 300)
I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.
When I ran:
SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;
... The total size is 5980MB
The indiviual size of the 2 indexes in question here are as follows, and I got them by running:
# SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE
reltype=0 ORDER BY relpages DESC LIMIT 10;
relname | mb | reltype
----------------------------------+------+--------
test_url_idx | 3684 | 0
test_pk | 2161 | 0
There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here
The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.
select * from test_table where url like 'orange%' limit 20;
...would take anything from 20-40 seconds to run.
Running explain analyze on the above gives the following:
# explain analyze select * from test_table where
url like 'orange%' limit 20;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..4787.96 rows=20 width=57)
(actual time=0.304..1898.583 rows=20 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=0.302..1898
.542 rows=20 loops=1)
Filter: ((url)::text ~~ 'orange%'::text)
Rows Removed by Filter: 210286
Total runtime: 1898.650 ms
(5 rows)
Taking another example... this time with the wildcard between american and .com....
# explain select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-------------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
Filter: ((url)::text ~~ 'american%.com'::text)
(3 rows)
# explain analyze select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-----------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
(actual time=83.470..3035.696 rows=50 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=83.467..303
5.614 rows=50 loops=1)
Filter: ((url)::text ~~ 'american%.com'::text)
Rows Removed by Filter: 276142
Total runtime: 3035.774 ms
(5 rows)
I then went from a 7GB to a 14GB server. Query Speeds were no better.
Observations on the server
I can see that Memory usage never really goes beyond 2MB.
Disk reads go off the charts when running a query using a LIKE statement.
Query speed is perfectly fine when matching against the id (primary key)
The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.
Changes to conf:
shared_buffers = 512MB
checkpoint_segments = 10
(I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)
cpu_index_tuple_cost = 0.0005
effective_cache_size = 10GB # recommendation in the blog post was 2GB...
On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:
CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)
Memory: 1.5/14.0GB (11%)
More details on Memory:
In use: 1.4GB
Available: 12.5GB
Committed 1.9/16.1 GB
Cached: 835MB
Paged Pool: 95.2MB
Non-paged pool: 71.2 MB
Questions
How can I ensure an index will sit in memory (providing it doesn't get too big for memory)? Is it just configuration tweaking I need here?
Is implementing my own search index (e.g. Lucene) a better option here?
Are the full-text indexing features in postgres going to improve performance dramatically, even if I can solve the index in memory issue?
Thanks for reading.

Those seq scans make it look like you didn't run analyze on the table after importing your data.
http://www.postgresql.org/docs/current/static/sql-analyze.html
During normal operation, scheduling to run vacuum analyze isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.
On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'
http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20
Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)
With that being said:
Postgres primarily caches indexes based on how often they're used, and it will not use an index if the stats suggest that it shouldn't -- hence the need to analyze after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.
and 3. Full text search works fine.
For further reading on fine-tuning, see the manual and:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Two last notes on your schema:
Last I checked, bigint (bigserial in your case) was slower than plain int. (This was a while ago, so the difference might now be negligible on modern, 64-bit servers.) Unless you foresee that you'll actually need more than 2.3 billion entries, int is plenty and takes less space.
From an implementation standpoint, the only difference between a varchar(300) and a varchar without a specified length (or text, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.

Unless your encoding or collation is C or POSIX, an ordinary btree index cannot efficiently satisfy an anchored like query. You may have to declare a btree index with the varchar_pattern_ops op class to benefit.

The problem is that you're getting hit with a full table scan for each of those lookups ("index in memory" isn't really an issue). Each time you run one of those queries the database is visiting every single row, which is causing the high disk usage. You might check here for a little more information (especially follow the links to the docs on operator classes and index types). If you follow that advice you should be able to get prefix lookups working fine, i.e. those situations where you're matching something like 'orange%'.
Full text search is nice for more natural text search, like written documents, but it might be more difficult to get it working well for URL searching. There was also this thread in the mailing lists a few months back that might have more domain-specific information for what you're trying to do.

explain analyze select * from test_table where
url like 'orange%' limit 20;
You probably want to use a gin/gist index for like queries. Should give you much better results than btree - I don't think btree supports like queries at all.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas