I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?
You will need a text_pattern_ops index if you're not in the C locale.
See: index types.
Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.
You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.
Related
I have a table with 2.2 Million rows.
Table "public.index"
Column | Type | Modifiers
-----------+-----------------------------+-----------------------------------------------------
fid | integer | not null default nextval('index_fid_seq'::regclass)
location | character varying |
Indexes:
"index_pkey" PRIMARY KEY, btree (fid)
"location_index" btree (location text_pattern_ops)
The location is the full path to a file, but I need to query using the name of the folder the file is located in. That folder name is unique in the table.
To avoid % at the beginning, I search for the full path which I know:
select fid from index where location like '/path/to/folder/%'
Explain Analyze:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on index (cost=0.00..120223.34 rows=217 width=4) (actual time=1181.701..1181.701 rows=0 loops=1)
Filter: ((location)::text ~~ '/path/to/folder/%'::text)
Rows Removed by Filter: 2166034
Planning time: 0.954 ms
Execution time: 1181.748 ms
(5 rows)
The question is not how to make a workaround, because
I have found that for my case:
When creating a foldername_index
create index on index (substring(location, '(?<=/path/to/)[^\/]*');
I can succesfully use the folder_name to query:
explain analyze select fid from index where substring(location, '(?<=/path/to/)[^\/]*') = 'foldername';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on index (cost=600.49..31524.74 rows=10830 width=12) (actual time=0.030..0.030 rows=1 loops=1)
Recheck Cond: ("substring"((location)::text, '(?<=/path/to/)[^\/]*'::text) = 'folder_name'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on foldername_index (cost=0.00..597.78 rows=10830 width=0) (actual time=0.023..0.023 rows=1 loops=1)
Index Cond: ("substring"((location)::text, '(?<=/path/to/)[^\/]*'::text) = 'folder_name'::text)
Planning time: 0.115 ms
Execution time: 0.059 ms
(7 rows)
I have followed the PostgreSQL FAQ:
When using wild-card operators such as LIKE or ~, indexes can only be
used in certain circumstances:
The beginning of the search string must be anchored to the start of the string, i.e.
LIKE patterns must not start with % or _.
The search string can not start with a character class, e.g. [a-e].
Everyting not the case in my query.
C locale must be used during initdb because sorting in a non-C locale often doesn't match the behavior of LIKE. You can create a special text_pattern_ops index that will work in such cases, but note it is only helpful for LIKE indexing.
I have C Locale:
# show LC_COLLATE;
lc_collate
------------
C
(1 row)
I also followed the instructions from this great answer here on Stack Overflow, which is why I use text_pattern_ops which did not change anything. Unfortunately, I cannot install new modules.
So: Why does my query perform a seq scan?
I have found the solution myself by thinking about it over and over again. Although it might be obvious for some people, it may help others:
/path/to/folder is actually /the_path/to/folder/ (There are underscores in the path). But _ is a wildcard in SQL (like %).
select fid from index where location like '/the_path/to/folder/%'
Uses seq scan because the index cannot filter any rows, because the part up to the underscore is the same for all rows.
select fid from index where location like '/the\_path/to/folder/%'
uses index scan.
I want to fetch users that has 1 or more processed bets. I do this by using next sql:
SELECT user_id FROM bets
WHERE bets.state in ('guessed', 'losed')
GROUP BY user_id
HAVING count(*) > 0;
But running EXPLAIN ANALYZE I noticed no index is used and query execution time is very high. I tried add partial index like:
CREATE INDEX processed_bets_index ON bets(state) WHERE state in ('guessed', 'losed');
But EXPLAIN ANALYZE output not changed:
HashAggregate (cost=34116.36..34233.54 rows=9375 width=4) (actual time=235.195..237.623 rows=13310 loops=1)
Filter: (count(*) > 0)
-> Seq Scan on bets (cost=0.00..30980.44 rows=627184 width=4) (actual time=0.020..150.346 rows=626674 loops=1)
Filter: ((state)::text = ANY ('{guessed,losed}'::text[]))
Rows Removed by Filter: 20951
Total runtime: 238.115 ms
(6 rows)
Records with other statuses except (guessed, losed) a little.
How do I create proper index?
I'm using PostgreSQL 9.3.4.
I assume that the state mostly consists of 'guessed' and 'losed', with maybe a few other states as well in there. So most probably the optimizer do not see the need to use the index since it would still fetch most of the rows.
What you do need is an index on the user_id, so perhaps something like this would work:
CREATE INDEX idx_bets_user_id_in_guessed_losed ON bets(user_id) WHERE state in ('guessed', 'losed');
Or, by not using a partial index:
CREATE INDEX idx_bets_state_user_id ON bets(state, user_id);
I've got a pretty large table with nearly 1 million rows and some of the queries are taking a long time (over a minute).
Here is one that's giving me a particularly hard time...
EXPLAIN ANALYZE SELECT "apps".* FROM "apps" WHERE "apps"."kind" = 'software' ORDER BY itunes_release_date DESC, rating_count DESC LIMIT 12;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=153823.03..153823.03 rows=12 width=2091) (actual time=162681.166..162681.194 rows=12 loops=1)
-> Sort (cost=153823.03..154234.66 rows=823260 width=2091) (actual time=162681.159..162681.169 rows=12 loops=1)
Sort Key: itunes_release_date, rating_count
Sort Method: top-N heapsort Memory: 48kB
-> Seq Scan on apps (cost=0.00..150048.41 rows=823260 width=2091) (actual time=0.718..161561.149 rows=808554 loops=1)
Filter: (kind = 'software'::text)
Total runtime: 162682.143 ms
(7 rows)
So, how would I optimize that? PG version is 9.2.4, FWIW.
There are already indexes on kind and kind, itunes_release_date.
Looks like you're missing an index, e.g. on (kind, itunes_release_date desc, rating_count desc).
How big is the apps table? Do you have at least this much memory allocated to postgres? If it's having to read from disk every time, query speed will be much slower.
Another thing that may help is to cluster the table on the 'apps' column. This may speed up disk access since all the software rows will be stored sequentially on disk.
The only way to speed up this query is to create a composite index on (itunes_release_date, rating_count). It will allow Postgres to pick first N rows from the index directly.
I'm storing a relatively reasonable (~3 million) number of very small rows (the entire DB is ~300MB) in PostgreSQL. The data is organized thus:
Table "public.tr_rating"
Column | Type | Modifiers
-----------+--------------------------+---------------------------------------------------------------
user_id | bigint | not null
place_id | bigint | not null
rating | smallint | not null
rated_at | timestamp with time zone | not null default now()
rating_id | bigint | not null default nextval('tr_rating_rating_id_seq'::regclass)
Indexes:
"tr_rating_rating_id_key" UNIQUE, btree (rating_id)
"tr_rating_user_idx" btree (user_id, place_id)
Now, I would like to retrieve the ratings deposited over a set of places by your friends (a set of users)
The natural query I wrote is:
SELECT * FROM tr_rating WHERE user_id=ANY(?) AND place_id=ANY(?)
The size of the user_id array is ~500, while the place_id array is ~10,000
This turns into:
Bitmap Heap Scan on tr_rating (cost=2453743.43..2492013.53 rows=3627 width=34) (actual time=10174.044..10174.234 rows=1111 loops=1)
Buffers: shared hit=27922214
-> Bitmap Index Scan on tr_rating_user_idx (cost=0.00..2453742.53 rows=3627 width=0) (actual time=10174.031..10174.031 rows=1111 loops=1)
Index Cond: ((user_id = ANY (...) ))
Buffers: shared hit=27922214
Total runtime: 10279.290 ms
The first suspicious thing I see here is that it estimates that scanning the index for 500 users will take 2.5M disk seeks
Everything else here looks reasonable, except that it takes ten full seconds to do this! The index (via \di) looks like:
public | tr_rating_user_idx | index | tr_rating | 67 MB |
at 67 MB, I would expect it could tear through the index in a trivial amount of time, even if it has to do it sequentially. As the buffers accounting from the EXPLAIN ANALYZE shows, everything is already in memory (as all values other than shared_hit are zero and thus suppressed).
I have tried various combinations of REINDEX, VACUUM, ANALYZE, and CLUSTER with no measurable improvement.
Any thoughts as to what I am doing wrong here, or how I could debug further? I'm mystified; 67MB of data is a puny amount to spend so much time searching through...
For reference, the hardware is a 8-way recent Xeon with 8 15K 300GB drives in RAID-10. Should be enough :-)
EDIT
Per btilly's suggestion, I tried out temporary tables:
=> explain analyze select * from tr_rating NATURAL JOIN user_ids NATURAL JOIN place_ids;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=49133.46..49299.51 rows=3524 width=34) (actual time=13.801..15.676 rows=1111 loops=1)
Hash Cond: (place_ids.place_id = tr_rating.place_id)
-> Seq Scan on place_ids (cost=0.00..59.66 rows=4066 width=8) (actual time=0.009..0.619 rows=4251 loops=1)
-> Hash (cost=48208.02..48208.02 rows=74035 width=34) (actual time=13.767..13.767 rows=7486 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 527kB
-> Nested Loop (cost=0.00..48208.02 rows=74035 width=34) (actual time=0.047..11.055 rows=7486 loops=1)
-> Seq Scan on user_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.006..0.399 rows=2189 loops=1)
-> Index Scan using tr_rating_user_idx on tr_rating (cost=0.00..22.07 rows=35 width=34) (actual time=0.002..0.003 rows=3 loops=2189)
Index Cond: (tr_rating.user_id = user_ids.user_id) JOIN place_ids;
Total runtime: 15.931 ms
Why is the query plan so much better when faced with temporary tables, rather than arrays? The data is exactly the same, simply presented in a different way. Additionally, I've measured the time to create a temporary table at running in the tens to hundreds of milliseconds, which is a pretty steep overhead to pay. Can I continue to use the array approach, yet allow Postgres to use the hash join which is so much faster, instead?
EDIT 2
By creating a hash index on user_id, the runtime reduces to 250ms. Adding another hash index to place_id reduces the runtime further to 50ms. This is still twice as slow as using temporary tables, but the overhead of making the table negates any gains I see. I still do not understand how doing O(500) lookups in a btree index can take ten seconds, but the hash index is unquestionably much faster.
It looks like it is taking each row in the index, and then scanning through your user_id array, then if it finds it scanning through your place_id array. That means that for 3 million rows it has to scan through 100 user_ids, and for each match it scans through 10,000 place_ids. Those matches are individually fast, but this is a poor algorithm that could potentially result in up to 30 billion operations.
You'd be better off creating two temporary tables, giving them indexes, and doing a join. If it does a hash join, then you'd potentially have 6 million hash lookups. (3 million for user_id and 3 million for place_id.)
There is a table:
doc_id(integer)-value(integer)
Approximate 100.000 doc_id and 27.000.000 rows.
Majority query on this table - searching documents similar to current document:
select 10 documents with maximum of
(count common to current document value)/(count ov values in document).
Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.
Should I transfer all this to NoSQL base, if so, what?
QUERY:
EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Limit (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
-> Sort (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
-> Nested Loop (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
-> Index Scan using crc32_idx on text_attachment d (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
Filter: (d.doc_id <> 29758)
Total runtime: 1849.753 ms
(12 rows)
1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.
I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.
My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:
doc_id(integer)-similar_doc(integer)-score(integer)
where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).
Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.
e.g. when a new document is inserted:
for each of the documents already in the table
you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document
If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.
First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.
Besides speed, there is also functionality, that's what you will loose.
===
What about pushing the function to a JOIN:
EXPLAIN ANALYZE
SELECT
D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM
testing.text_attachment D
JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE
D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10