I'm storing a relatively reasonable (~3 million) number of very small rows (the entire DB is ~300MB) in PostgreSQL. The data is organized thus:
Table "public.tr_rating"
Column | Type | Modifiers
-----------+--------------------------+---------------------------------------------------------------
user_id | bigint | not null
place_id | bigint | not null
rating | smallint | not null
rated_at | timestamp with time zone | not null default now()
rating_id | bigint | not null default nextval('tr_rating_rating_id_seq'::regclass)
Indexes:
"tr_rating_rating_id_key" UNIQUE, btree (rating_id)
"tr_rating_user_idx" btree (user_id, place_id)
Now, I would like to retrieve the ratings deposited over a set of places by your friends (a set of users)
The natural query I wrote is:
SELECT * FROM tr_rating WHERE user_id=ANY(?) AND place_id=ANY(?)
The size of the user_id array is ~500, while the place_id array is ~10,000
This turns into:
Bitmap Heap Scan on tr_rating (cost=2453743.43..2492013.53 rows=3627 width=34) (actual time=10174.044..10174.234 rows=1111 loops=1)
Buffers: shared hit=27922214
-> Bitmap Index Scan on tr_rating_user_idx (cost=0.00..2453742.53 rows=3627 width=0) (actual time=10174.031..10174.031 rows=1111 loops=1)
Index Cond: ((user_id = ANY (...) ))
Buffers: shared hit=27922214
Total runtime: 10279.290 ms
The first suspicious thing I see here is that it estimates that scanning the index for 500 users will take 2.5M disk seeks
Everything else here looks reasonable, except that it takes ten full seconds to do this! The index (via \di) looks like:
public | tr_rating_user_idx | index | tr_rating | 67 MB |
at 67 MB, I would expect it could tear through the index in a trivial amount of time, even if it has to do it sequentially. As the buffers accounting from the EXPLAIN ANALYZE shows, everything is already in memory (as all values other than shared_hit are zero and thus suppressed).
I have tried various combinations of REINDEX, VACUUM, ANALYZE, and CLUSTER with no measurable improvement.
Any thoughts as to what I am doing wrong here, or how I could debug further? I'm mystified; 67MB of data is a puny amount to spend so much time searching through...
For reference, the hardware is a 8-way recent Xeon with 8 15K 300GB drives in RAID-10. Should be enough :-)
EDIT
Per btilly's suggestion, I tried out temporary tables:
=> explain analyze select * from tr_rating NATURAL JOIN user_ids NATURAL JOIN place_ids;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=49133.46..49299.51 rows=3524 width=34) (actual time=13.801..15.676 rows=1111 loops=1)
Hash Cond: (place_ids.place_id = tr_rating.place_id)
-> Seq Scan on place_ids (cost=0.00..59.66 rows=4066 width=8) (actual time=0.009..0.619 rows=4251 loops=1)
-> Hash (cost=48208.02..48208.02 rows=74035 width=34) (actual time=13.767..13.767 rows=7486 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 527kB
-> Nested Loop (cost=0.00..48208.02 rows=74035 width=34) (actual time=0.047..11.055 rows=7486 loops=1)
-> Seq Scan on user_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.006..0.399 rows=2189 loops=1)
-> Index Scan using tr_rating_user_idx on tr_rating (cost=0.00..22.07 rows=35 width=34) (actual time=0.002..0.003 rows=3 loops=2189)
Index Cond: (tr_rating.user_id = user_ids.user_id) JOIN place_ids;
Total runtime: 15.931 ms
Why is the query plan so much better when faced with temporary tables, rather than arrays? The data is exactly the same, simply presented in a different way. Additionally, I've measured the time to create a temporary table at running in the tens to hundreds of milliseconds, which is a pretty steep overhead to pay. Can I continue to use the array approach, yet allow Postgres to use the hash join which is so much faster, instead?
EDIT 2
By creating a hash index on user_id, the runtime reduces to 250ms. Adding another hash index to place_id reduces the runtime further to 50ms. This is still twice as slow as using temporary tables, but the overhead of making the table negates any gains I see. I still do not understand how doing O(500) lookups in a btree index can take ten seconds, but the hash index is unquestionably much faster.
It looks like it is taking each row in the index, and then scanning through your user_id array, then if it finds it scanning through your place_id array. That means that for 3 million rows it has to scan through 100 user_ids, and for each match it scans through 10,000 place_ids. Those matches are individually fast, but this is a poor algorithm that could potentially result in up to 30 billion operations.
You'd be better off creating two temporary tables, giving them indexes, and doing a join. If it does a hash join, then you'd potentially have 6 million hash lookups. (3 million for user_id and 3 million for place_id.)
Related
There is a table, has 10 million records, and it has a column which type is array, it looks like:
id | content | contained_special_ids
----------------------------------------
1 | abc | { 1, 2 }
2 | abd | { 1, 3 }
3 | abe | { 1, 4 }
4 | abf | { 3 }
5 | abg | { 2 }
6 | abh | { 3 }
and I want to know that how many records there is which contained_special_ids includes 3, so my sql is:
select count(*) from my_table where contained_special_ids #> array[3]
It works fine when data is small, however it takes long time (about 30+ seconds) when the table has 10 million records.
I have added index to this column:
"index_my_table_on_contained_special_ids" gin (contained_special_ids)
So, how to optimize this select count query?
Thanks a lot!
UPDATE
below is the explain:
Finalize Aggregate (cost=1049019.17..1049019.18 rows=1 width=8) (actual time=44343.230..44362.224 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1049018.95..1049019.16 rows=2 width=8) (actual time=44340.332..44362.217 rows=3 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1048018.95..1048018.96 rows=1 width=8) (actual time=44337.615..44337.615 rows=1 loops=3)
Output: PARTIAL count(*)
Worker 0: actual time=44336.442..44336.442 rows=1 loops=1
Worker 1: actual time=44336.564..44336.564 rows=1 loops=1
-> Parallel Bitmap Heap Scan on public.my_table (cost=9116.31..1046912.22 rows=442694 width=0) (actual time=330.602..44304.221 rows=391431 loops=3)
Recheck Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Rows Removed by Index Recheck: 501077
Heap Blocks: exact=67496 lossy=109789
Worker 0: actual time=329.547..44301.513 rows=409272 loops=1
Worker 1: actual time=329.794..44304.582 rows=378538 loops=1
-> Bitmap Index Scan on index_my_table_on_contained_special_ids (cost=0.00..8850.69 rows=1062465 width=0) (actual time=278.413..278.414 rows=1176563 loops=1)
Index Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Planning Time: 1.041 ms
Execution Time: 44362.262 ms
Increase work_mem until the lossy blocks go away. Also, make sure the table is well vacuumed to support index-only bitmap scans, and that you are using a new enough version (which you should tell us) to support those. Finally, you can try increasing effective_io_concurrency.
Also, post plans as text, not images; and turn on track_io_timing.
There is no way to optimize such a query due to 2 factors :
The use of a non atomic value that violate the FIRST NORMAL FORM
The fact that PostGreSQL is unable to perform quickly aggregate computation
On the first problem... 1st NORMAL FORM
each data in table's colums must be atomic.... Of course an array containing multiple value is not atomic.
Then no index would be efficient on such a column due to a type that violate 1FN
This can be reduced by using a table instaed of an array
On the poor performance of PG's aggregate
PG use a model of MVCC that combine in the same table data pages with phantom records and valid records, so to count valid record, that's need to red one by one all the records to distinguish wich one are valid to be counted from the other taht must not be count...
Most of other DBMS does not works as PG, like Oracle or SQL Server that does not keep phantom records inside the datapages, and some others have the exact count of the valid rows into the page header...
As a example, read the tests I have done comparing COUNT and other aggregate functions between PG and SQL Server, some queries runs 1500 time faster on SQL Server...
I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.
The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.
I have a reference table for UUIDs that is roughly 200M rows. I have ~5000 UUIDs that I want to look up in the reference table. Reference table looks like:
CREATE TABLE object_store AS (
project_id UUID,
object_id UUID,
object_name VARCHAR(20),
description VARCHAR(80)
);
CREATE INDEX object_store_project_idx ON object_store(project_id);
CREATE INDEX object_store_id_idx ON object_store(object_id);
* Edit #2 *
Request for the temp_objects table definition.
CREATE TEMPORARY TABLE temp_objects AS (
object_id UUID
)
ON COMMIT DELETE ROWS;
The reason for the separate index is because object_id is not unique, and can belong to many different projects. The reference table is just a temp table of UUIDs (temp_objects) that I want to check (5000 object_ids).
If I query the above reference table with 1 object_id literal value, it's almost instantaneous (2ms). If the temp table only has 1 row, again, instantaneous (2ms). But with 5000 rows it takes 25 minutes to even return. Granted it pulls back >3M rows of matches.
* Edited *
This is for 1 row comparison (4.198 ms):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..475780.22 rows=494005 width=65) (actual time=0.038..2.631 rows=1194 loops=1)
Buffers: shared hit=1202, local hit=1
-> Seq Scan on temp_objects t (cost=0.00..13.60 rows=360 width=16) (actual time=0.007..0.009 rows=1 loops=1)
Buffers: local hit=1
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1307.85 rows=1372 width=81) (actual time=0.027..1.707 rows=1194 loops=1)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=1202
Planning time: 0.173 ms
Execution time: 3.096 ms
(9 rows)
Time: 4.198 ms
This is for 4911 row comparison (1579082.974 ms (26:19.083)):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..3217316.86 rows=3507438 width=65) (actual time=0.041..1576913.100 rows=8043500 loops=1)
Buffers: shared hit=5185078 read=2887548, local hit=71
-> Seq Scan on temp_objects d (cost=0.00..96.56 rows=2556 width=16) (actual time=0.009..3.945 rows=4911 loops=1)
Buffers: local hit=71
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1244.97 rows=1372 width=81) (actual time=1.492..320.081 rows=1638 loops=4911)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=5185078 read=2887548
Planning time: 0.169 ms
Execution time: 1579078.811 ms
(9 rows)
Time: 1579082.974 ms (26:19.083)
Eventually I want to group and get a count of the matching object_ids by project_id, using standard grouping. The aggregate is at the upper end (of course) of the cost. It took just about 25 minutes again to complete the below query. Yet, when I limit the temp table to only 1 row, it comes back in 21ms. Something is not adding up...
EXPLAIN SELECT O.project_id, count(*)
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id GROUP BY O.project_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6189484.10..6189682.84 rows=19874 width=73)
Group Key: o.project_id
-> Nested Loop (cost=0.57..6155795.69 rows=6737683 width=65)
-> Seq Scan on temp_objects t (cost=0.00..120.10 rows=4910 width=16)
-> Index Scan using object_store_id_idx on object_store o (cost=0.57..1239.98 rows=1372 width=81)
Index Cond: (object_id = t.object_id)
(6 rows)
I'm on PostgreSQL 10.6, running 2 CPUs and 8GB of RAM on an SSD. I have ANALYZEd the tables, I have set the work_mem to 50MB, shared_buffers to 2GB, and have set the random_page_cost to 1. All helped the queries actually to come back in several minutes, but still not as fast as I feel it should be.
I have the option to go to cloud computing if CPUs/RAM/parallelization make a big difference. Just looking for suggestions on how to get this simple query to return in < few seconds (if possible).
* UPDATE *
Taking the hint from Jürgen Zornig, I changed both object_id fields to be bigint, using just the top half of the UUID and reducing my datasize by half. Doing the aggregate query above the query now performs at ~16min.
Next, taking jjane's suggestion of set enable_nestloop to off, my aggregate query jumped to 6min! Unfortunately, all the other suggestions haven't sped it up past 6min, although it's interesting that changing my "TEMPORARY" table to a permanent one allowed 2 workers to work it, it didn't change the time. I think jjane is accurate by saying the IO is the binding factor here. Here is the latest explain plan from the 6min (wish it were faster, still, but it's better!):
explain (analyze, buffers, format text) select project_id, count(*) from object_store natural join temp_object group by project_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=3966899.86..3967396.69 rows=19873 width=73) (actual time=368124.126..368744.157 rows=153633 loops=1)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Sort (cost=3966899.86..3966999.23 rows=39746 width=73) (actual time=368124.116..368586.497 rows=333427 loops=1)
Sort Key: object_store.project_id
Sort Method: external merge Disk: 29392kB
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Gather (cost=3959690.23..3963863.56 rows=39746 width=73) (actual time=366476.369..366827.313 rows=333427 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Partial HashAggregate (cost=3958690.23..3958888.96 rows=19873 width=73) (actual time=366472.712..366568.313 rows=111142 loops=3)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Hash Join (cost=132.50..3944473.09 rows=2843429 width=65) (actual time=7.880..363848.830 rows=2681167 loops=3)
Hash Cond: (object_store.object_id = temp_object.object_id)
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Parallel Seq Scan on object_store (cost=0.00..3499320.53 rows=83317153 width=73) (actual time=0.467..324932.880 rows=66653718 loops=3)
Buffers: shared hit=242934 read=2423215
I/O Timings: read=870720.440
-> Hash (cost=71.11..71.11 rows=4911 width=8) (actual time=7.349..7.349 rows=4911 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 256kB
Buffers: shared hit=66
-> Seq Scan on temp_object (cost=0.00..71.11 rows=4911 width=8) (actual time=0.014..2.101 rows=4911 loops=3)
Buffers: shared hit=66
Planning time: 0.247 ms
Execution time: 368779.757 ms
(32 rows)
Time: 368780.532 ms (06:08.781)
So I'm at 6min per query now. I think with I/O costs, I may try for an in-memory store on this table if possible to see if getting it off SSD makes it even better.
UUIDs are (EDIT) working against adaptive cache management and, because of their random nature effectively dropping the cache hit ratio because the index space is larger than memory. Ids cover a numerically wide range equally distributed, so in fact every Id lands pretty much on its own leaf on the index tree. As the index leaf determines in which data page the row is saved in disk pretty much every row gets its own page resulting in a whole lot of extremely expensive I/O Operations to get all these rows read in.
That's the reason why its generally not recommended to use UUIDs and if you really need UUIDs then at least generate timestamp/mac-prefixed UUIDs (have a look at uuid_generate_v1() - https://www.postgresql.org/docs/9.4/uuid-ossp.html) that are numerically close to each other, therefore chances are higher that data rows are clustered together on lesser data Pages resulting in fewer I/O Operations to get more Data in.
Long Story Short: Randomness over a large range kills your index (well actually not the index, it results in a lot of expensive I/O to get data on reading and to maintain the index on writing) and therefore slows queries down to a point where it is as good as having no index at all.
Here is also an article for reference
It looks like the centerpiece of your question is why it doesn't scale up from one input row to 5000 input rows linearly. But I think that this is a red herring. How are you choosing the one row? If you choose the same one row each time, then the data will stay in cache and will be very fast. I bet this is what you are doing. If you choose a different random one row each time you do a one-row plan, you will probably find the scaling to be more linear.
You should turn on track_io_timing. I have little doubt that IO is actually the bottleneck, but it is always nice to see it actually measured and reported, I have been surprised before.
The use of temporary table will inhibit parallel query. You might want to test with a permanent table, to see if you do get use of parallel workers, and if so, whether that actually helps. If you do this test, you should use your aggregation version of the query. They parallelize more efficiently than non-aggregation queries do, and if that is your ultimate goal that is what you should initially test with.
Another thing you could try is a large setting of effective_io_concurrency. But, that will only help if your plan uses bitmap scans to start with, which the plans you show do not. Setting random_page_cost from 1 to a slightly higher value might encourage it to use bitmap scans. (effective_io_concurrency is weird because bitmap plans can get a substantial realistic benefit from a higher setting, but the planner doesn't give bitmap plans any credit for that benefit they receive. So you must be "accidentally" using that plan already in order to get the benefit)
At some point (as you increase the number of rows in temp_objects) it is going to be faster to hash that table, and hashjoin it to a seq-scan of the object_store table. Is 5000 already past the point at which that would be faster? The planner clearly doesn't think so, but the planner never gets the cut-over point exactly right, and is often off by quite a bit. What happens if you set enable_nestloop TO off; before running your query?
Have you done low-level benchmarking of your SSD (outside of the database)? Assuming substantially all of your time is spent on IO reads and nearly none of those are fulfilled by the filesystem cache, you are getting 1576913/2887548 = 0.55ms per read. That seems pretty long. That is about what I get on a bottom-dollar laptop where the SSD is being exposed through a VM layer. I'd expect better than that from server-grade hardware.
Be sure you have also a proper index for temp_objects table
CREATE INDEX temp_object_id_idx ON temp_objects(object_id);
SELECT O.project_id
FROM temp_objects T
JOIN object_store O ON T.object_id = O.object_id;
Firstly: I would try to get the index into memory. What is shared_buffers set to? If it is small, lets make that bigger first. See if we can reduce the index scan IO.
Next: Are parallel queries enabled? I'm not sure that will help here very much because you have only 2 cpus, but it wouldn't hurt.
Even though the object column is completely random, I'd also bump up the statistics on that table from the default (100 rows or something like that) to a few thousand rows. Then run Analyze again. (or for thoroughness, vacuum analyze)
Work Mem at 50M may be low too. It could potentially be larger if you don't have a lot of concurrent users and you have G's of RAM to work with. Too large and it can be counter productive, but you could go up a bit more to see if it helps.
You could try CTAS on the big table into a new table to sort object id so that it isn't completely random.
There might be a crazy partitioning scheme you could come up with if you were using PostgreSQL 12 that would group the object ids into some even partition distribution.
We have a table that contains raw analytics (like Google Analytics and similar) numbers for views on our videos. It contains numbers like raw views, downloads, loads, etc. Each video is identified by a video_id.
Data is recorded per-day, but because we need to extract on a number of metrics each day can contain multiple records for a specific video_id. Example:
date | video_id | country | source | downloads | etc...
----------------------------------------------------------------
2014-01-02 | 1 | us | facebook | 10 |
2014-01-02 | 1 | dk | facebook | 13 |
2014-01-02 | 1 | dk | admin | 20 |
I have a query where I need to get aggregate data for all videos that have new data beyond a certain date. To get the video ID's I do this query: SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY photo_id (alternatively I could do a DISTINCT(video_id) without a GROUP BY, performance is identical).
Once I have these IDs I need the total aggregate data (for all time). Combined, this turns into the following query:
SELECT
video_id,
SUM(downloads),
SUM(loads),
<more SUMs),
FROM
table
WHERE
video_id IN (SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY video_id)
GROUP BY
video_id
There is around ~10 columns we SUM (5-10 depending on the query). The EXPLAIN ANALYZE gives the following:
GroupAggregate (cost=2370840.59..2475948.90 rows=42537 width=72) (actual time=153790.362..162668.962 rows=87661 loops=1)
-> Sort (cost=2370840.59..2378295.16 rows=2981826 width=72) (actual time=153790.329..155833.770 rows=3285001 loops=1)
Sort Key: table.video_id
Sort Method: external merge Disk: 263528kB
-> Hash Join (cost=57066.94..1683266.53 rows=2981826 width=72) (actual time=740.210..143814.921 rows=3285001 loops=1)
Hash Cond: (table.video_id = table.video_id)
-> Seq Scan on table (cost=0.00..1550549.52 rows=5963652 width=72) (actual time=1.768..47613.953 rows=5963652 loops=1)
-> Hash (cost=56924.17..56924.17 rows=11422 width=8) (actual time=734.881..734.881 rows=87661 loops=1)
Buckets: 2048 Batches: 4 (originally 1) Memory Usage: 1025kB
-> HashAggregate (cost=56695.73..56809.95 rows=11422 width=8) (actual time=693.769..715.665 rows=87661 loops=1)
-> Index Only Scan using table_recent_ids on table (cost=0.00..52692.41 rows=1601328 width=8) (actual time=1.279..314.249 rows=1614339 loops=1)
Index Cond: (date >= '2014-01-01'::date)
Heap Fetches: 0
Total runtime: 162693.367 ms
As you can see, it's using a (quite big) external disk merge sort and taking a long time. I am unsure of why the sorts are triggered in the first place, and I am looking for a way to avoid it or at least minimize it. I know increasing work_mem can alleviate external disk merges, but in this case it seems to be excessive and having a work_mem above 500MB seems like a bad idea.
The table has two (relevant) indexes: One on video_id alone and another on (date, video_id).
EDIT: Updated query after running ANALYZE table.
Edited to match the revised query plan.
You are getting a sort because Postgres needs to sort the result rows to group them.
This query looks like it could really benefit from an index on table(video_id, date), or even just an index on table(video_id). Having such an index would likely avoid the need to sort.
Edited (#2) to suggest
You could also consider testing an alternative query such as this:
SELECT
video_id,
MAX(date) as latest_date,
<SUMs>
FROM
table
GROUP BY
video_id
HAVING
latest_date >= '2014-01-01'
That avoids any join or subquery, and given an index on table(video_id [, other columns]) it can be hoped that the sort will be avoided as well. It will compute the sums over the whole base table before filtering out the groups you don't want, but that operation is O(n), whereas sorting is O(m log m). Thus, if the date criterion is not very selective then checking it after the fact may be an improvement.
I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?
You will need a text_pattern_ops index if you're not in the C locale.
See: index types.
Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.
You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.