PostgreSQL query against millions of rows takes long on UUIDs

PostgreSQL query against millions of rows takes long on UUIDs - sql

I have a reference table for UUIDs that is roughly 200M rows. I have ~5000 UUIDs that I want to look up in the reference table. Reference table looks like:
CREATE TABLE object_store AS (
project_id UUID,
object_id UUID,
object_name VARCHAR(20),
description VARCHAR(80)
);
CREATE INDEX object_store_project_idx ON object_store(project_id);
CREATE INDEX object_store_id_idx ON object_store(object_id);
* Edit #2 *
Request for the temp_objects table definition.
CREATE TEMPORARY TABLE temp_objects AS (
object_id UUID
)
ON COMMIT DELETE ROWS;
The reason for the separate index is because object_id is not unique, and can belong to many different projects. The reference table is just a temp table of UUIDs (temp_objects) that I want to check (5000 object_ids).
If I query the above reference table with 1 object_id literal value, it's almost instantaneous (2ms). If the temp table only has 1 row, again, instantaneous (2ms). But with 5000 rows it takes 25 minutes to even return. Granted it pulls back >3M rows of matches.
* Edited *
This is for 1 row comparison (4.198 ms):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..475780.22 rows=494005 width=65) (actual time=0.038..2.631 rows=1194 loops=1)
Buffers: shared hit=1202, local hit=1
-> Seq Scan on temp_objects t (cost=0.00..13.60 rows=360 width=16) (actual time=0.007..0.009 rows=1 loops=1)
Buffers: local hit=1
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1307.85 rows=1372 width=81) (actual time=0.027..1.707 rows=1194 loops=1)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=1202
Planning time: 0.173 ms
Execution time: 3.096 ms
(9 rows)
Time: 4.198 ms
This is for 4911 row comparison (1579082.974 ms (26:19.083)):
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT O.project_id
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..3217316.86 rows=3507438 width=65) (actual time=0.041..1576913.100 rows=8043500 loops=1)
Buffers: shared hit=5185078 read=2887548, local hit=71
-> Seq Scan on temp_objects d (cost=0.00..96.56 rows=2556 width=16) (actual time=0.009..3.945 rows=4911 loops=1)
Buffers: local hit=71
-> Index Scan using object_store_id_idx on object_store l (cost=0.57..1244.97 rows=1372 width=81) (actual time=1.492..320.081 rows=1638 loops=4911)
Index Cond: (object_id = t.object_id)
Buffers: shared hit=5185078 read=2887548
Planning time: 0.169 ms
Execution time: 1579078.811 ms
(9 rows)
Time: 1579082.974 ms (26:19.083)
Eventually I want to group and get a count of the matching object_ids by project_id, using standard grouping. The aggregate is at the upper end (of course) of the cost. It took just about 25 minutes again to complete the below query. Yet, when I limit the temp table to only 1 row, it comes back in 21ms. Something is not adding up...
EXPLAIN SELECT O.project_id, count(*)
FROM temp_objects T JOIN object_store O ON T.object_id = O.object_id GROUP BY O.project_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6189484.10..6189682.84 rows=19874 width=73)
Group Key: o.project_id
-> Nested Loop (cost=0.57..6155795.69 rows=6737683 width=65)
-> Seq Scan on temp_objects t (cost=0.00..120.10 rows=4910 width=16)
-> Index Scan using object_store_id_idx on object_store o (cost=0.57..1239.98 rows=1372 width=81)
Index Cond: (object_id = t.object_id)
(6 rows)
I'm on PostgreSQL 10.6, running 2 CPUs and 8GB of RAM on an SSD. I have ANALYZEd the tables, I have set the work_mem to 50MB, shared_buffers to 2GB, and have set the random_page_cost to 1. All helped the queries actually to come back in several minutes, but still not as fast as I feel it should be.
I have the option to go to cloud computing if CPUs/RAM/parallelization make a big difference. Just looking for suggestions on how to get this simple query to return in < few seconds (if possible).
* UPDATE *
Taking the hint from Jürgen Zornig, I changed both object_id fields to be bigint, using just the top half of the UUID and reducing my datasize by half. Doing the aggregate query above the query now performs at ~16min.
Next, taking jjane's suggestion of set enable_nestloop to off, my aggregate query jumped to 6min! Unfortunately, all the other suggestions haven't sped it up past 6min, although it's interesting that changing my "TEMPORARY" table to a permanent one allowed 2 workers to work it, it didn't change the time. I think jjane is accurate by saying the IO is the binding factor here. Here is the latest explain plan from the 6min (wish it were faster, still, but it's better!):
explain (analyze, buffers, format text) select project_id, count(*) from object_store natural join temp_object group by project_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=3966899.86..3967396.69 rows=19873 width=73) (actual time=368124.126..368744.157 rows=153633 loops=1)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Sort (cost=3966899.86..3966999.23 rows=39746 width=73) (actual time=368124.116..368586.497 rows=333427 loops=1)
Sort Key: object_store.project_id
Sort Method: external merge Disk: 29392kB
Buffers: shared hit=243022 read=2423215, temp read=3674 written=3687
I/O Timings: read=870720.440
-> Gather (cost=3959690.23..3963863.56 rows=39746 width=73) (actual time=366476.369..366827.313 rows=333427 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Partial HashAggregate (cost=3958690.23..3958888.96 rows=19873 width=73) (actual time=366472.712..366568.313 rows=111142 loops=3)
Group Key: object_store.project_id
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Hash Join (cost=132.50..3944473.09 rows=2843429 width=65) (actual time=7.880..363848.830 rows=2681167 loops=3)
Hash Cond: (object_store.object_id = temp_object.object_id)
Buffers: shared hit=243022 read=2423215
I/O Timings: read=870720.440
-> Parallel Seq Scan on object_store (cost=0.00..3499320.53 rows=83317153 width=73) (actual time=0.467..324932.880 rows=66653718 loops=3)
Buffers: shared hit=242934 read=2423215
I/O Timings: read=870720.440
-> Hash (cost=71.11..71.11 rows=4911 width=8) (actual time=7.349..7.349 rows=4911 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 256kB
Buffers: shared hit=66
-> Seq Scan on temp_object (cost=0.00..71.11 rows=4911 width=8) (actual time=0.014..2.101 rows=4911 loops=3)
Buffers: shared hit=66
Planning time: 0.247 ms
Execution time: 368779.757 ms
(32 rows)
Time: 368780.532 ms (06:08.781)
So I'm at 6min per query now. I think with I/O costs, I may try for an in-memory store on this table if possible to see if getting it off SSD makes it even better.

UUIDs are (EDIT) working against adaptive cache management and, because of their random nature effectively dropping the cache hit ratio because the index space is larger than memory. Ids cover a numerically wide range equally distributed, so in fact every Id lands pretty much on its own leaf on the index tree. As the index leaf determines in which data page the row is saved in disk pretty much every row gets its own page resulting in a whole lot of extremely expensive I/O Operations to get all these rows read in.
That's the reason why its generally not recommended to use UUIDs and if you really need UUIDs then at least generate timestamp/mac-prefixed UUIDs (have a look at uuid_generate_v1() - https://www.postgresql.org/docs/9.4/uuid-ossp.html) that are numerically close to each other, therefore chances are higher that data rows are clustered together on lesser data Pages resulting in fewer I/O Operations to get more Data in.
Long Story Short: Randomness over a large range kills your index (well actually not the index, it results in a lot of expensive I/O to get data on reading and to maintain the index on writing) and therefore slows queries down to a point where it is as good as having no index at all.
Here is also an article for reference

It looks like the centerpiece of your question is why it doesn't scale up from one input row to 5000 input rows linearly. But I think that this is a red herring. How are you choosing the one row? If you choose the same one row each time, then the data will stay in cache and will be very fast. I bet this is what you are doing. If you choose a different random one row each time you do a one-row plan, you will probably find the scaling to be more linear.
You should turn on track_io_timing. I have little doubt that IO is actually the bottleneck, but it is always nice to see it actually measured and reported, I have been surprised before.
The use of temporary table will inhibit parallel query. You might want to test with a permanent table, to see if you do get use of parallel workers, and if so, whether that actually helps. If you do this test, you should use your aggregation version of the query. They parallelize more efficiently than non-aggregation queries do, and if that is your ultimate goal that is what you should initially test with.
Another thing you could try is a large setting of effective_io_concurrency. But, that will only help if your plan uses bitmap scans to start with, which the plans you show do not. Setting random_page_cost from 1 to a slightly higher value might encourage it to use bitmap scans. (effective_io_concurrency is weird because bitmap plans can get a substantial realistic benefit from a higher setting, but the planner doesn't give bitmap plans any credit for that benefit they receive. So you must be "accidentally" using that plan already in order to get the benefit)
At some point (as you increase the number of rows in temp_objects) it is going to be faster to hash that table, and hashjoin it to a seq-scan of the object_store table. Is 5000 already past the point at which that would be faster? The planner clearly doesn't think so, but the planner never gets the cut-over point exactly right, and is often off by quite a bit. What happens if you set enable_nestloop TO off; before running your query?
Have you done low-level benchmarking of your SSD (outside of the database)? Assuming substantially all of your time is spent on IO reads and nearly none of those are fulfilled by the filesystem cache, you are getting 1576913/2887548 = 0.55ms per read. That seems pretty long. That is about what I get on a bottom-dollar laptop where the SSD is being exposed through a VM layer. I'd expect better than that from server-grade hardware.

Be sure you have also a proper index for temp_objects table
CREATE INDEX temp_object_id_idx ON temp_objects(object_id);
SELECT O.project_id
FROM temp_objects T
JOIN object_store O ON T.object_id = O.object_id;

Firstly: I would try to get the index into memory. What is shared_buffers set to? If it is small, lets make that bigger first. See if we can reduce the index scan IO.
Next: Are parallel queries enabled? I'm not sure that will help here very much because you have only 2 cpus, but it wouldn't hurt.
Even though the object column is completely random, I'd also bump up the statistics on that table from the default (100 rows or something like that) to a few thousand rows. Then run Analyze again. (or for thoroughness, vacuum analyze)
Work Mem at 50M may be low too. It could potentially be larger if you don't have a lot of concurrent users and you have G's of RAM to work with. Too large and it can be counter productive, but you could go up a bit more to see if it helps.
You could try CTAS on the big table into a new table to sort object id so that it isn't completely random.
There might be a crazy partitioning scheme you could come up with if you were using PostgreSQL 12 that would group the object ids into some even partition distribution.

Related

How to optimize large database query accessed with an item ID?

I'm making a site that stores a large amount of data (8 data points for 313 item_ids every 10 seconds over 24 hr) and I serve that data to users on demand. The request is supplied with an item ID with which I query the database with something along the lines of SELECT * FROM current_day_data WHERE item_id = <supplied ID> (assuming the id is valid).
CREATE TABLE current_day_data (
"time" bigint,
item_id text NOT NULL,
-- some data,
id integer NOT NULL
);
CREATE INDEX item_id_lookup ON public.current_day_data USING btree (item_id);
This works fine, but the request takes about a third of a second, so I'm looking into either other database options to help optimize this, or some way to optimize the query itself.
My current setup is a PostgreSQL database with an index on the item ID column, but I feel like there's options in the realm of NoSQL (an area I'm unfamiliar with) due to it's similarity to a hash table.
My ideal solution would be a hash table with the item IDs as the key and the data as a JSON-like object but I don't know what options could achieve that.
tl;dr how to optimize SELECT * FROM current_day_data WHERE item_id = <supplied ID> through better querying or new database solution?
edit: here's the EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM current_day_data
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on current_day_data (cost=0.00..46811.09 rows=2584364 width=75) (actual time=0.013..291.667 rows=2700251 loops=1)
Buffers: shared hit=39058
Planning:
Buffers: shared hit=112
Planning Time: 0.584 ms
Execution Time: 446.622 ms
(6 rows)
EXPLAIN with a specified item id EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM current_day_data WHERE item_id = 'SUGAR_CANE';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on current_day_info (cost=33.40..12099.27 rows=8592 width=75) (actual time=2.949..12.236 rows=8627 loops=1)
Recheck Cond: (product_id = 'SUGAR_CANE'::text)
Heap Blocks: exact=8570
Buffers: shared hit=8619
-> Bitmap Index Scan on prod_id_lookup (cost=0.00..32.97 rows=8592 width=0) (actual time=1.751..1.751 rows=8665 loops=1)
Index Cond: (product_id = 'SUGAR_CANE'::text)
Buffers: shared hit=12
Planning:
Buffers: shared hit=68
Planning Time: 0.339 ms
Execution Time: 12.686 ms
(11 rows)
Now this says 12.7ms which makes me think the 300ms has something to do with the library I'm using (SQLAlchemy), but that wouldn't really make sense since it's a popular library. More specifically, the line I'm using is:
results = CurrentDayData.query.filter(CurrentDayData.item_id == item_id).all()

That’s a very simple query that uses an index, therefore the only way to possible speed it up would be to improve the specification of your hardware.
Moving to a different form of database, on the same hardware, is not going to make a significant difference in performance to this type of query.

Improve Postgre SQL query performance

I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.

The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.

PostgreSQL: Limit in query makes it not use index

I have a large table with BRIN index, and if I do query with limit it ignores the index and go for sequence scan, without it it uses index (I tried it several times with same results)
explain (analyze,verbose,buffers,timing,costs)
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293
limit 100
Limit (cost=0.00..349.26 rows=100 width=207) (actual time=28927.179..28927.214 rows=100 loops=1)
Output: id, date_key_trainjr...
Buffers: shared hit=225 read=1680241
-> Seq Scan on testj.cdc_s5_gpps_ind (cost=0.00..3894204.10 rows=1114998 width=207) (actual time=28927.175..28927.202 rows=100 loops=1)
Output: id, date_key_trainjr...
Filter: (cdc_s5_gpps_ind.id_transformace = 1293)
Rows Removed by Filter: 59204140
Buffers: shared hit=225 read=1680241
Planning Time: 0.149 ms
Execution Time: 28927.255 ms
explain (analyze,verbose,buffers,timing,costs)
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293
Bitmap Heap Scan on testj.cdc_s5_gpps_ind (cost=324.36..979783.34 rows=1114998 width=207) (actual time=110.103..467.008 rows=1073725 loops=1)
Output: id, date_key_trainjr...
Recheck Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
Rows Removed by Index Recheck: 11663
Heap Blocks: lossy=32000
Buffers: shared hit=32056
-> Bitmap Index Scan on gpps_brin_index (cost=0.00..45.61 rows=1120373 width=0) (actual time=2.326..2.326 rows=320000 loops=1)
Index Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
Buffers: shared hit=56
Planning Time: 1.343 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.540 ms, Inlining 32.246 ms, Optimization 44.423 ms, Emission 22.524 ms, Total 99.732 ms
Execution Time: 537.627 ms
Is there a reason for this behavior?
PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1
20191121 (Red Hat 8.3.1-5), 64-bit

There is a very simple (which is not to say good) reason for this. The planner assumes rows with id_transformace = 1293 are evenly distributed throughout the table, and so it will be able to collect 100 of them very quickly with a seq scan and then stop early. But this assumption is very wrong, and needs to go through a big chunk of the table to find 100 qualifying rows.
This assumption is not based on any statistics gathered on the table, so increasing the statistics target will not help. And extended statistics will not help either, as it only offers statistic between columns, not between a column and the physical ordering.
There are no good clean ways to solve this purely on the stock server side. One work-around is to set enable_seqscan=off before running the query, then reset afterwords. Another would be to add ORDER BY random() to your query, that way the planner knows it can't stop early. Or maybe the extension pg_hint_plan could help, I've never used it.
You might get it to change the plan by tweaking your some of your *_cost parameters, but that would likely make other things worse. Seeing the output of the EXPLAIN (ANALYZE, BUFFERS) of the LIMITed query run with enable_seqscan=off could inform that decision.

Since the column appears to be sparse/skew, you could try to increase the statistics size :
ALTER TABLE testj.cdc_s5_gpps_ind
ALTER COLUMN id_transformace SET STATISTICS 1000;
ANALYZE testj.cdc_s5_gpps_ind;
Postgres-11 and above also has extended statistics, allowing multi-column correlations to be recognised and exploited. You must have some understanding of the actual structure of the data in the table to use them effectively.

time-dependent postgres query speed

I have a situation where the select query could be done in 3 seconds or more than 1 hours still not finish (I could not wait that long and killed it). I believe it may have something to do with the automatic statistics collection behavior of postgres server. I have a 3 table join one of them has over 70 million rows.
-- tmp_variant_filtered has about 4000 rows
-- variant_quick > 70 million rows
-- filtered_variant_quick has about 70 k rows
select count(*)
from "tmp_variant_filtered" t join "variant_quick" v on getchrnum(t.seqname)=v.chrom
and t.pos_start=v.pos and t.ref=v.ref
and t.alt=v.alt
join "filtered_variant_quick" f on f.variantid=v.id
where v.samplerun=165
;
-- running the query immediately after tmp_variant_filtered was loaded
-- Query plan that will take > 1 hour and not finish
Aggregate (cost=332.05..332.06 rows=1 width=8)
-> Nested Loop (cost=0.86..332.05 rows=1 width=0)
-> Nested Loop (cost=0.57..323.74 rows=1 width=8)
Join Filter: ((t.pos_start = v.pos) AND ((t.ref)::text = (v.ref)::text) AND ((t.alt)::text = (v.alt)::text) AND (getchrnum(t.seqname) = v.chrom))
-> Seq Scan on tmp_variant_filtered t (cost=0.00..315.00 rows=1 width=1126)
-> Index Scan using variant_quick_samplerun_chrom_pos_ref_alt_key on variant_quick v (cost=0.57..8.47 rows=1 width=20)
Index Cond: (samplerun = 165)
-> Index Only Scan using filtered_variant_quick_pkey on filtered_variant_quick f (cost=0.29..8.31 rows=1 width=8)
Index Cond: (variantid = v.id)
-- running the query a few minutes after tmp_variant_filtered was loaded with copy command
-- query plan that will take less than 5 seconds to finish
Aggregate (cost=425.69..425.70 rows=1 width=8)
-> Nested Loop (cost=8.78..425.68 rows=1 width=0)
-> Hash Join (cost=8.48..417.37 rows=1 width=8)
Hash Cond: ((t.pos_start = v.pos) AND ((t.ref)::text = (v.ref)::text) AND ((t.alt)::text = (v.alt)::text))
Join Filter: (getchrnum(t.seqname) = v.chrom)
-> Seq Scan on tmp_variant_filtered t (cost=0.00..359.06 rows=4406 width=13)
-> Hash (cost=8.47..8.47 rows=1 width=20)
-> Index Scan using variant_quick_samplerun_chrom_pos_ref_alt_key on variant_quick v (cost=0.57..8.47 rows=1 width=20)
Index Cond: (samplerun = 165)
-> Index Only Scan using filtered_variant_quick_pkey on filtered_variant_quick f (cost=0.29..8.31 rows=1 width=8)
Index Cond: (variantid = v.id)
If you run the query immediately after the tmp table got populated, it will give you the plan as shown on top, and the query will take a very long time. If you wait a few minutes, the the plan will be the lower with hash-join. The cost estimate for the upper is less than the lower.
Since the query was embedded in some scripting language, the top plan is used and usually it got finished in a couple of hours. If I do this on a terminal, after I terminated the script, the lower plan would be used, and it usually take a couple of seconds to finish.
I even did an experiment by copying the tmp_variant_filtered table into another table, say 'test'. If I run the query immediately after the copy (manually, there will be a couple of seconds of delay), then I was stuck. Killing the current job, wait for a few minutes, the the same query become blazing fast.
It was long time ago that I was doing query tuning; now I am just starting to pick it up again. I am reading and trying to understand why postgres has such a behavior. Would appreciate the experts to give a hint.

Immediately after inserting the rows into the table, there are no statistics available for column values and their distribution. Thus the optimizer assumes the table is empty. The only sensible strategy to retrieve all rows from an (supposedly) empty table is to do a Seq Scan. You can see this assumption in the execution plan:
Seq Scan on tmp_variant_filtered t (cost=0.00..315.00 rows=1 width=1126)
The rows=1 means that the optimizer expects that only one row will be returned by the Seq Scan. Because it's only one row, the planner chooses a nested loop to do the join - which means the Seq Scan is done once for each row in the other table (you could see that more clearly if your use explain (analyze, verbose) to generate the execution plan)
The statistics are updated in the background by the "autovacuum daemon" if you don't do it manually. That's why after waiting a while, you see a better plan, as the optimizer now know the table isn't empty.
Once the optimizer has better knowledge of the size of the table, it chooses the much more efficient Hash Join to bring the two tables together - which means the Seq Scan is only executed once, rather than multiple times.
It is always recommended to run analyze (or vacuum analyze) on tables where you changed the number of rows significantly if you need a good execution plan immediately after populating the table.
Quote from the manual
Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. This includes bulk loading large amounts of data into the table. Running ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics

Regardless the mechanism for this time dependent behavior, I figure out a solution with VACUUM ANALYZE my_table. Not sure it is the cure or just give a little bit time delay. I was using psocopg2 to execute the query and had to avoid the 'cannot vacuum inside a transaction' exception. Here I list the code block you need:
self.conn.commit()
self.conn.set_session(autocommit=True)
self.cursor.execute("vacuum analyze {}".format(one_of_my_tables))
# here you probably should have used sql.SQL("...").format()
# to be more secure, I am using the text composition for example
self.conn.set_session(autocommit=False)
I applied to two of the three tables involved in my join in the question. Maybe apply vacuum analyze to one should be sufficient. As mentioned by Basil, I should have asked the question in the dba group.

How can I optimize this SQL query in Postgres?

I've got a pretty large table with nearly 1 million rows and some of the queries are taking a long time (over a minute).
Here is one that's giving me a particularly hard time...
EXPLAIN ANALYZE SELECT "apps".* FROM "apps" WHERE "apps"."kind" = 'software' ORDER BY itunes_release_date DESC, rating_count DESC LIMIT 12;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=153823.03..153823.03 rows=12 width=2091) (actual time=162681.166..162681.194 rows=12 loops=1)
-> Sort (cost=153823.03..154234.66 rows=823260 width=2091) (actual time=162681.159..162681.169 rows=12 loops=1)
Sort Key: itunes_release_date, rating_count
Sort Method: top-N heapsort Memory: 48kB
-> Seq Scan on apps (cost=0.00..150048.41 rows=823260 width=2091) (actual time=0.718..161561.149 rows=808554 loops=1)
Filter: (kind = 'software'::text)
Total runtime: 162682.143 ms
(7 rows)
So, how would I optimize that? PG version is 9.2.4, FWIW.
There are already indexes on kind and kind, itunes_release_date.

Looks like you're missing an index, e.g. on (kind, itunes_release_date desc, rating_count desc).

How big is the apps table? Do you have at least this much memory allocated to postgres? If it's having to read from disk every time, query speed will be much slower.
Another thing that may help is to cluster the table on the 'apps' column. This may speed up disk access since all the software rows will be stored sequentially on disk.

The only way to speed up this query is to create a composite index on (itunes_release_date, rating_count). It will allow Postgres to pick first N rows from the index directly.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas