Postgres index for finding the most recent row - sql

I've got a table with 1.5 million rows. It will get significantly bigger than this. I have a very simple query that I think I should be able to index to be lightning fast.
select * from instant_power_reads where site_id = 22 order by read_at desc limit 1;
I've been reading around and see most people saying that a combined index on site_id and read_at should take care of this, but I can't get it to work.
My index looks like this:
CREATE INDEX idx_ipr_site_read
ON public.instant_power_reads USING btree
(site_id ASC NULLS LAST, read_at DESC NULLS LAST)
TABLESPACE pg_default;
When I run the query, it takes about a third of a second, which I believe is too long. This is what the explain tells me. It doesn't seem to be using the index like I think it should. Shouldn't this be able to do a quick "Index Only Scan"?
Limit (cost=10904.20..10904.21 rows=1 width=30)
-> Sort (cost=10904.20..10959.73 rows=22209 width=30)
Sort Key: read_at DESC
-> Bitmap Heap Scan on instant_power_reads (cost=516.55..10793.16 rows=22209 width=30)
Recheck Cond: (site_id = 22)
-> Bitmap Index Scan on idx_ipr_site_read (cost=0.00..511.00 rows=22209 width=0)
Index Cond: (site_id = 22)
I know I have other options. My main idea is to maintain a separate temp table that just contains the latest records for each site. But I'd like to learn what I'm doing wrong here.
EDIT: I was asked to post the EXPLAIN (ANALYZE, BUFFERS) for the basic index.
Limit (cost=0.43..2.24 rows=1 width=30) (actual time=0.027..0.028 rows=1 loops=1)
Buffers: shared hit=4
-> Index Scan Backward using idx_ipr_site_read on instant_power_reads (cost=0.43..40405.75 rows=22288 width=30) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (site_id = 22)
Buffers: shared hit=4
Planning Time: 0.124 ms
Execution Time: 0.042 ms
EDIT 2: I accepted Laurenz' answer, but wanted to specify that the key to figuring this out was running "EXPLAIN (ANALYZE, BUFFERS)" because it shows the actual planning and execution time, which may be much different from the time your GUI (like pgAdmin) is showing you.

Looks like the obvious index is the right choice:
CREATE INDEX ON instant_power_reads (site_id, read_at);
If you measure a long execution time on the client side, it is probably network latency.

Related

Small result query with LIMIT 1000x slower than queries with >100 rows

I am trying to debug a query that runs faster the more records it returns but performance severely degrades (>10x slower) with smaller returns (i.e. <10 rows) using small LIMIT (ie 10).
Example:
Fast query with 5 results out of 1M rows - no LIMIT
SELECT *
FROM transaction_internal_by_addresses
WHERE address = 'foo'
ORDER BY block_number desc;
Explain:
Sort (cost=7733.14..7749.31 rows=6468 width=126) (actual time=0.030..0.031 rows=5 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Sort Key: transaction_internal_by_addresses.block_number
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=10
-> Index Scan using transaction_internal_by_addresses_pkey on public.transaction_internal_by_addresses (cost=0.69..7323.75 rows=6468 width=126) (actual time=0.018..0.021 rows=5 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Index Cond: (transaction_internal_by_addresses.address = 'foo'::text)
Buffers: shared hit=10
Query Identifier: -8912211611755432198
Planning Time: 0.051 ms
Execution Time: 0.041 ms
Fast query with 5 results out of 1M rows: - High LIMIT
SELECT *
FROM transaction_internal_by_addresses
WHERE address = 'foo'
ORDER BY block_number desc
LIMIT 100;
Limit (cost=7570.95..7571.20 rows=100 width=126) (actual time=0.024..0.025 rows=5 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Buffers: shared hit=10
-> Sort (cost=7570.95..7587.12 rows=6468 width=126) (actual time=0.023..0.024 rows=5 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Sort Key: transaction_internal_by_addresses.block_number DESC
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=10
-> Index Scan using transaction_internal_by_addresses_pkey on public.transaction_internal_by_addresses (cost=0.69..7323.75 rows=6468 width=126) (actual time=0.016..0.020 rows=5 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Index Cond: (transaction_internal_by_addresses.address = 'foo'::text)
Buffers: shared hit=10
Query Identifier: 3421253327669991203
Planning Time: 0.042 ms
Execution Time: 0.034 ms
Slow query: - Low LIMIT
SELECT *
FROM transaction_internal_by_addresses
WHERE address = 'foo'
ORDER BY block_number desc
LIMIT 10;
Explain result:
Limit (cost=1000.63..6133.94 rows=10 width=126) (actual time=10277.845..11861.269 rows=0 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Buffers: shared hit=56313576
-> Gather Merge (cost=1000.63..3333036.90 rows=6491 width=126) (actual time=10277.844..11861.266 rows=0 loops=1)
" Output: address, block_number, log_index, transaction_hash"
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=56313576
-> Parallel Index Scan Backward using transaction_internal_by_address_idx_block_number on public.transaction_internal_by_addresses (cost=0.57..3331263.70 rows=1623 width=126) (actual time=10256.995..10256.995 rows=0 loops=5)
" Output: address, block_number, log_index, transaction_hash"
Filter: (transaction_internal_by_addresses.address = 'foo'::text)
Rows Removed by Filter: 18485480
Buffers: shared hit=56313576
Worker 0: actual time=10251.822..10251.823 rows=0 loops=1
Buffers: shared hit=11387166
Worker 1: actual time=10250.971..10250.972 rows=0 loops=1
Buffers: shared hit=10215941
Worker 2: actual time=10252.269..10252.269 rows=0 loops=1
Buffers: shared hit=10191990
Worker 3: actual time=10252.513..10252.514 rows=0 loops=1
Buffers: shared hit=10238279
Query Identifier: 2050754902087402293
Planning Time: 0.081 ms
Execution Time: 11861.297 ms
DDL
create table transaction_internal_by_addresses
(
address text not null,
block_number bigint,
log_index bigint not null,
transaction_hash text not null,
primary key (address, log_index, transaction_hash)
);
alter table transaction_internal_by_addresses
owner to "icon-worker";
create index transaction_internal_by_address_idx_block_number
on transaction_internal_by_addresses (block_number);
So my questions
Should I just be looking at ways to force the query planner to apply the WHERE on the address (primary key)?
As you can see in the explain, the row block_number is scanned in the slow query but I am not sure why. Can anyone explain?
Is this normal? Seems like the more data, the harder the query, not the other way around as in this case.
Update
Apologies for A, the delay in responding and B, some of the inconsistencies in this question.
I have updated the EXPLAIN clearly showing the 1000x performance degradation
A multicolumn BTREE index on (address, block_number DESC) is exactly what the query planner needs to generate the result sets you mentioned. It will random-access the index to the first eligible row, then read the rows out in sequential order until it hits the LIMIT. You can also omit the DESC with no ill effects.
create index address_block_number
on transaction_internal_by_addresses
(address, block_number DESC);
As for asking "why" about query planner results, that's often an enduring mystery.
Sub-millisecond differences are hardly predictable so you're pretty much staring at noise, random miniscule differences caused by other things happening on the system. Your fastest query runs in tens of microseconds, slowest in a single millisecond - all of these are below typical network, mouse click, screen refresh latencies.
The planner already applies a where on your address: Index Cond: (a_table.address = 'foo'::text)
You're ordering by block_number, so it makes sense to scan it. It's also taking place in all three of your queries because they all do that.
It is normal - here's an online demo with similar differences. If what you're after is some reliable time estimation, use pgbench to run your queries multiple times and average out the timings.
Your third query plan seems to be coming from a different query, against a different table: a_table, compared to the initial two: transaction_internal_by_addresses.
If you were just wondering why these timings look like that, it's pretty much random and/or irrelevant at this level. If you're facing some kind of problem because of how these queries behave, it'd be better to focus on describing that problem - the queries themselves all do the same thing and the difference in their execution times is negligible.
Should I just be looking at ways to force the query planner to apply the WHERE on the address (primary key)?
Yes, it can be improve performance
As you can see in the explain, the row block_number is scanned in the slow query but I am not sure why. Can anyone explain?
Because Sort keys are different. Look carefully:
Sort Key: transaction_internal_by_addresses.block_number DESC
Sort Key: a_table.a_indexed_row DESC
it seems a_table.a_indexed_row has less performant stuff (eg more columns, more complex structure etc.)
Is this normal? Seems like the more data, the harder the query, not the other way around as in this case.
Normally more queries cause more time. But as I mentioned above, maybe a_table.a_indexed_row returns more values, has more columns etc.

Simple aggregate query running slow

I am trying to determine why a fairly simple aggregate query is taking so long to perform on a single table. The table is called plots, and it is [id, device_id, time, ...] There are two indices, UNIQUE(id) and UNIQUE(device_id, time).
The query is simply:
SELECT device_id, MIN(time)
FROM plots
GROUP BY device_id
To me, this should be very fast, but it is taking 3+ minutes. The table has ~45 million rows, divided roughly equally among 1200 or so device_id's.
EXPLAIN for the query:
Finalize GroupAggregate (cost=1502955.41..1503055.97 rows=906 width=12)
Group Key: device_id
-> Gather Merge (cost=1502955.41..1503052.35 rows=906 width=12)
Workers Planned: 1
-> Sort (cost=1501955.41..1501955.86 rows=906 width=12)
Sort Key: device_id
-> Partial HashAggregate (cost=1501943.79..1501946.51 rows=906 width=12)
Group Key: device_id
-> Parallel Seq Scan on plots (cost=0.00..1476417.34 rows=25526447 width=12)
EXPLAIN for query with a where device_id = xxx:
GroupAggregate (cost=398.86..78038.77 rows=906 width=12)
Group Key: device_id
-> Bitmap Heap Scan on plots (cost=398.86..77992.99 rows=43065 width=12)
Recheck Cond: (device_id = 6780)
-> Bitmap Index Scan on index_plots_on_device_id_and_time (cost=0.00..396.71 rows=43065 width=0)
Index Cond: (device_id = 6780)
I have done VACUUM (FULL, ANALYZE) as well as REINDEX DATABASE.
I have also tried doing partition queries to accomplish the same.
Any pointers on making this faster? Or am I just boned on the table size. It seems like it should be fine with the index though. Maybe I am missing something...
EDIT / UPDATE:
The problem seems to be resolved at this point, though I am not sure why. I have dropped and rebuilt the index many times, and suddenly the query is only taking ~7 seconds, which is acceptable. Of note, this morning I dropped the index and created a new one with the reverse column order (time, device_id) and I was surprised to see good results. I then reverted to the previous index, and the results were improved further. I will refork the production database and try to retrace my steps and post an update. Should I be worried about the query planner wonking out in the future?
Current EXPLAIN with analysis (as requested):
Finalize GroupAggregate (cost=1000.12..480787.58 rows=905 width=12) (actual time=36.299..7530.403 rows=916 loops=1)
Group Key: device_id
Buffers: shared hit=135087 read=40325
I/O Timings: read=138.419
-> Gather Merge (cost=1000.12..480783.96 rows=905 width=12) (actual time=36.226..7552.052 rows=1829 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Partial GroupAggregate (cost=0.11..479687.58 rows=905 width=12) (actual time=15.779..5026.094 rows=914 loops=2)
Group Key: device_id
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
-> Parallel Index Only Scan using index_plots_time_and_device_id on plots (cost=0.11..454158.41 rows=25526447 width=12) (actual time=0.033..2999.764 rows=21697480 loops=2)
Heap Fetches: 0
Buffers: shared hit=509502 read=160807
I/O Timings: read=639.797
Planning Time: 0.092 ms
Execution Time: 7554.100 ms
(19 rows)
Approach 1:
You can try to remove your UNIQUE to an index on your database. CREATE UNIQUE INDEX and CREATE INDEX have different behaviors. I believe that you can get benefits from CREATE INDEX.
Approach 2:
You can create a materialized view. If you can get some delay on your information, you can do the following:
CREATE MATERIALIZED VIEW myreport AS
SELECT device_id,
MIN(time) AS mintime
FROM plots
GROUP BY device_id
CREATE INDEX myreport_device_id ON myreport(device_id);
Also, you need to remember to regularly do:
REFRESH MATERIALIZED VIEW CONCURRENTLY myreport;
And less regularly do:
VACUUM ANALYZE myreport

Improve PostgresSQL aggregation query performance

I am aggregating data from a Postgres table, the query is taking approx 2 seconds which I want to reduce to less than a second.
Please find below the execution details:
Query
select
a.search_keyword,
hll_cardinality( hll_union_agg(a.users) ):: int as user_count,
hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count,
sum(a.total) as keyword_count
from
rollup_day a
where
a.created_date between '2018-09-01' and '2019-09-30'
and a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'
group by
a.search_keyword
order by
session_count desc
limit 100;
Table metadata
Total number of rows - 506527
Composite Index on columns : tenant_id and created_date
Query plan
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1722.685..1722.694 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=5454 dbname=postgres
-> Limit (cost=64250.24..64250.49 rows=100 width=42) (actual time=1783.087..1783.106 rows=100 loops=1)
-> Sort (cost=64250.24..64558.81 rows=123430 width=42) (actual time=1783.085..1783.093 rows=100 loops=1)
Sort Key: ((hll_cardinality(hll_union_agg(sessions)))::integer) DESC
Sort Method: top-N heapsort Memory: 33kB
-> GroupAggregate (cost=52933.89..59532.83 rows=123430 width=42) (actual time=905.502..1724.363 rows=212633 loops=1)
Group Key: search_keyword
-> Sort (cost=52933.89..53636.53 rows=281055 width=54) (actual time=905.483..1351.212 rows=280981 loops=1)
Sort Key: search_keyword
Sort Method: external merge Disk: 18496kB
-> Seq Scan on rollup_day a (cost=0.00..17890.22 rows=281055 width=54) (actual time=29.720..112.161 rows=280981 loops=1)
Filter: ((created_date >= '2018-09-01'::date) AND (created_date <= '2019-09-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
Rows Removed by Filter: 225546
Planning Time: 0.129 ms
Execution Time: 1786.222 ms
Planning Time: 0.103 ms
Execution Time: 1722.718 ms
What I've tried
I've tried with indexes on tenant_id and created_date but as the data is huge so it's always doing sequence scan rather than an index scan for filters. I've read about it and found, the Postgres query engine switch to sequence scan if the data returned is > 5-10% of the total rows. Please follow the link for more reference.
I've increased the work_mem to 100MB but it only improved the performance a little bit.
Any help would be really appreciated.
Update
Query plan after setting work_mem to 100MB
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1375.926..1375.935 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=5454 dbname=postgres
-> Limit (cost=48348.85..48349.10 rows=100 width=42) (actual time=1307.072..1307.093 rows=100 loops=1)
-> Sort (cost=48348.85..48633.55 rows=113880 width=42) (actual time=1307.071..1307.080 rows=100 loops=1)
Sort Key: (sum(total)) DESC
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=38285.79..43996.44 rows=113880 width=42) (actual time=941.504..1261.177 rows=172945 loops=1)
Group Key: search_keyword
-> Sort (cost=38285.79..38858.52 rows=229092 width=54) (actual time=941.484..963.061 rows=227261 loops=1)
Sort Key: search_keyword
Sort Method: quicksort Memory: 32982kB
-> Seq Scan on rollup_day_104290 a (cost=0.00..17890.22 rows=229092 width=54) (actual time=38.803..104.350 rows=227261 loops=1)
Filter: ((created_date >= '2019-01-01'::date) AND (created_date <= '2019-12-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
Rows Removed by Filter: 279266
Planning Time: 0.131 ms
Execution Time: 1308.814 ms
Planning Time: 0.112 ms
Execution Time: 1375.961 ms
Update 2
After creating an index on created_date and increased work_mem to 120MB
create index date_idx on rollup_day(created_date);
The total number of rows is: 12,124,608
Query Plan is:
Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=2635.530..2635.540 rows=100 loops=1)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=localhost port=9702 dbname=postgres
-> Limit (cost=73545.19..73545.44 rows=100 width=51) (actual time=2755.849..2755.873 rows=100 loops=1)
-> Sort (cost=73545.19..73911.25 rows=146424 width=51) (actual time=2755.847..2755.858 rows=100 loops=1)
Sort Key: (sum(total)) DESC
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=59173.97..67948.97 rows=146424 width=51) (actual time=2014.260..2670.732 rows=296537 loops=1)
Group Key: search_keyword
-> Sort (cost=59173.97..60196.85 rows=409152 width=55) (actual time=2013.885..2064.775 rows=410618 loops=1)
Sort Key: search_keyword
Sort Method: quicksort Memory: 61381kB
-> Index Scan using date_idx_102913 on rollup_day_102913 a (cost=0.42..21036.35 rows=409152 width=55) (actual time=0.026..183.370 rows=410618 loops=1)
Index Cond: ((created_date >= '2018-01-01'::date) AND (created_date <= '2018-12-31'::date))
Filter: (tenant_id = '12850a62-19ac-477d-9cd7-837f3d716885'::uuid)
Planning Time: 0.135 ms
Execution Time: 2760.667 ms
Planning Time: 0.090 ms
Execution Time: 2635.568 ms
You should experiment with higher settings of work_mem until you get an in-memory sort. Of course you can only be generous with memory if your machine has enough of it.
What would make your query way faster is if you store pre-aggregated data, either using a materialized view or a second table and a trigger on your original table that keeps the sums in the other table updated. I don't know if that is possible with your data, as I don't know what hll_cardinality and hll_union_agg are.
Have you tried a Covering indexes, so the optimizer will use the index, and not do a sequential scan ?
create index covering on rollup_day(tenant_id, created_date, search_keyword, users, sessions, total);
If Postgres 11
create index covering on rollup_day(tenant_id, created_date) INCLUDE (search_keyword, users, sessions, total);
But since you also do a sort/group by on search_keyword maybe :
create index covering on rollup_day(tenant_id, created_date, search_keyword);
create index covering on rollup_day(tenant_id, search_keyword, created_date);
Or :
create index covering on rollup_day(tenant_id, created_date, search_keyword) INCLUDE (users, sessions, total);
create index covering on rollup_day(tenant_id, search_keyword, created_date) INCLUDE (users, sessions, total);
One of these indexes should make the query faster. You should only add one of these indexes.
Even if it makes this query faster, having big indexes will/might make your write operations slower (especially HOT updates are not available on indexed columns). And you will use more storage.
Idea came from here , there is also an hint about size for the work_mem
Another example where the index was not used
use the table partitions and create a composite index it will bring down the total cost as:
it will save huge cost on scans for you.
partitions will segregate data and will be very helpful in future purge operations as well.
I have personally tried and tested table partitions with such cases and the throughput is amazing with the combination of
partitions & composite indexes.
Partitioning can be done on the range of created date and then composite indexes on date & tenant.
Remember you can always have a composite index with a condition in it if there is a very specific requirement for the condition in your query. This way the data will be sorted already in the index and will save huge costs for sort operations as well.
Hope this helps.
PS: Also, is it possible to share any test sample data for the same?
my suggestion would be to break up the select.
Now what I would try also in combination with this to setup 2 indices on the table. One on the Dates the other on the ID. One of the problem with weird IDs is, that it takes time to compare and they can be treated as string compare in the background. Thats why the break up, to prefilter the data before the between command is executed. Now the between command can make a select slow. Here I would suggest to break it up into 2 selects and inner join (I now the memory consumption is a problem).
Here is an example what I mean. I hope the optimizer is smart enough to restructure your query.
SELECT
a.search_keyword,
hll_cardinality( hll_union_agg(a.users) ):: int as user_count,
hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count,
sum(a.total) as keyword_count
FROM
(SELECT
*
FROM
rollup_day a
WHERE
a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885') t1
WHERE
a.created_date between '2018-09-01' and '2019-09-30'
group by
a.search_keyword
order by
session_count desc
Now if this does not work then you need more specific optimizations. For example. Can the total be equal to 0, then you need filtered index on the data where the total is > 0. Are there any other criteria that make it easy to exclude rows from the select.
The next consideration would be to create a row where there is a short ID (instead of 62850a62-19ac-477d-9cd7-837f3d716885 -> 62850 ), that can be a number and that would make preselection very easy and memory consumption less.

Why is PostgreSQL 11 optimizer refusing best plan which would use an index w/ included column?

PostgreSQL 11 isn't smart enough to use indexes with included columns?
CREATE INDEX organization_locations__org_id_is_headquarters__inc_location_id_ix
ON organization_locations(org_id, is_headquarters) INCLUDE (location_id);
ANALYZE organization_locations;
ANALYZE organizations;
EXPLAIN VERBOSE
SELECT location_id
FROM organization_locations ol
WHERE org_id = (SELECT id FROM organizations WHERE code = 'akron')
AND is_headquarters = 1;
QUERY PLAN
Seq Scan on organization_locations ol (cost=8.44..14.61 rows=1 width=4)
Output: ol.location_id
Filter: ((ol.org_id = $0) AND (ol.is_headquarters = 1))
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4)
Output: organizations.id
Index Cond: ((organizations.code)::text = 'akron'::text)
There are only 211 rows currently in organization_locations, average row length 91 bytes.
I get only loading one data page. But the I/O is the same to grab the index page and the target data is right there (no extra lookup into the data page from the index). What is PG thinking with this plan?
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
EDIT: Here is the explain with buffers:
Seq Scan on organization_locations ol (cost=8.44..14.33 rows=1 width=4) (actual time=0.018..0.032 rows=1 loops=1)
Filter: ((org_id = $0) AND (is_headquarters = 1))
Rows Removed by Filter: 210
Buffers: shared hit=7
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: ((code)::text = 'akron'::text)
Buffers: shared hit=4
Planning Time: 0.402 ms
Execution Time: 0.048 ms
Reading one index page is not cheaper than reading a table page, so with tiny tables you cannot expect a gain from an index-only scan.
Besides, did you
VACUUM organization_locations;
Without that, the visibility map won't show that the table block is all-visible, so you cannot get an index-only scan no matter what.
In addition to the other answers, this is probably a silly index to have in the first place. INCLUDE is good when you need a unique index but you also want to tack on a column which is not part of the unique constraint, or when the included column doesn't have btree operators and so can't be in the main body of the index. In other cases, you should just put the extra column in the index itself.
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
This is your workflow problem that you can't expect PostgreSQL to solve for you. Do you really think PostgreSQL should create actual plans based on imaginary scenarios?

Order of columns in INNER JOIN condition affects the performance badly

I have two tables which links to each other like this:
Table answered_questions with the following columns and indexes:
id: primary key
taken_test_id: integer (foreign key)
question_id: integer (foreign key, links to another table called questions)
indexes: (taken_test_id, question_id)
Table taken_tests
id: primary key
user_id: (foreign key, links to table Users)
indexes: user_id column
First query (with EXPLAIN ANALYZE output):
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id"
WHERE
"taken_tests"."user_id" = 1;
Output:
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.276 ms
Execution time: 2.365 ms
(7 rows)
Another query (this is generated automatically by Rails when using joins method in ActiveRecord)
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
And here is the output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.302 ms
Execution time: 1258.035 ms
(7 rows)
The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.
Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)
I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table
Update 1:
When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)
EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE)
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Buffers: shared hit=1986
-> Index Scan using index_taken_tests_on_user_id on public.taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
Output: taken_tests.id
Index Cond: (taken_tests.user_id = 1)
Buffers: shared hit=269
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Index Cond: (answered_questions.taken_test_id = taken_tests.id)
Buffers: shared hit=1717
Planning time: 0.238 ms
Execution time: 2.335 ms
As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.
The difference comes from the very same unit's execution time:
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)
and
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)
As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem.
Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.
For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.