Optimizing a Django `.exists()` query

Optimizing a Django `.exists()` query - sql

I have a .exists() query in an app I am writing. I want to optimize it.
The current ORM expression yields SQL that looks like this:
SELECT DISTINCT
(1) AS "a",
"the_model"."id",
... snip every single column on the_model
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
The explain plan looks like this:
Limit (cost=176.60..176.63 rows=1 width=223)
-> Unique (cost=176.60..177.40 rows=29 width=223)
-> Sort (cost=176.60..176.67 rows=29 width=223)
Sort Key: id, ...SNIP...
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=223)
Index Cond: (user_id = 6)
Filter: ...SNIP...
If I manually modify the above SQL and remove the individual table columns so it looks like this:
SELECT DISTINCT
(1) AS "a",
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
the explain plan shows a couple fewer steps, which is great.
Limit (cost=0.43..175.89 rows=1 width=4)
-> Unique (cost=0.43..175.89 rows=1 width=4)
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ...SNIP...
I can go further by removing the DISTINCT keyword from the query, thus yielding an even shallower execution plan, although the cost saving here is minor:
Limit (cost=0.43..6.48 rows=1 width=4)
-> Index Scan using ..SNIP... on ..SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ..SNIP...
I can modify the ORM expression using .only('id') to select only one field. However, that does not result in the result I would like. It is doing an unnecessary sort on id. Ideally, I would like to do a .only(None), since none of the columns are needed here. They only add weight.
I would also like to remove the DISTINCT keyword, if possible. I don't think it adds much execution time if the columns are removed though.
It seems like this could be done across the board, as .exists() returns a boolean. None of the returned columns are used for anything. They only complicate the query and reduce performance.

I found that during my QuerySet construction, .distinct() was being called before my .exists(). It was causing the columns to be selected unnecessarily.

Related

Improve Postgre SQL query performance

I'm running this query in our database:
select
(
select least(2147483647, sum(pb.nr_size))
from tb_pr_dc pd
inner join tb_pr_dc_bn pb on 1=1
and pb.id_pr_dc_bn = pd.id_pr_dc_bn
where 1=1
and pd.id_pr = pt.id_pr -- outer query column
)
from
(
select regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
) pt
;
Which outputs 500 rows having a single result column and takes around 1 min and 43 secs to run. The explain (analyze, verbose, buffers) outputs the following plan:
Subquery Scan on pt (cost=0.00..805828.19 rows=1000 width=8) (actual time=96.791..103205.872 rows=500 loops=1)
Output: (SubPlan 1)
Buffers: shared hit=373771 read=153484
-> Result (cost=0.00..22.52 rows=1000 width=4) (actual time=0.434..3.729 rows=500 loops=1)
Output: ((regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr)
-> ProjectSet (cost=0.00..5.02 rows=1000 width=32) (actual time=0.429..2.288 rows=500 loops=1)
Output: (regexp_split_to_table('[list of 500 ids]', ',')::integer id_pr
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
SubPlan 1
-> Aggregate (cost=805.78..805.80 rows=1 width=8) (actual time=206.399..206.400 rows=1 loops=500)
Output: LEAST('2147483647'::bigint, sum((pb.nr_size)::integer))
Buffers: shared hit=373771 read=153484
-> Nested Loop (cost=0.87..805.58 rows=83 width=4) (actual time=1.468..206.247 rows=219 loops=500)
Output: pb.nr_size
Inner Unique: true
Buffers: shared hit=373771 read=153484
-> Index Scan using tb_pr_dc_in05 on db.tb_pr_dc pd (cost=0.43..104.02 rows=83 width=4) (actual time=0.233..49.289 rows=219 loops=500)
Output: pd.id_pr_dc, pd.ds_pr_dc, pd.id_pr, pd.id_user_in, pd.id_user_ex, pd.dt_in, pd.dt_ex, pd.ds_mt_ex, pd.in_at, pd.id_tp_pr_dc, pd.id_pr_xz (...)
Index Cond: ((pd.id_pr)::integer = pt.id_pr)
Buffers: shared hit=24859 read=64222
-> Index Scan using tb_pr_dc_bn_pk on db.tb_pr_dc_bn pb (cost=0.43..8.45 rows=1 width=8) (actual time=0.715..0.715 rows=1 loops=109468)
Output: pb.id_pr_dc_bn, pb.ds_ex, pb.ds_md_dc, pb.ds_m5_dc, pb.nm_aq, pb.id_user, pb.dt_in, pb.ob_pr_dc, pb.nr_size, pb.ds_sg, pb.ds_cr_ch, pb.id_user_ (...)
Index Cond: ((pb.id_pr_dc_bn)::integer = (pd.id_pr_dc_bn)::integer)
Buffers: shared hit=348912 read=89262
Planning Time: 1.151 ms
Execution Time: 103206.243 ms
The logic is: for each id_pr chosen (in the list of 500 ids) calculate the sum of the integer column pb.nr_size associated with them, returning the lesser value between this amount and the number 2,147,483,647. The result must contain 500 rows, one for each id, and we already know that they'll match at least one row in the subquery, so will not produce null values.
The index tb_pr_dc_in05 is a b-tree on id_pr only, which is of integer type. The index tb_pr_dc_bn_pk is a b-tree on the primary key id_pr_dc_bn only, which is of integer type also. Table tb_pr_dc has many rows for each id_pr. Actually, we have 209,217 unique id_prs in tb_pr_dc for a total of 13,910,855 rows. Table tb_pr_dc_bn has the same amount of rows.
As can be seen, we defined 500 ids to query tb_pr_dc, finding 109,468 rows (less than 1% of the table size) and then finding the same amount looking in tb_pr_dc_bn. Imo, the indexes look fine and the amount of rows to evaluate is minimal, so I can't understand why it's taking so much time to run this query. A lot of other queries reading a lot more of data on other tables and doing more calculations are running fine. The DBA just ran a reindex and vacuum analyze, but still it's running the same slow way. We are running PostgreSQL 11 on Linux. I'm running this query in a replica without concurrent access.
What could I be missing that could improve this query performance?
Thanks for your attention.

The time is spent jumping all over the table to find 109468 randomly scattered rows, issuing random IO requests to do so. You can verify that be turning track_io_timing on and redoing the plans (probably just leave it turned on globally and by default, the overhead is low and the value it produces is high), but I'm sure enough that I don't need to see that output before reaching this conclusion. The other queries that are faster are probably accessing fewer disk pages because they access data that is more tightly packed, or is organized so that it can be read more sequentially. In fact, I would say your query is quite fast given how many pages it had to read.
You ask about why so many columns are output in the internal nodes of the plan. The reason for that is that PostgreSQL often just passes around pointers to where the tuple lives in the shared_buffers, and the tuple being pointed to has the columns that the table itself has. It could allocate memory in which to store a reformatted version of the tuple with the unnecessary columns stripped out, but that would generally be more work, not less. If it was a reason to copy and re-form the tuple anyway, it will remove the extraneous columns while it does so. But it won't do it without a reason.
One way to sped this up is to create indexes which will enable index-only scans. Those would be on tb_pr_dc (id_pr, id_pr_dc_bn) and on tb_pr_dc_bn (id_pr_dc_bn, nr_size).
If this isn't enough, there might be other ways to improve this too; but I can't think through them if I keep getting distracted by the long strings of unmemorable unpronounceable gibberish you have for table and column names.

Why is PostgreSQL 11 optimizer refusing best plan which would use an index w/ included column?

PostgreSQL 11 isn't smart enough to use indexes with included columns?
CREATE INDEX organization_locations__org_id_is_headquarters__inc_location_id_ix
ON organization_locations(org_id, is_headquarters) INCLUDE (location_id);
ANALYZE organization_locations;
ANALYZE organizations;
EXPLAIN VERBOSE
SELECT location_id
FROM organization_locations ol
WHERE org_id = (SELECT id FROM organizations WHERE code = 'akron')
AND is_headquarters = 1;
QUERY PLAN
Seq Scan on organization_locations ol (cost=8.44..14.61 rows=1 width=4)
Output: ol.location_id
Filter: ((ol.org_id = $0) AND (ol.is_headquarters = 1))
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4)
Output: organizations.id
Index Cond: ((organizations.code)::text = 'akron'::text)
There are only 211 rows currently in organization_locations, average row length 91 bytes.
I get only loading one data page. But the I/O is the same to grab the index page and the target data is right there (no extra lookup into the data page from the index). What is PG thinking with this plan?
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
EDIT: Here is the explain with buffers:
Seq Scan on organization_locations ol (cost=8.44..14.33 rows=1 width=4) (actual time=0.018..0.032 rows=1 loops=1)
Filter: ((org_id = $0) AND (is_headquarters = 1))
Rows Removed by Filter: 210
Buffers: shared hit=7
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: ((code)::text = 'akron'::text)
Buffers: shared hit=4
Planning Time: 0.402 ms
Execution Time: 0.048 ms

Reading one index page is not cheaper than reading a table page, so with tiny tables you cannot expect a gain from an index-only scan.
Besides, did you
VACUUM organization_locations;
Without that, the visibility map won't show that the table block is all-visible, so you cannot get an index-only scan no matter what.

In addition to the other answers, this is probably a silly index to have in the first place. INCLUDE is good when you need a unique index but you also want to tack on a column which is not part of the unique constraint, or when the included column doesn't have btree operators and so can't be in the main body of the index. In other cases, you should just put the extra column in the index itself.
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
This is your workflow problem that you can't expect PostgreSQL to solve for you. Do you really think PostgreSQL should create actual plans based on imaginary scenarios?

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);

This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

How to optimize query by index PostgreSQL

I want to fetch users that has 1 or more processed bets. I do this by using next sql:
SELECT user_id FROM bets
WHERE bets.state in ('guessed', 'losed')
GROUP BY user_id
HAVING count(*) > 0;
But running EXPLAIN ANALYZE I noticed no index is used and query execution time is very high. I tried add partial index like:
CREATE INDEX processed_bets_index ON bets(state) WHERE state in ('guessed', 'losed');
But EXPLAIN ANALYZE output not changed:
HashAggregate (cost=34116.36..34233.54 rows=9375 width=4) (actual time=235.195..237.623 rows=13310 loops=1)
Filter: (count(*) > 0)
-> Seq Scan on bets (cost=0.00..30980.44 rows=627184 width=4) (actual time=0.020..150.346 rows=626674 loops=1)
Filter: ((state)::text = ANY ('{guessed,losed}'::text[]))
Rows Removed by Filter: 20951
Total runtime: 238.115 ms
(6 rows)
Records with other statuses except (guessed, losed) a little.
How do I create proper index?
I'm using PostgreSQL 9.3.4.

I assume that the state mostly consists of 'guessed' and 'losed', with maybe a few other states as well in there. So most probably the optimizer do not see the need to use the index since it would still fetch most of the rows.
What you do need is an index on the user_id, so perhaps something like this would work:
CREATE INDEX idx_bets_user_id_in_guessed_losed ON bets(user_id) WHERE state in ('guessed', 'losed');
Or, by not using a partial index:
CREATE INDEX idx_bets_state_user_id ON bets(state, user_id);

How can I tune the PostgreSQL query optimizer when using SQL logical 'OR' and operator 'LIKE'?

How can I tune the PostgreSQL query plan (or the SQL query itself) to make more optimal use of available indexes when the query contains an SQL OR condition that uses the LIKE operator instead of =?
For example consider the following query that takes just 5 milliseconds to execute:
explain analyze select *
from report_workflow.request_attribute
where domain = 'externalId'
and scoped_id_value[2] = 'G130324135454100';
"Bitmap Heap Scan on request_attribute (cost=44.94..6617.85 rows=2081 width=139) (actual time=4.619..4.619 rows=2 loops=1)"
" Recheck Cond: (((scoped_id_value[2])::text = 'G130324135454100'::text) AND ((domain)::text = 'externalId'::text))"
" -> Bitmap Index Scan on request_attribute_accession_number (cost=0.00..44.42 rows=2081 width=0) (actual time=3.777..3.777 rows=2 loops=1)"
" Index Cond: ((scoped_id_value[2])::text = 'G130324135454100'::text)"
"Total runtime: 5.059 ms"
As the query plan shows, this query takes advantage of partial index request_attribute_accession_number and index condition scoped_id_value[2] = 'G130324135454100'. Index request_attribute_accession_number has the following definition:
CREATE INDEX request_attribute_accession_number
ON report_workflow.request_attribute((scoped_id_value[2]))
WHERE domain = 'externalId';
(Note that column scoped_id_value in table request_attribute has type character varying[].)
However, when I add to the same query an extra OR condition that uses the same array column element scoped_id_value[2], but the the LIKE operator instead of =, the query, despite producing the same result for the same first condition, now takes 7553 ms:
explain analyze select *
from report_workflow.request_attribute
where domain = 'externalId'
and (scoped_id_value[2] = 'G130324135454100'
or scoped_id_value[2] like '%G130324135454100%');
"Bitmap Heap Scan on request_attribute (cost=7664.77..46768.27 rows=2122 width=139) (actual time=142.164..7552.650 rows=2 loops=1)"
" Recheck Cond: ((domain)::text = 'externalId'::text)"
" Rows Removed by Index Recheck: 1728712"
" Filter: (((scoped_id_value[2])::text = 'G130324135454100'::text) OR ((scoped_id_value[2])::text ~~ '%G130324135454100%'::text))"
" Rows Removed by Filter: 415884"
" -> Bitmap Index Scan on request_attribute_accession_number (cost=0.00..7664.24 rows=416143 width=0) (actual time=136.249..136.249 rows=415886 loops=1)"
"Total runtime: 7553.154 ms"
Notice that this time the query optimizer ignores index condition scoped_id_value[2] = 'G130324135454100' when it performs the inner bitmap index scan using index request_attribute_accession_number and consequently generates 415,886 rows instead of just two as did the first query.
When introducing the OR condition with the LIKE operator into this second query, why does the optimizer produce a much less optimal query plan as the first? How can I tune the query optimizer or the query to perform more like the first query?

In the second plan, you have:
scoped_id_value[2] like '%G130324135454100%'
Postgres (nor any other database) cannot use the index to solve this. Where would it look in the index? It doesn't even know where to start, so it has to do a full table scan.
You can handle this, for this one case, by building an index on an expression (see here). However, that would be very specific to the string 'G130324135454100'.
I should add, the problem is not the like. Postgres would use the index on:
scoped_id_value[2] like 'G130324135454100%'

It is not possible to short circuit this expression:
scoped_id_value[2] = 'G130324135454100'
or scoped_id_value[2] like '%G130324135454100%'
into this one:
scoped_id_value[2] = 'G130324135454100'
because it would not catch the cases where there are characters before or after that which would be matched with:
scoped_id_value[2] like '%G130324135454100%'
The only short circuit possible would be the last one. And only if Postgresql realizes that the core string (between %s) in the last one is the same as the previous one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas