I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
Related
I am designing a table that has a jsonb column realizing permissions with the following format:
[
{"role": 5, "perm": "view"},
{"role": 30, "perm": "edit"},
{"role": 52, "perm": "view"}
]
TL;DR
How do I convert such jsonb value into an SQL array of integer roles? In this example, it would be '{5,30,52}'::int[]. I have some solutions but none are fast enough. Keep reading...
Each logged-in user has some roles (one or more). The idea is to filter the records using the overlap operator (&&) on int[].
SELECT * FROM data WHERE extract_roles(access) && '{1,5,17}'::int[]
I am looking for the extract_roles function/expression that can also be used in the definition of an index:
CREATE INDEX data_roles ON data USING gin ((extract_roles(access)))
jsonb in Postgres seems to have broad support for building and transforming but less for extracting values - SQL arrays in this case.
What I tried:
create or replace function extract_roles(access jsonb) returns int[]
language sql
strict
parallel safe
immutable
-- with the following bodies:
-- (0) 629ms
select translate(jsonb_path_query_array(access, '$.role')::text, '[]', '{}')::int[]
-- (1) 890ms
select array_agg(r::int) from jsonb_path_query(access, '$.role') r
-- (2) 866ms
select array_agg((t ->> 'role')::int) from jsonb_array_elements(access) as x(t)
-- (3) 706ms
select f1 from jsonb_populate_record(row('{}'::int[]), jsonb_build_object('f1', jsonb_path_query_array(access, '$.role'))) as x (f1 int[])
When the index is used, the query is fast. But there are two problems with these expressions:
some of the functions are only stable and not immutable; this also applies to cast. Am I allowed to mark my function as immutable? The immutability is required by the index definition.
they are slow; the planner does not use the index in some scenarios, and then the query can become really slow (times above are on a table with 3M records):
explain (analyse)
select id, access
from data
where extract_roles(access) && '{-3,99}'::int[]
order by id
limit 100
with the following plan (same for all variants above; prefers scanning the index associated with the primary key, gets sorted results and hopes that it finds 100 of them soon):
Limit (cost=1000.45..2624.21 rows=100 width=247) (actual time=40.668..629.193 rows=100 loops=1)
-> Gather Merge (cost=1000.45..476565.03 rows=29288 width=247) (actual time=40.667..629.162 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using data_pkey on data (cost=0.43..472184.44 rows=12203 width=247) (actual time=25.522..513.463 rows=35 loops=3)
Filter: (extract_roles(access) && '{-3,99}'::integer[])
Rows Removed by Filter: 84918
Planning Time: 0.182 ms
Execution Time: 629.245 ms
Removing the LIMIT clause is paradoxically fast:
Gather Merge (cost=70570.65..73480.29 rows=24938 width=247) (actual time=63.263..75.710 rows=40094 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=69570.63..69601.80 rows=12469 width=247) (actual time=59.870..61.569 rows=13365 loops=3)
Sort Key: id
Sort Method: external merge Disk: 3744kB
Worker 0: Sort Method: external merge Disk: 3232kB
Worker 1: Sort Method: external merge Disk: 3160kB
-> Parallel Bitmap Heap Scan on data (cost=299.93..68722.36 rows=12469 width=247) (actual time=13.823..49.336 rows=13365 loops=3)
Recheck Cond: (extract_roles(access) && '{-3,99}'::integer[])
Heap Blocks: exact=9033
-> Bitmap Index Scan on data_roles (cost=0.00..292.44 rows=29926 width=0) (actual time=9.429..9.430 rows=40094 loops=1)
Index Cond: (extract_roles(access) && '{-3,99}'::integer[])
Planning Time: 0.234 ms
Execution Time: 77.719 ms
Is there any better and faster way to extract int[] from a jsonb? Because I cannot rely on the planner always using the index. Playing with COST of the extract_roles function helps a bit (planner starts using the index for LIMIT 1000) but even an insanely high value does not force the index for LIMIT 100.
Comments:
If there is not, I will probably store the information in another column roles int[], which is fast but takes extra space and requires extra treatment (can be solved using generated columns on Postgres 12+, which Azure still does not provide, or a trigger, or in the application logic).
Looking into the future, will there be any better support in Postgres 15? Maybe JSON_QUERY but I don’t see any immediate improvement because its RETURNING clause probably refers to the whole result and not its elements.
Maybe jsonb_populate_record could also consider non-composite types (its signature allows it) such as:
select jsonb_populate_record(null::int[], '[123,456]'::jsonb)
The two closest questions are:
Extract integer array from jsonb within postgres 9.6
Cast postgresql jsonb value as array of int and remove element from it
Reaction to suggested normalization:
Normalization is probably not viable. But let's follow the train of thoughts.
I assume that the extra table would look like this: *_perm (id, role, perm). There would be an index on id and another index on role.
Because a user has multiple roles, it could join multiple records for the same id, which would cause multiplication of the records in the data table and force a group by aggregation.
A group by is bad for performance because it prevents some optimizations. I am designing a building block. So there can be for example two data tables at play:
select pd.*, jsonb_agg(to_jsonb(pp))
from posts_data pd
join posts_perm pp on pd.id = pp.id
where exists(
select 1
from comments_data cd on cd.post_id = pd.id
join comments_perm cp on cp.id = cd.id
where cd.reputation > 100
and cp.role in (3,34,52)
-- no group by needed due to semi-join
)
and cp.role in (3,34,52)
group by pd.id
order by pd.title
limit 10
If I am not mistaken, this query will require the aggregation of all records before they are sorted. No index can help here. That will never be fast with millions of records. Moreover, there is non-trivial logic behind group by usage - it is not always needed.
What if we did not need to return the permissions but only cared about its existence?
select pd.*
from posts_data pd
where exists(
select 1
from posts_perm pp on pd.id = pp.id
where cp.role in (3,34,52)
)
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where exists(
select 1
from comments_perm cp on cp.id = cd.id
where cp.role in (3,34,52)
)
and cd.reputation > 100
)
order by pd.title
limit 10
Then we don't need any aggregation - the database will simply issue a SEMI-JOIN. If there is an index on title, the database may consider using it. We can even fetch the permissions in the projection; something like this:
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
...
Where a nested-loop join will be issued for only the few (10) records. I will test this approach.
Another option is to keep the data in both tables - the data table would only store an int[] of roles. Then we save a JOIN and only fetch from the permission table at the end. Now we need an index that supports array operations - GIN.
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
from posts_data pd
where pd.roles && '{3,34,52}'::int[]
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where cd.roles && '{3,34,52}'::int[]
and cd.reputation > 100
)
order by pd.title
limit 10
Because we always aggregate all permissions for the returned records (their interpretation is in the application and does not matter that we return all of them), we can store the post_perms as a json. Because we never need to work with the values in SQL, storing them directly in the data table seems reasonable.
We will need to support some bulk-sharing operations later that update the permissions for many records, but that is much rarer than selects. Because of this we could favor jsonb instead.
The projection does not need the select of permissions anymore:
select pd.*
...
But now the roles column is redundant - we have the same information in the same table, just in JSON format. If we can write a function that extracts just the roles, we can directly index it.
And we are back at the beginning. But it looks like the extract_roles function is never going to be fast, so we need to keep roles column.
Another reason for keeping permissions in the same table is the possibility of combining multiple indices using Bitmap And and avoiding a join.
There will be a huge bias in the roles. Some are going to be present on almost all rows (admin can edit everything), others will be rare (John Doe can only access these 3 records that were explicitly shared with him). I am not sure how well statistics will work on the int[] approach but so far my tests show that the GIN index is used when the role is infrequent (high selectivity).
It looks like the core problem here is the classic one with WHERE...ORDER BY...LIMIT, that the planner assumes all of the qualifying rows are scattered evenly throughout the ordering. But that isn't the case here: rows meeting your && condition are selectively deficient in low-numbered "id". So it has to walk that index far farther than it thought it would need to before it catches the LIMIT.
There is nothing you can do (in any version) to get the planner to estimate that better. You could just prevent that index from being used by rewriting it to order by id+0. But then it wouldn't use that plan even when it would truly be faster, like the admin who is on everything. (Which by the way seems like a bad idea--an exceptional user should probably be handled exceptionally, not shoehorned into the normal system).
The immutable extraction function certainly is slow, but if the above planning problem were fixed that wouldn't matter. Making the function faster would probably require some compiled code, and Azure surely would not let you link the .so file into their managed server.
Because the JSON has a regular structure (int, text), I also considered two alternative storages:
create a composite type role of (int, text) and store the array role[]; extract_roles function is still needed;
store two arrays int[] and text[].
The latter one won for the following reasons:
smallest disk space (important for queries that require seq scan);
no need for extract_roles function - the int array is stored directly;
no need for functional index;
easy append (but same is true for JSON);
the library that I am using (jOOQ) has a good binding for arrays so working with them may even be more pleasant than with a JSON.
Disadvantages are:
harder remove - need to unnest and reaggregate.
I have a simple select query:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id
Then I want to get the first 100 results, so I use this:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id LIMIT 100
The problem is that the second query works much slower then the first one. It takes less than a second to execute the first query and more than a minute to execute the second one.
These are execution plans for the queries:
without limit:
Sort (cost=26201.43..26231.42 rows=11994 width=72)
Sort Key: entity_id
-> Index Scan using entity_type_id_idx on entities (cost=0.00..24895.34 rows=11994 width=72)
Index Cond: (entity_type_id = 1)
with limit:
Limit (cost=0.00..8134.39 rows=100 width=72)
-> Index Scan using xpkentities on entities (cost=0.00..975638.85 rows=11994 width=72)
Filter: (entity_type_id = 1)
I don't understand why these two plans are so different and why the performance decreases so much. How should I tweak the second query to make it work faster?
I use PostgreSql 9.2.
You want the 100 smallest entity_id's matching your condition. Now - if those were numbers 1..100 then clearly using the entity_id index is the best way to handle this - everything is pre-sorted. In fact, if the 100 you wanted were in the range 1..200 then it still makes sense. Probably 1..1000 would.
So - PostgreSQL thinks it will find lots of entity_type_id=1 values at the "start" of the table. It estimates a cost of 8134 vs 26231 to filter by type then sort. In your case it is wrong.
Now - either there is some correlation which isn't obvious (that's bad - we can't tell the planner about that at present), or we don't have up-to-date or sufficient stats.
Does an ANALYZE entities make any difference? You can see what values the planner knows about by reading the planner-stats page in the manuals.
In PostgreSql 8.4 query
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from ALGSA
where laonr=1 and kuupaev <='9999-12-31' and
kuupaev||kellaaeg <= '9999-12-3123 59'
Takes 3 seconds to run:
"Aggregate (cost=3164.49..3164.50 rows=1 width=10) (actual time=2714.269..2714.270 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..3110.04 rows=21778 width=10) (actual time=0.105..1418.743 rows=70708 loops=1)"
" Filter: ((kuupaev <= '9999-12-31'::date) AND (laonr = 1::numeric) AND ((kuupaev || (kellaaeg)::text) <= '9999-12-3123 59'::text))"
"Total runtime: 2714.363 ms"
How to speed it up in PostgreSQL 8.4.4 ?
Table structure is below.
algsa table has index on kuupaev maybe this can be used?
Or is it possible to change query to add some other index to make it fast. Exising columns in table cannot changed.
CREATE TABLE firma1.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
... other columns
CONSTRAINT algsa_pkey PRIMARY KEY (id),
CONSTRAINT algsa_id_check CHECK (id > 0)
)
);
CREATE INDEX algsa_kuupaev_idx ON firma1.algsa USING btree (kuupaev);
Update
Tried analyze verbose firma1.algsa;
INFO: analyzing "firma1.algsa"
INFO: "algsa": scanned 1640 of 1640 pages, containing 70708 live rows and 13 dead rows; 30000 rows in sample, 70708 estimated total rows
Query returned successfully with no result in 1185 ms.
but query run time was still 2.7 seconds.
Why there are 30000 rows in sample . Isn't it too much, should this decreased?
This was a known issue in old versions of PostgreSQL - but it looks like it might've been resolved by 8.4; in fact, the docs for 8.0 have the caveat but the docs for 8.1 do not.
So you don't need to upgrade major versions for this reason, at least. You should however upgrade to the current 8.4 series release 8.4.16, as you're missing several years worth of bug fixes and tweaks.
The real problem here is that you're using max on an expression, not a simple value, and there's no functional index for that expression.
You could try creating an index on the expression kuupaev||kellaaeg ... but I suspect you have data model problems, and that there's a better solution by fixing your data model.
It looks like kuupaev is kuupäev, or date, and kellaaeg might be time. If so: never use the concatenation (||) operator for combining dates and times; use interval addition, eg kuupaev + kellaaeg. Instead of char you should be using the data type time or interval with a CHECK constraint for kellaaeg, depending on what it means and whether it's limited to 24 hours or not. Or, better still, use a single field of type timestamp (for local time) or timestamp with time zone (for global time) to store the combined date and time.
If you do this, you can create a simple index on the combined column that replaces both kellaaeg and kuupaev and use that for min and max among other things. If you need just the date part or just the time part for some things, use the date_trunc, extract and date_part functions; see the documentation.
See this earlier answer for another example of where separate date and time columns are a bad idea.
You should still plan an upgrade to 9.2. The upgrade path from 8.4 to 9.2 isn't too rough, you really just have to watch out for the setting of standard_conforming_strings on by default and the change of bytea_output from escape to hex. Both can be set back to the 8.4 defaults during transition and porting work. 8.4 won't be supported for much longer.
My first instinct would be to try an index:
create index algsa_laonr_kuupaev_kellaaeg_idx
on ALGSA (laonr asc, (kuupaev||kellaaeg) desc)
... and try the query as:
SELECT kuupaev||kellaaeg as res
from ALGSA
where laonr=1 and
kuupaev||kellaaeg <= '9999-12-3123 59'
order by
laonr asc,
kuupaev||kellaaeg desc
limit 1
I need the lowest value for runnerId.
This query:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ;
takes 80 ms (1968 result rows).
This:
SELECT min("runnerId") FROM betlog WHERE "marketId" = '107416794' ;
takes 1600 ms.
Is there a faster way to find the minimum, or should I calc the min in my java program?
"Result (cost=100.88..100.89 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Limit (cost=0.00..100.88 rows=1 width=9)"
" -> Index Scan using runneridindex on betlog (cost=0.00..410066.33 rows=4065 width=9)"
" Index Cond: ("runnerId" IS NOT NULL)"
" Filter: ("marketId" = 107416794::bigint)"
CREATE INDEX marketidindex
ON betlog
USING btree
("marketId" COLLATE pg_catalog."default");
Another idea:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" LIMIT 1 >1600ms
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" >>100ms
How can a LIMIT slow the query down?
What you need is a multi-column index:
CREATE INDEX betlog_mult_idx ON betlog ("marketId", "runnerId");
If interested, you'll find in-depth information about multi-column indexes in PostgreSQL, links and benchmarks under this related question on dba.SE.
How did I figure?
In a multi-column index, rows are ordered (and thereby clustered) by the first column of the index ("marketId"), and each cluster is in turn ordered by the second column of the index - so the first row matches the condition min("runnerId"). This makes the index scan extremely fast.
Concerning the paradox effect of LIMIT slowing down a query - the Postgres query planner has a weakness there. The common workaround is to use a CTE (not necessary in this case). Find more information under this recent, closely related question:
PostgreSQL query taking too long
The min statement will be executed by PostgreSQL using a sequential scan of the entire table. You could optimize the query using the following approach:
SELECT col FROM sometable ORDER BY col ASC LIMIT 1;
When you had an index on ("runnerId") (or at least with "runnerId" as the high order column) but did not have the index on ("marketId", "runnerId") it compared the cost of passing all rows with a matching "marketId" using the index on that column and picking out the minimum "runnerId" from that set to the cost of scanning using the index on "runnerId" and stopping when it found the first row with a matching "marketId". Based on available statistics and the assumption that "marketId" values would be randomly distributed within the index entries for the index on "runnerId" it estimated a lower cost for the latter approach.
It also estimated the cost of scanning the whole table and picking the minimum from matching rows as well as probably a number of other alternatives. It does not always use a certain type of plan, but compares costs of all the alternatives.
The problem is that the assumption that values will be randomly distributed in the range is not necessarily true (as in this example), leading to a scan of a high percentage of the range to find the rows lurking at the end. For some values of "marketId", where the chosen value is available near the beginning of the "runnerId" index, this plan should be very fast.
There has been discussion in the PostgreSQL developer community of how we might bias against plans which are "risky" in terms of running long if the data distribution is not what was assumed, and there has been work on tracking multi-column statistics so that correlated values don't run into such problems. Expect improvements in this area in the next few releases. Until then, Erwin's suggestions are on target for how to work around the issue.
Basically it comes down to making a more attractive plan available or introducing an optimization barrier. In this case you can provide a more attractive option by adding the index on ("marketId", "runnerId") -- which allows a very direct way to get straight to the answer. The planner assigns a very low cost to that alternative, causing it to be chosen. If you preferred not to add the index, you could force an optimization barrier by doing something like this:
SELECT min("runnerId")
FROM (SELECT "runnerId" FROM betlog
WHERE "marketId" = '107416794'
OFFSET 0) x;
When there is an OFFSET clause (even for an offset of zero) it forces the subquery to be planned separately and its results fed to the outer query. I would expect this to run in 80 ms rather than the 1600 ms you get without the optimization barrier. Of course, if you can add the index, the speed of the query when data is cached should be less than 1 ms.
UPDATE: Crap! it's not an integer it's character varying(10)
Executing the query like this uses the index
SELECT t."FieldID"
FROM table t
WHERE t."FieldID" = '0123456789'
But does not use the index if I execute this
SELECT t."FieldID"
FROM table t
WHERE t."FieldID" LIKE '01%'
or this
SELECT t."FieldID"
FROM table t
WHERE "substring"(t."FieldID", 0, 3) = '01'
also this
SELECT t."FieldID"
FROM table t
WHERE t."FieldID" ~ '^01'
My index looks like this
CREATE UNIQUE INDEX fieldid_index
ON "table"
USING btree
("FieldID");
Running PostgreSQL 7.4 (Yep Upgrading)
I'm optimizing my query and wanted to know if there is any performance gains using one of the three types of expressions in either the SELECT or WHERE clause in the statement.
NOTE: The query that executes with these style of constraints returns around 200,000 records
Example Data is a character varying(10): 0123456789 and it is indexed as well
1. (Substring)
SELECT CASE
WHEN "substring"(t."FieldID"::text, 0, 3) = '01'::text
THEN 'Found Match'::text
ELSE NULL::text
END AS matching_group
2. (Like)
SELECT CASE
WHEN t."FieldID"::text LIKE '01%'
THEN 'Found Match'::text
ELSE NULL::text
END AS matching_group
3. (RegEx)
SELECT CASE
WHEN t."FieldID" ~ '^01'
THEN 'Found Match'::text
ELSE NULL::text
END AS matching_group
Also is there any performance advantages using one over the other in the WHERE clause?
1. (Substring)
WHERE CASE
WHEN "substring"(t."FieldID"::text, 0, 3) = '01'::text
THEN 1
ELSE 0
END = 1
2. (Like)
WHERE CASE
WHEN t."FieldID"::text LIKE '01%'
THEN 1
ELSE 0
END = 1
3. (RegEx)
WHERE CASE
WHEN t."FieldID" ~ '^01'
THEN 1
ELSE 0
END = 1
Would using one option in the SELECT and a different option in the WHERE clause improve performance?
Personally I think that someone who creates this kind of a problem should not be allowed to use the word "performance". Restrictions (like those in the WHERE clause) on the text representation of the contents of a numeric field (maybe even a keyfield) indicate bad design, IMHO.
If this were my data, I would add a flagfield to the record, indicating wanted / not wanted in query xyz. One could even put it into a separate table. I prefer adding a (redundant?) column to creating an entire index based on GW-basic-substring rubbish.
The two things that have the most effect are indexing and sargability. Sargability means using an expression that can take advantage of an index. You measure their effect by using
ANALYZE your_first_table;
-- ANALYZE other tables used in this query.
EXPLAIN ANALYZE
SELECT ...
See the docs for Examining index usage.
You might be able to take advantage of indexes on expressions or partial indexes. PostgreSQL 7.4 supports both indexes on expressions and partial indexes. For testing, you can discourage certain kinds of query plans. (Also in 7.4.)
An expression-based index that might work for you:
create index firsttwochars
on your-table-name (substring(your-column-name from 1 for 2));
But you still need to test your queries to see whether they actually use the index. (Whether they're sargable.) This one might work.
select your-column-name
from your-table-name
where substring(your-column-name from 1 for 2) = '01'
Query plan without the index on the first two characters. (My test table uses random text-only usernames, which is why I searched on 'ab' instead of '01'.)
Seq Scan on substring (cost=0.00..205.00 rows=50 width=11) (actual time=0.315..4.377 rows=14 loops=1)
Filter: (substring((username)::text, 1, 2) = 'ab'::text)
Total runtime: 4.414 ms
Query plan with the index on the first two characters.
Bitmap Heap Scan on substring (cost=4.36..37.61 rows=14 width=11) (actual time=0.036..0.056 rows=14 loops=1)
Recheck Cond: (substring((username)::text, 1, 2) = 'ab'::text)
-> Bitmap Index Scan on firsttwochars (cost=0.00..4.36 rows=14 width=0) (actual time=0.028..0.028 rows=14 loops=1)
Index Cond: (substring((username)::text, 1, 2) = 'ab'::text)
Total runtime: 0.098 ms
In SQL Server the version with LIKE '01%' would be sargable. It actually converts these LIKE queries without leading wildcards to range queries.
The execution plan shows the seek predicate as being YourCol >= '01' AND YourCol < '02' perhaps a similar sort of rewrite could help in Postgresql?
In the select list, there will probably not be much difference between the three expressions. It's all CPU time.
For the WHERE clause, you could add an expression index such as
CREATE INDEX foo ON sometable ((
CASE
WHEN "substring"("FieldID"::text, 0, 3) = '01'::text
THEN 1
ELSE 0
END
));
but the selectivity of such a Boolean index will likely be bad enough to not interest the planner. It would be better to rewrite the WHERE clause to just
WHERE "substring"("FieldID"::text, 0, 3) = '01'::text
and then index that.
For the LIKE and regex cases you could consider a text_pattern_ops index as well; see the documentation.
All in all, I think you have some cleanup work to do on that query.