I need help optimizing a Postgres query which uses the BETWEEN clause with a timestamp field.
I have 2 tables:
ONE(int id_one(PK), datetime cut_time, int f1 ...)
containing about 3394 rows
TWO(int id_two(PK), int id_one(FK), int f2 ...)
containing about 4000000 rows
There are btree indexes on both PKs id_one and id_two, on the FK id_one and cut_time.
I want to perform a query like:
select o.id_one, Date(o.cut_time), o.f1, t.f2
from one o
inner join two t ON (o.id_one = t.id_one)
where o.cut_time between '2013-01-01' and '2013-01-31';
This query retrieves about 1.700.000 rows in about 7 seconds.
Below the explain analyze report is reported:
Merge Join (cost=20000000003.53..20000197562.38 rows=1680916 width=24) (actual time=0.017..741.718 rows=1692345 loops=1)"
Merge Cond: (c.coilid = hf.coilid)
-> Index Scan using pk_coils on coils c (cost=10000000000.00..10000000382.13 rows=1420 width=16) (actual time=0.008..4.539 rows=1404 loops=1)
Filter: ((cut_time >= '2013-01-01 00:00:00'::timestamp without time zone) AND (cut_time <= '2013-01-31 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1990
-> Index Scan using idx_fk_lf_data on hf_data hf (cost=10000000000.00..10000166145.90 rows=4017625 width=16) (actual time=0.003..392.535 rows=1963386 loops=1)
Total runtime: 768.473 ms
The index on the timestamp column isn't used. How to optimize this query?
Proper DDL script
A proper setup could look like this:
db<>fiddle here
Old sqlfiddle
More about this fiddle further down.
Assuming data type timestamp for the column datetime.
Incorrect query
BETWEEN is almost always wrong on principal with timestamp columns. See:
Find overlapping date ranges in PostgreSQL
In your query:
SELECT o.one_id, date(o.cut_time), o.f1, t.f2
FROM one o
JOIN two t USING (one_id)
WHERE o.cut_time BETWEEN '2013-01-01' AND '2013-01-31';
... the string constants '2013-01-01' and '2013-01-31' are coerced to the timestamps '2013-01-01 00:00' and '2013-01-31 00:00'. This excludes most of Jan. 31. The timestamp '2013-01-31 12:00' would not qualify, which is most certainly wrong.
If you'd use '2013-02-01' as upper bound instead, it would include '2013-02-01 00:00'. Still wrong.
To get all timestamps of "January 2013" it needs to be:
SELECT o.one_id, date(o.cut_time), o.f1, t.f2
FROM one o
JOIN two t USING (one_id)
WHERE o.cut_time >= '2013-01-01'
AND o.cut_time < '2013-02-01';
Exclude the upper bound.
Optimize query
It's probably pointless to retrieve 1.7 million rows. Aggregate before you retrieve the result.
Since table two is so much bigger, it's crucial how many rows you get from there. When retrieving more than ~ 5 %, a plain index on two.one_id will typically not be used, because it is faster to scan the table sequentially right away.
Your table statistics are outdated, or you have messed with cost constants and other parameters (which you obviously have, see below) to force Postgres into using the index anyway.
The only chance I would see for an index on two is a covering index:
CREATE INDEX two_one_id_f2 ON two(one_id, f2);
This way, Postgres could read from the index directly, if some preconditions are met. Might be a bit faster, not much. Didn't test.
Strange numbers in EXPLAIN output
As to your strange numbers in your EXPLAIN ANALYZE. The fiddle should explain it.
Seems like you had these debug settings:
SET enable_seqscan = off;
SET enable_indexscan = off;
SET enable_bitmapscan = off;
All of them should be on (default setting), except for debugging. Else it cripples performance! Check with:
SELECT * FROM pg_settings WHERE name ~~ 'enable%';
The query executes in less than one second. The other 6+ seconds are spent on traffic between server and client.
Related
Is there any optimization I can do to speed up this query. It is currently taking 30 minutes to run.
SELECT
*
FROM
service s
JOIN
bucket b ON s.procedure = b.hcpc
WHERE
month >= '201904'
AND bucket = 'Respirator'
Explain execution plan -
Gather (cost=1002.24..81397944.91 rows=9782404 width=212)
Workers Planned: 2
-> Hash Join (cost=2.24..80418704.51 rows=4076002 width=212)
Hash Cond: ((s .procedure)::text = (bac.hcpc)::text)
-> Parallel Seq Scan on service s (cost=0.00..77753288.33 rows=699907712 width=154)
Filter: ((month)::text >= '201904'::text)
-> Hash (cost=2.06..2.06 rows=14 width=58)
-> Seq Scan on buckets b (cost=0.00..2.06 rows=14 width=58)
Filter: ((bucket)::text = 'Respirator'::text)
SELECT *
FROM service s JOIN
bucket b
ON s.procedure = b.hcpc
WHERE s.month >= '201904' AND
b.bucket = 'Respirator';
I would suggest indexes on:
bucket(bucket, hcpc)
service(procedure, month)
Query optimization is something that doesn't have super hard and fast rules, it's more of a trial and error thing. Sometimes you will try a technique and it will work really well, but then the same technique will have little to no effect on another query. That being said, here are a couple of things that I would try to get you started.
Instead of SELECT *, list out the column names that you need. If you need all of both tables, still list them out.
Are there any numeric columns that you can use in your WHERE clause to do some preliminary filtering? Comparing only string data types is almost always a pain point in query optimization.
Look at the existing indexes on the table and see if any changes need to be made. Indexes can have a huge impact on query performance, both positive and negative depending on setup.
Again, it's all trial and error, these are just a couple of places to start.
I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
I am running a query against a table in postgressql 9.2.
The table has a lot of fields, but the ones relevant to this is:
video_id BIGINT NOT NULL
day_date DATE NOT NULL
total_plays BIGINT default 0
total_playthrough_average DOUBLE PRECISION
total_downloads BIGINT default 0
The query takes this form:
SELECT
SUM(total_plays) AS total_plays
CASE SUM(total_downloads)
WHEN 0 THEN 100
ELSE SUM(total_playthrough_average * total_downloads) / SUM(total_downloads) END AS total_playthrough_average
FROM
mytable
WHERE
video_id = XXXX
# Date parameter - examplified by current month
AND day_date >= DATE('2013-09-01') AND day_date <= DATE('2013-09-30')
The point of the query is to find the playthrough_average (a score of how much of the video the average person sees, between 0 and 100) of all videos, weighted by the downloads each video has (so the average playthrough of a video with 100 downloads weighs more than that of a video with 10 downloads).
The table uses the following index (among others):
"video_index1" btree (video_id, day_date, textfield1, textfield2, textfield3)
Doing an EXPLAIN ANALYZE on the query gives me this:
Aggregate (cost=153.33..153.35 rows=1 width=24) (actual time=6.219..6.221 rows=1 loops=1)
-> Index Scan using video_index1 on mytable (cost=0.00..152.73 rows=40 width=24) (actual time=0.461..5.387 rows=105 loops=1)
Index Cond: ((video_id = 6702200) AND (day_date >= '2013-01-01'::date) AND (day_date <= '2013-12-31'::date))
Total runtime: 6.757 ms
This seems like everything is dandy, but this is only when I test with a query that has already been performed. When my program is running I get a lot of queries taking 10-30 seconds (usually every few seconds). I am running it with 6-10 simultaneous processes making these queries (among others).
Is there something I can tweak in the postgresql settings to get better performance out of this? The table is updated constantly, although maybe only once or twice per hour per video_id, with both INSERT and UPDATE queries.
Your summing does not make sense to me. I think what you want is
select
sum(total_plays) as total_plays,
sum(total_downloads) as total_downloads,
sum(total_playthrough_average * total_downloads) as total_playthrough_average
from mytable
where
video_id = 1
and day_date between '2013-09-01' and '2013-09-30'
SQL Fiddle
In PostgreSql 8.4 query
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from ALGSA
where laonr=1 and kuupaev <='9999-12-31' and
kuupaev||kellaaeg <= '9999-12-3123 59'
Takes 3 seconds to run:
"Aggregate (cost=3164.49..3164.50 rows=1 width=10) (actual time=2714.269..2714.270 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..3110.04 rows=21778 width=10) (actual time=0.105..1418.743 rows=70708 loops=1)"
" Filter: ((kuupaev <= '9999-12-31'::date) AND (laonr = 1::numeric) AND ((kuupaev || (kellaaeg)::text) <= '9999-12-3123 59'::text))"
"Total runtime: 2714.363 ms"
How to speed it up in PostgreSQL 8.4.4 ?
Table structure is below.
algsa table has index on kuupaev maybe this can be used?
Or is it possible to change query to add some other index to make it fast. Exising columns in table cannot changed.
CREATE TABLE firma1.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
... other columns
CONSTRAINT algsa_pkey PRIMARY KEY (id),
CONSTRAINT algsa_id_check CHECK (id > 0)
)
);
CREATE INDEX algsa_kuupaev_idx ON firma1.algsa USING btree (kuupaev);
Update
Tried analyze verbose firma1.algsa;
INFO: analyzing "firma1.algsa"
INFO: "algsa": scanned 1640 of 1640 pages, containing 70708 live rows and 13 dead rows; 30000 rows in sample, 70708 estimated total rows
Query returned successfully with no result in 1185 ms.
but query run time was still 2.7 seconds.
Why there are 30000 rows in sample . Isn't it too much, should this decreased?
This was a known issue in old versions of PostgreSQL - but it looks like it might've been resolved by 8.4; in fact, the docs for 8.0 have the caveat but the docs for 8.1 do not.
So you don't need to upgrade major versions for this reason, at least. You should however upgrade to the current 8.4 series release 8.4.16, as you're missing several years worth of bug fixes and tweaks.
The real problem here is that you're using max on an expression, not a simple value, and there's no functional index for that expression.
You could try creating an index on the expression kuupaev||kellaaeg ... but I suspect you have data model problems, and that there's a better solution by fixing your data model.
It looks like kuupaev is kuupƤev, or date, and kellaaeg might be time. If so: never use the concatenation (||) operator for combining dates and times; use interval addition, eg kuupaev + kellaaeg. Instead of char you should be using the data type time or interval with a CHECK constraint for kellaaeg, depending on what it means and whether it's limited to 24 hours or not. Or, better still, use a single field of type timestamp (for local time) or timestamp with time zone (for global time) to store the combined date and time.
If you do this, you can create a simple index on the combined column that replaces both kellaaeg and kuupaev and use that for min and max among other things. If you need just the date part or just the time part for some things, use the date_trunc, extract and date_part functions; see the documentation.
See this earlier answer for another example of where separate date and time columns are a bad idea.
You should still plan an upgrade to 9.2. The upgrade path from 8.4 to 9.2 isn't too rough, you really just have to watch out for the setting of standard_conforming_strings on by default and the change of bytea_output from escape to hex. Both can be set back to the 8.4 defaults during transition and porting work. 8.4 won't be supported for much longer.
My first instinct would be to try an index:
create index algsa_laonr_kuupaev_kellaaeg_idx
on ALGSA (laonr asc, (kuupaev||kellaaeg) desc)
... and try the query as:
SELECT kuupaev||kellaaeg as res
from ALGSA
where laonr=1 and
kuupaev||kellaaeg <= '9999-12-3123 59'
order by
laonr asc,
kuupaev||kellaaeg desc
limit 1
I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.
Query #1:
SELECT id
FROM subsource_position
WHERE
id NOT IN (SELECT position_id FROM subsource)
This comes back with the following plan:
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2566.50 rows=101500 width=4)
-> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)
Query #2:
SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;
Plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------
SetOp Except (cost=24760.35..25668.66 rows=95997 width=4)
-> Sort (cost=24760.35..25214.50 rows=181663 width=4)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..6406.26 rows=181663 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4)
-> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4)
-> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)
I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?
Additional data:
=> select count(*) from subsource;
count
-------
85158
(1 row)
=> select count(*) from subsource_position;
count
-------
93261
(1 row)
Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.
Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.
Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.
Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).
Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):
SELECT sp.id
FROM subsource_position AS sp
LEFT JOIN subsource AS s ON (s.position_id = sp.id)
WHERE
s.position_id IS NULL
EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
Your queries are not functionally equivalent so any comparison of their query plans is meaningless.
Your first query is, in set theory terms, this:
{subsource.position_id} - {subsource_position.id}
^ ^ ^ ^
but your second is this:
{subsource_position.id} - {subsource.position_id}
^ ^ ^ ^
And A - B is not the same as B - A for arbitrary sets A and B.
Fix your queries to be semantically equivalent and try again.
If id and position_id are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.
Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.
(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u on the command line in seconds. Database less than 50MB when exported with pg_dump.)
PS: more interesting comment here:
more work has been put into optimizing
EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially
less useful due to its unintuitive but spec-mandated handling of NULLs.
We're not going to apologize for that, and we're not going to regard it as a bug.
What it comes down to is that except is different to not in with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.
The second query makes usage of the HASH JOIN feature of postgresql. This is much faster then the Seq Scan of the first one.