SQL performance tuning with EXPLAIN - sql

I have a big table (~44 GB, 421631931 rows).
I'm attempting to optimize this type of SQL query:
SELECT fid, sid, dsc_entry, clstr_first_entry, date_part('epoch',start_time)::numeric(20,7) AS time_epoch
FROM frames
WHERE (sid = 1)
AND start_time <= to_timestamp('1471161210.776')
ORDER BY start_time DESC
LIMIT 1;
So far, I have set up index on the column start_time:
"idx_start_time" btree (start_time) CLUSTER
When I run EXPLAIN, I get this plan:
Limit (cost=0.57..0.92 rows=1 width=24)
-> Index Scan Backward using idx_start_time on frames (cost=0.57..19347837.35 rows=55108378 width=24)
Index Cond: (start_time <= '2016-08-14 09:53:30.776+02'::timestamp with time zone)
Filter: (sid = 1)
This looks good to me (note that I have never attempted to optimize databases this way before), but the query still takes about approximately 80 seconds to complete.
Can you please point out to me, how can I speed this up some more? (disk space is not an issue)
Thanks,
Petr.

For this query:
SELECT fid, sid, dsc_entry, clstr_first_entry,
date_part('epoch', start_time)::numeric(20,7) AS time_epoch
FROM frames
WHERE (sid = 1) AND start_time <= to_timestamp('1471161210.776')
ORDER BY start_time DESC
LIMIT 1;
I would recommend an index on frames(sid, start_time).

the problem is the order by, because is the inverse of the index.
Try using:
CREATE INDEX ix ON frames (start_time DESC NULLS LAST);
The order by desc requieres an additional time to resort the data in a temp table.

Related

postgres large table select optimization

I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);

PostgreSql Select query performance issue

I have a simple select query:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id
Then I want to get the first 100 results, so I use this:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id LIMIT 100
The problem is that the second query works much slower then the first one. It takes less than a second to execute the first query and more than a minute to execute the second one.
These are execution plans for the queries:
without limit:
Sort (cost=26201.43..26231.42 rows=11994 width=72)
Sort Key: entity_id
-> Index Scan using entity_type_id_idx on entities (cost=0.00..24895.34 rows=11994 width=72)
Index Cond: (entity_type_id = 1)
with limit:
Limit (cost=0.00..8134.39 rows=100 width=72)
-> Index Scan using xpkentities on entities (cost=0.00..975638.85 rows=11994 width=72)
Filter: (entity_type_id = 1)
I don't understand why these two plans are so different and why the performance decreases so much. How should I tweak the second query to make it work faster?
I use PostgreSql 9.2.
You want the 100 smallest entity_id's matching your condition. Now - if those were numbers 1..100 then clearly using the entity_id index is the best way to handle this - everything is pre-sorted. In fact, if the 100 you wanted were in the range 1..200 then it still makes sense. Probably 1..1000 would.
So - PostgreSQL thinks it will find lots of entity_type_id=1 values at the "start" of the table. It estimates a cost of 8134 vs 26231 to filter by type then sort. In your case it is wrong.
Now - either there is some correlation which isn't obvious (that's bad - we can't tell the planner about that at present), or we don't have up-to-date or sufficient stats.
Does an ANALYZE entities make any difference? You can see what values the planner knows about by reading the planner-stats page in the manuals.

Optimization of aggregate SQL query

I am running a query against a table in postgressql 9.2.
The table has a lot of fields, but the ones relevant to this is:
video_id BIGINT NOT NULL
day_date DATE NOT NULL
total_plays BIGINT default 0
total_playthrough_average DOUBLE PRECISION
total_downloads BIGINT default 0
The query takes this form:
SELECT
SUM(total_plays) AS total_plays
CASE SUM(total_downloads)
WHEN 0 THEN 100
ELSE SUM(total_playthrough_average * total_downloads) / SUM(total_downloads) END AS total_playthrough_average
FROM
mytable
WHERE
video_id = XXXX
# Date parameter - examplified by current month
AND day_date >= DATE('2013-09-01') AND day_date <= DATE('2013-09-30')
The point of the query is to find the playthrough_average (a score of how much of the video the average person sees, between 0 and 100) of all videos, weighted by the downloads each video has (so the average playthrough of a video with 100 downloads weighs more than that of a video with 10 downloads).
The table uses the following index (among others):
"video_index1" btree (video_id, day_date, textfield1, textfield2, textfield3)
Doing an EXPLAIN ANALYZE on the query gives me this:
Aggregate (cost=153.33..153.35 rows=1 width=24) (actual time=6.219..6.221 rows=1 loops=1)
-> Index Scan using video_index1 on mytable (cost=0.00..152.73 rows=40 width=24) (actual time=0.461..5.387 rows=105 loops=1)
Index Cond: ((video_id = 6702200) AND (day_date >= '2013-01-01'::date) AND (day_date <= '2013-12-31'::date))
Total runtime: 6.757 ms
This seems like everything is dandy, but this is only when I test with a query that has already been performed. When my program is running I get a lot of queries taking 10-30 seconds (usually every few seconds). I am running it with 6-10 simultaneous processes making these queries (among others).
Is there something I can tweak in the postgresql settings to get better performance out of this? The table is updated constantly, although maybe only once or twice per hour per video_id, with both INSERT and UPDATE queries.
Your summing does not make sense to me. I think what you want is
select
sum(total_plays) as total_plays,
sum(total_downloads) as total_downloads,
sum(total_playthrough_average * total_downloads) as total_playthrough_average
from mytable
where
video_id = 1
and day_date between '2013-09-01' and '2013-09-30'
SQL Fiddle

Optimize BETWEEN date statement

I need help optimizing a Postgres query which uses the BETWEEN clause with a timestamp field.
I have 2 tables:
ONE(int id_one(PK), datetime cut_time, int f1 ...)
containing about 3394 rows
TWO(int id_two(PK), int id_one(FK), int f2 ...)
containing about 4000000 rows
There are btree indexes on both PKs id_one and id_two, on the FK id_one and cut_time.
I want to perform a query like:
select o.id_one, Date(o.cut_time), o.f1, t.f2
from one o
inner join two t ON (o.id_one = t.id_one)
where o.cut_time between '2013-01-01' and '2013-01-31';
This query retrieves about 1.700.000 rows in about 7 seconds.
Below the explain analyze report is reported:
Merge Join (cost=20000000003.53..20000197562.38 rows=1680916 width=24) (actual time=0.017..741.718 rows=1692345 loops=1)"
Merge Cond: (c.coilid = hf.coilid)
-> Index Scan using pk_coils on coils c (cost=10000000000.00..10000000382.13 rows=1420 width=16) (actual time=0.008..4.539 rows=1404 loops=1)
Filter: ((cut_time >= '2013-01-01 00:00:00'::timestamp without time zone) AND (cut_time <= '2013-01-31 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1990
-> Index Scan using idx_fk_lf_data on hf_data hf (cost=10000000000.00..10000166145.90 rows=4017625 width=16) (actual time=0.003..392.535 rows=1963386 loops=1)
Total runtime: 768.473 ms
The index on the timestamp column isn't used. How to optimize this query?
Proper DDL script
A proper setup could look like this:
db<>fiddle here
Old sqlfiddle
More about this fiddle further down.
Assuming data type timestamp for the column datetime.
Incorrect query
BETWEEN is almost always wrong on principal with timestamp columns. See:
Find overlapping date ranges in PostgreSQL
In your query:
SELECT o.one_id, date(o.cut_time), o.f1, t.f2
FROM one o
JOIN two t USING (one_id)
WHERE o.cut_time BETWEEN '2013-01-01' AND '2013-01-31';
... the string constants '2013-01-01' and '2013-01-31' are coerced to the timestamps '2013-01-01 00:00' and '2013-01-31 00:00'. This excludes most of Jan. 31. The timestamp '2013-01-31 12:00' would not qualify, which is most certainly wrong.
If you'd use '2013-02-01' as upper bound instead, it would include '2013-02-01 00:00'. Still wrong.
To get all timestamps of "January 2013" it needs to be:
SELECT o.one_id, date(o.cut_time), o.f1, t.f2
FROM one o
JOIN two t USING (one_id)
WHERE o.cut_time >= '2013-01-01'
AND o.cut_time < '2013-02-01';
Exclude the upper bound.
Optimize query
It's probably pointless to retrieve 1.7 million rows. Aggregate before you retrieve the result.
Since table two is so much bigger, it's crucial how many rows you get from there. When retrieving more than ~ 5 %, a plain index on two.one_id will typically not be used, because it is faster to scan the table sequentially right away.
Your table statistics are outdated, or you have messed with cost constants and other parameters (which you obviously have, see below) to force Postgres into using the index anyway.
The only chance I would see for an index on two is a covering index:
CREATE INDEX two_one_id_f2 ON two(one_id, f2);
This way, Postgres could read from the index directly, if some preconditions are met. Might be a bit faster, not much. Didn't test.
Strange numbers in EXPLAIN output
As to your strange numbers in your EXPLAIN ANALYZE. The fiddle should explain it.
Seems like you had these debug settings:
SET enable_seqscan = off;
SET enable_indexscan = off;
SET enable_bitmapscan = off;
All of them should be on (default setting), except for debugging. Else it cripples performance! Check with:
SELECT * FROM pg_settings WHERE name ~~ 'enable%';
The query executes in less than one second. The other 6+ seconds are spent on traffic between server and client.

how to speed up max() query

In PostgreSql 8.4 query
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from ALGSA
where laonr=1 and kuupaev <='9999-12-31' and
kuupaev||kellaaeg <= '9999-12-3123 59'
Takes 3 seconds to run:
"Aggregate (cost=3164.49..3164.50 rows=1 width=10) (actual time=2714.269..2714.270 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..3110.04 rows=21778 width=10) (actual time=0.105..1418.743 rows=70708 loops=1)"
" Filter: ((kuupaev <= '9999-12-31'::date) AND (laonr = 1::numeric) AND ((kuupaev || (kellaaeg)::text) <= '9999-12-3123 59'::text))"
"Total runtime: 2714.363 ms"
How to speed it up in PostgreSQL 8.4.4 ?
Table structure is below.
algsa table has index on kuupaev maybe this can be used?
Or is it possible to change query to add some other index to make it fast. Exising columns in table cannot changed.
CREATE TABLE firma1.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
... other columns
CONSTRAINT algsa_pkey PRIMARY KEY (id),
CONSTRAINT algsa_id_check CHECK (id > 0)
)
);
CREATE INDEX algsa_kuupaev_idx ON firma1.algsa USING btree (kuupaev);
Update
Tried analyze verbose firma1.algsa;
INFO: analyzing "firma1.algsa"
INFO: "algsa": scanned 1640 of 1640 pages, containing 70708 live rows and 13 dead rows; 30000 rows in sample, 70708 estimated total rows
Query returned successfully with no result in 1185 ms.
but query run time was still 2.7 seconds.
Why there are 30000 rows in sample . Isn't it too much, should this decreased?
This was a known issue in old versions of PostgreSQL - but it looks like it might've been resolved by 8.4; in fact, the docs for 8.0 have the caveat but the docs for 8.1 do not.
So you don't need to upgrade major versions for this reason, at least. You should however upgrade to the current 8.4 series release 8.4.16, as you're missing several years worth of bug fixes and tweaks.
The real problem here is that you're using max on an expression, not a simple value, and there's no functional index for that expression.
You could try creating an index on the expression kuupaev||kellaaeg ... but I suspect you have data model problems, and that there's a better solution by fixing your data model.
It looks like kuupaev is kuupƤev, or date, and kellaaeg might be time. If so: never use the concatenation (||) operator for combining dates and times; use interval addition, eg kuupaev + kellaaeg. Instead of char you should be using the data type time or interval with a CHECK constraint for kellaaeg, depending on what it means and whether it's limited to 24 hours or not. Or, better still, use a single field of type timestamp (for local time) or timestamp with time zone (for global time) to store the combined date and time.
If you do this, you can create a simple index on the combined column that replaces both kellaaeg and kuupaev and use that for min and max among other things. If you need just the date part or just the time part for some things, use the date_trunc, extract and date_part functions; see the documentation.
See this earlier answer for another example of where separate date and time columns are a bad idea.
You should still plan an upgrade to 9.2. The upgrade path from 8.4 to 9.2 isn't too rough, you really just have to watch out for the setting of standard_conforming_strings on by default and the change of bytea_output from escape to hex. Both can be set back to the 8.4 defaults during transition and porting work. 8.4 won't be supported for much longer.
My first instinct would be to try an index:
create index algsa_laonr_kuupaev_kellaaeg_idx
on ALGSA (laonr asc, (kuupaev||kellaaeg) desc)
... and try the query as:
SELECT kuupaev||kellaaeg as res
from ALGSA
where laonr=1 and
kuupaev||kellaaeg <= '9999-12-3123 59'
order by
laonr asc,
kuupaev||kellaaeg desc
limit 1