Efficient querying of multi-partition Postgres table - sql

I've just restructured my database to use partitioning in Postgres 8.2. Now I have a problem with query performance:
SELECT *
FROM my_table
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100;
There are 45 million rows in the table. Prior to partitioning, this would use a reverse index scan and stop as soon as it hit the limit.
After partitioning (on time_stamp ranges), Postgres does a full index scan of the master table and the relevant partition and merges the results, sorts them, then applies the limit. This takes way too long.
I can fix it with:
SELECT * FROM (
SELECT *
FROM my_table_part_a
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
SELECT * FROM (
SELECT *
FROM my_table_part_b
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
... and so on ...
ORDER BY id DESC
LIMIT 100
This runs quickly. The partitions where the times-stamps are out-of-range aren't even included in the query plan.
My question is: Is there some hint or syntax I can use in Postgres 8.2 to prevent the query-planner from scanning the full table but still using simple syntax that only refers to the master table?
Basically, can I avoid the pain of dynamically building the big UNION query over each partition that happens to be currently defined?
EDIT: I have constraint_exclusion enabled (thanks #Vinko Vrsalovic)

Have you tried Constraint Exclusion (section 5.9.4 in the document you've linked to)
Constraint exclusion is a query
optimization technique that improves
performance for partitioned tables
defined in the fashion described
above. As an example:
SET constraint_exclusion = on;
SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';
Without
constraint exclusion, the above query
would scan each of the partitions of
the measurement table. With constraint
exclusion enabled, the planner will
examine the constraints of each
partition and try to prove that the
partition need not be scanned because
it could not contain any rows meeting
the query's WHERE clause. When the
planner can prove this, it excludes
the partition from the query plan.
You can use the EXPLAIN command to
show the difference between a plan
with constraint_exclusion on and a
plan with it off.

I had a similar problem that I was able fix by casting conditions in WHERE.
EG: (assuming the time_stamp column is timestamptz type)
WHERE time_stamp >= '2010-02-10'::timestamptz and time_stamp < '2010-02-11'::timestamptz
Also, make sure the CHECK condition on the table is defined the same way...
EG:
CHECK (time_stamp < '2010-02-10'::timestamptz)

I had the same problem and it boiled down to two reasons in my case:
I had indexed column of type timestamp WITH time zone and partition constraint by this column with type timestamp WITHOUT time zone.
After fixing constraints ANALYZE of all child tables was needed.
Edit: another bit of knowledge - it's important to remember that constraint exclusion (which allows PG to skip scanning some tables based on your partitioning criteria) doesn't work with, quote: non-immutable function such as CURRENT_TIMESTAMP
I had requests with CURRENT_DATE and it was part of my problem.

Related

Performance issue in postgresql

when I'm trying to use this query in oracle it's taking 0.04054s and while using the same query in PostgreSQL then it taking 49.8min how can I change the query to increase performance in PostgreSQL?
SELECT
"ID","IMAGE","TITLE","SERVICE_DESC"
,"STATUS", "ACTION","REMOVAL_TEXT","SERVICE_PROVIDER"
, "SERVICE_PROVIDER_NAME"
FROM (
SELECT DISTINCT "ID","IMAGE"
,"TITLE", "SERVICE_DESC"
, COALESCE("STATUS",'N') as "STATUS"
,"ACTION","REMOVAL_TEXT","CREATED_DT"
,"SERVICE_PROVIDER", "SERVICE_PROVIDER_NAME"
FROM MZP_ADP.ALL_SERVICE_DETAILS
WHERE "ZIP_CODE"='55005' AND "MAKE_LIVE" = 'Y'
AND "LOCATION_ID" = '2407605'
AND "END_DATE" > CURRENT_TIMESTAMP(0)::TIMESTAMP WITHOUT TIME ZONE
AND "IS_ACTIVE" = 'Y' order by "CREATED_DT" desc
) alias;
There can be a lot of problem. (rowcounts, hardwer, no index)
First, what is the rowcounts of table?
Have you inserted a lot of row some time before?
(Then can be REINDEX TABLE TABLE_NAME , And VACUUM ANALYZE TABLE_NAME help.)
CHECK indexes on this columns
LOCATION_ID
ZIP_CODE
CREATED_DT
END_DATE
Why is the select in subselect?
Please eliminate.
Can you eliminate the distinct with an additional where clause?
Please Share plans and rowcount than can we say more details.
EXPLAIN ANALIZE SELECT..
You can try:
Create this index
create index ALL_SERVICE_DETAILS_CMP_INDEX on MZP_ADP.ALL_SERVICE_DETAILS ("ZIP_CODE", "MAKE_LIVE", "LOCATION_ID", "END_DATE", "IS_ACTIVE");
Remove parent select
Remove distinct (if there is at least one unique column in the select)
Apply few things for performance boosting
VACUUM FULL for tables (it also rebuild indexes). But any confusion execute rebuild indexes
VACUUM (FULL, ANALYZE) table_name;
REINDEX TABLE table_name;
Increase work_mem and maintenance_work_mem as per your memory and server
configuration
Use GROUP BY instead of DISTINCT (distinct is slower)
Remove ORDER BY inside subquery. If needed then use it outside
create an composite index with column ZIP_CODE, LOCATION_ID, END_DATE and use
proper ordering in WHERE clause (As MAKE_LIVE and IS_ACTIVE are flag type so need
to add first in index)
EXPLAIN ANALYZE QUERY for checking execution time and using proper index in query
Pseudocode:
SELECT columns
FROM (SELECT columns
FROM table
WHERE searching columns as per index creation
GROUP BY WITHOUT aggregated COLUMNS) t
ORDER BY columns -- if needed

Mariadb Scans all partitions in timestamp column

I have a table Partitioned by:
HASH(timestamp DIV 43200 )
When I perform this query
SELECT max(id)
FROM messages
WHERE timestamp BETWEEN 1581708508 AND 1581708807
it scans all partitions while both numbers 1581708508 & 1581708807& numbers between them are in the same partition, how can I make it to scan only that partition?
You have discovered one of the reasons why PARTITION BY HASH is useless.
In your situation, the Optimizer sees the range (BETWEEN) and says "punt, I'll just scan all the partitions".
That is, "partition pruning" does not work when the WHERE clause involves a range and you are using PARTITION BY HASH. PARTITION BY RANGE, on the other hand, may be able to prune. But... What's the advantage? It does not make the query any faster.
I have found only four uses for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint . It sounds like your application does not fit any of those cases.
That particular query would best be done without partitioning. Instead have a non-partitioned table with this 'composite' index:
INDEX(timestamp, id)
It must scan all the rows to discover the MAX(id), but with this index, it is
Scanning only the 2-column index
Not touching any rows outside the timestamp range.
Hence it will be as fast as possible. Even if PARTITION BY HASH were smart enough to do the desired pruning, it would not run any faster.
In particular, when you ask for a range on the Partition key, such as with WHERE timestamp BETWEEN 1581708508 AND 1581708807, the execution looks in all partitions for the desired rows. This is one of the major failings of Hash. Even if it could realize that only Partition is needed, it would be no faster than simply using the index I suggest.
You can determine that individual partition by using modular arithmetic
MOD(<formula which's argument of hash function>,<number of partitions>)
assuming you have 2 partitions
CREATE TABLE messages(ID int, timestamp int)
PARTITION BY HASH( timestamp DIV 43200 )
PARTITIONS 2;
look up partition names by
SELECT CONCAT( 'p',MOD(timestamp DIV 43200,2)) AS partition_name, timestamp
FROM messages;
and determine the related partition name for the value 1581708508 of timestamp column (assume p1). Then Use
SELECT MAX(id)
FROM messages PARTITION(p1)
to get all the records only in the partition p1 without need of a WHERE condition such as
WHERE timestamp BETWEEN 1581708508 AND 1581708807
Btw, all partitions might be listed through
SELECT *
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE table_name='messages'
Demo

ORACLE -How are indexes related to COST

This question might be weird since my knowledge of sql is limited.
I am trying to optimize a database containing millions of records but i don't seem to know when there is a performance. I have been playing around with the 'Explain plan' in SQL Developer, but those numbers really don't mean a lot to me.
For instance, I got a table with following columns :
ID | Creation_date | End_date | Critical_date | Flag
I do run a query on this table looking like this :
SELECT COUNT(ID) FROM TABLE
WHERE FLAG = 1
Creation_date < :p_d_one
End_date > : p_d_two
Critical_date < :p_d_three
This gives me a COST of 64. So I decided to add indexes.
Now the thing I do not understand is that when I create An index for Creation_date, the cost drops drastically -> (64 -> 9)
CREATE INDEX TABLE_INDEX_CREATION_DATE ON TABLE (Creation_date desc)
but when I create and index for the other fields it doesn't change a bit (even though they are also dates).
Any idea if it is still worth including them into my index ?
Should I still be creating indexes for other queries if they don't have any influence on the cost ?
Is the cost calculated using live stats from the database (Cardinality, ... ) ?
You could create an index over all the fields mentioned in the query (flag, creation_date, end_date, critical_date, id in that order) so your query becomes index-only, i.e. the RDBMS does not have to read the table at all, it can get everything it needs from the index.
For a particular query, Oracle will not use more than one index on a given table. Having more than one index is only necessary to support different queries.

Indexed ORDER BY with LIMIT 1

I'm trying to fetch most recent row in a table. I have a simple timestamp created_at which is indexed. When I query ORDER BY created_at DESC LIMIT 1, it takes far more than I think it should (about 50ms on my machine on 36k rows).
EXPLAIN-ing claims that it uses backwards index scan, but I confirmed that changing the index to be (created_at DESC) does not change the cost in query planner for a simple index scan.
How can I optimize this use case?
Running postgresql 9.2.4.
Edit:
# EXPLAIN SELECT * FROM articles ORDER BY created_at DESC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.58 rows=1 width=1752)
-> Index Scan Backward using index_articles_on_created_at on articles (cost=0.00..20667.37 rows=35696 width=1752)
(2 rows)
Assuming we are dealing with a big table, a partial index might help:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC)
WHERE created_at > '2013-09-15 0:0'::timestamp;
As you already found out: descending or ascending hardly matters here. Postgres can scan backwards at almost the same speed (exceptions apply with multi-column indices).
Query to use this index:
SELECT * FROM tbl
WHERE created_at > '2013-09-15 0:0'::timestamp -- matches index
ORDER BY created_at DESC
LIMIT 1;
The point here is to make the index much smaller, so it should be easier to cache and maintain.
You need to pick a timestamp that is guaranteed to be smaller than the most recent one.
You should recreate the index from time to time to cut off old data.
The condition needs to be IMMUTABLE.
So the one-time effect deteriorates over time. The specific problem is the hard coded condition:
WHERE created_at > '2013-09-15 0:0'::timestamp
Automate
You could update the index and your queries manually from time to time. Or you automate it with the help of a function like this one:
CREATE OR REPLACE FUNCTION f_min_ts()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$$SELECT '2013-09-15 0:0'::timestamp$$
Index:
CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC);
WHERE created_at > f_min_ts();
Query:
SELECT * FROM tbl
WHERE created_at > f_min_ts()
ORDER BY created_at DESC
LIMIT 1;
Automate recreation with a cron job or some trigger-based event. Your queries can stay the same now. But you need to recreate all indices using this function in any way after changing it. Just drop and create each one.
First ..
... test whether you are actually hitting the bottle neck with this.
Try whether a simple DROP index ... ; CREATE index ... does the job. Then your index might have been bloated. Your autovacuum settings may be off.
Or try VACUUM FULL ANALYZE to get your whole table plus indices in pristine condition and check again.
Other options include the usual general performance tuning and covering indexes, depending on what you actually retrieve from the table.

Two questions on PostgreSQL performance

1) What is the best way to implement paging in PostgreSQL?
Assume we need to implement paging. The simplest query is select * from MY_TABLE order by date_field DESC limit 10 offset 20. As far as I understand, we have 2 problems here: in case the dates may have duplicated values every run of this query may return different results and the more offset value is the longer the query runs. We have to provide additional column which is date_field_index:
--date_field--date_field_index--
12-01-2012 1
12-01-2012 2
14-01-2012 1
16-01-2012 1
--------------------------------
Now we can write something like
create index MY_INDEX on MY_TABLE (date_field, date_field_index);
select * from MY_TABLE where date_field=<last_page_date and not (date_field_index>=last_page_date_index and date_field=last+page_date) order by date_field DESC, date_field_index DESC limit 20;
..thus using the where clause and corresponding index instead of offset. OK, now the questions:
1) is this the best way to improve the initial query?
2) how can we populate that date_field_index field? we have to provide some trigger for this?
3) We should not use RowNumber() functions in Postgres because they are not using indexes and thus very slow. Is it correct?
2) Why column order in concatenated index is not affecting performance of the query?
My measurements show, that while searching using concatenated index (index consisting of 2 and more columns) there is no difference if we place the most selective column to the first place - or if we place it to the end. Why? If we place the most selective column to the first place - we run through a shorter range of the found rows which should have impact on performance. Am I right?
Use the primary key to untie in instead of the date_field_index column. Otherwise explain why that is not an option.
order by date_field DESC, "primary_key_column(s)" DESC
The combined index with the most unique column first is the best performer, but it will not be used if:
the distinct values are more than a few percent of the table
there aren't enough rows to make it worth
the range of dates is not small enough
What is the output of explain my_query?