Postgres query optimization - indexing

On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?

This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.

Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!

Related

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

SQL SERVER - Execution plan

I have a VIEW in both Databases. At one database, takes less then 1 second to run and but in the other database 1 minute or more to go. I check indexes and everything is the same. The diference between the number of rows is below than 10 millions of rows from each other database.
I check de exectuion plan, and what i found is that, the database that takes more time, i have 3 Hash Match(1 aggregate and 2 right outer join) that is responssible for 100% on the query batch. On the other database i don't have this in the execution plan.
Can anyone tell me where can i begin to search the problem?
Thank you, sorry for the bad english.
You can check this link here for a quick explanation on different types of joins.
Basically, with the information you've given us, here are some of the alternatives for what might be wrong:
One DB has indexes the other doesn't.
The size difference between some of the joined tables in one DB over the other, is dramatic enough to change the type of join used.
While your indexes might be the same on both DB table groups, as you said.. it's possible the other DB has outdated / bad statistics or too much index fragmentation, resulting in sub-optimal plans.
EDIT:
Regarding your comment below, it's true that rebuilding indexes is similar to dropping & recreating indexes. And since creating indexes also creates the statistics for those indexes, rebuilding will take care of them as well. Sometimes that's not enough however.
While officially default statistics should be built with about 20% sampling rate of the actual data, in reality the sampling rate can be as low as just a few percents depending on how massive the table is. It's rarely anywhere near 20%. Because of that, many DBA's build statistics manually with FULLSCAN to obtain a 100% sampling rate.
The statistics take equally much storage space either way, so there are really no downsides to this aside from the extra time required in maintenance plans. In my current project, we have several situations where the default sampling rate for the statistics is not enough, and would still produce bad plans. So we routinely update all statistics with FULLSCAN every few weeks to make sure the performance stays top notch.

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks
I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.
Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.
You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

How many rows in a database are TOO MANY?

I've a MySQL InnoDB table with 1,000,000 records. Is this too much? Or databases can handle this and more? I ask because I noticed that some queries (for example, getting the last row from a table) are slower (seconds) in the table with 1 millon rows than in one with 100.
I've a MySQL InnoDB table with 1000000 registers. Is this too much?
No, 1,000,000 rows (AKA records) is not too much for a database.
I ask because I noticed that some queries (for example, getting the last register of a table) are slower (seconds) in the table with 1 million registers than in one with 100.
There's a lot to account for in that statement. The usual suspects are:
Poorly written query
Not using a primary key, assuming one even exists on the table
Poorly designed data model (table structure)
Lack of indexes
I have a database with more than 97,000,000 records(30GB datafile), and having no problem .
Just remember to define and improve your table index.
So its obvious that 1,000,000 is not MANY ! (But if you don't index; yes, it is MANY )
Use 'explain' to examine your query and see if there is anything wrong with the query plan.
I think this is a common misconception - size is only one part of the equation when it comes to database scalability. There are other issues that are hard (or harder):
How large is the working set (i.e. how much data needs to be loaded in memory and actively worked on). If you just insert data and then do nothing with it, it's actually an easy problem to solve.
What level of concurrency is required? Is there just one user inserting/reading, or do we have many thousands of clients operating at once?
What levels of promise/durability and consistency of performance are required? Do we have to make sure that we can honor each commit. Is it okay if the average transaction is fast, or do we want to make sure that all transactions are reliably fast (six sigma quality control like - http://www.mysqlperformanceblog.com/2010/06/07/performance-optimization-and-six-sigma/).
Do you need to do any operational issues, such as ALTER the table schema? In InnoDB this is possible, but incredibly slow since it often has to create a temporary table in foreground (blocking all connections).
So I'm going to state the two limiting issues are going to be:
Your own skill at writing queries / having good indexes.
How much pain you can tolerate waiting on ALTER TABLE statements.
If you mean 1 million rows, then it depends on how your indexing is done and the configuration of your hardware. A million rows is not a large amount for an enterprise database, or even a dev database on decent equipment.
if you mean 1 million columns (not sure thats even possible in MySQL) then yes, this seems a bit large and will probably cause problems.
Register? Do you mean record?
One million records is not a real big deal for a database these days. If you run into any issue, it's likely not the database system itself, but rather the hardware that you're running it on. You're not going to run into a problem with the DB before you run out of hardware to throw at it, most likely.
Now, obviously some queries are slower than others, but if two very similar queries run in vastly different times, you need to figure out what the database's execution plan is and optimize for it, i.e. use correct indexes, proper normalization, etc.
Incidentally, there is no such thing as a "last" record in a table, from a logical standpoint they have no inherent order.
I've seen non-partitioned tables with several billion (indexed) records, that self-joined for analytical work. We eventually partitioned the thing but honestly we didn't see that much difference.
That said, that was in Oracle and I have not tested that volume of data in MySQL. Indexes are your friend :)
Assuming you mean "records" by "registers" no, it's not too much, MySQL scales really well and can hold as many records as you have space for in your hard disk.
Obviously though search queries will be slower. There is really no way around that except making sure that the fields are properly indexed.
The larger the table gets (as in more rows in it), the slower queries will typically run if there are no indexes. Once you add the right indexes your query performance should improve or at least not degrade as much as the table grows. However, if the query itself returns more rows as the table gets bigger, then you'll start to see degradation again.
While 1M rows are not that many, it also depends on how much memory you have on the DB server. If the table is too big to be cached in memory by the server, then queries will be slower.
Using the query provided will be exceptionally slow because of using a sort merge method to sort the data.
I would recommend rethinking the design so you are using indexes to retrieve it or make sure it is already ordered in that manner so no sorting is needed.

How do I force Postgres to use a particular index?

How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan and enable_indexscan parameters. See:
Examining index usage
enable_ parameters
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_ params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
For small tables, it's faster to do sequential scans.
Postgres doesn't use indexes when datatypes don't match properly, you may need to include appropriate casts.
Your planner settings might be causing problems.
See also this old newsgroup post.
Probably the only valid reason for using
set enable_seqscan=false
is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.
TL;DR
Run the following three commands and check whether the problem is fixed:
ANALYZE;
SET random_page_cost = 1.0;
SET effective_cache_size = 'X GB'; # replace X with total RAM size minus 2 GB
Read on for further details and background information about this.
Step 1: Analyze tables
As a simple first attempt to fix the issue, run the ANALYZE; command as the database superuser in order to update all table statistics. From the documentation:
The query planner uses these statistics to help determine the most efficient execution plans for queries.
Step 2: Set the correct random page cost
Index scans require non-sequential disk page fetches. PostgreSQL uses the random_page_cost configuration parameter to estimate the cost of such non-sequential fetches in relation to sequential fetches. From the documentation:
Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.
The default value is 4.0, thus assuming an average cost factor of 4 compared to sequential fetches, taking caching effects into account. However, if your database is stored on an SSD drive, then you should actually set random_page_cost to 1.1 according to the documentation:
Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1.
Also, if an index is mostly (or even entirely) cached in RAM, then an index scan will always be significantly faster than a disk-served sequential scan. The query planner however doesn't know which parts of the index are already cached, and thus might make an incorrect decision.
If your database indices are frequently used, and if the system has sufficient RAM, then the indices are likely to be cached eventually. In such a case, random_page_cost can be set to 1.0, or even to a value below 1.0 to aggressively prefer using index scans (although the documentation advises against doing that). You'll have to experiment with different values and see what works for you.
As a side note, you could also consider using the pg_prewarm extension to explicitly cache your indices into RAM.
You can set the random_page_cost like this:
SET random_page_cost = 1.0;
Step 3: Set the correct cache size
On a system with 8 or more GB RAM, you should set the effective_cache_size configuration parameter to the amount of memory which is typically available to PostgreSQL for data caching. From the documentation:
A higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.
Note that this parameter doesn't change the amount of memory which PostgreSQL will actually allocate, but is only used to compute cost estimates. A reasonable value (on a dedicated database server, at least) is the total RAM size minus 2 GB. The default value is 4 GB.
You can set the effective_cache_size like this:
SET effective_cache_size = '14 GB'; # e.g. on a dedicated server with 16 GB RAM
Step 4: Fix the problem permanently
You probably want to use ALTER SYSTEM SET ... or ALTER DATABASE db_name SET ... to set the new configuration parameter values permanently (either globally or per-database). See the documentation for details about setting parameters.
Step 5: Additional resources
If it still doesn't work, then you might also want to take a look at this PostgreSQL Wiki page about server tuning.
Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description = 'Refund'
GROUP BY client_id
PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description||'' = 'Refund'
GROUP BY client_id
The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.
Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.
There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.
Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.
One thing to note with PostgreSQL; where you are expecting an index to be used and it is not being used, is to VACUUM ANALYZE the table.
VACUUM ANALYZE schema.table;
This updates statistics used by the planner to determine the most efficient way to execute a query. Which may result in the index being used.
Another thing to check is the types.
Is the index on an int8 column and you are querying with numeric? The query will work but the index will not be used.
There is a trick to push postgres to prefer a seqscan adding a OFFSET 0 in the subquery
This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.
Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.
see How can I prevent Postgres from inlining a subquery?
Indexes can only be used under certain circumstances.
For example the type of the value fits to the type of the column.
You are not doing a operation on the column before comparing to the value.
Given a customer table with 3 columns with 3 indexes on all of the columns.
create table customer(id numeric(10), age int, phone varchar(200))
It might happend that the database trys to use for example the index idx_age instead of using the phone number.
You can sabotage the usage of the index age by doing an operation of age:
select * from customer where phone = '1235' and age+1 = 24
(although you are looking for the age 23)
This is of course a very simple example and the intelligence of postgres is probably good enough to do the right choice. But sometimes there is no other way then tricking the system.
Another example is to
select * from customer where phone = '1235' and age::varchar = '23'
But this is probably more costy than the option above.
Unfortunately you CANNOT set the name of the index into the query like you can do in MSSQL or Sybase.
select * from customer (index idx_phone) where phone = '1235' and age = 23.
This would help a lot to avoid problems like this.
Apparently there are cases where Postgre can be hinted to using an index by repeating a similar condition twice.
The specific case I observed was using PostGIS gin index and the ST_Within predicate like this:
select *
from address
natural join city
natural join restaurant
where st_within(address.location, restaurant.delivery_area)
and restaurant.delivery_area ~ address.location
Note that the first predicate st_within(address.location, restaurant.delivery_area) is automatically decomposed by PostGIS into (restaurant.delivery_area ~ address.location) AND _st_contains(restaurant.delivery_area, address.location) so adding the second predicate restaurant.delivery_area ~ address.location is completely redundant. Nevertheless, the second predicate convinced the planner to use spatial index on address.location and in the specific case I needed, improved the running time 8 times.