I have been tasked with improving the performance of a slow running process which updates some data in a PostGres 8.3 database (running on Solaris, updates are driven by Perl 5.8 scripts through SOAP). About 50% of the time consumed I have very little control over so tuning my 50% is quite important.
There are usually about 4,500,000 rows in the table although I've seen it bloat out to about 7,000,000. The id that the update is querying on (not primary or unique) has just under 9000 distinct values and the spread of occurrences is weighted heavily towards 1 per id (median value is 20, max value 7000).
There is an index on this id but with such sparse data I wonder if there's a better way of doing things. I'm also considering denormalising things a bit (database is not super-normalised anyway) & pulling data out into a separate table (probably controlled/maintained by triggers) to help speed things up.
So far, I have made some pretty basic tweaks (not pinging the database every n seconds to see if it's alive, not setting session variables unnecessarily etc) and this is helping but I really feel that there's something I'm missing with the data ...
Even if someone says that pulling relevant data out into a separate table is an excellent/terrible idea that would be really helpful! Any other ideas (or further questions for clarification) gratefully received!
Query:
UPDATE tab1 SET client = 'abcd', invoice = 999
WHERE id = 'A1000062' and releasetime < '02-11-09'::DATE
AND charge IS NOT NULL AND invoice IS NULL AND client IS NULL;
I realise the 'is not null' is far from ideal. Id is indexed as are invoice & client (btrees, so I understand PostGres will/should/can use an index there). It's a pretty trivial query ...
Query plan (explain with analyze):
Bitmap Heap Scan on tab1 (cost=17.42..1760.21 rows=133 width=670) (actual time=0.603..0.603 rows=0 loops=1)
Recheck Cond: (((id)::text = 'A1000062'::text) AND (invoice IS NULL))
Filter: ((charge IS NOT NULL) AND (client IS NULL) AND (releasetime < '2009-11-02'::date))
-> Bitmap Index Scan on cdr_snapshot_2007_09_12_snbs_invoice (cost=0.00..17.39 rows=450 width=0) (actual time=0.089..0.089 rows=63 loops=1)
Index Cond: (((snbs)::text = 'A1000062'::text) AND (invoice IS NULL))
Total runtime: 0.674 ms
Autovacuum is, I believe, enabled. There are no foreign key constraints but thanks for the tip as I didn't know that.
I am really liking the idea of increasing the statistics value - I'll be having a play around with that straight away.
You really need to get some query plans, and edit your question to include them. In addition to helping to figure out better ways of doing things, they can also be used to easily measure the improvement.
You can affect performance either by changing the SQL, or by adjusting the indexes and statistics that are used to determine the query plan.
One possibility is that you have foreign key constraints that do not have supporting indexes. PostgreSQL does not add them automatically when you create a foreign key constraint. If the referenced table has a row deleted, (or referenced field updated), the referencing table will need to be scanned entirely to either cascade the delete, or to ensure that there are no rows referencing the deleted one.
If the distribution of your id field is quite skewed, increasing the statistics on that column may help.
If the statistics is set to 100, then the 100 most common ids (from a sample) will be recorded, along with their frequency. Say that covers about 50% of your table, leaving say 2 to 3.5 million rows which PostgreSQL will assume fall evenly amongst your other 8900 ids,
or about 250 to 400 times each.
If the statistics was increased to 1000 and the top 1000 ids cover 95% of your rows, PostgreSQL will assume ids that are not in your list of 1000 most common will occur about 30 to 40 times each.
That change in estimates can affect the chosen query plan. If the pattern of queries more often selects ids that are the less frequently found ids, PostgreSQL will be over estimating how many times the ids will be found.
There is a performance cost for storing so many most frequent values, so you really need supporting query plan analysis to determine whether you're getting a net gain.
Related
I have the interesting problem that I want to enforce a specific limit on how many offers a user can place. The offers are saved in a postgresql (version 10) database and should not exceed 1000.
I am using the following sql query to check how many offers a user has and check it against the limit:
select count(*) from offers where offers.userId = 'b27e1d2f-c2c1-4d0b-8451-287013d7b716';
In the performance metrics I see that most of the time is spent on this query. Therefore I looked it up and found this: https://wiki.postgresql.org/wiki/Slow_Counting
PostgreSQL will still need to read the resulting rows to verify that they exist;
In the query planner it can be seen that additionally to the index only scan some heap fetches are needed which I assume slows down the whole query:
Index Only Scan using offers_by_user_id_index on offers
Index Cond: (account_id = 'b27e1d2f-c2c1-4d0b-8451-287013d7b716'::uuid) | Heap Fetches: 650
- What are ways to speed this up?
Is tracking the row count a good approach to speed up the check?
Thanks for your help!
Edit: UserId is an UUID and an index exists on the column UUID
The number of heap fetches suggests that the table is not vacuumed often enough. If you manually VACUUM it, does that speed things up?
I'd say that the right tool for that is de-normalization:
In table users add column offersCount
Create index on users table userid, offersCount
Add two triggers to the table offers
Insert trigger - will update users table and increase offersCount column
Delete trigger - will update users table and decrease offserCount column
With this there will be almost no latency
note: if you don't want to update user's table, just create new one, with only two columns
First, your ids are presumably numbers, so the comparisons should not be to a string. So:
select count(*)
from offers
where offers.userId = 1;
For this query, I recommend a query on offers(userid). That might be a big help.
This might be a situation where storing the ids in an array is beneficial. Then you can just add:
alter table users add constraint chk_offers check (array_length(offers) <= 1000)
This also changes how to insert and delete values.
For many purposes this will work well. It does not work well if you care keeping lots of other information about the user/offer, such as creation date, offer date, channel, and so on.
I'm having trouble regarding speed in a SELECT query on a Postgres database.
I have a table with two integer columns as key: (int1,int2)
This table has around 70 million rows.
I need to make two kind of simple SELECT queries in this environment:
SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;
These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on lec_sim (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
Recheck Cond: (lec2_id = 11782)
-> Bitmap Index Scan on lec_sim_lec2_hash_ind (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
Index Cond: (lec2_id = 11782)
Total runtime: 23342.534 ms
(5 rows)
This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.
These are some parameters of the postgres db config:
work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB
Any help, comment or thought would be really appreciated.
Thank you in advance.
Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.
Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:
Reorganize the data to reduce the seeking of the data.
Use PostgreSQL 9.2 feature "index-only scans"
Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".
I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.
I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1)
Index Cond: (lec2_id = 12300) Total runtime:
1.822 ms (3 rows)
Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.
Thank you so much willglynn.
I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.
What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.
What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.
For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:
random_page_cost=1.0
In your in your postgresql.conf.
The default is random_page_cost=4.0 and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)
This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.
The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check #willglynn's answer for that.
Hope this helps!
I am working with a table with a "state" column, which typically holds only 2 or 3 different values. Sometimes, when this table holds several million rows, following SQL statement becomes slow (I assume a full table scan is done):
SELECT state, count(*) FROM mytable GROUP BY state
I expect to get something like this:
disabled | 500000
enabled | 2000000
(basically I want to know how many items are "enabled" and how many items are "disabled" - actually that's a number instead of a text in my real application)
I guess adding an index for my state column is pretty useless, since only very few different values can be found there. What other options do I have?
There is also a "timestamp" column (with an index). Ideally the solution should also work well if I add:
WHERE timestamp BETWEEN x AND y
Right now I'm using an SQLite3 database, but it looks like other database engines are not too different, so solutions for other DB engines might be interesting as well.
Thank you!
I would put a covering index on timestamp,state (in that order). The rationale is:
the condition on the timestamp will be much more selective than the state
if the state is still in the index (i.e covering index), the engine only has to generate a range scan on the index itself (without having to pay for random I/Os to access the main data of the table).
Note: if the timestamp range is too wide, it will become slow despite of the index. Because random I/Os are more expensive than sequential I/Os, there is a point where the index range scan will become more expensive than the table scan. As a rule of thumb, if you need to scan more than 10% of the table, the engine should consider to keep the table scan and ignore the index. I'm note sure sqlite is smart enough to support this kind of optimization though.
Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.
I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0
;
The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:
Merge Join (cost=127710692.21..135714045.43 rows=452882848 width=57)
Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
-> Sort (cost=3193335.39..3219544.38 rows=10483593 width=41)
Sort Key: page.page_namespace, (page.page_title)::text
-> Seq Scan on page (cost=0.00..439678.01 rows=10483593 width=41)
Filter: (page_is_redirect = 0::numeric)
-> Sort (cost=124517356.82..125285665.74 rows=307323566 width=46)
Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
-> Seq Scan on pagelinks (cost=0.00..6169460.66 rows=307323566 width=46)
Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).
So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.
Thx
(PostgreSQL is not as intelligent as I thought; it needs heuristics)
Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?
I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.
Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.
Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.
If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.
As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.
This is very old, but if you want a way for you to monitore your update... Remember that sequences are affected globally, so you just can create one to monitore this update in another session by doing this:
create sequence yourprogress;
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0 AND NEXTVAL('yourprogress')!=0;
Then in another session just do this (don't worry about transactions, as sequences are affected globally):
select last_value from yourprogress;
This will show how many lines are being affected, so you can estimate how long you will take.
At just end restart your sequence to do another try:
alter sequence yourprogress restart with 1;
Or just drop it:
drop sequence yourprogress;
You need indexes or, as Bill pointed out, it will need to do sequential scans on all the tables.
CREATE INDEX page_ns_title_idx on page(page_namespace, page_title);
CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title);
CREATE INDEX page_redir_idx on page(page_is_redirect);