Postgres Index drop query performance - sql

SELECT T.id_task,
T.group,
A.type
FROM TASK T
JOIN ACTION A ON T.id_task = A.id_task
WHERE T.dt_inc BETWEEN 1511146800000 AND 1511492399999
AND A.dt_scheduled BETWEEN 1511146800000 AND 1511492399999
AND A.id_action > ( SELECT MIN(id_action) FROM ACTION WHERE T.id_task = id_task AND type <> 'TRANSFER' )
The original query is much bigger, but the problematic part is here.
Its take one minute to finish, using only ( task.dt_inc, action.dt_scheduled ), BUT if i DROP INDEX action_dt_scheduled, the execution time drops to one second with same results using only ( task.dt_inc, action.id_task ) indexes.
Why a index is performing so bad ?
Can i ignore this index without droping it ?
I recreated the index DROP > CREATE, i make REINDEX wath should be the same than recreate, what can i do now ?
EDIT:
I was trying to get the EXPLAIN of the slow query, but this is not slow anymore, the ANALYSE and REINDEX of the tables has solved the problem.

There are more reasons why the performance can be worse with index instead without it:
bad estimations - when index is used, then random io is used. random io is usually significantly slower then seq scan - but if index selects small part of table, then it is acceptable. But when planner has bad estimation of result, then index can be used although the seq scan is better. You can check the estimation by EXPLAIN ANALYZE query command.
bloated index - a state of index can be bad if long time index was not reindexed - then access and usage of index can be slow.
wrong sized effective_cache_size parameter - when this paremeter is not accurate or it is not safe, then pages of index can push from RAM a heap (table) pages. In this case a page cache (shared buffers) is not stable - you can check the stability of shared buffers with a pg_buffercache extension. Too low shared_buffers can have similar effect.

Related

Optimize SQL update query

I have an update which takes a lot of time to finish. 10 millions of rows need to be updated. The execution ended after 6 hours.
This is the query :
update A
set a_top = 'N'
where (a_toto, a_num, a_titi) in
(select a_toto, a_num, a_titi
from A
where a_titi <> 'test' and a_top is null limit 10000000);
Two indexes have been created :
CREATE UNIQUE INDEX pk_A ON A USING btree (a_toto, a_num, a_titi)
CREATE INDEX id1_A ON A USING btree (a_num)
These are the things I already checked :
No locks
No triggers on A
The execution plan shows me that the indexes are not used, would it change anything if I drop the indexes, update the rows and then create the indexes after that ?
Is there a way of improving the query itself ?
Here is the execution plan :
Update on A (cost=3561856.86..10792071.61 rows=80305304 width=200)
-> Hash Join (cost=3561856.86..10792071.61 rows=80305304 width=200)
Hash Cond: (((A.a_toto)::text = (("ANY_subquery".a_toto)::text)) AND ((A.a_num)::text = (("ANY_subquery".a_num)::text)) AND ((A.a_titi)::text = (("ANY_subquery".a_titi)::text)))
-> Seq Scan on A (cost=0.00..2509069.04 rows=80305304 width=126)
-> Hash (cost=3490830.00..3490830.00 rows=2082792 width=108)
-> Unique (cost=3390830.00..3490830.00 rows=2082792 width=108)
-> Sort (cost=3390830.00..3415830.00 rows=10000000 width=108)
Sort Key: (("ANY_subquery".a_toto)::text), (("ANY_subquery".a_num)::text), (("ANY_subquery".a_titi)::text)
-> Subquery Scan on "ANY_subquery" (cost=0.00..484987.17 rows=10000000 width=108)
-> Limit (cost=0.00..384987.17 rows=10000000 width=42)
-> Seq Scan on A A_1 (cost=0.00..2709832.30 rows=70387600 width=42)
Filter: ((a_top IS NULL) AND ((a_titi)::text <> 'test'::text))
(12 rows)
Thanks for you help.
The index I would have suggested is:
CREATE UNIQUE INDEX pk_A ON A USING btree (a_titi, a_top, a_toto, a_num);
This index covers the WHERE clause of the subquery, allowing Postgres to throw away records which don't meet the criteria. The index also includes the three columns which are needed in the SELECT clause.
One reason your current first index is not being used is that it doesn't cover the WHERE clause. This index, if used, might require a full index scan, during which Postgres would have to manually filter off non matching records.
PostgreSQL does a pretty poor job of optimizing bulk updates, because it optimizes it (almost) just like a select, and throws an update on top. It doesn't consider how the order of the rows returned by the select-like-portion will effect the IO pattern of the update itself. This can have devastatingly poor performance for high latency devices, like spinning disks. And is often bad even for SSD.
My theory is that you could get greatly improved performance by injecting a Sort by ctid node just below the Update node. But it looks really hard to do this, even in a gross and hackish way just to get a proof-of-concept.
But if the Hash node can fit the entire hash table in work_mem, rather than spilling to disk, then the Hash Join should return tuples in physical order so they can be updated efficiently. This would require a work_mem very much larger than 4MB, though. (But it is hard to say how much larger without trial and error. But even if spills to disk in 4 batches, that should be much better than hundreds.)
You can probably get it to use an index plan by setting both enable_hashjoin and enable_mergejoin to off. But whether this will actually be faster is another question, it might have the same random IO problem as the current method does. Unless the table is clustered or something like that.
You should really go back to your client and ask what they are trying to accomplish here. If they would just update the table in one shot without the self join, they wouldn't have this problem. If they are using the LIMIT to try to get the UPDATE to run faster, then it is probably backfiring spectacularly. If they are doing it for some other reason, well, what is it?

Suboptimal execution plan is better than optimal plan when statistics are stale

Why the cost of the execution plan that was generated based on stale statistics is cheaper than the cost of the plan based on updated statistics?
To understand my problem please to follow below scenario:
Assumption: auto statistics update is off.
When I updated statistics with full scan manually then I executed following batch including actual execution plan:
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
SELECT *
FROM AWSales -- table has 60000 rows
WHERE SalesOrderID = 44133
OPTION(RECOMPILE);
--returns 17 rows
The optimizer generated a plan that used non clustered index seek and key lookup - that was definitely fine.
Then I wanted to cheat the optimizer so I inserted 60 000 rows where input value for SalesOrderID column was 44133.
Without updating statistics I executed the mentioned batch again and the optimizer returned the same plan (with index seek) but with different cost (60 000 rows returned), of course.
Next I updated statistics with full scan manually for the table and I executed the batch again. This time the optimizer returned different plan with index scan operator. Predictable. At first glance it looked good. But when I compared the query plan costs it was more expensive than the cost of plan that used index seek. So after updating statistics query with optimal plan was slower. NONSENSE!
Then I wanted to compare costs of index seek before update and after update. Updated statistics caused the optimizer chose plan with index scan, so to force generating plan with index seek I added hint into the query. After executing it turned out that the actual query cost (forced index scan) was MUCH bigger then cost when statistics were stale. How is that possible?
For more details please look at sample script

Postgresql: How do I ensure that indexes are in memory

I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.
I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.
Problem description:
I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:
id (bigserial)
url (charachter varying 300)
I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.
When I ran:
SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;
... The total size is 5980MB
The indiviual size of the 2 indexes in question here are as follows, and I got them by running:
# SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE
reltype=0 ORDER BY relpages DESC LIMIT 10;
relname | mb | reltype
----------------------------------+------+--------
test_url_idx | 3684 | 0
test_pk | 2161 | 0
There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here
The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.
select * from test_table where url like 'orange%' limit 20;
...would take anything from 20-40 seconds to run.
Running explain analyze on the above gives the following:
# explain analyze select * from test_table where
url like 'orange%' limit 20;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..4787.96 rows=20 width=57)
(actual time=0.304..1898.583 rows=20 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=0.302..1898
.542 rows=20 loops=1)
Filter: ((url)::text ~~ 'orange%'::text)
Rows Removed by Filter: 210286
Total runtime: 1898.650 ms
(5 rows)
Taking another example... this time with the wildcard between american and .com....
# explain select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-------------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
Filter: ((url)::text ~~ 'american%.com'::text)
(3 rows)
# explain analyze select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-----------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
(actual time=83.470..3035.696 rows=50 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=83.467..303
5.614 rows=50 loops=1)
Filter: ((url)::text ~~ 'american%.com'::text)
Rows Removed by Filter: 276142
Total runtime: 3035.774 ms
(5 rows)
I then went from a 7GB to a 14GB server. Query Speeds were no better.
Observations on the server
I can see that Memory usage never really goes beyond 2MB.
Disk reads go off the charts when running a query using a LIKE statement.
Query speed is perfectly fine when matching against the id (primary key)
The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.
Changes to conf:
shared_buffers = 512MB
checkpoint_segments = 10
(I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)
cpu_index_tuple_cost = 0.0005
effective_cache_size = 10GB # recommendation in the blog post was 2GB...
On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:
CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)
Memory: 1.5/14.0GB (11%)
More details on Memory:
In use: 1.4GB
Available: 12.5GB
Committed 1.9/16.1 GB
Cached: 835MB
Paged Pool: 95.2MB
Non-paged pool: 71.2 MB
Questions
How can I ensure an index will sit in memory (providing it doesn't get too big for memory)? Is it just configuration tweaking I need here?
Is implementing my own search index (e.g. Lucene) a better option here?
Are the full-text indexing features in postgres going to improve performance dramatically, even if I can solve the index in memory issue?
Thanks for reading.
Those seq scans make it look like you didn't run analyze on the table after importing your data.
http://www.postgresql.org/docs/current/static/sql-analyze.html
During normal operation, scheduling to run vacuum analyze isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.
On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'
http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20
Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)
With that being said:
Postgres primarily caches indexes based on how often they're used, and it will not use an index if the stats suggest that it shouldn't -- hence the need to analyze after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.
and 3. Full text search works fine.
For further reading on fine-tuning, see the manual and:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Two last notes on your schema:
Last I checked, bigint (bigserial in your case) was slower than plain int. (This was a while ago, so the difference might now be negligible on modern, 64-bit servers.) Unless you foresee that you'll actually need more than 2.3 billion entries, int is plenty and takes less space.
From an implementation standpoint, the only difference between a varchar(300) and a varchar without a specified length (or text, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.
Unless your encoding or collation is C or POSIX, an ordinary btree index cannot efficiently satisfy an anchored like query. You may have to declare a btree index with the varchar_pattern_ops op class to benefit.
The problem is that you're getting hit with a full table scan for each of those lookups ("index in memory" isn't really an issue). Each time you run one of those queries the database is visiting every single row, which is causing the high disk usage. You might check here for a little more information (especially follow the links to the docs on operator classes and index types). If you follow that advice you should be able to get prefix lookups working fine, i.e. those situations where you're matching something like 'orange%'.
Full text search is nice for more natural text search, like written documents, but it might be more difficult to get it working well for URL searching. There was also this thread in the mailing lists a few months back that might have more domain-specific information for what you're trying to do.
explain analyze select * from test_table where
url like 'orange%' limit 20;
You probably want to use a gin/gist index for like queries. Should give you much better results than btree - I don't think btree supports like queries at all.

Improving query speed: simple SELECT in big postgres table

I'm having trouble regarding speed in a SELECT query on a Postgres database.
I have a table with two integer columns as key: (int1,int2)
This table has around 70 million rows.
I need to make two kind of simple SELECT queries in this environment:
SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;
These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on lec_sim (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
Recheck Cond: (lec2_id = 11782)
-> Bitmap Index Scan on lec_sim_lec2_hash_ind (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
Index Cond: (lec2_id = 11782)
Total runtime: 23342.534 ms
(5 rows)
This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.
These are some parameters of the postgres db config:
work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB
Any help, comment or thought would be really appreciated.
Thank you in advance.
Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.
Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:
Reorganize the data to reduce the seeking of the data.
Use PostgreSQL 9.2 feature "index-only scans"
Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".
I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.
I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1)
Index Cond: (lec2_id = 12300) Total runtime:
1.822 ms (3 rows)
Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.
Thank you so much willglynn.
I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.
What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.
What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.
For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:
random_page_cost=1.0
In your in your postgresql.conf.
The default is random_page_cost=4.0 and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)
This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.
The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check #willglynn's answer for that.
Hope this helps!

Long UPDATE in postgresql

I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0
;
The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:
Merge Join (cost=127710692.21..135714045.43 rows=452882848 width=57)
Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
-> Sort (cost=3193335.39..3219544.38 rows=10483593 width=41)
Sort Key: page.page_namespace, (page.page_title)::text
-> Seq Scan on page (cost=0.00..439678.01 rows=10483593 width=41)
Filter: (page_is_redirect = 0::numeric)
-> Sort (cost=124517356.82..125285665.74 rows=307323566 width=46)
Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
-> Seq Scan on pagelinks (cost=0.00..6169460.66 rows=307323566 width=46)
Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).
So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.
Thx
(PostgreSQL is not as intelligent as I thought; it needs heuristics)
Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?
I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.
Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.
Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.
If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.
As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.
This is very old, but if you want a way for you to monitore your update... Remember that sequences are affected globally, so you just can create one to monitore this update in another session by doing this:
create sequence yourprogress;
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0 AND NEXTVAL('yourprogress')!=0;
Then in another session just do this (don't worry about transactions, as sequences are affected globally):
select last_value from yourprogress;
This will show how many lines are being affected, so you can estimate how long you will take.
At just end restart your sequence to do another try:
alter sequence yourprogress restart with 1;
Or just drop it:
drop sequence yourprogress;
You need indexes or, as Bill pointed out, it will need to do sequential scans on all the tables.
CREATE INDEX page_ns_title_idx on page(page_namespace, page_title);
CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title);
CREATE INDEX page_redir_idx on page(page_is_redirect);