Improving query speed: simple SELECT in big postgres table - sql

I'm having trouble regarding speed in a SELECT query on a Postgres database.
I have a table with two integer columns as key: (int1,int2)
This table has around 70 million rows.
I need to make two kind of simple SELECT queries in this environment:
SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;
These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on lec_sim (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
Recheck Cond: (lec2_id = 11782)
-> Bitmap Index Scan on lec_sim_lec2_hash_ind (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
Index Cond: (lec2_id = 11782)
Total runtime: 23342.534 ms
(5 rows)
This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.
These are some parameters of the postgres db config:
work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB
Any help, comment or thought would be really appreciated.
Thank you in advance.

Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.

Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:
Reorganize the data to reduce the seeking of the data.
Use PostgreSQL 9.2 feature "index-only scans"
Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".
I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.
I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1)
Index Cond: (lec2_id = 12300) Total runtime:
1.822 ms (3 rows)
Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.
Thank you so much willglynn.

I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.
What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.
What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.

For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:
random_page_cost=1.0
In your in your postgresql.conf.
The default is random_page_cost=4.0 and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)
This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.
The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check #willglynn's answer for that.
Hope this helps!

Related

Optimize SQL update query

I have an update which takes a lot of time to finish. 10 millions of rows need to be updated. The execution ended after 6 hours.
This is the query :
update A
set a_top = 'N'
where (a_toto, a_num, a_titi) in
(select a_toto, a_num, a_titi
from A
where a_titi <> 'test' and a_top is null limit 10000000);
Two indexes have been created :
CREATE UNIQUE INDEX pk_A ON A USING btree (a_toto, a_num, a_titi)
CREATE INDEX id1_A ON A USING btree (a_num)
These are the things I already checked :
No locks
No triggers on A
The execution plan shows me that the indexes are not used, would it change anything if I drop the indexes, update the rows and then create the indexes after that ?
Is there a way of improving the query itself ?
Here is the execution plan :
Update on A (cost=3561856.86..10792071.61 rows=80305304 width=200)
-> Hash Join (cost=3561856.86..10792071.61 rows=80305304 width=200)
Hash Cond: (((A.a_toto)::text = (("ANY_subquery".a_toto)::text)) AND ((A.a_num)::text = (("ANY_subquery".a_num)::text)) AND ((A.a_titi)::text = (("ANY_subquery".a_titi)::text)))
-> Seq Scan on A (cost=0.00..2509069.04 rows=80305304 width=126)
-> Hash (cost=3490830.00..3490830.00 rows=2082792 width=108)
-> Unique (cost=3390830.00..3490830.00 rows=2082792 width=108)
-> Sort (cost=3390830.00..3415830.00 rows=10000000 width=108)
Sort Key: (("ANY_subquery".a_toto)::text), (("ANY_subquery".a_num)::text), (("ANY_subquery".a_titi)::text)
-> Subquery Scan on "ANY_subquery" (cost=0.00..484987.17 rows=10000000 width=108)
-> Limit (cost=0.00..384987.17 rows=10000000 width=42)
-> Seq Scan on A A_1 (cost=0.00..2709832.30 rows=70387600 width=42)
Filter: ((a_top IS NULL) AND ((a_titi)::text <> 'test'::text))
(12 rows)
Thanks for you help.
The index I would have suggested is:
CREATE UNIQUE INDEX pk_A ON A USING btree (a_titi, a_top, a_toto, a_num);
This index covers the WHERE clause of the subquery, allowing Postgres to throw away records which don't meet the criteria. The index also includes the three columns which are needed in the SELECT clause.
One reason your current first index is not being used is that it doesn't cover the WHERE clause. This index, if used, might require a full index scan, during which Postgres would have to manually filter off non matching records.
PostgreSQL does a pretty poor job of optimizing bulk updates, because it optimizes it (almost) just like a select, and throws an update on top. It doesn't consider how the order of the rows returned by the select-like-portion will effect the IO pattern of the update itself. This can have devastatingly poor performance for high latency devices, like spinning disks. And is often bad even for SSD.
My theory is that you could get greatly improved performance by injecting a Sort by ctid node just below the Update node. But it looks really hard to do this, even in a gross and hackish way just to get a proof-of-concept.
But if the Hash node can fit the entire hash table in work_mem, rather than spilling to disk, then the Hash Join should return tuples in physical order so they can be updated efficiently. This would require a work_mem very much larger than 4MB, though. (But it is hard to say how much larger without trial and error. But even if spills to disk in 4 batches, that should be much better than hundreds.)
You can probably get it to use an index plan by setting both enable_hashjoin and enable_mergejoin to off. But whether this will actually be faster is another question, it might have the same random IO problem as the current method does. Unless the table is clustered or something like that.
You should really go back to your client and ask what they are trying to accomplish here. If they would just update the table in one shot without the self join, they wouldn't have this problem. If they are using the LIMIT to try to get the UPDATE to run faster, then it is probably backfiring spectacularly. If they are doing it for some other reason, well, what is it?

Postgresql: How do I ensure that indexes are in memory

I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.
I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.
Problem description:
I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:
id (bigserial)
url (charachter varying 300)
I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.
When I ran:
SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;
... The total size is 5980MB
The indiviual size of the 2 indexes in question here are as follows, and I got them by running:
# SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE
reltype=0 ORDER BY relpages DESC LIMIT 10;
relname | mb | reltype
----------------------------------+------+--------
test_url_idx | 3684 | 0
test_pk | 2161 | 0
There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here
The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.
select * from test_table where url like 'orange%' limit 20;
...would take anything from 20-40 seconds to run.
Running explain analyze on the above gives the following:
# explain analyze select * from test_table where
url like 'orange%' limit 20;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..4787.96 rows=20 width=57)
(actual time=0.304..1898.583 rows=20 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=0.302..1898
.542 rows=20 loops=1)
Filter: ((url)::text ~~ 'orange%'::text)
Rows Removed by Filter: 210286
Total runtime: 1898.650 ms
(5 rows)
Taking another example... this time with the wildcard between american and .com....
# explain select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-------------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
Filter: ((url)::text ~~ 'american%.com'::text)
(3 rows)
# explain analyze select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-----------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
(actual time=83.470..3035.696 rows=50 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=83.467..303
5.614 rows=50 loops=1)
Filter: ((url)::text ~~ 'american%.com'::text)
Rows Removed by Filter: 276142
Total runtime: 3035.774 ms
(5 rows)
I then went from a 7GB to a 14GB server. Query Speeds were no better.
Observations on the server
I can see that Memory usage never really goes beyond 2MB.
Disk reads go off the charts when running a query using a LIKE statement.
Query speed is perfectly fine when matching against the id (primary key)
The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.
Changes to conf:
shared_buffers = 512MB
checkpoint_segments = 10
(I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)
cpu_index_tuple_cost = 0.0005
effective_cache_size = 10GB # recommendation in the blog post was 2GB...
On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:
CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)
Memory: 1.5/14.0GB (11%)
More details on Memory:
In use: 1.4GB
Available: 12.5GB
Committed 1.9/16.1 GB
Cached: 835MB
Paged Pool: 95.2MB
Non-paged pool: 71.2 MB
Questions
How can I ensure an index will sit in memory (providing it doesn't get too big for memory)? Is it just configuration tweaking I need here?
Is implementing my own search index (e.g. Lucene) a better option here?
Are the full-text indexing features in postgres going to improve performance dramatically, even if I can solve the index in memory issue?
Thanks for reading.
Those seq scans make it look like you didn't run analyze on the table after importing your data.
http://www.postgresql.org/docs/current/static/sql-analyze.html
During normal operation, scheduling to run vacuum analyze isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.
On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'
http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20
Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)
With that being said:
Postgres primarily caches indexes based on how often they're used, and it will not use an index if the stats suggest that it shouldn't -- hence the need to analyze after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.
and 3. Full text search works fine.
For further reading on fine-tuning, see the manual and:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Two last notes on your schema:
Last I checked, bigint (bigserial in your case) was slower than plain int. (This was a while ago, so the difference might now be negligible on modern, 64-bit servers.) Unless you foresee that you'll actually need more than 2.3 billion entries, int is plenty and takes less space.
From an implementation standpoint, the only difference between a varchar(300) and a varchar without a specified length (or text, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.
Unless your encoding or collation is C or POSIX, an ordinary btree index cannot efficiently satisfy an anchored like query. You may have to declare a btree index with the varchar_pattern_ops op class to benefit.
The problem is that you're getting hit with a full table scan for each of those lookups ("index in memory" isn't really an issue). Each time you run one of those queries the database is visiting every single row, which is causing the high disk usage. You might check here for a little more information (especially follow the links to the docs on operator classes and index types). If you follow that advice you should be able to get prefix lookups working fine, i.e. those situations where you're matching something like 'orange%'.
Full text search is nice for more natural text search, like written documents, but it might be more difficult to get it working well for URL searching. There was also this thread in the mailing lists a few months back that might have more domain-specific information for what you're trying to do.
explain analyze select * from test_table where
url like 'orange%' limit 20;
You probably want to use a gin/gist index for like queries. Should give you much better results than btree - I don't think btree supports like queries at all.

PostgreSQL: How to optimize my database for storing and querying a huge graph

I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:
CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256));
CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link));
CREATE INDEX id_idx ON edges (id);
CREATE INDEX link_idx ON edges (link);
The data in the table edges looks like
id link
1 234
1 88865
1 6
2 365
2 12
...
So it stores for each node with id x the outgoing link to id y.
The time for searching all the outgoing links is ok:
=# explain analyze select link from edges where id=4620;
QUERY PLAN
---------------------------------------------------------------------------------
Index Scan using id_idx on edges (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1)
Index Cond: (id = 4620)
Total runtime: 158.348 ms
(3 rows)
However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):
=# explain analyze select id from edges where link=4620;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on edges (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1)
Recheck Cond: (link = 4620)
-> Bitmap Index Scan on link_idx (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=26887 loops=1)
Index Cond: (link = 4620)
Total runtime: 49001.936 ms
(5 rows)
I tried to force Postgres not to use a Bitmap Scan via
=# set enable_bitmapscan = false;
but the speed of the query for incoming links didn't improve:
=# explain analyze select id from edges where link=1588;
QUERY PLAN
-------------------------------------------------------------------------------------------
Index Scan using link_idx on edges (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1)
Index Cond: (link = 1588)
Total runtime: 51300.041 ms
(3 rows)
I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here?
I guess it is because of a “density” of same-key-records on the disk.
I think the records with same id are stored in dense (i.e., few number of blocks) and those with same link are stored in sparse (i.e., distributed to huge number of blocks).
If you have inserted records in the order of id, this situation can be happen.
Assume that:
1. there are 10,000 records,
2. they're stored in the order such as (id, link) = (1, 1), (1, 2),..., (1, 100), (2, 1)..., and
3. 50 records can be stored in a block.
In the assumption above, block #1~#3 consists of the records (1, 1)~(1, 50), (1, 51)~(1, 100) and (2, 1)~(2, 50) respectively.
When you SELECT * FROM edges WHERE id=1, only 2 blocks (#1, #2) is to be loaded and scanned.
On the other hand, SELECT * FROM edges WHERE link=1 requires 50 blocks (#1, #3, #5,...), even though the number of rows are same.
If you need good performance and can deal without foreign key constraints (or use triggers to implement them manually) try the intarray and intagg extension modules. Instead of the edges table have an outedges integer[] column on nodes table. This will add about 140MB to the table, so the whole thing will still probably fit into memory. For reverse lookups, either create an GIN index on the outedges column (for an additional 280MB), or just add an inedges column.
Postgresql has pretty high row overhead so the naive edges table will result in 1G of space for the table alone, + another 1.5 for the indices. Given your dataset size, you have a good chance of having most of it in cache if you use integer arrays to store the relations. This will make any lookups blazingly fast. I see around 0.08ms lookup times to get edges in either direction for a given node. Even if you don't fit it all in memory, you'll still have a larger fraction in memory and a whole lot better cache locality.
I think habe is right.
You can check this by using cluster link_idx on edges; analyze edges after filling the table. Now the second query should be fast, and first should be slow.
To have both queries fast you'll have to denormalize by using a second table, as you have proposed. Just remember to cluster and analyze this second table after loading your data, so all egdes linking to a node will be physically grouped.
If you will not query this all the time and you do not want to store and backup this second table then you can create it temporarily before querying:
create temporary table egdes_backwards
as select link, id from edges order by link, id;
create index edges_backwards_link_idx on edges_backwards(link);
You do not have to cluster this temporary table, as it will be physically ordered right on creation. It does not make sense for one query, but can help for several queries in a row.
Your issue seems to be disk-io related. Postgres has to read the tuples of index matches in order to see whether or not the row is visible (this can not be done from an index as it doesn't contain the necessary information).
VACUUM ANALYZE (or simply ANALYZE) will help if you have lots of deleted rows and/or updated rows. Run it first and see if you get any improvements.
CLUSTER might also help. Based on your examples, I'd say using link_idx as the cluster-key. "CLUSTER edges USING link_idx". It might degrade the performance of your id queries though (your id queries might be quick because they are already sorted on disk). Remember to run ANALYZE after CLUSTER.
Next steps include fine-tuning memory parameters, adding more memory, or adding a faster disk subsystem.
have you tried doing this in www.neo4j.org? This is almost trivial in a graph database and should give you performance on your usecase in ms-range.

Performance optimisation - Postgres

I have been tasked with improving the performance of a slow running process which updates some data in a PostGres 8.3 database (running on Solaris, updates are driven by Perl 5.8 scripts through SOAP). About 50% of the time consumed I have very little control over so tuning my 50% is quite important.
There are usually about 4,500,000 rows in the table although I've seen it bloat out to about 7,000,000. The id that the update is querying on (not primary or unique) has just under 9000 distinct values and the spread of occurrences is weighted heavily towards 1 per id (median value is 20, max value 7000).
There is an index on this id but with such sparse data I wonder if there's a better way of doing things. I'm also considering denormalising things a bit (database is not super-normalised anyway) & pulling data out into a separate table (probably controlled/maintained by triggers) to help speed things up.
So far, I have made some pretty basic tweaks (not pinging the database every n seconds to see if it's alive, not setting session variables unnecessarily etc) and this is helping but I really feel that there's something I'm missing with the data ...
Even if someone says that pulling relevant data out into a separate table is an excellent/terrible idea that would be really helpful! Any other ideas (or further questions for clarification) gratefully received!
Query:
UPDATE tab1 SET client = 'abcd', invoice = 999
WHERE id = 'A1000062' and releasetime < '02-11-09'::DATE
AND charge IS NOT NULL AND invoice IS NULL AND client IS NULL;
I realise the 'is not null' is far from ideal. Id is indexed as are invoice & client (btrees, so I understand PostGres will/should/can use an index there). It's a pretty trivial query ...
Query plan (explain with analyze):
Bitmap Heap Scan on tab1 (cost=17.42..1760.21 rows=133 width=670) (actual time=0.603..0.603 rows=0 loops=1)
Recheck Cond: (((id)::text = 'A1000062'::text) AND (invoice IS NULL))
Filter: ((charge IS NOT NULL) AND (client IS NULL) AND (releasetime < '2009-11-02'::date))
-> Bitmap Index Scan on cdr_snapshot_2007_09_12_snbs_invoice (cost=0.00..17.39 rows=450 width=0) (actual time=0.089..0.089 rows=63 loops=1)
Index Cond: (((snbs)::text = 'A1000062'::text) AND (invoice IS NULL))
Total runtime: 0.674 ms
Autovacuum is, I believe, enabled. There are no foreign key constraints but thanks for the tip as I didn't know that.
I am really liking the idea of increasing the statistics value - I'll be having a play around with that straight away.
You really need to get some query plans, and edit your question to include them. In addition to helping to figure out better ways of doing things, they can also be used to easily measure the improvement.
You can affect performance either by changing the SQL, or by adjusting the indexes and statistics that are used to determine the query plan.
One possibility is that you have foreign key constraints that do not have supporting indexes. PostgreSQL does not add them automatically when you create a foreign key constraint. If the referenced table has a row deleted, (or referenced field updated), the referencing table will need to be scanned entirely to either cascade the delete, or to ensure that there are no rows referencing the deleted one.
If the distribution of your id field is quite skewed, increasing the statistics on that column may help.
If the statistics is set to 100, then the 100 most common ids (from a sample) will be recorded, along with their frequency. Say that covers about 50% of your table, leaving say 2 to 3.5 million rows which PostgreSQL will assume fall evenly amongst your other 8900 ids,
or about 250 to 400 times each.
If the statistics was increased to 1000 and the top 1000 ids cover 95% of your rows, PostgreSQL will assume ids that are not in your list of 1000 most common will occur about 30 to 40 times each.
That change in estimates can affect the chosen query plan. If the pattern of queries more often selects ids that are the less frequently found ids, PostgreSQL will be over estimating how many times the ids will be found.
There is a performance cost for storing so many most frequent values, so you really need supporting query plan analysis to determine whether you're getting a net gain.

Long UPDATE in postgresql

I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0
;
The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:
Merge Join (cost=127710692.21..135714045.43 rows=452882848 width=57)
Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
-> Sort (cost=3193335.39..3219544.38 rows=10483593 width=41)
Sort Key: page.page_namespace, (page.page_title)::text
-> Seq Scan on page (cost=0.00..439678.01 rows=10483593 width=41)
Filter: (page_is_redirect = 0::numeric)
-> Sort (cost=124517356.82..125285665.74 rows=307323566 width=46)
Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
-> Seq Scan on pagelinks (cost=0.00..6169460.66 rows=307323566 width=46)
Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).
So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.
Thx
(PostgreSQL is not as intelligent as I thought; it needs heuristics)
Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?
I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.
Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.
Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.
If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.
As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.
This is very old, but if you want a way for you to monitore your update... Remember that sequences are affected globally, so you just can create one to monitore this update in another session by doing this:
create sequence yourprogress;
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0 AND NEXTVAL('yourprogress')!=0;
Then in another session just do this (don't worry about transactions, as sequences are affected globally):
select last_value from yourprogress;
This will show how many lines are being affected, so you can estimate how long you will take.
At just end restart your sequence to do another try:
alter sequence yourprogress restart with 1;
Or just drop it:
drop sequence yourprogress;
You need indexes or, as Bill pointed out, it will need to do sequential scans on all the tables.
CREATE INDEX page_ns_title_idx on page(page_namespace, page_title);
CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title);
CREATE INDEX page_redir_idx on page(page_is_redirect);