My Postgres database wasn't using my index; I resolved it, but don't understand the fix, can anyone explain what happened?

My Postgres database wasn't using my index; I resolved it, but don't understand the fix, can anyone explain what happened? - sql

My database schema in relevant part is there is a table called User, which had a boolean field Admin. There was an index on this field Admin.
The day before I restored my full production database onto my development machine, and then made only very minor changes to the database, so they should have been very similar.
When I ran the following command on my development machine, I got the expected result:
EXPLAIN SELECT * FROM user WHERE admin IS TRUE;
Index Scan using index_user_on_admin on user (cost=0.00..9.14 rows=165 width=3658)
Index Cond: (admin = true)
Filter: (admin IS TRUE)
However, when I ran the exact same command on my production machine, I got this:
Seq Scan on user (cost=0.00..620794.93 rows=4966489 width=3871)
Filter: (admin IS TRUE)
So instead of using the exact index that was a perfect match for the query, it was using a sequential scan of almost 5 million rows!
I then tried to run EXPLAIN ANALYZE SELECT * FROM user WHERE admin IS TRUE; with the hope that ANALYZE would make Postgres realize a sequential scan of 5 million rows wasn't as good as using the index, but that didn't change anything.
I also tried to run REINDEX INDEX index_user_on_admin in case the index was corrupted, without any benefit.
Finally, I called VACUUM ANALYZE user and that resolved the problem in short order.
My main understanding of vacuum is that it is used to reclaim wasted space. What could have been going on that would cause my index to misbehave so badly, and why did vacuum fix it?

It was most likely the ANALYZE that helped, by updating the data statistics used by the planner to determine what would be the best way to run a query. VACUUM ANALYZE just runs the two commands in order, VACUUM first, ANALYZE second, but ANALYZE itself would probably be enough to help.
The ANALYZE option to EXPLAIN has completely nothing to do with the ANALYZE command. It just causes Postgres to run the query and report the actual run times, so that they can be compared with the planner predictions (EXPLAIN without the ANALYZE only displays the query plan and what the planner thinks it will cost, but does not actually run the query). So EXPLAIN ANALYZE did not help because it did not update the statistics. ANALYZE and EXPLAIN ANALYZE are two completely different actions that just happen to use the same word.

PostgreSQL keeps a number of advanced statistics about the table condition, index condition, data, etc... This can get out of sync sometimes. Running VACUUM will correct the problem.
It is likely that when you reloaded the table from scratch on development, it had the same effect.
Take a look at this:
http://www.postgresql.org/docs/current/static/maintenance.html#VACUUM-FOR-STATISTICS

A partial index seems a good solution for your issue:
CREATE INDEX admin_users_ix ON users (admin)
WHERE admin IS TRUE;;
Has no sense to index a lot of tuples over a identical field.

Here is what I think is the most likely explanation.
Your index is useful only when a very small number of rows are returned (btw, I don't like to index bools for this reason-- you might consider using a partial index instead, or even adding a where admin is true since that will keep your index only to the cases where it is likely to be usable anyway).
If more than around, iirc, 10% of the pages in the table are to be retrieved, the planner is likely to choose a lot of sequential disk I/O over a smaller amount of random disk I/O because that way you don't have to wait for platters to turn. The seek speed is a big issue there and PostgreSQL will tend to try to balance that against the amount of actual data to be retrieved from the relation.
You had statistics gathered which indicated that the table was either smaller than it was or there were more admins as a portion of users than you had, and so the planner used bad information to make the decision.
VACUUM ANALYZE does three things. First it freezes tuples visible to all transactions so that transaction wraparound is not an issue. Then it allocates tuples visible to no transactions as free space. Neither of these affected your issue. However the third is that it analyzes the tables and gather statistics on the tables. Keep in mind this is a random sampling and therefore sometimes can be off. My guess is that the previous run, it grabbed the page with lots of admins and thus grossly overestimated the number of admins of the system.
This is probably a good time to double check your autovacuum settings because it is also possible that the statistics are very much out of date elsewhere but that is far from certain. In particular, cost-based vacuum settings have defaults that sometimes make it so that vacuum never fully catches up.

Related

What might cause Oracle to choose parallel execution on one database versus the other for the same objects?

I have a test and development Oracle 19c database that is running out of temp tablespace on an older pre-existing query. The explain plan shows that, on the database running out of space, the explain plan is using a lot of parallel execution steps (PX SEND BROADCAST, PX RECEIVE, PX BLOCK ITERATOR). The database is also buffering a lot of the scanned data, which I assume is what is causing all the space to get eaten up.
On the dev database, the same query, on the same objects, same indexes, same everything else as far as I have checked, it runs without running out of space. The explain plan uses about half the steps and does not use parallel execution at all.
I am trying to work with one of our DBAs to find what is causing the difference. What are some things I should look at that might explain such a difference in explain plan? I have looked at making sure the indexes are the same, the data size is the same, there have been recent gather stats run, and I have also looked at these settings:
select PDML_ENABLED, PDML_STATUS, PDDL_STATUS, PQ_STATUS FROM V$session where sid = (select sid from v$mystat where rownum = 1);
Are there any global or session parameters I might compare between the two databases?

When you say the tables/indexes are the same, make sure to check their "parallel" attribute.
At the system level, check parallel_degree_policy.
Also, an explain plan should tell you why a specific degree or parallelism was chosen; that might provide a clue.

Select LIMIT 1 takes long time on postgresql

I'm running a simple query on localhost PostgreSQL database and it runs too long:
SELECT * FROM features LIMIT 1;
I expect such query to be finished in a fraction of a second as it basically says "peek anywhere in the database and pick one row". Or it doesn't?
table size is 75GB with estimated row count 1.84405e+008
I'm the only user of the database
the database server was just started, so I guess nothing is cached in memory

I totally agree with #larwa1n with the content he comment on your post.
The reason here, I guess, is the performance of SELECT is too slow.
With my experience maybe there are another reasons. I list as below:
The table is too big, so let add some WHERE CLAUSE and INDEX
The performance of your server/disk drive is too slow.
Other process take most resource.
Another reason maybe come from maintenance task, let check again does the autovacuum is running? If not, check is this table is vacuum already? If not, let do a vacuum full on that table. Sometimes, when you do a lot of insert/update/delete on a large table without vacuum will make the table save in fragmented disk block, which will take longer time in query.
Hopefully, this answer will help you find out the final reason.

Index not used Postgres

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.

A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.

Postgres query optimization

On postgres 9.0, set both index_scan and seq_scan to Off. Why does it improve query performance by 2x?

This may help some queries run faster, but is almost certain to make other queries slower. It's interesting information for diagnostic purposes, but a bad idea for a long-term "solution".
PostgreSQL uses a cost-based optimizer, which looks at the costs of all possible plans based on statistics gathered by scanning your tables (normally by autovacuum) and costing factors. If it's not choosing the fastest plan, it is usually because your costing factors don't accurately model actual costs for your environment, statistics are not up-to-date, or statistics are not fine-grained enough.
After turning index_scan and seq_scan back on:
I have generally found the cpu_tuple_cost default to be too low; I have often seen better plans chosen by setting that to 0.03 instead of the default 0.01; and I've never seen that override cause problems.
If the active portion of your database fits in RAM, try reducing both seq_page_cost and random_page_cost to 0.1.
Be sure to set effective_cache_size to the sum of shared_buffers and whatever your OS is showing as cached.
Never disable autovacuum. You might want to adjust parameters, but do that very carefully, with small incremental changes and subsequent monitoring.
You may need to occasionally run explicit VACUUM ANALYZE or ANALYZE commands, especially for temporary tables or tables which have just had a lot of modifications and are about to be used in queries.
You might want to increase default_statistics_target, from_collapse_limit, join_collapse_limit, or some geqo settings; but it's hard to tell whether those are appropriate without a lot more detail than you've given so far.
You can try out a query with different costing factors set on a single connection. When you confirm a configuration which works well for your whole mix (i.e., it accurately models costs in your environment), you should make the updates in your postgresql.conf file.
If you want more targeted help, please show the structure of the tables, the query itself, and the results of running EXPLAIN ANALYZE for the query. A description of your OS and hardware helps a lot, too, along with your PostgreSQL configuration.

Why ?
The most logical answer is because of the way your database tables are configured.
Without you posting your table schema's I can only hazard a guess that your indices don't have a high cardinality.
that is to say, that if your index contains too much information to be useful then it will be far less efficient, or indeed slower.
Cardinality is a measure of how unique a row in your index is. The lower the cardinality, the slower your query will be.
A perfect example is having a boolean field in your index; perhaps you have a Contacts table in your database and it has a boolean column that records true or false depending on whether the customer would like to be contacted by a third party.
In the mean, if you did 'select * from Contacts where OptIn = true'; you can imagine that you'd return a lot of Contacts; imagine 50% of contacts in our case.
Now if you add this 'Optin' column to an index on that same table; it stands to reason that no matter how fine the other selectors are, you will always return 50% of the table, because of the value of 'OptIn'.
This is a perfect example of low cardinality; it will be slow because any query involving that index will have to select 50% of the rows in the table; to then be able to apply further WHERE filters to reduce the dataset again.
Long story short; If your Indices include bad fields or simply represent every column in the table; then the SQL engine has to resort to testing row-by-agonizing-row.
Anyway, the above is theoretical in your case; but it is a known common reason for why queries suddenly start taking much longer.
Please fill in the gaps regarding your data structure, index definitions and the actual query that is really slow!

PostgreSQL Query Optimization and the Postmaster Process'

I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.
Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.
Another thing is bothering me.
I have 2 tables. One is the page table; I have an index on its page_id column. The other is the pagelinks tables which has the pl_from column that references either nothing or a variable in the page.page_id column; unlike the page_id column, the pl_from has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, page table has 13.4 million rows (after I deleted those I do not need) while the pagelinks table has 293 million.
I need to execute the following command to clean the pagelinks table of some of its useless rows:
DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);
So basically, I wish to rid the pagelinks table of all links coming from a page not in the page table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":
Nested Loop (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
Join Filter: ("outer".pl_from <> "inner".page_id)"
-> Seq Scan on pagelinks (cost=0.00..5889791.00 rows=293392800 width=17)
-> Materialize (cost=494640.60..708341.51 rows=13474691 width=11)
-> Seq Scan on page (cost=0.00..402211.91 rows=13474691 width=11)
It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the page_id index to do its thing...but it is a stubborn optimizer and I might be wrong.

To your second question; you could try creating a new table with just the records you need with a CREATE TABLE AS statement; if the new table is sufficiently small, it might be faster- but it might not help either.

Indeed, I decided to CREATE a temporary table to speed up query execution:
CREATE TABLE temp_to_delete AS(
(SELECT DISTINCT pl_from FROM pagelinks)
EXCEPT
(SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete
WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);
Surprisingly, this query completed in about 4 hours while the initial query had remained active for about 14hrs before I decided to kill it. More specifically, the DELETE returned:
Query returned successfully: 31340904 rows affected, 4415166 ms execution time.
As for the first part of my question, it seems that the postmaster process indeed keeps some info in cache; when another query requires info not in the cache and some memory (RAM), the cache is emptied. And the postmasters are indeed but a pool of process'.
It has also occurred to me that the gnome-system-monitor is a myth for it gives incomplete information and is worthless in informational value. It is mostly due to this application that I have been so confused lately; for example, it does not consider the memory usage of other users (like the postgres user!) and even tells me that I have 12 GB of RAM left when this is so untrue. Hence, I tried out a couple of system monitors for I like to know how postgreSQL is using its resources, and it seems that xosview is indeed a valid tool.
Hope this helps!

Your postmaster process will stay there as long as the connection to the client is open.
Does pgadmin close the connection ? I don't know.
Memory used could be shared_buffers (check your config settings) or not.
Now, the query. For big maintenance operations like this, feel free to set work_mem to something large like a few GB. You look like you got lots of RAM, so use it.
set work_mem to '4GB';
EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);
It should seq scan page, hash it, and seq scan pagelinks, peeking in the hash to check for page_ids. It should be quite fast (much faster than 4 hours !) but you need a large work_mem for the hash.
But since you delete a significant portion of your table, it might be faster to do it like this :
CREATE TABLE pagelinks2 AS SELECT a.* FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;
(you could use a simple JOIN instead of IN)
You can also add an ORDER BY on this query, and your new table will be nicely ordered on disk for optimal access later.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas