Suboptimal execution plan is better than optimal plan when statistics are stale

Suboptimal execution plan is better than optimal plan when statistics are stale - sql

Why the cost of the execution plan that was generated based on stale statistics is cheaper than the cost of the plan based on updated statistics?
To understand my problem please to follow below scenario:
Assumption: auto statistics update is off.
When I updated statistics with full scan manually then I executed following batch including actual execution plan:
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
SELECT *
FROM AWSales -- table has 60000 rows
WHERE SalesOrderID = 44133
OPTION(RECOMPILE);
--returns 17 rows
The optimizer generated a plan that used non clustered index seek and key lookup - that was definitely fine.
Then I wanted to cheat the optimizer so I inserted 60 000 rows where input value for SalesOrderID column was 44133.
Without updating statistics I executed the mentioned batch again and the optimizer returned the same plan (with index seek) but with different cost (60 000 rows returned), of course.
Next I updated statistics with full scan manually for the table and I executed the batch again. This time the optimizer returned different plan with index scan operator. Predictable. At first glance it looked good. But when I compared the query plan costs it was more expensive than the cost of plan that used index seek. So after updating statistics query with optimal plan was slower. NONSENSE!
Then I wanted to compare costs of index seek before update and after update. Updated statistics caused the optimizer chose plan with index scan, so to force generating plan with index seek I added hint into the query. After executing it turned out that the actual query cost (forced index scan) was MUCH bigger then cost when statistics were stale. How is that possible?
For more details please look at sample script

Related

Postgres Index drop query performance

SELECT T.id_task,
T.group,
A.type
FROM TASK T
JOIN ACTION A ON T.id_task = A.id_task
WHERE T.dt_inc BETWEEN 1511146800000 AND 1511492399999
AND A.dt_scheduled BETWEEN 1511146800000 AND 1511492399999
AND A.id_action > ( SELECT MIN(id_action) FROM ACTION WHERE T.id_task = id_task AND type <> 'TRANSFER' )
The original query is much bigger, but the problematic part is here.
Its take one minute to finish, using only ( task.dt_inc, action.dt_scheduled ), BUT if i DROP INDEX action_dt_scheduled, the execution time drops to one second with same results using only ( task.dt_inc, action.id_task ) indexes.
Why a index is performing so bad ?
Can i ignore this index without droping it ?
I recreated the index DROP > CREATE, i make REINDEX wath should be the same than recreate, what can i do now ?
EDIT:
I was trying to get the EXPLAIN of the slow query, but this is not slow anymore, the ANALYSE and REINDEX of the tables has solved the problem.

There are more reasons why the performance can be worse with index instead without it:
bad estimations - when index is used, then random io is used. random io is usually significantly slower then seq scan - but if index selects small part of table, then it is acceptable. But when planner has bad estimation of result, then index can be used although the seq scan is better. You can check the estimation by EXPLAIN ANALYZE query command.
bloated index - a state of index can be bad if long time index was not reindexed - then access and usage of index can be slow.
wrong sized effective_cache_size parameter - when this paremeter is not accurate or it is not safe, then pages of index can push from RAM a heap (table) pages. In this case a page cache (shared buffers) is not stable - you can check the stability of shared buffers with a pg_buffercache extension. Too low shared_buffers can have similar effect.

Improving query speed: simple SELECT in big postgres table

I'm having trouble regarding speed in a SELECT query on a Postgres database.
I have a table with two integer columns as key: (int1,int2)
This table has around 70 million rows.
I need to make two kind of simple SELECT queries in this environment:
SELECT * FROM table WHERE int1=X;
SELECT * FROM table WHERE int2=X;
These two selects returns around 10.000 rows each out of these 70 million. For this to work as fast as possible I thought on using two HASH indexes, one for each column. Unfortunately the results are not that good:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on lec_sim (cost=232.21..25054.38 rows=6565 width=36) (actual time=14.759..23339.545 rows=7871 loops=1)
Recheck Cond: (lec2_id = 11782)
-> Bitmap Index Scan on lec_sim_lec2_hash_ind (cost=0.00..230.56 rows=6565 width=0) (actual time=13.495..13.495 rows=7871 loops=1)
Index Cond: (lec2_id = 11782)
Total runtime: 23342.534 ms
(5 rows)
This is an EXPLAIN ANALYZE example of one of these queries. It is taking around 23 seconds. My expectations are to get this information in less than a second.
These are some parameters of the postgres db config:
work_mem = 128MB
shared_buffers = 2GB
maintenance_work_mem = 512MB
fsync = off
synchronous_commit = off
effective_cache_size = 4GB
Any help, comment or thought would be really appreciated.
Thank you in advance.

Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.

Thank you willglyn. As you noticed, the problem was the seeking through the HD and not looking up for the indexes. You proposed many solutions, like loading the dataset in RAM or buy an SSDs HD. But forgetting about these two, that involve managing things outside the database itself, you proposed two ideas:
Reorganize the data to reduce the seeking of the data.
Use PostgreSQL 9.2 feature "index-only scans"
Since I am under a PostgreSQL 9.1 Server, I decided to take option "1".
I made a copy of the table. So now I have the same table with the same data twice. I created an index for each one, the first one being indexed by (int1) and the second one by (int2). Then I clustered them both (CLUSTER table USING ind_intX) by its respective indexes.
I'm posting now an EXPLAIN ANALYZE of the same query, done in one of these clustered tables:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using lec_sim_lec2id_ind on lec_sim_lec2id (cost=0.00..21626.82 rows=6604 width=36) (actual time=0.051..1.500 rows=8119 loops=1)
Index Cond: (lec2_id = 12300) Total runtime:
1.822 ms (3 rows)
Now the seeking is really fast. I went down from 23 seconds to ~2 milliseconds, which is an impressive improvement. I think this problem is solved for me, I hope this might be useful also for others experiencing the same problem.
Thank you so much willglynn.

I had a case of super slow queries where simple one to many joins (in PG v9.1) were performed between a table that was 33 million rows to a child table that was 2.4 billion rows in size. I performed a CLUSTER on the foreign key index for the child table, but found that this didn't solve my problem with query timeouts, for even the simplest of queries. Running ANALYZE also did not solve the issue.
What made a huge difference was performing a manual VACUUM on both the parent table and the child table. Even as the parent table was completing its VACUUM process, I went from 10 minute timeouts to results coming back in one second.
What I am taking away from this is that regular VACUUM operations are still critical, even for v9.1. The reason I did this was that I noticed autovacuum hadn't run on either of the tables for at least two weeks, and lots of upserts and inserts had occurred since then. It may be that I need to improve the autovacuum trigger to take care of this issue going forward, but what I can say is that a 640GB table with a couple of billion rows does perform well if everything is cleaned up. I haven't yet had to partition the table to get good performance.

For a very simple and effective one liner, if you have fast solid-state storage on your postgres machine, try setting:
random_page_cost=1.0
In your in your postgresql.conf.
The default is random_page_cost=4.0 and this is optimized for storage with high seek times like old spinning disks. This changes the cost calculation for seeking and relies less on your memory (which could ultimately be going to swap anyway)
This setting alone improved my filtering query from 8 seconds down to 2 seconds on a long table with a couple million records.
The other major improvement came from making indexes with all of the booleen columns on my table. This reduced the 2 second query to about 1 second. Check #willglynn's answer for that.
Hope this helps!

Spatial index hints don't work in SQL Server 2008?

Using the following
SELECT *
FROM dbo.GRSM_WETLAND_POLY
CROSS APPLY (SELECT TOP 1 Name, shape
FROM GRSM.dbo.GRSM_Trails --WITH(index(S319_idx))
WHERE GRSM_Trails.Shape.STDistance(dbo.GRSM_WETLAND_POLY.Shape) IS NOT NULL
ORDER BY GRSM_Trails.Shape.STDistance(dbo.GRSM_WETLAND_POLY.Shape) ASC) fnc
runs very slow on 134 rows (56 seconds), however, with the index hint uncommented, it returns
Msg 8635, Level 16, State 4, Line 3
The query processor could not produce a query plan for a query with a spatial index hint. Reason: Spatial indexes do not support the
comparator supplied in the predicate. Try removing the index hints or
removing SET FORCEPLAN.
Execution plan shows the filter cost at 98%, it's querying against 1400 rows in the other table, so the total cost is 134 * 1400 individual seeks, which is where the delay is. On their own, the spatial indexes in each table perform great, with no fragmentation, 99% page fulness, and use medium for all 4 grid levels with 16 cells per object. Changing the spatial index properties on either table had no effect on performance.
Documentation suggests that spatial index hints can only be used in queries in SQL Server 2012, but surely there's a work around for this?

Main question would be why are you forcing the the hint? If SQL Server didn't choose the index on the plan it generated, forcing another plan will almost always result in decreased performance.
What you should do is analyse each node of the resulting execution plan to see where is the bottle neck that is taking so long. If you post a print screen maybe we can help

SQL Server : Estimated Execution Plan

I am using SQL Server Execution plan for analysing the performance of a stored procedure. I have two results with and without the index. In both these results the estimated cost shows the same value (.0032831) but the cost % differs from one another as first, without index is 7% and with Index is 14%.
What does it really means?
Please help me with this.,
Thanks in advance.

It means that the plan without the index is costed as being doubly expensive.
For the first one the total plan cost is .0032831/0.07 = 0.0469, for the second one the total plan cost is .0032831/0.14 which is clearly going to be half that number.

SQL Server 2000 - What really is the "Actual Number of Rows"?

I have an SQL Server 2000 query which performs an clustered index scan and displays a very high number of rows. For a table where I have 260.000 records, the actual number of rows displayed in the plan execution is ... 34.000.000.
Does this makes sense? What am I misunderstanding?
Thanks.

The values displayed on the execution plan are estimates based on statistics. Normal queries like:
SELECT COUNT(*) FROM Table
are 100% accurate for your transaction*.
Here's a related question.
*Edge cases may vary depending on transaction isolation level.
More information on stats:
Updating statistics
How often and how (maintenance plans++)

If your row counts are off from the query plan, you'll need to update the statistics or else the query optimizer will possibly be choosing the wrong plan. Also, a clustered index scan is almost like a table scan... try to fix up the indexes so you get a clustered index seek or at least an index seek.

But ... If it's the "Actual Number of Rows" ... why is that based on statistics?
I assumed that the Estimated Number of Rows is used when building the query plan (and colected from statistics at that time) and the Actual Number of Rows is some extra info added after the query execution, for user debug and tunning purposes?
Isn't this right?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas