I'm developing Pg/PLSQL function for PostgresQL 9.1. When I use variables in a SQL query, optimizer build a bad execution plan. But if I replace a variable by its value the plan is ok.
For instance:
v_param := 100;
select count(*)
into result
from <some tables>
where <some conditions>
and id = v_param
performed in 3s
and
select count(*)
into result
from <some tables>
where <some conditions>
and id = 100
performed in 300ms
In first case optimizer generate a fixed plan for any value of v_param.
In second case optimizer generate a plan based on specified value and it's significantly more efficient despite not using plan caching.
Is it possible to make optimizer to generate plan without dynamic binding and generate a plan every time when I execute the query?
This has been dramatically improved by Tom Lane in the just-released PostgreSQL 9.2; see What's new in PostgreSQL 9.2 particularly:
Prepared statements used to be optimized once, without any knowledge
of the parameters' values. With 9.2, the planner will use specific
plans regarding to the parameters sent (the query will be planned at
execution), except if the query is executed several times and the
planner decides that the generic plan is not too much more expensive
than the specific plans.
This has been a long-standing and painful wart that's previously required SET enable_... params, the use of wrapper functions using EXECUTE, or other ugly hacks. Now it should "just work".
Upgrade.
For anyone else reading this, you can tell if this problem is biting you because auto_explain plans of parameterised / prepared queries will differ from those you get when you explain the query yourself. To verify, try PREPARE ... SELECT then EXPLAIN EXECUTE and see if you get a different plan to EXPLAIN SELECT.
See also this prior answer.
Dynamic queries doesn't use cached plans - so you can use EXECUTE USING statement in 9.1 and older. 9.2 should to work without this workaround as Craig wrote.
v_param := 100;
EXECUTE 'select count(*) into result from <some tables> where <some conditions>
and id = $1' USING v_param;
Related
I am trying to understand the query optimization in postgresql and I have a function with some queries in it. Some of them are simple querys that saves a value into a variable and then the next query takes this variable to find something.. lets say:
function()...
select type into t
from tableA
where code = a_c;
select num into n
from tableB
where id = t;
end function...
and many more.. If I want to explain analyse the whole function I execute the command explain analyse select function(); Is this the right way to do it or should I have to explain analyse every query inside the function and if so with what values?
Consider using the auto_explain module:
The auto_explain module provides a means for logging execution plans
of slow statements automatically, without having to run EXPLAIN by
hand. This is especially helpful for tracking down un-optimized
queries in large applications.
with auto_explain.log_nested_statements turned on:
auto_explain.log_nested_statements (boolean)
auto_explain.log_nested_statements causes nested statements
(statements executed inside a function) to be considered for logging.
When it is off, only top-level query plans are logged. This parameter
is off by default. Only superusers can change this setting.
I am trying the example given in this blog in my laptop.
https://jonathanlewis.wordpress.com/2013/06/25/system-stats-2/
I get the values the same as mentioned in the blog, but when I use the parallelism hint, the system is not using the DOP, but instead the same old plan is generated. I am not sure what I am missing or what values I did not set.
I have set my parallel_max_servers using the following statement:
alter system set parallel_max_servers=40 scope=both;
When I run the explain statement as:
explain plan for select /*+ parallel(t1 5) */ max(n1) from t1;
I am still getting the same old plan as if no parallelism is used. Is there any other parameter I need to set to make my system use the parallelism.
Thanks!
With the parallel hint, you don't need to specify a table name...just the parallelism quantity. Like this:
select /*+ PARALLEL (4) */ max(n1)
from t1;
I confirmed that adding the table name, prevents parallelism from occurring in the execution plan.
I have a sql query, the exact code of which is generated in C#, and passed through ADO.Net as a text-based SqlCommand.
The query looks something like this:
SELECT TOP (#n)
a.ID,
a.Event_Type_ID as EventType,
a.Date_Created,
a.Meta_Data
FROM net.Activity a
LEFT JOIN net.vu_Network_Activity na WITH (NOEXPAND)
ON na.Member_ID = #memberId AND na.Activity_ID = a.ID
LEFT JOIN net.Member_Activity_Xref ma
ON ma.Member_ID = #memberId AND ma.Activity_ID = a.ID
WHERE
a.ID < #LatestId
AND (
Event_Type_ID IN(1,2,3))
OR
(
(na.Activity_ID IS NOT NULL OR ma.Activity_ID IS NOT NULL)
AND
Event_Type_ID IN(4,5,6)
)
)
ORDER BY a.ID DESC
This query has been working well for quite some time. It takes advantage of some indexes we have on these tables.
In any event, all of a sudden this query started running really slow, but ran almost instantaneously in SSMS.
Eventually, after reading several resources, I was able to verify that the slowdown we were getting was from poor parameter sniffing.
By copying all of the parameters to local variables, I was able to successfully reduce the problem. The thing is, this just feels like all kind of wrong to me.
I'm assuming that what happened was the statistics of one of these tables was updated, and then by some crappy luck, the very first time this query was recompiled, it was called with parameter values that cause the execution plan to differ?
I was able to track down the query in the Activity Monitor, and the execution plan resulting in the query to run in ~13 seconds was:
Running in SSMS results in the following execution plan (and only takes ~100ms):
So what is the question?
I guess my question is this: How can I fix this problem, without copying the parameters to local variables, which could lead to a large number of cached execution plans?
Quote from the linked comment / Jes Borland:
You can use local variables in stored procedures to “avoid” parameter sniffing. Understand, though, that this can lead to many plans stored in the cache. That can have its own performance implications. There isn’t a one-size-fits-all solution to the problem!
My thinking is that if there is some way for me to manually remove the current execution plan from the temp db, that might just be good enough... but everything I have found online only shows me how to do this for an actual named stored procedure.
This is a text-based SqlCommand coming from C#, so I do not know how to find the cached execution plan, with the sniffed parameter values, and remove it?
Note: the somewhat obvious solution of "just create a proper stored procedure" is difficult to do because this query can get generated in a number of different ways... and would require a somewhat unpleasant refactor.
If you want to remove a specific plan from the cache then it is really a two step process: first obtain the plan handle for that specific plan; and then use DBCC FREEPROCCACHE to remove that plan from the cache.
To get the plan handle, you need to look in the execution plan cache. The T-SQL below is an example of how you could search for the plan and get the handle (you may need to play with the filter clause a bit to hone in on your particular plan):
SELECT top (10)
qs.last_execution_time,
qs.creation_time,
cp.objtype,
SUBSTRING(qt.[text], qs.statement_start_offset/2, (
CASE
WHEN qs.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2 + 1
) AS query_text,
qt.text as full_query_text,
tp.query_plan,
qs.sql_handle,
qs.plan_handle
FROM
sys.dm_exec_query_stats qs
LEFT JOIN sys.dm_exec_cached_plans cp ON cp.plan_handle=qs.plan_handle
CROSS APPLY sys.dm_exec_sql_text (qs.[sql_handle]) AS qt
OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
WHERE qt.text like '%vu_Network_Activity%'
Once you have the plan handle, call DBCC FREEPROCCACHE as below:
DBCC FREEPROCCACHE(<plan_handle>)
There are many ways to delete/invalidate a query plan:
DBCC FREEPROCCACHE(plan_handle)
or
EXEC sp_recompile 'net.Activity'
or
adding OPTION (RECOMPILE) query hint at the end of your query
or
using optimize for ad hoc workloads server settings
or
updating statistics
If you have a crappy product from a crappy vendor, the best way to handle parameter sniffing is to create you own plan using EXEC sp_create_plan_guide/
I've just recently become aware of the fact that you can now index your views in SQL Server (see http://technet.microsoft.com/en-us/library/cc917715.aspx). I'm now trying to figure out when I'd get better performance from a query against an indexed view versus the same query inside a stored procedure that's had it's execution path cached?
Take for example the following:
SELECT colA, colB, sum(colC), sum(colD), colE
FROM myTable
WHERE colFDate < '9/30/2011'
GROUP BY colA, colB, colE
The date will be different every time it's run, so if this were a view, I wouldn't include the WHERE in the view and instead have that as part of my select against the view. If it were a stored procedure, the date would be a parameter. Note, there are about 300,000 rows in the table. 200,000 of them would meet the where clause with the date. 10,000 would be returned after the group by.
If this were an indexed view, should I expect to get better performance out of it than a stored procedure that's had an opportunity to cache the execution path? Or would the proc be faster? Or would the difference be negligible? I know we could say "just try both out" but there are too many factors that could falsely bias the results leading me down a false conclusion, so I'd like to hear more of the theory behind it and what the expected outcomes are instead.
Thanks!
An indexed view can be regarded like a normal table - it's a materialized collection of rows.
So the question really boils down to whether or not a "normal" query is faster than a stored procedure.
If you look at what steps the SQL Server goes through to execute any query (stored procedure call or ad-hoc SQL statement), you'll find (roughly) these steps:
syntactically check the query
if it's okay - it checks the plan cache to see if it already has an execution plan for that query
if there is an execution plan - that plan is (re-)used and the query executed
if there is no plan yet, an execution plan is determined
that plan is stored into the plan cache for later reuse
the query is executed
The point is: ad-hoc SQL and stored procedures are treatly no differently.
If an ad-hoc SQL query is properly using parameters - as it should anyway, to prevent SQL injection attacks - its performance characteristics are no different and most definitely no worse than executing a stored procedure.
Stored procedure have other benefits (no need to grant users direct table access, for instance), but in terms of performance, using properly parametrized ad-hoc SQL queries is just as efficient as using stored procedures.
Using stored procedures over non-parametrized queries is better for two main reasons:
since each non-parametrized query is a new, different query to SQL Server, it has to go through all the steps of determining the execution plan, for each query (thus wasting time - and also wasting plan cache space, since storing the execution plan into plan cache doesn't really help in the end, since that particular query will probably not be executed again)
non-parametrized queries are at risk of SQL injection attack and should be avoided at all costs
Now of course, if you're indexed view can reduce down the number rows significantly (by using a GROUP BY clause) - then of course that indexed view will be significantly faster than when you're running a stored procedure against the whole dataset. But that's not because of the different approaches taken - it's just a matter of scale - querying against a few dozen or few hundred rows will be faster than querying against 200'000 or more rows - no matter which way you query.
We're seeing strange behavior when running two versions of a query on SQL Server 2005:
version A:
SELECT otherattributes.* FROM listcontacts JOIN otherattributes
ON listcontacts.contactId = otherattributes.contactId WHERE listcontacts.listid = 1234
ORDER BY name ASC
version B:
DECLARE #Id AS INT;
SET #Id = 1234;
SELECT otherattributes.* FROM listcontacts JOIN otherattributes
ON listcontacts.contactId = otherattributes.contactId
WHERE listcontacts.listid = #Id
ORDER BY name ASC
Both queries return 1000 rows; version A takes on average 15s; version B on average takes 4s.
Could anyone help us understand the difference in execution times of these two versions of SQL?
If we invoke this query via named parameters using NHibernate, we see the following query via SQL Server profiler:
EXEC sp_executesql N'SELECT otherattributes.* FROM listcontacts JOIN otherattributes ON listcontacts.contactId = otherattributes.contactId WHERE listcontacts.listid = #id ORDER BY name ASC',
N'#id INT',
#id=1234;
...and this tends to perform as badly as version A.
Try take a look at the execution plan for your query. This should give you some more explanation on how your query is executed.
I've not seen the execution plans, but I strongly suspect that they are different in these two cases. The issue that you are having is that in case A (the faster query) the optimiser knows the value that you are using for the list id (1234) and using a combination of the distribution statistics and the indexes chooses an optimal plan.
In the second case, the optimiser is not able to sniff the value of the ID and so produces a plan that would be acceptable for any passed in list id. And where I say acceptable I do not mean optimal.
So what can you do to improve the scenario? There are a couple of alternatives here:
1) Create a stored procedure to perform the query as below:
CREATE PROCEDURE Foo
#Id INT
AS
SELECT otherattributes.* FROM listcontacts JOIN otherattributes
ON listcontacts.contactId = otherattributes.contactId WHERE listcontacts.listid = #Id
ORDER BY name ASC
GO
This will allow the optimiser to sniff the value of the input parameter when passed in and produce an appropriate execution plan for the first execution. Unfortunately it will cache that plan for reuse later so unless the you generally call the sproc with similarly selective values this may not help you too much
2) Create a stored procedure as above, but specify it to be WITH RECOMPILE. This will ensure that the stored procedure is recompiled each time it is executed and hence produce a new plan optimised for this input value
3) Add OPTION (RECOMPILE) to the end of the SQL Statement. Forces recompilation of this statement, and is able to optimise for the input value
4) Add OPTION (OPTIMIZE FOR (#Id = 1234)) to the end of the SQL statement. This will cause the plan that gets cached to be optimised for this specific input value. Great if this is a highly common value, or most common values are similarly selective, but not so great if the distribution of selectivity is more widely spread.
It's possible that instead of casting 1234 to be the same type as listcontacts.listid and then doing the comparison with each row, it might be casting the value in each row to be the same as 1234. The first requires just one cast, the second needs a cast per row (and that's probably on far more than 1000 rows, it may be for every row in the table). I'm not sure what type that constant will be interpreted as but it may be 'numeric' rather than 'int'.
If this is the cause, the second version is faster because it's forcing 1234 to be interpreted as an int and thus removing the need to cast the value in every row.
However, as the previous poster suggests, the query plan shown in SQL Server Management Studio may indicate an alternative explanation.
The best way to see what is happening is to compare the execution plans, everything else is speculation based on the limited details presented in the question.
To see the execution plan, go into SQL Server Management Studio and run SET SHOWPLAN_XML ON then run query version A, the query will not run but the execution plan will be displayed in XML. Then run query version B and see its execution plan. If you still can't tell the difference or solve the problem, post both execution plans and someone here will explain it.