Table Valued Function where did my query plan go? - sql

I've just wrapped a complex SQL Statement in a Table-valued function on SQLServer 2000.
When looking at the Query Plan for a SELECT * FROM dbo.NewFunc it just gives me a Table Scan of the table I have created.
I'm guessing that this is because table is created in tempdb and I am just selecting from it.
So the query is simply :
SELECT * FROM table in tempdb
My questions are:
Is the UDF using the same plan as the complex SQL statement?
How can I tune indexes for this UDF?
Can I see the true plan?

Multi-statement table valued functions (TVF) are black boxes to the optimiser for the outer query. You can only see IO, CPU etc from profiler.
The TVF must run to completion and return all rows before any processing happens. That means a where clause will not be optimised for example.
So if this TVF returns a million rows, it has be sorted first.
SELECT TOP 1 x FROM db.MyTVF ORDER BY x DESC
Single statement/inline TVFs do not suffer because they are expanded like macros and evaluated. The example above would evaluate indexes etc.
Also here too: Does query plan optimizer works well with joined/filtered table-valued functions? and Relative Efficiency of JOIN vs APPLY in Microsoft SQL Server 2008
To answer exactly: no, no, and no
I have very few multi statement TVFs: where I do, I have lots of parameters to filter inside the UDF.

Related

Oracle: why doesn't use parallel execution?

Look at the following query:
If I comment the subquery it uses parallel execution otherwise it doesn't.
After the query has been
SELECT /*+ parallel(c, 20) */
1, (SELECT 2 FROM DUAL)
FROM DUAL c;
You could have found the answer in the documentation:
A SELECT statement can be parallelized only if the following
conditions are satisfied:
The query includes a parallel hint specification (PARALLEL or
PARALLEL_INDEX) or the schema objects referred to in the query have a
PARALLEL declaration associated with them.
At least one of the tables specified in the query requires one of
the following:
A full table scan
An index range scan spanning multiple partitions
No scalar subqueries are in the SELECT list.
Your query falls at the final hurdle: it has a scalar subquery in its projection. If you want to parallelize the query you need to find another way to write it.
One Idea could be not to use a subquery, but you can try and use a join? Your sub query seems fairly simply, no grouping etc, so it should not be an issue to translate it into a join.
Maybe the optimizer is not capable of parallel execution when there are subqueries.

Use of function calls in stored procedure sql server 2005?

Use of function calls in where clause of stored procedure slows down performance in sql server 2005?
SELECT * FROM Member M
WHERE LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'administrator'
AND LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'moderator'
In this query GetLookupDetailTitle is a user defined function and LOWER() is built in function i am asking about both.
Yes.
Both of these are practices to be avoided where possible.
Applying almost any function to a column makes the expression unsargable which means an index cannot be used and even if the column is not indexed it makes cardinality estimates incorrect for the rest of the plan.
Additionally your dbo.GetLookupDetailTitle scalar function looks like it does data access and this should be inlined into the query.
The query optimiser does not inline logic from scalar UDFs and your query will be performing this lookup for each row in your source data, which will effectively enforce a nested loops join irrespective of its suitability.
Additionally this will actually happen twice per row because of the 2 function invocations. You should probably rewrite as something like
SELECT M.* /*But don't use * either, list columns explicitly... */
FROM Member M
WHERE NOT EXISTS(SELECT *
FROM MemberRoles R
WHERE R.MemberId = M.MemberId
AND R.RoleId IN (1,2)
)
Don't be tempted to replace the literal values 1,2 with variables with more descriptive names as this too can mess up cardinality estimates.
Using a function in a WHERE clause forces a table scan.
There's no way to use an index since the engine can't know what the result will be until it runs the function on every row in the table.
You can avoid both the user-defined function and the built-in by
defining "magic" values for administrator and moderator roles and compare Member.RoleId against these scalars
defining IsAdministrator and IsModerator flags on a MemberRole table and join with Member to filter on those flags

Multi-statement Table Valued Function vs Inline Table Valued Function

A few examples to show, just incase:
Inline Table Valued
CREATE FUNCTION MyNS.GetUnshippedOrders()
RETURNS TABLE
AS
RETURN SELECT a.SaleId, a.CustomerID, b.Qty
FROM Sales.Sales a INNER JOIN Sales.SaleDetail b
ON a.SaleId = b.SaleId
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.ShipDate IS NULL
GO
Multi Statement Table Valued
CREATE FUNCTION MyNS.GetLastShipped(#CustomerID INT)
RETURNS #CustomerOrder TABLE
(SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL)
AS
BEGIN
DECLARE #MaxDate DATETIME
SELECT #MaxDate = MAX(OrderDate)
FROM Sales.SalesOrderHeader
WHERE CustomerID = #CustomerID
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.OrderDate = #MaxDate
AND a.CustomerID = #CustomerID
RETURN
END
GO
Is there an advantage to using one type (in-line or multi statement) over the other? Is there certain scenarios when one is better than the other or are the differences purely syntactical? I realise the two example queries are doing different things but is there a reason I would write them in that way?
Reading about them and the advantages/differences haven't really been explained.
In researching Matt's comment, I have revised my original statement. He is correct, there will be a difference in performance between an inline table valued function (ITVF) and a multi-statement table valued function (MSTVF) even if they both simply execute a SELECT statement. SQL Server will treat an ITVF somewhat like a VIEW in that it will calculate an execution plan using the latest statistics on the tables in question. A MSTVF is equivalent to stuffing the entire contents of your SELECT statement into a table variable and then joining to that. Thus, the compiler cannot use any table statistics on the tables in the MSTVF. So, all things being equal, (which they rarely are), the ITVF will perform better than the MSTVF. In my tests, the performance difference in completion time was negligible however from a statistics standpoint, it was noticeable.
In your case, the two functions are not functionally equivalent. The MSTV function does an extra query each time it is called and, most importantly, filters on the customer id. In a large query, the optimizer would not be able to take advantage of other types of joins as it would need to call the function for each customerId passed. However, if you re-wrote your MSTV function like so:
CREATE FUNCTION MyNS.GetLastShipped()
RETURNS #CustomerOrder TABLE
(
SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL
)
AS
BEGIN
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a
INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c
ON b.ProductID = c.ProductID
WHERE a.OrderDate = (
Select Max(SH1.OrderDate)
FROM Sales.SalesOrderHeader As SH1
WHERE SH1.CustomerID = A.CustomerId
)
RETURN
END
GO
In a query, the optimizer would be able to call that function once and build a better execution plan but it still would not be better than an equivalent, non-parameterized ITVS or a VIEW.
ITVFs should be preferred over a MSTVFs when feasible because the datatypes, nullability and collation from the columns in the table whereas you declare those properties in a multi-statement table valued function and, importantly, you will get better execution plans from the ITVF. In my experience, I have not found many circumstances where an ITVF was a better option than a VIEW but mileage may vary.
Thanks to Matt.
Addition
Since I saw this come up recently, here is an excellent analysis done by Wayne Sheffield comparing the performance difference between Inline Table Valued functions and Multi-Statement functions.
His original blog post.
Copy on SQL Server Central
Internally, SQL Server treats an inline table valued function much like it would a view and treats a multi-statement table valued function similar to how it would a stored procedure.
When an inline table-valued function is used as part of an outer query, the query processor expands the UDF definition and generates an execution plan that accesses the underlying objects, using the indexes on these objects.
For a multi-statement table valued function, an execution plan is created for the function itself and stored in the execution plan cache (once the function has been executed the first time). If multi-statement table valued functions are used as part of larger queries then the optimiser does not know what the function returns, and so makes some standard assumptions - in effect it assumes that the function will return a single row, and that the returns of the function will be accessed by using a table scan against a table with a single row.
Where multi-statement table valued functions can perform poorly is when they return a large number of rows and are joined against in outer queries. The performance issues are primarily down to the fact that the optimiser will produce a plan assuming that a single row is returned, which will not necessarily be the most appropriate plan.
As a general rule of thumb we have found that where possible inline table valued functions should be used in preference to multi-statement ones (when the UDF will be used as part of an outer query) due to these potential performance issues.
There is another difference. An inline table-valued function can be inserted into, updated, and deleted from - just like a view. Similar restrictions apply - can't update functions using aggregates, can't update calculated columns, and so on.
Your examples, I think, answer the question very well. The first function can be done as a single select, and is a good reason to use the inline style. The second could probably be done as a single statement (using a sub-query to get the max date), but some coders may find it easier to read or more natural to do it in multiple statements as you have done. Some functions just plain can't get done in one statement, and so require the multi-statement version.
I suggest using the simplest (inline) whenever possible, and using multi-statements when necessary (obviously) or when personal preference/readability makes it wirth the extra typing.
Another case to use a multi line function would be to circumvent sql server from pushing down the where clause.
For example, I have a table with a table names and some table names are formatted like C05_2019 and C12_2018 and and all tables formatted that way have the same schema. I wanted to merge all that data into one table and parse out 05 and 12 to a CompNo column and 2018,2019 into a year column. However, there are other tables like ACA_StupidTable which I cannot extract CompNo and CompYr and would get a conversion error if I tried. So, my query was in two part, an inner query that returned only tables formatted like 'C_______' then the outer query did a sub-string and int conversion. ie Cast(Substring(2, 2) as int) as CompNo. All looks good except that sql server decided to put my Cast function before the results were filtered and so I get a mind scrambling conversion error. A multi statement table function may prevent that from happening, since it is basically a "new" table.
look at Comparing Inline and Multi-Statement Table-Valued Functions you can find good descriptions and performance benchmarks
I have not tested this, but a multi statement function caches the result set. There may be cases where there is too much going on for the optimizer to inline the function. For example suppose you have a function that returns a result from different databases depending on what you pass as a "Company Number". Normally, you could create a view with a union all then filter by company number but I found that sometimes sql server pulls back the entire union and is not smart enough to call the one select. A table function can have logic to choose the source.
Maybe in a very condensed way.
ITVF ( inline TVF) : more if u are DB person, is kind of parameterized view, take a single SELECT st
MTVF ( Multi-statement TVF): Developer, creates and load a table variable.
if you are going to do a query you can join in your Inline Table Valued function like:
SELECT
a.*,b.*
FROM AAAA a
INNER JOIN MyNS.GetUnshippedOrders() b ON a.z=b.z
it will incur little overhead and run fine.
if you try to use your the Multi Statement Table Valued in a similar query, you will have performance issues:
SELECT
x.a,x.b,x.c,(SELECT OrderQty FROM MyNS.GetLastShipped(x.CustomerID)) AS Qty
FROM xxxx x
because you will execute the function 1 time for each row returned, as the result set gets large, it will run slower and slower.

In which sequence are queries and sub-queries executed by the SQL engine?

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.

Can queries that read table variables generate parallel exection plans in SQL Server 2008?

First, from BOL:
Queries that modify table variables do not generate parallel query execution plans. Performance can be affected when very large table variables, or table variables in complex queries, are modified. In these situations, consider using temporary tables instead. For more information, see CREATE TABLE (Transact-SQL). Queries that read table variables without modifying them can still be parallelized.
That seems clear enough. Queries that read table variables, without modifying them, can still be parallelized.
But then over at SQL Server Storage Engine, an otherwise reputable source, Sunil Agarwal said this in an article on tempdb from March 30, 2008:
Queries involving table variables don't generate parallel plans.
Was Sunil paraphrasing BOL re: INSERT, or does the presence of table variables in the FROM clause prevent parallelism? If so, why?
I am thinking specifically of the control table use case, where you have a small control table being joined to a larger table, to map values, act as a filter, or both.
Thanks!
OK, I have a parallel select but not on the table variable
I've anonymised it and:
BigParallelTable is 900k rows and wide
For legacy reasons, BigParallelTable is partially denormalised (I'll fix it, later, promise)
BigParallelTable often generates parallel plans because it's not ideal and is "expensive"
SQL Server 2005 x64, SP3, build 4035, 16 cores
Query + plan:
DECLARE #FilterList TABLE (bar varchar(100) NOT NULL)
INSERT #FilterList (bar)
SELECT 'val1' UNION ALL 'val2' UNION ALL 'val3'
--snipped
SELECT
*
FROM
dbo.BigParallelTable BPT
JOIN
#FilterList FL ON BPT.Thing = FL.Bar
StmtText
|--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([FL].[bar])=([BPT].[Thing]), RESIDUAL:(#FilterList.[bar] as [FL].[bar]=[MyDB].[dbo].[BigParallelTable].[Thing] as [BPT].[Thing]))
|--Parallelism(Distribute Streams, Broadcast Partitioning)
| |--Table Scan(OBJECT:(#FilterList AS [FL]))
|--Clustered Index Scan(OBJECT:([MyDB].[dbo].[BigParallelTable].[PK_BigParallelTable] AS [BPT]))
Now, thinking about it, a table variable is almost always a table scan, has no stats and is assumed one row "Estimated number of rows = 1", "Actual.. = 3".
Can we declare that table variables are not used in parallel, but the containing plan can use parallelism elsewhere? So BOL is correct and the SQL Storage article is wrong
Simple Example showing a parallel operator on a table variable itself.
DECLARE #T TABLE
(
X INT
)
INSERT INTO #T
SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master..spt_values v1,master..spt_values v2;
WITH E8(N)
AS (SELECT 1
FROM #T a,
#T b),
Nums(N)
AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM E8)
SELECT COUNT(N)
FROM Nums
OPTION (RECOMPILE)
[Answering my own question here, so I can present the relevant quotes appropriately....]
Boris B, from an thread at MSDN SQL Server forums:
Read-only queries that use table variables can still be parallelized. Queries that involve table variables that are modified run serially. We will correct the statement in Books Online. (emp. added)
and:
Note that there are two flavors of parallelism support:
A. The operator can/can not be in a parallel thread
B. The query can/can not be run in parallel because this operator exists in the tree.
B is a superset of A.
As best I can tell, table variables are not B and may be A.
Another relevant quote, re: non-inlined T-SQL TVFs:
Non-inlined T-SQL TVFs...is considered for parallelism if the TVF inputs are run-time constants, e.g. variables and parameters. If the input is a column (from a cross apply) then parallelism is disabled for the whole statement.
My understanding is that parallelism is blocked on table variables for UPDATE/DELETE/INSERT operations, but not for SELECTs. Proving that would be a lot more difficult than just hypothesizing, of course. :-)