I have a query which uses a complicated set of CASE statements, some nested, some with a COALESCE of CASE statements, and it is a pain to manage.
The same basic logic applies to around 12 columns in the SELECT, and could be modularised and placed in a function - with benefits including consistency and ease of prototyping.
Will putting the logic into a function massively impede performance? Are functions inherently slower?
The SELECT Pulls from 5 tables each of around 100,000 rows, so performance is of some importance.
SQL Server 2005
Normally, selecting a result of a scalar function won't hurt much, but filtering by it may easily cost hundreds of seconds (not necessarily though).
If you need to filter by a scalar function result (WHERE col = dbo.scalar_function()), it often helps to make an inline table-valued function instead. It would return its value as the only row of the result table. You would then do inner join with the function result, effectively filtering by the returned value. This works because SQL Server is always able to unwind inline table-valued functions and inline them into the calling query.
Note this trick won't work if the function is a multi-step one. These cannot be unwound.
Related
I'm trying to optimize a query in a database. That query is similar to the following:
select * from Account
inner join udf_Account('user') udfAccount
on Account.Id = udfAccount.AccountId
Actually the real query is much longer but the most important point is that it contains a few inner join on user defined functions (udf) which depend on the user id. (So this is constant parameter which do not change during the query evaluation).
Due to a large amount of data, my query takes approximatively 20 seconds on a production database which is not acceptable.
I have already seen that by storing the results of the functions in temporary tables and using these tables in the query reduces a lot the duration of the query.
I'm asking the following questions:
Can I avoid the temporary tables. Isn't it a way to tell sql that the function can be evaluated only once ? Using temporary tables would imply some important changes in my code this is why I would be happy if I had another solution.
Are there any other ways to optimize my query ?
In SQL Server, if your functions are Inline rather than Multi-Statement, SQL Server explands tham (macro-like) into your queries. It's just like they become sub-queries in your main query.
This notionally allows the optimiser to make a 'better' execution plan.
For example; Provided that the fields you are joining on are directly derived from their source tables, this should make indexes on those fields available.
Without looking at the whole query and your individual functions, it appears that you're already in a good place with regards to your syntax. The next place to look is at the indexes that exist, and aim for index-seeks rather than table-scans or index-scans.
(That's all a bit simplistic, but it's a good start for query optimisation, which is an immense topic.)
Another option is to look at using CROSS APPLY with your inline table valued functions.
(Available in SQL Server 2005 onwards)
This allows the values from tables in your queries to be used as parameters to your functions. Again, provided that the functions are inline, SQL Server expands the function inline when building the execution plan.
An example could be...
SELECT
Account.AccountID,
subAccount.AccountID AS SubAccountID,
Balance.currentAvailable AS SubAccountBalance
FROM
Account
CROSS APPLY
dbo.getSubAccounts('User', Account.AccountID) AS SubAccount
CROSS APPLY
dbo.getCurrentBalance(SubAccount.AccountID) AS Balance
WHERE
Account.AccountID = 1234
I believe you want to define what mysql calls a "deterministic" function. Depending on your flavor of SQL this will have different syntax. But ultimately the biggest optimisation would be to not use a function at all, but simply add an account column to the user table.
I have a sql query that I will be reusing in multiple stored procedures. The query works against multiple tables and returns an integer value based on 2 variables passed to it.
Rather than repeating the query in different stored procedures I want to share it and have 2 options:
create a view to which I can join to based on the variables and get the integer value from it.
create a function again with criteria passed to it and return integer variable
I am leaning towards option 1 but would like opinions on which is better and common practice. Which would be better performance wise etc. (joining to a view or calling function)
EDIT: The RDBMS is SQL Server
If you will always be using the same parametrised predicate to filter the results then I'd go for a parametrised inline table valued function. In theory this is treated the same as a View in that they both get expanded out by the optimiser in practice it can avoid predicate pushing issues. An example of such a case can be seen in the second part of this article.
As Andomar points out in the comments most of the time the query optimiser does do a good job of pushing down the predicate to where it is needed but I'm not aware of any circumstance in which using the inline TVF will perform worse so this seems a rational default choice between the two (very similar) constructs.
The one advantage I can see for the View would be that it would allow you to select without a filter or with different filters so is more versatile.
Inline TVFs can also be used to replace scalar UDFs for efficiency gains as in this example.
You cannot pass variables into a view, so your only option it seems is to use a function. There are two options for this:
a SCALAR function
a TABLE-VALUED function (inline or multi-statement)
If you were returning records, then you could use a WHERE clause from outside a not-too-complex VIEW which can get in-lined into the query within the view, but since all you are returning is a single column integer value, then a view won't work.
An inline TVF can be expanded by the query optimizer to work together with the outer (calling) query, so it can be faster in most cases when compared to a SCALAR function.
However, the usages are different - a SCALAR function returns a single value immediately
select dbo.scalarme(col1, col2), other from ..
whereas an inline-TVF requires you to either subquery it or CROSS APPLY against another table
select (select value from dbo.tvf(col1, col2)), other from ..
-- or
select f.value, t.other
from tbl t
CROSS apply dbo.tvf(col1, col2) f -- or outer apply
I'm going to give you a half-answer because I cannot be sure about what is better in terms of performance, I'm sorry. But then other people have surely got good pieces of advice on that score, I'm certain.
I will stick to your 'common practice' part of question.
So, a scalar function wood seem to me a natural solution in this case. Why, you only want a value, an integer value to be returned - this is what scalar functions are for, isn't it?
But then, if I could see a probability that later I would need more than one value, I might then consider switching to a TVF. Then again, what if you have already implemented your scalar function and used it in many places of your application and now you need a row, a column or a table of values to be returned using basically the same logic?
In my view (no pun intended), a view could become something like the greatest common divisor for both scalar and table-valued functions. The functions would only need to apply the parameters.
Now you have said that you are only planning to choose which option to use. Yet, considering the above, I still think that views can be a good choice and prove useful when scaling your application, and you could actually use both views and functions (if only that didn't upset the performance too badly) just as I have described.
One advantage a TVF has has over a view is that you can force whoever calls it to target a specific index.
I am wondering about performance difference between stored procedure and scalar-valued function. For now i use mostly scalar-valued functions because i can use them inside other queries (and 99% of time they return 1 value anyway).
But there are scalar-valued functions that I never use within other queries and usually i call them with simple SELECT dbo.somefunction (parameter) and that's it.
Would it be better from performance point of view to migrate them to stored procedure?
Scalar valued functions are recompiled every time they are called. This is because they cannot be included in the plan cache created from the sql processed by the query optimizer/processor.
For cases where you only call the udf once, (as in Select dbo.UDFName(params) ) it really doesn't matter much, but if you use a scalar valued udf in a query that examines a million rows, the udf will be compiled one million times. This will definitely be a performance hit. There is a technique where (if the algorithm can be written in a set-based structure) that you can convert scalar UDFs into what are called table valued in-line udfs that return one row/one column tables... and these can be included in sql queries plan caches along with the rest of the sql, so they are not subject to this performance hit...
There are no absolutes here, the devil is in the details. The best approach is to test for a given situation by examining the query plan and statistics.
Also, whether the function is CLR or not can make a difference of several orders of magnitude.
I have an expensive scalar UDF that I need to include in a select statement and use that value to narrow the results in the where clause. The UDF takes parameters from the current row so I can't just store it in a var and select from that.
Running the UDF twice per row just feels wrong:
Select someField,
someOtherField,
dbo.MyExpensiveScalarUDF(someField, someOtherField)
from someTable
where dbo.MyExpensiveScalarUDF(someField, someOtherField) in (aHandfulOfValues)
How do I clean this up so that the function only runs once per row?
Just because you happen to mention the function twice doesn't mean it will be computed twice per row. With luck, the query optimizer will computed it only once per row. Whether it does or not may depend in part on whether the UDF appears to be deterministic or nondeterministic.
Take a look at the estimated execution plan. Maybe you'll find out you're worrying about nothing.
If it's computed twice, you could try this and see if it changes the plan, but it's still no guarantee:
WITH T(someField,someOtherField,expensiveResult) as (
select someField, someOtherField, dbo.MyExpensiveScalarUDF(someField, someOtherField)
from someTable
)
select * from T
where expensiveResult in (thisVal,thatVal,theotherVal);
Steve is correct - the query plan will probably not re-evaluate identical expressions if the UDF is deterministic.
However, repeating yourself is a potential maintenance problem:
WITH temp AS (
Select someField,
someOtherField,
dbo.MyExpensiveScalarUDF(someField, someOtherField) AS scalar
from someTable
)
SELECT *
FROM temp
where scalar in (aHandfulOfValues)
You can avoid it with a CTE or nested query.
Scalar UDFs are best to be avoided if at all possible for rowsets of any significant size (say a half million evaluations). If you expand it inline here (and with a CTE you won't have to repeat yourself), you'll probably find a huge performance boost. Scalar UDFs should be a last resort. In my experience you're far better off using a persisted computed column, or inline or just about any other technique before relying on a scalar UDF.
I'd need a lot more detail before I could address this specific question, but two general ideas hit me right off:
(1) Can you make it a table-based function, join it in the FROM clause, and work from there?
(2) Look into the OUTER APPLY and CROSS APPLY join clauses. Essentially, they allow joins on table-based functions, where the parameters passed to the function are based on the row being joined (as opposed to a single call). Good examples of this are in BOL.
My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.
I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).
Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.
Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).
I found that by doing something like this:
INSERT INTO PreCalcs
SELECT param1
,param2
,dbo.udf_result(param1, param2) AS result
FROM (
SELECT DISTINCT param1, param2 FROM big_table
)
When PreCalcs is suitably indexed, the combination of that with:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON PreCalcs.param1 = big_table.param1
AND PreCalcs.param2 = big_table.param2
You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.
The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON (
PreCalcs.param1 = big_table.param1
OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
)
AND (
PreCalcs.param2 = big_table.param2
OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
)
Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.
I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?
Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.
Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.
I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.
How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?
By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.
No optimiser in existence would be able to divine this useful information.
The scalar udf is a black box.
For a more practical solution, I'd use a computed, persisted columns that does the udf call.
So it's available in all queries can be indexed/included.
This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.