SQL Efficiency with Function - sql

So I've got this database that helps organize information for academic conferences, but we need to know sometimes whether an item is "incomplete" - the rules behind what could make something incomplete are a bit complex, so I built them into a scalar function that just returns true if the item is complete and 0 otherwise.
The problem I'm running into is that when I call the function on a big table of data, it'll take about 1 minute to return the results. This is causing time-outs on the web site.
I don't think there's much I can do about the function itself. But I was wondering if anybody knows any techniques generally for these kinds of situations? What do you do when you have a big function like that that just has to be run sometimes on everything? Could I actually store the results of the function and then have it refreshed every now and then? Is there a good and efficient way to have it stored, but refresh it if the record is updated? I thought I could do that as a trigger or something, but if somebody ever runs a big update, it'll take forever.
Thanks,
Mike

If the function is deterministic you could add it as a computed column, and then index on it, which might improve your performance.
MSDN documentation.

The problem is that the function looks at an individual record and has logic such as "if this column is null" or "if that column is greater than 0". This logic is basically a black box to the query optimizer. There might be indexes on those fields it could use, but it has no way to know about it. It has to run this logic on every available record, rather than using the criteria in a functional matter to pare down the result set. In database parlance, we would say that the UDF is not sargable.
So what you want is some way to build your logic for incomplete conferences into a structure that the query optimizier can take better advantage of: match conditions to indexes and so forth. Off the top of my head, your options to do this include a view or a computed column.

Scalar UDFs in SQL Server perform very poorly at the moment. I only use them as a carefully planned last resort. There are probably ways to solve your problem using other techniques (even deeply nested views or inline TVF which build up all the rules and are re-joined) but it's hard to tell without seeing the requirements.

If your function is that inefficient, you'll have to deal with either out of date data, or slow results.
It sounds like you care more about performance, so like #cmsjr said, add the data to the table.
Also, create a cron job to refresh the results periodically. Perhaps add an updated column to your database table, and then the cron job only has to re-process those rows.
One more thing, how complex is the function? Could you reduce the run-time of the function by pulling it out of SQL, perhaps writing it a layer above the database layer?

I've encountered cases where in SQL Server 2000 at least a function will perform terribly and just breaking that logic out and putting it into the query speeds things tremendously. This is an edge case but if you think the function is fine then you could try that. Otherwise I'd look to compute the column and store it as others are suggesting.

Don't be so sure that you can't tune your function.
Typically, with a 'completeness' check, your worst time is when the record's actually complete. For everything else, you can abort early, so either test the cases that are fastest to compute first, or those that are most likely to cause the record to be flagged incomplete.
For bulk updates, you either have to just sit and wait, or come up with a system where you can run a less complete by faster check first, and then a more thorough check in the background.

As Cade Roux says Scalar functions are evil they are interpreted for each row and as a result are a big problem where performance is concerned. If possible use a table valued function or computed column

Related

Functions in where clause - BQ

I have always avoided to use functions in the WHERE or ON clauses.
However, I am now working with Big Query and I wonder whether is the same that in the "old" data warehouses as it doesn't have indexes.
I still avoid it but I am doing code reviews where I see this and I don't really know if using a function in a WHERE will affect anything else apart from me.
Does anyone know about it?
Thanks :)
BigQuery is a massively parallel database. Basically, the generic equi-join algorithm is going to partitioning on the key values and send the rows from both tables -- with the same key -- to the same node. And whether the key is a column or a function result adds little overhead beyond the actual function call.
This actually works pretty well even if you are using functions. For instance, I have found that sometimes I need to compare strings and integers -- say, when using an id value embedded in a string. This requires conversion and that has reasonable performance.
So, I would say that what you learned about function calls affecting indexes is true. But there are no indexes in BQ so that isn't a concern. Of course, expensive function calls can be an issue. And function calls can impede partition pruning. So they can have an effect.

SQL alternative to a View

I don't really know how to even ask for what i need.
So i try to explain my situation.
I have a rather simple sql query that joins various tables but I need to execute this query with slightly different conditions over and over again.
The execusion time of the query is somewhere around 0.25 seconds.
But all the queries i need to execute take easily 15 seconds.
This is way to long.
What i need is a table or view that holds the query results for me so that i only need to select from this one table instead of joining large tables over and over again.
A view wouldn't really help because it would just execute the same query over and over again. As far as i know.
Is there a way to have something like a view which holds its data as long as its source tables doesn't change ? And will only update and execute the query if it is really necessary?
I think what you described very good fits to
materialized view
usage with fast refresh on commit. However your query need to be eligible for fast refresh.
Another way to use
result_cache
it is automatically invalidates when one of the source tables is changed. I would try both to decide which one suites better for this particular task.
I would suggest table-valued functions for this purpose. Defining such a function requires coding in PL/SQL, but it is not that hard if the function is based on a single query.
You can think of such functions as a way of parameterizing views.
Here is a good place to start learning about them.

Keeping dynamic out of SQL while using specifications with stored procedures

A specification essentially is a text string representing a "where" clause created by an end user.
I have stored procedures that copy a set of related tables and records to other places. The operation is always the same, but dependent on some crazy user requirements like "products that are frozen and blue and on sale on Tuesday".
What if we fed the user specification (or string parameter) to a scalar function that returned true/false which executed the specification as dynamic SQL or just exec (#variable).
It could tell us whether those records exist. We could add the result of the function to our copy products where clause.
It would keep us from recompiling the copy script each time our where clauses changed. Plus it would isolate the product selection in to a single function.
Anyone ever do anything like this or have examples? What bad things could come of it?
EDIT:
This is the specification I simply added to the end of each insert/select statement:
and exists (
select null as nothing
from SameTableAsOutsideTable inside
where inside.ID = outside.id and -- Join operations to outside table
inside.page in (6, 7) and -- Criteria 1
inside.dept in (7, 6, 2, 4) -- Criteria 2
)
It would be great to feed a parameter into a function that produces records based on the user criteria, so all that above could be something like:
and dbo.UserCriteria( #page="6,7", #dept="7,6,2,4")
Dynamic Search Conditions in T-SQL
When optimizing SQL the important thing is optimizing the access path to data (ie. index usage). This trumps code reuse, maintainability, nice formatting and just about every other development perk you can think of. This is because a bad access path will cause the query to perform hundreds of times slower than it should. The article linked sums up very well all the options you have, and your envisioned function is nowhere on the radar. Your options will gravitate around dynamic SQL or very complicated static queries. I'm afraid there is no free lunch on this topic.
It doesn't sound like a very good idea to me. Even supposing that you had proper defensive coding to avoid SQL injection attacks it's not going to really buy you anything. The code still needs to be "compiled" each time.
Also, it's pretty much always a bad idea to let users create free-form WHERE clauses. Users are pretty good at finding new and innovative ways to bring a server to a grinding halt.
If you or your users or someone else in the business can't come up with some concrete search requirements then it's likely that someone isn't thinking about it hard enough and doesn't really know what they want. You can have pretty versatile search capabilities without letting the users completely loose on the system. Alternatively, look at some of the BI tools out there and consider creating a data mart where they can do these kinds of ad hoc searches.
How about this:
You create another store procedure (instead of function) and pass the right condition to it.
Based on that condition it dumps the record ids to a temp table.
Next you move procedure will read ids from that table and do the needful things?
Or you could create a user function that returns a table which is nothing but the ids of the records that matches your criteria (dynamic)
If I am totally off, then please clarify me.
Hope this helps.
If you are forced to use dynamic queries and you don't have any solid and predefined search requirements, it is strongly recommended to use sp_executesql instead of EXEC . It provides parametrized queries to prevent SQL Injection attacks (to some extent) and It makes use of execution plans to speed up performance. (More info)

using distinct command

using distinct command in SQL is good practice or not? is there any drawback of distinct command?
It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.
If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing
From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .
It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.
If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.

Is it.. Good / Bad / Okay ... to use IF/While Conditions in Stored Procedures?

My primary concern is with SQL Server 2005... I went through many website and each tells something different.
What are the scenarios that are good / ok to use.. For example does it hurts to even set variable values inside IF or only if I run a query. Supposing my SPs is building a dynamic SQL based of several conditions in Input Parameters, do I need to rethink about the query... What about a SP that runs different query based on whether some record exists in the table. etc.. etc.. My question is not just limited to these scenarios... I'm looking for a little more generalised answer so that I can improve my future SPs
In essense... Which statements are good to use in Branching conditions / Loops, which is bad and which is Okay.
Generally... Avoid procedural code in your database, and stick to queries. That gives the Query Optimizer the chance to do its job much better.
The exceptions would be code that is designed to do many things, rather than making a result-set, and when a query would need to join rows exponentially to get a result.
It is very hard to answer this question if you don't provide any code. No language construct is Good/Bad/Okay by itself, its what you want to achieve and how well that can be expressed with those constructs.
There's no definitive answer as it really depends on the situation.
In general, I think it's best to keep the logic within a sproc as simple and set-based as possible. Making it too complicated with multiple nested IF conditions for example, may complicate it for the query optimiser meaning it can't create a good execution plan suitable for all paths through the sproc. For example, the first time the sproc is run, it takes path A through the logic and the execution plan reflects this. The next time it runs with different parameters, it takes path B through but resuses the original execution plan which is not optimal for this second path. One solution to this is to break the load into separate stored procedures to call depending on the path being followed - this allows that sub sproc to be optimised and execution plan cached independently.
Loops can be the only viable option, but in general I'd try to not use them - always try to do things in a set-based fashion if it is possible.