I have always avoided to use functions in the WHERE or ON clauses.
However, I am now working with Big Query and I wonder whether is the same that in the "old" data warehouses as it doesn't have indexes.
I still avoid it but I am doing code reviews where I see this and I don't really know if using a function in a WHERE will affect anything else apart from me.
Does anyone know about it?
Thanks :)
BigQuery is a massively parallel database. Basically, the generic equi-join algorithm is going to partitioning on the key values and send the rows from both tables -- with the same key -- to the same node. And whether the key is a column or a function result adds little overhead beyond the actual function call.
This actually works pretty well even if you are using functions. For instance, I have found that sometimes I need to compare strings and integers -- say, when using an id value embedded in a string. This requires conversion and that has reasonable performance.
So, I would say that what you learned about function calls affecting indexes is true. But there are no indexes in BQ so that isn't a concern. Of course, expensive function calls can be an issue. And function calls can impede partition pruning. So they can have an effect.
Related
On this question that I asked the other day I got the following comment.
In almost any database, almost any function on a column prevents the use of indexes. There are exceptions here and there, but in general, functions prevent the use of indexes
I googled around and found more mentions of this same behavior, but I had trouble finding something more in depth than what the comment already told me.
Could someone elaborate on why this occurs, and perhaps strategies for avoiding it?
An index in its most basic form is just the sorted column data, making it easy to look up by some value. For example, a textbook can have the pages in some order, but then have an index in the back for all the terms. As you can see, the data is precomputed/sorted and stored in a separate area.
When you apply a function to the column and try to match/filter based on the output, the index is no longer useful. Let's take a look at our book example again, and say that the function we're applying is the reverse of the term (so reverse('integral') becomes 'largetni'). You won't find this value in the index, so you have to take all the terms, put them through the function, and only then compare. All at query time. Originally we could skip search for i, then in, then int and so on, making it easy to find the term so the function made everything much slower.
If you query using this function often, you could make an index with reverse(term) ahead of time to speed up look ups. But without doing so explicitly, it will always be slow.
The indexes are stored separately from the data itself on the SQL server. So when you do a query the B-tree index that ought to be referenced to provide the speed can no longer be referenced because there is an operation(the function) on each of the column so the query optimiser will opt not to use the index any more.
Here is a good explanation of why this occurs (this is a SQL Server specific article, but probably applies to other SQL RDBMS systems):
https://www.mssqltips.com/sqlservertip/1236/avoid-sql-server-functions-in-the-where-clause-for-performance/
The line from the article that really stands out is "The reason for this is is that the function value has to be evaluated for each row of data to determine it matches your criteria."
Let's consider an extreme example. Let's say that you're looking up a row using a cryptographic hash function, like HASH(email_address) = 0x123456. The database has an index built on email_address, but now you're asking it to look up data on HASH(email_address) which it doesn't have. It could still use the index, but it would end up having to look at every single index entry for email_address and see if HASH(email_address) matches. If it's going to have to scan the full index, it may as well just scan the full table instead so it doesn't have to bounce back and forth fetching individual row locations.
Is there any kind of performance gain between 'MOVE TO' vs x = y? I have a really old program I am optimizing and would like to know if it's worth it to pull out all the MOVE TO. Any other general tips on ABAP optimization would be great as well.
No, that is just the same operation expressed in two different ways. Nothing to gain there. If you're out for generic hints, there's a good book available that I'd recommend studying in detail. If you have to optimize a specific program, use the tracing tools (transaction SAT in sufficiently current releases).
The two statements are equivalent:
"
To assign the value of a data object source to a variable destination, use the following statement:
MOVE source TO destination.
or the equivalent statement
destination = source.
"
No, they're the same.
Here's a couple quick hints from my years of performance enhancement:
1) if you use move-corresponding where possible, your code can be a lot more concise, modular, and extendable (in the distant past this was frowned upon but the technical reasons for this are generally not applicable anymore).
2) Use SAT at every opportunity, and be sure to turn on internal table tracking. This is like turning on the lights versus stumbling over furniture in the dark.
3) Make the database layer do as much work as possible for you. Try to combine queries wherever possible, especially when combining result sets. Two queries linked by a join is usually much better than select > itab > select FOR ALL ENTRIES.
4) This is a bit advanced, but FOR ALL ENTRIES often has much slower performance than the equivalent select-options IN phrase. This seems to be because the latter is built as one big query to the database layer while the former requires multiple trips to the database layer. The caveat, of course, is that if you have too many records in your select-options the generated query at the database layer will exceed the allowable size on your system, but large performance gains are possible within that limitation. In general, SAP just loves select-options.
5) Index, index, index!
First of all move does not really affect much performance.
What is affecting quite a lot in the projects I worked for is following:
Nested loops (very evil). For example, loop through all documents, and for each document select single to check it company code is allowed to be displayed.
Instead, make a list of company codes, consult them all once from db and consult this results table instead.
Use hash or sorted tables where possible. Where not possible, use standard table, but sort it by keys and use "binary search".
Select from DB by all key fields. If not possible, consider creating indexes.
For small and simple selects, use joins. For bigger selects using joins will still work faster, but would be difficult to follow up.
Minor thing - use field symbols to read table line, this makes it much faster.
1) You should be careful while using SELECT statement in ABAP language.
Unnecessary database connections significantly decreases the performance of ABAP program.
2) While using internal table with functions you should call it by reference to reduce memory usage.
Call By Reference:
Passes a pointer to the memory location.Changes made to the
variable within the subroutine affects the variable outside
the subroutine.
3)Should not use internal tables with workarea.
4)While using nested loops, use sorting algorithms.
They are the same, as is the ADD keyword and + operator.
If you do want to optimize your ABAP, I have found the largest culprits to be:
Not using binary lookups and/or (internal) table keys properly.
The syntax of ABAP is brain-dead when it comes to table use. Know how
to work with tables efficiently. Basically write
better/optimal/elegant high-level code. This is always a winner!
Fewer instructions == less time. The fewer instructions you hit the
faster the program will run. This is important in tight loops... I
know this sounds obvious, but ABAP is so slow, that if you are really
trying to optimize critical programs, this will make a difference.
(We have processes that run days... and shaving off an hour or so
makes a difference!)
Don't mix types. There is a little bit of overhead to some
implicit conversions... for instance if you are initializing a
string data type, then use the correct literal string with
(backtick) quotes: `literal`. This also counts for looking up in
tables using keys... use exact match datatypes.
Function calls... I cannot stress the overhead of function calls
enough... the less you have the better. Goes against anything a real
computer programmer believes, but there you have it... ABAP is a
special case.
Loop using ASSIGNING (or REF TO - slightly slower on certain
types), avoid INTO like a plague.
PS: Also keep in mind that SWITCH statements are just glorified IF conditionals... thus move the most common conditions to the top!
You can create CDS with ADT Eclipse. Or views(se11) have good performance for selecting.
"MOVE a TO" b and "a = b" are just same in ABAP. There is no performance difference "MOVE" is just a more visible, noticeable version.
But if you talk about "MOVE-CORRESPONDING", yes, there is a performance difference. It's more practical to code, but actually runs slower then direct movement.
using distinct command in SQL is good practice or not? is there any drawback of distinct command?
It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.
If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing
From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .
It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.
If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.
From the MSDN docs for create function:
User-defined functions cannot be used to perform actions that modify the database state.
My question is simply - why?
Yes, a UDF that modifies data may have potentially unwanted side-effects.
Yes, there is overhead involved if a UDF is called thousands of times.
But that is the whole point of design and testing - to ensure that such issues are ironed out before deployment. So why do DB vendors insist on imposing these artificial limitations on developers? What is the point of a language construct that can essentially only be used as a wrapper for select statements?
The reason for this question is as follows: I am writing a function to return a GUID for a certain unique integer ID. If a GUID is already allocated for that ID I simply return it; otherwise I want to generate a new GUID, store that into a table, and return the newly-generated GUID. (Yes, this sounds long-winded and possibly crazy, but when you're sending data to another dev company who believes their design was handed down by God and cannot be improved upon, it's easier just to smile and nod and do what they ask).
I know that I can use a stored procedure with an output parameter to achieve the same result, but then I have to declare a new variable just to hold the result of the sproc. Not only that, I then have to convert my simple select into a while loop that inserts into a temporary table, and call the sproc for every iteration of that loop.
It's usually best to think of the available tools as a spectrum, from Views, through UDFs, out to Stored Procedures. At the one end (Views) you have a lot of restrictions, but this means the optimizer can actually "see through" the code and make intelligent choices. At the other end (Stored Procedures), you've got lots of flexibility, but because you have such freedom, you lose some abilities (e.g. because you can return multiple result sets from a stored proc, you lose the ability to "compose" it as part of a larger query).
UDFs sit in a middle ground - you can do more than you can do in a view (multiple statements, for example), but you don't have as much flexibility as a stored proc. By giving up this freedom, it allows the outputs to be composed as part of a larger query. By not having side effects, you guarantee that, for example, it doesn't matter in which row order the UDF is applied in. If you could have side effects, the optimizer might have to give an ordering guarantee.
I understand your issue, I think, but taking this from your comment:
I want to do something like select my_udf(my_variable) from my_table, where my_udf either selects or creates the value it returns
So you want a select that (potentially) modifies data. Can you look at that sentence on its own and tell me that that reads perfectly OK? - I certainly can't.
Reading your description of what you actually need to do:
I am writing a function to return a
GUID for a certain unique integer ID.
If a GUID is already allocated for
that ID I simply return it; otherwise
I want to generate a new GUID, store
that into a table, and return the
newly-generated GUID.
I know that I can use a stored
procedure with an output parameter to
achieve the same result, but then I
have to declare a new variable just to
hold the result of the sproc. Not only
that, I then have to convert my simple
select into a while loop that inserts
into a temporary table, and call the
sproc for every iteration of that
loop.
from that last sentence it sounds like you have to process many rows at once, so how about a single INSERT that inserts the GUIDs for those IDs that don't already have them, followed by a single SELECT that returns all the GUIDs that (now) exist?
Sometimes if you cannot implement the solution you came up with, it may be an indication that your solution is not optimal.
Using a statement like this
INSERT INTO IntGuids(IntValue, GuidValue)
SELECT MyIntValues.IntValue, NEWID()
FROM MyIntValues
LEFT OUTER JOIN IntGuids ON MyIntValues.IntValue = IntGuids.IntValue
WHERE IntGuids.IntValue IS NULL
creates all the GUIDs you need to have in 1 statement. No need to SELECT+INSERT for every single value.
So I've got this database that helps organize information for academic conferences, but we need to know sometimes whether an item is "incomplete" - the rules behind what could make something incomplete are a bit complex, so I built them into a scalar function that just returns true if the item is complete and 0 otherwise.
The problem I'm running into is that when I call the function on a big table of data, it'll take about 1 minute to return the results. This is causing time-outs on the web site.
I don't think there's much I can do about the function itself. But I was wondering if anybody knows any techniques generally for these kinds of situations? What do you do when you have a big function like that that just has to be run sometimes on everything? Could I actually store the results of the function and then have it refreshed every now and then? Is there a good and efficient way to have it stored, but refresh it if the record is updated? I thought I could do that as a trigger or something, but if somebody ever runs a big update, it'll take forever.
Thanks,
Mike
If the function is deterministic you could add it as a computed column, and then index on it, which might improve your performance.
MSDN documentation.
The problem is that the function looks at an individual record and has logic such as "if this column is null" or "if that column is greater than 0". This logic is basically a black box to the query optimizer. There might be indexes on those fields it could use, but it has no way to know about it. It has to run this logic on every available record, rather than using the criteria in a functional matter to pare down the result set. In database parlance, we would say that the UDF is not sargable.
So what you want is some way to build your logic for incomplete conferences into a structure that the query optimizier can take better advantage of: match conditions to indexes and so forth. Off the top of my head, your options to do this include a view or a computed column.
Scalar UDFs in SQL Server perform very poorly at the moment. I only use them as a carefully planned last resort. There are probably ways to solve your problem using other techniques (even deeply nested views or inline TVF which build up all the rules and are re-joined) but it's hard to tell without seeing the requirements.
If your function is that inefficient, you'll have to deal with either out of date data, or slow results.
It sounds like you care more about performance, so like #cmsjr said, add the data to the table.
Also, create a cron job to refresh the results periodically. Perhaps add an updated column to your database table, and then the cron job only has to re-process those rows.
One more thing, how complex is the function? Could you reduce the run-time of the function by pulling it out of SQL, perhaps writing it a layer above the database layer?
I've encountered cases where in SQL Server 2000 at least a function will perform terribly and just breaking that logic out and putting it into the query speeds things tremendously. This is an edge case but if you think the function is fine then you could try that. Otherwise I'd look to compute the column and store it as others are suggesting.
Don't be so sure that you can't tune your function.
Typically, with a 'completeness' check, your worst time is when the record's actually complete. For everything else, you can abort early, so either test the cases that are fastest to compute first, or those that are most likely to cause the record to be flagged incomplete.
For bulk updates, you either have to just sit and wait, or come up with a system where you can run a less complete by faster check first, and then a more thorough check in the background.
As Cade Roux says Scalar functions are evil they are interpreted for each row and as a result are a big problem where performance is concerned. If possible use a table valued function or computed column