I am new to the concept of Pipeline Functions. I have some questions regarding
From Database point of view:
What actually is Pipeline function ?
What is the advantage of using Pipeline Function ?
What challenges are solved using Pipeline Function ?
Are the any optimization advantages of using Pipeline Function ?
Thanks.
To quote fom "Ask Tom Oracle":
pipelined functions are simply "code you can pretend is a database table"
pipelined functions give you the (amazing to me) ability to
select * from PLSQL_FUNCTION;
anytime you think you can use it -- to select * from a function, instead of a table, it
might be "useful".
As far as advantages: a big advantage of using a Pipeline function is that your function can return rows one-by-one as opposed to building the entire result set in memory as a whole before returning it.
The above gives the obvious optimization - memory savings from something that would otherwise return big result set.
A fairly interesting example of using pipelined functions is here
What seems to be a good use of them is ETL (extract/transform/load) - for example see here
Related
We have a table with 2 top level columns of type 'struct' - one is a 'before', and an 'after' image. The struct schemas are non trivial - nested, with arrays to a variable depth. The are sent to us from replication, so the schemas are always the same (but the schemas of course can be updated at some point, but always together)
Objective is for the two input structs, to return 2 struct 'diffs' of the before and after with only fields that have changed - essentially the 'delta' diff of the changes produce by the replication source. We know something has changed, but not 'what' since we get the full before and after image. this raw data lands in BQ and is then processed from there but need to determine the more granular change for high order BQ processing.
The table schema is very wide (1000's of leaf fields), and the data populated fairly spare (so alot of nulls will be present on both sides of the snapshot) - so would need to be performant as best as possible when executing over 10s of millions of rows.
All things are nullable for maximum flexibility.
So change could look like:
null -> value
value -> null
valueA -> valueB
Arrays:
recursive use of above for arrays of structs, ordering could be relaxed if that makes it easier?
It might not be possible.
Ive not attempted this yet as it seems really difficult so am looking to the community boffins for some support for this. I feel the arrays could be difficult part. There is probably an easy way perhaps in Python I dont or even doing some JSON conversion and comparison using JOSN tools? It feels like it would be a super cool feature built in to BQ as well, so if can get this to work, will add a feature request for it.
Id like to have a SQL UDF for reuse (we have SQL skills not python, although if easier in python then thats ok), and now with the new feature of persistent SQL UDFs, this seems the right time to ask and test the feature out!
sql
def struct_diff(before Struct, after Struct)
(beforeChange, afterChange) - type of signature but open to suggestions?
It appears to be really difficult to get a piece of reusable code. Since currently there is no support for recursive functions for SQL UDF, you cannot use a recursive approach for the nested structs.
Although, you might be able to get some specific SQL UDF functions depending on your array and structs structures. You can use an approach like this one to compare the structs.
CREATE TEMP FUNCTION final_compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(s1 as prev, s2 as cur)
);
CREATE TEMP FUNCTION compare(s1 ANY TYPE, s2 ANY TYPE) AS (
STRUCT(final_compare(s1.structA, s2.structA))
);
You can use UNNEST to work with arrays, and the final SQL UDF would really depend on your data.
As #rtenha suggested, Python could be a lot easier to handle this problem.
Finally, I did some tests using JavaScript UDF, and it was basically the same result, if not worst than SQL UDF.
The console allows a recursive definition of the function, however it will fail during execution. Also, javascript doesn't allow the ANY TYPE data type on the signature, so you would have to define the whole STRUCT definition or use a workaround like applying TO_JSON_STRING to your struct in order to pass it as a string.
Yesterday we got a scenario where had to get type of a db field and on base of that we had to write the description of the field. Like
Select ( Case DB_Type When 'I' Then 'Intermediate'
When 'P' Then 'Pending'
Else 'Basic'
End)
From DB_table
I suggested to write a db function instead of this case statement because that would be more reusable. Like
Select dbo.GetTypeName(DB_Type)
from DB_table
The interesting part is, One of our developer said using database function will be inefficient as database functions are slower than Case statement. I searched over the internet to find the answer which is better approach in terms of efficiency but unfortunately I found nothing that could be considered satisfied answer. Please enlighten me with your thoughts, which approach is better?
UDF function is always slower than case statements
Please refer the article
http://blogs.msdn.com/b/sqlserverfaq/archive/2009/10/06/performance-benefits-of-using-expression-over-user-defined-functions.aspx
The following article suggests you when to use UDF
http://www.sql-server-performance.com/2005/sql-server-udfs/
Summary :
There is a large performance penalty paid when User defined functions is used.This penalty shows up as poor query execution time when a query applies a UDF to a large number of rows, typically 1000 or more. The penalty is incurred because the SQL Server database engine must create its own internal cursor like processing. It must invoke each UDF on each row. If the UDF is used in the WHERE clause, this may happen as part of the filtering the rows. If the UDF is used in the select list, this happens when creating the results of the query to pass to the next stage of query processing.
It's the row by row processing that slows SQL Server the most.
When using a scalar function (a function that returns one value) the contents of the function will be executed once per row but the case statement will be executed across the entire set.
By operating against the entire set you allow the server to optimise your query more efficiently.
So the theory goes that the same query run both ways against a large dataset then the function should be slower. However, the difference may be trivial when operating against your data so you should try both methods and test them to determine if any performance trade off is worth the increased utility of a function.
Your devolper is right. Functions will slow down your query.
https://sqlserverfast.com/?s=user+defined+ugly
Calling functionsis like:
wrap parts into paper
put it into a bag
carry it to the mechanics
let him unwrap, do something, wrapt then result
carry it back
use it
I would like to create a function in CFML taking 3 parameters (a number and two dates). The function will then perform a few cfquery queries on these data like SELECT, UPDATE and INSERT...Any idea on how to code this function ? I'm a cfml newbie so be nice
What you're asking for is very basic and reviewing the documentation of cffunction should be enough to get you started: http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7f5c.html
Adobe docs have a section on how to write UDFs(User Defined Functions). Probably best to start there:
http://livedocs.adobe.com/coldfusion/8/htmldocs/help.html?content=UDFs_03.html
I am maintaining a function in SQL Server 2005, that based on an integer input parameter needs to call different functions e.g.
IF #rule_id = 1
-- execute function 1
ELSE IF #rule_id = 2
-- execute function 2
ELSE IF #rule_id = 3
... etc
The problem is that there are a fair few rules (about 100), and although the above is fairly readable, its performance isn't great. At the moment it's implemented as a series of IF's that do a binary-chop, which is much faster, but becomes fairly unpleasant to read and maintain. Any alternative ideas for something that performs well and is fairly maintainable?
I would suggest you generate the code programatically, eg. via XML+XSLT. the resulted T-SQL will be the same as you have now, but maintaining it would be much easier (adding/removing functions).
Inside a function you don't have much choice, using IFs is pretty much the only solution. You can't do dynamic SQL in functions (you can't invoke exec). If its a stored procedure, then you have much more libery as you can use dynamic SQL and have tricks like a lookup table: select #function = function from table where rule_id = #rule_id; exec sp_executesql #function;.
Can you change it so that it execs a function as a string? I'd normally recommend against this sort of dynamic sql, and there may be better ways if you step back and look at overall design... but with what is known here you may have found one of the rare exceptions when it's better.
ex:
set #functionCall = 'functionRootName' + #rule_id
exec #functionCall
Whatever is calling the SQL function - why does it not choose the function?
This seems like a poorly chosen distribution of responsibility.
I'm rewriting some old stored procedure and I've come across an unexpected performance issue when using a function instead of inline code.
The function is very simple as follow:
ALTER FUNCTION [dbo].[GetDateDifferenceInDays]
(
#first_date SMALLDATETIME,
#second_date SMALLDATETIME
)
RETURNS INT
AS
BEGIN
RETURN ABS(DATEDIFF(DAY, #first_date, #second_date))
END
So I've got two identical queries, but one uses the function and the other does the calculation in the query itself:
ABS(DATEDIFF(DAY, [mytable].first_date, [mytable].second_date))
Now the query with the inline code runs 3 times faster than the one using the function.
What you have is a scalar UDF ( takes 0 to n parameters and returns a scalar value ). Such UDFs typically cause a row-by-row operation of your query, unless called with constant parameters, with exactly the kind of performance degradation that you're experiencing with your query.
See here, here and here for detailed explanations of the peformance pitfalls of using UDFs.
Depending on the usage context, the query optimizer may be able to analyze the inline code and figure out a great index-using query plan, while it doesn't "inline the function" for similarly detailed analysis and so ends up with an inferior query plan when the function is involved. Look at the two query plans, side by side, and you should be able to confirm (or disprove) this hypothesis pretty easily!