When to use user defined functions in a SQL Server data warehouse - sql

I am working on creating a DWH where I am loading the data in Staging DB and before loading them into final DB I apply all the udfs that I have created on the data.
Source DB : Oracle
Dest DB : SQL Server
ETL Process : SSIS packages
I was not processing anything on staging to have a quick load.
Question: is it quicker to apply any udfs when the data is in staging itself or should it be done when loading the data to final DB.
Below facility_cd is a float value and I am passing it to a function emr_get_code_Description to get the corresponding description. The table where it's getting the description from is in the final DB. udf_replace_special_char is a simple function which is replacing a few special characters with NULL.
LTRIM(RTRIM([Dest_DWH].[dbo].udf_replace_special_char([Dest_DWH].[dbo].[emr_get_code_Description](Stg_ap.Facility_cd))))
In general what should be a better practice? Should I be updating this in staging and then load the data after all conversions to Final DB.
Function definitions :
Function 1 :
USE [PROD_DWH]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER function [dbo].[emr_get_code_Description](#cv int)
returns varchar(80)
as begin
-- Returns the code value display
declare #ret varchar(80)
select #ret = cv.DESCRIPTION
from PROD_DWH.DBO.table cv
where cv.code_value = #cv
and cv.active_ind = 1
return isnull(#ret, 0)
end;
Function 2 :
USE [PROD_DWH]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER function [dbo].[udf_replace_special_char](#var varchar(1000))
returns varchar(1000)
as begin
-- Returns the code value display
declare #return_var varchar(1000)
set #return_var = #var
set #return_var = replace(#return_var,CHAR(13),'')
set #return_var = replace(#return_var,CHAR(10),'')
set #return_var = replace(#return_var,CHAR(09),'')
set #return_var = replace(#return_var,CHAR(34),CHAR(39))
return isnull(#return_var, 0)
end;

First of all, as #Nick.McDermaid mentioned in the comments: Best practice is to avoid using User defined functions. There are many links containing information about the functions effects on query performance.
Removing Function Calls for Better Performance in SQL Server
Performance Considerations of User-Defined Functions in SQL Server 2012
Are SQL Server Functions Dragging Your Query Down?
T-SQL Best Practices - Don't Use Scalar Value Functions in Column List or WHERE Clauses
There is not ideal answer for these question, it is related to the case you are working with, but i can give some tips that you can take into consideration:
First, if you are using SSIS to import data into Staging Table, try replacing user defined function with the SSIS data flow components such as derived column transformation, Lookups, in a way that can enhance the performance of the data import.
If you cannot replace the UDF by SSIS components: If you are collecting data in high speed to a data lake (staging level) and then loading the data when needed, it is better to avoid using functions when importing data to staging table.
If You need to guarantee a high speed when loading data from staging table, then use the function in the first data import phase.
If the first data import phase (to staging table) and the second phase (from staging table) are not executed on the same machine, it could be better to execute functions on the more performant machine.
If function contains some operations like lookups, try replacing them with joins.
...
Update 1
After posting functions in your question, you can replace function 2 with a Derived Column Transformation in your SSIS package:
ISNULL([Column]) ? "" : REPLACE(REPLACE(REPLACE(REPLACE([Column],CHAR(10),""),CHAR(13),""),CHAR(09),""),CHAR(34),CHAR(39))
Also you can replace Function 1 with a Lookup Transformation in SSIS package or with a LEFT JOIN in SQL query.

Related

SQL how to call user defined function by dynamic variable name

In SQL - I have list of user defined function names in a table. based on the logic i need to call/exec the function.
Please my high level code logic below,
DECLARE #MY_FUNCTION VARCHAR(1000);
DECLARE #MY_INPUT_PARAMETER INT;
DECLARE #MY_OUTPUT_PARAMETER INT;
SET #MY_FUNCTION = '' -- Dynamically function name will be provided based on some big logic
--Note: function has input and output parameter
--my query
-- call the function by #MY_FUNCTION (#MY_INPUT_PARAMETER )
#MY_OUTPUT_PARAMETER = EXEC #MY_FUNCTION (#MY_INPUT_PARAMETER)
--Some big sql script using #MY_OUTPUT_PARAMETER
(
-- Script goes here
)
You will need to construct the function with parameters inside the variable and then run sp_execute. Check out the samples in https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-executesql-transact-sql?view=sql-server-ver15#c-using-the-output-parameter
Important
However, try to avoid this method of execution if possible. Let the application decide what SP to call and the SP can then use the right function to make the call. There are two advantages to this.
Your SP will be compiled and SQL will be able to have an execution plan and continue to fine tune it. Hence, better performance
You will have less chances of SQL injections depending on how the table with functions are populated.

SSIS pre-evaluation phase taking long

I have a data flow that contains a OLEDB source (statement generated through a variable) which calls a stored procedure.
In SSMS, it takes 8 minutes but the package itself takes 3 times longer to complete.
I've set the validation (DelayValidation) to true, so it still does it at run time. Ive also set the validation of the metadata in the data flow component, as well as in the connection manager.
The data flows have ReadUncommitted on them as well.
I`m not sure where else to look, any assistance on how to make this run faster would be great.
I suspect the real problem is in your stored procedure, but I've included some basic SSIS items as well to try to fix your problem:
Ensure connection managers for OLE DB sources are all set toDelayValidation ( = True).
Ensure that ValidateExternalMetadata is set to false
DefaultBufferMaxRows and DefaultBufferSize to correspond to the table's row sizes
DROP and Recreate your destination component is SSIS
Ensure your stored procedure has SET ANSI_NULLS ON
Ensure that the SQL in your sproc hits an index
Add the query hint OPTION (FAST 10000) - This hint means that it will choose a query which will optimise for the first 10,000 rows – the default SSIS buffer size
Review your stored procedure SQL Server parameter sniffing.
Slow way:
create procedure GetOrderForCustomers(#CustID varchar(20))
as
begin
select * from orders
where customerid = #CustID
end
Fast way:
create procedure GetOrderForCustomersWithoutPS(#CustID varchar(20))
as
begin
declare #LocCustID varchar(20)
set #LocCustID = #CustID
select * from orders
where customerid = #LocCustID
end

Selecting data from a different schema within a stored procedure

Consider this:
CREATE PROCEDURE [dbo].[setIdentifier](#oldIdentifierName as varchar(50), #newIdentifierName as varchar(50))
AS
BEGIN
DECLARE #old_id as int;
DECLARE #new_id as int;
SET #old_id = (SELECT value FROM Configuration WHERE id = #oldIdentifierName);
SET #new_id = (SELECT value FROM Configuration WHERE id = #newIdentifierName);
IF #old_id IS NOT NULL AND #new_id IS NOT NULL
BEGIN
UPDATE Customer
SET type = #new_id
WHERE type = #old_id;
END;
END
[...]
EXECUTE dbo.setIdentifier '1', '2';
What this does is create a stored procedure that accepts two parameters which it then uses to update a Customer table.
The problem is that the entire script above runs within a schema other than "dbo". Let's just assume the schema is "company1". And when the stored procedure is called, I get an error from the SELECT statement, which says that the Configuration table cannot be found. I'm guessing this is because MS SQL by default looks for tables within the same schema as the location of the stored procedure, and not within the calling context.
My question is this:
Is there some option or parameter or switch of some kind that will
tell MS SQL to look for tables in the "caller's default schema" and
not within the schema that procedure itself is stored in?
If not,
what would you recommend? I don't really want to prefix the tables
with the schema name, because it would be kind of unflexible to do
that. So I'm thinking about using dynamic sql (and the schema_name()
function which returns the correct value even within the procedure),
but I am just not experienced enough with MS SQL to construct the
proper syntax.
It would be a tad more efficient to explicitly specify the schema name. And generally speaking, schema's are mainly used to divide a database into logical area's. I would not anticipate on tables schema-hopping often.
Regarding your question, you might want to have a look at the 'execute as' documentation on msdn, since it allows to explicitly control your execution context.
I ended up passing the schema name to my script as a property on the command line for the "sqlcmd" command. Like this:
C:/> sqlcmd -vSCHEMANAME=myschema -imysqlfile
In the SQL script I can then access this variable like this:
SELECT * from $(SCHEMANAME).myTable WHERE.... etc
Not quite as flexible as dynamic sql, but "good enough" as it were.
Thanks all for taking time to respond.

Cast Stored Procedure Result as a Table? [duplicate]

This question already has answers here:
SQL: how to predicate over stored procedure's result sets?
(3 answers)
Closed 6 years ago.
I currently have a stored procedure that runs a complex query and returns a data set. I'd like to cast this data set to a table (on which I can perform further queries) if at all possible. I know I can do this using a table-valued UDF but I'd prefer to avoid that at this point. Is there any way I can accomplish this task?
EDIT: OK... so the SProc I'm using (written by third party and I'm not supposed to change it) runs a fairly complex select statement to return a bunch of line item data about purchase orders. I can recreate it as a UDF but then I'd have to support the UDF and ensure it gets changed as and when our vendor changes their SProc. I'd like to further refine this line item info by a number of criteria such as (but not limited to) item numbers, vendor codes, cost centers, etc. All of this information is brought back by the original SProc and I just need to be able to manipulate it further. My thought process was that if I can somehow treat the results of the SProc as a table (or get them into a table format of some type) then I can run further queries against the original result set to limit by the criteria mentioned above. Please let me know if any further details are needed.
There's various means of sharing data between stored procedures - this link is pretty exhaustive.
But I'm curious why you want a table valued stored procedure (which doesn't exist in SQL Server) when there are table valued functions...
Cast Stored Procedure Result as a
Table?
Yes and this is used quite often. It simply needs one or more select statements:
Create Procedure #Foo
As
Select object_id, name
From sys.columns
That said, you cannot join to this resultset nor can you easily consume it from another stored proc (although there is a way). Given your edit, it appears the question is whether you can consume the results of a stored proc by another stored proc. Technically, yes. You can populate a temp table with the results of a proc. However, you must declare your temp variable or temp table with the same column structure as is returned by the first resultset of the stored proc.
Declare #Data Table ( object_id int, name nvarchar(128) )
Insert #Data
Exec #Foo
Select *
From #Data
(Or use the far more clever OPENROWSET solution as mentioned by Cade Roux and OMG Ponies)
Have you considered using table-valued parameters? They are new in SQL 2008.
-- Edit --
Nope, never mind, they're only good for passing data into stored procedures.
You could try using a View instead of a Stored Procedure. Store your complex query as part of the view, and you have the functionality to perform more queries on the view.

Access to Result sets from within Stored procedures Transact-SQL SQL Server

I'm using SQL Server 2005, and I would like to know how to access different result sets from within transact-sql. The following stored procedure returns two result sets, how do I access them from, for example, another stored procedure?
CREATE PROCEDURE getOrder (#orderId as numeric) AS
BEGIN
select order_address, order_number from order_table where order_id = #orderId
select item, number_of_items, cost from order_line where order_id = #orderId
END
I need to be able to iterate through both result sets individually.
EDIT: Just to clarify the question, I want to test the stored procedures. I have a set of stored procedures which are used from a VB.NET client, which return multiple result sets. These are not going to be changed to a table valued function, I can't in fact change the procedures at all. Changing the procedure is not an option.
The result sets returned by the procedures are not the same data types or number of columns.
The short answer is: you can't do it.
From T-SQL there is no way to access multiple results of a nested stored procedure call, without changing the stored procedure as others have suggested.
To be complete, if the procedure were returning a single result, you could insert it into a temp table or table variable with the following syntax:
INSERT INTO #Table (...columns...)
EXEC MySproc ...parameters...
You can use the same syntax for a procedure that returns multiple results, but it will only process the first result, the rest will be discarded.
I was easily able to do this by creating a SQL2005 CLR stored procedure which contained an internal dataset.
You see, a new SqlDataAdapter will .Fill a multiple-result-set sproc into a multiple-table dataset by default. The data in these tables can in turn be inserted into #Temp tables in the calling sproc you wish to write. dataset.ReadXmlSchema will show you the schema of each result set.
Step 1: Begin writing the sproc which will read the data from the multi-result-set sproc
a. Create a separate table for each result set according to the schema.
CREATE PROCEDURE [dbo].[usp_SF_Read] AS
SET NOCOUNT ON;
CREATE TABLE #Table01 (Document_ID VARCHAR(100)
, Document_status_definition_uid INT
, Document_status_Code VARCHAR(100)
, Attachment_count INT
, PRIMARY KEY (Document_ID));
b. At this point you may need to declare a cursor to repetitively call the CLR sproc you will create here:
Step 2: Make the CLR Sproc
Partial Public Class StoredProcedures
<Microsoft.SqlServer.Server.SqlProcedure()> _
Public Shared Sub usp_SF_ReadSFIntoTables()
End Sub
End Class
a. Connect using New SqlConnection("context connection=true").
b. Set up a command object (cmd) to contain the multiple-result-set sproc.
c. Get all the data using the following:
Dim dataset As DataSet = New DataSet
With New SqlDataAdapter(cmd)
.Fill(dataset) ' get all the data.
End With
'you can use dataset.ReadXmlSchema at this point...
d. Iterate over each table and insert every row into the appropriate temp table (which you created in step one above).
Final note:
In my experience, you may wish to enforce some relationships between your tables so you know which batch each record came from.
That's all there was to it!
~ Shaun, Near Seattle
There is a kludge that you can do as well. Add an optional parameter N int to your sproc. Default the value of N to -1. If the value of N is -1, then do every one of your selects. Otherwise, do the Nth select and only the Nth select.
For example,
if (N = -1 or N = 0)
select ...
if (N = -1 or N = 1)
select ...
The callers of your sproc who do not specify N will get a result set with more than one tables. If you need to extract one or more of these tables from another sproc, simply call your sproc specifying a value for N. You'll have to call the sproc one time for each table you wish to extract. Inefficient if you need more than one table from the result set, but it does work in pure TSQL.
Note that there's an extra, undocumented limitation to the INSERT INTO ... EXEC statement: it cannot be nested. That is, the stored proc that the EXEC calls (or any that it calls in turn) cannot itself do an INSERT INTO ... EXEC. It appears that there's a single scratchpad per process that accumulates the result, and if they're nested you'll get an error when the caller opens this up, and then the callee tries to open it again.
Matthieu, you'd need to maintain separate temp tables for each "type" of result. Also, if you're executing the same one multiple times, you might need to add an extra column to that result to indicate which call it resulted from.
Sadly it is impossible to do this. The problem is, of course, that there is no SQL Syntax to allow it. It happens 'beneath the hood' of course, but you can't get at these other results in TSQL, only from the application via ODBC or whatever.
There is a way round it, as with most things. The trick is to use ole automation in TSQL to create an ADODB object which opens each resultset in turn and write the results to the tables you nominate (or do whatever you want with the resultsets). you can also do it in DMO if you enjoy pain.
There are two ways to do this easily. Either stick the results in a temp table and then reference the temp table from your sproc. The other alternative is to put the results into an XML variable that is used as an OUTPUT variable.
There are, however, pros and cons to both of these options. With a temporary table, you'll need to add code to the script that creates the calling procedure to create the temporary table before modifying the procedure. Also, you should clean up the temp table at the end of the procedure.
With the XML, it can be memory intensive and slow.
You could select them into temp tables or write table valued functions to return result sets. Are asking how to iterate through the result sets?