What is the most efficient way to process rows in a table? - sql

I am teaching myself basic and intermediate SQL concepts for a project I am working on.
I have a lot of data that needs to undergo processing so it can be presented in different ways. Right now I am using scalar functions calls in my select statement to process the data.
A simple example, lets say I have an attribute in my table called fun as data type int. I want to process my table so that all rows with fun < 10 are 'foo' and all rows with fun > 10 are 'faa'.
So I write an SQL function something like
CREATE FUNCTION dbo.fooORfaa
(
#fun AS int
)
RETURNS VARCHAR(3)
AS
BEGIN
IF (#fun < 10)
RETURN 'foo'
RETURN 'faa'
END
Then I use my function in something like this select statement
select dbo.fooORfaa([mytable].[fun]) AS 'blah'
from mytable
This example is trivial, but in my real code I need perform some fairly involved logic against one or more columns, and I am selecting sub results from procedures and joining tables together and other things you need to do in a data base.
I have to process lots of records in a reasonable time span. Is this method an efficient way to tackle this problem? Is there another technique I should be using instead?

For this use case, you need a CASE construct.
SELECT
CASE
WHEN T.fun < 10 THEN 'foo'
ELSE 'faa'
END foo_faa
FROM
myTable T
Always try to use set-based operations. User-defined functions will (mostly) kill your performance, and should be a last resort.
See: CASE (Transact-SQL)

Related

Using Functions for Easy writing makes SQL really slow

create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2 AS
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
SELECT takes around 5 seconds (80k + Lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
UPPER(UPPER(translate(TABLEB.STRING_X,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'))
=
UPPER(translate(TABLEB.N_STRING,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu')
Using function takes over 3 minutes (80k + lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
FUNCTION_X(TABLE_A.STRING_X) = FUNCTION_X(TABLE_B.N_STRING)
I dont know whats makes it so heavy.
If your first query, with the UPPER(UPPER(translate(...))) inline in the query takes only 5 seconds and the tables are big, I would look to see if you have a function based index having those functions on either or both tables.
An index, as you probably know, stores a sorted version of the data so that rows can be found quickly. But they're only useful if you are searching on the data that is sorted in the index. (Think of an index in a book, in which keywords sorted alphabetically -- useful for searching for a particular word, not so useful for finding references to words ending in the letter "r").
If there is a function based index on UPPER(UPPER(translate(...))) that is helping your original query, you are losing the benefit when your query specifies FUNCTION_X(...) instead. Oracle is not smart enough to realize they are the same function. You would need to create function based indexes on the expression you actually use in the query -- i.e, on FUNCTION_X(...).
Also, you can help performance by telling Oracle that your function is deterministic (i.e., always returns the same value for the same input) and intended to be used in SQL queries. So, in addition to the function based indexes, a better definition of your function would be:
create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2
DETERMINISTIC -- add this
AS
PRAGMA UDF; -- add this too
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
JOINS, of course, intend to exploit index values. The problem with your second query is that you are demanding that the SQL engine must execute this function-call for each and every line. It therefore cannot do anything better than a "full table scan," evaluating the function over each and every row ... actually over a Cartesian Product of the two tables taken together!!
You must find an alternative way to do that.

WHERE-CASE clause Subquery Performance

The question can be specific to SQL server.
When I write a query such as :
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (fnQuarterDate('20131231')='20131231') THEN 1
WHEN (fnQuarterDate('20131231')!='20131231') THEN 4
END;
Does the Function Call fnQuarterDate (or any Subquery) within Case inside a Where clause is executed for EACH row of the table ?
How would it be better if I get the function's (or any subquery) value beforehand inside a variable like:
DECLARE #X INT
IF fnQuarterDate('20131231')='20131231'
SET #X=1
ELSE
SET #X=0
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
WHEN (#X = 0) THEN 4
END;
I know that in MySQL if there is a subquery inside IN(..) within a WHERE clause, it is executed for each row, I just wanted to find out the same for SQL SERVER.
...
Just populated table with about 30K rows and found out the Time Difference:
Query1= 70ms ; Query 2= 6ms. I think that explains it but still don't know the actual facts behind it.
Also would there be any difference if instead of a UDF there was a simple subquery ?
I think the solution may in theory help you increase the performance, but it also depends on what the scalar function actually does. I think that in this case (my guess is formatting the date to last day in the quarter) would really be negligible.
You may want to read this page with suggested workarounds:
http://connect.microsoft.com/SQLServer/feedback/details/273443/the-scalar-expression-function-would-speed-performance-while-keeping-the-benefits-of-functions#
Because SQL Server must execute each function on every row, using any function incurs a cursor like performance penalty.
And in Workarounds, there is a comment that
I had the same problem when I used scalar UDF in join column, the
performance was horrible. After I replaced the UDF with temp table
that contains the results of UDF and used it in join clause, the
performance was order of magnitudes better. MS team should fix UDF's
to be more reliable.
So it appears that yes, this may increase the performance.
Your solution is correct, but I would recommend considering an improvement of the SQL to use ELSE instead, it looks cleaner to me:
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
ELSE 4
END;
It depends. See User-Defined Functions:
The number of times that a function specified in a query is actually executed can vary between execution plans built by the optimizer. An example is a function invoked by a subquery in a WHERE clause. The number of times the subquery and its function is executed can vary with different access paths chosen by the optimizer.
This approach uses in-line MySQL variables... The query alias of "sqlvars" will prepare the #dateBasis first with the date in question, then a second variable #qtrReportType based on the function call done ONCE for the entire query. Then, by cross-join (via no where clause between the tables since the sqlvars is considered a single row anyhow), will use those values to get data from your IndustryData table.
select
ID.*
from
( select
#dateBasis := '20131231',
#qtrReportType := case when fnQuarterDate(#dateBasis) = #dateBasis
then 1 else 4 end ) sqlvars,
IndustryData ID
where
ID.Date = #dateBasis
AND ID.ReportTypeID = #qtrReportType

Executing SQL function for each row in the table

I have implemented a SQL function and I want to execute it for every row returning from a table. For simplicity, let’s assume that a function accepts #StudentId as parameter and returns a table/multiple rows with Subject and Score as columns.
I want to execute this function for every student returning from a table say Students and aggregate the data in a temp table with following format:
StudentId |Subject| Score
One way I can think of is writing a cursor over Students table, executing the function for each student and collecting the information in a temp table. Is this the only option I have or is there any better way to handle this?
I am using SQL server 2008 R2.
Amey
Well, without proper specs, I can only offer that you start with something like this (please stop and provide more details before writing any cursors, loops or temp tables):
SELECT s.StudentId, f.someColumn FROM dbo.Students AS s
CROSS APPLY dbo.FunctionName(s.StudentId) AS f;
Also, if you provide more details about the function itself, we may have input on better ways to write it (or to not use a function at all).

Create a function with whole columns as input and output

I have several programs written in R that now I need to translate in T-SQL to deliver them to the client. I am new to T-SQL and I'm facing some difficulties in translating all my R functions.
An example is the numerical derivative function, which for two input columns (values and time) would return another column (of same length) with the computed derivative.
My current understanding is:
I can't use SP, because I'll need to use this functions inline with
select statement, like:
SELECT Customer_ID, Date, Amount, derivative(Amount, Date) FROM Customer_Detail
I can't use UDF, because they can take, as input parameter, only scalar. I'll need vectorised function due to speed and also because for some functions I have, like the one above, running row by row wouldn't be meaningful (for each value it needs the next and the previous)
UDA take whole column but, as the name says..., they will aggregate the column like sum or avg would.
If the above is correct, which other techniques would allow me to create the type of function I need? An example of SQL built-in function similar to what I'm after is square() which (apparently) takes a column and returns itself^2. My goal is creating a library of functions which behave like square, power, etc. But internally it'll be different cause square takes and returns each scalar is read through the rows. I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?
NB: At the moment I'm on SQL-Server 2005 but we'll switch soon to 2012 (or possibly 2014 in few months) so answers based on any 2005+ version of SQL-Server are fine.
EDIT: added the R tag for R developers who have, hopefully, already faced such difficulties.
EDIT2: Added CLR tag: I went through CLR user defined aggregate as defined in the Pro t-sql 2005 programmers guide. I already said above that this type of function wouldn't fit my needs but it was worth looking into it. The 4 methods needed by a UDA are: Init, Accumulate, Merge and Terminate. My request would need the whole data being analysed all together by the same instance of the UDA. So options including merge methods to group together partial results from multicore processing won't be working.
I think you may consider changing your mind a bit. SQL language is very good when working with sets of data, especially modern RDBMS implementations (like SQL Server 2012), but you have to think in sets, not in rows or columns. While I stilldon't know your exact tasks, let's see - SQL Server 2012 have very nice set of window functions + ranking functions + analytic functions + common table expressions, so you can write almost any query inline. You can use chains of common table expression to turn your data any way you want, to calculate running totals, to calculate averages or other aggregates over window and so on.
Actually, I've always liked SQL and when I've learned functional language (ML and Scala) a bit, my thought was that my approach to SQL is very similar to functional language paradigm - just slicing and dicing data without saving anything into variables, untils you have resultset your need.
Just quick example, here's a question from SO - How to get average of the 'middle' values in a group?. The goal was to get the average for each group of the middle 3 values:
TEST_ID TEST_VALUE GROUP_ID
1 5 1 -+
2 10 1 +- these values for group_id = 1
3 15 1 -+
4 25 2 -+
5 35 2 +- these values for group_id = 2
6 5 2 -+
7 15 2
8 25 3
9 45 3 -+
10 55 3 +- these values for group_id = 3
11 15 3 -+
12 5 3
13 25 3
14 45 4 +- this value for group_id = 4
For me, it's not an easy task to do in R, but in SQL it could be a really simple query like this:
with cte as (
select
*,
row_number() over(partition by group_id order by test_value) as rn,
count(*) over(partition by group_id) as cnt
from test
)
select
group_id, avg(test_value)
from cte
where
cnt <= 3 or
(rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
You can also easily expand this query to get 5 values around the middle.
TAke closer look to analytical functions, try to rethink your calculations in terms of window functions, may be it's not so hard to rewrite your R procedures in plain SQL.
Hope it helps.
I would solve this by passing a reference to the record(s) you want to process, and use so called "inline table-valued function" to return the record(s) after processing the initial records.
You find the table-function reference here:
http://technet.microsoft.com/en-en/library/ms186755.aspx
A Sample:
CREATE FUNCTION Sales.CustomerExtendedInfo (#CustomerID int)
RETURNS TABLE
AS
RETURN
(
SELECT FirstName + LastName AS CompleteName,
DATEDIFF(Day,CreateDate,GetDate()) AS DaysSinceCreation
FROM Customer_Detail
WHERE CustomerID = #CustomerID
);
GO
StoreID would be the Primary-Key of the Records you want to process.
Table-Function can afterwards be joined to other Query results if you want to process more than one record at once.
Here is a Sample:
SELECT * FROM Customer_Detail
CROSS APPLY Sales.CustomerExtendedInfo (CustomerID)
Using a normal Stored Procedure would do the same more or less, but it's a bit tricky to work with the results programmatically.
But keep one thing in mind: SQL-Server is not really good for "functional-programming". It's brilliant working with data and sets of data, but the more you use it as a "application server" the more you will realize it's not made for that.
I don't think this is possible in pure T-SQL without using cursors. But with cursors, stuff will usually be very slow. Cursors are processing the table row-by/row, and some people call this "slow-by-slow".
But you can create your own aggregate function (see Technet for more details). You have to implement the function using the .NET CLR (e.g. C# or R.NET).
For a nice example see here.
I think interfacing R with SQL is a very nice solution. Oracle is offering this combo as a commercial product, so why not going the same way with SQL Server.
When integrating R in the code using the own aggregate functions, you will only pay a small performance penalty. Own aggregate functions are quite fast according to the Microsoft documentation: "Managed code generally performs slightly slower than built-in SQL Server aggregate functions". And the R.NET solution seems also to be quite fast by loading the native R DLL directly in the running process. So it should be much faster than using R over ODBC.
ORIGINAL RESPONSE:
if you know already what are the functions you will need one of the approach I can think of is, creating one In-Line function for each method/operation you want to apply per table.
what I mean by that? for example you mentioned FROM Customer_Detail table when you select you might want need one method "derivative(Amount, Date)". let's say second method you might need (I am just making up for explanation) is "derivative1(Amount1, Date1)".
we create two In-Line Functions, each will do its own calculation inside function on intended columns and also returns remaining columns as it is. that way you get all columns as you get from table and also perform custom calculation as a set-based operation instead scalar operation.
later you can combine the Independent calculation of columns in same function if make sense.
you can still use this all functions and do JOIN to get all custom calculation in single set if needed as all functions will have common/unprocessed columns coming as it is.
see the example below.
IF object_id('Product','u') IS NOT NULL
DROP TABLE Product
GO
CREATE TABLE Product
(
pname sysname NOT NULL
,pid INT NOT NULL
,totalqty INT NOT NULL DEFAULT 1
,uprice NUMERIC(28,10) NOT NULL DEFAULT 0
)
GO
INSERT INTO Product( pname, pid, totalqty, uprice )
SELECT 'pen',1,100,1.2
UNION ALL SELECT 'book',2,300,10.00
UNION ALL SELECT 'lock',3,500,15.00
GO
IF object_id('ufn_Product_totalValue','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValue
GO
CREATE FUNCTION ufn_Product_totalValue
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice AS totalValue
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
IF object_id('ufn_Product_totalValuePct','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValuePct
GO
CREATE FUNCTION ufn_Product_totalValuePct
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice/100 AS totalValuePct
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
SELECT * FROM ufn_Product_totalValue(10,5)
SELECT * FROM ufn_Product_totalValuepct(10,5)
select tv.pname,tv.pid,tv.totalValue,pct.totalValuePct
from ufn_Product_totalValue(10,5) tv
join ufn_Product_totalValuePct(10,5) pct
on tv.pid=pct.pid
also check the output as shown below.
EDIT2:
three point smoothing Algorithms
IF OBJECT_ID('Test3PointSmoothingAlgo','u') IS NOT NULL
DROP TABLE Test3PointSmoothingAlgo
GO
CREATE TABLE Test3PointSmoothingAlgo
(
qty INT NOT NULL
,id INT IDENTITY NOT NULL
)
GO
INSERT Test3PointSmoothingAlgo( qty ) SELECT 10 UNION SELECT 20 UNION SELECT 30
GO
IF object_id('ufn_Test3PointSmoothingAlgo_qty','IF') IS NOT NULL
DROP FUNCTION ufn_Test3PointSmoothingAlgo_qty
GO
CREATE FUNCTION ufn_Test3PointSmoothingAlgo_qty
(
#ID INT --this is a dummy parameter
)
RETURNS TABLE AS
RETURN
(
WITH CTE_3PSA(SmoothingPoint,Coefficients)
AS --finding the ID of adjacent points
(
SELECT id,id
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id-1
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id+1
FROM Test3PointSmoothingAlgo
)
--Apply 3 point Smoothing algorithms formula
SELECT a.SmoothingPoint,SUM(ISNULL(b.qty,0))/3 AS Qty_Smoothed--this is a using 3 point smoothing algoritham formula
FROM CTE_3PSA a
LEFT JOIN Test3PointSmoothingAlgo b
ON a.Coefficients=b.id
GROUP BY a.SmoothingPoint
)
GO
SELECT SmoothingPoint,Qty_Smoothed FROM dbo.ufn_Test3PointSmoothingAlgo_qty(NULL)
I think you may need to break you functionalities into two parts - into UDA which can work on scopes thank to OVER (...) clause and formulas which combine the result scalars.
What you are asking for - to define objects in such a way as to make it a aggregate/scalar combo - is probably out of scope of regular SQL Server's capabilities, unless you fall back into CLR code the effectively would be equivalent to cursor in terms of performance or worse.
Your best shot is to probably defined SP (I know you don't what that) that will produce the whole result. Like create [derivative] stored procedure that will take in parameters with table and column names as parameters. You can even expand on the idea but in the end that's not what you want exactly.
Since you mention you will be upgrading to SQL Server 2012 - SQL Server 2008 introduced Table Valued Parameters
This feature will do what you want. You will have to define a User Defined Type (UDT) in your DB which is like a table definition with columns & their respective types.
You can then use that UDT as a parameter type for any other stored procedure or function in your DB.
You can combine these UDTs with CLR integration to achieve what you require.
As mentioned SQL is not good when you are comparing rows to other rows, it's much better at set based operations where every row is treated as an independent entity.
But, before looking at cursors & CLR, you should make sure it can't be done in pure TSQL which will almost always be faster & scale better as your table grows.
One method for comparing rows based on order is wrap your data in a CTE, adding a ranking function like ROW_NUMBER to set the row order, followed by a self-join of the CTE onto itself.
The join will be performed on the ordered field e.g. ROW_NUMBER=(ROW_NUMBER-1)
Look at this article for an example

Is this an inefficient way to write a SQL query?

Let's suppose I had a view, like this:
CREATE VIEW EmployeeView
AS
SELECT ID, Name, Salary(PaymentPlanID) AS Payment
FROM Employees
The user-defined function, Salary, is somewhat expensive.
If I wanted to do something like this,
SELECT *
FROM TempWorkers t
INNER JOIN EmployeeView e ON t.ID = e.ID
will Salary be executed on every row of Employees, or will it do the join first and then only be called on the rows filtered by the join? Could I expect the same behavior if EmployeeView was a subquery or a table valued function instead of a view?
The function will only be called where relevant. If your final select statement does not include that field, it's not called at all. If your final select refers to 1% of your table, it will only be called for that 1% of the table.
This is effectively the same for sub-queries/inline views. You could specify the function for a field in a sub-query, then never use that field, in which case the function never gets called.
As an aside: scalar functions are indeed notoriously expensive in many regards. You may be able to reduce it's cost by forming it as an inline table valued function.
SELECT
myTable.*,
myFunction.Value
FROM
myTable
CROSS APPLY
myFunction(myTable.field1, myTable.field2) as myFunction
As long as MyFunction is Inline (not multistatement) and returns only one row for each set of inputs, this often scales much better than Scalar Functions.
This is slightly different from making the whole view a table valued function, that returns many rows.
If such a TVF is multistatment, it WILL call the Salary function for every record. But inline functions can expanded inline, as if a SQL macro, and so only call Salary as required; like the view.
As a general rule for TVFs though, don't return records that will then be discarded.
It should only execute the Salary function for the joined rows. But you are not filtering the tables any further. If ID is a foreign key column and not null then it will execute that function for all the rows.
The actual execution plan is a good place to see for sure.
As said above, the function will only be called for relevant rows. For your further questions, and to get a really good idea of what's happening, you need to gather performance data either through SQL Profiler, or by viewing the actual execution plan and elapsed times. Then test out a few theories and find which is best performance.