Is this an inefficient way to write a SQL query?

Is this an inefficient way to write a SQL query? - sql

Let's suppose I had a view, like this:
CREATE VIEW EmployeeView
AS
SELECT ID, Name, Salary(PaymentPlanID) AS Payment
FROM Employees
The user-defined function, Salary, is somewhat expensive.
If I wanted to do something like this,
SELECT *
FROM TempWorkers t
INNER JOIN EmployeeView e ON t.ID = e.ID
will Salary be executed on every row of Employees, or will it do the join first and then only be called on the rows filtered by the join? Could I expect the same behavior if EmployeeView was a subquery or a table valued function instead of a view?

The function will only be called where relevant. If your final select statement does not include that field, it's not called at all. If your final select refers to 1% of your table, it will only be called for that 1% of the table.
This is effectively the same for sub-queries/inline views. You could specify the function for a field in a sub-query, then never use that field, in which case the function never gets called.
As an aside: scalar functions are indeed notoriously expensive in many regards. You may be able to reduce it's cost by forming it as an inline table valued function.
SELECT
myTable.*,
myFunction.Value
FROM
myTable
CROSS APPLY
myFunction(myTable.field1, myTable.field2) as myFunction
As long as MyFunction is Inline (not multistatement) and returns only one row for each set of inputs, this often scales much better than Scalar Functions.
This is slightly different from making the whole view a table valued function, that returns many rows.
If such a TVF is multistatment, it WILL call the Salary function for every record. But inline functions can expanded inline, as if a SQL macro, and so only call Salary as required; like the view.
As a general rule for TVFs though, don't return records that will then be discarded.

It should only execute the Salary function for the joined rows. But you are not filtering the tables any further. If ID is a foreign key column and not null then it will execute that function for all the rows.
The actual execution plan is a good place to see for sure.

As said above, the function will only be called for relevant rows. For your further questions, and to get a really good idea of what's happening, you need to gather performance data either through SQL Profiler, or by viewing the actual execution plan and elapsed times. Then test out a few theories and find which is best performance.

Related

Clarification about Select from (select...) statement

I came across a SQL practice question. The revealed answer is
SELECT ROUND(ABS(a - c) + ABS(b - d), 4) FROM (
SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d
FROM station);
Normally, I would enocunter
select[] from[] where [] (select...)
which to imply that the selected variable from the inner loop at the where clause will determine what is to be queried in the outer loop. As mentioned at the beginning, this time the select is after
FROM to me I'm curious the functionality of this. Is it creating an imaginary table?

The piece in parentheses:
(SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d FROM station)
is a subquery.
What's important here is that the result of a subquery looks like a regular table to the outer query. In some SQL flavors, an alias is necessary immediately following the closing parenthesis (i.e. a name by which to refer to the table-like result).
Whether this is technically a "temporary table" is a bit of a detail as its result isn't stored outside the scope of the query; and there is an also a thing called a temporary table which is stored.
Additionally (and this might be the source of confusion), subqueries can also be used in the WHERE clause with an operator (e.g. IN) like this:
SELECT student_name
FROM students
WHERE student_school IN (SEELCT school_name FROM schools WHERE location='Springfield')

This is, as discussed in the comments and the other answer a subquery.
Logically, such a subquery (when it appears in the FROM clause) is executed "first", and then the results treated as a table1. Importantly though, that is not required by the SQL language2. The entire query (including any subqueries) is optimized as a whole.
This can include the optimizer doing things like pushing a predicate from the outer WHERE clause (which, admittedly, your query doesn't have one) down into the subquery, if it's better to evaluate that predicate earlier rather than later.
Similarly, if you had two subqueries in your query that both access the same base table, that does not necessarily mean that the database system will actually query that base table exactly twice.
In any case, whether the database system chooses to materialize the results (store them somewhere) is also decided during the optimization phase. So without knowing your exact RDBMS and the decisions that the optimizer takes to optimize this particular query, it's impossible to say whether it will result in something actually being stored.
1Note that there is no standard terminology for this "result set as a table" produced by a subquery. Some people have mentioned "temporary tables" but since that is a term with a specific meaning in SQL, I shall not be using it here. I generally use the term "result set" to describe any set of data consisting of both columns and rows. This can be used both as a description of the result of the overall query and to describe smaller sections within a query.
2Provided that the final results are the same "as if" the query had been executed in its logical processing order, implementations are free to perform processing in any ordering they choose to.

As there are so many terms involved, I just thought I'll throw in another answer ...
In a relational database we deal with tables. A query reads from tables and its result again is a table (albeit not a stored one).
So in the FROM clause we can access query results just like any stored table:
select * from (select * from t) x;
This makes the inner query a subquery to our main query. We could also call this an ad-hoc view, because view is the word we use for queries we access data from. We can move it to the begin of our main query in order to enhance readability and possibly use it multiple times in it:
with x as (select * from t) select * from x;
We can even store such queries for later access:
create view v as select * from t;
select * from v;
In the SQL standard these terms are used:
BASE TABLE is a stored table we create with CREATE TABLE .... t in above examples is supposed to be a base table.
VIEWED TABLE is a view we create with CREATE VIEW .... v above examples is a viewed table.
DERIVED TABLE is an ad-hoc view, such as x in the examples above.
When using subqueries in other clauses than FROM (e.g. in the SELECT clause or the WHERE clause), we don't use the term "derived table". This is because in these clauses we don't access tables (i.e. something like WHERE mytable = ... does not exist), but columns and expression results. So the term "subquery" is more general than the term "derived table". In those clauses we still use various terms for subqueries, though. There are correlated and non-correlated subqueries and scalar and non-scalar ones.
And to make things even more complicated we can use correlated subqueries in the FROM clause in modern DBMS that feature lateral joins (sometimes implemented as CROSS APPLY and OUTER APPLY). The standard calls these LATERAL DERIVED TABLES.

Subquery is faster using a function

I have a long query (~200 lines) that I have embedded in a function:
CREATE FUNCTION spot_rate(base_currency character(3),
contra_currency character(3),
pricing_date date) RETURNS numeric(20,8)
Whether I run the query directly or the function I get similar results and similar performance. So far so good.
Now I have another long query that looks like:
SELECT x, sum(y * spot_rates.spot)
FROM (SELECT a, b, sum(c) FROM t1 JOIN t2 etc. (6 joins here)) AS table_1,
(SELECT
currency,
spot_rate(currency, 'USD', current_date) AS "spot"
FROM (SELECT DISTINCT currency FROM table_2) AS "currencies"
) AS "spot_rates"
WHERE
table_1.currency = spot_rates.currency
GROUP BY / ORDER BY
This query runs in 300 ms, which is slowish but fast enough at this stage (and probably makes sense given the number of rows and aggregation operations).
If however I replace spot_rate(currency, 'USD', current_date) by its equivalent query, it runs in 5+ seconds.
Running the subquery alone returns in ~200ms whether I use the function or the equivalent query.
Why would the query run more slowly than the function when used as a subquery?
ps: I hope there is a generic answer to this generic problem - if not I'll post more details but creating a contrived example is not straightforward.
EDIT: EXPLAIN ANALYZE run on the 2 subqueries and whole queries
subquery with function: http://explain.depesz.com/s/UHCF
subquery with direct query: http://explain.depesz.com/s/q5Q
whole query with function: http://explain.depesz.com/s/ZDt
whole query with direct query: http://explain.depesz.com/s/R2f
just the function body, using one set of arguments: http://explain.depesz.com/s/mEp

Just a wild guess: your query's range-table is exceeding the join_collapse_limit, causing a suboptimal plan to be used.
Try moving the subquery-body (the equivalent of the function) into a CTE, to keep it intact. (CTE's are always executed, and never broken-up by the query-generator/planner)
pre-calculting parts of the query into (TEMP) tables or materialised views can also help to reduce the number of RTEs
You could (temporarily) increase join_collapse_limit, but this will cost more planning time, and there certainly is a limit to this (the number of possible plans grows exponentially with the size of the range table.)
Normally, you can detect this behaviour by the bad query plan (like here: fewer index scans), but you'll need knowledge of the schema, and there must be some kind of reasonable plan possible (read: PK/FK and indices must be correct, too)

Use of function calls in stored procedure sql server 2005?

Use of function calls in where clause of stored procedure slows down performance in sql server 2005?
SELECT * FROM Member M
WHERE LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'administrator'
AND LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'moderator'
In this query GetLookupDetailTitle is a user defined function and LOWER() is built in function i am asking about both.

Yes.
Both of these are practices to be avoided where possible.
Applying almost any function to a column makes the expression unsargable which means an index cannot be used and even if the column is not indexed it makes cardinality estimates incorrect for the rest of the plan.
Additionally your dbo.GetLookupDetailTitle scalar function looks like it does data access and this should be inlined into the query.
The query optimiser does not inline logic from scalar UDFs and your query will be performing this lookup for each row in your source data, which will effectively enforce a nested loops join irrespective of its suitability.
Additionally this will actually happen twice per row because of the 2 function invocations. You should probably rewrite as something like
SELECT M.* /*But don't use * either, list columns explicitly... */
FROM Member M
WHERE NOT EXISTS(SELECT *
FROM MemberRoles R
WHERE R.MemberId = M.MemberId
AND R.RoleId IN (1,2)
)
Don't be tempted to replace the literal values 1,2 with variables with more descriptive names as this too can mess up cardinality estimates.

Using a function in a WHERE clause forces a table scan.
There's no way to use an index since the engine can't know what the result will be until it runs the function on every row in the table.

You can avoid both the user-defined function and the built-in by
defining "magic" values for administrator and moderator roles and compare Member.RoleId against these scalars
defining IsAdministrator and IsModerator flags on a MemberRole table and join with Member to filter on those flags

Question on Query execution

In the below query if the Patients table has 1000 records how many times TableValueFunction executes? Only once or 1000 time?
This is a query in a Stored Procedure, do you have a better idea to improve this?
SELECT * FROM Patients
WHERE Patient.Id In (SELECT PatientId FROM TableValueFunction(parameters..))

It depends on what you are using as parameters. If the parameters are constants the function will execute one time but if the parameters are fields from Patients the function will execute as many times as there are rows in table Patients.

To some extent it depends on whether you are talking about an inline TVF or a multi statement one.
A multi statement TVF is totally opaque to the query optimiser. It always assumes that it will return 1 row and it will not get expanded out into the main query.
Because of the 1 row assumption then if your Patients table is indexed on PatientId you will probably get a nested loops join with the TVF as the driving table meaning that it is only executed once.
If it is not indexed and you get a hash or merge join both of these methods only process both inputs once.
An inline TVF gets merged into the query itself. So the function itself is never executed as such. However SQL Server can then refer to cardinality information and might order the plan such that the query contained in the TVF appears on the inner side of a nested loops join and has a number of executions greater than one.

Multi-statement Table Valued Function vs Inline Table Valued Function

A few examples to show, just incase:
Inline Table Valued
CREATE FUNCTION MyNS.GetUnshippedOrders()
RETURNS TABLE
AS
RETURN SELECT a.SaleId, a.CustomerID, b.Qty
FROM Sales.Sales a INNER JOIN Sales.SaleDetail b
ON a.SaleId = b.SaleId
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.ShipDate IS NULL
GO
Multi Statement Table Valued
CREATE FUNCTION MyNS.GetLastShipped(#CustomerID INT)
RETURNS #CustomerOrder TABLE
(SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL)
AS
BEGIN
DECLARE #MaxDate DATETIME
SELECT #MaxDate = MAX(OrderDate)
FROM Sales.SalesOrderHeader
WHERE CustomerID = #CustomerID
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c ON b.ProductID = c.ProductID
WHERE a.OrderDate = #MaxDate
AND a.CustomerID = #CustomerID
RETURN
END
GO
Is there an advantage to using one type (in-line or multi statement) over the other? Is there certain scenarios when one is better than the other or are the differences purely syntactical? I realise the two example queries are doing different things but is there a reason I would write them in that way?
Reading about them and the advantages/differences haven't really been explained.

In researching Matt's comment, I have revised my original statement. He is correct, there will be a difference in performance between an inline table valued function (ITVF) and a multi-statement table valued function (MSTVF) even if they both simply execute a SELECT statement. SQL Server will treat an ITVF somewhat like a VIEW in that it will calculate an execution plan using the latest statistics on the tables in question. A MSTVF is equivalent to stuffing the entire contents of your SELECT statement into a table variable and then joining to that. Thus, the compiler cannot use any table statistics on the tables in the MSTVF. So, all things being equal, (which they rarely are), the ITVF will perform better than the MSTVF. In my tests, the performance difference in completion time was negligible however from a statistics standpoint, it was noticeable.
In your case, the two functions are not functionally equivalent. The MSTV function does an extra query each time it is called and, most importantly, filters on the customer id. In a large query, the optimizer would not be able to take advantage of other types of joins as it would need to call the function for each customerId passed. However, if you re-wrote your MSTV function like so:
CREATE FUNCTION MyNS.GetLastShipped()
RETURNS #CustomerOrder TABLE
(
SaleOrderID INT NOT NULL,
CustomerID INT NOT NULL,
OrderDate DATETIME NOT NULL,
OrderQty INT NOT NULL
)
AS
BEGIN
INSERT #CustomerOrder
SELECT a.SalesOrderID, a.CustomerID, a.OrderDate, b.OrderQty
FROM Sales.SalesOrderHeader a
INNER JOIN Sales.SalesOrderHeader b
ON a.SalesOrderID = b.SalesOrderID
INNER JOIN Production.Product c
ON b.ProductID = c.ProductID
WHERE a.OrderDate = (
Select Max(SH1.OrderDate)
FROM Sales.SalesOrderHeader As SH1
WHERE SH1.CustomerID = A.CustomerId
)
RETURN
END
GO
In a query, the optimizer would be able to call that function once and build a better execution plan but it still would not be better than an equivalent, non-parameterized ITVS or a VIEW.
ITVFs should be preferred over a MSTVFs when feasible because the datatypes, nullability and collation from the columns in the table whereas you declare those properties in a multi-statement table valued function and, importantly, you will get better execution plans from the ITVF. In my experience, I have not found many circumstances where an ITVF was a better option than a VIEW but mileage may vary.
Thanks to Matt.
Addition
Since I saw this come up recently, here is an excellent analysis done by Wayne Sheffield comparing the performance difference between Inline Table Valued functions and Multi-Statement functions.
His original blog post.
Copy on SQL Server Central

Internally, SQL Server treats an inline table valued function much like it would a view and treats a multi-statement table valued function similar to how it would a stored procedure.
When an inline table-valued function is used as part of an outer query, the query processor expands the UDF definition and generates an execution plan that accesses the underlying objects, using the indexes on these objects.
For a multi-statement table valued function, an execution plan is created for the function itself and stored in the execution plan cache (once the function has been executed the first time). If multi-statement table valued functions are used as part of larger queries then the optimiser does not know what the function returns, and so makes some standard assumptions - in effect it assumes that the function will return a single row, and that the returns of the function will be accessed by using a table scan against a table with a single row.
Where multi-statement table valued functions can perform poorly is when they return a large number of rows and are joined against in outer queries. The performance issues are primarily down to the fact that the optimiser will produce a plan assuming that a single row is returned, which will not necessarily be the most appropriate plan.
As a general rule of thumb we have found that where possible inline table valued functions should be used in preference to multi-statement ones (when the UDF will be used as part of an outer query) due to these potential performance issues.

There is another difference. An inline table-valued function can be inserted into, updated, and deleted from - just like a view. Similar restrictions apply - can't update functions using aggregates, can't update calculated columns, and so on.

Your examples, I think, answer the question very well. The first function can be done as a single select, and is a good reason to use the inline style. The second could probably be done as a single statement (using a sub-query to get the max date), but some coders may find it easier to read or more natural to do it in multiple statements as you have done. Some functions just plain can't get done in one statement, and so require the multi-statement version.
I suggest using the simplest (inline) whenever possible, and using multi-statements when necessary (obviously) or when personal preference/readability makes it wirth the extra typing.

Another case to use a multi line function would be to circumvent sql server from pushing down the where clause.
For example, I have a table with a table names and some table names are formatted like C05_2019 and C12_2018 and and all tables formatted that way have the same schema. I wanted to merge all that data into one table and parse out 05 and 12 to a CompNo column and 2018,2019 into a year column. However, there are other tables like ACA_StupidTable which I cannot extract CompNo and CompYr and would get a conversion error if I tried. So, my query was in two part, an inner query that returned only tables formatted like 'C_______' then the outer query did a sub-string and int conversion. ie Cast(Substring(2, 2) as int) as CompNo. All looks good except that sql server decided to put my Cast function before the results were filtered and so I get a mind scrambling conversion error. A multi statement table function may prevent that from happening, since it is basically a "new" table.

look at Comparing Inline and Multi-Statement Table-Valued Functions you can find good descriptions and performance benchmarks

I have not tested this, but a multi statement function caches the result set. There may be cases where there is too much going on for the optimizer to inline the function. For example suppose you have a function that returns a result from different databases depending on what you pass as a "Company Number". Normally, you could create a view with a union all then filter by company number but I found that sometimes sql server pulls back the entire union and is not smart enough to call the one select. A table function can have logic to choose the source.

Maybe in a very condensed way.
ITVF ( inline TVF) : more if u are DB person, is kind of parameterized view, take a single SELECT st
MTVF ( Multi-statement TVF): Developer, creates and load a table variable.

if you are going to do a query you can join in your Inline Table Valued function like:
SELECT
a.*,b.*
FROM AAAA a
INNER JOIN MyNS.GetUnshippedOrders() b ON a.z=b.z
it will incur little overhead and run fine.
if you try to use your the Multi Statement Table Valued in a similar query, you will have performance issues:
SELECT
x.a,x.b,x.c,(SELECT OrderQty FROM MyNS.GetLastShipped(x.CustomerID)) AS Qty
FROM xxxx x
because you will execute the function 1 time for each row returned, as the result set gets large, it will run slower and slower.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas