SQL update scalar function - sql

When I run this query:
UPDATE myTable SET x = (SELECT round(random()*100));
All records have the same value for x. This makes sense.
When I run this query:
UPDATE myTable SET x = round(random()*100);
It updates all records in the table and for each record the value of x is different.
I would like to know whats happening in the background for the second query. Is it running a update query for each record id(1....n)?
I am guessing it works similar to triggers, where for each row before updating
the trigger intercepts
calls the function and sets value of x
executes query
What is actually happening?

It's rather simple, really. The function random() is defined VOLATILE.
If you put it into a subquery, you generate a derived table with a single row. Postgres can materialize the result and reuse it in the outer query many times.
Otherwise Postgres calls the function for every row. That's how VOLATILE functions have to be treated. Per documentation on function volatility:
A VOLATILE function can do anything, including modifying the database.
It can return different results on successive calls with the same
arguments. The optimizer makes no assumptions about the behavior of
such functions. A query using a volatile function will re-evaluate the
function at every row where its value is needed.
Bold emphasis mine.

Related

Replace SQL User defined Function with Set operation to Update

I have a table say, TestTable which has columns StartDate, 1stDeliveryDate, DeliveryFrequency, HolidayCalenderId(which links to other table calender).
Now i have to add a column FinalDeliveryDate which needs some calulation using the other fields mentioned above.I have managed to write a Function which does the calculations and updates the field. The final query to update looks something like,
Update TestTable
Set FinalDeliveryDate = dbo.GetCalculatedFinalDeliveryDate(StartDate, 1stDeliveryDate, HolidayCalenderId)
WHere dbo.GetCalculatedFinalDeliveryDate(StartDate, 1stDeliveryDate, HolidayCalenderId) >= GETDATE
This example above is just to explain the scenario how the function is used. Now the size of TestTable is increasing rapidly as the business is growing. I am worried about performance as the query needs to run quite frequently and on all the rows in the test table. Just to clarify, I am a developer and not a DBA, but after researching about functions, I know its best to avoid them in Select and Where clause. If possible can someone point me in the right direction for how to remove this function and do a set operation to update the rows.. I am not able to think of any other way to replace this fucntion..

TSQL: Is there a way to limit the rows returned and count the total that would have been returned without the limit (without adding it to every row)?

I'm working to update a stored procedure that current selects up to n rows, if the rows returned = n, does a select count without the limit, and then returns the original select and the total impacted rows.
Kinda like:
SELECT TOP (#rowsToReturn)
A.data1,
A.data2
FROM
mytable A
SET #maxRows = ##ROWCOUNT
IF #rowsToReturn = ##ROWCOUNT
BEGIN
SET #maxRows = (SELECT COUNT(1) FROM mytableA)
END
I'm wanting reduce this to a single select statement. Based on this question, COUNT(*) OVER() allows this, but it is put on every single row instead of in an output parameter. Maybe something like FOUND_ROWS() in MYSQL, such as a ##TOTALROWCOUNT or such.
As a side note, since the actual select has an order by, the data base will need to already traverse the entire set (to make sure that it gets the correct first n ordered records), so the database should already have this count somewhere.
As #MartinSmith mentioned in a comment on this question, there is no direct (i.e. pure T-SQL) way of getting the total numbers of rows that would be returned while at the same time limiting it. In the past I have done the method of:
dump the query to a temp table to grab ##ROWCOUNT (the total set)
use ROW_NUBMER() AS [ResultID] on the ordered results of the main query
SELECT TOP (n) FROM #Temp ORDER BY [ResultID] or something similar
Of course, the downside here is that you have the disk I/O cost of getting those records into the temp table. Put [tempdb] on SSD? :)
I have also experienced the "run COUNT(*) with the same rest of the query first, then run the regular SELECT" method (as advocated by #Blam), and it is not a "free" re-run of the query:
It is a full re-run in many cases. The issue is that when doing COUNT(*) (hence not returning any fields), the optimizer only needs to worry about indexes in terms of the JOIN, WHERE, GROUP BY, ORDER BY clauses. But when you want some actual data back, that could change the execution plan quite a bit, especially if the indexes used to get the COUNT(*) are not "covering" for the fields in the SELECT list.
The other issue is that even if the indexes are all the same and hence all of the data pages are still in cache, that just saves you from the physical reads. But you still have the logical reads.
I'm not saying this method doesn't work, but I think the method in the Question that only does the COUNT(*) conditionally is far less stressful on the system.
The method advocated by #Gordon is actually functionally very similar to the temp table method I described above: it dumps the full result set to [tempdb] (the INSERTED table is in [tempdb]) to get the full ##ROWCOUNT and then it gets a subset. On the downside, the INSTEAD OF TRIGGER method is:
a lot more work to set up (as in 10x - 20x more): you need a real table to represent each distinct result set, you need a trigger, the trigger needs to either be built dynamically, or get the number of rows to return from some config table, or I suppose it could get it from CONTEXT_INFO() or a temp table. Still, the whole process is quite a few steps and convoluted.
very inefficient: first it does the same amount of work dumping the full result set to a table (i.e. into the INSERTED table--which lives in [tempdb]) but then it does an additional step of selecting the desired subset of records (not really a problem as this should still be in the buffer pool) to go back into the real table. What's worse is that second step is actually double I/O as the operation is also represented in the transaction log for the database where that real table exists. But wait, there's more: what about the next run of the query? You need to clear out this real table. Whether via DELETE or TRUNCATE TABLE, it is another operation that shows up (the amount of representation based on which of those two operations is used) in the transaction log, plus is additional time spent on the additional operation. AND, let's not forget about the step that selects the subset out of INSERTED into the real table: it doesn't have the opportunity to use an index since you can't index the INSERTED and DELETED tables. Not that you always would want to add an index to the temp table, but sometimes it helps (depending on the situation) and you at least have that choice.
overly complicated: what happens when two processes need to run the query at the same time? If they are sharing the same real table to dump into and then select out of for the final output, then there needs to be another column added to distinguish between the SPIDs. It could be ##SPID. Or it could be a GUID created before the initial INSERT into the real table is called (so that it can be passed to the INSTEAD OF trigger via CONTEXT_INFO() or a temp table). Whatever the value is, it would then be used to do the DELETE operation once the final output has been selected. And if not obvious, this part influences a performance issue brought up in the prior bullet: TRUNCATE TABLE cannot be used as it clears the entire table, leaving DELETE FROM dbo.RealTable WHERE ProcessID = #WhateverID; as the only option.
Now, to be fair, it is possible to do the final SELECT from within the trigger itself. This would reduce some of the inefficiency as the data never makes it into the real table and then also never needs to be deleted. It also reduces the over-complication as there should be no need to separate the data by SPID. However, this is a very time-limited solution as the ability to return results from within a trigger is going bye-bye in the next release of SQL Server, so sayeth the MSDN page for the disallow results from triggers Server Configuration Option:
This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible. We recommend that you set this value to 1.
The only actual way to do:
the query one time
get a subset of rows
and still get the total row count of the full result set
is to use .Net. If the procs are being called from app code, please see "EDIT 2" at the bottom. If you want to be able to randomly run various stored procedures via ad hoc queries, then it would have to be a SQLCLR stored procedure so that it could be generic and work for any query as stored procedures can return dynamic result sets and functions cannot. The proc would need at least 3 parameters:
#QueryToExec NVARCHAR(MAX)
#RowsToReturn INT
#TotalRows INT OUTPUT
The idea is to use "Context Connection = true;" to make use of the internal / in-process connection. You then do these basic steps:
call ExecuteDataReader()
before you read any rows, do a GetSchemaTable()
from the SchemaTable you get the result set field names and datatypes
from the result set structure you construct a SqlDataRecord
with that SqlDataRecord you call SqlContext.Pipe.SendResultsStart(_DataRecord)
now you start calling Reader.Read()
for each row you call:
Reader.GetValues()
DataRecord.SetValues()
SqlContext.Pipe.SendResultRow(_DataRecord)
RowCounter++
Rather than doing the typical "while (Reader.Read())", you instead include the #RowsToReturn param: while(Reader.Read() && RowCounter < RowsToReturn.Value)
After that while loop, call SqlContext.Pipe.SendResultsEnd() to close the result set (the one that you are sending, not the one you are reading)
then do a second while loop that cycles through the rest of the result, but never gets any of the fields:
while (Reader.Read())
{
RowCounter++;
}
then just set TotalRows = RowCounter; which will pass back the number of rows for the full result set, even though you only returned the top n rows of it :)
Not sure how this performs against the temp table method, the dual call method, or even #M.Ali's method (which I have also tried and kinda like, but the question was specific to not sending the value as a column), but it should be fine and does accomplish the task as requested.
EDIT:
Even better! Another option (a variation on the above C# suggestion) is to use the ##ROWCOUNT from the T-SQL stored procedure, sent as an OUTPUT parameter, rather than cycling through the rest of the rows in the SqlDataReader. So the stored procedure would be similar to:
CREATE PROCEDURE SchemaName.ProcName
(
#Param1 INT,
#Param2 VARCHAR(05),
#RowCount INT OUTPUT = -1 -- default so it doesn't have to be passed in
)
AS
SET NOCOUNT ON;
{any ol' query}
SET #RowCount = ##ROWCOUNT;
Then, in the app code, create a new SqlParameter, Direction = Output, for "#RowCount". The numbered steps above stay the same, except the last two (10 and 11), which change to:
Instead of the 2nd while loop, just call Reader.Close()
Instead of using the RowCounter variable, set TotalRows = (int)RowCountOutputParam.Value;
I have tried this and it does work. But so far I have not had time to test the performance against the other methods.
EDIT 2:
If the T-SQL stored procs are being called from the app layer (i.e. no need for ad hoc execution) then this is actually a much simpler variation of the above C# methods. In this case you don't need to worry about the SqlDataRecord or the SqlContext.Pipe methods. Assuming you already have a SqlDataReader set up to pull back the results, you just need to:
Make sure the T-SQL stored proc has a #RowCount INT OUTPUT = -1 parameter
Make sure to SET #RowCount = ##ROWCOUNT; immediately after the query
Register the OUTPUT param as a SqlParameter having Direction = Output
Use a loop similar to: while(Reader.Read() && RowCounter < RowsToReturn) so that you can stop retrieving results once you have pulled back the desired amount.
Remember to not limit the result in the stored proc (i.e. no TOP (n))
At that point, just like what was mentioned in the first "EDIT" above, just close the SqlDataReader and grab the .Value of the OUTPUT param :).
How about this....
DECLARE #N INT = 10
;WITH CTE AS
(
SELECT
A.data1,
A.data2
FROM mytable A
)
SELECT TOP (#N) * , (SELECT COUNT(*) FROM CTE) Total_Rows
FROM CTE
The last column will be populated with the total number of rows it would have returned without the TOP Clause.
The issue with your requirement is, you are expecting a SINGLE select statement to return a table and also a scalar value. which is not possible.
A Single select statement will return a table or a scalar value. OR you can have two separate selects one returning a Scalar value and other returning a scalar. Choice is yours :)
Just because you think TSQL should have a row count because of a sort doe not mean it does. And if it does it does it is not currently sharing it with the outside world.
What you are missing is this is very efficient
select count(*)
from ...
where ...
select top x
from ...
where ...
order by ...
With the count(*) unless the query is just plain ugly those indexes should be in memory.
It has to perform a count to sort based on what?
Did you actually evaluate any query plans?
If TSQL has to perform a sort then explain the following.
Why is the count(*) 100% of the cost when the second had to do a count anyway?
Just where in that second query plan is there a free opportunity to count?
Why are those query plans so different if they both need to count?
I think there is an arcane way to do what you want. It involves triggers and non-temporary tables. And, I should mention, although I have implemented each piece (for different purposes), I have never put them together for this purpose.
The idea starts with this Stack Overflow question. According to this source, ##ROWCOUNT counts the number of attempted inserts, even when they don't really happen. Now, I must admit that a perusal of available documentation doesn't seem to touch on this topic, so this may or may not be "correct" behavior. This method is relying on this "problem".
So, you could do what you want by:
Creating a new table for the output -- but not a table variable or a temporary table.
Creating an "instead of" trigger that prevents more than #maxRows from going into the table.
Select the query results into the table.
Read ##ROWCOUNT after the select.
Note that you can create the table and trigger using dynamic SQL. You could also create it once, and have the trigger read the #maxRows value from some sort of parameter table. As mentioned before, this needs to be a real table that supports triggers.

How to cope with null results in SQL Tasks that return single rows in SSIS 2005?

In a dataflow task, I can slip a rowcount into the processing flow and place the count into a variable. I can later use that variable to conditionally perform some other work if the rowcount was > 0. This works well for me, but I have no corresponding strategy for sql tasks expected to return a single row. In that event, I'm returning those values into variables. If the lookup produces no rows, the sql task fails when assigning values into those variables. I can branch on that component failing, but there's a side effect of that - if I'm running the job as a SQL server agent job step, the step returns DTSER_FAILURE, causing the step to fail. I can tell the sql agent to disregard the step failure, but then I won't know if I have a legitimate error in that step. This seems harder than it should be.
The only strategy I can think of is to run the same query with a count(*) aggregate and test if that returns a number > 0 and if so running the query again without the count. That's ugly because I have the same query in two places that I need to keep in sync.
Is there a better way?
In that same condition you can have additional logic (&& or ||). I would take one of the variables for your single statement and say something to the effect:
If #User::rowcount>0 || #User:single_record_var!=Default
That should help.
What kind of SQL statement? Can you change it to still return a single row with all NULLs instead of no rows?
What stops it from returning more than one row? The package would fail if it ended up returning more than one row, right?
You could also change it to call a stored procedure, and then call the stored procedure in two places without code duplication. You could also change it to be a view or user-defined function (if parameters are needed), SELECT COUNT(*) FROM udf() to check if there is data, SELECT * FROM udf() to get the row.

Incorrect Results when calling a Python UDF in Redshift multiple times within a single column inside a select statement

I am encountering an issue in Redshift where calling a UDF more than once per column inside a select statement is returning the same result as the first call to that UDF.
Bit of Background
I have a very simple Python UDF that calculates an md5 hash. The reason for this function is to be able to handle UTF-16/UTF-8 conversion before doing the hash so it is consistent with SQL server. Now the syntax or logic inside the function does not seem to be the issue as we have tried creating even simpler functions that produce the same behavior.
The Problem
My function is named MD5_UTF16 and is called by doing MD5_UTF16(yourvalue), and returns a hash string / hexdigest of the value you pass into the argument.
In my query I need to be able to do this (postgresql syntax):
SELECT MD5_UTF16(column1) || MD5_UTF16(column2)|| MD5_UTF16(column3) AS concatenatedhash
FROM MyTable
i.e. I need to calculate each hash and concatenate them in a single column. If I calculated each of those hashes separately in their own columns, the function generates the correct hashes for those columns. However, in my example above I have called each function and concatenated the results with the results of the other calls. In this scenario what is happening is all the calls to the functions are returning the hash for the first call i.e. MD5_UTF16(column1).
To clarify a bit further using example hash values. Let's pretend these are the hashes for each of the columns above:
Column 1: 275AB169CBEE4550F752C634B9335AE0
Column 2: B2214041A94F50B027FE1DEEC4C8474C
Column 3: 91050DAEFFEE20CDA2FC9914B6E4EBE9
My expected result for the concatenatedhash column would be a simple concatenation of the strings above (275AB169CBEE4550F752C634B9335AE0B2214041A94F50B027FE1DEEC4C8474C91050DAEFFEE20CDA2FC9914B6E4EBE9)
Instead, what I am getting is a concatenation of column 1's hash 3 times:
(275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0)
In my SELECT statement if I had called the function on column 2 (instead of column 1) first, then it would be the hash for column 2 that is repeated.
Has anyone encountered this before?
NOTE: You can only replicate this behavior if you are selecting data out of a table. So doing a:
SELECT MD5_UTF16('hard-coded value 1') || MD5_UTF16('hard-coded value 2')
with no table source will not replicate this behavior.
Work-arounds I am aware of
I do know of a possible workaround but I still would have expected my method above to work, so this question is not about applying the following workaround, but more understanding why the above method is not working.
- Workaround: Calculate each hash in a separate column first then concatenate them. This will have potential performance implications on our end among other things.
EDIT 1
Have found that the issue I've described only happens when there is a join in my query.. even if none of the column data from the joined table are being used in my UDF calls i.e.
SELECT ...concatenated hashes..
FROM table1
JOIN table2 ...
Removing the join seems to cause the hashes to be calculated correctly. Will attempt a workaround using this new knowledge. Not sure if it has anything to do with the execution plan running the UDF's differently when a join is involved - even though none of the column data from the joined table is being used for the UDF calls.

Can I maintain state between calls to a SQL Server UDF?

I have a SQL script that inserts data (via INSERT statements currently numbering in the thousands) One of the columns contains a unique identifier (though not an IDENTITY type, just a plain ol' int) that's actually unique across a few different tables.
I'd like to add a scalar function to my script that gets the next available ID (i.e. last used ID + 1) but I'm not sure this is possible because there doesn't seem to be a way to use a global or static variable from within a UDF, I can't use a temp table, and I can't update a permanent table from within a function.
Currently my script looks like this:
declare #v_baseID int
exec dbo.getNextID #v_baseID out --sproc to get the next available id
--Lots of these - where n is a hardcoded value
insert into tableOfStuff (someStuff, uniqueID) values ('stuff', #v_baseID + n )
exec dbo.UpdateNextID #v_baseID + lastUsedn --sproc to update the last used id
But I would like it to look like this:
--Lots of these
insert into tableOfStuff (someStuff, uniqueID) values ('stuff', getNextID() )
Hardcoding the offset is a pain in the arse, and is error prone. Packaging it up into a simple scalar function is very appealing, but I'm starting to think it can't be done that way since there doesn't seem to be a way to maintain the offset counter between calls. Is that right, or is there something I'm missing.
We're using SQL Server 2005 at the moment.
edits for clarification:
Two users hitting it won't happen. This is an upgrade script that will be run only once, and never concurrently.
The actual sproc isn't prefixed with sp_, fixed the example code.
In normal usage, we do use an id table and a sproc to get IDs as needed, I was just looking for a cleaner way to do it in this script, which essentially just dumps a bunch of data into the db.
I'm starting to think it can't be done that way since there doesn't seem to be a way to maintain the offset counter between calls. Is that right, or is there something I'm missing.
You aren't missing anything; SQL Server does not support global variables, and it doesn't support data modification within UDFs. And even if you wanted to do something as kludgy as using CONTEXT_INFO (see http://weblogs.sqlteam.com/mladenp/archive/2007/04/23/60185.aspx), you can't set that from within a UDF anyway.
Is there a way you can get around the "hardcoding" of the offset by making that a variable and looping over the iteration of it, doing the inserts within that loop?
If you have 2 users hitting it at the same time they will get the same id. Why didn't you use an id table with an identity instead, insert into that and use that as the unique (which is guaranteed) id, this will also perform much faster
sp_getNextID
never ever prefix procs with sp_, this has performance implication because the optimizer first checks the master DB to see if that proc exists there and then th local DB, also if MS decide to create a sp_getNextID in a service pack yours will never get executed
It would probably be more work than it's worth, but you can use static C#/VB variables in a SQL CLR UDF, so I think you'd be able to do what you want to do by simply incrementing this variable every time the UDF is called. The static variable would be lost whenever the appdomain unloaded, of course. So if you need continuity of your ID from one day to the next, you'd need a way, on first access of NextId, to poll all of tables that use this ID, to find the highest value.