Which are the best Variables in stored procedures

Which are the best Variables in stored procedures - sql-server-2005

I'm often dealing with some interfaces between two systems with data import or data export. Therefore I'm programming some T-SQL procedures. It's is often necessary to use some variables inside these procedures, to hold some values or single records.
The last time I set up some temp tables e.g. one with name #tmpGlobals and another named #tmpOutput. The names doesn't matter but I eliminated the use of declaring some #MainID int or like that.
Is this a good idea? Is it a performance issue?

As Alexander suggests, it really depends. I won't draw hard lines in the sand about number of rows, because it can also depend on the data types and hence the size of each row. Where one will make more sense than the other in your environment can depend on several factors aside from just the size of the data, including access patterns, sensitivity of performance to recompiles, your hardware, etc.
There is a common misconception that #table variables are only in memory, do not incur I/O, do not use tempdb, etc. While in certain isolated cases some of this is true, it is not something you can or should rely on.
Some other limitations of #table variables that may prevent your use of them, even for small data sets:
cannot index (other than primary key / unique constraint declarations on creation)
no statistics are maintained (unlike #temp tables)
cannot ALTER
cannot use as INSERT EXEC target in SQL Server 2005 (this restriction was lifted in 2008)
cannot use as SELECT INTO target
cannot truncate
can't use an alias type in definition
no parallelism
not visible to nested procs (unlike #temp tables)

It really depends on the amount of data. If you're using under 100 records, then DECLARE #MainID or the like is probably better since it's a smaller amount of data. Anything over 100 records though, you should definitely use #tmpGlobals or similar since it's better for memory management on the SQL server.
EDIT: It's not bad to use #tmpGlobals for smaller sets, just not much of a performance loss or gain from DECLARE #MainID. You will see a performance gain when using #tmpGlobals, instead of DECLARE #MainID, on a high number of records.

In general, you should choose the reverse if possible. It depends on whether you need to store a set of items or just result values.
Scoped variables, aside from table variables, are relatively cheap. Things that fit into typed variables that aren't tables, operate faster than storing them as single rows in a table.
Table variables and temp tables tend to be quite expensive. They may require space in tempdb and also offer no optimizations by default. In addition, table variables should be avoided for large sets of data. When processing large sets, you can apply indexes and define primary keys on temp tables if you wish, but you cannot do this for table variables. Finally, temp tables need cleanup before exiting scope.
For parameters, table variables are useful for return sets from functions. Temp tables cannot be returned from functions. Depending on the task at hand, use of functions may make it easier to encapsulate specific portions of work. You may find that some stored procedures are doing work that is better suited to functions, especially if you are reusing but not modifying the results.
Finally, if you just need one-time storage of results in the middle of stored-procedure work, try CTEs. These usually beat out both table variables and temp tables, as SQL server can make better decisions on how to store this data for you. Also, as a matter of syntax, it may make your declarations more legible.
Using Common-Table Expressions # MSDN
edit: (regarding temp tables)
Local temp tables go away when the query session ends, which can be an indeterminate amount of time away in the future. Global temp tables don't go away until the connection is closed and no other users are using the table, which can be even longer. In either case, it is best to drop temp tables (as no longer needed) on exit of a procedure to avoid tying up memory and other resources.
CTEs can be used to avert this, in many cases, because they are only local to the location where they are declared. They automatically are marked for cleanup once the stored procedure or function of their scope exits.

Related

Ok to use temp tables and table variable in the same stored procedure?

I have one select in my stored procedure that returns 4000 + rows. I was going to make this a temp table to work off the data later in the procedure.
I also have various other selects that only return 100-300 rows. I was going to make these table variables, again to work off later in the procedure.
Is it ok to use temp tables and table variables in the same procedure, or will this cause any performance issues?

Yes it is ok.
As for programming practice, I would prefer one type or the other (and lean toward table variables), if I'm reading a stored procedure. However, you might have a good reason for using one or the other, such as needing an index on a temp table or using it for a select into, then go ahead.

This is where you need to look for a full set of options sommarskog.se - share_data
Being able to add various indexes to temp tables is a particularly reason I'll sometimes choose temporary tables.
To avoid hitting temp db continuously, and if indexes are not required, then I'll use table variables.
Quite often now I use lots of CTEs that work together and avoid using any sort of materialized tables.
Classic answer - "it depends!"

I think there are many factors here that we don't know, such as your company's resources, your time-constraints, etc.
Generally speaking, it is fine to use temp tables for this purpose. And 100-300 rows(mentioned in the selects) - that's peanuts. No worries.

situations requiring temporary tables in stored procedures

Can anyone explain the situations in which we need to make use of temporary tables in stored procedures?

There are many cases where a complex join can really trip up the optimizer and make it do very expensive things. Sometimes the easiest way to cool the optimizer down is to break the complex query into smaller parts. You'll find a lot of misinformation out there about using a #table variable instead of a #temp table because #table variables always live in memory - this is a myth and don't believe it.
You may also find this worthwhile if you have an outlier query that would really benefit from a different index that is not on the base table, and you are not permitted (or it may be detrimental) to add that index to the base table (it may be an alternate clustered index, for example). A way to get around that would be to put the data in a #temp table (it may be a limited subset of the base table, acting like a filtered index), create the alternate index on the #temp table, and run the join against the #temp table. This is especially true if the data filtered into the #temp table is going to be used multiple times.
There are also times when you need to make many updates against some data, but you don't want to update the base table multiple times. You may have multiple things you need to do against a variety of other data that can't be done in one query. It can be more efficient to put the affected data into a #temp table, perform your series of calculations / modifications, then update back to the base table once instead of n times. If you use a transaction here against the base tables you could be locking them from your users for an extended period of time.
Another example is if you are using linked servers and the join across servers turns out to be very expensive. Instead you can stuff the remote data into a local #temp table first, create indexes on it locally, and run the query locally.

Temp table or permanent tables?

For my company I am redesigning some stored procedures. The original procedures are using lots of permanent tables which are filled during the execution of procedure and at the end, the values are deleted. The number of rows can extend from 100 to 50,000 rows for calculation of aggregations.
My question is, will there be severe performance issues if I replace those tables with temp tables ? Is it feasible to use temp tables ?

It depends on how often your using them, how long the processing takes, and if you are concurrently accessing data from the tables while writing.
If you use a temp table, it won't be sitting around waiting for indexing and caching while it's not in use. So it should save an ever so slight bit of resources there. However, you will incur overhead with the temp tables (i.e. creating and destroying).
I would re-examine how your queries function in the procedures and consider employing more in procedure CURSOR operations instead of loading everything into tables and deleting them.
However, databases are for storing information and retrieving information. I would shy away from using permanent tables for routine temp work and stick with the temp tables.
The overall performance shouldn't have any effect with the use case you specified in your question.
Hope this helps,
Jeffrey Kevin Pry

Yes its certainly feasible, you may want to check to see if the permanent tables have any indexing on them to speed up joins and so on.

I agree with Jeffrey. It always depends.
Since you're using Sql Server 2008 you might have a look at table variables.
They should be lighter than TEMP tables.
I define a User Defined Function which returns a table variable like this:
CREATE FUNCTION .ufd_GetUsers ( #UserCode INT )
RETURNS #UsersTemp TABLE
(
UserCode INT NOT NULL,
RoleCode INT NOT NULL
)
AS
BEGIN
INSERT #RolesTemp
SELECT
dbo.UsersRoles.Code, Roles.Code
FROM
dbo.UsersRoles
INNER JOIN
dbo.UsersRolesRelations ON dbo.UsersRoles.Code = dbo.UsersRolesRelations.UserCode
INNER JOIN
dbo.UsersRoles Roles ON dbo.UsersRolesRelations.RoleCode = Roles.Code
WHERE dbo.UsersRoles.Code = #UserCode
INSERT #UsersTemp VALUES(#UserCode, #UserCode)
RETURN
END

A big question is, can more then one person run one of these stored procedures at a time? I regularly see these kind of tables carried over from old single user databases (or from programmers who couldn't do subqueries or much of anything beyond SELECT * FROM). What happens if more then one user tries to run the same procedure, what happens if it crashes midway through - does the table get cleaned up? With temp tables or table variables you have the ability to properly scope the table to just the current connection.

Definitely use a temporary table, especially since you've alluded to the fact that its purpose is to assist with calculations and aggregates. If you used a table inside one of your database's schemas all that work is going to be logged - written, backed up, and so on. Using a temporary table eliminates that overhead for data that in the end you probably don't care about.

You actually might save some time from the fact that you can drop the temp tables at the end instead of deleting rows (you said you had multiple users so you have to delete rather than truncate). Deleting is a logged operation and can add considerable time to the process. If the permanent tables are indexed, then create the temp tables and index them as well. I would bet you would see an increase in performance usless your temp db is close to out of space.
Table variables also might work but they can't be indexed and they are generally only faster for smaller datasets. So you might try a combination of temp tables for the things taht will be large enough to benfit form indexing and table varaibles for the smaller items.
An advatage of using temp tables and table variables is that you guarantee that one users process won;t interfer with another user's process. You say they currently havea way to identify which records but all it takes is one bug being introduced to break that when using permanent tables. Permanent table for temporary processing are a very risky choice. Temp tables and table variabels can never see the data from someone else's process and thus are far safer as a choice.

Table variables are normally the way to go.
SQL2K and below can have significant performance bottlenecks if there are many temp tables being manipulated - the issue is the blocking DDL on the system tables.
Sql2005 is better, but table vars avoid the whole issue by not using those system tables at all, so can perform without inter-user locking issues (except those involved with the source data).
The issue is then that table vars only persist within scope, so if there is genuinuely a large amount of data that needs to be processed repeatedly & needs to be persisted over a (relatively) long duration then 'static' work tables may actually be faster - it does need a user key of some sort & regular cleaning. A last resort really.

Best Practice: One Stored Proc that always returns all fields or different stored procedure for each field set needed?

If I have a table with Field 1, Field 2, Field 3, Field 4 and for one instance need just Field 1 and Field 2, but another I need Field 3 and Field 4 and yet another need all of them...
Is it better to have a SP for each combination I need or one SP that always returns them all?

Very important question:
Writing many stored procs that run the same query will make you spend a lot of time documenting and apologising to future maintainers.
For every time anyone wants to introduce a change, they have to consider whether it should apply to all stored procs, or to some or to one only...
I would do only one stored proc.

I would just have one Stored Procedure as it will be easier to maintain.
Does it need to be a Stored Procedure? You could rewrite it as a View then simply select the columns that you need.

If network bandwidth and memory usage is more important than hours of work and project simplicity, then make a separate SP for each task. Otherwise there's no point. (the gains aren't that great, and are noticeable only when the rowset is extremely large, or there are a lot of simultaneous requests)

As a general rule it is good practice to select only the columns we need to serve a particular purpose. This is particularly true for tables which have:
lots of columns
LOB columns
sensitive or restricted data
However, if we have a complicated system with lots of tables it is obviously impractical to build a separate stored procedure for each distinct query. In fact it is probably undesirable to do so. The resultant API would be overwhelming to use and a lot of effort to maintain.
The solutions are several and various, and really depend on the nature of the applications. Views can help, although they share some of the same maintenance issues. Dynamic SQL is another approach. We can write complicated procedures which return many differnet result sets depending on the input parameters. Heck, sometimes we can even write SQL statements in the actual application.
Oh, and there is the simple procedure which basically wraps a SELECT * FROM some_table but that comes with its own suite of problems.

Why are relational set-based queries better than cursors?

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.
Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.
Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?

The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.
When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.

Set based queries are (usually) faster because:
They have more information for the query optimizer to optimize
They can batch reads from disk
There's less logging involved for rollbacks, transaction logs, etc.
Less locks are taken, which decreases overhead
Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)
Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.
A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:
--Cursor
DECLARE #phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO #phoneNumber
WHILE ##FETCH_STATUS = 0 BEGIN
DECLARE #exchange char(3), #areaCode char(3)
SELECT #exchange = LEFT(#phoneNumber, 3)
SELECT #areaCode = AreaCode
FROM AreaCode_Exchange
WHERE Exchange = #exchange
IF #areaCode IS NOT NULL BEGIN
UPDATE Customer SET AreaCode = #areaCode
WHERE CURRENT OF c
END
FETCH NEXT FROM c INTO #phoneNumber
END
CLOSE c
DEALLOCATE c
END
--Set
UPDATE Customer SET
AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
Customer.AreaCode IS NULL

In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:
It's (subjectively) easier to read. When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
It avoids network round trips. Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
It's wasteful. Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
Why wouldn't you? You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?

You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.
Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.
Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.
A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.

I think the real answer is, like all approaches in programming, that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:
You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.
The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path, and guess what happens... your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), you have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.

I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.
Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.

As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.
For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.
Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.
Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.
The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)

The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actually doing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)
That said, to get good performance in your actual concrete case, you may have to bend or break rules.
Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after all the updates, but generates a collision if done one-by-one.

set based is done in one operation
cursor as many operations as the rowset of the cursor

The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.
Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.

Simply put, in most cases, it's faster/easier to let the database do it for you.
The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas