SQL Server - standard pattern for doing row by row operations on a table/view - sql

I want to iterate through a table/view and then kick off some process (e.g. run a job, send an email) based on some criteria.
My arbitrary constraint here is that I want to do this inside the database itself, using T-SQL on a stored proc, trigger, etc.
Does this scenario require cursors, or is there some other native T-SQL row-based operation I can take advantage of?

Your best bet is a cursor. SQL being declarative and set based, any 'workaround you may find that tries to force SQL to do imperative row oriented operations is unreliable and may break. Eg. the optimizer may cut out your 'operation' from the execution, or do it in strange order or for an unexpected number of times.
The general bad name cursors get is when they are deployed instead of set based operations (like do a computation and update, or return a report) because the developer did not found a set oriented way of doing the same functionality. But for non-SQL operations (ie. launch a process) they are appropriate.
You may also use some variations on the cursor theme, like client side iterating through a result set. That is very similar in spirit to a cursor, although not using explicit cursors.

The standard way to do this would be SSIS. Just use an Execute SQL task to get the rows, and a For Each task container to iterate once per row. Inside the container, run whatever tasks you like, and they'll have access to the columns of each row.

If you are planning on sending an email to each record with an email address (or a similar row-based operation) then you would indeed plan on using a cursor.
There is no other "row-based" operation that you'd do within SQL itself (although I like John's suggestion to investigate SSIS - as long as you have SQL Server Standard or Enterprise). However, if you are summing, searching or any other kind of operation and then kicking off an event once done the entire selection set, then you would certainly not use a cursor. Just so you know - cursors are generally considered a "last resort" approach to problems in SQL Server.

The first thought which comes to my mind when I need to iterate over the result set of a query is to use cursors. Yes, it is a quick and dirty way of programming. But cursors have their setbacks as well - They incur overheads and can be performance bottle necks.
There are alternatives to using cursors. You can try using a temp table with an identity column. Copy you table to the temp table and using a while loop to iterate over the rows. Then based on a condition call your stored procedure.
Here, check this link for alternatives to cursors - http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1339242,00.html
cheers

Related

What is the best way to call stored proc for each row?

I try to copy this set of tables to other set with the same scheme as the source.
I wrote stored proc, in SQL, that receives ID from TableA and copies all tables from B-G.
Now I want for each row of TalbeA to call that stored proc. I can use CURSOR or WHILE for this but, I read that CURSOR is not recommended and that WHILE is slower than CURSOR.
Is there another way or in this cases CURSOR\WHILE is the solution?
Thank you
CURSOR / WHILE is fine in this instance - there isn't a better way to call a sproc per row. If the performance of this is likely to have an impact on the system, though, be careful when you run it.
There is a better alternative if you can code it up - and that's to perform all the "copying" for the records in TableA and below in a bunch of SQL statements, avoiding cursors altogether. To summarise this suggestion - set-based rather than row-based.
Unless there's a significant advantage in having this copying performed in the stored proc, you'd be better off re-writing the copying inline in your current script.
If you're wondering how this might be achieved, if you're having to maintain foreign keys, work with identity columns, etc, you might check out my answer to How Can I avoid using a cursor....

SQL server temporary tables vs cursors

In SQL Server stored procedures when to use temporary tables and when to use cursors. which is the best option performance wise?
If ever possible avoid cursors like the plague. SQL Server is set-based - anything you need to do in an RBAR (row-by-agonizing-row) fashion will be slow, sluggish and goes against the basic principles of how SQL works.
Your question is very vague - based on that information, we cannot really tell what you're trying to do. But the main recommendation remains: whenever possible (and it's possible in the vast majority of cases), use set-based operations - SELECT, UPDATE, INSERT and joins - don't force your procedural thinking onto SQL Server - that's not the best way to go.
So if you can use set-based operations to fill and use your temporary tables, I would prefer that method over cursors every time.
Cursors work row-by-row and are extremely poor performers. They can in almost all cases be replaced by better set-based code (not normally temp tables though)
Temp tables can be fine or bad depending on the data amount and what you are doing with them. They are not generally a replacement for a cursor.
Suggest you read this:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
I believe SARAVAN originally made the comparison between cursors and temp tables because many times you are confronted with a situation where using a temp table with an identity column and a #counter variable can be used to scroll/navigate through a data set much like one in a cursor.
In my experience, using the temp table (or table variable) scenario can help me get the job done 95% of the time and is faster than the typically slow cursor.

T-SQL optimizing performance of various stored procedures question

so I have written several stored procedures that act on individual rows of data by taking in an ID number. I would like to keep several stored procedures that can call this stored procedure at different levels of my database scheme. For instance, when a row is inserted I call this stored procedure. When something else is modified I would like to call this stored procedure for each line. This is so I can have one set of base code that can be called everywhere else but that acts on different amounts of data. I have been able to produce this result with Cursors, but I am told these are very inefficient. Is there any other way to produce this kind of functionality without sacrificing performance? Thanks.
Yes. Use standard joins to operate on sets rather than RBAR (Row By Agonising Row). i.e. Rather than call a function for each row, design a join that performs the required operation on every applicable row as a set operation.
I often see devs use the 'function operates on a each row', and although this seems to be the obvious way to encapsulate logic, it doesn't perform well on SQL Server or most DB engines.
In some circumstances, a table-valued function can be used effectively (MS SQL Server).
(BTW, you are correct in saying cursors are inefficient).

Cursors vs Procedures in SQL

So, I just learned about CURSORS but still don't exactly grasp them. What is the difference between a cursor and procedure or even a function?
So far from the various examples (DECLARE CURSOR ... SELECT ... FROM ...) It seems at most its a variable to hold a query. Is the data real time, or a snapshot of when the cursor was declared?
i.e.
I have a table with one row and one col with a value of 2.
I do DECLARE CURSOR ... SELECT * FROM table1
I then insert a new row with a value of 3.
When I run the cursor, would I Just get the one row from before the cursor was declared, or both rows?
Thanks
I would recommend researching a bit on "Set based" versus "Row Based". This article does a decent job.
Most database systems are geared toward performing set based operations. Because of this you will often see performance problems when you perform row based operations (like using a cursor). In my experience A LOT of sql that uses cursors can be rewritten without cursors.
In the example you asked about your cursor would only have one record in it.
Also, keep in mind that a stored procedure can make use of a cursor.
I believe the documentation will answer some of your questions. Read through the different options like "Insensitive" to see what they mean.
Also, as a general rule, it is frowned upon to use cursors. You should always try to find a "set based" solution before going the cursor route. There is much debate and documentation on this subject as well that is easily accessible.
My advice would be to forget you learned the syntax for cursors. Cursors are the last resort technique and should only be used by an expert who understnds what impact the cursor will have on performance and why the set-based alternatives won't work in his specific case. Most things done in cursors are far more easily done in a set-based fashion if you understand set operations. This link will help you learn what you need to know so that you will only rarely have to write a cursor:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
CURSOR:
do something that is to be done row by row and not possible with simple query
for example, if you have a temporary table to store hierarchy of categories, cursor can be useful to populate it
procedures or stored procedures are something that can have collection of sql statements (including cursor) in it. when executed by passing params (if any) it will execute the statements in it
A function or procedure is a set of instructions to perform some task.
A cursor is an array that can stores the result set of a query.
Stored procedures are pre-compiled objects and executes as bulk of statements, whereas cursors are used to execute row by row.
For Example: You can take cursors like a bag (cursors are similar to pointer that points out on a single row out of so many rows) and put all the result of a query run from within procedures, whenever you will need the results, just open the bag and find out the results row by row.

Why are relational set-based queries better than cursors?

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.
Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.
Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?
The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.
When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.
Set based queries are (usually) faster because:
They have more information for the query optimizer to optimize
They can batch reads from disk
There's less logging involved for rollbacks, transaction logs, etc.
Less locks are taken, which decreases overhead
Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)
Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.
A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:
--Cursor
DECLARE #phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO #phoneNumber
WHILE ##FETCH_STATUS = 0 BEGIN
DECLARE #exchange char(3), #areaCode char(3)
SELECT #exchange = LEFT(#phoneNumber, 3)
SELECT #areaCode = AreaCode
FROM AreaCode_Exchange
WHERE Exchange = #exchange
IF #areaCode IS NOT NULL BEGIN
UPDATE Customer SET AreaCode = #areaCode
WHERE CURRENT OF c
END
FETCH NEXT FROM c INTO #phoneNumber
END
CLOSE c
DEALLOCATE c
END
--Set
UPDATE Customer SET
AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
Customer.AreaCode IS NULL
In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:
It's (subjectively) easier to read. When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
It avoids network round trips. Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
It's wasteful. Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
Why wouldn't you? You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?
You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.
Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.
Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.
A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.
I think the real answer is, like all approaches in programming, that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:
You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.
The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path, and guess what happens... your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), you have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.
I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.
Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.
As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.
For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.
Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.
Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.
The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)
The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actually doing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)
That said, to get good performance in your actual concrete case, you may have to bend or break rules.
Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after all the updates, but generates a collision if done one-by-one.
set based is done in one operation
cursor as many operations as the rowset of the cursor
The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.
Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.
Simply put, in most cases, it's faster/easier to let the database do it for you.
The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.