Why do people hate SQL cursors so much? [closed] - sql

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I can understand wanting to avoid having to use a cursor due to the overhead and inconvenience, but it looks like there's some serious cursor-phobia-mania going on where people are going to great lengths to avoid having to use one.
For example, one question asked how to do something obviously trivial with a cursor and the accepted answer proposed using a common table expression (CTE) recursive query with a recursive custom function, even though this limits the number of rows that could be processed to 32 (due to recursive function call limit in sql server). This strikes me as a terrible solution for system longevity, not to mention a tremendous effort just to avoid using a simple cursor.
What is the reason for this level of insane hatred? Has some 'noted authority' issued a fatwa against cursors? Does some unspeakable evil lurk in the heart of cursors that corrupts the morals of children or something?
Wiki question, more interested in the answer than the rep.
Related Info:
SQL Server Fast Forward Cursors
EDIT: let me be more precise: I understand that cursors should not be used instead of normal relational operations; that is a no-brainer. What I don't understand is people going waaaaay out of their way to avoid cursors like they have cooties or something, even when a cursor is a simpler and/or more efficient solution. It's the irrational hatred that baffles me, not the obvious technical efficiencies.

The "overhead" with cursors is merely part of the API. Cursors are how parts of the RDBMS work under the hood. Often CREATE TABLE and INSERT have SELECT statements, and the implementation is the obvious internal cursor implementation.
Using higher-level "set-based operators" bundles the cursor results into a single result set, meaning less API back-and-forth.
Cursors predate modern languages that provide first-class collections. Old C, COBOL, Fortran, etc., had to process rows one at a time because there was no notion of "collection" that could be used widely. Java, C#, Python, etc., have first-class list structures to contain result sets.
The Slow Issue
In some circles, the relational joins are a mystery, and folks will write nested cursors rather than a simple join. I've seen truly epic nested loop operations written out as lots and lots of cursors. Defeating an RDBMS optimization. And running really slowly.
Simple SQL rewrites to replace nested cursor loops with joins and a single, flat cursor loop can make programs run in 100th the time. [They thought I was the god of optimization. All I did was replace nested loops with joins. Still used cursors.]
This confusion often leads to an indictment of cursors. However, it isn't the cursor, it's the misuse of the cursor that's the problem.
The Size Issue
For really epic result sets (i.e., dumping a table to a file), cursors are essential. The set-based operations can't materialize really large result sets as a single collection in memory.
Alternatives
I try to use an ORM layer as much as possible. But that has two purposes. First, the cursors are managed by the ORM component. Second, the SQL is separated from the application into a configuration file. It's not that the cursors are bad. It's that coding all those opens, closes and fetches is not value-add programming.

Cursors make people overly apply a procedural mindset to a set-based environment.
And they are SLOW!!!
From SQLTeam:
Please note that cursors are the
SLOWEST way to access data inside SQL
Server. The should only be used when
you truly need to access one row at a
time. The only reason I can think of
for that is to call a stored procedure
on each row. In the Cursor
Performance article I discovered
that cursors are over thirty times
slower than set based alternatives.

There's an answer above which says "cursors are the SLOWEST way to access data inside SQL Server... cursors are over thirty times slower than set based alternatives."
This statement may be true under many circumstances, but as a blanket statement it's problematic. For example, I've made good use of cursors in situations where I want to perform an update or delete operation affecting many rows of a large table which is receiving constant production reads. Running a stored procedure which does these updates one row at a time ends up being faster than set-based operations, because the set-based operation conflicts with the read operation and ends up causing horrific locking problems (and may kill the production system entirely, in extreme cases).
In the absence of other database activity, set-based operations are universally faster. In production systems, it depends.

Cursors tend to be used by beginning SQL developers in places where set-based operations would be better. Particularly when people learn SQL after learning a traditional programming language, the "iterate over these records" mentality tends to lead people to use cursors inappropriately.
Most serious SQL books include a chapter enjoining the use of cursors; well-written ones make it clear that cursors have their place but shouldn't be used for set-based operations.
There are obviously situations where cursors are the correct choice, or at least A correct choice.

The optimizer often cannot use the relational algebra to transform the problem when a cursor method is used. Often a cursor is a great way to solve a problem, but SQL is a declarative language, and there is a lot of information in the database, from constraints, to statistics and indexes which mean that the optimizer has a lot of options to solve the problem, whereas a cursor pretty much explicitly directs the solution.

In Oracle PL/SQL cursors will not result in table locks and it is possible to use bulk-collecting/bulk-fetching.
In Oracle 10 the often used implicit cursor
for x in (select ....) loop
--do something
end loop;
fetches implicitly 100 rows at a time. Explicit bulk-collecting/bulk-fetching is also possible.
However PL/SQL cursors are something of a last resort, use them when you are unable to solve a problem with set-based SQL.
Another reason is parallelization, it is easier for the database to parallelize big set-based statements than row-by-row imperative code. It is the same reason why functional programming becomes more and more popular (Haskell, F#, Lisp, C# LINQ, MapReduce ...), functional programming makes parallelization easier. The number CPUs per computer is rising so parallelization becomes more and more an issue.

In general, because on a relational database, the performance of code using cursors is an order of magnitude worse than set-based operations.

The answers above have not emphasized enough the importance of locking. I'm not a big fan of cursors because they often result in table level locks.

For what it's worth I have read that the "one" place a cursor will out perform its set-based counterpart is in a running total. Over a small table the speed of summing up the rows over the order by columns favors the set-based operation but as the table increases in row size the cursor will become faster because it can simply carry the running total value to the next pass of the loop. Now where you should do a running total is a different argument...

Outside of the performance (non)issues, I think the biggest failing of cursors is they are painful to debug. Especially compared to code in most client applications where debugging tends to be comparatively easy and language features tend to be much easier. In fact, I contend that nearly anything one is doing in SQL with a cursor should probably be happening in the client app in the first place.

Can you post that cursor example or link to the question? There's probably an even better way than a recursive CTE.
In addition to other comments, cursors when used improperly (which is often) cause unnecessary page/row locks.

You could have probably concluded your question after the second paragraph, rather than calling people "insane" simply because they have a different viewpoint than you do and otherwise trying to mock professionals who may have a very good reason for feeling the way that they do.
As to your question, while there are certainly situations where a cursor may be called for, in my experience developers decide that a cursor "must" be used FAR more often than is actually the case. The chance of someone erring on the side of too much use of cursors vs. not using them when they should is MUCH higher in my opinion.

basicaly 2 blocks of code that do the same thing. maybe it's a bit weird example but it proves the point. SQL Server 2005:
SELECT * INTO #temp FROM master..spt_values
DECLARE #startTime DATETIME
BEGIN TRAN
SELECT #startTime = GETDATE()
UPDATE #temp
SET number = 0
select DATEDIFF(ms, #startTime, GETDATE())
ROLLBACK
BEGIN TRAN
DECLARE #name VARCHAR
DECLARE tempCursor CURSOR
FOR SELECT name FROM #temp
OPEN tempCursor
FETCH NEXT FROM tempCursor
INTO #name
SELECT #startTime = GETDATE()
WHILE ##FETCH_STATUS = 0
BEGIN
UPDATE #temp SET number = 0 WHERE NAME = #name
FETCH NEXT FROM tempCursor
INTO #name
END
select DATEDIFF(ms, #startTime, GETDATE())
CLOSE tempCursor
DEALLOCATE tempCursor
ROLLBACK
DROP TABLE #temp
the single update takes 156 ms while the cursor takes 2016 ms.

Related

SQL server temporary tables vs cursors

In SQL Server stored procedures when to use temporary tables and when to use cursors. which is the best option performance wise?
If ever possible avoid cursors like the plague. SQL Server is set-based - anything you need to do in an RBAR (row-by-agonizing-row) fashion will be slow, sluggish and goes against the basic principles of how SQL works.
Your question is very vague - based on that information, we cannot really tell what you're trying to do. But the main recommendation remains: whenever possible (and it's possible in the vast majority of cases), use set-based operations - SELECT, UPDATE, INSERT and joins - don't force your procedural thinking onto SQL Server - that's not the best way to go.
So if you can use set-based operations to fill and use your temporary tables, I would prefer that method over cursors every time.
Cursors work row-by-row and are extremely poor performers. They can in almost all cases be replaced by better set-based code (not normally temp tables though)
Temp tables can be fine or bad depending on the data amount and what you are doing with them. They are not generally a replacement for a cursor.
Suggest you read this:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
I believe SARAVAN originally made the comparison between cursors and temp tables because many times you are confronted with a situation where using a temp table with an identity column and a #counter variable can be used to scroll/navigate through a data set much like one in a cursor.
In my experience, using the temp table (or table variable) scenario can help me get the job done 95% of the time and is faster than the typically slow cursor.

Difference between FETCH/FOR to loop a CURSOR in PL/SQL

I know that fetching a cursor will give me access to variables like %ROWCOUNT, %ROWTYPE, %FOUND, %NOTFOUND, %ISOPEN
...but I was wondering if there are any other reasons to use
Open - Fetch - Close instructions to loop a cursor
rather than
Loop the cursor with a FOR cycle... (In my opinion this is better becase it is simple)
What do you think?
From a performance standpoint, the difference is a lot more complicated than the Tim Hall tip that OMG Ponies linked to would imply. I believe that this tip is an introduction to a larger section that has been excerpted for the web-- I expect that Tim went on to make most if not all of these points in the book. Additionally, this entire discussion depends on the Oracle version you're using. I believe this is correct for 10.2, 11.1, and 11.2 but there are definitely differences if you start going back to older releases.
The particular example in the tip, first of all, is rather unrealistic. I've never seen anyone code a single-row fetch using an explicit cursor rather than a SELECT INTO. So the fact that SELECT INTO is more efficient is of very limited practical importance. If we're discussing loops, the performance we're interested in is how expensive it is to fetch many rows. And that's where the complexity starts to come in.
Oracle introduced the ability to do a BULK COLLECT of data from a cursor into a PL/SQL collection in 10.1. This is a much more efficient way to get data from the SQL engine to the PL/SQL collection because it allows you to minimize context shifts by fetching many rows at once. And subsequent operations on those collections are more efficient because your code can stay within the PL/SQL engine.
In order to take maximum advantage of the BULK COLLECT syntax, though, you generally have to use explicit cursors because that way you can populate a PL/SQL collection and then subsequently use the FORALL syntax to write the data back to the database (on the reasonable assumption that if you are fetching a bunch of data in a cursor, there is a strong probability that you are doing some sort of manipulation and saving the manipulated data somewhere). If you use an implicit cursor in a FOR loop, as OMG Ponies correctly points out, Oracle will be doing a BULK COLLECT behind the scenes to make the fetching of the data less expensive. But your code will be doing slower row-by-row inserts and updates because the data is not in a collection. Explicit cursors also offer the opportunity to set the LIMIT explicitly which can improve performance over the default of 100 for an implicit cursor in a FOR loop.
In general, assuming that you're on 10.2 or greater and that your code is fetching data and writing it back to the database,
Fastest
Explicit cursors doing a BULK COLLECT into a local collection (with an appropriate LIMIT) and using FORALL to write back to the database.
Implicit cursors doing a BULK COLLECT for you behind the scenes along with single-row writes back to the datbase.
Explicit cursors that are not doing a BULK COLLECT and not taking advantage of PL/SQL collections.
Slowest
On the other hand, using implicit cursors gets you quite a bit of the benefit of using bulk operations for very little of the upfront cost in refactoring old code or learning the new feature. If most of your PL/SQL development is done by developers whose primary language is something else or who don't necessarily keep up with new language features, FOR loops are going to be easier to understand and maintain than explicit cursor code that used all the new BULK COLLECT functionality. And when Oracle introduces new optimizations in the future, it's far more likely that the implicit cursor code would get the benefit automatically while the explicit code may require some manual rework.
Of course, by the time you're troubleshooting performance to the point where you really care about how much faster different variants of your looping code might be, you're often at the point where you would want to consider moving more logic into pure SQL and ditching the looping code entirely.
The OPEN / FETCH / CLOSE is called explicit cursor syntax; the latter is called implicit cursor syntax.
One key difference you've already noticed is that you can't use %FOUND/%NOTFOUND/etc in implicit cursors... Another thing to be aware of is that implicit cursors are faster than explicit ones--they read ahead (~100 records?) besides not supporting the explicit logic.
Additional info:
Implicit vs. Explicit Cursors
I don't know about any crucial differences in this two realizations besides one: for ... loop implicitly closes the cursor after the loop is finished and if open ... fetch ... close syntax you'd rather close the cursor yourself (just a good manner) - thought this is not a necessity: Oracle will close the cursor automatically outbound the visibility scope. Also you can't use %FOUND and %NOTFOUND in for ... loop cursors.
As for me I find the for ... loop realization much easier to read and support.
Correct me if I'm wrong but I think both have one nice feature what other one doesn't have.
With for loop you can do like this:
for i in (select * from dual)
dbms_output.put_line('ffffuuu');
end loop;
And with open .. fetch you can do like this:
declare
cur sys_refcursor;
tmp dual.dummy%type;
begin
open cur for 'select dummy from dual';
loop
fetch cur into tmp;
exit when cur%notfound;
dbms_output.put_line('ffffuuu');
end loop;
close cur;
end;
So with open fetch you can use dynamic cursors but with for loop you can define normal cursor without declaration.

Cursors vs Procedures in SQL

So, I just learned about CURSORS but still don't exactly grasp them. What is the difference between a cursor and procedure or even a function?
So far from the various examples (DECLARE CURSOR ... SELECT ... FROM ...) It seems at most its a variable to hold a query. Is the data real time, or a snapshot of when the cursor was declared?
i.e.
I have a table with one row and one col with a value of 2.
I do DECLARE CURSOR ... SELECT * FROM table1
I then insert a new row with a value of 3.
When I run the cursor, would I Just get the one row from before the cursor was declared, or both rows?
Thanks
I would recommend researching a bit on "Set based" versus "Row Based". This article does a decent job.
Most database systems are geared toward performing set based operations. Because of this you will often see performance problems when you perform row based operations (like using a cursor). In my experience A LOT of sql that uses cursors can be rewritten without cursors.
In the example you asked about your cursor would only have one record in it.
Also, keep in mind that a stored procedure can make use of a cursor.
I believe the documentation will answer some of your questions. Read through the different options like "Insensitive" to see what they mean.
Also, as a general rule, it is frowned upon to use cursors. You should always try to find a "set based" solution before going the cursor route. There is much debate and documentation on this subject as well that is easily accessible.
My advice would be to forget you learned the syntax for cursors. Cursors are the last resort technique and should only be used by an expert who understnds what impact the cursor will have on performance and why the set-based alternatives won't work in his specific case. Most things done in cursors are far more easily done in a set-based fashion if you understand set operations. This link will help you learn what you need to know so that you will only rarely have to write a cursor:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
CURSOR:
do something that is to be done row by row and not possible with simple query
for example, if you have a temporary table to store hierarchy of categories, cursor can be useful to populate it
procedures or stored procedures are something that can have collection of sql statements (including cursor) in it. when executed by passing params (if any) it will execute the statements in it
A function or procedure is a set of instructions to perform some task.
A cursor is an array that can stores the result set of a query.
Stored procedures are pre-compiled objects and executes as bulk of statements, whereas cursors are used to execute row by row.
For Example: You can take cursors like a bag (cursors are similar to pointer that points out on a single row out of so many rows) and put all the result of a query run from within procedures, whenever you will need the results, just open the bag and find out the results row by row.

SQL Server Fast Forward Cursors

It is generally accepted that the use of cursors in stored procedures should be avoided where possible (replaced with set based logic etc). If you take the cases where you need to iterate over some data, and can do in a read only manner, are fast forward (read only forward) cursor more or less inefficient than say while loops? From my investigations it looks as though the cursor option is generally faster and uses less reads and cpu time. I haven't done any extensive testing, but is this what others find? Do cursors of this type (fast forward) carry additional overhead or resource that could be expensive that I don't know about.
Is all the talk about not using cursors really about avoiding the use of cursors when set-based approaches are available, and the use of updatable cursors etc.
While a fast forward cursor does have some optimizations in Sql Server 2005, it is not true that they are anywhere close to a set based query in terms of performance. There are very few situations where cursor logic cannot be replaced by a set-based query. Cursors will always be inherently slower, due in part to the fact that you have to keep interrupting the execution in order to fill your local variables.
Here are few references, which would only be the tip of the iceberg if you research this issue:
http://www.code-magazine.com/Article.aspx?quickid=060113
http://dataeducation.com/re-inventing-the-recursive-cte/
This answer hopes to consolidate the replies given to date.
1) If at all possible, used set based logic for your queries i.e. try and use just SELECT, INSERT, UPDATE or DELETE with the appropriate FROM clauses or nested queries - these will almost always be faster.
2) If the above is not possible, then in SQL Server 2005+ FAST FORWARD cursors are efficient and perform well and should be used in preference to while loops.
You can avoid cursors most of the time, but sometimes it's necessary.
Just keep in mind that FAST_FORWARD is DYNAMIC ... FORWARD_ONLY you can use with a STATIC cursor.
Try using it on the Halloween problem to see what happens !!!
IF OBJECT_ID('Funcionarios') IS NOT NULL
DROP TABLE Funcionarios
GO
CREATE TABLE Funcionarios(ID Int IDENTITY(1,1) PRIMARY KEY,
ContactName Char(7000),
Salario Numeric(18,2));
GO
INSERT INTO Funcionarios(ContactName, Salario) VALUES('Fabiano', 1900)
INSERT INTO Funcionarios(ContactName, Salario) VALUES('Luciano',2050)
INSERT INTO Funcionarios(ContactName, Salario) VALUES('Gilberto', 2070)
INSERT INTO Funcionarios(ContactName, Salario) VALUES('Ivan', 2090)
GO
CREATE NONCLUSTERED INDEX ix_Salario ON Funcionarios(Salario)
GO
-- Halloween problem, will update all rows until then reach 3000 !!!
UPDATE Funcionarios SET Salario = Salario * 1.1
FROM Funcionarios WITH(index=ix_Salario)
WHERE Salario < 3000
GO
-- Simulate here with all different CURSOR declarations
-- DYNAMIC update the rows until all of then reach 3000
-- FAST_FORWARD update the rows until all of then reach 3000
-- STATIC update the rows only one time.
BEGIN TRAN
DECLARE #ID INT
DECLARE TMP_Cursor CURSOR DYNAMIC
--DECLARE TMP_Cursor CURSOR FAST_FORWARD
--DECLARE TMP_Cursor CURSOR STATIC READ_ONLY FORWARD_ONLY
FOR SELECT ID
FROM Funcionarios WITH(index=ix_Salario)
WHERE Salario < 3000
OPEN TMP_Cursor
FETCH NEXT FROM TMP_Cursor INTO #ID
WHILE ##FETCH_STATUS = 0
BEGIN
SELECT * FROM Funcionarios WITH(index=ix_Salario)
UPDATE Funcionarios SET Salario = Salario * 1.1
WHERE ID = #ID
FETCH NEXT FROM TMP_Cursor INTO #ID
END
CLOSE TMP_Cursor
DEALLOCATE TMP_Cursor
SELECT * FROM Funcionarios
ROLLBACK TRAN
GO
Some alternatives to using cursor:
WHILE loops
Temp tablolar
Derived tables
Associated subqueries
CASE statements
Multiple interrogations
Often, cursor operations can also be achieved with non-cursor techniques.
If you are sure that the cursor needs to be used, the number of records to be processed should be reduced as much as possible. One way of doing this is to get the records to be processed first into a temp table, not the original table, but a cursor that will use the records in the temp table. When this path is used, it is assumed that the number of records in the temp table has been greatly reduced compared to the original table. With fewer records, the cursor completes faster.
Some cursor properties that affect performance include:
FORWARD_ONLY: Supports forwarding only the cursor from the first row to the end with FETCH NEXT. Unless set as KEYSET or STATIC, the SELECT clause is re-evaluated when each fetch is called.
STATIC: Creates a temp copy of the created data and is used by the cursor. This prevents the cursor from being recalculated each time it is called, which improves performance. This does not allow cursor type modification, and changes to the table are not reflected when the fetch is called.
KEYSET: Cursored rows are placed in a table under tempdb, and changes to nonkey columns are reflected when the fetch is called. However, new records added to the table are not reflected. With the keyset cursor, the SELECT statement is not evaluated again.
DYNAMIC: All changes to the table are reflected in the cursore. The cursor is re-evaluated when each fetch is called. It uses a lot of resources and adversely affects performance.
FAST_FORWARD: The cursor is one-way, such as FORWARD_ONLY, but specifies the cursor as read-only. FORWARD_ONLY is a performance increase and the cursor is not reevaluated every fetch. It gives the best performance if it is suitable for programming.
OPTIMISTIC: This option can be used to update rows in the cursor. If a row is fetched and updated, and another row is updated between fetch and update operations, the cursor update operation fails. If an OPTIMISTIC cursor is used that can perform line update, it should not be updated by another process.
NOTE: If cursore is not specified, the default is FORWARD_ONLY.
"If You want a even faster cursor than FAST FORWARD then use a STATIC cursor. They are faster than FAST FORWARD. Not extremely faster but can make a difference."
Not so fast! According to Microsoft:
"Typically, when these conversions occurred, the cursor type degraded to a ‘more expensive’ cursor type. Generally, a (FAST) FORWARD-ONLY cursor is the most performant, followed by DYNAMIC, KEYSET, and finally STATIC which is generally the least performant."
from: Link
People avoid cursor because they generally are more difficult to write than a simple while loops, however, a while loop can be expensive because your constantly selecting data from a table, temporary or otherwise.
With a cursor, which is readonly fast forward, the data is kept in memory and has been specifically designed for looping.
This article highlights that an average cursor runs 50 times faster than a while loop.
To answer Mile's original questions...
Fast Forward, Read Only, Static cursors (affectionately known as a "Fire Hose Cursor") are typically as fast or faster than a equivalent Temp Table and a While loop because such a cursor is nothing more than a Temp Table and a While loop that has been optimized a bit behind the scenes.
To add to what Eric Z. Beard posted on this thread and to further answer the question of...
"Is all the talk about not using cursors really about avoiding the use
of cursors when set-based approaches are available, and the use of
updatable cursors etc."
Yes. With very few exceptions, it takes less time and less code to write proper set-based code to do the same thing as most cursors and has the added benefit of using much fewer resources and usually runs MUCH faster than a cursor or While loop. Generally speaking and with the exception of certain administrative tasks, they really should be avoided in favor of properly written set-based code. There are, of course, exceptions to every "rule" but, in the case of Cursors, While loops, and other forms of RBAR, most people can count the exceptions on one hand without using all of the fingers. ;-)
There's also the notion of "Hidden RBAR". This is code that looks set-based but actually isn't. This type of "set-based" code is the reason why certain people have embraced RBAR methods and say they're "OK". For example, solving the running total problem using an aggregated (SUM) correlated sub-query with an inequality in it to build the running total isn't really set-based in my book. Instead, it's RBAR on steroids because ,for each row calculated, it has to repeatedly "touch" many other rows at a rate of N*(N+1)/2. That's known as a "Triangular Join" and is at least half as bad as a full Cartesian Join (Cross Join or "Square Join").
Although MS has made some improvements in how Cursors work since SQL Server 2005, the term "Fast Cursor" is still an oxymoron compared to properly written set-based code. That also holds true even in Oracle. I worked with Oracle for a short 3 years in the past but my job was to make performance improvements in existing code. Most of the really substantial improvements were realized when I converted Cursors to set-based code. Many jobs that previously took 4 to 8 hours to execute were reduced to minutes and, sometimes, seconds.
The 'Best Practice' of avoiding cursors in SQL Server dates back to SQL Server 2000 and earlier versions. The rewrite of the engine in SQL 2005 addressed most of the issues related to the problems of cursors, particularly with the introduction of the fast forward option. Cursors are not neccessarily worse than set-based and are used extensively and successfully in Oracle PL/SQL (LOOP).
The 'generally accepted' that you refer to was valid, but is now outdated and incorrect - go on the assumption that fast forward cursors behave as advertised and perform. Do some tests and research, basing your findings on SQL2005 and later
If You want a even faster cursor than FAST FORWARD then use a STATIC cursor. They are faster than FAST FORWARD. Not extremely faster but can make a difference.

Why are relational set-based queries better than cursors?

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.
Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.
Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?
The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.
When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.
Set based queries are (usually) faster because:
They have more information for the query optimizer to optimize
They can batch reads from disk
There's less logging involved for rollbacks, transaction logs, etc.
Less locks are taken, which decreases overhead
Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)
Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.
A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:
--Cursor
DECLARE #phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO #phoneNumber
WHILE ##FETCH_STATUS = 0 BEGIN
DECLARE #exchange char(3), #areaCode char(3)
SELECT #exchange = LEFT(#phoneNumber, 3)
SELECT #areaCode = AreaCode
FROM AreaCode_Exchange
WHERE Exchange = #exchange
IF #areaCode IS NOT NULL BEGIN
UPDATE Customer SET AreaCode = #areaCode
WHERE CURRENT OF c
END
FETCH NEXT FROM c INTO #phoneNumber
END
CLOSE c
DEALLOCATE c
END
--Set
UPDATE Customer SET
AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
Customer.AreaCode IS NULL
In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:
It's (subjectively) easier to read. When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
It avoids network round trips. Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
It's wasteful. Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
Why wouldn't you? You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?
You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.
Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.
Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.
A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.
I think the real answer is, like all approaches in programming, that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:
You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.
The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path, and guess what happens... your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), you have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.
I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.
Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.
As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.
For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.
Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.
Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.
The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)
The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actually doing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)
That said, to get good performance in your actual concrete case, you may have to bend or break rules.
Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after all the updates, but generates a collision if done one-by-one.
set based is done in one operation
cursor as many operations as the rowset of the cursor
The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.
Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.
Simply put, in most cases, it's faster/easier to let the database do it for you.
The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.