Stored procedure vs multiple queries (activerecord) - sql

Just got started with activerecord. I am just wondering if I am making multiple queries in a request, is it better to have a stored procudure? That way, I am making 1 SQL query as oppose to multiple.
For example, a lot of times, I check if a record exists, if not, I create it. That's 2 queries. I can have that in a stored procedure which will make only 1 queries. Is that a way to go to increase performance?
Thanks

The performance increase will be gained by the fact that the stored procedure will stay compiled and the execution plan will stay stored. If you make your query from the application, and they are few and far between, that probably won't be the case.
As far as typing goes, it's about the same amount, just in different places.

For example, a lot of times, I check if a record exists, if not, I create it. That's 2 queries.
A DBA would probably advise you to
just insert the row, or
use the equivalent of the SQL MERGE statement, or
use whatever other statement your dbms supports (and which I presume ActiveRecord implements in one way or another), like REPLACE, or ON DUPLICATE KEY UPDATE ...
and trap the error that results from inserting a duplicate row.
All those should require only one round-trip to the database. And you have to trap errors anyway--there are a lot of things that can prevent an INSERT statement from succeeding besides a duplicate key.
On the other hand, if you have a business process that really requires a succession of SQL statements, it probably makes sense to write a stored procedure. That's not quite the same thing as "I am making 1 SQL query as oppose to multiple." You still have to execute the succession of SQL statements. You pass less to the server, but that's probably not important. The server still executes all the SQL statements.

I'm not sure that I'd bother with this level of performance tuning.
As far as single row inserts, updates, deletes etc are concerned, if you've got the right indexes in place then database performance is likely to be a very small performance concern compared with ruby, DOM processing on the client, network time, etc.. You might save yourself a few milliseconds, but that's unlikely to be significant.
You'll probably appreciate saving time by using:
Model.where(:blah => etc).first_or_initialize(:whatever => 'wotsit')
... so that you can later fret over a significant performance issue.

Related

SQL: Loop single select vs one select with IN clause

I'd like to ask you what is faster if use
Loop array and
call select XXX where id=
Call select XXX where id IN (list value of array)
The second one is almost always faster. Remember that in the first option, the client (usually) has to do a full database connection, log in, send the query, wait for the query to get parsed, wait for the query to get optimized, wait for the query to execute and then wait for the result to get back. In the second option, all of these steps are done once.
There may be cases where the first option is actually faster if your index-schema is bad and can't be fixed or the server is seriously wrong about how to run the disjunction that is the IN-clause and can't be told otherwise.
Doing all the work in the database should always be faster. Each time you connect to the database you incur some overhead. This overhead might be relatively minor, if the query plan is cached and the cache is optimized, but the data still needs to go back and forth.
More importantly, database engines are optimized to run queries. Some databases optimize in expressions, using a binary lookup. Parallel databases can also take advantage of multiple processors for the query. The performance only gets worse if the from is a view or if your query is more complicated.
Under some conditions, the performance difference may not really be noticeable -- say for a local database where the table is cached in memory. However, your habit should be to do such work in the database, not in the application.

Performance in these two SQL scenarios

I had assumed the first call below was more efficient, because it only makes this check once to see if #myInputParameter is less than 5000.
If the check fails, I avoid a query altogether. However, I've seen other people code like the second example, saying it's just as efficient, if not more.
Can anyone tell me which is quicker? It seems like the second one would be much slower, especially if the call is combing-through a large data set.
First call:
IF (#myInputParameter < 5000)
BEGIN
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
--and so on
END
Second call:
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
AND #myInputParameter < 5000
--and so on
Edit: I'm using SQL Server 2008 R2, but I'm really asking to get a feel for which query is "best practice" for SQL. I'm sure the difference in query-time between these two statements is a thousandth of a second, so it's not THAT critical. I'm just interested in writing better SQL code in general. Thanks
Sometimes, SQL Server is clever enough to transform the latter into the former. This manifests itself as a "startup predicate" on some plan operator like a filter or a loop-join. This causes the query to evaluate very quickly in small, constant time. I just tested this, btw.
You can't rely on this for all queries but once having verified that it works for a particular query through testing I'd rely on it.
If you use OPTION (recompile) it becomes even more reliable because this option inlines the parameter values into the query plan causing the entire query to turn into a constant scan.
SQL Server is well designed. It will do literal evaluations without actually scanning the table/indexes if it doesn't need to. So, I would expect them to perform basically identical.
From a practice standpoint, I think one should use the if statement, especially if it wraps multiple statements. But, this really is a matter of preference to me. For me, code that can't be executed would logically be faster than code that "should" execute without actually hitting the data.
Also, there is the possibility that SQL Server should create a bad plan and actually do a hit the data. I've never seen this specific scenario with literals, but I've had bad execution plans be created.

SQL Server Query Performance, Large where statements vs. queries

I am wondering which is a more efficent method to retrieve data from the database.
ex.
One particular part of my application can have well over 100 objects. Right now I have it setup to query the database twice for each object. This part of the application periodically refreshes itself, say every 2 minutes, and this application will probably end of being installed on 25-30 pc's. I am thinking that this is a large number of select statements to make from the database, and I am thinking about trying to optimize the procedure. I have to pull the information out of several tables, and both queries are using join statements.
Would it be better to rewrite the queries so that I am only executing the queries twice per update instead of 200 times? For example using a large where statement to include each object, and then do the processing of the data outside of the object, rather than inside each object?
Using SQL Server, .net No indexes on the tables, size of database is less than 10-5th
all things being equal, many statements with few rows is usually worse than few statements with many rows.
show the actual code and get better answers.
The default answer for optimization must always be: don't optimize until you have a demonstrated need for it. The followup is: once you have a demonstrated need for it, and an alternative approach: try both ways to determine which is faster. You can't predict which will be faster, and we certainly can't. KM's answer is good - fewer queries tends to be better - but the way to know is to test.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

Why are relational set-based queries better than cursors?

When writing database queries in something like TSQL or PLSQL, we often have a choice of iterating over rows with a cursor to accomplish the task, or crafting a single SQL statement that does the same job all at once.
Also, we have the choice of simply pulling a large set of data back into our application and then processing it row by row, with C# or Java or PHP or whatever.
Why is it better to use set-based queries? What is the theory behind this choice? What is a good example of a cursor-based solution and its relational equivalent?
The main reason that I'm aware of is that set-based operations can be optimised by the engine by running them across multiple threads. For example, think of a quicksort - you can separate the list you're sorting into multiple "chunks" and sort each separately in their own thread. SQL engines can do similar things with huge amounts of data in one set-based query.
When you perform cursor-based operations, the engine can only run sequentially and the operation has to be single threaded.
Set based queries are (usually) faster because:
They have more information for the query optimizer to optimize
They can batch reads from disk
There's less logging involved for rollbacks, transaction logs, etc.
Less locks are taken, which decreases overhead
Set based logic is the focus of RDBMSs, so they've been heavily optimized for it (often, at the expense of procedural performance)
Pulling data out to the middle tier to process it can be useful, though, because it removes the processing overhead off the DB server (which is the hardest thing to scale, and is normally doing other things as well). Also, you normally don't have the same overheads (or benefits) in the middle tier. Things like transactional logging, built-in locking and blocking, etc. - sometimes these are necessary and useful, other times they're just a waste of resources.
A simple cursor with procedural logic vs. set based example (T-SQL) that will assign an area code based on the telephone exchange:
--Cursor
DECLARE #phoneNumber char(7)
DECLARE c CURSOR LOCAL FAST_FORWARD FOR
SELECT PhoneNumber FROM Customer WHERE AreaCode IS NULL
OPEN c
FETCH NEXT FROM c INTO #phoneNumber
WHILE ##FETCH_STATUS = 0 BEGIN
DECLARE #exchange char(3), #areaCode char(3)
SELECT #exchange = LEFT(#phoneNumber, 3)
SELECT #areaCode = AreaCode
FROM AreaCode_Exchange
WHERE Exchange = #exchange
IF #areaCode IS NOT NULL BEGIN
UPDATE Customer SET AreaCode = #areaCode
WHERE CURRENT OF c
END
FETCH NEXT FROM c INTO #phoneNumber
END
CLOSE c
DEALLOCATE c
END
--Set
UPDATE Customer SET
AreaCode = AreaCode_Exchange.AreaCode
FROM Customer
JOIN AreaCode_Exchange ON
LEFT(Customer.PhoneNumber, 3) = AreaCode_Exchange.Exchange
WHERE
Customer.AreaCode IS NULL
In addition to the above "let the DBMS do the work" (which is a great solution), there are a couple other good reasons to leave the query in the DBMS:
It's (subjectively) easier to read. When looking at the code later, would you rather try and parse a complex stored procedure (or client-side code) with loops and things, or would you rather look at a concise SQL statement?
It avoids network round trips. Why shove all that data to the client and then shove more back? Why thrash the network if you don't need to?
It's wasteful. Your DBMS and app server(s) will need to buffer some/all of that data to work on it. If you don't have infinite memory you'll likely page out other data; why kick out possibly important things from memory to buffer a result set that is mostly useless?
Why wouldn't you? You bought (or are otherwise using) a highly reliable, very fast DBMS. Why wouldn't you use it?
You wanted some real-life examples. My company had a cursor that took over 40 minutes to process 30,000 records (and there were times when I needed to update over 200,000 records). It took 45 second to do the same task without the cursor. In another case I removed a cursor and sent the processing time from over 24 hours to less than a minute. One was an insert using the values clause instead of a select and the other was an update that used variables instead of a join. A good rule of thumb is that if it is an insert, update, or delete, you should look for a set-based way to perform the task.
Cursors have their uses (or the code wouldn't be their in the first place), but they should be extremely rare when querying a relational database (Except Oracle which is optimized to use them). One place where they can be faster is when doing calculations based on the value of the preceeding record (running totals). BUt even that should be tested.
Another limited case of using a cursor is to do some batch processing. If you are trying to do too much at once in set-based fashion it can lock the table to other users. If you havea truly large set, it may be best to break it up into smaller set-based inserts, updates or deletes that will not hold the lock too long and then run through the sets using a cursor.
A third use of a cursor is to run system stored procs through a group of input values. SInce this is limited to a generally small set and no one should mess with the system procs, this is an acceptable thing for an adminstrator to do. I do not recommend doing the same thing with a user created stored proc in order to process a large batch and to re-use code. It is better to write a set-based version that will be a better performer as performance should trump code reuse in most cases.
I think the real answer is, like all approaches in programming, that it depends on which one is better. Generally, a set based language is going to be more efficient, because that is what it was designed to do. There are two places where a cursor is at an advantage:
You are updating a large data set in a database where locking rows is not acceptable (during production hours maybe). A set based update has a possibility of locking a table for several seconds (or minutes), where a cursor (if written correctly) does not. The cursor can meander through the rows updating one at a time and you don't have to worry about affecting anything else.
The advantage to using SQL is that the bulk of the work for optimization is handled by the database engine in most circumstances. With the enterprise class db engines the designers have gone to painstaking lengths to make sure the system is efficient at handling data. The drawback is that SQL is a set based language. You have to be able to define a set of data to use it. Although this sounds easy, in some circumstances it is not. A query can be so complex that the internal optimizers in the engine can't effectively create an execution path, and guess what happens... your super powerful box with 32 processors uses a single thread to execute the query because it doesn't know how to do anything else, so you waste processor time on the database server which generally there is only one of as opposed to multiple application servers (so back to reason 1, you run into resource contentions with other things needing to run on the database server). With a row based language (C#, PHP, JAVA etc.), you have more control as to what happens. You can retrieve a data set and force it to execute the way you want it to. (Separate the data set out to run on multiple threads etc). Most of the time, it still isn't going to be efficient as running it on the database engine, because it will still have to access the engine to update the row, but when you have to do 1000+ calculations to update a row (and lets say you have a million rows), a database server can start to have problems.
I think it comes down to using the database is was designed to be used. Relational database servers are specifically developed and optimized to respond best to questions expressed in set logic.
Functionally, the penalty for cursors will vary hugely from product to product. Some (most?) rdbmss are built at least partially on top of isam engines. If the question is appropriate, and the veneer thin enough, it might in fact be as efficient to use a cursor. But that's one of the things you should become intimately familiar with, in terms of your brand of dbms, before trying it.
As has been said, the database is optimized for set operations. Literally engineers sat down and debugged/tuned that database for long periods of time. The chances of you out optimizing them are pretty slim. There are all sorts of fun tricks you can play with if you have a set of data to work with like batching disk reads/writes together, caching, multi-threading. Also some operations have a high overhead cost but if you do it to a bunch of data at once the cost per piece of data is low. If you are only working one row at a time, a lot of these methods and operations just can't happen.
For example, just look at the way the database joins. By looking at explain plans you can see several ways of doing joins. Most likely with a cursor you go row by row in one table and then select values you need from another table. Basically it's like a nested loop only without the tightness of the loop (which is most likely compiled into machine language and super optimized). SQL Server on its own has a whole bunch of ways of joining. If the rows are sorted, it will use some type of merge algorithm, if one table is small, it may turn one table into a hash lookup table and do the join by performing O(1) lookups from one table into the lookup table. There are a number of join strategies that many DBMS have that will beat you looking up values from one table in a cursor.
Just look at the example of creating a hash lookup table. To build the table is probably m operations if you are joining two tables one of length n and one of length m where m is the smaller table. Each lookup should be constant time, so that is n operations. so basically the efficiency of a hash join is around m (setup) + n (lookups). If you do it yourself and assuming no lookups/indexes, then for each of the n rows you will have to search m records (on average it equates to m/2 searches). So basically the level of operations goes from m + n (joining a bunch of records at once) to m * n / 2 (doing lookups through a cursor). Also the operations are simplifications. Depending upon the cursor type, fetching each row of a cursor may be the same as doing another select from the first table.
Locks also kill you. If you have cursors on a table you are locking up rows (in SQL server this is less severe for static and forward_only cursors...but the majority of cursor code I see just opens a cursor without specifying any of these options). If you do the operation in a set, the rows will still be locked up but for a lesser amount of time. Also the optimizer can see what you are doing and it may decide it is more efficient to lock the whole table instead of a bunch of rows or pages. But if you go line by line the optimizer has no idea.
The other thing is I have heard that in Oracle's case it is super optimized to do cursor operations so it's nowhere near the same penalty for set based operations versus cursors in Oracle as it is in SQL Server. I'm not an Oracle expert so I can't say for sure. But more than one Oracle person has told me that cursors are way more efficient in Oracle. So if you sacrificed your firstborn son for Oracle you may not have to worry about cursors, consult your local highly paid Oracle DBA :)
The idea behind preferring to do the work in queries is that the database engine can optimize by reformulating it. That's also why you'd want to run EXPLAIN on your query, to see what the db is actually doing. (e.g. taking advantage of indices, table sizes and sometimes even knowledge about the distributions of values in columns.)
That said, to get good performance in your actual concrete case, you may have to bend or break rules.
Oh, another reason might be constraints: Incrementing a unique column by one might be okay if constraints are checked after all the updates, but generates a collision if done one-by-one.
set based is done in one operation
cursor as many operations as the rowset of the cursor
The REAL answer is go get one of E.F. Codd's books and brush up on relational algebra. Then get a good book on Big O notation. After nearly two decades in IT this is, IMHO, one of the big tragedies of the modern MIS or CS degree: Very few actually study computation. You know...the "compute" part of "computer"? Structured Query Language (and all its supersets) is merely a practical application of relational algebra. Yes, the RDBMS have optimized memory management and read/write but the same could be said for procedural languages. As I read it, the original question is not about the IDE, the software, but rather about the efficiency of one method of computation vs. another.
Even a quick familiarization with Big O notation will begin to shed light on why, when dealing with sets of data, iteration is more expensive than a declarative statement.
Simply put, in most cases, it's faster/easier to let the database do it for you.
The database's purpose in life is to store/retrieve/manipulate data in set formats and to be really fast. Your VB.NET/ASP.NET code is likely nowhere near as fast as a dedicated database engine. Leveraging this is a wise use of resources.