SQL UPDATE WHERE IN (List) or UPDATE each individually? - sql

I'm doing my best lately to look for the best way to run certain queries in SQL that could potentially be done multiple different ways. Among my research I've come across quite a lot of hate for the WHERE IN concept, due to an inherent inefficiency in how it works.
eg: WHERE Col IN (val1, val2, val3)
In my current project, I'm doing an UPDATE on a large set of data and am wondering which of the following is more efficient: (or whether a better option exists)
UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (id1, id2, id3 ....);
In the above, the list of ID's can be up to 1.5k ID's.
VS
Looping through all ID's in code, and running the following statement for each:
UPDATE table1 SET somecolumn = 'someVal' WHERE ID = 'theID';
To myself, it seems more logical that the former would work better / faster, because there's less queries to run. That said, I'm not 100% familiar with the in's and out's of SQL and how query queueing works.
I'm also unsure as to which would be friendlier on the DB as far as table locks and other general performance.
General info in case it helps, I'm using Microsoft SQL Server 2014, and the primary development language is C#.
Any help is much appreciated.
EDIT:
Option 3:
UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (SELECT ID FROM #definedTable);
In the above, #definedTable is a SQL 'User Defined Table Type', where the data inside comes through to a stored procedure as (in C#) type SqlDbType.Structured
People are asking how the ID's come in:
ID's are in a List<string>in the code, and are used for other things in the code before then being sent to a stored procedure. Currently, the ID's are coming into the stored procedure as a 'User-Defined Table Type' with only one column (ID's).
I thought having them in a table might be better than having the code concatenate a massive string and just spitting it into the SP as a variable that looks like id1, id2, id3, id4 etc

I'm using your third option and it works great.
My stored procedure has a table-valued parameter. See also Use Table-Valued Parameters.
In the procedure there is one statement, no loops, like you said:
UPDATE table1 SET somecolumn = 'someVal' WHERE ID IN (SELECT ID FROM #definedTable);
It is better to call the procedure once, than 1,500 times. It is better to have one transaction, than 1,500 transactions.
If the number of rows in the #definedTable goes above, say, 10K, I'd consider splitting it in batches of 10K.
Your first variant is OK for few values in the IN clause, but when you get to really high numbers (60K+) you can see something like this, as shown in this answer:
Msg 8623, Level 16, State 1, Line 1 The query processor ran out of
internal resources and could not produce a query plan. This is a rare
event and only expected for extremely complex queries or queries that
reference a very large number of tables or partitions. Please simplify
the query. If you believe you have received this message in error,
contact Customer Support Services for more information.

Your first or third options are the best way to go. For either of them, you want an index on table1(id).
In general, it is better to run one query rather than multiple queries because the overhead of passing data in and out of the database adds up. In addition, each update starts a transactions and commits it -- more overhead. That said, this will probably not be important unless you are updating thousands of records. The overhead is measured in hundreds of microseconds or milliseconds, on a typical system.

You should definitely NOT use a loop and send an entire new SQL statement for each ID. In that case, the SQL engine has to recompile the SQL statement and come up with an execution plan, etc. every single time.
Probably the best thing to do is to make a prepared statement with a placeholder then loop through your data executing the statement for each value. Then the statement stays in the database engine's memory and it quickly just executes it with the new value each time you call it rather than start from scratch.
If you have a large database and/or run this often, also make sure you create an index on that ID value, otherwise it will have to do a full table scan with every value.
EDIT:
Perl pseudocode as described below:
#!/usr/bin/perl
use DBI;
$dbh = DBI->connect('dbi:Oracle:MY_DB', 'scott', 'tiger', { RaiseError => 1, PrintError =>1, AutoCommit => 0 });
$sth = $dbh->prepare ("UPDATE table1 SET somecolumn = ? WHERE id = ?");
foreach $tuple (#updatetuples) {
$sth->execute($$tuple[1], $$tuple[0]);
}
$dbh->commit;
$sth->finish;
$dbh->disconnect;
exit (0);

I came upon this post when trying to solve a very similar problem so thought I'd share what I found. My answer uses the case keyword, and applies to when you are trying to run an update for a list of key-value pairs (not when you are trying to update a bunch of rows to a single value). Normally I'd just run an update query and join the relevant tables, but I am using SQLite rather than MySQL and SQLite doesn't support joined update queries as well as MySQL. You can do something like this:
UPDATE mytable SET somefield=( CASE WHEN (id=100) THEN 'some value 1' WHEN (id=101) THEN 'some value 2' END ) WHERE id IN (100,101);

Related

What is the less expensive method of inserting records into Oracle?

I need to be able to repeatedly process an XML file and insert large amounts of data into an Oracle database. The procedure needs to be able to create new records, or update existing ones if data already exists.
I can think of two ways to process inserting/updating 100,000 records into an Oracle database. But which is the better method? Or is there another way?
Attempt the INSERT. If no exception, the insert works and all is good. If there is an exception, catch it and do an UPDATE instead.
Look up the record first (SELECT). If not found do an INSERT. If found, do an UPDATE.
Obviously if the Oracle table is empty then the first method saves time by foregoing lookups. But if the file was previously imported, and then someone changes a few lines and re-imports, then the amount of exceptions generated becomes huge.
The 2nd method takes longer on an empty database due to lookups but does not generate expensive exceptions during subsequent imports.
Is there a "normal" pattern for working with data like this?
Thanks!
I don't know what 'the' pattern is, but if you are generating a statement, then maybe you can generate a union of select from dual queries that contains all the data from the XML file. Then, you can wrap this select in a MERGE INTO statement, so your SQL looks something like:
MERGE INTO YourTable t
USING (
SELECT 'Val1FromXML' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
UNION ALL
SELECT 'Val1FromRow2' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
...) x ON x.SomeKey = t.SomeKey
WHEN MATCHED THEN
UPDATE SET
t.ExtraField = x.ExtraField,
t.OtherField = x.OtherField
WHEN NOT MATCHED THEN
INSERT (ExtraField, OtherField) VALUES (x.ExtraField, x.OtherField)
The advantage of this statement, is that it's only one statement, so it saves the overhead of initializing a statement for each row. Also, as a single statement it will completely fail or completely succeed, what you would otherwise accomplish with a transaction.
And that's a pitfall as well. For an import like this, you may want to do only a limited number of rows at a time and then commit. That way you don't lock a large part of the table for too long, and you can break the import and continue later. But fortunately, it should be pretty easy to generate a MERGE INTO statement for a limited number of rows too, by simply putting no more than, say, 500 rows in the unioned select-from-duals.
The "normal" pattern would be to wrap your file with an external table and then perform an upsert via the merge keyword.
Depending on your hardware loading the file into a staging table via SQL*Loader can be much faster than using an external table.
edit - just realized you're processing a file and not trying to load it directly. GolezTrol's answer is a good way to deal with the rows you're generating. If there's a huge amount though I would still recommend populating a staging table and consider loading it separately via loader instead a massive SQL statement.

Better way to update TSQL

We are in the process of some data integration and I get update scripts in the form of
UPDATE Table1 SET Table1.field1 = '12345' WHERE Table1.field2 = '345667';
UPDATE Table1 SET Table1.field1 = '12365' WHERE Table1.field2 = '567885';
Table1.field2 is not indexed.
The scripts run without problems, but it takes forever. 8000 rows affected in a bit over 7 minutes, which I feel is a bit long. (It's running on a dev server which is not the best, but a look at the server doesn't indicate that it is overly busy).
So my question is, is there a better way (i.e. faster) to run this type of update statements. (SQL 2008 R2)
Many Thanks!
You may be RBAR-ing (row-by-agonizing-row) the server with multiple UPDATE statements. Essentially, you're having it do a table scan for each query, which is obviously non-ideal. While an index would help the most, executing multiple single-value statements will still cost you.
SQL Server allows you to use JOINs for update statements, so you may see some improvement doing something like this:
WITH Incoming AS (SELECT field1, field2
FROM (VALUES('12345', '345667'),
('12365', '567885')) i(field1, field2))
UPDATE Table1
SET Table1.field1 = Incoming.field1
FROM Table1
JOIN Incoming
ON Incoming.field2 = Table1.field2;
SQL Fiddle Example
If it turns out that the number of rows in Incoming is large, you should probably realize it as an actual table that you bulk-load into first. You should be able to put an index on the load table (refreshed after the import, to make sure statistics are correct).
But really, an index on field2 should probably be the first thing, especially if there are multiple queries that use that column.
As I see it, you can try different things (like storing all values on another table, then updating one using the other), but in the end, the engine is going to search using a single field, testing equality with a value.
That would require an index. If you can at least test in dev, maybe you can show the performance improvement to someone who can authorize the creation of the new index in the production environment.
That's my answer, I hope someone comes up with a better one!

Multiple small queries, or one big query while storing xml data into SQL Server

I'm parsing an xml file which I loop through and store information into a SQL Server. I send a MERGE query to either insert or update information.
Is it best to store this information in a variable, and send query after the loop has finished, or send numerous small queries within the loop? I expect 60-100 queries for each loop.
$DOM=simplexml_load_file($url);
$info=$DOM->info;
foreach($info as $i){
$i_name=$i['name'];
$i_id=$i['id'];
...
$q=sqlsrv_query($conn,"
MERGE dbo.members m USING (
SELECT
'$i_name' as name,
'$i_id' as id,
...
) s ON ( m.id=s.id )
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT ...
");
}
My experience is that the best performance comes from batching the SQL statements several hundred at a time.
Hopefully the language your own (php? perl? can't tell) has a utility for this, otherwise you can easily code it up yourself.
Of course, if your DB is on the same machine it probably makes no difference.
It depends on various factors. You could setup a test scenario and check the performance of both options, then choose whatever is better for your case. We had a similar case and best option was to have a stored procedure that received a table with all the needed values.
Check this other similar questions, they are not exactly same as yours but I believe the answers given there will help you a lot.
Update or Merge very big tables in SQL Server
Multiple INSERT statements vs. single INSERT with multiple VALUES

Static vs dynamic sql

In my database at several places developers have used dynamic sql instead of static. And they are saying reason for this is to improve the performance. Can someone tell me can if dynamic sql can really increase the performance in stored procedure or plsql block?
Which will execute faster and why ?
1.
begin
execute immediate 'delete from X';
end;
2.
begin
delete from X;
end;
Your example code is so simple that there will be little difference, but in that case the static version would most likely execute better.
The main reason to use dynamic SQL for performance is when the SQL statement can vary in a significant way - i.e. you might be able to add extra code to the WHERE clause at runtime based on the state of the system (restrict by a sub-query on Address, if Address entered, etc).
Another reason is that sometimes using Bind variables as parameters can be counter-productive.
An example is if you have something like a status field, where data is not evenly distributed (but is indexed).
Consider the following 3 statements, when 95% of the data is 'P'rocessed
SELECT col FROM table
WHERE status = 'U'-- unprocessed
AND company = :company
SELECT col FROM table
WHERE status = 'P' -- processed
AND company = :company
SELECT col FROM table
WHERE status = :status
AND company = :company
In the final version, Oracle will choose a generic explain plan. In the first version, it may decide the best plan is to start with the index on status (knowing that 'U'nprocessed entries are a very small part of the total).
You could implement that through different static statements, but where you have more complex statements which only change by a couple of characters, dynamic SQL may be a better option.
Downsides
Each repetition of the same dynamic SQL statement incurs a soft parse, which is a small overhead compared to a static statement, but still an overhead.
Each NEW sql statement (dynamic or static) also incurs a lock on the SGA (shared memory), and can result in pushing 'old' statements out.
A bad, but common, system design is for someone to use dynamic SQL to generate simple selects that only vary by key - i.e.
SELECT col FROM table WHERE id = 5
SELECT col FROM table WHERE id = 20
SELECT col FROM table WHERE id = 7
The individual statements will be quick, but the overall system performance will deteriorate, as it is killing the shared resources.
Also - it is far harder to trap errors at compile time with dynamic SQL. If using PL/SQL this is throwing away a good compilation time check. Even when using something like JDBC (where you move all your database code into strings - good idea!) you can get pre-parsers to validate the JDBC content. Dynamic SQL = runtime testing only.
Overheads
The overhead of execute immediate is small - it is in the thousandths of a second - however, it can add up if this is inside a loop / on a method called once per object / etc. I once got a 10x speed improvement by replacing dynamic SQL with generated static SQL. However, this complicated the code, and was only done because we required the speed.
Unfortunately, this does vary on a case-by-case basis.
For your given examples, there is probably no measurable difference. But for a more complicated example, you'd probably want to test your own code.
The link #DumbCoder gave in the comments has some excellent rules of thumb which also apply to Oracle for the most part. You can use something like this to assist you in deciding, but there is no simple rule like "dynamic is faster than static".

Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?

I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).
If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.
One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:
-- 1. Temporary table for ID’s:
DECLARE #IDS TABLE (ID int);
-- 2. Split the given string of ID’s, and each ID to #IDS.
-- Omitted for brevity.
-- 3. Join the main table to #ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN #IDS ON MyTable.ID = #IDS.ID;
Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable #IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.
Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.
Well, it depends on what's really going on. How is the user choosing these IDs?
Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?
It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).
I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?
Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.
Don't forget to old adage (with apologies to Benjamin Franklin):
He who would trade correctness for performance deserves neither
Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.
The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.
Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.
For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.
Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:
XML solution: 99% time - xml parsing
comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
CLR UDF to split string: 98% - index seek.
Here is the code for CLR UDF:
public class SplitString
{
[SqlFunction(FillRowMethodName = "FillRow")]
public static IEnumerable InitMethod(String inputString)
{
return inputString.Split(',');
}
public static void FillRow(Object obj, out int ID)
{
string strID = (string)obj;
ID = Int32.Parse(strID);
}
}
So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.
I tested the search of 1K record in a table of 200K.
A table var has issues: using a temp table with index has benefits for statistics.
A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.
Parsing a CSV is easy: see questions on right...
Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using
WHERE IN
or
WHERE EXISTS
or
JOIN someholdingtable ON ...
Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (#param1, #param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.
So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.
Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).
There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.
In SQL Server 2008 or later you should be looking to use table-valued parameters.
2008 makes it simple to pass a comma-separated list to SQL Server using this method.
Here is an excellent source of information and performance tests on the subject:
Arrays-in-sql-2008
Here is a great tutorial:
passing-table-valued-parameters-in-sql-server-2008
For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.
Even loading each row by id is not as much inefficient like it looks like.
If problems with string manipulation are putted aside, I think that:
WHERE ID=1 OR ID=2 OR ID=3 ...
is more efficient, nevertheless I wouldn't do it.
You could compare performance between both approaches.
To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.
Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:
CREATE PROCEDURE myProc(#MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
DECLARE #x XML
SELECT #x = CONVERT(XML, #MyXmlAsSTR)
Then you can join your base query directly with the XML select as (not tested):
SELECT t.*
FROM myTable t
INNER JOIN #x.nodes('/ROOT/ROW') AS R(x)
ON t.ID = x.value('#ID', 'INTEGER')
when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.
select t.*
from (
select id = 35 union all
select id = 87 union all
select id = 445 union all
...
select id = 33643
) ids
join my_table t on t.id = ids.id
If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.
Note that query engines tend to treat
select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)
as equivalent to
select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643
and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.