I need to be able to repeatedly process an XML file and insert large amounts of data into an Oracle database. The procedure needs to be able to create new records, or update existing ones if data already exists.
I can think of two ways to process inserting/updating 100,000 records into an Oracle database. But which is the better method? Or is there another way?
Attempt the INSERT. If no exception, the insert works and all is good. If there is an exception, catch it and do an UPDATE instead.
Look up the record first (SELECT). If not found do an INSERT. If found, do an UPDATE.
Obviously if the Oracle table is empty then the first method saves time by foregoing lookups. But if the file was previously imported, and then someone changes a few lines and re-imports, then the amount of exceptions generated becomes huge.
The 2nd method takes longer on an empty database due to lookups but does not generate expensive exceptions during subsequent imports.
Is there a "normal" pattern for working with data like this?
Thanks!
I don't know what 'the' pattern is, but if you are generating a statement, then maybe you can generate a union of select from dual queries that contains all the data from the XML file. Then, you can wrap this select in a MERGE INTO statement, so your SQL looks something like:
MERGE INTO YourTable t
USING (
SELECT 'Val1FromXML' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
UNION ALL
SELECT 'Val1FromRow2' as SomeKey, 'Val2FromXML' as Extrafield, 'Val3FromXML' as OtherField FROM DUAL
...) x ON x.SomeKey = t.SomeKey
WHEN MATCHED THEN
UPDATE SET
t.ExtraField = x.ExtraField,
t.OtherField = x.OtherField
WHEN NOT MATCHED THEN
INSERT (ExtraField, OtherField) VALUES (x.ExtraField, x.OtherField)
The advantage of this statement, is that it's only one statement, so it saves the overhead of initializing a statement for each row. Also, as a single statement it will completely fail or completely succeed, what you would otherwise accomplish with a transaction.
And that's a pitfall as well. For an import like this, you may want to do only a limited number of rows at a time and then commit. That way you don't lock a large part of the table for too long, and you can break the import and continue later. But fortunately, it should be pretty easy to generate a MERGE INTO statement for a limited number of rows too, by simply putting no more than, say, 500 rows in the unioned select-from-duals.
The "normal" pattern would be to wrap your file with an external table and then perform an upsert via the merge keyword.
Depending on your hardware loading the file into a staging table via SQL*Loader can be much faster than using an external table.
edit - just realized you're processing a file and not trying to load it directly. GolezTrol's answer is a good way to deal with the rows you're generating. If there's a huge amount though I would still recommend populating a staging table and consider loading it separately via loader instead a massive SQL statement.
We are in the process of some data integration and I get update scripts in the form of
UPDATE Table1 SET Table1.field1 = '12345' WHERE Table1.field2 = '345667';
UPDATE Table1 SET Table1.field1 = '12365' WHERE Table1.field2 = '567885';
Table1.field2 is not indexed.
The scripts run without problems, but it takes forever. 8000 rows affected in a bit over 7 minutes, which I feel is a bit long. (It's running on a dev server which is not the best, but a look at the server doesn't indicate that it is overly busy).
So my question is, is there a better way (i.e. faster) to run this type of update statements. (SQL 2008 R2)
Many Thanks!
You may be RBAR-ing (row-by-agonizing-row) the server with multiple UPDATE statements. Essentially, you're having it do a table scan for each query, which is obviously non-ideal. While an index would help the most, executing multiple single-value statements will still cost you.
SQL Server allows you to use JOINs for update statements, so you may see some improvement doing something like this:
WITH Incoming AS (SELECT field1, field2
FROM (VALUES('12345', '345667'),
('12365', '567885')) i(field1, field2))
UPDATE Table1
SET Table1.field1 = Incoming.field1
FROM Table1
JOIN Incoming
ON Incoming.field2 = Table1.field2;
SQL Fiddle Example
If it turns out that the number of rows in Incoming is large, you should probably realize it as an actual table that you bulk-load into first. You should be able to put an index on the load table (refreshed after the import, to make sure statistics are correct).
But really, an index on field2 should probably be the first thing, especially if there are multiple queries that use that column.
As I see it, you can try different things (like storing all values on another table, then updating one using the other), but in the end, the engine is going to search using a single field, testing equality with a value.
That would require an index. If you can at least test in dev, maybe you can show the performance improvement to someone who can authorize the creation of the new index in the production environment.
That's my answer, I hope someone comes up with a better one!
I'm parsing an xml file which I loop through and store information into a SQL Server. I send a MERGE query to either insert or update information.
Is it best to store this information in a variable, and send query after the loop has finished, or send numerous small queries within the loop? I expect 60-100 queries for each loop.
$DOM=simplexml_load_file($url);
$info=$DOM->info;
foreach($info as $i){
$i_name=$i['name'];
$i_id=$i['id'];
...
$q=sqlsrv_query($conn,"
MERGE dbo.members m USING (
SELECT
'$i_name' as name,
'$i_id' as id,
...
) s ON ( m.id=s.id )
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT ...
");
}
My experience is that the best performance comes from batching the SQL statements several hundred at a time.
Hopefully the language your own (php? perl? can't tell) has a utility for this, otherwise you can easily code it up yourself.
Of course, if your DB is on the same machine it probably makes no difference.
It depends on various factors. You could setup a test scenario and check the performance of both options, then choose whatever is better for your case. We had a similar case and best option was to have a stored procedure that received a table with all the needed values.
Check this other similar questions, they are not exactly same as yours but I believe the answers given there will help you a lot.
Update or Merge very big tables in SQL Server
Multiple INSERT statements vs. single INSERT with multiple VALUES
In my database at several places developers have used dynamic sql instead of static. And they are saying reason for this is to improve the performance. Can someone tell me can if dynamic sql can really increase the performance in stored procedure or plsql block?
Which will execute faster and why ?
1.
begin
execute immediate 'delete from X';
end;
2.
begin
delete from X;
end;
Your example code is so simple that there will be little difference, but in that case the static version would most likely execute better.
The main reason to use dynamic SQL for performance is when the SQL statement can vary in a significant way - i.e. you might be able to add extra code to the WHERE clause at runtime based on the state of the system (restrict by a sub-query on Address, if Address entered, etc).
Another reason is that sometimes using Bind variables as parameters can be counter-productive.
An example is if you have something like a status field, where data is not evenly distributed (but is indexed).
Consider the following 3 statements, when 95% of the data is 'P'rocessed
SELECT col FROM table
WHERE status = 'U'-- unprocessed
AND company = :company
SELECT col FROM table
WHERE status = 'P' -- processed
AND company = :company
SELECT col FROM table
WHERE status = :status
AND company = :company
In the final version, Oracle will choose a generic explain plan. In the first version, it may decide the best plan is to start with the index on status (knowing that 'U'nprocessed entries are a very small part of the total).
You could implement that through different static statements, but where you have more complex statements which only change by a couple of characters, dynamic SQL may be a better option.
Downsides
Each repetition of the same dynamic SQL statement incurs a soft parse, which is a small overhead compared to a static statement, but still an overhead.
Each NEW sql statement (dynamic or static) also incurs a lock on the SGA (shared memory), and can result in pushing 'old' statements out.
A bad, but common, system design is for someone to use dynamic SQL to generate simple selects that only vary by key - i.e.
SELECT col FROM table WHERE id = 5
SELECT col FROM table WHERE id = 20
SELECT col FROM table WHERE id = 7
The individual statements will be quick, but the overall system performance will deteriorate, as it is killing the shared resources.
Also - it is far harder to trap errors at compile time with dynamic SQL. If using PL/SQL this is throwing away a good compilation time check. Even when using something like JDBC (where you move all your database code into strings - good idea!) you can get pre-parsers to validate the JDBC content. Dynamic SQL = runtime testing only.
Overheads
The overhead of execute immediate is small - it is in the thousandths of a second - however, it can add up if this is inside a loop / on a method called once per object / etc. I once got a 10x speed improvement by replacing dynamic SQL with generated static SQL. However, this complicated the code, and was only done because we required the speed.
Unfortunately, this does vary on a case-by-case basis.
For your given examples, there is probably no measurable difference. But for a more complicated example, you'd probably want to test your own code.
The link #DumbCoder gave in the comments has some excellent rules of thumb which also apply to Oracle for the most part. You can use something like this to assist you in deciding, but there is no simple rule like "dynamic is faster than static".
I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).
If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.
One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:
-- 1. Temporary table for ID’s:
DECLARE #IDS TABLE (ID int);
-- 2. Split the given string of ID’s, and each ID to #IDS.
-- Omitted for brevity.
-- 3. Join the main table to #ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN #IDS ON MyTable.ID = #IDS.ID;
Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable #IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.
Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.
Well, it depends on what's really going on. How is the user choosing these IDs?
Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?
It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).
I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?
Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.
Don't forget to old adage (with apologies to Benjamin Franklin):
He who would trade correctness for performance deserves neither
Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.
The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.
Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.
For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.
Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:
XML solution: 99% time - xml parsing
comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
CLR UDF to split string: 98% - index seek.
Here is the code for CLR UDF:
public class SplitString
{
[SqlFunction(FillRowMethodName = "FillRow")]
public static IEnumerable InitMethod(String inputString)
{
return inputString.Split(',');
}
public static void FillRow(Object obj, out int ID)
{
string strID = (string)obj;
ID = Int32.Parse(strID);
}
}
So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.
I tested the search of 1K record in a table of 200K.
A table var has issues: using a temp table with index has benefits for statistics.
A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.
Parsing a CSV is easy: see questions on right...
Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using
WHERE IN
or
WHERE EXISTS
or
JOIN someholdingtable ON ...
Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (#param1, #param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.
So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.
Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).
There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.
In SQL Server 2008 or later you should be looking to use table-valued parameters.
2008 makes it simple to pass a comma-separated list to SQL Server using this method.
Here is an excellent source of information and performance tests on the subject:
Arrays-in-sql-2008
Here is a great tutorial:
passing-table-valued-parameters-in-sql-server-2008
For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.
Even loading each row by id is not as much inefficient like it looks like.
If problems with string manipulation are putted aside, I think that:
WHERE ID=1 OR ID=2 OR ID=3 ...
is more efficient, nevertheless I wouldn't do it.
You could compare performance between both approaches.
To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.
Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:
CREATE PROCEDURE myProc(#MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
DECLARE #x XML
SELECT #x = CONVERT(XML, #MyXmlAsSTR)
Then you can join your base query directly with the XML select as (not tested):
SELECT t.*
FROM myTable t
INNER JOIN #x.nodes('/ROOT/ROW') AS R(x)
ON t.ID = x.value('#ID', 'INTEGER')
when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.
select t.*
from (
select id = 35 union all
select id = 87 union all
select id = 445 union all
...
select id = 33643
) ids
join my_table t on t.id = ids.id
If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.
Note that query engines tend to treat
select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)
as equivalent to
select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643
and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.