Performance considration between Cursor with 30000 record or create 30000 update statement - sql

Which is better?
1)A cursor that loop 30000 record and perform update one by one
2)Create a script that has 30000 update command
thanks

Both should take about the same time, mainly subject to how the CURSOR is declared.
Reason? You have 30,000 individual updates which is usually the main factor
Note that 30,000 individual UPDATES in one batch will probably fail because of batch size and compile time anyway...
SQL is a set based language and you can most likely do a single UPDATE to update all rows in one go. If you can't, it is because of 2 reasons
You need "per row" logic: this can usually be achieved by CASE expressions, UDFs etc
You don't understand sets and SQL
With more information (the SQL and logic) we could help you more...

There is a very easy way to tell: Do it and measure the time.
Other than that, having 30000 lines does not make a lot of sense when you can have just 10.
Making updates this way for reasons other than data migration or maintenance doesn't sound like wise either, and in those cases performance is not an issue - but maintenance and legibility always is.

You know, that depends on context.
It helps, though, to learn. SQL for example. You are on a low level not to see the real optimizations possible here. SQL is a lot more than just Update, Insert and simple Select statements.
1)A cursor that loop 30000 record and perform update one by one
Linear step by step processing. No way to paralellize as SQL itself has no threading mechanisms available to the user; Optimizations are one by one - i.e. the query optimizer looks at items one statement at a time.
2)Create a script that has 30000 update command
Assuming the script is external, it could split the work and run it concurrent on multiple connections, i.e. run more than one parallel.
But there is more:
Make a script that calculates the new values.
Bulk import them into a temporary table using the buld copy API
Issue ONE update statment that takes the updated values from the temporary table to the final one.
Maybe have a script that issues a merge statement for multi update? There are tons of variations there if you know the SQL api more than "update, open cursor, simple select".
I do that - though a lot more data (batches of 50.000, sometimes 4-6 at the same time). The problem being that sql bulk copy has some overhead. But I manage 75.000 inserts per second that way.
A lot depends on the business questions and the complexity of the logic - if it is simple updates then the question is: Calculated or externally driven? Multiple values by 2 = calculated, updating addresses = data driven (i.e. you need the new data from somewhere).

Related

What Are the Performance Differences Between Running One vs Many Inserts

I'm currently in a situation where I'm building a script that I know will need to insert multiple rows. I'm doing this in Perl, so in terms of parameterization, it's much easier to insert each row individually. In terms of speed, I'm guessing running just one insert statement will be faster (although latency will be relatively low as I'm quite close to the database itself). I'm thinking the number of rows per run of the script will be about 20-40 on average. That said, what would be the approximate performance differences between running just 1 INSERT INTO statement v.s. running one for each row? Note: The server is running SQL 2008.
[EDIT]Since there seems to be a lot of confusion, I'd like to clarify that what I'm really asking for is the theory behind how a multi-row insert is handled by SQL Server 2008. Does it essentially just convert it internally into a bunch of individual insert statements and run those over one connection, or does it do something more intelligent?
Yes, I know I can run timed loops. No, that's not what I'm asking for. [/EDIT]
Combining multiple inserts into one command is always going to execute much more quickly than executing separate inserts. The reasons are:
A lot of work is done parsing the SQL - with multi version, there's only one parsing effort
More work is done checking permissions - again, only done once
Database connections are "chatty" - with multi version, handshaking only done once. You really notice this issue when using a poor network connection
Finally, multi version gives opportunity for server to optimize the operation
There is a general idea to let the SQL database do its thing and not try to treat the database as some sort of disk read. I've seen many times where a developer will read from one table, then another, or do a general query and then run through each row to see if it's the one they want. Generally, it's better to let the SQL database do its thing.
In this case, I can't really see an advantage of doing a single vs. multiple row insert. I guess there might be some because you don't have to do multiple prepares, and commits.
It shouldn't be too difficult to actual create a temporary database and try this out. Create a database with two columns, and have the program generate data to toss into the tables. Give yourself a decent amount to do. For example, how many items will this table have? And, how many do you think you'll be inserting at once? Say create a table of 1,000,000 items, and insert into this table 1000 items at a time, 100 items at a time, and one item at a time. Just generate data using the increment operator. There may be a "sweetspot" of the number of items you can insert at once.
In my unbiased, and always correct opinion, you'll probably find that the difference isn't worth fretting over, and you should instead employ the method that makes your code the easiest to maintain.
I've have a programming dictum: The place where you want to optimize your code is probably the wrong place. We like efficiency, but we usually attack the wrong item. And, whatever we've squeezed out in terms of efficiency, we end up wasting in maintenance.
So, just program what is the easiest to understand and don't fret about being overly efficient.
Just to add a couple of other performance differentiators to think about on insertion:
Foreign Keys - If the table you are inserting into has foreign keys, SQL Server effectively needs to join to the foreign key tables on insert. When you do your inserts in one query, SQL server can be more efficient in doing these joins.
Transactions - As you don't mention transactions, I assume you must be using SQL Server auto-commit mode. With such a small number of rows, it is likely that the overhead of creating 40 transactions vs. 1 transaction would be higher than maintaining the log to allow rollback. However, if you were inserting 400000 rows, it would likely be more expensive to insert in one statement/transaction than insert 400000 separate rows as the cost to be prepared to roll back up to 400000 rows is very high (if you were to insert 400000 rows, it usually is best to insert in batches -> the optimal batch size can be determined through testing). Also, above a certain row count, it may become more efficient to disable the foreign keys, insert the rows, then re-enable them.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."

Update thousands of records in a DataSet to SQL Server

I have half a million records in a data set of which 50,000 are updated. Now I need to commit the updated records back to the SQL Server 2005 Database.
What is the best and efficient way to do this considering the fact that such updates could be frequent (though concurrency is not an issue but performance is)
I would use a Batch Update.
Also documented here.
I agree with David's answer, as that's what I use. However, there is an alternative approach you could take which is worth considering (all situations are different after all) - it's something I would consider in the future if I had another similar requirement.
You could bulk insert the updated records into a new table in the DB (using SqlBulkCopy) which is an extremely fast way of loading data into the db (example). Then run an UPDATE statement on your main table to pull in the updated values from this new table which you would drop at the end.
The batched update approach of using SqlDataAdapter allows you to easily deal with any errors on specific rows (e.g. you could tell it to continue in the event of an error with a specific updated row so it doesn't stop the whole process).

Should I break down large SQL queries (MS)

This is in regards to MS SQL Server 2005.
I have an SSIS package that validates data between two different data sources. If it finds differences it builds and executes a SQL update script to fix the problem. The SQL Update script runs at the end of the package after all differences are found.
I'm wondering if it is necessary or a good idea to some how break down the sql update script into multiple transactions and whats the best way to do this.
The update script looks similar to this, but longer (example):
Update MyPartTable SET MyPartGroup = (Select PartGroupID From MyPartGroupTable
Where PartGroup = "Widgets"), PartAttr1 = 'ABC', PartAttr2 = 'DEF', PartAttr3 = '123'
WHERE PartNumber = 'ABC123';
For every error/difference found an additional Update query is added to the Update Script.
I only expect about 300 updates on a daily basis, but sometimes there could be 50,000. Should I break the script down into transactions every say 500 update queries or something?
don't optimize anything before you know there is a problem. if it is running fast, let it go. if it is running slow, make some changes.
No, I think the statement is fine as it is. It won't make much a of a difference in speed at all.
Billy Makes a valid point if you do care about the readability of the query(you should if it is a query that will be seen or used in the future.).
Would your system handle other processes reading the data that has yet to be updated? If so, you might want to perform multiple transactions.
The benefit of performing multiple transactions is that you will not continually accumulate locks. If you perform all these updates at once, SQL Server will eventually run out of small-grained lock resources (row/key) and upgrade to a table lock. When it does this, nobody else will be able to read from these tables until the transaction completes (unless they use dirty reads or are in snapshot mode).
The side effect is that other processes that read data may get inconsistent results.
So if nodoby else needs to use this data while you are updating, then sure, do all the updates in one transaction. If there are other processes that need to use the table, then yes, do it in chunks.
It shouldn't be a problem to split things up. However, if you want to A. maintain consistency between the items, and/or B. perform slightly better, you might want to use a single transaction for the while thing.
BEGIN TRANSACTION;
//Write 500 things
//Write 500 things
//Write 500 things
COMMIT TRANSACTION;
Transactions exist for just this reason -- where program logic would be clearer by splitting up queries but where data consistency between multiple actions is desired.
All records affected by the query will be either locked or copied into tempdb if the transaction operates in SNAPSHOT isolation level.
IF the number of records is high enough, the locks may be escalated.
If transaction isolation level is not SNAPSHOT, then a concurrent query will not be able to read the locked records which may be a concurrency problem for your application.
If transaction isolation level is SNAPSHOT, then tempdb should contain enough space to accomodate the old versions of the records, or the query will fail.
If either of this is a problem for you, then you should split the update into several chunks.

SQL, selecting and updating

I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?
It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.
Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?
If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.
Can't you just use the same connection without closing it?