Best approach for Oracle database mass update - sql

I need to do a mass update of a table (~30,000 records) in an Oracle 10g database. The challenge is there is no "where" clause that can select the target rows. Each target row can, however, be identified via a composite key. The catch here is that the list of composite keys is from an external source (not in the database).
Currently I have a Java program that loops through the list of composite keys and spits out a PL/SQL procedure which is essentially just a bunch of repeated update statements similar to the following:
update table1 t set myfield='Updated' where t.comp_key1='12345' and t.comp_key2='98765';
Is there a better way to do this, or is this "good enough" considering we're only dealing with ~30K records?

Possibly good enough, but if the external source of the keys is a file, create an external table pointing at the file, to expose the data in the file as a relational table and then you can potentially do it in a merge (update) statement.

Good enough.
30,000 updates using the primary key, even if they are all hard-parsed, will normally only take a few seconds. You could probably speed things up by combining the updates, as #Ed Gibbs suggested. But so far this looks like a very quick process that isn't worth optimizing. Putting it all in a single PL/SQL procedure was a smart move, and saved 99% of the time that would be needed for a really naive, row-by-row-from-the-client solution.

Related

Speeding up deletes that have joins

i am running a stored procedure to delete data from two tables:
delete from TESTING_testresults
from TESTING_testresults
inner join TESTING_QuickLabDump
on TESTING_QuickLabDump.quicklabdumpid = TESTING_TestResults.quicklabdumpid
where TESTING_quicklabdump.[Specimen ID]=#specimen
delete from TESTING_QuickLabDump
from TESTING_Quicklabdump
where [specimen id]=#specimen
one table is 60m rows and the other is about 2m rows
the procedure takes about 3 seconds to run.
is there any way i can speed this up? perhaps using EXISTS?
meaning IF EXISTS...THEN DELETE - because the delete should not be occurring every single time
something like this
if #specimen exists in TESTING_QuickLabDump then do the procedure with the two deletes
thank you !!!
Rewriting the query probably wont help speeding this up. Use the profiler to find out which parts of the query are slow. For this, make it profiler output the execution plan. Then, try adding appropriate indexes. Perhaps one or both tables could use an index over [specimen id].
For a table with 60 mil rows I would definitely look into partitioning the data horizontally and/or vertically. If it's time-sensitive data then you ought to be able to move old data into a history table. That's usually the first and most obvious thing people do so I would imagine if that were a possibility you would have already done it.
If there are many columns then it would definitely benefit you to denormalize the data into multiple tables. If you did this, I would suggest renaming the tables and creating a view of all the partitioned tables named after the original table. Doing that should ensure existing code isn't broken.
If you 'really' want to fine tune the speed then you should look into getting a faster hard drive and learn a little about hard drives work. Whether the data is stored towards the inner or outer section of the hd will affect speed of access slightly for example. And solid state hard drives have come a long way so you might look into getting one of those.
Beside indexing "obvious" fields, also look in your database schema and check if you have any FOREIGN KEYs whose ON DELETE CASCADE or SET NULL might be triggered by your delete (unlike Oracle, MS SQL Server will tend to show these in the execution plan). Fortunately, this is usually fairly easy to fix by indexing the child endpoint of the FOREIGN KEY.
Also check if you have any expensive triggers.

Temp table or permanent tables?

For my company I am redesigning some stored procedures. The original procedures are using lots of permanent tables which are filled during the execution of procedure and at the end, the values are deleted. The number of rows can extend from 100 to 50,000 rows for calculation of aggregations.
My question is, will there be severe performance issues if I replace those tables with temp tables ? Is it feasible to use temp tables ?
It depends on how often your using them, how long the processing takes, and if you are concurrently accessing data from the tables while writing.
If you use a temp table, it won't be sitting around waiting for indexing and caching while it's not in use. So it should save an ever so slight bit of resources there. However, you will incur overhead with the temp tables (i.e. creating and destroying).
I would re-examine how your queries function in the procedures and consider employing more in procedure CURSOR operations instead of loading everything into tables and deleting them.
However, databases are for storing information and retrieving information. I would shy away from using permanent tables for routine temp work and stick with the temp tables.
The overall performance shouldn't have any effect with the use case you specified in your question.
Hope this helps,
Jeffrey Kevin Pry
Yes its certainly feasible, you may want to check to see if the permanent tables have any indexing on them to speed up joins and so on.
I agree with Jeffrey. It always depends.
Since you're using Sql Server 2008 you might have a look at table variables.
They should be lighter than TEMP tables.
I define a User Defined Function which returns a table variable like this:
CREATE FUNCTION .ufd_GetUsers ( #UserCode INT )
RETURNS #UsersTemp TABLE
(
UserCode INT NOT NULL,
RoleCode INT NOT NULL
)
AS
BEGIN
INSERT #RolesTemp
SELECT
dbo.UsersRoles.Code, Roles.Code
FROM
dbo.UsersRoles
INNER JOIN
dbo.UsersRolesRelations ON dbo.UsersRoles.Code = dbo.UsersRolesRelations.UserCode
INNER JOIN
dbo.UsersRoles Roles ON dbo.UsersRolesRelations.RoleCode = Roles.Code
WHERE dbo.UsersRoles.Code = #UserCode
INSERT #UsersTemp VALUES(#UserCode, #UserCode)
RETURN
END
A big question is, can more then one person run one of these stored procedures at a time? I regularly see these kind of tables carried over from old single user databases (or from programmers who couldn't do subqueries or much of anything beyond SELECT * FROM). What happens if more then one user tries to run the same procedure, what happens if it crashes midway through - does the table get cleaned up? With temp tables or table variables you have the ability to properly scope the table to just the current connection.
Definitely use a temporary table, especially since you've alluded to the fact that its purpose is to assist with calculations and aggregates. If you used a table inside one of your database's schemas all that work is going to be logged - written, backed up, and so on. Using a temporary table eliminates that overhead for data that in the end you probably don't care about.
You actually might save some time from the fact that you can drop the temp tables at the end instead of deleting rows (you said you had multiple users so you have to delete rather than truncate). Deleting is a logged operation and can add considerable time to the process. If the permanent tables are indexed, then create the temp tables and index them as well. I would bet you would see an increase in performance usless your temp db is close to out of space.
Table variables also might work but they can't be indexed and they are generally only faster for smaller datasets. So you might try a combination of temp tables for the things taht will be large enough to benfit form indexing and table varaibles for the smaller items.
An advatage of using temp tables and table variables is that you guarantee that one users process won;t interfer with another user's process. You say they currently havea way to identify which records but all it takes is one bug being introduced to break that when using permanent tables. Permanent table for temporary processing are a very risky choice. Temp tables and table variabels can never see the data from someone else's process and thus are far safer as a choice.
Table variables are normally the way to go.
SQL2K and below can have significant performance bottlenecks if there are many temp tables being manipulated - the issue is the blocking DDL on the system tables.
Sql2005 is better, but table vars avoid the whole issue by not using those system tables at all, so can perform without inter-user locking issues (except those involved with the source data).
The issue is then that table vars only persist within scope, so if there is genuinuely a large amount of data that needs to be processed repeatedly & needs to be persisted over a (relatively) long duration then 'static' work tables may actually be faster - it does need a user key of some sort & regular cleaning. A last resort really.

sql: DELETE + INSERT vs UPDATE + INSERT

A similar question has been asked, but since it always depends, I'm asking for my specific situation separately.
I have a web-site page that shows some data that comes from a database, and to generate the data from that database I have to do some fairly complex multiple joins queries.
The data is being updated once a day (nightly).
I would like to pre-generate the data for the said view to speed up the page access.
For that I am creating a table that contains exact data I need.
Question: for my situation, is it reasonable to do complete table wipe followed by insert? or should I do update,insert?
SQL wise seems like DELETE + INSERT will be easier (INSERT part is a single SQL expression).
EDIT: RDBMS: MS SQL Server 2008 Ent
TRUNCATE will be faster than delete, so if you need to empty a table do that instead
You didn't specify your RDBMS vendor but some of them also have MERGE/UPSERT commands This enables you do update the table if the data exists and insert if it doesn't
It partly depends on how the data is accessed. If you have a period of time with no (or very few) users accessing it, then there won't be much impact on the data disappearing (between the DELETE and the completion of the INSERT) for a short while.
Have you considered using a materialized view (MSSQL calls them indexed views) instead of doing it manually? This could also have other performance benefits as an indexed view gives the query optimizer more choices when its constructing execution plans for other queries that reference the table(s) in the view.
It depends on the size of the table and the recovery model on the database. If you are deleting many hundreds of thousands of records and reinstating them vs updating a small batch of a few hundred and inserting tens of rows, it will add an unnecessary size to your transaction logs. However you could use TRUNCATE to get around this as it won't affect the transaction log.
Do you have the option of a MERGE/UPSERT? If you're using MS-SQL you can use CROSS APPLY to do something similar if you don't.
One approach to handling this type of problem is to insert into new table, then do a table Rename. This will insure that all new data is present at the same time.
What if some data that was present yesterdays is not anymore? Delete may be safer or you could end up deleting some records anyway.
And in the end it doesnt really matter which way you go.
Unless on the case #kevinw mentioned
Although I fully agree with SQLMenace's answer I do would like to point out that MERGE does NOT remove unneeded records ! If you're sure that your new data will be a super-set of the existing data, then MERGE is great, otherwise you'll either need to make sure that you delete any superfluous records later on, or use the TRUNCATE + INSERT method ...
(Personally I'm still a fan of the latter as it usually is quite fast, just make sure to drop all indexes/unique constraints upfront and rebuild them one by one. This has the benefit of the INSERT transaction being smaller and the index-adding being done in (smaller) transactions again later on). (**)
(**: yes, this might be tricky on live system, but then again he already mentioned this was done during some kind of overnight anyway, I'm extrapolating there is no user-access at that time)

RENAME faster than DROP+ADD in MySQL alter table

I'm performing some MySQL table maintenance that will mean removing some redundant columns and adding some new ones.
Some of the columns to drop are of the same type as ones to add. Would the procedure be faster if I took advantage of this and reused some of the existing columns?
My rationale is that changing column names should be a simple table metadata change, whereas removing and adding columns means either finding room at the end of the file (fragmenting data) or rebuilding every row with the correct columns so that they're at the same place on the disk.
The engine in question is MyISAM and I'm not up to scratch on how exactly it'll treat this so I'd like to hear from anyone who has been in the same situation before!
Unless you have a serious issue with performance, I wouldn't take the renaming approach - because of all the dirty data you're going to leave lying around.
Also, by dropping the table, you will cause any indexes to get re-built - which is a good idea every once in a while...
Martin
I would drop the columns. You will have fragmentation either way. That should be handled in your regular maintenance plans. You could accelerate those after a large number of modification operations.
If you don't know, in Myisam table, every ALTER TABLE operation will do a copy of entire table, thus the table will be locked for the time your server needs to copy the table.
I've used that same logic, and got stung because even with changes that are supposed to not require rewriting the table (i.e. a table rename), a MySQL bug caused it to think it was a change that required rewriting the table.
If the fields you are dealing with are date, datetime or timestamp fields, you are likely to be hit by this, which means that you should just assume it has to do a full rewrite and plan that way.

MySQL SELECT statement using Regex to recognise existing data

My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot
can this cleaning be done in a SQL statement
Yes, you can write a stored procedure to do it in the database layer:
mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
-> RETURNS VARCHAR(255) DETERMINISTIC
-> RETURN REPLACE(s, '.', ' ');
mysql> SELECT clean_me('BANK.12323.DESCRIPTION');
BANK 12323 DESCRIPTION
This will perform very poorly across a large table though.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).
Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.
The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.
The cleanest way is indeed to make sure only correct data is in the database.
In this example the "BANK.12323.DESCRIPTION" would be returned by:
SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?
But this might impose performance issues when you have a lot of data in the table.
Another way that you could do it is as follows:
Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:
REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)
'on duplicate key' allows you to specify which columns to update on a duplicate key error:
INSERT INTO transaction (desc, dated_on, amount) values (?,?,?)
ON DUPLICATE KEY SET amount = amount
By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.
If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.
Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.