Oracle SQL technique to avoid filling trans log - sql

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,

The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.

Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.

Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.

Related

Does it affect performance to frequently repopulate a highly read database table?

I have a database table with about 2500 rows in, which is frequently read by my web application. Will it affect the performance of reading from that table if all of the data in it is frequently (e.g. every 1-5 minutes) deleted and re-inserted?
By that I mean:
DELETE FROM MyTable
INSERT INTO MyTable SELECT ...
Probably not, at the given numbers ...
However, if you have one or more index(es) on your table (to help with read/select, or automatically on any PK/UK ...) you should consider that every delete/insert may result in re-calculation of any such index (on top of the delete/insert as such), not directly affecting table-reads as such, but adding to the overall load on the DB server.
There is no sourcecode, but it appears you are using this table as intermediate/interface to sth. else, so while 'updating' you'd probably want to make sure to bundle your delete(s)/insert(s) in transactions, best you can, rather than e.g. executing them all individually, like in a loop. Or see if you can keep your PKs and rather just update ...?
This could also help reduce fragmentation in the underlying storage ...

Latest table modified time in PostgreSQL

I need to get the latest modified time of a table, so came across
select relfilenode from pg_class where relname = 'test';
which gives me the relfilenode id, this seems to be a directory name in
L:\Databases\PostgresSQL\data\base\inodenumber
For which I later extract the latest modified time.
Is this the right way to do this or are there any better methods to do the same
Testing the mtime of the table's relfilenode won't work well. As Eelke noted VACUUM among other operations will modify the timestamp. Hint bit setting will modify the table too, causing it to appear to be "modified" by a SELECT. Additionally, sometimes a table has more than one fork to its on-disk relation (1GB chunks), and you'd have to check all of them to find the most recent.
If you want to keep a last modified time for a table, add an AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE ... FOR EACH STATEMENT trigger that updates a timestamp row in a table you use for tracking modification times.
The downside of the trigger is that it'll contest a single row lock on the table, so it'll serialize all your transactions. It'll also greatly increase the chance of getting deadlocks. What you really want is probably something nontransactional that doesn't have to roll back when the transaction does, where if multiple transactions update a counter the highest value wins. There's nothing like that built in, though it might not be too hard as a C extension.
A slightly more sophisticated option would be to create a trigger that uses dblink to do an update of the last-updated counter. That'll avoid most of the contention problems but it'll actually make deadlocking worse because PostgreSQL's deadlock detection won't be able to "see" the fact that the two sessions are deadlocked via an intermediary. You'd need a way to SELECT ... FOR UPDATE with a timeout to make it reliable without aborting transactions too often.
In any case, a trigger won't catch DDL, though. DDL triggers ("Event triggers") are coming in Pg 9.3.
See also:
How do I find the last time that a PostgreSQL database has been updated?
How to get 'last modified time' of the table in postgres?
I don't think that would be completly reliable as a vacuum would also modify the file(s) containing the table but the logical content of the table does not change during a vacuum.
You could create triggers for INSERT, UPDATE and DELETE that maintain the last modified timestamp for each table in another table. This method has a slight performance impact but will provide accurate information.

SQL DELETE - Maximum number of rows

What limit should be placed on the number of rows to delete in a SQL statement?
We need to delete from 1 to several hundred thousand rows and need to apply some sort of best practise limit in order to not absolutely kill the SQL server or fill up the logs every time we empty a waste-basket.
This question is not specific to any type of database.
That's a very very broad question that basically boils down to "it depends". The factors that influence it include:
What is your level of concurrency? A delete statement places an exclusive lock on affected rows. Depending on the databse engine, deleted data distribution, etc., that could escalate to page or entire table. Can your data readers afford to be blocked for the duration of the delete?
How complex is the delete statement? How many other tables are you joining to, or are there complex WHERE clauses? Sometimes the identification of rows to delete can be more "expensive" than the delete itself, so one big delete may be "cheaper".
Are you fearful about deadlocks? As you decrease the size of your delete, your deadlock "foot print" is reduced. Ideally, single-row deletes will always succeed.
Do you care about throughput performance? As with any SQL statement, there is a generally constant amount of overhead (connection stuff, query parsing, returning results, etc.). From a single-connection point of view, a 1000-line delete will be faster than 1000 x 1-line deletes.
Don't forget about index maintenance overhead, fragmentation cleanup, or any triggers. They can also affect your system.
In general, though, I benchmark at 1000-lines per statement. Most systems I've worked with (sub-"enterprise") end up with a sweet-spot between 500 and 5000 records per delete. I like to do something like this:
set rowcount 500
select 1 -- Just to force ##rowcount > 0
while ##ROWCOUNT > 0
delete from [table]
[where ...]
Though limiting the number of rows affected by your delete using the set rowcount option and then performing a loop is very good (and I've used it many a time before), be aware that from SQL 2012 onwards this will not be an option (see BOL).
Therefore, another option may be to limit the number of rows being deleted using the TOP clause. i.e.
SELECT 1
WHILE ##ROWCOUNT > 0
BEGIN
DELETE TOP (#)
FROM mytable
[WHERE ...]
END
Unless you have a lot of triggers or integrity constraints to verify, deletion shouldn't be that expensive an operation.
But if you're that concerned about performance, my initial hunch would be to mark the appropriate rows as deleted and then physically delete them later during a periodic cleanup. But I'm not a big fan of this because you'll have to change any queries on that table to exclude logically- but not physically-deleted rows.
Whenever I see a database that routinely deletes large amounts of rows in bulk, it makes me think the data model or processing design is not optimal. Why load 1 million rows and then delete them? If you need to do something like purge historical data, then consider table partitioning.
I run into this question and found my own answer to be quite effective: do a subselect.
delete from urls where url in ( select top 10000 url from urls)
a general answer is to drop the table and re-create it, that is a good performing solution, but applies for the full table

Would a table lock speed up an update statement in Oracle 10g enterprise?

We have a fairly wide table BaseData with some 33 millions rows in it. Then we have an update query that joins it to several other tables containing all kinds of parameters, some functions are applied, there is a group by original Id and then the results are written back to the BaseData table in a few columns.
This process is very slow so I'm looking into ways of speeding it up. I have most of my experience in SQLServer so all this type of internals of Oracle I don't know yet.
One thing I suspect is that during the update Oracle creates versions of every row so any oher readers can read that unaffected row. This however takes up considerable resources. Is there any way to have the update take a write lock on the table so it wouldn't create versions of every row?
Any other tips you guys have for large updates? We already broke it down into batches. Each batch is in a seperate partition of the table and then several updates are run in parallel. But still its all much too slow.
The short answer is that no, in Oracle, taking an exclusive lock on a table won't prevent other sessions from reading it, or having to incur the work of generating a read-consistent view of the data. Similarly, in Oracle, you can't tell a session to enable "dirty reads."
Well, the first question is what's slow - is it all the work of joining and applying functions, or is it the writing back? How does a SELECT my_updated_resultset FROM BASEDATA JOIN... perform compared to your update statement? Have you verified that there's contention between the readers of BaseData and the update process? Also, it's it too slow for the business, or just slower than you think it should be?
Another option to consider is to use partition exchange to perform your updates. The high level concept would be:
CREATE TABLE BASEDATA_XCHG as SELECT * FROM BASEDATA WHERE 1 = 0;
INSERT /*+ append */ INTO BASEDATA_XCHG SELECT my_updated_resultset FROM BASEDATA PARTITION (ONLY_ONE_PARTITION) JOIN...
Create all the required indexes and constraints on the BASEDATA_XCHG table.
ALTER TABLE BASEDATA EXCHANGE PARTITION (ONLY_ONE_PARTITION) WITH BASEDATA_XCHG
If you're updating most of the rows in a partition of BASEDATA table, don't update them - create a new table and exchange it out. Tim Gorman has an excellent paper called "Scaling to Infinity" that covers this concept in greater depth; you may wish to check it out.
In addition to Adam's answer:
Run an EXPLAIN PLAN on your update statement and check the execution plan.
Chances are that adding indexes to support your joins and WHERE conditions can speed up the query.
Oracle uses undo segments for read consistency (along with SCNs, read more here)
I'm assuming these large batch processes are running on a staging area and not a "prod" instance that is being used by a lot of various processes. If you are updating 25% or more (rough figures) of some big table, it may be better to do a CTAS (create table as select...) than attempting updates. Your CTAS would contain the update logic for the new table. Once done, add indexes/grants/etc on new table and rename new to old. You can also add a parallel hint and nologging on the CTAS to potentially speed things up even more.

Optimizing Delete on SQL Server

Deletes on sql server are sometimes slow and I've been often in need to optimize them in order to diminish the needed time.
I've been googleing a bit looking for tips on how to do that, and I've found diverse suggestions.
I'd like to know your favorite and most effective techinques to tame the delete beast, and how and why they work.
until now:
be sure foreign keys have indexes
be sure the where conditions are indexed
use of WITH ROWLOCK
destroy unused indexes, delete, rebuild the indexes
now, your turn.
The following article, Fast Ordered Delete Operations may be of interest to you.
Performing fast SQL Server delete operations
The solution focuses on utilising a view in order to simplify the execution plan produced for a batched delete operation. This is achieved by referencing the given table once, rather than twice which in turn reduces the amount of I/O required.
I have much more experience with Oracle, but very likely the same applies to SQL Server as well:
when deleting a large number of rows, issue a table lock, so the database doesn't have to do lots of row locks
if the table you delete from is referenced by other tables, make sure those other tables have indexes on the foreign key column(s) (otherwise the database will do a full table scan for each deleted row on the other table to ensure that deleting the row doesn't violate the foreign key constraint)
I wonder if it's time for garbage-collecting databases? You mark a row for deletion and the server deletes it later during a sweep. You wouldn't want this for every delete - because sometimes a row must go now - but it would be handy on occasion.
Summary of Answers through 2014-11-05
This answer is flagged as community wiki since this is an ever-evolving topic with a lot of nuances, but very few possible answers overall.
The first issue is you must ask yourself what scenario you're optimizing for? This is generally either performance with a single user on the db, or scale with many users on the db. Sometimes the answers are the exact opposite.
For single user optimization
Hint a TABLELOCK
Remove indexes not used in the delete then rebuild them afterward
Batch using something like SET ROWCOUNT 20000 (or whatever, depending on log space) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0)
If deleting a large % of table, just make a new one and delete the old table
Partition the rows to delete, then drop the parition. [Read more...]
For multi user optimization
Hint row locks
Use the clustered index
Design clustered index to minimize page re-organization if large blocks are deleted
Update "is_deleted" column, then do actual deletion later during a maintenance window
For general optimization
Be sure FKs have indexes on their source tables
Be sure WHERE clause has indexes
Identify the rows to delete in the WHERE clause with a view or derived table instead of referencing the table directly. [Read more...]
To be honest, deleting a million rows from a table scales just as badly as inserting or updating a million rows. It's the size of the rowset that's the problem, and there's not much you can do about that.
My suggestions:
Make sure that the table has a primary key and clustered index (this is vital for all operations).
Make sure that the clustered index is such that minimal page re-organisation would occur if a large block of rows were to be deleted.
Make sure that your selection criteria are SARGable.
Make sure that all your foreign key constraints are currently trusted.
(if the indexes are "unused", why are they there at all?)
One option I've used in the past is to do the work in batches. The crude way would be to use SET ROWCOUNT 20000 (or whatever) and loop (perhaps with a WAITFOR DELAY) until you get rid of it all (##ROWCOUNT = 0).
This might help reduce the impact upon other systems.
The problem is you haven't defined your conditions enough. I.e. what exactly are you optimizing?
For example, is the system down for nightly maintenance and no users are on the system? And are you deleting a large % of the database?
If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename. If deleting a small %, you likely want to batch things in as large batches as your log space allows. It entirely depends on your database, but dropping indexes for the duration of the rebuild may hurt or help -- if even possible due to being "offline".
If you're online, what's the likelihood your deletes are conflicting with user activity (and is user activity predominantly read, update, or what)? Or, are you trying to optimize for user experience or speed of getting your query done? If you're deleting from a table that's frequently updated by other users, you need to batch but with smaller batch sizes. Even if you do something like a table lock to enforce isolation, that doesn't do much good if your delete statement takes an hour.
When you define your conditions better, you can pick one of the other answers here. I like the link in Rob Sanders' post for batching things.
If you have lots of foreign key tables, start at the bottom of the chain and work up. The final delete will go faster and block less things if there are no child records to cascade delete (which I would NOT turn on if I had a large number fo child tables as it will kill performance).
Delete in batches.
If you have foreign key tables that are no longer being used (you'd be surprised how often production databses end up with old tables nobody will get rid of), get rid of them or at least break the FK/PK connection. No sense cheking a table for records if it isn't being used.
Don't delete - mark records as delted and then exclude marked records from all queries. This is best set up at the time of database design. A lot of people use this because it is also the best fastest way to get back records accidentlally deleted. But it is a lot of work to set up in an already existing system.
I'll add another one to this:
Make sure your transaction isolation level and database options are set appropriately. If your SQL server is set not to use row versioning, or you're using an isolation level on other queries where you will wait for the rows to be deleted, you could be setting yourself up for some very poor performance while the operation is happening.
On very large tables where you have a very specific set of criteria for deletes, you could also partition the table, switch out the partition, and then process the deletions.
The SQLCAT team has been using this technique on really really large volumes of data. I found some references to it here but I'll try and find something more definitive.
I think, the big trap with delete that kill the performance is that sql after each row deleted, it updates all the related indexes for any column in this row. what about delting all indexes before bulk delete?
There are deletes and then there are deletes. If you are aging out data as part of a trim job, you will hopefully be able to delete contiguous blocks of rows by clustered key. If you have to age out data from a high volume table that is not contiguous it is very very painful.
If it is true that UPDATES are faster than DELETES, you could add a status column called DELETED and filter on it in your selects. Then run a proc at night that does the actual deletes.
Do you have foreign keys with referential integrity activated?
Do you have triggers active?
Simplify any use of functions in your WHERE clause! Example:
DELETE FROM Claims
WHERE dbo.YearMonthGet(DataFileYearMonth) = dbo.YearMonthGet(#DataFileYearMonth)
This form of the WHERE clause required 8 minutes to delete 125,837 records.
The YearMonthGet function composed a date with the year and month from the input date and set day = 1. This was to ensure we deleted records based on year and month but not day of month.
I rewrote the WHERE clause to:
WHERE YEAR(DataFileYearMonth) = YEAR(#DataFileYearMonth)
AND MONTH(DataFileYearMonth) = MONTH(#DataFileYearMonth)
The result: The delete required about 38-44 seconds to delete those 125,837 records!