Fast Update database with more than 10 million records

Fast Update database with more than 10 million records - sql

I am fairly new to SQL and was wondering if someone can help me.
I got a database that has around 10 million rows.
I need to make a script that finds the records that have some NULL fields, and then updates it to a certain value.
The problem I have from doing a simple update statement, is that it will blow the rollback space.
I was reading around that I need to use BULK COLLECT AND FETCH.
My idea was to fetch 10,000 records at a time, update, commit, and continue fetching.
I tried looking for examples on Google but I have not found anything yet.
Any help?
Thanks!!
This is what I have so far:
DECLARE
CURSOR rec_cur IS
SELECT DATE_ORIGIN
FROM MAIN_TBL WHERE DATE_ORIGIN IS NULL;
TYPE date_tab_t IS TABLE OF DATE;
date_tab date_tab_t;
BEGIN
OPEN rec_cur;
LOOP
FETCH rec_cur BULK COLLECT INTO date_tab LIMIT 1000;
EXIT WHEN date_tab.COUNT() = 0;
FORALL i IN 1 .. date_tab.COUNT
UPDATE MAIN_TBL SET DATE_ORIGIN = '23-JAN-2012'
WHERE DATE_ORIGIN IS NULL;
END LOOP;
CLOSE rec_cur;
END;

I think I see what you're trying to do. There are a number of points I want to make about the differences between the code below and yours.
Your forall loop will not use an index. This is easy to get round by using rowid to update your table.
By committing after each forall you reduce the amount of undo needed; but make it more difficult to rollback if something goes wrong. Though logically your query could be re-started in the middle easily and without detriment to your objective.
rowids are small, collect at least 25k at a time; if not 100k.
You cannot index a null in Oracle. There are plenty of questions on stackoverflow about this is you need more information. A functional index on something like nvl(date_origin,'x') as a loose example would increase the speed at which you select data. It also means you never actually have to use the table itself. You only select from the index.
Your date data-type seems to be a string. I've kept this but it's not wise.
If you can get someone to increase your undo tablespace size then a straight up update will be quicker.
Assuming as per your comments date_origin is a date then the index should be on something like:
nvl(date_origin,to_date('absolute_minimum_date_in_Oracle_as_a_string','yyyymmdd'))
I don't have access to a DB at the moment but to find out the amdiOaas run the following query:
select to_date('0001','yyyy') from dual;
It should raise a useful error for you.
Working example in PL/SQL Developer.
create table main_tbl as
select cast( null as date ) as date_origin
from all_objects
;
create index i_main_tbl
on main_tbl ( nvl( to_date(date_origin,'yyyy-mm-dd')
, to_date('0001-01-01' ,'yyyy-mm-dd') )
)
;
declare
cursor c_rec is
select rowid
from main_tbl
where nvl(date_origin,to_date('0001-01-01','yyyy-mm-dd'))
= to_date('0001-01-01','yyyy-mm-dd')
;
type t__rec is table of rowid index by binary_integer;
t_rec t__rec;
begin
open c_rec;
loop
fetch c_rec bulk collect into t_rec limit 50000;
exit when t_rec.count = 0;
forall i in t_rec.first .. t_rec.last
update main_tbl
set date_origin = to_date('23-JAN-2012','DD-MON-YYYY')
where rowid = t_rec(i)
;
commit ;
end loop;
close c_rec;
end;
/

Related

Updating Millions of Records Oracle

I have created one query to update the 35 million records column,
but unfortunately, it took around more than one hour to process.
did I miss anything on the below query?
DECLARE
CURSOR exp_cur IS
SELECT
DECODE(
COLUMN_NAME,
NULL, NULL,
standard_hash(COLUMN_NAME)
) AS COLUMN_NAME
FROM TABLE1;
TYPE nt_fName IS TABLE OF VARCHAR2(100);
fname nt_fName;
BEGIN
OPEN exp_cur;
FETCH exp_cur BULK COLLECT INTO fname LIMIT 1000000;
CLOSE exp_cur;
--Print data
FOR idx IN 1 .. fname.COUNT
LOOP
UPDATE TABLE1 SET COLUMN_NAME=fname(idx);
commit;
DBMS_OUTPUT.PUT_LINE (idx||' '||fname(idx) );
END LOOP;
END;

The reason why bulk collect used with a forall construction is generally faster than the equivalent row-by-row loop is because it applies all the updates in one shot, instead of laboriously stepping though the rows one at a time and launching 35 million separate update statements, each one requiring the database to search for the individual row before updating it. But what you have written (even when the bugs are fixed) is still a row-by-row loop with 35 million search and update statements, plus the additional overhead of populating a 700 MB array in memory, 35 million commits, and 35 million dbms_output messages. It has to be slower because it has significantly more work to do than a plain update.
If it is practical to copy the data to a new table, insert will be a lot faster than update. At the end you can reapply any grants, indexes and constraints to the new table, rename both tables and drop the old one. You can also insert /*+ parallel enable_parallel_dml */ (or prior to Oracle 12c, you have to alter session enable parallel dml separately.) You could define the new table as nologging during the copy, but check with your DBA as that can affect replication and backups, though that might not matter if this is a test system. This will all need careful scripting if it's going to form part of a routine workflow.

Your code is updating all records of TABLE1 in each loop. (It loops 35 million times and in each loop updating 35 million records, That's why it is taking time)
You can simply use a single update statement as follows:
UPDATE TABLE1 SET COLUMN_NAME = standard_hash(COLUMN_NAME)
WHERE COLUMN_NAME IS NOT NULL;
So, If you want to use the BULK COLLECT and FORALL then you can use it as follows:
DECLARE
CURSOR EXP_CUR IS
SELECT COLUMN_NAME FROM TABLE1
WHERE COLUMN_NAME IS NOT NULL;
TYPE NT_FNAME IS TABLE OF VARCHAR2(100);
FNAME NT_FNAME;
BEGIN
OPEN EXP_CUR;
FETCH EXP_CUR BULK COLLECT INTO FNAME LIMIT 1000000;
FORALL IDX IN FNAME.FIRST..FNAME.LAST
UPDATE TABLE1
SET COLUMN_NAME = STANDARD_HASH(COLUMN_NAME)
WHERE COLUMN_NAME = FNAME(IDX);
COMMIT;
CLOSE EXP_CUR;
END;
/

PL SQL bulk collect fetchall not completing

I made this procedure to bulk delete data (35m records). Can you see why this pl/sql procedure runs without exiting and rows are not getting deleted ?
create or replace procedure clear_logs
as
CURSOR c_logstodel IS SELECT * FROM test where id=23;
TYPE typ_log is table of test%ROWTYPE;
v_log_del typ_log;
BEGIN
OPEN c_logstodel;
LOOP
FETCH c_logstodel BULK COLLECT INTO v_log_del LIMIT 5000;
EXIT WHEN c_logstodel%NOTFOUND;
FORALL i IN v_log_del.FIRST..v_log_del.LAST
DELETE FROM test WHERE id =v_log_del(i).id;
COMMIT;
END LOOP;
CLOSE c_logstodel;
END clear_logs;

Adding in rowid instead of column name, exit when v_delete_data.count = 0; instead of EXIT WHEN c_logstodel%NOTFOUND; and changing chunk limit to 50,000 allowed the script clear 35 million rows in 15 mins
create or replace procedure clear_logs
as
CURSOR c_logstodel IS SELECT rowid FROM test where id=23;
TYPE typ_log is table of rowid index by binary_integer;
v_log_del typ_log;
BEGIN
OPEN c_logstodel;
LOOP
FETCH c_logstodel BULK COLLECT INTO v_log_del LIMIT 50000;
exit when v_log_del.count = 0;
FORALL i IN v_log_del.FIRST..v_log_del.LAST
DELETE FROM test WHERE rowid =v_log_del(i);
exit when v_log_del.count = 0;
COMMIT;
END LOOP;
COMMIT;
CLOSE c_logstodel;
END clear_logs;

First off when using BULK COLLECT LIMIT X the %NOTFOUND takes on a slightly unexpected meaning. In this case %NOTFOUND actually means Oracle could not retrieve X rows. (I guess technically it always does you fetch the next 1 and it says it could not fill the 1 row buffer.) Just move the EXIT WHEN %NOTFOUND to after the FORALL. But there is actually no reason to retrieve the data and then delete the retrieved rows. While one statement would be considerable faster 35M rows would require signifient rollback space. There is an interment solution.
Although not commonly used Delate statements generate rownum as do selects. This value can be user to limit the number or rows processed. So to break into a given commit size just limit rownum on the delete:
create or replace procedure clear_logs
as
k_max_rows_per_interation constant integer := 50000;
begin
loop
delete
from test
where id=23
and rownum <= k_max_rows_per_interation;
exit when sql%rowcount < k_max_rows_per_interation;
commit;
end loop;
commit;
end;
As #Stilgar points out deletes are expensive, meaning slow, so their solution may be better. But this has the advantage that it does not essentially take the table completely out-of-service during the operation. NOTE: I tend to use a much larger commit interval size, generally around 400,000 - 300,000 rows. I suggest you talk with your DBA see what they think this limit should be. Remember it is their job to properly size rollback space for typical operations. If this is normal in your operation they need to set it correctly. If you can get rollback space for 35M deletes then that is the fastest you are going to get.

Tuning SQL Updation query

I am using this update query which is taking about 8hrs to execute, I want it to take lesser time, how can I do that? My query is:
BEGIN
LOOP
UPDATE ENCORE_LIVE.INSTRUMENT_CLASSIFICATION SET CODE = NULL WHERE TYPE_ID =
15 AND CODE is not NULL
and rownum < 50000;
exit when SQL%rowcount < 49999;
commit;
END LOOP;
commit;
END;

So you are updating a table in batches of 50 000 records, right?
You should verify if you have columns TYPE_ID and CODE in INSTRUMENT_CLASSIFICATION table indexed and eventually create indexes.
Check also this answer.
Your UPDATE query is basic and cannot be optimized anyhow. To create an index you can use:
CREATE INDEX instrumentClasification_TypeCode_idx
ON ENCORE_LIVE.INSTRUMENT_CLASSIFICATION
(TYPE_ID, CODE)
Also you should consider running the update query in one batch without splitting it in batches.

Efficient way to update all rows in a table

I have a table with a lot of records (could be more than 500 000 or 1 000 000). I added a new column in this table and I need to fill a value for every row in the column, using the corresponding row value of another column in this table.
I tried to use separate transactions for selecting every next chunk of 100 records and update the value for them, but still this takes hours to update all records in Oracle10 for example.
What is the most efficient way to do this in SQL, without using some dialect-specific features, so it works everywhere (Oracle, MSSQL, MySQL, PostGre etc.)?
ADDITIONAL INFO: There are no calculated fields. There are indexes. Used generated SQL statements which update the table row by row.

The usual way is to use UPDATE:
UPDATE mytable
SET new_column = <expr containing old_column>
You should be able to do this is a single transaction.

As Marcelo suggests:
UPDATE mytable
SET new_column = <expr containing old_column>;
If this takes too long and fails due to "snapshot too old" errors (e.g. if the expression queries another highly-active table), and if the new value for the column is always NOT NULL, you could update the table in batches:
UPDATE mytable
SET new_column = <expr containing old_column>
WHERE new_column IS NULL
AND ROWNUM <= 100000;
Just run this statement, COMMIT, then run it again; rinse, repeat until it reports "0 rows updated". It'll take longer but each update is less likely to fail.
EDIT:
A better alternative that should be more efficient is to use the DBMS_PARALLEL_EXECUTE API.
Sample code (from Oracle docs):
DECLARE
l_sql_stmt VARCHAR2(1000);
l_try NUMBER;
l_status NUMBER;
BEGIN
-- Create the TASK
DBMS_PARALLEL_EXECUTE.CREATE_TASK ('mytask');
-- Chunk the table by ROWID
DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_ROWID('mytask', 'HR', 'EMPLOYEES', true, 100);
-- Execute the DML in parallel
l_sql_stmt := 'update EMPLOYEES e
SET e.salary = e.salary + 10
WHERE rowid BETWEEN :start_id AND :end_id';
DBMS_PARALLEL_EXECUTE.RUN_TASK('mytask', l_sql_stmt, DBMS_SQL.NATIVE,
parallel_level => 10);
-- If there is an error, RESUME it for at most 2 times.
l_try := 0;
l_status := DBMS_PARALLEL_EXECUTE.TASK_STATUS('mytask');
WHILE(l_try < 2 and l_status != DBMS_PARALLEL_EXECUTE.FINISHED)
LOOP
l_try := l_try + 1;
DBMS_PARALLEL_EXECUTE.RESUME_TASK('mytask');
l_status := DBMS_PARALLEL_EXECUTE.TASK_STATUS('mytask');
END LOOP;
-- Done with processing; drop the task
DBMS_PARALLEL_EXECUTE.DROP_TASK('mytask');
END;
/
Oracle Docs: https://docs.oracle.com/database/121/ARPLS/d_parallel_ex.htm#ARPLS67333

You could drop any indexes on the table, then do your insert, and then recreate the indexes.

Might not work you for, but a technique I've used a couple times in the past for similar circumstances.
created updated_{table_name}, then select insert into this table in batches. Once finished, and this hinges on Oracle ( which I don't know or use ) supporting the ability to rename tables in an atomic fashion. updated_{table_name} becomes {table_name} while {table_name} becomes original_{table_name}.
Last time I had to do this was for a heavily indexed table with several million rows that absolutely positively could not be locked for the duration needed to make some serious changes to it.

What is the database version? Check out virtual columns in 11g:
Adding Columns with a Default Value
http://www.oracle.com/technology/pub/articles/oracle-database-11g-top-features/11g-schemamanagement.html

update Hotels set Discount=30 where Hotelid >= 1 and Hotelid <= 5504

For Postgresql I do something like this (if we are sure no more updates/inserts take place):
create table new_table as table orig_table with data;
update new_table set column = <expr>
start transaction;
drop table orig_table;
rename new_table to orig_table;
commit;
Update:
One improvement is that if your table is very large you will not lock the table, this operation in this case could take minutes.
Only if you are sure in the process no inserts and/or updates take
place.

SQL optimization question (oracle)

Edit: Please answer one of the two answers I ask. I know there are other options that would be better in a different case. These other potential options (partitioning the table, running as one large delete statement w/o committing in batches, etc) are NOT options in my case due to things outside my control.
I have several very large tables to delete from. All have the same foreign key that is indexed. I need to delete certain records from all tables.
table source
id --primary_key
import_source --used for choosing the ids to delete
table t1
id --foreign key
--other fields
table t2
id --foreign key
--different other fields
Usually when doing a delete like this, I'll put together a loop to step through all the ids:
declare
my_counter integer := 0;
begin
for cur in (
select id from source where import_source = 'bad.txt'
) loop
begin
delete from source where id = cur.id;
delete from t1 where id = cur.id;
delete from t2 where id = cur.id;
my_counter := my_counter + 1;
if my_counter > 500 then
my_counter := 0;
commit;
end if;
end;
end loop;
commit;
end;
However, in some code I saw elsewhere, it was put together in separate loops, one for each delete.
declare
type import_ids is table of integer index by pls_integer;
my_count integer := 0;
begin
select id bulk collect into my_import_ids from source where import_source = 'bad.txt'
for h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
--do commit check
end loop;
for h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
--do commit check
end loop;
--do commit check will be replaced with the same chunk to commit every 500 rows as the above query
So I need one of the following answered:
1) Which of these is better?
2) How can I find out which is better for my particular case? (IE if it depends on how many tables I have, how big they are, etc)
Edit:
I must do this in a loop due to the size of these tables. I will be deleting thousands of records from tables with hundreds of millions of records. This is happening on a system that can't afford to have the tables locked for that long.
EDIT:
NOTE: I am required to commit in batches. The amount of data is too large to do it in one batch. The rollback tables will crash our database.
If there is a way to commit in batches other than looping, I'd be willing to hear it. Otherwise, don't bother saying that I shouldn't use a loop...

Why loop at all?
delete from t1 where id IN (select id from source where import_source = 'bad.txt';
delete from t2 where id IN (select id from source where import_source = 'bad.txt';
delete from source where import_source = 'bad.txt'
That's using standard SQL. I don't know Oracle specifically, but many DBMSes also feature multi-table JOIN-based DELETEs as well that would let you do the whole thing in a single statement.

David,
If you insist on commiting, you can use the following code:
declare
type import_ids is table of integer index by pls_integer;
my_import_ids import_ids;
cursor c is select id from source where import_source = 'bad.txt';
begin
open c;
loop
fetch c bulk collect into my_import_ids limit 500;
forall h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
forall h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
commit;
exit when c%notfound;
end loop;
close c;
end;
This program fetches ids by pieces of 500 rows, deleting and commiting each piece. It should be much faster then row-by-row processing, because bulk collect and forall works as a single operation (in a single round-trip to and from database), thus minimizing the number of context switches. See Bulk Binds, Forall, Bulk Collect for details.

First of all, you shouldn't commit in the loop - it is not efficient (generates lots of redo) and if some error occurrs, you can't rollback.
As mentioned in previous answers, you should issue single deletes, or, if you are deleting most of the records, then it could be more optimal to create new tables with remaining rows, drop old ones and rename the new ones to old names.
Something like this:
CREATE TABLE new_table as select * from old_table where <filter only remaining rows>;
index new_table
grant on new table
add constraints on new_table
etc on new_table
drop table old_table
rename new_table to old_table;
See also Ask Tom

Larry Lustig is right that you don't need a loop. Nonetheless there may be some benefit in doing the delete in smaller chunks. Here PL/SQL bulk binds can improve speed greatly:
declare
type import_ids is table of integer index by pls_integer;
my_count integer := 0;
begin
select id bulk collect into my_import_ids from source where import_source = 'bad.txt'
forall h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
forall h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
The way I wrote it it does it all at once, in which case yeah the single SQL is better. But you can change your loop conditions to break it into chunks. The key points are
don't commit on every row. If anything, commit only every N rows.
When using chunks of N, don't run the delete in an ordinary loop. Use forall to run the delete as a bulk bind, which is much faster.
The reason, aside from the overhead of commits, is that each time you execute an SQL statement inside PL/SQL code it essentially does a context switch. Bulk binds avoid that.

You may try partitioning anyway to use parallel execution, not just to drop one partition. The Oracle documentation may prove useful in setting this up. Each partition would use it's own rollback segment in this case.

If you are doing the delete from the source before the t1/t2 deletes, that suggests you don't have referential integrity constraints (as otherwise you'd get errors saying child records exist).
I'd go for creating the constraint with ON DELETE CASCADE. Then a simple
DECLARE
v_cnt NUMBER := 1;
BEGIN
WHILE v_cnt > 0 LOOP
DELETE FROM source WHERE import_source = 'bad.txt' and rownum < 5000;
v_cnt := SQL%ROWCOUNT;
COMMIT;
END LOOP;
END;
The child records would get deleted automatically.
If you can't have the ON DELETE CASCADE, I'd go with a GLOBAL TEMPORARY TABLE with ON COMMIT DELETE ROWS
DECLARE
v_cnt NUMBER := 1;
BEGIN
WHILE v_cnt > 0 LOOP
INSERT INTO temp (id)
SELECT id FROM source WHERE import_source = 'bad.txt' and rownum < 5000;
v_cnt := SQL%ROWCOUNT;
DELETE FROM t1 WHERE id IN (SELECT id FROM temp);
DELETE FROM t2 WHERE id IN (SELECT id FROM temp);
DELETE FROM source WHERE id IN (SELECT id FROM temp);
COMMIT;
END LOOP;
END;
I'd also go for the largest chunk your DBA will allow.
I'd expect each transaction to last for at least a minute. More frequent commits would be a waste.
This is happening on a system that
can't afford to have the tables locked
for that long.
Oracle doesn't lock tables, only rows. I'm assuming no-one will be locking the rows you are deleting (or at least not for long). So locking is not an issue.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fast Update database with more than 10 million records - sql

Related

Updating Millions of Records Oracle

PL SQL bulk collect fetchall not completing

Tuning SQL Updation query

Efficient way to update all rows in a table

SQL optimization question (oracle)

Categories

Resources