Split data from select query and commit into temp table - sql

I have a scenario where I'm creating a temp table with the data coming in from a select statement. My problem is when the data coming in from the select query is huge, I'm running into insufficient memory problems which results in my query failing to give results.
I was wondering if there was a way to commit a chunk of data (say every 1000 rows) into the temp table before moving on to the next one.
Eg:
CREATE TABLE NEW_TABLE AS
SELECT [ column1, column2...columnN ]
FROM EXISTING_TABLE
[ WHERE ] <ALL CONDITIONS>
Now lets assume that the inner select returns 100 rows. I don't want all the 100 rows to be inserted into the NEW_TABLE at once. I want to split this up. How do I go about efficiently doing this?

If you create temp table why you not create table as GLOBAL TEMPORARY TABLE, like this? :
CREATE GLOBAL TEMPORARY TABLE NEW_TABLE AS
SELECT [ column1, column2...columnN ]
FROM EXISTING_TABLE
[ WHERE ] <ALL CONDITIONS>
ON COMMIT DELETE ROWS;
It can cause lower memory usage because of " Decreased redo generation as, by definition, they are non-logging." (http://psoug.org/reference/gtt.html)
Elaborating on what I wrote:
"Redo records are buffered in a
circular fashion in the redo log
buffer of the SGA (see "How Oracle
Database Writes to the Redo Log" )
and are written to one of the redo
log files by the Log Writer (LGWR)
database background process." - docs.oracle.com/cd/B28359_01/server.111/b28310/onlineredo001.htm
using global temporary let you avoiding redundant memory cost. I think you don't care about safety temporary data. Besides Global Temporary Table default is written to temporary tablespace which is optimized for the storage of transient data.
" When Oracle needs to store data in a global temporary table or build a hash table for a hash join, Oracle also starts the operation in memory and completes the task without writing to disk if the amount of data involved is small enough. While populating a global temporary table or building a hash is not a sorting operation, we will lump all of these activities together in this paper because they are handled in a similar way by Oracle.
If an operation uses up a threshold amount of memory, then Oracle breaks the operation into smaller ones that can each be performed in memory. Partial results are written to disk in a temporary tablespace."
Basing on upon:
- you use other memory area than default process memory (sga, undo) which appeared to small in your case
- when memory size will be insufficient -> oracle will write data to disk

OK. I think I have a temp solution/work around for my problem: Something like this:
DECLARE
counter number (6) := 1;
cursor c1 is
<select query>
type t__data is table of c1%rowtype index by inary_integer;
t_data t__data;
begin
open c1;
loop
fetch c1 bulk collect into t_data;
exit when t_data.count = 0;
for idx in t_data.first .. t_data.last loop
insert into TEMP_ONE1 (x,y)
values (t_data(idx).x, t_data(idx).y);
counter := counter + 1;
if counter = 10000
then
counter := 0;
commit;
end if;
end loop;
end loop;
close c1;
end;
Do you think this is a good solution to the problem?
(I ran the query for a small sample of data and it does work)

Related

How to list all tables that have data in it?

I have around 2000 tables that most of them are not in use and do not have any data in them.
I know how to list all tables as below
SELECT owner, table_name FROM ALL_TABLES
But I do not know how to list the one that has at least one row of data in it.
Is there anyway to do that?
There are a few ways you could do this:
Brute-force and count the rows in every table
Check the table stats
Check if there is any storage allocated
Brute force
This loops through the tables, counts the rows, and spits out those that are empty:
declare
c integer;
begin
for t in (
select table_name from user_tables
where external = 'NO'
and temporary = 'N'
) loop
execute immediate
'select count(*) from ' || t.table_name
into c;
if c = 0 then
dbms_output.put_line ( t.table_name );
end if;
end loop;
end;
/
This is the only way to be sure there are no rows in the table now. The main drawback to this is it could take a looooong time if you a many tables with millions+ rows.
I've excluded:
Temporary tables. You can only see data inserted in your session. If they're in use in another session you can't see this
External tables. These point to files on the database server's file system. The files could be temporarily missing/blank/etc.
There may be other table types with issues like these - make sure you double check any that are reported as empty.
Check the stats
If all the table stats are up-to-date, you can check the num_rows:
select table_name
from user_tables ut
where external = 'NO'
and temporary = 'N'
and num_rows = 0;
The caveat with this is this figures may be out-of-date. You can force a regather now by running:
exec dbms_stats.gather_schema_stats ( user );
Though this is likely to take a while and - if gathering has been disabled/deferred - might result in unwanted plan changes. Avoid doing this on your production database!
Check storage allocation
You can look for tables with no segments allocated with:
select table_name
from user_tables ut
where external = 'NO'
and temporary = 'N'
and segment_created = 'NO';
As there's no space allocated to these, there's definitely no rows in them! But a table could have space allocated but no rows in it. So it may omit some of the empty tables - this is particularly likely for tables that did have rows in the past, but are empty now.
Final thoughts
It's worth remembering that a table with no rows now could still be in use. Staging tables used for daily/weekly/monthly loads may be purged at the end of the process; removing these will still break your apps!
There could also be code which refers to empty tables which work as-is, but would error if you drop the table.
A better approach would be to enable auditing, then run this for "a while". Any tables will no audited access in the time period are probably safe to remove.

Update million rows using rowids from one table to another Oracle

Hi I have two table with million rows in each.I have oracle 11 g R1
I am sure many of us must have gone through this situation.
What is the most efficient and fast way to update from one table to another where the values are DIFFERENT.
Eg: Table 1 has 4 NUMBER columns with a high precision eg : 0.2212454215454212
Table 2 has 6 columns.
update table 2's four columns based on common column on both the tables, only the different ones.
I have something like this
DECLARE
TYPE test1_t IS TABLE OF test.score%TYPE INDEX BY PLS_..;
TYPE test2_t IS TABLE OF test.id%TYPE INDEX BY PLS..;
TYPE test3_t IS TABLE OF test.Crank%TYPE INDEX BY PLS..;
vscore test1_t;
vid test2_t;
vurank test4_t;
BEGIN
SELECT id,score,urank
BULK COLLECT INTO vid,vscore,vurank
FROM test;
FORALL i IN 1 .. vid.COUNT
MERGE INTO final T
USING (SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL) S
ON (S.o_id = T.id)
WHEN MATCHED THEN
UPDATE SET T.crank = S.o_crank
WHERE T.crank <> S.o_crank;
Since the numbers are with high precision is it slowing down?
I tried Bulk Collect and Merge combination still its taking time ~ 30 mins for worst case scenario if I have to update 1 million rows.
Is there something with rowid?
Help will be appreciated.
If you want to update all the rows, then just use update:
update table_1
set (col1,
col2) = (
select col1,
col2
from table2
where table2.col_a = table1.col_a and
table2.col_b = table1.col_b)
Bulk collect or any PL/SQL technique will always be slower than a pure SQL technique.
The numeric precision is probably not significant, and rowid is not relevant as there is no common value between the two tables.
When dealing with millions of rows, parallel DML is a game changer. Of course you need to have Enterprise Edition to use parallel, but it's really the only thing which will make much difference.
I recommend you read an article on OraFAQ by rleishman comparing 8 Bulk Update Methods. His key finding is that "the cost of disk reads so far outweighs the context switches that that they are barely noticable (sic)". In other words, unless your data is already cached in memory there really isn't a significant difference between SQL and PL/SQL approaches.
The article does have some neat suggestions on employing parallel. The surprising outcome is that a parallel pipelined function offers the best performance.
Focusing on the syntax have been used and skipping the logic (may using a pure update + pure insert may solve the problem, merge cost, indexes, possible full scan on merge and else )
You should use Limit in Bulk Collect syntax
Using a bulk collect with no limit
Will case all records to be loaded in memory
With no partially committed merges, you will create a larg redolog,
that must be apply in the end of the process.
Both will reason in low performance.
DECLARE
v_fetchSize NUMBER := 1000; -- based on hardware, design and .... could be scaled
CURSOR a_cur IS
SELECT id,score,urank FROM test;
TYPE myarray IS TABLE OF a_cur%ROWTYPE;
cur_array myarray;
BEGIN
OPEN a_cur;
LOOP
FETCH a_cur BULK COLLECT INTO cur_array LIMIT v_fetchSize;
FORALL i IN 1 .. cur_array.COUNT
// DO Operation
COMMIT;
EXIT WHEN a_cur%NOTFOUND;
END LOOP;
CLOSE a_cur;
END;
Just to be sure: test.id and final.id must be indexed.
With first select ... from test you got too much records from Table 1 and after that you need to compare all of them with records on Table 2. Try to select only what you need to update. So, there are at least 2 variants:
a) select only changed records:
SELECT source_table.id, source_table.score, source_table.urank
BULK COLLECT INTO vid,vscore,vurank
FROM
test source_table,
final destination_table
where
source_table.id = destination_table.id
and
source_table.crank <> destination_table.crank
;
b) Add new field to source table with datetime value and fill it in trigger with current time. While synchronizing pick only records changed during last day. This field needs to be indexed.
After such a change on update phase you don't need to compare other fields, only match ID's:
FORALL i IN 1 .. vid.COUNT
MERGE INTO FINAL T
USING (
SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL
) S
ON (S.o_id = T.id)
WHEN MATCHED
THEN UPDATE SET T.crank = S.o_crank
If you worry about size of undo/redo segments then variant b) is more useful, because you can get records from source Table 1 divided to time slices and commit changes after updating every slice. E.g. from 00:00 to 01:00 , from 01:00 to 02:00 etc.
In this variant update can be done just by SQL statement without selecting a data into collections in row with maintaining acceptable sizes of redo/undo logs.

update x set y = null takes a long time

At work, I have a large table (some 3 million rows, like 40-50 columns). I sometimes need to empty some of the columns and fill them with new data. What I did not expect is that
UPDATE table1 SET y = null
takes much more time than filling the column with data which is generated, for example, in the sql query from other columns of the same table or queried from other tables in a subquery. It does not matter if I go through all table rows at once (like in the update query above) or if I use a cursor to go through the table row by row (using the pk). It does not matter if I use the large table at work or if I create a small test table and fill it with some hundredthousands of test-rows. Setting the column to null always takes way longer (Throughout the tests, I encountered factors of 2 to 10) than updating the column with some dynamic data (which is different for each row).
Whats the reason for this? What does Oracle do when setting a column to null? Or - what's is my error in reasoning?
Thanks for your help!
P.S.: I am using oracle 11g2, and found these results using both plsql developer and oracle sql developer.
Is column Y indexed? It could be that setting the column to null means Oracle has to delete from the index, rather than just update it. If that's the case, you could drop and rebuild it after updating the data.
EDIT:
Is it just column Y that exhibits the issue, or is it independent of the column being updated? Can you post the table definition, including constraints?
Summary
I think updating to null is slower because Oracle (incorrectly) tries to take advantage of the way it stores nulls, causing it to frequently re-organize the rows in the block ("heap block compress"), creating a lot of extra UNDO and REDO.
What's so special about null?
From the Oracle Database Concepts:
"Nulls are stored in the database if they fall between columns with data values. In these cases they require 1 byte to store the length of the column (zero).
Trailing nulls in a row require no storage because a new row header signals that the remaining columns in the previous row are null. For example, if the last three columns of a table are null, no information is stored for those columns. In tables with many columns,
the columns more likely to contain nulls should be defined last to conserve disk space."
Test
Benchmarking updates is very difficult because the true cost of an update cannot be measured just from the update statement. For example, log switches will
not happen with every update, and delayed block cleanout will happen later. To accurately test an update, there should be multiple runs,
objects should be recreated for each run, and the high and low values should be discarded.
For simplicity the script below does not throw out high and low results, and only tests a table with a single column. But the problem still occurs regardless of the number of columns, their data, and which column is updated.
I used the RunStats utility from http://www.oracle-developer.net/utilities.php to compare the resource consumption of updating-to-a-value with updating-to-a-null.
create table test1(col1 number);
BEGIN
dbms_output.enable(1000000);
runstats_pkg.rs_start;
for i in 1 .. 10 loop
execute immediate 'drop table test1 purge';
execute immediate 'create table test1 (col1 number)';
execute immediate 'insert /*+ append */ into test1 select 1 col1
from dual connect by level <= 100000';
commit;
execute immediate 'update test1 set col1 = 1';
commit;
end loop;
runstats_pkg.rs_pause;
runstats_pkg.rs_resume;
for i in 1 .. 10 loop
execute immediate 'drop table test1 purge';
execute immediate 'create table test1 (col1 number)';
execute immediate 'insert /*+ append */ into test1 select 1 col1
from dual connect by level <= 100000';
commit;
execute immediate 'update test1 set col1 = null';
commit;
end loop;
runstats_pkg.rs_stop();
END;
/
Result
There are dozens of differences, these are the four I think are most relevant:
Type Name Run1 Run2 Diff
----- ---------------------------- ------------ ------------ ------------
TIMER elapsed time (hsecs) 1,269 4,738 3,469
STAT heap block compress 1 2,028 2,027
STAT undo change vector size 55,855,008 181,387,456 125,532,448
STAT redo size 133,260,596 581,641,084 448,380,488
Solutions?
The only possible solution I can think of is to enable table compression. The trailing-null storage trick doesn't happen for compressed tables.
So even though the "heap block compress" number gets even higher for Run2, from 2028 to 23208, I guess it doesn't actually do anything.
The redo, undo, and elapsed time between the two runs is almost identical with table compression enabled.
However, there are lots of potential downsides to table compression. Updating to a null will run much faster, but every other update will run at least slightly slower.
That's because it deletes from blocks that data.
And delete is the hardest operation. If you can avoid a delete, do it.
I recommend you to create another table with that column null(Create table as select for example, or insert select), and fill it(the column) with your procedure. Drop old table and then rename the new table with current name.
UPDATE:
Another important thing is that you should update the column as is, with new values. It is useless to set them null and after that refill them.
If you do not have values for all rows, you can do the update like this:
udpate table1
set y = (select new_value from source where source.key = table1.key)
and will set to null those rows that does not exists in source.
I would try what Tom Kyte suggested on large updates.
When it comes to huge tables, it best to go like this : take a few rows, update them, take some more, update those etc. Don't try to issue an update on all the table. That's a killer move right from the start.
Basically create binary_integer indexed table, fetch 10 rows at a time, and update them.
Here is a piece of code that i have used of large tables with success. Because im lazy and its like 2AM now ill just copy paste it here and let you figure it out, but let me know if you need help :
DECLARE
TYPE BookingRecord IS RECORD (
bprice number,
bevent_id number,
book_id number
);
TYPE array is TABLE of BookingRecord index by binary_integer;
l_data array;
CURSOR c1 is
SELECT LVC_USD_PRICE_V2(ev.activity_version_id,ev.course_start_date,t.local_update_date,ev.currency,nvl(t.delegate_country,ev.sponsor_org_country),ev.price,ev.currency,t.ota_status,ev.location_type) x,
ev.title,
t.ota_booking_id
FROM ota_gsi_delegate_bookings_t#diseulprod t,
inted_parted_events_t#diseulprod ev
WHERE t.event_id = ev.event_id
and t.ota_booking_id =
BEGIN
open c1;
loop
fetch c1 bulk collect into l_data limit 20;
for i in 1..l_data.count
loop
update ou_inc_int_t_01
set price = l_data(i).bprice,
updated = 'Y'
where booking_id = l_data(i).book_id;
end loop;
exit when c1%notfound;
end loop;
close c1;
END;
what can also help speed up updates is to use alter table table1 nologging so that the update won't generate redo logs. another possibility is to drop the column and re-add it. since this is a DDL operation it will generate neither redo nor undo.

Slow simple update query on PostgreSQL database with 3 million rows

I am trying a simple UPDATE table SET column1 = 0 on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.
Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.
Any other ideas?
Update:
This is the table structure:
CREATE TABLE myTable
(
id bigserial NOT NULL,
title text,
description text,
link text,
"type" character varying(255),
generalFreq real,
generalWeight real,
author_id bigint,
status_id bigint,
CONSTRAINT resources_pkey PRIMARY KEY (id),
CONSTRAINT author_pkey FOREIGN KEY (author_id)
REFERENCES users (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT c_unique_status_id UNIQUE (status_id)
);
I am trying to run UPDATE myTable SET generalFreq = 0;
I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%).
My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.
Transactions of bunch of "UPDATE myTable SET myField=value WHERE
myId=id" Gives 1,500 updates/sec. which means each run would
take at least 18 hours.
HOT updates solution as described here with FILLFACTOR=50. Gives
1,600 updates/sec. I use SSD's so it's a costly improvement as it
doubles the storage size.
Insert in a temporary table of updated value and merge them after
with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM
for each partition; 100,000 up/s otherwise. Cooool.Here is the
sequence of operations:
CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));
Accumulate a bunch of updates in a buffer depending of available RAM
When it's filled, or need to change of table/partition, or completed:
COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;
That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.
Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates
First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:
SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';
HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.
Ps. You need version 8.3 or better.
After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:
CREATE TABLE table2 AS
SELECT
all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable
Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)
Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.
Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.
Try this (note that generalFreq starts as type REAL, and stays the same):
ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;
This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.
How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.
If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.
The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:
UPDATE myTable SET generalFreq = 0 where generalFreq != 0;
(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.
Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.
In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.
My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...
Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.
I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.
Inspired by the answers here, I wrote a temp table and UPDATE... FROM based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.
The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.
Below is the relevant Python code that does CREATE TEMPORARY TABLE, COPY FROM and UPDATE FROM. See the full example in this gist.
def bulk_load_psql_using_temp_table(
dbsession: Session,
data_as_dicts: List[dict],
):
"""Bulk update columns in PostgreSQL faster using temp table.
Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
Your mileage and speed may vary, but it is going to be faster.
The observation was 3x ... 4x faster when doing UPDATEs
where one of the columns is indexed.
Contains hardcoded temp table creation and UPDATE FROM statements.
In our case we are bulk updating three columns.
- Create a temp table - if not created before
- Filling it from the in-memory CSV using COPY FROM
- Then performing UPDATE ... FROM on the actual table from the temp table
- Between the update chunks, clear the temp table using TRUNCATE
Why is it faster? I have did not get a clear answer from the sources I wa reading.
At least there should be
less data uploaded from the client to the server,
as CSV loading is more compact than bulk updates.
Further reading
- `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_
- `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_
- `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.
:dbsession:
SQLAlchemy session.
Note that we open a separate connection for the bulk update.
:param data_as_dicts:
In bound data as it would be given to bulk_update_mapping
"""
# mem table created in sql
temp_table_name = "temp_bulk_temp_loader"
# the real table of which data we are filling
real_table_name = "swap"
# colums we need to copy
columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]
# how our CSV fields are separated
delim = ";"
# Increase temp buffer size for updates
temp_buffer_size = "3000MB"
# Dump data to a local mem buffer using CSV writer.
# No header - this is specifically addressed in copy_from()
out = StringIO()
writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
writer.writerows(data_as_dicts)
# Update data in alternative raw connection
engine = dbsession.bind
conn = engine.connect()
try:
# No rollbacks
conn.execution_options(isolation_level="AUTOCOMMIT")
# See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")
# Temp table is dropped at the end of the session
# https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
# This must match data_as_dicts structure.
sql = f"""
CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
(
id int,
sync_event_id int,
sync_reserve0 bytea,
sync_reserve1 bytea
);
"""
conn.execute(sql)
# Clean any pending data in the temp table
# between update chunks.
# TODO: Not sure why this does not clear itself at conn.close()
# as I would expect based on the documentation.
sql = f"TRUNCATE {temp_table_name}"
conn.execute(sql)
# Load data from CSV to the temp table
# https://www.psycopg.org/docs/cursor.html
cursor = conn.connection.cursor()
out.seek(0)
cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)
# Fill real table from the temp table
# This copies values from the temp table using
# UPDATE...FROM and matching by the row id.
sql = f"""
UPDATE {real_table_name}
SET
sync_event_id=b.sync_event_id,
sync_reserve0=b.sync_reserve0,
sync_reserve1=b.sync_reserve1
FROM {temp_table_name} AS b
WHERE {real_table_name}.id=b.id;
"""
res = conn.execute(sql)
logger.debug("Updated %d rows", res.rowcount)
finally:
conn.close()
I do an update millions rows incrementally in batches with minimal locks by one procedure loop_execute(). There is a progress of execution in percent and a prediction of the end work time!
try
UPDATE myTable SET generalFreq = 0.0;
Maybe it is a casting issue

SQL Insert with large dataset

When running a query like "insert into table " how do we handle the commit size? I.e. are all records from anotherTable inserted in a single transaction OR is there a way to set a commit size?
Thanks very much ~Sri
PS: I am a first timer here, and this site looks very good!
In good databases that is an atomic statement, so no, there is no way to limit the number of records inserted - which is a good thing!
In the context that the original poster wants to avoid rollback space problems, the answer is pretty straightforward. The rollback segments should be sized to accpomodate the size of transactions, not the other way round. You commit when your transaction is complete.
I've written code in various langues, mostly Java, to do bulk inserts like what you described. Each time I did it, mostly from parsing some input file or something like that, I would basically just prepare a sub-set of data to insert from the total amount (usually batches of 4000 or so) and feed that data to our DAO layer. So it was done programatically. We never noticed any real performance hit for doing it this way and we were dealing with a few million records. If you have large data sets to insert the operation will "take awhile" regardless of how you do it.
You can't handle the commit size unless you explicitly code it. For example you could use a where loop, and code up a way to limit the ammount of data your selecting.
David Aldridge is right, size the rollback segment based on the maximum transaction, when you want the INSERT to either succeed or fail as a whole.
Some alternatives:
If you don't care about being able to roll it back (which is what the segment is there for), you could ALTER TABLE and add the NOLOGGING clause. But that's not a wise move unless you're loading a reporting table where you drop all old rows and load new ones, or some other special cases.
If you're okay with some rows getting inserted and others failing for some reason, then add support for handling the failures, using the INSERT INTO LOG ERRORS INTO syntax.
If you need the data set to be limited, build that limit into the query.
For example, in Microsoft SQL Server parlance, you can use "TOP N" to make sure the query only returns a limited number of rows.
INSERT INTO thisTable
SELECT TOP 100 * FROM anotherTable;
The reason why I want to do that is to avoid the rollback segment going out of space. Also, I want to see results being populated in the target table at regular intervals.
I dont want to use a where loop because it might add performance overheads. Isn't it?
~Sri
You are right, you may want to run large inserts in batches. The attached link shows a way to do it in SQL Server, if you are using a different backend you would do something simliar but the exact syntax might be differnt. This is a case when a loop is acceptable.
http://www.tek-tips.com/faqs.cfm?fid=3141
"The reason why I want to do that is to avoid the rollback segment going out of space. Also, I want to see results being populated in the target table at regular intervals."
The first is simply a matter of sizing the undo tablespace correctly. Since the undo is a delete of an existing row, it doesn't require a lot of space. Conversely, a delete generally requires more space because it has to have a copy of the entire deleted row to re-insert it to undo it.
For the second, have a look at v$session_longops and/or rows_processed in v$sql
INSERT INTO TableInserted
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY ID) AS RowNumber
FROM TableSelected
) X
WHERE RowNumber BETWEEN 101 AND 200
You could wrap the above into a while loop pretty easily, replacing the 101 and 200 with variables. It's better than doing 1 record at a time.
I don't know what versions of Oracle support window functions.
This is an extended comment to demonstrate that setting indexes to NOLOGGING will not help reduce UNDO or REDO for INSERTs.
The manual implies NOLOGGING indexes may help improve DML by reducing UNDO and REDO. And since NOLOGGING helps with table DML it seems logical that it would also help with INDEX changes. But this test case demonstrates that changing indexes to NOLOGGING has no affect on INSERT statements.
drop table table_no_index;
drop table table_w_log_index;
drop table table_w_nolog_index;
--#0: Before
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#1: NOLOGGING table with no index. This is the best case scenario.
create table table_no_index(a number) nologging;
insert /*+ append */ into table_no_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#2: NOLOGGING table with LOGGING index. This should generate REDO and UNDO.
create table table_w_log_index(a number) nologging;
create index table_w_log_index_idx on table_w_log_index(a);
insert /*+ append */ into table_w_log_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#3: NOLOGGING table with NOLOGGING index. Does this generate as much REDO and UNDO as previous step?
create table table_w_nolog_index(a number) nologging;
create index table_w_nolog_index_idx on table_w_nolog_index(a) nologging;
insert /*+ append */ into table_w_nolog_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
Here are the results from the statistics queries. The numbers are cumulative for the session. Test cases #2 and #3 have the same increase in UNDO and REDO.
--#0: BEFORE: Very little redo or undo since session just started.
redo size 35,436
undo change vector size 10,120
--#1: NOLOGGING table, no index: Very little redo or undo.
redo size 88,460
undo change vector size 21,772
--#2: NOLOGGING table, LOGGING index: Large amount of redo and undo.
redo size 6,895,100
undo change vector size 3,180,920
--#3: NOLOGGING table, NOLOGGING index: Large amount of redo and undo.
redo size 13,736,036
undo change vector size 6,354,032
You may just want to make the indexes NOLOGGING. That way the table data is recoverable, but the indexes will need to be rebuilt if table is recovered. Index maintenance can create a lot of undo.