When running a query like "insert into table " how do we handle the commit size? I.e. are all records from anotherTable inserted in a single transaction OR is there a way to set a commit size?
Thanks very much ~Sri
PS: I am a first timer here, and this site looks very good!
In good databases that is an atomic statement, so no, there is no way to limit the number of records inserted - which is a good thing!
In the context that the original poster wants to avoid rollback space problems, the answer is pretty straightforward. The rollback segments should be sized to accpomodate the size of transactions, not the other way round. You commit when your transaction is complete.
I've written code in various langues, mostly Java, to do bulk inserts like what you described. Each time I did it, mostly from parsing some input file or something like that, I would basically just prepare a sub-set of data to insert from the total amount (usually batches of 4000 or so) and feed that data to our DAO layer. So it was done programatically. We never noticed any real performance hit for doing it this way and we were dealing with a few million records. If you have large data sets to insert the operation will "take awhile" regardless of how you do it.
You can't handle the commit size unless you explicitly code it. For example you could use a where loop, and code up a way to limit the ammount of data your selecting.
David Aldridge is right, size the rollback segment based on the maximum transaction, when you want the INSERT to either succeed or fail as a whole.
Some alternatives:
If you don't care about being able to roll it back (which is what the segment is there for), you could ALTER TABLE and add the NOLOGGING clause. But that's not a wise move unless you're loading a reporting table where you drop all old rows and load new ones, or some other special cases.
If you're okay with some rows getting inserted and others failing for some reason, then add support for handling the failures, using the INSERT INTO LOG ERRORS INTO syntax.
If you need the data set to be limited, build that limit into the query.
For example, in Microsoft SQL Server parlance, you can use "TOP N" to make sure the query only returns a limited number of rows.
INSERT INTO thisTable
SELECT TOP 100 * FROM anotherTable;
The reason why I want to do that is to avoid the rollback segment going out of space. Also, I want to see results being populated in the target table at regular intervals.
I dont want to use a where loop because it might add performance overheads. Isn't it?
~Sri
You are right, you may want to run large inserts in batches. The attached link shows a way to do it in SQL Server, if you are using a different backend you would do something simliar but the exact syntax might be differnt. This is a case when a loop is acceptable.
http://www.tek-tips.com/faqs.cfm?fid=3141
"The reason why I want to do that is to avoid the rollback segment going out of space. Also, I want to see results being populated in the target table at regular intervals."
The first is simply a matter of sizing the undo tablespace correctly. Since the undo is a delete of an existing row, it doesn't require a lot of space. Conversely, a delete generally requires more space because it has to have a copy of the entire deleted row to re-insert it to undo it.
For the second, have a look at v$session_longops and/or rows_processed in v$sql
INSERT INTO TableInserted
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY ID) AS RowNumber
FROM TableSelected
) X
WHERE RowNumber BETWEEN 101 AND 200
You could wrap the above into a while loop pretty easily, replacing the 101 and 200 with variables. It's better than doing 1 record at a time.
I don't know what versions of Oracle support window functions.
This is an extended comment to demonstrate that setting indexes to NOLOGGING will not help reduce UNDO or REDO for INSERTs.
The manual implies NOLOGGING indexes may help improve DML by reducing UNDO and REDO. And since NOLOGGING helps with table DML it seems logical that it would also help with INDEX changes. But this test case demonstrates that changing indexes to NOLOGGING has no affect on INSERT statements.
drop table table_no_index;
drop table table_w_log_index;
drop table table_w_nolog_index;
--#0: Before
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#1: NOLOGGING table with no index. This is the best case scenario.
create table table_no_index(a number) nologging;
insert /*+ append */ into table_no_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#2: NOLOGGING table with LOGGING index. This should generate REDO and UNDO.
create table table_w_log_index(a number) nologging;
create index table_w_log_index_idx on table_w_log_index(a);
insert /*+ append */ into table_w_log_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
--#3: NOLOGGING table with NOLOGGING index. Does this generate as much REDO and UNDO as previous step?
create table table_w_nolog_index(a number) nologging;
create index table_w_nolog_index_idx on table_w_nolog_index(a) nologging;
insert /*+ append */ into table_w_nolog_index select level from dual connect by level <= 100000;
commit;
select name, value from v$mystat natural join v$statname where display_name in ('undo change vector size', 'redo size') order by 1;
Here are the results from the statistics queries. The numbers are cumulative for the session. Test cases #2 and #3 have the same increase in UNDO and REDO.
--#0: BEFORE: Very little redo or undo since session just started.
redo size 35,436
undo change vector size 10,120
--#1: NOLOGGING table, no index: Very little redo or undo.
redo size 88,460
undo change vector size 21,772
--#2: NOLOGGING table, LOGGING index: Large amount of redo and undo.
redo size 6,895,100
undo change vector size 3,180,920
--#3: NOLOGGING table, NOLOGGING index: Large amount of redo and undo.
redo size 13,736,036
undo change vector size 6,354,032
You may just want to make the indexes NOLOGGING. That way the table data is recoverable, but the indexes will need to be rebuilt if table is recovered. Index maintenance can create a lot of undo.
Related
I have a scenario wherein I need to copy 500 million rows from a Table1 to Table2. Couple of points,
Table1 has 2 billion rows.
Table2 is a new table and identical to Table1.
Table1 and Table2 both are of List partition type.
Both tables has to be in same tablespace and tablespace is created with LOGGING mode.
TABLESPACE Block size is: 8192, FORCE_LOGGING NO, AUTO EXTEND ON. REDO ARCHIVAL ENABLED
So, here is what my approach to do this activity and I ask for recommendations to improve or maybe prevent some sudden unwanted situations.
Create Table2 with same structure without any indexes or PK.
Alter Table2 nologging; --Putting the table in NOLOGGING mode to stop redo generation. This is done just to improve performance.
Do this activity in 50 parallel jobs (Jobs created based on partitioned column). Partitioned Column has 120 distinct values. So total 120 jobs. First 50 will be posted and as soon as 1 finishes, 51th will be posted and so on.
Using a Cursor, Bulk Fetch with limit of 5000 and FORALL for insert (With APPEND Hint). Commit immediately after 1 iteration so commit freq is 5000.
After all the jobs are finished, put Table2 back in LOGGING mode.
alter table Table2 logging;
Create all required indexes and PK on Table2 with Parallel mode enabled and then alter index NOPARALLEL.
Any suggestions? Thanks a lot for your time.
Use a single SELECT statement instead of PL/SQL.
There's no need to commit in chunks or to have a parallel strategy that mirrors the partitions. If the APPEND hint works and a direct-path write is used then there won't be any significant REDO or UNDO usage, so you don't need to run in chunks to reduce resource consumption. Oracle can easily divide a segment into granules - it's just copying a bunch of blocks from one place to another, it doesn't matter if it processes them per-partition. (Some possible exceptions are if you're using a weird column that doesn't support parallel SQL, or if you're joining tables and using partition-wise joins.)
alter session enable parallel dml;
alter table table2 nologging;
--Picking a good DOP can be tricky. 32 might not be the best number for you.
insert /*+ append parallel(32) */ into table2
select * from table1;
commit;
alter table table2 logging;
Before you run this, check the execution plan. There are lots of things that can prevent direct-path writes and you want to find them before you start the DML.
In the execution plan, make sure you see "LOAD AS SELECT" to ensure direct-path writes, "PX" to ensure parallelism, and a "PX" operation before the "LOAD AS SELECT" to ensure that both the writes and the reads are done in parallel.
alter session enable parallel dml;
alter table table2 nologging;
explain plan for
insert /*+ append parallel(32) */ into table2
select * from table1;
select * from table(dbms_xplan.display);
I often find it's not worth dealing with indexes separately. But that may depend on the number of indexes.
I have a transactional database. one of the tables is almost empty (A). it has a unique indexed column (x), and no clustered index.
2 concurrent transactions:
begin trans
insert into A (x,y,z) (1,2,3)
WAITFOR DELAY '00:00:02'; -- or manually run the first 2 lines only
select * from A where x=1; -- small tables produce a query plan of table scan here, and block against the transaction below.
commit
begin trans
insert into A (x,y,z) (2,3,4)
WAITFOR DELAY '00:00:02';
-- on a table with 3 or less pages this hint is needed to not block against the above transaction
select * from A with(forceseek) -- force query plan of index seek + rid lookup
where x=2;
commit
My problem is that when the table has very few rows the 2 transactions can deadlock, because SQL Server generates a table scan for the select, even though there is an index, and both wait on the lock held by the newly inserted row of the other transaction.
When there are lots of rows in this table, the query plan changes to an index seek, and both happily complete.
When the table is small, the WITH(FORCESEEK) hint forces the correct query plan (5% more expensive for tiny tables).
is it possible to provide a default hint for all queries on a table to pretend to have the 'forceseek' hint?
the deadlocking code above was generated by Hibernate, is it possible to have hibernate emit the needed query hints?
we can make the tables pretend to be large enough that the query optimizer selects the index seek with the undocumented features in UPDATE STATISTICS http://msdn.microsoft.com/en-AU/library/ms187348(v=sql.110).aspx . Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?
You can create a Plan Guide.
Or you can enable Read Committed Snapshot isolation level in the database.
Better still: make the index clustered.
For small tables that experience high update ratio, perhaps you can apply the advice from Using tables as Queues.
Can anyone see any downsides to making all tables with less than 1000 rows pretend they have 1000 rows over 10 pages?
If the table appears in a another, more complex, query (think joins) then the cardinality estimates may cascade wildly off and produce bad plans.
You could create a view that is a copy of the table but with the hint and have queries use the view instead:
create view A2 as
select * from A with(forceseek)
If you want to preserve the table name used by queries, rename the table to something else then name the view "A":
sp_rename 'A', 'A2';
create view A as
select * from A2 with(forceseek)
Just to add another option you may consider.
You can lock entire table on update by using
ALTER TABLE MyTable SET LOCK_ESCALATION = TABLE
This workaround is fine if you do not have too many updates that will queue and slow performance.
It is table-wide and no updates to other code is needed.
There is a table T with column a:
CREATE TABLE T {
id_t integer not null,
text varchar2(100),
a integer
}
/
ALTER TABLE T ADD CONSTRAINT PK_T PRIMARY KEY (ID_T)
/
Index was created like this:
CREATE INDEX IDX_T$A ON T(a);
Also there's such a check constraint:
ALTER TABLE T ADD CONSTRAINT CHECK (a is null or a = 1);
Most of the records in T have null value of a, so the query using the index works really fast if the index is in consistent state and statistics for it is up to date.
But the problem is that values of a of some rows change really frequently (some rows get null value, some get 1), and I need to rebuild the index let's say every hour.
However, really often when the job doing this, trying to rebuild the index, it gets an exception:
ORA-00054: resource busy and acquire with NOWAIT specified
Can anybody help me with coping with this issue?
Index rebuild is not needed in most cases. Of course newly created indexes are efficient and their efficiency decreases over time. But this process stops after some time - it simply converges to some level.
If you really need to optimize indexes try to use less invasive DDL command "ALTER INDEX SHRINK SPACE COMPACT".
PS: I would also recommend you to use some smaller block size (4K or 8K) for you tablespace storage.
Have you tried adding "ONLINE" to that index rebuild statement?
Edit: If online rebuild is not available then you might look at a fast refresh on commit materialised view to store the rowid's or primary keys of rows that have a 1 for column A.
Start with a look at the documentation:-
http://docs.oracle.com/cd/B28359_01/server.111/b28326/repmview.htm
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6002.htm#SQLRF01302
You'd create a materialised view log on the table, and then a materialised view.
Think in particular about the resource requirements for this: changes to the master table require a change vector to be written to the materialised view log, which is effectively an additional insert for every change. Then the changes have to be propagated to another table (the materialised view storage table) with additional queries. It is by no means a low-impact option.
Rebuilding for Performance
Most Oracle experts are skeptical of frequently rebuilding indexes. For example, a quick glance at the presentation Rebuilding the Truth will show you that indexes do not behave in the naive way many people assume they do.
One of the relevant points in that presentation is "fully deleted blocks are recycled and are not generally problematic". If your values completely change, then your index should not grow infinitely large. Although your indexes are used in a non-typical way, that
behavior is probably a good thing.
Here's a quick example. Create 1 million rows and index 100 of them.
--Create table, constraints, and index.
CREATE TABLE T
(
id_t integer primary key,
text varchar2(100),
a integer check (a is null or a = 1)
);
CREATE INDEX IDX_T$A ON T(a);
--Insert 1M rows, with 100 "1"s.
insert into t
select level, level, case when mod(level, 10000) = 0 then 1 else null end
from dual connect by level <= 1000000;
commit;
--Initial sizes:
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Now completely shuffle the index rows around 1000 times.
--Move the 1s around 1000 times. Takes about 6 minutes.
begin
for i in 9000 .. 10000 loop
update t
set a = case when mod(id_t, i) = 0 then 1 else null end
--Don't update if the vlaue is the same
where nvl(a,-1) <> nvl(case when mod(id_t,i) = 0 then 1 else null end,-1);
commit;
end loop;
end;
/
The index segment size is still the same.
--The the index size is the same.
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Rebuilding for Statistics
It's good to worry about the statistics of objects whose data changes so dramatically. But again, although your system is unusual, it may work fine with the default Oracle behavior. Although the rows indexed may completely change, the relevant statistics may stay the same. If there are always 100 rows indexed, the number of rows, blocks, and distinctness will stay the same.
Perhaps the clustering factor will significantly change, if the 100 rows shift from being completely random to being very close to each other. But even that may not matter. If there are millions of rows, but only 100 indexed, the optimizer's decision will probably be the same regardless of the clustering factor. Reading 1 block (awesome clustering factor) or reading 100 blocks (worst-case clustering factor) will still look much better than doing a full table scan of millions of rows.
But statistics are complicated, I'm surely over-simplifying things. If you need to keep your statistics a specific way, you may want to lock them. Unfortunately you can't lock just an index, but you can lock the table and it's dependent indexes.
begin
dbms_stats.lock_table_stats(ownname => user, tabname => 'T');
end;
/
Rebuilding anyway
If a rebuild is still necessary, #Robe Eleckers idea to retry should work. Although instead of an exception, it would be easier to set DDL_LOCK_TIMEOUT.
alter session set ddl_lock_timeout = 500;
The session will still need to get an exclusive lock on the table, but this will make it much easier to find the right window of opportunity.
Since the field in question has very low cardinality I would suggest using a bitmap index and skipping the rebuilds altogether.
CREATE BITMAP INDEX IDX_T$A ON T(a);
Note (as mentioned in comments): transactional performance is very low for bitmap indexes so this would only work well if there are very few overlapping transactions doing updates to the table.
I got an oracle SQL query that selects entries of the current day like so:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE)
AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
Whereas the EVT_END field is of type DATE and T.TYPE is a NUMBER(15,0).
Im sure with increasing size of the table data (and ongoing time), the date constraint will decrease the result set by a much larger factor than the type constraint. (Since there are a very limited number of types)
So the basic question arising is, what's the best index to choose to make the selection on the current date faster. I especially wonder what the advantages and disadvantages of a functional index on TRUNC(T.EVT_END) to a normal index on T.EVT_END would be. When using a functional index the query would look something like that:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE)
AND T.TYPE = 123
Because other queries use the mentioned date constraints without the additional type selection (or maybe with some other fields), multicolumn indexes wouldn't help me a lot.
Thanks, I'd appreciate your hints.
Your index should be TYPE, EVT_END.
CREATE INDEX PIndex
ON MY_TABLE (TYPE, EVT_END)
The optimizer plan will first go through this index to find the TYPE=123 section. Then under TYPE=123, it will have the EVT_END timestamps sorted, so it can search the b-tree for the first date in the range, and go through the dates sequentially until a data is out of the range.
Based on the query above the functional index will provide no value. For a functional index to be used the predicate in the query would need to be written as follows:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
The functional index on the column EVT_END, is being ignored. It would be better to have a normal index on the EVT_END date. For a functional index to be used the left hand of the condition must match the declaration of the functional index. I would probably write the query as:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE+1)
AND T.TYPE = 123
And I would create the following index:
CREATE INDEX bla on MY_TABLE( EVT_END )
This is assuming you are trying to find the events that ended within a day.
Results
If your index is cached, a function-based index performs best. If your index is not cached, a compressed function-based index performs best.
Below are the relative times generated by my test code. Lower is better. You cannot compare the numbers between cached and non-cached, they are totally different tests.
In cache Not in cache
Regular 120 139
FBI 100 138
Compressed FBI 126 100
I'm not sure why the FBI performs better than the regular index. (Although it's probably related to what you said about equality predicates versus range. You can see that the regular index has an extra "FILTER" step in its explain plan.) The compressed FBI has some additional overhead to uncompress the blocks. This small amount of extra CPU time is relevant when everything is already in memory, and CPU waits are most important. But when nothing is cached, and IO is more important, the reduced space of the compressed FBI helps a lot.
Assumptions
There seems to be a lot of confusion about this question. The way I read it, you only care about this one specific query, and you want to know whether a function-based index or a regular index will be faster.
I assume you do not care about other queries that may benefit from this index, additional time spent to maintain the index, if the developers remember to use it, or whether or not the optimizer chooses the index. (If the optimizer doesn't choose the index, which I think is unlikely, you can add a hint.) Let me know if any of these assumptions are wrong.
Code
--Create tables. 1 = regular, 2 = FBI, 3 = Compressed FBI
create table my_table1(evt_end date, type number) nologging;
create table my_table2(evt_end date, type number) nologging;
create table my_table3(evt_end date, type number) nologging;
--Create 1K days, each with 100K values
begin
for i in 1 .. 1000 loop
insert /*+ append */ into my_table1
select sysdate + i - 500 + (level * interval '1' second), 1
from dual connect by level <= 100000;
commit;
end loop;
end;
/
insert /*+ append */ into my_table2 select * from my_table1;
insert /*+ append */ into my_table3 select * from my_table1;
--Create indexes
create index my_table1_idx on my_table1(evt_end);
create index my_table2_idx on my_table2(trunc(evt_end));
create index my_table3_idx on my_table3(trunc(evt_end)) compress;
--Gather statistics
begin
dbms_stats.gather_table_stats(user, 'MY_TABLE1');
dbms_stats.gather_table_stats(user, 'MY_TABLE2');
dbms_stats.gather_table_stats(user, 'MY_TABLE3');
end;
/
--Get the segment size.
--This shows the main advantage of a compressed FBI, the lower space.
select segment_name, bytes/1024/1024/1024 GB
from dba_segments
where segment_name like 'MY_TABLE__IDX'
order by segment_name;
SEGMENT_NAME GB
MY_TABLE1_IDX 2.0595703125
MY_TABLE2_IDX 2.0478515625
MY_TABLE3_IDX 1.1923828125
--Test block.
--Uncomment different lines to generate 6 different test cases.
--Regular, Function-based, and Function-based compressed. Both cached and not-cached.
declare
v_count number;
v_start_time number;
v_total_time number := 0;
begin
--Uncomment two lines to test the server when it's "cold", and nothing is cached.
for i in 1 .. 10 loop
execute immediate 'alter system flush buffer_cache';
--Uncomment one line to test the server when it's "hot", and everything is cached.
--for i in 1 .. 1000 loop
v_start_time := dbms_utility.get_time;
SELECT COUNT(*)
INTO V_COUNT
--#1: Regular
FROM MY_TABLE1 T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400;
--#2: Function-based
--FROM MY_TABLE2 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
--#3: Compressed function-based
--FROM MY_TABLE3 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
v_total_time := v_total_time + (dbms_utility.get_time - v_start_time);
end loop;
dbms_output.put_line('Seconds: '||v_total_time/100);
end;
/
Test Methodology
I ran each block at least 5 times, alternated between run types (in case something was running on my machine only part of the time), threw out the high and the low run times, and averaged them. The code above does not include all that logic, since it would take up 90% of this answer.
Other Things to Consider
There are still many other things to consider. My code assumes the data is inserted in a very index-friendly order. Things will be totally different if this is not true, as compression may not help at all.
Probably the best solution to this problem is to avoid it completely with partitioning. For reading the same amount of data, a full table scan is much faster than an index read because it uses multi-block IO. But there are some downsides to partitioning, like the large amount of money
required to buy the option, and extra maintenance tasks. For example, creating partitions ahead of time, or using interval partitioning (which has some other weird issues), gathering stats, deferred segment creation, etc.
Ultimately, you will need to test this yourself. But remember that testing even such a simple choice is difficult. You need realistic data, realistic tests, and a realistic environment. Realistic data is much harder than it sounds. With indexes, you cannot simply copy the data and build the indexes at once. create table my_table1 as select * from and create index ... will create a different index than if you create the table and perform a bunch of inserts and deletes in a specific order.
#S1lence:
I believe there would be a considerable time of thought behind this question being asked by you. And, I took a lot of time to post my answer here, as I don't like posting any guesses for answers.
I would like to share my websearch experience on this choice of normal Index on a date column against FBIs.
Based on my understanding on the link below, if you are about to use TRUNC function for sure, then you can strike out the option of normal index, as this consulting web space says that:
Even though the column may have an index, the trunc built-in function will invalidate the index, causing sub-optimal execution with unnecessary I/O.
I suppose that clears all. You've to go with FBI if you gonna use TRUNC for sure. Please let me know if my reply makes sense.
Oracle SQL Tuning with function-based indexes
Cheers,
Lakshmanan C.
The deicion over whether or not to use a function-based index should be driven by how you plan to write your queries. If all your queries against the date column will be in the form TRUNC(EVT_END), then you should use the FBI. However, in general it will be better to create an index on just EVT_END for the following reasons:
It will be more reusable. If you ever have queries checking particular times of the day then you can't use TRUNC.
There will be more distinct keys in the index using just the date. If you have 1,000 different times inserted during a day, EVT_END will will 1,000 distinct keys, whereas TRUNC(EVT_END) will only have 1 (this assumes that you're storing the time component and not just midnight for all the dates - in the second case both will have 1 distinct key for a day). This matters because the more distinct values an index has, the higher the selectivity of the index and the more likely it is to be used by the optimizer (see this)
The clustering factor is likely to be different, but in the case of using trunc it's more likely to go up, not down as stated in some other comments. This is because the clustering factor represents how closely the order of the values in the index match the physical storage of the data. If all your data is inserted in date order then a plain index will have the same order as the physical data. However, with TRUNC all times on a single day will map to the same value, so the order of rows in the index may be completely different to the physical data. Again, this means the trunc index is less likely to be used. This will entirely depend on your database's insertion/deletion patterns however.
Developers are more likely to write queries against where trunc isn't applied to the column (in my experience). Whether this holds true for you will depend upon your developers and the quality controls you have around deployed SQL.
Personally, I would go with Marlin's answer of TYPE, EVT_END as a first pass. You need to test this in your environment however and see how this affects this query and all others using the TYPE and EVT_END columns however.
We have a 'merge' script that is used to assign codes to customers. Currently it works by looking at customers in a staging table and assigning them unused codes. Those codes are marked as used and the staged records, with codes, loaded to a production table. The staging table gets cleared and life is peachy.
Unfortunately we are working with a larger data set now (both customers and codes) and the process is taking WAY to long to run. I'm hoping the wonderful community here can look at the code here and offer either improvements upon it or another way of attacking the problem.
Thanks in advance!
Edit - Forgot to mention part of the reason for some of the checks in this is that the staging table is 'living' and can have records feeding into it during the script run.
whenever sqlerror exit 1
-- stagingTable: TAB_000000003134
-- codeTable: TAB_000000003135
-- masterTable: TAB_000000003133
-- dedupe staging table
delete from TAB_000000003134 a
where ROWID > (
select min(rowid)
from TAB_000000003134 b
where a.cust_id = b.cust_id
);
commit;
delete from TAB_000000003134
where cust_id is null;
commit;
-- set row num on staging table
update TAB_000000003134
set row_num = rownum;
commit;
-- reset row nums on code table
update TAB_000000003135
set row_num = NULL;
commit;
-- assign row nums to codes
update TAB_000000003135
set row_num = rownum
where dateassigned is null
and active = 1;
commit;
-- attach codes to staging table
update TAB_000000003134 d
set (CODE1, CODE2) =
(
select CODE1, CODE2
from TAB_000000003135 c
where d.row_num = c.row_num
);
commit;
-- mark used codes compared to template
update TAB_000000003135 c
set dateassigned = sysdate, assignedto = (select cust_id from TAB_000000003134 d where c.CODE1 = d.CODE1)
where exists (select 'x' from TAB_000000003134 d where c.CODE1 = d.CODE1);
commit;
-- clear and copy data to master
truncate table TAB_000000003133;
insert into TAB_000000003133 (
<custmomer fields>, code1, code2, TIMESTAMP_
)
select <custmomer fields>, CODE1, CODE2,SYSDATE
from TAB_000000003134;
commit;
-- remove any staging records with code numbers
delete from TAB_000000003134
where CODE1 is not NULL;
commit;
quit
Combine statements as much as possible. For example, combine the first two deletes by simply adding "or cust_id is null" to the first delete. This will definitely reduce the number of reads, and may also significantly decrease the amount of data written. (Oracle writes blocks, not rows, so even if the two statements work with different rows they may be re-writing the same blocks.)
It's probably quicker to insert the entire table into another table than to update every row. Oracle does a lot of extra work for updates and deletes, to maintain concurrency and consistency. And updating values to NULL can be especially expensive, see update x set y = null takes a long time for some more details. You can avoid (almost all) UNDO and REDO with direct-path inserts: make sure the table is in NOLOGGING mode (or the database is in NOARCHIVELOG mode), and insert using the APPEND hint.
Replace the UPDATEs with MERGEs. UPDATEs can only use nested loops, MERGEs can also use hash joins. If you're updating a large amount of data a MERGE can be significantly faster. And MERGEs don't have to read a table twice if it's used for the SET and for a EXISTS. (Although creating a new table may also be faster.)
Use /*+ APPEND */ with the TAB_000000003133 insert. If you're truncating the table, I assume you don't need point-in-time recovery of the data, so you might as well insert it directly to the datafile and skip all the overhead.
Use parallelism (if you're not already). There are side-affects and dozens of factors to consider for tuning, but don't let that discourage you. If you're dealing with large amounts of data, sooner or later you'll need to use parallelism if you want to get the most out of your hardware.
Use better names. This advice is more subjective, but in my opinion I think using good names is extremely important. Even though it's all 0s and 1s at some level, and many programmers think that cryptic code is cool, you want people to understand and care about your data. People just won't care as much about TAB_000000003135 as something like TAB_CUSTOMER_CODES. It'll be harder to learn, people are less likely to change it because it looks so complicated, and people are less likely to see errors because the purpose isn't as clear.
Don't commit after every statement. Instead, you should issue one COMMIT at the end of the script. This isn't so much for performance, but because the data is not in a consistent state until the end of the script.
(It turns out there probably are performance benefits to committing less frequently in Oracle, but your primary concern should be about maintaining consistency)
You might look into using global temporary tables. The data in a global temp table is only visible to the current session, so you could skip some of the reset steps in your script.