Which codes have correspondence in the db - sql

I'm faced with the task of having to look in a database with millions of records, which codes of a set of about 1500 have a corresponding record, which ones of those exist in the db. For example i have 1500 IDs in a csv file. I want to know which ones of those IDs exist in the database, and are therefore correct, and which ones don't.
Is there a better way of doing this without "... WHERE id IN (1, 2, 3, ..., 1500);" ?
The DB/language in question is ORACLE PL/SQL.
Thanks in advance for any help.

Build an external table on your CSV file. These are highly neat things which allow us to query the contents of an OS file in SQL. Find out more.
Then it's a simple matter of issuing a query:
select csv.id
, case ( when tgt.id is null then 'invalid' else 'valid') end as valid_id
from your_external_tab csv
left join target_table tgt on (csv.id = tgt.id)
"CSV table is hardly ideal from a performance point of view"
Performance is a matter of context. In this case it depends on how often the data in the CSV changes and how often we need to query it. If the file is produced once a day and we only need to check the values after it has been delivered then an external table is the most efficient solution. But if this data set is a permanent repository which needs to be queried often then the overhead of writing to a heap table is obviously justified.
To me, a CSV file consisting of a bunch IDs and nothing else sounds like transient data and so
fits the use case for external tables. But the OP may have additional requirements which they haven't mentioned.
Here is an alternative approach which doesn't require creating any permanent database objects. Consequently it is less elegant, and probably will perform worse.
It reads the CSV file labouriously using UTL_FILE and populates a collection based on SYSTEM.NUMBER_TBL_TYPE, a pre-defined collection (nested table of NUMBER) which should be available in your Oracle database.
declare
ids system.number_tbl_type;
fh utl_file.file_handle;
idx pls_integer := 0;
n pls_integer;
begin
fh := utl_file.fopen('your_data_directory', 'your_data.csv', 'r');
begin
utl_file.get_line(fh, n);
loop
idx := idx + 1;
ids.extend();
ids(idx) := n;
utl_file.get_line(fh, n);
end loop;
exception
when no_data_found then
if utl_file.is_open(fh) then
utl_file.fclose(fh);
end if;
when others then
raise;
end;
for id_recs in in ( select csv.column_value
, case ( when tgt.id is null then 'invalid' else 'valid') end as valid_id
from (select * from table(ids)) csv
left join target_table tgt on (csv.column_value = tgt.id)
) loop
dbms_output.put_line '(ID '||id_recs.column_value || ' is '||id_recs.valid_id);
end loop;
end;
Note: I have not tested this code. The principle is sound but the details may need debugging ;)

Related

Statistics rapidly changing within transaction - fixing execution plan

A problem I'm facing (Oracle 11g):
I create a table, let's call it table_xyz, with index (not unique, no primary key).
I create package with procedure that will insert let's say 10 millions ofrecords monthly- it's not a simple "insert into", it's thousand lines of procedural code and some of it actually also selects data from table_xyz to calculate what data to insert further.
For example somewhere within the procedure there is this query
Now, there is a problem:
When the procedure is run for the first time, all queries on table_xyz will have execution plan based on the moment, when there were 0 records in table_xyz.
So, all queries will effectively full scan table_xyz, instead of starting to use indexes at some point.
This leads to terrible performance and in my case actually, the first run will never finish...
Now, there are three approaches i thought of:
At some point within the transaction, recalculate statistics. For example run "analyze table / analyze index compute statistics" after the count of records in table_xyz reaches the power of 10.
This is not possible, however, since ANALYZE commits transaction and i cannot allow that
At some point within the transaction, recalculate statistics as above, but do it in autonomous transaction. This does not work however, since the new statistics will not be visible for the main transaction (i tested that).
Just hint all the cursors that use table_xyz with USE_INDEX. This gets the job done, but is ugly and generally frowned upon in the codebase.
Are there any other ways?
I attach some code. It is just an example, please do not try to optimize it by removing the procedure and so on.
create table table_xyz (idx number(10) /*+ Specifically this is NOT a primary key */
,some_value varchar2(10)
);
create index table_xyz_idx on table_xyz (idx);
declare
cursor idxes is
select level idx
from dual d
connect by level < 100000;
current_val varchar2(10);
function calculate_some_value(p_idx number) return varchar2
is
cursor c_previous is
select t.some_value
from table_xyz t
where t.idx in (round(p_idx / 2, 0), round(p_idx / 3, 0), round(p_idx / 5, 0))
order by t.idx desc
;
x varchar2(100);
begin
open c_previous;
fetch c_previous into x;
close c_previous;
x := nvl(x, 'XYZ');
if mod(p_idx, 2) = 0 then
x := x || '2';
elsif mod(p_idx, 3) = 0 then
x := '3' || x;
elsif mod(p_idx, 5) = 0 then
x := substr(x, 1,1) || '5' || substr(x, 2, 2 + mod(p_idx, 7));
end if;
x := substr(x, 1, 10);
return x;
end calculate_some_value;
begin
for idx in idxes
loop
current_val := calculate_some_value(idx.idx);
insert into table_xyz(idx, some_value) values (idx.idx, current_val);
end loop;
end;
Consider taking a look at the DBMS_STATS package.
Option A: use the DBMS_STATS procedures for manually setting table, column, and index statistics (i.e., SET_TABLE_STATS, SET_COLUMN_STATS, and SET_INDEX_STATS, respectively). Then use DBMS_STATS.LOCK_TABLE_STATS to keep your manually set statistics from being overwritten (e.g., by a DBA gathering schema statistics while your table happens to be empty).
Option B: run you procedure as is and then, after, manually gather stats on the table. Then, as above, use DBMS_STATS.LOCK_TABLE_STATS to keep them from being overwritten.
Either way, the idea is to set or gather statistics on your table and then lock them in place.
If you want to get fancier, maybe you could automate this. E.g.,
At install time, manually set the statistics and lock them for your first run
In your procedure code, at the end, unlock the statistics, gather them, and re-lock them.

PL/SQL - where to update and where to rollback

I'm pretty new to PL/SQL and I have to work with it a lil bit. I had to make some functions which are pretty similar. I simplified it for this question.
I got 2 tables (called them TABLE1, TABLE2 in this example) which have some data. I have to trim, validate the data and insert it into other tables.
TABLE1 -> TABLE3
TABLE2 -> TABLE4
TABLE1 has some orders, while TABLE2 has several positions for each order. As I said its simplfied so I haven't posted things like the exception or the open/close cursors etc. Atm it works like this but I don't think this structure is somewhere near "best-practise" but I didn't found any PL/SQL code on the web which covered this problem, though it must be something pretty common.
COMMIT could be at the end of the outer loop I guess, maybe its even after the whole function.
So could you tell me if its okay like this or completely stupid and what I should/could change and why. I don't wanna get used to a 'bad' codingstyle so I wanna learn it the right way while I'm a beginner at it.
Heres the simplified code:
BEGIN
SAVEPOINT SAVE_Stufe_5;
LOOP
SAVEPOINT SAVE_LOOP;
FETCH CURSOR1 INTO RECORD1;
EXIT WHEN CURSOR1%NOTFOUND OR CURSOR1%NOTFOUND IS NULL;
vError := 0;
RECORD1 := CURSOR1;
-- DATAVALIDATION (vError will be the Errorcode)
IF (vError = 0) THEN
retcode := InsertTABLE3(RECORD1);
IF (retcode != DATABASE_OK) THEN
ROLLBACK TO SAVE_LOOP;
END IF;
END IF;
LOOP
FETCH CURSOR2 INTO CURSOR2;
EXIT WHEN CURSOR2%NOTFOUND OR CURSOR2%NOTFOUND IS NULL OR vError != 0 OR retcode != DATABASE_OK;
RECORD2 := CURSOR2;
-- DATAVALIDATION (vError will be the Errorcode)
IF (vError = 0) THEN
retcode := InsertTABLE4(RECORD2);
IF (retcode = DATABASE_OK) THEN
UPDATE TABLE2
SET TABLE2.Status = 20
WHERE TABLE2.ID = CURSOR2.ID;
ELSE
ROLLBACK TO SAVE_LOOP;
END IF;
END IF;
END LOOP;
IF (vError = 0) THEN
UPDATE TABLE1
SET TABLE1.Status = 20
WHERE TABLE1.ID = CURSOR1.ID
ELSE
ROLLBACK TO SAVE_LOOP;
UPDATE TABLE1
SET TABLE1.Status = vError
WHERE TABLE1.ID = CURSOR1.ID
UPDATE TABLE2
SET TABLE2.Status = vError
WHERE TABLE2.ID = CURSOR2.ID
END IF;
END LOOP;
END;
Small update:
I managed to do the validation set-based, though I don't really know how to get my data into the other table. I tried a insert select with the trim in it but that only inserts one row. If I would use a implicit cursor as suggested I still had to loop, I wouldn't loop the cursor but the SELECT INTO as far as the implicit cursor only has one row.
I guess I could really need a snippet or some link to help me out. Here's a simplified version of my try:
INSERT INTO TABLE3
(
val1,
val2,
val3
)
SELECT TRIM(val1),
TRIM(val2),
TRIM(val3),
FROM TABLE1
WHERE STATUS = 10
AND (TRIM(PK1) || TRIM(PK2)) NOT IN (SELECT (TABLE3.PK1 || TABLE3.PK2) FROM TABLE3);
Generally speaking, it is bad style to do things by looping in SQL. A loop like this will be extremely slow compared to a set-based solution. Instead of all these loops, it would be preferable to use a single insert or merge statement to copy all of table 1 into table 3--or perhaps a few statements if your validation is complex and you need some intermediate steps.
Most types of trimming and data validation you would want to do can be handled like this. Almost never do you need nested loops. There are exceptions, but they are rare. Those who are new to SQL tend to use loops because that is what we know from other languages. I was in that category not so long ago. But to really use the power of the language, you have to get beyond that.
Beyond this general point, not much specific help can be given if we don't know anything about the tables or what kind of validation you are doing.
When you should commit is also dependent on your specific design and what you are trying to accomplish.

Strange runtime behaviour of an update in a cursor

I have an old and a new table. In the old table, there is a VARCHAR2 column, called number, with values like 11-38D402342 and 11-38D402342/43. Also some crappy data, but never null.
In the new table, there are the two VARCHAR2 columns number_left and number_right. these two were filled from the number column of the old table:
number_left = nvl(substr(number,1,instr(number,'/',1)-1),number)
number_right = substr(number,1,instr(rtrim(number,'/'),'/',1)-length(substr(number,instr(number,'/'))))||substr(number,decode(instr(number,'/'),0,null,instr(number,'/')+1))
After some other decisions, we now need only one column again. To make sure the copied numbers get set correctly, I decided to use the number of the old table and the used conversion to identify matching rows.
I have a mapping but cannot use it since the number_left and number_right can be copied into new rows by an application and must get the old number too. So a number of the old table can possibly copied into multible rows in the new table.
I tried used this code:
declare
left_num varchar2(255 char) := '';
right_num varchar2(255 char) := '';
pos number := 0;
CURSOR c_number is
select number from old_table;
begin
for cur in c_number loop
left_num := nvl(substr(cur.number,1,instr(cur.number,'/',1)-1),cur.number);
if instr(cur.number,'/') = 0 then
pos := null;
else
pos := instr(cur.number,'/')+1;
end if;
right_num := substr(cur.number,1,instr(rtrim(cur.number,'/'),'/',1)-length(substr(cur.number,instr(cur.number,'/'))))||substr(cur.number, pos);
update new_table n
set n.number = cur.snummer
where n.number_left = left_num
and (
n.number_right = right_num
or (
n.number_right is null
and right_num is null
)
);
end loop;
end;
In the old table I have 170.000 lines and in the new one 180.000 since there is also some new data. Peanuts.
But here comes the strange part:
Everything seems to work fine but after the first 14.000 rows it gets terribly slow, perhaps 3 rows in a second. And I think it becomes slower and slower.
Any idea?
Instead of asking a bunch of random starngers to take a guess you should learn how to trace the performance of SQL in the database. There are several differnt ways you can gain insight into the performance of your statement, but you will ned some level of DBA-type access.
This is an area which is covered in the official documentation but I suggest you start with Tim Hall's excellent overview. Find it here.

Fast Update database with more than 10 million records

I am fairly new to SQL and was wondering if someone can help me.
I got a database that has around 10 million rows.
I need to make a script that finds the records that have some NULL fields, and then updates it to a certain value.
The problem I have from doing a simple update statement, is that it will blow the rollback space.
I was reading around that I need to use BULK COLLECT AND FETCH.
My idea was to fetch 10,000 records at a time, update, commit, and continue fetching.
I tried looking for examples on Google but I have not found anything yet.
Any help?
Thanks!!
This is what I have so far:
DECLARE
CURSOR rec_cur IS
SELECT DATE_ORIGIN
FROM MAIN_TBL WHERE DATE_ORIGIN IS NULL;
TYPE date_tab_t IS TABLE OF DATE;
date_tab date_tab_t;
BEGIN
OPEN rec_cur;
LOOP
FETCH rec_cur BULK COLLECT INTO date_tab LIMIT 1000;
EXIT WHEN date_tab.COUNT() = 0;
FORALL i IN 1 .. date_tab.COUNT
UPDATE MAIN_TBL SET DATE_ORIGIN = '23-JAN-2012'
WHERE DATE_ORIGIN IS NULL;
END LOOP;
CLOSE rec_cur;
END;
I think I see what you're trying to do. There are a number of points I want to make about the differences between the code below and yours.
Your forall loop will not use an index. This is easy to get round by using rowid to update your table.
By committing after each forall you reduce the amount of undo needed; but make it more difficult to rollback if something goes wrong. Though logically your query could be re-started in the middle easily and without detriment to your objective.
rowids are small, collect at least 25k at a time; if not 100k.
You cannot index a null in Oracle. There are plenty of questions on stackoverflow about this is you need more information. A functional index on something like nvl(date_origin,'x') as a loose example would increase the speed at which you select data. It also means you never actually have to use the table itself. You only select from the index.
Your date data-type seems to be a string. I've kept this but it's not wise.
If you can get someone to increase your undo tablespace size then a straight up update will be quicker.
Assuming as per your comments date_origin is a date then the index should be on something like:
nvl(date_origin,to_date('absolute_minimum_date_in_Oracle_as_a_string','yyyymmdd'))
I don't have access to a DB at the moment but to find out the amdiOaas run the following query:
select to_date('0001','yyyy') from dual;
It should raise a useful error for you.
Working example in PL/SQL Developer.
create table main_tbl as
select cast( null as date ) as date_origin
from all_objects
;
create index i_main_tbl
on main_tbl ( nvl( to_date(date_origin,'yyyy-mm-dd')
, to_date('0001-01-01' ,'yyyy-mm-dd') )
)
;
declare
cursor c_rec is
select rowid
from main_tbl
where nvl(date_origin,to_date('0001-01-01','yyyy-mm-dd'))
= to_date('0001-01-01','yyyy-mm-dd')
;
type t__rec is table of rowid index by binary_integer;
t_rec t__rec;
begin
open c_rec;
loop
fetch c_rec bulk collect into t_rec limit 50000;
exit when t_rec.count = 0;
forall i in t_rec.first .. t_rec.last
update main_tbl
set date_origin = to_date('23-JAN-2012','DD-MON-YYYY')
where rowid = t_rec(i)
;
commit ;
end loop;
close c_rec;
end;
/

SQL optimization question (oracle)

Edit: Please answer one of the two answers I ask. I know there are other options that would be better in a different case. These other potential options (partitioning the table, running as one large delete statement w/o committing in batches, etc) are NOT options in my case due to things outside my control.
I have several very large tables to delete from. All have the same foreign key that is indexed. I need to delete certain records from all tables.
table source
id --primary_key
import_source --used for choosing the ids to delete
table t1
id --foreign key
--other fields
table t2
id --foreign key
--different other fields
Usually when doing a delete like this, I'll put together a loop to step through all the ids:
declare
my_counter integer := 0;
begin
for cur in (
select id from source where import_source = 'bad.txt'
) loop
begin
delete from source where id = cur.id;
delete from t1 where id = cur.id;
delete from t2 where id = cur.id;
my_counter := my_counter + 1;
if my_counter > 500 then
my_counter := 0;
commit;
end if;
end;
end loop;
commit;
end;
However, in some code I saw elsewhere, it was put together in separate loops, one for each delete.
declare
type import_ids is table of integer index by pls_integer;
my_count integer := 0;
begin
select id bulk collect into my_import_ids from source where import_source = 'bad.txt'
for h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
--do commit check
end loop;
for h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
--do commit check
end loop;
--do commit check will be replaced with the same chunk to commit every 500 rows as the above query
So I need one of the following answered:
1) Which of these is better?
2) How can I find out which is better for my particular case? (IE if it depends on how many tables I have, how big they are, etc)
Edit:
I must do this in a loop due to the size of these tables. I will be deleting thousands of records from tables with hundreds of millions of records. This is happening on a system that can't afford to have the tables locked for that long.
EDIT:
NOTE: I am required to commit in batches. The amount of data is too large to do it in one batch. The rollback tables will crash our database.
If there is a way to commit in batches other than looping, I'd be willing to hear it. Otherwise, don't bother saying that I shouldn't use a loop...
Why loop at all?
delete from t1 where id IN (select id from source where import_source = 'bad.txt';
delete from t2 where id IN (select id from source where import_source = 'bad.txt';
delete from source where import_source = 'bad.txt'
That's using standard SQL. I don't know Oracle specifically, but many DBMSes also feature multi-table JOIN-based DELETEs as well that would let you do the whole thing in a single statement.
David,
If you insist on commiting, you can use the following code:
declare
type import_ids is table of integer index by pls_integer;
my_import_ids import_ids;
cursor c is select id from source where import_source = 'bad.txt';
begin
open c;
loop
fetch c bulk collect into my_import_ids limit 500;
forall h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
forall h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
commit;
exit when c%notfound;
end loop;
close c;
end;
This program fetches ids by pieces of 500 rows, deleting and commiting each piece. It should be much faster then row-by-row processing, because bulk collect and forall works as a single operation (in a single round-trip to and from database), thus minimizing the number of context switches. See Bulk Binds, Forall, Bulk Collect for details.
First of all, you shouldn't commit in the loop - it is not efficient (generates lots of redo) and if some error occurrs, you can't rollback.
As mentioned in previous answers, you should issue single deletes, or, if you are deleting most of the records, then it could be more optimal to create new tables with remaining rows, drop old ones and rename the new ones to old names.
Something like this:
CREATE TABLE new_table as select * from old_table where <filter only remaining rows>;
index new_table
grant on new table
add constraints on new_table
etc on new_table
drop table old_table
rename new_table to old_table;
See also Ask Tom
Larry Lustig is right that you don't need a loop. Nonetheless there may be some benefit in doing the delete in smaller chunks. Here PL/SQL bulk binds can improve speed greatly:
declare
type import_ids is table of integer index by pls_integer;
my_count integer := 0;
begin
select id bulk collect into my_import_ids from source where import_source = 'bad.txt'
forall h in 1..my_import_ids.count
delete from t1 where id = my_import_ids(h);
forall h in 1..my_import_ids.count
delete from t2 where id = my_import_ids(h);
The way I wrote it it does it all at once, in which case yeah the single SQL is better. But you can change your loop conditions to break it into chunks. The key points are
don't commit on every row. If anything, commit only every N rows.
When using chunks of N, don't run the delete in an ordinary loop. Use forall to run the delete as a bulk bind, which is much faster.
The reason, aside from the overhead of commits, is that each time you execute an SQL statement inside PL/SQL code it essentially does a context switch. Bulk binds avoid that.
You may try partitioning anyway to use parallel execution, not just to drop one partition. The Oracle documentation may prove useful in setting this up. Each partition would use it's own rollback segment in this case.
If you are doing the delete from the source before the t1/t2 deletes, that suggests you don't have referential integrity constraints (as otherwise you'd get errors saying child records exist).
I'd go for creating the constraint with ON DELETE CASCADE. Then a simple
DECLARE
v_cnt NUMBER := 1;
BEGIN
WHILE v_cnt > 0 LOOP
DELETE FROM source WHERE import_source = 'bad.txt' and rownum < 5000;
v_cnt := SQL%ROWCOUNT;
COMMIT;
END LOOP;
END;
The child records would get deleted automatically.
If you can't have the ON DELETE CASCADE, I'd go with a GLOBAL TEMPORARY TABLE with ON COMMIT DELETE ROWS
DECLARE
v_cnt NUMBER := 1;
BEGIN
WHILE v_cnt > 0 LOOP
INSERT INTO temp (id)
SELECT id FROM source WHERE import_source = 'bad.txt' and rownum < 5000;
v_cnt := SQL%ROWCOUNT;
DELETE FROM t1 WHERE id IN (SELECT id FROM temp);
DELETE FROM t2 WHERE id IN (SELECT id FROM temp);
DELETE FROM source WHERE id IN (SELECT id FROM temp);
COMMIT;
END LOOP;
END;
I'd also go for the largest chunk your DBA will allow.
I'd expect each transaction to last for at least a minute. More frequent commits would be a waste.
This is happening on a system that
can't afford to have the tables locked
for that long.
Oracle doesn't lock tables, only rows. I'm assuming no-one will be locking the rows you are deleting (or at least not for long). So locking is not an issue.