Multiple small deletes - sql

I have a PL/SQL script that loops over records of people (~4 million) and executes multiple updates (~100) and a single delete statement (all of these updates and delete are on different tables). The problem I am facing is that the one delete statement takes about half the run time by itself. I understand that when you execute a delete statement, it needs to update the index, but I find it rather ridiculous. I am currently testing this script with one thread using dbms_parallel_execute but I plan to multithread this script.
I am executing a query similar to the following:
DELETE FROM table1 t1
WHERE (t1.key1, t1.key2) IN (SELECT t2.key1, t2.key2
FROM table2 t2
WHERE t2.parm1 = 1234
AND t2.parm2 = 5678).
Following facts:
Table2 (~30 million records) is ~10 times larger than table1 (~3 million records).
There is a primary key on table1(key1, key2)
There is a primary key on table2(key1, key2)
There is an index on table2(parm1, parm2)
I have disabled the foreign key constraint on table1(key1, key2) that references table2(key1, key2)
There are no other constraints on table1, but many more constraints on table2.
All triggers on table1 have been disabled
The explain plan for this query comes up with a cost lower than that of many of my update statements (but I know this doesn't account for much).
Explain plan output:
OPERATION OPTIONS OBJECT_INSTANCE OBJECT_TYPE OPTIMIZER SEARCH_COLUMNS ID PARENT_ID DEPTH POSITION COST CARDINALITY BYTES CPU_COST IO_COST TIME
------------------------------------ ---------------------------------------------------------------------------------------------------- -------------------------------------------- ------------------------------------ ---------------------------------------------------------------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- --------------------------------------------
DELETE STATEMENT ALL_ROWS 0 0 5 5 1 36 38043 5 1
DELETE 1 0 1 1
NESTED LOOPS 2 1 2 1 5 1 36 38043 5 1
TABLE ACCESS BY INDEX ROWID 2 TABLE ANALYZED 3 2 3 1 4 1 25 29022 4 1
INDEX RANGE SCAN INDEX ANALYZED 1 4 3 4 1 3 1 21564 3 1
INDEX UNIQUE SCAN INDEX (UNIQUE) ANALYZED 2 5 2 3 2 1 1 11 9021 1 1
I was wondering if there were any way to make this delete go faster. I tried to do a bulk delete but it didn't seem to improve the run time. If there were any way to execute all the deletes and then update the index after, I suspect it would run faster. Obviously doing a create table from a select is out of the picture since I am looping over records (and running through multiple conditions) from another table to do the delete.

Your each delete call, running a query in table 2 on 30m records, which definitely degrade performance and may also create locking issue, which in turn again slow down the query.
I suggest to move out inline query which is selecting data from table2. Table2 should be driving the delete and have delete candidate records. It can run as a cursor or place this data in temporary table. Let delete be executed in chunk of 500 , 1000 and followed by commit. This chunk can be optimized based on results.
Index update during delete is not redundant, if this process is running in non working hours, you may disable index and recreate again..

I think so if the outer query is "small" and the inner query is "big" -- a WHERE EXISTS can be quite efficient.
Try where exists clause instead of In clause then check for the explain plan and the performance.
DELETE FROM table1 t1
WHERE
Exists (select 1
FROM table2 t2
WHERE t2.parm1 = 1234
AND t2.parm2 = 5678
AND t2.key1 = t1.key1
AND t2.key2 = t1.key2
)

Related

Multiple Joins - Performance

I have three tables:
Table A (approx. 500 000 records)
ID
ID_B
Text
1
10
bla
2
10
blabla
3
30
blablabla
Table B (approx. 100 000 records)
ID
Text
10
blab
20
blaba
30
blabb
Table C (approx. 600 000 records)
ID
ID_A
1
1
2
1
3
2
Now I want to join this three tables:
SELECT A.Text
FROM A
JOIN B ON B.ID = A.ID_B
JOIN C ON C.ID_A = A.ID
I have created a clustered primary key index (ID) and non-clustered index (ID_B) on table A.
According to the execution plan, at the beginning the clustered index is used to join A and C.
Afterwards the result set is sorted on column ID_B and used then in a merge join with B.
Execution Plan
The sort operation is the most expensive one. (about 40% of total costs)
Is there any way to optimize this query in terms of overall performance?
You haven't mentioned if you have any indexes on table B. Perhaps an index on it with the identifier and 'including' any columns you want to output.
Now, from what I gather in the comments, you're really joining to tables b and c primarily as a filter, not because you need to output data from those tables. If that's really the case, you should use exists. You may shy away from subqueries, but the engine knows what to do with exists. You'll see in the plan that it will run a 'semi join'.
select a.text
from a
where exists (select 0 from b where b.id = a.id_b)
and exists (select 0 from c where c.id_a = a.id)

Merge two versions of database tables with conflicting keys

I have been asked to merge 2 Access databases. They are conflicting versions of the same file.
A database was emailed to somebody. (I know.) Somebody added records to the 'main' copy while somebody else added records to their copy. I want to add the new records from the 'unauthorised' copy into the main version, before utterly destroying all other copies.
Unfortunately, the database has several related tables. As would naturally happen when records are added, records in different versions have conflicting primary keys. These conflicting keys are also used as foreign keys in the new records. A foreign key reference to ID x means different things in the 2 versions.
Is there any hope? I thought of maybe importing it all into excel and using formulas to update the primary and foreign keys.
Is there any way to fix this programatically?
EDIT: Here is a picture showing the full relationships. Tables teachers, tests, and test_results have been changed; the others are the same in both.
In the main database, add a Long field named [oldID] to each table into which you need to append data. Then create Linked Tables pointing to the relevant tables in the "other" database. Since the table names are the same, the linked tables will have a '1' appended to them.
For this example, we have
[teachers]
ID teacher oldID
-- -------- -----
1 TeacherA
2 TeacherB
3 TeacherX
[teachers1]
ID teacher
-- --------
1 TeacherA
2 TeacherB
3 TeacherY
[tests]
ID test_name teacher oldID
-- -------------- ------- -----
1 TeacherA_Test1 1
2 TeacherA_Test2 1
3 TeacherB_Test1 2
4 TeacherX_Test1 3
[tests1]
ID test_name teacher
-- -------------- -------
1 TeacherA_Test1 1
2 TeacherA_Test2 1
3 TeacherB_Test1 2
4 TeacherY_Test1 3
5 TeacherY_Test2 3
Make a note of where the tables diverge. In this case the [teachers] tables diverge after ID=2. So, insert the new rows from [teachers1] into [teachers], putting [teachers1].[ID] into [teachers].[oldID] so we can map old IDs to new ones:
INSERT INTO [teachers] ([teacher], [oldID])
SELECT [teacher], [ID] FROM [teachers1] WHERE [ID]>2
So now we have
[teachers]
ID teacher oldID
-- -------- -----
1 TeacherA
2 TeacherB
3 TeacherX
4 TeacherY 3
Now when we append the new rows from [tests1] into [tests] we can use an INNER JOIN on [teachers].[oldID] to adjust the foreign key values that get inserted:
INSERT INTO [tests] ([test_name], [teacher], [oldID])
SELECT [tests1].[test_name], [teachers].[ID], [tests1].[ID]
FROM [tests1] INNER JOIN [teachers] ON [tests1].[teacher]=[teachers].[oldID]
giving us
[tests]
ID test_name teacher oldID
-- -------------- ------- -----
1 TeacherA_Test1 1
2 TeacherA_Test2 1
3 TeacherB_Test1 2
4 TeacherX_Test1 3
5 TeacherY_Test1 4 4
6 TeacherY_Test2 4 5
Notice how the [teacher] foreign key has been mapped from the value 3 in [tests1] to 4 in [tests], reflecting the new [teachers].[ID] value for 'TeacherY'.
You can then repeat the process for child tables of [tests].
(Once the cleanup is complete you can remove the table links and drop the [oldID] columns.)
Is there any way to fix this programatically?
No. This must be done by a human capable of reading and understanding the data and taking decisions.
Create a query with an inner join between table one and table two, another query with an outer join between table one and table two, and another query with an outer join between table two and table one.
Now you can study the differences and decide which version of similar records to be kept and which records are completely new and should be kept - some with a new Primary Key.

Removing duplicate records using another table's oid

Table 1 Table 2
-------- --------
oid oid (J)
sequence trip_id
stop
trip_update_id (J)
(J) = join
Table 1 and Table 2 are updated ever 30 seconds from an api simultaneously.
At the end of each day Table 1 has been filled with 98% duplicate data, this is because the data feed includes both new data generated in the last 30 seconds, and all data generated in previous feeds from the same day. As a result Table 1 is filled with mostly duplicate data (the oid is automatically generated upon insertion, therefore all oid are unique).
Table 2 has all unique records, therefore my question is what is the sql to turn Table 1 into all unique records for each trip_id in Table 2.
I'm not quite sure if I understand what the problem is, but here comes a few suggestions.
To remove rows from table1 with trip_update_id values not found in table2:
delete from table1
where trip_update_id not in (select trip_id from table2 where trip_id is not null)
(The is not null part is very important if trip_id is allowed to have NULL values!!!)
To duplicate remove trip_update_id rows from table 1, keep the one with highest oid:
delete from table1
where oid not in (select max(oid) from table1
group by trip_update_id)

What about performance of 1 to 1 or 0 when join to single view

I have 2 tables, first table is a raw data for calculate result and store to second table (tbl_cdn) and some time in in calculation process may use some record of tbl_fbk like self join. My question is should I use 2 tables in 1 to 1 or 0 relation or merge into single table ?
tbl_fbk
fbkrec_id
name
value_a
value_b
reference_result_value ( foriegn key to fbkrec_id on tbl_cdn)
tbl_cdn
fbkrec_id
result value

Understanding this SQL Query

I'm new to oracle database, can some help me understand this query. This query eliminates duplicates from table.
DELETE FROM table_name A
WHERE ROWID > (SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values);
Any suggestions for improving the query are welcome.
Edit: No this is not homework , what I didn't understand is, what is being done by subquery and what does ROWID > On subquery do ?
This is the Source of the query
Dissecting the actual mechanics:
DELETE FROM table_name A
This is a standard query to delete records from the table named "table_name". Here, it has been aliased as "A" to be referred to in the subquery.
WHERE ROWID >
This places a condition on the deletion, such that for each row encountered, the ROWID must meed a condition of being greater than..
(SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values)
This is a subquery that is correlated to the main DELETE statement. It uses the value A.key_values from the outside query. So given a record from the DELETE statement, it will run this subquery to find the minimum rowid (internal record id) for all records in the same table (aliased as B now) that bear the same key_values value.
So, to put it together, say you had these rows
rowid | key_values
======= ============
1 A
2 B
3 B
4 C
5 A
6 B
The subquery works out that the min(rowid) for each record based on ALL records with the same key_values is:
rowid | key_values | min(rowid)
======= ============ ===========
1 A 1
2 B 2
3 B 2 **
4 C 4
5 A 1 **
6 B 2 **
For the records marked with **, the condition
WHERE ROWID > { subquery }
becomes true, and they are deleted.
EDIT - additional info
This answer previously stated that ROWID increased by insertion order. That is very untrue. The truth is that rowid is just a file.block.slot-on-block - a physical address.
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:53140678334596
Tom's Followup December 1, 2008 - 6am Central time zone:
it is quite possible that D will be "first" in the table - as it took over A's place.
If rowids always "grew", than space would never be reused (that would be an implication of rowids growing always - we would never be able to reuse old space as the rowid is just a file.block.slot-on-block - a physical address)
Rowid is a pseudo-column that uniquely identifies each row in a table; it is numeric.
This query finds all rows in A where A.key_values = B.key_values and delete all of them but one with the minimal rowid. It's just a way to arbitrarily choose one duplicate to preserve.
Quote AskTom:
A rowid is assigned to a row upon insert and is immutable (never changing)... unless the row
is deleted and re-inserted (meaning it is another row, not the same row!)
The query you provided is relying on that rowid, and deletes all the rows with a rowid value higher than the minimum one on a per key_values basis. Hence, any duplicates are removed.
The subquery you provided is a correlated subquery, because there's a relationship between the table reference in the subquery, and one outside of the subquery.
ROWID is a number that increments for each new row that is inserted. So if you have two ROWID numbers 16 & 24, you know 16 was inserted before 24. Your delete statement is deleting all duplicates, keeping only the first of those duplicates that was inserted. Make sense??