Hi all i have the following merge sql script which works fine for a relatively small number of rows (up to about 20,000 i've found). However sometimes the data i have in Table B can be up to 100,000 rows and trying to merge this with Table A (which is currently at 60 million rows). This takes quite a while to process, which is understandable as it has to merge 100,000 with 60 million existing records!
I was just wondering if there was a better way to do this. Or is it possible to have some sort of count, so merge 20,000 rows from Table B to Table A. Then delete those merged rows from table B. Then do the next 20,000 rows and so on, until Table B has no rows left?
Script:
MERGE
Table A AS [target]
USING
Table B AS [source]
ON
([target].recordID = [source].recordID)
WHEN NOT MATCHED BY TARGET
THEN
INSERT([recordID],[Field 1]),[Field 2],[Field 3],[Field 4],[Field 5])
VALUES([source].[recordID],[source].[Field 1],[source].[Field 2],[source].[Field 3],[source].[Field 4],[source].[Field 5]
);
MERGE is overkill for this since all you want is to INSERT missing values.
Try:
INSERT INTO Table_A
([recordID],[Field 1]),[Field 2],[Field 3],[Field 4],[Field 5])
SELECT B.[recordID],
B.[Field 1],B.[Field 2],B.[Field 3],B.[Field 4],B.[Field 5]
FROM Table_B as B
WHERE NOT EXISTS (SELECT 1 FROM Table_A A
WHERE A.RecordID = B.RecordID)
In my experience MERGE can perform worse for simple operations like this. I try to reserve it for when you need varying operations depending on conditions, like an UPSERT.
You can definitely do (SELECT TOP 20000 * FROM B ORDER BY [some_column]) as [source] in USING and then delete these records after MERGE. So you pseudo-code will look like :
1. Merge top 20000
2. Delete 20000 records from source table
3. Check ##ROWCOUNT. If it's 0, exit; otherwise goto step 1
I'm not sure if it runs any faster than merging all the records at the same time.
Also, are you sure you need MERGE? From what I see in your code INSERT INTO ... SELECT should also work for you.
Related
Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;
I have one small doubt in query performance. Basically, I have a table with more than 1C records. sl_id is the primary key in that table. Currently, I am updating the table column status to true (default false) by using the sl_id.
In my program, I will have 200 unique sl_id in an array. I am updating the status to true(always) by using each sl_id.
My doubt:
Shall I use individual update queries by specifing each sl_id in a where condition to update the status?
(OR)
Shall I use IN operator and put all 200 unique sl_id in one single query?
Which one will be faster?
In rough order of slower to faster:
200 Individual queries, each in their own transaction
200 Individual queries, all in one transaction
1 big query with WHERE ... IN (...) or WHERE EXISTS (SELECT ...)
1 big query with an INNER JOIN over a VALUES clause
(only faster for very big lists of values): COPY value list to a temp table, index it, and JOIN on the temp table.
If you're using hundreds of values I really suggest joining over a VALUES clause. For many thousands of values, COPY to a temp table and index it then join on it.
An example of joining on a values clause. Given this IN query:
SELECT *
FROM mytable
WHERE somevalue IN (1, 2, 3, 4, 5);
the equivalent with VALUES is:
SELECT *
FROM mytable
INNER JOIN (
VALUES (1), (2), (3), (4), (5)
) vals(v)
ON (somevalue = v);
Note, however, that using VALUES this way is a PostgreSQL extension, wheras IN, or using a temporary table, is SQL standard.
See this related question:
Postgres NOT IN performance
Definitely you should use WHERE IN operator. Making 200 queries is much slower than one bigger. Remember, when you sending query to database, there is additional time needed to communicate between server and DB and this will crush your performance.
Definitely IN is more powerful, but again the number of match to check in IN will make performance issue.
So, I will suggest to use IN but with BATCH, as in if you have 200 record to update then part in 50 each and then make 4 UPDATE query, or something like that.
Hope it helps...!!
I am using oracle database and what i do is
Taking 1 record of table A. (table A has column P and lets say
values of it are x,y,z)
Putting that record to table B or C or D according to values x,y,z
(if P=x then put record to table B , if P=y then put record into
table C ...)
Delete that record of A which we inserted to table B or C or D.
Note: size of A is like 200 million, B is 170 C is 20 D is 10 so and size of A is decreasing others are same (if a parameter of A record is negative then it is not inserted into to B,C,D it is exist in these tables so just deleted it from table) so there is no size change for B,C,D just size of A decreasing in time.
The problem is at the beginning everything is working nice, but in time, its becoming extremely slow. Approximately it is making 40 insert+delete in 1 second but in time its processing 1 insert+delete in 3 second.
All tables have index in corresponding columns.
Paralel run exists but there is no lock.
Table sizes are approximately 60 million record.
What other effects can make it - in time - if there is no lock or size increase for table??
note: it is not different processes, in same process i click "execute query" it is starting very fast but then extremely slow.
Inserting 200 million records from a staging table and inserting them into permanent tables in a single transaction is ambitious. It would be a useful if you had a scheme for dividing the records from table A into chunks which could be processed in discrete chunks.
Without seeing your code it's hard to tell but I have a suspicion you are attempting this RBAR rather than a more efficient set-based approach. I think the key here is to de-couple the insertions from clearing down table A. Insert all the records, than zap A at your leisure. Something like this
insert all
when p = 'X' then into b
when p = 'Y' then into c
when p = 'Z' then into d
select * from a;
truncate table a;
Hi I have two table with million rows in each.I have oracle 11 g R1
I am sure many of us must have gone through this situation.
What is the most efficient and fast way to update from one table to another where the values are DIFFERENT.
Eg: Table 1 has 4 NUMBER columns with a high precision eg : 0.2212454215454212
Table 2 has 6 columns.
update table 2's four columns based on common column on both the tables, only the different ones.
I have something like this
DECLARE
TYPE test1_t IS TABLE OF test.score%TYPE INDEX BY PLS_..;
TYPE test2_t IS TABLE OF test.id%TYPE INDEX BY PLS..;
TYPE test3_t IS TABLE OF test.Crank%TYPE INDEX BY PLS..;
vscore test1_t;
vid test2_t;
vurank test4_t;
BEGIN
SELECT id,score,urank
BULK COLLECT INTO vid,vscore,vurank
FROM test;
FORALL i IN 1 .. vid.COUNT
MERGE INTO final T
USING (SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL) S
ON (S.o_id = T.id)
WHEN MATCHED THEN
UPDATE SET T.crank = S.o_crank
WHERE T.crank <> S.o_crank;
Since the numbers are with high precision is it slowing down?
I tried Bulk Collect and Merge combination still its taking time ~ 30 mins for worst case scenario if I have to update 1 million rows.
Is there something with rowid?
Help will be appreciated.
If you want to update all the rows, then just use update:
update table_1
set (col1,
col2) = (
select col1,
col2
from table2
where table2.col_a = table1.col_a and
table2.col_b = table1.col_b)
Bulk collect or any PL/SQL technique will always be slower than a pure SQL technique.
The numeric precision is probably not significant, and rowid is not relevant as there is no common value between the two tables.
When dealing with millions of rows, parallel DML is a game changer. Of course you need to have Enterprise Edition to use parallel, but it's really the only thing which will make much difference.
I recommend you read an article on OraFAQ by rleishman comparing 8 Bulk Update Methods. His key finding is that "the cost of disk reads so far outweighs the context switches that that they are barely noticable (sic)". In other words, unless your data is already cached in memory there really isn't a significant difference between SQL and PL/SQL approaches.
The article does have some neat suggestions on employing parallel. The surprising outcome is that a parallel pipelined function offers the best performance.
Focusing on the syntax have been used and skipping the logic (may using a pure update + pure insert may solve the problem, merge cost, indexes, possible full scan on merge and else )
You should use Limit in Bulk Collect syntax
Using a bulk collect with no limit
Will case all records to be loaded in memory
With no partially committed merges, you will create a larg redolog,
that must be apply in the end of the process.
Both will reason in low performance.
DECLARE
v_fetchSize NUMBER := 1000; -- based on hardware, design and .... could be scaled
CURSOR a_cur IS
SELECT id,score,urank FROM test;
TYPE myarray IS TABLE OF a_cur%ROWTYPE;
cur_array myarray;
BEGIN
OPEN a_cur;
LOOP
FETCH a_cur BULK COLLECT INTO cur_array LIMIT v_fetchSize;
FORALL i IN 1 .. cur_array.COUNT
// DO Operation
COMMIT;
EXIT WHEN a_cur%NOTFOUND;
END LOOP;
CLOSE a_cur;
END;
Just to be sure: test.id and final.id must be indexed.
With first select ... from test you got too much records from Table 1 and after that you need to compare all of them with records on Table 2. Try to select only what you need to update. So, there are at least 2 variants:
a) select only changed records:
SELECT source_table.id, source_table.score, source_table.urank
BULK COLLECT INTO vid,vscore,vurank
FROM
test source_table,
final destination_table
where
source_table.id = destination_table.id
and
source_table.crank <> destination_table.crank
;
b) Add new field to source table with datetime value and fill it in trigger with current time. While synchronizing pick only records changed during last day. This field needs to be indexed.
After such a change on update phase you don't need to compare other fields, only match ID's:
FORALL i IN 1 .. vid.COUNT
MERGE INTO FINAL T
USING (
SELECT vid (i) AS o_id,
vurank (i) AS o_urank,
vscore (i) AS o_score FROM DUAL
) S
ON (S.o_id = T.id)
WHEN MATCHED
THEN UPDATE SET T.crank = S.o_crank
If you worry about size of undo/redo segments then variant b) is more useful, because you can get records from source Table 1 divided to time slices and commit changes after updating every slice. E.g. from 00:00 to 01:00 , from 01:00 to 02:00 etc.
In this variant update can be done just by SQL statement without selecting a data into collections in row with maintaining acceptable sizes of redo/undo logs.
I got to delete some unwanted rows from a table based on the result of a select query from another table
DELETE /*+ parallels(fe) */ FROM fact_x fe
WHERE fe.key NOT IN(
SELECT DISTINCT di.key
FROM dim_x di
JOIN fact_y fa
ON fa.code = di.code
WHERE fa.code_type = 'ABC'
);
The inner select query returns 77 rows and executes in few milliseconds. but the outer delete query runs forever(for more than 8 hrs). I tried to count how many rows got to be deleted by converting the delete to select count(1) and its around 66.4 million fact_x rows out of total 66.8 million rows. I am not trying to truncate though. I need to retain remaining rows.
Is there any other way to achieve this? will deleting this by running a pl/sql cursor will work better?
Would it not make more sense just to insert the rows you want to keep into another table, then drop the existing table? Even if there are FKs to disable/recreate/etc. it almost certain to be faster.
Could you add a "toBeDeleted" column? The query to set that wouldn't need that "NOT IN" construction. Deleting the marked rows should also be "simple".
Then again, deleting 99,4% of the 67 million rows will take some time.
Try /*+ parallel(fe) */. No "S".