Assume that we have two tables, named Tb1 and Tb2 and we are going to replace data from one to another. Tb1 is the main source of data and Tb2 is the Destination. This replacement operation has 3 parts.
In the first part we are going to validate all rows in Tb1 and check if they are correct. For example National security code must exactly have 10 digits or a real customer must have a valid birth date so according to these validation rules, 28 different validation methods and error codes have been considered. During the validation every spoiled row's description and status will be updated to a new state.
Part 2 fixes the rows' problems and the third one replace them to the Tb2.
For instance this row says that it has 4 different error.
-- Tb1.desc=6,8,14,16
-- Tb1.sts=0
A correct row of data
-- Tb1.desc=Null i
-- Tb1.sts=1
I have been working on the first part recently and have come up with a solution which works fine but it is too slow. Unfortunately It takes exactly 31 minutes to validate 100,000 rows. In a real situation we are going to validate more than 2 million records so it is totally useless despite all it's functionality.
Let's take look at my package :
procedure Val_primary IS
begin
Open X_CUSTOMER;
Loop
fetch X_CUSTOMER bulk collect into CUSTOMER_RECORD;
EXIT WHEN X_CUSTOMER%notfound;
For i in CUSTOMER_RECORD.first..CUSTOMER_RECORD.last loop
Val_CTYP(CUSTOMER_RECORD(i).XCUSTYP);
Val_BRNCH(CUSTOMER_RECORD(i).XBRNCH);
--Rest of the validations ...
UptDate_Val(CUSTOMER_RECORD(i).Xrownum);
end loop;
CUSTOMER_RECORD.delete;
End loop;
Close X_CUSTOMER;
end Val_primary;
Inside a validation procedure :
procedure Val_CTYP(customer_type IN number)IS
Begin
IF(customer_type<1 or customer_type>3)then
RW_FINAL_STATUS:=0;
FINAL_ERR_DSC:=Concat(FINAL_ERR_DSC,ERR_INVALID_CTYP);
End If;
End Val_CTYP;
Inside the update procedure :
procedure UptDate_Val(rownumb IN number) IS
begin
update tb1 set tb1.xstst=RW_FINAL_STATUS,tb1.xdesc=FINAL_ERR_DSC where xc1customer.xrownum=rownumb;
RW_FINAL_STATUS:=1;
FINAL_ERR_DSC:=null;
end UptDate_Val;
Is there any way to reduce execution time ?
It must be done less than 20 minutes for more than 2 million records.
Maybe each validation check could be a case expression within an inline view, and you could concatenate them etc in the enclosing query, giving you a single SQL statement that could drive an update. Something along the lines of:
select xxx, yyy, zzz -- whatever columns you need from xc1customer
, errors -- concatenation of all error codes that apply
, case when errors is not null then 0 else 1 end as status
from ( select xxx, yyy, zzz
, trim(ltrim(val_ctyp||' ') || ltrim(val_abc||' ') || ltrim(val_xyz||' ') || etc...) as errors
from ( select c.xxx, c.yyy, c.zzz
, case when customer_type < 1 or customer_type > 3 then err_invalid_ctyp end as val_ctyp
, case ... end as val_abc
, case ... end as val_xyz
from xc1customer c
)
);
Sticking with the procedural approach, the slow part seems to be the single-row update. There is no advantage to bulk-collecting all 20 million rows into session memory only to apply 20 million individual updates. The quick fix would be to add a limit clause to the bulk collect (and move the exit to the bottom of the loop where it should be), have your validation procedures set a value in the array instead of updating the table, and batch the updates into one forall per loop iteration.
You can be a bit freer with passing records and arrays in and out of procedures rather than having everything a global variable, as passing by reference means there is no performance overhead.
There are two potential lines of attack.
Specific implementation. Collections are read into session memory. This is usually quite small compared to global memory allocation. Reading 100000 longish rows into session memory is a bad idea and can cause performance issues. So breaking up the process into smaller chunks (say 1000 rows) will most likely improve throughput.
General implementation. What is the point of the tripartite process? Updating Table1 with some error flags is an expensive activity. A more efficient approach would be to apply the fixes to the data in the collection and apply that to Table2. You can write a log record if you need to track what changes are made.
Applying these suggestion you'd end up with a single procedure which looks a bit like this:
procedure one_and_only is
begin
open x_customer;
<< tab_loop >>
loop
fetch x_customer bulk collect into customer_record
limit 1000;
exit when customer_record.count() = 0;
<< rec_loop >>
for i in customer_record.first..customer_record.last loop
val_and_fix_ctyp(customer_record(i).xcustyp);
val_and_fix_brnch(customer_record(i).xbrnch);
--rest of the validations ...
end loop rec_loop;
-- apply the cleaned data to target table
forall j in 1..customer_record.count()
insert into table_2
values customer_record(j);
end loop tab_loop;
close x_customer;
end one_and_only;
Note that this approach requires the customer_record collection to match the projection of the target table. Also, don't use %notfound to test for end of the cursor unless you can guarantee the total number of read records is an exact multiple of the LIMIT number.
Related
Not sure if I am asking the question correctly
I have a table with 2K records. Each record represents a huge Batch processing Unit and it is marked with an initial status of 0(Zero). Every time the batch is completed, the status is updated for that Batch record=10. This Marks that row/Batch as completed
What is the fastest way to see if All records have the status as 10. The query can return true or false or count as soon as it encounters the first 0(Zero). Only in the worst-case scenario, it has to go thru the entire table and return.
The fastest would probably be:
select (case when exists (select 1 from t where status <> 10)
then 'incomplete' else 'complete'
end)
from dual;
This can use an index on (status), which should make the lookup even faster.
However, I would recommend changing the "batch job" to write a message when it is complete -- such as updating a "completed time" column in a batches table. Why query for completeness when the batch knows if it is done?
I would do something like this:
select null as col
from your_table
where status = 0 AND ROWNUM = 1
;
(or select whatever you want from the first row encountered that has status = 0; the way I wrote the query, it will only tell you if there are incomplete batches, but it won't tell you anything else.)
The ROWNUM condition forces execution to stop as soon as the first status = 0 is encountered - either from the table itself, or from an index if you have an index on the status column (as you might, if this kind of query is common in your business).
The query will return either a single row - with NULL in the only column col, unless you choose to select something from the record - or no rows if all batches are complete.
I am trying to help a colleague with a SAS Script she is working with . I am a programmer so I understand the logic, however I don't now the syntax in SAS. Basically this is what she is trying to do.
We have:
Array of Procedure Dates (proc_date[i])
Array of Procedures (proc[i]).
Each record in our data can have up to 20 Procedures and 20 dates.
i=20
Each procedure has an associated code, lets just say there are 100 different codes where codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB etc etc.
We need to loop through each Procedure and assign it the correct ProcedureCategory if it falls into 1 of the 100 codes (ie: it enters one of the If Statements). When this is true we then need to loop through each other corresponding Procedure Date in that row, if they are different dates when we add the 'Weighted Values' together, else we would just take the greater of the 2 values.
I hope that helps and I hope this makes sense. I could write this in another language (ie: C/C++/C#/VB) however I'm at a loss with SAS as I'm just not that familiar with the syntax and the logic doesn't seem to be that other OO languages.
Thanks in advance for any assistance.
Kind Regards.
You don't want to do 100 if statements, one way or the other.
The answer to the core of your question is that you need the do loop outside of the if statement.
data want;
set have;
array proc[20];
array proc_date[20];
do _i = 1 to dim(proc); *this would be 20;
if proc[_i] = 53 then ... ;
else if proc[_i] = 54 then ...;
end;
run;
Now, what you're trying to do with the proc_date sounds like you need to do something with proc_date[i] in that same loop. The loop is just an iterator - the only thing changing is _i which is used as an array index. It's welcome to be the array index for either array (or to do any other thing). This is where this differs from common OOP practice, since this isn't an array class; you're not using the individual object to iterate it. This is a functional language style (c would do it the same way, even).
However, the if/else bits there would be unwieldy and long. In SAS you have a lot of ways of dealing with that. You might have another array of 100 values proc can take, and then inside that do loop have another do loop iterating over that array (do _j = 1 to 100;) - or the other way around (iterate through the 100, inside that iterate through the 20) if that makes more sense (if you want to at one time have all of the values).
data want;
set have;
array proc[20];
array proc_date[20];
array proc_val[100]; *needs to be populated in the `have` dataset, or in another;
do _i = 1 to dim(proc_val);
do _i = 1 to dim(proc);
if proc[_j] = proc_val[_i] then ...; *this statement executes 100*20 times;
end;
end;
run;
You also could have a user-defined format, which is really just a one-to-one mapping of values (start value -> label value). Map your 100 values to the 10 procedures they correspond to, or whatever. Then all 100 if statements become
proc_value[_i] = put(proc[_i],PROCFMT.);
and then proc_value[_i] stores the procedure (or whatever) which you then can evaluate more simply hopefully.
You also may want to look into hash tables; both for a similar concept as the format above, but also for doing the storage. Hash tables are a common idea in programming that perhaps you've already come across, and the way SAS implements them is actually OOP-like. If you're trying to do some sort of summarization based on the procedure values, you could easily do that in a hash table and probably much more efficiently than you could in IF statements.
Here are some statements mentioned.
*codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB ;
proc format;
value codes 1-10 = 'A'
11-20 = 'B';
Procedure(i) = put(code,codes.);
another way to recode ranges is with the between syntax
if 1 <= value <= 10 then variable = <new-value>;
hth
We have an UPDATE in production(below) which processes more or less the same number of rows each day but with drastically different runtimes. Some days the query finishes in 2 minutes, while other days, the query might take 20 minutes. Per my analysis of the AWR data, the culprit was I/O wait time and whenever the query slows down, the cache hit ratio goes down due to increased physical reads.
The outline of the query itself is below:
update /*+ nologging parallel ( a 12 ) */ huge_table1 a
set col = 1
where col1 > 'A'
and col2 < 'B'
and exists ( select /*+ parallel ( b 12 ) */ 1
from huge_table2 b
where b.col3 = a.col3 );
huge_table1 and huge_table2 contains about 100 million rows and the execution statistics are below:
Day EXECUTIONS ELAPSED_TIME_S_1EXEC CPU_TIME_S_1EXEC IOWAIT_S_1EXEC ROWS_PROCESSED_1EXEC BUFFER_GETS_1EXEC DISK_READS_1EXEC DIRECT_WRITES_1EXEC
------- ----------- -------------------- ---------------- -------------- -------------------- ----------------- ----------------- -------------------
1 1 133.055 69.110 23.325 2178085.000 3430367.000 90522.000 42561.000
2 1 123.580 65.020 20.282 2179404.000 3341566.000 86614.000 38925.000
3 1 1212.762 72.800 1105.084 1982658.000 3131695.000 268260.000 38446.000
4 1 1085.773 59.600 996.642 1965309.000 2954480.000 200612.000 26790.000
As seen above, the LIO has remained almost the same in each case, although the elapsed time has increased in the 3rd and 4th days due to increased IO waits, which if my assumption is correct was caused by increase in the PIO. Per Tom Kyte, tuning should be focused on reducing the LIO instead of PIO and as LIO reduces, so will PIO. But in this case, the LIO has been constant throughout, but the PIO has been varying significantly.
My question - What tuning strategy could be adopted here?
I would:
-> Check the execution plan for both cases.
-> Check IO subsystem health.
-> Monitor the server the time this runs and make sure the IO sybsystem is not saturated by another process.
Also, what kind of I/O is leading read events? Sequential, parallel , scattered?... here you can ge a lead of the strategy the plan is following to perform the update...
Is the buffer cache being resized? a small and cold buffer cache which gets resized during this big execution could lead to blocks needing to be read into the buffer cache in order to update them.
Some ideas based on the data you showed... please let us know what came out!
Recently I had problem which huge update. I found good solution based on parallel pipelined function which decrease time of updating significantly.
My proposition is not exactly what you asked but maybe this approach could give you short and stable time in days perspective:
Create type:
CREATE type test_num_arr AS TABLE of INTEGER;
/
Make updating pipelined function (you can ofcourse adjust):
create or replace FUNCTION test_parallel_update (
test_cur IN SYS_REFCURSOR
)
RETURN test_num_arr
PARALLEL_ENABLE (PARTITION test_cur BY ANY)
PIPELINED
IS
PRAGMA AUTONOMOUS_TRANSACTION;
test_rec HUGE_TABLE1%ROWTYPE;
TYPE num_tab_t IS TABLE OF NUMBER(38);
pk_tab NUM_TAB_T;
cnt INTEGER := 0;
BEGIN
LOOP
FETCH test_cur BULK COLLECT INTO pk_tab LIMIT 1000;
EXIT WHEN pk_tab.COUNT() = 0;
FORALL i IN pk_tab.FIRST .. pk_tab.LAST
UPDATE HUGE_TABLE1
set col = 1
where col1 > 'A'
and col2 < 'B'
and exists ( select 1
from huge_table2 b
where b.col3 = a.col3
)
AND ID = pk_tab(i);
cnt := cnt + pk_tab.COUNT;
END LOOP;
CLOSE test_cur;
COMMIT;
PIPE ROW(cnt);
RETURN;
END;
Lastly, run your update:
SELECT * FROM TABLE(test_parallel_update(CURSOR(SELECT id FROM huge_table1)));
Approach based on:
http://www.orafaq.com/node/2450
To answer your question about the strategy, must of course choose LIO. Row access in buffer are much faster than disk operation.
With respect to your problem,seen that the first days the execution time is good and the last days it is not.
If you use indexes on columns = b.col3 a.col3 and there is a lot of insertion in the tables.Maybe they are out of date and so your query can no longer use the index and reads more blocks.
Because in your execution plan we see an increase in disk reads.
In this case it would be necessary to :
EXEC DBMS_STATS.gather_table_stats(schema, table_name);
You should gather statistics periodically with scheduler. depending on your data changing on volume.
You could schedule during the day just a gathers index statistics with :
DBMS_STATS.GATHER_INDEX_STATS
And evening :
DBMS_STATS.GATHER_TABLE_STATS
witch gathers table and column (and index) statistics.
In addition to your question about the possibilities there is also change to the data model. On large volumes partitioned tables are a good aproach in reducing IO.
hoping that ca can help
As bubooal says, we can't help you whithout execution plan and table structure of the 2 tables. Could you give us this 2 information?
Maybe partitioning could help you to reduce I/O.
Another possibilities is to keep the two table in your cache. it seems that the number of buffer get is the same. So when the query hangs it's because your tables are not in the buffer cache. For that you could use the db_keep_cache_size and pin your tables (or the good partition) in this cache
i'm creating something similar to an advertising system.
I would like to show, for example, 5 ads (5 record) from a given database table.
So i execute something like
SELECT * FROM mytable
ORDER BY view_counter ASC
LIMIT 5
ok, it works.
But, how can contextualy update the "view_counter" (that is a counter with the number of show) maybe with a single SQL ?
And, if i don't ask too much, is it possible to save the "position" which my record are returned ?
For example, my sql return
- record F (pos. 1)
- record X (pos. 2)
- record Z (pos. 3)
And save in a field "Avarage_Position" the .. avarage of position ?
Thanks in advance.
Regards
how can contextualy update the "view_counter" (that is a counter with the number of show) maybe with a single SQL ?
That's usually something handled by analytic/rank/windowing functions, which MySQL doesn't currently support. But you can use the following query to get the output you want:
SELECT *,
#rownum := #rownum + 1 AS rank
FROM mytable
JOIN (SELECT #rownum := 1) r
ORDER BY view_counter ASC
LIMIT 5
You'd get output like:
description | rank
--------------------------
record F | 1
record X | 2
record Z | 3
if i don't ask too much, is it possible to save the "position" which my record are returned ?
I don't recommend doing this, because it means the data needs to be updated every time there's a change. On other databases I'd recommend using a view so the calculation is made only when the view is used, but MySQL doesn't support variable use in views.
There is an alternative means of getting the rank value using a subselect - this link is for SQL Server, but there's nothing in the solution that is SQL Server specific.
You could do something like this, but it is pretty ugly and I would not recommend it (see below for my actual suggestion about how to handle this issue).
Create a dummy_field tinyint field, sum_position int field and average_position decimal field and run the following few statements within the same connection (I am usually very much against MySQL stored procedures, but in this case it could be useful to store this in a SP).
SET #updated_ads := '';
SET #current_position := 0;
UPDATE mytable SET view_counter= view_counter+1,
dummy_field = (SELECT #updated_ads := CONCAT(#updated_ads,id,"\t",ad_text,"\r\n"))*0, /* I added *0 for saving it as tinyint in dummy_field */
sum_position = sum_position + (#current_position := #current_position +1),
average_position = sum_position / view_counter
ORDER BY view_counter DESC
LIMIT 5;
SELECT #updated_ads;
Then parse the result string in your code using the delimiters you used (I used \r\n as a row delimiter and \t as the field delimiter).
What I actually suggest you to do is:
Query for selected ads.
Write a log file with the selected ads and positions.
Write a job to process the log file and update view_counter, average_position and sum_position fields in batch.
thanks for your answer. I solved simply executing the same SELECT query (with exactly the clause WHERE, Order BY and LIMIT) but, instead SELECT, i used UPDATE.
Yes, there's an "overhead", but it's simple solution.
I need to select data from one table and insert it into another table. Currently the SQL looks something like this:
INSERT INTO A (x, y, z)
SELECT x, y, z
FROM B b
WHERE ...
However, the SELECT is huge, resulting in over 2 millions rows and we think it is taking up too much memory. Informix, the db in this case, runs out of virtual memory when the query is run.
How would I go about selecting and inserting a set of rows (say 2000)? Given that I don't think there are any row ids etc.
You can do SELECT FIRST n * from Table. Where n is the amount of rows you want, say 2000. Also, in the WHERE clause do an embedded select that checks the table you are inserting in to for rows already existing. So that the next time the statement is ran, it will not include already inserted data.
I assume that you have some script that this is executed from? You can just loop and limit as long as you order the values returned from the nested select. Here is some pseudo code.
total = SELECT COUNT(x) FROM B WHERE ...
while (total > 0)
INSERT INTO A (x, y, z) SELECT x, y, z FROM B b WHERE ... ORDER BY x LIMIT 2000
total = total - 2000
end
I'm almost certain that IDS only lets you use the FIRST clause where the data is returned to the client1, and that is something you want to avoid if at all possible.
You say you get an out of memory error (rather than, say, a long transaction aborted error)? Have you looked at the configuration of your server to ensure it has a reasonable amount of memory?
It depends in part on how big your data set is, and what the constraints are - why you are doing the load across tables. But I would normally aim to determine a way of partitioning the data into loadable subsets and run those sequentially in a loop. For example, if the sequence numbers are between 1 and 10,000,000, I might run the loop ten times, with condition on the sequence number for AND seqnum >= 0 AND seqnum < 1000000' and thenAND seqnum >= 1000000 AND seqnum < 2000000', etc. Preferably in a language with the ability to substitute the range via variables.
This is a bit nuisancy, and you want to err on the conservative side in terms of range size (more smaller partitions rather than fewer bigger ones - to reduce the risk of running out of memory).
1 Over-simplifying slightly. A stored procedure would have to count as 'the client', for example, and the communication cost in a stored procedure is (a lot) less than the cost of going to the genuine client.