I have a code as below, which takes more than 200 minutes to execute, could you help me to improve the performance.
This code is triggered at User Command. Say users select all records (about 300K records) at first level output for next level drill down report:
The dynamic internal table <OI_TABLE> has around 300K records with a field component called BOX marked as 'X' for those selected lines by the user.
The program reads the selected lines from the dynamic internal table (<OI_TABLE>), compare them to another standard internal table (GT_BSIS) which has around 300K records, whose clearing key value should be the same for both internal tables (1:N cardinality).
It then inserts those common records to the third standard internal table (GT_L2_DISP) for further processing/display.
Code:
LOOP AT <oi_table> ASSIGNING <oi_line>.
ASSIGN COMPONENT 'BOX' OF STRUCTURE <oi_line> TO <oi_field>.
IF <oi_field> = 'X'.
ASSIGN COMPONENT 'CLEARING_KEY' OF STRUCTURE <oi_line> TO <oi_field>.
LOOP AT gt_bsis WHERE clearing_key = <oi_field>.
MOVE-CORRESPONDING gt_bsis TO gt_l2_disp.
APPEND gt_l2_disp.
ENDLOOP.
ENDIF.
ENDLOOP.
Here, <oi_table> contains data for first level ALV output and GT_BSIS would contain data for 2nd level ALV output.
My understanding :
If we can fill up standard internal table GT_BSIS and marking/passing 'X' to a column (say FLAG) in GT_BSIS, while user is selecting rows from ALV first level output, it would help in performance as ONE LOOP .. ENDLOOP could be avoided.
An indexed internal table may also be an option. Please suggest a way to improve the performance.
NB: our SAP System is ECC, ABAP 7.31, so please do not propose inline code/declaration.
I might be wrong, but reversing the loops might help you somewhat. In theory this should reduce exponential growth of iterations when the 2 tables grow.
I think reversal might help because you don't appear need <OI_TABLE> for anything beyond filtering, but you are looping over it anyways. That means both looping over more rows AND you sometimes might invoke a full scan of GT_BSIS if no no corresponding key exists (I'm not entirely sure if loops make use indexing).
sort <oi_table> by ('CLEARING_KEY'). "Only if table isn't already a sorted table
Loop at GT_BSIS assigning <ls_bsis> where "Assigning! Don't make copies of each row on each iteration, if you can help it
read table <oi_table> ASSIGNING <oi_line> with key ('CLEARING_KEY') = <ls_bsis>-clearing_key binary search.
check sy-subrc = 0.
assign component 'BOX' of structure <oi_line> to <oi_field>
if <oi_field> = 'X'.
MOVE-CORRESPONDING <ls_bsis> TO gt_l2_disp.
append ls_l2_disp to gt_l2_disp.
endif.
endloop.
#Zero, Thank you so much Zero for your Response and Valued Feedback, Appreciate, I have done it and here is the Code Below, which I have used. Performance/Runtime is Reduced Drastically from 210 Minutes to well Below 5 Minutes for about 340K Records for Both those Internal Tables,
DATA: git_ibsis TYPE HASHED TABLE OF modbsis_layout1
WITH UNIQUE KEY clearing_key belnr buzei,
gwa_ibsis LIKE LINE OF git_ibsis.
FIELD-SYMBOLS : <gfs_ibsis> LIKE LINE OF git_ibsis.
DATA : lv_clkey(12) TYPE c,
lv_box(3) TYPE c.
lv_clkey = 'CLEARING_KEY'.
lv_box = 'BOX'.
SORT : ibsis BY clearing_key belnr buzei,
<oi_table> BY ('BOX') ('CLEARING_KEY').
LOOP AT ibsis.
MOVE-CORRESPONDING ibsis TO gwa_ibsis.
INSERT gwa_ibsis INTO TABLE git_ibsis.
CLEAR : gwa_ibsis, ibsis.
ENDLOOP.
SORT : git_ibsis BY clearing_key belnr buzei.
LOOP AT git_ibsis ASSIGNING <gfs_ibsis>.
READ TABLE <oi_table> ASSIGNING <oi_line> WITH KEY
(lv_box) = gc_x
(lv_clkey) = <gfs_ibsis>-clearing_key BINARY SEARCH.
IF sy-subrc = 0.
MOVE-CORRESPONDING <gfs_ibsis> TO zvxl100.
APPEND zvxl100.
ENDIF.
ENDLOOP.
Note : Used Hashed Internal Table,
Performance / Runtime is Reduced from 210 Minutes to Well Below 5 Minutes,
Related
I am trying to perform an analysis on some data, however, the speed shall be quite faster!
these are the steps that I follow. Please recommend any solutions that you think might speed up the processing time.
ts is a datetime object and the "time" column in Data is in epoch time. Note that Data might include up to 500000 records.
Data = pd.DataFrame(RawData) # (RawData is a list of lists)
Data.loc[:, 'time'] = pd.to_datetime(Data.loc[:, 'time'], unit='s')
I find the index of the first row in Data which has a time object greater than my ts as follows:
StartIndex = Data.loc[:, 'time'].searchsorted(ts)
StartIndex is usually very low and is found within a few records from the beginning, however, I have no idea if the size of Data would affect fining this index.
Now we get to the hard part: within Data there is column called "PDNumber". I have two other variables called Max_1 and Min_1. I have to find the index of the the row in which the "PDNumber" value goes above Max_1 or comes below Min_1. Note that this search shall start from StartIndex through the end of dataframe. whichever happens first, the search shall stop and the found Index is called SecondStartIndex. Now we have another two variables called Max_2 and Min_2. Again, we have to search the "PDNumber" column to find the index of the first row that goes above 'Max_2' or comes below Min_2; this index is called ThirdIndex
right now, I use a for loop to go through data adding the index by 1 in each step and see if I have reached the SecondIndex and when reached, I use a while loop till the end of dataframe to find the ThirdIndex. I use a counter in while loop as well.
Any suggestions on speeding up the process time?
I created a custom table to store reasons for modifying an Object. I'm doing a POC with BOPF in order to learn, even it may not make sense to use it here.
This is how the persistent structure looks like (simplified):
define type zobject_modifications {
object_id : zobject_id;
#EndUserText.label : 'Modification Number'
mod_num : abap.numc(4);
reason_id : zreason_id;
#EndUserText.label : 'Modification Comments'
comments : abap.string(256);
}
The alternative key consists in the object_id + mod_num. The mod_num should be an auto-generated counter, always adding 1 to the last modification for the object_id.
I created a determination before_save to generate it, checking the MAX mod_num from the database BOs and from the current instantiated BOs and increasing by 1.
But when I try to create 2 BOs for the same object in a single transaction, I get an error because of the duplicated alternative key, since the field MOD_NUM is still initial and the before_save would be triggered later. I tried to change the determination to "After Modify" but I still get the same problem.
The question is: When and how should I generate the next MOD_NUM to be able to create multiple nodes for the same object ID safely?
This must be a very common problem so there must be a best practice way to do it, but I was not able to find it.
Use a number range to produce sequential identifiers. They ensure that you won't get duplicates if there are ongoing and concurrent transactions.
If you want to insist on determining the next identifier on your own, use the io_read input parameter of the determination to retrieve the biggest mod_num:
The database contains only those nodes that have already been committed. But your new nodes are not committed, yet, such that you won't get them.
io_read in contrast accesses BOPF's temporary buffer that also contains the nodes you just created, hence seeing the more actual data.
Attempts at changing data type in Access have failed due to error:
"There isn't enough disk space or memory". Over 385,325 records exists in the table.
Attempts at the following links, among other StackOverFlow threads, have failed:
Can't change data type on MS Access 2007
Microsoft Access can't change the datatype. There isn't enough disk space or memory
The intention is to change data type for one column from "Text" to "Number". The aforementioned links cannot accommodate that either due to size or the desired data type fields.
Breaking out the table may not be an option due to the number of records.
Help on this would be appreciated.
I cannot tell for sure about MS Access, but for MS SQL one can avoid a table rebuild (requiring lots of time and space) by appending a new column that allows null- values at the rightmost end of the table, update the column using normal update queries and AFAIK even drop the old column and rename the new one. So in the end it's just the location of that column that has changed.
As for your 385,325 records (I'd expect that number to be correct) even if the table had 1000 columns with 500 unicode- characters each we'd end up with approximately 385,325*1000*500*2 ~ 385 GB of data. That should nowadays not be the maximum available - so:
if it's the disk space you're running out of, how about move the data to some other computer, change the DB there and move it back.
if the DB seems to be corrupted (and standard tools didn't help (make a copy)) it will most probably help to create a new table or database using table creation (better: create manually and append) queries.
I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.
Redis has a SCAN command that may be used to iterate keys matching a pattern etc.
Redis SCAN doc
You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)
I'm wondering how Redis implements the scanning algorithm-wise?
You may find answer in redis dict.c source file. Then I will quote part of it.
Iterating works the following way:
Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.
The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.
How it works
The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.
This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).
For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.
What happens if the table changes in size?
If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).
If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.
By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.
Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.
Wait... You have TWO tables during rehashing!
Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.
Limitations
This iterator is completely stateless, and this is a huge advantage, including no additional memory used.
The disadvantages resulting from this design are:
It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.