COLLECT and MODIFY optimization for large itab - abap

I have 2 parts of code. Both of them process 1,5 million records, but the first part is taking 20 minutes and the 2nd part is taking 13,5 hours!!!!!!
Here is the 1st part:
loop at it_bkpf.
select * from bseg into corresponding fields of itab
where bukrs = it_bkpf-bukrs and
belnr = it_bkpf-belnr and
gjahr = it_bkpf-gjahr and
hkont in s_hkont.
if sy-subrc = 0 .
itab-budat = it_bkpf-budat.
clear *bseg .
select single * from *bseg
where bukrs = itab-bukrs and
belnr = itab-belnr and
gjahr = itab-gjahr and
hkont = wtax .
if sy-subrc <> 0 .
itab-budat = '99991231'.
endif.
endif.
append itab.
endselect.
endloop.
The 2nd part which is doing 13,5 hours is the following:
sort itab by belnr.
loop at itab where hkont(2) = '73'.
move-corresponding itab to itab2.
collect itab2.
endloop.
loop at itab2.
lv_5per_total = con_5per_tax * itab2-dmbtr.
lv_5per_upper = lv_5per_total + '0.02'.
lv_5per_lower = lv_5per_total - '0.02'.
read table itab with key belnr = itab2-belnr
hkont = wtax.
if sy-subrc = 0.
if itab-dmbtr between lv_5per_lower and lv_5per_upper.
itab-budat = '99991231'.
modify itab transporting budat where belnr = itab2-belnr.
endif.
endif.
endloop.
Does anyone have an idea on how to fix the 2nd part?
Some extra things:
it_bkpf has 1,5 million records.
After the 1st process ITAB has 1,5 million records.
In the 2nd part in the 1st loop I summ the amounts per belnr for the accounts that start with 73.
In the 2nd loop I compare the sum per belnr with the amount of the belnr/account and do what the code says.
Thanks
ADDITIONAL INFORMATION:
1st of all the initial code existed and I added the new one. ITAB existed and ITAB2 is mine. So the declaration of the tables was:
DATA : BEGIN OF itab OCCURS 0,
bukrs LIKE bseg-bukrs,
hkont LIKE bseg-hkont,
belnr LIKE bkpf-belnr,
gjahr LIKE bkpf-gjahr,
dmbtr LIKE bseg-dmbtr,
shkzg LIKE bseg-shkzg ,
budat LIKE bkpf-budat,
zzcode LIKE bseg-zzcode.
DATA END OF itab.
DATA : BEGIN OF itab2 OCCURS 0 ,
belnr LIKE bkpf-belnr,
dmbtr LIKE bseg-dmbtr,
END OF itab2.
After your suggestion I made the following changes:
types: begin of ty_belnr_sums,
belnr like bkpf-belnr,
dmbtr like bseg-dmbtr,
end of ty_belnr_sums.
data: git_belnr_sums type sorted table of ty_belnr_sums
with unique key belnr.
data: gwa_belnr_sums type ty_belnr_sums.
data: lv_5per_upper type p decimals 2,
lv_5per_lower type p decimals 2,
lv_5per_total type p decimals 2.
sort itab by belnr hkont.
loop at itab where hkont(2) = '73'.
move-corresponding itab to gwa_belnr_sums.
collect gwa_belnr_sums into git_belnr_sums .
endloop.
loop at git_belnr_sums into gwa_belnr_sums.
lv_5per_total = con_5per_tax * gwa_belnr_sums-dmbtr.
lv_5per_upper = lv_5per_total + '0.02'.
lv_5per_lower = lv_5per_total - '0.02'.
read table itab with key belnr = gwa_belnr_sums-belnr
hkont = wtax
binary search.
if sy-subrc = 0.
if itab-dmbtr between lv_5per_lower and lv_5per_upper.
itab-budat = '99991231'.
modify itab transporting budat
where belnr = gwa_belnr_sums-belnr.
endif.
endif.
endloop.
Now I am running in the background for 1,5 millions records and it continues after 1 hour.

I guess all people gave you all the hints to optimize your code. Essentially, as you want to optimize only the 2nd part, the only issues seem to be with itab operations (loop, read, modify), that you may improve by using an index or a or a hash table on the internal table itab. If these concepts are unclear, I recommend the ABAP documentation: Row-Based Administration Costs of Internal Tables and Internal Tables - Performance Notes
Two more things:
you should measure the performance of your code with transaction SAT, and you will see on which operation the time is spent.
Not related to the performance but on the data model: the key of a financial document is made of 3 fields, company (BUKRS), document number (BELNR) and year (GJAHR). So, you shouldn't group ("collect") by document number only, or you risk to mix documents of distinct companies or years. Instead, keep the 3 fields.
Now, if you accept to adapt a few little things in the 1st part, right after reading the lines of BSEG of every document, then you may simply loop at these lines, no need of an index or binary search or hash table, and that's done. The only thing to do is to first store the lines of BSEG in a temporary itab_temp, to contain only the lines of one document at a time, so that you loop only at these lines, and then add them to itab.
If you want to update the BUDAT component later, not in the 1st part, then store the document key + BUDAT in a distinct internal table (say itab_budat), and define a non-unique secondary index on itab with the document key fields (say by_doc_key), and you'll only need this straight forward code :
LOOP AT itab_budat ASSIGNING <line_budat>.
MODIFY itab FROM itab USING KEY by_doc_key TRANSPORTING budat
WHERE bukrs = <line_budat>-bukrs
AND belnr = <line_budat>-belnr
AND gjahr = <line_budat>-gjahr.
ENDLOOP.
Note that I propose a (non-unique) secondary index, not a primary one, because as you don't want to change (too much) the 1st part, this will avoid potential issues that you might have with a primary index.
If you need more information on the terms, syntaxes, and so on, you'll find them in the ABAP documentation (inline or web).

There are several performance issues here:
1st part:
Never do SELECT in LOOP. I said never. You should do instead a SELECT ... FOR ALL ENTRIES or SELECT ... JOIN (BSEG is cluster table in ECC so no JOIN possible, but transparent table in S4/HANA, so you can JOIN it with BKPF). As far as I see from the coding, you select G/L items, so you can check if using BSIS/BSAS would be better (instead of BSEG). You can also check table BSET (looks like you are interested in "tax" lines)
2nd part:
If you do a LOOP with WHERE condition, you get the best performance if the internal table is TYPE SORTED (with proper key). However your WHERE condition uses offset, so I am not sure, if the SORTED table will help, but it is worth a try.
However the real issue is here: COLLECT itab2. COLLECT is not supported for TYPE STANDARD tables (this is clearly stated in the SAPHelp), the internal table has to be TYPE HASHED (there are technical reasons behind, I am not going into details).
After that you LOOP one internal table (itab2) and READ TABLE the other (itab). To have the best performance for the single line read, itab has to be TYPE HASHED (or at least TYPE SORTED with the proper key).
Instead of using internal tables with HEADER LINES, you have to LOOP ... ASSIGNING FIELD-SYMBOL(<...>), so you don't need the MODIFY anymore, this will also slightly improve the performance.

As said by JozsefSzikszai, it would probably better to use BSAS / BSIS as they have indexes. If you really want to use BSEG :
BSEG is a cluster table. AFAIK, it is preferable for those tables to perform a SELECT using the key fields to get an internal table, and then work on the internal table for additionnal fields
select * from bseg into corresponding fields of itab
for all entries in it_bkpf
where bukrs = it_bkpf-bukrs and
belnr = it_bkpf-belnr and
gjahr = it_bkpf-gjahr.
delete itab where hkont not in s_hkont.
In the second part, you're performing a READ TABLE on a standard table (itab).
Using a BINARY search would cut greatly in your time... A read table on a standard table is a full table scan (the lines are read till we find the value). Sorting itab by belnr and hkont (instead of only belnr) and addinf BINARY SEARCH will make it a dichotomice search.

The first is faster because it is just linearly bad
The second one is quadratic.
The first one looks quadratic too, but bukrs, belnr, gjahr are almost the complete primary key of bseg, so for every line of it_bkpf there not many lines found by the SELECTs.
First part
As JozsefSzikszai wrote, FOR ALL ENTRIES is much faster than nested SELECTs. Moreover, huge amounts of unnecessary data is moved from the DB to the application server, as the SELECT SINGLE is just an existence check, still it reads all 362 colums of BSEG.
So create a SORTED table before the LOOP:
TYPES: BEGIN OF tt_bseg,
bukrs TYPE bukrs,
belnr TYPE belnr,
gjahr TYPE gjahr,
END OF tt_bseg.
DATA: lt_bseg2 TYPE SORTED TABLE OF tt_bseg WITH NON-UNIQUE KEY bukrs belnr gjahr.
SELECT bukrs belnr gjahr FROM bseg
INTO TABLE lt_bseg
FOR ALL ENTRIES IN it_bkpf
WHERE bukrs = it_bkpf-bukrs
AND belnr = it_bkpf-belnr
AND gjahr = it_bkpf-gjahr
AND hkont = wtax.
Instead of the SELECT SINGLE, use READ TABLE:
READ TABLE lt_bseg TRANSPORTING NO FIELDS
WITH KEY bukrs = itab-bukrs
belnr = itab-belnr
gjahr = itab-gjahr.
Second part
This is quadratic because for every iteration of itab2, all lines of itab are checked for the MODIFY.
Simplest solution:
Instead of SORT itab BY belnr load the content of itab into a SORTED table, and use that later instead:
DATA: lt_itab TYPE SORTED TABLE OF itab
WITH NON-UNIQUE KEY belnr hkont.
lt_itab = itab.
FREE itab.

There were many reasonable recommendations from guys, especially from Sandra Rossi, but I'll just add my two cents to your latest variant that runs 1 hour:
use field-symbols as much as possible. They really can influence performance on large datasets (and 1.5M is large, yes!)
use secondary keys, as proposed by Sandra. It really matters a lot (see standard program DEMO_SECONDARY_KEYS)
do not use INTO CORRESPONDING FIELDS unless needed, it slows down the queries a bit
some doublework in your snippet, as you collect unnecessary rows in first loop by itab (BSEG), but BUDAT is written only when BSEG position sum fits the lower-upper range. So some collected sums are wasting time.
All in all, I haven't fully understood your logic, as you write BUDAT not to the whole matched rowset but only to the first row of the group with matched BELNR, which has no sense, but nevertheless.
Nevertheless, if one try to just repeat you logic without any alterations and apply a new GROUP BY syntax, here is what one might get.
* To omit `INTO CORRESPONDING` you should properly declare your structure
* and fetch only those fields which are needed. Also avoid obsolete `OCCURS` syntax.
TYPES: BEGIN OF ty_itab,
bukrs TYPE bseg-bukrs,
belnr TYPE bkpf-belnr,
gjahr TYPE bkpf-gjahr,
hkont TYPE bseg-hkont,
dmbtr TYPE bseg-dmbtr,
shkzg TYPE bseg-shkzg ,
budat TYPE bkpf-budat,
END OF ty_itab.
TYPES: begin of ty_belnr_sums,
belnr type bkpf-belnr,
dmbtr type bseg-dmbtr,
end of ty_belnr_sums.
DATA: itab TYPE TABLE OF ty_itab INITIAL SIZE 0,
con_5per_tax type p decimals 2 value '0.03'.
SELECT g~bukrs g~belnr g~gjahr g~hkont g~dmbtr g~shkzg f~budat UP TO 1500000 rows
INTO table itab
FROM bseg AS g
JOIN bkpf AS f
ON g~bukrs = f~bukrs
AND g~belnr = f~belnr
AND g~gjahr = f~gjahr.
DATA members LIKE itab.
LOOP AT itab ASSIGNING FIELD-SYMBOL(<fs_itab>)
GROUP BY ( belnr = <fs_itab>-belnr )
hkont = <fs_itab>-hkont )
ASCENDING
ASSIGNING FIELD-SYMBOL(<group>).
CLEAR members.
CHECK <group>-hkont(2) = '73'.
LOOP AT GROUP <group> ASSIGNING FIELD-SYMBOL(<group_line>).
members = VALUE #( BASE members ( <group_line> ) ).
ENDLOOP.
DATA(sum) = REDUCE dmbtr( INIT val TYPE dmbtr
FOR wa IN members
NEXT val = val + wa-dmbtr ).
IF members[1]-dmbtr BETWEEN con_5per_tax * sum - '0.02' AND con_5per_tax * sum + '0.02'.
<first_line>-budat = '99991231'.
ENDIF.
ENDLOOP.
During my test on 1.5M dataset I measured the runtime with GET RUN TIME FIELD and got following results.
Old snippet:
My snippet:

Related

Function Module for Calculation Schema in MM

I am looking for a function module that does the calculation schema for arbitrary material.
When opening ME23N and looking for the position details you have the tab Conditions where the table showing contains the base price and various conditions and below the "endprice". But since the price finding calculates the (baseprice + conditions) * amount as the netto value and divides this by the amount this can lead to rounding issues where the calculated value of 4,738 gets rounded to 4,74 which gets stored as netto price. Now when calculating nettoprice * amount this value can be different to the original value printed on the purchase document.
Since the purchase-document-value is not stored in the EKPO my goal is to re-evaluate this value by simply calling a function module with the material number and the calculation schema and any necessary parameter to give me the actual value that (again) is printed on the document.
Is there any function module that can do this or do I have to code the logic by myself?
As I wrote in my comment the solution is the FM BAPI_PO_GETDETAIL1. If you supply the PO number you get several tables containing information that is displayed in the PO create/view transaction. One of them is the iTab POCOND that has all conditions. Then you just have to read this iTab and calculate the values and add them up.
lv_ebeln = 4711
lv_ebelp = 10
" Call FM to get the detail data for one PO and each position
call function 'BAPI_PO_GETDETAIL1'
exporting
purchaseorder = lv_ebeln
tables
pocond = gt_pocond
.
" Loop over the iTab and only read entries for position 10
loop at gt_pocond
into gs_pocond
where itm_number = lv_ebelp.
" Get the netto value NAVS
if ( gs_pocond-cond_type = 'NAVS' ).
lv_netwr = gs_pocond-conbaseval.
endif.
endloop.

Sum Using Reduce Syntax ABAP

I am stuck at a problem that I want to solve with REDUCE syntax.
I have this this internal table
I need to sum the values of the column "quantity" and update column called "Totals quantity"
And the same thing with column price in column "Total Price".
This is for each Purchase order.
I have this code right now
Loop at it_numpos Into Data(ls_numpos).
lv_valort = lv_valort + ls_numpos-netpr. " Purchase Order Total Price
lv_cantt = lv_cantt + ls_numpos-menge. " Purchase Total Quantity
At end of ebeln.
ls_numpos-zmenge3 = lv_cantt.
ls_numpos-znetpr6 = lv_valort.
Modify it_numpos From ls_numpos Transporting zmenge3 znetpr6 Where ebeln = ls_numpos-ebeln.
Clear: lv_cantt, ls_numpos, lv_valort.
Endat.
EndLoop.
It is possible to transform this code to abap new syntax?
I don't think REDUCE is the right tool for the job here, as it is meant to reduce the values of a table to one single value. In your case there is not one single value, as you're calculating new values for each sales order. So you would need to somehow loop over the table grouping the items together, then use reduce, then loop again to assign the values back into the table. That would rather complicate the code and just for the sake of using new synax is probably not worth the trouble. I think a LOOP AT is the better choice here, though I'd use a LOOP AT ... GROUP BY and then two LOOP AT GROUP loops, which makes the whole processing quite readable:
LOOP AT order_items ASSIGNING FIELD-SYMBOL(<order_item>) GROUP BY <order_item>-id INTO DATA(order).
DATA(total_price) = 0.
LOOP AT GROUP order ASSIGNING <order_item>.
total_price = total_price + <order_item>-price.
ENDLOOP.
LOOP AT GROUP order ASSIGNING <order_item>.
<order_item>-total_price = total_price.
ENDLOOP.
ENDLOOP.
However whether that is better than group level processing is up to you.

Keyerror on pd.merge()

I am trying to merge 2 data-frames ('credit' and 'info') on the column 'id'.
My code for this is:
c.execute('SELECT * FROM "credit"')
credit=c.fetchall()
credit=pd.DataFrame(credit)
c.execute('SELECT * FROM "info"')
info=c.fetchall()
movies_df=pd.DataFrame(info)
movies_df_merge=pd.merge(credit, movies_df, on='id')
Both of the id column types from the tables ('credit' and 'info') integers, but I am unsure of why I keep getting a key error on 'id'.
I have also tried:
movies_df_merge=movies_df.merge(credit, on='id')
The way how you read both DataFrames is not relevant here.
Just print both DataFrames (if the number of records is big, it will
be enough to print(head(df))).
Then look at them. Especially check whether both DataFrames contains
id column. Maybe one of them is ID, whereas another is id?
The upper / lower case of names does matter here.
Check also that id column in both DataFrames is a "normal" column
(not a part of the index).

Slow SELECT FOR ALL ENTRIES

The below SELECT runs with the internal table GIT_KUNNR_TAB containing 2.291.000 lines with unique clients (kunnr) and takes 16 minutes to complete.
select kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
into corresponding fields of table git_oi_tab
from bsid
for all entries in git_kunnr_tab
where bukrs = p_bukrs
and kunnr = git_kunnr_tab-kunnr
and umsks = ' '
and augdt = clear_augdt
and budat le p_key
and blart in s_blart
and xref3 in s_xref3.
BSID contains in total 20.000.000 records and for the 2.291.000 unique clients it gets 445.000 records from BSID.
Most of the time there are even more lines in GIT_KUNNR_TAB.
Is there any quicker selection?
Drop the FOR ALL ENTRIES part
Most likely the rest of the WHERE condition is selective enough. You get back more records than necessary, but much quicker.
As git_kunnr_tab is unique, you can turn it into a HASHED table, and filter git_oi_tab with that on the application server.
SELECT kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
INTO corresponding fields of table git_oi_tab
FROM bsid
WHERE bukrs = p_bukrs
AND umsks = ' '
AND augdt = clear_augdt
AND budat le p_key
AND blart in s_blart
AND xref3 in s_xref3.
DATA: lt_kunnr_tab TYPE HASHED TABLE of <type of git_kunnr_tab>
WITH UNIQE KEY kunnr.
lt_kunnr_tab = git_kunnr_tab.
LOOP AT git_oi_tab ASSIGNING FIELD-SYMBOL(<fs_oi>).
READ TABLE lt_kunnr_tab TRANSPORTING NO FIELDS
WITH KEY kunnr = <fs_oi>-kunnr.
IF sy-subrc <> 0
DELETE git_oi_tab.
ENDIF.
ENDIF.
ENDLOOP.
FREE lt_kunnr_tab.
This is not a general solution
If the FAE driver table contains more than 20% of the rows of the target table, dropping it completely is mostly beneficial for speed.
If it has less rows, FAE is the better solution.
Be careful however, dropping FAE can significantly increase the memory consumption of the resulting internal table!
FOR ALL ENTRIES vs Range table
You can see many places in the internet that Range tables are faster than FAE. This is true in some very specific cases:
Only one field is used from the FAE driver table1
There are more rows in the driver table than FAE sends down in one batch
By default the batch size is 5 in Oracle, 50 in DB2, 100 in HANA
There are not so many rows in the Range thable that it causes a dump
The maximum length is 1 048 576 bytes (note 1002491)
Range tables can be faster than FAE because it sends down all the filtering conditions in one query. This is of course dangerous, as the size of a query is limited. If it exceeds the set limit, you get a dump.
However, using the hint MAX_IN_BLOCKING_FACTOR and MAX_BLOCKING_FACTOR you can give FAE all the benefits of range tables, without its downsides by increasing the batch size.
So only use Range tables with actual ranges, like between A and C, or not between G and J.
this one is not about speed, but functional correctness. Range tables treat fields independently, while FAE works with rows
Normally for only a one field using a range is much faster.
So if you selecting the data by some key from the internal table comparing only one field from the table, turn it into the range instead of FAE:
TYPES:
tr_kunnr TYPE RANGE OF kunnr.
* or just do loop/append if you on the old system (>7.4)
DATA(lr_kunnr) = VALUE tr_kunnr(
FOR <fs_oi> IN git_oi_tab
(
sign = 'I'
option = 'EQ'
low = fs_oi-kunnr
)
).
select kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
into corresponding fields of table #git_oi_tab
from bsid
where bukrs = #p_bukrs
and kunnr in #lr_kunnr...
I can't find the article, but an investigation was made, and the range is much faster in case of one field comparison than an FAE.

updating a value with several conditions

I am trying to add several condition. I'd like to update base2 with sum of itself with an intermediate value, and I'd like to post some conditions on intermediate value and base2.
I modified the table manually in database. Intermediatevalue is one of the columns in the table, and is calculated based on the base2 value,
In the first row, I have a base2 value, and I calculate to get first row intermediate value, now in the second row, I need to set the new base2=previous base2+previous intermediate value. That is why I have two counters to trace where the item's positions are. Counter1 counts the index of itemid, and counter2 traces counts for the loop inside the itemid
The question is how to set this new base2. Is it possible to set my new base2 on one line? Or will I have to set another variable to the intermediate value in the previous row and add it as a new variable to base2?
Here below is what I want to have, but has errors (function missing) ).
UPDATE TABLE2 SET base2=
(base2+INTERMEDIATEVALUE WHERE loadingordinal=counter2 AND itemid=counter1)
WHERE loadingordinal=counter2 +1 AND itemid=counter1
In VFP, doing such updates via Update_SQL is hard. Fortunately there are many ways to solve the problem and one is to use the xBase command instead (in xBase you can use REPLACE for SQL-UPDATE. Replace operates on "current row" by default (no scope clause is included). You didn't supply any schema nor any sample data so I will give only a plain sample by guess. Your code would look like:
local lnPrevious, lnItemId
lnItemId = -1
Select Table2
Scan
&& Save Current Id (ItemID?)
lnItemId = Table2.ItemID
&& Do calculation using previous value
&& On first row it is the same as base2
lnPrevious = base2
scan While Table2.ItemId = m.lnItemId
Replace Base2 with m.lnPrevious, ;
IntermediateValue with DoCalculation(Base2)
&& Previous value for next row
lnPrevious = Base2 + IntermediateValue
endscan
skip -1 && Endscan would move the pointer
Endscan
Note, if you need more than ItemId (or maybe passing Base2 and IntermediateValue itself rather than lnPrevious) you could as well do something like:
local loRow
scan
scatter to name loRow memo
scan while table2.ItemId = m.loRow.ItemId
...
I'm not sure what INTERMEDIATEVALUE is but if it is a table then you can follow the query below or you can just adjust it. You can do subqueries in order to achieve this type of condition
UPDATE TABLE2
SET base2 = base2 + (SELECT *
FROM INTERMEDIATEVALUE
WHERE loadingordinal=counter2 AND itemid=counter1)
WHERE loadingordinal=counter2 +1 AND itemid=counter1