Extracting unique values from an internal table - abap

What is the most efficient way to extract the unique values from a column or multiple columns of an internal table?

If you have 7.40 SP08 or above you can simply use the inline syntax to populate the target table (no need for LOOP GROUP BY):
DATA: it_unique TYPE STANDARD TABLE OF fieldtype.
it_unique = VALUE #(
FOR GROUPS value OF <line> IN it_itab
GROUP BY <line>-field WITHOUT MEMBERS ( value ) ).
This works with any type of the target table.
For an older release use:
DATA: it_unique TYPE HASHED TABLE OF fieldtype WITH UNIQUE KEY table_line.
LOOP AT it_itab ASSIGNING <line>.
INSERT <line>-field INTO TABLE lt_unique.
ENDLOOP.
The above works with sorted tables as well. Although I do not recommend to use sorted tables for this purpose unless you are really sure that only a few lines will be in the result.
The non-zero sy-subrc of INSERT is simply ignored. No need to do the key lookup twice (once for existence check, once for insert).
If the target must be a STANDARD TABLE and you have an old ABAP stack you can alternatively use
DATA: it_unique TYPE STANDARD TABLE OF fieldtype.
LOOP AT it_itab ASSIGNING <line>.
READ TABLE lt_unique WITH TABLE KEY table_line = <line>-field
TRANSPORTING NO FIELDS BINARY SEARCH.
INSERT <line>-field INTO lt_unique INDEX sy-tabix.
ENDLOOP.
This provides the same behavior as with a sorted table but with a standard table.
Whether this is more efficient than SORT / DELETE ADJACENT DUPLICATES depends on the number of duplicate entries in itab. The more duplicate entries exist the faster will be the above solution because it avoids the unnecessary appends to the target table. But on the other side appends are faster than inserts.

Prior to ABAP 7.40's SP08 release, the most efficient way of extracting unique values from an internal table or itab is the following:
LOOP AT lt_itab ASSIGNING <ls_itab>.
APPEND <ls_itab>-value TO lt_values.
ENDLOOP.
SORT lt_values.
DELETE ADJACENT DUPLICATES FROM lt_values.
Checking the presence of a given <ls_itab>-value before adding it to the internal table is another way of guaranteeing uniqueness but will probably be much more computationally expensive when inserting into a standard table. For sorted or hashed destination tables, use:
LOOP AT lt_itab ASSIGNING <ls_itab>.
READ TABLE lt_sorted_values WITH KEY table_line = <ls_itab>-value BINARY SEARCH.
IF sy-subrc <> 0.
APPEND <ls_itab>-value TO lt_sorted_values.
ENDIF.
ENDLOOP.
Note that using the first method but inserting the values into a dummy table followed by an APPEND LINES OF lt_dummy INTO lt_sorted_values may be faster, but the size of the intermediate tables can muddle that.
As of ABAP 7.40 Support Package 08 however, the GROUP BY loops offer a better way to extract unique values. As the name indicates these function similarly to SQL's GROUP BY. For instance, the following code will extract unique project numbers from an internal table:
LOOP AT lt_project_data ASSIGNING FIELD-SYMBOL(<ls_grp_proj>)
GROUP BY ( project = <ls_grp_proj>-proj_number ) ASCENDING
WITHOUT MEMBERS
ASSIGNING FIELD-SYMBOL(<ls_grp_unique_proj>).
APPEND <ls_grp_unique_proj>-project TO lt_unique_projects.
ENDLOOP.
The same logic can be extended to retrieve unique pairs, such as the composite primary keys of the EKPO table, EBELN ("Purchasing Document", po_nr) and EBELP ("Item Number of Purchasing Document", po_item):
LOOP AT lt_purchasing_document_items ASSIGNING FIELD-SYMBOL(<ls_grp_po>)
GROUP BY ( number = <ls_grp_po>-po_nr
item = <ls_grp_po>-po_item ) ASCENDING
WITHOUT MEMBERS
ASSIGNING FIELD-SYMBOL(<ls_grp_po_item>).
APPEND VALUE #( ebeln = <ls_grp_po_item>-number
ebelp = <ls_grp_po_item>-item ) TO lt_unique_po_items.
ENDLOOP.
According to Horst Keller, one of the SAP designers of the new ABAP 7.40 release, the performance of GROUP BY loops is likely to be the same as a manual implementation of these LOOPs. Depending on how (in)efficiently such a custom loop is implemented it may even be faster. Note that these will be faster than the two methods given above for systems where the GROUP BY loops are not available.
Note that in most cases querying the database to return DISTINCT values will be much faster and performance-wise doing that will blow any ABAP code that uses internal tables out of the water, especially on HANA systems.

How about this?
lt_unique[] = lt_itab[].
SORT lt_unique[] BY field1 field2 field3...
DELETE ADJACENT DUPLICATES FROM lt_values COMPARING field1 field2 field3...

Related

Check if a set of related rows exist

I want to manage sets of things in a database. Assume the following two tables:
CREATE TABLE Sets (id BIGINT PRIMARY KEY, name VARCHAR(64));
CREATE TABLE SetItems (fkSet BIGINT, item BIGINT, FOREIGN KEY (fkSet) REFERENCES Sets(id));
I could create Sets by inserting a row into table Sets, and add one or more rows into SetItems with the corresponding fkSet.
Getting the items of a specific set is easy, it's basically SELECT * FROM SetItems WHERE fkSet = :id.
Problem: Now I want to find out if a set exists, given a set of SetItems.
Example: I want to find if there is a set with the items 2 and 5.
What I tried:
(1) I could try something like:
SELECT s.fkSet FROM Sets s, SetItems i1, SetItems i2
WHERE s.id = i1.fkSet AND i1.item = 2
AND s.id = i2.fkSet AND i2.item = 5;
But such an approach has several drawbacks:
I guess it scales very badly if I need to check for more SetItems.
I need to put together the sql-query with string concatenation which I dislike, increasing the chances for an injection attack.
it could also find sets which have additional items besides 2 and 5, which I would not like.
To better prevent SQL-Injections, I would prefer a way where I could use Prepared Statements. Technically, I could assemble the query-string for a prepared statement using String concatenation, and then set the query parameters, but this approach feels wrong somehow.
(2) Another solution: I could first get all sets the first SetItem is part of, and then check for each returned Set if it also contains all the other items and none additional ones. If the first SetItem is contained in a large number of Sets, this would result in a large number of queries, which seems inefficient and not scalable.
(3) For each SetItem that should be contained, I could get all sets it is in, and then do an intersection in my code outside SQL. This would require at most as many sql queries as there are SetItems to be checked.
(4) An alternative would be to store the setItems as a comma-separated list as VARCHAR, sorted in increasing order, directly as an additional column in the table Sets. The table SetItems would not be needed then. To check for the existence of a set I could just query if there is a row with the same comma-separated list. But then queries like "in which set is item xy contained" would not be possible so easily, relying on String-matching in the SQL-query. Not very relational...
Question: How can I efficiently query an SQL database if a set of related rows exists?
Should I structure my data differently? Should I use a NoSQL database for such a query?
I'm currently using H2 and would prefer a solution not using some specific SQL-dialect of a single database vendor.
You can use having to check how many distinct matches you have per set:
select i.fkSet
from SetItems i
where i.item in (2, 5)
group by s.fkSet
having count(distinct i.item) = 2
Of course, you need to make sure the final number (here 2) corresponds to the number of values you have listed at the in operator.

Is the sort order used in a WHERE condition with OR

If I have an internal table lt_itab, with type SORTED by werks matnr and , does this LOOP do a binary search?
LOOP AT lt_itab INTO ls_itab
WHERE ( werks = space OR werks = '*' ).
Or does the OR force a linear scan?
It is a linear scan
I have built a small test program with a 100 million rows. The OR makes the LOOP about 150 times slower.
It should be a linear scan.
As on FILTER operator, Multiple comparisons can be joined using AND only in the WHERE, in my opinion, to make sure it is binary search.
And in the case of a hash key, precisely one comparison expression for each key component. The only relational operator allowed for op is =, also is to make sure search is O(1).
From the official documentation:
If no explicit table key name is specified after USING KEY, the order
in which the rows are read depends on the table category as follows:
Standard tables and sorted tables. The rows are read by ascending row
numbers in the primary table index. In each loop pass, the system
field sy-tabix contains the row number of the current row in the
primary table index.
Hashed tables. The rows are processed in the order in which they were
inserted in the table, and by the sort order used after the statement
SORT. In each loop pass, the system field sy-tabix contains the value
0.
The loop continues to run until all the table rows that meet the cond
condition have been read or until it is exited with a statement. If no
appropriate rows are found or if the internal table is blank, the loop
is not run at all.
The loop doesn't do a binary search, since it's not searching, but looping, i.e. iterating over every row in lt_itab.
https://help.sap.com/http.svc/rc/abapdocu_751_index_htm/7.51/en-US/abaploop_at_itab.htm

DB2/SQL equivalent of SAS's sum(of ) function

SAS has a sum(of col1 - coln ) function which finds the sum of all the values from col1, col2, col3...coln. (ie, you don't have to list out all the column names, as long as they are numbered consecutively). This is a handy shortcut to find a sum of several (suitably named) variables.
Question - Is there a DB2/SQL equivalent of this? I have 50 columns (they are named col1, col2, col3....col50 and I need to find the sum of them.
ie:
select sum(col1, col2, col3,....,col50) AggregateSum
from foo.table
No, DB2 has no such beast, at least to my knowledge. However, you can dynamically create such a query by first querying the database metadata to extract the columns for a given table.
From memory, DB2 has a sysibm.syscolumns table which basically contains the column information that you could use to construct a query on the fly.
You would first use a query like:
select column for sysibm.syscolumns
where schema = 'foo' and tablename = 'table'
and column like 'col%'
(the column names may not match exactly but, since they're not the same on the differing variants of DB2 (DB2/z, DB2/LUW, iSeries DB2, etc) anyway, that hardly matters).
Then use the results of that query to construct your actual query:
select col1+col2+...+colN AggregateSum from foo.table
where the col1+col2+...+colN bit has been built from the previous query.
If, as you mention in a comment, you only want the eighteen "highest" columns (e.g., if columns 1 thru 100 exist, you only want 83 thru 100), you can modify the first query to do that, with something like:
select column for sysibm.syscolumns
where schema = 'foo' and tablename = 'table'
and column like 'col%'
order by column desc
fetch first 18 rows only
but, in that case, you may want to call the columns col0001, col0145 and so on, or make the sorting able to handle variable width numbers.
Although it may be easier (if you can't change the column names) to get all the columns colNNN, sort them yourself by the numeric (not string) value after the col, and throw away all but the last eighteen when constructing the second query).
Both these options will return only eighteen rows maximum.
But you may also want to think, in that case, about moving the variable data to another table, if that's possible in your situation. If you ever find yourself maintaining an array within a table, it's usually better to separate that out.
So your main table would then be something like:
main_id primary key
other_data
and your auxiliary table would be akin to:
main_id foreign key to main(main_id)
sequence_nm
other_data
primary key (main_id, sequence_num)
That would allow you to have sparse data if needed, and also to add data without having to change the schema of the main table. The query to get the latest eighteen results would be a little more complicated but still a relatively simple join of the two tables.

Decision when to create Index on table column in database?

I am not db guy. But I need to create tables and do CRUD operations on them. I get confused should I create the index on all columns by default
or not? Here is my understanding which I consider while creating index.
Index basically contains the memory location range ( starting memory location where first value is stored to end memory location where last value is
stored). So when we insert any value in table index for column needs to be updated as it has got one more value but update of column
value wont have any impact on index value. Right? So bottom line is when my column is used in join between two tables we should consider
creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of
updating index value when new value is inserted in column.Right?
Consider this scenario where table mytable contains two three columns i.e col1,col2,col3. Now we fire this query
select col1,col2 from mytable
Now there are two cases here. In first case we create the index on col1 and col2. In second case we don't create any index.** As per my understanding
case 1 will be faster than case2 because in case 1 we oracle can quickly find column memory location. So here I have not used any join columns but
still index is helping here. So should I consider creating index here or not?**
What if in the same scenario above if we fire
select * from mytable
instead of
select col1,col2 from mytable
Will index help here?
Don't create Indexes in every column! It will slow things down on insert/delete/update operations.
As a simple reminder, you can create an index in columns that are common in WHERE, ORDER BY and GROUP BY clauses. You may consider adding an index in colums that are used to relate other tables (through a JOIN, for example)
Example:
SELECT col1,col2,col3 FROM my_table WHERE col2=1
Here, creating an index on col2 would help this query a lot.
Also, consider index selectivity. Simply put, create index on values that has a "big domain", i.e. Ids, names, etc. Don't create them on Male/Female columns.
but update of column value wont have any impact on index value. Right?
No. Updating an indexed column will have an impact. The Oracle 11g performance manual states that:
UPDATE statements that modify indexed columns and INSERT and DELETE
statements that modify indexed tables take longer than if there were
no index. Such SQL statements must modify data in indexes and data in
tables. They also create additional undo and redo.
So bottom line is when my column is used in join between two tables we should consider creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of updating index value when new value is inserted in column. Right?
Not just Inserts but any other Data Manipulation Language statement.
Consider this scenario . . . Will index help here?
With regards to this last paragraph, why not build some test cases with representative data volumes so that you prove or disprove your assumptions about which columns you should index?
In the specific scenario you give, there is no WHERE clause, so a table scan is going to be used or the index scan will be used, but you're only dropping one column, so the performance might not be that different. In the second scenario, the index shouldn't be used, since it isn't covering and there is no WHERE clause. If there were a WHERE clause, the index could allow the filtering to reduce the number of rows which need to be looked up to get the missing column.
Oracle has a number of different tables, including heap or index organized tables.
If an index is covering, it is more likely to be used, especially when selective. But note that an index organized table is not better than a covering index on a heap when there are constraints in the WHERE clause and far fewer columns in the covering index than in the base table.
Creating indexes with more columns than are actually used only helps if they are more likely to make the index covering, but adding all the columns would be similar to an index organized table. Note that Oracle does not have the equivalent of SQL Server's INCLUDE (COLUMN) which can be used to make indexes more covering (it's effectively making an additional clustered index of only a subset of the columns - useful if you want an index to be unique but also add some data which you don't want to be considered in the uniqueness but helps to make it covering for more queries)
You need to look at your plans and then determine if indexes will help things. And then look at the plans afterwards to see if they made a difference.

IN vs OR of Oracle, which faster?

I'm developing an application which processes many data in Oracle database.
In some case, I have to get many object based on a given list of conditions, and I use SELECT ...FROM.. WHERE... IN..., but the IN expression just accepts a list whose size is maximum 1,000 items.
So I use OR expression instead, but as I observe -- perhaps this query (using OR) is slower than IN (with the same list of condition). Is it right? And if so, how to improve the speed of query?
IN is preferable to OR -- OR is a notoriously bad performer, and can cause other issues that would require using parenthesis in complex queries.
Better option than either IN or OR, is to join to a table containing the values you want (or don't want). This table for comparison can be derived, temporary, or already existing in your schema.
In this scenario I would do this:
Create a one column global temporary table
Populate this table with your list from the external source (and quickly - another whole discussion)
Do your query by joining the temporary table to the other table (consider dynamic sampling as the temporary table will not have good statistics)
This means you can leave the sort to the database and write a simple query.
Oracle internally converts IN lists to lists of ORs anyway so there should really be no performance differences. The only difference is that Oracle has to transform INs but has longer strings to parse if you supply ORs yourself.
Here is how you test that.
CREATE TABLE my_test (id NUMBER);
SELECT 1
FROM my_test
WHERE id IN (1,2,3,4,5,6,7,8,9,10,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,
91,92,93,94,95,96,97,98,99,100
);
SELECT sql_text, hash_value
FROM v$sql
WHERE sql_text LIKE '%my_test%';
SELECT operation, options, filter_predicates
FROM v$sql_plan
WHERE hash_value = '1181594990'; -- hash_value from previous query
SELECT STATEMENT
TABLE ACCESS FULL ("ID"=1 OR "ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5
OR "ID"=6 OR "ID"=7 OR "ID"=8 OR "ID"=9 OR "ID"=10 OR "ID"=21 OR
"ID"=22 OR "ID"=23 OR "ID"=24 OR "ID"=25 OR "ID"=26 OR "ID"=27 OR
"ID"=28 OR "ID"=29 OR "ID"=30 OR "ID"=31 OR "ID"=32 OR "ID"=33 OR
"ID"=34 OR "ID"=35 OR "ID"=36 OR "ID"=37 OR "ID"=38 OR "ID"=39 OR
"ID"=40 OR "ID"=41 OR "ID"=42 OR "ID"=43 OR "ID"=44 OR "ID"=45 OR
"ID"=46 OR "ID"=47 OR "ID"=48 OR "ID"=49 OR "ID"=50 OR "ID"=51 OR
"ID"=52 OR "ID"=53 OR "ID"=54 OR "ID"=55 OR "ID"=56 OR "ID"=57 OR
"ID"=58 OR "ID"=59 OR "ID"=60 OR "ID"=61 OR "ID"=62 OR "ID"=63 OR
"ID"=64 OR "ID"=65 OR "ID"=66 OR "ID"=67 OR "ID"=68 OR "ID"=69 OR
"ID"=70 OR "ID"=71 OR "ID"=72 OR "ID"=73 OR "ID"=74 OR "ID"=75 OR
"ID"=76 OR "ID"=77 OR "ID"=78 OR "ID"=79 OR "ID"=80 OR "ID"=81 OR
"ID"=82 OR "ID"=83 OR "ID"=84 OR "ID"=85 OR "ID"=86 OR "ID"=87 OR
"ID"=88 OR "ID"=89 OR "ID"=90 OR "ID"=91 OR "ID"=92 OR "ID"=93 OR
"ID"=94 OR "ID"=95 OR "ID"=96 OR "ID"=97 OR "ID"=98 OR "ID"=99 OR
"ID"=100)
I would question the whole approach. The client of the SP has to send 100000 IDs. Where does the client get those IDs from? Sending such a large number of ID as the parameter of the proc is going to cost significantly anyway.
If you create the table with a primary key:
CREATE TABLE my_test (id NUMBER,
CONSTRAINT PK PRIMARY KEY (id));
and go through the same SELECTs to run the query with the multiple IN values, followed by retrieving the execution plan via hash value, what you get is:
SELECT STATEMENT
INLIST ITERATOR
INDEX RANGE SCAN
This seems to imply that when you have an IN list and are using this with a PK column, Oracle keeps the list internally as an "INLIST" because it is more efficient to process this, rather than converting it to ORs as in the case of an un-indexed table.
I was using Oracle 10gR2 above.