DELETE ADJACENT DUPLICATES delete order? - abap

If there are entries with the same key.
sort itab by key.
delete adjacent duplicates from itab comparing key.
Does anyone know which one will be deleted if delete adjacent duplicates..comparing key?
The first one or second one?

From F1 help on "delete adjacent duplicate"
In the case of several double lines following one another, all the
lines - except for the first - are deleted.
So the second (identical) line should be deleted
Regards,

Instead of sorting a standard table, you could consider declaring another internal table as a sorted table of the same type with a unique key corresponding to the fields you're comparing to eliminate the duplicates. It's faster, allows you to keep your original table unchanged, and, in my opinion, makes your code more readable because it's easier to understand which rows are kept and which ones are not. Example:
LOOP AT itab ASSIGNING <itab_row>.
INSERT <itab_row> INTO TABLE sorted_itab.
ENDLOOP.

If data in your itab are fetched from DB, it's better than you use ORDER BY addition in SELECT and than you can use the delete adjacent duplicates . Sorting algorithm costs nlog(n) and is better that DBMS does these type of operation instead ABAP.
Obviously that if you can do the DISTINCT or GROUP BY in SQL you avoid to use both SORT and delete adjacent duplicates and you should solve all performance problems

Related

how to uniquely identify rows in two table copies

I have essentially two tables that are copies of each other. One is dynamic and some DML statements happen quite constantly, so this table serve as a stage table, the other is used as a way to synchronize the changes form this stage table. So the tables can have different data at different times, and I use a merge statement to sync the tables. Something along these lines:
MERGE INTO source s
USING (
SELECT
*
FROM
stage st
) se ON ( s.eim_product_id = st.eim_product_id )
...
The problem is that eim_product_id is neither a primary key, nor unique. So my merge statement essentially throws this error:
Error report -
ORA-30926: unable to get a stable set of rows in the source tables
And the only pseudo-columns I can think of to use is something like an identity column id_seq INTEGER GENERATED ALWAYS AS IDENTITY or a rowid. However, the problem is that it will not be consistent this approach to uniquely identify the row across both tables, right ? I believe I need some kind of hash that does the job , but unsure what would be the best and simplest approach in this case.
The rowid pseudo-column won't match between the tables, and isn't necessarily constant. Creating a hash could get expensive in terms of CPU; an updated row in the first table wouldn't have a matching hash in the second table for the merge to find. If you only generate the hash at insert and never update then it's just a more expensive, complicated sequence.
Your best bet is an identity column with a unique constraint on the first table, copied to the second table by the merge: it is unique, only calculated very efficiently once at insert, will always identify the same row in both tables, and need never change.

Delete duplicates excluding columns

I am trying to delete duplicates from an internal table, comparing all columns excluding some of them. Obviously I can list all the columns that I want to compare using COMPARING, but this would not look good in code.
So let's say there are 100 columns and I want to exclude from the comparing 2.
How can I achieve that with a smart way?
You could use the DELETE ADJUSTMENT DUPLICATES operator, there you can define which columns you compare. You'll just have to sort the itab before this operation.

Removing non-adjacent duplicates comparing all fields

What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?
All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?
To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.
I would investigate two methods:
Store the original index in an auxiliary field, SORT BY the fields you want to compare (possibly using STABLE), DELETE ADJACENT DUPLICATES, then re-SORT BY the stored index.
Using a HASHED TABLE for the fields you want to compare, LOOP through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.
I'm not sure about the performance, but I would recommend to use SAT on a plausible data set for both methods and compare the results.

Find which table is causing duplicate rows in a view

I have a view in sql server which should be returning one row per project. A few of the projects have multiple rows. The view has a lot of table joins so I would not like to have to manually run a script on each table to find out which one is causing duplicates. Is there a quick automated way to find out which table is the problem table (aka the one with duplicate rows)?
The quickest way I've found is:
find an example dupe
copy out the query
comment out all joins
add the joins back one at a time until you get another row
Whatever the join is where you started getting dupes, is where you have multiple records.
My technique is to make a copy of the view and modify it to return every column from every table in the order of the FROM clause, with extra columns between with the table names as the column name (see example below). Then select a few rows and slowly scan to the right until you can find the table that does NOT have duplicate row data, and this is the one causing dupes.
SELECT
TableA = '----------', TableA.*,
TableB = '----------', TableB.*
FROM ...
This is usually a very fast way to find out. The problem with commenting out joins is that then you have to comment out the matching columns in the select clause each time, too.
I used a variation of SpectralGhost's technique to get this working even though neither method really solves the problem of avoiding the manual checking of each table for duplicate rows.
My variation was to use a divide and conquer method of commenting out the joins instead of
commenting out each one individually. Due to the sheer number of joins this was much faster.

Skipping primary key conflicts with SQL copy

I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.
I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.
That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.
Use a select statement to select exactly the data you want to insert, without the duplicates.
Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)
You might check out ASKTOM ideas on how to select the non-duplicate rows