Removing non-adjacent duplicates comparing all fields

Removing non-adjacent duplicates comparing all fields - abap

What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?
All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?
To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.

I would investigate two methods:
Store the original index in an auxiliary field, SORT BY the fields you want to compare (possibly using STABLE), DELETE ADJACENT DUPLICATES, then re-SORT BY the stored index.
Using a HASHED TABLE for the fields you want to compare, LOOP through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.
I'm not sure about the performance, but I would recommend to use SAT on a plausible data set for both methods and compare the results.

Related

Does indexing get impacted on adding new column to sql based db?

My sql based db, I know that adding more rows causes re-indexing which slows down write operation.
If I just add an additional column, and I push update queries on these rows such that it's adding values into empty column, will my write performance be impacted in this case? Because my index is still based on PK so that's not changing.

Adding a column will generally require rewriting the data. This can depends on the database you are using and the type of the column, but think that space needs to be reserved for the column on each data page.
If you modify columns that are not parts of indexes, then in general you are pretty safe from impacts on indexes.
There is one exception where something could occur. Some columns (such as strings) have a variable size. If the new value does not fit on a page, then the page needs to be split. This split slows down the update and would also affect indexes.
This should not happen for updates to fixed-size columns -- generally numbers, date/times, and char() values in most databases.

Query on varchar vs foreign key performance

This is for SQL Server.
I have a table that will contain a lot of rows and that table will be queried multiple times so I need to make sure my design is optimized.
Just for the question let say that table contains 2 columns. Name and Type.
Name is a varchar and it will be unique.
Type can be 5 different value (type1... type5). (It possible can contains more values in the future)
Should I make type a varchar (and create an index) or would be it better to create a table of types that will contains 5 rows with only a column for the name and make type a foreign key?
Is there a performance difference between both approach? The queries will not always have the same condition. Sometime it will query the name, type, or both with different values.
EDIT: Consider that in my application if type would be a table, the IDs would be cached so I wouldn't have to query the Type table everytime.

Strictly speaking, you'll probably get better query performance if you keep all the data in one table. However doing this is known as "denormalization" and comes with a number of pretty significant drawbacks.
If your table has "a lot of rows", storing an extra varchar field for every row as opposed to say, a small, or even tinyint, can add a non-trivial amount of size to your table
If any of that data needs to change, you'll have to perform lots of updates against that table. This means transaction log growth and potential blocking on the table during modification locks. If you store it as a separate table with 5-ish rows, if you need to update the data associated with that data, you just update one of the 5 rows you need.
Denormalizing data means that the definition of that data is no longer stored in one place, but in multiple places (actually its stored across every single row that contains those values).
For all the reasons listed above, managing that data (inserts, updates, deletes, and simply defining the data) can quickly become far more overhead than simply normalizing the data correctly in the first place, and for little to no benefit beyond what can be done with proper indexing.
If you find the need to return both the "big" table and some other information from the type table and you're worried about join performance, truthfully, wouldn't be. That's a generalization, but If your big table has, say, 500M rows in it, I can't see many use cases where you'd want all those rows returned; you're probably going to get a subset. In which case, that join might be more manageable. Provided you index type, the join should be pretty snappy.
If you do go the route of denormalizing your data, I'd recommend still having the lookup table as the "master definition" of what a "type" is, so it's not a conglomeration of millions of rows of data.
If you STILL want to denormalize the data WITHOUT a lookup table, at least put a CHECK constraint on the column to limit which values are allowable or not.

How much is "a lot of rows"?.
If it is hundreds of thousands or more, then a Columnstore Index may be a good fit.
It depends on your needs, but usually you would want the type column to be of a numerical value (in your case tinyint).

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables

You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.

Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.

1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).

Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

What's more efficient: string searches, or joins through SQL?

I have two lists of words and I need to find matches (intersection of the two sets.)
Should I store each list as a string and find matches through string functions (like a regular expression) or store the words in a table, and have SQL find matches by joining?

It is almost impossible to say without more information about the problem. Here are some things to consider:
How many different distinct items do you have?
How many different combinations would be on a typical row?
Do your searches require looking for wildcards?
How long are the individual items?
Specifics on the database engine and hardware you are running on.
I want to emphasize that in almost all situations, you want to store the values in another table. Performance is not necessarily the primary reason. More important are ease of updating and deleting individual values, and the ability to support many more types of queries (such as a list of all available values).
But, we can still think about the performance issues. Storing values in a single string simply requires fetching the page with the record on it, and then applying a function that goes through the string. For simple patterns (such as identifying the presence of a fixed substring), this should go quite fast. There are few things that computers do faster than looping through strings and comparing values (assuming a reasonable implementation).
In the fastest possible join, both tables need to be read in, and the keys need to be matched. This requires additional effort. The situation is even worse, because you really want two additional tables, one for the individual string items and the other for the relationship between the original records and the items.
At this point, you may think "gosh, strings seem like a better idea". This is wrong. One of the big differences is in average size. If you items are, on average, longer than, say, 4 characters, then you save space by using a reference table. This saved space immediately translates into improved performance, because there is less I/O. With indexes, the additional tables would be in memory anyway, so the matching would be quite fast.
And, there is the issue of querying. You can use standard SQL functions for queries such as records that have A and B (many string functions are database specific). You can easily find out exactly which items are in the database, and relatively easily find what pairs exist on records. You can keep track of when an item is added to a record, and the first time it appears in the database. Generally, this flexible functionality -- which is just basic SQL functionality -- is what you need when managing this type of data.

Storing in a table will be much faster than a SQL string manipulation function in most circumstances especially if you can index the words.

I think you're asking if this:
SELECT word FROM table_one WHERE word in (SELECT word FROM table_two)
is faster than this:
SELECT table_one.word FROM table_one
INNER JOIN table_two ON table_one.word = table_two.word
The first answer should be faster, because the second creates a (potentially large) temporary object (the joined table).
Note that I assume you have an index on word. Also: if the strings are very long (URLs, for example), this will be very slow, and you should match on a hash instead.

PostgreSQL: performance impact of extra columns

Given a large table (10-100 million rows) what's the best way to add some extra (unindexed) columns to it?
Just add the columns.
Create a separate table for each extra column, and use joins when you want to access the extra values.
Does the answer change depending on whether the extra columns are dense (mostly not null) or sparse (mostly null)?

A column with a NULL value can be added to a row without any changes to the rest of the data page in most cases. Only one bit has to be set in the NULL bitmap. So, yes, a sparse column is much cheaper to add in most cases.
Whether it is a good idea to create a separate 1:1 table for additional columns very much depends on the use case. It is generally more expensive. For starters, there is an overhead of 28 bytes (heap tuple header plus item identifier) per row and some additional overhead per table. It is also much more expensive to JOIN rows in a query than to read them in one piece. And you need to add a primary / foreign key column plus an index on it. Splitting may be a good idea if you don't need the additional columns in most queries. Mostly it is a bad idea.
Adding a column is fast in PostgreSQL. Updating the values in the column is what may be expensive, because every UPDATE writes a new row (due to the MVCC model). Therefore, it is a good idea to update multiple columns at once.
Database page layout in the manual.
How to calculate row sizes:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas