Does indexing get impacted on adding new column to sql based db? - sql

My sql based db, I know that adding more rows causes re-indexing which slows down write operation.
If I just add an additional column, and I push update queries on these rows such that it's adding values into empty column, will my write performance be impacted in this case? Because my index is still based on PK so that's not changing.

Adding a column will generally require rewriting the data. This can depends on the database you are using and the type of the column, but think that space needs to be reserved for the column on each data page.
If you modify columns that are not parts of indexes, then in general you are pretty safe from impacts on indexes.
There is one exception where something could occur. Some columns (such as strings) have a variable size. If the new value does not fit on a page, then the page needs to be split. This split slows down the update and would also affect indexes.
This should not happen for updates to fixed-size columns -- generally numbers, date/times, and char() values in most databases.

Related

Removing non-adjacent duplicates comparing all fields

What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?
All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?
To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.
I would investigate two methods:
Store the original index in an auxiliary field, SORT BY the fields you want to compare (possibly using STABLE), DELETE ADJACENT DUPLICATES, then re-SORT BY the stored index.
Using a HASHED TABLE for the fields you want to compare, LOOP through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.
I'm not sure about the performance, but I would recommend to use SAT on a plausible data set for both methods and compare the results.

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

Index all columns

Knowing that an indexed column leads to a better performance, is it worthy to indexes all columns in all tables of the database? What are the advantages/disadvantages of such approach?
If it is worthy, is there a way to auto create indexes in SQL Server? My application dynamically adds tables and columns (depending on the user configuration) and I would like to have them auto indexed.
It is difficult to imagine real-world scenarios where indexing every column would be useful, for the reasons mentioned above. The type of scenario would require a bunch of different queries, all accessing exactly one column of the table. Each query could be accessing a different column.
The other answers don't address the issues during the select side of the query. Obviously, maintaining indexes is an issue, but if you are creating the table/s once and then reading many, many times, the overhead of updates/inserts/deletes is not a consideration.
An index contains the original data along with points to records/pages where the data resides. The structure of an index makes it fast to do things like: find a single value, retrieve values in order, count the number of distinct values, and find the minimum and maximum values.
An index does not only take space up on disk. More importantly, it occupies memory. And, memory contention is often the factor that determines query performance. In general, building an index on every column will occupy more space than then original data. (One exception would be a column that is relative wide and has relatively few values.)
In addition, to satisfy many queries you may need one or more indexes plus the original data. Your page cache gets rather filled with data, which can increase the number of cache misses, which in turn incurs more overhead.
I wonder if your question is really a sign that you have not modelled your data structures adequately. There are few cases where you want users to build ad hoc permanent tables. More typically, their data would be stored in a pre-defined format, which you can optimize for the access requirements.
No because you have to take in consideration that every time you add or update a record, you have to recalculate your indexes and having indexes on all columns would take a lot of time and lead to bad performance.
So databases like data warehouses where there use only select queries is a good idea but on normal database it's a bad idea.
Also, it's not because you are using a column in a where clause that you have to add an index on it.
Try to find a column where the record will be almost all unique like a primary key and that you don't edit often.
A bad idea would be to index the sex of a person cause there are only 2 possible values and the result of the index would only split the data then it will search in almost every records.
No, you should not index all of your columns, and there's several reasons for this:
There is a cost to maintain each index during an insert, update or delete statement, that will cause each of those transactions to take longer.
It will increase the storage required since each index takes up space on disk.
If the column values are not disperse, the index will not be used/ignored (ex: A gender flag).
Composite indexes (indexes with more than one column) can greatly benefit performance for frequently run WHERE, GROUP BY, ORDER BY or JOIN clauses, and multiple single indexes cannot be combined.
You are much better off using Explain plans and data access and adding indexes when necessary (and only when necessary, IMHO), rather than creating them all up front.
No, there is overhead in maintaining the indexes, so indexing all columns would slow down all of your insert, update and delete operations. You should index the columns that you are frequently referencing in WHERE clauses, and you will see a benefit.
Indexes take up space. And they take up time to create, rebuild, maintain, etc. So there's not a guaranteed return on performance for indexing just any old column. You should index the columns that give the performance for the operations you'll use. Indexes help reads, so if you're mostly reading, index columns that will be searched on, sorted by, or joined to other tables relationally. Otherwise, it's more expensive than what benefit you may see.
Every index requires additional CPU time and disk I/O overhead during
inserts and deletions.
Indies on non-primary keys might have to be hanged on updates, although an index on the primary key might not (this is beause updates typially do not modify the primary-key attributes).
Each extra index requires additional storage spae.
For queries whih involve onditions on several searh keys, e ieny
might not be bad even if only some of the keys have indies on them.
Therefore, database performane is improved less by adding indies when
many indies already exist.

PostgreSQL: performance impact of extra columns

Given a large table (10-100 million rows) what's the best way to add some extra (unindexed) columns to it?
Just add the columns.
Create a separate table for each extra column, and use joins when you want to access the extra values.
Does the answer change depending on whether the extra columns are dense (mostly not null) or sparse (mostly null)?
A column with a NULL value can be added to a row without any changes to the rest of the data page in most cases. Only one bit has to be set in the NULL bitmap. So, yes, a sparse column is much cheaper to add in most cases.
Whether it is a good idea to create a separate 1:1 table for additional columns very much depends on the use case. It is generally more expensive. For starters, there is an overhead of 28 bytes (heap tuple header plus item identifier) per row and some additional overhead per table. It is also much more expensive to JOIN rows in a query than to read them in one piece. And you need to add a primary / foreign key column plus an index on it. Splitting may be a good idea if you don't need the additional columns in most queries. Mostly it is a bad idea.
Adding a column is fast in PostgreSQL. Updating the values in the column is what may be expensive, because every UPDATE writes a new row (due to the MVCC model). Therefore, it is a good idea to update multiple columns at once.
Database page layout in the manual.
How to calculate row sizes:
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL

MySQL query performance

I have the following table structure:
EVENT_ID(INT) EVENT_NAME(VARCHAR) EVENT_DATE(DATETIME) EVENT_OWNER(INT)
I need to add the field EVENT_COMMENTS which should be a text field or a very big VARCHAR.
I have 2 places where I query this table, one is on a page that lists all the events (in that page I do not need to display the event_comments field).
And another page that loads all the details for a specific events, which I will need to display the event_comments field on.
Should I create an extra table with the event_id and the event_comments for that event? Or should I just add that field on the current table?
In other words, what I'm asking is, if I have a text field in my table, but I don't SELECT it, will it affect the performance of the queries to my table?
Adding a field to your table makes it larger.
This means that:
Table scans will take more time
Less records will fit into a page and hence into the cache, thus increasing the risk of cache misses
Selecting this field with a join, however, would take more time.
So adding this field into this table will make the queries which don't select it run slower, and those which do select it run faster.
Yes, it affect the performance. At least, according to this article published yesterday.
According to it, if you don't want to suffer performance issues, it's better to put them in a separate table and JOIN them when needed.
This is the relative section:
Try to limit the number of columns in
a table. Too many columns in a table
can make the scan time for queries
much longer than if there are just a
few columns. In addition, if you have
a table with many columns that aren't
typically used, you are also wasting
disk space with NULL value fields.
This is also true with variable size
fields, such as text or blob, where
the table size can grow much larger
than needed. In this case, you should
consider splitting off the additional
columns into a different table,
joining them together on the primary
key of the records
You should put in on the same table.
Yes, it probably will affect other queries on the same table, and you should probably do it anyway, as you probably don't care.
Depending on the engine, blobs are either stored inline (MyISAM), partially off-page (InnoDB) or entirely off-page (InnoDB Plugin, in some cases).
These have the potential to decrease the number of rows per page, and therefore increase the number of IO operations to satisfy some query.
However, it is extremely unlikely that you care, so you should just do it anyway. How many rows does this table have? 10^9 ? How many of them have non-null values for the blob?
It shouldn't be too much of a hit, but if you're worried about performance, you should always run a few benchmarks and run EXPLAINs on your queries to see the true effect.
How many events are you expecting to have?
Chances are that if you don't have a truckload of hundred of thousands events, your performance will be good in any case.