I've a query that performs join on many tables which is resulting in poor performance.
To improve the performance, I've created an indexed view and I see a significant improvement in the performance of the query on view with date filter. However, my concern is about the storage of the index. From what I have read, the unique clustered index is stored on SQL Server. Does it mean it stores separately the entire data resulting as part of joins within the view? If so, if I've included all columns from tables that are part of the joins in the view, would the disk space on the server consumed be approx double the disk space without indexed view? And every time I enter data into underlying tables, the data is duplicated for the indexed view?
That is correct. An indexed view is basically an additional table that contains a copy of all the data in a sorted way. That's what makes it so fast, but as everything in SQL Server land, it comes at a price - in this case the additional storage required and the additional time required to keep all the copies of the data in sync.
The same is true for a normal index on a table. It is also a copy of the index keys (plus some information of where to find the original row), that needs additional storage and additional time during updates to be maintained.
To figure out if adding an index on a table or view makes sense, requires you to look at the entire system and see if the performance improvement for the one query is worth the performance degradation of other queries.
In your case you should also (first) check if additional indexes on the underlying tables might help your queries.
Yes, that is correct. An indexed view persists all data in the view separately from the source tables. Depending on the columns and joins, the data is duplicated, and can actually be many times larger than the source tables.
Pretty much, yeah. You've made a trade-off where you get better performance in return for some additional effort by the engine, plus the additional storage needed to persist it.
Related
In the project I'm working with, we have a table which sees a lot of read/write activity.
It's sort of a "visibilities" table; a background job is constantly running to generate records into this table based on the creation of other business domain entities.
This table needs to get searched against (and is being updated) on a regular basis, and we're running into performance problems because of it.
When we introduce indexes to improve the speed of search, it causes timeout issues with writing to the table when people perform updates. The table is relatively large and the search criteria is a bit complex so the indexes are large.
What I'm wondering, is if I added an "archived" bit column to the table, consistently marked somewhat old records as archived, could I re-structure the indexes to only index data which is Archived=0? Would that allow me to reduce the size of the indexes (and thus the performance impact of writing to those tables)?
I would assume no since the indexes must still consider which records are archived or not, but I'm not a SQL expert so I wanted to check.
If that would not be an ideal setup, what might I do to accomplish a similar result?
You can create a Filtered Index, which can index only the columns where Archived=0, and can be used only in queries that specify WHERE Archived=0 and ....
We have a table design that consists of 10,000,000 records and 200,000 columns.
The columns are a mixture of:
Binary flags.
Integers.
The queries need to perform and / or operations on 1-100 columns at a time, and should complete in under 0.1 seconds, returning a only projection/subset of each matched row.
Around 10 new columns get added per day.
Around 1,000 new rows get added per day.
There are no joins.
Which DBMS is best suited for this?
Reason behind this approach:
The columns are materialized indexes from user defined queries: that's why new columns get added each day (as more users come up with their own queries). The other option would be to not use materialized views, and have the user's queries perform joins. Problem here is the queries could take any form and in aggregate there would be a large number of very different execution plans across everyones query... since the user defines the query, it's kinda impossible to optimise a traditional SQL database using indexes, normalised tables, etc.
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
Other databases may allow periodic refresh2.
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
1 E.g. SQL Server indexed views.
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
I have a table which will have 3 different queries run against it.
The first query has a where clause which uses two of the columns
The second three of the columns
The third four of the columns
If I run each query through the Estimated Execution Plan, and SQL Server Management Studio suggests adding a new different index for each query.
I'm happy to add three different index's for maximum performance. The table is never updated and rarely insert into.
However is it a good idea to add multiple index's to the same table each to accommodate a different query.
If you know the queries well enough, then add the indexes.
Indexes primarily add overhead when you are modifying the data (insert/update/delete). They do incur a bit of extra overhead by also taking up memory in the page cache. This cuts two ways. Sometimes the index itself can entirely replace the table. Sometimes both are needed, depending on the query and the indexes.
There is little downside if the data is not changing, and a lot of potential upside. Because SQL Server is recommending the indexes, you can be pretty confident that they will get used and should increase the performance of the query.
I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.
Knowing that an indexed column leads to a better performance, is it worthy to indexes all columns in all tables of the database? What are the advantages/disadvantages of such approach?
If it is worthy, is there a way to auto create indexes in SQL Server? My application dynamically adds tables and columns (depending on the user configuration) and I would like to have them auto indexed.
It is difficult to imagine real-world scenarios where indexing every column would be useful, for the reasons mentioned above. The type of scenario would require a bunch of different queries, all accessing exactly one column of the table. Each query could be accessing a different column.
The other answers don't address the issues during the select side of the query. Obviously, maintaining indexes is an issue, but if you are creating the table/s once and then reading many, many times, the overhead of updates/inserts/deletes is not a consideration.
An index contains the original data along with points to records/pages where the data resides. The structure of an index makes it fast to do things like: find a single value, retrieve values in order, count the number of distinct values, and find the minimum and maximum values.
An index does not only take space up on disk. More importantly, it occupies memory. And, memory contention is often the factor that determines query performance. In general, building an index on every column will occupy more space than then original data. (One exception would be a column that is relative wide and has relatively few values.)
In addition, to satisfy many queries you may need one or more indexes plus the original data. Your page cache gets rather filled with data, which can increase the number of cache misses, which in turn incurs more overhead.
I wonder if your question is really a sign that you have not modelled your data structures adequately. There are few cases where you want users to build ad hoc permanent tables. More typically, their data would be stored in a pre-defined format, which you can optimize for the access requirements.
No because you have to take in consideration that every time you add or update a record, you have to recalculate your indexes and having indexes on all columns would take a lot of time and lead to bad performance.
So databases like data warehouses where there use only select queries is a good idea but on normal database it's a bad idea.
Also, it's not because you are using a column in a where clause that you have to add an index on it.
Try to find a column where the record will be almost all unique like a primary key and that you don't edit often.
A bad idea would be to index the sex of a person cause there are only 2 possible values and the result of the index would only split the data then it will search in almost every records.
No, you should not index all of your columns, and there's several reasons for this:
There is a cost to maintain each index during an insert, update or delete statement, that will cause each of those transactions to take longer.
It will increase the storage required since each index takes up space on disk.
If the column values are not disperse, the index will not be used/ignored (ex: A gender flag).
Composite indexes (indexes with more than one column) can greatly benefit performance for frequently run WHERE, GROUP BY, ORDER BY or JOIN clauses, and multiple single indexes cannot be combined.
You are much better off using Explain plans and data access and adding indexes when necessary (and only when necessary, IMHO), rather than creating them all up front.
No, there is overhead in maintaining the indexes, so indexing all columns would slow down all of your insert, update and delete operations. You should index the columns that you are frequently referencing in WHERE clauses, and you will see a benefit.
Indexes take up space. And they take up time to create, rebuild, maintain, etc. So there's not a guaranteed return on performance for indexing just any old column. You should index the columns that give the performance for the operations you'll use. Indexes help reads, so if you're mostly reading, index columns that will be searched on, sorted by, or joined to other tables relationally. Otherwise, it's more expensive than what benefit you may see.
Every index requires additional CPU time and disk I/O overhead during
inserts and deletions.
Indies on non-primary keys might have to be hanged on updates, although an index on the primary key might not (this is beause updates typially do not modify the primary-key attributes).
Each extra index requires additional storage spae.
For queries whih involve onditions on several searh keys, e ieny
might not be bad even if only some of the keys have indies on them.
Therefore, database performane is improved less by adding indies when
many indies already exist.