SQL Server 2005: Index bigger than data stored

SQL Server 2005: Index bigger than data stored - sql

I created 1 database with 2 file groups: 1 primary and 1 index.
Primary file group includes 1 data file (*.mdf): store all tables
Index file group includes 1 index file (*.ndf): store all indexes
Most of indexes are non-clustered indexes
After a short time using the database, the data file is 2GB but the index file is 12 GB. I do not know what problem happened in my database.
I have some questions:
How do I reduce the size of the index file?
How do I know what is stored in the index file?
How do I trace all impacts to the index file?
How do I limit size growing of index file?

How do I reduce size of index file ?
Drop some unneeded indexes or reduce the number of columns in existing ones. Remember that the clustered index column(s) is a "hidden" included column in all non clustered indexes.
If you have an index on a,b,c,d and an index on a,b,c you might consider dropping the second one as the first one covers the second one.
You may also be able to find potential unused indexes by looking at sys.dm_db_index_usage_stats
How to know what is stored in index file?
It will store whatever you defined it to store! The following query will help you tell which indexes are using the most space and for what reason (in row data, lob data)
SELECT convert(char(8),object_name(i.object_id)) AS table_name, i.name AS index_name,
i.index_id, i.type_desc as index_type,
partition_id, partition_number AS pnum, rows,
allocation_unit_id AS au_id, a.type_desc as page_type_desc, total_pages AS pages
FROM sys.indexes i JOIN sys.partitions p
ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a
ON p.partition_id = a.container_id
order by pages desc

My guess (which I think is where marc_s is also headed) is that you've declared your clustered indexes for at least some of your tables to be on the index file group. The clustered index determines how (and where) the actual data for your table is stored.
Posting some of your code would certainly help others pinpoint the problem though.
I think that Martin Smith answered your other questions pretty well. I'll just add this... If you want to limit index sizes you need to evaluate your indexes. Don't add indexes just because you think that you might need them. Do testing with realistic (or ideally real-world) loads on the database to see which indexes will actually give you needed boosts to performance. Indexes have costs to them. In addition to the space cost which you're seeing, they also add to the overhead of inserts and updates, which have to keep the indexes in sync. Because of these costs, you should always have a good reason to add an index and you should consciously think about the trade-offs.

Consider that it is actually quite common for the total storage required for Indexes to be greater than the storage required for the table data within a given database.
Your particular scenario however would appear to quite excessive. As others have pointed out, if you have assigned the Clustered Index for a given table to reside in a separate data file (Your Index data file) then the entire physical table itself will reside in this file also, because in a manner of speak the Clustered Index is the table.
Providing details of your Table Schema and Index Structures will enable us to provide you with more specific guidance.
Other posters have mentioned that:
You should review your index
definitions for duplicate indexes. Take a look at Identifying Overlapping Indexes by Brent Ozar.
You should look to identify unused
indexes. Take a look as SQL Server Pedia Article: Finding Unused Indexes
Other avenues to explore include reviewing the fragmentation of your indexes, as this can increase the storage requirements.
Heavy fragmentation, particularly in the Clustered Index of a table containing LOB data, can result in a significant increase in storage needs. Reorganizing the Clustered Index on tables that contain LOB data will compact the LOB data.
See Reorganizing and Rebuilding Indexes

#martin-smith's answer is almost what I needed...
Here is how you sort by index size in GB (mssql uses 8KB pages == 128 pages/MB)
SELECT
object_name(p.object_id) AS table_name
, i.name AS index_name
, i.index_id
, i.type_desc AS index_type
-- , partition_id
-- , partition_number AS pnum
-- , allocation_unit_id AS au_id
, rows
, a.type_desc as page_type_desc
, total_pages/(1024 * 128.0) AS sizeGB
FROM
sys.indexes i
JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a ON p.partition_id = a.container_id
JOIN sys.all_objects ao ON (ao.object_id = i.object_id)
WHERE ao.type_desc = 'USER_TABLE'
ORDER BY
-- table_name
sizeGB DESC

Related

What Columns Should I Index to Improve Performance in SQL

In my query I have a temp table of keys that will be joined to multiple tables later on.
I want to create an index on my temp table to improve performance, cause it takes a couple of minutes for my query to run.
SELECT DISTINCT
k.Id, k.Name, a.Address, a.City, a.State, a.Zip, p.Phone, p.Fax, ...
FROM
#tempKeys k
INNER JOIN
dbo.Address a ON a.AddrId = k.AddrId
INNER JOIN
dbo.Phone p ON p.PhoneId = a.PhoneId
...
My question is should I create an index for each column that is being joined to a table separately
CREATE NONCLUSTERED INDEX ... (Addr.Id ASC)
CREATE NONCLUSTERED INDEX ... (PhoneId ASC)
or can I create one index that includes all columns being joined
CREATE NONCLUSTERED INDEX ... (Addr.Id ASC, PhoneId ASC)
Also, are there other ways I can improve performance on this scenario?

As #DaleK says this is a complex topic. In general though, an index is only usable when all the leading values are used. Your suggestion of a composite index will likely not work. The indexed value of PhoneId cannot be used independently from AddrId. (The index would be ok for AddrId on its own)
The best approach is to have a test database with representative data & volumes then check the query plan & suggestions. Don't forget every index you add has a side effect on the insert.
Another factor is that without a WHERE clause or if there are larger data sets (I think over 5-10% of the table), the optimiser will decide it's often faster to not use indexes anyway.
And I'd rethink using temp tables anyway, let alone indexed ones. They're rarely necessary. A single, large query usually runs faster (and has better data integrity depending on your isolation model) than one split into chunks.

SQL-SERVER Query is taking too much time

I have this simple query but it is taking 1 min for just 0.5M records even all columns mentioned in select are in Non Clustered index.
Both tables have approx 1M records and approx 200 columns in each.
Does having lots of records in table or having lot of index causing the slowness.
SELECT catalog_items.id,
catalog_items.store_code,
catalog_items.web_code AS web_code,
catalog_items.name AS name,
catalog_items.name AS item_description,
catalog_items.image_thumnail AS image_thumnail,
catalog_items.purchase_description AS purchase_description,
catalog_items.sale_description AS sale_description,
catalog_items.taxable,
catalog_items.is_unique_item,
ISNULL(catalog_items.inventory_posting_flag, 'Y') AS inventory_posting_flag,
catalog_item_extensions.total_cost,
catalog_item_extensions.price
FROM catalog_items
LEFT OUTER JOIN catalog_item_extensions ON catalog_items.id = catalog_item_extensions.catalog_item_id
WHERE catalog_items.trans_flag = 'A';
Update: execution plan showing index missing it the same index is already there. Why?

I'm not convinced that the plan is wrong currently, on the basis that you mention selecting 500k rows, out of a table of 1m rows. Even with an index as suggested by others, the selectivity of that index is pretty weak, from a tipping point perspective (https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/ ) - even with 200 columns I wouldn't expect 500k out of 1m rows per table to result in index seeks with lookups, a full scan would be faster in the CBO's view.
The missing index question - if you look closely its not just suggesting the index on trans_flag, it's suggesting to index the field and then INCUDE a number more. We can't see how many it's suggesting to include, but I would expect it to be all of them in the query and it's basically suggesting you create a covering index. Even in an NC Index Scan scenario this would be faster to scan than the base table.
We also have no information about physical layouts as yet, how the page is constructed, level of fragmentation etc, or even what kind of disks the data is on and overall size. That image_thumbnail field for example is suggestive of a large row size overall, which means we are dealing with off page storage into LOB / SLOB.
In short - even with a query plan, there is no 'easy' answer here in my view.

For this query
select . . .
from catalog_items ci left outer join
catalog_item_extensions cie
on ci.id = cie.catalog_item_id
where ci.trans_flag = 'A'
I would recommend an index on catalog_items(trans_flag, id) and catalog_item_extensions(catalog_item_id).

Why NonClustered index scan faster than Clustered Index scan?

As I know, heap tables are tables without clustered index and has no physical order.
I have a heap table "scan" with 120k rows and I am using this select:
SELECT id FROM scan
If I create a non-clustered index for the column "id", I get 223 physical reads.
If I remove the non-clustered index and alter the table to make "id" my primary key (and so my clustered index), I get 515 physical reads.
If the clustered index table is something like this picture:
Why Clustered Index Scans workw like the table scan? (or worse in case of retrieving all rows). Why it is not using the "clustered index table" that has less blocks and already has the ID that I need?

SQL Server indices are b-trees. A non-clustered index just contains the indexed columns, with the leaf nodes of the b-tree being pointers to the approprate data page. A clustered index is different: its leaf nodes are the data page itself and the clustered index's b-tree becomes the backing store for the table itself; the heap ceases to exist for the table.
Your non-clustered index contains a single, presumably integer column. It's a small, compact index to start with. Your query select id from scan has a covering index: the query can be satisfied just by examining the index, which is what is happening. If, however, your query included columns not in the index, assuming the optimizer elected to use the non-clustered index, an additional lookup would be required to fetch the data pages required, either from the clustering index or from the heap.
To understand what's going on, you need to examine the execution plan selected by the optimizer:
See Displaying Graphical Execution Plans
See Red Gate's SQL Server Execution Plans, by Grant Fritchey

A clustered index generally is about as big as the same data in a heap would be (assuming the same page fullness). It should use just a little more reads than a heap would use because of additional B-tree levels.
A CI cannot be smaller than a heap would be. I don't see why you would think that. Most of the size of a partition (be it a heap or a tree) is in the data.
Note, that less physical reads does not necessarily translate to a query being faster. Random IO can be 100x slower than sequential IO.

When to use Clustered Index-
Query Considerations:
1) Return a range of values by using operators such as BETWEEN, >, >=, <, and <= 2) Return large result sets
3) Use JOIN clauses; typically these are foreign key columns
4) Use ORDER BY, or GROUP BY clauses. An index on the columns specified in the ORDER BY or GROUP BY clause may remove the need for the Database Engine to sort the data, because the rows are already sorted. This improves query performance.
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Are unique or contain many distinct values
2) Defined as IDENTITY because the column is guaranteed to be unique within the table
3) Used frequently to sort the data retrieved from a table
Clustered indexes are not a good choice for the following attributes:
1) Columns that undergo frequent changes
2) Wide keys
When to use Nonclustered Index-
Query Considerations:
1) Use JOIN or GROUP BY clauses. Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
2) Queries that do not return large result sets
3) Contain columns frequently involved in search conditions of a query, such as WHERE clause, that return exact matches
Column Considerations :
Consider columns that have one or more of the following attributes:
1) Cover the query. For more information, see Index with Included Columns
2) Lots of distinct values, such as a combination of last name and first name, if a clustered index is used for other columns
3) Used frequently to sort the data retrieved from a table
Database Considerations:
1) Databases or tables with low update requirements, but large volumes of data can benefit from many nonclustered indexes to improve query performance.
2) Online Transaction Processing applications and databases that contain heavily updated tables should avoid over-indexing. Additionally, indexes should be narrow, that is, with as few columns as possible.

Try running
DBCC DROPCLEANBUFFERS
Before the queries...
If you really want to compare them.
Physical reads don't mean the same as logical reads when optimizing a query

select count(*) vs keep a counter

Assuming indexes are put in place, and absolute-count-accuracy is not necessary (it's okay to be off by one or two), is it okay to use:
Option A
select count(*)
from Table
where Property = #Property
vs
Option B
update PropertyCounters
SET PropertyCount = PropertyCount + 1
where Property = #Property
then doing:
select PropertyCount
from PropertyCounters
where Property = #Property
How much performance degradation can I reasonably expect from doing select count(*) as the table grows into thousands/millions of records?

Keeping a separate count column in addition to the real data is a denormalisation. There are reasons why you might need to do it for performance, but you shouldn't go there until you really need to. It makes your code more complicated, with more chance of inconsistencies creeping in.
For the simple case where the query really is just SELECT COUNT(property) FROM table WHERE property=..., there's no reason to denormalise; you can make that fast by adding an index on the property column.

You didn't specify the platform, but since you use T-SQL syntax for #variables I'll venture a SQL Server platform specific answer:
count(*), or strictly speaking would be count_big(*), is an expression that can be used in indexed views, see Designing Indexed Views.
create view vwCounts
with schembinding
as select Property, count_big(*) as Count
from dbo.Table
group by Property;
create unique clustered index cdxCounts on vwCounts(Property);
select Count
from vwCount with (noexpand)
where Property = #property;
On Enterprise Edition the optimizer will even use the indexed view for your original query:
select count_big(*)
from Table
where Property = #property;
So in the end you get your cake and eat it too: the property is already aggregated and maintained for your for free by the engine. The price is that updates have to maintain the indexed view (they will not recompute the aggregate count though) and the aggregation will create hot spots for contention (locks on separate rows on Table will contend for same count(*) update on the indexed view).

If you say that you do not need absolute accuracy, then Option B is a strange approach. If Option A becomes too heavy (even after adding indexes), you can cache the output of Option A in memory or in another table (your PropertyCounters), and periodically refresh it.

This isn't something that can be answered in general SQL terms. Quite apart from the normal caveats about indices and so on affecting queries, it's also something where there is considerable different between platforms.
I'd bet on better performance on this from SQL Server than Postgres, to the point where I'd consider the latter approach sooner on Postgres and not on SQL Server. However, with a partial index set just right for matching the criteria, I'd bet on Postgres beating out SQL Server. That's just what I'd bet small winnings on though, either way I'd test if I needed to think about it for real.
If you do go for the latter approach, enforce it with a trigger or similar, so that you can't become inaccurate.

On SQL Server, if you don't need absolutely accurate counts, you could also inspect the catalog views. This would be much easier to do - you don't have to keep a count yourself - and it's a lot less taxing on the system. After all, if you need to count all the rows in a table, you need to scan that table, one way or another - no way around that.
With this SQL statement here, you'll get all the tables in your database, and their row counts, as kept by SQL Server:
SELECT
t.NAME AS TableName,
SUM(p.rows) AS RowCounts
FROM
sys.tables t
INNER JOIN
sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
WHERE
t.NAME NOT LIKE 'dt%' AND
i.OBJECT_ID > 255 AND
i.index_id <= 1
GROUP BY
t.NAME, i.object_id, i.index_id, i.name
ORDER BY
OBJECT_NAME(i.object_id)
I couldn't find any documentation on exactly how current those numbers are, typically - but from my own experience, they're usually on the spot (unless you're doing some bulk loading or something - but in that case, you wouldn't want to constantly scan the table to get the exact count, either)

Defining indexes: Which Columns, and Performance Impact?

I know how to use indexes(clustured and non clustured)
But when should i use non clustured indexes in my table.
What scenarios should be there, so as to make my column non clustured index.
I have gone throught msdn guidelines but still little bit confusion.
Should i make only unique columns as NC or should there any other columns also as NC.
If i overload my table with NC indexes then will it decrease my performance also ?
Should I use composite non-C index on columns that are foreign keys.
I know primary key should be Clustured, Unique keys should be NC but what about foreign keys.

The clustered index defines your table's physical structure (to a certain degree) - e.g. it defines in what order the data is ordered. Think of the phonebook, which is "clustered" by (LastName,FirstName) - at least in most countries it is.
You only get one clustered index per table - so choose it wisely! According to the gospel of the Queen of Indexing, Kimberly Tripp, the clustering key should be narrow, stable (never change), unique (yes!) and ideally ever-increasing.
It should be narrow, because the clustering key will be added to each and every entry of each and every non-clustered index - after all, the clustering key is the value used to ultimately find the actual data.
It should be stable since constantly updating lots of index values is a costly affair - especially since the clustering key would have to updated in all non-clustered indices as well.
It needs to be unique, since again - it's ultimately the value used to locate the actual data. If you choose a column that is not guaranteed to be unique, SQL Server will "uniquify" your clustering key by adding a 4-byte value to it - not a good thing.
And ideally, the clustering key should be ever-increasing since that causes the least page and index fragmentation and thus is best for performance.
The ideal candidate for a clustering key would be a INT (or BIGINT) IDENTITY - it ideally fulfills all those requirements.
As for non-clustered indices - use and choose them wisely! There's only one general rule I can give you: all columns that are part of a foreign key (referencing another table) should be in an index - SQL Server will not (contrary to popular belief and lots of myths) put such an index in place automatically - never has, never does.
Other than that - you need to watch your system, see what kind of queries you have - all columns that show up in a WHERE or SORT clause are potential candidate to be indexed - but too many indices isn't a good thing either....

You can only have one clustered index per table. It doesn't have to be the primary key, but in most cases it will be.
Beyond that - it really depends on the queries & the tipping point for what indexes will be used. But defining indexes also means there will be an impact to DML - inserts, updates & deletes will take a slight performance hit.
Should I use composite non clustered index(es) on columns that are foreign keys?
Doesn't matter what the column is, it's the usage that matters for the optimizer to determine what index, clustered or otherwise, to use.

Yes, you can overload your tables with too many indexes. In general, every additional index costs performance time in terms of index maintenance. Tables that are heavily updated should generally have fewer indexes.
Another broad rule (from Richard Campbell, on RunAs Radio and DotNetRocks), is that a few broad indexes will perform better than a larger number of narrow indexes. A broad index will cover a wider range of queries, and there's less for the query optimizer to investigate. Remember that the query optimizer has a limited time to run.
Investigate SQL Server Profiler. There are tools there (used to be stand-alone, but they've changed and I haven't used them recently). They can analyze workloads and make indexing recommendations. These will be better choices than indexes picked "intuitively."

If you have queries that are referencing columns that are not in your index, the SQL server engine will have to perform a table lookup to get the non-included columns from the actual table.
If you are running these queries often, you should create non-clustered indexes that "cover" the query by including all the referenced columns in the index. This should include any non-unique columns.
Adding indexes to a table always decreases write performance, since the index will have to be updated every time the table is updated.

What fields are you doing lookups on? Searching? Etc.
Determine what fields you are using when running your queries (WHERE clause)
and they could possibly be good candidates.
For instance, think of a library. The book catalog has a clustered index for the ISBN number and a non clustered index for say publishing year, etc.
Also what helped me is something that Bart Duncan posted a long time ago.
He deserves the credit for this.
The article was entitled "Are you using SQL's Missing Index DMV's?". Look it up and run this query:
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) AS improvement_measure,
'CREATE INDEX [missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*, mid.database_id, mid.[object_id]
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC
It is not the ultimate solution for you but it will help you determine some indexes.
And the link to the article: http://blogs.msdn.com/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx. By default when you create a PK in SQL Server it by default is the clustered index, it doesn't have to be, but it generally is.

If you should or not make clustered indexes depends on you workload (usually dominated by the amount and kind of SELECT statements hitting your table)
A clustered index will force the disk storage order of the rows to be according to the clustered index values. (For this reason, there can be only 1 clustered index per table, as rows are stored on disk only once) This makes sense if most of your queries are always demanding a group of related rows.
Example: suppose you are storing CustomerOrders, and you frequently want to know the number of CustomerOrders (regardless of the customer) in a certain time period. In this case it may be useful to create a clusterd index with the OrderDate as first column. If on the other hand you are frequently looking for all CustomerOrders with the same CustomerId, it makes more sense to put the CustomerId as first column in your clustered index.
The disadvantage of clustered indexes is not in de clustered index itself, but on the secondary indexes: secondary indexes are themselves not clustered (by definition, as the rows can only be stored once, and are stored in order of the clustered index), and their index entries point to the index entries of the clustered index. So to retrieve a row via a secondary index requires 2 read operations: one of the secondary index, and then one of the clustered index it it pointing to.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas