eligibility for creating index - sql

I have created script to find selectivity of each columns for every tables. In those some tables with less than 100 rows but selectivity of column is more than 50%.
where Selectivity = Distinct Values / Total Number Rows
So, is those column are eligible for index?
Or, can you tell, how much minimum rows require for eligibility for creating index?

I think I understand what you are trying to accomplish by calculating a 'Selectivity' value for your data but you cannot apply the rule blindly.
In fact in for certain queries the 'Selectivity' value might be really low an index will still be very beneficial. For example:
Assume a 'inbox' table with millions of rows, these rows have a 'Read' boolean field. In this case the distinct values over the number of rows will be really low. If most items are read most of the time then finding unread items with an index on this field will be very efficient.
Creating indexes index come at a cost. Although you get the benefit for reads, you pay for writes and disk usage.
I would rather recommend you profile your queries and index accordingly. You can also look at the data from sys.dm_db_missing_index_group_stats and other Dynamic management views that will give you insight on indexes usage (or missing) ones.

You can create a index on a table with 0 rows, 1 row or a 100 million rows. You can create an index where every column has the same value or unique values.
So you can create an index. The question is really should you create an index and no tool is going to tell you that because indexes can also be multi-value and it depends on what queries you run. Creating indexes is something done when performance tuning queries or preemptively when you know that you'll be creating queries that are using it.
Every index comes with a cost in terms of space and time required to do updates, inserts and deletes. You don't want to be creating them spuriously so you're really going to have to do this by hand, not as a result of a script to see how unique the value of a column is.

A general rule of thumb says that if you have a very large table (over 1 million rows), you should only use an index if a WHERE clause based on that index selects at most something in the neighborhood of 1-2% of the data.
If you have a "gender" column and roughly 50% of values are "male" and roughly 50% "female", then having an index on that really doesn't give you much - SQL Server and most other RDBMS will most likely still do a full table scan in this case, since on average, they'd have to scan at least half the table anyway, so the "detour" by using an index first and then looking up the actual full data based on that index value is just not worth it.
An index is excellent if you have something like unique keys (customer number), or a value that is quite selective. An index is not without cost - it uses up disk space, it needs to be maintained, it will slightly slow down all operations besides the SELECT - so thread carefully, it's not the best idea to just blindly index everything. Having too few indices is bad - but having too many, and the wrong ones, can be even worse! :-) Nobody ever claimed getting your indices right was easy.... :-)
But there's definitely help out there - the best source I know are Kimberly Tripp's excellent blog posts on SQL Server indexing (and many other topics).
Marc

Related

When isn't it appropriate to use a SQL Index

I was asked a question today on when wouldn't I want to create a SQL Index on a table.
The only thing I can think of is when you don't need one (i.e. a small table). That answer doesn't feel right. Is there a thresh-hold on when I should use an index and when I shouldn't?
When not to create an index on the table, there are lots of things to consider.
First, is that there are a lot of possible indexes you could create. For example, you could create an index containing not only every column in the table, but every permutation of the columns (since column ordering in indexes does matter). This could be a huge number of indexes as your column count gets higher.
Every index comes with a number of things that decrease performance in different ways. For example, they may take memory/disk space from what is available. Probably worse than this though, is the fact that indexes need to be updated when the table underneath it is updated. This means that every insert/update/delete in a table, can trigger an index update. As you have more indexes, that's more indexes to update, which can kill performance on your CUD operations, and can kill your server performance if you are doing these often.
Because of this performance impact, you want to avoid 'useless' indexes. Indexes that are used for every query are typically good, but an index used only once a day for a <1s query is probably useless. It's all a tradeoff in attempting to determine which indexes are useful enough to use and whose performance benefits are greater than the performance hits.
You could answer it with the conter question: When do you need an index?
You need an index, if you want to search for entries, to get your results faster. For example if the column is used in a where clause. Of course you could try index everything, but indexing will cause you to use extra memory/hard disk. So you only index columns you use to find your rows.
What rows MySQL for example is reading while trying to find your rows, you can analyze with the EXPLAIN command.
Does this help?
A rule of thumb is, to drop all indices except the unique index on the primary key, on small tables (less than about 100'000 rows).
Also, it is not appropriate to use an index, if the column is not for search purpose (e.g. the salary of employees).

Indexing and performance Implications of moving small table into big table

I have a table with approximately 2.5 million rows that I am thinking about moving into a much larger table, 35 million rows, with a boolean flag set on the original 2.5 million.
If I wanted to run lots of queries against the 2.5 million records in the new larger table, would adding an index be useful / not cause a full table scan on every query? I know that traditionally indexes aren't helpful in booleans, but since only 7% of the records will be true, I thought it might not require a table scan on every query.
Perhaps look at using a partial index.
From docs
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries for only those table rows
that satisfy the predicate.
A major motivation for partial indexes is to avoid indexing common
values. Since a query searching for a common value (one that accounts
for more than a few percent of all the table rows) will not use the
index anyway, there is no point in keeping those rows in the index at
all. This reduces the size of the index, which will speed up queries
that do use the index. It will also speed up many table update
operations because the index does not need to be updated in all cases.
Example 11-1 shows a possible application of this idea.
I would be looking at partitioning, if you have a substantial proportion of the table that you want to access efficiently.
If you do "insert into big select * from small", then all of the rows that came from the small table are likely to be physically close to each other. After analyzing the table, PostgreSQL will know this, and so will probably choose to use the index on the boolean.
But, if there a lot of churn in the rows then eventually the "true" rows and the "false" rows will become all jumbled up, making use of the index less and less effective, and PostgreSQL will stop using it.
By using partitioning/inheritance, you can keep the rows physically separate (to make sequential scanning on just the small set faster) while making them look like a single set of data when you want to.
Depending on the nature of the queries you run, you might also benefit from adding other columns to the index, keeping the boolean column as the first column.

Should searchable date fields in a database table always be indexed?

If I have a field in a table of some date type and I know that I will always be searching it using comparisons like between, > or < and never = could there be a good reason not to add an index for it?
The only reason not to add an index on a field you are going to search on is that the cost of maintaining the index overweights its benefits.
This may happen if:
You have a really tough DML on your table
The existence of the index makes it intolerably slow, and
It's more important to have fast DML than the fast queries.
If it's not the case, then just create the index. The optimizer just won't use it if it thinks it's not needed.
There are far more bad reasons.
However, an index on the search column may not be enough if the index is nonclustered and non-covering. Queries like this are often good candidates for clustered indexes, however a covering index is just as good.
This is a great example of why this is as much art as science. Some considerations:
How often is data added to this table? If there is far more reading/searching than adding/changing (the whole point of some tables to dump data into for reporting), then you want to go crazy with indexes. You clustered index might be needed more for the ID field, but you can have plenty of multi-column indexes (where the date fields comes later, with columns listed earlier in the index do a good job of reducing the result set), and covered indexes (where all returned values are in the index, so it's very fast, like you're searching on the clustered index to begin with).
If the table is edited/added to often, or you have limited storage space and hence can't have tons of indexes, then you have to be more careful with your indexes. If your date criteria typically gives a wide range of data, and you don't search often on other fields, then you could give a clustered index over to this date field, but think several times before you do that. You clustered index being on a simple autonumber field is a bonus for all you indexes. Non-covered indexes use the clustered index to zip to the records for the result set. Don't move the clustered index to a date field unless the vast majority of your searching is on that date field. It's the nuclear option.
If you can't have a lot of covered indexes (data changes a lot on the table, there's limited space, your result sets are large and varied), and/or you really need the clustered index for another column, and the typical date criteria gives a wide range of records, and you have to search a lot, you've got problems. If you can dump data to a reporting table, do that. If you can't, then you'll have to balance all these competing factors carefully. Maybe for the top 2-3 searches you minimize the result-set columns as much as you can configure covered indexes, and you let the rest make due with a simple non -clustered index
You can see why good db people should be paid well. I know a lot of the factors, but I envy people to can balance all these things quickly and correctly without having to do a lot of profiling.
Don't index it IF you want to scan the entire table every time. I would want the database to try and do a range scan, so I'd add the index, but I use SQL Server and it will use the index in most cases. However different databases many not use the index.
Depending on the data, I'd go further than that, and suggest it could be a clustered index if you're going to be doing BETWEEN queries, to avoid the table scan.
While an index helps for querying the table, it will also slow down inserts, updates and deletes somewhat. If you have a lot more changes in the table than queries, an index can hurt the overall performance.
If the table is small it might never use the indexes therefore adding them may just be wasting resources.
There are datatypes (like image in SQL Server) and data distributions where indexes are unlikely to be used or can't be used. For instance in SQL Server, it is pointless to index a bit field as there is not enough variability in the data for an index to do any good.
If you usually query with a like clause and a wildcard as the first character, no index will be used, so creating one is another waste of reseources.

Do indexes suck in SQL?

Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values.
If I were to put an index on the column would it be large?
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.
It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.
Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?
And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.
Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.
Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.
Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.
YMMV.
The short answer:
Do indexes suck: Yes and No
The longer answer:
They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.
Good starting points:
http://www.sqlservercentral.com/articles/Indexing/
No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.
First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance
Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.
It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.
Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.
Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).
Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.
To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.
Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?
The index size will be proportional to the number of your rows and the length of the indexed values.
The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).
If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same
In this case, you would come with 20 indexes being same in size in total as your original one.
This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.
Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).
The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.
Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.
Sorry, I'm not quite sure what you mean by "large".
If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.
If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.
The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.
Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.
Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.
As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.
It would be large enough to hold those values for all the rows, in a sorted order.
Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.

Table Scan vs. Add Index - which is quicker?

I have a table with many millions of rows. I need to find all the rows with a specific column value. That column is not in an index, so a table scan results.
But would it be quicker to add an index with the column at the head (prime key following), do the query, then drop the index?
I can't add an index permanently as the user is nominating what column they're looking for.
Two questions to think about:
How many columns could be nominated for the query?
Does the data change frequently? A lot of it?
If you have a small number of candidate columns, and the data doesn't change a lot, then you might want to consider adding a permanent index on any or even all candidate column.
"Blasphemy!", I hear. Most sources tell you to "never" index every column of a table, but that advised is rooted on the generic assumption that tables are modified frequently.
You will pay a price in additional storage, as well as a performance hit when the data changes.
How small is small and how much is a lot, and is the tradeoff worth it?
There is no way to tell a priory because "too slow" is usually a subjective measurement.
You will have to try it, measure the size of your indexes and then the effect they have in the searches. You will have to balance the costs against the increase in satisfaction of your customers.
[Added] Oh, one more thing: temporary indexes are not only physically slower than a table scan, but they would destroy your concurrency. Re-indexing a table usually (always?) requires a full table lock, so in effect only one user search could be done at a time.
Good luck.
I'm no DBA, but I would guess that building the index would require scanning the table anyway.
Unless there are going to be multiple queries on that column, I would recommend not creating the index.
Best to check the explain plans/execution times for both ways, though!
As everyone else has said, it most certainly would not be faster to add an index than it would be to do a full scan of that column.
However, I would suggest tracking the query pattern and find out which column(s) are searched for the most, and add indexes at least for them. You may find out that 3-4 indexes speeds up 90% of your queries.
Adding an index requires a table scan, so if you can't add a permanent index it sounds like a single scan will be (slightly) faster.
No, that would not be quicker. What would be quicker is to just add the index and leave it there!
Of course, it may not be practical to index every column, but then again it may. How is data added to the table?
It wouldn't be. Creating an index is more complex than simply scanning the column, even if the computational complexity is the same.
That said - how many columns do you have? Are you sure you can't just create an index for each of them if the query time for a single find is too long?
It depends on the complexity of your query. If you're retrieving the data once, then doing a table scan is faster. However, if you're going back to the table more than once for related information in the same query, then the index is faster.
Another related strategy is to do the table scan, and put all the data in a temporary table. Then index THAT and then you can do all your subsequent selects, groupings, and as many other queries on the subset of indexed data. The benefit being that looking up related information in related tables using the temp table is MUCH faster.
However, space is cheap these days, so you'd probably best be served by examining how your users actually USE your system and adding indexes on those frequent columns. I have yet to see users use ALL the search parameters ALL the time.
Your solution will not scale unless you add a permanent index to each column, with all of the columns that are returned in the query in the list of included columns (a covering index). These indexes will be very large, and inserts and updates to that table will be a bit slower, but you don't have much of a choice if you are allowing a user to arbitrarily select a search column.
How many columns are there? How often does the data get updated? How fast do inserts and updates need to run? There are trade-offs involved, depending on the answers to those questions. Do plenty of experimentation and testing so you know for sure how things will perform.
But to your original question, adding and dropping an index for the purpose of a single query is only beneficial if you do more than one select during the query (for example, the select is in a sub-query that gets run for each row returned).